root 73f242e3e4 distillation: Phase 9 — release freeze and operator handoff

Final phase. Adds:
  scripts/distillation/release_freeze.ts   ~330 lines, 6 release gates
  docs/distillation/operator-handoff.md    durable cold-start operator doc
  docs/distillation/recovery-runbook.md    failure-mode runbook by symptom
  scripts/distillation/distill.ts          +release-freeze subcommand

The release_freeze orchestrator runs every gate the system has:
  1. Clean git state (tolerates auto-regenerated reports)
  2. Full test suite (bun test tests/distillation auditor/schemas/distillation)
  3. Phase commit verification (every Phase 0-8 commit resolves)
  4. Acceptance gate (22-invariant fixture E2E)
  5. audit-full (Phases 0-7 verified + drift detection)
  6. Tag availability check (distillation-v1.0.0 not yet existing)

Outputs:
  reports/distillation/release-freeze.md       human-readable manifest
  reports/distillation/release-manifest.json   machine-readable manifest

Manifest captures:
  - git_head + git_branch + released_at
  - phase→commit map for all 9 commits (Phase 0+1+2 scaffold through Phase 8 audit)
  - dataset counts at freeze (RAG/SFT/Preference/evidence/scored/quarantined)
  - latest audit baseline row
  - per-gate pass/fail with detail

Operator handoff doc covers:
  - phase map with commits + report locations
  - known-good commands
  - how to rerun audit-full + inspect drift
  - how to restore from last-good (git checkout distillation-v1.0.0)
  - how to add future phases without contaminating corpus
  - what NOT to modify casually (with file:reason mapping)
  - cumulative commits at v1.0.0

Recovery runbook covers, by symptom:
  - audit-full exit non-zero (per-phase diagnostics)
  - drift table flags warn (intentional vs regression)
  - acceptance fail vs audit-full pass divergence
  - run-all empty exports (counter-bisection order)
  - hash mismatch on identical input (determinism violation; CRITICAL)
  - replay logs growing unbounded (rotation guidance)
  - nuclear restore via git checkout distillation-v1.0.0

Spec constraints (per now.md Phase 9):
  - DO NOT add new intelligence features ✓ (zero new logic)
  - DO NOT change scoring/export logic ✓ (zero touches)
  - DO NOT weaken gates ✓ (gates only added, never relaxed beyond the
    auto-regen tolerance documented in checkCleanGit)
  - DO NOT retrain anything ✓ (no model touches)

CLI:
  ./scripts/distill release-freeze   # exit 0 = release-ready

Tag creation deferred to operator confirmation (the release-freeze
report prints the exact `git tag` command). Per CLAUDE.md guidance,
destructive/visible operations like tags require explicit user
authorization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-26 23:54:31 -05:00

9.5 KiB

Raw Blame History

Distillation System — Recovery Runbook

Version: v1.0.0 Audience: Future operator (or future Claude session) inheriting this system in a broken state.

This is the failure-mode runbook. Read top-down, stop at the first symptom that matches.

Symptom 1: `audit-full` exits non-zero

A required check failed. The report at reports/distillation/phase8-full-audit-report.md will name the failing check verbatim. Map by phase:

P0 — recon doc missing

Cause: repo state corrupted; the recon doc has never existed at this commit. Fix:

git checkout distillation-v1.0.0 -- docs/recon/local-distillation-recon.md

P0 — tier-1 source stream missing

Cause: fresh-clone or post-rotation environment without source data. Severity: informational only (audit-full reports as required=false). Pipeline will produce 0 rows but won't fail. Fix: populate the source stream OR accept reduced output and note in the next report.

P1 — schema validators fail

Cause: somebody modified a Phase 1 schema and broke a validator. Diagnostic:

bun test auditor/schemas/distillation/ 2>&1 | grep -A2 "fail"

Fix: revert the schema change. Phase 1 schemas are versioned (_SCHEMA_VERSION = 1); they bump deliberately, never silently.

git diff distillation-v1.0.0 -- auditor/schemas/distillation/  # see what changed
git checkout distillation-v1.0.0 -- auditor/schemas/distillation/<changed_file>

P2 — materializer dry-run fails / writes 0

Cause: every source jsonl is empty OR every transform is broken. Diagnostic:

ls -la data/_kb/*.jsonl   # confirm sources have content
bun run scripts/distillation/build_evidence_index.ts --dry-run
# inspect skip reasons in data/_kb/distillation_skips.jsonl

Fix: identify the broken transform via per-source row counts. If a real-data shape changed (e.g. a new field name in a source jsonl), update the matching transform in scripts/distillation/transforms.ts. Add a fixture row to auditor/schemas/distillation/realdata.test.ts covering the new shape.

P3 — scored-runs distribution empty

Cause: data/scored-runs/ is missing OR the score categories are not landing. Fix:

./scripts/distill score   # re-runs scorer if data/evidence/ is populated

If still empty, the scorer rules in scripts/distillation/scorer.ts may have changed. Re-run bun test tests/distillation/scorer.test.ts — if those fail, revert.

P4 — SFT contamination firewall caught a leak (CRITICAL)

Severity: alert. A rejected or needs_human_review row landed in SFT. This is a non-negotiable spec violation. Stop immediately:

mv exports/sft/instruction_response.jsonl exports/sft/instruction_response.jsonl.QUARANTINED-$(date +%s)

Diagnostic: the audit report's "P4 SFT contamination firewall" check shows the count. The forbidden row is in the file you just renamed:

jq 'select(.quality_score != "accepted" and .quality_score != "partially_accepted")' exports/sft/instruction_response.jsonl.QUARANTINED-*

Root cause is one of:

SftSample schema validator was loosened — check auditor/schemas/distillation/sft_sample.ts against v1.0.0
Exporter SFT_NEVER constant was loosened — check scripts/distillation/export_sft.ts
Schema validation was bypassed (someone called appendFileSync directly) — find the offending caller via git log --oneline -p exports/sft/

Recovery: revert the offending change. Re-run ./scripts/distill export-sft. The fresh output should pass the audit's leak check.

P4 — Preference self-pair leaked

Cause: export_preference's pairing logic produced chosen_run_id == rejected_run_id. Schema validator should catch this. Diagnostic:

jq 'select(.chosen_run_id == .rejected_run_id) | .id' exports/preference/chosen_rejected.jsonl

Fix: check scripts/distillation/export_preference.ts::buildPair — the equality guard at the top must remain. Revert if missing.

P5 — RunSummary fails to validate

Cause: receipts harness emitted a malformed summary. Diagnostic:

LATEST=$(ls -1t reports/distillation/*/summary.json | head -1)
bun -e "import {validateRunSummary} from './auditor/schemas/distillation/run_summary'; const s = JSON.parse(await Bun.file('$LATEST').text()); console.log(validateRunSummary(s))"

Fix: the field that failed validation is named in the validator output. Either fix the harness in scripts/distillation/receipts.ts or revert.

P6 — acceptance gate fails an invariant

Cause: something in Phases 1-5 changed in a way the fixture catches. Diagnostic: run acceptance directly to see all 22 checks:

bun run scripts/distillation/acceptance.ts
# scan output; the failing check names what broke
cat reports/distillation/phase6-acceptance-report.md

Fix: the report's "Failures" section names the invariant. The 22 invariants are documented at the top of scripts/distillation/acceptance.ts. Locate the affected phase code, revert.

P7 — replay validation regressed

Cause: local model output failed the structural validator OR retrieval found 0 playbooks. Diagnostic:

./scripts/distill replay --task "<task that previously passed>" --local-only
# inspect the validation_result.reasons

Fix:

If validation reasons mention "filler/hedge phrase": the local model regressed (model swap?) — revert LH_REPLAY_LOCAL_MODEL to default
If retrieval is empty: exports/rag/playbooks.jsonl is empty — re-run export-rag

Symptom 2: drift table flags `warn`

A metric moved >20% from baseline. This is a SOFT alert — not a failure, but worth investigating before treating outputs as stable.

Diagnostic:

jq -r '.metrics' data/_kb/audit_baselines.jsonl | tail -5
# see the recent baseline trajectory
cat reports/distillation/phase8-full-audit-report.md
# read the drift table

Common causes:

New source data → record counts grow → expected, not a regression
Scorer rules changed → category distribution shifted → confirm intentional
Exporter filter loosened → SFT/RAG counts grow → CHECK contamination firewall first

If the drift is intentional, write a row to data/_kb/audit_baselines.jsonl documenting why (no schema for this — just append a JSON line with a notes field). Future audits will treat the new value as the new baseline.

Symptom 3: `acceptance` exits non-zero but `audit-full` doesn't

This is rare — acceptance is stricter (22 invariants on a fixture vs audit-full's 16 required checks on real data). The failing acceptance check usually points to a broken assumption that real data hides.

Diagnostic:

bun run scripts/distillation/acceptance.ts 2>&1 | head -50
ls /tmp/distillation_phase6_acceptance/data/evidence/  # the failed run leaves its temp root

Fix: the acceptance script keeps the temp root on fail (cleans only on pass). Inspect /tmp/distillation_phase6_acceptance/ to see what the pipeline produced vs expected. The fixture rows themselves are the contract — change the fixture deliberately, never to make a test pass.

Symptom 4: `run-all` produces empty exports

Cause: Phase 2 or Phase 3 ran on empty input.

Diagnostic order:

ls data/_kb/*.jsonl — sources present?
find data/evidence -name "*.jsonl" | xargs wc -l — any rows materialized?
find data/scored-runs -name "*.jsonl" | xargs wc -l — any rows scored?
wc -l data/_kb/distillation_skips.jsonl data/_kb/scoring_skips.jsonl — anything skipped?

The first counter that's 0 names the broken phase.

Symptom 5: hash mismatch on identical input

Cause: determinism violation. Same input → different output. This is a CRITICAL bug.

Diagnostic:

# Wipe outputs but keep sources, run twice with same recorded_at, compare run_hash
rm -rf data/evidence data/scored-runs exports
RA="2026-04-27T00:00:00.000Z"
./scripts/distill run-all  # captures run_id_1
LATEST_1=$(ls -1t reports/distillation/ | grep -v phase | head -1)

rm -rf data/evidence data/scored-runs exports
./scripts/distill run-all
LATEST_2=$(ls -1t reports/distillation/ | grep -v phase | head -1)

jq .run_hash reports/distillation/$LATEST_1/summary.json
jq .run_hash reports/distillation/$LATEST_2/summary.json
# These MUST match if recorded_at is fixed.

If they don't match: something in the pipeline introduced non-determinism. Common causes:

A Date.now() baked into output (other than the explicit recorded_at)
Math.random() or randomUUID() in a path that should be deterministic
A Map iteration order issue (rare in V8)
Concurrent writes to the same file

Bisect against distillation-v1.0.0 — find the commit that introduced the non-determinism, revert.

Symptom 6: replay logs growing unbounded

Phase 7 replay appends to data/_kb/replay_runs.jsonl with no rotation. Acceptable until file >100MB, then:

# Move and start fresh
mv data/_kb/replay_runs.jsonl data/_kb/replay_runs.archive.$(date +%s).jsonl
gzip data/_kb/replay_runs.archive.*.jsonl

This is documented as a Phase 7 carry-over. A future Phase 10+ could add rotation.

Last resort: nuclear restore

If nothing works:

git fetch --tags
git stash               # save uncommitted work
git checkout distillation-v1.0.0
./scripts/distill audit-full   # confirm 16/16 pass at v1.0.0
./scripts/distill acceptance   # confirm 22/22 pass

# Now diff against the broken state to see what changed
git diff distillation-v1.0.0..scrum/auto-apply-19814 -- scripts/distillation/ auditor/schemas/distillation/

The diff is your bug.

9.5 KiB Raw Blame History

Distillation System — Recovery Runbook

Symptom 1: audit-full exits non-zero

P0 — recon doc missing

P0 — tier-1 source stream missing

P1 — schema validators fail

P2 — materializer dry-run fails / writes 0

P3 — scored-runs distribution empty

P4 — SFT contamination firewall caught a leak (CRITICAL)

P4 — Preference self-pair leaked

P5 — RunSummary fails to validate

P6 — acceptance gate fails an invariant

P7 — replay validation regressed

Symptom 2: drift table flags warn

Symptom 3: acceptance exits non-zero but audit-full doesn't

Symptom 4: run-all produces empty exports

Symptom 5: hash mismatch on identical input

Symptom 6: replay logs growing unbounded

Last resort: nuclear restore

9.5 KiB

Raw Blame History

Symptom 1: `audit-full` exits non-zero

Symptom 2: drift table flags `warn`

Symptom 3: `acceptance` exits non-zero but `audit-full` doesn't

Symptom 4: `run-all` produces empty exports