lakehouse/docs/distillation/recovery-runbook.md
root 73f242e3e4 distillation: Phase 9 — release freeze and operator handoff
Final phase. Adds:
  scripts/distillation/release_freeze.ts   ~330 lines, 6 release gates
  docs/distillation/operator-handoff.md    durable cold-start operator doc
  docs/distillation/recovery-runbook.md    failure-mode runbook by symptom
  scripts/distillation/distill.ts          +release-freeze subcommand

The release_freeze orchestrator runs every gate the system has:
  1. Clean git state (tolerates auto-regenerated reports)
  2. Full test suite (bun test tests/distillation auditor/schemas/distillation)
  3. Phase commit verification (every Phase 0-8 commit resolves)
  4. Acceptance gate (22-invariant fixture E2E)
  5. audit-full (Phases 0-7 verified + drift detection)
  6. Tag availability check (distillation-v1.0.0 not yet existing)

Outputs:
  reports/distillation/release-freeze.md       human-readable manifest
  reports/distillation/release-manifest.json   machine-readable manifest

Manifest captures:
  - git_head + git_branch + released_at
  - phase→commit map for all 9 commits (Phase 0+1+2 scaffold through Phase 8 audit)
  - dataset counts at freeze (RAG/SFT/Preference/evidence/scored/quarantined)
  - latest audit baseline row
  - per-gate pass/fail with detail

Operator handoff doc covers:
  - phase map with commits + report locations
  - known-good commands
  - how to rerun audit-full + inspect drift
  - how to restore from last-good (git checkout distillation-v1.0.0)
  - how to add future phases without contaminating corpus
  - what NOT to modify casually (with file:reason mapping)
  - cumulative commits at v1.0.0

Recovery runbook covers, by symptom:
  - audit-full exit non-zero (per-phase diagnostics)
  - drift table flags warn (intentional vs regression)
  - acceptance fail vs audit-full pass divergence
  - run-all empty exports (counter-bisection order)
  - hash mismatch on identical input (determinism violation; CRITICAL)
  - replay logs growing unbounded (rotation guidance)
  - nuclear restore via git checkout distillation-v1.0.0

Spec constraints (per now.md Phase 9):
  - DO NOT add new intelligence features ✓ (zero new logic)
  - DO NOT change scoring/export logic ✓ (zero touches)
  - DO NOT weaken gates ✓ (gates only added, never relaxed beyond the
    auto-regen tolerance documented in checkCleanGit)
  - DO NOT retrain anything ✓ (no model touches)

CLI:
  ./scripts/distill release-freeze   # exit 0 = release-ready

Tag creation deferred to operator confirmation (the release-freeze
report prints the exact `git tag` command). Per CLAUDE.md guidance,
destructive/visible operations like tags require explicit user
authorization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 23:54:31 -05:00

209 lines
9.5 KiB
Markdown

# Distillation System — Recovery Runbook
**Version:** v1.0.0
**Audience:** Future operator (or future Claude session) inheriting this system in a broken state.
This is the failure-mode runbook. Read top-down, stop at the first symptom that matches.
## Symptom 1: `audit-full` exits non-zero
A required check failed. The report at `reports/distillation/phase8-full-audit-report.md` will name the failing check verbatim. Map by phase:
### P0 — recon doc missing
**Cause:** repo state corrupted; the recon doc has never existed at this commit.
**Fix:**
```bash
git checkout distillation-v1.0.0 -- docs/recon/local-distillation-recon.md
```
### P0 — tier-1 source stream missing
**Cause:** fresh-clone or post-rotation environment without source data.
**Severity:** informational only (audit-full reports as required=false). Pipeline will produce 0 rows but won't fail.
**Fix:** populate the source stream OR accept reduced output and note in the next report.
### P1 — schema validators fail
**Cause:** somebody modified a Phase 1 schema and broke a validator.
**Diagnostic:**
```bash
bun test auditor/schemas/distillation/ 2>&1 | grep -A2 "fail"
```
**Fix:** revert the schema change. Phase 1 schemas are versioned (`_SCHEMA_VERSION = 1`); they bump deliberately, never silently.
```bash
git diff distillation-v1.0.0 -- auditor/schemas/distillation/ # see what changed
git checkout distillation-v1.0.0 -- auditor/schemas/distillation/<changed_file>
```
### P2 — materializer dry-run fails / writes 0
**Cause:** every source jsonl is empty OR every transform is broken.
**Diagnostic:**
```bash
ls -la data/_kb/*.jsonl # confirm sources have content
bun run scripts/distillation/build_evidence_index.ts --dry-run
# inspect skip reasons in data/_kb/distillation_skips.jsonl
```
**Fix:** identify the broken transform via per-source row counts. If a real-data shape changed (e.g. a new field name in a source jsonl), update the matching transform in `scripts/distillation/transforms.ts`. Add a fixture row to `auditor/schemas/distillation/realdata.test.ts` covering the new shape.
### P3 — scored-runs distribution empty
**Cause:** `data/scored-runs/` is missing OR the score categories are not landing.
**Fix:**
```bash
./scripts/distill score # re-runs scorer if data/evidence/ is populated
```
If still empty, the scorer rules in `scripts/distillation/scorer.ts` may have changed. Re-run `bun test tests/distillation/scorer.test.ts` — if those fail, revert.
### P4 — SFT contamination firewall caught a leak (CRITICAL)
**Severity:** alert. A `rejected` or `needs_human_review` row landed in SFT. This is a non-negotiable spec violation.
**Stop immediately:**
```bash
mv exports/sft/instruction_response.jsonl exports/sft/instruction_response.jsonl.QUARANTINED-$(date +%s)
```
**Diagnostic:** the audit report's "P4 SFT contamination firewall" check shows the count. The forbidden row is in the file you just renamed:
```bash
jq 'select(.quality_score != "accepted" and .quality_score != "partially_accepted")' exports/sft/instruction_response.jsonl.QUARANTINED-*
```
**Root cause is one of:**
1. SftSample schema validator was loosened — check `auditor/schemas/distillation/sft_sample.ts` against v1.0.0
2. Exporter `SFT_NEVER` constant was loosened — check `scripts/distillation/export_sft.ts`
3. Schema validation was bypassed (someone called `appendFileSync` directly) — find the offending caller via `git log --oneline -p exports/sft/`
**Recovery:** revert the offending change. Re-run `./scripts/distill export-sft`. The fresh output should pass the audit's leak check.
### P4 — Preference self-pair leaked
**Cause:** export_preference's pairing logic produced `chosen_run_id == rejected_run_id`. Schema validator should catch this.
**Diagnostic:**
```bash
jq 'select(.chosen_run_id == .rejected_run_id) | .id' exports/preference/chosen_rejected.jsonl
```
**Fix:** check `scripts/distillation/export_preference.ts::buildPair` — the equality guard at the top must remain. Revert if missing.
### P5 — RunSummary fails to validate
**Cause:** receipts harness emitted a malformed summary.
**Diagnostic:**
```bash
LATEST=$(ls -1t reports/distillation/*/summary.json | head -1)
bun -e "import {validateRunSummary} from './auditor/schemas/distillation/run_summary'; const s = JSON.parse(await Bun.file('$LATEST').text()); console.log(validateRunSummary(s))"
```
**Fix:** the field that failed validation is named in the validator output. Either fix the harness in `scripts/distillation/receipts.ts` or revert.
### P6 — acceptance gate fails an invariant
**Cause:** something in Phases 1-5 changed in a way the fixture catches.
**Diagnostic:** run acceptance directly to see all 22 checks:
```bash
bun run scripts/distillation/acceptance.ts
# scan output; the failing check names what broke
cat reports/distillation/phase6-acceptance-report.md
```
**Fix:** the report's "Failures" section names the invariant. The 22 invariants are documented at the top of `scripts/distillation/acceptance.ts`. Locate the affected phase code, revert.
### P7 — replay validation regressed
**Cause:** local model output failed the structural validator OR retrieval found 0 playbooks.
**Diagnostic:**
```bash
./scripts/distill replay --task "<task that previously passed>" --local-only
# inspect the validation_result.reasons
```
**Fix:**
- If validation reasons mention "filler/hedge phrase": the local model regressed (model swap?) — revert `LH_REPLAY_LOCAL_MODEL` to default
- If retrieval is empty: `exports/rag/playbooks.jsonl` is empty — re-run export-rag
## Symptom 2: drift table flags `warn`
A metric moved >20% from baseline. This is a SOFT alert — not a failure, but worth investigating before treating outputs as stable.
**Diagnostic:**
```bash
jq -r '.metrics' data/_kb/audit_baselines.jsonl | tail -5
# see the recent baseline trajectory
cat reports/distillation/phase8-full-audit-report.md
# read the drift table
```
**Common causes:**
- New source data → record counts grow → expected, not a regression
- Scorer rules changed → category distribution shifted → confirm intentional
- Exporter filter loosened → SFT/RAG counts grow → CHECK contamination firewall first
**If the drift is intentional**, write a row to `data/_kb/audit_baselines.jsonl` documenting why (no schema for this — just append a JSON line with a `notes` field). Future audits will treat the new value as the new baseline.
## Symptom 3: `acceptance` exits non-zero but `audit-full` doesn't
This is rare — acceptance is stricter (22 invariants on a fixture vs audit-full's 16 required checks on real data). The failing acceptance check usually points to a broken assumption that real data hides.
**Diagnostic:**
```bash
bun run scripts/distillation/acceptance.ts 2>&1 | head -50
ls /tmp/distillation_phase6_acceptance/data/evidence/ # the failed run leaves its temp root
```
**Fix:** the acceptance script keeps the temp root on fail (cleans only on pass). Inspect `/tmp/distillation_phase6_acceptance/` to see what the pipeline produced vs expected. The fixture rows themselves are the contract — change the fixture deliberately, never to make a test pass.
## Symptom 4: `run-all` produces empty exports
**Cause:** Phase 2 or Phase 3 ran on empty input.
**Diagnostic order:**
1. `ls data/_kb/*.jsonl` — sources present?
2. `find data/evidence -name "*.jsonl" | xargs wc -l` — any rows materialized?
3. `find data/scored-runs -name "*.jsonl" | xargs wc -l` — any rows scored?
4. `wc -l data/_kb/distillation_skips.jsonl data/_kb/scoring_skips.jsonl` — anything skipped?
The first counter that's 0 names the broken phase.
## Symptom 5: hash mismatch on identical input
**Cause:** determinism violation. Same input → different output. This is a CRITICAL bug.
**Diagnostic:**
```bash
# Wipe outputs but keep sources, run twice with same recorded_at, compare run_hash
rm -rf data/evidence data/scored-runs exports
RA="2026-04-27T00:00:00.000Z"
./scripts/distill run-all # captures run_id_1
LATEST_1=$(ls -1t reports/distillation/ | grep -v phase | head -1)
rm -rf data/evidence data/scored-runs exports
./scripts/distill run-all
LATEST_2=$(ls -1t reports/distillation/ | grep -v phase | head -1)
jq .run_hash reports/distillation/$LATEST_1/summary.json
jq .run_hash reports/distillation/$LATEST_2/summary.json
# These MUST match if recorded_at is fixed.
```
**If they don't match:** something in the pipeline introduced non-determinism. Common causes:
- A `Date.now()` baked into output (other than the explicit `recorded_at`)
- `Math.random()` or `randomUUID()` in a path that should be deterministic
- A `Map` iteration order issue (rare in V8)
- Concurrent writes to the same file
Bisect against `distillation-v1.0.0` — find the commit that introduced the non-determinism, revert.
## Symptom 6: replay logs growing unbounded
Phase 7 replay appends to `data/_kb/replay_runs.jsonl` with no rotation. Acceptable until file >100MB, then:
```bash
# Move and start fresh
mv data/_kb/replay_runs.jsonl data/_kb/replay_runs.archive.$(date +%s).jsonl
gzip data/_kb/replay_runs.archive.*.jsonl
```
This is documented as a Phase 7 carry-over. A future Phase 10+ could add rotation.
## Last resort: nuclear restore
If nothing works:
```bash
git fetch --tags
git stash # save uncommitted work
git checkout distillation-v1.0.0
./scripts/distill audit-full # confirm 16/16 pass at v1.0.0
./scripts/distill acceptance # confirm 22/22 pass
# Now diff against the broken state to see what changed
git diff distillation-v1.0.0..scrum/auto-apply-19814 -- scripts/distillation/ auditor/schemas/distillation/
```
The diff is your bug.