lakehouse/docs/distillation/recovery-runbook.md

# Distillation System — Recovery Runbook

**Version:** v1.0.0
**Audience:** Future operator (or future Claude session) inheriting this system in a broken state.

This is the failure-mode runbook. Read top-down, stop at the first symptom that matches.

## Symptom 1: `audit-full` exits non-zero

A required check failed. The report at `reports/distillation/phase8-full-audit-report.md` will name the failing check verbatim. Map by phase:

### P0 — recon doc missing
**Cause:** repo state corrupted; the recon doc has never existed at this commit.
**Fix:**
```bash
git checkout distillation-v1.0.0 -- docs/recon/local-distillation-recon.md
```

### P0 — tier-1 source stream missing
**Cause:** fresh-clone or post-rotation environment without source data.
**Severity:** informational only (audit-full reports as required=false). Pipeline will produce 0 rows but won't fail.
**Fix:** populate the source stream OR accept reduced output and note in the next report.

### P1 — schema validators fail
**Cause:** somebody modified a Phase 1 schema and broke a validator.
**Diagnostic:**
```bash
bun test auditor/schemas/distillation/ 2>&1 | grep -A2 "fail"
```
**Fix:** revert the schema change. Phase 1 schemas are versioned (`_SCHEMA_VERSION = 1`); they bump deliberately, never silently.
```bash
git diff distillation-v1.0.0 -- auditor/schemas/distillation/  # see what changed
git checkout distillation-v1.0.0 -- auditor/schemas/distillation/<changed_file>
```

### P2 — materializer dry-run fails / writes 0
**Cause:** every source jsonl is empty OR every transform is broken.
**Diagnostic:**
```bash
ls -la data/_kb/*.jsonl   # confirm sources have content
bun run scripts/distillation/build_evidence_index.ts --dry-run
# inspect skip reasons in data/_kb/distillation_skips.jsonl
```
**Fix:** identify the broken transform via per-source row counts. If a real-data shape changed (e.g. a new field name in a source jsonl), update the matching transform in `scripts/distillation/transforms.ts`. Add a fixture row to `auditor/schemas/distillation/realdata.test.ts` covering the new shape.

### P3 — scored-runs distribution empty
**Cause:** `data/scored-runs/` is missing OR the score categories are not landing.
**Fix:**
```bash
./scripts/distill score   # re-runs scorer if data/evidence/ is populated
```
If still empty, the scorer rules in `scripts/distillation/scorer.ts` may have changed. Re-run `bun test tests/distillation/scorer.test.ts` — if those fail, revert.

### P4 — SFT contamination firewall caught a leak (CRITICAL)
**Severity:** alert. A `rejected` or `needs_human_review` row landed in SFT. This is a non-negotiable spec violation.
**Stop immediately:**
```bash
mv exports/sft/instruction_response.jsonl exports/sft/instruction_response.jsonl.QUARANTINED-$(date +%s)
```
**Diagnostic:** the audit report's "P4 SFT contamination firewall" check shows the count. The forbidden row is in the file you just renamed:
```bash
jq 'select(.quality_score != "accepted" and .quality_score != "partially_accepted")' exports/sft/instruction_response.jsonl.QUARANTINED-*
```
**Root cause is one of:**
1. SftSample schema validator was loosened — check `auditor/schemas/distillation/sft_sample.ts` against v1.0.0
2. Exporter `SFT_NEVER` constant was loosened — check `scripts/distillation/export_sft.ts`
3. Schema validation was bypassed (someone called `appendFileSync` directly) — find the offending caller via `git log --oneline -p exports/sft/`

**Recovery:** revert the offending change. Re-run `./scripts/distill export-sft`. The fresh output should pass the audit's leak check.

### P4 — Preference self-pair leaked
**Cause:** export_preference's pairing logic produced `chosen_run_id == rejected_run_id`. Schema validator should catch this.
**Diagnostic:**
```bash
jq 'select(.chosen_run_id == .rejected_run_id) | .id' exports/preference/chosen_rejected.jsonl
```
**Fix:** check `scripts/distillation/export_preference.ts::buildPair` — the equality guard at the top must remain. Revert if missing.

### P5 — RunSummary fails to validate
**Cause:** receipts harness emitted a malformed summary.
**Diagnostic:**
```bash
LATEST=$(ls -1t reports/distillation/*/summary.json | head -1)
bun -e "import {validateRunSummary} from './auditor/schemas/distillation/run_summary'; const s = JSON.parse(await Bun.file('$LATEST').text()); console.log(validateRunSummary(s))"
```
**Fix:** the field that failed validation is named in the validator output. Either fix the harness in `scripts/distillation/receipts.ts` or revert.

### P6 — acceptance gate fails an invariant
**Cause:** something in Phases 1-5 changed in a way the fixture catches.
**Diagnostic:** run acceptance directly to see all 22 checks:
```bash
bun run scripts/distillation/acceptance.ts
# scan output; the failing check names what broke
cat reports/distillation/phase6-acceptance-report.md
```
**Fix:** the report's "Failures" section names the invariant. The 22 invariants are documented at the top of `scripts/distillation/acceptance.ts`. Locate the affected phase code, revert.

### P7 — replay validation regressed
**Cause:** local model output failed the structural validator OR retrieval found 0 playbooks.
**Diagnostic:**
```bash
./scripts/distill replay --task "<task that previously passed>" --local-only
# inspect the validation_result.reasons
```
**Fix:**
- If validation reasons mention "filler/hedge phrase": the local model regressed (model swap?) — revert `LH_REPLAY_LOCAL_MODEL` to default
- If retrieval is empty: `exports/rag/playbooks.jsonl` is empty — re-run export-rag

## Symptom 2: drift table flags `warn`

A metric moved >20% from baseline. This is a SOFT alert — not a failure, but worth investigating before treating outputs as stable.

**Diagnostic:**
```bash
jq -r '.metrics' data/_kb/audit_baselines.jsonl | tail -5
# see the recent baseline trajectory
cat reports/distillation/phase8-full-audit-report.md
# read the drift table
```

**Common causes:**
- New source data → record counts grow → expected, not a regression
- Scorer rules changed → category distribution shifted → confirm intentional
- Exporter filter loosened → SFT/RAG counts grow → CHECK contamination firewall first

**If the drift is intentional**, write a row to `data/_kb/audit_baselines.jsonl` documenting why (no schema for this — just append a JSON line with a `notes` field). Future audits will treat the new value as the new baseline.

## Symptom 3: `acceptance` exits non-zero but `audit-full` doesn't

This is rare — acceptance is stricter (22 invariants on a fixture vs audit-full's 16 required checks on real data). The failing acceptance check usually points to a broken assumption that real data hides.

**Diagnostic:**
```bash
bun run scripts/distillation/acceptance.ts 2>&1 | head -50
ls /tmp/distillation_phase6_acceptance/data/evidence/  # the failed run leaves its temp root
```

**Fix:** the acceptance script keeps the temp root on fail (cleans only on pass). Inspect `/tmp/distillation_phase6_acceptance/` to see what the pipeline produced vs expected. The fixture rows themselves are the contract — change the fixture deliberately, never to make a test pass.

## Symptom 4: `run-all` produces empty exports

**Cause:** Phase 2 or Phase 3 ran on empty input.

**Diagnostic order:**
1. `ls data/_kb/*.jsonl` — sources present?
2. `find data/evidence -name "*.jsonl" | xargs wc -l` — any rows materialized?
3. `find data/scored-runs -name "*.jsonl" | xargs wc -l` — any rows scored?
4. `wc -l data/_kb/distillation_skips.jsonl data/_kb/scoring_skips.jsonl` — anything skipped?

The first counter that's 0 names the broken phase.

## Symptom 5: hash mismatch on identical input

**Cause:** determinism violation. Same input → different output. This is a CRITICAL bug.

**Diagnostic:**
```bash
# Wipe outputs but keep sources, run twice with same recorded_at, compare run_hash
rm -rf data/evidence data/scored-runs exports
RA="2026-04-27T00:00:00.000Z"
./scripts/distill run-all  # captures run_id_1
LATEST_1=$(ls -1t reports/distillation/ | grep -v phase | head -1)

rm -rf data/evidence data/scored-runs exports
./scripts/distill run-all
LATEST_2=$(ls -1t reports/distillation/ | grep -v phase | head -1)

jq .run_hash reports/distillation/$LATEST_1/summary.json
jq .run_hash reports/distillation/$LATEST_2/summary.json
# These MUST match if recorded_at is fixed.
```

**If they don't match:** something in the pipeline introduced non-determinism. Common causes:
- A `Date.now()` baked into output (other than the explicit `recorded_at`)
- `Math.random()` or `randomUUID()` in a path that should be deterministic
- A `Map` iteration order issue (rare in V8)
- Concurrent writes to the same file

Bisect against `distillation-v1.0.0` — find the commit that introduced the non-determinism, revert.

## Symptom 6: replay logs growing unbounded

Phase 7 replay appends to `data/_kb/replay_runs.jsonl` with no rotation. Acceptable until file >100MB, then:

```bash
# Move and start fresh
mv data/_kb/replay_runs.jsonl data/_kb/replay_runs.archive.$(date +%s).jsonl
gzip data/_kb/replay_runs.archive.*.jsonl
```

This is documented as a Phase 7 carry-over. A future Phase 10+ could add rotation.

## Last resort: nuclear restore

If nothing works:

```bash
git fetch --tags
git stash               # save uncommitted work
git checkout distillation-v1.0.0
./scripts/distill audit-full   # confirm 16/16 pass at v1.0.0
./scripts/distill acceptance   # confirm 22/22 pass

# Now diff against the broken state to see what changed
git diff distillation-v1.0.0..scrum/auto-apply-19814 -- scripts/distillation/ auditor/schemas/distillation/
```

The diff is your bug.