# Distillation System — Operator Handoff **Version:** v1.0.0 **Branch:** `scrum/auto-apply-19814` **Tag:** `distillation-v1.0.0` **Audit baseline:** `data/_kb/audit_baselines.jsonl` (auto-grown per audit-full run) This is the operator-level handoff for the distillation system. If you are picking this up cold, **read this doc first**, then `docs/recon/local-distillation-recon.md`. Skim the per-phase reports under `reports/distillation/` only when you need detail. ## What this system does Turns real Lakehouse execution traces (1052 records sampled at v1.0.0 freeze) into clean, gated training datasets: - **RAG corpus** — 446 grounded examples for retrieval-augmentation - **SFT corpus** — 351 instruction→response pairs (strict accepted-only) - **Preference corpus** — 83 chosen/rejected pairs (zero self-pairs, zero identical-text) It is **NOT** a model trainer. It is a **knowledge refinery** that produces training-safe substrate. The local-model "replay" runtime (Phase 7) demonstrates that retrieval against this substrate makes a 7B-class model behave like the system instead of fabricating audit verdicts. ## Phase map | Phase | What it does | Commit | Report | |---|---|---|---| | 0 | Recon doc — inventory of source streams + integration plan | 27b1d27 | `docs/recon/local-distillation-recon.md` | | 1 | 9 schemas + 51 schema tests + foundation types | 27b1d27 | (in commit body) | | 2 | Materializer: 12 source jsonls → unified EvidenceRecord at `data/evidence/YYYY/MM/DD/` | 1ea8029 | (in commit body) | | 3 | Deterministic Success Scorer: EvidenceRecord → ScoredRun (4 categories, no LLM) | c989253 | (in commit body) | | 4 | RAG/SFT/Preference exports + quarantine system | 68b6697 | `reports/distillation/phase4-export-report.md` | | 5 | Receipts harness — per-stage StageReceipt + RunSummary + DriftReport | 2cf359a | `reports/distillation/phase5-receipts-report.md` | | 6 | Acceptance gate — fixture-driven 22-invariant E2E test | 1b433a9 | `reports/distillation/phase6-acceptance-report.md` | | 7 | Replay layer — retrieval-driven local-model bootstrap | 681f39d | `reports/distillation/phase7-replay-report.md` | | 8 | Full system audit + drift baseline | 5bdd159 | `reports/distillation/phase8-full-audit-report.md` | The auditor rebuild (commit 20a039c) is wired to use the Phase 5 substrate: it now calls `lakehouse_answers_v1` matrix retrieval instead of tree-split shard summaries. Per-audit cost: 50× fewer cloud calls, 17× faster wall-clock. ## Known-good commands All commands run from `/home/profit/lakehouse`. Use `bun run scripts/distillation/distill.ts ` or `./scripts/distill ` if symlinked. ```bash # Build everything end-to-end with structured receipts ./scripts/distill run-all # Read a specific run's summary + drift ./scripts/distill receipts --run-id # Verify the system end-to-end on a deterministic fixture ./scripts/distill acceptance # Audit Phases 0-7 + drift detection vs prior baseline ./scripts/distill audit-full # Test a task through the replay layer (local model with retrieval) ./scripts/distill replay --task "" ./scripts/distill replay --task "" --no-retrieval # baseline / A/B ./scripts/distill replay --task "" --allow-escalation # try deepseek if local fails # Per-stage one-shot (rare — prefer run-all for receipts) ./scripts/distill build-evidence ./scripts/distill score ./scripts/distill export-rag ./scripts/distill export-sft # strict accepted-only ./scripts/distill export-sft --include-partial # opens to partially_accepted ./scripts/distill export-preference ./scripts/distill export-all ``` ## How to rerun the full audit ```bash ./scripts/distill audit-full ``` Reads: - on-disk `data/evidence/`, `data/scored-runs/`, `exports/{rag,sft,preference}/*` - the most recent run_id under `reports/distillation/` - the prior audit baseline at `data/_kb/audit_baselines.jsonl` Writes: - `reports/distillation/phase8-full-audit-report.md` - a new row to `data/_kb/audit_baselines.jsonl` (auto-grown — never overwrite) Exit code 0 = pass (every required check held). Non-zero = at least one required check failed. ## How to inspect drift Two levels: 1. **Per-run drift** — every `run-all` writes `reports/distillation//drift.json`. Compares to the most recent prior run. Severity `ok | warn | alert`. 2. **Cross-run baseline drift** — `audit-full` reads the latest baseline row from `data/_kb/audit_baselines.jsonl` and compares 10 tracked metrics (record counts, category distribution, export sizes, quarantine totals). Drift table appears in `reports/distillation/phase8-full-audit-report.md` with `>20%` flagged as `warn`. The baseline file is **append-only**. Don't truncate it — its value grows with the longitudinal record. If a metric flips `warn` after a code change, the row before that change is the diagnostic anchor. ## How to restore from last good state ```bash git fetch --tags git checkout distillation-v1.0.0 ./scripts/distill audit-full # confirm 16/16 required pass at v1.0.0 ``` If you've made changes that broke the system, hard reset to v1.0.0: ```bash git reset --hard distillation-v1.0.0 # destructive — loses uncommitted work ./scripts/distill acceptance # confirm 22/22 fixture invariants ./scripts/distill audit-full # confirm baseline match ``` ## How to add future phases without contaminating the corpus The corpus = `exports/rag/playbooks.jsonl` + `exports/sft/instruction_response.jsonl` + `exports/preference/chosen_rejected.jsonl`. These are training-safe **only if** every gate held. To add Phase 10+: 1. Add code under `scripts/distillation/.ts`. Do NOT modify Phases 0-8. 2. If your phase produces evidence, append to `data/_kb/.jsonl` and add a transform in `scripts/distillation/transforms.ts`. The materializer picks it up automatically. 3. If your phase needs a new schema, create `auditor/schemas/distillation/.ts` with `_SCHEMA_VERSION = 1` constant + validator + tests in `auditor/schemas/distillation/schemas.test.ts` (positive + negative fixtures). 4. Run `./scripts/distill audit-full` BEFORE merging. Confirm 16/16 still passes. 5. Run `./scripts/distill acceptance`. Confirm 22/22 still passes. 6. Re-run `./scripts/distill run-all`. Inspect drift in the new run's `drift.json`. Anything `>20%` in record counts means your phase moved the corpus — explain it in the commit. ## What NOT to modify casually These have explicit firewalls. Touching them = potentially weakening contamination prevention: | File | Why fragile | |---|---| | `auditor/schemas/distillation/sft_sample.ts` | The `quality_score` enum literally enforces "no rejected/needs_human_review in SFT". Loosening it = silent leak | | `scripts/distillation/export_sft.ts` `SFT_NEVER` constant | Second-layer defense. If schema fails, this catches it | | `scripts/distillation/export_sft.ts` re-read validation | Third layer — re-reads on-disk SFT and fails LOUD if forbidden quality_score appears | | `scripts/distillation/scorer.ts` category mapping | Changing rules → silent corpus shift. Run `audit-full` after any change to see drift | | `tests/fixtures/distillation/acceptance/` | The fixture is the gate. Changing it = changing the bar | | `data/_kb/audit_baselines.jsonl` | Append-only. Truncating loses longitudinal drift signal | If you must change one of these, run `audit-full` BEFORE and AFTER. The drift table will tell you exactly what your change moved. ## Receipt-vs-drift quick reference If `audit-full` flags a metric: - `>20%` swing in `p3_accepted` → scorer rules changed OR source data shifted - `>20%` swing in `p4_sft_rows` → SFT eligibility changed (check exporter filter) - `>20%` swing in `p4_total_quarantined` → either source data is dirtier OR a gate got tighter - Hash mismatch on identical input → determinism violation; revert immediately If `acceptance` fails: - 22 invariants are pinned in `scripts/distillation/acceptance.ts`. The failing one names what broke. - Spec invariants (1-22) are documented in `reports/distillation/phase6-acceptance-report.md`. ## Pointers to non-distillation systems The auditor (`auditor/`) and the gateway (`crates/gateway/`) are the consumers of the distillation substrate. They use it but are not part of it: - Auditor's `pr_audit` mode (`crates/gateway/src/v1/mode.rs`) retrieves from `lakehouse_answers_v1`. If you regenerate the RAG export, the auditor's context auto-improves on next call. - The gateway's `/v1/chat` is the entry point all model calls flow through. Receipts capture provider, model, latency, prompt+completion tokens. ## Provenance Every export row → traces to `data/scored-runs/.../.jsonl` line N → traces to `data/evidence/.../.jsonl` line N → traces to `data/_kb/.jsonl` line N. The `provenance.sig_hash` field (canonical sha256 of the source row, sorted keys) is the join key. If a downstream consumer asks "where did this SFT row come from", run: ```bash jq 'select(.id == "") | .provenance' exports/sft/instruction_response.jsonl # returns {source_file, line_offset, sig_hash, recorded_at} # Then: sed -n "$(( + 1))p" data/scored-runs/ # And so on back to data/_kb/.jsonl ``` ## Test discipline ```bash bun test tests/distillation/ auditor/schemas/distillation/ ``` At v1.0.0: **145 tests, 0 fail, 372 expect() calls, ~600ms.** Any new phase must keep this at 0 fail. ## Cumulative commits at v1.0.0 ``` 27b1d27 distillation: Phase 0 recon + Phase 1 schemas + Phase 2 transforms scaffold 1ea8029 distillation: Phase 2 — Evidence View materializer + health audit c989253 distillation: Phase 3 — deterministic Success Scorer 68b6697 distillation: Phase 4 — dataset export layer 2cf359a distillation: Phase 5 — receipts harness (system-level observability) 1b433a9 distillation: Phase 6 — acceptance gate suite 20a039c auditor: rebuild on mode runner + drop tree-split (use distillation substrate) 681f39d distillation: Phase 7 — replay-driven local model bootstrapping 5bdd159 distillation: Phase 8 — full system audit distillation: Phase 9 — release freeze and operator handoff ```