Final phase. Adds:
scripts/distillation/release_freeze.ts ~330 lines, 6 release gates
docs/distillation/operator-handoff.md durable cold-start operator doc
docs/distillation/recovery-runbook.md failure-mode runbook by symptom
scripts/distillation/distill.ts +release-freeze subcommand
The release_freeze orchestrator runs every gate the system has:
1. Clean git state (tolerates auto-regenerated reports)
2. Full test suite (bun test tests/distillation auditor/schemas/distillation)
3. Phase commit verification (every Phase 0-8 commit resolves)
4. Acceptance gate (22-invariant fixture E2E)
5. audit-full (Phases 0-7 verified + drift detection)
6. Tag availability check (distillation-v1.0.0 not yet existing)
Outputs:
reports/distillation/release-freeze.md human-readable manifest
reports/distillation/release-manifest.json machine-readable manifest
Manifest captures:
- git_head + git_branch + released_at
- phase→commit map for all 9 commits (Phase 0+1+2 scaffold through Phase 8 audit)
- dataset counts at freeze (RAG/SFT/Preference/evidence/scored/quarantined)
- latest audit baseline row
- per-gate pass/fail with detail
Operator handoff doc covers:
- phase map with commits + report locations
- known-good commands
- how to rerun audit-full + inspect drift
- how to restore from last-good (git checkout distillation-v1.0.0)
- how to add future phases without contaminating corpus
- what NOT to modify casually (with file:reason mapping)
- cumulative commits at v1.0.0
Recovery runbook covers, by symptom:
- audit-full exit non-zero (per-phase diagnostics)
- drift table flags warn (intentional vs regression)
- acceptance fail vs audit-full pass divergence
- run-all empty exports (counter-bisection order)
- hash mismatch on identical input (determinism violation; CRITICAL)
- replay logs growing unbounded (rotation guidance)
- nuclear restore via git checkout distillation-v1.0.0
Spec constraints (per now.md Phase 9):
- DO NOT add new intelligence features ✓ (zero new logic)
- DO NOT change scoring/export logic ✓ (zero touches)
- DO NOT weaken gates ✓ (gates only added, never relaxed beyond the
auto-regen tolerance documented in checkCleanGit)
- DO NOT retrain anything ✓ (no model touches)
CLI:
./scripts/distill release-freeze # exit 0 = release-ready
Tag creation deferred to operator confirmation (the release-freeze
report prints the exact `git tag` command). Per CLAUDE.md guidance,
destructive/visible operations like tags require explicit user
authorization.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10 KiB
Distillation System — Operator Handoff
Version: v1.0.0
Branch: scrum/auto-apply-19814
Tag: distillation-v1.0.0
Audit baseline: data/_kb/audit_baselines.jsonl (auto-grown per audit-full run)
This is the operator-level handoff for the distillation system. If you are picking this up cold, read this doc first, then docs/recon/local-distillation-recon.md. Skim the per-phase reports under reports/distillation/ only when you need detail.
What this system does
Turns real Lakehouse execution traces (1052 records sampled at v1.0.0 freeze) into clean, gated training datasets:
- RAG corpus — 446 grounded examples for retrieval-augmentation
- SFT corpus — 351 instruction→response pairs (strict accepted-only)
- Preference corpus — 83 chosen/rejected pairs (zero self-pairs, zero identical-text)
It is NOT a model trainer. It is a knowledge refinery that produces training-safe substrate. The local-model "replay" runtime (Phase 7) demonstrates that retrieval against this substrate makes a 7B-class model behave like the system instead of fabricating audit verdicts.
Phase map
| Phase | What it does | Commit | Report |
|---|---|---|---|
| 0 | Recon doc — inventory of source streams + integration plan | 27b1d27 |
docs/recon/local-distillation-recon.md |
| 1 | 9 schemas + 51 schema tests + foundation types | 27b1d27 |
(in commit body) |
| 2 | Materializer: 12 source jsonls → unified EvidenceRecord at data/evidence/YYYY/MM/DD/ |
1ea8029 |
(in commit body) |
| 3 | Deterministic Success Scorer: EvidenceRecord → ScoredRun (4 categories, no LLM) | c989253 |
(in commit body) |
| 4 | RAG/SFT/Preference exports + quarantine system | 68b6697 |
reports/distillation/phase4-export-report.md |
| 5 | Receipts harness — per-stage StageReceipt + RunSummary + DriftReport | 2cf359a |
reports/distillation/phase5-receipts-report.md |
| 6 | Acceptance gate — fixture-driven 22-invariant E2E test | 1b433a9 |
reports/distillation/phase6-acceptance-report.md |
| 7 | Replay layer — retrieval-driven local-model bootstrap | 681f39d |
reports/distillation/phase7-replay-report.md |
| 8 | Full system audit + drift baseline | 5bdd159 |
reports/distillation/phase8-full-audit-report.md |
The auditor rebuild (commit 20a039c) is wired to use the Phase 5 substrate: it now calls lakehouse_answers_v1 matrix retrieval instead of tree-split shard summaries. Per-audit cost: 50× fewer cloud calls, 17× faster wall-clock.
Known-good commands
All commands run from /home/profit/lakehouse. Use bun run scripts/distillation/distill.ts <command> or ./scripts/distill <command> if symlinked.
# Build everything end-to-end with structured receipts
./scripts/distill run-all
# Read a specific run's summary + drift
./scripts/distill receipts --run-id <uuid>
# Verify the system end-to-end on a deterministic fixture
./scripts/distill acceptance
# Audit Phases 0-7 + drift detection vs prior baseline
./scripts/distill audit-full
# Test a task through the replay layer (local model with retrieval)
./scripts/distill replay --task "<input>"
./scripts/distill replay --task "<input>" --no-retrieval # baseline / A/B
./scripts/distill replay --task "<input>" --allow-escalation # try deepseek if local fails
# Per-stage one-shot (rare — prefer run-all for receipts)
./scripts/distill build-evidence
./scripts/distill score
./scripts/distill export-rag
./scripts/distill export-sft # strict accepted-only
./scripts/distill export-sft --include-partial # opens to partially_accepted
./scripts/distill export-preference
./scripts/distill export-all
How to rerun the full audit
./scripts/distill audit-full
Reads:
- on-disk
data/evidence/,data/scored-runs/,exports/{rag,sft,preference}/* - the most recent run_id under
reports/distillation/ - the prior audit baseline at
data/_kb/audit_baselines.jsonl
Writes:
reports/distillation/phase8-full-audit-report.md- a new row to
data/_kb/audit_baselines.jsonl(auto-grown — never overwrite)
Exit code 0 = pass (every required check held). Non-zero = at least one required check failed.
How to inspect drift
Two levels:
-
Per-run drift — every
run-allwritesreports/distillation/<run_id>/drift.json. Compares to the most recent prior run. Severityok | warn | alert. -
Cross-run baseline drift —
audit-fullreads the latest baseline row fromdata/_kb/audit_baselines.jsonland compares 10 tracked metrics (record counts, category distribution, export sizes, quarantine totals). Drift table appears inreports/distillation/phase8-full-audit-report.mdwith>20%flagged aswarn.
The baseline file is append-only. Don't truncate it — its value grows with the longitudinal record. If a metric flips warn after a code change, the row before that change is the diagnostic anchor.
How to restore from last good state
git fetch --tags
git checkout distillation-v1.0.0
./scripts/distill audit-full # confirm 16/16 required pass at v1.0.0
If you've made changes that broke the system, hard reset to v1.0.0:
git reset --hard distillation-v1.0.0 # destructive — loses uncommitted work
./scripts/distill acceptance # confirm 22/22 fixture invariants
./scripts/distill audit-full # confirm baseline match
How to add future phases without contaminating the corpus
The corpus = exports/rag/playbooks.jsonl + exports/sft/instruction_response.jsonl + exports/preference/chosen_rejected.jsonl. These are training-safe only if every gate held. To add Phase 10+:
- Add code under
scripts/distillation/<your_phase>.ts. Do NOT modify Phases 0-8. - If your phase produces evidence, append to
data/_kb/<your_stream>.jsonland add a transform inscripts/distillation/transforms.ts. The materializer picks it up automatically. - If your phase needs a new schema, create
auditor/schemas/distillation/<your_schema>.tswith_SCHEMA_VERSION = 1constant + validator + tests inauditor/schemas/distillation/schemas.test.ts(positive + negative fixtures). - Run
./scripts/distill audit-fullBEFORE merging. Confirm 16/16 still passes. - Run
./scripts/distill acceptance. Confirm 22/22 still passes. - Re-run
./scripts/distill run-all. Inspect drift in the new run'sdrift.json. Anything>20%in record counts means your phase moved the corpus — explain it in the commit.
What NOT to modify casually
These have explicit firewalls. Touching them = potentially weakening contamination prevention:
| File | Why fragile |
|---|---|
auditor/schemas/distillation/sft_sample.ts |
The quality_score enum literally enforces "no rejected/needs_human_review in SFT". Loosening it = silent leak |
scripts/distillation/export_sft.ts SFT_NEVER constant |
Second-layer defense. If schema fails, this catches it |
scripts/distillation/export_sft.ts re-read validation |
Third layer — re-reads on-disk SFT and fails LOUD if forbidden quality_score appears |
scripts/distillation/scorer.ts category mapping |
Changing rules → silent corpus shift. Run audit-full after any change to see drift |
tests/fixtures/distillation/acceptance/ |
The fixture is the gate. Changing it = changing the bar |
data/_kb/audit_baselines.jsonl |
Append-only. Truncating loses longitudinal drift signal |
If you must change one of these, run audit-full BEFORE and AFTER. The drift table will tell you exactly what your change moved.
Receipt-vs-drift quick reference
If audit-full flags a metric:
>20%swing inp3_accepted→ scorer rules changed OR source data shifted>20%swing inp4_sft_rows→ SFT eligibility changed (check exporter filter)>20%swing inp4_total_quarantined→ either source data is dirtier OR a gate got tighter- Hash mismatch on identical input → determinism violation; revert immediately
If acceptance fails:
- 22 invariants are pinned in
scripts/distillation/acceptance.ts. The failing one names what broke. - Spec invariants (1-22) are documented in
reports/distillation/phase6-acceptance-report.md.
Pointers to non-distillation systems
The auditor (auditor/) and the gateway (crates/gateway/) are the consumers of the distillation substrate. They use it but are not part of it:
- Auditor's
pr_auditmode (crates/gateway/src/v1/mode.rs) retrieves fromlakehouse_answers_v1. If you regenerate the RAG export, the auditor's context auto-improves on next call. - The gateway's
/v1/chatis the entry point all model calls flow through. Receipts capture provider, model, latency, prompt+completion tokens.
Provenance
Every export row → traces to data/scored-runs/.../<source>.jsonl line N → traces to data/evidence/.../<source>.jsonl line N → traces to data/_kb/<source>.jsonl line N. The provenance.sig_hash field (canonical sha256 of the source row, sorted keys) is the join key.
If a downstream consumer asks "where did this SFT row come from", run:
jq 'select(.id == "<sft_id>") | .provenance' exports/sft/instruction_response.jsonl
# returns {source_file, line_offset, sig_hash, recorded_at}
# Then:
sed -n "$((<line_offset> + 1))p" data/scored-runs/<source_file>
# And so on back to data/_kb/<original_source>.jsonl
Test discipline
bun test tests/distillation/ auditor/schemas/distillation/
At v1.0.0: 145 tests, 0 fail, 372 expect() calls, ~600ms. Any new phase must keep this at 0 fail.
Cumulative commits at v1.0.0
27b1d27 distillation: Phase 0 recon + Phase 1 schemas + Phase 2 transforms scaffold
1ea8029 distillation: Phase 2 — Evidence View materializer + health audit
c989253 distillation: Phase 3 — deterministic Success Scorer
68b6697 distillation: Phase 4 — dataset export layer
2cf359a distillation: Phase 5 — receipts harness (system-level observability)
1b433a9 distillation: Phase 6 — acceptance gate suite
20a039c auditor: rebuild on mode runner + drop tree-split (use distillation substrate)
681f39d distillation: Phase 7 — replay-driven local model bootstrapping
5bdd159 distillation: Phase 8 — full system audit
<this> distillation: Phase 9 — release freeze and operator handoff