Final phase. Adds:
scripts/distillation/release_freeze.ts ~330 lines, 6 release gates
docs/distillation/operator-handoff.md durable cold-start operator doc
docs/distillation/recovery-runbook.md failure-mode runbook by symptom
scripts/distillation/distill.ts +release-freeze subcommand
The release_freeze orchestrator runs every gate the system has:
1. Clean git state (tolerates auto-regenerated reports)
2. Full test suite (bun test tests/distillation auditor/schemas/distillation)
3. Phase commit verification (every Phase 0-8 commit resolves)
4. Acceptance gate (22-invariant fixture E2E)
5. audit-full (Phases 0-7 verified + drift detection)
6. Tag availability check (distillation-v1.0.0 not yet existing)
Outputs:
reports/distillation/release-freeze.md human-readable manifest
reports/distillation/release-manifest.json machine-readable manifest
Manifest captures:
- git_head + git_branch + released_at
- phase→commit map for all 9 commits (Phase 0+1+2 scaffold through Phase 8 audit)
- dataset counts at freeze (RAG/SFT/Preference/evidence/scored/quarantined)
- latest audit baseline row
- per-gate pass/fail with detail
Operator handoff doc covers:
- phase map with commits + report locations
- known-good commands
- how to rerun audit-full + inspect drift
- how to restore from last-good (git checkout distillation-v1.0.0)
- how to add future phases without contaminating corpus
- what NOT to modify casually (with file:reason mapping)
- cumulative commits at v1.0.0
Recovery runbook covers, by symptom:
- audit-full exit non-zero (per-phase diagnostics)
- drift table flags warn (intentional vs regression)
- acceptance fail vs audit-full pass divergence
- run-all empty exports (counter-bisection order)
- hash mismatch on identical input (determinism violation; CRITICAL)
- replay logs growing unbounded (rotation guidance)
- nuclear restore via git checkout distillation-v1.0.0
Spec constraints (per now.md Phase 9):
- DO NOT add new intelligence features ✓ (zero new logic)
- DO NOT change scoring/export logic ✓ (zero touches)
- DO NOT weaken gates ✓ (gates only added, never relaxed beyond the
auto-regen tolerance documented in checkCleanGit)
- DO NOT retrain anything ✓ (no model touches)
CLI:
./scripts/distill release-freeze # exit 0 = release-ready
Tag creation deferred to operator confirmation (the release-freeze
report prints the exact `git tag` command). Per CLAUDE.md guidance,
destructive/visible operations like tags require explicit user
authorization.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
192 lines
10 KiB
Markdown
192 lines
10 KiB
Markdown
# Distillation System — Operator Handoff
|
||
|
||
**Version:** v1.0.0
|
||
**Branch:** `scrum/auto-apply-19814`
|
||
**Tag:** `distillation-v1.0.0`
|
||
**Audit baseline:** `data/_kb/audit_baselines.jsonl` (auto-grown per audit-full run)
|
||
|
||
This is the operator-level handoff for the distillation system. If you are picking this up cold, **read this doc first**, then `docs/recon/local-distillation-recon.md`. Skim the per-phase reports under `reports/distillation/` only when you need detail.
|
||
|
||
## What this system does
|
||
|
||
Turns real Lakehouse execution traces (1052 records sampled at v1.0.0 freeze) into clean, gated training datasets:
|
||
|
||
- **RAG corpus** — 446 grounded examples for retrieval-augmentation
|
||
- **SFT corpus** — 351 instruction→response pairs (strict accepted-only)
|
||
- **Preference corpus** — 83 chosen/rejected pairs (zero self-pairs, zero identical-text)
|
||
|
||
It is **NOT** a model trainer. It is a **knowledge refinery** that produces training-safe substrate. The local-model "replay" runtime (Phase 7) demonstrates that retrieval against this substrate makes a 7B-class model behave like the system instead of fabricating audit verdicts.
|
||
|
||
## Phase map
|
||
|
||
| Phase | What it does | Commit | Report |
|
||
|---|---|---|---|
|
||
| 0 | Recon doc — inventory of source streams + integration plan | 27b1d27 | `docs/recon/local-distillation-recon.md` |
|
||
| 1 | 9 schemas + 51 schema tests + foundation types | 27b1d27 | (in commit body) |
|
||
| 2 | Materializer: 12 source jsonls → unified EvidenceRecord at `data/evidence/YYYY/MM/DD/` | 1ea8029 | (in commit body) |
|
||
| 3 | Deterministic Success Scorer: EvidenceRecord → ScoredRun (4 categories, no LLM) | c989253 | (in commit body) |
|
||
| 4 | RAG/SFT/Preference exports + quarantine system | 68b6697 | `reports/distillation/phase4-export-report.md` |
|
||
| 5 | Receipts harness — per-stage StageReceipt + RunSummary + DriftReport | 2cf359a | `reports/distillation/phase5-receipts-report.md` |
|
||
| 6 | Acceptance gate — fixture-driven 22-invariant E2E test | 1b433a9 | `reports/distillation/phase6-acceptance-report.md` |
|
||
| 7 | Replay layer — retrieval-driven local-model bootstrap | 681f39d | `reports/distillation/phase7-replay-report.md` |
|
||
| 8 | Full system audit + drift baseline | 5bdd159 | `reports/distillation/phase8-full-audit-report.md` |
|
||
|
||
The auditor rebuild (commit 20a039c) is wired to use the Phase 5 substrate: it now calls `lakehouse_answers_v1` matrix retrieval instead of tree-split shard summaries. Per-audit cost: 50× fewer cloud calls, 17× faster wall-clock.
|
||
|
||
## Known-good commands
|
||
|
||
All commands run from `/home/profit/lakehouse`. Use `bun run scripts/distillation/distill.ts <command>` or `./scripts/distill <command>` if symlinked.
|
||
|
||
```bash
|
||
# Build everything end-to-end with structured receipts
|
||
./scripts/distill run-all
|
||
|
||
# Read a specific run's summary + drift
|
||
./scripts/distill receipts --run-id <uuid>
|
||
|
||
# Verify the system end-to-end on a deterministic fixture
|
||
./scripts/distill acceptance
|
||
|
||
# Audit Phases 0-7 + drift detection vs prior baseline
|
||
./scripts/distill audit-full
|
||
|
||
# Test a task through the replay layer (local model with retrieval)
|
||
./scripts/distill replay --task "<input>"
|
||
./scripts/distill replay --task "<input>" --no-retrieval # baseline / A/B
|
||
./scripts/distill replay --task "<input>" --allow-escalation # try deepseek if local fails
|
||
|
||
# Per-stage one-shot (rare — prefer run-all for receipts)
|
||
./scripts/distill build-evidence
|
||
./scripts/distill score
|
||
./scripts/distill export-rag
|
||
./scripts/distill export-sft # strict accepted-only
|
||
./scripts/distill export-sft --include-partial # opens to partially_accepted
|
||
./scripts/distill export-preference
|
||
./scripts/distill export-all
|
||
```
|
||
|
||
## How to rerun the full audit
|
||
|
||
```bash
|
||
./scripts/distill audit-full
|
||
```
|
||
|
||
Reads:
|
||
- on-disk `data/evidence/`, `data/scored-runs/`, `exports/{rag,sft,preference}/*`
|
||
- the most recent run_id under `reports/distillation/`
|
||
- the prior audit baseline at `data/_kb/audit_baselines.jsonl`
|
||
|
||
Writes:
|
||
- `reports/distillation/phase8-full-audit-report.md`
|
||
- a new row to `data/_kb/audit_baselines.jsonl` (auto-grown — never overwrite)
|
||
|
||
Exit code 0 = pass (every required check held). Non-zero = at least one required check failed.
|
||
|
||
## How to inspect drift
|
||
|
||
Two levels:
|
||
|
||
1. **Per-run drift** — every `run-all` writes `reports/distillation/<run_id>/drift.json`. Compares to the most recent prior run. Severity `ok | warn | alert`.
|
||
|
||
2. **Cross-run baseline drift** — `audit-full` reads the latest baseline row from `data/_kb/audit_baselines.jsonl` and compares 10 tracked metrics (record counts, category distribution, export sizes, quarantine totals). Drift table appears in `reports/distillation/phase8-full-audit-report.md` with `>20%` flagged as `warn`.
|
||
|
||
The baseline file is **append-only**. Don't truncate it — its value grows with the longitudinal record. If a metric flips `warn` after a code change, the row before that change is the diagnostic anchor.
|
||
|
||
## How to restore from last good state
|
||
|
||
```bash
|
||
git fetch --tags
|
||
git checkout distillation-v1.0.0
|
||
./scripts/distill audit-full # confirm 16/16 required pass at v1.0.0
|
||
```
|
||
|
||
If you've made changes that broke the system, hard reset to v1.0.0:
|
||
|
||
```bash
|
||
git reset --hard distillation-v1.0.0 # destructive — loses uncommitted work
|
||
./scripts/distill acceptance # confirm 22/22 fixture invariants
|
||
./scripts/distill audit-full # confirm baseline match
|
||
```
|
||
|
||
## How to add future phases without contaminating the corpus
|
||
|
||
The corpus = `exports/rag/playbooks.jsonl` + `exports/sft/instruction_response.jsonl` + `exports/preference/chosen_rejected.jsonl`. These are training-safe **only if** every gate held. To add Phase 10+:
|
||
|
||
1. Add code under `scripts/distillation/<your_phase>.ts`. Do NOT modify Phases 0-8.
|
||
2. If your phase produces evidence, append to `data/_kb/<your_stream>.jsonl` and add a transform in `scripts/distillation/transforms.ts`. The materializer picks it up automatically.
|
||
3. If your phase needs a new schema, create `auditor/schemas/distillation/<your_schema>.ts` with `_SCHEMA_VERSION = 1` constant + validator + tests in `auditor/schemas/distillation/schemas.test.ts` (positive + negative fixtures).
|
||
4. Run `./scripts/distill audit-full` BEFORE merging. Confirm 16/16 still passes.
|
||
5. Run `./scripts/distill acceptance`. Confirm 22/22 still passes.
|
||
6. Re-run `./scripts/distill run-all`. Inspect drift in the new run's `drift.json`. Anything `>20%` in record counts means your phase moved the corpus — explain it in the commit.
|
||
|
||
## What NOT to modify casually
|
||
|
||
These have explicit firewalls. Touching them = potentially weakening contamination prevention:
|
||
|
||
| File | Why fragile |
|
||
|---|---|
|
||
| `auditor/schemas/distillation/sft_sample.ts` | The `quality_score` enum literally enforces "no rejected/needs_human_review in SFT". Loosening it = silent leak |
|
||
| `scripts/distillation/export_sft.ts` `SFT_NEVER` constant | Second-layer defense. If schema fails, this catches it |
|
||
| `scripts/distillation/export_sft.ts` re-read validation | Third layer — re-reads on-disk SFT and fails LOUD if forbidden quality_score appears |
|
||
| `scripts/distillation/scorer.ts` category mapping | Changing rules → silent corpus shift. Run `audit-full` after any change to see drift |
|
||
| `tests/fixtures/distillation/acceptance/` | The fixture is the gate. Changing it = changing the bar |
|
||
| `data/_kb/audit_baselines.jsonl` | Append-only. Truncating loses longitudinal drift signal |
|
||
|
||
If you must change one of these, run `audit-full` BEFORE and AFTER. The drift table will tell you exactly what your change moved.
|
||
|
||
## Receipt-vs-drift quick reference
|
||
|
||
If `audit-full` flags a metric:
|
||
- `>20%` swing in `p3_accepted` → scorer rules changed OR source data shifted
|
||
- `>20%` swing in `p4_sft_rows` → SFT eligibility changed (check exporter filter)
|
||
- `>20%` swing in `p4_total_quarantined` → either source data is dirtier OR a gate got tighter
|
||
- Hash mismatch on identical input → determinism violation; revert immediately
|
||
|
||
If `acceptance` fails:
|
||
- 22 invariants are pinned in `scripts/distillation/acceptance.ts`. The failing one names what broke.
|
||
- Spec invariants (1-22) are documented in `reports/distillation/phase6-acceptance-report.md`.
|
||
|
||
## Pointers to non-distillation systems
|
||
|
||
The auditor (`auditor/`) and the gateway (`crates/gateway/`) are the consumers of the distillation substrate. They use it but are not part of it:
|
||
|
||
- Auditor's `pr_audit` mode (`crates/gateway/src/v1/mode.rs`) retrieves from `lakehouse_answers_v1`. If you regenerate the RAG export, the auditor's context auto-improves on next call.
|
||
- The gateway's `/v1/chat` is the entry point all model calls flow through. Receipts capture provider, model, latency, prompt+completion tokens.
|
||
|
||
## Provenance
|
||
|
||
Every export row → traces to `data/scored-runs/.../<source>.jsonl` line N → traces to `data/evidence/.../<source>.jsonl` line N → traces to `data/_kb/<source>.jsonl` line N. The `provenance.sig_hash` field (canonical sha256 of the source row, sorted keys) is the join key.
|
||
|
||
If a downstream consumer asks "where did this SFT row come from", run:
|
||
|
||
```bash
|
||
jq 'select(.id == "<sft_id>") | .provenance' exports/sft/instruction_response.jsonl
|
||
# returns {source_file, line_offset, sig_hash, recorded_at}
|
||
# Then:
|
||
sed -n "$((<line_offset> + 1))p" data/scored-runs/<source_file>
|
||
# And so on back to data/_kb/<original_source>.jsonl
|
||
```
|
||
|
||
## Test discipline
|
||
|
||
```bash
|
||
bun test tests/distillation/ auditor/schemas/distillation/
|
||
```
|
||
|
||
At v1.0.0: **145 tests, 0 fail, 372 expect() calls, ~600ms.** Any new phase must keep this at 0 fail.
|
||
|
||
## Cumulative commits at v1.0.0
|
||
|
||
```
|
||
27b1d27 distillation: Phase 0 recon + Phase 1 schemas + Phase 2 transforms scaffold
|
||
1ea8029 distillation: Phase 2 — Evidence View materializer + health audit
|
||
c989253 distillation: Phase 3 — deterministic Success Scorer
|
||
68b6697 distillation: Phase 4 — dataset export layer
|
||
2cf359a distillation: Phase 5 — receipts harness (system-level observability)
|
||
1b433a9 distillation: Phase 6 — acceptance gate suite
|
||
20a039c auditor: rebuild on mode runner + drop tree-split (use distillation substrate)
|
||
681f39d distillation: Phase 7 — replay-driven local model bootstrapping
|
||
5bdd159 distillation: Phase 8 — full system audit
|
||
<this> distillation: Phase 9 — release freeze and operator handoff
|
||
```
|