root 73f242e3e4 distillation: Phase 9 — release freeze and operator handoff

Final phase. Adds:
  scripts/distillation/release_freeze.ts   ~330 lines, 6 release gates
  docs/distillation/operator-handoff.md    durable cold-start operator doc
  docs/distillation/recovery-runbook.md    failure-mode runbook by symptom
  scripts/distillation/distill.ts          +release-freeze subcommand

The release_freeze orchestrator runs every gate the system has:
  1. Clean git state (tolerates auto-regenerated reports)
  2. Full test suite (bun test tests/distillation auditor/schemas/distillation)
  3. Phase commit verification (every Phase 0-8 commit resolves)
  4. Acceptance gate (22-invariant fixture E2E)
  5. audit-full (Phases 0-7 verified + drift detection)
  6. Tag availability check (distillation-v1.0.0 not yet existing)

Outputs:
  reports/distillation/release-freeze.md       human-readable manifest
  reports/distillation/release-manifest.json   machine-readable manifest

Manifest captures:
  - git_head + git_branch + released_at
  - phase→commit map for all 9 commits (Phase 0+1+2 scaffold through Phase 8 audit)
  - dataset counts at freeze (RAG/SFT/Preference/evidence/scored/quarantined)
  - latest audit baseline row
  - per-gate pass/fail with detail

Operator handoff doc covers:
  - phase map with commits + report locations
  - known-good commands
  - how to rerun audit-full + inspect drift
  - how to restore from last-good (git checkout distillation-v1.0.0)
  - how to add future phases without contaminating corpus
  - what NOT to modify casually (with file:reason mapping)
  - cumulative commits at v1.0.0

Recovery runbook covers, by symptom:
  - audit-full exit non-zero (per-phase diagnostics)
  - drift table flags warn (intentional vs regression)
  - acceptance fail vs audit-full pass divergence
  - run-all empty exports (counter-bisection order)
  - hash mismatch on identical input (determinism violation; CRITICAL)
  - replay logs growing unbounded (rotation guidance)
  - nuclear restore via git checkout distillation-v1.0.0

Spec constraints (per now.md Phase 9):
  - DO NOT add new intelligence features ✓ (zero new logic)
  - DO NOT change scoring/export logic ✓ (zero touches)
  - DO NOT weaken gates ✓ (gates only added, never relaxed beyond the
    auto-regen tolerance documented in checkCleanGit)
  - DO NOT retrain anything ✓ (no model touches)

CLI:
  ./scripts/distill release-freeze   # exit 0 = release-ready

Tag creation deferred to operator confirmation (the release-freeze
report prints the exact `git tag` command). Per CLAUDE.md guidance,
destructive/visible operations like tags require explicit user
authorization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-26 23:54:31 -05:00

10 KiB

Raw Blame History

Distillation System — Operator Handoff

Version: v1.0.0 Branch: scrum/auto-apply-19814 Tag: distillation-v1.0.0 Audit baseline: data/_kb/audit_baselines.jsonl (auto-grown per audit-full run)

This is the operator-level handoff for the distillation system. If you are picking this up cold, read this doc first, then docs/recon/local-distillation-recon.md. Skim the per-phase reports under reports/distillation/ only when you need detail.

What this system does

Turns real Lakehouse execution traces (1052 records sampled at v1.0.0 freeze) into clean, gated training datasets:

RAG corpus — 446 grounded examples for retrieval-augmentation
SFT corpus — 351 instruction→response pairs (strict accepted-only)
Preference corpus — 83 chosen/rejected pairs (zero self-pairs, zero identical-text)

It is NOT a model trainer. It is a knowledge refinery that produces training-safe substrate. The local-model "replay" runtime (Phase 7) demonstrates that retrieval against this substrate makes a 7B-class model behave like the system instead of fabricating audit verdicts.

Phase map

Phase	What it does	Commit	Report
0	Recon doc — inventory of source streams + integration plan	`27b1d27`	`docs/recon/local-distillation-recon.md`
1	9 schemas + 51 schema tests + foundation types	`27b1d27`	(in commit body)
2	Materializer: 12 source jsonls → unified EvidenceRecord at `data/evidence/YYYY/MM/DD/`	`1ea8029`	(in commit body)
3	Deterministic Success Scorer: EvidenceRecord → ScoredRun (4 categories, no LLM)	`c989253`	(in commit body)
4	RAG/SFT/Preference exports + quarantine system	`68b6697`	`reports/distillation/phase4-export-report.md`
5	Receipts harness — per-stage StageReceipt + RunSummary + DriftReport	`2cf359a`	`reports/distillation/phase5-receipts-report.md`
6	Acceptance gate — fixture-driven 22-invariant E2E test	`1b433a9`	`reports/distillation/phase6-acceptance-report.md`
7	Replay layer — retrieval-driven local-model bootstrap	`681f39d`	`reports/distillation/phase7-replay-report.md`
8	Full system audit + drift baseline	`5bdd159`	`reports/distillation/phase8-full-audit-report.md`

The auditor rebuild (commit 20a039c) is wired to use the Phase 5 substrate: it now calls lakehouse_answers_v1 matrix retrieval instead of tree-split shard summaries. Per-audit cost: 50× fewer cloud calls, 17× faster wall-clock.

Known-good commands

All commands run from /home/profit/lakehouse. Use bun run scripts/distillation/distill.ts <command> or ./scripts/distill <command> if symlinked.

# Build everything end-to-end with structured receipts
./scripts/distill run-all

# Read a specific run's summary + drift
./scripts/distill receipts --run-id <uuid>

# Verify the system end-to-end on a deterministic fixture
./scripts/distill acceptance

# Audit Phases 0-7 + drift detection vs prior baseline
./scripts/distill audit-full

# Test a task through the replay layer (local model with retrieval)
./scripts/distill replay --task "<input>"
./scripts/distill replay --task "<input>" --no-retrieval     # baseline / A/B
./scripts/distill replay --task "<input>" --allow-escalation # try deepseek if local fails

# Per-stage one-shot (rare — prefer run-all for receipts)
./scripts/distill build-evidence
./scripts/distill score
./scripts/distill export-rag
./scripts/distill export-sft        # strict accepted-only
./scripts/distill export-sft --include-partial   # opens to partially_accepted
./scripts/distill export-preference
./scripts/distill export-all

How to rerun the full audit

./scripts/distill audit-full

Reads:

on-disk data/evidence/, data/scored-runs/, exports/{rag,sft,preference}/*
the most recent run_id under reports/distillation/
the prior audit baseline at data/_kb/audit_baselines.jsonl

Writes:

reports/distillation/phase8-full-audit-report.md
a new row to data/_kb/audit_baselines.jsonl (auto-grown — never overwrite)

Exit code 0 = pass (every required check held). Non-zero = at least one required check failed.

How to inspect drift

Two levels:

Per-run drift — every run-all writes reports/distillation/<run_id>/drift.json. Compares to the most recent prior run. Severity ok | warn | alert.
Cross-run baseline drift — audit-full reads the latest baseline row from data/_kb/audit_baselines.jsonl and compares 10 tracked metrics (record counts, category distribution, export sizes, quarantine totals). Drift table appears in reports/distillation/phase8-full-audit-report.md with >20% flagged as warn.

The baseline file is append-only. Don't truncate it — its value grows with the longitudinal record. If a metric flips warn after a code change, the row before that change is the diagnostic anchor.

How to restore from last good state

git fetch --tags
git checkout distillation-v1.0.0
./scripts/distill audit-full   # confirm 16/16 required pass at v1.0.0

If you've made changes that broke the system, hard reset to v1.0.0:

git reset --hard distillation-v1.0.0   # destructive — loses uncommitted work
./scripts/distill acceptance           # confirm 22/22 fixture invariants
./scripts/distill audit-full           # confirm baseline match

How to add future phases without contaminating the corpus

The corpus = exports/rag/playbooks.jsonl + exports/sft/instruction_response.jsonl + exports/preference/chosen_rejected.jsonl. These are training-safe only if every gate held. To add Phase 10+:

Add code under scripts/distillation/<your_phase>.ts. Do NOT modify Phases 0-8.
If your phase produces evidence, append to data/_kb/<your_stream>.jsonl and add a transform in scripts/distillation/transforms.ts. The materializer picks it up automatically.
If your phase needs a new schema, create auditor/schemas/distillation/<your_schema>.ts with _SCHEMA_VERSION = 1 constant + validator + tests in auditor/schemas/distillation/schemas.test.ts (positive + negative fixtures).
Run ./scripts/distill audit-full BEFORE merging. Confirm 16/16 still passes.
Run ./scripts/distill acceptance. Confirm 22/22 still passes.
Re-run ./scripts/distill run-all. Inspect drift in the new run's drift.json. Anything >20% in record counts means your phase moved the corpus — explain it in the commit.

What NOT to modify casually

These have explicit firewalls. Touching them = potentially weakening contamination prevention:

File	Why fragile
`auditor/schemas/distillation/sft_sample.ts`	The `quality_score` enum literally enforces "no rejected/needs_human_review in SFT". Loosening it = silent leak
`scripts/distillation/export_sft.ts` `SFT_NEVER` constant	Second-layer defense. If schema fails, this catches it
`scripts/distillation/export_sft.ts` re-read validation	Third layer — re-reads on-disk SFT and fails LOUD if forbidden quality_score appears
`scripts/distillation/scorer.ts` category mapping	Changing rules → silent corpus shift. Run `audit-full` after any change to see drift
`tests/fixtures/distillation/acceptance/`	The fixture is the gate. Changing it = changing the bar
`data/_kb/audit_baselines.jsonl`	Append-only. Truncating loses longitudinal drift signal

If you must change one of these, run audit-full BEFORE and AFTER. The drift table will tell you exactly what your change moved.

Receipt-vs-drift quick reference

If audit-full flags a metric:

>20% swing in p3_accepted → scorer rules changed OR source data shifted
>20% swing in p4_sft_rows → SFT eligibility changed (check exporter filter)
>20% swing in p4_total_quarantined → either source data is dirtier OR a gate got tighter
Hash mismatch on identical input → determinism violation; revert immediately

If acceptance fails:

22 invariants are pinned in scripts/distillation/acceptance.ts. The failing one names what broke.
Spec invariants (1-22) are documented in reports/distillation/phase6-acceptance-report.md.

Pointers to non-distillation systems

The auditor (auditor/) and the gateway (crates/gateway/) are the consumers of the distillation substrate. They use it but are not part of it:

Auditor's pr_audit mode (crates/gateway/src/v1/mode.rs) retrieves from lakehouse_answers_v1. If you regenerate the RAG export, the auditor's context auto-improves on next call.
The gateway's /v1/chat is the entry point all model calls flow through. Receipts capture provider, model, latency, prompt+completion tokens.

Provenance

Every export row → traces to data/scored-runs/.../<source>.jsonl line N → traces to data/evidence/.../<source>.jsonl line N → traces to data/_kb/<source>.jsonl line N. The provenance.sig_hash field (canonical sha256 of the source row, sorted keys) is the join key.

If a downstream consumer asks "where did this SFT row come from", run:

jq 'select(.id == "<sft_id>") | .provenance' exports/sft/instruction_response.jsonl
# returns {source_file, line_offset, sig_hash, recorded_at}
# Then:
sed -n "$((<line_offset> + 1))p" data/scored-runs/<source_file>
# And so on back to data/_kb/<original_source>.jsonl

Test discipline

bun test tests/distillation/ auditor/schemas/distillation/

At v1.0.0: 145 tests, 0 fail, 372 expect() calls, ~600ms. Any new phase must keep this at 0 fail.

Cumulative commits at v1.0.0

27b1d27  distillation: Phase 0 recon + Phase 1 schemas + Phase 2 transforms scaffold
1ea8029  distillation: Phase 2 — Evidence View materializer + health audit
c989253  distillation: Phase 3 — deterministic Success Scorer
68b6697  distillation: Phase 4 — dataset export layer
2cf359a  distillation: Phase 5 — receipts harness (system-level observability)
1b433a9  distillation: Phase 6 — acceptance gate suite
20a039c  auditor: rebuild on mode runner + drop tree-split (use distillation substrate)
681f39d  distillation: Phase 7 — replay-driven local model bootstrapping
5bdd159  distillation: Phase 8 — full system audit
<this>   distillation: Phase 9 — release freeze and operator handoff

10 KiB Raw Blame History Unescape Escape