lakehouse/docs/distillation/operator-handoff.md
root 73f242e3e4 distillation: Phase 9 — release freeze and operator handoff
Final phase. Adds:
  scripts/distillation/release_freeze.ts   ~330 lines, 6 release gates
  docs/distillation/operator-handoff.md    durable cold-start operator doc
  docs/distillation/recovery-runbook.md    failure-mode runbook by symptom
  scripts/distillation/distill.ts          +release-freeze subcommand

The release_freeze orchestrator runs every gate the system has:
  1. Clean git state (tolerates auto-regenerated reports)
  2. Full test suite (bun test tests/distillation auditor/schemas/distillation)
  3. Phase commit verification (every Phase 0-8 commit resolves)
  4. Acceptance gate (22-invariant fixture E2E)
  5. audit-full (Phases 0-7 verified + drift detection)
  6. Tag availability check (distillation-v1.0.0 not yet existing)

Outputs:
  reports/distillation/release-freeze.md       human-readable manifest
  reports/distillation/release-manifest.json   machine-readable manifest

Manifest captures:
  - git_head + git_branch + released_at
  - phase→commit map for all 9 commits (Phase 0+1+2 scaffold through Phase 8 audit)
  - dataset counts at freeze (RAG/SFT/Preference/evidence/scored/quarantined)
  - latest audit baseline row
  - per-gate pass/fail with detail

Operator handoff doc covers:
  - phase map with commits + report locations
  - known-good commands
  - how to rerun audit-full + inspect drift
  - how to restore from last-good (git checkout distillation-v1.0.0)
  - how to add future phases without contaminating corpus
  - what NOT to modify casually (with file:reason mapping)
  - cumulative commits at v1.0.0

Recovery runbook covers, by symptom:
  - audit-full exit non-zero (per-phase diagnostics)
  - drift table flags warn (intentional vs regression)
  - acceptance fail vs audit-full pass divergence
  - run-all empty exports (counter-bisection order)
  - hash mismatch on identical input (determinism violation; CRITICAL)
  - replay logs growing unbounded (rotation guidance)
  - nuclear restore via git checkout distillation-v1.0.0

Spec constraints (per now.md Phase 9):
  - DO NOT add new intelligence features ✓ (zero new logic)
  - DO NOT change scoring/export logic ✓ (zero touches)
  - DO NOT weaken gates ✓ (gates only added, never relaxed beyond the
    auto-regen tolerance documented in checkCleanGit)
  - DO NOT retrain anything ✓ (no model touches)

CLI:
  ./scripts/distill release-freeze   # exit 0 = release-ready

Tag creation deferred to operator confirmation (the release-freeze
report prints the exact `git tag` command). Per CLAUDE.md guidance,
destructive/visible operations like tags require explicit user
authorization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 23:54:31 -05:00

192 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Distillation System — Operator Handoff
**Version:** v1.0.0
**Branch:** `scrum/auto-apply-19814`
**Tag:** `distillation-v1.0.0`
**Audit baseline:** `data/_kb/audit_baselines.jsonl` (auto-grown per audit-full run)
This is the operator-level handoff for the distillation system. If you are picking this up cold, **read this doc first**, then `docs/recon/local-distillation-recon.md`. Skim the per-phase reports under `reports/distillation/` only when you need detail.
## What this system does
Turns real Lakehouse execution traces (1052 records sampled at v1.0.0 freeze) into clean, gated training datasets:
- **RAG corpus** — 446 grounded examples for retrieval-augmentation
- **SFT corpus** — 351 instruction→response pairs (strict accepted-only)
- **Preference corpus** — 83 chosen/rejected pairs (zero self-pairs, zero identical-text)
It is **NOT** a model trainer. It is a **knowledge refinery** that produces training-safe substrate. The local-model "replay" runtime (Phase 7) demonstrates that retrieval against this substrate makes a 7B-class model behave like the system instead of fabricating audit verdicts.
## Phase map
| Phase | What it does | Commit | Report |
|---|---|---|---|
| 0 | Recon doc — inventory of source streams + integration plan | 27b1d27 | `docs/recon/local-distillation-recon.md` |
| 1 | 9 schemas + 51 schema tests + foundation types | 27b1d27 | (in commit body) |
| 2 | Materializer: 12 source jsonls → unified EvidenceRecord at `data/evidence/YYYY/MM/DD/` | 1ea8029 | (in commit body) |
| 3 | Deterministic Success Scorer: EvidenceRecord → ScoredRun (4 categories, no LLM) | c989253 | (in commit body) |
| 4 | RAG/SFT/Preference exports + quarantine system | 68b6697 | `reports/distillation/phase4-export-report.md` |
| 5 | Receipts harness — per-stage StageReceipt + RunSummary + DriftReport | 2cf359a | `reports/distillation/phase5-receipts-report.md` |
| 6 | Acceptance gate — fixture-driven 22-invariant E2E test | 1b433a9 | `reports/distillation/phase6-acceptance-report.md` |
| 7 | Replay layer — retrieval-driven local-model bootstrap | 681f39d | `reports/distillation/phase7-replay-report.md` |
| 8 | Full system audit + drift baseline | 5bdd159 | `reports/distillation/phase8-full-audit-report.md` |
The auditor rebuild (commit 20a039c) is wired to use the Phase 5 substrate: it now calls `lakehouse_answers_v1` matrix retrieval instead of tree-split shard summaries. Per-audit cost: 50× fewer cloud calls, 17× faster wall-clock.
## Known-good commands
All commands run from `/home/profit/lakehouse`. Use `bun run scripts/distillation/distill.ts <command>` or `./scripts/distill <command>` if symlinked.
```bash
# Build everything end-to-end with structured receipts
./scripts/distill run-all
# Read a specific run's summary + drift
./scripts/distill receipts --run-id <uuid>
# Verify the system end-to-end on a deterministic fixture
./scripts/distill acceptance
# Audit Phases 0-7 + drift detection vs prior baseline
./scripts/distill audit-full
# Test a task through the replay layer (local model with retrieval)
./scripts/distill replay --task "<input>"
./scripts/distill replay --task "<input>" --no-retrieval # baseline / A/B
./scripts/distill replay --task "<input>" --allow-escalation # try deepseek if local fails
# Per-stage one-shot (rare — prefer run-all for receipts)
./scripts/distill build-evidence
./scripts/distill score
./scripts/distill export-rag
./scripts/distill export-sft # strict accepted-only
./scripts/distill export-sft --include-partial # opens to partially_accepted
./scripts/distill export-preference
./scripts/distill export-all
```
## How to rerun the full audit
```bash
./scripts/distill audit-full
```
Reads:
- on-disk `data/evidence/`, `data/scored-runs/`, `exports/{rag,sft,preference}/*`
- the most recent run_id under `reports/distillation/`
- the prior audit baseline at `data/_kb/audit_baselines.jsonl`
Writes:
- `reports/distillation/phase8-full-audit-report.md`
- a new row to `data/_kb/audit_baselines.jsonl` (auto-grown — never overwrite)
Exit code 0 = pass (every required check held). Non-zero = at least one required check failed.
## How to inspect drift
Two levels:
1. **Per-run drift** — every `run-all` writes `reports/distillation/<run_id>/drift.json`. Compares to the most recent prior run. Severity `ok | warn | alert`.
2. **Cross-run baseline drift**`audit-full` reads the latest baseline row from `data/_kb/audit_baselines.jsonl` and compares 10 tracked metrics (record counts, category distribution, export sizes, quarantine totals). Drift table appears in `reports/distillation/phase8-full-audit-report.md` with `>20%` flagged as `warn`.
The baseline file is **append-only**. Don't truncate it — its value grows with the longitudinal record. If a metric flips `warn` after a code change, the row before that change is the diagnostic anchor.
## How to restore from last good state
```bash
git fetch --tags
git checkout distillation-v1.0.0
./scripts/distill audit-full # confirm 16/16 required pass at v1.0.0
```
If you've made changes that broke the system, hard reset to v1.0.0:
```bash
git reset --hard distillation-v1.0.0 # destructive — loses uncommitted work
./scripts/distill acceptance # confirm 22/22 fixture invariants
./scripts/distill audit-full # confirm baseline match
```
## How to add future phases without contaminating the corpus
The corpus = `exports/rag/playbooks.jsonl` + `exports/sft/instruction_response.jsonl` + `exports/preference/chosen_rejected.jsonl`. These are training-safe **only if** every gate held. To add Phase 10+:
1. Add code under `scripts/distillation/<your_phase>.ts`. Do NOT modify Phases 0-8.
2. If your phase produces evidence, append to `data/_kb/<your_stream>.jsonl` and add a transform in `scripts/distillation/transforms.ts`. The materializer picks it up automatically.
3. If your phase needs a new schema, create `auditor/schemas/distillation/<your_schema>.ts` with `_SCHEMA_VERSION = 1` constant + validator + tests in `auditor/schemas/distillation/schemas.test.ts` (positive + negative fixtures).
4. Run `./scripts/distill audit-full` BEFORE merging. Confirm 16/16 still passes.
5. Run `./scripts/distill acceptance`. Confirm 22/22 still passes.
6. Re-run `./scripts/distill run-all`. Inspect drift in the new run's `drift.json`. Anything `>20%` in record counts means your phase moved the corpus — explain it in the commit.
## What NOT to modify casually
These have explicit firewalls. Touching them = potentially weakening contamination prevention:
| File | Why fragile |
|---|---|
| `auditor/schemas/distillation/sft_sample.ts` | The `quality_score` enum literally enforces "no rejected/needs_human_review in SFT". Loosening it = silent leak |
| `scripts/distillation/export_sft.ts` `SFT_NEVER` constant | Second-layer defense. If schema fails, this catches it |
| `scripts/distillation/export_sft.ts` re-read validation | Third layer — re-reads on-disk SFT and fails LOUD if forbidden quality_score appears |
| `scripts/distillation/scorer.ts` category mapping | Changing rules → silent corpus shift. Run `audit-full` after any change to see drift |
| `tests/fixtures/distillation/acceptance/` | The fixture is the gate. Changing it = changing the bar |
| `data/_kb/audit_baselines.jsonl` | Append-only. Truncating loses longitudinal drift signal |
If you must change one of these, run `audit-full` BEFORE and AFTER. The drift table will tell you exactly what your change moved.
## Receipt-vs-drift quick reference
If `audit-full` flags a metric:
- `>20%` swing in `p3_accepted` → scorer rules changed OR source data shifted
- `>20%` swing in `p4_sft_rows` → SFT eligibility changed (check exporter filter)
- `>20%` swing in `p4_total_quarantined` → either source data is dirtier OR a gate got tighter
- Hash mismatch on identical input → determinism violation; revert immediately
If `acceptance` fails:
- 22 invariants are pinned in `scripts/distillation/acceptance.ts`. The failing one names what broke.
- Spec invariants (1-22) are documented in `reports/distillation/phase6-acceptance-report.md`.
## Pointers to non-distillation systems
The auditor (`auditor/`) and the gateway (`crates/gateway/`) are the consumers of the distillation substrate. They use it but are not part of it:
- Auditor's `pr_audit` mode (`crates/gateway/src/v1/mode.rs`) retrieves from `lakehouse_answers_v1`. If you regenerate the RAG export, the auditor's context auto-improves on next call.
- The gateway's `/v1/chat` is the entry point all model calls flow through. Receipts capture provider, model, latency, prompt+completion tokens.
## Provenance
Every export row → traces to `data/scored-runs/.../<source>.jsonl` line N → traces to `data/evidence/.../<source>.jsonl` line N → traces to `data/_kb/<source>.jsonl` line N. The `provenance.sig_hash` field (canonical sha256 of the source row, sorted keys) is the join key.
If a downstream consumer asks "where did this SFT row come from", run:
```bash
jq 'select(.id == "<sft_id>") | .provenance' exports/sft/instruction_response.jsonl
# returns {source_file, line_offset, sig_hash, recorded_at}
# Then:
sed -n "$((<line_offset> + 1))p" data/scored-runs/<source_file>
# And so on back to data/_kb/<original_source>.jsonl
```
## Test discipline
```bash
bun test tests/distillation/ auditor/schemas/distillation/
```
At v1.0.0: **145 tests, 0 fail, 372 expect() calls, ~600ms.** Any new phase must keep this at 0 fail.
## Cumulative commits at v1.0.0
```
27b1d27 distillation: Phase 0 recon + Phase 1 schemas + Phase 2 transforms scaffold
1ea8029 distillation: Phase 2 — Evidence View materializer + health audit
c989253 distillation: Phase 3 — deterministic Success Scorer
68b6697 distillation: Phase 4 — dataset export layer
2cf359a distillation: Phase 5 — receipts harness (system-level observability)
1b433a9 distillation: Phase 6 — acceptance gate suite
20a039c auditor: rebuild on mode runner + drop tree-split (use distillation substrate)
681f39d distillation: Phase 7 — replay-driven local model bootstrapping
5bdd159 distillation: Phase 8 — full system audit
<this> distillation: Phase 9 — release freeze and operator handoff
```