lakehouse/docs/recon/local-distillation-recon.md
root 27b1d27605
Some checks failed
lakehouse/auditor 9 blocking issues: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
distillation: Phase 0 recon + Phase 1 schemas + Phase 2 transforms scaffold
Phase 0 — docs/recon/local-distillation-recon.md
Inventories the 23 KB JSONL streams + 20 vector corpora + auditor's
kb_index.ts as substrate for the now.md distillation pipeline. Maps
spec modules to existing producers, identifies real gaps, lists 9
schemas to formalize. ZERO implementation in recon — gating doc only.

Phase 1 — auditor/schemas/distillation/
9 schemas + foundation types + 48 tests passing in 502ms:

  types.ts                      shared validators + canonicalSha256
  evidence_record.ts            EVIDENCE_SCHEMA_VERSION=1, ModelRole enum
  scored_run.ts                 4 categories pinned, anchor_grounding ∈ [0,1]
  receipt.ts                    git_sha 40-char, sha256 file refs, validation_pass:bool
  playbook.ts                   non-empty source_run_ids + acceptance_criteria
  scratchpad_summary.ts         validation_status enum, hash sha256
  model_ledger.ts               success_rate ∈ [0,1], sample_count ≥ 1
  rag_sample.ts                 success_score ∈ {accepted, partially_accepted}
  sft_sample.ts                 quality_score MUST be 'accepted' (no leak)
  preference_sample.ts          chosen != rejected, source_run_ids must differ
  evidence_record.test.ts       10 tests, JSON-fixture round-trip
  schemas.test.ts               30 tests, inline fixtures
  realdata.test.ts              8 tests, real-JSONL probe

Real-data validation probe (one of the 3 notables from recon):
46 rows across 7 sources, 100% pass. distilled_facts/procedures alive.
Report at data/_kb/realdata_validation_report.md (also written by the
test). Confirms schema fits existing producers without migration.

Phase 2 scaffold — scripts/distillation/transforms.ts
Promoted PROBES from realdata.test.ts into a real TRANSFORMS array
covering 12 source streams (8 Tier 1 validated + 4 Tier 2 from
recon's untested-streams list). Pure functions: no I/O, no model
calls, no clock reads. Caller supplies recorded_at + sig_hash so
materializer is deterministic by construction.

Spec non-negotiables enforced at schema layer (defense in depth):
  - provenance{source_file, sig_hash, recorded_at} required everywhere
  - schema_version mismatch hard-rejects (forward-compat gate)
  - SFT no-leak: validateSftSample REJECTS partially_accepted, rejected,
    needs_human_review — three explicit tests
  - Every score has WHY (reasons non-empty)
  - Every playbook traces to source (source_run_ids non-empty)
  - Every preference has WHY (reason non-empty)
  - Receipts substantive (git_sha 40-char, sha256 64-char, validation_pass:bool)

Branch carries uncommitted auditor rebuild work (mode.rs + modes.toml
+ inference.ts + static.ts) blocked on upstream Ollama Cloud kimi-k2
500 ISE; held pending recon-driven design decisions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:30:38 -05:00

310 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Local Distillation Pipeline — Repo Recon
**Date:** 2026-04-26
**Status:** Phase 0 (read-only inventory — no implementation yet)
**Spec:** `/home/profit/now.md`
**Branch:** `scrum/auto-apply-19814` head `f753e11` (uncommitted: auditor rebuild)
This document inventories what already exists in the Lakehouse repo before we build the distillation substrate. It is the gating artifact: per the spec, no implementation lands until this document is settled.
The headline finding: **~70% of the spec's modules already have working substrate** in the form of JSONL streams, vector corpora, scoring gates, and a partial extraction pipeline (`distilled_facts.jsonl` / `distilled_procedures.jsonl`). The work is integration + formalization, not greenfield. The biggest risk is shipping a parallel system that drifts from what the existing scrum/auditor/observer loops actually produce.
---
## 1. Repo structure
```
/home/profit/lakehouse
├── crates/ # 15 Rust crates (see PRD.md)
│ ├── shared/ # types.rs, profiles/, model_matrix.rs, secrets.rs
│ ├── gateway/ # /v1/* HTTP surface, mode router, observer event fanout
│ ├── vectord/ # HNSW + pathway_memory.rs (88 traces) + Mem0 versioning
│ ├── catalogd/ # column types, manifests
│ ├── truth/ # Phase 42 — TOML rule engine for SQL/request gates
│ ├── validator/ # Phase 43 — staffing/devops validators
│ └── ...
├── auditor/ # TypeScript PR auditor (Bun runtime)
│ ├── audit.ts # orchestrator
│ ├── audit_one.ts # one-shot harness
│ ├── claim_parser.ts # extracts ship-claims from PR body
│ ├── fact_extractor.ts # LLM Team /api/run?mode=extract integration
│ ├── kb_index.ts # **already a queryable index over data/_kb/*.jsonl**
│ ├── kb_stats.ts
│ ├── checks/ # static.ts, dynamic.ts, inference.ts, kb_query.ts
│ ├── policy.ts # severity → block/warn/info gates
│ └── gitea.ts # PR poller
├── tests/
│ ├── real-world/ # scrum_master_pipeline.ts, scrum_applier.ts, runs/
│ ├── multi-agent/ # scenarios/, playbooks/
│ ├── architecture_smoke.ts
│ ├── battery/
│ └── agent_test/
├── scripts/
│ ├── build_answers_corpus.ts # NEW 2026-04-26: lakehouse_answers_v1
│ ├── build_lakehouse_corpus.ts # arch corpus
│ ├── build_symbols_corpus.ts # symbols corpus
│ ├── build_scrum_findings_corpus.ts # findings corpus
│ ├── vectorize_raw_corpus.ts
│ ├── mode_experiment.ts / mode_compare.ts / mode_pass{2,3,4,5}_*.ts
│ └── ...
├── sidecar/ # Python (Ollama embed adapter)
├── mcp-server/ # observer.ts, relevance.ts, ai_models.ts (port 3700/3800)
├── ui/ # public Bun UI (devop.live/lakehouse, port 3700)
├── data/ # **the substrate this document audits** (see §3)
└── docs/
├── PRD.md
├── PHASES.md
├── DECISIONS.md (ADRs 001-021)
├── SCRUM_MASTER_SPEC.md
├── MATRIX_AGENT_HANDOVER.md
├── MODE_RUNNER_TUNING_PLAN.md
└── recon/
└── local-distillation-recon.md (this file)
```
---
## 2. Existing components by spec module
### 2.1 Gateway / orchestrator
`crates/gateway/src/v1/mode.rs` is the **prompt-molder substrate** — task_class → mode → enrichment composer + LLM call. Five native modes (codereview_lakehouse, codereview_isolation, codereview_null, codereview_matrix_only, codereview_playbook_only) + staffing_inference_lakehouse + (uncommitted) pr_audit. Strong-model auto-downgrade gate based on Pass 5 variance test.
**Distillation relevance:** The mode runner already encodes "compose pathway memory + matrix retrieval + framing into a one-shot prompt." The distillation pipeline can call mode runner endpoints rather than reimplementing retrieval.
### 2.2 Observer / scratchpad
- `mcp-server/observer.ts` — Bun service on `:3800`. `/event`, `/relevance`, `/review` endpoints. Receives scrum + scenario + langfuse-bridge sources. KB preamble blends pathway + arch + answers (3 sources).
- `mcp-server/relevance.ts` — adjacency-pollution heuristic filter (added 2026-04-25).
- Scrum scratchpad: `tests/real-world/scrum_master_pipeline.ts::treeSplitFile` — text-only multi-shard scratchpad. Auditor curates the same way (`auditor/checks/inference.ts::treeSplitDiff`).
**Distillation relevance:** Scratchpads are unstructured text. Spec wants structured extraction (objective/completed/failed/pending). The `distilled_*.jsonl` streams (§3) already do half of this for the LLM Team runs — extending to the scrum scratchpad is the gap.
### 2.3 Knowledge base / index
Two layers:
**Layer 1: append-only JSONL streams in `data/_kb/`** (see §3 for full inventory).
**Layer 2: vector corpora in `data/vectors/*.parquet`** + HNSW indexes.
Auditor's `kb_index.ts` already wraps the JSONLs as a queryable index. `kb_query.ts` check uses it to surface recurring patterns across PRs.
**Distillation relevance:** Layer 1 is the EvidenceCollector substrate. Layer 2 is the HybridIndexer substrate. Neither has a unified record schema across streams — that's the formalization work.
### 2.4 MCP / context integrations
- `.mcp.json` — MCP server configuration (gitea, etc.)
- `mcp-server/` — observer + relevance + ai_models surfaces
- LLM Team UI (port 5000) — `/api/run?mode=extract` is the only registered mode (per `feedback_endpoint_probe_discipline.md`); `code_review/patch/refactor` return "Unknown mode"
- `aibridge` crate — Rust ↔ Python sidecar; OpenAI-compat proxy as of `3a0b37e`
**Distillation relevance:** Existing call surfaces are already the right shape. Distillation pipeline runs ON the gateway via `/v1/*`, not on a parallel runtime.
### 2.5 PRD / requirements docs
- `docs/PRD.md` — phases 0-37 (shipped) + 38-44 productization
- `docs/CONTROL_PLANE_PRD.md` — long-horizon control plane (2026-04-22 pivot)
- `docs/PHASES.md` — phase tracker
- `docs/DECISIONS.md` — ADRs 001-021 (021 is semantic-correctness matrix layer)
- `docs/SCRUM_MASTER_SPEC.md` — scrum loop architecture + refactor timeline
- `docs/MODE_RUNNER_TUNING_PLAN.md` — open knobs
**Distillation relevance:** PRD is the ground truth for the PRD-drift comparator. PHASES.md + auditor's `phase_sweep_findings.jsonl` already encode partial drift reports.
### 2.6 Model routing logic
- `config/modes.toml` — task_class → mode/model registry (6 task classes including new pr_audit)
- `crates/gateway/src/v1/mode.rs::is_weak_model` — strong/weak heuristic for matrix corpus downgrade
- `data/_kb/model_trust.jsonl` (45K) — per-run model performance ledger (run_id, accepted_model, attempts_made, etc.)
- `data/_kb/mode_experiments.jsonl` (1.3M) — per-call mode runner telemetry (mode, model, latency_ms, sources, response, response_chars)
**Distillation relevance:** `mode_experiments.jsonl` is the cleanest per-call record we have — it's already an EvidenceRecord with everything except observer_notes and human_override fields. The Model Routing Ledger spec module is mostly an aggregation script over this jsonl + model_trust.jsonl.
### 2.7 Logs / traces
- Langfuse (port 3001, docker `langfuse`) — every `/v1/chat` and `/v1/respond` call (`crates/gateway/src/v1/langfuse_trace.rs`). Fire-and-forget.
- Observer `/event` — every `/v1/chat` call also fires here (`d1d97a0`)
- `data/_observer/ops.jsonl` — observer event log (mcp-server side)
- `data/_auditor/verdicts/*.json` — per-PR auditor verdict
- Systemd journals: lakehouse, lakehouse-sidecar, lakehouse-observer, lakehouse-auditor
**Distillation relevance:** Langfuse + observer events are the trace substrate, but they're not yet linked to the JSONL streams via shared run_id. Linkage is part of EvidenceRecord work.
### 2.8 Test framework
- Bun-native tests in `crates/*/src/**/*test*` (Rust) and `tests/*` (TypeScript)
- `tests/real-world/` — scrum master + applier integration
- `tests/architecture_smoke.ts` — PRD-invariant probe against 500k workers
- `tests/multi-agent/scenarios/` — 20+ scenario fixtures (Heritage_Foods, Riverfront_Steel, etc.)
- `auditor/fixtures/hybrid_38_40_45.ts` — auditor's own dynamic fixture
**Distillation relevance:** Test framework supports both Rust and TS. The acceptance-gate suite (Phase 6 of distillation plan) lands in `tests/distillation/`.
### 2.9 Data schemas (existing, implicit)
The shapes that matter, by JSONL:
| File | Key fields | Provenance fields |
|------|-----------|-------------------|
| `audits.jsonl` (2.6M) | full per-PR verdict | `pr_number`, `head_sha`, `audited_at` |
| `audit_facts.jsonl` (506K) | extracted facts/entities/relationships from auditor inference | `pr_number`, `head_sha`, `extracted_at`, `extractor`, `verifier`, `llm_team_run_id` |
| `audit_lessons.jsonl` (539K) | derived lessons from past audits | (similar to facts) |
| `audit_discrepancies.jsonl` | N=3 consensus splits — chosen/rejected pairs | `pr_number`, `head_sha`, `claim_idx`, `votes`, `resolution` |
| `scrum_reviews.jsonl` (448K) | per-file scrum review (forensic JSON or markdown) | `file`, `reviewed_at`, `accepted_model`, `accepted_on_attempt` |
| `auto_apply.jsonl` (14K) | applier action per file | `file`, `ts`, `action`, `patches_applied` |
| `mode_experiments.jsonl` (1.3M) | per-call mode runner telemetry | `ts`, `task_class`, `mode`, `model`, `file_path`, `sources`, `latency_ms` |
| `observer_escalations.jsonl` (1.9K) | observer-diagnosed failure clusters | `ts`, `sig_hash`, `cluster_size`, `analysis`, `mode`, `kb_preamble_chars` |
| `observer_reviews.jsonl` (97K) | observer hand-reviews of scrum attempts | (TBD) |
| `model_trust.jsonl` (45K) | per-run model trust ledger | `run_id`, `task_type`, `accepted_model`, `attempts_made`, `confidence_avg`, `errors`, `thin_rejections` |
| `outcomes.jsonl` (98K) | per-run scenario outcomes | `run_id`, `sig_hash`, `created_at`, `models`, `total_events`, `ok_events`, `total_citations`, `total_gap_signals` |
| `human_overrides.jsonl` (2.4K) | human-in-loop overrides | (TBD) |
| `overseer_corrections.jsonl` (21K) | overseer model corrections | (TBD) |
| `phase_sweep_findings.jsonl` (45K) | phase-audit drift findings | `phase`, `phase_name`, `status`, `claims_verified`, `claims_fake`, `claims_partial`, `findings`, `evidence`, `discovered_at` |
| `doc_drift_corrections.jsonl` (603B) | doc drift signals | (TBD) |
| `pathway_recommendations.jsonl` (57K) | pathway memory hot-swap recommendations | `run_id` |
| `signatures.jsonl` (270K) | run signatures for dedup/grouping | (TBD) |
| `classifications.jsonl` (52K) | task-type classifications | (TBD — likely the task_type taxonomy) |
| `contract_analyses.jsonl` (4.3K) | contract analysis runs (closest to canonical EvidenceRecord) | `ts`, `ok`, `permit_id`, `analysis`, `matrix_corpora`, `matrix_hits`, `matrix_ms`, `observer_verdict`, `observer_conf`, `observer_notes`, `observer_src`, `cost`, `duration_ms` |
| `distilled_facts.jsonl` (179K) | **already-distilled fact stream** | `run_id`, `sig_hash`, `created_at`, `extractor`, `verifier`, `categorizer`, `category`, `text`, `embedding`, `embed_dim`, `schema_version`, `source_label`, `source_service` |
| `distilled_procedures.jsonl` (21K) | **already-distilled procedure stream** | (same shape as facts) |
| `distilled_config_hints.jsonl` (22K) | **already-distilled config-hint stream** | (same shape) |
---
## 3. The data substrate (what's already produced)
### Schema observation
`distilled_facts.jsonl` and `distilled_procedures.jsonl` already match what now.md calls a normalized evidence record — almost. They have:
✅ run_id, sig_hash (provenance + dedup)
✅ extractor, verifier, categorizer (deterministic role labels)
✅ schema_version (forward-compat)
✅ embedding pre-computed (already in HybridIndexer Layer 2!)
✅ category, source_label, source_service (taxonomy + origin)
✅ text (the distilled content)
❌ no observer_notes
❌ no commands_run / tool_calls
❌ no validation_results / failure_markers
❌ no human_override
So: **the `distilled_*` streams are an EvidenceRecord prototype, narrowed to LLM-extracted text.** Extending the schema to cover the missing fields (or sourcing them via JOIN to other streams) is the Phase 1 work.
`contract_analyses.jsonl` is the **other** prototype — it carries observer integration fields (verdict, confidence, notes, src) plus retrieval telemetry (matrix_corpora, matrix_hits, matrix_ms) plus per-call cost/duration. Different shape, but more complete in some axes.
The right move is to **reconcile both shapes** into a single schema rather than picking one.
### Vector corpora (HybridIndexer Layer 2)
20 corpora live in `data/vectors/*.parquet`:
- `lakehouse_arch_v1` — architecture corpus
- `lakehouse_symbols_v1` — symbol corpus (via tree-sitter or grep)
- `lakehouse_answers_v1` — gold-standard prior reviews + escalations (commit `0844206`)
- `scrum_findings_v1` — old, superseded by answers_v1
- `distilled_factual_v202604*`, `distilled_procedural_v202604*`, `distilled_config_hint_v202604*` — vectorized distilled streams
- `kb_team_runs_v1`, `kb_team_runs_agent`, `llm_team_runs_v1` — LLM Team artifact corpora
- `chicago_permits_v1`, `entity_brief_v1`, `ethereal_workers_v1`, `workers_500k_v8` — domain corpora
- `threat_intel_v1`, `sec_tickers_v1` — external
The hybrid retrieval pattern is established: `mode.rs` queries top_k from each named corpus, merges by score, takes top 8, drops via `/relevance`. **Keyword/BM25 is missing** (the spec asks for hybrid keyword + semantic) — but DataFusion in queryd can run substring/regex queries on the underlying Parquet, so the substrate is there.
---
## 4. Gap analysis (spec module → real gap)
| Spec module | What we have | Gap |
|------|------|-----|
| Evidence Collector | 23 source JSONLs, 2 prototype schemas (`distilled_*`, `contract_analyses`) | Unified `EvidenceRecord` schema spanning all sources + JOIN view by run_id/file/timestamp |
| Success Scorer | 5 scrum_applier gates, auditor verdicts, mode_compare grounding %, pathway replay rate, scrum verdict, observer accept/reject | Single deterministic function combining these into 4 categories with explicit reasons[] |
| Playbook Extractor | bug_fingerprints (semantic-correctness layer), `_playbook_memory/`, `_playbook_lessons/` (50+ JSON), distilled_procedures.jsonl | Full task-flow playbooks (model routing path + commands_run + recovery + escalation triggers); current playbooks are bug-pattern + staffing-fill, not procedural |
| Hybrid Indexer | 20 vector corpora + pathway_memory + auditor `kb_index.ts` | Keyword/BM25 layer; task-tag filters (the embedding side is solid) |
| Dataset Builder | nothing exporting in spec format | NET NEW — `build_rag_dataset.ts`, `build_sft_dataset.ts`, `build_preference_dataset.ts` |
| Scratchpad Normalizer | tree-split scratchpads (text), `distilled_*.jsonl` (LLM-extracted) | Structured normalization of scrum/auditor scratchpads into objective/completed/failed/pending JSON |
| PRD Drift Comparator | auditor inference + static + `phase_sweep_findings.jsonl` + `doc_drift_corrections.jsonl` | Per-repo-state snapshot (the existing pieces are per-PR or per-phase) |
| Model Routing Ledger | `model_trust.jsonl` + `mode_experiments.jsonl` + strong-model downgrade gate | Aggregated, queryable view by task_type × model_name |
| Receipts | per-call jsonl rows + auditor verdicts | Per-pipeline-stage `receipt.json` with git_sha + input/output hashes + record_counts |
---
## 5. Risks
1. **Drift from existing loops.** The scrum, auditor, and observer pipelines all write into the substrate. A distillation pipeline that defines its own EvidenceRecord without conforming to those producers' shapes will drift. Mitigation: derive `EvidenceRecord` schema from existing JSONL keys, formalize what's there before adding new fields.
2. **Over-distillation as theater.** It's tempting to "extract" content from raw runs without checking the existing distilled_facts/procedures already cover the run. Mitigation: dedup by `sig_hash` against existing distilled streams before extracting; emit pure pass-through rows when source already has a distilled twin.
3. **Stale extraction.** `distilled_facts.jsonl` was last touched 2026-04-23 — 3 days old. `distilled_config_hints.jsonl` similar. If the extraction pipeline that produces them has rotted, building on top of them propagates rot. Mitigation: run the distillation extractor once on a fresh run before treating these as canonical; verify schema_version still matches.
4. **No-leak invariant on SFT.** The spec is non-negotiable: rejected runs must NEVER appear in `exports/sft/instruction_response.jsonl`. Easy to violate via JOIN bugs. Mitigation: SFT export reads only `category=accepted` rows from `scored-runs/*.jsonl`; tests enforce this with a fixture containing rejected/partial mix.
5. **Provenance integrity.** Every export row must trace to a source jsonl row. Mitigation: `provenance` field is `{source_file, line_offset, sig_hash}`; export-side validator checks each row's source_file exists and contains a row with matching sig_hash.
6. **Receipts as security theater.** A receipt that just says "ran successfully" is worse than nothing. Mitigation: receipts include git_sha, sha256 of input/output files, record_counts (in vs out), and an explicit `validation_pass` boolean tied to schema validators.
7. **Hybrid index keyword side.** Adding BM25 over Parquet via DataFusion is doable but requires a custom UDF. If we punt this to "later," the hybrid in HybridIndexer is dishonest naming. Mitigation: ship Phase 1-5 with semantic-only and rename the module `SemanticIndexer`; add BM25 in a follow-up phase rather than claiming hybrid prematurely.
8. **Upstream model outage.** Just observed: `kimi-k2:1t` is currently 500-ing on Ollama Cloud. If distillation pipeline depends on a single model for verification, an outage breaks the whole pipeline. Mitigation: deterministic validators must NOT call any LLM; only the LLM-driven steps (initial extraction) should depend on cloud. Failures degrade gracefully — extracted text gets routed to `needs_human_review` not silently dropped.
---
## 6. Recommended integration points
1. **Reuse `auditor/kb_index.ts` as the EvidenceCollector substrate.** It already reads JSONL streams. Extend it to emit the unified EvidenceRecord by JOINing across streams by `run_id`/`file`/`sig_hash`.
2. **Reuse `crates/shared/src/profiles/` as the schema home for model ledger entries.** `MemoryProfile` and `RetrievalProfile` are already typed. Add `ModelRoutingLedger` alongside.
3. **Reuse `mode_experiments.jsonl` as the per-call truth source.** It's the most complete record per call (mode, model, sources, response, latency_ms, ts). Treat it as the canonical "execution trace" for any /v1/mode/execute call.
4. **Reuse `data/vectors/*` as the HybridIndexer storage.** Don't add a parallel index — the Parquet + HNSW pattern is already proven. The new RAG export emits TO an existing-shaped corpus.
5. **Reuse `scripts/build_*_corpus.ts` as the dataset-building convention.** They're already idempotent, take env knobs (LH_GATEWAY, LH_CHUNK_SIZE, LH_OVERLAP), and POST to `/vectors/index`. The new export scripts follow the same shape.
6. **Reuse `mcp-server/observer.ts` as the validation event sink.** Distillation pipeline stages emit `/event` calls so a future UI can show pipeline progress alongside scrum + scenario events.
7. **Reuse `auditor/policy.ts` as the gate-pattern reference.** The 5-gate scrum_applier and the `policy.ts` severity dispatch both encode the discipline of "deterministic check first, model opinion never." Success Scorer follows the same pattern.
8. **Reuse `contract_analyses.jsonl` as the EvidenceRecord prototype.** It's the closest existing schema to what now.md asks for. Migrate its fields into the unified EvidenceRecord; backfill its rows into `data/evidence/`.
---
## 7. Schemas to formalize in Phase 1
Based on the inventory above, the schemas Phase 1 needs to define are:
1. **EvidenceRecord** — derived from `contract_analyses` + `mode_experiments` + observer fields + the spec's required fields (run_id, task_id, timestamp, model_name, model_role, input_hash, output_hash, source_files, commands_run, retrieved_context, observer_notes, scratchpad_summary, success_markers, failure_markers, validation_results, human_override, provenance)
2. **ScoredRun**`{evidence_run_id, category in {accepted, partially_accepted, rejected, needs_human_review}, reasons: string[], scored_at, scorer_version}`
3. **Playbook**`{playbook_id, task_type, problem_pattern, useful_context, model_routing_path, commands_worked, commands_failed, validation_steps, repo_files_touched, recovery_strategy, known_failure_modes, escalation_threshold, acceptance_criteria, source_run_ids, created_at}`
4. **ScratchpadSummary**`{run_id, current_objective, completed_steps, failed_steps, pending_steps, important_paths, decisions, unresolved_questions, validation_status, next_command, source_scratchpad_hash}`
5. **ModelLedgerEntry**`{model_name, model_provider, task_type, success_rate, failure_modes, best_partner_model, escalation_role, cost, latency_p50, latency_p95, context_window, sample_count, last_updated}`
6. **RagSample** — spec shape exactly
7. **SftSample** — spec shape exactly + strict `score=accepted` invariant
8. **PreferenceSample** — spec shape exactly + `chosen != rejected` invariant
9. **Receipt**`{command, git_sha, input_files: [{path, sha256}], output_files: [{path, sha256}], record_counts: {in, out}, validation_pass, errors, warnings, duration_ms, started_at, ended_at}`
Each schema lands in `crates/shared/src/schemas/distillation/` (Rust source-of-truth) + `auditor/schemas/distillation/` (TS validators). Phase 1 acceptance: every schema has 2+ positive fixtures (drawn from existing JSONL rows) and 2+ negative fixtures (missing required, wrong type, no provenance).
---
## 8. Phase 1 readiness checklist
Before Phase 1 starts, the following must be true:
- [x] Recon doc exists (this file)
- [x] Sample shapes captured for the 8+ source JSONLs the schemas derive from
- [x] Existing distilled_* streams audited — confirmed they're prototypes, not blockers
- [x] Existing vector corpora inventoried — confirmed HybridIndexer Layer 2 substrate is real
- [x] Risks listed with mitigations
- [x] Integration points named — derive, don't reinvent
Phase 1 is unblocked after this document is reviewed by the user. Implementation begins with `crates/shared/src/schemas/distillation/evidence_record.rs` + matching `auditor/schemas/distillation/evidence_record.ts` Zod validator + 2/2 fixtures from `distilled_facts.jsonl` and `contract_analyses.jsonl`.
---
## 9. What this document is NOT
- Not a green-light to start implementation. The spec is explicit: schemas first, then everything else.
- Not a commitment to build all 9 schemas in parallel. Phase 1 ships the EvidenceRecord schema alone if necessary, with the others queued behind it.
- Not a replacement for the spec at `/home/profit/now.md`. Spec is canonical; this document maps spec onto current state.
- Not a survey of the staffing pipeline (`crates/validator/staffing/*`, scenarios/, etc.). Distillation is orthogonal — the staffing pipeline is one of the many sources distillation reads from, not its target.