# Local Distillation Pipeline — Repo Recon **Date:** 2026-04-26 **Status:** Phase 0 (read-only inventory — no implementation yet) **Spec:** `/home/profit/now.md` **Branch:** `scrum/auto-apply-19814` head `f753e11` (uncommitted: auditor rebuild) This document inventories what already exists in the Lakehouse repo before we build the distillation substrate. It is the gating artifact: per the spec, no implementation lands until this document is settled. The headline finding: **~70% of the spec's modules already have working substrate** in the form of JSONL streams, vector corpora, scoring gates, and a partial extraction pipeline (`distilled_facts.jsonl` / `distilled_procedures.jsonl`). The work is integration + formalization, not greenfield. The biggest risk is shipping a parallel system that drifts from what the existing scrum/auditor/observer loops actually produce. --- ## 1. Repo structure ``` /home/profit/lakehouse ├── crates/ # 15 Rust crates (see PRD.md) │ ├── shared/ # types.rs, profiles/, model_matrix.rs, secrets.rs │ ├── gateway/ # /v1/* HTTP surface, mode router, observer event fanout │ ├── vectord/ # HNSW + pathway_memory.rs (88 traces) + Mem0 versioning │ ├── catalogd/ # column types, manifests │ ├── truth/ # Phase 42 — TOML rule engine for SQL/request gates │ ├── validator/ # Phase 43 — staffing/devops validators │ └── ... ├── auditor/ # TypeScript PR auditor (Bun runtime) │ ├── audit.ts # orchestrator │ ├── audit_one.ts # one-shot harness │ ├── claim_parser.ts # extracts ship-claims from PR body │ ├── fact_extractor.ts # LLM Team /api/run?mode=extract integration │ ├── kb_index.ts # **already a queryable index over data/_kb/*.jsonl** │ ├── kb_stats.ts │ ├── checks/ # static.ts, dynamic.ts, inference.ts, kb_query.ts │ ├── policy.ts # severity → block/warn/info gates │ └── gitea.ts # PR poller ├── tests/ │ ├── real-world/ # scrum_master_pipeline.ts, scrum_applier.ts, runs/ │ ├── multi-agent/ # scenarios/, playbooks/ │ ├── architecture_smoke.ts │ ├── battery/ │ └── agent_test/ ├── scripts/ │ ├── build_answers_corpus.ts # NEW 2026-04-26: lakehouse_answers_v1 │ ├── build_lakehouse_corpus.ts # arch corpus │ ├── build_symbols_corpus.ts # symbols corpus │ ├── build_scrum_findings_corpus.ts # findings corpus │ ├── vectorize_raw_corpus.ts │ ├── mode_experiment.ts / mode_compare.ts / mode_pass{2,3,4,5}_*.ts │ └── ... ├── sidecar/ # Python (Ollama embed adapter) ├── mcp-server/ # observer.ts, relevance.ts, ai_models.ts (port 3700/3800) ├── ui/ # public Bun UI (devop.live/lakehouse, port 3700) ├── data/ # **the substrate this document audits** (see §3) └── docs/ ├── PRD.md ├── PHASES.md ├── DECISIONS.md (ADRs 001-021) ├── SCRUM_MASTER_SPEC.md ├── MATRIX_AGENT_HANDOVER.md ├── MODE_RUNNER_TUNING_PLAN.md └── recon/ └── local-distillation-recon.md (this file) ``` --- ## 2. Existing components by spec module ### 2.1 Gateway / orchestrator `crates/gateway/src/v1/mode.rs` is the **prompt-molder substrate** — task_class → mode → enrichment composer + LLM call. Five native modes (codereview_lakehouse, codereview_isolation, codereview_null, codereview_matrix_only, codereview_playbook_only) + staffing_inference_lakehouse + (uncommitted) pr_audit. Strong-model auto-downgrade gate based on Pass 5 variance test. **Distillation relevance:** The mode runner already encodes "compose pathway memory + matrix retrieval + framing into a one-shot prompt." The distillation pipeline can call mode runner endpoints rather than reimplementing retrieval. ### 2.2 Observer / scratchpad - `mcp-server/observer.ts` — Bun service on `:3800`. `/event`, `/relevance`, `/review` endpoints. Receives scrum + scenario + langfuse-bridge sources. KB preamble blends pathway + arch + answers (3 sources). - `mcp-server/relevance.ts` — adjacency-pollution heuristic filter (added 2026-04-25). - Scrum scratchpad: `tests/real-world/scrum_master_pipeline.ts::treeSplitFile` — text-only multi-shard scratchpad. Auditor curates the same way (`auditor/checks/inference.ts::treeSplitDiff`). **Distillation relevance:** Scratchpads are unstructured text. Spec wants structured extraction (objective/completed/failed/pending). The `distilled_*.jsonl` streams (§3) already do half of this for the LLM Team runs — extending to the scrum scratchpad is the gap. ### 2.3 Knowledge base / index Two layers: **Layer 1: append-only JSONL streams in `data/_kb/`** (see §3 for full inventory). **Layer 2: vector corpora in `data/vectors/*.parquet`** + HNSW indexes. Auditor's `kb_index.ts` already wraps the JSONLs as a queryable index. `kb_query.ts` check uses it to surface recurring patterns across PRs. **Distillation relevance:** Layer 1 is the EvidenceCollector substrate. Layer 2 is the HybridIndexer substrate. Neither has a unified record schema across streams — that's the formalization work. ### 2.4 MCP / context integrations - `.mcp.json` — MCP server configuration (gitea, etc.) - `mcp-server/` — observer + relevance + ai_models surfaces - LLM Team UI (port 5000) — `/api/run?mode=extract` is the only registered mode (per `feedback_endpoint_probe_discipline.md`); `code_review/patch/refactor` return "Unknown mode" - `aibridge` crate — Rust ↔ Python sidecar; OpenAI-compat proxy as of `3a0b37e` **Distillation relevance:** Existing call surfaces are already the right shape. Distillation pipeline runs ON the gateway via `/v1/*`, not on a parallel runtime. ### 2.5 PRD / requirements docs - `docs/PRD.md` — phases 0-37 (shipped) + 38-44 productization - `docs/CONTROL_PLANE_PRD.md` — long-horizon control plane (2026-04-22 pivot) - `docs/PHASES.md` — phase tracker - `docs/DECISIONS.md` — ADRs 001-021 (021 is semantic-correctness matrix layer) - `docs/SCRUM_MASTER_SPEC.md` — scrum loop architecture + refactor timeline - `docs/MODE_RUNNER_TUNING_PLAN.md` — open knobs **Distillation relevance:** PRD is the ground truth for the PRD-drift comparator. PHASES.md + auditor's `phase_sweep_findings.jsonl` already encode partial drift reports. ### 2.6 Model routing logic - `config/modes.toml` — task_class → mode/model registry (6 task classes including new pr_audit) - `crates/gateway/src/v1/mode.rs::is_weak_model` — strong/weak heuristic for matrix corpus downgrade - `data/_kb/model_trust.jsonl` (45K) — per-run model performance ledger (run_id, accepted_model, attempts_made, etc.) - `data/_kb/mode_experiments.jsonl` (1.3M) — per-call mode runner telemetry (mode, model, latency_ms, sources, response, response_chars) **Distillation relevance:** `mode_experiments.jsonl` is the cleanest per-call record we have — it's already an EvidenceRecord with everything except observer_notes and human_override fields. The Model Routing Ledger spec module is mostly an aggregation script over this jsonl + model_trust.jsonl. ### 2.7 Logs / traces - Langfuse (port 3001, docker `langfuse`) — every `/v1/chat` and `/v1/respond` call (`crates/gateway/src/v1/langfuse_trace.rs`). Fire-and-forget. - Observer `/event` — every `/v1/chat` call also fires here (`d1d97a0`) - `data/_observer/ops.jsonl` — observer event log (mcp-server side) - `data/_auditor/verdicts/*.json` — per-PR auditor verdict - Systemd journals: lakehouse, lakehouse-sidecar, lakehouse-observer, lakehouse-auditor **Distillation relevance:** Langfuse + observer events are the trace substrate, but they're not yet linked to the JSONL streams via shared run_id. Linkage is part of EvidenceRecord work. ### 2.8 Test framework - Bun-native tests in `crates/*/src/**/*test*` (Rust) and `tests/*` (TypeScript) - `tests/real-world/` — scrum master + applier integration - `tests/architecture_smoke.ts` — PRD-invariant probe against 500k workers - `tests/multi-agent/scenarios/` — 20+ scenario fixtures (Heritage_Foods, Riverfront_Steel, etc.) - `auditor/fixtures/hybrid_38_40_45.ts` — auditor's own dynamic fixture **Distillation relevance:** Test framework supports both Rust and TS. The acceptance-gate suite (Phase 6 of distillation plan) lands in `tests/distillation/`. ### 2.9 Data schemas (existing, implicit) The shapes that matter, by JSONL: | File | Key fields | Provenance fields | |------|-----------|-------------------| | `audits.jsonl` (2.6M) | full per-PR verdict | `pr_number`, `head_sha`, `audited_at` | | `audit_facts.jsonl` (506K) | extracted facts/entities/relationships from auditor inference | `pr_number`, `head_sha`, `extracted_at`, `extractor`, `verifier`, `llm_team_run_id` | | `audit_lessons.jsonl` (539K) | derived lessons from past audits | (similar to facts) | | `audit_discrepancies.jsonl` | N=3 consensus splits — chosen/rejected pairs | `pr_number`, `head_sha`, `claim_idx`, `votes`, `resolution` | | `scrum_reviews.jsonl` (448K) | per-file scrum review (forensic JSON or markdown) | `file`, `reviewed_at`, `accepted_model`, `accepted_on_attempt` | | `auto_apply.jsonl` (14K) | applier action per file | `file`, `ts`, `action`, `patches_applied` | | `mode_experiments.jsonl` (1.3M) | per-call mode runner telemetry | `ts`, `task_class`, `mode`, `model`, `file_path`, `sources`, `latency_ms` | | `observer_escalations.jsonl` (1.9K) | observer-diagnosed failure clusters | `ts`, `sig_hash`, `cluster_size`, `analysis`, `mode`, `kb_preamble_chars` | | `observer_reviews.jsonl` (97K) | observer hand-reviews of scrum attempts | (TBD) | | `model_trust.jsonl` (45K) | per-run model trust ledger | `run_id`, `task_type`, `accepted_model`, `attempts_made`, `confidence_avg`, `errors`, `thin_rejections` | | `outcomes.jsonl` (98K) | per-run scenario outcomes | `run_id`, `sig_hash`, `created_at`, `models`, `total_events`, `ok_events`, `total_citations`, `total_gap_signals` | | `human_overrides.jsonl` (2.4K) | human-in-loop overrides | (TBD) | | `overseer_corrections.jsonl` (21K) | overseer model corrections | (TBD) | | `phase_sweep_findings.jsonl` (45K) | phase-audit drift findings | `phase`, `phase_name`, `status`, `claims_verified`, `claims_fake`, `claims_partial`, `findings`, `evidence`, `discovered_at` | | `doc_drift_corrections.jsonl` (603B) | doc drift signals | (TBD) | | `pathway_recommendations.jsonl` (57K) | pathway memory hot-swap recommendations | `run_id` | | `signatures.jsonl` (270K) | run signatures for dedup/grouping | (TBD) | | `classifications.jsonl` (52K) | task-type classifications | (TBD — likely the task_type taxonomy) | | `contract_analyses.jsonl` (4.3K) | contract analysis runs (closest to canonical EvidenceRecord) | `ts`, `ok`, `permit_id`, `analysis`, `matrix_corpora`, `matrix_hits`, `matrix_ms`, `observer_verdict`, `observer_conf`, `observer_notes`, `observer_src`, `cost`, `duration_ms` | | `distilled_facts.jsonl` (179K) | **already-distilled fact stream** | `run_id`, `sig_hash`, `created_at`, `extractor`, `verifier`, `categorizer`, `category`, `text`, `embedding`, `embed_dim`, `schema_version`, `source_label`, `source_service` | | `distilled_procedures.jsonl` (21K) | **already-distilled procedure stream** | (same shape as facts) | | `distilled_config_hints.jsonl` (22K) | **already-distilled config-hint stream** | (same shape) | --- ## 3. The data substrate (what's already produced) ### Schema observation `distilled_facts.jsonl` and `distilled_procedures.jsonl` already match what now.md calls a normalized evidence record — almost. They have: ✅ run_id, sig_hash (provenance + dedup) ✅ extractor, verifier, categorizer (deterministic role labels) ✅ schema_version (forward-compat) ✅ embedding pre-computed (already in HybridIndexer Layer 2!) ✅ category, source_label, source_service (taxonomy + origin) ✅ text (the distilled content) ❌ no observer_notes ❌ no commands_run / tool_calls ❌ no validation_results / failure_markers ❌ no human_override So: **the `distilled_*` streams are an EvidenceRecord prototype, narrowed to LLM-extracted text.** Extending the schema to cover the missing fields (or sourcing them via JOIN to other streams) is the Phase 1 work. `contract_analyses.jsonl` is the **other** prototype — it carries observer integration fields (verdict, confidence, notes, src) plus retrieval telemetry (matrix_corpora, matrix_hits, matrix_ms) plus per-call cost/duration. Different shape, but more complete in some axes. The right move is to **reconcile both shapes** into a single schema rather than picking one. ### Vector corpora (HybridIndexer Layer 2) 20 corpora live in `data/vectors/*.parquet`: - `lakehouse_arch_v1` — architecture corpus - `lakehouse_symbols_v1` — symbol corpus (via tree-sitter or grep) - `lakehouse_answers_v1` — gold-standard prior reviews + escalations (commit `0844206`) - `scrum_findings_v1` — old, superseded by answers_v1 - `distilled_factual_v202604*`, `distilled_procedural_v202604*`, `distilled_config_hint_v202604*` — vectorized distilled streams - `kb_team_runs_v1`, `kb_team_runs_agent`, `llm_team_runs_v1` — LLM Team artifact corpora - `chicago_permits_v1`, `entity_brief_v1`, `ethereal_workers_v1`, `workers_500k_v8` — domain corpora - `threat_intel_v1`, `sec_tickers_v1` — external The hybrid retrieval pattern is established: `mode.rs` queries top_k from each named corpus, merges by score, takes top 8, drops via `/relevance`. **Keyword/BM25 is missing** (the spec asks for hybrid keyword + semantic) — but DataFusion in queryd can run substring/regex queries on the underlying Parquet, so the substrate is there. --- ## 4. Gap analysis (spec module → real gap) | Spec module | What we have | Gap | |------|------|-----| | Evidence Collector | 23 source JSONLs, 2 prototype schemas (`distilled_*`, `contract_analyses`) | Unified `EvidenceRecord` schema spanning all sources + JOIN view by run_id/file/timestamp | | Success Scorer | 5 scrum_applier gates, auditor verdicts, mode_compare grounding %, pathway replay rate, scrum verdict, observer accept/reject | Single deterministic function combining these into 4 categories with explicit reasons[] | | Playbook Extractor | bug_fingerprints (semantic-correctness layer), `_playbook_memory/`, `_playbook_lessons/` (50+ JSON), distilled_procedures.jsonl | Full task-flow playbooks (model routing path + commands_run + recovery + escalation triggers); current playbooks are bug-pattern + staffing-fill, not procedural | | Hybrid Indexer | 20 vector corpora + pathway_memory + auditor `kb_index.ts` | Keyword/BM25 layer; task-tag filters (the embedding side is solid) | | Dataset Builder | nothing exporting in spec format | NET NEW — `build_rag_dataset.ts`, `build_sft_dataset.ts`, `build_preference_dataset.ts` | | Scratchpad Normalizer | tree-split scratchpads (text), `distilled_*.jsonl` (LLM-extracted) | Structured normalization of scrum/auditor scratchpads into objective/completed/failed/pending JSON | | PRD Drift Comparator | auditor inference + static + `phase_sweep_findings.jsonl` + `doc_drift_corrections.jsonl` | Per-repo-state snapshot (the existing pieces are per-PR or per-phase) | | Model Routing Ledger | `model_trust.jsonl` + `mode_experiments.jsonl` + strong-model downgrade gate | Aggregated, queryable view by task_type × model_name | | Receipts | per-call jsonl rows + auditor verdicts | Per-pipeline-stage `receipt.json` with git_sha + input/output hashes + record_counts | --- ## 5. Risks 1. **Drift from existing loops.** The scrum, auditor, and observer pipelines all write into the substrate. A distillation pipeline that defines its own EvidenceRecord without conforming to those producers' shapes will drift. Mitigation: derive `EvidenceRecord` schema from existing JSONL keys, formalize what's there before adding new fields. 2. **Over-distillation as theater.** It's tempting to "extract" content from raw runs without checking the existing distilled_facts/procedures already cover the run. Mitigation: dedup by `sig_hash` against existing distilled streams before extracting; emit pure pass-through rows when source already has a distilled twin. 3. **Stale extraction.** `distilled_facts.jsonl` was last touched 2026-04-23 — 3 days old. `distilled_config_hints.jsonl` similar. If the extraction pipeline that produces them has rotted, building on top of them propagates rot. Mitigation: run the distillation extractor once on a fresh run before treating these as canonical; verify schema_version still matches. 4. **No-leak invariant on SFT.** The spec is non-negotiable: rejected runs must NEVER appear in `exports/sft/instruction_response.jsonl`. Easy to violate via JOIN bugs. Mitigation: SFT export reads only `category=accepted` rows from `scored-runs/*.jsonl`; tests enforce this with a fixture containing rejected/partial mix. 5. **Provenance integrity.** Every export row must trace to a source jsonl row. Mitigation: `provenance` field is `{source_file, line_offset, sig_hash}`; export-side validator checks each row's source_file exists and contains a row with matching sig_hash. 6. **Receipts as security theater.** A receipt that just says "ran successfully" is worse than nothing. Mitigation: receipts include git_sha, sha256 of input/output files, record_counts (in vs out), and an explicit `validation_pass` boolean tied to schema validators. 7. **Hybrid index keyword side.** Adding BM25 over Parquet via DataFusion is doable but requires a custom UDF. If we punt this to "later," the hybrid in HybridIndexer is dishonest naming. Mitigation: ship Phase 1-5 with semantic-only and rename the module `SemanticIndexer`; add BM25 in a follow-up phase rather than claiming hybrid prematurely. 8. **Upstream model outage.** Just observed: `kimi-k2:1t` is currently 500-ing on Ollama Cloud. If distillation pipeline depends on a single model for verification, an outage breaks the whole pipeline. Mitigation: deterministic validators must NOT call any LLM; only the LLM-driven steps (initial extraction) should depend on cloud. Failures degrade gracefully — extracted text gets routed to `needs_human_review` not silently dropped. --- ## 6. Recommended integration points 1. **Reuse `auditor/kb_index.ts` as the EvidenceCollector substrate.** It already reads JSONL streams. Extend it to emit the unified EvidenceRecord by JOINing across streams by `run_id`/`file`/`sig_hash`. 2. **Reuse `crates/shared/src/profiles/` as the schema home for model ledger entries.** `MemoryProfile` and `RetrievalProfile` are already typed. Add `ModelRoutingLedger` alongside. 3. **Reuse `mode_experiments.jsonl` as the per-call truth source.** It's the most complete record per call (mode, model, sources, response, latency_ms, ts). Treat it as the canonical "execution trace" for any /v1/mode/execute call. 4. **Reuse `data/vectors/*` as the HybridIndexer storage.** Don't add a parallel index — the Parquet + HNSW pattern is already proven. The new RAG export emits TO an existing-shaped corpus. 5. **Reuse `scripts/build_*_corpus.ts` as the dataset-building convention.** They're already idempotent, take env knobs (LH_GATEWAY, LH_CHUNK_SIZE, LH_OVERLAP), and POST to `/vectors/index`. The new export scripts follow the same shape. 6. **Reuse `mcp-server/observer.ts` as the validation event sink.** Distillation pipeline stages emit `/event` calls so a future UI can show pipeline progress alongside scrum + scenario events. 7. **Reuse `auditor/policy.ts` as the gate-pattern reference.** The 5-gate scrum_applier and the `policy.ts` severity dispatch both encode the discipline of "deterministic check first, model opinion never." Success Scorer follows the same pattern. 8. **Reuse `contract_analyses.jsonl` as the EvidenceRecord prototype.** It's the closest existing schema to what now.md asks for. Migrate its fields into the unified EvidenceRecord; backfill its rows into `data/evidence/`. --- ## 7. Schemas to formalize in Phase 1 Based on the inventory above, the schemas Phase 1 needs to define are: 1. **EvidenceRecord** — derived from `contract_analyses` + `mode_experiments` + observer fields + the spec's required fields (run_id, task_id, timestamp, model_name, model_role, input_hash, output_hash, source_files, commands_run, retrieved_context, observer_notes, scratchpad_summary, success_markers, failure_markers, validation_results, human_override, provenance) 2. **ScoredRun** — `{evidence_run_id, category in {accepted, partially_accepted, rejected, needs_human_review}, reasons: string[], scored_at, scorer_version}` 3. **Playbook** — `{playbook_id, task_type, problem_pattern, useful_context, model_routing_path, commands_worked, commands_failed, validation_steps, repo_files_touched, recovery_strategy, known_failure_modes, escalation_threshold, acceptance_criteria, source_run_ids, created_at}` 4. **ScratchpadSummary** — `{run_id, current_objective, completed_steps, failed_steps, pending_steps, important_paths, decisions, unresolved_questions, validation_status, next_command, source_scratchpad_hash}` 5. **ModelLedgerEntry** — `{model_name, model_provider, task_type, success_rate, failure_modes, best_partner_model, escalation_role, cost, latency_p50, latency_p95, context_window, sample_count, last_updated}` 6. **RagSample** — spec shape exactly 7. **SftSample** — spec shape exactly + strict `score=accepted` invariant 8. **PreferenceSample** — spec shape exactly + `chosen != rejected` invariant 9. **Receipt** — `{command, git_sha, input_files: [{path, sha256}], output_files: [{path, sha256}], record_counts: {in, out}, validation_pass, errors, warnings, duration_ms, started_at, ended_at}` Each schema lands in `crates/shared/src/schemas/distillation/` (Rust source-of-truth) + `auditor/schemas/distillation/` (TS validators). Phase 1 acceptance: every schema has 2+ positive fixtures (drawn from existing JSONL rows) and 2+ negative fixtures (missing required, wrong type, no provenance). --- ## 8. Phase 1 readiness checklist Before Phase 1 starts, the following must be true: - [x] Recon doc exists (this file) - [x] Sample shapes captured for the 8+ source JSONLs the schemas derive from - [x] Existing distilled_* streams audited — confirmed they're prototypes, not blockers - [x] Existing vector corpora inventoried — confirmed HybridIndexer Layer 2 substrate is real - [x] Risks listed with mitigations - [x] Integration points named — derive, don't reinvent Phase 1 is unblocked after this document is reviewed by the user. Implementation begins with `crates/shared/src/schemas/distillation/evidence_record.rs` + matching `auditor/schemas/distillation/evidence_record.ts` Zod validator + 2/2 fixtures from `distilled_facts.jsonl` and `contract_analyses.jsonl`. --- ## 9. What this document is NOT - Not a green-light to start implementation. The spec is explicit: schemas first, then everything else. - Not a commitment to build all 9 schemas in parallel. Phase 1 ships the EvidenceRecord schema alone if necessary, with the others queued behind it. - Not a replacement for the spec at `/home/profit/now.md`. Spec is canonical; this document maps spec onto current state. - Not a survey of the staffing pipeline (`crates/validator/staffing/*`, scenarios/, etc.). Distillation is orthogonal — the staffing pipeline is one of the many sources distillation reads from, not its target.