# Mode Runner Tuning Plan **Date:** 2026-04-26 **Branch:** `scrum/auto-apply-19814` (PR #11) **Status:** Pass 5 variance test complete; conclusions locked. Implementation in progress. A fresh Claude session reading this + the pass5 row range in `data/_kb/mode_experiments.jsonl` should be able to continue the work without re-running anything. --- ## What we set out to do J's directive 2026-04-26 evening: "Mode runner experiment + corpus tightening." Symptom in memory before the session: scrum_review's matrix corpus was kept-rate 0/2 across every call — silent failure. Question: should we tighten the corpus, build new ones, or change retrieval? ## What we built Three new corpora indexed under `/vectors/index`: | Corpus | Builder | Docs | Chunks | Source | |---|---|---|---|---| | `lakehouse_arch_v1` | `scripts/build_lakehouse_corpus.ts` | 93 | 2119 | DECISIONS.md ADRs + standalone ADRs + PHASES.md + PRD.md + CONTROL_PLANE_PRD.md + SCRUM_MASTER_SPEC.md | | `scrum_findings_v1` | `scripts/build_scrum_findings_corpus.ts` | 168 | 1260 | Past `scrum_reviews.jsonl` rows | | `lakehouse_symbols_v1` | `scripts/build_symbols_corpus.ts` | 656 | 2470 | Regex-extracted `pub fn|struct|enum|trait` + `///` docs from `crates/**/*.rs` | Multi-corpus support added to the mode runner: - `crates/gateway/src/v1/mode.rs` — `matrix_corpus` is now `Vec` (string OR array in modes.toml/JSON via `deserialize_string_or_vec`) - Top-K retrieved from each corpus, merged by score, top 8 globally before relevance filter - Each chunk tagged with `corpus` for telemetry - Prompt assembly prefers `doc_id` over `source` so reviewer sees `[adr:009]` not `[lakehouse_arch]` Validation infra: - `scripts/mode_pass5_variance_paid.ts` — N reps × M conditions on one file, paid model - `scripts/mode_pass5_summarize.ts` — mean ± stddev + head-to-head wins/losses with parser handling 3 finding-table shapes (numbered, path-with-line, path-with-symbol) - `scripts/mode_compare.ts` — extended grouping key to `mode|corpus` (sorted+joined when multiple corpora) so multi-corpus sweeps don't last-write-wins-clobber ## What we learned ### Single-rep bake-off (free-tier `openai/gpt-oss-120b:free`, 3 files) Confirmed `lakehouse_arch_v1` adds +1.7 grounded findings/file vs isolation, 100% groundedness, −20s latency. **But:** matrix slightly *hurts* on small files (273-line `delta.rs`: lakehouse 7 vs isolation 9) and unlocks +9 findings on the large file (1355-line `pathway_memory.rs`). `scrum_findings_v1` produced 24% out-of-bounds line citations from cross-file line-number drift — **dangerous, excluded from defaults**. Only safe with same-file gating (TBD if needed). ### Single-rep bake-off (paid `x-ai/grok-4.1-fast`, 3 files × 4 conditions) Picture *flips* on a strong model. Composed corpus −1.4 grounded vs isolation. Symbols-alone slightly negative. Arch-alone negative. Suggested kitchen-sinking enrichment denigrates results when the model is good enough to handle the file directly. ### Pass 5 variance test (paid grok-4.1-fast, 5 reps × 4 conditions on `pathway_memory.rs`) | Condition | n | mean grounded ± σ | range | H2H vs isolation | Δ mean | |---|---|---|---|---|---| | **isolation** | 5 | 6.2 ± 1.3 | [5–8] | baseline | — | | arch_only | 5 | 5.2 ± 0.8 | [4–6] | 0W–3L–2T | −1.0 | | symbols_only | 5 | 6.4 ± 1.5 | [4–8] | 3W–2L–0T | +0.2 | | **composed (A+C)** | 5 | 4.4 ± 1.1 | [3–6] | **0W–5L–0T** | **−1.8** | **Composed loses 5/5 head-to-head against isolation on this file with this model.** Probability under random noise = 1/2⁵ = 3.1%. Statistically significant. Data window: rows in `data/_kb/mode_experiments.jsonl` where `ts > "2026-04-26T21:50:03Z"` and `file_path == "crates/vectord/src/pathway_memory.rs"`. Re-aggregate any time with `bun run scripts/mode_pass5_summarize.ts --since 2026-04-26T21:50:03Z`. ## Decisions taken 1. **Composed-corpus default is reverted.** `scrum_review.preferred_mode` switches from `codereview_lakehouse` → `codereview_isolation`. Matrix corpora stay defined in modes.toml but only fire when a caller explicitly forces `codereview_lakehouse` or one of the matrix-only experimental modes. 2. **Model-aware enrichment downgrade (α) is wired** in `crates/gateway/src/v1/mode.rs::execute`. When a caller resolves a "strong" model AND the resolved mode is `codereview_lakehouse`, the runner downgrades to `codereview_isolation` flag-set automatically. Strong patterns: `x-ai/grok-*`, `anthropic/*`, `openai/gpt-4*`, `openai/gpt-5*`, `deepseek/deepseek-v4*`, `moonshotai/kimi-k2*`, `google/gemini-2.5*`. Override via `LH_FORCE_FULL_ENRICHMENT=1` for diagnostic runs. 3. **`scrum_findings_v1` stays excluded from defaults** until same-file gating lands. Built and indexed; do not point any task class at it without that gate. ## Open follow-ups (not landed in this batch) - **Same-file gating for `scrum_findings_v1`** — restrict retrieval to chunks where `file_path == focus_file` so cross-file line-number drift can't happen. Then it becomes a per-file "what was found before" signal. - **Variance test on small files** — pass 5 was 1 file (the largest, where matrix-hurt was sharpest). Confirm direction holds on 273-line / 333-line files. ~15 min × 2 files = ~30 min. - **Verify weak-model gain holds with α** — the bake-off showed matrix helps free-tier `gpt-oss-120b:free` on the large file. After α is wired, re-run on a free-tier model to confirm full enrichment still fires for it. ~5 min. - **Higher-signal matrix (β fork)** — if we ever want matrix back as a default, it can't be whole-ADR/whole-section chunks. Better: only retrieve chunks where the focus file's defined symbols appear. Tighter signal, fewer chunks. Postponed. ## Reference data + tools - **Mode-runner code:** `crates/gateway/src/v1/mode.rs` - **Mode config:** `config/modes.toml` - **Per-call experiment log:** `data/_kb/mode_experiments.jsonl` - **Sweep harnesses:** - `scripts/mode_experiment.ts` — files × modes × 1 rep (default model: `x-ai/grok-4.1-fast`) - `scripts/mode_pass2_corpus_sweep.ts` — corpus × threshold sweep - `scripts/mode_pass3_variance.ts` — temp × reps on one mode - `scripts/mode_pass5_variance_paid.ts` — N reps × M conditions on one file - **Aggregators:** - `scripts/mode_compare.ts` — full per-mode comparison with grounding check - `scripts/mode_pass5_summarize.ts` — variance + head-to-head, robust to 3 table shapes - **Corpus builders (re-runnable when source docs / scrum_reviews / source code change):** - `scripts/build_lakehouse_corpus.ts` - `scripts/build_scrum_findings_corpus.ts` - `scripts/build_symbols_corpus.ts` ## Re-entry recipe (fresh session) ```bash cd /home/profit/lakehouse git log --oneline scrum/auto-apply-19814 -10 # what's recent cat docs/MODE_RUNNER_TUNING_PLAN.md # this file bun run scripts/mode_pass5_summarize.ts --since 2026-04-26T21:50:03Z # locked result curl -s http://localhost:3100/v1/mode/list | jq '.task_classes.scrum_review' # current config ``` If you want to reproduce the bake-off: ```bash # Strong model variance test (~17 min): bun run scripts/mode_pass5_variance_paid.ts # Weak-model regression (~10 min): LH_MODEL=openai/gpt-oss-120b:free LH_REPS=3 bun run scripts/mode_pass5_variance_paid.ts ```