lakehouse/docs/MODE_RUNNER_TUNING_PLAN.md

# Mode Runner Tuning Plan

**Date:** 2026-04-26
**Branch:** `scrum/auto-apply-19814` (PR #11)
**Status:** Pass 5 variance test complete; conclusions locked. Implementation in progress.

A fresh Claude session reading this + the pass5 row range in `data/_kb/mode_experiments.jsonl` should be able to continue the work without re-running anything.

---

## What we set out to do

J's directive 2026-04-26 evening: "Mode runner experiment + corpus tightening."

Symptom in memory before the session: scrum_review's matrix corpus was kept-rate 0/2 across every call — silent failure. Question: should we tighten the corpus, build new ones, or change retrieval?

## What we built

Three new corpora indexed under `/vectors/index`:

| Corpus | Builder | Docs | Chunks | Source |
|---|---|---|---|---|
| `lakehouse_arch_v1` | `scripts/build_lakehouse_corpus.ts` | 93 | 2119 | DECISIONS.md ADRs + standalone ADRs + PHASES.md + PRD.md + CONTROL_PLANE_PRD.md + SCRUM_MASTER_SPEC.md |
| `scrum_findings_v1` | `scripts/build_scrum_findings_corpus.ts` | 168 | 1260 | Past `scrum_reviews.jsonl` rows |
| `lakehouse_symbols_v1` | `scripts/build_symbols_corpus.ts` | 656 | 2470 | Regex-extracted `pub fn|struct|enum|trait` + `///` docs from `crates/**/*.rs` |

Multi-corpus support added to the mode runner:
- `crates/gateway/src/v1/mode.rs` — `matrix_corpus` is now `Vec<String>` (string OR array in modes.toml/JSON via `deserialize_string_or_vec`)
- Top-K retrieved from each corpus, merged by score, top 8 globally before relevance filter
- Each chunk tagged with `corpus` for telemetry
- Prompt assembly prefers `doc_id` over `source` so reviewer sees `[adr:009]` not `[lakehouse_arch]`

Validation infra:
- `scripts/mode_pass5_variance_paid.ts` — N reps × M conditions on one file, paid model
- `scripts/mode_pass5_summarize.ts` — mean ± stddev + head-to-head wins/losses with parser handling 3 finding-table shapes (numbered, path-with-line, path-with-symbol)
- `scripts/mode_compare.ts` — extended grouping key to `mode|corpus` (sorted+joined when multiple corpora) so multi-corpus sweeps don't last-write-wins-clobber

## What we learned

### Single-rep bake-off (free-tier `openai/gpt-oss-120b:free`, 3 files)

Confirmed `lakehouse_arch_v1` adds +1.7 grounded findings/file vs isolation, 100% groundedness, −20s latency. **But:** matrix slightly *hurts* on small files (273-line `delta.rs`: lakehouse 7 vs isolation 9) and unlocks +9 findings on the large file (1355-line `pathway_memory.rs`).

`scrum_findings_v1` produced 24% out-of-bounds line citations from cross-file line-number drift — **dangerous, excluded from defaults**. Only safe with same-file gating (TBD if needed).

### Single-rep bake-off (paid `x-ai/grok-4.1-fast`, 3 files × 4 conditions)

Picture *flips* on a strong model. Composed corpus −1.4 grounded vs isolation. Symbols-alone slightly negative. Arch-alone negative. Suggested kitchen-sinking enrichment denigrates results when the model is good enough to handle the file directly.

### Pass 5 variance test (paid grok-4.1-fast, 5 reps × 4 conditions on `pathway_memory.rs`)

| Condition | n | mean grounded ± σ | range | H2H vs isolation | Δ mean |
|---|---|---|---|---|---|
| **isolation** | 5 | 6.2 ± 1.3 | [5–8] | baseline | — |
| arch_only | 5 | 5.2 ± 0.8 | [4–6] | 0W–3L–2T | −1.0 |
| symbols_only | 5 | 6.4 ± 1.5 | [4–8] | 3W–2L–0T | +0.2 |
| **composed (A+C)** | 5 | 4.4 ± 1.1 | [3–6] | **0W–5L–0T** | **−1.8** |

**Composed loses 5/5 head-to-head against isolation on this file with this model.** Probability under random noise = 1/2⁵ = 3.1%. Statistically significant.

Data window: rows in `data/_kb/mode_experiments.jsonl` where `ts > "2026-04-26T21:50:03Z"` and `file_path == "crates/vectord/src/pathway_memory.rs"`. Re-aggregate any time with `bun run scripts/mode_pass5_summarize.ts --since 2026-04-26T21:50:03Z`.

## Decisions taken

1. **Composed-corpus default is reverted.** `scrum_review.preferred_mode` switches from `codereview_lakehouse` → `codereview_isolation`. Matrix corpora stay defined in modes.toml but only fire when a caller explicitly forces `codereview_lakehouse` or one of the matrix-only experimental modes.

2. **Model-aware enrichment downgrade (α) is wired** in `crates/gateway/src/v1/mode.rs::execute`. When a caller resolves a "strong" model AND the resolved mode is `codereview_lakehouse`, the runner downgrades to `codereview_isolation` flag-set automatically. Strong patterns: `x-ai/grok-*`, `anthropic/*`, `openai/gpt-4*`, `openai/gpt-5*`, `deepseek/deepseek-v4*`, `moonshotai/kimi-k2*`, `google/gemini-2.5*`. Override via `LH_FORCE_FULL_ENRICHMENT=1` for diagnostic runs.

3. **`scrum_findings_v1` stays excluded from defaults** until same-file gating lands. Built and indexed; do not point any task class at it without that gate.

## Open follow-ups (not landed in this batch)

- **Same-file gating for `scrum_findings_v1`** — restrict retrieval to chunks where `file_path == focus_file` so cross-file line-number drift can't happen. Then it becomes a per-file "what was found before" signal.
- **Variance test on small files** — pass 5 was 1 file (the largest, where matrix-hurt was sharpest). Confirm direction holds on 273-line / 333-line files. ~15 min × 2 files = ~30 min.
- **Verify weak-model gain holds with α** — the bake-off showed matrix helps free-tier `gpt-oss-120b:free` on the large file. After α is wired, re-run on a free-tier model to confirm full enrichment still fires for it. ~5 min.
- **Higher-signal matrix (β fork)** — if we ever want matrix back as a default, it can't be whole-ADR/whole-section chunks. Better: only retrieve chunks where the focus file's defined symbols appear. Tighter signal, fewer chunks. Postponed.

## Reference data + tools

- **Mode-runner code:** `crates/gateway/src/v1/mode.rs`
- **Mode config:** `config/modes.toml`
- **Per-call experiment log:** `data/_kb/mode_experiments.jsonl`
- **Sweep harnesses:**
  - `scripts/mode_experiment.ts` — files × modes × 1 rep (default model: `x-ai/grok-4.1-fast`)
  - `scripts/mode_pass2_corpus_sweep.ts` — corpus × threshold sweep
  - `scripts/mode_pass3_variance.ts` — temp × reps on one mode
  - `scripts/mode_pass5_variance_paid.ts` — N reps × M conditions on one file
- **Aggregators:**
  - `scripts/mode_compare.ts` — full per-mode comparison with grounding check
  - `scripts/mode_pass5_summarize.ts` — variance + head-to-head, robust to 3 table shapes
- **Corpus builders (re-runnable when source docs / scrum_reviews / source code change):**
  - `scripts/build_lakehouse_corpus.ts`
  - `scripts/build_scrum_findings_corpus.ts`
  - `scripts/build_symbols_corpus.ts`

## Re-entry recipe (fresh session)

```bash
cd /home/profit/lakehouse
git log --oneline scrum/auto-apply-19814 -10               # what's recent
cat docs/MODE_RUNNER_TUNING_PLAN.md                        # this file
bun run scripts/mode_pass5_summarize.ts --since 2026-04-26T21:50:03Z  # locked result
curl -s http://localhost:3100/v1/mode/list | jq '.task_classes.scrum_review'  # current config
```

If you want to reproduce the bake-off:

```bash
# Strong model variance test (~17 min):
bun run scripts/mode_pass5_variance_paid.ts

# Weak-model regression (~10 min):
LH_MODEL=openai/gpt-oss-120b:free LH_REPS=3 bun run scripts/mode_pass5_variance_paid.ts
```