lakehouse/docs/MODE_RUNNER_TUNING_PLAN.md
root 2dbc8dbc83
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
v1/mode: model-aware enrichment downgrade + 3 corpora + variance harness
Pass 5 (5 reps × 4 conditions × 1 file on grok-4.1-fast) showed composing
matrix corpora is anti-additive on strong models — composed lakehouse_arch
+ symbols LOST 5/5 head-to-head vs codereview_isolation (Δ −1.8 grounded
findings, p=0.031). Default flips to isolation; matrix path now auto-
downgrades when the resolved model is strong.

Mode runner:
- matrix_corpus is Vec<String> (string OR array via deserialize_string_or_vec)
- top_k=6 from each corpus, merge by score, take top 8 globally
- chunk tag prefers doc_id over source so reviewer sees [adr:009] vs [lakehouse_arch]
- is_weak_model() gate auto-downgrades codereview_lakehouse → codereview_isolation
  for strong models (default-strong; weak = :free suffix or local last-resort)
- LH_FORCE_FULL_ENRICHMENT=1 bypasses for diagnostic runs
- EnrichmentSources.downgraded_from records when the gate fires

Three corpora indexed via /vectors/index (5849 chunks total):
- lakehouse_arch_v1 — ADRs + phases + PRD + scrum spec (93 docs, 2119 chunks)
- scrum_findings_v1 — past scrum_reviews.jsonl (168 docs, 1260 chunks; EXCLUDED
  from defaults — 24% out-of-bounds line citations from cross-file drift)
- lakehouse_symbols_v1 — regex-extracted pub items + /// docs (656 docs, 2470 chunks)

Experiment infra:
- scripts/build_*_corpus.ts — re-runnable when source content changes
- scripts/mode_pass5_variance_paid.ts — N reps × M conditions on one file
- scripts/mode_pass5_summarize.ts — mean ± σ + head-to-head, parser handles
  numbered + path-with-line + path-with-symbol finding tables
- scripts/mode_compare.ts — groups by mode|corpus when sweeps span corpora
- scripts/mode_experiment.ts — default model bumped to x-ai/grok-4.1-fast,
  --corpus flag for per-call override

Decisions + open follow-ups: docs/MODE_RUNNER_TUNING_PLAN.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 17:29:17 -05:00

115 lines
7.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Mode Runner Tuning Plan
**Date:** 2026-04-26
**Branch:** `scrum/auto-apply-19814` (PR #11)
**Status:** Pass 5 variance test complete; conclusions locked. Implementation in progress.
A fresh Claude session reading this + the pass5 row range in `data/_kb/mode_experiments.jsonl` should be able to continue the work without re-running anything.
---
## What we set out to do
J's directive 2026-04-26 evening: "Mode runner experiment + corpus tightening."
Symptom in memory before the session: scrum_review's matrix corpus was kept-rate 0/2 across every call — silent failure. Question: should we tighten the corpus, build new ones, or change retrieval?
## What we built
Three new corpora indexed under `/vectors/index`:
| Corpus | Builder | Docs | Chunks | Source |
|---|---|---|---|---|
| `lakehouse_arch_v1` | `scripts/build_lakehouse_corpus.ts` | 93 | 2119 | DECISIONS.md ADRs + standalone ADRs + PHASES.md + PRD.md + CONTROL_PLANE_PRD.md + SCRUM_MASTER_SPEC.md |
| `scrum_findings_v1` | `scripts/build_scrum_findings_corpus.ts` | 168 | 1260 | Past `scrum_reviews.jsonl` rows |
| `lakehouse_symbols_v1` | `scripts/build_symbols_corpus.ts` | 656 | 2470 | Regex-extracted `pub fn|struct|enum|trait` + `///` docs from `crates/**/*.rs` |
Multi-corpus support added to the mode runner:
- `crates/gateway/src/v1/mode.rs``matrix_corpus` is now `Vec<String>` (string OR array in modes.toml/JSON via `deserialize_string_or_vec`)
- Top-K retrieved from each corpus, merged by score, top 8 globally before relevance filter
- Each chunk tagged with `corpus` for telemetry
- Prompt assembly prefers `doc_id` over `source` so reviewer sees `[adr:009]` not `[lakehouse_arch]`
Validation infra:
- `scripts/mode_pass5_variance_paid.ts` — N reps × M conditions on one file, paid model
- `scripts/mode_pass5_summarize.ts` — mean ± stddev + head-to-head wins/losses with parser handling 3 finding-table shapes (numbered, path-with-line, path-with-symbol)
- `scripts/mode_compare.ts` — extended grouping key to `mode|corpus` (sorted+joined when multiple corpora) so multi-corpus sweeps don't last-write-wins-clobber
## What we learned
### Single-rep bake-off (free-tier `openai/gpt-oss-120b:free`, 3 files)
Confirmed `lakehouse_arch_v1` adds +1.7 grounded findings/file vs isolation, 100% groundedness, 20s latency. **But:** matrix slightly *hurts* on small files (273-line `delta.rs`: lakehouse 7 vs isolation 9) and unlocks +9 findings on the large file (1355-line `pathway_memory.rs`).
`scrum_findings_v1` produced 24% out-of-bounds line citations from cross-file line-number drift — **dangerous, excluded from defaults**. Only safe with same-file gating (TBD if needed).
### Single-rep bake-off (paid `x-ai/grok-4.1-fast`, 3 files × 4 conditions)
Picture *flips* on a strong model. Composed corpus 1.4 grounded vs isolation. Symbols-alone slightly negative. Arch-alone negative. Suggested kitchen-sinking enrichment denigrates results when the model is good enough to handle the file directly.
### Pass 5 variance test (paid grok-4.1-fast, 5 reps × 4 conditions on `pathway_memory.rs`)
| Condition | n | mean grounded ± σ | range | H2H vs isolation | Δ mean |
|---|---|---|---|---|---|
| **isolation** | 5 | 6.2 ± 1.3 | [58] | baseline | — |
| arch_only | 5 | 5.2 ± 0.8 | [46] | 0W3L2T | 1.0 |
| symbols_only | 5 | 6.4 ± 1.5 | [48] | 3W2L0T | +0.2 |
| **composed (A+C)** | 5 | 4.4 ± 1.1 | [36] | **0W5L0T** | **1.8** |
**Composed loses 5/5 head-to-head against isolation on this file with this model.** Probability under random noise = 1/2⁵ = 3.1%. Statistically significant.
Data window: rows in `data/_kb/mode_experiments.jsonl` where `ts > "2026-04-26T21:50:03Z"` and `file_path == "crates/vectord/src/pathway_memory.rs"`. Re-aggregate any time with `bun run scripts/mode_pass5_summarize.ts --since 2026-04-26T21:50:03Z`.
## Decisions taken
1. **Composed-corpus default is reverted.** `scrum_review.preferred_mode` switches from `codereview_lakehouse``codereview_isolation`. Matrix corpora stay defined in modes.toml but only fire when a caller explicitly forces `codereview_lakehouse` or one of the matrix-only experimental modes.
2. **Model-aware enrichment downgrade (α) is wired** in `crates/gateway/src/v1/mode.rs::execute`. When a caller resolves a "strong" model AND the resolved mode is `codereview_lakehouse`, the runner downgrades to `codereview_isolation` flag-set automatically. Strong patterns: `x-ai/grok-*`, `anthropic/*`, `openai/gpt-4*`, `openai/gpt-5*`, `deepseek/deepseek-v4*`, `moonshotai/kimi-k2*`, `google/gemini-2.5*`. Override via `LH_FORCE_FULL_ENRICHMENT=1` for diagnostic runs.
3. **`scrum_findings_v1` stays excluded from defaults** until same-file gating lands. Built and indexed; do not point any task class at it without that gate.
## Open follow-ups (not landed in this batch)
- **Same-file gating for `scrum_findings_v1`** — restrict retrieval to chunks where `file_path == focus_file` so cross-file line-number drift can't happen. Then it becomes a per-file "what was found before" signal.
- **Variance test on small files** — pass 5 was 1 file (the largest, where matrix-hurt was sharpest). Confirm direction holds on 273-line / 333-line files. ~15 min × 2 files = ~30 min.
- **Verify weak-model gain holds with α** — the bake-off showed matrix helps free-tier `gpt-oss-120b:free` on the large file. After α is wired, re-run on a free-tier model to confirm full enrichment still fires for it. ~5 min.
- **Higher-signal matrix (β fork)** — if we ever want matrix back as a default, it can't be whole-ADR/whole-section chunks. Better: only retrieve chunks where the focus file's defined symbols appear. Tighter signal, fewer chunks. Postponed.
## Reference data + tools
- **Mode-runner code:** `crates/gateway/src/v1/mode.rs`
- **Mode config:** `config/modes.toml`
- **Per-call experiment log:** `data/_kb/mode_experiments.jsonl`
- **Sweep harnesses:**
- `scripts/mode_experiment.ts` — files × modes × 1 rep (default model: `x-ai/grok-4.1-fast`)
- `scripts/mode_pass2_corpus_sweep.ts` — corpus × threshold sweep
- `scripts/mode_pass3_variance.ts` — temp × reps on one mode
- `scripts/mode_pass5_variance_paid.ts` — N reps × M conditions on one file
- **Aggregators:**
- `scripts/mode_compare.ts` — full per-mode comparison with grounding check
- `scripts/mode_pass5_summarize.ts` — variance + head-to-head, robust to 3 table shapes
- **Corpus builders (re-runnable when source docs / scrum_reviews / source code change):**
- `scripts/build_lakehouse_corpus.ts`
- `scripts/build_scrum_findings_corpus.ts`
- `scripts/build_symbols_corpus.ts`
## Re-entry recipe (fresh session)
```bash
cd /home/profit/lakehouse
git log --oneline scrum/auto-apply-19814 -10 # what's recent
cat docs/MODE_RUNNER_TUNING_PLAN.md # this file
bun run scripts/mode_pass5_summarize.ts --since 2026-04-26T21:50:03Z # locked result
curl -s http://localhost:3100/v1/mode/list | jq '.task_classes.scrum_review' # current config
```
If you want to reproduce the bake-off:
```bash
# Strong model variance test (~17 min):
bun run scripts/mode_pass5_variance_paid.ts
# Weak-model regression (~10 min):
LH_MODEL=openai/gpt-oss-120b:free LH_REPS=3 bun run scripts/mode_pass5_variance_paid.ts
```