Pass 5 (5 reps × 4 conditions × 1 file on grok-4.1-fast) showed composing matrix corpora is anti-additive on strong models — composed lakehouse_arch + symbols LOST 5/5 head-to-head vs codereview_isolation (Δ −1.8 grounded findings, p=0.031). Default flips to isolation; matrix path now auto- downgrades when the resolved model is strong. Mode runner: - matrix_corpus is Vec<String> (string OR array via deserialize_string_or_vec) - top_k=6 from each corpus, merge by score, take top 8 globally - chunk tag prefers doc_id over source so reviewer sees [adr:009] vs [lakehouse_arch] - is_weak_model() gate auto-downgrades codereview_lakehouse → codereview_isolation for strong models (default-strong; weak = :free suffix or local last-resort) - LH_FORCE_FULL_ENRICHMENT=1 bypasses for diagnostic runs - EnrichmentSources.downgraded_from records when the gate fires Three corpora indexed via /vectors/index (5849 chunks total): - lakehouse_arch_v1 — ADRs + phases + PRD + scrum spec (93 docs, 2119 chunks) - scrum_findings_v1 — past scrum_reviews.jsonl (168 docs, 1260 chunks; EXCLUDED from defaults — 24% out-of-bounds line citations from cross-file drift) - lakehouse_symbols_v1 — regex-extracted pub items + /// docs (656 docs, 2470 chunks) Experiment infra: - scripts/build_*_corpus.ts — re-runnable when source content changes - scripts/mode_pass5_variance_paid.ts — N reps × M conditions on one file - scripts/mode_pass5_summarize.ts — mean ± σ + head-to-head, parser handles numbered + path-with-line + path-with-symbol finding tables - scripts/mode_compare.ts — groups by mode|corpus when sweeps span corpora - scripts/mode_experiment.ts — default model bumped to x-ai/grok-4.1-fast, --corpus flag for per-call override Decisions + open follow-ups: docs/MODE_RUNNER_TUNING_PLAN.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.2 KiB
Mode Runner Tuning Plan
Date: 2026-04-26
Branch: scrum/auto-apply-19814 (PR #11)
Status: Pass 5 variance test complete; conclusions locked. Implementation in progress.
A fresh Claude session reading this + the pass5 row range in data/_kb/mode_experiments.jsonl should be able to continue the work without re-running anything.
What we set out to do
J's directive 2026-04-26 evening: "Mode runner experiment + corpus tightening."
Symptom in memory before the session: scrum_review's matrix corpus was kept-rate 0/2 across every call — silent failure. Question: should we tighten the corpus, build new ones, or change retrieval?
What we built
Three new corpora indexed under /vectors/index:
| Corpus | Builder | Docs | Chunks | Source |
|---|---|---|---|---|
lakehouse_arch_v1 |
scripts/build_lakehouse_corpus.ts |
93 | 2119 | DECISIONS.md ADRs + standalone ADRs + PHASES.md + PRD.md + CONTROL_PLANE_PRD.md + SCRUM_MASTER_SPEC.md |
scrum_findings_v1 |
scripts/build_scrum_findings_corpus.ts |
168 | 1260 | Past scrum_reviews.jsonl rows |
lakehouse_symbols_v1 |
scripts/build_symbols_corpus.ts |
656 | 2470 | Regex-extracted `pub fn |
Multi-corpus support added to the mode runner:
crates/gateway/src/v1/mode.rs—matrix_corpusis nowVec<String>(string OR array in modes.toml/JSON viadeserialize_string_or_vec)- Top-K retrieved from each corpus, merged by score, top 8 globally before relevance filter
- Each chunk tagged with
corpusfor telemetry - Prompt assembly prefers
doc_idoversourceso reviewer sees[adr:009]not[lakehouse_arch]
Validation infra:
scripts/mode_pass5_variance_paid.ts— N reps × M conditions on one file, paid modelscripts/mode_pass5_summarize.ts— mean ± stddev + head-to-head wins/losses with parser handling 3 finding-table shapes (numbered, path-with-line, path-with-symbol)scripts/mode_compare.ts— extended grouping key tomode|corpus(sorted+joined when multiple corpora) so multi-corpus sweeps don't last-write-wins-clobber
What we learned
Single-rep bake-off (free-tier openai/gpt-oss-120b:free, 3 files)
Confirmed lakehouse_arch_v1 adds +1.7 grounded findings/file vs isolation, 100% groundedness, −20s latency. But: matrix slightly hurts on small files (273-line delta.rs: lakehouse 7 vs isolation 9) and unlocks +9 findings on the large file (1355-line pathway_memory.rs).
scrum_findings_v1 produced 24% out-of-bounds line citations from cross-file line-number drift — dangerous, excluded from defaults. Only safe with same-file gating (TBD if needed).
Single-rep bake-off (paid x-ai/grok-4.1-fast, 3 files × 4 conditions)
Picture flips on a strong model. Composed corpus −1.4 grounded vs isolation. Symbols-alone slightly negative. Arch-alone negative. Suggested kitchen-sinking enrichment denigrates results when the model is good enough to handle the file directly.
Pass 5 variance test (paid grok-4.1-fast, 5 reps × 4 conditions on pathway_memory.rs)
| Condition | n | mean grounded ± σ | range | H2H vs isolation | Δ mean |
|---|---|---|---|---|---|
| isolation | 5 | 6.2 ± 1.3 | [5–8] | baseline | — |
| arch_only | 5 | 5.2 ± 0.8 | [4–6] | 0W–3L–2T | −1.0 |
| symbols_only | 5 | 6.4 ± 1.5 | [4–8] | 3W–2L–0T | +0.2 |
| composed (A+C) | 5 | 4.4 ± 1.1 | [3–6] | 0W–5L–0T | −1.8 |
Composed loses 5/5 head-to-head against isolation on this file with this model. Probability under random noise = 1/2⁵ = 3.1%. Statistically significant.
Data window: rows in data/_kb/mode_experiments.jsonl where ts > "2026-04-26T21:50:03Z" and file_path == "crates/vectord/src/pathway_memory.rs". Re-aggregate any time with bun run scripts/mode_pass5_summarize.ts --since 2026-04-26T21:50:03Z.
Decisions taken
-
Composed-corpus default is reverted.
scrum_review.preferred_modeswitches fromcodereview_lakehouse→codereview_isolation. Matrix corpora stay defined in modes.toml but only fire when a caller explicitly forcescodereview_lakehouseor one of the matrix-only experimental modes. -
Model-aware enrichment downgrade (α) is wired in
crates/gateway/src/v1/mode.rs::execute. When a caller resolves a "strong" model AND the resolved mode iscodereview_lakehouse, the runner downgrades tocodereview_isolationflag-set automatically. Strong patterns:x-ai/grok-*,anthropic/*,openai/gpt-4*,openai/gpt-5*,deepseek/deepseek-v4*,moonshotai/kimi-k2*,google/gemini-2.5*. Override viaLH_FORCE_FULL_ENRICHMENT=1for diagnostic runs. -
scrum_findings_v1stays excluded from defaults until same-file gating lands. Built and indexed; do not point any task class at it without that gate.
Open follow-ups (not landed in this batch)
- Same-file gating for
scrum_findings_v1— restrict retrieval to chunks wherefile_path == focus_fileso cross-file line-number drift can't happen. Then it becomes a per-file "what was found before" signal. - Variance test on small files — pass 5 was 1 file (the largest, where matrix-hurt was sharpest). Confirm direction holds on 273-line / 333-line files. ~15 min × 2 files = ~30 min.
- Verify weak-model gain holds with α — the bake-off showed matrix helps free-tier
gpt-oss-120b:freeon the large file. After α is wired, re-run on a free-tier model to confirm full enrichment still fires for it. ~5 min. - Higher-signal matrix (β fork) — if we ever want matrix back as a default, it can't be whole-ADR/whole-section chunks. Better: only retrieve chunks where the focus file's defined symbols appear. Tighter signal, fewer chunks. Postponed.
Reference data + tools
- Mode-runner code:
crates/gateway/src/v1/mode.rs - Mode config:
config/modes.toml - Per-call experiment log:
data/_kb/mode_experiments.jsonl - Sweep harnesses:
scripts/mode_experiment.ts— files × modes × 1 rep (default model:x-ai/grok-4.1-fast)scripts/mode_pass2_corpus_sweep.ts— corpus × threshold sweepscripts/mode_pass3_variance.ts— temp × reps on one modescripts/mode_pass5_variance_paid.ts— N reps × M conditions on one file
- Aggregators:
scripts/mode_compare.ts— full per-mode comparison with grounding checkscripts/mode_pass5_summarize.ts— variance + head-to-head, robust to 3 table shapes
- Corpus builders (re-runnable when source docs / scrum_reviews / source code change):
scripts/build_lakehouse_corpus.tsscripts/build_scrum_findings_corpus.tsscripts/build_symbols_corpus.ts
Re-entry recipe (fresh session)
cd /home/profit/lakehouse
git log --oneline scrum/auto-apply-19814 -10 # what's recent
cat docs/MODE_RUNNER_TUNING_PLAN.md # this file
bun run scripts/mode_pass5_summarize.ts --since 2026-04-26T21:50:03Z # locked result
curl -s http://localhost:3100/v1/mode/list | jq '.task_classes.scrum_review' # current config
If you want to reproduce the bake-off:
# Strong model variance test (~17 min):
bun run scripts/mode_pass5_variance_paid.ts
# Weak-model regression (~10 min):
LH_MODEL=openai/gpt-oss-120b:free LH_REPS=3 bun run scripts/mode_pass5_variance_paid.ts