Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
Pass 5 (5 reps × 4 conditions × 1 file on grok-4.1-fast) showed composing matrix corpora is anti-additive on strong models — composed lakehouse_arch + symbols LOST 5/5 head-to-head vs codereview_isolation (Δ −1.8 grounded findings, p=0.031). Default flips to isolation; matrix path now auto- downgrades when the resolved model is strong. Mode runner: - matrix_corpus is Vec<String> (string OR array via deserialize_string_or_vec) - top_k=6 from each corpus, merge by score, take top 8 globally - chunk tag prefers doc_id over source so reviewer sees [adr:009] vs [lakehouse_arch] - is_weak_model() gate auto-downgrades codereview_lakehouse → codereview_isolation for strong models (default-strong; weak = :free suffix or local last-resort) - LH_FORCE_FULL_ENRICHMENT=1 bypasses for diagnostic runs - EnrichmentSources.downgraded_from records when the gate fires Three corpora indexed via /vectors/index (5849 chunks total): - lakehouse_arch_v1 — ADRs + phases + PRD + scrum spec (93 docs, 2119 chunks) - scrum_findings_v1 — past scrum_reviews.jsonl (168 docs, 1260 chunks; EXCLUDED from defaults — 24% out-of-bounds line citations from cross-file drift) - lakehouse_symbols_v1 — regex-extracted pub items + /// docs (656 docs, 2470 chunks) Experiment infra: - scripts/build_*_corpus.ts — re-runnable when source content changes - scripts/mode_pass5_variance_paid.ts — N reps × M conditions on one file - scripts/mode_pass5_summarize.ts — mean ± σ + head-to-head, parser handles numbered + path-with-line + path-with-symbol finding tables - scripts/mode_compare.ts — groups by mode|corpus when sweeps span corpora - scripts/mode_experiment.ts — default model bumped to x-ai/grok-4.1-fast, --corpus flag for per-call override Decisions + open follow-ups: docs/MODE_RUNNER_TUNING_PLAN.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
115 lines
7.2 KiB
Markdown
115 lines
7.2 KiB
Markdown
# Mode Runner Tuning Plan
|
||
|
||
**Date:** 2026-04-26
|
||
**Branch:** `scrum/auto-apply-19814` (PR #11)
|
||
**Status:** Pass 5 variance test complete; conclusions locked. Implementation in progress.
|
||
|
||
A fresh Claude session reading this + the pass5 row range in `data/_kb/mode_experiments.jsonl` should be able to continue the work without re-running anything.
|
||
|
||
---
|
||
|
||
## What we set out to do
|
||
|
||
J's directive 2026-04-26 evening: "Mode runner experiment + corpus tightening."
|
||
|
||
Symptom in memory before the session: scrum_review's matrix corpus was kept-rate 0/2 across every call — silent failure. Question: should we tighten the corpus, build new ones, or change retrieval?
|
||
|
||
## What we built
|
||
|
||
Three new corpora indexed under `/vectors/index`:
|
||
|
||
| Corpus | Builder | Docs | Chunks | Source |
|
||
|---|---|---|---|---|
|
||
| `lakehouse_arch_v1` | `scripts/build_lakehouse_corpus.ts` | 93 | 2119 | DECISIONS.md ADRs + standalone ADRs + PHASES.md + PRD.md + CONTROL_PLANE_PRD.md + SCRUM_MASTER_SPEC.md |
|
||
| `scrum_findings_v1` | `scripts/build_scrum_findings_corpus.ts` | 168 | 1260 | Past `scrum_reviews.jsonl` rows |
|
||
| `lakehouse_symbols_v1` | `scripts/build_symbols_corpus.ts` | 656 | 2470 | Regex-extracted `pub fn|struct|enum|trait` + `///` docs from `crates/**/*.rs` |
|
||
|
||
Multi-corpus support added to the mode runner:
|
||
- `crates/gateway/src/v1/mode.rs` — `matrix_corpus` is now `Vec<String>` (string OR array in modes.toml/JSON via `deserialize_string_or_vec`)
|
||
- Top-K retrieved from each corpus, merged by score, top 8 globally before relevance filter
|
||
- Each chunk tagged with `corpus` for telemetry
|
||
- Prompt assembly prefers `doc_id` over `source` so reviewer sees `[adr:009]` not `[lakehouse_arch]`
|
||
|
||
Validation infra:
|
||
- `scripts/mode_pass5_variance_paid.ts` — N reps × M conditions on one file, paid model
|
||
- `scripts/mode_pass5_summarize.ts` — mean ± stddev + head-to-head wins/losses with parser handling 3 finding-table shapes (numbered, path-with-line, path-with-symbol)
|
||
- `scripts/mode_compare.ts` — extended grouping key to `mode|corpus` (sorted+joined when multiple corpora) so multi-corpus sweeps don't last-write-wins-clobber
|
||
|
||
## What we learned
|
||
|
||
### Single-rep bake-off (free-tier `openai/gpt-oss-120b:free`, 3 files)
|
||
|
||
Confirmed `lakehouse_arch_v1` adds +1.7 grounded findings/file vs isolation, 100% groundedness, −20s latency. **But:** matrix slightly *hurts* on small files (273-line `delta.rs`: lakehouse 7 vs isolation 9) and unlocks +9 findings on the large file (1355-line `pathway_memory.rs`).
|
||
|
||
`scrum_findings_v1` produced 24% out-of-bounds line citations from cross-file line-number drift — **dangerous, excluded from defaults**. Only safe with same-file gating (TBD if needed).
|
||
|
||
### Single-rep bake-off (paid `x-ai/grok-4.1-fast`, 3 files × 4 conditions)
|
||
|
||
Picture *flips* on a strong model. Composed corpus −1.4 grounded vs isolation. Symbols-alone slightly negative. Arch-alone negative. Suggested kitchen-sinking enrichment denigrates results when the model is good enough to handle the file directly.
|
||
|
||
### Pass 5 variance test (paid grok-4.1-fast, 5 reps × 4 conditions on `pathway_memory.rs`)
|
||
|
||
| Condition | n | mean grounded ± σ | range | H2H vs isolation | Δ mean |
|
||
|---|---|---|---|---|---|
|
||
| **isolation** | 5 | 6.2 ± 1.3 | [5–8] | baseline | — |
|
||
| arch_only | 5 | 5.2 ± 0.8 | [4–6] | 0W–3L–2T | −1.0 |
|
||
| symbols_only | 5 | 6.4 ± 1.5 | [4–8] | 3W–2L–0T | +0.2 |
|
||
| **composed (A+C)** | 5 | 4.4 ± 1.1 | [3–6] | **0W–5L–0T** | **−1.8** |
|
||
|
||
**Composed loses 5/5 head-to-head against isolation on this file with this model.** Probability under random noise = 1/2⁵ = 3.1%. Statistically significant.
|
||
|
||
Data window: rows in `data/_kb/mode_experiments.jsonl` where `ts > "2026-04-26T21:50:03Z"` and `file_path == "crates/vectord/src/pathway_memory.rs"`. Re-aggregate any time with `bun run scripts/mode_pass5_summarize.ts --since 2026-04-26T21:50:03Z`.
|
||
|
||
## Decisions taken
|
||
|
||
1. **Composed-corpus default is reverted.** `scrum_review.preferred_mode` switches from `codereview_lakehouse` → `codereview_isolation`. Matrix corpora stay defined in modes.toml but only fire when a caller explicitly forces `codereview_lakehouse` or one of the matrix-only experimental modes.
|
||
|
||
2. **Model-aware enrichment downgrade (α) is wired** in `crates/gateway/src/v1/mode.rs::execute`. When a caller resolves a "strong" model AND the resolved mode is `codereview_lakehouse`, the runner downgrades to `codereview_isolation` flag-set automatically. Strong patterns: `x-ai/grok-*`, `anthropic/*`, `openai/gpt-4*`, `openai/gpt-5*`, `deepseek/deepseek-v4*`, `moonshotai/kimi-k2*`, `google/gemini-2.5*`. Override via `LH_FORCE_FULL_ENRICHMENT=1` for diagnostic runs.
|
||
|
||
3. **`scrum_findings_v1` stays excluded from defaults** until same-file gating lands. Built and indexed; do not point any task class at it without that gate.
|
||
|
||
## Open follow-ups (not landed in this batch)
|
||
|
||
- **Same-file gating for `scrum_findings_v1`** — restrict retrieval to chunks where `file_path == focus_file` so cross-file line-number drift can't happen. Then it becomes a per-file "what was found before" signal.
|
||
- **Variance test on small files** — pass 5 was 1 file (the largest, where matrix-hurt was sharpest). Confirm direction holds on 273-line / 333-line files. ~15 min × 2 files = ~30 min.
|
||
- **Verify weak-model gain holds with α** — the bake-off showed matrix helps free-tier `gpt-oss-120b:free` on the large file. After α is wired, re-run on a free-tier model to confirm full enrichment still fires for it. ~5 min.
|
||
- **Higher-signal matrix (β fork)** — if we ever want matrix back as a default, it can't be whole-ADR/whole-section chunks. Better: only retrieve chunks where the focus file's defined symbols appear. Tighter signal, fewer chunks. Postponed.
|
||
|
||
## Reference data + tools
|
||
|
||
- **Mode-runner code:** `crates/gateway/src/v1/mode.rs`
|
||
- **Mode config:** `config/modes.toml`
|
||
- **Per-call experiment log:** `data/_kb/mode_experiments.jsonl`
|
||
- **Sweep harnesses:**
|
||
- `scripts/mode_experiment.ts` — files × modes × 1 rep (default model: `x-ai/grok-4.1-fast`)
|
||
- `scripts/mode_pass2_corpus_sweep.ts` — corpus × threshold sweep
|
||
- `scripts/mode_pass3_variance.ts` — temp × reps on one mode
|
||
- `scripts/mode_pass5_variance_paid.ts` — N reps × M conditions on one file
|
||
- **Aggregators:**
|
||
- `scripts/mode_compare.ts` — full per-mode comparison with grounding check
|
||
- `scripts/mode_pass5_summarize.ts` — variance + head-to-head, robust to 3 table shapes
|
||
- **Corpus builders (re-runnable when source docs / scrum_reviews / source code change):**
|
||
- `scripts/build_lakehouse_corpus.ts`
|
||
- `scripts/build_scrum_findings_corpus.ts`
|
||
- `scripts/build_symbols_corpus.ts`
|
||
|
||
## Re-entry recipe (fresh session)
|
||
|
||
```bash
|
||
cd /home/profit/lakehouse
|
||
git log --oneline scrum/auto-apply-19814 -10 # what's recent
|
||
cat docs/MODE_RUNNER_TUNING_PLAN.md # this file
|
||
bun run scripts/mode_pass5_summarize.ts --since 2026-04-26T21:50:03Z # locked result
|
||
curl -s http://localhost:3100/v1/mode/list | jq '.task_classes.scrum_review' # current config
|
||
```
|
||
|
||
If you want to reproduce the bake-off:
|
||
|
||
```bash
|
||
# Strong model variance test (~17 min):
|
||
bun run scripts/mode_pass5_variance_paid.ts
|
||
|
||
# Weak-model regression (~10 min):
|
||
LH_MODEL=openai/gpt-oss-120b:free LH_REPS=3 bun run scripts/mode_pass5_variance_paid.ts
|
||
```
|