root 2dbc8dbc83

lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts

v1/mode: model-aware enrichment downgrade + 3 corpora + variance harness

Pass 5 (5 reps × 4 conditions × 1 file on grok-4.1-fast) showed composing
matrix corpora is anti-additive on strong models — composed lakehouse_arch
+ symbols LOST 5/5 head-to-head vs codereview_isolation (Δ −1.8 grounded
findings, p=0.031). Default flips to isolation; matrix path now auto-
downgrades when the resolved model is strong.

Mode runner:
- matrix_corpus is Vec<String> (string OR array via deserialize_string_or_vec)
- top_k=6 from each corpus, merge by score, take top 8 globally
- chunk tag prefers doc_id over source so reviewer sees [adr:009] vs [lakehouse_arch]
- is_weak_model() gate auto-downgrades codereview_lakehouse → codereview_isolation
  for strong models (default-strong; weak = :free suffix or local last-resort)
- LH_FORCE_FULL_ENRICHMENT=1 bypasses for diagnostic runs
- EnrichmentSources.downgraded_from records when the gate fires

Three corpora indexed via /vectors/index (5849 chunks total):
- lakehouse_arch_v1 — ADRs + phases + PRD + scrum spec (93 docs, 2119 chunks)
- scrum_findings_v1 — past scrum_reviews.jsonl (168 docs, 1260 chunks; EXCLUDED
  from defaults — 24% out-of-bounds line citations from cross-file drift)
- lakehouse_symbols_v1 — regex-extracted pub items + /// docs (656 docs, 2470 chunks)

Experiment infra:
- scripts/build_*_corpus.ts — re-runnable when source content changes
- scripts/mode_pass5_variance_paid.ts — N reps × M conditions on one file
- scripts/mode_pass5_summarize.ts — mean ± σ + head-to-head, parser handles
  numbered + path-with-line + path-with-symbol finding tables
- scripts/mode_compare.ts — groups by mode|corpus when sweeps span corpora
- scripts/mode_experiment.ts — default model bumped to x-ai/grok-4.1-fast,
  --corpus flag for per-call override

Decisions + open follow-ups: docs/MODE_RUNNER_TUNING_PLAN.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-26 17:29:17 -05:00

7.2 KiB

Raw Blame History

Mode Runner Tuning Plan

Date: 2026-04-26 Branch: scrum/auto-apply-19814 (PR #11) Status: Pass 5 variance test complete; conclusions locked. Implementation in progress.

A fresh Claude session reading this + the pass5 row range in data/_kb/mode_experiments.jsonl should be able to continue the work without re-running anything.

What we set out to do

J's directive 2026-04-26 evening: "Mode runner experiment + corpus tightening."

Symptom in memory before the session: scrum_review's matrix corpus was kept-rate 0/2 across every call — silent failure. Question: should we tighten the corpus, build new ones, or change retrieval?

What we built

Three new corpora indexed under /vectors/index:

Corpus	Builder	Docs	Chunks	Source
`lakehouse_arch_v1`	`scripts/build_lakehouse_corpus.ts`	93	2119	DECISIONS.md ADRs + standalone ADRs + PHASES.md + PRD.md + CONTROL_PLANE_PRD.md + SCRUM_MASTER_SPEC.md
`scrum_findings_v1`	`scripts/build_scrum_findings_corpus.ts`	168	1260	Past `scrum_reviews.jsonl` rows
`lakehouse_symbols_v1`	`scripts/build_symbols_corpus.ts`	656	2470	Regex-extracted `pub fn

Multi-corpus support added to the mode runner:

crates/gateway/src/v1/mode.rs — matrix_corpus is now Vec<String> (string OR array in modes.toml/JSON via deserialize_string_or_vec)
Top-K retrieved from each corpus, merged by score, top 8 globally before relevance filter
Each chunk tagged with corpus for telemetry
Prompt assembly prefers doc_id over source so reviewer sees [adr:009] not [lakehouse_arch]

Validation infra:

scripts/mode_pass5_variance_paid.ts — N reps × M conditions on one file, paid model
scripts/mode_pass5_summarize.ts — mean ± stddev + head-to-head wins/losses with parser handling 3 finding-table shapes (numbered, path-with-line, path-with-symbol)
scripts/mode_compare.ts — extended grouping key to mode|corpus (sorted+joined when multiple corpora) so multi-corpus sweeps don't last-write-wins-clobber

What we learned

Single-rep bake-off (free-tier `openai/gpt-oss-120b:free`, 3 files)

Confirmed lakehouse_arch_v1 adds +1.7 grounded findings/file vs isolation, 100% groundedness, −20s latency. But: matrix slightly hurts on small files (273-line delta.rs: lakehouse 7 vs isolation 9) and unlocks +9 findings on the large file (1355-line pathway_memory.rs).

scrum_findings_v1 produced 24% out-of-bounds line citations from cross-file line-number drift — dangerous, excluded from defaults. Only safe with same-file gating (TBD if needed).

Single-rep bake-off (paid `x-ai/grok-4.1-fast`, 3 files × 4 conditions)

Picture flips on a strong model. Composed corpus −1.4 grounded vs isolation. Symbols-alone slightly negative. Arch-alone negative. Suggested kitchen-sinking enrichment denigrates results when the model is good enough to handle the file directly.

Pass 5 variance test (paid grok-4.1-fast, 5 reps × 4 conditions on `pathway_memory.rs`)

Condition	n	mean grounded ± σ	range	H2H vs isolation	Δ mean
isolation	5	6.2 ± 1.3	[5–8]	baseline	—
arch_only	5	5.2 ± 0.8	[4–6]	0W–3L–2T	−1.0
symbols_only	5	6.4 ± 1.5	[4–8]	3W–2L–0T	+0.2
composed (A+C)	5	4.4 ± 1.1	[3–6]	0W–5L–0T	−1.8

Composed loses 5/5 head-to-head against isolation on this file with this model. Probability under random noise = 1/2⁵ = 3.1%. Statistically significant.

Data window: rows in data/_kb/mode_experiments.jsonl where ts > "2026-04-26T21:50:03Z" and file_path == "crates/vectord/src/pathway_memory.rs". Re-aggregate any time with bun run scripts/mode_pass5_summarize.ts --since 2026-04-26T21:50:03Z.

Decisions taken

Composed-corpus default is reverted. scrum_review.preferred_mode switches from codereview_lakehouse → codereview_isolation. Matrix corpora stay defined in modes.toml but only fire when a caller explicitly forces codereview_lakehouse or one of the matrix-only experimental modes.
Model-aware enrichment downgrade (α) is wired in crates/gateway/src/v1/mode.rs::execute. When a caller resolves a "strong" model AND the resolved mode is codereview_lakehouse, the runner downgrades to codereview_isolation flag-set automatically. Strong patterns: x-ai/grok-*, anthropic/*, openai/gpt-4*, openai/gpt-5*, deepseek/deepseek-v4*, moonshotai/kimi-k2*, google/gemini-2.5*. Override via LH_FORCE_FULL_ENRICHMENT=1 for diagnostic runs.
scrum_findings_v1 stays excluded from defaults until same-file gating lands. Built and indexed; do not point any task class at it without that gate.

Open follow-ups (not landed in this batch)

Same-file gating for scrum_findings_v1 — restrict retrieval to chunks where file_path == focus_file so cross-file line-number drift can't happen. Then it becomes a per-file "what was found before" signal.
Variance test on small files — pass 5 was 1 file (the largest, where matrix-hurt was sharpest). Confirm direction holds on 273-line / 333-line files. ~15 min × 2 files = ~30 min.
Verify weak-model gain holds with α — the bake-off showed matrix helps free-tier gpt-oss-120b:free on the large file. After α is wired, re-run on a free-tier model to confirm full enrichment still fires for it. ~5 min.
Higher-signal matrix (β fork) — if we ever want matrix back as a default, it can't be whole-ADR/whole-section chunks. Better: only retrieve chunks where the focus file's defined symbols appear. Tighter signal, fewer chunks. Postponed.

Reference data + tools

Mode-runner code: crates/gateway/src/v1/mode.rs
Mode config: config/modes.toml
Per-call experiment log: data/_kb/mode_experiments.jsonl
Sweep harnesses:
- scripts/mode_experiment.ts — files × modes × 1 rep (default model: x-ai/grok-4.1-fast)
- scripts/mode_pass2_corpus_sweep.ts — corpus × threshold sweep
- scripts/mode_pass3_variance.ts — temp × reps on one mode
- scripts/mode_pass5_variance_paid.ts — N reps × M conditions on one file
Aggregators:
- scripts/mode_compare.ts — full per-mode comparison with grounding check
- scripts/mode_pass5_summarize.ts — variance + head-to-head, robust to 3 table shapes
Corpus builders (re-runnable when source docs / scrum_reviews / source code change):
- scripts/build_lakehouse_corpus.ts
- scripts/build_scrum_findings_corpus.ts
- scripts/build_symbols_corpus.ts

Re-entry recipe (fresh session)

cd /home/profit/lakehouse
git log --oneline scrum/auto-apply-19814 -10               # what's recent
cat docs/MODE_RUNNER_TUNING_PLAN.md                        # this file
bun run scripts/mode_pass5_summarize.ts --since 2026-04-26T21:50:03Z  # locked result
curl -s http://localhost:3100/v1/mode/list | jq '.task_classes.scrum_review'  # current config

If you want to reproduce the bake-off:

# Strong model variance test (~17 min):
bun run scripts/mode_pass5_variance_paid.ts

# Weak-model regression (~10 min):
LH_MODEL=openai/gpt-oss-120b:free LH_REPS=3 bun run scripts/mode_pass5_variance_paid.ts

7.2 KiB Raw Blame History Unescape Escape