root 52bb216c2d
Some checks failed
lakehouse/auditor 1 blocking issue: todo!() macro call in tests/real-world/scrum_master_pipeline.ts
mode_compare: grounding check + control flag + emoji-tolerant section detection
Three fixes after the playbook_only confabulation surfaced in
2026-04-26 experiment (8 'findings' on a 333-line file all citing
lines 378-945 — fully fabricated from pathway-memory pattern names).

(1) Aggregator regex bug — section detection failed on emoji-prefixed
markdown headers like `## 🔎 Ranked Findings`. The original regex
required word chars right after #{1,3}\s+, so the patches table
header `## 🛠️ Concrete Patch Suggestions` was never recognized as
a stop boundary, double-counting every finding. Fix: allow
non-letter chars (emoji/space) between # and the keyword.

(2) Grounding check — for each finding row in the response, extract
backtick-quoted symbols + cited line numbers; verify symbols exist
in the actual focus file and lines fall within EOF. Computes
grounded/total ratio per mode. Surfaces 'OOB' (out-of-bounds) count
explicitly so confabulation is visible at a glance. Confirms what
hand-grading found: codereview_playbook_only's 8 findings on
service.rs were 1/8 grounded with 7 OOB.

(3) Control mode tagging — codereview_null and codereview_playbook_only
are designed as falsifiers (baseline / lossy ceiling) and their
numerical wins should never be read as recommendations. Output
marks them with ⚗ glyph + warning footer.

Per-mode aggregate is now sorted by groundedness, not raw count.
Per-mode-vs-lakehouse comparison uses grounded findings, not raw —
so confabulation can no longer score a "win".

Updated SCRUM_MASTER_SPEC.md with refactor timeline pointing at
the 2026-04-25/26 commits (observer fix, relevance filter, retire
wire, mode router, enrichment runner, parameterized experiment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 01:44:21 -05:00
..