STATE_OF_PLAY: v4 split-threshold result + adjacent-query observation

- Reality test table extends from #001-#003 to #001-#004; v4 row marked as "the honest configuration" because OOD cross-pollination is gone. - Shape B section gains the split-threshold rationale (boost safe at loose, inject structurally riskier so tighter). - Verbatim drop framing rewritten — v3→v4 is configuration evolution, not regression. - OPEN: closed "Shape B cap/decay" + the conditional Q15 boost-math item (Shape B + split threshold addressed both). Replaced with two finer-grained follow-ups: adjacent-query Q6↔Q7 swap (might be correct, verify with v4 re-judge metric) and liberal-paraphrase recovery loss (Q9/Q15 missed because qwen2.5 drifted >0.20). - RECENT VERIFIED WAVE adds 94fc3b6 + 67d1957. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:26:23 -05:00 · 2026-04-30 07:26:23 -05:00 · 87cbd10090
commit 87cbd10090
parent 67d1957b87
1 changed files with 10 additions and 7 deletions
--- a/STATE_OF_PLAY.md
+++ b/STATE_OF_PLAY.md
@ -1,7 +1,7 @@
 # STATE OF PLAY — Lakehouse-Go

-**Last verified:** 2026-04-30 ~07:05 CDT
-**Verified by:** live probes + `just verify` PASS + reality tests #001/#002/#003 (verbatim 7/8 lift v1, paraphrase 6/6 recovery v3 with Shape B), not memory.
+**Last verified:** 2026-04-30 ~07:25 CDT
+**Verified by:** live probes + `just verify` PASS + reality tests #001/#002/#003/#004 (v4 with split inject threshold: 6/8 verbatim lift + 6/8 paraphrase recovery + zero OOD cross-pollination), not memory.

 > **Read this FIRST.** When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at `/home/profit/lakehouse/` is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.

@ -103,15 +103,16 @@ The 5-loop substrate's load-bearing gate (per `project_small_model_pipeline_visi
 |---|---|---|---|---|
 | `playbook_lift_001` | boost-only | **7/8 (87.5%)** | not tested | Cosine + boost re-rank works for verbatim replay. Substrate live. |
 | `playbook_lift_002` | boost-only | 2/2 | **0/2** | Boost can't promote answers OUT of regular top-K — paraphrase gap exposed. |
-| `playbook_lift_003` | **Shape B** | 2/6 | **6/6 → top-1** | Shape B injects recorded answers into paraphrase results. Learning property holds. |
+| `playbook_lift_003` | Shape B (loose 0.5) | 2/6 | 6/6 → top-1 | Shape B injects, but cross-pollinates: w-4435 surfaces as warm top-1 for unrelated OOD queries (dental/RN/SWE). |
+| `playbook_lift_004` | **Shape B + split threshold (0.5 boost / 0.20 inject)** | **6/8 (75%)** | **6/8 (75%)** | OOD cross-pollination GONE; system refuses to inject when it's not confident. The honest configuration. |

-**Shape B** (`InjectPlaybookMisses` in `internal/matrix/playbook.go`): when warm-pass retrieval doesn't already include a playbook hit's answer, append a synthetic Result with distance = `playbook_hit_distance × BoostFactor`. Caller re-sorts + truncates. Documented at `playbook.go:22-27` since v0; v3 shipped the implementation.
+**Shape B** (`InjectPlaybookMisses` in `internal/matrix/playbook.go`): when warm-pass retrieval doesn't already include a playbook hit's answer, append a synthetic Result with distance = `playbook_hit_distance × BoostFactor`. Caller re-sorts + truncates. Documented at `playbook.go:22-27` since v0; v3 shipped the implementation. v4 added the split-threshold defense (`DefaultPlaybookMaxInjectDistance = 0.20` while boost stays at 0.50) — boost is safe at loose thresholds because it only re-ranks results already in retrieval; inject is structurally riskier so its threshold is tighter.

 OOD honesty (dental hygienist / RN / software engineer queries) holds across all three runs — judge rates them 1, system doesn't fabricate matches. Cross-corpus boosts (e- ↔ w- swaps) confirmed in v1 + v3.

 Evidence: `reports/reality-tests/playbook_lift_{001,002,003}.{json,md}`. Per the report's rubric (lift ≥ 50% = matrix doing real work), 6/6 paraphrase recovery is the validation that matters — verbatim replay is structurally the easy case.

-**Verbatim lift drop v1→v3 (7→2) is NOT a regression.** Shape B cross-pollinates: a single strong recording (e.g. Q2's w-4435 for "OSHA-30 forklift Wisconsin") surfaces as warm top-1 for several other queries whose embeddings sit within `DefaultPlaybookMaxDistance` (0.5). The lift metric counts "warm top-1 == cold judge best for THIS query" — cross-pollinated lifts don't register even when they're reasonable. A v4 metric would re-judge warm results to measure quality lift, not rank-of-cold-judge-best (filed as OPEN below).
+**v3 → v4 is the configuration evolution.** v3 ran with one threshold (0.5) for both boost and inject — paraphrase recovery hit 6/6 but a single strong recording (w-4435 for "OSHA-30 forklift Wisconsin") cross-pollinated to OOD queries (dental hygienist / RN / software engineer) because their text vectors fell within 0.5 cosine. v4 split: boost stays at 0.5 (safe — only re-ranks existing results), inject tightens to 0.20 (true paraphrase territory). Result: OOD cross-pollination eliminated, paraphrase recovery 6/8, verbatim lift 6/8. The two paraphrases that missed (Q9, Q15) were liberally rephrased — drift > 0.20 cosine.

 ### Harness expansion (2026-04-30 ~05:30 CDT)

@ -182,8 +183,8 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
 | Item | What | When to act |
 |---|---|---|
 | **Reality test v4: re-judge warm results** | Current lift metric counts "warm top-1 == cold judge best." Shape B cross-pollinates recorded answers across queries (run #003 dropped verbatim-lift 7→2 because w-4435 surfaced as warm top-1 for several queries whose cold-judge-best was a different worker). True quality lift would re-judge warm results and measure judge agreement. ~30 min — adds a 4th pass to the driver. | Next time we want to measure Shape B's actual quality, not rank-of-original-judge-best. |
-| **Shape B injection cap / decay** | Run #003 surfaced w-4435 (Q2's recording) as warm top-1 for ~8 queries because cosine on their text fell within `DefaultPlaybookMaxDistance`. Either (a) cap injections per query (max 1 from any single recording), (b) decay BoostFactor by playbook hit distance so far hits inject at higher distance, or (c) accept cross-pollination as a feature. Open design call. | When the v4 re-judge metric shows cross-pollination is hurting quality. |
-| **Q15 boost-math edge case** | "Engaged warehouse associate with strong safety compliance" — judge picked rank-9 result; score=1.0 boost halves distance but rank-9 was >2× top-1 distance, so boost-only couldn't promote. Shape B may already address this (now injects directly), but worth verifying with a follow-up run. | After v4 metric lands — re-check whether Shape B closed this case. |
+| **Adjacent-query cross-pollination** | After v4's split threshold, OOD cross-pollination is gone but Q6 ("Forklift-certified loader") ↔ Q7 ("Hazmat-certified warehouse worker, cold storage") still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Likely correct (genuinely similar staffing queries), but worth verifying with the v4 re-judge metric — if the judge agrees both directions are good matches, accept; if not, tighten further (e.g. 0.15) or add a same-query-only mode. | Co-decision with v4 re-judge. |
+| **Liberal-paraphrase recovery loss** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt (more conservative wording variation) or a per-pair `paraphrase_max_drift` measurement. Cosmetic vs. real depends on whether realistic coordinator queries drift like qwen2.5's rephrases do. | When real coordinator queries are available for a calibration run. |
 | **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/<bin>.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. |
 | **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. |
 | **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. |
@ -210,6 +211,8 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
 | `9ce067b` | observerd: test that locks ADR-005 5.3 (provenance recorded post-run) |
 | `e9822f0` | playbook_lift v2: paraphrase pass — exposed boost-only limit (0/2 paraphrase recovery) |
 | `154a72e` | matrix: Shape B (`InjectPlaybookMisses`) — 6/6 paraphrase recovery in run #003 |
+| `94fc3b6` | STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination |
+| `67d1957` | matrix: split boost/inject thresholds (0.5 / 0.20) — kills cross-pollination, run #004 6/8+6/8 |

 Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).