STATE_OF_PLAY: v4 split-threshold result + adjacent-query observation

- Reality test table extends from #001-#003 to #001-#004; v4 row marked
  as "the honest configuration" because OOD cross-pollination is gone.
- Shape B section gains the split-threshold rationale (boost safe at
  loose, inject structurally riskier so tighter).
- Verbatim drop framing rewritten — v3→v4 is configuration evolution,
  not regression.
- OPEN: closed "Shape B cap/decay" + the conditional Q15 boost-math
  item (Shape B + split threshold addressed both). Replaced with two
  finer-grained follow-ups: adjacent-query Q6↔Q7 swap (might be
  correct, verify with v4 re-judge metric) and liberal-paraphrase
  recovery loss (Q9/Q15 missed because qwen2.5 drifted >0.20).
- RECENT VERIFIED WAVE adds 94fc3b6 + 67d1957.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-30 07:26:23 -05:00
parent 67d1957b87
commit 87cbd10090

View File

@ -1,7 +1,7 @@
# STATE OF PLAY — Lakehouse-Go
**Last verified:** 2026-04-30 ~07:05 CDT
**Verified by:** live probes + `just verify` PASS + reality tests #001/#002/#003 (verbatim 7/8 lift v1, paraphrase 6/6 recovery v3 with Shape B), not memory.
**Last verified:** 2026-04-30 ~07:25 CDT
**Verified by:** live probes + `just verify` PASS + reality tests #001/#002/#003/#004 (v4 with split inject threshold: 6/8 verbatim lift + 6/8 paraphrase recovery + zero OOD cross-pollination), not memory.
> **Read this FIRST.** When the user says "we're working on lakehouse," default to the Go rewrite (this repo); the Rust legacy at `/home/profit/lakehouse/` is maintenance-only. If memory contradicts this file, this file wins. Update it when something is verified working — not when a phase finishes.
@ -103,15 +103,16 @@ The 5-loop substrate's load-bearing gate (per `project_small_model_pipeline_visi
|---|---|---|---|---|
| `playbook_lift_001` | boost-only | **7/8 (87.5%)** | not tested | Cosine + boost re-rank works for verbatim replay. Substrate live. |
| `playbook_lift_002` | boost-only | 2/2 | **0/2** | Boost can't promote answers OUT of regular top-K — paraphrase gap exposed. |
| `playbook_lift_003` | **Shape B** | 2/6 | **6/6 → top-1** | Shape B injects recorded answers into paraphrase results. Learning property holds. |
| `playbook_lift_003` | Shape B (loose 0.5) | 2/6 | 6/6 → top-1 | Shape B injects, but cross-pollinates: w-4435 surfaces as warm top-1 for unrelated OOD queries (dental/RN/SWE). |
| `playbook_lift_004` | **Shape B + split threshold (0.5 boost / 0.20 inject)** | **6/8 (75%)** | **6/8 (75%)** | OOD cross-pollination GONE; system refuses to inject when it's not confident. The honest configuration. |
**Shape B** (`InjectPlaybookMisses` in `internal/matrix/playbook.go`): when warm-pass retrieval doesn't already include a playbook hit's answer, append a synthetic Result with distance = `playbook_hit_distance × BoostFactor`. Caller re-sorts + truncates. Documented at `playbook.go:22-27` since v0; v3 shipped the implementation.
**Shape B** (`InjectPlaybookMisses` in `internal/matrix/playbook.go`): when warm-pass retrieval doesn't already include a playbook hit's answer, append a synthetic Result with distance = `playbook_hit_distance × BoostFactor`. Caller re-sorts + truncates. Documented at `playbook.go:22-27` since v0; v3 shipped the implementation. v4 added the split-threshold defense (`DefaultPlaybookMaxInjectDistance = 0.20` while boost stays at 0.50) — boost is safe at loose thresholds because it only re-ranks results already in retrieval; inject is structurally riskier so its threshold is tighter.
OOD honesty (dental hygienist / RN / software engineer queries) holds across all three runs — judge rates them 1, system doesn't fabricate matches. Cross-corpus boosts (e- ↔ w- swaps) confirmed in v1 + v3.
Evidence: `reports/reality-tests/playbook_lift_{001,002,003}.{json,md}`. Per the report's rubric (lift ≥ 50% = matrix doing real work), 6/6 paraphrase recovery is the validation that matters — verbatim replay is structurally the easy case.
**Verbatim lift drop v1→v3 (7→2) is NOT a regression.** Shape B cross-pollinates: a single strong recording (e.g. Q2's w-4435 for "OSHA-30 forklift Wisconsin") surfaces as warm top-1 for several other queries whose embeddings sit within `DefaultPlaybookMaxDistance` (0.5). The lift metric counts "warm top-1 == cold judge best for THIS query" — cross-pollinated lifts don't register even when they're reasonable. A v4 metric would re-judge warm results to measure quality lift, not rank-of-cold-judge-best (filed as OPEN below).
**v3 → v4 is the configuration evolution.** v3 ran with one threshold (0.5) for both boost and inject — paraphrase recovery hit 6/6 but a single strong recording (w-4435 for "OSHA-30 forklift Wisconsin") cross-pollinated to OOD queries (dental hygienist / RN / software engineer) because their text vectors fell within 0.5 cosine. v4 split: boost stays at 0.5 (safe — only re-ranks existing results), inject tightens to 0.20 (true paraphrase territory). Result: OOD cross-pollination eliminated, paraphrase recovery 6/8, verbatim lift 6/8. The two paraphrases that missed (Q9, Q15) were liberally rephrased — drift > 0.20 cosine.
### Harness expansion (2026-04-30 ~05:30 CDT)
@ -182,8 +183,8 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
| Item | What | When to act |
|---|---|---|
| **Reality test v4: re-judge warm results** | Current lift metric counts "warm top-1 == cold judge best." Shape B cross-pollinates recorded answers across queries (run #003 dropped verbatim-lift 7→2 because w-4435 surfaced as warm top-1 for several queries whose cold-judge-best was a different worker). True quality lift would re-judge warm results and measure judge agreement. ~30 min — adds a 4th pass to the driver. | Next time we want to measure Shape B's actual quality, not rank-of-original-judge-best. |
| **Shape B injection cap / decay** | Run #003 surfaced w-4435 (Q2's recording) as warm top-1 for ~8 queries because cosine on their text fell within `DefaultPlaybookMaxDistance`. Either (a) cap injections per query (max 1 from any single recording), (b) decay BoostFactor by playbook hit distance so far hits inject at higher distance, or (c) accept cross-pollination as a feature. Open design call. | When the v4 re-judge metric shows cross-pollination is hurting quality. |
| **Q15 boost-math edge case** | "Engaged warehouse associate with strong safety compliance" — judge picked rank-9 result; score=1.0 boost halves distance but rank-9 was >2× top-1 distance, so boost-only couldn't promote. Shape B may already address this (now injects directly), but worth verifying with a follow-up run. | After v4 metric lands — re-check whether Shape B closed this case. |
| **Adjacent-query cross-pollination** | After v4's split threshold, OOD cross-pollination is gone but Q6 ("Forklift-certified loader") ↔ Q7 ("Hazmat-certified warehouse worker, cold storage") still swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Likely correct (genuinely similar staffing queries), but worth verifying with the v4 re-judge metric — if the judge agrees both directions are good matches, accept; if not, tighten further (e.g. 0.15) or add a same-query-only mode. | Co-decision with v4 re-judge. |
| **Liberal-paraphrase recovery loss** | Q9 + Q15 in run #004 lost paraphrase recovery because qwen2.5 rephrased liberally enough to drift past 0.20 inject threshold. Acceptable (system refusing to inject when not confident), but might be tightenable with a different paraphrase prompt (more conservative wording variation) or a per-pair `paraphrase_max_drift` measurement. Cosmetic vs. real depends on whether realistic coordinator queries drift like qwen2.5's rephrases do. | When real coordinator queries are available for a calibration run. |
| **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/<bin>.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. |
| **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. |
| **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. |
@ -210,6 +211,8 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
| `9ce067b` | observerd: test that locks ADR-005 5.3 (provenance recorded post-run) |
| `e9822f0` | playbook_lift v2: paraphrase pass — exposed boost-only limit (0/2 paraphrase recovery) |
| `154a72e` | matrix: Shape B (`InjectPlaybookMisses`) — 6/6 paraphrase recovery in run #003 |
| `94fc3b6` | STATE_OF_PLAY: capture v3 reality test + Shape B + cross-pollination |
| `67d1957` | matrix: split boost/inject thresholds (0.5 / 0.20) — kills cross-pollination, run #004 6/8+6/8 |
Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).