diff --git a/reports/reality-tests/real_006_findings.md b/reports/reality-tests/real_006_findings.md index d15a0f7..5aeab4f 100644 --- a/reports/reality-tests/real_006_findings.md +++ b/reports/reality-tests/real_006_findings.md @@ -101,31 +101,78 @@ Q19 Machine Operators → cold = warm e-1251 (clean) Q43 Packer → cold e-7746 (rating 5) → warm w-279 (rating 2) ← regressed ``` -**Q43 regressed from rating 5 (perfect match) to rating 2 (weak) -even though `warm_boosted_count=0` and `playbook_recorded=false`.** -Same query, different warm top-1, no boost flag set. The playbook -recording from Q18 (Shipping Clerks at Midway/Chicago) reaches Q43 -(Packer at Midway/Chicago) — same client+city, different role — -through the playbook corpus retrieval surface, even though the role -gate exists. +**Diagnosis (2026-05-05 follow-up): the leak source isn't Q18 — it's Q49.** -This is the **same pattern real_001 surfaced** (Q5/Q10 demoted by -Q2's playbook), and the role-gate fix from real_002 (`roleEqual` -on `Role` field) was supposed to close it. Possible explanations: +Three queries in real_006 touch `w-279`: -1. Role extractor failed on either Q18 ("Shipping Clerks") or Q43 - ("Packer") — leaving an empty role bypasses the gate (gate is - "permissive on empty" by design) -2. Gate fires on boost path but not on Shape B inject path — and - "boost=0" in the JSON is `warm_boosted_count` (count of - re-ranked entries), not a flag for "no playbook influence at all" -3. Cosine-level drift: the playbook entry just happens to be close - enough to Q43 in raw cosine space that warm-pass retrieval picks - up `w-279` directly without going through boost or inject +| # | role-extract | client | city | result | playbook? | +|---|---|---|---|---|---| +| Q8 Packers Indianapolis IN Heritage Foods | Packers | Heritage Foods | Indianapolis | w-279 (cold = judge-best) | no | +| **Q49** Packers **Indianapolis IN** Midway Distribution | Packers | **Midway Distribution** | Indianapolis | cold e-2746 → warm w-279 (judge-best) | **yes — recorded** | +| Q43 Packer **Chicago IL** Midway Distribution | Packer | **Midway Distribution** | Chicago | cold e-7746 (rating 5) → warm w-279 (rating 2) | no | -The other regressions (Q4 Centennial Packaging Flint MI, Q25 above) -are smaller (3→2 and 2→1) and likely judge consistency drift on -borderline candidates. Q43 is the structural one. +Q49 recorded `w-279` with role=`Packers`, client=`Midway Distribution`, +city=`Indianapolis`. When Q43 ran with role=`Packer`, +client=`Midway Distribution`, city=`Chicago`: + +- `roleEqual("Packer", "Packers")` → both normalize to `"packer"` → + **gate passes (correctly, by design)** +- Q49's recorded query embedding is close enough to Q43's that the + playbook hit's distance falls inside `DefaultPlaybookMaxInjectDistance = 0.20` + (role + client + count + time-token dominate cosine; only the city + and the singular/plural noun differ) +- Inject fires; `w-279` (an Indianapolis worker) surfaces at Q43's + warm top-1 in **Chicago** +- Judge correctly rates this 2/5 — wrong city + +**The role gate IS working as designed.** What's missing is a +**city gate** (or more generally, a metadata-equality gate on the +demand attributes that don't appear in the role field). Real_002's +fix anticipated cross-role bleed (Forklift → CNC); it didn't +anticipate cross-city bleed within the same client+role. + +**Why prior tests missed this:** real_001-005 sourced from rows 0-9 +of fill_events.parquet. Among those 10 rows there was no +Midway-Distribution × Packer × (different cities) pair. real_006 +includes rows 10-59 which contain Q43 (Chicago) and Q49 (Indianapolis) +on the same client+role — a structurally new combination the +substrate hadn't been tested against. + +The methodology gap closing on itself: the offset-flag fix that +surfaced real_006's headline number (-12 pts strict) also surfaced +a real cross-city leak the gate doesn't catch. + +**The other regressions** (Q4 Centennial Packaging Flint, Q25 +Riverfront Steel Quality Techs) are smaller (3→2 and 2→1) and look +like judge consistency drift on borderline candidates. Q43 is the +structural one. + +### Concrete fix surface (next session) + +A city gate alongside the role gate would close this: + +1. **`PlaybookEntry`** gains `City string` (or generalize to + `Metadata map[string]string`). Recorded at `playbookRecord` time + from the same query the role extractor parses. +2. **`InjectPlaybookMisses` + `ApplyPlaybookBoost`** add a + `cityEqual(queryCity, hit.Entry.City)` check after the role + check. Same "permissive on empty" semantics as `roleEqual`. +3. **Bin extractor** adds a city-extract regex (e.g. + `\s+in\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\s+[A-Z]{2}`) to capture + the city + state token from the standard query shape. +4. **Unit tests** mirror the existing role-gate tests, locking the + exact Q43/Q49 scenario as a regression gate (and add an integration- + level test that record(role=Packers, city=Indianapolis) followed by + search(role=Packer, city=Chicago) doesn't surface the recorded + answer). + +Estimated scope: 1 new field, 2 new gate checks, 1 new regex, ~5 +tests. Same shape as real_002's role-gate fix. Half a session. + +Open question: same-metro normalization ("Detroit MI" ≈ "Dearborn MI"?) +would help with real-world dispatch where coordinators legitimately +route across nearby cities. Punt that to future work — strict equality +closes the structural bleed without over-engineering. ---