real_006 diagnosis: Q43 leak is cross-city, not cross-role
Traced w-279's path through the substrate. The leak source is Q49
(Packers in Indianapolis IN for Midway Distribution), NOT Q18 as
the initial reading suspected.
Q49 recorded w-279 with role=Packers, client=Midway Distribution,
city=Indianapolis. Q43 (Packer in Chicago IL for Midway Distribution)
ran later. roleEqual("Packer","Packers") → both normalize to
"packer" → role gate passes (correctly, by design — they ARE the
same role under plural-strip). Cosine distance between Q49's
recorded query and Q43's query is small enough to fit inside the
0.20 inject threshold because role + client + count + time-token
dominate the embedding (only the city and singular/plural noun
differ). Inject fires, w-279 surfaces at Q43's warm top-1 in
Chicago, judge correctly rates 2/5 — wrong city.
The role gate IS working. What's missing is a CITY gate. Real_002's
fix targeted cross-role bleed (Forklift → CNC). real_006 surfaced
cross-city bleed within same role + same client — a hole prior
tests structurally couldn't reach because they all sourced from
rows 0-9 where no such pair existed.
Concrete fix surface documented (1 new field, 2 gate checks, 1
regex, ~5 tests). Half a session of work, same shape as real_002.
Not implementing tonight — diagnosis only.
The 18 unit-level role-gate tests still pass, confirming the gate
is doing what it was specified to do. The bug is a missing
specification, not a broken implementation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
95f155b017
commit
eb4308d8fd
@ -101,31 +101,78 @@ Q19 Machine Operators → cold = warm e-1251 (clean)
|
|||||||
Q43 Packer → cold e-7746 (rating 5) → warm w-279 (rating 2) ← regressed
|
Q43 Packer → cold e-7746 (rating 5) → warm w-279 (rating 2) ← regressed
|
||||||
```
|
```
|
||||||
|
|
||||||
**Q43 regressed from rating 5 (perfect match) to rating 2 (weak)
|
**Diagnosis (2026-05-05 follow-up): the leak source isn't Q18 — it's Q49.**
|
||||||
even though `warm_boosted_count=0` and `playbook_recorded=false`.**
|
|
||||||
Same query, different warm top-1, no boost flag set. The playbook
|
|
||||||
recording from Q18 (Shipping Clerks at Midway/Chicago) reaches Q43
|
|
||||||
(Packer at Midway/Chicago) — same client+city, different role —
|
|
||||||
through the playbook corpus retrieval surface, even though the role
|
|
||||||
gate exists.
|
|
||||||
|
|
||||||
This is the **same pattern real_001 surfaced** (Q5/Q10 demoted by
|
Three queries in real_006 touch `w-279`:
|
||||||
Q2's playbook), and the role-gate fix from real_002 (`roleEqual`
|
|
||||||
on `Role` field) was supposed to close it. Possible explanations:
|
|
||||||
|
|
||||||
1. Role extractor failed on either Q18 ("Shipping Clerks") or Q43
|
| # | role-extract | client | city | result | playbook? |
|
||||||
("Packer") — leaving an empty role bypasses the gate (gate is
|
|---|---|---|---|---|---|
|
||||||
"permissive on empty" by design)
|
| Q8 Packers Indianapolis IN Heritage Foods | Packers | Heritage Foods | Indianapolis | w-279 (cold = judge-best) | no |
|
||||||
2. Gate fires on boost path but not on Shape B inject path — and
|
| **Q49** Packers **Indianapolis IN** Midway Distribution | Packers | **Midway Distribution** | Indianapolis | cold e-2746 → warm w-279 (judge-best) | **yes — recorded** |
|
||||||
"boost=0" in the JSON is `warm_boosted_count` (count of
|
| Q43 Packer **Chicago IL** Midway Distribution | Packer | **Midway Distribution** | Chicago | cold e-7746 (rating 5) → warm w-279 (rating 2) | no |
|
||||||
re-ranked entries), not a flag for "no playbook influence at all"
|
|
||||||
3. Cosine-level drift: the playbook entry just happens to be close
|
|
||||||
enough to Q43 in raw cosine space that warm-pass retrieval picks
|
|
||||||
up `w-279` directly without going through boost or inject
|
|
||||||
|
|
||||||
The other regressions (Q4 Centennial Packaging Flint MI, Q25 above)
|
Q49 recorded `w-279` with role=`Packers`, client=`Midway Distribution`,
|
||||||
are smaller (3→2 and 2→1) and likely judge consistency drift on
|
city=`Indianapolis`. When Q43 ran with role=`Packer`,
|
||||||
borderline candidates. Q43 is the structural one.
|
client=`Midway Distribution`, city=`Chicago`:
|
||||||
|
|
||||||
|
- `roleEqual("Packer", "Packers")` → both normalize to `"packer"` →
|
||||||
|
**gate passes (correctly, by design)**
|
||||||
|
- Q49's recorded query embedding is close enough to Q43's that the
|
||||||
|
playbook hit's distance falls inside `DefaultPlaybookMaxInjectDistance = 0.20`
|
||||||
|
(role + client + count + time-token dominate cosine; only the city
|
||||||
|
and the singular/plural noun differ)
|
||||||
|
- Inject fires; `w-279` (an Indianapolis worker) surfaces at Q43's
|
||||||
|
warm top-1 in **Chicago**
|
||||||
|
- Judge correctly rates this 2/5 — wrong city
|
||||||
|
|
||||||
|
**The role gate IS working as designed.** What's missing is a
|
||||||
|
**city gate** (or more generally, a metadata-equality gate on the
|
||||||
|
demand attributes that don't appear in the role field). Real_002's
|
||||||
|
fix anticipated cross-role bleed (Forklift → CNC); it didn't
|
||||||
|
anticipate cross-city bleed within the same client+role.
|
||||||
|
|
||||||
|
**Why prior tests missed this:** real_001-005 sourced from rows 0-9
|
||||||
|
of fill_events.parquet. Among those 10 rows there was no
|
||||||
|
Midway-Distribution × Packer × (different cities) pair. real_006
|
||||||
|
includes rows 10-59 which contain Q43 (Chicago) and Q49 (Indianapolis)
|
||||||
|
on the same client+role — a structurally new combination the
|
||||||
|
substrate hadn't been tested against.
|
||||||
|
|
||||||
|
The methodology gap closing on itself: the offset-flag fix that
|
||||||
|
surfaced real_006's headline number (-12 pts strict) also surfaced
|
||||||
|
a real cross-city leak the gate doesn't catch.
|
||||||
|
|
||||||
|
**The other regressions** (Q4 Centennial Packaging Flint, Q25
|
||||||
|
Riverfront Steel Quality Techs) are smaller (3→2 and 2→1) and look
|
||||||
|
like judge consistency drift on borderline candidates. Q43 is the
|
||||||
|
structural one.
|
||||||
|
|
||||||
|
### Concrete fix surface (next session)
|
||||||
|
|
||||||
|
A city gate alongside the role gate would close this:
|
||||||
|
|
||||||
|
1. **`PlaybookEntry`** gains `City string` (or generalize to
|
||||||
|
`Metadata map[string]string`). Recorded at `playbookRecord` time
|
||||||
|
from the same query the role extractor parses.
|
||||||
|
2. **`InjectPlaybookMisses` + `ApplyPlaybookBoost`** add a
|
||||||
|
`cityEqual(queryCity, hit.Entry.City)` check after the role
|
||||||
|
check. Same "permissive on empty" semantics as `roleEqual`.
|
||||||
|
3. **Bin extractor** adds a city-extract regex (e.g.
|
||||||
|
`\s+in\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\s+[A-Z]{2}`) to capture
|
||||||
|
the city + state token from the standard query shape.
|
||||||
|
4. **Unit tests** mirror the existing role-gate tests, locking the
|
||||||
|
exact Q43/Q49 scenario as a regression gate (and add an integration-
|
||||||
|
level test that record(role=Packers, city=Indianapolis) followed by
|
||||||
|
search(role=Packer, city=Chicago) doesn't surface the recorded
|
||||||
|
answer).
|
||||||
|
|
||||||
|
Estimated scope: 1 new field, 2 new gate checks, 1 new regex, ~5
|
||||||
|
tests. Same shape as real_002's role-gate fix. Half a session.
|
||||||
|
|
||||||
|
Open question: same-metro normalization ("Detroit MI" ≈ "Dearborn MI"?)
|
||||||
|
would help with real-world dispatch where coordinators legitimately
|
||||||
|
route across nearby cities. Punt that to future work — strict equality
|
||||||
|
closes the structural bleed without over-engineering.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user