Traced w-279's path through the substrate. The leak source is Q49
(Packers in Indianapolis IN for Midway Distribution), NOT Q18 as
the initial reading suspected.
Q49 recorded w-279 with role=Packers, client=Midway Distribution,
city=Indianapolis. Q43 (Packer in Chicago IL for Midway Distribution)
ran later. roleEqual("Packer","Packers") → both normalize to
"packer" → role gate passes (correctly, by design — they ARE the
same role under plural-strip). Cosine distance between Q49's
recorded query and Q43's query is small enough to fit inside the
0.20 inject threshold because role + client + count + time-token
dominate the embedding (only the city and singular/plural noun
differ). Inject fires, w-279 surfaces at Q43's warm top-1 in
Chicago, judge correctly rates 2/5 — wrong city.
The role gate IS working. What's missing is a CITY gate. Real_002's
fix targeted cross-role bleed (Forklift → CNC). real_006 surfaced
cross-city bleed within same role + same client — a hole prior
tests structurally couldn't reach because they all sourced from
rows 0-9 where no such pair existed.
Concrete fix surface documented (1 new field, 2 gate checks, 1
regex, ~5 tests). Half a session of work, same shape as real_002.
Not implementing tonight — diagnosis only.
The 18 unit-level role-gate tests still pass, confirming the gate
is doing what it was specified to do. The bug is a missing
specification, not a broken implementation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
235 lines
10 KiB
Markdown
235 lines
10 KiB
Markdown
# Reality test real_006 — distribution-shift findings
|
||
|
||
**Run:** 2026-05-05 04:41:46 → 04:50:08 CDT (8m22s driver wall, ~14 min including ingest)
|
||
**Judge:** `qwen2.5:latest` (Ollama, local) — anchor's recommended judge, ~9s/query
|
||
**Queries:** 50 from `tests/reality/real_coord_queries_v3.txt` (rows 10-59 of fill_events.parquet, single `need` style)
|
||
**Corpora:** `workers,ethereal_workers` (5K + 10K)
|
||
**Local-only:** zero cloud calls per PRD line 70.
|
||
|
||
Companion to `playbook_lift_real_006.{json,md}`. That's the harness output; this is the reading.
|
||
|
||
---
|
||
|
||
## Why this test exists
|
||
|
||
real_001-005 all sourced their queries from the **first 10 rows** of
|
||
`fill_events.parquet`. `gen_real_queries.go` had `-limit N` but no
|
||
`-offset N`, so every "real" reality test ran on the same memorized
|
||
slice. The published "8 / 10 cold-pass top-1 = judge-best" was a
|
||
property of those 10 rows, not measured generalization. real_006
|
||
closes the methodology gap: new `-offset` flag samples rows 10-59 (5×
|
||
the count, never seen by the substrate).
|
||
|
||
---
|
||
|
||
## Headline — substrate generalizes (mostly)
|
||
|
||
| Metric | real_001 (10 queries, rows 0-9) | real_006 (50 queries, rows 10-59) | Verdict |
|
||
|---|---:|---:|---|
|
||
| Cold-pass top-1 = judge-best (rank match) | 8 / 10 (80%) | **41 / 50 (82%)** | **HOLDS** |
|
||
| Cold-pass top-1 = judge-best AND rating ≥ 2 | 8 / 10 (80%) | 34 / 50 (68%) | -12 pts |
|
||
| Mean cold top-1 judge rating | ~3.3 | 3.08 | -7% |
|
||
| Discoveries (judge promoted non-top-1) | 2 / 10 | 9 / 50 (18%) | comparable |
|
||
| Verbatim lift (discovery → warm top-1) | 2 / 2 (100%) | 9 / 9 (100%) | **HOLDS** |
|
||
| Paraphrase recovery → top-1 | n/a (disabled) | 6 / 9 (67%) | new |
|
||
| Quality regressed on rejudge | 0 (test absent) | 3 / 50 (6%) | new |
|
||
|
||
**Reading:** the substrate's *rank* behavior generalizes cleanly — the
|
||
top-1 worker is judge-approved at the same rate on fresh data as on
|
||
memorized data. The *quality* of top-1 (rating ≥ 2) drops 12 points,
|
||
which means 7 of the 41 "no-discovery" queries had cold top-1 the
|
||
judge rated 1 (irrelevant) but the corpus had nothing better. Honest
|
||
signal: parts of the v3 slice are in territory the workers corpus
|
||
doesn't cover well.
|
||
|
||
The verbatim-lift property (discovery → warm top-1) is **clean at
|
||
9/9**, matching real_001's 2/2 perfectly. When the playbook records,
|
||
the recorded answer comes back next time. That's the load-bearing
|
||
learning property.
|
||
|
||
---
|
||
|
||
## Cluster analysis — the cross-pollination question
|
||
|
||
real_001 found that same-(client, city) clusters cause Shape A boost
|
||
to bleed across roles. Real_002's role-gate fix (`roleEqual`) was
|
||
supposed to close that. real_006 has *more* cluster opportunities than
|
||
real_001 did:
|
||
|
||
| Cluster | Count | Result |
|
||
|---|---:|---|
|
||
| Riverfront Steel + Columbus OH | 4 | mostly clean — see below |
|
||
| Heritage Foods + Gary IN | 3 | **clean** — distinct workers per role, no boost firing |
|
||
| Cornerstone Fabrication + Louisville KY | 3 | clean |
|
||
| Midway Distribution + Chicago IL | 3 | **bleed: Q43 regressed** |
|
||
|
||
### Heritage Foods + Gary IN (3 queries, all clean)
|
||
|
||
```
|
||
Q14 Assemblers → e-1315
|
||
Q22 Material Handler → e-18
|
||
Q42 Machine Operator → e-1089
|
||
```
|
||
|
||
Three different roles → three different workers. Zero boosts fired,
|
||
zero playbooks recorded. **Role-disambiguation works at the cosine
|
||
level for this cluster.** Comparable to real_002's role-gate
|
||
demonstration.
|
||
|
||
### Riverfront Steel + Columbus OH (4 queries, partial)
|
||
|
||
```
|
||
Q9 Assemblers → w-281 (cold = warm, no boost)
|
||
Q25 Quality Techs → w-281 (cold = warm, no boost) ← same worker as Q9!
|
||
Q26 Machine Operator → w-4815 (clean)
|
||
Q32 Material Handler → e-8676 → w-2589 (judge promoted, playbook recorded)
|
||
```
|
||
|
||
Q9 and Q25 both surface `w-281` cold-pass for *different roles* —
|
||
that's a **cosine-level confusion** in the workers corpus, not a
|
||
playbook bleed. The substrate isn't breaking; the corpus contains a
|
||
worker whose resume embeds close to both "Assemblers" and "Quality
|
||
Techs" in this client+city. Judge-rating Q25 dropped 2 → 1 on
|
||
rejudge, which is the LLM's own consistency drift, not a substrate
|
||
fault. Worth noting but not a bug.
|
||
|
||
### Midway Distribution + Chicago IL (3 queries) — the regression
|
||
|
||
```
|
||
Q18 Shipping Clerks → cold w-4504 → warm w-1522 (boost=1, playbook recorded)
|
||
Q19 Machine Operators → cold = warm e-1251 (clean)
|
||
Q43 Packer → cold e-7746 (rating 5) → warm w-279 (rating 2) ← regressed
|
||
```
|
||
|
||
**Diagnosis (2026-05-05 follow-up): the leak source isn't Q18 — it's Q49.**
|
||
|
||
Three queries in real_006 touch `w-279`:
|
||
|
||
| # | role-extract | client | city | result | playbook? |
|
||
|---|---|---|---|---|---|
|
||
| Q8 Packers Indianapolis IN Heritage Foods | Packers | Heritage Foods | Indianapolis | w-279 (cold = judge-best) | no |
|
||
| **Q49** Packers **Indianapolis IN** Midway Distribution | Packers | **Midway Distribution** | Indianapolis | cold e-2746 → warm w-279 (judge-best) | **yes — recorded** |
|
||
| Q43 Packer **Chicago IL** Midway Distribution | Packer | **Midway Distribution** | Chicago | cold e-7746 (rating 5) → warm w-279 (rating 2) | no |
|
||
|
||
Q49 recorded `w-279` with role=`Packers`, client=`Midway Distribution`,
|
||
city=`Indianapolis`. When Q43 ran with role=`Packer`,
|
||
client=`Midway Distribution`, city=`Chicago`:
|
||
|
||
- `roleEqual("Packer", "Packers")` → both normalize to `"packer"` →
|
||
**gate passes (correctly, by design)**
|
||
- Q49's recorded query embedding is close enough to Q43's that the
|
||
playbook hit's distance falls inside `DefaultPlaybookMaxInjectDistance = 0.20`
|
||
(role + client + count + time-token dominate cosine; only the city
|
||
and the singular/plural noun differ)
|
||
- Inject fires; `w-279` (an Indianapolis worker) surfaces at Q43's
|
||
warm top-1 in **Chicago**
|
||
- Judge correctly rates this 2/5 — wrong city
|
||
|
||
**The role gate IS working as designed.** What's missing is a
|
||
**city gate** (or more generally, a metadata-equality gate on the
|
||
demand attributes that don't appear in the role field). Real_002's
|
||
fix anticipated cross-role bleed (Forklift → CNC); it didn't
|
||
anticipate cross-city bleed within the same client+role.
|
||
|
||
**Why prior tests missed this:** real_001-005 sourced from rows 0-9
|
||
of fill_events.parquet. Among those 10 rows there was no
|
||
Midway-Distribution × Packer × (different cities) pair. real_006
|
||
includes rows 10-59 which contain Q43 (Chicago) and Q49 (Indianapolis)
|
||
on the same client+role — a structurally new combination the
|
||
substrate hadn't been tested against.
|
||
|
||
The methodology gap closing on itself: the offset-flag fix that
|
||
surfaced real_006's headline number (-12 pts strict) also surfaced
|
||
a real cross-city leak the gate doesn't catch.
|
||
|
||
**The other regressions** (Q4 Centennial Packaging Flint, Q25
|
||
Riverfront Steel Quality Techs) are smaller (3→2 and 2→1) and look
|
||
like judge consistency drift on borderline candidates. Q43 is the
|
||
structural one.
|
||
|
||
### Concrete fix surface (next session)
|
||
|
||
A city gate alongside the role gate would close this:
|
||
|
||
1. **`PlaybookEntry`** gains `City string` (or generalize to
|
||
`Metadata map[string]string`). Recorded at `playbookRecord` time
|
||
from the same query the role extractor parses.
|
||
2. **`InjectPlaybookMisses` + `ApplyPlaybookBoost`** add a
|
||
`cityEqual(queryCity, hit.Entry.City)` check after the role
|
||
check. Same "permissive on empty" semantics as `roleEqual`.
|
||
3. **Bin extractor** adds a city-extract regex (e.g.
|
||
`\s+in\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\s+[A-Z]{2}`) to capture
|
||
the city + state token from the standard query shape.
|
||
4. **Unit tests** mirror the existing role-gate tests, locking the
|
||
exact Q43/Q49 scenario as a regression gate (and add an integration-
|
||
level test that record(role=Packers, city=Indianapolis) followed by
|
||
search(role=Packer, city=Chicago) doesn't surface the recorded
|
||
answer).
|
||
|
||
Estimated scope: 1 new field, 2 new gate checks, 1 new regex, ~5
|
||
tests. Same shape as real_002's role-gate fix. Half a session.
|
||
|
||
Open question: same-metro normalization ("Detroit MI" ≈ "Dearborn MI"?)
|
||
would help with real-world dispatch where coordinators legitimately
|
||
route across nearby cities. Punt that to future work — strict equality
|
||
closes the structural bleed without over-engineering.
|
||
|
||
---
|
||
|
||
## What this confirms vs falsifies
|
||
|
||
**Confirmed:**
|
||
- Substrate generalizes at the rank level (82% cold-top-1 = judge-best)
|
||
- Verbatim lift works (9/9 discoveries → warm top-1)
|
||
- Role-disambiguation works at cosine level for clean role-distinct
|
||
query distributions (Heritage Foods cluster is the proof)
|
||
- Paraphrase recovery is real (6/9 → top-1, 9/9 any-rank)
|
||
|
||
**Falsified / weakened:**
|
||
- "8/10 cold-pass top-1 = judge-best" was 12 points optimistic on
|
||
the strict (rating ≥ 2) interpretation. Real number on broader
|
||
data is ~68%, not 80%. Headline rank-match number (82%) holds.
|
||
- Real_002's role-gate fix is **not structurally airtight**. Q43
|
||
shows the cluster-bleed pattern can still fire under conditions
|
||
the prior tests didn't reach. Open question: which path is
|
||
leaking — extractor failure, gate scope, or cosine drift?
|
||
|
||
---
|
||
|
||
## Next moves (informed by this evidence)
|
||
|
||
1. **Diagnose Q43 specifically**: re-run the role extractor on its
|
||
query text, check whether Q18's playbook entry has a role field
|
||
recorded, look at the warm-pass top-K to see whether `w-279`
|
||
reaches there via boost, inject, or cosine-only.
|
||
2. **Strengthen the corpus for the role-city combos that scored
|
||
low rating** (the 7 queries where cold top-1 was rating=1). The
|
||
workers corpus has gaps the v3 slice surfaced.
|
||
3. **Don't ship the "80% generalizes" framing as-is.** The number
|
||
real_006 measured (82% rank, 68% rating ≥ 2) is the honest one
|
||
to publish.
|
||
|
||
This is what reality tests are for. Numbers from the memorized slice
|
||
gave a clean story; numbers from the held-out slice show where it
|
||
needs work.
|
||
|
||
---
|
||
|
||
## Repro
|
||
|
||
```bash
|
||
cd /home/profit/golangLAKEHOUSE
|
||
PATH=/usr/local/go/bin:$PATH go build -o bin/gen_real_queries ./scripts/cutover/gen_real_queries.go
|
||
./bin/gen_real_queries -limit 50 -offset 10 -styles need > tests/reality/real_coord_queries_v3.txt
|
||
|
||
PATH=/usr/local/go/bin:$PATH \
|
||
RUN_ID=real_006 \
|
||
JUDGE_MODEL=qwen2.5:latest \
|
||
QUERIES_FILE=tests/reality/real_coord_queries_v3.txt \
|
||
WITH_PARAPHRASE=1 \
|
||
WITH_REJUDGE=1 \
|
||
bash scripts/playbook_lift.sh
|
||
```
|
||
|
||
Local-only. No cloud calls.
|