golangLAKEHOUSE/reports/reality-tests/real_006_findings.md
root eb4308d8fd real_006 diagnosis: Q43 leak is cross-city, not cross-role
Traced w-279's path through the substrate. The leak source is Q49
(Packers in Indianapolis IN for Midway Distribution), NOT Q18 as
the initial reading suspected.

Q49 recorded w-279 with role=Packers, client=Midway Distribution,
city=Indianapolis. Q43 (Packer in Chicago IL for Midway Distribution)
ran later. roleEqual("Packer","Packers") → both normalize to
"packer" → role gate passes (correctly, by design — they ARE the
same role under plural-strip). Cosine distance between Q49's
recorded query and Q43's query is small enough to fit inside the
0.20 inject threshold because role + client + count + time-token
dominate the embedding (only the city and singular/plural noun
differ). Inject fires, w-279 surfaces at Q43's warm top-1 in
Chicago, judge correctly rates 2/5 — wrong city.

The role gate IS working. What's missing is a CITY gate. Real_002's
fix targeted cross-role bleed (Forklift → CNC). real_006 surfaced
cross-city bleed within same role + same client — a hole prior
tests structurally couldn't reach because they all sourced from
rows 0-9 where no such pair existed.

Concrete fix surface documented (1 new field, 2 gate checks, 1
regex, ~5 tests). Half a session of work, same shape as real_002.
Not implementing tonight — diagnosis only.

The 18 unit-level role-gate tests still pass, confirming the gate
is doing what it was specified to do. The bug is a missing
specification, not a broken implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 05:13:19 -05:00

235 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Reality test real_006 — distribution-shift findings
**Run:** 2026-05-05 04:41:46 → 04:50:08 CDT (8m22s driver wall, ~14 min including ingest)
**Judge:** `qwen2.5:latest` (Ollama, local) — anchor's recommended judge, ~9s/query
**Queries:** 50 from `tests/reality/real_coord_queries_v3.txt` (rows 10-59 of fill_events.parquet, single `need` style)
**Corpora:** `workers,ethereal_workers` (5K + 10K)
**Local-only:** zero cloud calls per PRD line 70.
Companion to `playbook_lift_real_006.{json,md}`. That's the harness output; this is the reading.
---
## Why this test exists
real_001-005 all sourced their queries from the **first 10 rows** of
`fill_events.parquet`. `gen_real_queries.go` had `-limit N` but no
`-offset N`, so every "real" reality test ran on the same memorized
slice. The published "8 / 10 cold-pass top-1 = judge-best" was a
property of those 10 rows, not measured generalization. real_006
closes the methodology gap: new `-offset` flag samples rows 10-59 (5×
the count, never seen by the substrate).
---
## Headline — substrate generalizes (mostly)
| Metric | real_001 (10 queries, rows 0-9) | real_006 (50 queries, rows 10-59) | Verdict |
|---|---:|---:|---|
| Cold-pass top-1 = judge-best (rank match) | 8 / 10 (80%) | **41 / 50 (82%)** | **HOLDS** |
| Cold-pass top-1 = judge-best AND rating ≥ 2 | 8 / 10 (80%) | 34 / 50 (68%) | -12 pts |
| Mean cold top-1 judge rating | ~3.3 | 3.08 | -7% |
| Discoveries (judge promoted non-top-1) | 2 / 10 | 9 / 50 (18%) | comparable |
| Verbatim lift (discovery → warm top-1) | 2 / 2 (100%) | 9 / 9 (100%) | **HOLDS** |
| Paraphrase recovery → top-1 | n/a (disabled) | 6 / 9 (67%) | new |
| Quality regressed on rejudge | 0 (test absent) | 3 / 50 (6%) | new |
**Reading:** the substrate's *rank* behavior generalizes cleanly — the
top-1 worker is judge-approved at the same rate on fresh data as on
memorized data. The *quality* of top-1 (rating ≥ 2) drops 12 points,
which means 7 of the 41 "no-discovery" queries had cold top-1 the
judge rated 1 (irrelevant) but the corpus had nothing better. Honest
signal: parts of the v3 slice are in territory the workers corpus
doesn't cover well.
The verbatim-lift property (discovery → warm top-1) is **clean at
9/9**, matching real_001's 2/2 perfectly. When the playbook records,
the recorded answer comes back next time. That's the load-bearing
learning property.
---
## Cluster analysis — the cross-pollination question
real_001 found that same-(client, city) clusters cause Shape A boost
to bleed across roles. Real_002's role-gate fix (`roleEqual`) was
supposed to close that. real_006 has *more* cluster opportunities than
real_001 did:
| Cluster | Count | Result |
|---|---:|---|
| Riverfront Steel + Columbus OH | 4 | mostly clean — see below |
| Heritage Foods + Gary IN | 3 | **clean** — distinct workers per role, no boost firing |
| Cornerstone Fabrication + Louisville KY | 3 | clean |
| Midway Distribution + Chicago IL | 3 | **bleed: Q43 regressed** |
### Heritage Foods + Gary IN (3 queries, all clean)
```
Q14 Assemblers → e-1315
Q22 Material Handler → e-18
Q42 Machine Operator → e-1089
```
Three different roles → three different workers. Zero boosts fired,
zero playbooks recorded. **Role-disambiguation works at the cosine
level for this cluster.** Comparable to real_002's role-gate
demonstration.
### Riverfront Steel + Columbus OH (4 queries, partial)
```
Q9 Assemblers → w-281 (cold = warm, no boost)
Q25 Quality Techs → w-281 (cold = warm, no boost) ← same worker as Q9!
Q26 Machine Operator → w-4815 (clean)
Q32 Material Handler → e-8676 → w-2589 (judge promoted, playbook recorded)
```
Q9 and Q25 both surface `w-281` cold-pass for *different roles*
that's a **cosine-level confusion** in the workers corpus, not a
playbook bleed. The substrate isn't breaking; the corpus contains a
worker whose resume embeds close to both "Assemblers" and "Quality
Techs" in this client+city. Judge-rating Q25 dropped 2 → 1 on
rejudge, which is the LLM's own consistency drift, not a substrate
fault. Worth noting but not a bug.
### Midway Distribution + Chicago IL (3 queries) — the regression
```
Q18 Shipping Clerks → cold w-4504 → warm w-1522 (boost=1, playbook recorded)
Q19 Machine Operators → cold = warm e-1251 (clean)
Q43 Packer → cold e-7746 (rating 5) → warm w-279 (rating 2) ← regressed
```
**Diagnosis (2026-05-05 follow-up): the leak source isn't Q18 — it's Q49.**
Three queries in real_006 touch `w-279`:
| # | role-extract | client | city | result | playbook? |
|---|---|---|---|---|---|
| Q8 Packers Indianapolis IN Heritage Foods | Packers | Heritage Foods | Indianapolis | w-279 (cold = judge-best) | no |
| **Q49** Packers **Indianapolis IN** Midway Distribution | Packers | **Midway Distribution** | Indianapolis | cold e-2746 → warm w-279 (judge-best) | **yes — recorded** |
| Q43 Packer **Chicago IL** Midway Distribution | Packer | **Midway Distribution** | Chicago | cold e-7746 (rating 5) → warm w-279 (rating 2) | no |
Q49 recorded `w-279` with role=`Packers`, client=`Midway Distribution`,
city=`Indianapolis`. When Q43 ran with role=`Packer`,
client=`Midway Distribution`, city=`Chicago`:
- `roleEqual("Packer", "Packers")` → both normalize to `"packer"`
**gate passes (correctly, by design)**
- Q49's recorded query embedding is close enough to Q43's that the
playbook hit's distance falls inside `DefaultPlaybookMaxInjectDistance = 0.20`
(role + client + count + time-token dominate cosine; only the city
and the singular/plural noun differ)
- Inject fires; `w-279` (an Indianapolis worker) surfaces at Q43's
warm top-1 in **Chicago**
- Judge correctly rates this 2/5 — wrong city
**The role gate IS working as designed.** What's missing is a
**city gate** (or more generally, a metadata-equality gate on the
demand attributes that don't appear in the role field). Real_002's
fix anticipated cross-role bleed (Forklift → CNC); it didn't
anticipate cross-city bleed within the same client+role.
**Why prior tests missed this:** real_001-005 sourced from rows 0-9
of fill_events.parquet. Among those 10 rows there was no
Midway-Distribution × Packer × (different cities) pair. real_006
includes rows 10-59 which contain Q43 (Chicago) and Q49 (Indianapolis)
on the same client+role — a structurally new combination the
substrate hadn't been tested against.
The methodology gap closing on itself: the offset-flag fix that
surfaced real_006's headline number (-12 pts strict) also surfaced
a real cross-city leak the gate doesn't catch.
**The other regressions** (Q4 Centennial Packaging Flint, Q25
Riverfront Steel Quality Techs) are smaller (3→2 and 2→1) and look
like judge consistency drift on borderline candidates. Q43 is the
structural one.
### Concrete fix surface (next session)
A city gate alongside the role gate would close this:
1. **`PlaybookEntry`** gains `City string` (or generalize to
`Metadata map[string]string`). Recorded at `playbookRecord` time
from the same query the role extractor parses.
2. **`InjectPlaybookMisses` + `ApplyPlaybookBoost`** add a
`cityEqual(queryCity, hit.Entry.City)` check after the role
check. Same "permissive on empty" semantics as `roleEqual`.
3. **Bin extractor** adds a city-extract regex (e.g.
`\s+in\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\s+[A-Z]{2}`) to capture
the city + state token from the standard query shape.
4. **Unit tests** mirror the existing role-gate tests, locking the
exact Q43/Q49 scenario as a regression gate (and add an integration-
level test that record(role=Packers, city=Indianapolis) followed by
search(role=Packer, city=Chicago) doesn't surface the recorded
answer).
Estimated scope: 1 new field, 2 new gate checks, 1 new regex, ~5
tests. Same shape as real_002's role-gate fix. Half a session.
Open question: same-metro normalization ("Detroit MI" ≈ "Dearborn MI"?)
would help with real-world dispatch where coordinators legitimately
route across nearby cities. Punt that to future work — strict equality
closes the structural bleed without over-engineering.
---
## What this confirms vs falsifies
**Confirmed:**
- Substrate generalizes at the rank level (82% cold-top-1 = judge-best)
- Verbatim lift works (9/9 discoveries → warm top-1)
- Role-disambiguation works at cosine level for clean role-distinct
query distributions (Heritage Foods cluster is the proof)
- Paraphrase recovery is real (6/9 → top-1, 9/9 any-rank)
**Falsified / weakened:**
- "8/10 cold-pass top-1 = judge-best" was 12 points optimistic on
the strict (rating ≥ 2) interpretation. Real number on broader
data is ~68%, not 80%. Headline rank-match number (82%) holds.
- Real_002's role-gate fix is **not structurally airtight**. Q43
shows the cluster-bleed pattern can still fire under conditions
the prior tests didn't reach. Open question: which path is
leaking — extractor failure, gate scope, or cosine drift?
---
## Next moves (informed by this evidence)
1. **Diagnose Q43 specifically**: re-run the role extractor on its
query text, check whether Q18's playbook entry has a role field
recorded, look at the warm-pass top-K to see whether `w-279`
reaches there via boost, inject, or cosine-only.
2. **Strengthen the corpus for the role-city combos that scored
low rating** (the 7 queries where cold top-1 was rating=1). The
workers corpus has gaps the v3 slice surfaced.
3. **Don't ship the "80% generalizes" framing as-is.** The number
real_006 measured (82% rank, 68% rating ≥ 2) is the honest one
to publish.
This is what reality tests are for. Numbers from the memorized slice
gave a clean story; numbers from the held-out slice show where it
needs work.
---
## Repro
```bash
cd /home/profit/golangLAKEHOUSE
PATH=/usr/local/go/bin:$PATH go build -o bin/gen_real_queries ./scripts/cutover/gen_real_queries.go
./bin/gen_real_queries -limit 50 -offset 10 -styles need > tests/reality/real_coord_queries_v3.txt
PATH=/usr/local/go/bin:$PATH \
RUN_ID=real_006 \
JUDGE_MODEL=qwen2.5:latest \
QUERIES_FILE=tests/reality/real_coord_queries_v3.txt \
WITH_PARAPHRASE=1 \
WITH_REJUDGE=1 \
bash scripts/playbook_lift.sh
```
Local-only. No cloud calls.