golangLAKEHOUSE/reports/reality-tests/real_006_findings.md
root eb4308d8fd real_006 diagnosis: Q43 leak is cross-city, not cross-role
Traced w-279's path through the substrate. The leak source is Q49
(Packers in Indianapolis IN for Midway Distribution), NOT Q18 as
the initial reading suspected.

Q49 recorded w-279 with role=Packers, client=Midway Distribution,
city=Indianapolis. Q43 (Packer in Chicago IL for Midway Distribution)
ran later. roleEqual("Packer","Packers") → both normalize to
"packer" → role gate passes (correctly, by design — they ARE the
same role under plural-strip). Cosine distance between Q49's
recorded query and Q43's query is small enough to fit inside the
0.20 inject threshold because role + client + count + time-token
dominate the embedding (only the city and singular/plural noun
differ). Inject fires, w-279 surfaces at Q43's warm top-1 in
Chicago, judge correctly rates 2/5 — wrong city.

The role gate IS working. What's missing is a CITY gate. Real_002's
fix targeted cross-role bleed (Forklift → CNC). real_006 surfaced
cross-city bleed within same role + same client — a hole prior
tests structurally couldn't reach because they all sourced from
rows 0-9 where no such pair existed.

Concrete fix surface documented (1 new field, 2 gate checks, 1
regex, ~5 tests). Half a session of work, same shape as real_002.
Not implementing tonight — diagnosis only.

The 18 unit-level role-gate tests still pass, confirming the gate
is doing what it was specified to do. The bug is a missing
specification, not a broken implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 05:13:19 -05:00

10 KiB
Raw Blame History

Reality test real_006 — distribution-shift findings

Run: 2026-05-05 04:41:46 → 04:50:08 CDT (8m22s driver wall, ~14 min including ingest) Judge: qwen2.5:latest (Ollama, local) — anchor's recommended judge, ~9s/query Queries: 50 from tests/reality/real_coord_queries_v3.txt (rows 10-59 of fill_events.parquet, single need style) Corpora: workers,ethereal_workers (5K + 10K) Local-only: zero cloud calls per PRD line 70.

Companion to playbook_lift_real_006.{json,md}. That's the harness output; this is the reading.


Why this test exists

real_001-005 all sourced their queries from the first 10 rows of fill_events.parquet. gen_real_queries.go had -limit N but no -offset N, so every "real" reality test ran on the same memorized slice. The published "8 / 10 cold-pass top-1 = judge-best" was a property of those 10 rows, not measured generalization. real_006 closes the methodology gap: new -offset flag samples rows 10-59 (5× the count, never seen by the substrate).


Headline — substrate generalizes (mostly)

Metric real_001 (10 queries, rows 0-9) real_006 (50 queries, rows 10-59) Verdict
Cold-pass top-1 = judge-best (rank match) 8 / 10 (80%) 41 / 50 (82%) HOLDS
Cold-pass top-1 = judge-best AND rating ≥ 2 8 / 10 (80%) 34 / 50 (68%) -12 pts
Mean cold top-1 judge rating ~3.3 3.08 -7%
Discoveries (judge promoted non-top-1) 2 / 10 9 / 50 (18%) comparable
Verbatim lift (discovery → warm top-1) 2 / 2 (100%) 9 / 9 (100%) HOLDS
Paraphrase recovery → top-1 n/a (disabled) 6 / 9 (67%) new
Quality regressed on rejudge 0 (test absent) 3 / 50 (6%) new

Reading: the substrate's rank behavior generalizes cleanly — the top-1 worker is judge-approved at the same rate on fresh data as on memorized data. The quality of top-1 (rating ≥ 2) drops 12 points, which means 7 of the 41 "no-discovery" queries had cold top-1 the judge rated 1 (irrelevant) but the corpus had nothing better. Honest signal: parts of the v3 slice are in territory the workers corpus doesn't cover well.

The verbatim-lift property (discovery → warm top-1) is clean at 9/9, matching real_001's 2/2 perfectly. When the playbook records, the recorded answer comes back next time. That's the load-bearing learning property.


Cluster analysis — the cross-pollination question

real_001 found that same-(client, city) clusters cause Shape A boost to bleed across roles. Real_002's role-gate fix (roleEqual) was supposed to close that. real_006 has more cluster opportunities than real_001 did:

Cluster Count Result
Riverfront Steel + Columbus OH 4 mostly clean — see below
Heritage Foods + Gary IN 3 clean — distinct workers per role, no boost firing
Cornerstone Fabrication + Louisville KY 3 clean
Midway Distribution + Chicago IL 3 bleed: Q43 regressed

Heritage Foods + Gary IN (3 queries, all clean)

Q14 Assemblers       → e-1315
Q22 Material Handler → e-18
Q42 Machine Operator → e-1089

Three different roles → three different workers. Zero boosts fired, zero playbooks recorded. Role-disambiguation works at the cosine level for this cluster. Comparable to real_002's role-gate demonstration.

Riverfront Steel + Columbus OH (4 queries, partial)

Q9  Assemblers       → w-281    (cold = warm, no boost)
Q25 Quality Techs    → w-281    (cold = warm, no boost) ← same worker as Q9!
Q26 Machine Operator → w-4815   (clean)
Q32 Material Handler → e-8676 → w-2589  (judge promoted, playbook recorded)

Q9 and Q25 both surface w-281 cold-pass for different roles — that's a cosine-level confusion in the workers corpus, not a playbook bleed. The substrate isn't breaking; the corpus contains a worker whose resume embeds close to both "Assemblers" and "Quality Techs" in this client+city. Judge-rating Q25 dropped 2 → 1 on rejudge, which is the LLM's own consistency drift, not a substrate fault. Worth noting but not a bug.

Midway Distribution + Chicago IL (3 queries) — the regression

Q18 Shipping Clerks   → cold w-4504 → warm w-1522  (boost=1, playbook recorded)
Q19 Machine Operators → cold = warm e-1251         (clean)
Q43 Packer            → cold e-7746 (rating 5) → warm w-279 (rating 2)  ← regressed

Diagnosis (2026-05-05 follow-up): the leak source isn't Q18 — it's Q49.

Three queries in real_006 touch w-279:

# role-extract client city result playbook?
Q8 Packers Indianapolis IN Heritage Foods Packers Heritage Foods Indianapolis w-279 (cold = judge-best) no
Q49 Packers Indianapolis IN Midway Distribution Packers Midway Distribution Indianapolis cold e-2746 → warm w-279 (judge-best) yes — recorded
Q43 Packer Chicago IL Midway Distribution Packer Midway Distribution Chicago cold e-7746 (rating 5) → warm w-279 (rating 2) no

Q49 recorded w-279 with role=Packers, client=Midway Distribution, city=Indianapolis. When Q43 ran with role=Packer, client=Midway Distribution, city=Chicago:

  • roleEqual("Packer", "Packers") → both normalize to "packer"gate passes (correctly, by design)
  • Q49's recorded query embedding is close enough to Q43's that the playbook hit's distance falls inside DefaultPlaybookMaxInjectDistance = 0.20 (role + client + count + time-token dominate cosine; only the city and the singular/plural noun differ)
  • Inject fires; w-279 (an Indianapolis worker) surfaces at Q43's warm top-1 in Chicago
  • Judge correctly rates this 2/5 — wrong city

The role gate IS working as designed. What's missing is a city gate (or more generally, a metadata-equality gate on the demand attributes that don't appear in the role field). Real_002's fix anticipated cross-role bleed (Forklift → CNC); it didn't anticipate cross-city bleed within the same client+role.

Why prior tests missed this: real_001-005 sourced from rows 0-9 of fill_events.parquet. Among those 10 rows there was no Midway-Distribution × Packer × (different cities) pair. real_006 includes rows 10-59 which contain Q43 (Chicago) and Q49 (Indianapolis) on the same client+role — a structurally new combination the substrate hadn't been tested against.

The methodology gap closing on itself: the offset-flag fix that surfaced real_006's headline number (-12 pts strict) also surfaced a real cross-city leak the gate doesn't catch.

The other regressions (Q4 Centennial Packaging Flint, Q25 Riverfront Steel Quality Techs) are smaller (3→2 and 2→1) and look like judge consistency drift on borderline candidates. Q43 is the structural one.

Concrete fix surface (next session)

A city gate alongside the role gate would close this:

  1. PlaybookEntry gains City string (or generalize to Metadata map[string]string). Recorded at playbookRecord time from the same query the role extractor parses.
  2. InjectPlaybookMisses + ApplyPlaybookBoost add a cityEqual(queryCity, hit.Entry.City) check after the role check. Same "permissive on empty" semantics as roleEqual.
  3. Bin extractor adds a city-extract regex (e.g. \s+in\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\s+[A-Z]{2}) to capture the city + state token from the standard query shape.
  4. Unit tests mirror the existing role-gate tests, locking the exact Q43/Q49 scenario as a regression gate (and add an integration- level test that record(role=Packers, city=Indianapolis) followed by search(role=Packer, city=Chicago) doesn't surface the recorded answer).

Estimated scope: 1 new field, 2 new gate checks, 1 new regex, ~5 tests. Same shape as real_002's role-gate fix. Half a session.

Open question: same-metro normalization ("Detroit MI" ≈ "Dearborn MI"?) would help with real-world dispatch where coordinators legitimately route across nearby cities. Punt that to future work — strict equality closes the structural bleed without over-engineering.


What this confirms vs falsifies

Confirmed:

  • Substrate generalizes at the rank level (82% cold-top-1 = judge-best)
  • Verbatim lift works (9/9 discoveries → warm top-1)
  • Role-disambiguation works at cosine level for clean role-distinct query distributions (Heritage Foods cluster is the proof)
  • Paraphrase recovery is real (6/9 → top-1, 9/9 any-rank)

Falsified / weakened:

  • "8/10 cold-pass top-1 = judge-best" was 12 points optimistic on the strict (rating ≥ 2) interpretation. Real number on broader data is ~68%, not 80%. Headline rank-match number (82%) holds.
  • Real_002's role-gate fix is not structurally airtight. Q43 shows the cluster-bleed pattern can still fire under conditions the prior tests didn't reach. Open question: which path is leaking — extractor failure, gate scope, or cosine drift?

Next moves (informed by this evidence)

  1. Diagnose Q43 specifically: re-run the role extractor on its query text, check whether Q18's playbook entry has a role field recorded, look at the warm-pass top-K to see whether w-279 reaches there via boost, inject, or cosine-only.
  2. Strengthen the corpus for the role-city combos that scored low rating (the 7 queries where cold top-1 was rating=1). The workers corpus has gaps the v3 slice surfaced.
  3. Don't ship the "80% generalizes" framing as-is. The number real_006 measured (82% rank, 68% rating ≥ 2) is the honest one to publish.

This is what reality tests are for. Numbers from the memorized slice gave a clean story; numbers from the held-out slice show where it needs work.


Repro

cd /home/profit/golangLAKEHOUSE
PATH=/usr/local/go/bin:$PATH go build -o bin/gen_real_queries ./scripts/cutover/gen_real_queries.go
./bin/gen_real_queries -limit 50 -offset 10 -styles need > tests/reality/real_coord_queries_v3.txt

PATH=/usr/local/go/bin:$PATH \
  RUN_ID=real_006 \
  JUDGE_MODEL=qwen2.5:latest \
  QUERIES_FILE=tests/reality/real_coord_queries_v3.txt \
  WITH_PARAPHRASE=1 \
  WITH_REJUDGE=1 \
  bash scripts/playbook_lift.sh

Local-only. No cloud calls.