# Reality test real_006 — distribution-shift findings **Run:** 2026-05-05 04:41:46 → 04:50:08 CDT (8m22s driver wall, ~14 min including ingest) **Judge:** `qwen2.5:latest` (Ollama, local) — anchor's recommended judge, ~9s/query **Queries:** 50 from `tests/reality/real_coord_queries_v3.txt` (rows 10-59 of fill_events.parquet, single `need` style) **Corpora:** `workers,ethereal_workers` (5K + 10K) **Local-only:** zero cloud calls per PRD line 70. Companion to `playbook_lift_real_006.{json,md}`. That's the harness output; this is the reading. --- ## Why this test exists real_001-005 all sourced their queries from the **first 10 rows** of `fill_events.parquet`. `gen_real_queries.go` had `-limit N` but no `-offset N`, so every "real" reality test ran on the same memorized slice. The published "8 / 10 cold-pass top-1 = judge-best" was a property of those 10 rows, not measured generalization. real_006 closes the methodology gap: new `-offset` flag samples rows 10-59 (5× the count, never seen by the substrate). --- ## Headline — substrate generalizes (mostly) | Metric | real_001 (10 queries, rows 0-9) | real_006 (50 queries, rows 10-59) | Verdict | |---|---:|---:|---| | Cold-pass top-1 = judge-best (rank match) | 8 / 10 (80%) | **41 / 50 (82%)** | **HOLDS** | | Cold-pass top-1 = judge-best AND rating ≥ 2 | 8 / 10 (80%) | 34 / 50 (68%) | -12 pts | | Mean cold top-1 judge rating | ~3.3 | 3.08 | -7% | | Discoveries (judge promoted non-top-1) | 2 / 10 | 9 / 50 (18%) | comparable | | Verbatim lift (discovery → warm top-1) | 2 / 2 (100%) | 9 / 9 (100%) | **HOLDS** | | Paraphrase recovery → top-1 | n/a (disabled) | 6 / 9 (67%) | new | | Quality regressed on rejudge | 0 (test absent) | 3 / 50 (6%) | new | **Reading:** the substrate's *rank* behavior generalizes cleanly — the top-1 worker is judge-approved at the same rate on fresh data as on memorized data. The *quality* of top-1 (rating ≥ 2) drops 12 points, which means 7 of the 41 "no-discovery" queries had cold top-1 the judge rated 1 (irrelevant) but the corpus had nothing better. Honest signal: parts of the v3 slice are in territory the workers corpus doesn't cover well. The verbatim-lift property (discovery → warm top-1) is **clean at 9/9**, matching real_001's 2/2 perfectly. When the playbook records, the recorded answer comes back next time. That's the load-bearing learning property. --- ## Cluster analysis — the cross-pollination question real_001 found that same-(client, city) clusters cause Shape A boost to bleed across roles. Real_002's role-gate fix (`roleEqual`) was supposed to close that. real_006 has *more* cluster opportunities than real_001 did: | Cluster | Count | Result | |---|---:|---| | Riverfront Steel + Columbus OH | 4 | mostly clean — see below | | Heritage Foods + Gary IN | 3 | **clean** — distinct workers per role, no boost firing | | Cornerstone Fabrication + Louisville KY | 3 | clean | | Midway Distribution + Chicago IL | 3 | **bleed: Q43 regressed** | ### Heritage Foods + Gary IN (3 queries, all clean) ``` Q14 Assemblers → e-1315 Q22 Material Handler → e-18 Q42 Machine Operator → e-1089 ``` Three different roles → three different workers. Zero boosts fired, zero playbooks recorded. **Role-disambiguation works at the cosine level for this cluster.** Comparable to real_002's role-gate demonstration. ### Riverfront Steel + Columbus OH (4 queries, partial) ``` Q9 Assemblers → w-281 (cold = warm, no boost) Q25 Quality Techs → w-281 (cold = warm, no boost) ← same worker as Q9! Q26 Machine Operator → w-4815 (clean) Q32 Material Handler → e-8676 → w-2589 (judge promoted, playbook recorded) ``` Q9 and Q25 both surface `w-281` cold-pass for *different roles* — that's a **cosine-level confusion** in the workers corpus, not a playbook bleed. The substrate isn't breaking; the corpus contains a worker whose resume embeds close to both "Assemblers" and "Quality Techs" in this client+city. Judge-rating Q25 dropped 2 → 1 on rejudge, which is the LLM's own consistency drift, not a substrate fault. Worth noting but not a bug. ### Midway Distribution + Chicago IL (3 queries) — the regression ``` Q18 Shipping Clerks → cold w-4504 → warm w-1522 (boost=1, playbook recorded) Q19 Machine Operators → cold = warm e-1251 (clean) Q43 Packer → cold e-7746 (rating 5) → warm w-279 (rating 2) ← regressed ``` **Q43 regressed from rating 5 (perfect match) to rating 2 (weak) even though `warm_boosted_count=0` and `playbook_recorded=false`.** Same query, different warm top-1, no boost flag set. The playbook recording from Q18 (Shipping Clerks at Midway/Chicago) reaches Q43 (Packer at Midway/Chicago) — same client+city, different role — through the playbook corpus retrieval surface, even though the role gate exists. This is the **same pattern real_001 surfaced** (Q5/Q10 demoted by Q2's playbook), and the role-gate fix from real_002 (`roleEqual` on `Role` field) was supposed to close it. Possible explanations: 1. Role extractor failed on either Q18 ("Shipping Clerks") or Q43 ("Packer") — leaving an empty role bypasses the gate (gate is "permissive on empty" by design) 2. Gate fires on boost path but not on Shape B inject path — and "boost=0" in the JSON is `warm_boosted_count` (count of re-ranked entries), not a flag for "no playbook influence at all" 3. Cosine-level drift: the playbook entry just happens to be close enough to Q43 in raw cosine space that warm-pass retrieval picks up `w-279` directly without going through boost or inject The other regressions (Q4 Centennial Packaging Flint MI, Q25 above) are smaller (3→2 and 2→1) and likely judge consistency drift on borderline candidates. Q43 is the structural one. --- ## What this confirms vs falsifies **Confirmed:** - Substrate generalizes at the rank level (82% cold-top-1 = judge-best) - Verbatim lift works (9/9 discoveries → warm top-1) - Role-disambiguation works at cosine level for clean role-distinct query distributions (Heritage Foods cluster is the proof) - Paraphrase recovery is real (6/9 → top-1, 9/9 any-rank) **Falsified / weakened:** - "8/10 cold-pass top-1 = judge-best" was 12 points optimistic on the strict (rating ≥ 2) interpretation. Real number on broader data is ~68%, not 80%. Headline rank-match number (82%) holds. - Real_002's role-gate fix is **not structurally airtight**. Q43 shows the cluster-bleed pattern can still fire under conditions the prior tests didn't reach. Open question: which path is leaking — extractor failure, gate scope, or cosine drift? --- ## Next moves (informed by this evidence) 1. **Diagnose Q43 specifically**: re-run the role extractor on its query text, check whether Q18's playbook entry has a role field recorded, look at the warm-pass top-K to see whether `w-279` reaches there via boost, inject, or cosine-only. 2. **Strengthen the corpus for the role-city combos that scored low rating** (the 7 queries where cold top-1 was rating=1). The workers corpus has gaps the v3 slice surfaced. 3. **Don't ship the "80% generalizes" framing as-is.** The number real_006 measured (82% rank, 68% rating ≥ 2) is the honest one to publish. This is what reality tests are for. Numbers from the memorized slice gave a clean story; numbers from the held-out slice show where it needs work. --- ## Repro ```bash cd /home/profit/golangLAKEHOUSE PATH=/usr/local/go/bin:$PATH go build -o bin/gen_real_queries ./scripts/cutover/gen_real_queries.go ./bin/gen_real_queries -limit 50 -offset 10 -styles need > tests/reality/real_coord_queries_v3.txt PATH=/usr/local/go/bin:$PATH \ RUN_ID=real_006 \ JUDGE_MODEL=qwen2.5:latest \ QUERIES_FILE=tests/reality/real_coord_queries_v3.txt \ WITH_PARAPHRASE=1 \ WITH_REJUDGE=1 \ bash scripts/playbook_lift.sh ``` Local-only. No cloud calls.