root 95f155b017 real_006: distribution-shift test on rows 10-59 of fill_events

Methodology fix: gen_real_queries.go gains -offset N flag. Every prior
real_NNN test sourced queries from rows 0-9 of fill_events.parquet
(default -limit 10), so the substrate's published "8/10 cold-pass top-1
= judge-best" was measured on a memorized slice, not held-out data.

real_006 samples 50 fresh rows (offset 10, never seen by the workers
or ethereal_workers corpora). Same harness, same local qwen2.5:latest
judge, same K=10. ~14 min wall total. Local-only, no cloud calls.

Headline findings:

- Cold-pass top-1 = judge-best (rank match): 41/50 (82%) vs real_001's
  8/10 (80%) — substrate generalizes at rank level.
- Strict (rating ≥ 2): 34/50 (68%) — 12-point drop from real_001's
  80%. ~7 of 41 "no-discovery" queries had cold top-1 the judge rated
  1; the corpus has gaps for some role-city combos in the v3 slice.
- Verbatim lift: 9/9 discoveries → warm top-1 (clean, matches real_001 2/2)
- Paraphrase recovery: 6/9 → top-1, 9/9 any-rank
- Quality regressed: 3/50 — Q43 is the structural one

Q43 (Packer at Midway Distribution / Chicago IL) regressed from
rating 5 to rating 2 on warm pass with `warm_boosted_count=0` and
`playbook_recorded=false`. Q18 (Shipping Clerks at the same client+city)
recorded a playbook entry. The regression suggests Q18's recording
leaked into Q43 via the warm-pass playbook corpus retrieval surface
even though the role gate from real_002 should have blocked it.
Three possible paths: extractor failed on one query, gate fires on
boost path but not Shape B inject, or cosine drift puts the recorded
worker close enough to Q43's embedding that warm-pass retrieval picks
it up directly. Diagnosis is the next move.

Three same-(client, city) clusters tested:
- Heritage Foods Gary IN × 3 distinct roles: clean, distinct workers
- Riverfront Steel Columbus OH × 4: cosine-level confusion (Q9/Q25
  surface same worker w-281 for Assemblers vs Quality Techs at cold-
  pass), but no playbook bleed
- Midway Distribution Chicago IL × 3: Q43 regression as above

What this confirms: substrate works on the fresh distribution at the
rank level, verbatim lift is real, paraphrase recovery is real.

What this falsifies: real_002's role-gate fix is not structurally
airtight. The bleed pattern can still fire under conditions the
prior tests didn't reach.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-05 04:54:03 -05:00

7.8 KiB

Raw Blame History

Reality test real_006 — distribution-shift findings

Run: 2026-05-05 04:41:46 → 04:50:08 CDT (8m22s driver wall, ~14 min including ingest) Judge: qwen2.5:latest (Ollama, local) — anchor's recommended judge, ~9s/query Queries: 50 from tests/reality/real_coord_queries_v3.txt (rows 10-59 of fill_events.parquet, single need style) Corpora: workers,ethereal_workers (5K + 10K) Local-only: zero cloud calls per PRD line 70.

Companion to playbook_lift_real_006.{json,md}. That's the harness output; this is the reading.

Why this test exists

real_001-005 all sourced their queries from the first 10 rows of fill_events.parquet. gen_real_queries.go had -limit N but no -offset N, so every "real" reality test ran on the same memorized slice. The published "8 / 10 cold-pass top-1 = judge-best" was a property of those 10 rows, not measured generalization. real_006 closes the methodology gap: new -offset flag samples rows 10-59 (5× the count, never seen by the substrate).

Headline — substrate generalizes (mostly)

Metric	real_001 (10 queries, rows 0-9)	real_006 (50 queries, rows 10-59)	Verdict
Cold-pass top-1 = judge-best (rank match)	8 / 10 (80%)	41 / 50 (82%)	HOLDS
Cold-pass top-1 = judge-best AND rating ≥ 2	8 / 10 (80%)	34 / 50 (68%)	-12 pts
Mean cold top-1 judge rating	~3.3	3.08	-7%
Discoveries (judge promoted non-top-1)	2 / 10	9 / 50 (18%)	comparable
Verbatim lift (discovery → warm top-1)	2 / 2 (100%)	9 / 9 (100%)	HOLDS
Paraphrase recovery → top-1	n/a (disabled)	6 / 9 (67%)	new
Quality regressed on rejudge	0 (test absent)	3 / 50 (6%)	new

Reading: the substrate's rank behavior generalizes cleanly — the top-1 worker is judge-approved at the same rate on fresh data as on memorized data. The quality of top-1 (rating ≥ 2) drops 12 points, which means 7 of the 41 "no-discovery" queries had cold top-1 the judge rated 1 (irrelevant) but the corpus had nothing better. Honest signal: parts of the v3 slice are in territory the workers corpus doesn't cover well.

The verbatim-lift property (discovery → warm top-1) is clean at 9/9, matching real_001's 2/2 perfectly. When the playbook records, the recorded answer comes back next time. That's the load-bearing learning property.

Cluster analysis — the cross-pollination question

real_001 found that same-(client, city) clusters cause Shape A boost to bleed across roles. Real_002's role-gate fix (roleEqual) was supposed to close that. real_006 has more cluster opportunities than real_001 did:

Cluster	Count	Result
Riverfront Steel + Columbus OH	4	mostly clean — see below
Heritage Foods + Gary IN	3	clean — distinct workers per role, no boost firing
Cornerstone Fabrication + Louisville KY	3	clean
Midway Distribution + Chicago IL	3	bleed: Q43 regressed

Heritage Foods + Gary IN (3 queries, all clean)

Q14 Assemblers       → e-1315
Q22 Material Handler → e-18
Q42 Machine Operator → e-1089

Three different roles → three different workers. Zero boosts fired, zero playbooks recorded. Role-disambiguation works at the cosine level for this cluster. Comparable to real_002's role-gate demonstration.

Riverfront Steel + Columbus OH (4 queries, partial)

Q9  Assemblers       → w-281    (cold = warm, no boost)
Q25 Quality Techs    → w-281    (cold = warm, no boost) ← same worker as Q9!
Q26 Machine Operator → w-4815   (clean)
Q32 Material Handler → e-8676 → w-2589  (judge promoted, playbook recorded)

Q9 and Q25 both surface w-281 cold-pass for different roles — that's a cosine-level confusion in the workers corpus, not a playbook bleed. The substrate isn't breaking; the corpus contains a worker whose resume embeds close to both "Assemblers" and "Quality Techs" in this client+city. Judge-rating Q25 dropped 2 → 1 on rejudge, which is the LLM's own consistency drift, not a substrate fault. Worth noting but not a bug.

Midway Distribution + Chicago IL (3 queries) — the regression

Q18 Shipping Clerks   → cold w-4504 → warm w-1522  (boost=1, playbook recorded)
Q19 Machine Operators → cold = warm e-1251         (clean)
Q43 Packer            → cold e-7746 (rating 5) → warm w-279 (rating 2)  ← regressed

Q43 regressed from rating 5 (perfect match) to rating 2 (weak) even though warm_boosted_count=0 and playbook_recorded=false. Same query, different warm top-1, no boost flag set. The playbook recording from Q18 (Shipping Clerks at Midway/Chicago) reaches Q43 (Packer at Midway/Chicago) — same client+city, different role — through the playbook corpus retrieval surface, even though the role gate exists.

This is the same pattern real_001 surfaced (Q5/Q10 demoted by Q2's playbook), and the role-gate fix from real_002 (roleEqual on Role field) was supposed to close it. Possible explanations:

Role extractor failed on either Q18 ("Shipping Clerks") or Q43 ("Packer") — leaving an empty role bypasses the gate (gate is "permissive on empty" by design)
Gate fires on boost path but not on Shape B inject path — and "boost=0" in the JSON is warm_boosted_count (count of re-ranked entries), not a flag for "no playbook influence at all"
Cosine-level drift: the playbook entry just happens to be close enough to Q43 in raw cosine space that warm-pass retrieval picks up w-279 directly without going through boost or inject

The other regressions (Q4 Centennial Packaging Flint MI, Q25 above) are smaller (3→2 and 2→1) and likely judge consistency drift on borderline candidates. Q43 is the structural one.

What this confirms vs falsifies

Confirmed:

Substrate generalizes at the rank level (82% cold-top-1 = judge-best)
Verbatim lift works (9/9 discoveries → warm top-1)
Role-disambiguation works at cosine level for clean role-distinct query distributions (Heritage Foods cluster is the proof)
Paraphrase recovery is real (6/9 → top-1, 9/9 any-rank)

Falsified / weakened:

"8/10 cold-pass top-1 = judge-best" was 12 points optimistic on the strict (rating ≥ 2) interpretation. Real number on broader data is ~68%, not 80%. Headline rank-match number (82%) holds.
Real_002's role-gate fix is not structurally airtight. Q43 shows the cluster-bleed pattern can still fire under conditions the prior tests didn't reach. Open question: which path is leaking — extractor failure, gate scope, or cosine drift?

Next moves (informed by this evidence)

Diagnose Q43 specifically: re-run the role extractor on its query text, check whether Q18's playbook entry has a role field recorded, look at the warm-pass top-K to see whether w-279 reaches there via boost, inject, or cosine-only.
Strengthen the corpus for the role-city combos that scored low rating (the 7 queries where cold top-1 was rating=1). The workers corpus has gaps the v3 slice surfaced.
Don't ship the "80% generalizes" framing as-is. The number real_006 measured (82% rank, 68% rating ≥ 2) is the honest one to publish.

This is what reality tests are for. Numbers from the memorized slice gave a clean story; numbers from the held-out slice show where it needs work.

Repro

cd /home/profit/golangLAKEHOUSE
PATH=/usr/local/go/bin:$PATH go build -o bin/gen_real_queries ./scripts/cutover/gen_real_queries.go
./bin/gen_real_queries -limit 50 -offset 10 -styles need > tests/reality/real_coord_queries_v3.txt

PATH=/usr/local/go/bin:$PATH \
  RUN_ID=real_006 \
  JUDGE_MODEL=qwen2.5:latest \
  QUERIES_FILE=tests/reality/real_coord_queries_v3.txt \
  WITH_PARAPHRASE=1 \
  WITH_REJUDGE=1 \
  bash scripts/playbook_lift.sh

Local-only. No cloud calls.

7.8 KiB Raw Blame History Unescape Escape