root 95f155b017 real_006: distribution-shift test on rows 10-59 of fill_events

Methodology fix: gen_real_queries.go gains -offset N flag. Every prior
real_NNN test sourced queries from rows 0-9 of fill_events.parquet
(default -limit 10), so the substrate's published "8/10 cold-pass top-1
= judge-best" was measured on a memorized slice, not held-out data.

real_006 samples 50 fresh rows (offset 10, never seen by the workers
or ethereal_workers corpora). Same harness, same local qwen2.5:latest
judge, same K=10. ~14 min wall total. Local-only, no cloud calls.

Headline findings:

- Cold-pass top-1 = judge-best (rank match): 41/50 (82%) vs real_001's
  8/10 (80%) — substrate generalizes at rank level.
- Strict (rating ≥ 2): 34/50 (68%) — 12-point drop from real_001's
  80%. ~7 of 41 "no-discovery" queries had cold top-1 the judge rated
  1; the corpus has gaps for some role-city combos in the v3 slice.
- Verbatim lift: 9/9 discoveries → warm top-1 (clean, matches real_001 2/2)
- Paraphrase recovery: 6/9 → top-1, 9/9 any-rank
- Quality regressed: 3/50 — Q43 is the structural one

Q43 (Packer at Midway Distribution / Chicago IL) regressed from
rating 5 to rating 2 on warm pass with `warm_boosted_count=0` and
`playbook_recorded=false`. Q18 (Shipping Clerks at the same client+city)
recorded a playbook entry. The regression suggests Q18's recording
leaked into Q43 via the warm-pass playbook corpus retrieval surface
even though the role gate from real_002 should have blocked it.
Three possible paths: extractor failed on one query, gate fires on
boost path but not Shape B inject, or cosine drift puts the recorded
worker close enough to Q43's embedding that warm-pass retrieval picks
it up directly. Diagnosis is the next move.

Three same-(client, city) clusters tested:
- Heritage Foods Gary IN × 3 distinct roles: clean, distinct workers
- Riverfront Steel Columbus OH × 4: cosine-level confusion (Q9/Q25
  surface same worker w-281 for Assemblers vs Quality Techs at cold-
  pass), but no playbook bleed
- Midway Distribution Chicago IL × 3: Q43 regression as above

What this confirms: substrate works on the fresh distribution at the
rank level, verbatim lift is real, paraphrase recovery is real.

What this falsifies: real_002's role-gate fix is not structurally
airtight. The bleed pattern can still fire under conditions the
prior tests didn't reach.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-05 04:54:03 -05:00

11 KiB

Raw Blame History

Playbook-Lift Reality Test — Run real_006

Generated: 2026-05-05T09:50:08.929241389Z Judge: qwen2.5:latest (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest) Corpora: workers,ethereal_workers Workers limit: 5000 Queries: tests/reality/real_coord_queries_v3.txt (50 executed) K per pass: 10 Paraphrase pass: ENABLED Re-judge pass: ENABLED Evidence: reports/reality-tests/playbook_lift_real_006.json

Headline

Metric	Value
Total queries run	50
Cold-pass discoveries (judge-best ≠ top-1)	9
Warm-pass lifts (recorded playbook → top-1)	9
No change (judge-best already top-1, no playbook needed)	41
Playbook boosts triggered (warm pass)	10
Mean Δ top-1 distance (warm − cold)	-0.05014307
Paraphrase pass — recorded answer at rank 0 (top-1)	6 / 9
Paraphrase pass — recorded answer at any rank in top-K	9 / 9
Quality lift (warm top-1 rating > cold top-1 rating)	11 / 50
Quality neutral (warm top-1 rating = cold top-1 rating)	36 / 50
Quality regressed (warm top-1 rating < cold top-1 rating)	3 / 50

Verbatim lift rate: 9 of 9 discoveries became top-1 after warm pass.

Per-query results

#	Query	Cold top-1	Cold judge-best (rank/rating)	Recorded?	Warm top-1	Judge-best warm rank	Lift
1	Need 1 Loader in Kansas City MO starting at 17:30 for Corner	w-4806	0/5	—	w-4806	0	no
2	Need 2 Assemblers in Cincinnati OH starting at 14:30 for Gre	w-4371	0/4	—	w-4371	0	no
3	Need 1 Forklift Operator in Lexington KY starting at 08:30 f	e-8263	1/5	✓ w-4636	w-4636	0	YES
4	Need 2 Assemblers in Flint MI starting at 08:30 for Centenni	e-7186	1/4	✓ e-9319	e-9319	0	YES
5	Need 2 Welders in Indianapolis IN starting at 10:00 for Nort	e-5834	0/4	—	e-5834	0	no
6	Need 2 Material Handlers in Cincinnati OH starting at 13:00	e-4871	4/2	—	e-4871	4	no
7	Need 3 Pickers in Flint MI starting at 17:00 for Centennial	e-5571	0/2	—	e-5571	0	no
8	Need 3 Packers in Indianapolis IN starting at 09:00 for Heri	w-279	0/4	—	w-279	0	no
9	Need 3 Assemblers in Columbus OH starting at 17:30 for River	w-281	0/4	—	w-281	0	no
10	Need 5 Machine Operators in Cleveland OH starting at 14:30 f	e-8279	0/4	—	e-8279	0	no
11	Need 2 Assemblers in Grand Rapids MI starting at 13:00 for C	w-4502	0/3	—	w-4502	0	no
12	Need 2 Pickers in Akron OH starting at 10:30 for Summit Indu	e-5655	2/2	—	e-5655	2	no
13	Need 3 Quality Techs in Lexington KY starting at 12:30 for K	e-6369	0/4	—	e-6369	0	no
14	Need 4 Assemblers in Gary IN starting at 12:00 for Heritage	e-1315	0/2	—	e-1315	0	no
15	Need 3 Packers in Toledo OH starting at 16:00 for Cornerston	e-4887	1/2	—	e-4887	1	no
16	Need 4 Warehouse Associates in Fort Wayne IN starting at 13:	w-4434	0/4	—	w-4434	0	no
17	Need 4 Assemblers in Columbus OH starting at 13:00 for Midwa	w-281	0/4	—	w-281	0	no
18	Need 2 Shipping Clerks in Chicago IL starting at 17:00 for M	w-4504	1/4	✓ w-1522	w-1522	0	YES
19	Need 2 Machine Operators in Chicago IL starting at 11:00 for	e-1251	0/4	—	e-1251	0	no
20	Need 3 CNC Operators in Grand Rapids MI starting at 10:00 fo	w-792	0/3	—	w-792	0	no
21	Need 1 Warehouse Associate in Lexington KY starting at 09:30	e-2331	0/4	—	e-2331	0	no
22	Need 2 Material Handlers in Gary IN starting at 11:00 for He	e-18	0/2	—	e-18	0	no
23	Need 5 Assemblers in Fort Wayne IN starting at 16:30 for Mid	e-6271	7/3	—	e-6271	7	no
24	Need 1 Loader in Cincinnati OH starting at 08:00 for Summit	e-8843	1/5	✓ w-4473	w-4473	0	YES
25	Need 3 Quality Techs in Columbus OH starting at 12:00 for Ri	w-281	0/2	—	w-281	0	no
26	Need 1 Machine Operator in Columbus OH starting at 09:30 for	w-4815	0/4	—	w-4815	0	no
27	Need 3 Machine Operators in Madison WI starting at 12:00 for	w-2027	0/4	—	w-2027	0	no
28	Need 2 Material Handlers in Kansas City MO starting at 11:30	e-6774	0/3	—	e-6774	0	no
29	Need 3 Loaders in Flint MI starting at 16:00 for Parallel Ma	w-4875	0/2	—	w-4875	0	no
30	Need 2 Welders in Louisville KY starting at 13:00 for Horizo	w-2267	0/4	—	w-2267	0	no
31	Need 1 CNC Operator in Flint MI starting at 10:30 for Horizo	e-317	7/3	—	e-317	7	no
32	Need 1 Material Handler in Columbus OH starting at 15:30 for	e-8676	1/4	✓ w-2589	w-2589	0	YES
33	Need 2 Forklift Operators in Louisville KY starting at 14:30	w-1830	0/4	—	w-1830	0	no
34	Need 2 Warehouse Associates in Chicago IL starting at 10:00	w-4743	7/4	✓ e-9171	e-9171	0	YES
35	Need 2 Material Handlers in Gary IN starting at 15:00 for Pa	w-4236	1/2	—	w-4236	1	no
36	Need 1 Forklift Operator in Grand Rapids MI starting at 10:0	w-3227	3/2	—	w-3227	3	no
37	Need 2 Pickers in Louisville KY starting at 12:30 for Corner	e-6489	2/4	✓ e-7622	e-7622	0	YES
38	Need 2 Loaders in Indianapolis IN starting at 17:30 for Midw	e-9877	0/4	—	e-9877	0	no
39	Need 1 Shipping Clerk in Indianapolis IN starting at 12:30 f	w-4635	0/5	—	w-4635	0	no
40	Need 2 Assemblers in Cincinnati OH starting at 08:00 for Key	w-4945	0/4	—	w-4945	0	no
41	Need 5 Quality Techs in Kansas City MO starting at 11:30 for	e-5633	0/4	—	e-5633	0	no
42	Need 2 Machine Operators in Gary IN starting at 10:00 for He	e-1089	0/2	—	e-1089	0	no
43	Need 1 Packer in Chicago IL starting at 09:30 for Midway Dis	e-7746	0/5	—	w-279	1	no
44	Need 2 Pickers in Lexington KY starting at 17:30 for Vanguar	e-3375	0/4	—	e-3375	0	no
45	Need 2 Maintenance Techs in Grand Rapids MI starting at 17:0	e-6083	0/2	—	e-6083	0	no
46	Need 1 Material Handler in Detroit MI starting at 10:30 for	w-3286	0/5	—	w-3286	0	no
47	Need 1 Welder in Akron OH starting at 15:00 for Summit Indus	e-6149	0/2	—	e-6149	0	no
48	Need 1 Shipping Clerk in Cincinnati OH starting at 13:30 for	e-4218	3/5	✓ w-3488	w-3488	0	YES
49	Need 5 Packers in Indianapolis IN starting at 10:30 for Midw	e-2746	2/4	✓ w-279	w-279	0	YES
50	Need 1 Forklift Operator in Louisville KY starting at 10:30	w-1830	0/4	—	w-1830	0	no

Paraphrase pass — does the playbook help similar-but-different queries?

For each query whose Pass 1 cold pass recorded a playbook entry, the judge model rephrased the query, and the rephrased version was sent through warm matrix.search. The recorded answer ID's rank in those results tests whether cosine on the embedded paraphrase finds the recorded query's vector.

#	Original (≤40c)	Paraphrase (≤60c)	Recorded answer	Paraphrase top-1	Recorded rank	Paraphrase lift
3	Need 1 Forklift Operator in Lexington KY	Vanguard Components requires a Forklift Operator in Lexingto	w-4636	w-4636	0	YES
4	Need 2 Assemblers in Flint MI starting a	Centennial Packaging requires 2 Assemblers to start at 08:30	e-9319	e-9319	0	YES
18	Need 2 Shipping Clerks in Chicago IL sta	Looking for 2 Shipping Clerks in Chicago, IL to start at 5:0	w-1522	w-4504	1	no
24	Need 1 Loader in Cincinnati OH starting	Summit Industrial requires 1 Loader position from 08:00 onwa	w-4473	w-4473	0	YES
32	Need 1 Material Handler in Columbus OH s	Looking for a Material Handler in Columbus, OH who can start	w-2589	w-2589	0	YES
34	Need 2 Warehouse Associates in Chicago I	Looking for 2 Warehouse Associates to work from 10:00 onward	e-9171	e-9171	0	YES
37	Need 2 Pickers in Louisville KY starting	Looking for 2 Pickers in Louisville, KY to start at 12:30 fo	e-7622	e-6489	2	no
48	Need 1 Shipping Clerk in Cincinnati OH s	Summit Industrial requires a Shipping Clerk in Cincinnati, O	w-3488	w-3488	0	YES
49	Need 5 Packers in Indianapolis IN starti	Looking for 5 packers in Indianapolis, IN to start at 10:30	w-279	e-2746	1	no

Honesty caveats

Judge IS the ground truth proxy. Without human-labeled relevance, the LLM judge's verdict is what defines "best." If qwen2.5:latest rates badly, the lift number is meaningless. To validate the judge itself, sample 5–10 verdicts manually and check agreement.
Score-1.0 boost = distance halved. Playbook math is distance' = distance × (1 - 0.5 × score). Lift requires the judge-best result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise even halving doesn't promote it. Tight clusters → little visible lift.
Verbatim vs paraphrase. The verbatim lift rate (above) is the cheap case — same query, recorded playbook, expected boost. The paraphrase pass (when enabled) is the actual learning property: similar-but-different queries hitting a recorded playbook. Compare verbatim and paraphrase lift rates — paraphrase should be lower (semantic-distance gates some playbook hits) but non-zero is the meaningful signal.
Multi-corpus skew. Default corpora=workers,ethereal_workers — if all judge-best results land in one corpus, the matrix layer's purpose isn't being tested. Check per-corpus distribution in the JSON.
Judge resolution. This run used qwen2.5:latest from env JUDGE_MODEL=qwen2.5:latest. Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
Paraphrase generation also uses the judge. The same model that rates relevance also rephrases queries. A judge that's bad at rating staffing queries is probably also bad at rephrasing them. Worth sanity-checking a sample of paraphrase_query values in the JSON before trusting the paraphrase lift number.

Next moves

If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real work. Move to paraphrase queries + tag-based boost (currently ignored).
If lift rate < 20%: investigate why — judge variance, distance gap too wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need retuning.
If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is already close to optimal on this query distribution. Either the corpus is too narrow or the queries are too easy.

11 KiB Raw Blame History Unescape Escape