Methodology fix: gen_real_queries.go gains -offset N flag. Every prior real_NNN test sourced queries from rows 0-9 of fill_events.parquet (default -limit 10), so the substrate's published "8/10 cold-pass top-1 = judge-best" was measured on a memorized slice, not held-out data. real_006 samples 50 fresh rows (offset 10, never seen by the workers or ethereal_workers corpora). Same harness, same local qwen2.5:latest judge, same K=10. ~14 min wall total. Local-only, no cloud calls. Headline findings: - Cold-pass top-1 = judge-best (rank match): 41/50 (82%) vs real_001's 8/10 (80%) — substrate generalizes at rank level. - Strict (rating ≥ 2): 34/50 (68%) — 12-point drop from real_001's 80%. ~7 of 41 "no-discovery" queries had cold top-1 the judge rated 1; the corpus has gaps for some role-city combos in the v3 slice. - Verbatim lift: 9/9 discoveries → warm top-1 (clean, matches real_001 2/2) - Paraphrase recovery: 6/9 → top-1, 9/9 any-rank - Quality regressed: 3/50 — Q43 is the structural one Q43 (Packer at Midway Distribution / Chicago IL) regressed from rating 5 to rating 2 on warm pass with `warm_boosted_count=0` and `playbook_recorded=false`. Q18 (Shipping Clerks at the same client+city) recorded a playbook entry. The regression suggests Q18's recording leaked into Q43 via the warm-pass playbook corpus retrieval surface even though the role gate from real_002 should have blocked it. Three possible paths: extractor failed on one query, gate fires on boost path but not Shape B inject, or cosine drift puts the recorded worker close enough to Q43's embedding that warm-pass retrieval picks it up directly. Diagnosis is the next move. Three same-(client, city) clusters tested: - Heritage Foods Gary IN × 3 distinct roles: clean, distinct workers - Riverfront Steel Columbus OH × 4: cosine-level confusion (Q9/Q25 surface same worker w-281 for Assemblers vs Quality Techs at cold- pass), but no playbook bleed - Midway Distribution Chicago IL × 3: Q43 regression as above What this confirms: substrate works on the fresh distribution at the rank level, verbatim lift is real, paraphrase recovery is real. What this falsifies: real_002's role-gate fix is not structurally airtight. The bleed pattern can still fire under conditions the prior tests didn't reach. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
152 lines
11 KiB
Markdown
152 lines
11 KiB
Markdown
# Playbook-Lift Reality Test — Run real_006
|
||
|
||
**Generated:** 2026-05-05T09:50:08.929241389Z
|
||
**Judge:** `qwen2.5:latest` (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest)
|
||
**Corpora:** `workers,ethereal_workers`
|
||
**Workers limit:** 5000
|
||
**Queries:** `tests/reality/real_coord_queries_v3.txt` (50 executed)
|
||
**K per pass:** 10
|
||
**Paraphrase pass:** ENABLED
|
||
**Re-judge pass:** ENABLED
|
||
**Evidence:** `reports/reality-tests/playbook_lift_real_006.json`
|
||
|
||
---
|
||
|
||
## Headline
|
||
|
||
| Metric | Value |
|
||
|---|---:|
|
||
| Total queries run | 50 |
|
||
| Cold-pass discoveries (judge-best ≠ top-1) | 9 |
|
||
| Warm-pass lifts (recorded playbook → top-1) | 9 |
|
||
| No change (judge-best already top-1, no playbook needed) | 41 |
|
||
| Playbook boosts triggered (warm pass) | 10 |
|
||
| Mean Δ top-1 distance (warm − cold) | -0.05014307 |
|
||
| **Paraphrase pass — recorded answer at rank 0 (top-1)** | **6 / 9** |
|
||
| Paraphrase pass — recorded answer at any rank in top-K | 9 / 9 |
|
||
| **Quality lift** (warm top-1 rating > cold top-1 rating) | **11 / 50** |
|
||
| Quality neutral (warm top-1 rating = cold top-1 rating) | 36 / 50 |
|
||
| Quality regressed (warm top-1 rating < cold top-1 rating) | 3 / 50 |
|
||
|
||
**Verbatim lift rate:** 9 of 9 discoveries became top-1 after warm pass.
|
||
|
||
---
|
||
|
||
## Per-query results
|
||
|
||
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|
||
|---|---|---|---|---|---|---|---|
|
||
| 1 | Need 1 Loader in Kansas City MO starting at 17:30 for Corner | w-4806 | 0/5 | — | w-4806 | 0 | no |
|
||
| 2 | Need 2 Assemblers in Cincinnati OH starting at 14:30 for Gre | w-4371 | 0/4 | — | w-4371 | 0 | no |
|
||
| 3 | Need 1 Forklift Operator in Lexington KY starting at 08:30 f | e-8263 | 1/5 | ✓ w-4636 | w-4636 | 0 | **YES** |
|
||
| 4 | Need 2 Assemblers in Flint MI starting at 08:30 for Centenni | e-7186 | 1/4 | ✓ e-9319 | e-9319 | 0 | **YES** |
|
||
| 5 | Need 2 Welders in Indianapolis IN starting at 10:00 for Nort | e-5834 | 0/4 | — | e-5834 | 0 | no |
|
||
| 6 | Need 2 Material Handlers in Cincinnati OH starting at 13:00 | e-4871 | 4/2 | — | e-4871 | 4 | no |
|
||
| 7 | Need 3 Pickers in Flint MI starting at 17:00 for Centennial | e-5571 | 0/2 | — | e-5571 | 0 | no |
|
||
| 8 | Need 3 Packers in Indianapolis IN starting at 09:00 for Heri | w-279 | 0/4 | — | w-279 | 0 | no |
|
||
| 9 | Need 3 Assemblers in Columbus OH starting at 17:30 for River | w-281 | 0/4 | — | w-281 | 0 | no |
|
||
| 10 | Need 5 Machine Operators in Cleveland OH starting at 14:30 f | e-8279 | 0/4 | — | e-8279 | 0 | no |
|
||
| 11 | Need 2 Assemblers in Grand Rapids MI starting at 13:00 for C | w-4502 | 0/3 | — | w-4502 | 0 | no |
|
||
| 12 | Need 2 Pickers in Akron OH starting at 10:30 for Summit Indu | e-5655 | 2/2 | — | e-5655 | 2 | no |
|
||
| 13 | Need 3 Quality Techs in Lexington KY starting at 12:30 for K | e-6369 | 0/4 | — | e-6369 | 0 | no |
|
||
| 14 | Need 4 Assemblers in Gary IN starting at 12:00 for Heritage | e-1315 | 0/2 | — | e-1315 | 0 | no |
|
||
| 15 | Need 3 Packers in Toledo OH starting at 16:00 for Cornerston | e-4887 | 1/2 | — | e-4887 | 1 | no |
|
||
| 16 | Need 4 Warehouse Associates in Fort Wayne IN starting at 13: | w-4434 | 0/4 | — | w-4434 | 0 | no |
|
||
| 17 | Need 4 Assemblers in Columbus OH starting at 13:00 for Midwa | w-281 | 0/4 | — | w-281 | 0 | no |
|
||
| 18 | Need 2 Shipping Clerks in Chicago IL starting at 17:00 for M | w-4504 | 1/4 | ✓ w-1522 | w-1522 | 0 | **YES** |
|
||
| 19 | Need 2 Machine Operators in Chicago IL starting at 11:00 for | e-1251 | 0/4 | — | e-1251 | 0 | no |
|
||
| 20 | Need 3 CNC Operators in Grand Rapids MI starting at 10:00 fo | w-792 | 0/3 | — | w-792 | 0 | no |
|
||
| 21 | Need 1 Warehouse Associate in Lexington KY starting at 09:30 | e-2331 | 0/4 | — | e-2331 | 0 | no |
|
||
| 22 | Need 2 Material Handlers in Gary IN starting at 11:00 for He | e-18 | 0/2 | — | e-18 | 0 | no |
|
||
| 23 | Need 5 Assemblers in Fort Wayne IN starting at 16:30 for Mid | e-6271 | 7/3 | — | e-6271 | 7 | no |
|
||
| 24 | Need 1 Loader in Cincinnati OH starting at 08:00 for Summit | e-8843 | 1/5 | ✓ w-4473 | w-4473 | 0 | **YES** |
|
||
| 25 | Need 3 Quality Techs in Columbus OH starting at 12:00 for Ri | w-281 | 0/2 | — | w-281 | 0 | no |
|
||
| 26 | Need 1 Machine Operator in Columbus OH starting at 09:30 for | w-4815 | 0/4 | — | w-4815 | 0 | no |
|
||
| 27 | Need 3 Machine Operators in Madison WI starting at 12:00 for | w-2027 | 0/4 | — | w-2027 | 0 | no |
|
||
| 28 | Need 2 Material Handlers in Kansas City MO starting at 11:30 | e-6774 | 0/3 | — | e-6774 | 0 | no |
|
||
| 29 | Need 3 Loaders in Flint MI starting at 16:00 for Parallel Ma | w-4875 | 0/2 | — | w-4875 | 0 | no |
|
||
| 30 | Need 2 Welders in Louisville KY starting at 13:00 for Horizo | w-2267 | 0/4 | — | w-2267 | 0 | no |
|
||
| 31 | Need 1 CNC Operator in Flint MI starting at 10:30 for Horizo | e-317 | 7/3 | — | e-317 | 7 | no |
|
||
| 32 | Need 1 Material Handler in Columbus OH starting at 15:30 for | e-8676 | 1/4 | ✓ w-2589 | w-2589 | 0 | **YES** |
|
||
| 33 | Need 2 Forklift Operators in Louisville KY starting at 14:30 | w-1830 | 0/4 | — | w-1830 | 0 | no |
|
||
| 34 | Need 2 Warehouse Associates in Chicago IL starting at 10:00 | w-4743 | 7/4 | ✓ e-9171 | e-9171 | 0 | **YES** |
|
||
| 35 | Need 2 Material Handlers in Gary IN starting at 15:00 for Pa | w-4236 | 1/2 | — | w-4236 | 1 | no |
|
||
| 36 | Need 1 Forklift Operator in Grand Rapids MI starting at 10:0 | w-3227 | 3/2 | — | w-3227 | 3 | no |
|
||
| 37 | Need 2 Pickers in Louisville KY starting at 12:30 for Corner | e-6489 | 2/4 | ✓ e-7622 | e-7622 | 0 | **YES** |
|
||
| 38 | Need 2 Loaders in Indianapolis IN starting at 17:30 for Midw | e-9877 | 0/4 | — | e-9877 | 0 | no |
|
||
| 39 | Need 1 Shipping Clerk in Indianapolis IN starting at 12:30 f | w-4635 | 0/5 | — | w-4635 | 0 | no |
|
||
| 40 | Need 2 Assemblers in Cincinnati OH starting at 08:00 for Key | w-4945 | 0/4 | — | w-4945 | 0 | no |
|
||
| 41 | Need 5 Quality Techs in Kansas City MO starting at 11:30 for | e-5633 | 0/4 | — | e-5633 | 0 | no |
|
||
| 42 | Need 2 Machine Operators in Gary IN starting at 10:00 for He | e-1089 | 0/2 | — | e-1089 | 0 | no |
|
||
| 43 | Need 1 Packer in Chicago IL starting at 09:30 for Midway Dis | e-7746 | 0/5 | — | w-279 | 1 | no |
|
||
| 44 | Need 2 Pickers in Lexington KY starting at 17:30 for Vanguar | e-3375 | 0/4 | — | e-3375 | 0 | no |
|
||
| 45 | Need 2 Maintenance Techs in Grand Rapids MI starting at 17:0 | e-6083 | 0/2 | — | e-6083 | 0 | no |
|
||
| 46 | Need 1 Material Handler in Detroit MI starting at 10:30 for | w-3286 | 0/5 | — | w-3286 | 0 | no |
|
||
| 47 | Need 1 Welder in Akron OH starting at 15:00 for Summit Indus | e-6149 | 0/2 | — | e-6149 | 0 | no |
|
||
| 48 | Need 1 Shipping Clerk in Cincinnati OH starting at 13:30 for | e-4218 | 3/5 | ✓ w-3488 | w-3488 | 0 | **YES** |
|
||
| 49 | Need 5 Packers in Indianapolis IN starting at 10:30 for Midw | e-2746 | 2/4 | ✓ w-279 | w-279 | 0 | **YES** |
|
||
| 50 | Need 1 Forklift Operator in Louisville KY starting at 10:30 | w-1830 | 0/4 | — | w-1830 | 0 | no |
|
||
|
||
---
|
||
|
||
## Paraphrase pass — does the playbook help similar-but-different queries?
|
||
|
||
For each query whose Pass 1 cold pass recorded a playbook entry, the
|
||
judge model rephrased the query, and the rephrased version was sent
|
||
through warm matrix.search. The recorded answer ID's rank in those
|
||
results tests whether cosine on the embedded paraphrase finds the
|
||
recorded query's vector.
|
||
|
||
| # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift |
|
||
|---|---|---|---|---|---|---|
|
||
| 3 | Need 1 Forklift Operator in Lexington KY | Vanguard Components requires a Forklift Operator in Lexingto | w-4636 | w-4636 | 0 | **YES** |
|
||
| 4 | Need 2 Assemblers in Flint MI starting a | Centennial Packaging requires 2 Assemblers to start at 08:30 | e-9319 | e-9319 | 0 | **YES** |
|
||
| 18 | Need 2 Shipping Clerks in Chicago IL sta | Looking for 2 Shipping Clerks in Chicago, IL to start at 5:0 | w-1522 | w-4504 | 1 | no |
|
||
| 24 | Need 1 Loader in Cincinnati OH starting | Summit Industrial requires 1 Loader position from 08:00 onwa | w-4473 | w-4473 | 0 | **YES** |
|
||
| 32 | Need 1 Material Handler in Columbus OH s | Looking for a Material Handler in Columbus, OH who can start | w-2589 | w-2589 | 0 | **YES** |
|
||
| 34 | Need 2 Warehouse Associates in Chicago I | Looking for 2 Warehouse Associates to work from 10:00 onward | e-9171 | e-9171 | 0 | **YES** |
|
||
| 37 | Need 2 Pickers in Louisville KY starting | Looking for 2 Pickers in Louisville, KY to start at 12:30 fo | e-7622 | e-6489 | 2 | no |
|
||
| 48 | Need 1 Shipping Clerk in Cincinnati OH s | Summit Industrial requires a Shipping Clerk in Cincinnati, O | w-3488 | w-3488 | 0 | **YES** |
|
||
| 49 | Need 5 Packers in Indianapolis IN starti | Looking for 5 packers in Indianapolis, IN to start at 10:30 | w-279 | e-2746 | 1 | no |
|
||
|
||
---
|
||
|
||
## Honesty caveats
|
||
|
||
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
|
||
judge's verdict is what defines "best." If `qwen2.5:latest` rates badly,
|
||
the lift number is meaningless. To validate the judge itself, sample 5–10
|
||
verdicts manually and check agreement.
|
||
2. **Score-1.0 boost = distance halved.** Playbook math is
|
||
`distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
|
||
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
|
||
even halving doesn't promote it. Tight clusters → little visible lift.
|
||
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
|
||
case — same query, recorded playbook, expected boost. The paraphrase
|
||
pass (when enabled) is the actual learning property: similar-but-different
|
||
queries hitting a recorded playbook. Compare verbatim and paraphrase
|
||
lift rates — paraphrase should be lower (semantic-distance gates some
|
||
playbook hits) but non-zero is the meaningful signal.
|
||
4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
|
||
results land in one corpus, the matrix layer's purpose isn't being tested.
|
||
Check per-corpus distribution in the JSON.
|
||
5. **Judge resolution.** This run used `qwen2.5:latest` from
|
||
env JUDGE_MODEL=qwen2.5:latest.
|
||
Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
|
||
6. **Paraphrase generation also uses the judge.** The same model that rates
|
||
relevance also rephrases queries. A judge that's bad at rating staffing
|
||
queries is probably also bad at rephrasing them. Worth sanity-checking
|
||
a sample of `paraphrase_query` values in the JSON before trusting the
|
||
paraphrase lift number.
|
||
|
||
## Next moves
|
||
|
||
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
|
||
work. Move to paraphrase queries + tag-based boost (currently ignored).
|
||
- If lift rate < 20%: investigate why — judge variance, distance gap too
|
||
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
|
||
retuning.
|
||
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
|
||
already close to optimal on this query distribution. Either the corpus
|
||
is too narrow or the queries are too easy.
|