real_006: distribution-shift test on rows 10-59 of fill_events
Methodology fix: gen_real_queries.go gains -offset N flag. Every prior real_NNN test sourced queries from rows 0-9 of fill_events.parquet (default -limit 10), so the substrate's published "8/10 cold-pass top-1 = judge-best" was measured on a memorized slice, not held-out data. real_006 samples 50 fresh rows (offset 10, never seen by the workers or ethereal_workers corpora). Same harness, same local qwen2.5:latest judge, same K=10. ~14 min wall total. Local-only, no cloud calls. Headline findings: - Cold-pass top-1 = judge-best (rank match): 41/50 (82%) vs real_001's 8/10 (80%) — substrate generalizes at rank level. - Strict (rating ≥ 2): 34/50 (68%) — 12-point drop from real_001's 80%. ~7 of 41 "no-discovery" queries had cold top-1 the judge rated 1; the corpus has gaps for some role-city combos in the v3 slice. - Verbatim lift: 9/9 discoveries → warm top-1 (clean, matches real_001 2/2) - Paraphrase recovery: 6/9 → top-1, 9/9 any-rank - Quality regressed: 3/50 — Q43 is the structural one Q43 (Packer at Midway Distribution / Chicago IL) regressed from rating 5 to rating 2 on warm pass with `warm_boosted_count=0` and `playbook_recorded=false`. Q18 (Shipping Clerks at the same client+city) recorded a playbook entry. The regression suggests Q18's recording leaked into Q43 via the warm-pass playbook corpus retrieval surface even though the role gate from real_002 should have blocked it. Three possible paths: extractor failed on one query, gate fires on boost path but not Shape B inject, or cosine drift puts the recorded worker close enough to Q43's embedding that warm-pass retrieval picks it up directly. Diagnosis is the next move. Three same-(client, city) clusters tested: - Heritage Foods Gary IN × 3 distinct roles: clean, distinct workers - Riverfront Steel Columbus OH × 4: cosine-level confusion (Q9/Q25 surface same worker w-281 for Assemblers vs Quality Techs at cold- pass), but no playbook bleed - Midway Distribution Chicago IL × 3: Q43 regression as above What this confirms: substrate works on the fresh distribution at the rank level, verbatim lift is real, paraphrase recovery is real. What this falsifies: real_002's role-gate fix is not structurally airtight. The bleed pattern can still fire under conditions the prior tests didn't reach. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
0e530f4436
commit
95f155b017
151
reports/reality-tests/playbook_lift_real_006.md
Normal file
151
reports/reality-tests/playbook_lift_real_006.md
Normal file
@ -0,0 +1,151 @@
|
||||
# Playbook-Lift Reality Test — Run real_006
|
||||
|
||||
**Generated:** 2026-05-05T09:50:08.929241389Z
|
||||
**Judge:** `qwen2.5:latest` (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest)
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**Workers limit:** 5000
|
||||
**Queries:** `tests/reality/real_coord_queries_v3.txt` (50 executed)
|
||||
**K per pass:** 10
|
||||
**Paraphrase pass:** ENABLED
|
||||
**Re-judge pass:** ENABLED
|
||||
**Evidence:** `reports/reality-tests/playbook_lift_real_006.json`
|
||||
|
||||
---
|
||||
|
||||
## Headline
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Total queries run | 50 |
|
||||
| Cold-pass discoveries (judge-best ≠ top-1) | 9 |
|
||||
| Warm-pass lifts (recorded playbook → top-1) | 9 |
|
||||
| No change (judge-best already top-1, no playbook needed) | 41 |
|
||||
| Playbook boosts triggered (warm pass) | 10 |
|
||||
| Mean Δ top-1 distance (warm − cold) | -0.05014307 |
|
||||
| **Paraphrase pass — recorded answer at rank 0 (top-1)** | **6 / 9** |
|
||||
| Paraphrase pass — recorded answer at any rank in top-K | 9 / 9 |
|
||||
| **Quality lift** (warm top-1 rating > cold top-1 rating) | **11 / 50** |
|
||||
| Quality neutral (warm top-1 rating = cold top-1 rating) | 36 / 50 |
|
||||
| Quality regressed (warm top-1 rating < cold top-1 rating) | 3 / 50 |
|
||||
|
||||
**Verbatim lift rate:** 9 of 9 discoveries became top-1 after warm pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-query results
|
||||
|
||||
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| 1 | Need 1 Loader in Kansas City MO starting at 17:30 for Corner | w-4806 | 0/5 | — | w-4806 | 0 | no |
|
||||
| 2 | Need 2 Assemblers in Cincinnati OH starting at 14:30 for Gre | w-4371 | 0/4 | — | w-4371 | 0 | no |
|
||||
| 3 | Need 1 Forklift Operator in Lexington KY starting at 08:30 f | e-8263 | 1/5 | ✓ w-4636 | w-4636 | 0 | **YES** |
|
||||
| 4 | Need 2 Assemblers in Flint MI starting at 08:30 for Centenni | e-7186 | 1/4 | ✓ e-9319 | e-9319 | 0 | **YES** |
|
||||
| 5 | Need 2 Welders in Indianapolis IN starting at 10:00 for Nort | e-5834 | 0/4 | — | e-5834 | 0 | no |
|
||||
| 6 | Need 2 Material Handlers in Cincinnati OH starting at 13:00 | e-4871 | 4/2 | — | e-4871 | 4 | no |
|
||||
| 7 | Need 3 Pickers in Flint MI starting at 17:00 for Centennial | e-5571 | 0/2 | — | e-5571 | 0 | no |
|
||||
| 8 | Need 3 Packers in Indianapolis IN starting at 09:00 for Heri | w-279 | 0/4 | — | w-279 | 0 | no |
|
||||
| 9 | Need 3 Assemblers in Columbus OH starting at 17:30 for River | w-281 | 0/4 | — | w-281 | 0 | no |
|
||||
| 10 | Need 5 Machine Operators in Cleveland OH starting at 14:30 f | e-8279 | 0/4 | — | e-8279 | 0 | no |
|
||||
| 11 | Need 2 Assemblers in Grand Rapids MI starting at 13:00 for C | w-4502 | 0/3 | — | w-4502 | 0 | no |
|
||||
| 12 | Need 2 Pickers in Akron OH starting at 10:30 for Summit Indu | e-5655 | 2/2 | — | e-5655 | 2 | no |
|
||||
| 13 | Need 3 Quality Techs in Lexington KY starting at 12:30 for K | e-6369 | 0/4 | — | e-6369 | 0 | no |
|
||||
| 14 | Need 4 Assemblers in Gary IN starting at 12:00 for Heritage | e-1315 | 0/2 | — | e-1315 | 0 | no |
|
||||
| 15 | Need 3 Packers in Toledo OH starting at 16:00 for Cornerston | e-4887 | 1/2 | — | e-4887 | 1 | no |
|
||||
| 16 | Need 4 Warehouse Associates in Fort Wayne IN starting at 13: | w-4434 | 0/4 | — | w-4434 | 0 | no |
|
||||
| 17 | Need 4 Assemblers in Columbus OH starting at 13:00 for Midwa | w-281 | 0/4 | — | w-281 | 0 | no |
|
||||
| 18 | Need 2 Shipping Clerks in Chicago IL starting at 17:00 for M | w-4504 | 1/4 | ✓ w-1522 | w-1522 | 0 | **YES** |
|
||||
| 19 | Need 2 Machine Operators in Chicago IL starting at 11:00 for | e-1251 | 0/4 | — | e-1251 | 0 | no |
|
||||
| 20 | Need 3 CNC Operators in Grand Rapids MI starting at 10:00 fo | w-792 | 0/3 | — | w-792 | 0 | no |
|
||||
| 21 | Need 1 Warehouse Associate in Lexington KY starting at 09:30 | e-2331 | 0/4 | — | e-2331 | 0 | no |
|
||||
| 22 | Need 2 Material Handlers in Gary IN starting at 11:00 for He | e-18 | 0/2 | — | e-18 | 0 | no |
|
||||
| 23 | Need 5 Assemblers in Fort Wayne IN starting at 16:30 for Mid | e-6271 | 7/3 | — | e-6271 | 7 | no |
|
||||
| 24 | Need 1 Loader in Cincinnati OH starting at 08:00 for Summit | e-8843 | 1/5 | ✓ w-4473 | w-4473 | 0 | **YES** |
|
||||
| 25 | Need 3 Quality Techs in Columbus OH starting at 12:00 for Ri | w-281 | 0/2 | — | w-281 | 0 | no |
|
||||
| 26 | Need 1 Machine Operator in Columbus OH starting at 09:30 for | w-4815 | 0/4 | — | w-4815 | 0 | no |
|
||||
| 27 | Need 3 Machine Operators in Madison WI starting at 12:00 for | w-2027 | 0/4 | — | w-2027 | 0 | no |
|
||||
| 28 | Need 2 Material Handlers in Kansas City MO starting at 11:30 | e-6774 | 0/3 | — | e-6774 | 0 | no |
|
||||
| 29 | Need 3 Loaders in Flint MI starting at 16:00 for Parallel Ma | w-4875 | 0/2 | — | w-4875 | 0 | no |
|
||||
| 30 | Need 2 Welders in Louisville KY starting at 13:00 for Horizo | w-2267 | 0/4 | — | w-2267 | 0 | no |
|
||||
| 31 | Need 1 CNC Operator in Flint MI starting at 10:30 for Horizo | e-317 | 7/3 | — | e-317 | 7 | no |
|
||||
| 32 | Need 1 Material Handler in Columbus OH starting at 15:30 for | e-8676 | 1/4 | ✓ w-2589 | w-2589 | 0 | **YES** |
|
||||
| 33 | Need 2 Forklift Operators in Louisville KY starting at 14:30 | w-1830 | 0/4 | — | w-1830 | 0 | no |
|
||||
| 34 | Need 2 Warehouse Associates in Chicago IL starting at 10:00 | w-4743 | 7/4 | ✓ e-9171 | e-9171 | 0 | **YES** |
|
||||
| 35 | Need 2 Material Handlers in Gary IN starting at 15:00 for Pa | w-4236 | 1/2 | — | w-4236 | 1 | no |
|
||||
| 36 | Need 1 Forklift Operator in Grand Rapids MI starting at 10:0 | w-3227 | 3/2 | — | w-3227 | 3 | no |
|
||||
| 37 | Need 2 Pickers in Louisville KY starting at 12:30 for Corner | e-6489 | 2/4 | ✓ e-7622 | e-7622 | 0 | **YES** |
|
||||
| 38 | Need 2 Loaders in Indianapolis IN starting at 17:30 for Midw | e-9877 | 0/4 | — | e-9877 | 0 | no |
|
||||
| 39 | Need 1 Shipping Clerk in Indianapolis IN starting at 12:30 f | w-4635 | 0/5 | — | w-4635 | 0 | no |
|
||||
| 40 | Need 2 Assemblers in Cincinnati OH starting at 08:00 for Key | w-4945 | 0/4 | — | w-4945 | 0 | no |
|
||||
| 41 | Need 5 Quality Techs in Kansas City MO starting at 11:30 for | e-5633 | 0/4 | — | e-5633 | 0 | no |
|
||||
| 42 | Need 2 Machine Operators in Gary IN starting at 10:00 for He | e-1089 | 0/2 | — | e-1089 | 0 | no |
|
||||
| 43 | Need 1 Packer in Chicago IL starting at 09:30 for Midway Dis | e-7746 | 0/5 | — | w-279 | 1 | no |
|
||||
| 44 | Need 2 Pickers in Lexington KY starting at 17:30 for Vanguar | e-3375 | 0/4 | — | e-3375 | 0 | no |
|
||||
| 45 | Need 2 Maintenance Techs in Grand Rapids MI starting at 17:0 | e-6083 | 0/2 | — | e-6083 | 0 | no |
|
||||
| 46 | Need 1 Material Handler in Detroit MI starting at 10:30 for | w-3286 | 0/5 | — | w-3286 | 0 | no |
|
||||
| 47 | Need 1 Welder in Akron OH starting at 15:00 for Summit Indus | e-6149 | 0/2 | — | e-6149 | 0 | no |
|
||||
| 48 | Need 1 Shipping Clerk in Cincinnati OH starting at 13:30 for | e-4218 | 3/5 | ✓ w-3488 | w-3488 | 0 | **YES** |
|
||||
| 49 | Need 5 Packers in Indianapolis IN starting at 10:30 for Midw | e-2746 | 2/4 | ✓ w-279 | w-279 | 0 | **YES** |
|
||||
| 50 | Need 1 Forklift Operator in Louisville KY starting at 10:30 | w-1830 | 0/4 | — | w-1830 | 0 | no |
|
||||
|
||||
---
|
||||
|
||||
## Paraphrase pass — does the playbook help similar-but-different queries?
|
||||
|
||||
For each query whose Pass 1 cold pass recorded a playbook entry, the
|
||||
judge model rephrased the query, and the rephrased version was sent
|
||||
through warm matrix.search. The recorded answer ID's rank in those
|
||||
results tests whether cosine on the embedded paraphrase finds the
|
||||
recorded query's vector.
|
||||
|
||||
| # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift |
|
||||
|---|---|---|---|---|---|---|
|
||||
| 3 | Need 1 Forklift Operator in Lexington KY | Vanguard Components requires a Forklift Operator in Lexingto | w-4636 | w-4636 | 0 | **YES** |
|
||||
| 4 | Need 2 Assemblers in Flint MI starting a | Centennial Packaging requires 2 Assemblers to start at 08:30 | e-9319 | e-9319 | 0 | **YES** |
|
||||
| 18 | Need 2 Shipping Clerks in Chicago IL sta | Looking for 2 Shipping Clerks in Chicago, IL to start at 5:0 | w-1522 | w-4504 | 1 | no |
|
||||
| 24 | Need 1 Loader in Cincinnati OH starting | Summit Industrial requires 1 Loader position from 08:00 onwa | w-4473 | w-4473 | 0 | **YES** |
|
||||
| 32 | Need 1 Material Handler in Columbus OH s | Looking for a Material Handler in Columbus, OH who can start | w-2589 | w-2589 | 0 | **YES** |
|
||||
| 34 | Need 2 Warehouse Associates in Chicago I | Looking for 2 Warehouse Associates to work from 10:00 onward | e-9171 | e-9171 | 0 | **YES** |
|
||||
| 37 | Need 2 Pickers in Louisville KY starting | Looking for 2 Pickers in Louisville, KY to start at 12:30 fo | e-7622 | e-6489 | 2 | no |
|
||||
| 48 | Need 1 Shipping Clerk in Cincinnati OH s | Summit Industrial requires a Shipping Clerk in Cincinnati, O | w-3488 | w-3488 | 0 | **YES** |
|
||||
| 49 | Need 5 Packers in Indianapolis IN starti | Looking for 5 packers in Indianapolis, IN to start at 10:30 | w-279 | e-2746 | 1 | no |
|
||||
|
||||
---
|
||||
|
||||
## Honesty caveats
|
||||
|
||||
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
|
||||
judge's verdict is what defines "best." If `qwen2.5:latest` rates badly,
|
||||
the lift number is meaningless. To validate the judge itself, sample 5–10
|
||||
verdicts manually and check agreement.
|
||||
2. **Score-1.0 boost = distance halved.** Playbook math is
|
||||
`distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
|
||||
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
|
||||
even halving doesn't promote it. Tight clusters → little visible lift.
|
||||
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
|
||||
case — same query, recorded playbook, expected boost. The paraphrase
|
||||
pass (when enabled) is the actual learning property: similar-but-different
|
||||
queries hitting a recorded playbook. Compare verbatim and paraphrase
|
||||
lift rates — paraphrase should be lower (semantic-distance gates some
|
||||
playbook hits) but non-zero is the meaningful signal.
|
||||
4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
|
||||
results land in one corpus, the matrix layer's purpose isn't being tested.
|
||||
Check per-corpus distribution in the JSON.
|
||||
5. **Judge resolution.** This run used `qwen2.5:latest` from
|
||||
env JUDGE_MODEL=qwen2.5:latest.
|
||||
Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
|
||||
6. **Paraphrase generation also uses the judge.** The same model that rates
|
||||
relevance also rephrases queries. A judge that's bad at rating staffing
|
||||
queries is probably also bad at rephrasing them. Worth sanity-checking
|
||||
a sample of `paraphrase_query` values in the JSON before trusting the
|
||||
paraphrase lift number.
|
||||
|
||||
## Next moves
|
||||
|
||||
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
|
||||
work. Move to paraphrase queries + tag-based boost (currently ignored).
|
||||
- If lift rate < 20%: investigate why — judge variance, distance gap too
|
||||
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
|
||||
retuning.
|
||||
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
|
||||
already close to optimal on this query distribution. Either the corpus
|
||||
is too narrow or the queries are too easy.
|
||||
187
reports/reality-tests/real_006_findings.md
Normal file
187
reports/reality-tests/real_006_findings.md
Normal file
@ -0,0 +1,187 @@
|
||||
# Reality test real_006 — distribution-shift findings
|
||||
|
||||
**Run:** 2026-05-05 04:41:46 → 04:50:08 CDT (8m22s driver wall, ~14 min including ingest)
|
||||
**Judge:** `qwen2.5:latest` (Ollama, local) — anchor's recommended judge, ~9s/query
|
||||
**Queries:** 50 from `tests/reality/real_coord_queries_v3.txt` (rows 10-59 of fill_events.parquet, single `need` style)
|
||||
**Corpora:** `workers,ethereal_workers` (5K + 10K)
|
||||
**Local-only:** zero cloud calls per PRD line 70.
|
||||
|
||||
Companion to `playbook_lift_real_006.{json,md}`. That's the harness output; this is the reading.
|
||||
|
||||
---
|
||||
|
||||
## Why this test exists
|
||||
|
||||
real_001-005 all sourced their queries from the **first 10 rows** of
|
||||
`fill_events.parquet`. `gen_real_queries.go` had `-limit N` but no
|
||||
`-offset N`, so every "real" reality test ran on the same memorized
|
||||
slice. The published "8 / 10 cold-pass top-1 = judge-best" was a
|
||||
property of those 10 rows, not measured generalization. real_006
|
||||
closes the methodology gap: new `-offset` flag samples rows 10-59 (5×
|
||||
the count, never seen by the substrate).
|
||||
|
||||
---
|
||||
|
||||
## Headline — substrate generalizes (mostly)
|
||||
|
||||
| Metric | real_001 (10 queries, rows 0-9) | real_006 (50 queries, rows 10-59) | Verdict |
|
||||
|---|---:|---:|---|
|
||||
| Cold-pass top-1 = judge-best (rank match) | 8 / 10 (80%) | **41 / 50 (82%)** | **HOLDS** |
|
||||
| Cold-pass top-1 = judge-best AND rating ≥ 2 | 8 / 10 (80%) | 34 / 50 (68%) | -12 pts |
|
||||
| Mean cold top-1 judge rating | ~3.3 | 3.08 | -7% |
|
||||
| Discoveries (judge promoted non-top-1) | 2 / 10 | 9 / 50 (18%) | comparable |
|
||||
| Verbatim lift (discovery → warm top-1) | 2 / 2 (100%) | 9 / 9 (100%) | **HOLDS** |
|
||||
| Paraphrase recovery → top-1 | n/a (disabled) | 6 / 9 (67%) | new |
|
||||
| Quality regressed on rejudge | 0 (test absent) | 3 / 50 (6%) | new |
|
||||
|
||||
**Reading:** the substrate's *rank* behavior generalizes cleanly — the
|
||||
top-1 worker is judge-approved at the same rate on fresh data as on
|
||||
memorized data. The *quality* of top-1 (rating ≥ 2) drops 12 points,
|
||||
which means 7 of the 41 "no-discovery" queries had cold top-1 the
|
||||
judge rated 1 (irrelevant) but the corpus had nothing better. Honest
|
||||
signal: parts of the v3 slice are in territory the workers corpus
|
||||
doesn't cover well.
|
||||
|
||||
The verbatim-lift property (discovery → warm top-1) is **clean at
|
||||
9/9**, matching real_001's 2/2 perfectly. When the playbook records,
|
||||
the recorded answer comes back next time. That's the load-bearing
|
||||
learning property.
|
||||
|
||||
---
|
||||
|
||||
## Cluster analysis — the cross-pollination question
|
||||
|
||||
real_001 found that same-(client, city) clusters cause Shape A boost
|
||||
to bleed across roles. Real_002's role-gate fix (`roleEqual`) was
|
||||
supposed to close that. real_006 has *more* cluster opportunities than
|
||||
real_001 did:
|
||||
|
||||
| Cluster | Count | Result |
|
||||
|---|---:|---|
|
||||
| Riverfront Steel + Columbus OH | 4 | mostly clean — see below |
|
||||
| Heritage Foods + Gary IN | 3 | **clean** — distinct workers per role, no boost firing |
|
||||
| Cornerstone Fabrication + Louisville KY | 3 | clean |
|
||||
| Midway Distribution + Chicago IL | 3 | **bleed: Q43 regressed** |
|
||||
|
||||
### Heritage Foods + Gary IN (3 queries, all clean)
|
||||
|
||||
```
|
||||
Q14 Assemblers → e-1315
|
||||
Q22 Material Handler → e-18
|
||||
Q42 Machine Operator → e-1089
|
||||
```
|
||||
|
||||
Three different roles → three different workers. Zero boosts fired,
|
||||
zero playbooks recorded. **Role-disambiguation works at the cosine
|
||||
level for this cluster.** Comparable to real_002's role-gate
|
||||
demonstration.
|
||||
|
||||
### Riverfront Steel + Columbus OH (4 queries, partial)
|
||||
|
||||
```
|
||||
Q9 Assemblers → w-281 (cold = warm, no boost)
|
||||
Q25 Quality Techs → w-281 (cold = warm, no boost) ← same worker as Q9!
|
||||
Q26 Machine Operator → w-4815 (clean)
|
||||
Q32 Material Handler → e-8676 → w-2589 (judge promoted, playbook recorded)
|
||||
```
|
||||
|
||||
Q9 and Q25 both surface `w-281` cold-pass for *different roles* —
|
||||
that's a **cosine-level confusion** in the workers corpus, not a
|
||||
playbook bleed. The substrate isn't breaking; the corpus contains a
|
||||
worker whose resume embeds close to both "Assemblers" and "Quality
|
||||
Techs" in this client+city. Judge-rating Q25 dropped 2 → 1 on
|
||||
rejudge, which is the LLM's own consistency drift, not a substrate
|
||||
fault. Worth noting but not a bug.
|
||||
|
||||
### Midway Distribution + Chicago IL (3 queries) — the regression
|
||||
|
||||
```
|
||||
Q18 Shipping Clerks → cold w-4504 → warm w-1522 (boost=1, playbook recorded)
|
||||
Q19 Machine Operators → cold = warm e-1251 (clean)
|
||||
Q43 Packer → cold e-7746 (rating 5) → warm w-279 (rating 2) ← regressed
|
||||
```
|
||||
|
||||
**Q43 regressed from rating 5 (perfect match) to rating 2 (weak)
|
||||
even though `warm_boosted_count=0` and `playbook_recorded=false`.**
|
||||
Same query, different warm top-1, no boost flag set. The playbook
|
||||
recording from Q18 (Shipping Clerks at Midway/Chicago) reaches Q43
|
||||
(Packer at Midway/Chicago) — same client+city, different role —
|
||||
through the playbook corpus retrieval surface, even though the role
|
||||
gate exists.
|
||||
|
||||
This is the **same pattern real_001 surfaced** (Q5/Q10 demoted by
|
||||
Q2's playbook), and the role-gate fix from real_002 (`roleEqual`
|
||||
on `Role` field) was supposed to close it. Possible explanations:
|
||||
|
||||
1. Role extractor failed on either Q18 ("Shipping Clerks") or Q43
|
||||
("Packer") — leaving an empty role bypasses the gate (gate is
|
||||
"permissive on empty" by design)
|
||||
2. Gate fires on boost path but not on Shape B inject path — and
|
||||
"boost=0" in the JSON is `warm_boosted_count` (count of
|
||||
re-ranked entries), not a flag for "no playbook influence at all"
|
||||
3. Cosine-level drift: the playbook entry just happens to be close
|
||||
enough to Q43 in raw cosine space that warm-pass retrieval picks
|
||||
up `w-279` directly without going through boost or inject
|
||||
|
||||
The other regressions (Q4 Centennial Packaging Flint MI, Q25 above)
|
||||
are smaller (3→2 and 2→1) and likely judge consistency drift on
|
||||
borderline candidates. Q43 is the structural one.
|
||||
|
||||
---
|
||||
|
||||
## What this confirms vs falsifies
|
||||
|
||||
**Confirmed:**
|
||||
- Substrate generalizes at the rank level (82% cold-top-1 = judge-best)
|
||||
- Verbatim lift works (9/9 discoveries → warm top-1)
|
||||
- Role-disambiguation works at cosine level for clean role-distinct
|
||||
query distributions (Heritage Foods cluster is the proof)
|
||||
- Paraphrase recovery is real (6/9 → top-1, 9/9 any-rank)
|
||||
|
||||
**Falsified / weakened:**
|
||||
- "8/10 cold-pass top-1 = judge-best" was 12 points optimistic on
|
||||
the strict (rating ≥ 2) interpretation. Real number on broader
|
||||
data is ~68%, not 80%. Headline rank-match number (82%) holds.
|
||||
- Real_002's role-gate fix is **not structurally airtight**. Q43
|
||||
shows the cluster-bleed pattern can still fire under conditions
|
||||
the prior tests didn't reach. Open question: which path is
|
||||
leaking — extractor failure, gate scope, or cosine drift?
|
||||
|
||||
---
|
||||
|
||||
## Next moves (informed by this evidence)
|
||||
|
||||
1. **Diagnose Q43 specifically**: re-run the role extractor on its
|
||||
query text, check whether Q18's playbook entry has a role field
|
||||
recorded, look at the warm-pass top-K to see whether `w-279`
|
||||
reaches there via boost, inject, or cosine-only.
|
||||
2. **Strengthen the corpus for the role-city combos that scored
|
||||
low rating** (the 7 queries where cold top-1 was rating=1). The
|
||||
workers corpus has gaps the v3 slice surfaced.
|
||||
3. **Don't ship the "80% generalizes" framing as-is.** The number
|
||||
real_006 measured (82% rank, 68% rating ≥ 2) is the honest one
|
||||
to publish.
|
||||
|
||||
This is what reality tests are for. Numbers from the memorized slice
|
||||
gave a clean story; numbers from the held-out slice show where it
|
||||
needs work.
|
||||
|
||||
---
|
||||
|
||||
## Repro
|
||||
|
||||
```bash
|
||||
cd /home/profit/golangLAKEHOUSE
|
||||
PATH=/usr/local/go/bin:$PATH go build -o bin/gen_real_queries ./scripts/cutover/gen_real_queries.go
|
||||
./bin/gen_real_queries -limit 50 -offset 10 -styles need > tests/reality/real_coord_queries_v3.txt
|
||||
|
||||
PATH=/usr/local/go/bin:$PATH \
|
||||
RUN_ID=real_006 \
|
||||
JUDGE_MODEL=qwen2.5:latest \
|
||||
QUERIES_FILE=tests/reality/real_coord_queries_v3.txt \
|
||||
WITH_PARAPHRASE=1 \
|
||||
WITH_REJUDGE=1 \
|
||||
bash scripts/playbook_lift.sh
|
||||
```
|
||||
|
||||
Local-only. No cloud calls.
|
||||
@ -29,6 +29,7 @@ import (
|
||||
func main() {
|
||||
src := flag.String("src", "/home/profit/lakehouse/data/datasets/fill_events.parquet", "fill_events parquet path")
|
||||
limit := flag.Int("limit", 10, "number of source rows to read")
|
||||
offset := flag.Int("offset", 0, "skip the first N rows (lets reality tests sample beyond the memorized real_001 slice)")
|
||||
styles := flag.String("styles", "need", "comma-separated styles to emit per row (need|client_first|looking|shorthand|all)")
|
||||
flag.Parse()
|
||||
|
||||
@ -58,16 +59,24 @@ func main() {
|
||||
at := tbl.Column(10).Data().Chunk(0)
|
||||
deadline := tbl.Column(12).Data().Chunk(0)
|
||||
|
||||
n := int(tbl.NumRows())
|
||||
if *limit < n {
|
||||
n = *limit
|
||||
totalRows := int(tbl.NumRows())
|
||||
start := *offset
|
||||
if start < 0 {
|
||||
start = 0
|
||||
}
|
||||
if start > totalRows {
|
||||
start = totalRows
|
||||
}
|
||||
end := start + *limit
|
||||
if end > totalRows {
|
||||
end = totalRows
|
||||
}
|
||||
|
||||
stylesList := parseStyles(*styles)
|
||||
|
||||
fmt.Println("# Real-shape coordinator queries — generated from fill_events.parquet")
|
||||
fmt.Println("# (real-shape demand data; queries built mechanically from event rows).")
|
||||
fmt.Printf("# Source: %s (%d rows total, %d emitted, styles=%v)\n", *src, tbl.NumRows(), n, stylesList)
|
||||
fmt.Printf("# Source: %s (%d rows total, rows [%d,%d) emitted, styles=%v)\n", *src, totalRows, start, end, stylesList)
|
||||
fmt.Println("#")
|
||||
fmt.Println("# Styles:")
|
||||
fmt.Println("# need: 'Need N {role}{s} in {city} {state} starting at {at} for {client}'")
|
||||
@ -80,7 +89,7 @@ func main() {
|
||||
fmt.Println("# substrate's bleed behavior when the role gate is silently disabled.")
|
||||
fmt.Println()
|
||||
|
||||
for i := 0; i < n; i++ {
|
||||
for i := start; i < end; i++ {
|
||||
ev := event{
|
||||
client: client.ValueStr(i),
|
||||
city: city.ValueStr(i),
|
||||
|
||||
64
tests/reality/real_coord_queries_v3.txt
Normal file
64
tests/reality/real_coord_queries_v3.txt
Normal file
@ -0,0 +1,64 @@
|
||||
# Real-shape coordinator queries — generated from fill_events.parquet
|
||||
# (real-shape demand data; queries built mechanically from event rows).
|
||||
# Source: /home/profit/lakehouse/data/datasets/fill_events.parquet (123 rows total, rows [10,60) emitted, styles=[need])
|
||||
#
|
||||
# Styles:
|
||||
# need: 'Need N {role}{s} in {city} {state} starting at {at} for {client}'
|
||||
# — matches scripts/playbook_lift's extractRoleFromNeed regex
|
||||
# client_first: '{client} needs N {role}{s} in {city} {state} at {at}'
|
||||
# looking: 'Looking for N {role}{s} at {client} in {city} {state} for {at} shift'
|
||||
# shorthand: 'N {role}{s} {city} {state} {at} {client}'
|
||||
#
|
||||
# Only 'need' currently extracts a role. The other three test the
|
||||
# substrate's bleed behavior when the role gate is silently disabled.
|
||||
|
||||
Need 1 Loader in Kansas City MO starting at 17:30 for Cornerstone Fabrication
|
||||
Need 2 Assemblers in Cincinnati OH starting at 14:30 for Great Lakes Mfg
|
||||
Need 1 Forklift Operator in Lexington KY starting at 08:30 for Vanguard Components, deadline 2026-05-20
|
||||
Need 2 Assemblers in Flint MI starting at 08:30 for Centennial Packaging
|
||||
Need 2 Welders in Indianapolis IN starting at 10:00 for Northland Logistics
|
||||
Need 2 Material Handlers in Cincinnati OH starting at 13:00 for Great Lakes Mfg
|
||||
Need 3 Pickers in Flint MI starting at 17:00 for Centennial Packaging
|
||||
Need 3 Packers in Indianapolis IN starting at 09:00 for Heritage Foods, deadline 2026-05-04
|
||||
Need 3 Assemblers in Columbus OH starting at 17:30 for Riverfront Steel
|
||||
Need 5 Machine Operators in Cleveland OH starting at 14:30 for Apex Warehouse
|
||||
Need 2 Assemblers in Grand Rapids MI starting at 13:00 for Cornerstone Fabrication, deadline 2026-05-21
|
||||
Need 2 Pickers in Akron OH starting at 10:30 for Summit Industrial
|
||||
Need 3 Quality Techs in Lexington KY starting at 12:30 for Keystone Plastics, deadline 2026-06-14
|
||||
Need 4 Assemblers in Gary IN starting at 12:00 for Heritage Foods
|
||||
Need 3 Packers in Toledo OH starting at 16:00 for Cornerstone Fabrication, deadline 2026-06-02
|
||||
Need 4 Warehouse Associates in Fort Wayne IN starting at 13:00 for Cornerstone Fabrication, deadline 2026-05-17
|
||||
Need 4 Assemblers in Columbus OH starting at 13:00 for Midway Distribution
|
||||
Need 2 Shipping Clerks in Chicago IL starting at 17:00 for Midway Distribution
|
||||
Need 2 Machine Operators in Chicago IL starting at 11:00 for Midway Distribution
|
||||
Need 3 CNC Operators in Grand Rapids MI starting at 10:00 for Parallel Machining, deadline 2026-06-14
|
||||
Need 1 Warehouse Associate in Lexington KY starting at 09:30 for Keystone Plastics, deadline 2026-06-14
|
||||
Need 2 Material Handlers in Gary IN starting at 11:00 for Heritage Foods
|
||||
Need 5 Assemblers in Fort Wayne IN starting at 16:30 for Midway Distribution, deadline 2026-06-09
|
||||
Need 1 Loader in Cincinnati OH starting at 08:00 for Summit Industrial
|
||||
Need 3 Quality Techs in Columbus OH starting at 12:00 for Riverfront Steel
|
||||
Need 1 Machine Operator in Columbus OH starting at 09:30 for Riverfront Steel
|
||||
Need 3 Machine Operators in Madison WI starting at 12:00 for Great Lakes Mfg, deadline 2026-05-24
|
||||
Need 2 Material Handlers in Kansas City MO starting at 11:30 for Parallel Machining
|
||||
Need 3 Loaders in Flint MI starting at 16:00 for Parallel Machining
|
||||
Need 2 Welders in Louisville KY starting at 13:00 for Horizon Supply, deadline 2026-06-04
|
||||
Need 1 CNC Operator in Flint MI starting at 10:30 for Horizon Supply
|
||||
Need 1 Material Handler in Columbus OH starting at 15:30 for Riverfront Steel
|
||||
Need 2 Forklift Operators in Louisville KY starting at 14:30 for Cornerstone Fabrication, deadline 2026-06-08
|
||||
Need 2 Warehouse Associates in Chicago IL starting at 10:00 for Northland Logistics
|
||||
Need 2 Material Handlers in Gary IN starting at 15:00 for Parallel Machining
|
||||
Need 1 Forklift Operator in Grand Rapids MI starting at 10:00 for Cornerstone Fabrication, deadline 2026-05-21
|
||||
Need 2 Pickers in Louisville KY starting at 12:30 for Cornerstone Fabrication, deadline 2026-06-08
|
||||
Need 2 Loaders in Indianapolis IN starting at 17:30 for Midway Distribution
|
||||
Need 1 Shipping Clerk in Indianapolis IN starting at 12:30 for Northland Logistics
|
||||
Need 2 Assemblers in Cincinnati OH starting at 08:00 for Keystone Plastics, deadline 2026-05-26
|
||||
Need 5 Quality Techs in Kansas City MO starting at 11:30 for Summit Industrial, deadline 2026-05-23
|
||||
Need 2 Machine Operators in Gary IN starting at 10:00 for Heritage Foods
|
||||
Need 1 Packer in Chicago IL starting at 09:30 for Midway Distribution
|
||||
Need 2 Pickers in Lexington KY starting at 17:30 for Vanguard Components, deadline 2026-05-20
|
||||
Need 2 Maintenance Techs in Grand Rapids MI starting at 17:00 for Pioneer Assembly, deadline 2026-05-20
|
||||
Need 1 Material Handler in Detroit MI starting at 10:30 for Summit Industrial
|
||||
Need 1 Welder in Akron OH starting at 15:00 for Summit Industrial
|
||||
Need 1 Shipping Clerk in Cincinnati OH starting at 13:30 for Summit Industrial
|
||||
Need 5 Packers in Indianapolis IN starting at 10:30 for Midway Distribution
|
||||
Need 1 Forklift Operator in Louisville KY starting at 10:30 for Cornerstone Fabrication, deadline 2026-06-08
|
||||
Loading…
x
Reference in New Issue
Block a user