From 95f155b017ba9cbbf437c768b02a0cda61fd3faf Mon Sep 17 00:00:00 2001 From: root Date: Tue, 5 May 2026 04:54:03 -0500 Subject: [PATCH] real_006: distribution-shift test on rows 10-59 of fill_events MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Methodology fix: gen_real_queries.go gains -offset N flag. Every prior real_NNN test sourced queries from rows 0-9 of fill_events.parquet (default -limit 10), so the substrate's published "8/10 cold-pass top-1 = judge-best" was measured on a memorized slice, not held-out data. real_006 samples 50 fresh rows (offset 10, never seen by the workers or ethereal_workers corpora). Same harness, same local qwen2.5:latest judge, same K=10. ~14 min wall total. Local-only, no cloud calls. Headline findings: - Cold-pass top-1 = judge-best (rank match): 41/50 (82%) vs real_001's 8/10 (80%) — substrate generalizes at rank level. - Strict (rating ≥ 2): 34/50 (68%) — 12-point drop from real_001's 80%. ~7 of 41 "no-discovery" queries had cold top-1 the judge rated 1; the corpus has gaps for some role-city combos in the v3 slice. - Verbatim lift: 9/9 discoveries → warm top-1 (clean, matches real_001 2/2) - Paraphrase recovery: 6/9 → top-1, 9/9 any-rank - Quality regressed: 3/50 — Q43 is the structural one Q43 (Packer at Midway Distribution / Chicago IL) regressed from rating 5 to rating 2 on warm pass with `warm_boosted_count=0` and `playbook_recorded=false`. Q18 (Shipping Clerks at the same client+city) recorded a playbook entry. The regression suggests Q18's recording leaked into Q43 via the warm-pass playbook corpus retrieval surface even though the role gate from real_002 should have blocked it. Three possible paths: extractor failed on one query, gate fires on boost path but not Shape B inject, or cosine drift puts the recorded worker close enough to Q43's embedding that warm-pass retrieval picks it up directly. Diagnosis is the next move. Three same-(client, city) clusters tested: - Heritage Foods Gary IN × 3 distinct roles: clean, distinct workers - Riverfront Steel Columbus OH × 4: cosine-level confusion (Q9/Q25 surface same worker w-281 for Assemblers vs Quality Techs at cold- pass), but no playbook bleed - Midway Distribution Chicago IL × 3: Q43 regression as above What this confirms: substrate works on the fresh distribution at the rank level, verbatim lift is real, paraphrase recovery is real. What this falsifies: real_002's role-gate fix is not structurally airtight. The bleed pattern can still fire under conditions the prior tests didn't reach. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../reality-tests/playbook_lift_real_006.md | 151 ++++++++++++++ reports/reality-tests/real_006_findings.md | 187 ++++++++++++++++++ scripts/cutover/gen_real_queries.go | 19 +- tests/reality/real_coord_queries_v3.txt | 64 ++++++ 4 files changed, 416 insertions(+), 5 deletions(-) create mode 100644 reports/reality-tests/playbook_lift_real_006.md create mode 100644 reports/reality-tests/real_006_findings.md create mode 100644 tests/reality/real_coord_queries_v3.txt diff --git a/reports/reality-tests/playbook_lift_real_006.md b/reports/reality-tests/playbook_lift_real_006.md new file mode 100644 index 0000000..e2f5a2f --- /dev/null +++ b/reports/reality-tests/playbook_lift_real_006.md @@ -0,0 +1,151 @@ +# Playbook-Lift Reality Test — Run real_006 + +**Generated:** 2026-05-05T09:50:08.929241389Z +**Judge:** `qwen2.5:latest` (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest) +**Corpora:** `workers,ethereal_workers` +**Workers limit:** 5000 +**Queries:** `tests/reality/real_coord_queries_v3.txt` (50 executed) +**K per pass:** 10 +**Paraphrase pass:** ENABLED +**Re-judge pass:** ENABLED +**Evidence:** `reports/reality-tests/playbook_lift_real_006.json` + +--- + +## Headline + +| Metric | Value | +|---|---:| +| Total queries run | 50 | +| Cold-pass discoveries (judge-best ≠ top-1) | 9 | +| Warm-pass lifts (recorded playbook → top-1) | 9 | +| No change (judge-best already top-1, no playbook needed) | 41 | +| Playbook boosts triggered (warm pass) | 10 | +| Mean Δ top-1 distance (warm − cold) | -0.05014307 | +| **Paraphrase pass — recorded answer at rank 0 (top-1)** | **6 / 9** | +| Paraphrase pass — recorded answer at any rank in top-K | 9 / 9 | +| **Quality lift** (warm top-1 rating > cold top-1 rating) | **11 / 50** | +| Quality neutral (warm top-1 rating = cold top-1 rating) | 36 / 50 | +| Quality regressed (warm top-1 rating < cold top-1 rating) | 3 / 50 | + +**Verbatim lift rate:** 9 of 9 discoveries became top-1 after warm pass. + +--- + +## Per-query results + +| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift | +|---|---|---|---|---|---|---|---| +| 1 | Need 1 Loader in Kansas City MO starting at 17:30 for Corner | w-4806 | 0/5 | — | w-4806 | 0 | no | +| 2 | Need 2 Assemblers in Cincinnati OH starting at 14:30 for Gre | w-4371 | 0/4 | — | w-4371 | 0 | no | +| 3 | Need 1 Forklift Operator in Lexington KY starting at 08:30 f | e-8263 | 1/5 | ✓ w-4636 | w-4636 | 0 | **YES** | +| 4 | Need 2 Assemblers in Flint MI starting at 08:30 for Centenni | e-7186 | 1/4 | ✓ e-9319 | e-9319 | 0 | **YES** | +| 5 | Need 2 Welders in Indianapolis IN starting at 10:00 for Nort | e-5834 | 0/4 | — | e-5834 | 0 | no | +| 6 | Need 2 Material Handlers in Cincinnati OH starting at 13:00 | e-4871 | 4/2 | — | e-4871 | 4 | no | +| 7 | Need 3 Pickers in Flint MI starting at 17:00 for Centennial | e-5571 | 0/2 | — | e-5571 | 0 | no | +| 8 | Need 3 Packers in Indianapolis IN starting at 09:00 for Heri | w-279 | 0/4 | — | w-279 | 0 | no | +| 9 | Need 3 Assemblers in Columbus OH starting at 17:30 for River | w-281 | 0/4 | — | w-281 | 0 | no | +| 10 | Need 5 Machine Operators in Cleveland OH starting at 14:30 f | e-8279 | 0/4 | — | e-8279 | 0 | no | +| 11 | Need 2 Assemblers in Grand Rapids MI starting at 13:00 for C | w-4502 | 0/3 | — | w-4502 | 0 | no | +| 12 | Need 2 Pickers in Akron OH starting at 10:30 for Summit Indu | e-5655 | 2/2 | — | e-5655 | 2 | no | +| 13 | Need 3 Quality Techs in Lexington KY starting at 12:30 for K | e-6369 | 0/4 | — | e-6369 | 0 | no | +| 14 | Need 4 Assemblers in Gary IN starting at 12:00 for Heritage | e-1315 | 0/2 | — | e-1315 | 0 | no | +| 15 | Need 3 Packers in Toledo OH starting at 16:00 for Cornerston | e-4887 | 1/2 | — | e-4887 | 1 | no | +| 16 | Need 4 Warehouse Associates in Fort Wayne IN starting at 13: | w-4434 | 0/4 | — | w-4434 | 0 | no | +| 17 | Need 4 Assemblers in Columbus OH starting at 13:00 for Midwa | w-281 | 0/4 | — | w-281 | 0 | no | +| 18 | Need 2 Shipping Clerks in Chicago IL starting at 17:00 for M | w-4504 | 1/4 | ✓ w-1522 | w-1522 | 0 | **YES** | +| 19 | Need 2 Machine Operators in Chicago IL starting at 11:00 for | e-1251 | 0/4 | — | e-1251 | 0 | no | +| 20 | Need 3 CNC Operators in Grand Rapids MI starting at 10:00 fo | w-792 | 0/3 | — | w-792 | 0 | no | +| 21 | Need 1 Warehouse Associate in Lexington KY starting at 09:30 | e-2331 | 0/4 | — | e-2331 | 0 | no | +| 22 | Need 2 Material Handlers in Gary IN starting at 11:00 for He | e-18 | 0/2 | — | e-18 | 0 | no | +| 23 | Need 5 Assemblers in Fort Wayne IN starting at 16:30 for Mid | e-6271 | 7/3 | — | e-6271 | 7 | no | +| 24 | Need 1 Loader in Cincinnati OH starting at 08:00 for Summit | e-8843 | 1/5 | ✓ w-4473 | w-4473 | 0 | **YES** | +| 25 | Need 3 Quality Techs in Columbus OH starting at 12:00 for Ri | w-281 | 0/2 | — | w-281 | 0 | no | +| 26 | Need 1 Machine Operator in Columbus OH starting at 09:30 for | w-4815 | 0/4 | — | w-4815 | 0 | no | +| 27 | Need 3 Machine Operators in Madison WI starting at 12:00 for | w-2027 | 0/4 | — | w-2027 | 0 | no | +| 28 | Need 2 Material Handlers in Kansas City MO starting at 11:30 | e-6774 | 0/3 | — | e-6774 | 0 | no | +| 29 | Need 3 Loaders in Flint MI starting at 16:00 for Parallel Ma | w-4875 | 0/2 | — | w-4875 | 0 | no | +| 30 | Need 2 Welders in Louisville KY starting at 13:00 for Horizo | w-2267 | 0/4 | — | w-2267 | 0 | no | +| 31 | Need 1 CNC Operator in Flint MI starting at 10:30 for Horizo | e-317 | 7/3 | — | e-317 | 7 | no | +| 32 | Need 1 Material Handler in Columbus OH starting at 15:30 for | e-8676 | 1/4 | ✓ w-2589 | w-2589 | 0 | **YES** | +| 33 | Need 2 Forklift Operators in Louisville KY starting at 14:30 | w-1830 | 0/4 | — | w-1830 | 0 | no | +| 34 | Need 2 Warehouse Associates in Chicago IL starting at 10:00 | w-4743 | 7/4 | ✓ e-9171 | e-9171 | 0 | **YES** | +| 35 | Need 2 Material Handlers in Gary IN starting at 15:00 for Pa | w-4236 | 1/2 | — | w-4236 | 1 | no | +| 36 | Need 1 Forklift Operator in Grand Rapids MI starting at 10:0 | w-3227 | 3/2 | — | w-3227 | 3 | no | +| 37 | Need 2 Pickers in Louisville KY starting at 12:30 for Corner | e-6489 | 2/4 | ✓ e-7622 | e-7622 | 0 | **YES** | +| 38 | Need 2 Loaders in Indianapolis IN starting at 17:30 for Midw | e-9877 | 0/4 | — | e-9877 | 0 | no | +| 39 | Need 1 Shipping Clerk in Indianapolis IN starting at 12:30 f | w-4635 | 0/5 | — | w-4635 | 0 | no | +| 40 | Need 2 Assemblers in Cincinnati OH starting at 08:00 for Key | w-4945 | 0/4 | — | w-4945 | 0 | no | +| 41 | Need 5 Quality Techs in Kansas City MO starting at 11:30 for | e-5633 | 0/4 | — | e-5633 | 0 | no | +| 42 | Need 2 Machine Operators in Gary IN starting at 10:00 for He | e-1089 | 0/2 | — | e-1089 | 0 | no | +| 43 | Need 1 Packer in Chicago IL starting at 09:30 for Midway Dis | e-7746 | 0/5 | — | w-279 | 1 | no | +| 44 | Need 2 Pickers in Lexington KY starting at 17:30 for Vanguar | e-3375 | 0/4 | — | e-3375 | 0 | no | +| 45 | Need 2 Maintenance Techs in Grand Rapids MI starting at 17:0 | e-6083 | 0/2 | — | e-6083 | 0 | no | +| 46 | Need 1 Material Handler in Detroit MI starting at 10:30 for | w-3286 | 0/5 | — | w-3286 | 0 | no | +| 47 | Need 1 Welder in Akron OH starting at 15:00 for Summit Indus | e-6149 | 0/2 | — | e-6149 | 0 | no | +| 48 | Need 1 Shipping Clerk in Cincinnati OH starting at 13:30 for | e-4218 | 3/5 | ✓ w-3488 | w-3488 | 0 | **YES** | +| 49 | Need 5 Packers in Indianapolis IN starting at 10:30 for Midw | e-2746 | 2/4 | ✓ w-279 | w-279 | 0 | **YES** | +| 50 | Need 1 Forklift Operator in Louisville KY starting at 10:30 | w-1830 | 0/4 | — | w-1830 | 0 | no | + +--- + +## Paraphrase pass — does the playbook help similar-but-different queries? + +For each query whose Pass 1 cold pass recorded a playbook entry, the +judge model rephrased the query, and the rephrased version was sent +through warm matrix.search. The recorded answer ID's rank in those +results tests whether cosine on the embedded paraphrase finds the +recorded query's vector. + +| # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift | +|---|---|---|---|---|---|---| +| 3 | Need 1 Forklift Operator in Lexington KY | Vanguard Components requires a Forklift Operator in Lexingto | w-4636 | w-4636 | 0 | **YES** | +| 4 | Need 2 Assemblers in Flint MI starting a | Centennial Packaging requires 2 Assemblers to start at 08:30 | e-9319 | e-9319 | 0 | **YES** | +| 18 | Need 2 Shipping Clerks in Chicago IL sta | Looking for 2 Shipping Clerks in Chicago, IL to start at 5:0 | w-1522 | w-4504 | 1 | no | +| 24 | Need 1 Loader in Cincinnati OH starting | Summit Industrial requires 1 Loader position from 08:00 onwa | w-4473 | w-4473 | 0 | **YES** | +| 32 | Need 1 Material Handler in Columbus OH s | Looking for a Material Handler in Columbus, OH who can start | w-2589 | w-2589 | 0 | **YES** | +| 34 | Need 2 Warehouse Associates in Chicago I | Looking for 2 Warehouse Associates to work from 10:00 onward | e-9171 | e-9171 | 0 | **YES** | +| 37 | Need 2 Pickers in Louisville KY starting | Looking for 2 Pickers in Louisville, KY to start at 12:30 fo | e-7622 | e-6489 | 2 | no | +| 48 | Need 1 Shipping Clerk in Cincinnati OH s | Summit Industrial requires a Shipping Clerk in Cincinnati, O | w-3488 | w-3488 | 0 | **YES** | +| 49 | Need 5 Packers in Indianapolis IN starti | Looking for 5 packers in Indianapolis, IN to start at 10:30 | w-279 | e-2746 | 1 | no | + +--- + +## Honesty caveats + +1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM + judge's verdict is what defines "best." If `qwen2.5:latest` rates badly, + the lift number is meaningless. To validate the judge itself, sample 5–10 + verdicts manually and check agreement. +2. **Score-1.0 boost = distance halved.** Playbook math is + `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best + result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise + even halving doesn't promote it. Tight clusters → little visible lift. +3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap + case — same query, recorded playbook, expected boost. The paraphrase + pass (when enabled) is the actual learning property: similar-but-different + queries hitting a recorded playbook. Compare verbatim and paraphrase + lift rates — paraphrase should be lower (semantic-distance gates some + playbook hits) but non-zero is the meaningful signal. +4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best + results land in one corpus, the matrix layer's purpose isn't being tested. + Check per-corpus distribution in the JSON. +5. **Judge resolution.** This run used `qwen2.5:latest` from + env JUDGE_MODEL=qwen2.5:latest. + Bumping the judge for run #N+1 means editing one line in lakehouse.toml. +6. **Paraphrase generation also uses the judge.** The same model that rates + relevance also rephrases queries. A judge that's bad at rating staffing + queries is probably also bad at rephrasing them. Worth sanity-checking + a sample of `paraphrase_query` values in the JSON before trusting the + paraphrase lift number. + +## Next moves + +- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real + work. Move to paraphrase queries + tag-based boost (currently ignored). +- If lift rate < 20%: investigate why — judge variance, distance gap too + wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need + retuning. +- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is + already close to optimal on this query distribution. Either the corpus + is too narrow or the queries are too easy. diff --git a/reports/reality-tests/real_006_findings.md b/reports/reality-tests/real_006_findings.md new file mode 100644 index 0000000..d15a0f7 --- /dev/null +++ b/reports/reality-tests/real_006_findings.md @@ -0,0 +1,187 @@ +# Reality test real_006 — distribution-shift findings + +**Run:** 2026-05-05 04:41:46 → 04:50:08 CDT (8m22s driver wall, ~14 min including ingest) +**Judge:** `qwen2.5:latest` (Ollama, local) — anchor's recommended judge, ~9s/query +**Queries:** 50 from `tests/reality/real_coord_queries_v3.txt` (rows 10-59 of fill_events.parquet, single `need` style) +**Corpora:** `workers,ethereal_workers` (5K + 10K) +**Local-only:** zero cloud calls per PRD line 70. + +Companion to `playbook_lift_real_006.{json,md}`. That's the harness output; this is the reading. + +--- + +## Why this test exists + +real_001-005 all sourced their queries from the **first 10 rows** of +`fill_events.parquet`. `gen_real_queries.go` had `-limit N` but no +`-offset N`, so every "real" reality test ran on the same memorized +slice. The published "8 / 10 cold-pass top-1 = judge-best" was a +property of those 10 rows, not measured generalization. real_006 +closes the methodology gap: new `-offset` flag samples rows 10-59 (5× +the count, never seen by the substrate). + +--- + +## Headline — substrate generalizes (mostly) + +| Metric | real_001 (10 queries, rows 0-9) | real_006 (50 queries, rows 10-59) | Verdict | +|---|---:|---:|---| +| Cold-pass top-1 = judge-best (rank match) | 8 / 10 (80%) | **41 / 50 (82%)** | **HOLDS** | +| Cold-pass top-1 = judge-best AND rating ≥ 2 | 8 / 10 (80%) | 34 / 50 (68%) | -12 pts | +| Mean cold top-1 judge rating | ~3.3 | 3.08 | -7% | +| Discoveries (judge promoted non-top-1) | 2 / 10 | 9 / 50 (18%) | comparable | +| Verbatim lift (discovery → warm top-1) | 2 / 2 (100%) | 9 / 9 (100%) | **HOLDS** | +| Paraphrase recovery → top-1 | n/a (disabled) | 6 / 9 (67%) | new | +| Quality regressed on rejudge | 0 (test absent) | 3 / 50 (6%) | new | + +**Reading:** the substrate's *rank* behavior generalizes cleanly — the +top-1 worker is judge-approved at the same rate on fresh data as on +memorized data. The *quality* of top-1 (rating ≥ 2) drops 12 points, +which means 7 of the 41 "no-discovery" queries had cold top-1 the +judge rated 1 (irrelevant) but the corpus had nothing better. Honest +signal: parts of the v3 slice are in territory the workers corpus +doesn't cover well. + +The verbatim-lift property (discovery → warm top-1) is **clean at +9/9**, matching real_001's 2/2 perfectly. When the playbook records, +the recorded answer comes back next time. That's the load-bearing +learning property. + +--- + +## Cluster analysis — the cross-pollination question + +real_001 found that same-(client, city) clusters cause Shape A boost +to bleed across roles. Real_002's role-gate fix (`roleEqual`) was +supposed to close that. real_006 has *more* cluster opportunities than +real_001 did: + +| Cluster | Count | Result | +|---|---:|---| +| Riverfront Steel + Columbus OH | 4 | mostly clean — see below | +| Heritage Foods + Gary IN | 3 | **clean** — distinct workers per role, no boost firing | +| Cornerstone Fabrication + Louisville KY | 3 | clean | +| Midway Distribution + Chicago IL | 3 | **bleed: Q43 regressed** | + +### Heritage Foods + Gary IN (3 queries, all clean) + +``` +Q14 Assemblers → e-1315 +Q22 Material Handler → e-18 +Q42 Machine Operator → e-1089 +``` + +Three different roles → three different workers. Zero boosts fired, +zero playbooks recorded. **Role-disambiguation works at the cosine +level for this cluster.** Comparable to real_002's role-gate +demonstration. + +### Riverfront Steel + Columbus OH (4 queries, partial) + +``` +Q9 Assemblers → w-281 (cold = warm, no boost) +Q25 Quality Techs → w-281 (cold = warm, no boost) ← same worker as Q9! +Q26 Machine Operator → w-4815 (clean) +Q32 Material Handler → e-8676 → w-2589 (judge promoted, playbook recorded) +``` + +Q9 and Q25 both surface `w-281` cold-pass for *different roles* — +that's a **cosine-level confusion** in the workers corpus, not a +playbook bleed. The substrate isn't breaking; the corpus contains a +worker whose resume embeds close to both "Assemblers" and "Quality +Techs" in this client+city. Judge-rating Q25 dropped 2 → 1 on +rejudge, which is the LLM's own consistency drift, not a substrate +fault. Worth noting but not a bug. + +### Midway Distribution + Chicago IL (3 queries) — the regression + +``` +Q18 Shipping Clerks → cold w-4504 → warm w-1522 (boost=1, playbook recorded) +Q19 Machine Operators → cold = warm e-1251 (clean) +Q43 Packer → cold e-7746 (rating 5) → warm w-279 (rating 2) ← regressed +``` + +**Q43 regressed from rating 5 (perfect match) to rating 2 (weak) +even though `warm_boosted_count=0` and `playbook_recorded=false`.** +Same query, different warm top-1, no boost flag set. The playbook +recording from Q18 (Shipping Clerks at Midway/Chicago) reaches Q43 +(Packer at Midway/Chicago) — same client+city, different role — +through the playbook corpus retrieval surface, even though the role +gate exists. + +This is the **same pattern real_001 surfaced** (Q5/Q10 demoted by +Q2's playbook), and the role-gate fix from real_002 (`roleEqual` +on `Role` field) was supposed to close it. Possible explanations: + +1. Role extractor failed on either Q18 ("Shipping Clerks") or Q43 + ("Packer") — leaving an empty role bypasses the gate (gate is + "permissive on empty" by design) +2. Gate fires on boost path but not on Shape B inject path — and + "boost=0" in the JSON is `warm_boosted_count` (count of + re-ranked entries), not a flag for "no playbook influence at all" +3. Cosine-level drift: the playbook entry just happens to be close + enough to Q43 in raw cosine space that warm-pass retrieval picks + up `w-279` directly without going through boost or inject + +The other regressions (Q4 Centennial Packaging Flint MI, Q25 above) +are smaller (3→2 and 2→1) and likely judge consistency drift on +borderline candidates. Q43 is the structural one. + +--- + +## What this confirms vs falsifies + +**Confirmed:** +- Substrate generalizes at the rank level (82% cold-top-1 = judge-best) +- Verbatim lift works (9/9 discoveries → warm top-1) +- Role-disambiguation works at cosine level for clean role-distinct + query distributions (Heritage Foods cluster is the proof) +- Paraphrase recovery is real (6/9 → top-1, 9/9 any-rank) + +**Falsified / weakened:** +- "8/10 cold-pass top-1 = judge-best" was 12 points optimistic on + the strict (rating ≥ 2) interpretation. Real number on broader + data is ~68%, not 80%. Headline rank-match number (82%) holds. +- Real_002's role-gate fix is **not structurally airtight**. Q43 + shows the cluster-bleed pattern can still fire under conditions + the prior tests didn't reach. Open question: which path is + leaking — extractor failure, gate scope, or cosine drift? + +--- + +## Next moves (informed by this evidence) + +1. **Diagnose Q43 specifically**: re-run the role extractor on its + query text, check whether Q18's playbook entry has a role field + recorded, look at the warm-pass top-K to see whether `w-279` + reaches there via boost, inject, or cosine-only. +2. **Strengthen the corpus for the role-city combos that scored + low rating** (the 7 queries where cold top-1 was rating=1). The + workers corpus has gaps the v3 slice surfaced. +3. **Don't ship the "80% generalizes" framing as-is.** The number + real_006 measured (82% rank, 68% rating ≥ 2) is the honest one + to publish. + +This is what reality tests are for. Numbers from the memorized slice +gave a clean story; numbers from the held-out slice show where it +needs work. + +--- + +## Repro + +```bash +cd /home/profit/golangLAKEHOUSE +PATH=/usr/local/go/bin:$PATH go build -o bin/gen_real_queries ./scripts/cutover/gen_real_queries.go +./bin/gen_real_queries -limit 50 -offset 10 -styles need > tests/reality/real_coord_queries_v3.txt + +PATH=/usr/local/go/bin:$PATH \ + RUN_ID=real_006 \ + JUDGE_MODEL=qwen2.5:latest \ + QUERIES_FILE=tests/reality/real_coord_queries_v3.txt \ + WITH_PARAPHRASE=1 \ + WITH_REJUDGE=1 \ + bash scripts/playbook_lift.sh +``` + +Local-only. No cloud calls. diff --git a/scripts/cutover/gen_real_queries.go b/scripts/cutover/gen_real_queries.go index 7f6d781..6eb54fc 100644 --- a/scripts/cutover/gen_real_queries.go +++ b/scripts/cutover/gen_real_queries.go @@ -29,6 +29,7 @@ import ( func main() { src := flag.String("src", "/home/profit/lakehouse/data/datasets/fill_events.parquet", "fill_events parquet path") limit := flag.Int("limit", 10, "number of source rows to read") + offset := flag.Int("offset", 0, "skip the first N rows (lets reality tests sample beyond the memorized real_001 slice)") styles := flag.String("styles", "need", "comma-separated styles to emit per row (need|client_first|looking|shorthand|all)") flag.Parse() @@ -58,16 +59,24 @@ func main() { at := tbl.Column(10).Data().Chunk(0) deadline := tbl.Column(12).Data().Chunk(0) - n := int(tbl.NumRows()) - if *limit < n { - n = *limit + totalRows := int(tbl.NumRows()) + start := *offset + if start < 0 { + start = 0 + } + if start > totalRows { + start = totalRows + } + end := start + *limit + if end > totalRows { + end = totalRows } stylesList := parseStyles(*styles) fmt.Println("# Real-shape coordinator queries — generated from fill_events.parquet") fmt.Println("# (real-shape demand data; queries built mechanically from event rows).") - fmt.Printf("# Source: %s (%d rows total, %d emitted, styles=%v)\n", *src, tbl.NumRows(), n, stylesList) + fmt.Printf("# Source: %s (%d rows total, rows [%d,%d) emitted, styles=%v)\n", *src, totalRows, start, end, stylesList) fmt.Println("#") fmt.Println("# Styles:") fmt.Println("# need: 'Need N {role}{s} in {city} {state} starting at {at} for {client}'") @@ -80,7 +89,7 @@ func main() { fmt.Println("# substrate's bleed behavior when the role gate is silently disabled.") fmt.Println() - for i := 0; i < n; i++ { + for i := start; i < end; i++ { ev := event{ client: client.ValueStr(i), city: city.ValueStr(i), diff --git a/tests/reality/real_coord_queries_v3.txt b/tests/reality/real_coord_queries_v3.txt new file mode 100644 index 0000000..c303c41 --- /dev/null +++ b/tests/reality/real_coord_queries_v3.txt @@ -0,0 +1,64 @@ +# Real-shape coordinator queries — generated from fill_events.parquet +# (real-shape demand data; queries built mechanically from event rows). +# Source: /home/profit/lakehouse/data/datasets/fill_events.parquet (123 rows total, rows [10,60) emitted, styles=[need]) +# +# Styles: +# need: 'Need N {role}{s} in {city} {state} starting at {at} for {client}' +# — matches scripts/playbook_lift's extractRoleFromNeed regex +# client_first: '{client} needs N {role}{s} in {city} {state} at {at}' +# looking: 'Looking for N {role}{s} at {client} in {city} {state} for {at} shift' +# shorthand: 'N {role}{s} {city} {state} {at} {client}' +# +# Only 'need' currently extracts a role. The other three test the +# substrate's bleed behavior when the role gate is silently disabled. + +Need 1 Loader in Kansas City MO starting at 17:30 for Cornerstone Fabrication +Need 2 Assemblers in Cincinnati OH starting at 14:30 for Great Lakes Mfg +Need 1 Forklift Operator in Lexington KY starting at 08:30 for Vanguard Components, deadline 2026-05-20 +Need 2 Assemblers in Flint MI starting at 08:30 for Centennial Packaging +Need 2 Welders in Indianapolis IN starting at 10:00 for Northland Logistics +Need 2 Material Handlers in Cincinnati OH starting at 13:00 for Great Lakes Mfg +Need 3 Pickers in Flint MI starting at 17:00 for Centennial Packaging +Need 3 Packers in Indianapolis IN starting at 09:00 for Heritage Foods, deadline 2026-05-04 +Need 3 Assemblers in Columbus OH starting at 17:30 for Riverfront Steel +Need 5 Machine Operators in Cleveland OH starting at 14:30 for Apex Warehouse +Need 2 Assemblers in Grand Rapids MI starting at 13:00 for Cornerstone Fabrication, deadline 2026-05-21 +Need 2 Pickers in Akron OH starting at 10:30 for Summit Industrial +Need 3 Quality Techs in Lexington KY starting at 12:30 for Keystone Plastics, deadline 2026-06-14 +Need 4 Assemblers in Gary IN starting at 12:00 for Heritage Foods +Need 3 Packers in Toledo OH starting at 16:00 for Cornerstone Fabrication, deadline 2026-06-02 +Need 4 Warehouse Associates in Fort Wayne IN starting at 13:00 for Cornerstone Fabrication, deadline 2026-05-17 +Need 4 Assemblers in Columbus OH starting at 13:00 for Midway Distribution +Need 2 Shipping Clerks in Chicago IL starting at 17:00 for Midway Distribution +Need 2 Machine Operators in Chicago IL starting at 11:00 for Midway Distribution +Need 3 CNC Operators in Grand Rapids MI starting at 10:00 for Parallel Machining, deadline 2026-06-14 +Need 1 Warehouse Associate in Lexington KY starting at 09:30 for Keystone Plastics, deadline 2026-06-14 +Need 2 Material Handlers in Gary IN starting at 11:00 for Heritage Foods +Need 5 Assemblers in Fort Wayne IN starting at 16:30 for Midway Distribution, deadline 2026-06-09 +Need 1 Loader in Cincinnati OH starting at 08:00 for Summit Industrial +Need 3 Quality Techs in Columbus OH starting at 12:00 for Riverfront Steel +Need 1 Machine Operator in Columbus OH starting at 09:30 for Riverfront Steel +Need 3 Machine Operators in Madison WI starting at 12:00 for Great Lakes Mfg, deadline 2026-05-24 +Need 2 Material Handlers in Kansas City MO starting at 11:30 for Parallel Machining +Need 3 Loaders in Flint MI starting at 16:00 for Parallel Machining +Need 2 Welders in Louisville KY starting at 13:00 for Horizon Supply, deadline 2026-06-04 +Need 1 CNC Operator in Flint MI starting at 10:30 for Horizon Supply +Need 1 Material Handler in Columbus OH starting at 15:30 for Riverfront Steel +Need 2 Forklift Operators in Louisville KY starting at 14:30 for Cornerstone Fabrication, deadline 2026-06-08 +Need 2 Warehouse Associates in Chicago IL starting at 10:00 for Northland Logistics +Need 2 Material Handlers in Gary IN starting at 15:00 for Parallel Machining +Need 1 Forklift Operator in Grand Rapids MI starting at 10:00 for Cornerstone Fabrication, deadline 2026-05-21 +Need 2 Pickers in Louisville KY starting at 12:30 for Cornerstone Fabrication, deadline 2026-06-08 +Need 2 Loaders in Indianapolis IN starting at 17:30 for Midway Distribution +Need 1 Shipping Clerk in Indianapolis IN starting at 12:30 for Northland Logistics +Need 2 Assemblers in Cincinnati OH starting at 08:00 for Keystone Plastics, deadline 2026-05-26 +Need 5 Quality Techs in Kansas City MO starting at 11:30 for Summit Industrial, deadline 2026-05-23 +Need 2 Machine Operators in Gary IN starting at 10:00 for Heritage Foods +Need 1 Packer in Chicago IL starting at 09:30 for Midway Distribution +Need 2 Pickers in Lexington KY starting at 17:30 for Vanguard Components, deadline 2026-05-20 +Need 2 Maintenance Techs in Grand Rapids MI starting at 17:00 for Pioneer Assembly, deadline 2026-05-20 +Need 1 Material Handler in Detroit MI starting at 10:30 for Summit Industrial +Need 1 Welder in Akron OH starting at 15:00 for Summit Industrial +Need 1 Shipping Clerk in Cincinnati OH starting at 13:30 for Summit Industrial +Need 5 Packers in Indianapolis IN starting at 10:30 for Midway Distribution +Need 1 Forklift Operator in Louisville KY starting at 10:30 for Cornerstone Fabrication, deadline 2026-06-08