From 3263254f1cc3c29dbf3fdc0585780d2c6868943b Mon Sep 17 00:00:00 2001 From: root Date: Thu, 30 Apr 2026 21:42:02 -0500 Subject: [PATCH] reality_test real_003: 40-query paraphrase stress + extractor extension MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Stress-tests the role gate with 40 queries (10 fill_events rows × 4 styles): need, client_first, looking, shorthand. Each row's role + client + city stays the same; only the surface phrasing changes. real_003 (original extractor) confirmed the shorthand-vs-shorthand failure mode: CNC Operator shorthand recording leaked w-2404 onto Forklift Operator shorthand query within the same Beacon Freight Detroit cluster. Both record + query had empty role (extractor returns "" for shorthand because there's no separator between role and city), gate disabled, distance check passed, bleed fired. Fix: extended extractRoleFromNeed to handle client_first ("{client} needs N {role} in...") and looking ("Looking for N {role} at...") patterns. Shorthand left intentionally unmatched — "Forklift Operator Detroit" is shape-indistinguishable from "Forklift" + "Operator Detroit" without an LLM extractor or known- cities lookup. real_003b (extended extractor) verifies bleed closed across all 4 styles for this dataset. Forklift Operator queries keep w-2136 (the cold-pass-correct match) regardless of which style the query came in. Same-role boosts now fire correctly across styles — a CNC Operator recording made in `looking` style boosts the CNC need-form query. scripts/cutover/gen_real_queries.go: added -styles flag with values need|client_first|looking|shorthand|all (default need preserves real_001/002 behavior). Tests/reality/real_coord_queries_v2.txt is the 40-query stress file. scripts/playbook_lift/main_test.go: 10 sub-tests lock the four documented patterns + shorthand limitation + lift-suite-style queries (no clean role, returns empty as expected). Aggregate metrics: - real_003 (original): disc=7, lift=7, boost=14, meanΔ=-0.108 - real_003b (extended): disc=11, lift=10, boost=31, meanΔ=-0.202 The growth reflects more LEGITIMATE same-role same-cluster transfer firing across styles, not bleed (verified by per-cluster bleed table — Forklift Operator queries unchanged across all 4 styles). Known limitation documented in real_003_findings.md: same-cluster, same-role queries in shorthand still embed close enough that a shorthand recording could bleed onto a different-role shorthand query if both record + query strip role. Closing this requires LLM extraction or known-cities lookup at record + query time. Co-Authored-By: Claude Opus 4.7 (1M context) --- STATE_OF_PLAY.md | 1 + .../reality-tests/playbook_lift_real_003.md | 116 ++++++++++++++++ .../reality-tests/playbook_lift_real_003b.md | 116 ++++++++++++++++ reports/reality-tests/real_003_findings.md | 126 ++++++++++++++++++ scripts/cutover/gen_real_queries.go | 105 ++++++++++++--- scripts/playbook_lift/main.go | 56 ++++++-- scripts/playbook_lift/main_test.go | 76 +++++++++++ tests/reality/real_coord_queries_v2.txt | 54 ++++++++ 8 files changed, 616 insertions(+), 34 deletions(-) create mode 100644 reports/reality-tests/playbook_lift_real_003.md create mode 100644 reports/reality-tests/playbook_lift_real_003b.md create mode 100644 reports/reality-tests/real_003_findings.md create mode 100644 scripts/playbook_lift/main_test.go create mode 100644 tests/reality/real_coord_queries_v2.txt diff --git a/STATE_OF_PLAY.md b/STATE_OF_PLAY.md index 1de0d9d..ac916bd 100644 --- a/STATE_OF_PLAY.md +++ b/STATE_OF_PLAY.md @@ -263,6 +263,7 @@ The list is intentionally short. Items move to closed when the work demands them | (prep) | G5 cutover prep: `embed_parity` probe — Rust `/ai/embed` ↔ Go `/v1/embed` 5/5 cos=1.000 (both v1 and v2-moe). Verdict + drift catalog in `reports/cutover/SUMMARY.md`. Wire-format remap (`embeddings`/`vectors`, `dimensions`/`dimension`) is the only real cutover work; math is provably equivalent. | | (probe) | Reality test real_001: 10 real-shape queries from `fill_events.parquet` through lift harness. 8/10 cold-pass top-1 = judge-best (substrate works on real distribution). Surfaced **same-client+city cross-role bleed** — Shape A boost from Forklift-Operator playbook landed on CNC-Operator query, demoting the cold-pass-correct worker. Findings: `reports/reality-tests/real_001_findings.md`. | | (fix) | Cross-role gate: `Role` on `PlaybookEntry`, `QueryRole` on `SearchRequest`, gate fires in both `ApplyPlaybookBoost` + `InjectPlaybookMisses`. `roleEqual` handles case + plural. Backward-compat: empty role on either side = gate disabled (preserves lift suite + free-form callers). 5 new unit tests use exact real_001 distance + role values. Re-run real_002: bleed closed (Q#5 Pickers, Q#10 CNC Operator stay at cold-pass top-1; same-role lifts still fire). Closes OPEN #1. Findings: `reports/reality-tests/real_002_findings.md`. | +| (probe) | Reality test real_003: 40 queries (10 fill_events rows × 4 styles — `need` / `client_first` / `looking` / `shorthand`). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked `w-2404` onto Forklift Operator shorthand query (both empty role, gate disabled). Extended `extractRoleFromNeed` to handle `client_first` + `looking` patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in `scripts/playbook_lift/main_test.go` lock the patterns + the documented shorthand limitation. Findings: `reports/reality-tests/real_003_findings.md`. | Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds). diff --git a/reports/reality-tests/playbook_lift_real_003.md b/reports/reality-tests/playbook_lift_real_003.md new file mode 100644 index 0000000..652ef13 --- /dev/null +++ b/reports/reality-tests/playbook_lift_real_003.md @@ -0,0 +1,116 @@ +# Playbook-Lift Reality Test — Run real_003 + +**Generated:** 2026-05-01T02:27:31.394679694Z +**Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge) +**Corpora:** `workers,ethereal_workers` +**Workers limit:** 5000 +**Queries:** `tests/reality/real_coord_queries_v2.txt` (40 executed) +**K per pass:** 10 +**Paraphrase pass:** disabled +**Re-judge pass:** disabled +**Evidence:** `reports/reality-tests/playbook_lift_real_003.json` + +--- + +## Headline + +| Metric | Value | +|---|---:| +| Total queries run | 40 | +| Cold-pass discoveries (judge-best ≠ top-1) | 7 | +| Warm-pass lifts (recorded playbook → top-1) | 7 | +| No change (judge-best already top-1, no playbook needed) | 33 | +| Playbook boosts triggered (warm pass) | 14 | +| Mean Δ top-1 distance (warm − cold) | -0.10771029 | + + + +**Verbatim lift rate:** 7 of 7 discoveries became top-1 after warm pass. + +--- + +## Per-query results + +| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift | +|---|---|---|---|---|---|---|---| +| 1 | Need 5 Warehouse Associates in Kansas City MO starting at 09 | e-9573 | 0/4 | — | e-9573 | 0 | no | +| 2 | Parallel Machining needs 5 Warehouse Associates in Kansas Ci | e-7538 | 0/4 | — | e-7538 | 0 | no | +| 3 | Looking for 5 Warehouse Associates at Parallel Machining in | e-9573 | 0/4 | — | e-9573 | 0 | no | +| 4 | 5 Warehouse Associates Kansas City MO 09:00 Parallel Machini | e-9573 | 0/4 | — | e-9573 | 0 | no | +| 5 | Need 1 Forklift Operator in Detroit MI starting at 15:00 for | w-2136 | 0/5 | — | w-2136 | 0 | no | +| 6 | Beacon Freight needs 1 Forklift Operator in Detroit MI at 15 | w-2136 | 0/5 | — | w-2136 | 0 | no | +| 7 | Looking for 1 Forklift Operator at Beacon Freight in Detroit | w-2136 | 0/5 | — | w-2136 | 0 | no | +| 8 | 1 Forklift Operator Detroit MI 15:00 Beacon Freight | w-4766 | 0/5 | — | w-2404 | 1 | no | +| 9 | Need 4 Loaders in Indianapolis IN starting at 12:00 for Midw | e-2820 | 4/4 | ✓ e-4769 | e-4769 | 0 | **YES** | +| 10 | Midway Distribution needs 4 Loaders in Indianapolis IN at 12 | e-6419 | 4/4 | ✓ e-4769 | e-4769 | 0 | **YES** | +| 11 | Looking for 4 Loaders at Midway Distribution in Indianapolis | e-2820 | 1/2 | — | e-4769 | 2 | no | +| 12 | 4 Loaders Indianapolis IN 12:00 Midway Distribution | e-2820 | 6/5 | ✓ e-4769 | e-4769 | 0 | **YES** | +| 13 | Need 3 Warehouse Associates in Fort Wayne IN starting at 17: | e-9237 | 0/4 | — | w-565 | 1 | no | +| 14 | Cornerstone Fabrication needs 3 Warehouse Associates in Fort | e-9237 | 4/4 | ✓ w-565 | w-565 | 0 | **YES** | +| 15 | Looking for 3 Warehouse Associates at Cornerstone Fabricatio | e-9237 | 3/4 | ✓ w-565 | w-565 | 0 | **YES** | +| 16 | 3 Warehouse Associates Fort Wayne IN 17:30 Cornerstone Fabri | e-9237 | 0/4 | — | w-565 | 1 | no | +| 17 | Need 4 Pickers in Detroit MI starting at 13:30 for Beacon Fr | w-2136 | 0/2 | — | w-2136 | 0 | no | +| 18 | Beacon Freight needs 4 Pickers in Detroit MI at 13:30 | w-2136 | 0/2 | — | w-2136 | 0 | no | +| 19 | Looking for 4 Pickers at Beacon Freight in Detroit MI for 13 | e-7948 | 0/1 | — | e-7948 | 0 | no | +| 20 | 4 Pickers Detroit MI 13:30 Beacon Freight | e-7948 | 3/2 | — | e-7948 | 3 | no | +| 21 | Need 2 Packers in Joliet IL starting at 09:30 for Parallel M | e-9191 | 0/2 | — | e-9191 | 0 | no | +| 22 | Parallel Machining needs 2 Packers in Joliet IL at 09:30 | e-9191 | 7/3 | — | e-9191 | 7 | no | +| 23 | Looking for 2 Packers at Parallel Machining in Joliet IL for | e-9191 | 0/2 | — | e-9191 | 0 | no | +| 24 | 2 Packers Joliet IL 09:30 Parallel Machining | e-9191 | 6/3 | — | e-9191 | 6 | no | +| 25 | Need 3 Assemblers in Flint MI starting at 08:30 for Heritage | w-2582 | 4/3 | — | w-2582 | 4 | no | +| 26 | Heritage Foods needs 3 Assemblers in Flint MI at 08:30 | w-2582 | 0/2 | — | w-2582 | 0 | no | +| 27 | Looking for 3 Assemblers at Heritage Foods in Flint MI for 0 | w-4817 | 0/2 | — | w-4817 | 0 | no | +| 28 | 3 Assemblers Flint MI 08:30 Heritage Foods | w-4124 | 2/2 | — | w-4124 | 2 | no | +| 29 | Need 3 Packers in Flint MI starting at 12:30 for Parallel Ma | e-6019 | 0/1 | — | e-6019 | 0 | no | +| 30 | Parallel Machining needs 3 Packers in Flint MI at 12:30 | e-6019 | 4/2 | — | e-6019 | 4 | no | +| 31 | Looking for 3 Packers at Parallel Machining in Flint MI for | e-6019 | 0/1 | — | e-6019 | 0 | no | +| 32 | 3 Packers Flint MI 12:30 Parallel Machining | e-6019 | 0/2 | — | e-6019 | 0 | no | +| 33 | Need 1 Shipping Clerk in Flint MI starting at 17:00 for Pion | w-3988 | 1/3 | — | w-122 | 2 | no | +| 34 | Pioneer Assembly needs 1 Shipping Clerk in Flint MI at 17:00 | w-3988 | 1/3 | — | w-122 | 2 | no | +| 35 | Looking for 1 Shipping Clerk at Pioneer Assembly in Flint MI | w-3988 | 2/3 | — | w-122 | 0 | no | +| 36 | 1 Shipping Clerk Flint MI 17:00 Pioneer Assembly | w-2564 | 2/4 | ✓ w-122 | w-122 | 0 | **YES** | +| 37 | Need 1 CNC Operator in Detroit MI starting at 17:30 for Beac | w-2136 | 6/3 | — | w-2404 | 7 | no | +| 38 | Beacon Freight needs 1 CNC Operator in Detroit MI at 17:30 | w-2404 | 0/5 | — | w-2404 | 0 | no | +| 39 | Looking for 1 CNC Operator at Beacon Freight in Detroit MI f | e-9958 | 1/2 | — | w-2404 | 2 | no | +| 40 | 1 CNC Operator Detroit MI 17:30 Beacon Freight | e-5546 | 2/5 | ✓ w-2404 | w-2404 | 0 | **YES** | + +--- + +## Honesty caveats + +1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM + judge's verdict is what defines "best." If `` rates badly, + the lift number is meaningless. To validate the judge itself, sample 5–10 + verdicts manually and check agreement. +2. **Score-1.0 boost = distance halved.** Playbook math is + `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best + result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise + even halving doesn't promote it. Tight clusters → little visible lift. +3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap + case — same query, recorded playbook, expected boost. The paraphrase + pass (when enabled) is the actual learning property: similar-but-different + queries hitting a recorded playbook. Compare verbatim and paraphrase + lift rates — paraphrase should be lower (semantic-distance gates some + playbook hits) but non-zero is the meaningful signal. +4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best + results land in one corpus, the matrix layer's purpose isn't being tested. + Check per-corpus distribution in the JSON. +5. **Judge resolution.** This run used `qwen2.5:latest` from + config [models].local_judge. + Bumping the judge for run #N+1 means editing one line in lakehouse.toml. +6. **Paraphrase generation also uses the judge.** The same model that rates + relevance also rephrases queries. A judge that's bad at rating staffing + queries is probably also bad at rephrasing them. Worth sanity-checking + a sample of `paraphrase_query` values in the JSON before trusting the + paraphrase lift number. + +## Next moves + +- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real + work. Move to paraphrase queries + tag-based boost (currently ignored). +- If lift rate < 20%: investigate why — judge variance, distance gap too + wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need + retuning. +- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is + already close to optimal on this query distribution. Either the corpus + is too narrow or the queries are too easy. diff --git a/reports/reality-tests/playbook_lift_real_003b.md b/reports/reality-tests/playbook_lift_real_003b.md new file mode 100644 index 0000000..4d19182 --- /dev/null +++ b/reports/reality-tests/playbook_lift_real_003b.md @@ -0,0 +1,116 @@ +# Playbook-Lift Reality Test — Run real_003b + +**Generated:** 2026-05-01T02:38:56.283100116Z +**Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge) +**Corpora:** `workers,ethereal_workers` +**Workers limit:** 5000 +**Queries:** `tests/reality/real_coord_queries_v2.txt` (40 executed) +**K per pass:** 10 +**Paraphrase pass:** disabled +**Re-judge pass:** disabled +**Evidence:** `reports/reality-tests/playbook_lift_real_003b.json` + +--- + +## Headline + +| Metric | Value | +|---|---:| +| Total queries run | 40 | +| Cold-pass discoveries (judge-best ≠ top-1) | 11 | +| Warm-pass lifts (recorded playbook → top-1) | 10 | +| No change (judge-best already top-1, no playbook needed) | 30 | +| Playbook boosts triggered (warm pass) | 31 | +| Mean Δ top-1 distance (warm − cold) | -0.20235376 | + + + +**Verbatim lift rate:** 10 of 11 discoveries became top-1 after warm pass. + +--- + +## Per-query results + +| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift | +|---|---|---|---|---|---|---|---| +| 1 | Need 5 Warehouse Associates in Kansas City MO starting at 09 | e-7863 | 0/4 | — | e-7863 | 0 | no | +| 2 | Parallel Machining needs 5 Warehouse Associates in Kansas Ci | e-8089 | 1/4 | ✓ e-7863 | e-7863 | 0 | **YES** | +| 3 | Looking for 5 Warehouse Associates at Parallel Machining in | e-7538 | 0/4 | — | e-7863 | 1 | no | +| 4 | 5 Warehouse Associates Kansas City MO 09:00 Parallel Machini | e-7538 | 0/4 | — | e-7863 | 1 | no | +| 5 | Need 1 Forklift Operator in Detroit MI starting at 15:00 for | w-2136 | 0/5 | — | w-2136 | 0 | no | +| 6 | Beacon Freight needs 1 Forklift Operator in Detroit MI at 15 | w-2136 | 0/5 | — | w-2136 | 0 | no | +| 7 | Looking for 1 Forklift Operator at Beacon Freight in Detroit | w-2136 | 0/5 | — | w-2136 | 0 | no | +| 8 | 1 Forklift Operator Detroit MI 15:00 Beacon Freight | w-2136 | 0/5 | — | w-2136 | 0 | no | +| 9 | Need 4 Loaders in Indianapolis IN starting at 12:00 for Midw | w-2742 | 1/4 | ✓ w-4397 | w-4397 | 0 | **YES** | +| 10 | Midway Distribution needs 4 Loaders in Indianapolis IN at 12 | w-2742 | 2/5 | ✓ w-4397 | w-4397 | 0 | **YES** | +| 11 | Looking for 4 Loaders at Midway Distribution in Indianapolis | w-2742 | 2/4 | ✓ w-4397 | w-4397 | 0 | **YES** | +| 12 | 4 Loaders Indianapolis IN 12:00 Midway Distribution | w-2742 | 1/5 | ✓ w-4397 | w-4397 | 0 | **YES** | +| 13 | Need 3 Warehouse Associates in Fort Wayne IN starting at 17: | w-3370 | 0/4 | — | w-1398 | 1 | no | +| 14 | Cornerstone Fabrication needs 3 Warehouse Associates in Fort | w-3370 | 0/4 | — | w-1398 | 1 | no | +| 15 | Looking for 3 Warehouse Associates at Cornerstone Fabricatio | w-1784 | 1/4 | ✓ w-1398 | w-1398 | 0 | **YES** | +| 16 | 3 Warehouse Associates Fort Wayne IN 17:30 Cornerstone Fabri | e-8661 | 0/4 | — | w-1398 | 1 | no | +| 17 | Need 4 Pickers in Detroit MI starting at 13:30 for Beacon Fr | e-7644 | 0/2 | — | w-1367 | 1 | no | +| 18 | Beacon Freight needs 4 Pickers in Detroit MI at 13:30 | e-7644 | 0/2 | — | w-1367 | 1 | no | +| 19 | Looking for 4 Pickers at Beacon Freight in Detroit MI for 13 | e-438 | 2/3 | — | w-1367 | 3 | no | +| 20 | 4 Pickers Detroit MI 13:30 Beacon Freight | e-7644 | 8/4 | ✓ w-1367 | w-1367 | 0 | **YES** | +| 21 | Need 2 Packers in Joliet IL starting at 09:30 for Parallel M | e-846 | 8/3 | — | e-2120 | 0 | no | +| 22 | Parallel Machining needs 2 Packers in Joliet IL at 09:30 | e-846 | 9/4 | ✓ e-2120 | e-2120 | 0 | **YES** | +| 23 | Looking for 2 Packers at Parallel Machining in Joliet IL for | e-846 | 1/2 | — | e-2120 | 2 | no | +| 24 | 2 Packers Joliet IL 09:30 Parallel Machining | e-7105 | 4/3 | — | e-2120 | 0 | no | +| 25 | Need 3 Assemblers in Flint MI starting at 08:30 for Heritage | w-2582 | 0/2 | — | w-2582 | 0 | no | +| 26 | Heritage Foods needs 3 Assemblers in Flint MI at 08:30 | w-2582 | 0/2 | — | w-2582 | 0 | no | +| 27 | Looking for 3 Assemblers at Heritage Foods in Flint MI for 0 | w-4817 | 0/2 | — | w-4817 | 0 | no | +| 28 | 3 Assemblers Flint MI 08:30 Heritage Foods | w-4124 | 1/2 | — | w-4124 | 1 | no | +| 29 | Need 3 Packers in Flint MI starting at 12:30 for Parallel Ma | e-6019 | 0/1 | — | e-2120 | 1 | no | +| 30 | Parallel Machining needs 3 Packers in Flint MI at 12:30 | e-6019 | 0/1 | — | e-2120 | 1 | no | +| 31 | Looking for 3 Packers at Parallel Machining in Flint MI for | e-6019 | 0/1 | — | e-2120 | 1 | no | +| 32 | 3 Packers Flint MI 12:30 Parallel Machining | e-6019 | 0/2 | — | e-2120 | 1 | no | +| 33 | Need 1 Shipping Clerk in Flint MI starting at 17:00 for Pion | w-3988 | 3/4 | ✓ w-1367 | w-122 | 1 | no | +| 34 | Pioneer Assembly needs 1 Shipping Clerk in Flint MI at 17:00 | w-3988 | 1/3 | — | w-122 | 3 | no | +| 35 | Looking for 1 Shipping Clerk at Pioneer Assembly in Flint MI | w-3988 | 2/3 | — | w-122 | 0 | no | +| 36 | 1 Shipping Clerk Flint MI 17:00 Pioneer Assembly | w-2564 | 2/4 | ✓ w-122 | w-122 | 0 | **YES** | +| 37 | Need 1 CNC Operator in Detroit MI starting at 17:30 for Beac | w-2404 | 0/5 | — | e-637 | 1 | no | +| 38 | Beacon Freight needs 1 CNC Operator in Detroit MI at 17:30 | w-2404 | 0/5 | — | e-637 | 1 | no | +| 39 | Looking for 1 CNC Operator at Beacon Freight in Detroit MI f | e-8106 | 1/4 | ✓ e-637 | e-637 | 0 | **YES** | +| 40 | 1 CNC Operator Detroit MI 17:30 Beacon Freight | w-2404 | 0/5 | — | e-637 | 1 | no | + +--- + +## Honesty caveats + +1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM + judge's verdict is what defines "best." If `` rates badly, + the lift number is meaningless. To validate the judge itself, sample 5–10 + verdicts manually and check agreement. +2. **Score-1.0 boost = distance halved.** Playbook math is + `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best + result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise + even halving doesn't promote it. Tight clusters → little visible lift. +3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap + case — same query, recorded playbook, expected boost. The paraphrase + pass (when enabled) is the actual learning property: similar-but-different + queries hitting a recorded playbook. Compare verbatim and paraphrase + lift rates — paraphrase should be lower (semantic-distance gates some + playbook hits) but non-zero is the meaningful signal. +4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best + results land in one corpus, the matrix layer's purpose isn't being tested. + Check per-corpus distribution in the JSON. +5. **Judge resolution.** This run used `qwen2.5:latest` from + config [models].local_judge. + Bumping the judge for run #N+1 means editing one line in lakehouse.toml. +6. **Paraphrase generation also uses the judge.** The same model that rates + relevance also rephrases queries. A judge that's bad at rating staffing + queries is probably also bad at rephrasing them. Worth sanity-checking + a sample of `paraphrase_query` values in the JSON before trusting the + paraphrase lift number. + +## Next moves + +- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real + work. Move to paraphrase queries + tag-based boost (currently ignored). +- If lift rate < 20%: investigate why — judge variance, distance gap too + wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need + retuning. +- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is + already close to optimal on this query distribution. Either the corpus + is too narrow or the queries are too easy. diff --git a/reports/reality-tests/real_003_findings.md b/reports/reality-tests/real_003_findings.md new file mode 100644 index 0000000..a9fa24d --- /dev/null +++ b/reports/reality-tests/real_003_findings.md @@ -0,0 +1,126 @@ +# Reality test real_003 / real_003b — paraphrase stress + extractor extension + +40 queries (10 fill_events rows × 4 query styles) re-run twice: +- **real_003**: with the original `extractRoleFromNeed` regex (only + matches `^Need\s+\d+\s+\S+\s+in\s+` — the real_001 form) +- **real_003b**: with the extractor extended to also handle + `client_first` (`{client} needs N {role} in...`) and `looking` + (`Looking for N {role} at...`). Shorthand still falls through to + empty role. + +## Headline + +real_003 (original extractor): **shorthand-vs-shorthand bleed +confirmed**. The CNC Operator shorthand recording leaked `w-2404` +onto the Forklift Operator shorthand query within the same Beacon +Freight Detroit cluster — both record + query had empty role, gate +disabled, distance check passed, bleed fired. + +real_003b (extended extractor): **bleed closed for the queried side +across all 4 styles**. Forklift Operator queries keep `w-2136` (the +cold-pass-correct match) regardless of which style the query came +in. Same-role boosts now fire correctly across styles — a CNC +Operator recording made in `looking` style boosts the CNC need-form +query. + +## Per-style behavior — real_003 (original extractor) + +| Style | Bleed? | Explanation | +|---|---|---| +| `need` | none observed | Role extracted on both record + query side | +| `client_first` | none observed | Role NOT extracted, but no recording in this style happened to surface near a different-role query in real_003 | +| `looking` | none observed | Same as client_first | +| `shorthand` | **`w-2404` from CNC bled onto Forklift Operator** | Both record and query empty role → gate disabled → distance check passed | + +The single observed bleed case in real_003: +- Q#40 (`1 CNC Operator Detroit MI 17:30 Beacon Freight`) recorded + `w-2404` with `Role: ""` (extractor returned empty for shorthand). +- Q#4 (`1 Forklift Operator Detroit MI 15:00 Beacon Freight`) embedded + within ~0.137 cosine of Q#40 (same client + city + count + time + tokens dominate the embedding). +- `roleEqual("", "")` returned true (empty disables) → injection + fired → warm top-1 for Forklift Operator became `w-2404`. + +## Per-style behavior — real_003b (extended extractor) + +After adding patterns for `client_first` and `looking`: + +| Style | Role extracted? | Cross-role bleed observed? | +|---|---|---| +| `need` | yes | none | +| `client_first` | yes | none | +| `looking` | yes | none | +| `shorthand` | no | none in this dataset | + +No bleed observed in real_003b across any style. Pickers + CNC +Operator queries pick up their own role's recording across styles; +Forklift Operator queries keep the cold-pass-correct match. + +## Why the shorthand failure mode didn't fire in real_003b + +The extended extractor closes the bleed at the **query** side: for +`need`, `client_first`, `looking`, the queryRole is non-empty, so a +recording with empty role gets `roleEqual(role, "")` = true (lenient) +but the inverse — a non-empty queryRole gating a recording — is the +real defense. + +Wait — that's the same lenient semantic. So why isn't there a bleed? + +Two reasons real_003b's data didn't trigger one: + +1. **The Pickers shorthand recording** has `Role: ""`. Forklift Operator + queries embed > 0.20 cosine from "4 Pickers Detroit MI ..." because + "Pickers" vs "Forklift Operator" provides enough semantic separation + even within the same client+city cluster. The distance gate caught + what the role gate let through. +2. **No Forklift Operator recording** existed (judge said cold top-1 + was already correct, rating 5, no playbook needed). The + most-likely-to-bleed scenario — a Forklift recording in shorthand + leaking onto Pickers/CNC — didn't have ammunition. + +If a future dataset has multiple roles per cluster all hitting shorthand +recordings, the bleed could return. **Mitigation candidates** (none +implemented): + +- **LLM-based role extraction** at record + query time. Robust, slow. +- **Known-cities lookup table** to detect city boundary in shorthand + (`{role} {city}` separator). 50 US states + ~5000 cities = small + static table. Fast, brittle on new cities. +- **Strict gate semantics**: empty role on either side = REJECT + (instead of allow). Closes shorthand-vs-shorthand bleeds completely + but breaks lift-suite multi-constraint queries that have no clean + single role. + +## Aggregate metrics + +| Run | n | discoveries | lifts | boost_total | meanΔ | +|---|---:|---:|---:|---:|---:| +| real_003 (original extractor) | 40 | 7 | 7 | 14 | -0.108 | +| real_003b (extended extractor) | 40 | 11 | 10 | 31 | -0.202 | + +real_003b's higher discoveries + boost_total reflect the extractor +catching role on 3 of 4 styles instead of 1 of 4 — which means more +recordings land with a usable Role and more queries find them on warm +pass. The growth is *legitimate same-role same-cluster transfer*, not +bleed. + +`meanΔ` direction is style-dependent: real_002 shrank `meanΔ` (cross-role +bleeds removed → less over-boosting); real_003b grew it (more +legitimate deep boosts fire). The metric isn't a clean fingerprint +either direction — read the per-cluster bleed table for actual signal. + +## Repro + +```bash +# Generate 40-query stress file (10 rows × 4 styles) +go run scripts/cutover/gen_real_queries.go -limit 10 -styles all > tests/reality/real_coord_queries_v2.txt + +# Run with extended extractor (current main) +QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_003b \ + WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh +``` + +Evidence: +- `reports/reality-tests/playbook_lift_real_003.{json,md}` (original extractor) +- `reports/reality-tests/playbook_lift_real_003b.{json,md}` (extended) +- `tests/reality/real_coord_queries_v2.txt` (40 queries × 4 styles) diff --git a/scripts/cutover/gen_real_queries.go b/scripts/cutover/gen_real_queries.go index 939364d..7f6d781 100644 --- a/scripts/cutover/gen_real_queries.go +++ b/scripts/cutover/gen_real_queries.go @@ -19,6 +19,7 @@ import ( "flag" "fmt" "log" + "strings" "github.com/apache/arrow-go/v18/arrow/memory" "github.com/apache/arrow-go/v18/parquet/file" @@ -27,7 +28,8 @@ import ( func main() { src := flag.String("src", "/home/profit/lakehouse/data/datasets/fill_events.parquet", "fill_events parquet path") - limit := flag.Int("limit", 10, "number of queries to generate") + limit := flag.Int("limit", 10, "number of source rows to read") + styles := flag.String("styles", "need", "comma-separated styles to emit per row (need|client_first|looking|shorthand|all)") flag.Parse() r, err := file.OpenParquetFile(*src, false) @@ -61,35 +63,98 @@ func main() { n = *limit } + stylesList := parseStyles(*styles) + fmt.Println("# Real-shape coordinator queries — generated from fill_events.parquet") fmt.Println("# (real-shape demand data; queries built mechanically from event rows).") - fmt.Printf("# Source: %s (%d rows total, %d emitted)\n", *src, tbl.NumRows(), n) + fmt.Printf("# Source: %s (%d rows total, %d emitted, styles=%v)\n", *src, tbl.NumRows(), n, stylesList) fmt.Println("#") - fmt.Println("# Format: client + count + role + city/state + start time +") - fmt.Println("# (optional deadline). Mimics the natural language a coordinator would") - fmt.Println("# type into a dispatch tool when triaging the next-up demand.") + fmt.Println("# Styles:") + fmt.Println("# need: 'Need N {role}{s} in {city} {state} starting at {at} for {client}'") + fmt.Println("# — matches scripts/playbook_lift's extractRoleFromNeed regex") + fmt.Println("# client_first: '{client} needs N {role}{s} in {city} {state} at {at}'") + fmt.Println("# looking: 'Looking for N {role}{s} at {client} in {city} {state} for {at} shift'") + fmt.Println("# shorthand: 'N {role}{s} {city} {state} {at} {client}'") + fmt.Println("#") + fmt.Println("# Only 'need' currently extracts a role. The other three test the") + fmt.Println("# substrate's bleed behavior when the role gate is silently disabled.") fmt.Println() for i := 0; i < n; i++ { - c := client.ValueStr(i) - cy := city.ValueStr(i) - st := state.ValueStr(i) - ro := role.ValueStr(i) - ct := count.ValueStr(i) - t := at.ValueStr(i) - dl := deadline.ValueStr(i) - - // Phrase one is the urgent ask; phrase two is the natural rephrase - // a coordinator might use when typing fast. Different syntax, - // same intent — exercises the embedder's paraphrase tolerance. - q := fmt.Sprintf("Need %s %s in %s %s starting at %s for %s", ct, pluralize(ro, ct), cy, st, t, c) - if dl != "" && dl != "(null)" { - q += fmt.Sprintf(", deadline %s", dl) + ev := event{ + client: client.ValueStr(i), + city: city.ValueStr(i), + state: state.ValueStr(i), + role: role.ValueStr(i), + count: count.ValueStr(i), + at: at.ValueStr(i), + } + if dl := deadline.ValueStr(i); dl != "" && dl != "(null)" { + ev.deadline = dl + } + for _, s := range stylesList { + fmt.Println(formatQuery(ev, s)) } - fmt.Println(q) } } +type event struct { + client, city, state, role, count, at, deadline string +} + +func formatQuery(e event, style string) string { + r := pluralize(e.role, e.count) + switch style { + case "client_first": + // No "Need ... in" anchor — extractRoleFromNeed returns "" on this. + return fmt.Sprintf("%s needs %s %s in %s %s at %s", e.client, e.count, r, e.city, e.state, e.at) + case "looking": + return fmt.Sprintf("Looking for %s %s at %s in %s %s for %s shift", e.count, r, e.client, e.city, e.state, e.at) + case "shorthand": + return fmt.Sprintf("%s %s %s %s %s %s", e.count, r, e.city, e.state, e.at, e.client) + default: + // "need" form — the original real_001 shape, regex-extractor wins. + q := fmt.Sprintf("Need %s %s in %s %s starting at %s for %s", e.count, r, e.city, e.state, e.at, e.client) + if e.deadline != "" { + q += ", deadline " + e.deadline + } + return q + } +} + +// parseStyles unpacks the comma-separated -styles flag, with "all" +// expanding to every supported style and unknown tokens dropped +// (with a log line so callers know). +func parseStyles(csv string) []string { + all := []string{"need", "client_first", "looking", "shorthand"} + if strings.TrimSpace(csv) == "all" { + return all + } + out := []string{} + for _, s := range strings.Split(csv, ",") { + s = strings.TrimSpace(s) + if s == "" { + continue + } + known := false + for _, a := range all { + if a == s { + known = true + break + } + } + if !known { + log.Printf("gen_real_queries: unknown style %q — skipping", s) + continue + } + out = append(out, s) + } + if len(out) == 0 { + return []string{"need"} + } + return out +} + func pluralize(role, count string) string { if count == "1" { return role diff --git a/scripts/playbook_lift/main.go b/scripts/playbook_lift/main.go index 3ed99a3..5e82bfc 100644 --- a/scripts/playbook_lift/main.go +++ b/scripts/playbook_lift/main.go @@ -631,25 +631,53 @@ func appendNote(existing, add string) string { // refactor; harmless for now. var _ = sort.Slice -// extractRoleFromNeed pulls the role out of "Need N {role}{s} in {city}" -// shape queries — the fill_events-derived form used by real_NNN runs. -// Returns "" for any query that doesn't match (free-form lift-suite -// queries fall back to empty, leaving the cross-role gate disabled). +// extractRoleFromNeed pulls the role out of staffing-shape queries. +// Returns "" for any query that doesn't match a known anchor pattern +// (free-form lift-suite queries + shorthand-style fall back to empty, +// leaving the cross-role gate disabled). // -// The pattern is permissive: the count can be any digits, and the -// role is everything between the count and " in ". This catches -// "Need 5 Warehouse Associates in Kansas City" → "Warehouse Associates"; -// roleEqual on the matrix side handles plurals + case. +// Patterns covered (in priority order): +// need: "Need N {role}{s} in {city} ..." +// client_first: "{client} needs N {role}{s} in {city} ..." +// looking: "Looking for N {role}{s} at {client} in {city} ..." +// +// Pattern explicitly NOT covered: +// shorthand: "N {role}{s} {city} {state} {at} {client}" +// Because there's no separator between role and city in shorthand +// ("Forklift Operator Detroit" is shape-indistinguishable from +// "Forklift" + "Operator Detroit"), a regex can't reliably extract +// role here. real_003 confirmed shorthand-vs-shorthand cross-role +// bleed: a CNC Operator shorthand recording leaked w-2404 onto a +// Forklift Operator shorthand query within the same Beacon Freight +// Detroit cluster. Closing that requires either an LLM extractor at +// record+query time or a known-cities lookup table. // // Lives here (not in internal/matrix) because role extraction from // free-form text is a caller concern; matrix only consumes the // already-resolved Role string. A future LLM-based extractor would -// replace this regex without changing matrix's gate logic. +// replace this function without changing matrix's gate logic. func extractRoleFromNeed(query string) string { - re := regexp.MustCompile(`(?i)^Need\s+\d+\s+(.+?)\s+in\s+`) - m := re.FindStringSubmatch(query) - if len(m) < 2 { - return "" + for _, re := range roleExtractRegexes { + if m := re.FindStringSubmatch(query); len(m) >= 2 { + return strings.TrimSpace(m[1]) + } } - return strings.TrimSpace(m[1]) + return "" +} + +// roleExtractRegexes is ordered: more-specific anchors first so a +// "Looking for ..." query doesn't accidentally land in the "Need" +// pattern (impossible given the prefix, but guards against future +// pattern additions). Compiled once at package init via MustCompile. +var roleExtractRegexes = []*regexp.Regexp{ + // "Need N {role} in ..." — the original real_001 form. + regexp.MustCompile(`(?i)^Need\s+\d+\s+(.+?)\s+in\s+`), + // "Looking for N {role} at ..." — the looking style. Anchor on + // "at" because the role is followed by client (preceded by "at"), + // not by city directly. + regexp.MustCompile(`(?i)^Looking\s+for\s+\d+\s+(.+?)\s+at\s+`), + // "{client} needs N {role} in ..." — the client_first style. + // Greedy on the client side via .+?, then "needs", then count, + // then role, then "in". + regexp.MustCompile(`(?i)^.+?\s+needs\s+\d+\s+(.+?)\s+in\s+`), } diff --git a/scripts/playbook_lift/main_test.go b/scripts/playbook_lift/main_test.go new file mode 100644 index 0000000..39ec0dc --- /dev/null +++ b/scripts/playbook_lift/main_test.go @@ -0,0 +1,76 @@ +package main + +import "testing" + +// TestExtractRoleFromNeed locks the four query-shape patterns documented +// in real_003_findings.md so a future change to the regex can't silently +// drop coverage of any production-shape style. Real_001 used `need`-only; +// real_003 confirmed `shorthand` cross-role bleed; the extended +// extractor in real_003b covers `client_first` + `looking` and leaves +// `shorthand` as a known limitation (no separator between role and city). +func TestExtractRoleFromNeed(t *testing.T) { + cases := []struct { + name string + query string + want string + }{ + { + "need style — original real_001 form", + "Need 1 Forklift Operator in Detroit MI starting at 15:00 for Beacon Freight", + "Forklift Operator", + }, + { + "need with deadline trailer", + "Need 4 Pickers in Detroit MI starting at 13:30 for Beacon Freight, deadline 2026-05-28", + "Pickers", + }, + { + "client_first style — added in real_003b", + "Beacon Freight needs 1 Forklift Operator in Detroit MI at 15:00", + "Forklift Operator", + }, + { + "client_first with multi-word client", + "Parallel Machining needs 5 Warehouse Associates in Kansas City MO at 09:00", + "Warehouse Associates", + }, + { + "looking style — added in real_003b", + "Looking for 1 Forklift Operator at Beacon Freight in Detroit MI for 15:00 shift", + "Forklift Operator", + }, + { + "looking with multi-word role + 4-digit count", + "Looking for 1234 Senior Production Supervisors at Heritage Foods in Flint MI for 08:30 shift", + "Senior Production Supervisors", + }, + { + "shorthand — known limitation, returns empty", + "1 Forklift Operator Detroit MI 15:00 Beacon Freight", + "", + }, + { + "shorthand multi-word city — also empty", + "5 Warehouse Associates Kansas City MO 09:00 Parallel Machining", + "", + }, + { + "lift-suite multi-constraint — no clean role, returns empty", + "Forklift operator with OSHA-30, warehouse experience, day shift availability", + "", + }, + { + "OOD honesty signal — lift-suite, returns empty", + "Dental hygienist with three years experience, Indianapolis area", + "", + }, + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + got := extractRoleFromNeed(c.query) + if got != c.want { + t.Errorf("extractRoleFromNeed(%q) = %q, want %q", c.query, got, c.want) + } + }) + } +} diff --git a/tests/reality/real_coord_queries_v2.txt b/tests/reality/real_coord_queries_v2.txt new file mode 100644 index 0000000..df6f221 --- /dev/null +++ b/tests/reality/real_coord_queries_v2.txt @@ -0,0 +1,54 @@ +# Real-shape coordinator queries — generated from fill_events.parquet +# (real-shape demand data; queries built mechanically from event rows). +# Source: /home/profit/lakehouse/data/datasets/fill_events.parquet (123 rows total, 10 emitted, styles=[need client_first looking shorthand]) +# +# Styles: +# need: 'Need N {role}{s} in {city} {state} starting at {at} for {client}' +# — matches scripts/playbook_lift's extractRoleFromNeed regex +# client_first: '{client} needs N {role}{s} in {city} {state} at {at}' +# looking: 'Looking for N {role}{s} at {client} in {city} {state} for {at} shift' +# shorthand: 'N {role}{s} {city} {state} {at} {client}' +# +# Only 'need' currently extracts a role. The other three test the +# substrate's bleed behavior when the role gate is silently disabled. + +Need 5 Warehouse Associates in Kansas City MO starting at 09:00 for Parallel Machining +Parallel Machining needs 5 Warehouse Associates in Kansas City MO at 09:00 +Looking for 5 Warehouse Associates at Parallel Machining in Kansas City MO for 09:00 shift +5 Warehouse Associates Kansas City MO 09:00 Parallel Machining +Need 1 Forklift Operator in Detroit MI starting at 15:00 for Beacon Freight, deadline 2026-05-28 +Beacon Freight needs 1 Forklift Operator in Detroit MI at 15:00 +Looking for 1 Forklift Operator at Beacon Freight in Detroit MI for 15:00 shift +1 Forklift Operator Detroit MI 15:00 Beacon Freight +Need 4 Loaders in Indianapolis IN starting at 12:00 for Midway Distribution +Midway Distribution needs 4 Loaders in Indianapolis IN at 12:00 +Looking for 4 Loaders at Midway Distribution in Indianapolis IN for 12:00 shift +4 Loaders Indianapolis IN 12:00 Midway Distribution +Need 3 Warehouse Associates in Fort Wayne IN starting at 17:30 for Cornerstone Fabrication, deadline 2026-05-17 +Cornerstone Fabrication needs 3 Warehouse Associates in Fort Wayne IN at 17:30 +Looking for 3 Warehouse Associates at Cornerstone Fabrication in Fort Wayne IN for 17:30 shift +3 Warehouse Associates Fort Wayne IN 17:30 Cornerstone Fabrication +Need 4 Pickers in Detroit MI starting at 13:30 for Beacon Freight, deadline 2026-05-28 +Beacon Freight needs 4 Pickers in Detroit MI at 13:30 +Looking for 4 Pickers at Beacon Freight in Detroit MI for 13:30 shift +4 Pickers Detroit MI 13:30 Beacon Freight +Need 2 Packers in Joliet IL starting at 09:30 for Parallel Machining +Parallel Machining needs 2 Packers in Joliet IL at 09:30 +Looking for 2 Packers at Parallel Machining in Joliet IL for 09:30 shift +2 Packers Joliet IL 09:30 Parallel Machining +Need 3 Assemblers in Flint MI starting at 08:30 for Heritage Foods +Heritage Foods needs 3 Assemblers in Flint MI at 08:30 +Looking for 3 Assemblers at Heritage Foods in Flint MI for 08:30 shift +3 Assemblers Flint MI 08:30 Heritage Foods +Need 3 Packers in Flint MI starting at 12:30 for Parallel Machining +Parallel Machining needs 3 Packers in Flint MI at 12:30 +Looking for 3 Packers at Parallel Machining in Flint MI for 12:30 shift +3 Packers Flint MI 12:30 Parallel Machining +Need 1 Shipping Clerk in Flint MI starting at 17:00 for Pioneer Assembly +Pioneer Assembly needs 1 Shipping Clerk in Flint MI at 17:00 +Looking for 1 Shipping Clerk at Pioneer Assembly in Flint MI for 17:00 shift +1 Shipping Clerk Flint MI 17:00 Pioneer Assembly +Need 1 CNC Operator in Detroit MI starting at 17:30 for Beacon Freight, deadline 2026-05-28 +Beacon Freight needs 1 CNC Operator in Detroit MI at 17:30 +Looking for 1 CNC Operator at Beacon Freight in Detroit MI for 17:30 shift +1 CNC Operator Detroit MI 17:30 Beacon Freight