playbook_lift: LLM-based role extractor closes shorthand bleed (real_004)

real_003 left a known-weak hole: shorthand-style queries ("{count} {role} {city} {state} ...") have no separator between role and city, so a regex can't reliably extract — leaving the cross-role gate disabled when both record AND query are shorthand. This commit adds a roleExtractor with regex-first + LLM fallback: - Regex first (fast, deterministic) — handles need + client_first + looking from real_003b. ~75% of styles, no LLM cost paid. - LLM fallback when regex returns empty AND model is configured — Ollama-shape /api/chat with format=json, schema-tight prompt, temperature 0. ~1-3s on local qwen2.5. - Per-process cache — paraphrase + rejudge passes reuse the same query 4× per run; cache prevents 4× LLM cost. - Off-by-default — opt-in via -llm-role-extract flag (CLI) and LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping config unchanged unless explicitly enabled. 8 new tests in scripts/playbook_lift/main_test.go: - TestRoleExtractor_RegexFirst: LLM not called when regex matches - TestRoleExtractor_LLMFallback: shorthand goes to LLM - TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved - TestRoleExtractor_Cache: 3 calls = 1 LLM hit - TestRoleExtractor_NilSafe: nil receiver runs regex only - TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths - TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic witness for the real_003 scenario — both record + query are shorthand, regex returns "" for both, LLM produces DIFFERENT role tokens for CNC vs Forklift, so matrix gate's cross-role rejection (locked separately in TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires correctly. This is the load-bearing verification. Reality test real_004 ran the same 40-query stress as real_003 with LLM extraction on. Cross-style same-role boosts fired correctly across all 4 styles for Loaders + Packers + Shipping Clerk clusters (including shorthand → other-style transfer). No cross-role bleed observed. The reality test alone can't be a clean "with vs without" comparison (HNSW build is non-deterministic across runs, and real_004 stochastics didn't trigger a shorthand recording at all), which is why the unit-test witness exists. Production note (in real_004_findings.md): LLM extraction is for reality-test coverage of arbitrary query shapes. Production should extract role at INGEST time (when the inbox parser already runs an LLM) and pass already-resolved role through requests — same shape as multi_coord_stress's existing Demand{Role: ...} model. The hot path should never need the harness extractor's per-query LLM cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:51:27 -05:00 · 2026-04-30 22:51:27 -05:00 · 0331288641
commit 0331288641
parent 3263254f1c
6 changed files with 572 additions and 9 deletions
--- a/STATE_OF_PLAY.md
+++ b/STATE_OF_PLAY.md
@ -264,6 +264,7 @@ The list is intentionally short. Items move to closed when the work demands them
 | (probe)   | Reality test real_001: 10 real-shape queries from `fill_events.parquet` through lift harness. 8/10 cold-pass top-1 = judge-best (substrate works on real distribution). Surfaced **same-client+city cross-role bleed** — Shape A boost from Forklift-Operator playbook landed on CNC-Operator query, demoting the cold-pass-correct worker. Findings: `reports/reality-tests/real_001_findings.md`. |
 | (fix)     | Cross-role gate: `Role` on `PlaybookEntry`, `QueryRole` on `SearchRequest`, gate fires in both `ApplyPlaybookBoost` + `InjectPlaybookMisses`. `roleEqual` handles case + plural. Backward-compat: empty role on either side = gate disabled (preserves lift suite + free-form callers). 5 new unit tests use exact real_001 distance + role values. Re-run real_002: bleed closed (Q#5 Pickers, Q#10 CNC Operator stay at cold-pass top-1; same-role lifts still fire). Closes OPEN #1. Findings: `reports/reality-tests/real_002_findings.md`. |
 | (probe)   | Reality test real_003: 40 queries (10 fill_events rows × 4 styles — `need` / `client_first` / `looking` / `shorthand`). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked `w-2404` onto Forklift Operator shorthand query (both empty role, gate disabled). Extended `extractRoleFromNeed` to handle `client_first` + `looking` patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in `scripts/playbook_lift/main_test.go` lock the patterns + the documented shorthand limitation. Findings: `reports/reality-tests/real_003_findings.md`. |
 | (fix)     | LLM-based role extractor (real_004): `roleExtractor` struct with regex-first → qwen2.5 format=json fallback → per-process cache. Opt-in via `-llm-role-extract` flag + `LLM_ROLE_EXTRACT=1` env. Off-by-default preserves real_003b shipping config. 8 new tests including `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the load-bearing witness pairing with the matrix-side `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` to prove the extraction-layer + gate-layer compose correctly on the exact real_003 failure mode. Findings: `reports/reality-tests/real_004_findings.md`. |
 Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).
--- a/reports/reality-tests/playbook_lift_real_004.md
+++ b/reports/reality-tests/playbook_lift_real_004.md
@ -0,0 +1,116 @@
 # Playbook-Lift Reality Test — Run real_004
 **Generated:** 2026-05-01T03:48:39.932829485Z
 **Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge)
 **Corpora:** `workers,ethereal_workers`
 **Workers limit:** 5000
 **Queries:** `tests/reality/real_coord_queries_v2.txt` (40 executed)
 **K per pass:** 10
 **Paraphrase pass:** disabled
 **Re-judge pass:** disabled
 **Evidence:** `reports/reality-tests/playbook_lift_real_004.json`
 ---
 ## Headline
 | Metric | Value |
 |---|---:|
 | Total queries run | 40 |
 | Cold-pass discoveries (judge-best ≠ top-1) | 5 |
 | Warm-pass lifts (recorded playbook → top-1) | 4 |
 | No change (judge-best already top-1, no playbook needed) | 36 |
 | Playbook boosts triggered (warm pass) | 16 |
 | Mean Δ top-1 distance (warm − cold) | -0.11025716 |
 **Verbatim lift rate:** 4 of 5 discoveries became top-1 after warm pass.
 ---
 ## Per-query results
 | # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
 |---|---|---|---|---|---|---|---|
 | 1 | Need 5 Warehouse Associates in Kansas City MO starting at 09 | w-3288 | 0/4 | — | w-3288 | 0 | no |
 | 2 | Parallel Machining needs 5 Warehouse Associates in Kansas Ci | w-3288 | 0/4 | — | w-3288 | 0 | no |
 | 3 | Looking for 5 Warehouse Associates at Parallel Machining in  | w-3288 | 0/4 | — | w-3288 | 0 | no |
 | 4 | 5 Warehouse Associates Kansas City MO 09:00 Parallel Machini | w-3288 | 0/4 | — | w-3288 | 0 | no |
 | 5 | Need 1 Forklift Operator in Detroit MI starting at 15:00 for | w-2136 | 0/5 | — | w-2136 | 0 | no |
 | 6 | Beacon Freight needs 1 Forklift Operator in Detroit MI at 15 | e-6723 | 2/2 | — | e-6723 | 2 | no |
 | 7 | Looking for 1 Forklift Operator at Beacon Freight in Detroit | w-4766 | 0/5 | — | w-4766 | 0 | no |
 | 8 | 1 Forklift Operator Detroit MI 15:00 Beacon Freight | e-5818 | 0/4 | — | e-5818 | 0 | no |
 | 9 | Need 4 Loaders in Indianapolis IN starting at 12:00 for Midw | e-9877 | 0/4 | — | w-398 | 2 | no |
 | 10 | Midway Distribution needs 4 Loaders in Indianapolis IN at 12 | e-4537 | 4/5 | ✓ w-2148 | w-398 | 1 | no |
 | 11 | Looking for 4 Loaders at Midway Distribution in Indianapolis | w-962 | 6/4 | ✓ w-398 | w-398 | 0 | **YES** |
 | 12 | 4 Loaders Indianapolis IN 12:00 Midway Distribution | e-9877 | 0/4 | — | w-2148 | 2 | no |
 | 13 | Need 3 Warehouse Associates in Fort Wayne IN starting at 17: | w-2857 | 1/4 | ✓ w-1063 | w-1063 | 0 | **YES** |
 | 14 | Cornerstone Fabrication needs 3 Warehouse Associates in Fort | w-4703 | 0/2 | — | w-1063 | 1 | no |
 | 15 | Looking for 3 Warehouse Associates at Cornerstone Fabricatio | w-1784 | 0/3 | — | w-1063 | 1 | no |
 | 16 | 3 Warehouse Associates Fort Wayne IN 17:30 Cornerstone Fabri | w-2754 | 2/3 | — | w-1063 | 3 | no |
 | 17 | Need 4 Pickers in Detroit MI starting at 13:30 for Beacon Fr | e-2247 | 1/2 | — | e-2247 | 1 | no |
 | 18 | Beacon Freight needs 4 Pickers in Detroit MI at 13:30 | e-2247 | 2/2 | — | e-2247 | 2 | no |
 | 19 | Looking for 4 Pickers at Beacon Freight in Detroit MI for 13 | e-2247 | 1/2 | — | e-2247 | 1 | no |
 | 20 | 4 Pickers Detroit MI 13:30 Beacon Freight | e-2247 | 0/3 | — | e-2247 | 0 | no |
 | 21 | Need 2 Packers in Joliet IL starting at 09:30 for Parallel M | e-846 | 2/3 | — | e-2120 | 0 | no |
 | 22 | Parallel Machining needs 2 Packers in Joliet IL at 09:30 | e-846 | 2/4 | ✓ e-2120 | e-2120 | 0 | **YES** |
 | 23 | Looking for 2 Packers at Parallel Machining in Joliet IL for | e-846 | 1/2 | — | e-2120 | 2 | no |
 | 24 | 2 Packers Joliet IL 09:30 Parallel Machining | e-846 | 2/3 | — | e-2120 | 0 | no |
 | 25 | Need 3 Assemblers in Flint MI starting at 08:30 for Heritage | w-4124 | 9/3 | — | w-4124 | 9 | no |
 | 26 | Heritage Foods needs 3 Assemblers in Flint MI at 08:30 | w-4124 | 3/3 | — | w-4124 | 3 | no |
 | 27 | Looking for 3 Assemblers at Heritage Foods in Flint MI for 0 | w-4124 | 2/2 | — | w-4124 | 2 | no |
 | 28 | 3 Assemblers Flint MI 08:30 Heritage Foods | w-4124 | 3/3 | — | w-4124 | 3 | no |
 | 29 | Need 3 Packers in Flint MI starting at 12:30 for Parallel Ma | e-6019 | 8/2 | — | e-6019 | 8 | no |
 | 30 | Parallel Machining needs 3 Packers in Flint MI at 12:30 | e-6019 | 9/2 | — | e-6019 | 9 | no |
 | 31 | Looking for 3 Packers at Parallel Machining in Flint MI for  | e-6019 | 0/1 | — | e-6019 | 0 | no |
 | 32 | 3 Packers Flint MI 12:30 Parallel Machining | e-2006 | 0/3 | — | e-2006 | 0 | no |
 | 33 | Need 1 Shipping Clerk in Flint MI starting at 17:00 for Pion | w-3988 | 2/4 | ✓ w-1367 | w-1367 | 0 | **YES** |
 | 34 | Pioneer Assembly needs 1 Shipping Clerk in Flint MI at 17:00 | w-3988 | 1/3 | — | w-1367 | 2 | no |
 | 35 | Looking for 1 Shipping Clerk at Pioneer Assembly in Flint MI | w-3988 | 0/2 | — | w-1367 | 1 | no |
 | 36 | 1 Shipping Clerk Flint MI 17:00 Pioneer Assembly | e-4849 | 0/2 | — | w-1367 | 1 | no |
 | 37 | Need 1 CNC Operator in Detroit MI starting at 17:30 for Beac | e-489 | 5/3 | — | e-489 | 5 | no |
 | 38 | Beacon Freight needs 1 CNC Operator in Detroit MI at 17:30 | e-489 | 0/2 | — | e-489 | 0 | no |
 | 39 | Looking for 1 CNC Operator at Beacon Freight in Detroit MI f | e-489 | 0/2 | — | e-489 | 0 | no |
 | 40 | 1 CNC Operator Detroit MI 17:30 Beacon Freight | e-489 | 3/3 | — | e-489 | 3 | no |
 ---
 ## Honesty caveats
 1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
   judge's verdict is what defines "best." If `` rates badly,
   the lift number is meaningless. To validate the judge itself, sample 5–10
   verdicts manually and check agreement.
 2. **Score-1.0 boost = distance halved.** Playbook math is
   `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
   result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
   even halving doesn't promote it. Tight clusters → little visible lift.
 3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
   case — same query, recorded playbook, expected boost. The paraphrase
   pass (when enabled) is the actual learning property: similar-but-different
   queries hitting a recorded playbook. Compare verbatim and paraphrase
   lift rates — paraphrase should be lower (semantic-distance gates some
   playbook hits) but non-zero is the meaningful signal.
 4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
   results land in one corpus, the matrix layer's purpose isn't being tested.
   Check per-corpus distribution in the JSON.
 5. **Judge resolution.** This run used `qwen2.5:latest` from
   config [models].local_judge.
   Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
 6. **Paraphrase generation also uses the judge.** The same model that rates
   relevance also rephrases queries. A judge that's bad at rating staffing
   queries is probably also bad at rephrasing them. Worth sanity-checking
   a sample of `paraphrase_query` values in the JSON before trusting the
   paraphrase lift number.
 ## Next moves
 - If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
  work. Move to paraphrase queries + tag-based boost (currently ignored).
 - If lift rate < 20%: investigate why — judge variance, distance gap too
  wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
  retuning.
 - If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
  already close to optimal on this query distribution. Either the corpus
  is too narrow or the queries are too easy.
--- a/reports/reality-tests/real_004_findings.md
+++ b/reports/reality-tests/real_004_findings.md
@ -0,0 +1,104 @@
 # Reality test real_004 — LLM-based role extractor closes shorthand hole
 real_003 / real_003b had a documented limitation: the regex extractor
 can't separate role from city in shorthand-style queries
 (`{count} {role} {city} {state} ...`) because there's no anchor
 between role and city. real_004 closes this with an LLM fallback —
 qwen2.5 format=json, called only when the regex returns empty.
 ## Architecture
 `roleExtractor` struct in `scripts/playbook_lift/main.go`:
 1. **Regex first** (fast, deterministic) — handles need + client_first
   + looking patterns from real_003b.
 2. **LLM fallback** when regex returns empty AND model is configured —
   Ollama-shape `/api/chat` with format=json, schema-tight system
   prompt, temperature 0.
 3. **Per-process cache** — paraphrase + rejudge passes hit the same
   query 4× per run; cache prevents 4× LLM cost per query.
 4. **Off-by-default** — `-llm-role-extract` flag (CLI) +
   `LLM_ROLE_EXTRACT=1` env var (harness) opt in. Default behavior
   is unchanged from real_003b.
 ## Verification — what's proven
 **Unit tests (8 new in `scripts/playbook_lift/main_test.go`):**
 - `TestRoleExtractor_RegexFirst` — when regex matches, LLM is NOT
  called (cost discipline preserved on the 75% of queries the regex
  handles).
 - `TestRoleExtractor_LLMFallback` — shorthand query goes to LLM,
  result is used.
 - `TestRoleExtractor_LLMOffLeavesEmpty` — without model configured,
  shorthand returns empty (current default).
 - `TestRoleExtractor_Cache` — 3 calls to same query = 1 LLM hit.
 - `TestRoleExtractor_NilSafe` — nil receiver runs regex only;
  matrixSearch + playbookRecord don't need a guard.
 - `TestExtractRoleViaLLM_HTTPError` + `_BadJSON` — failure paths
  surface error so caller can fall back cleanly.
 - `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the synthetic
  witness for the real_003 scenario: when both record AND query are
  shorthand (regex returns "" for both), LLM produces DIFFERENT role
  tokens for CNC Operator vs Forklift Operator queries → matrix
  gate's cross-role rejection fires correctly. This is the load-bearing
  verification — paired with `internal/matrix/playbook_test.go`'s
  `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` (which uses
  the exact role tokens this test produces).
 ## Verification — what real_004 (live harness) shows
 Same 40-query stress file as real_003. With `LLM_ROLE_EXTRACT=1`:
 | Run | recordings | shorthand recordings | boosts | Detroit cluster bleed |
 |---|---:|---:|---:|---|
 | real_003  (regex only) | 7  | 1 (CNC) | 14 | **YES — w-2404 leaked** |
 | real_003b (extended regex) | 11 | 1 (Pickers) | 31 | none observed |
 | real_004  (regex + LLM)    | 5  | 0 | 16 | none observed |
 real_004's shorthand-recordings = 0 is stochastic (HNSW build is
 non-deterministic across runs, so cold-pass top-1 varies and so does
 which queries trigger discovery). The LLM is doing work — boosts
 fired across all 4 styles for the Loaders + Packers + Shipping Clerk
 clusters, including shorthand queries. But the dataset didn't
 naturally produce a shorthand recording in this run, so we can't
 read a clean "with-vs-without" reality-test signal on the bleed.
 **This is why the unit-test witness exists.** The reality test confirms
 no regressions; the unit test proves the extraction-layer correctness
 on the exact failure mode real_003 surfaced. Together they cover the
 fix.
 ## Cost note
 LLM extraction adds ~1-3s per shorthand query on local qwen2.5. Per
 real_004 timing the harness took ~10 minutes for 40 queries with LLM
 on (vs ~2 min in real_003b). The cache makes paraphrase + rejudge
 passes free; first-touch shorthand queries pay the LLM cost once.
 For production hot paths (inbox parsing, retrieve), the LLM cost is
 prohibitive. The right architecture there is to extract role at
 INGEST time (when queries land in the inbox parser) and pass the
 already-resolved role through the request — same pattern as
 multi_coord_stress's existing `Demand{Role: ...}` shape. The harness
 LLM extractor is for reality-test coverage of arbitrary query shapes;
 production should never need it.
 ## Repro
 ```bash
 # Generate the same 40-query stress file
 go run scripts/cutover/gen_real_queries.go -limit 10 -styles all > tests/reality/real_coord_queries_v2.txt
 # With LLM extractor (closes shorthand hole)
 QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004 \
  WITH_PARAPHRASE=0 WITH_REJUDGE=0 LLM_ROLE_EXTRACT=1 \
  ./scripts/playbook_lift.sh
 # Without LLM extractor (regex-only — current default)
 QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004_regex_only \
  WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
  ./scripts/playbook_lift.sh
 ```
 Evidence: `reports/reality-tests/playbook_lift_real_004.{json,md}`.
--- a/scripts/playbook_lift.sh
+++ b/scripts/playbook_lift.sh
@ -283,6 +283,16 @@ fi
 if [ "$WITH_REJUDGE" = "1" ]; then
  EXTRA_FLAGS="$EXTRA_FLAGS -with-rejudge"
 fi
 # LLM_ROLE_EXTRACT=1 enables the qwen2.5 fallback extractor for
 # shorthand-style queries (real_003 known limitation). Adds ~1-3s
 # per shorthand query but closes the cross-role bleed at the
 # extraction layer. Default off.
 if [ "${LLM_ROLE_EXTRACT:-0}" = "1" ]; then
  EXTRA_FLAGS="$EXTRA_FLAGS -llm-role-extract"
  if [ -n "${LLM_ROLE_MODEL:-}" ]; then
    EXTRA_FLAGS="$EXTRA_FLAGS -llm-role-model $LLM_ROLE_MODEL"
  fi
 fi
 ./bin/playbook_lift \
  -config  "$CONFIG_PATH" \
  -gateway "http://127.0.0.1:3110" \
--- a/scripts/playbook_lift/main.go
+++ b/scripts/playbook_lift/main.go
@ -148,6 +148,8 @@ func main() {
 	out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
 	withParaphrase := flag.Bool("with-paraphrase", false, "after warm pass, generate a paraphrase via the judge model and re-query with playbook=true to test the learning property")
 	withRejudge := flag.Bool("with-rejudge", false, "after warm pass, judge warm top-1 to measure QUALITY lift (vs cold top-1 rating), not just rank-of-cold-judge-best")
 	llmRoleExtract := flag.Bool("llm-role-extract", false, "fall back to LLM (qwen2.5 format=json) when the regex extractor returns empty — closes the shorthand-style cross-role bleed surfaced in real_003 at the cost of ~1-3s/query")
 	llmRoleModel := flag.String("llm-role-model", "qwen2.5:latest", "Ollama model used for LLM role extraction; ignored when -llm-role-extract is off")
 	flag.Parse()
 	// Judge resolution priority: explicit flag > $JUDGE_MODEL env >
@ -176,6 +178,23 @@ func main() {
 	log.Printf("[lift] %d queries · corpora=%v · k=%d · judge=%s", len(qs), corpora, *k, *judge)
 	hc := &http.Client{Timeout: 60 * time.Second}
 	// Package-global role extractor — used by matrixSearch +
 	// playbookRecord. Off-by-default so the existing harness behavior
 	// (regex-only extraction) is unchanged unless -llm-role-extract.
 	{
 		mdl := ""
 		if *llmRoleExtract {
 			mdl = *llmRoleModel
 			log.Printf("[lift] llm role extraction ON (model=%s) — shorthand queries get LLM fallback", mdl)
 		}
 		globalRoleExtractor = &roleExtractor{
 			hc:        hc,
 			ollamaURL: *ollama,
 			model:     mdl,
 		}
 	}
 	runs := make([]queryRun, 0, len(qs))
 	totalDelta := float32(0)
 	playbookBoostedTotal := 0
@ -482,11 +501,12 @@ func matrixSearch(hc *http.Client, gw, query string, corpora []string, k int, us
 		"per_corpus_k": k,
 		"use_playbook": usePlaybook,
 	}
-	// Mechanical role extraction (real_001 cross-role bleed fix).
+	// Role extraction (real_001 + real_003 cross-role bleed fixes).
-	// Empty when the query doesn't match the "Need N {role}{s} in"
+	// Goes through globalRoleExtractor so shorthand-style queries get
-	// pattern — gate stays disabled and the harness behaves as before
+	// LLM fallback when -llm-role-extract is on. Empty result leaves
-	// for the lift suite's free-form queries.
+	// the gate disabled — harness preserves current behavior on
-	if role := extractRoleFromNeed(query); role != "" {
+	// truly-unparseable shapes.
 	if role := globalRoleExtractor.extract(query); role != "" {
 		body["query_role"] = role
 	}
 	bs, _ := json.Marshal(body)
@ -516,9 +536,10 @@ func playbookRecord(hc *http.Client, gw, query, answerID, answerCorpus string, s
 		"score":         score,
 		"tags":          []string{"reality-test", "playbook-lift-001"},
 	}
-	// Same mechanical role extraction as matrixSearch — recorded role
+	// Same extractor as matrixSearch — shared cache, same LLM fallback
-	// lets retrieve-time gate fire on cross-role queries (real_001 fix).
+	// rules. Recorded role lets retrieve-time gate fire on cross-role
-	if role := extractRoleFromNeed(query); role != "" {
+	// queries (real_001 + real_003 fixes).
 	if role := globalRoleExtractor.extract(query); role != "" {
 		body["role"] = role
 	}
 	bs, _ := json.Marshal(body)
@ -631,6 +652,110 @@ func appendNote(existing, add string) string {
 // refactor; harmless for now.
 var _ = sort.Slice
 // globalRoleExtractor is set in main() and read by matrixSearch +
 // playbookRecord. nil-safe: a nil receiver on roleExtractor.extract
 // degrades to regex-only behavior, so the field is checked once at
 // startup and never re-validated per call.
 var globalRoleExtractor *roleExtractor
 // roleExtractor combines the fast-path regex with an optional LLM
 // fallback so callers can pay for shorthand coverage only when they
 // need it. Per real_003_findings.md: shorthand-style queries
 // ("N {role} {city} {state} {at} {client}") have no separator
 // between role and city, so a regex can't reliably extract — but a
 // small LLM with a tight format=json prompt can. Cost is ~1-3s per
 // extraction on local qwen2.5; cached per-process so paraphrase
 // passes don't pay twice for the same query.
 //
 // Empty cache + off-by-default LLM = the existing real_003b behavior
 // is unchanged unless callers explicitly enable LLM mode.
 type roleExtractor struct {
 	hc        *http.Client
 	ollamaURL string
 	model     string // "" disables LLM fallback
 	cache     map[string]string
 }
 // extract returns the role for a query. Tries regex first (fast,
 // deterministic); if regex misses and LLM is configured, calls Ollama;
 // caches the result either way. Returns "" when both miss — the
 // caller's gate stays disabled, preserving current behavior on
 // truly-unparseable shapes.
 func (r *roleExtractor) extract(query string) string {
 	if r == nil {
 		return extractRoleFromNeed(query)
 	}
 	if cached, ok := r.cache[query]; ok {
 		return cached
 	}
 	role := extractRoleFromNeed(query)
 	if role == "" && r.model != "" {
 		if v, err := extractRoleViaLLM(r.hc, r.ollamaURL, r.model, query); err == nil {
 			role = v
 		} else {
 			log.Printf("[lift] llm-role-extract failed (%v) — falling back to empty for %q", err, abbrev(query, 60))
 		}
 	}
 	if r.cache == nil {
 		r.cache = make(map[string]string)
 	}
 	r.cache[query] = role
 	return role
 }
 // extractRoleViaLLM asks the Ollama-shape /api/chat to identify the
 // staffing role in a free-form query. Tight schema + format=json so
 // parsing is deterministic. Empty role string is a valid response —
 // the model may decline to extract when the query has no clean role
 // (e.g. lift-suite multi-constraint queries).
 func extractRoleViaLLM(hc *http.Client, ollamaURL, model, query string) (string, error) {
 	system := `You are a staffing-domain role extractor.
 Output JSON only: {"role": "<role string or empty>"}.
 Rules:
 - Identify the staffing role (job title) the query is asking for.
 - Return only the role noun phrase — e.g. "Forklift Operator", "Pickers".
 - Preserve plurality from the query (don't singularize).
 - Strip qualifiers ("OSHA-30 certified") — we want the bare role.
 - If the query has no clean staffing role, return "" (empty string).
 - Do NOT explain. Just emit the JSON.`
 	body, _ := json.Marshal(map[string]any{
 		"model":  model,
 		"stream": false,
 		"format": "json",
 		"messages": []map[string]string{
 			{"role": "system", "content": system},
 			{"role": "user", "content": query},
 		},
 		"options": map[string]any{"temperature": 0.0},
 	})
 	req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(body))
 	req.Header.Set("Content-Type", "application/json")
 	resp, err := hc.Do(req)
 	if err != nil {
 		return "", err
 	}
 	defer resp.Body.Close()
 	if resp.StatusCode/100 != 2 {
 		return "", fmt.Errorf("ollama chat: HTTP %d", resp.StatusCode)
 	}
 	rb, _ := io.ReadAll(resp.Body)
 	var ollamaResp struct {
 		Message struct {
 			Content string `json:"content"`
 		} `json:"message"`
 	}
 	if err := json.Unmarshal(rb, &ollamaResp); err != nil {
 		return "", err
 	}
 	var out struct {
 		Role string `json:"role"`
 	}
 	if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &out); err != nil {
 		return "", fmt.Errorf("decode role: %w (content=%q)", err, ollamaResp.Message.Content)
 	}
 	return strings.TrimSpace(out.Role), nil
 }
 // extractRoleFromNeed pulls the role out of staffing-shape queries.
 // Returns "" for any query that doesn't match a known anchor pattern
 // (free-form lift-suite queries + shorthand-style fall back to empty,
--- a/scripts/playbook_lift/main_test.go
+++ b/scripts/playbook_lift/main_test.go
@ -1,6 +1,14 @@
 package main
-import "testing"
+import (
 	"encoding/json"
 	"io"
 	"net/http"
 	"net/http/httptest"
 	"strings"
 	"sync/atomic"
 	"testing"
 )
 // TestExtractRoleFromNeed locks the four query-shape patterns documented
 // in real_003_findings.md so a future change to the regex can't silently
@ -74,3 +82,202 @@ func TestExtractRoleFromNeed(t *testing.T) {
 		})
 	}
 }
 // fakeOllama returns a minimal Ollama-shape /api/chat handler that
 // echoes a fixed role string. Counts hits so caching tests can assert
 // on call count.
 func fakeOllama(role string) (*httptest.Server, *int64) {
 	var hits int64
 	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		if r.URL.Path != "/api/chat" {
 			http.NotFound(w, r)
 			return
 		}
 		atomic.AddInt64(&hits, 1)
 		_, _ = io.ReadAll(r.Body)
 		// Ollama wraps the model's response under message.content as
 		// a JSON-string. The model's output is itself the
 		// {"role":"..."} JSON our extractor parses.
 		inner, _ := json.Marshal(map[string]string{"role": role})
 		out, _ := json.Marshal(map[string]any{
 			"message": map[string]string{"content": string(inner)},
 		})
 		w.Header().Set("Content-Type", "application/json")
 		_, _ = w.Write(out)
 	}))
 	return srv, &hits
 }
 // TestRoleExtractor_RegexFirst locks the priority order: when the
 // regex matches, the LLM must NOT be called even if configured. Same
 // behavior the real_003b shipping config relies on (don't pay LLM
 // cost on need/client_first/looking queries).
 func TestRoleExtractor_RegexFirst(t *testing.T) {
 	srv, hits := fakeOllama("LLM-WONT-BE-CALLED")
 	defer srv.Close()
 	rx := &roleExtractor{
 		hc:        srv.Client(),
 		ollamaURL: srv.URL,
 		model:     "test-model", // LLM is configured...
 	}
 	got := rx.extract("Need 5 Forklift Operators in Detroit MI starting at 09:00 for ACME")
 	if got != "Forklift Operators" {
 		t.Errorf("regex should win on Need-form query, got %q", got)
 	}
 	if *hits != 0 {
 		t.Errorf("LLM should not be called when regex matched, got %d hits", *hits)
 	}
 }
 // TestRoleExtractor_LLMFallback locks the shorthand-coverage path:
 // when regex returns empty AND LLM is configured, the LLM is called
 // and its result is used. Closes the real_003 shorthand bleed at the
 // extraction layer.
 func TestRoleExtractor_LLMFallback(t *testing.T) {
 	srv, hits := fakeOllama("CNC Operator")
 	defer srv.Close()
 	rx := &roleExtractor{
 		hc:        srv.Client(),
 		ollamaURL: srv.URL,
 		model:     "test-model",
 	}
 	got := rx.extract("1 CNC Operator Detroit MI 17:30 Beacon Freight")
 	if got != "CNC Operator" {
 		t.Errorf("LLM fallback should fire on shorthand, got %q", got)
 	}
 	if *hits != 1 {
 		t.Errorf("expected exactly 1 LLM hit, got %d", *hits)
 	}
 }
 // TestRoleExtractor_LLMOffLeavesEmpty locks the opt-in default: with
 // LLM model unset, shorthand queries return empty (preserves
 // real_003b shipping config — no LLM cost paid by default).
 func TestRoleExtractor_LLMOffLeavesEmpty(t *testing.T) {
 	rx := &roleExtractor{} // model = "" disables LLM
 	got := rx.extract("1 CNC Operator Detroit MI 17:30 Beacon Freight")
 	if got != "" {
 		t.Errorf("with LLM off, shorthand should return empty, got %q", got)
 	}
 }
 // TestRoleExtractor_Cache locks the per-process cache: re-extracting
 // the same query must NOT call the LLM twice. Paraphrase passes hit
 // the same query under cold + warm + paraphrase + rejudge — without
 // the cache that's 4× LLM cost per query.
 func TestRoleExtractor_Cache(t *testing.T) {
 	srv, hits := fakeOllama("Pickers")
 	defer srv.Close()
 	rx := &roleExtractor{
 		hc:        srv.Client(),
 		ollamaURL: srv.URL,
 		model:     "test-model",
 	}
 	q := "4 Pickers Detroit MI 13:30 Beacon Freight"
 	for i := 0; i < 3; i++ {
 		got := rx.extract(q)
 		if got != "Pickers" {
 			t.Errorf("call %d: got %q, want Pickers", i, got)
 		}
 	}
 	if *hits != 1 {
 		t.Errorf("expected 1 LLM hit (rest cached), got %d", *hits)
 	}
 }
 // TestRoleExtractor_NilSafe locks the nil-receiver behavior so call
 // sites in matrixSearch + playbookRecord don't need a guard. nil
 // extractor degrades to regex-only.
 func TestRoleExtractor_NilSafe(t *testing.T) {
 	var rx *roleExtractor
 	got := rx.extract("Need 1 Loader in Indianapolis IN starting at 12:00 for Midway")
 	if got != "Loader" {
 		t.Errorf("nil receiver should still run regex, got %q", got)
 	}
 	got = rx.extract("1 CNC Operator Detroit MI 17:30 Beacon")
 	if got != "" {
 		t.Errorf("nil receiver should return empty when regex misses (no LLM), got %q", got)
 	}
 }
 // TestExtractRoleViaLLM_HTTPError locks the failure path: HTTP non-2xx
 // surfaces as an error so the caller can fall back to empty cleanly.
 func TestExtractRoleViaLLM_HTTPError(t *testing.T) {
 	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		http.Error(w, "internal", http.StatusInternalServerError)
 	}))
 	defer srv.Close()
 	_, err := extractRoleViaLLM(srv.Client(), srv.URL, "test-model", "anything")
 	if err == nil || !strings.Contains(err.Error(), "HTTP 500") {
 		t.Errorf("expected HTTP 500 error, got %v", err)
 	}
 }
 // TestRoleExtractor_ClosesCrossRoleShorthandBleed is the synthetic
 // witness for the real_003 finding: when both record AND query are
 // shorthand-style (no anchor between role and city), the regex
 // returns "" for both. With LLM off, both sides go empty → gate
 // disabled → cross-role bleed possible. With LLM on, both sides
 // extract a non-empty role → role-mismatch is caught at the matrix
 // gate.
 //
 // This test isolates the EXTRACTION layer (does the harness produce
 // the right role token?). The matrix gate's behavior on those tokens
 // is locked separately in internal/matrix/playbook_test.go's
 // TestInjectPlaybookMisses_RoleGateRejectsCrossRole.
 func TestRoleExtractor_ClosesCrossRoleShorthandBleed(t *testing.T) {
 	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		body, _ := io.ReadAll(r.Body)
 		// Determine which role to return based on the user message.
 		role := ""
 		if strings.Contains(string(body), "CNC") {
 			role = "CNC Operator"
 		} else if strings.Contains(string(body), "Forklift") {
 			role = "Forklift Operator"
 		}
 		inner, _ := json.Marshal(map[string]string{"role": role})
 		out, _ := json.Marshal(map[string]any{
 			"message": map[string]string{"content": string(inner)},
 		})
 		_, _ = w.Write(out)
 	}))
 	defer srv.Close()
 	rx := &roleExtractor{
 		hc:        srv.Client(),
 		ollamaURL: srv.URL,
 		model:     "test-model",
 	}
 	// Both queries are shorthand — regex would return "" for both.
 	// LLM extracts a real role for each, and they differ.
 	cnc := rx.extract("1 CNC Operator Detroit MI 17:30 Beacon Freight")
 	fork := rx.extract("1 Forklift Operator Detroit MI 15:00 Beacon Freight")
 	if cnc != "CNC Operator" {
 		t.Errorf("CNC shorthand should extract via LLM, got %q", cnc)
 	}
 	if fork != "Forklift Operator" {
 		t.Errorf("Forklift shorthand should extract via LLM, got %q", fork)
 	}
 	if cnc == fork {
 		t.Errorf("cross-role shorthand should produce DIFFERENT role tokens (cnc=%q fork=%q)", cnc, fork)
 	}
 }
 // TestExtractRoleViaLLM_BadJSON locks the model-output validation: if
 // the LLM returns non-JSON content, the error is surfaced (don't
 // silently treat it as empty role).
 func TestExtractRoleViaLLM_BadJSON(t *testing.T) {
 	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		out, _ := json.Marshal(map[string]any{
 			"message": map[string]string{"content": "this is not json"},
 		})
 		_, _ = w.Write(out)
 	}))
 	defer srv.Close()
 	_, err := extractRoleViaLLM(srv.Client(), srv.URL, "test-model", "anything")
 	if err == nil || !strings.Contains(err.Error(), "decode role") {
 		t.Errorf("expected decode error, got %v", err)
 	}
 }