real_003 left a known-weak hole: shorthand-style queries
("{count} {role} {city} {state} ...") have no separator between
role and city, so a regex can't reliably extract — leaving the
cross-role gate disabled when both record AND query are shorthand.
This commit adds a roleExtractor with regex-first + LLM fallback:
- Regex first (fast, deterministic) — handles need + client_first +
looking from real_003b. ~75% of styles, no LLM cost paid.
- LLM fallback when regex returns empty AND model is configured —
Ollama-shape /api/chat with format=json, schema-tight prompt,
temperature 0. ~1-3s on local qwen2.5.
- Per-process cache — paraphrase + rejudge passes reuse the same
query 4× per run; cache prevents 4× LLM cost.
- Off-by-default — opt-in via -llm-role-extract flag (CLI) and
LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping
config unchanged unless explicitly enabled.
8 new tests in scripts/playbook_lift/main_test.go:
- TestRoleExtractor_RegexFirst: LLM not called when regex matches
- TestRoleExtractor_LLMFallback: shorthand goes to LLM
- TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved
- TestRoleExtractor_Cache: 3 calls = 1 LLM hit
- TestRoleExtractor_NilSafe: nil receiver runs regex only
- TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths
- TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic
witness for the real_003 scenario — both record + query are
shorthand, regex returns "" for both, LLM produces DIFFERENT
role tokens for CNC vs Forklift, so matrix gate's cross-role
rejection (locked separately in
TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires
correctly. This is the load-bearing verification.
Reality test real_004 ran the same 40-query stress as real_003 with
LLM extraction on. Cross-style same-role boosts fired correctly
across all 4 styles for Loaders + Packers + Shipping Clerk clusters
(including shorthand → other-style transfer). No cross-role bleed
observed. The reality test alone can't be a clean "with vs without"
comparison (HNSW build is non-deterministic across runs, and
real_004 stochastics didn't trigger a shorthand recording at all),
which is why the unit-test witness exists.
Production note (in real_004_findings.md): LLM extraction is for
reality-test coverage of arbitrary query shapes. Production should
extract role at INGEST time (when the inbox parser already runs an
LLM) and pass already-resolved role through requests — same shape
as multi_coord_stress's existing Demand{Role: ...} model. The hot
path should never need the harness extractor's per-query LLM cost.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.7 KiB
Reality test real_004 — LLM-based role extractor closes shorthand hole
real_003 / real_003b had a documented limitation: the regex extractor
can't separate role from city in shorthand-style queries
({count} {role} {city} {state} ...) because there's no anchor
between role and city. real_004 closes this with an LLM fallback —
qwen2.5 format=json, called only when the regex returns empty.
Architecture
roleExtractor struct in scripts/playbook_lift/main.go:
- Regex first (fast, deterministic) — handles need + client_first
- looking patterns from real_003b.
- LLM fallback when regex returns empty AND model is configured —
Ollama-shape
/api/chatwith format=json, schema-tight system prompt, temperature 0. - Per-process cache — paraphrase + rejudge passes hit the same query 4× per run; cache prevents 4× LLM cost per query.
- Off-by-default —
-llm-role-extractflag (CLI) +LLM_ROLE_EXTRACT=1env var (harness) opt in. Default behavior is unchanged from real_003b.
Verification — what's proven
Unit tests (8 new in scripts/playbook_lift/main_test.go):
TestRoleExtractor_RegexFirst— when regex matches, LLM is NOT called (cost discipline preserved on the 75% of queries the regex handles).TestRoleExtractor_LLMFallback— shorthand query goes to LLM, result is used.TestRoleExtractor_LLMOffLeavesEmpty— without model configured, shorthand returns empty (current default).TestRoleExtractor_Cache— 3 calls to same query = 1 LLM hit.TestRoleExtractor_NilSafe— nil receiver runs regex only; matrixSearch + playbookRecord don't need a guard.TestExtractRoleViaLLM_HTTPError+_BadJSON— failure paths surface error so caller can fall back cleanly.TestRoleExtractor_ClosesCrossRoleShorthandBleed— the synthetic witness for the real_003 scenario: when both record AND query are shorthand (regex returns "" for both), LLM produces DIFFERENT role tokens for CNC Operator vs Forklift Operator queries → matrix gate's cross-role rejection fires correctly. This is the load-bearing verification — paired withinternal/matrix/playbook_test.go'sTestInjectPlaybookMisses_RoleGateRejectsCrossRole(which uses the exact role tokens this test produces).
Verification — what real_004 (live harness) shows
Same 40-query stress file as real_003. With LLM_ROLE_EXTRACT=1:
| Run | recordings | shorthand recordings | boosts | Detroit cluster bleed |
|---|---|---|---|---|
| real_003 (regex only) | 7 | 1 (CNC) | 14 | YES — w-2404 leaked |
| real_003b (extended regex) | 11 | 1 (Pickers) | 31 | none observed |
| real_004 (regex + LLM) | 5 | 0 | 16 | none observed |
real_004's shorthand-recordings = 0 is stochastic (HNSW build is non-deterministic across runs, so cold-pass top-1 varies and so does which queries trigger discovery). The LLM is doing work — boosts fired across all 4 styles for the Loaders + Packers + Shipping Clerk clusters, including shorthand queries. But the dataset didn't naturally produce a shorthand recording in this run, so we can't read a clean "with-vs-without" reality-test signal on the bleed.
This is why the unit-test witness exists. The reality test confirms no regressions; the unit test proves the extraction-layer correctness on the exact failure mode real_003 surfaced. Together they cover the fix.
Cost note
LLM extraction adds ~1-3s per shorthand query on local qwen2.5. Per real_004 timing the harness took ~10 minutes for 40 queries with LLM on (vs ~2 min in real_003b). The cache makes paraphrase + rejudge passes free; first-touch shorthand queries pay the LLM cost once.
For production hot paths (inbox parsing, retrieve), the LLM cost is
prohibitive. The right architecture there is to extract role at
INGEST time (when queries land in the inbox parser) and pass the
already-resolved role through the request — same pattern as
multi_coord_stress's existing Demand{Role: ...} shape. The harness
LLM extractor is for reality-test coverage of arbitrary query shapes;
production should never need it.
Repro
# Generate the same 40-query stress file
go run scripts/cutover/gen_real_queries.go -limit 10 -styles all > tests/reality/real_coord_queries_v2.txt
# With LLM extractor (closes shorthand hole)
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004 \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 LLM_ROLE_EXTRACT=1 \
./scripts/playbook_lift.sh
# Without LLM extractor (regex-only — current default)
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004_regex_only \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
./scripts/playbook_lift.sh
Evidence: reports/reality-tests/playbook_lift_real_004.{json,md}.