real_003 left a known-weak hole: shorthand-style queries
("{count} {role} {city} {state} ...") have no separator between
role and city, so a regex can't reliably extract — leaving the
cross-role gate disabled when both record AND query are shorthand.
This commit adds a roleExtractor with regex-first + LLM fallback:
- Regex first (fast, deterministic) — handles need + client_first +
looking from real_003b. ~75% of styles, no LLM cost paid.
- LLM fallback when regex returns empty AND model is configured —
Ollama-shape /api/chat with format=json, schema-tight prompt,
temperature 0. ~1-3s on local qwen2.5.
- Per-process cache — paraphrase + rejudge passes reuse the same
query 4× per run; cache prevents 4× LLM cost.
- Off-by-default — opt-in via -llm-role-extract flag (CLI) and
LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping
config unchanged unless explicitly enabled.
8 new tests in scripts/playbook_lift/main_test.go:
- TestRoleExtractor_RegexFirst: LLM not called when regex matches
- TestRoleExtractor_LLMFallback: shorthand goes to LLM
- TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved
- TestRoleExtractor_Cache: 3 calls = 1 LLM hit
- TestRoleExtractor_NilSafe: nil receiver runs regex only
- TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths
- TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic
witness for the real_003 scenario — both record + query are
shorthand, regex returns "" for both, LLM produces DIFFERENT
role tokens for CNC vs Forklift, so matrix gate's cross-role
rejection (locked separately in
TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires
correctly. This is the load-bearing verification.
Reality test real_004 ran the same 40-query stress as real_003 with
LLM extraction on. Cross-style same-role boosts fired correctly
across all 4 styles for Loaders + Packers + Shipping Clerk clusters
(including shorthand → other-style transfer). No cross-role bleed
observed. The reality test alone can't be a clean "with vs without"
comparison (HNSW build is non-deterministic across runs, and
real_004 stochastics didn't trigger a shorthand recording at all),
which is why the unit-test witness exists.
Production note (in real_004_findings.md): LLM extraction is for
reality-test coverage of arbitrary query shapes. Production should
extract role at INGEST time (when the inbox parser already runs an
LLM) and pass already-resolved role through requests — same shape
as multi_coord_stress's existing Demand{Role: ...} model. The hot
path should never need the harness extractor's per-query LLM cost.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
105 lines
4.7 KiB
Markdown
105 lines
4.7 KiB
Markdown
# Reality test real_004 — LLM-based role extractor closes shorthand hole
|
||
|
||
real_003 / real_003b had a documented limitation: the regex extractor
|
||
can't separate role from city in shorthand-style queries
|
||
(`{count} {role} {city} {state} ...`) because there's no anchor
|
||
between role and city. real_004 closes this with an LLM fallback —
|
||
qwen2.5 format=json, called only when the regex returns empty.
|
||
|
||
## Architecture
|
||
|
||
`roleExtractor` struct in `scripts/playbook_lift/main.go`:
|
||
|
||
1. **Regex first** (fast, deterministic) — handles need + client_first
|
||
+ looking patterns from real_003b.
|
||
2. **LLM fallback** when regex returns empty AND model is configured —
|
||
Ollama-shape `/api/chat` with format=json, schema-tight system
|
||
prompt, temperature 0.
|
||
3. **Per-process cache** — paraphrase + rejudge passes hit the same
|
||
query 4× per run; cache prevents 4× LLM cost per query.
|
||
4. **Off-by-default** — `-llm-role-extract` flag (CLI) +
|
||
`LLM_ROLE_EXTRACT=1` env var (harness) opt in. Default behavior
|
||
is unchanged from real_003b.
|
||
|
||
## Verification — what's proven
|
||
|
||
**Unit tests (8 new in `scripts/playbook_lift/main_test.go`):**
|
||
|
||
- `TestRoleExtractor_RegexFirst` — when regex matches, LLM is NOT
|
||
called (cost discipline preserved on the 75% of queries the regex
|
||
handles).
|
||
- `TestRoleExtractor_LLMFallback` — shorthand query goes to LLM,
|
||
result is used.
|
||
- `TestRoleExtractor_LLMOffLeavesEmpty` — without model configured,
|
||
shorthand returns empty (current default).
|
||
- `TestRoleExtractor_Cache` — 3 calls to same query = 1 LLM hit.
|
||
- `TestRoleExtractor_NilSafe` — nil receiver runs regex only;
|
||
matrixSearch + playbookRecord don't need a guard.
|
||
- `TestExtractRoleViaLLM_HTTPError` + `_BadJSON` — failure paths
|
||
surface error so caller can fall back cleanly.
|
||
- `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the synthetic
|
||
witness for the real_003 scenario: when both record AND query are
|
||
shorthand (regex returns "" for both), LLM produces DIFFERENT role
|
||
tokens for CNC Operator vs Forklift Operator queries → matrix
|
||
gate's cross-role rejection fires correctly. This is the load-bearing
|
||
verification — paired with `internal/matrix/playbook_test.go`'s
|
||
`TestInjectPlaybookMisses_RoleGateRejectsCrossRole` (which uses
|
||
the exact role tokens this test produces).
|
||
|
||
## Verification — what real_004 (live harness) shows
|
||
|
||
Same 40-query stress file as real_003. With `LLM_ROLE_EXTRACT=1`:
|
||
|
||
| Run | recordings | shorthand recordings | boosts | Detroit cluster bleed |
|
||
|---|---:|---:|---:|---|
|
||
| real_003 (regex only) | 7 | 1 (CNC) | 14 | **YES — w-2404 leaked** |
|
||
| real_003b (extended regex) | 11 | 1 (Pickers) | 31 | none observed |
|
||
| real_004 (regex + LLM) | 5 | 0 | 16 | none observed |
|
||
|
||
real_004's shorthand-recordings = 0 is stochastic (HNSW build is
|
||
non-deterministic across runs, so cold-pass top-1 varies and so does
|
||
which queries trigger discovery). The LLM is doing work — boosts
|
||
fired across all 4 styles for the Loaders + Packers + Shipping Clerk
|
||
clusters, including shorthand queries. But the dataset didn't
|
||
naturally produce a shorthand recording in this run, so we can't
|
||
read a clean "with-vs-without" reality-test signal on the bleed.
|
||
|
||
**This is why the unit-test witness exists.** The reality test confirms
|
||
no regressions; the unit test proves the extraction-layer correctness
|
||
on the exact failure mode real_003 surfaced. Together they cover the
|
||
fix.
|
||
|
||
## Cost note
|
||
|
||
LLM extraction adds ~1-3s per shorthand query on local qwen2.5. Per
|
||
real_004 timing the harness took ~10 minutes for 40 queries with LLM
|
||
on (vs ~2 min in real_003b). The cache makes paraphrase + rejudge
|
||
passes free; first-touch shorthand queries pay the LLM cost once.
|
||
|
||
For production hot paths (inbox parsing, retrieve), the LLM cost is
|
||
prohibitive. The right architecture there is to extract role at
|
||
INGEST time (when queries land in the inbox parser) and pass the
|
||
already-resolved role through the request — same pattern as
|
||
multi_coord_stress's existing `Demand{Role: ...}` shape. The harness
|
||
LLM extractor is for reality-test coverage of arbitrary query shapes;
|
||
production should never need it.
|
||
|
||
## Repro
|
||
|
||
```bash
|
||
# Generate the same 40-query stress file
|
||
go run scripts/cutover/gen_real_queries.go -limit 10 -styles all > tests/reality/real_coord_queries_v2.txt
|
||
|
||
# With LLM extractor (closes shorthand hole)
|
||
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004 \
|
||
WITH_PARAPHRASE=0 WITH_REJUDGE=0 LLM_ROLE_EXTRACT=1 \
|
||
./scripts/playbook_lift.sh
|
||
|
||
# Without LLM extractor (regex-only — current default)
|
||
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004_regex_only \
|
||
WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
|
||
./scripts/playbook_lift.sh
|
||
```
|
||
|
||
Evidence: `reports/reality-tests/playbook_lift_real_004.{json,md}`.
|