golangLAKEHOUSE/reports/reality-tests/real_004_findings.md
root 0331288641 playbook_lift: LLM-based role extractor closes shorthand bleed (real_004)
real_003 left a known-weak hole: shorthand-style queries
("{count} {role} {city} {state} ...") have no separator between
role and city, so a regex can't reliably extract — leaving the
cross-role gate disabled when both record AND query are shorthand.

This commit adds a roleExtractor with regex-first + LLM fallback:

- Regex first (fast, deterministic) — handles need + client_first +
  looking from real_003b. ~75% of styles, no LLM cost paid.
- LLM fallback when regex returns empty AND model is configured —
  Ollama-shape /api/chat with format=json, schema-tight prompt,
  temperature 0. ~1-3s on local qwen2.5.
- Per-process cache — paraphrase + rejudge passes reuse the same
  query 4× per run; cache prevents 4× LLM cost.
- Off-by-default — opt-in via -llm-role-extract flag (CLI) and
  LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping
  config unchanged unless explicitly enabled.

8 new tests in scripts/playbook_lift/main_test.go:
- TestRoleExtractor_RegexFirst: LLM not called when regex matches
- TestRoleExtractor_LLMFallback: shorthand goes to LLM
- TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved
- TestRoleExtractor_Cache: 3 calls = 1 LLM hit
- TestRoleExtractor_NilSafe: nil receiver runs regex only
- TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths
- TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic
  witness for the real_003 scenario — both record + query are
  shorthand, regex returns "" for both, LLM produces DIFFERENT
  role tokens for CNC vs Forklift, so matrix gate's cross-role
  rejection (locked separately in
  TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires
  correctly. This is the load-bearing verification.

Reality test real_004 ran the same 40-query stress as real_003 with
LLM extraction on. Cross-style same-role boosts fired correctly
across all 4 styles for Loaders + Packers + Shipping Clerk clusters
(including shorthand → other-style transfer). No cross-role bleed
observed. The reality test alone can't be a clean "with vs without"
comparison (HNSW build is non-deterministic across runs, and
real_004 stochastics didn't trigger a shorthand recording at all),
which is why the unit-test witness exists.

Production note (in real_004_findings.md): LLM extraction is for
reality-test coverage of arbitrary query shapes. Production should
extract role at INGEST time (when the inbox parser already runs an
LLM) and pass already-resolved role through requests — same shape
as multi_coord_stress's existing Demand{Role: ...} model. The hot
path should never need the harness extractor's per-query LLM cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:51:27 -05:00

105 lines
4.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Reality test real_004 — LLM-based role extractor closes shorthand hole
real_003 / real_003b had a documented limitation: the regex extractor
can't separate role from city in shorthand-style queries
(`{count} {role} {city} {state} ...`) because there's no anchor
between role and city. real_004 closes this with an LLM fallback —
qwen2.5 format=json, called only when the regex returns empty.
## Architecture
`roleExtractor` struct in `scripts/playbook_lift/main.go`:
1. **Regex first** (fast, deterministic) — handles need + client_first
+ looking patterns from real_003b.
2. **LLM fallback** when regex returns empty AND model is configured —
Ollama-shape `/api/chat` with format=json, schema-tight system
prompt, temperature 0.
3. **Per-process cache** — paraphrase + rejudge passes hit the same
query 4× per run; cache prevents 4× LLM cost per query.
4. **Off-by-default**`-llm-role-extract` flag (CLI) +
`LLM_ROLE_EXTRACT=1` env var (harness) opt in. Default behavior
is unchanged from real_003b.
## Verification — what's proven
**Unit tests (8 new in `scripts/playbook_lift/main_test.go`):**
- `TestRoleExtractor_RegexFirst` — when regex matches, LLM is NOT
called (cost discipline preserved on the 75% of queries the regex
handles).
- `TestRoleExtractor_LLMFallback` — shorthand query goes to LLM,
result is used.
- `TestRoleExtractor_LLMOffLeavesEmpty` — without model configured,
shorthand returns empty (current default).
- `TestRoleExtractor_Cache` — 3 calls to same query = 1 LLM hit.
- `TestRoleExtractor_NilSafe` — nil receiver runs regex only;
matrixSearch + playbookRecord don't need a guard.
- `TestExtractRoleViaLLM_HTTPError` + `_BadJSON` — failure paths
surface error so caller can fall back cleanly.
- `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the synthetic
witness for the real_003 scenario: when both record AND query are
shorthand (regex returns "" for both), LLM produces DIFFERENT role
tokens for CNC Operator vs Forklift Operator queries → matrix
gate's cross-role rejection fires correctly. This is the load-bearing
verification — paired with `internal/matrix/playbook_test.go`'s
`TestInjectPlaybookMisses_RoleGateRejectsCrossRole` (which uses
the exact role tokens this test produces).
## Verification — what real_004 (live harness) shows
Same 40-query stress file as real_003. With `LLM_ROLE_EXTRACT=1`:
| Run | recordings | shorthand recordings | boosts | Detroit cluster bleed |
|---|---:|---:|---:|---|
| real_003 (regex only) | 7 | 1 (CNC) | 14 | **YES — w-2404 leaked** |
| real_003b (extended regex) | 11 | 1 (Pickers) | 31 | none observed |
| real_004 (regex + LLM) | 5 | 0 | 16 | none observed |
real_004's shorthand-recordings = 0 is stochastic (HNSW build is
non-deterministic across runs, so cold-pass top-1 varies and so does
which queries trigger discovery). The LLM is doing work — boosts
fired across all 4 styles for the Loaders + Packers + Shipping Clerk
clusters, including shorthand queries. But the dataset didn't
naturally produce a shorthand recording in this run, so we can't
read a clean "with-vs-without" reality-test signal on the bleed.
**This is why the unit-test witness exists.** The reality test confirms
no regressions; the unit test proves the extraction-layer correctness
on the exact failure mode real_003 surfaced. Together they cover the
fix.
## Cost note
LLM extraction adds ~1-3s per shorthand query on local qwen2.5. Per
real_004 timing the harness took ~10 minutes for 40 queries with LLM
on (vs ~2 min in real_003b). The cache makes paraphrase + rejudge
passes free; first-touch shorthand queries pay the LLM cost once.
For production hot paths (inbox parsing, retrieve), the LLM cost is
prohibitive. The right architecture there is to extract role at
INGEST time (when queries land in the inbox parser) and pass the
already-resolved role through the request — same pattern as
multi_coord_stress's existing `Demand{Role: ...}` shape. The harness
LLM extractor is for reality-test coverage of arbitrary query shapes;
production should never need it.
## Repro
```bash
# Generate the same 40-query stress file
go run scripts/cutover/gen_real_queries.go -limit 10 -styles all > tests/reality/real_coord_queries_v2.txt
# With LLM extractor (closes shorthand hole)
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004 \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 LLM_ROLE_EXTRACT=1 \
./scripts/playbook_lift.sh
# Without LLM extractor (regex-only — current default)
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004_regex_only \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
./scripts/playbook_lift.sh
```
Evidence: `reports/reality-tests/playbook_lift_real_004.{json,md}`.