golangLAKEHOUSE/reports/reality-tests/real_004_findings.md

# Reality test real_004 — LLM-based role extractor closes shorthand hole

real_003 / real_003b had a documented limitation: the regex extractor
can't separate role from city in shorthand-style queries
(`{count} {role} {city} {state} ...`) because there's no anchor
between role and city. real_004 closes this with an LLM fallback —
qwen2.5 format=json, called only when the regex returns empty.

## Architecture

`roleExtractor` struct in `scripts/playbook_lift/main.go`:

1. **Regex first** (fast, deterministic) — handles need + client_first
   + looking patterns from real_003b.
2. **LLM fallback** when regex returns empty AND model is configured —
   Ollama-shape `/api/chat` with format=json, schema-tight system
   prompt, temperature 0.
3. **Per-process cache** — paraphrase + rejudge passes hit the same
   query 4× per run; cache prevents 4× LLM cost per query.
4. **Off-by-default** — `-llm-role-extract` flag (CLI) +
   `LLM_ROLE_EXTRACT=1` env var (harness) opt in. Default behavior
   is unchanged from real_003b.

## Verification — what's proven

**Unit tests (8 new in `scripts/playbook_lift/main_test.go`):**

- `TestRoleExtractor_RegexFirst` — when regex matches, LLM is NOT
  called (cost discipline preserved on the 75% of queries the regex
  handles).
- `TestRoleExtractor_LLMFallback` — shorthand query goes to LLM,
  result is used.
- `TestRoleExtractor_LLMOffLeavesEmpty` — without model configured,
  shorthand returns empty (current default).
- `TestRoleExtractor_Cache` — 3 calls to same query = 1 LLM hit.
- `TestRoleExtractor_NilSafe` — nil receiver runs regex only;
  matrixSearch + playbookRecord don't need a guard.
- `TestExtractRoleViaLLM_HTTPError` + `_BadJSON` — failure paths
  surface error so caller can fall back cleanly.
- `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the synthetic
  witness for the real_003 scenario: when both record AND query are
  shorthand (regex returns "" for both), LLM produces DIFFERENT role
  tokens for CNC Operator vs Forklift Operator queries → matrix
  gate's cross-role rejection fires correctly. This is the load-bearing
  verification — paired with `internal/matrix/playbook_test.go`'s
  `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` (which uses
  the exact role tokens this test produces).

## Verification — what real_004 (live harness) shows

Same 40-query stress file as real_003. With `LLM_ROLE_EXTRACT=1`:

| Run | recordings | shorthand recordings | boosts | Detroit cluster bleed |
|---|---:|---:|---:|---|
| real_003  (regex only) | 7  | 1 (CNC) | 14 | **YES — w-2404 leaked** |
| real_003b (extended regex) | 11 | 1 (Pickers) | 31 | none observed |
| real_004  (regex + LLM)    | 5  | 0 | 16 | none observed |

real_004's shorthand-recordings = 0 is stochastic (HNSW build is
non-deterministic across runs, so cold-pass top-1 varies and so does
which queries trigger discovery). The LLM is doing work — boosts
fired across all 4 styles for the Loaders + Packers + Shipping Clerk
clusters, including shorthand queries. But the dataset didn't
naturally produce a shorthand recording in this run, so we can't
read a clean "with-vs-without" reality-test signal on the bleed.

**This is why the unit-test witness exists.** The reality test confirms
no regressions; the unit test proves the extraction-layer correctness
on the exact failure mode real_003 surfaced. Together they cover the
fix.

## Cost note

LLM extraction adds ~1-3s per shorthand query on local qwen2.5. Per
real_004 timing the harness took ~10 minutes for 40 queries with LLM
on (vs ~2 min in real_003b). The cache makes paraphrase + rejudge
passes free; first-touch shorthand queries pay the LLM cost once.

For production hot paths (inbox parsing, retrieve), the LLM cost is
prohibitive. The right architecture there is to extract role at
INGEST time (when queries land in the inbox parser) and pass the
already-resolved role through the request — same pattern as
multi_coord_stress's existing `Demand{Role: ...}` shape. The harness
LLM extractor is for reality-test coverage of arbitrary query shapes;
production should never need it.

## Repro

```bash
# Generate the same 40-query stress file
go run scripts/cutover/gen_real_queries.go -limit 10 -styles all > tests/reality/real_coord_queries_v2.txt

# With LLM extractor (closes shorthand hole)
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004 \
  WITH_PARAPHRASE=0 WITH_REJUDGE=0 LLM_ROLE_EXTRACT=1 \
  ./scripts/playbook_lift.sh

# Without LLM extractor (regex-only — current default)
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004_regex_only \
  WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
  ./scripts/playbook_lift.sh
```

Evidence: `reports/reality-tests/playbook_lift_real_004.{json,md}`.