golangLAKEHOUSE/reports/reality-tests/real_004_findings.md
root 0331288641 playbook_lift: LLM-based role extractor closes shorthand bleed (real_004)
real_003 left a known-weak hole: shorthand-style queries
("{count} {role} {city} {state} ...") have no separator between
role and city, so a regex can't reliably extract — leaving the
cross-role gate disabled when both record AND query are shorthand.

This commit adds a roleExtractor with regex-first + LLM fallback:

- Regex first (fast, deterministic) — handles need + client_first +
  looking from real_003b. ~75% of styles, no LLM cost paid.
- LLM fallback when regex returns empty AND model is configured —
  Ollama-shape /api/chat with format=json, schema-tight prompt,
  temperature 0. ~1-3s on local qwen2.5.
- Per-process cache — paraphrase + rejudge passes reuse the same
  query 4× per run; cache prevents 4× LLM cost.
- Off-by-default — opt-in via -llm-role-extract flag (CLI) and
  LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping
  config unchanged unless explicitly enabled.

8 new tests in scripts/playbook_lift/main_test.go:
- TestRoleExtractor_RegexFirst: LLM not called when regex matches
- TestRoleExtractor_LLMFallback: shorthand goes to LLM
- TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved
- TestRoleExtractor_Cache: 3 calls = 1 LLM hit
- TestRoleExtractor_NilSafe: nil receiver runs regex only
- TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths
- TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic
  witness for the real_003 scenario — both record + query are
  shorthand, regex returns "" for both, LLM produces DIFFERENT
  role tokens for CNC vs Forklift, so matrix gate's cross-role
  rejection (locked separately in
  TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires
  correctly. This is the load-bearing verification.

Reality test real_004 ran the same 40-query stress as real_003 with
LLM extraction on. Cross-style same-role boosts fired correctly
across all 4 styles for Loaders + Packers + Shipping Clerk clusters
(including shorthand → other-style transfer). No cross-role bleed
observed. The reality test alone can't be a clean "with vs without"
comparison (HNSW build is non-deterministic across runs, and
real_004 stochastics didn't trigger a shorthand recording at all),
which is why the unit-test witness exists.

Production note (in real_004_findings.md): LLM extraction is for
reality-test coverage of arbitrary query shapes. Production should
extract role at INGEST time (when the inbox parser already runs an
LLM) and pass already-resolved role through requests — same shape
as multi_coord_stress's existing Demand{Role: ...} model. The hot
path should never need the harness extractor's per-query LLM cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:51:27 -05:00

4.7 KiB
Raw Blame History

Reality test real_004 — LLM-based role extractor closes shorthand hole

real_003 / real_003b had a documented limitation: the regex extractor can't separate role from city in shorthand-style queries ({count} {role} {city} {state} ...) because there's no anchor between role and city. real_004 closes this with an LLM fallback — qwen2.5 format=json, called only when the regex returns empty.

Architecture

roleExtractor struct in scripts/playbook_lift/main.go:

  1. Regex first (fast, deterministic) — handles need + client_first
    • looking patterns from real_003b.
  2. LLM fallback when regex returns empty AND model is configured — Ollama-shape /api/chat with format=json, schema-tight system prompt, temperature 0.
  3. Per-process cache — paraphrase + rejudge passes hit the same query 4× per run; cache prevents 4× LLM cost per query.
  4. Off-by-default-llm-role-extract flag (CLI) + LLM_ROLE_EXTRACT=1 env var (harness) opt in. Default behavior is unchanged from real_003b.

Verification — what's proven

Unit tests (8 new in scripts/playbook_lift/main_test.go):

  • TestRoleExtractor_RegexFirst — when regex matches, LLM is NOT called (cost discipline preserved on the 75% of queries the regex handles).
  • TestRoleExtractor_LLMFallback — shorthand query goes to LLM, result is used.
  • TestRoleExtractor_LLMOffLeavesEmpty — without model configured, shorthand returns empty (current default).
  • TestRoleExtractor_Cache — 3 calls to same query = 1 LLM hit.
  • TestRoleExtractor_NilSafe — nil receiver runs regex only; matrixSearch + playbookRecord don't need a guard.
  • TestExtractRoleViaLLM_HTTPError + _BadJSON — failure paths surface error so caller can fall back cleanly.
  • TestRoleExtractor_ClosesCrossRoleShorthandBleed — the synthetic witness for the real_003 scenario: when both record AND query are shorthand (regex returns "" for both), LLM produces DIFFERENT role tokens for CNC Operator vs Forklift Operator queries → matrix gate's cross-role rejection fires correctly. This is the load-bearing verification — paired with internal/matrix/playbook_test.go's TestInjectPlaybookMisses_RoleGateRejectsCrossRole (which uses the exact role tokens this test produces).

Verification — what real_004 (live harness) shows

Same 40-query stress file as real_003. With LLM_ROLE_EXTRACT=1:

Run recordings shorthand recordings boosts Detroit cluster bleed
real_003 (regex only) 7 1 (CNC) 14 YES — w-2404 leaked
real_003b (extended regex) 11 1 (Pickers) 31 none observed
real_004 (regex + LLM) 5 0 16 none observed

real_004's shorthand-recordings = 0 is stochastic (HNSW build is non-deterministic across runs, so cold-pass top-1 varies and so does which queries trigger discovery). The LLM is doing work — boosts fired across all 4 styles for the Loaders + Packers + Shipping Clerk clusters, including shorthand queries. But the dataset didn't naturally produce a shorthand recording in this run, so we can't read a clean "with-vs-without" reality-test signal on the bleed.

This is why the unit-test witness exists. The reality test confirms no regressions; the unit test proves the extraction-layer correctness on the exact failure mode real_003 surfaced. Together they cover the fix.

Cost note

LLM extraction adds ~1-3s per shorthand query on local qwen2.5. Per real_004 timing the harness took ~10 minutes for 40 queries with LLM on (vs ~2 min in real_003b). The cache makes paraphrase + rejudge passes free; first-touch shorthand queries pay the LLM cost once.

For production hot paths (inbox parsing, retrieve), the LLM cost is prohibitive. The right architecture there is to extract role at INGEST time (when queries land in the inbox parser) and pass the already-resolved role through the request — same pattern as multi_coord_stress's existing Demand{Role: ...} shape. The harness LLM extractor is for reality-test coverage of arbitrary query shapes; production should never need it.

Repro

# Generate the same 40-query stress file
go run scripts/cutover/gen_real_queries.go -limit 10 -styles all > tests/reality/real_coord_queries_v2.txt

# With LLM extractor (closes shorthand hole)
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004 \
  WITH_PARAPHRASE=0 WITH_REJUDGE=0 LLM_ROLE_EXTRACT=1 \
  ./scripts/playbook_lift.sh

# Without LLM extractor (regex-only — current default)
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004_regex_only \
  WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
  ./scripts/playbook_lift.sh

Evidence: reports/reality-tests/playbook_lift_real_004.{json,md}.