playbook_lift: LLM-based role extractor closes shorthand bleed (real_004)

real_003 left a known-weak hole: shorthand-style queries
("{count} {role} {city} {state} ...") have no separator between
role and city, so a regex can't reliably extract — leaving the
cross-role gate disabled when both record AND query are shorthand.

This commit adds a roleExtractor with regex-first + LLM fallback:

- Regex first (fast, deterministic) — handles need + client_first +
  looking from real_003b. ~75% of styles, no LLM cost paid.
- LLM fallback when regex returns empty AND model is configured —
  Ollama-shape /api/chat with format=json, schema-tight prompt,
  temperature 0. ~1-3s on local qwen2.5.
- Per-process cache — paraphrase + rejudge passes reuse the same
  query 4× per run; cache prevents 4× LLM cost.
- Off-by-default — opt-in via -llm-role-extract flag (CLI) and
  LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping
  config unchanged unless explicitly enabled.

8 new tests in scripts/playbook_lift/main_test.go:
- TestRoleExtractor_RegexFirst: LLM not called when regex matches
- TestRoleExtractor_LLMFallback: shorthand goes to LLM
- TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved
- TestRoleExtractor_Cache: 3 calls = 1 LLM hit
- TestRoleExtractor_NilSafe: nil receiver runs regex only
- TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths
- TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic
  witness for the real_003 scenario — both record + query are
  shorthand, regex returns "" for both, LLM produces DIFFERENT
  role tokens for CNC vs Forklift, so matrix gate's cross-role
  rejection (locked separately in
  TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires
  correctly. This is the load-bearing verification.

Reality test real_004 ran the same 40-query stress as real_003 with
LLM extraction on. Cross-style same-role boosts fired correctly
across all 4 styles for Loaders + Packers + Shipping Clerk clusters
(including shorthand → other-style transfer). No cross-role bleed
observed. The reality test alone can't be a clean "with vs without"
comparison (HNSW build is non-deterministic across runs, and
real_004 stochastics didn't trigger a shorthand recording at all),
which is why the unit-test witness exists.

Production note (in real_004_findings.md): LLM extraction is for
reality-test coverage of arbitrary query shapes. Production should
extract role at INGEST time (when the inbox parser already runs an
LLM) and pass already-resolved role through requests — same shape
as multi_coord_stress's existing Demand{Role: ...} model. The hot
path should never need the harness extractor's per-query LLM cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-30 22:51:27 -05:00
parent 3263254f1c
commit 0331288641
6 changed files with 572 additions and 9 deletions

View File

@ -264,6 +264,7 @@ The list is intentionally short. Items move to closed when the work demands them
| (probe) | Reality test real_001: 10 real-shape queries from `fill_events.parquet` through lift harness. 8/10 cold-pass top-1 = judge-best (substrate works on real distribution). Surfaced **same-client+city cross-role bleed** — Shape A boost from Forklift-Operator playbook landed on CNC-Operator query, demoting the cold-pass-correct worker. Findings: `reports/reality-tests/real_001_findings.md`. |
| (fix) | Cross-role gate: `Role` on `PlaybookEntry`, `QueryRole` on `SearchRequest`, gate fires in both `ApplyPlaybookBoost` + `InjectPlaybookMisses`. `roleEqual` handles case + plural. Backward-compat: empty role on either side = gate disabled (preserves lift suite + free-form callers). 5 new unit tests use exact real_001 distance + role values. Re-run real_002: bleed closed (Q#5 Pickers, Q#10 CNC Operator stay at cold-pass top-1; same-role lifts still fire). Closes OPEN #1. Findings: `reports/reality-tests/real_002_findings.md`. |
| (probe) | Reality test real_003: 40 queries (10 fill_events rows × 4 styles — `need` / `client_first` / `looking` / `shorthand`). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked `w-2404` onto Forklift Operator shorthand query (both empty role, gate disabled). Extended `extractRoleFromNeed` to handle `client_first` + `looking` patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in `scripts/playbook_lift/main_test.go` lock the patterns + the documented shorthand limitation. Findings: `reports/reality-tests/real_003_findings.md`. |
| (fix) | LLM-based role extractor (real_004): `roleExtractor` struct with regex-first → qwen2.5 format=json fallback → per-process cache. Opt-in via `-llm-role-extract` flag + `LLM_ROLE_EXTRACT=1` env. Off-by-default preserves real_003b shipping config. 8 new tests including `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the load-bearing witness pairing with the matrix-side `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` to prove the extraction-layer + gate-layer compose correctly on the exact real_003 failure mode. Findings: `reports/reality-tests/real_004_findings.md`. |
Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

View File

@ -0,0 +1,116 @@
# Playbook-Lift Reality Test — Run real_004
**Generated:** 2026-05-01T03:48:39.932829485Z
**Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge)
**Corpora:** `workers,ethereal_workers`
**Workers limit:** 5000
**Queries:** `tests/reality/real_coord_queries_v2.txt` (40 executed)
**K per pass:** 10
**Paraphrase pass:** disabled
**Re-judge pass:** disabled
**Evidence:** `reports/reality-tests/playbook_lift_real_004.json`
---
## Headline
| Metric | Value |
|---|---:|
| Total queries run | 40 |
| Cold-pass discoveries (judge-best ≠ top-1) | 5 |
| Warm-pass lifts (recorded playbook → top-1) | 4 |
| No change (judge-best already top-1, no playbook needed) | 36 |
| Playbook boosts triggered (warm pass) | 16 |
| Mean Δ top-1 distance (warm cold) | -0.11025716 |
**Verbatim lift rate:** 4 of 5 discoveries became top-1 after warm pass.
---
## Per-query results
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|---|---|---|---|---|---|---|---|
| 1 | Need 5 Warehouse Associates in Kansas City MO starting at 09 | w-3288 | 0/4 | — | w-3288 | 0 | no |
| 2 | Parallel Machining needs 5 Warehouse Associates in Kansas Ci | w-3288 | 0/4 | — | w-3288 | 0 | no |
| 3 | Looking for 5 Warehouse Associates at Parallel Machining in | w-3288 | 0/4 | — | w-3288 | 0 | no |
| 4 | 5 Warehouse Associates Kansas City MO 09:00 Parallel Machini | w-3288 | 0/4 | — | w-3288 | 0 | no |
| 5 | Need 1 Forklift Operator in Detroit MI starting at 15:00 for | w-2136 | 0/5 | — | w-2136 | 0 | no |
| 6 | Beacon Freight needs 1 Forklift Operator in Detroit MI at 15 | e-6723 | 2/2 | — | e-6723 | 2 | no |
| 7 | Looking for 1 Forklift Operator at Beacon Freight in Detroit | w-4766 | 0/5 | — | w-4766 | 0 | no |
| 8 | 1 Forklift Operator Detroit MI 15:00 Beacon Freight | e-5818 | 0/4 | — | e-5818 | 0 | no |
| 9 | Need 4 Loaders in Indianapolis IN starting at 12:00 for Midw | e-9877 | 0/4 | — | w-398 | 2 | no |
| 10 | Midway Distribution needs 4 Loaders in Indianapolis IN at 12 | e-4537 | 4/5 | ✓ w-2148 | w-398 | 1 | no |
| 11 | Looking for 4 Loaders at Midway Distribution in Indianapolis | w-962 | 6/4 | ✓ w-398 | w-398 | 0 | **YES** |
| 12 | 4 Loaders Indianapolis IN 12:00 Midway Distribution | e-9877 | 0/4 | — | w-2148 | 2 | no |
| 13 | Need 3 Warehouse Associates in Fort Wayne IN starting at 17: | w-2857 | 1/4 | ✓ w-1063 | w-1063 | 0 | **YES** |
| 14 | Cornerstone Fabrication needs 3 Warehouse Associates in Fort | w-4703 | 0/2 | — | w-1063 | 1 | no |
| 15 | Looking for 3 Warehouse Associates at Cornerstone Fabricatio | w-1784 | 0/3 | — | w-1063 | 1 | no |
| 16 | 3 Warehouse Associates Fort Wayne IN 17:30 Cornerstone Fabri | w-2754 | 2/3 | — | w-1063 | 3 | no |
| 17 | Need 4 Pickers in Detroit MI starting at 13:30 for Beacon Fr | e-2247 | 1/2 | — | e-2247 | 1 | no |
| 18 | Beacon Freight needs 4 Pickers in Detroit MI at 13:30 | e-2247 | 2/2 | — | e-2247 | 2 | no |
| 19 | Looking for 4 Pickers at Beacon Freight in Detroit MI for 13 | e-2247 | 1/2 | — | e-2247 | 1 | no |
| 20 | 4 Pickers Detroit MI 13:30 Beacon Freight | e-2247 | 0/3 | — | e-2247 | 0 | no |
| 21 | Need 2 Packers in Joliet IL starting at 09:30 for Parallel M | e-846 | 2/3 | — | e-2120 | 0 | no |
| 22 | Parallel Machining needs 2 Packers in Joliet IL at 09:30 | e-846 | 2/4 | ✓ e-2120 | e-2120 | 0 | **YES** |
| 23 | Looking for 2 Packers at Parallel Machining in Joliet IL for | e-846 | 1/2 | — | e-2120 | 2 | no |
| 24 | 2 Packers Joliet IL 09:30 Parallel Machining | e-846 | 2/3 | — | e-2120 | 0 | no |
| 25 | Need 3 Assemblers in Flint MI starting at 08:30 for Heritage | w-4124 | 9/3 | — | w-4124 | 9 | no |
| 26 | Heritage Foods needs 3 Assemblers in Flint MI at 08:30 | w-4124 | 3/3 | — | w-4124 | 3 | no |
| 27 | Looking for 3 Assemblers at Heritage Foods in Flint MI for 0 | w-4124 | 2/2 | — | w-4124 | 2 | no |
| 28 | 3 Assemblers Flint MI 08:30 Heritage Foods | w-4124 | 3/3 | — | w-4124 | 3 | no |
| 29 | Need 3 Packers in Flint MI starting at 12:30 for Parallel Ma | e-6019 | 8/2 | — | e-6019 | 8 | no |
| 30 | Parallel Machining needs 3 Packers in Flint MI at 12:30 | e-6019 | 9/2 | — | e-6019 | 9 | no |
| 31 | Looking for 3 Packers at Parallel Machining in Flint MI for | e-6019 | 0/1 | — | e-6019 | 0 | no |
| 32 | 3 Packers Flint MI 12:30 Parallel Machining | e-2006 | 0/3 | — | e-2006 | 0 | no |
| 33 | Need 1 Shipping Clerk in Flint MI starting at 17:00 for Pion | w-3988 | 2/4 | ✓ w-1367 | w-1367 | 0 | **YES** |
| 34 | Pioneer Assembly needs 1 Shipping Clerk in Flint MI at 17:00 | w-3988 | 1/3 | — | w-1367 | 2 | no |
| 35 | Looking for 1 Shipping Clerk at Pioneer Assembly in Flint MI | w-3988 | 0/2 | — | w-1367 | 1 | no |
| 36 | 1 Shipping Clerk Flint MI 17:00 Pioneer Assembly | e-4849 | 0/2 | — | w-1367 | 1 | no |
| 37 | Need 1 CNC Operator in Detroit MI starting at 17:30 for Beac | e-489 | 5/3 | — | e-489 | 5 | no |
| 38 | Beacon Freight needs 1 CNC Operator in Detroit MI at 17:30 | e-489 | 0/2 | — | e-489 | 0 | no |
| 39 | Looking for 1 CNC Operator at Beacon Freight in Detroit MI f | e-489 | 0/2 | — | e-489 | 0 | no |
| 40 | 1 CNC Operator Detroit MI 17:30 Beacon Freight | e-489 | 3/3 | — | e-489 | 3 | no |
---
## Honesty caveats
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
judge's verdict is what defines "best." If `` rates badly,
the lift number is meaningless. To validate the judge itself, sample 510
verdicts manually and check agreement.
2. **Score-1.0 boost = distance halved.** Playbook math is
`distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
even halving doesn't promote it. Tight clusters → little visible lift.
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
case — same query, recorded playbook, expected boost. The paraphrase
pass (when enabled) is the actual learning property: similar-but-different
queries hitting a recorded playbook. Compare verbatim and paraphrase
lift rates — paraphrase should be lower (semantic-distance gates some
playbook hits) but non-zero is the meaningful signal.
4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
results land in one corpus, the matrix layer's purpose isn't being tested.
Check per-corpus distribution in the JSON.
5. **Judge resolution.** This run used `qwen2.5:latest` from
config [models].local_judge.
Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
6. **Paraphrase generation also uses the judge.** The same model that rates
relevance also rephrases queries. A judge that's bad at rating staffing
queries is probably also bad at rephrasing them. Worth sanity-checking
a sample of `paraphrase_query` values in the JSON before trusting the
paraphrase lift number.
## Next moves
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
work. Move to paraphrase queries + tag-based boost (currently ignored).
- If lift rate < 20%: investigate why judge variance, distance gap too
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
retuning.
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
already close to optimal on this query distribution. Either the corpus
is too narrow or the queries are too easy.

View File

@ -0,0 +1,104 @@
# Reality test real_004 — LLM-based role extractor closes shorthand hole
real_003 / real_003b had a documented limitation: the regex extractor
can't separate role from city in shorthand-style queries
(`{count} {role} {city} {state} ...`) because there's no anchor
between role and city. real_004 closes this with an LLM fallback —
qwen2.5 format=json, called only when the regex returns empty.
## Architecture
`roleExtractor` struct in `scripts/playbook_lift/main.go`:
1. **Regex first** (fast, deterministic) — handles need + client_first
+ looking patterns from real_003b.
2. **LLM fallback** when regex returns empty AND model is configured —
Ollama-shape `/api/chat` with format=json, schema-tight system
prompt, temperature 0.
3. **Per-process cache** — paraphrase + rejudge passes hit the same
query 4× per run; cache prevents 4× LLM cost per query.
4. **Off-by-default**`-llm-role-extract` flag (CLI) +
`LLM_ROLE_EXTRACT=1` env var (harness) opt in. Default behavior
is unchanged from real_003b.
## Verification — what's proven
**Unit tests (8 new in `scripts/playbook_lift/main_test.go`):**
- `TestRoleExtractor_RegexFirst` — when regex matches, LLM is NOT
called (cost discipline preserved on the 75% of queries the regex
handles).
- `TestRoleExtractor_LLMFallback` — shorthand query goes to LLM,
result is used.
- `TestRoleExtractor_LLMOffLeavesEmpty` — without model configured,
shorthand returns empty (current default).
- `TestRoleExtractor_Cache` — 3 calls to same query = 1 LLM hit.
- `TestRoleExtractor_NilSafe` — nil receiver runs regex only;
matrixSearch + playbookRecord don't need a guard.
- `TestExtractRoleViaLLM_HTTPError` + `_BadJSON` — failure paths
surface error so caller can fall back cleanly.
- `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the synthetic
witness for the real_003 scenario: when both record AND query are
shorthand (regex returns "" for both), LLM produces DIFFERENT role
tokens for CNC Operator vs Forklift Operator queries → matrix
gate's cross-role rejection fires correctly. This is the load-bearing
verification — paired with `internal/matrix/playbook_test.go`'s
`TestInjectPlaybookMisses_RoleGateRejectsCrossRole` (which uses
the exact role tokens this test produces).
## Verification — what real_004 (live harness) shows
Same 40-query stress file as real_003. With `LLM_ROLE_EXTRACT=1`:
| Run | recordings | shorthand recordings | boosts | Detroit cluster bleed |
|---|---:|---:|---:|---|
| real_003 (regex only) | 7 | 1 (CNC) | 14 | **YES — w-2404 leaked** |
| real_003b (extended regex) | 11 | 1 (Pickers) | 31 | none observed |
| real_004 (regex + LLM) | 5 | 0 | 16 | none observed |
real_004's shorthand-recordings = 0 is stochastic (HNSW build is
non-deterministic across runs, so cold-pass top-1 varies and so does
which queries trigger discovery). The LLM is doing work — boosts
fired across all 4 styles for the Loaders + Packers + Shipping Clerk
clusters, including shorthand queries. But the dataset didn't
naturally produce a shorthand recording in this run, so we can't
read a clean "with-vs-without" reality-test signal on the bleed.
**This is why the unit-test witness exists.** The reality test confirms
no regressions; the unit test proves the extraction-layer correctness
on the exact failure mode real_003 surfaced. Together they cover the
fix.
## Cost note
LLM extraction adds ~1-3s per shorthand query on local qwen2.5. Per
real_004 timing the harness took ~10 minutes for 40 queries with LLM
on (vs ~2 min in real_003b). The cache makes paraphrase + rejudge
passes free; first-touch shorthand queries pay the LLM cost once.
For production hot paths (inbox parsing, retrieve), the LLM cost is
prohibitive. The right architecture there is to extract role at
INGEST time (when queries land in the inbox parser) and pass the
already-resolved role through the request — same pattern as
multi_coord_stress's existing `Demand{Role: ...}` shape. The harness
LLM extractor is for reality-test coverage of arbitrary query shapes;
production should never need it.
## Repro
```bash
# Generate the same 40-query stress file
go run scripts/cutover/gen_real_queries.go -limit 10 -styles all > tests/reality/real_coord_queries_v2.txt
# With LLM extractor (closes shorthand hole)
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004 \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 LLM_ROLE_EXTRACT=1 \
./scripts/playbook_lift.sh
# Without LLM extractor (regex-only — current default)
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_004_regex_only \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
./scripts/playbook_lift.sh
```
Evidence: `reports/reality-tests/playbook_lift_real_004.{json,md}`.

View File

@ -283,6 +283,16 @@ fi
if [ "$WITH_REJUDGE" = "1" ]; then
EXTRA_FLAGS="$EXTRA_FLAGS -with-rejudge"
fi
# LLM_ROLE_EXTRACT=1 enables the qwen2.5 fallback extractor for
# shorthand-style queries (real_003 known limitation). Adds ~1-3s
# per shorthand query but closes the cross-role bleed at the
# extraction layer. Default off.
if [ "${LLM_ROLE_EXTRACT:-0}" = "1" ]; then
EXTRA_FLAGS="$EXTRA_FLAGS -llm-role-extract"
if [ -n "${LLM_ROLE_MODEL:-}" ]; then
EXTRA_FLAGS="$EXTRA_FLAGS -llm-role-model $LLM_ROLE_MODEL"
fi
fi
./bin/playbook_lift \
-config "$CONFIG_PATH" \
-gateway "http://127.0.0.1:3110" \

View File

@ -148,6 +148,8 @@ func main() {
out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
withParaphrase := flag.Bool("with-paraphrase", false, "after warm pass, generate a paraphrase via the judge model and re-query with playbook=true to test the learning property")
withRejudge := flag.Bool("with-rejudge", false, "after warm pass, judge warm top-1 to measure QUALITY lift (vs cold top-1 rating), not just rank-of-cold-judge-best")
llmRoleExtract := flag.Bool("llm-role-extract", false, "fall back to LLM (qwen2.5 format=json) when the regex extractor returns empty — closes the shorthand-style cross-role bleed surfaced in real_003 at the cost of ~1-3s/query")
llmRoleModel := flag.String("llm-role-model", "qwen2.5:latest", "Ollama model used for LLM role extraction; ignored when -llm-role-extract is off")
flag.Parse()
// Judge resolution priority: explicit flag > $JUDGE_MODEL env >
@ -176,6 +178,23 @@ func main() {
log.Printf("[lift] %d queries · corpora=%v · k=%d · judge=%s", len(qs), corpora, *k, *judge)
hc := &http.Client{Timeout: 60 * time.Second}
// Package-global role extractor — used by matrixSearch +
// playbookRecord. Off-by-default so the existing harness behavior
// (regex-only extraction) is unchanged unless -llm-role-extract.
{
mdl := ""
if *llmRoleExtract {
mdl = *llmRoleModel
log.Printf("[lift] llm role extraction ON (model=%s) — shorthand queries get LLM fallback", mdl)
}
globalRoleExtractor = &roleExtractor{
hc: hc,
ollamaURL: *ollama,
model: mdl,
}
}
runs := make([]queryRun, 0, len(qs))
totalDelta := float32(0)
playbookBoostedTotal := 0
@ -482,11 +501,12 @@ func matrixSearch(hc *http.Client, gw, query string, corpora []string, k int, us
"per_corpus_k": k,
"use_playbook": usePlaybook,
}
// Mechanical role extraction (real_001 cross-role bleed fix).
// Empty when the query doesn't match the "Need N {role}{s} in"
// pattern — gate stays disabled and the harness behaves as before
// for the lift suite's free-form queries.
if role := extractRoleFromNeed(query); role != "" {
// Role extraction (real_001 + real_003 cross-role bleed fixes).
// Goes through globalRoleExtractor so shorthand-style queries get
// LLM fallback when -llm-role-extract is on. Empty result leaves
// the gate disabled — harness preserves current behavior on
// truly-unparseable shapes.
if role := globalRoleExtractor.extract(query); role != "" {
body["query_role"] = role
}
bs, _ := json.Marshal(body)
@ -516,9 +536,10 @@ func playbookRecord(hc *http.Client, gw, query, answerID, answerCorpus string, s
"score": score,
"tags": []string{"reality-test", "playbook-lift-001"},
}
// Same mechanical role extraction as matrixSearch — recorded role
// lets retrieve-time gate fire on cross-role queries (real_001 fix).
if role := extractRoleFromNeed(query); role != "" {
// Same extractor as matrixSearch — shared cache, same LLM fallback
// rules. Recorded role lets retrieve-time gate fire on cross-role
// queries (real_001 + real_003 fixes).
if role := globalRoleExtractor.extract(query); role != "" {
body["role"] = role
}
bs, _ := json.Marshal(body)
@ -631,6 +652,110 @@ func appendNote(existing, add string) string {
// refactor; harmless for now.
var _ = sort.Slice
// globalRoleExtractor is set in main() and read by matrixSearch +
// playbookRecord. nil-safe: a nil receiver on roleExtractor.extract
// degrades to regex-only behavior, so the field is checked once at
// startup and never re-validated per call.
var globalRoleExtractor *roleExtractor
// roleExtractor combines the fast-path regex with an optional LLM
// fallback so callers can pay for shorthand coverage only when they
// need it. Per real_003_findings.md: shorthand-style queries
// ("N {role} {city} {state} {at} {client}") have no separator
// between role and city, so a regex can't reliably extract — but a
// small LLM with a tight format=json prompt can. Cost is ~1-3s per
// extraction on local qwen2.5; cached per-process so paraphrase
// passes don't pay twice for the same query.
//
// Empty cache + off-by-default LLM = the existing real_003b behavior
// is unchanged unless callers explicitly enable LLM mode.
type roleExtractor struct {
hc *http.Client
ollamaURL string
model string // "" disables LLM fallback
cache map[string]string
}
// extract returns the role for a query. Tries regex first (fast,
// deterministic); if regex misses and LLM is configured, calls Ollama;
// caches the result either way. Returns "" when both miss — the
// caller's gate stays disabled, preserving current behavior on
// truly-unparseable shapes.
func (r *roleExtractor) extract(query string) string {
if r == nil {
return extractRoleFromNeed(query)
}
if cached, ok := r.cache[query]; ok {
return cached
}
role := extractRoleFromNeed(query)
if role == "" && r.model != "" {
if v, err := extractRoleViaLLM(r.hc, r.ollamaURL, r.model, query); err == nil {
role = v
} else {
log.Printf("[lift] llm-role-extract failed (%v) — falling back to empty for %q", err, abbrev(query, 60))
}
}
if r.cache == nil {
r.cache = make(map[string]string)
}
r.cache[query] = role
return role
}
// extractRoleViaLLM asks the Ollama-shape /api/chat to identify the
// staffing role in a free-form query. Tight schema + format=json so
// parsing is deterministic. Empty role string is a valid response —
// the model may decline to extract when the query has no clean role
// (e.g. lift-suite multi-constraint queries).
func extractRoleViaLLM(hc *http.Client, ollamaURL, model, query string) (string, error) {
system := `You are a staffing-domain role extractor.
Output JSON only: {"role": "<role string or empty>"}.
Rules:
- Identify the staffing role (job title) the query is asking for.
- Return only the role noun phrase e.g. "Forklift Operator", "Pickers".
- Preserve plurality from the query (don't singularize).
- Strip qualifiers ("OSHA-30 certified") we want the bare role.
- If the query has no clean staffing role, return "" (empty string).
- Do NOT explain. Just emit the JSON.`
body, _ := json.Marshal(map[string]any{
"model": model,
"stream": false,
"format": "json",
"messages": []map[string]string{
{"role": "system", "content": system},
{"role": "user", "content": query},
},
"options": map[string]any{"temperature": 0.0},
})
req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(body))
req.Header.Set("Content-Type", "application/json")
resp, err := hc.Do(req)
if err != nil {
return "", err
}
defer resp.Body.Close()
if resp.StatusCode/100 != 2 {
return "", fmt.Errorf("ollama chat: HTTP %d", resp.StatusCode)
}
rb, _ := io.ReadAll(resp.Body)
var ollamaResp struct {
Message struct {
Content string `json:"content"`
} `json:"message"`
}
if err := json.Unmarshal(rb, &ollamaResp); err != nil {
return "", err
}
var out struct {
Role string `json:"role"`
}
if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &out); err != nil {
return "", fmt.Errorf("decode role: %w (content=%q)", err, ollamaResp.Message.Content)
}
return strings.TrimSpace(out.Role), nil
}
// extractRoleFromNeed pulls the role out of staffing-shape queries.
// Returns "" for any query that doesn't match a known anchor pattern
// (free-form lift-suite queries + shorthand-style fall back to empty,

View File

@ -1,6 +1,14 @@
package main
import "testing"
import (
"encoding/json"
"io"
"net/http"
"net/http/httptest"
"strings"
"sync/atomic"
"testing"
)
// TestExtractRoleFromNeed locks the four query-shape patterns documented
// in real_003_findings.md so a future change to the regex can't silently
@ -74,3 +82,202 @@ func TestExtractRoleFromNeed(t *testing.T) {
})
}
}
// fakeOllama returns a minimal Ollama-shape /api/chat handler that
// echoes a fixed role string. Counts hits so caching tests can assert
// on call count.
func fakeOllama(role string) (*httptest.Server, *int64) {
var hits int64
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.URL.Path != "/api/chat" {
http.NotFound(w, r)
return
}
atomic.AddInt64(&hits, 1)
_, _ = io.ReadAll(r.Body)
// Ollama wraps the model's response under message.content as
// a JSON-string. The model's output is itself the
// {"role":"..."} JSON our extractor parses.
inner, _ := json.Marshal(map[string]string{"role": role})
out, _ := json.Marshal(map[string]any{
"message": map[string]string{"content": string(inner)},
})
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write(out)
}))
return srv, &hits
}
// TestRoleExtractor_RegexFirst locks the priority order: when the
// regex matches, the LLM must NOT be called even if configured. Same
// behavior the real_003b shipping config relies on (don't pay LLM
// cost on need/client_first/looking queries).
func TestRoleExtractor_RegexFirst(t *testing.T) {
srv, hits := fakeOllama("LLM-WONT-BE-CALLED")
defer srv.Close()
rx := &roleExtractor{
hc: srv.Client(),
ollamaURL: srv.URL,
model: "test-model", // LLM is configured...
}
got := rx.extract("Need 5 Forklift Operators in Detroit MI starting at 09:00 for ACME")
if got != "Forklift Operators" {
t.Errorf("regex should win on Need-form query, got %q", got)
}
if *hits != 0 {
t.Errorf("LLM should not be called when regex matched, got %d hits", *hits)
}
}
// TestRoleExtractor_LLMFallback locks the shorthand-coverage path:
// when regex returns empty AND LLM is configured, the LLM is called
// and its result is used. Closes the real_003 shorthand bleed at the
// extraction layer.
func TestRoleExtractor_LLMFallback(t *testing.T) {
srv, hits := fakeOllama("CNC Operator")
defer srv.Close()
rx := &roleExtractor{
hc: srv.Client(),
ollamaURL: srv.URL,
model: "test-model",
}
got := rx.extract("1 CNC Operator Detroit MI 17:30 Beacon Freight")
if got != "CNC Operator" {
t.Errorf("LLM fallback should fire on shorthand, got %q", got)
}
if *hits != 1 {
t.Errorf("expected exactly 1 LLM hit, got %d", *hits)
}
}
// TestRoleExtractor_LLMOffLeavesEmpty locks the opt-in default: with
// LLM model unset, shorthand queries return empty (preserves
// real_003b shipping config — no LLM cost paid by default).
func TestRoleExtractor_LLMOffLeavesEmpty(t *testing.T) {
rx := &roleExtractor{} // model = "" disables LLM
got := rx.extract("1 CNC Operator Detroit MI 17:30 Beacon Freight")
if got != "" {
t.Errorf("with LLM off, shorthand should return empty, got %q", got)
}
}
// TestRoleExtractor_Cache locks the per-process cache: re-extracting
// the same query must NOT call the LLM twice. Paraphrase passes hit
// the same query under cold + warm + paraphrase + rejudge — without
// the cache that's 4× LLM cost per query.
func TestRoleExtractor_Cache(t *testing.T) {
srv, hits := fakeOllama("Pickers")
defer srv.Close()
rx := &roleExtractor{
hc: srv.Client(),
ollamaURL: srv.URL,
model: "test-model",
}
q := "4 Pickers Detroit MI 13:30 Beacon Freight"
for i := 0; i < 3; i++ {
got := rx.extract(q)
if got != "Pickers" {
t.Errorf("call %d: got %q, want Pickers", i, got)
}
}
if *hits != 1 {
t.Errorf("expected 1 LLM hit (rest cached), got %d", *hits)
}
}
// TestRoleExtractor_NilSafe locks the nil-receiver behavior so call
// sites in matrixSearch + playbookRecord don't need a guard. nil
// extractor degrades to regex-only.
func TestRoleExtractor_NilSafe(t *testing.T) {
var rx *roleExtractor
got := rx.extract("Need 1 Loader in Indianapolis IN starting at 12:00 for Midway")
if got != "Loader" {
t.Errorf("nil receiver should still run regex, got %q", got)
}
got = rx.extract("1 CNC Operator Detroit MI 17:30 Beacon")
if got != "" {
t.Errorf("nil receiver should return empty when regex misses (no LLM), got %q", got)
}
}
// TestExtractRoleViaLLM_HTTPError locks the failure path: HTTP non-2xx
// surfaces as an error so the caller can fall back to empty cleanly.
func TestExtractRoleViaLLM_HTTPError(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
http.Error(w, "internal", http.StatusInternalServerError)
}))
defer srv.Close()
_, err := extractRoleViaLLM(srv.Client(), srv.URL, "test-model", "anything")
if err == nil || !strings.Contains(err.Error(), "HTTP 500") {
t.Errorf("expected HTTP 500 error, got %v", err)
}
}
// TestRoleExtractor_ClosesCrossRoleShorthandBleed is the synthetic
// witness for the real_003 finding: when both record AND query are
// shorthand-style (no anchor between role and city), the regex
// returns "" for both. With LLM off, both sides go empty → gate
// disabled → cross-role bleed possible. With LLM on, both sides
// extract a non-empty role → role-mismatch is caught at the matrix
// gate.
//
// This test isolates the EXTRACTION layer (does the harness produce
// the right role token?). The matrix gate's behavior on those tokens
// is locked separately in internal/matrix/playbook_test.go's
// TestInjectPlaybookMisses_RoleGateRejectsCrossRole.
func TestRoleExtractor_ClosesCrossRoleShorthandBleed(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
body, _ := io.ReadAll(r.Body)
// Determine which role to return based on the user message.
role := ""
if strings.Contains(string(body), "CNC") {
role = "CNC Operator"
} else if strings.Contains(string(body), "Forklift") {
role = "Forklift Operator"
}
inner, _ := json.Marshal(map[string]string{"role": role})
out, _ := json.Marshal(map[string]any{
"message": map[string]string{"content": string(inner)},
})
_, _ = w.Write(out)
}))
defer srv.Close()
rx := &roleExtractor{
hc: srv.Client(),
ollamaURL: srv.URL,
model: "test-model",
}
// Both queries are shorthand — regex would return "" for both.
// LLM extracts a real role for each, and they differ.
cnc := rx.extract("1 CNC Operator Detroit MI 17:30 Beacon Freight")
fork := rx.extract("1 Forklift Operator Detroit MI 15:00 Beacon Freight")
if cnc != "CNC Operator" {
t.Errorf("CNC shorthand should extract via LLM, got %q", cnc)
}
if fork != "Forklift Operator" {
t.Errorf("Forklift shorthand should extract via LLM, got %q", fork)
}
if cnc == fork {
t.Errorf("cross-role shorthand should produce DIFFERENT role tokens (cnc=%q fork=%q)", cnc, fork)
}
}
// TestExtractRoleViaLLM_BadJSON locks the model-output validation: if
// the LLM returns non-JSON content, the error is surfaced (don't
// silently treat it as empty role).
func TestExtractRoleViaLLM_BadJSON(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
out, _ := json.Marshal(map[string]any{
"message": map[string]string{"content": "this is not json"},
})
_, _ = w.Write(out)
}))
defer srv.Close()
_, err := extractRoleViaLLM(srv.Client(), srv.URL, "test-model", "anything")
if err == nil || !strings.Contains(err.Error(), "decode role") {
t.Errorf("expected decode error, got %v", err)
}
}