History

root 0331288641 playbook_lift: LLM-based role extractor closes shorthand bleed (real_004)

real_003 left a known-weak hole: shorthand-style queries
("{count} {role} {city} {state} ...") have no separator between
role and city, so a regex can't reliably extract — leaving the
cross-role gate disabled when both record AND query are shorthand.

This commit adds a roleExtractor with regex-first + LLM fallback:

- Regex first (fast, deterministic) — handles need + client_first +
  looking from real_003b. ~75% of styles, no LLM cost paid.
- LLM fallback when regex returns empty AND model is configured —
  Ollama-shape /api/chat with format=json, schema-tight prompt,
  temperature 0. ~1-3s on local qwen2.5.
- Per-process cache — paraphrase + rejudge passes reuse the same
  query 4× per run; cache prevents 4× LLM cost.
- Off-by-default — opt-in via -llm-role-extract flag (CLI) and
  LLM_ROLE_EXTRACT=1 env var (harness wrapper). real_003b shipping
  config unchanged unless explicitly enabled.

8 new tests in scripts/playbook_lift/main_test.go:
- TestRoleExtractor_RegexFirst: LLM not called when regex matches
- TestRoleExtractor_LLMFallback: shorthand goes to LLM
- TestRoleExtractor_LLMOffLeavesEmpty: opt-in default preserved
- TestRoleExtractor_Cache: 3 calls = 1 LLM hit
- TestRoleExtractor_NilSafe: nil receiver runs regex only
- TestExtractRoleViaLLM_HTTPError + _BadJSON: failure paths
- TestRoleExtractor_ClosesCrossRoleShorthandBleed: synthetic
  witness for the real_003 scenario — both record + query are
  shorthand, regex returns "" for both, LLM produces DIFFERENT
  role tokens for CNC vs Forklift, so matrix gate's cross-role
  rejection (locked separately in
  TestInjectPlaybookMisses_RoleGateRejectsCrossRole) fires
  correctly. This is the load-bearing verification.

Reality test real_004 ran the same 40-query stress as real_003 with
LLM extraction on. Cross-style same-role boosts fired correctly
across all 4 styles for Loaders + Packers + Shipping Clerk clusters
(including shorthand → other-style transfer). No cross-role bleed
observed. The reality test alone can't be a clean "with vs without"
comparison (HNSW build is non-deterministic across runs, and
real_004 stochastics didn't trigger a shorthand recording at all),
which is why the unit-test witness exists.

Production note (in real_004_findings.md): LLM extraction is for
reality-test coverage of arbitrary query shapes. Production should
extract role at INGEST time (when the inbox parser already runs an
LLM) and pass already-resolved role through requests — same shape
as multi_coord_stress's existing Demand{Role: ...} model. The hot
path should never need the harness extractor's per-query LLM cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 22:51:27 -05:00

multi_coord_stress_001.md

multi-coord stress harness — Phase 1 of 48-hour mock

2026-04-30 07:55:29 -05:00

multi_coord_stress_002.md

multi-coord stress Phase 1.5: shared-role contracts + paraphrase handover

2026-04-30 08:03:16 -05:00

multi_coord_stress_003.md

multi-coord stress Phase 2: ExcludeIDs + fresh-resume + 200-worker swap

2026-04-30 08:19:29 -05:00

multi_coord_stress_004.md

multi-coord stress Phase 2: ExcludeIDs + fresh-resume + 200-worker swap

2026-04-30 08:19:29 -05:00

multi_coord_stress_005.md

embedd: bump default to nomic-embed-text-v2-moe (475M MoE, 768d drop-in)

2026-04-30 08:26:52 -05:00

multi_coord_stress_006.md

observerd: /observer/inbox + multi-coord stress phase 1c (priority-ordered events)

2026-04-30 08:34:36 -05:00

multi_coord_stress_007.md

multi_coord_stress: LLM-parsed inbox demands (qwen2.5)

2026-04-30 14:51:19 -05:00

multi_coord_stress_008.md

multi_coord_stress: judge re-rates inbox top-1 — recovers honesty signal

2026-04-30 16:16:49 -05:00

multi_coord_stress_009.md

langfuse: Go-side client + Phase 1c instrumentation

2026-04-30 16:25:03 -05:00

multi_coord_stress_010.md

multi_coord_stress: fresh_workers two-tier index — fresh-resume now top-1

2026-04-30 16:31:45 -05:00

multi_coord_stress_011.md

multi_coord_stress: full Langfuse coverage — every phase + every call

2026-04-30 16:43:32 -05:00

playbook_lift_001.md

playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%)

2026-04-30 06:22:21 -05:00

playbook_lift_002.md

playbook_lift v2: paraphrase pass + run #002 finds boost-only limit

2026-04-30 06:47:41 -05:00

playbook_lift_003.md

matrix: Shape B — inject playbook misses + 6/6 paraphrase recovery

2026-04-30 07:06:13 -05:00

playbook_lift_004.md

matrix: split boost / inject thresholds — kills Shape B cross-pollination

2026-04-30 07:24:55 -05:00

playbook_lift_005.md

playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14%

2026-04-30 07:42:04 -05:00

playbook_lift_real_001.md

reality_test real_001: real-shape coordinator queries — surfaces cross-role bleed

2026-04-30 20:18:40 -05:00

playbook_lift_real_002.md

matrix: cross-role playbook gate — closes real_001 bleed (OPEN #1 )

2026-04-30 20:34:10 -05:00

playbook_lift_real_003.md

reality_test real_003: 40-query paraphrase stress + extractor extension

2026-04-30 21:42:02 -05:00

playbook_lift_real_003b.md

reality_test real_003: 40-query paraphrase stress + extractor extension

2026-04-30 21:42:02 -05:00

playbook_lift_real_004.md

playbook_lift: LLM-based role extractor closes shorthand bleed (real_004)

2026-04-30 22:51:27 -05:00

README.md

phase 3: playbook_lift harness reads judge from config

2026-04-29 23:57:28 -05:00

real_001_findings.md

reality_test real_001: real-shape coordinator queries — surfaces cross-role bleed

2026-04-30 20:18:40 -05:00

real_002_findings.md

matrix: cross-role playbook gate — closes real_001 bleed (OPEN #1 )

2026-04-30 20:34:10 -05:00

real_003_findings.md

reality_test real_003: 40-query paraphrase stress + extractor extension

2026-04-30 21:42:02 -05:00

real_004_findings.md

playbook_lift: LLM-based role extractor closes shorthand bleed (real_004)

2026-04-30 22:51:27 -05:00

README.md

reports/reality-tests — does the 5-loop substrate actually work?

Reality tests measure product outcomes, not substrate health. The 21 smokes prove the system runs; the proof harness proves the system makes the claims it claims; reality tests answer: does the small-model pipeline + matrix indexer + playbook give measurably better results than raw cosine?

This is the gate from project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for." Single load-bearing criterion. Throughput, scaling, code elegance are secondary.

What lives here

Each reality test is a numbered run that produces:

<test>_<NNN>.json — raw structured evidence (per-query data, summary metrics)
<test>_<NNN>.md — human-readable report with headline metrics, per-query table, honesty caveats, next moves

Runs are append-only. Earlier runs stay in tree as historical baseline.

Test catalog

`playbook_lift_<NNN>` — does the playbook actually lift the right answer?

Driver: scripts/playbook_lift.sh → bin/playbook_lift Queries: tests/reality/playbook_lift_queries.txt Pipeline: cold pass → LLM judge → playbook record → warm pass → measure ranking shift.

The headline question: when the LLM judge finds a better answer than cosine top-1, can the playbook boost it to top-1 on the next run? If yes, the learning loop closes; if no, the matrix layer + playbook is infrastructure for a thesis that doesn't pay rent.

See the run reports for honesty caveats — chiefly that the LLM judge IS the ground-truth proxy.

Running a reality test

# Defaults: judge resolved from lakehouse.toml [models].local_judge,
# workers limit 5000, run id 001
./scripts/playbook_lift.sh

# Re-run with a different judge to check inter-judge agreement
# (env JUDGE_MODEL overrides the config tier)
JUDGE_MODEL=qwen3:latest RUN_ID=002 ./scripts/playbook_lift.sh

# Smaller scale for fast iteration
WORKERS_LIMIT=1000 K=5 RUN_ID=dev ./scripts/playbook_lift.sh

Judge resolution priority (Phase 3, 2026-04-29):

-judge flag on the Go driver (explicit override)
JUDGE_MODEL env var (operator override)
lakehouse.toml [models].local_judge (default)
Hardcoded qwen3.5:latest (last-resort fallback if config missing)

This means model bumps land in lakehouse.toml, not in this script or the Go driver. Bumping local_judge to a stronger local model (e.g. when qwen4 ships) takes one line.

Requires: Ollama on :11434 with nomic-embed-text + the resolved judge model loaded. Skips cleanly (exit 0) if Ollama is absent.

Interpreting results

Three thresholds matter on the playbook_lift tests:

Lift rate (lifts / discoveries)	Verdict
≥ 50%	Loop closes — playbook is doing real work, move to paraphrase queries
20-50%	Lift exists but inconsistent — investigate boost math (`score × 0.5`) or judge variance
< 20%	Loop is not pulling its weight — diagnose before adding more components

A separate concern: discovery rate (cold judge-best ≠ cold top-1). If discovery is itself rare (< 30% of queries), cosine is already close to optimal on this query distribution and the matrix+playbook layer has little headroom. That's not necessarily a bug — but it means the value gate has to come from somewhere else (multi-corpus retrieval, domain-specific tags, drift signal).

What this is not

Not a benchmark. No comparison against external systems; only internal cold-vs-warm.
Not a regression gate. Each run is a snapshot. Scores will drift with corpus changes, judge updates, and playbook math tuning. Don't wire just verify to demand a minimum lift.
Not human-validated. The LLM judge is the ground truth proxy. Sample 5-10 verdicts manually per run to sanity-check the judge isn't pathological.

README.md Unescape Escape

reports/reality-tests — does the 5-loop substrate actually work?

What lives here

Test catalog

playbook_lift_<NNN> — does the playbook actually lift the right answer?

Running a reality test

Interpreting results

What this is not

README.md

`playbook_lift_<NNN>` — does the playbook actually lift the right answer?