root b2e45f7f26 playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%)

The 5-loop substrate's load-bearing gate is verified — playbook +
matrix indexer give the results we're looking for. Per the report's
rubric, lift ≥ 50% of discoveries means matrix is doing real work;
7/8 = 87.5% blew through that.

Harness was structurally hiding bugs behind a 5-daemon stripped boot.
Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade:

1. driver→matrixd: {"query": ...} → {"query_text": ...} field name
2. harness temp toml missing [s3] → wrong default bucket → catalogd
   rehydrate 500 on first call
3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name
4. expand boot from 5 → 10 daemons in dep-ordered launch
5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion)
6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) —
   wrong domain for staffing queries; replaced with ethereal_workers
   (10K rows, real staffing schema, "e-" id prefix to avoid collision
   with workers' "w-"). staffing_workers driver gains -index-name +
   -id-prefix flags so the same binary serves both corpora
7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running
   ~30s per judge call against the lift loop; reverted to
   qwen2.5:latest (~1s/call, 30× faster, held lift theory)

Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go
so future drift fires in `go test`, not in a reality run. R-005 closed:

- cmd/matrixd/main_test.go (new) — playbook record drift detector +
  score bounds + 6 routes mounted
- cmd/queryd/main_test.go — wrong-field-name drift detector
- cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire
- cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode

`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green.

Reality test results (reports/reality-tests/playbook_lift_001.{json,md}):
  Queries              21 (staffing-domain, 7 categories)
  Discoveries          8 (judge ≠ cosine top-1)
  Lifts                7/8 (87.5%)
  Boosts triggered     9
  Mean Δ distance      -0.053 (warm closer than cold)
  OOD honesty          dental/RN/SWE rated 1, no fake matches
  Cross-corpus boosts  confirmed (e- ↔ w- swaps in lifts)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 06:22:21 -05:00

4.9 KiB

Raw Blame History

Playbook-Lift Reality Test — Run 001

Generated: 2026-04-30T10:50:22.550677651Z Judge: qwen2.5:latest (Ollama, resolved from env JUDGE_MODELqwen2.5:latest) Corpora: workers,ethereal_workers Workers limit: 5000 Queries: tests/reality/playbook_lift_queries.txt (21 executed) K per pass: 10 Evidence: reports/reality-tests/playbook_lift_001.json

Headline

Metric	Value
Total queries run	21
Cold-pass discoveries (judge-best ≠ top-1)	8
Warm-pass lifts (recorded playbook → top-1)	7
No change (judge-best already top-1, no playbook needed)	14
Playbook boosts triggered (warm pass)	9
Mean Δ top-1 distance (warm − cold)	-0.053097825

Lift rate: 7 of 8 discoveries became top-1 after warm pass.

Per-query results

#	Query	Cold top-1	Cold judge-best (rank/rating)	Recorded?	Warm top-1	Judge-best warm rank	Lift
1	Forklift operator with OSHA-30, warehouse experience, day sh	e-2085	2/4	✓ w-2019	w-2019	0	YES
2	OSHA-30 certified forklift operator in Wisconsin, cold stora	e-6293	7/3	—	e-6293	7	no
3	Production worker with confined-space cert and hazmat traini	w-4552	7/3	—	w-4552	7	no
4	CDL Class A driver, clean record, willing to do regional 4-d	w-3272	0/1	—	w-3272	0	no
5	Warehouse lead with current OSHA-30 certification, NOT OSHA-	w-4833	5/4	✓ w-195	w-195	0	YES
6	Forklift-certified loader, certification must be active, dis	e-2975	2/4	✓ w-3821	w-3821	0	YES
7	Hazmat-certified warehouse worker comfortable with cold stor	w-4965	2/4	✓ w-4257	w-4257	0	YES
8	Bilingual production worker with team-lead experience and tr	w-4115	0/4	—	w-4115	0	no
9	Inventory specialist with confined-space cert and compliance	w-3819	1/3	—	w-3819	1	no
10	Warehouse worker who can run inventory cycles and lead a sma	e-8132	0/4	—	e-8132	0	no
11	Production line worker comfortable filling in as line superv	w-2377	3/4	✓ w-2954	w-2954	0	YES
12	Customer service rep willing to cross-train into dispatch or	e-1332	2/2	—	e-1332	2	no
13	Reliable production line lead with strong attendance and lea	e-4284	6/4	✓ e-5778	e-5778	0	YES
14	Highly responsive forklift operator available for last-minut	e-3695	2/4	✓ e-5385	e-5385	0	YES
15	Engaged warehouse associate with strong safety compliance re	e-7646	9/4	✓ e-2028	w-4257	1	no
16	CDL-A driver based in IL or WI, willing to run regional 4-da	w-3272	7/2	—	w-3272	7	no
17	Bilingual customer service rep in Indianapolis or Cincinnati	e-4240	6/2	—	e-4240	6	no
18	Production supervisor open to Midwest relocation for permane	w-1876	0/2	—	w-1876	0	no
19	Dental hygienist with three years experience, Indianapolis a	w-211	0/1	—	w-211	0	no
20	Registered nurse with ICU experience, willing to take per-di	w-577	0/1	—	w-577	0	no
21	Software engineer with React and TypeScript, three years exp	w-2407	0/1	—	w-2407	0	no

Honesty caveats

Judge IS the ground truth proxy. Without human-labeled relevance, the LLM judge's verdict is what defines "best." If qwen2.5:latest rates badly, the lift number is meaningless. To validate the judge itself, sample 5–10 verdicts manually and check agreement.
Score-1.0 boost = distance halved. Playbook math is distance' = distance × (1 - 0.5 × score). Lift requires the judge-best result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise even halving doesn't promote it. Tight clusters → little visible lift.
Same-query replay is the cheap case. Real lift comes from similar but not identical queries hitting a recorded playbook. This run only tests verbatim replay. A v2 should add paraphrase queries.
Multi-corpus skew. Default corpora=workers,ethereal_workers — if all judge-best results land in one corpus, the matrix layer's purpose isn't being tested. Check per-corpus distribution in the JSON.
Judge resolution. This run used qwen2.5:latest from env JUDGE_MODEL overrideqwen2.5:latest. Bumping the judge for run #N+1 means editing one line in lakehouse.toml.

Next moves

If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real work. Move to paraphrase queries + tag-based boost (currently ignored).
If lift rate < 20%: investigate why — judge variance, distance gap too wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need retuning.
If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is already close to optimal on this query distribution. Either the corpus is too narrow or the queries are too easy.

4.9 KiB Raw Blame History Unescape Escape