golangLAKEHOUSE/reports/reality-tests/playbook_lift_001.md
root b2e45f7f26 playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%)
The 5-loop substrate's load-bearing gate is verified — playbook +
matrix indexer give the results we're looking for. Per the report's
rubric, lift ≥ 50% of discoveries means matrix is doing real work;
7/8 = 87.5% blew through that.

Harness was structurally hiding bugs behind a 5-daemon stripped boot.
Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade:

1. driver→matrixd: {"query": ...} → {"query_text": ...} field name
2. harness temp toml missing [s3] → wrong default bucket → catalogd
   rehydrate 500 on first call
3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name
4. expand boot from 5 → 10 daemons in dep-ordered launch
5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion)
6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) —
   wrong domain for staffing queries; replaced with ethereal_workers
   (10K rows, real staffing schema, "e-" id prefix to avoid collision
   with workers' "w-"). staffing_workers driver gains -index-name +
   -id-prefix flags so the same binary serves both corpora
7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running
   ~30s per judge call against the lift loop; reverted to
   qwen2.5:latest (~1s/call, 30× faster, held lift theory)

Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go
so future drift fires in `go test`, not in a reality run. R-005 closed:

- cmd/matrixd/main_test.go (new) — playbook record drift detector +
  score bounds + 6 routes mounted
- cmd/queryd/main_test.go — wrong-field-name drift detector
- cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire
- cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode

`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green.

Reality test results (reports/reality-tests/playbook_lift_001.{json,md}):
  Queries              21 (staffing-domain, 7 categories)
  Discoveries          8 (judge ≠ cosine top-1)
  Lifts                7/8 (87.5%)
  Boosts triggered     9
  Mean Δ distance      -0.053 (warm closer than cold)
  OOD honesty          dental/RN/SWE rated 1, no fake matches
  Cross-corpus boosts  confirmed (e- ↔ w- swaps in lifts)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:22:21 -05:00

4.9 KiB
Raw Blame History

Playbook-Lift Reality Test — Run 001

Generated: 2026-04-30T10:50:22.550677651Z Judge: qwen2.5:latest (Ollama, resolved from env JUDGE_MODELqwen2.5:latest) Corpora: workers,ethereal_workers Workers limit: 5000 Queries: tests/reality/playbook_lift_queries.txt (21 executed) K per pass: 10 Evidence: reports/reality-tests/playbook_lift_001.json


Headline

Metric Value
Total queries run 21
Cold-pass discoveries (judge-best ≠ top-1) 8
Warm-pass lifts (recorded playbook → top-1) 7
No change (judge-best already top-1, no playbook needed) 14
Playbook boosts triggered (warm pass) 9
Mean Δ top-1 distance (warm cold) -0.053097825

Lift rate: 7 of 8 discoveries became top-1 after warm pass.


Per-query results

# Query Cold top-1 Cold judge-best (rank/rating) Recorded? Warm top-1 Judge-best warm rank Lift
1 Forklift operator with OSHA-30, warehouse experience, day sh e-2085 2/4 ✓ w-2019 w-2019 0 YES
2 OSHA-30 certified forklift operator in Wisconsin, cold stora e-6293 7/3 e-6293 7 no
3 Production worker with confined-space cert and hazmat traini w-4552 7/3 w-4552 7 no
4 CDL Class A driver, clean record, willing to do regional 4-d w-3272 0/1 w-3272 0 no
5 Warehouse lead with current OSHA-30 certification, NOT OSHA- w-4833 5/4 ✓ w-195 w-195 0 YES
6 Forklift-certified loader, certification must be active, dis e-2975 2/4 ✓ w-3821 w-3821 0 YES
7 Hazmat-certified warehouse worker comfortable with cold stor w-4965 2/4 ✓ w-4257 w-4257 0 YES
8 Bilingual production worker with team-lead experience and tr w-4115 0/4 w-4115 0 no
9 Inventory specialist with confined-space cert and compliance w-3819 1/3 w-3819 1 no
10 Warehouse worker who can run inventory cycles and lead a sma e-8132 0/4 e-8132 0 no
11 Production line worker comfortable filling in as line superv w-2377 3/4 ✓ w-2954 w-2954 0 YES
12 Customer service rep willing to cross-train into dispatch or e-1332 2/2 e-1332 2 no
13 Reliable production line lead with strong attendance and lea e-4284 6/4 ✓ e-5778 e-5778 0 YES
14 Highly responsive forklift operator available for last-minut e-3695 2/4 ✓ e-5385 e-5385 0 YES
15 Engaged warehouse associate with strong safety compliance re e-7646 9/4 ✓ e-2028 w-4257 1 no
16 CDL-A driver based in IL or WI, willing to run regional 4-da w-3272 7/2 w-3272 7 no
17 Bilingual customer service rep in Indianapolis or Cincinnati e-4240 6/2 e-4240 6 no
18 Production supervisor open to Midwest relocation for permane w-1876 0/2 w-1876 0 no
19 Dental hygienist with three years experience, Indianapolis a w-211 0/1 w-211 0 no
20 Registered nurse with ICU experience, willing to take per-di w-577 0/1 w-577 0 no
21 Software engineer with React and TypeScript, three years exp w-2407 0/1 w-2407 0 no

Honesty caveats

  1. Judge IS the ground truth proxy. Without human-labeled relevance, the LLM judge's verdict is what defines "best." If qwen2.5:latest rates badly, the lift number is meaningless. To validate the judge itself, sample 510 verdicts manually and check agreement.
  2. Score-1.0 boost = distance halved. Playbook math is distance' = distance × (1 - 0.5 × score). Lift requires the judge-best result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise even halving doesn't promote it. Tight clusters → little visible lift.
  3. Same-query replay is the cheap case. Real lift comes from similar but not identical queries hitting a recorded playbook. This run only tests verbatim replay. A v2 should add paraphrase queries.
  4. Multi-corpus skew. Default corpora=workers,ethereal_workers — if all judge-best results land in one corpus, the matrix layer's purpose isn't being tested. Check per-corpus distribution in the JSON.
  5. Judge resolution. This run used qwen2.5:latest from env JUDGE_MODEL overrideqwen2.5:latest. Bumping the judge for run #N+1 means editing one line in lakehouse.toml.

Next moves

  • If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real work. Move to paraphrase queries + tag-based boost (currently ignored).
  • If lift rate < 20%: investigate why — judge variance, distance gap too wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need retuning.
  • If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is already close to optimal on this query distribution. Either the corpus is too narrow or the queries are too easy.