golangLAKEHOUSE/tests/reality/playbook_lift_queries.txt
root b2e45f7f26 playbook_lift: harness expansion + reality test #001 (7/8 lift, 87.5%)
The 5-loop substrate's load-bearing gate is verified — playbook +
matrix indexer give the results we're looking for. Per the report's
rubric, lift ≥ 50% of discoveries means matrix is doing real work;
7/8 = 87.5% blew through that.

Harness was structurally hiding bugs behind a 5-daemon stripped boot.
Expanding to the full 10-daemon prod stack surfaced 7 fixes in cascade:

1. driver→matrixd: {"query": ...} → {"query_text": ...} field name
2. harness temp toml missing [s3] → wrong default bucket → catalogd
   rehydrate 500 on first call
3. harness→queryd SQL probe: {"q": ...} → {"sql": ...} field name
4. expand boot from 5 → 10 daemons in dep-ordered launch
5. add SQL surface probe (3-row CSV ingest → COUNT(*)=3 assertion)
6. candidates corpus was synthetic SWE-tech (Swift/iOS, Scala/Spark) —
   wrong domain for staffing queries; replaced with ethereal_workers
   (10K rows, real staffing schema, "e-" id prefix to avoid collision
   with workers' "w-"). staffing_workers driver gains -index-name +
   -id-prefix flags so the same binary serves both corpora
7. local_judge qwen3.5:latest is a vision-SSM 256K-ctx build running
   ~30s per judge call against the lift loop; reverted to
   qwen2.5:latest (~1s/call, 30× faster, held lift theory)

Each contract drift (1, 3) is now locked into a cmd/<bin>/main_test.go
so future drift fires in `go test`, not in a reality run. R-005 closed:

- cmd/matrixd/main_test.go (new) — playbook record drift detector +
  score bounds + 6 routes mounted
- cmd/queryd/main_test.go — wrong-field-name drift detector
- cmd/pathwayd/main_test.go (new) — 9 routes + add round-trip + retire
- cmd/observerd/main_test.go (new) — 4 routes + invalid-op + unknown-mode

`go test ./cmd/{matrixd,queryd,pathwayd,observerd}` all green.

Reality test results (reports/reality-tests/playbook_lift_001.{json,md}):
  Queries              21 (staffing-domain, 7 categories)
  Discoveries          8 (judge ≠ cosine top-1)
  Lifts                7/8 (87.5%)
  Boosts triggered     9
  Mean Δ distance      -0.053 (warm closer than cold)
  OOD honesty          dental/RN/SWE rated 1, no fake matches
  Cross-corpus boosts  confirmed (e- ↔ w- swaps in lifts)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:22:21 -05:00

49 lines
2.7 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Playbook lift reality test — staffing query corpus.
#
# Each non-blank, non-comment line is one query. The harness will run
# each through matrix.search (cold pass, then warm pass with playbook),
# ask the LLM judge to rate top-K results, and record lift metrics.
#
# Lift only fires when the judge picks something different from cosine
# top-1, so queries are weighted toward multi-constraint asks where
# cosine has to compromise. Single-axis queries ("forklift operator")
# give cosine an easy win and the harness can't tell if the playbook
# is doing anything.
#
# 21 queries, 7 categories × 3 each (OOD = 2 + 1 buffer).
# --- Multi-constraint role + cert + geo (3) ---
Forklift operator with OSHA-30, warehouse experience, day shift availability
OSHA-30 certified forklift operator in Wisconsin, cold storage experience, day shift only
Production worker with confined-space cert and hazmat training, Indianapolis area
# --- Cert-discriminator (cosine confuses lookalikes) (3) ---
CDL Class A driver, clean record, willing to do regional 4-day routes
Warehouse lead with current OSHA-30 certification, NOT OSHA-10, team management experience
Forklift-certified loader, certification must be active, distinct from general warehouse staff
# --- Skill-intersection (multi-tag must all be present) (3) ---
Hazmat-certified warehouse worker comfortable with cold storage operations
Bilingual production worker with team-lead experience and training delivery skills
Inventory specialist with confined-space cert and compliance background
# --- Adjacent-role ambiguity (judge can pick better fit) (3) ---
Warehouse worker who can run inventory cycles and lead a small team
Production line worker comfortable filling in as line supervisor when needed
Customer service rep willing to cross-train into dispatch or scheduling
# --- Soft-attribute + role (uses reliability/availability/engagement scores) (3) ---
Reliable production line lead with strong attendance and lean manufacturing background
Highly responsive forklift operator available for last-minute shift coverage
Engaged warehouse associate with strong safety compliance record
# --- Geographic specificity (multi-state, regional preference) (3) ---
CDL-A driver based in IL or WI, willing to run regional 4-day routes
Bilingual customer service rep in Indianapolis or Cincinnati metro, Spanish and English
Production supervisor open to Midwest relocation for permanent role
# --- OOD honesty signal (system should return low-confidence, not bogus matches) (3) ---
Dental hygienist with three years experience, Indianapolis area
Registered nurse with ICU experience, willing to take per-diem shifts
Software engineer with React and TypeScript, three years experience