golangLAKEHOUSE/tests/reality/playbook_lift_queries.txt
root 3dd7d9fe30 reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine?
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
  rates top-K → record playbook entry pointing at the highest-rated
  result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
  ranking shift. Lift = real if recorded answer becomes top-1.

Files:
- scripts/playbook_lift/main.go         driver (391 LoC)
- scripts/playbook_lift.sh              stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt  query corpus (5 placeholders;
                                            J writes real 20+)
- reports/reality-tests/README.md       framework + interpretation
- .gitignore                            track reports/reality-tests/
                                        but ignore per-run JSON evidence

This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.

Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:22:36 -05:00

19 lines
923 B
Plaintext

# Playbook lift reality test — staffing query corpus.
#
# Each non-blank, non-comment line is one query. The harness will run
# each through matrix.search (cold pass, then warm pass with playbook),
# ask the LLM judge to rate top-K results, and record lift metrics.
#
# Goal: 20 queries, weighted toward the kinds of asks a staffing
# coordinator would actually issue. Specific roles + certifications +
# constraints surface playbook lift better than generic "find a worker"
# style queries.
#
# Placeholders (5) — J: replace + extend to 20+ for the real test.
Forklift operator with OSHA-30, warehouse experience, day shift availability
Bilingual customer service rep, Spanish + English, two years call-center experience
CDL Class A driver, clean record, willing to do regional 4-day routes
Production line supervisor with lean manufacturing background
Dental hygienist with three years experience, Indianapolis area