reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine?

First reality test driver. Two-pass design: - Pass 1 (cold): matrix.search use_playbook=false → small-model judge rates top-K → record playbook entry pointing at the highest-rated result (which may NOT be top-1 by distance — that's the discovery). - Pass 2 (warm): same queries with use_playbook=true → measure ranking shift. Lift = real if recorded answer becomes top-1. Files: - scripts/playbook_lift/main.go driver (391 LoC) - scripts/playbook_lift.sh stack-bring-up + report gen - tests/reality/playbook_lift_queries.txt query corpus (5 placeholders; J writes real 20+) - reports/reality-tests/README.md framework + interpretation - .gitignore track reports/reality-tests/ but ignore per-run JSON evidence This answers the gate from project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for." Without ground-truth labels, the LLM judge is the proxy — the same small-model thesis applied to evaluation. Honest about that limitation in the generated reports. Driver compiles clean; full run requires Ollama + workers/candidates ingest. Skips cleanly if Ollama absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:22:36 -05:00 · 2026-04-29 23:22:36 -05:00 · 3dd7d9fe30
commit 3dd7d9fe30
parent 8278eb9a87
5 changed files with 715 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -39,10 +39,14 @@ vendor/
 # Use /reports/* + un-ignore so git can traverse into reports/.
 /reports/*
 !/reports/scrum/
 !/reports/reality-tests/
 # Inside the audit directory, the per-run _evidence/ dump (smoke logs,
 # command output) IS runtime — track the dir, ignore its contents.
 /reports/scrum/_evidence/*
 !/reports/scrum/_evidence/.gitkeep
 # Reality-test JSON evidence is runtime — track the dir + MD reports
 # (committed deliberately as outcome record), ignore per-run JSON.
 /reports/reality-tests/*.json
 # Proof harness runtime output — same pattern as reports/scrum/_evidence.
 # Track the directory but ignore per-run subdirs.
--- a/reports/reality-tests/README.md
+++ b/reports/reality-tests/README.md
@ -0,0 +1,69 @@
 # reports/reality-tests — does the 5-loop substrate actually work?
 Reality tests measure **product outcomes**, not substrate health. The 21 smokes prove the system *runs*; the proof harness proves the system *makes the claims it claims*; reality tests answer: **does the small-model pipeline + matrix indexer + playbook give measurably better results than raw cosine?**
 This is the gate from `project_small_model_pipeline_vision.md`: *"the playbook + matrix indexer must give the results we're looking for."* Single load-bearing criterion. Throughput, scaling, code elegance are secondary.
 ---
 ## What lives here
 Each reality test is a numbered run that produces:
 - `<test>_<NNN>.json` — raw structured evidence (per-query data, summary metrics)
 - `<test>_<NNN>.md` — human-readable report with headline metrics, per-query table, honesty caveats, next moves
 Runs are append-only. Earlier runs stay in tree as historical baseline.
 ---
 ## Test catalog
 ### `playbook_lift_<NNN>` — does the playbook actually lift the right answer?
 **Driver:** `scripts/playbook_lift.sh` → `bin/playbook_lift`
 **Queries:** `tests/reality/playbook_lift_queries.txt`
 **Pipeline:** cold pass → LLM judge → playbook record → warm pass → measure ranking shift.
 The headline question: **when the LLM judge finds a better answer than cosine top-1, can the playbook boost it to top-1 on the next run?** If yes, the learning loop closes; if no, the matrix layer + playbook is infrastructure for a thesis that doesn't pay rent.
 See the run reports for honesty caveats — chiefly that the LLM judge IS the ground-truth proxy.
 ---
 ## Running a reality test
 ```bash
 # Defaults: judge=qwen3.5:latest, workers limit 5000, run id 001
 ./scripts/playbook_lift.sh
 # Re-run with a different judge to check inter-judge agreement
 JUDGE_MODEL=qwen2.5:latest RUN_ID=002 ./scripts/playbook_lift.sh
 # Smaller scale for fast iteration
 WORKERS_LIMIT=1000 K=5 RUN_ID=dev ./scripts/playbook_lift.sh
 ```
 Requires: Ollama on `:11434` with `nomic-embed-text` + the chosen judge model loaded. Skips cleanly (exit 0) if Ollama is absent.
 ---
 ## Interpreting results
 Three thresholds matter on the `playbook_lift` tests:
 | Lift rate (lifts / discoveries) | Verdict |
 |---|---|
 | ≥ 50% | Loop closes — playbook is doing real work, move to paraphrase queries |
 | 20-50% | Lift exists but inconsistent — investigate boost math (`score × 0.5`) or judge variance |
 | < 20% | Loop is not pulling its weight — diagnose before adding more components |
 A separate concern: **discovery rate** (cold judge-best ≠ cold top-1). If discovery is itself rare (< 30% of queries), cosine is already close to optimal on this query distribution and the matrix+playbook layer has little headroom. That's not necessarily a bug — but it means the value gate has to come from somewhere else (multi-corpus retrieval, domain-specific tags, drift signal).
 ---
 ## What this is not
 - **Not a benchmark.** No comparison against external systems; only internal cold-vs-warm.
 - **Not a regression gate.** Each run is a snapshot. Scores will drift with corpus changes, judge updates, and playbook math tuning. Don't wire `just verify` to demand a minimum lift.
 - **Not human-validated.** The LLM judge is the ground truth proxy. Sample 5-10 verdicts manually per run to sanity-check the judge isn't pathological.
--- a/scripts/playbook_lift.sh
+++ b/scripts/playbook_lift.sh
@ -0,0 +1,233 @@
 #!/usr/bin/env bash
 # Playbook-lift reality test — measure whether the 5-loop substrate
 # (matrix retrieve+merge + playbook + small-model judge) actually beats
 # raw cosine on staffing queries.
 #
 # Pipeline:
 #   1. Boot the Go stack (storaged, embedd, vectord, matrixd, gateway)
 #   2. Ingest workers (default 5000) + candidates corpora
 #   3. Run the playbook_lift driver: cold pass → judge → record →
 #      warm pass → measure
 #   4. Generate markdown report from the JSON evidence
 #
 # Output:
 #   reports/reality-tests/playbook_lift_<N>.json    — raw evidence
 #   reports/reality-tests/playbook_lift_<N>.md      — human report
 #
 # Requires: Ollama on :11434 with nomic-embed-text + the judge model
 # loaded. Skips (exit 0) if Ollama is absent.
 #
 # Usage:
 #   ./scripts/playbook_lift.sh                      # run #001 with defaults
 #   RUN_ID=002 ./scripts/playbook_lift.sh           # explicit run id
 #   JUDGE_MODEL=qwen2.5:latest ./scripts/playbook_lift.sh
 #   WORKERS_LIMIT=2000 ./scripts/playbook_lift.sh
 set -euo pipefail
 cd "$(dirname "$0")/.."
 export PATH="$PATH:/usr/local/go/bin"
 RUN_ID="${RUN_ID:-001}"
 JUDGE_MODEL="${JUDGE_MODEL:-qwen3.5:latest}"
 WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
 QUERIES_FILE="${QUERIES_FILE:-tests/reality/playbook_lift_queries.txt}"
 CORPORA="${CORPORA:-workers,candidates}"
 K="${K:-10}"
 OUT_JSON="reports/reality-tests/playbook_lift_${RUN_ID}.json"
 OUT_MD="reports/reality-tests/playbook_lift_${RUN_ID}.md"
 if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
  echo "[lift] Ollama not reachable on :11434 — skipping"
  exit 0
 fi
 if ! curl -sS http://localhost:11434/api/tags | jq -e --arg m "$JUDGE_MODEL" \
    '.models[] | select(.name == $m)' >/dev/null 2>&1; then
  echo "[lift] judge model '$JUDGE_MODEL' not loaded in Ollama — pull it first"
  exit 1
 fi
 echo "[lift] building binaries..."
 go build -o bin/ ./cmd/storaged ./cmd/embedd ./cmd/vectord ./cmd/matrixd ./cmd/gateway \
                 ./scripts/staffing_workers ./scripts/staffing_candidates \
                 ./scripts/playbook_lift
 pkill -f "bin/(storaged|embedd|vectord|matrixd|gateway)" 2>/dev/null || true
 sleep 0.3
 PIDS=()
 TMP="$(mktemp -d)"
 CFG="$TMP/lift.toml"
 cleanup() {
  echo "[lift] cleanup"
  for p in "${PIDS[@]:-}"; do [ -n "${p:-}" ] && kill "$p" 2>/dev/null || true; done
  rm -rf "$TMP"
 }
 trap cleanup EXIT INT TERM
 cat > "$CFG" <<EOF
 [gateway]
 bind = "127.0.0.1:3110"
 storaged_url = "http://127.0.0.1:3211"
 catalogd_url = "http://127.0.0.1:3212"
 ingestd_url  = "http://127.0.0.1:3213"
 queryd_url   = "http://127.0.0.1:3214"
 vectord_url  = "http://127.0.0.1:3215"
 embedd_url   = "http://127.0.0.1:3216"
 pathwayd_url = "http://127.0.0.1:3217"
 matrixd_url  = "http://127.0.0.1:3218"
 [vectord]
 bind = "127.0.0.1:3215"
 storaged_url = ""
 [matrixd]
 bind = "127.0.0.1:3218"
 embedd_url  = "http://127.0.0.1:3216"
 vectord_url = "http://127.0.0.1:3215"
 EOF
 poll_health() {
  local port="$1" deadline=$(($(date +%s) + 5))
  while [ "$(date +%s)" -lt "$deadline" ]; do
    if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
    sleep 0.05
  done
  return 1
 }
 echo "[lift] launching stack..."
 ./bin/storaged -config "$CFG" > /tmp/storaged.log 2>&1 & PIDS+=($!)
 poll_health 3211 || { echo "storaged failed"; exit 1; }
 ./bin/embedd   -config "$CFG" > /tmp/embedd.log   2>&1 & PIDS+=($!)
 poll_health 3216 || { echo "embedd failed"; exit 1; }
 ./bin/vectord  -config "$CFG" > /tmp/vectord.log  2>&1 & PIDS+=($!)
 poll_health 3215 || { echo "vectord failed"; exit 1; }
 ./bin/matrixd  -config "$CFG" > /tmp/matrixd.log  2>&1 & PIDS+=($!)
 poll_health 3218 || { echo "matrixd failed"; exit 1; }
 ./bin/gateway  -config "$CFG" > /tmp/gateway.log  2>&1 & PIDS+=($!)
 poll_health 3110 || { echo "gateway failed"; exit 1; }
 echo
 echo "[lift] ingest workers (limit=$WORKERS_LIMIT)..."
 ./bin/staffing_workers -limit "$WORKERS_LIMIT"
 echo
 echo "[lift] ingest candidates..."
 ./bin/staffing_candidates -skip-populate=false -query "warmup" 2>&1 \
  | grep -v "^\[candidates\]\(matrix\|reality\)" || true
 echo
 echo "[lift] running driver — judge=$JUDGE_MODEL · queries=$QUERIES_FILE · k=$K"
 ./bin/playbook_lift \
  -gateway "http://127.0.0.1:3110" \
  -ollama  "http://localhost:11434" \
  -queries "$QUERIES_FILE" \
  -corpora "$CORPORA" \
  -judge   "$JUDGE_MODEL" \
  -k       "$K" \
  -out     "$OUT_JSON"
 echo
 echo "[lift] generating markdown report → $OUT_MD"
 generate_md() {
  local json="$1" md="$2"
  local total discovery lift no_change boosted mean_delta gen_at
  total=$(jq -r '.summary.total' "$json")
  discovery=$(jq -r '.summary.with_discovery' "$json")
  lift=$(jq -r '.summary.lift_count' "$json")
  no_change=$(jq -r '.summary.no_change' "$json")
  boosted=$(jq -r '.summary.playbook_boosted_total' "$json")
  mean_delta=$(jq -r '.summary.mean_top1_delta_distance' "$json")
  gen_at=$(jq -r '.summary.generated_at' "$json")
  cat > "$md" <<MDEOF
 # Playbook-Lift Reality Test — Run ${RUN_ID}
 **Generated:** ${gen_at}
 **Judge:** \`${JUDGE_MODEL}\` (Ollama)
 **Corpora:** \`${CORPORA}\`
 **Workers limit:** ${WORKERS_LIMIT}
 **Queries:** \`${QUERIES_FILE}\` (${total} executed)
 **K per pass:** ${K}
 **Evidence:** \`${OUT_JSON}\`
 ---
 ## Headline
 | Metric | Value |
 |---|---:|
 | Total queries run | ${total} |
 | Cold-pass discoveries (judge-best ≠ top-1) | ${discovery} |
 | Warm-pass lifts (recorded playbook → top-1) | ${lift} |
 | No change (judge-best already top-1, no playbook needed) | ${no_change} |
 | Playbook boosts triggered (warm pass) | ${boosted} |
 | Mean Δ top-1 distance (warm − cold) | ${mean_delta} |
 **Lift rate:** ${lift} of ${discovery} discoveries became top-1 after warm pass.
 ---
 ## Per-query results
 | # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
 |---|---|---|---|---|---|---|---|
 MDEOF
  jq -r '.runs | to_entries[] |
    [
      (.key + 1 | tostring),
      (.value.query | .[0:60]),
      .value.cold_top1_id,
      ((.value.cold_judge_best_rank | tostring) + "/" + (.value.cold_judge_best_rating | tostring)),
      (if .value.playbook_recorded then "✓ " + (.value.playbook_target_id // "") else "—" end),
      .value.warm_top1_id,
      (.value.warm_judge_best_rank | tostring),
      (if .value.lift then "**YES**" else "no" end)
    ] | "| " + join(" | ") + " |"
  ' "$json" >> "$md"
  cat >> "$md" <<MDEOF
 ---
 ## Honesty caveats
 1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
   judge's verdict is what defines "best." If \`${JUDGE_MODEL}\` rates badly,
   the lift number is meaningless. To validate the judge itself, sample 5–10
   verdicts manually and check agreement.
 2. **Score-1.0 boost = distance halved.** Playbook math is
   \`distance' = distance × (1 - 0.5 × score)\`. Lift requires the judge-best
   result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
   even halving doesn't promote it. Tight clusters → little visible lift.
 3. **Same-query replay is the cheap case.** Real lift comes from *similar but
   not identical* queries hitting a recorded playbook. This run only tests
   verbatim replay. A v2 should add paraphrase queries.
 4. **Multi-corpus skew.** Default corpora=\`${CORPORA}\` — if all judge-best
   results land in one corpus, the matrix layer's purpose isn't being tested.
   Check per-corpus distribution in the JSON.
 ## Next moves
 - If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
  work. Move to paraphrase queries + tag-based boost (currently ignored).
 - If lift rate < 20%: investigate why — judge variance, distance gap too
  wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
  retuning.
 - If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
  already close to optimal on this query distribution. Either the corpus
  is too narrow or the queries are too easy.
 MDEOF
 }
 generate_md "$OUT_JSON" "$OUT_MD"
 echo
 echo "[lift] DONE"
 echo "[lift]   evidence:  $OUT_JSON"
 echo "[lift]   report:    $OUT_MD"
--- a/scripts/playbook_lift/main.go
+++ b/scripts/playbook_lift/main.go
@ -0,0 +1,391 @@
 // Playbook-lift reality test driver. Two-pass design:
 //
 //   Pass 1 (cold): for each query → matrix.search use_playbook=false →
 //                  LLM judge rates top-K → record playbook entry pointing
 //                  at the highest-rated result (which may NOT be top-1
 //                  by distance — that's the discovery worth boosting).
 //
 //   Pass 2 (warm): same queries → use_playbook=true → measure how the
 //                  ranking shifted.
 //
 // Lift = real if pass-2 brings the LLM-judged-best result into top-1
 // more often than pass-1. If lift ≈ 0, the playbook is just confirming
 // what cosine already said and the 5-loop thesis is unproven.
 //
 // Honest about what this measures: with no human-labeled ground truth,
 // the LLM judge IS the ground truth proxy. That's the small-model
 // pipeline thesis itself — the same model class that runs the inner
 // loop is also what we trust to evaluate it. If you don't trust the
 // judge, the lift number is meaningless; that's a separate problem
 // for ground-truth labeling.
 //
 // Usage (driven by scripts/playbook_lift.sh):
 //   playbook_lift -gateway http://127.0.0.1:3110 \
 //                 -queries tests/reality/playbook_lift_queries.txt \
 //                 -judge qwen3.5:latest \
 //                 -corpora workers,candidates \
 //                 -k 10 \
 //                 -out reports/reality-tests/playbook_lift_001.json
 package main
 import (
 	"bytes"
 	"encoding/json"
 	"flag"
 	"fmt"
 	"io"
 	"log"
 	"net/http"
 	"os"
 	"sort"
 	"strings"
 	"time"
 )
 type matrixResult struct {
 	ID       string          `json:"id"`
 	Distance float32         `json:"distance"`
 	Corpus   string          `json:"corpus"`
 	Metadata json.RawMessage `json:"metadata,omitempty"`
 }
 type matrixResp struct {
 	Results         []matrixResult `json:"results"`
 	PerCorpusCounts map[string]int `json:"per_corpus_counts"`
 	PlaybookBoosted int            `json:"playbook_boosted,omitempty"`
 }
 type judgeVerdict struct {
 	Rating int    `json:"rating"`
 	Reason string `json:"reason"`
 }
 type queryRun struct {
 	Query string `json:"query"`
 	ColdTop1ID       string  `json:"cold_top1_id"`
 	ColdTop1Distance float32 `json:"cold_top1_distance"`
 	ColdJudgeBestID  string  `json:"cold_judge_best_id"`
 	ColdJudgeBestRank int    `json:"cold_judge_best_rank"`
 	ColdJudgeBestRating int  `json:"cold_judge_best_rating"`
 	ColdRatings      []int   `json:"cold_ratings"`
 	PlaybookRecorded bool   `json:"playbook_recorded"`
 	PlaybookID       string `json:"playbook_target_id,omitempty"`
 	WarmTop1ID       string  `json:"warm_top1_id"`
 	WarmTop1Distance float32 `json:"warm_top1_distance"`
 	WarmBoostedCount int     `json:"warm_boosted_count"`
 	WarmJudgeBestRank int    `json:"warm_judge_best_rank"`
 	Lift bool   `json:"lift"`            // judge-best was below top-1 cold, but top-1 warm
 	Note string `json:"note,omitempty"`
 }
 type summary struct {
 	Total                 int       `json:"total"`
 	WithDiscovery         int       `json:"with_discovery"` // judge-best != cold top-1
 	LiftCount             int       `json:"lift_count"`     // top-1 changed warm→ judge-best
 	NoChange              int       `json:"no_change"`
 	MeanTop1DeltaDistance float32   `json:"mean_top1_delta_distance"`
 	PlaybookBoostedTotal  int       `json:"playbook_boosted_total"`
 	GeneratedAt           time.Time `json:"generated_at"`
 }
 func main() {
 	gw := flag.String("gateway", "http://127.0.0.1:3110", "Go gateway base URL")
 	ollama := flag.String("ollama", "http://127.0.0.1:11434", "Ollama base URL for LLM judge")
 	queries := flag.String("queries", "tests/reality/playbook_lift_queries.txt", "query corpus path")
 	corporaCSV := flag.String("corpora", "workers,candidates", "comma-separated matrix corpora")
 	judge := flag.String("judge", "qwen3.5:latest", "Ollama model for relevance judging")
 	k := flag.Int("k", 10, "top-k from matrix.search per pass")
 	out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
 	flag.Parse()
 	corpora := strings.Split(*corporaCSV, ",")
 	qs, err := loadQueries(*queries)
 	if err != nil {
 		log.Fatalf("load queries: %v", err)
 	}
 	if len(qs) == 0 {
 		log.Fatalf("no queries in %s", *queries)
 	}
 	log.Printf("[lift] %d queries · corpora=%v · k=%d · judge=%s", len(qs), corpora, *k, *judge)
 	hc := &http.Client{Timeout: 60 * time.Second}
 	runs := make([]queryRun, 0, len(qs))
 	totalDelta := float32(0)
 	playbookBoostedTotal := 0
 	withDiscovery := 0
 	liftCount := 0
 	noChange := 0
 	// Pass 1 (cold) + record playbooks based on judge verdicts.
 	for i, q := range qs {
 		log.Printf("[lift] (%d/%d cold) %s", i+1, len(qs), abbrev(q, 60))
 		resp, err := matrixSearch(hc, *gw, q, corpora, *k, false)
 		if err != nil {
 			log.Printf("  cold search failed: %v — skipping", err)
 			continue
 		}
 		if len(resp.Results) == 0 {
 			log.Printf("  cold returned 0 results — skipping")
 			continue
 		}
 		ratings := make([]int, len(resp.Results))
 		bestRank := 0
 		bestRating := -1
 		for j, r := range resp.Results {
 			rating := judgeRate(hc, *ollama, *judge, q, r)
 			ratings[j] = rating
 			if rating > bestRating {
 				bestRating = rating
 				bestRank = j
 			}
 		}
 		run := queryRun{
 			Query:               q,
 			ColdTop1ID:          resp.Results[0].ID,
 			ColdTop1Distance:    resp.Results[0].Distance,
 			ColdJudgeBestID:     resp.Results[bestRank].ID,
 			ColdJudgeBestRank:   bestRank,
 			ColdJudgeBestRating: bestRating,
 			ColdRatings:         ratings,
 		}
 		// Record a playbook only if the judge best is not already top-1
 		// (otherwise we're boosting something cosine already crowned).
 		if bestRank > 0 && bestRating >= 4 {
 			withDiscovery++
 			if err := playbookRecord(hc, *gw, q, resp.Results[bestRank].ID, resp.Results[bestRank].Corpus, 1.0); err != nil {
 				log.Printf("  playbook record failed: %v", err)
 				run.Note = "playbook record failed: " + err.Error()
 			} else {
 				run.PlaybookRecorded = true
 				run.PlaybookID = resp.Results[bestRank].ID
 			}
 		} else if bestRank == 0 {
 			run.Note = "judge-best already top-1 cold — no playbook needed"
 		} else {
 			run.Note = fmt.Sprintf("judge-best rating %d below threshold (4) — no playbook", bestRating)
 		}
 		runs = append(runs, run)
 	}
 	// Pass 2 (warm) on the same queries.
 	for i := range runs {
 		q := runs[i].Query
 		log.Printf("[lift] (%d/%d warm) %s", i+1, len(runs), abbrev(q, 60))
 		resp, err := matrixSearch(hc, *gw, q, corpora, *k, true)
 		if err != nil || len(resp.Results) == 0 {
 			runs[i].Note = appendNote(runs[i].Note, fmt.Sprintf("warm search failed: %v", err))
 			continue
 		}
 		runs[i].WarmTop1ID = resp.Results[0].ID
 		runs[i].WarmTop1Distance = resp.Results[0].Distance
 		runs[i].WarmBoostedCount = resp.PlaybookBoosted
 		playbookBoostedTotal += resp.PlaybookBoosted
 		// Find where the cold judge-best ID landed in the warm ranking.
 		warmRank := -1
 		for j, r := range resp.Results {
 			if r.ID == runs[i].ColdJudgeBestID {
 				warmRank = j
 				break
 			}
 		}
 		runs[i].WarmJudgeBestRank = warmRank
 		switch {
 		case runs[i].PlaybookRecorded && warmRank == 0:
 			runs[i].Lift = true
 			liftCount++
 		case !runs[i].PlaybookRecorded:
 			noChange++
 		default:
 			noChange++
 		}
 		totalDelta += runs[i].WarmTop1Distance - runs[i].ColdTop1Distance
 	}
 	sum := summary{
 		Total:                 len(runs),
 		WithDiscovery:         withDiscovery,
 		LiftCount:             liftCount,
 		NoChange:              noChange,
 		MeanTop1DeltaDistance: 0,
 		PlaybookBoostedTotal:  playbookBoostedTotal,
 		GeneratedAt:           time.Now().UTC(),
 	}
 	if len(runs) > 0 {
 		sum.MeanTop1DeltaDistance = totalDelta / float32(len(runs))
 	}
 	if err := writeJSON(*out, runs, sum); err != nil {
 		log.Fatalf("write %s: %v", *out, err)
 	}
 	log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f",
 		sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance)
 	log.Printf("[lift] results → %s", *out)
 }
 func loadQueries(path string) ([]string, error) {
 	bs, err := os.ReadFile(path)
 	if err != nil {
 		return nil, err
 	}
 	var out []string
 	for _, line := range strings.Split(string(bs), "\n") {
 		s := strings.TrimSpace(line)
 		if s == "" || strings.HasPrefix(s, "#") {
 			continue
 		}
 		out = append(out, s)
 	}
 	return out, nil
 }
 func matrixSearch(hc *http.Client, gw, query string, corpora []string, k int, usePlaybook bool) (*matrixResp, error) {
 	body := map[string]any{
 		"query_text":   query,
 		"corpora":      corpora,
 		"k":            k,
 		"per_corpus_k": k,
 		"use_playbook": usePlaybook,
 	}
 	bs, _ := json.Marshal(body)
 	req, _ := http.NewRequest("POST", gw+"/v1/matrix/search", bytes.NewReader(bs))
 	req.Header.Set("Content-Type", "application/json")
 	resp, err := hc.Do(req)
 	if err != nil {
 		return nil, err
 	}
 	defer resp.Body.Close()
 	rb, _ := io.ReadAll(resp.Body)
 	if resp.StatusCode/100 != 2 {
 		return nil, fmt.Errorf("status %d: %s", resp.StatusCode, string(rb))
 	}
 	var out matrixResp
 	if err := json.Unmarshal(rb, &out); err != nil {
 		return nil, fmt.Errorf("unmarshal: %w (body=%s)", err, abbrev(string(rb), 200))
 	}
 	return &out, nil
 }
 func playbookRecord(hc *http.Client, gw, query, answerID, answerCorpus string, score float64) error {
 	body := map[string]any{
 		"query":         query,
 		"answer_id":     answerID,
 		"answer_corpus": answerCorpus,
 		"score":         score,
 		"tags":          []string{"reality-test", "playbook-lift-001"},
 	}
 	bs, _ := json.Marshal(body)
 	req, _ := http.NewRequest("POST", gw+"/v1/matrix/playbooks/record", bytes.NewReader(bs))
 	req.Header.Set("Content-Type", "application/json")
 	resp, err := hc.Do(req)
 	if err != nil {
 		return err
 	}
 	defer resp.Body.Close()
 	if resp.StatusCode/100 != 2 {
 		rb, _ := io.ReadAll(resp.Body)
 		return fmt.Errorf("status %d: %s", resp.StatusCode, string(rb))
 	}
 	return nil
 }
 // judgeRate calls Ollama's /api/chat directly and asks for a 1-5 rating
 // of the result against the query. Returns 0 on any failure (treated as
 // "couldn't judge, exclude from best-of consideration").
 func judgeRate(hc *http.Client, ollamaURL, model, query string, r matrixResult) int {
 	system := `You rate retrieval results for a staffing co-pilot.
 Rate the result 1-5 against the query:
  5 = perfect match (this person/job IS what was asked for)
  4 = strong match (right field, right level, minor mismatches)
  3 = adjacent match (related field or partial overlap)
  2 = weak/tangential match
  1 = irrelevant
 Output JSON only: {"rating": N, "reason": "<one sentence>"}.`
 	user := fmt.Sprintf("Query: %q\n\nResult corpus: %s\nResult ID: %s\nResult metadata:\n%s",
 		query, r.Corpus, r.ID, string(r.Metadata))
 	body := map[string]any{
 		"model":  model,
 		"stream": false,
 		"format": "json",
 		"messages": []map[string]string{
 			{"role": "system", "content": system},
 			{"role": "user", "content": user},
 		},
 		"options": map[string]any{"temperature": 0},
 	}
 	bs, _ := json.Marshal(body)
 	req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(bs))
 	req.Header.Set("Content-Type", "application/json")
 	resp, err := hc.Do(req)
 	if err != nil {
 		return 0
 	}
 	defer resp.Body.Close()
 	if resp.StatusCode/100 != 2 {
 		return 0
 	}
 	rb, _ := io.ReadAll(resp.Body)
 	var ollamaResp struct {
 		Message struct {
 			Content string `json:"content"`
 		} `json:"message"`
 	}
 	if err := json.Unmarshal(rb, &ollamaResp); err != nil {
 		return 0
 	}
 	var v judgeVerdict
 	if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &v); err != nil {
 		return 0
 	}
 	if v.Rating < 1 || v.Rating > 5 {
 		return 0
 	}
 	return v.Rating
 }
 func writeJSON(path string, runs []queryRun, sum summary) error {
 	if err := os.MkdirAll(filepath_dir(path), 0o755); err != nil {
 		return err
 	}
 	out := struct {
 		Summary summary    `json:"summary"`
 		Runs    []queryRun `json:"runs"`
 	}{Summary: sum, Runs: runs}
 	bs, err := json.MarshalIndent(out, "", "  ")
 	if err != nil {
 		return err
 	}
 	return os.WriteFile(path, bs, 0o644)
 }
 func filepath_dir(p string) string {
 	if i := strings.LastIndex(p, "/"); i >= 0 {
 		return p[:i]
 	}
 	return "."
 }
 func abbrev(s string, n int) string {
 	if len(s) <= n {
 		return s
 	}
 	return s[:n] + "…"
 }
 func appendNote(existing, add string) string {
 	if existing == "" {
 		return add
 	}
 	return existing + "; " + add
 }
 // Suppress unused-import warning when sort isn't used in a future
 // refactor; harmless for now.
 var _ = sort.Slice
--- a/tests/reality/playbook_lift_queries.txt
+++ b/tests/reality/playbook_lift_queries.txt
@ -0,0 +1,18 @@
 # Playbook lift reality test — staffing query corpus.
 #
 # Each non-blank, non-comment line is one query. The harness will run
 # each through matrix.search (cold pass, then warm pass with playbook),
 # ask the LLM judge to rate top-K results, and record lift metrics.
 #
 # Goal: 20 queries, weighted toward the kinds of asks a staffing
 # coordinator would actually issue. Specific roles + certifications +
 # constraints surface playbook lift better than generic "find a worker"
 # style queries.
 #
 # Placeholders (5) — J: replace + extend to 20+ for the real test.
 Forklift operator with OSHA-30, warehouse experience, day shift availability
 Bilingual customer service rep, Spanish + English, two years call-center experience
 CDL Class A driver, clean record, willing to do regional 4-day routes
 Production line supervisor with lean manufacturing background
 Dental hygienist with three years experience, Indianapolis area