reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine?

First reality test driver. Two-pass design: - Pass 1 (cold): matrix.search use_playbook=false → small-model judge rates top-K → record playbook entry pointing at the highest-rated result (which may NOT be top-1 by distance — that's the discovery). - Pass 2 (warm): same queries with use_playbook=true → measure ranking shift. Lift = real if recorded answer becomes top-1. Files: - scripts/playbook_lift/main.go driver (391 LoC) - scripts/playbook_lift.sh stack-bring-up + report gen - tests/reality/playbook_lift_queries.txt query corpus (5 placeholders; J writes real 20+) - reports/reality-tests/README.md framework + interpretation - .gitignore track reports/reality-tests/ but ignore per-run JSON evidence This answers the gate from project_small_model_pipeline_vision.md: "the playbook + matrix indexer must give the results we're looking for." Without ground-truth labels, the LLM judge is the proxy — the same small-model thesis applied to evaluation. Honest about that limitation in the generated reports. Driver compiles clean; full run requires Ollama + workers/candidates ingest. Skips cleanly if Ollama absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:22:36 -05:00 · 2026-04-29 23:22:36 -05:00 · 3dd7d9fe30
commit 3dd7d9fe30
parent 8278eb9a87
5 changed files with 715 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -39,10 +39,14 @@ vendor/
 # Use /reports/* + un-ignore so git can traverse into reports/.
 /reports/*
 !/reports/scrum/
+!/reports/reality-tests/
 # Inside the audit directory, the per-run _evidence/ dump (smoke logs,
 # command output) IS runtime — track the dir, ignore its contents.
 /reports/scrum/_evidence/*
 !/reports/scrum/_evidence/.gitkeep
+# Reality-test JSON evidence is runtime — track the dir + MD reports
+# (committed deliberately as outcome record), ignore per-run JSON.
+/reports/reality-tests/*.json

 # Proof harness runtime output — same pattern as reports/scrum/_evidence.
 # Track the directory but ignore per-run subdirs.
--- a/reports/reality-tests/README.md
+++ b/reports/reality-tests/README.md
@ -0,0 +1,69 @@
+# reports/reality-tests — does the 5-loop substrate actually work?
+
+Reality tests measure **product outcomes**, not substrate health. The 21 smokes prove the system *runs*; the proof harness proves the system *makes the claims it claims*; reality tests answer: **does the small-model pipeline + matrix indexer + playbook give measurably better results than raw cosine?**
+
+This is the gate from `project_small_model_pipeline_vision.md`: *"the playbook + matrix indexer must give the results we're looking for."* Single load-bearing criterion. Throughput, scaling, code elegance are secondary.
+
+---
+
+## What lives here
+
+Each reality test is a numbered run that produces:
+
+- `<test>_<NNN>.json` — raw structured evidence (per-query data, summary metrics)
+- `<test>_<NNN>.md` — human-readable report with headline metrics, per-query table, honesty caveats, next moves
+
+Runs are append-only. Earlier runs stay in tree as historical baseline.
+
+---
+
+## Test catalog
+
+### `playbook_lift_<NNN>` — does the playbook actually lift the right answer?
+
+**Driver:** `scripts/playbook_lift.sh` → `bin/playbook_lift`
+**Queries:** `tests/reality/playbook_lift_queries.txt`
+**Pipeline:** cold pass → LLM judge → playbook record → warm pass → measure ranking shift.
+
+The headline question: **when the LLM judge finds a better answer than cosine top-1, can the playbook boost it to top-1 on the next run?** If yes, the learning loop closes; if no, the matrix layer + playbook is infrastructure for a thesis that doesn't pay rent.
+
+See the run reports for honesty caveats — chiefly that the LLM judge IS the ground-truth proxy.
+
+---
+
+## Running a reality test
+
+```bash
+# Defaults: judge=qwen3.5:latest, workers limit 5000, run id 001
+./scripts/playbook_lift.sh
+
+# Re-run with a different judge to check inter-judge agreement
+JUDGE_MODEL=qwen2.5:latest RUN_ID=002 ./scripts/playbook_lift.sh
+
+# Smaller scale for fast iteration
+WORKERS_LIMIT=1000 K=5 RUN_ID=dev ./scripts/playbook_lift.sh
+```
+
+Requires: Ollama on `:11434` with `nomic-embed-text` + the chosen judge model loaded. Skips cleanly (exit 0) if Ollama is absent.
+
+---
+
+## Interpreting results
+
+Three thresholds matter on the `playbook_lift` tests:
+
+| Lift rate (lifts / discoveries) | Verdict |
+|---|---|
+| ≥ 50% | Loop closes — playbook is doing real work, move to paraphrase queries |
+| 20-50% | Lift exists but inconsistent — investigate boost math (`score × 0.5`) or judge variance |
+| < 20% | Loop is not pulling its weight — diagnose before adding more components |
+
+A separate concern: **discovery rate** (cold judge-best ≠ cold top-1). If discovery is itself rare (< 30% of queries), cosine is already close to optimal on this query distribution and the matrix+playbook layer has little headroom. That's not necessarily a bug — but it means the value gate has to come from somewhere else (multi-corpus retrieval, domain-specific tags, drift signal).
+
+---
+
+## What this is not
+
+- **Not a benchmark.** No comparison against external systems; only internal cold-vs-warm.
+- **Not a regression gate.** Each run is a snapshot. Scores will drift with corpus changes, judge updates, and playbook math tuning. Don't wire `just verify` to demand a minimum lift.
+- **Not human-validated.** The LLM judge is the ground truth proxy. Sample 5-10 verdicts manually per run to sanity-check the judge isn't pathological.
--- a/scripts/playbook_lift.sh
+++ b/scripts/playbook_lift.sh
@ -0,0 +1,233 @@
+#!/usr/bin/env bash
+# Playbook-lift reality test — measure whether the 5-loop substrate
+# (matrix retrieve+merge + playbook + small-model judge) actually beats
+# raw cosine on staffing queries.
+#
+# Pipeline:
+#   1. Boot the Go stack (storaged, embedd, vectord, matrixd, gateway)
+#   2. Ingest workers (default 5000) + candidates corpora
+#   3. Run the playbook_lift driver: cold pass → judge → record →
+#      warm pass → measure
+#   4. Generate markdown report from the JSON evidence
+#
+# Output:
+#   reports/reality-tests/playbook_lift_<N>.json    — raw evidence
+#   reports/reality-tests/playbook_lift_<N>.md      — human report
+#
+# Requires: Ollama on :11434 with nomic-embed-text + the judge model
+# loaded. Skips (exit 0) if Ollama is absent.
+#
+# Usage:
+#   ./scripts/playbook_lift.sh                      # run #001 with defaults
+#   RUN_ID=002 ./scripts/playbook_lift.sh           # explicit run id
+#   JUDGE_MODEL=qwen2.5:latest ./scripts/playbook_lift.sh
+#   WORKERS_LIMIT=2000 ./scripts/playbook_lift.sh
+
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+export PATH="$PATH:/usr/local/go/bin"
+
+RUN_ID="${RUN_ID:-001}"
+JUDGE_MODEL="${JUDGE_MODEL:-qwen3.5:latest}"
+WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
+QUERIES_FILE="${QUERIES_FILE:-tests/reality/playbook_lift_queries.txt}"
+CORPORA="${CORPORA:-workers,candidates}"
+K="${K:-10}"
+
+OUT_JSON="reports/reality-tests/playbook_lift_${RUN_ID}.json"
+OUT_MD="reports/reality-tests/playbook_lift_${RUN_ID}.md"
+
+if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
+  echo "[lift] Ollama not reachable on :11434 — skipping"
+  exit 0
+fi
+
+if ! curl -sS http://localhost:11434/api/tags | jq -e --arg m "$JUDGE_MODEL" \
+    '.models[] | select(.name == $m)' >/dev/null 2>&1; then
+  echo "[lift] judge model '$JUDGE_MODEL' not loaded in Ollama — pull it first"
+  exit 1
+fi
+
+echo "[lift] building binaries..."
+go build -o bin/ ./cmd/storaged ./cmd/embedd ./cmd/vectord ./cmd/matrixd ./cmd/gateway \
+                 ./scripts/staffing_workers ./scripts/staffing_candidates \
+                 ./scripts/playbook_lift
+
+pkill -f "bin/(storaged|embedd|vectord|matrixd|gateway)" 2>/dev/null || true
+sleep 0.3
+
+PIDS=()
+TMP="$(mktemp -d)"
+CFG="$TMP/lift.toml"
+
+cleanup() {
+  echo "[lift] cleanup"
+  for p in "${PIDS[@]:-}"; do [ -n "${p:-}" ] && kill "$p" 2>/dev/null || true; done
+  rm -rf "$TMP"
+}
+trap cleanup EXIT INT TERM
+
+cat > "$CFG" <<EOF
+[gateway]
+bind = "127.0.0.1:3110"
+storaged_url = "http://127.0.0.1:3211"
+catalogd_url = "http://127.0.0.1:3212"
+ingestd_url  = "http://127.0.0.1:3213"
+queryd_url   = "http://127.0.0.1:3214"
+vectord_url  = "http://127.0.0.1:3215"
+embedd_url   = "http://127.0.0.1:3216"
+pathwayd_url = "http://127.0.0.1:3217"
+matrixd_url  = "http://127.0.0.1:3218"
+
+[vectord]
+bind = "127.0.0.1:3215"
+storaged_url = ""
+
+[matrixd]
+bind = "127.0.0.1:3218"
+embedd_url  = "http://127.0.0.1:3216"
+vectord_url = "http://127.0.0.1:3215"
+EOF
+
+poll_health() {
+  local port="$1" deadline=$(($(date +%s) + 5))
+  while [ "$(date +%s)" -lt "$deadline" ]; do
+    if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
+    sleep 0.05
+  done
+  return 1
+}
+
+echo "[lift] launching stack..."
+./bin/storaged -config "$CFG" > /tmp/storaged.log 2>&1 & PIDS+=($!)
+poll_health 3211 || { echo "storaged failed"; exit 1; }
+./bin/embedd   -config "$CFG" > /tmp/embedd.log   2>&1 & PIDS+=($!)
+poll_health 3216 || { echo "embedd failed"; exit 1; }
+./bin/vectord  -config "$CFG" > /tmp/vectord.log  2>&1 & PIDS+=($!)
+poll_health 3215 || { echo "vectord failed"; exit 1; }
+./bin/matrixd  -config "$CFG" > /tmp/matrixd.log  2>&1 & PIDS+=($!)
+poll_health 3218 || { echo "matrixd failed"; exit 1; }
+./bin/gateway  -config "$CFG" > /tmp/gateway.log  2>&1 & PIDS+=($!)
+poll_health 3110 || { echo "gateway failed"; exit 1; }
+
+echo
+echo "[lift] ingest workers (limit=$WORKERS_LIMIT)..."
+./bin/staffing_workers -limit "$WORKERS_LIMIT"
+
+echo
+echo "[lift] ingest candidates..."
+./bin/staffing_candidates -skip-populate=false -query "warmup" 2>&1 \
+  | grep -v "^\[candidates\]\(matrix\|reality\)" || true
+
+echo
+echo "[lift] running driver — judge=$JUDGE_MODEL · queries=$QUERIES_FILE · k=$K"
+./bin/playbook_lift \
+  -gateway "http://127.0.0.1:3110" \
+  -ollama  "http://localhost:11434" \
+  -queries "$QUERIES_FILE" \
+  -corpora "$CORPORA" \
+  -judge   "$JUDGE_MODEL" \
+  -k       "$K" \
+  -out     "$OUT_JSON"
+
+echo
+echo "[lift] generating markdown report → $OUT_MD"
+generate_md() {
+  local json="$1" md="$2"
+  local total discovery lift no_change boosted mean_delta gen_at
+  total=$(jq -r '.summary.total' "$json")
+  discovery=$(jq -r '.summary.with_discovery' "$json")
+  lift=$(jq -r '.summary.lift_count' "$json")
+  no_change=$(jq -r '.summary.no_change' "$json")
+  boosted=$(jq -r '.summary.playbook_boosted_total' "$json")
+  mean_delta=$(jq -r '.summary.mean_top1_delta_distance' "$json")
+  gen_at=$(jq -r '.summary.generated_at' "$json")
+
+  cat > "$md" <<MDEOF
+# Playbook-Lift Reality Test — Run ${RUN_ID}
+
+**Generated:** ${gen_at}
+**Judge:** \`${JUDGE_MODEL}\` (Ollama)
+**Corpora:** \`${CORPORA}\`
+**Workers limit:** ${WORKERS_LIMIT}
+**Queries:** \`${QUERIES_FILE}\` (${total} executed)
+**K per pass:** ${K}
+**Evidence:** \`${OUT_JSON}\`
+
+---
+
+## Headline
+
+| Metric | Value |
+|---|---:|
+| Total queries run | ${total} |
+| Cold-pass discoveries (judge-best ≠ top-1) | ${discovery} |
+| Warm-pass lifts (recorded playbook → top-1) | ${lift} |
+| No change (judge-best already top-1, no playbook needed) | ${no_change} |
+| Playbook boosts triggered (warm pass) | ${boosted} |
+| Mean Δ top-1 distance (warm − cold) | ${mean_delta} |
+
+**Lift rate:** ${lift} of ${discovery} discoveries became top-1 after warm pass.
+
+---
+
+## Per-query results
+
+| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
+|---|---|---|---|---|---|---|---|
+MDEOF
+
+  jq -r '.runs | to_entries[] |
+    [
+      (.key + 1 | tostring),
+      (.value.query | .[0:60]),
+      .value.cold_top1_id,
+      ((.value.cold_judge_best_rank | tostring) + "/" + (.value.cold_judge_best_rating | tostring)),
+      (if .value.playbook_recorded then "✓ " + (.value.playbook_target_id // "") else "—" end),
+      .value.warm_top1_id,
+      (.value.warm_judge_best_rank | tostring),
+      (if .value.lift then "**YES**" else "no" end)
+    ] | "| " + join(" | ") + " |"
+  ' "$json" >> "$md"
+
+  cat >> "$md" <<MDEOF
+
+---
+
+## Honesty caveats
+
+1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
+   judge's verdict is what defines "best." If \`${JUDGE_MODEL}\` rates badly,
+   the lift number is meaningless. To validate the judge itself, sample 5–10
+   verdicts manually and check agreement.
+2. **Score-1.0 boost = distance halved.** Playbook math is
+   \`distance' = distance × (1 - 0.5 × score)\`. Lift requires the judge-best
+   result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
+   even halving doesn't promote it. Tight clusters → little visible lift.
+3. **Same-query replay is the cheap case.** Real lift comes from *similar but
+   not identical* queries hitting a recorded playbook. This run only tests
+   verbatim replay. A v2 should add paraphrase queries.
+4. **Multi-corpus skew.** Default corpora=\`${CORPORA}\` — if all judge-best
+   results land in one corpus, the matrix layer's purpose isn't being tested.
+   Check per-corpus distribution in the JSON.
+
+## Next moves
+
+- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
+  work. Move to paraphrase queries + tag-based boost (currently ignored).
+- If lift rate < 20%: investigate why — judge variance, distance gap too
+  wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
+  retuning.
+- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
+  already close to optimal on this query distribution. Either the corpus
+  is too narrow or the queries are too easy.
+MDEOF
+}
+
+generate_md "$OUT_JSON" "$OUT_MD"
+
+echo
+echo "[lift] DONE"
+echo "[lift]   evidence:  $OUT_JSON"
+echo "[lift]   report:    $OUT_MD"
--- a/scripts/playbook_lift/main.go
+++ b/scripts/playbook_lift/main.go
@ -0,0 +1,391 @@
+// Playbook-lift reality test driver. Two-pass design:
+//
+//   Pass 1 (cold): for each query → matrix.search use_playbook=false →
+//                  LLM judge rates top-K → record playbook entry pointing
+//                  at the highest-rated result (which may NOT be top-1
+//                  by distance — that's the discovery worth boosting).
+//
+//   Pass 2 (warm): same queries → use_playbook=true → measure how the
+//                  ranking shifted.
+//
+// Lift = real if pass-2 brings the LLM-judged-best result into top-1
+// more often than pass-1. If lift ≈ 0, the playbook is just confirming
+// what cosine already said and the 5-loop thesis is unproven.
+//
+// Honest about what this measures: with no human-labeled ground truth,
+// the LLM judge IS the ground truth proxy. That's the small-model
+// pipeline thesis itself — the same model class that runs the inner
+// loop is also what we trust to evaluate it. If you don't trust the
+// judge, the lift number is meaningless; that's a separate problem
+// for ground-truth labeling.
+//
+// Usage (driven by scripts/playbook_lift.sh):
+//   playbook_lift -gateway http://127.0.0.1:3110 \
+//                 -queries tests/reality/playbook_lift_queries.txt \
+//                 -judge qwen3.5:latest \
+//                 -corpora workers,candidates \
+//                 -k 10 \
+//                 -out reports/reality-tests/playbook_lift_001.json
+package main
+
+import (
+	"bytes"
+	"encoding/json"
+	"flag"
+	"fmt"
+	"io"
+	"log"
+	"net/http"
+	"os"
+	"sort"
+	"strings"
+	"time"
+)
+
+type matrixResult struct {
+	ID       string          `json:"id"`
+	Distance float32         `json:"distance"`
+	Corpus   string          `json:"corpus"`
+	Metadata json.RawMessage `json:"metadata,omitempty"`
+}
+
+type matrixResp struct {
+	Results         []matrixResult `json:"results"`
+	PerCorpusCounts map[string]int `json:"per_corpus_counts"`
+	PlaybookBoosted int            `json:"playbook_boosted,omitempty"`
+}
+
+type judgeVerdict struct {
+	Rating int    `json:"rating"`
+	Reason string `json:"reason"`
+}
+
+type queryRun struct {
+	Query string `json:"query"`
+
+	ColdTop1ID       string  `json:"cold_top1_id"`
+	ColdTop1Distance float32 `json:"cold_top1_distance"`
+	ColdJudgeBestID  string  `json:"cold_judge_best_id"`
+	ColdJudgeBestRank int    `json:"cold_judge_best_rank"`
+	ColdJudgeBestRating int  `json:"cold_judge_best_rating"`
+	ColdRatings      []int   `json:"cold_ratings"`
+
+	PlaybookRecorded bool   `json:"playbook_recorded"`
+	PlaybookID       string `json:"playbook_target_id,omitempty"`
+
+	WarmTop1ID       string  `json:"warm_top1_id"`
+	WarmTop1Distance float32 `json:"warm_top1_distance"`
+	WarmBoostedCount int     `json:"warm_boosted_count"`
+	WarmJudgeBestRank int    `json:"warm_judge_best_rank"`
+
+	Lift bool   `json:"lift"`            // judge-best was below top-1 cold, but top-1 warm
+	Note string `json:"note,omitempty"`
+}
+
+type summary struct {
+	Total                 int       `json:"total"`
+	WithDiscovery         int       `json:"with_discovery"` // judge-best != cold top-1
+	LiftCount             int       `json:"lift_count"`     // top-1 changed warm→ judge-best
+	NoChange              int       `json:"no_change"`
+	MeanTop1DeltaDistance float32   `json:"mean_top1_delta_distance"`
+	PlaybookBoostedTotal  int       `json:"playbook_boosted_total"`
+	GeneratedAt           time.Time `json:"generated_at"`
+}
+
+func main() {
+	gw := flag.String("gateway", "http://127.0.0.1:3110", "Go gateway base URL")
+	ollama := flag.String("ollama", "http://127.0.0.1:11434", "Ollama base URL for LLM judge")
+	queries := flag.String("queries", "tests/reality/playbook_lift_queries.txt", "query corpus path")
+	corporaCSV := flag.String("corpora", "workers,candidates", "comma-separated matrix corpora")
+	judge := flag.String("judge", "qwen3.5:latest", "Ollama model for relevance judging")
+	k := flag.Int("k", 10, "top-k from matrix.search per pass")
+	out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
+	flag.Parse()
+
+	corpora := strings.Split(*corporaCSV, ",")
+
+	qs, err := loadQueries(*queries)
+	if err != nil {
+		log.Fatalf("load queries: %v", err)
+	}
+	if len(qs) == 0 {
+		log.Fatalf("no queries in %s", *queries)
+	}
+	log.Printf("[lift] %d queries · corpora=%v · k=%d · judge=%s", len(qs), corpora, *k, *judge)
+
+	hc := &http.Client{Timeout: 60 * time.Second}
+	runs := make([]queryRun, 0, len(qs))
+	totalDelta := float32(0)
+	playbookBoostedTotal := 0
+	withDiscovery := 0
+	liftCount := 0
+	noChange := 0
+
+	// Pass 1 (cold) + record playbooks based on judge verdicts.
+	for i, q := range qs {
+		log.Printf("[lift] (%d/%d cold) %s", i+1, len(qs), abbrev(q, 60))
+		resp, err := matrixSearch(hc, *gw, q, corpora, *k, false)
+		if err != nil {
+			log.Printf("  cold search failed: %v — skipping", err)
+			continue
+		}
+		if len(resp.Results) == 0 {
+			log.Printf("  cold returned 0 results — skipping")
+			continue
+		}
+		ratings := make([]int, len(resp.Results))
+		bestRank := 0
+		bestRating := -1
+		for j, r := range resp.Results {
+			rating := judgeRate(hc, *ollama, *judge, q, r)
+			ratings[j] = rating
+			if rating > bestRating {
+				bestRating = rating
+				bestRank = j
+			}
+		}
+		run := queryRun{
+			Query:               q,
+			ColdTop1ID:          resp.Results[0].ID,
+			ColdTop1Distance:    resp.Results[0].Distance,
+			ColdJudgeBestID:     resp.Results[bestRank].ID,
+			ColdJudgeBestRank:   bestRank,
+			ColdJudgeBestRating: bestRating,
+			ColdRatings:         ratings,
+		}
+		// Record a playbook only if the judge best is not already top-1
+		// (otherwise we're boosting something cosine already crowned).
+		if bestRank > 0 && bestRating >= 4 {
+			withDiscovery++
+			if err := playbookRecord(hc, *gw, q, resp.Results[bestRank].ID, resp.Results[bestRank].Corpus, 1.0); err != nil {
+				log.Printf("  playbook record failed: %v", err)
+				run.Note = "playbook record failed: " + err.Error()
+			} else {
+				run.PlaybookRecorded = true
+				run.PlaybookID = resp.Results[bestRank].ID
+			}
+		} else if bestRank == 0 {
+			run.Note = "judge-best already top-1 cold — no playbook needed"
+		} else {
+			run.Note = fmt.Sprintf("judge-best rating %d below threshold (4) — no playbook", bestRating)
+		}
+		runs = append(runs, run)
+	}
+
+	// Pass 2 (warm) on the same queries.
+	for i := range runs {
+		q := runs[i].Query
+		log.Printf("[lift] (%d/%d warm) %s", i+1, len(runs), abbrev(q, 60))
+		resp, err := matrixSearch(hc, *gw, q, corpora, *k, true)
+		if err != nil || len(resp.Results) == 0 {
+			runs[i].Note = appendNote(runs[i].Note, fmt.Sprintf("warm search failed: %v", err))
+			continue
+		}
+		runs[i].WarmTop1ID = resp.Results[0].ID
+		runs[i].WarmTop1Distance = resp.Results[0].Distance
+		runs[i].WarmBoostedCount = resp.PlaybookBoosted
+		playbookBoostedTotal += resp.PlaybookBoosted
+
+		// Find where the cold judge-best ID landed in the warm ranking.
+		warmRank := -1
+		for j, r := range resp.Results {
+			if r.ID == runs[i].ColdJudgeBestID {
+				warmRank = j
+				break
+			}
+		}
+		runs[i].WarmJudgeBestRank = warmRank
+
+		switch {
+		case runs[i].PlaybookRecorded && warmRank == 0:
+			runs[i].Lift = true
+			liftCount++
+		case !runs[i].PlaybookRecorded:
+			noChange++
+		default:
+			noChange++
+		}
+		totalDelta += runs[i].WarmTop1Distance - runs[i].ColdTop1Distance
+	}
+
+	sum := summary{
+		Total:                 len(runs),
+		WithDiscovery:         withDiscovery,
+		LiftCount:             liftCount,
+		NoChange:              noChange,
+		MeanTop1DeltaDistance: 0,
+		PlaybookBoostedTotal:  playbookBoostedTotal,
+		GeneratedAt:           time.Now().UTC(),
+	}
+	if len(runs) > 0 {
+		sum.MeanTop1DeltaDistance = totalDelta / float32(len(runs))
+	}
+
+	if err := writeJSON(*out, runs, sum); err != nil {
+		log.Fatalf("write %s: %v", *out, err)
+	}
+	log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f",
+		sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance)
+	log.Printf("[lift] results → %s", *out)
+}
+
+func loadQueries(path string) ([]string, error) {
+	bs, err := os.ReadFile(path)
+	if err != nil {
+		return nil, err
+	}
+	var out []string
+	for _, line := range strings.Split(string(bs), "\n") {
+		s := strings.TrimSpace(line)
+		if s == "" || strings.HasPrefix(s, "#") {
+			continue
+		}
+		out = append(out, s)
+	}
+	return out, nil
+}
+
+func matrixSearch(hc *http.Client, gw, query string, corpora []string, k int, usePlaybook bool) (*matrixResp, error) {
+	body := map[string]any{
+		"query_text":   query,
+		"corpora":      corpora,
+		"k":            k,
+		"per_corpus_k": k,
+		"use_playbook": usePlaybook,
+	}
+	bs, _ := json.Marshal(body)
+	req, _ := http.NewRequest("POST", gw+"/v1/matrix/search", bytes.NewReader(bs))
+	req.Header.Set("Content-Type", "application/json")
+	resp, err := hc.Do(req)
+	if err != nil {
+		return nil, err
+	}
+	defer resp.Body.Close()
+	rb, _ := io.ReadAll(resp.Body)
+	if resp.StatusCode/100 != 2 {
+		return nil, fmt.Errorf("status %d: %s", resp.StatusCode, string(rb))
+	}
+	var out matrixResp
+	if err := json.Unmarshal(rb, &out); err != nil {
+		return nil, fmt.Errorf("unmarshal: %w (body=%s)", err, abbrev(string(rb), 200))
+	}
+	return &out, nil
+}
+
+func playbookRecord(hc *http.Client, gw, query, answerID, answerCorpus string, score float64) error {
+	body := map[string]any{
+		"query":         query,
+		"answer_id":     answerID,
+		"answer_corpus": answerCorpus,
+		"score":         score,
+		"tags":          []string{"reality-test", "playbook-lift-001"},
+	}
+	bs, _ := json.Marshal(body)
+	req, _ := http.NewRequest("POST", gw+"/v1/matrix/playbooks/record", bytes.NewReader(bs))
+	req.Header.Set("Content-Type", "application/json")
+	resp, err := hc.Do(req)
+	if err != nil {
+		return err
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode/100 != 2 {
+		rb, _ := io.ReadAll(resp.Body)
+		return fmt.Errorf("status %d: %s", resp.StatusCode, string(rb))
+	}
+	return nil
+}
+
+// judgeRate calls Ollama's /api/chat directly and asks for a 1-5 rating
+// of the result against the query. Returns 0 on any failure (treated as
+// "couldn't judge, exclude from best-of consideration").
+func judgeRate(hc *http.Client, ollamaURL, model, query string, r matrixResult) int {
+	system := `You rate retrieval results for a staffing co-pilot.
+Rate the result 1-5 against the query:
+  5 = perfect match (this person/job IS what was asked for)
+  4 = strong match (right field, right level, minor mismatches)
+  3 = adjacent match (related field or partial overlap)
+  2 = weak/tangential match
+  1 = irrelevant
+Output JSON only: {"rating": N, "reason": "<one sentence>"}.`
+	user := fmt.Sprintf("Query: %q\n\nResult corpus: %s\nResult ID: %s\nResult metadata:\n%s",
+		query, r.Corpus, r.ID, string(r.Metadata))
+
+	body := map[string]any{
+		"model":  model,
+		"stream": false,
+		"format": "json",
+		"messages": []map[string]string{
+			{"role": "system", "content": system},
+			{"role": "user", "content": user},
+		},
+		"options": map[string]any{"temperature": 0},
+	}
+	bs, _ := json.Marshal(body)
+	req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(bs))
+	req.Header.Set("Content-Type", "application/json")
+	resp, err := hc.Do(req)
+	if err != nil {
+		return 0
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode/100 != 2 {
+		return 0
+	}
+	rb, _ := io.ReadAll(resp.Body)
+	var ollamaResp struct {
+		Message struct {
+			Content string `json:"content"`
+		} `json:"message"`
+	}
+	if err := json.Unmarshal(rb, &ollamaResp); err != nil {
+		return 0
+	}
+	var v judgeVerdict
+	if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &v); err != nil {
+		return 0
+	}
+	if v.Rating < 1 || v.Rating > 5 {
+		return 0
+	}
+	return v.Rating
+}
+
+func writeJSON(path string, runs []queryRun, sum summary) error {
+	if err := os.MkdirAll(filepath_dir(path), 0o755); err != nil {
+		return err
+	}
+	out := struct {
+		Summary summary    `json:"summary"`
+		Runs    []queryRun `json:"runs"`
+	}{Summary: sum, Runs: runs}
+	bs, err := json.MarshalIndent(out, "", "  ")
+	if err != nil {
+		return err
+	}
+	return os.WriteFile(path, bs, 0o644)
+}
+
+func filepath_dir(p string) string {
+	if i := strings.LastIndex(p, "/"); i >= 0 {
+		return p[:i]
+	}
+	return "."
+}
+
+func abbrev(s string, n int) string {
+	if len(s) <= n {
+		return s
+	}
+	return s[:n] + "…"
+}
+
+func appendNote(existing, add string) string {
+	if existing == "" {
+		return add
+	}
+	return existing + "; " + add
+}
+
+// Suppress unused-import warning when sort isn't used in a future
+// refactor; harmless for now.
+var _ = sort.Slice
--- a/tests/reality/playbook_lift_queries.txt
+++ b/tests/reality/playbook_lift_queries.txt
@ -0,0 +1,18 @@
+# Playbook lift reality test — staffing query corpus.
+#
+# Each non-blank, non-comment line is one query. The harness will run
+# each through matrix.search (cold pass, then warm pass with playbook),
+# ask the LLM judge to rate top-K results, and record lift metrics.
+#
+# Goal: 20 queries, weighted toward the kinds of asks a staffing
+# coordinator would actually issue. Specific roles + certifications +
+# constraints surface playbook lift better than generic "find a worker"
+# style queries.
+#
+# Placeholders (5) — J: replace + extend to 20+ for the real test.
+
+Forklift operator with OSHA-30, warehouse experience, day shift availability
+Bilingual customer service rep, Spanish + English, two years call-center experience
+CDL Class A driver, clean record, willing to do regional 4-day routes
+Production line supervisor with lean manufacturing background
+Dental hygienist with three years experience, Indianapolis area