reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine?

First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
  rates top-K → record playbook entry pointing at the highest-rated
  result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
  ranking shift. Lift = real if recorded answer becomes top-1.

Files:
- scripts/playbook_lift/main.go         driver (391 LoC)
- scripts/playbook_lift.sh              stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt  query corpus (5 placeholders;
                                            J writes real 20+)
- reports/reality-tests/README.md       framework + interpretation
- .gitignore                            track reports/reality-tests/
                                        but ignore per-run JSON evidence

This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.

Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-29 23:22:36 -05:00
parent 8278eb9a87
commit 3dd7d9fe30
5 changed files with 715 additions and 0 deletions

4
.gitignore vendored
View File

@ -39,10 +39,14 @@ vendor/
# Use /reports/* + un-ignore so git can traverse into reports/.
/reports/*
!/reports/scrum/
!/reports/reality-tests/
# Inside the audit directory, the per-run _evidence/ dump (smoke logs,
# command output) IS runtime — track the dir, ignore its contents.
/reports/scrum/_evidence/*
!/reports/scrum/_evidence/.gitkeep
# Reality-test JSON evidence is runtime — track the dir + MD reports
# (committed deliberately as outcome record), ignore per-run JSON.
/reports/reality-tests/*.json
# Proof harness runtime output — same pattern as reports/scrum/_evidence.
# Track the directory but ignore per-run subdirs.

View File

@ -0,0 +1,69 @@
# reports/reality-tests — does the 5-loop substrate actually work?
Reality tests measure **product outcomes**, not substrate health. The 21 smokes prove the system *runs*; the proof harness proves the system *makes the claims it claims*; reality tests answer: **does the small-model pipeline + matrix indexer + playbook give measurably better results than raw cosine?**
This is the gate from `project_small_model_pipeline_vision.md`: *"the playbook + matrix indexer must give the results we're looking for."* Single load-bearing criterion. Throughput, scaling, code elegance are secondary.
---
## What lives here
Each reality test is a numbered run that produces:
- `<test>_<NNN>.json` — raw structured evidence (per-query data, summary metrics)
- `<test>_<NNN>.md` — human-readable report with headline metrics, per-query table, honesty caveats, next moves
Runs are append-only. Earlier runs stay in tree as historical baseline.
---
## Test catalog
### `playbook_lift_<NNN>` — does the playbook actually lift the right answer?
**Driver:** `scripts/playbook_lift.sh``bin/playbook_lift`
**Queries:** `tests/reality/playbook_lift_queries.txt`
**Pipeline:** cold pass → LLM judge → playbook record → warm pass → measure ranking shift.
The headline question: **when the LLM judge finds a better answer than cosine top-1, can the playbook boost it to top-1 on the next run?** If yes, the learning loop closes; if no, the matrix layer + playbook is infrastructure for a thesis that doesn't pay rent.
See the run reports for honesty caveats — chiefly that the LLM judge IS the ground-truth proxy.
---
## Running a reality test
```bash
# Defaults: judge=qwen3.5:latest, workers limit 5000, run id 001
./scripts/playbook_lift.sh
# Re-run with a different judge to check inter-judge agreement
JUDGE_MODEL=qwen2.5:latest RUN_ID=002 ./scripts/playbook_lift.sh
# Smaller scale for fast iteration
WORKERS_LIMIT=1000 K=5 RUN_ID=dev ./scripts/playbook_lift.sh
```
Requires: Ollama on `:11434` with `nomic-embed-text` + the chosen judge model loaded. Skips cleanly (exit 0) if Ollama is absent.
---
## Interpreting results
Three thresholds matter on the `playbook_lift` tests:
| Lift rate (lifts / discoveries) | Verdict |
|---|---|
| ≥ 50% | Loop closes — playbook is doing real work, move to paraphrase queries |
| 20-50% | Lift exists but inconsistent — investigate boost math (`score × 0.5`) or judge variance |
| < 20% | Loop is not pulling its weight diagnose before adding more components |
A separate concern: **discovery rate** (cold judge-best ≠ cold top-1). If discovery is itself rare (< 30% of queries), cosine is already close to optimal on this query distribution and the matrix+playbook layer has little headroom. That's not necessarily a bug but it means the value gate has to come from somewhere else (multi-corpus retrieval, domain-specific tags, drift signal).
---
## What this is not
- **Not a benchmark.** No comparison against external systems; only internal cold-vs-warm.
- **Not a regression gate.** Each run is a snapshot. Scores will drift with corpus changes, judge updates, and playbook math tuning. Don't wire `just verify` to demand a minimum lift.
- **Not human-validated.** The LLM judge is the ground truth proxy. Sample 5-10 verdicts manually per run to sanity-check the judge isn't pathological.

233
scripts/playbook_lift.sh Executable file
View File

@ -0,0 +1,233 @@
#!/usr/bin/env bash
# Playbook-lift reality test — measure whether the 5-loop substrate
# (matrix retrieve+merge + playbook + small-model judge) actually beats
# raw cosine on staffing queries.
#
# Pipeline:
# 1. Boot the Go stack (storaged, embedd, vectord, matrixd, gateway)
# 2. Ingest workers (default 5000) + candidates corpora
# 3. Run the playbook_lift driver: cold pass → judge → record →
# warm pass → measure
# 4. Generate markdown report from the JSON evidence
#
# Output:
# reports/reality-tests/playbook_lift_<N>.json — raw evidence
# reports/reality-tests/playbook_lift_<N>.md — human report
#
# Requires: Ollama on :11434 with nomic-embed-text + the judge model
# loaded. Skips (exit 0) if Ollama is absent.
#
# Usage:
# ./scripts/playbook_lift.sh # run #001 with defaults
# RUN_ID=002 ./scripts/playbook_lift.sh # explicit run id
# JUDGE_MODEL=qwen2.5:latest ./scripts/playbook_lift.sh
# WORKERS_LIMIT=2000 ./scripts/playbook_lift.sh
set -euo pipefail
cd "$(dirname "$0")/.."
export PATH="$PATH:/usr/local/go/bin"
RUN_ID="${RUN_ID:-001}"
JUDGE_MODEL="${JUDGE_MODEL:-qwen3.5:latest}"
WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
QUERIES_FILE="${QUERIES_FILE:-tests/reality/playbook_lift_queries.txt}"
CORPORA="${CORPORA:-workers,candidates}"
K="${K:-10}"
OUT_JSON="reports/reality-tests/playbook_lift_${RUN_ID}.json"
OUT_MD="reports/reality-tests/playbook_lift_${RUN_ID}.md"
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
echo "[lift] Ollama not reachable on :11434 — skipping"
exit 0
fi
if ! curl -sS http://localhost:11434/api/tags | jq -e --arg m "$JUDGE_MODEL" \
'.models[] | select(.name == $m)' >/dev/null 2>&1; then
echo "[lift] judge model '$JUDGE_MODEL' not loaded in Ollama — pull it first"
exit 1
fi
echo "[lift] building binaries..."
go build -o bin/ ./cmd/storaged ./cmd/embedd ./cmd/vectord ./cmd/matrixd ./cmd/gateway \
./scripts/staffing_workers ./scripts/staffing_candidates \
./scripts/playbook_lift
pkill -f "bin/(storaged|embedd|vectord|matrixd|gateway)" 2>/dev/null || true
sleep 0.3
PIDS=()
TMP="$(mktemp -d)"
CFG="$TMP/lift.toml"
cleanup() {
echo "[lift] cleanup"
for p in "${PIDS[@]:-}"; do [ -n "${p:-}" ] && kill "$p" 2>/dev/null || true; done
rm -rf "$TMP"
}
trap cleanup EXIT INT TERM
cat > "$CFG" <<EOF
[gateway]
bind = "127.0.0.1:3110"
storaged_url = "http://127.0.0.1:3211"
catalogd_url = "http://127.0.0.1:3212"
ingestd_url = "http://127.0.0.1:3213"
queryd_url = "http://127.0.0.1:3214"
vectord_url = "http://127.0.0.1:3215"
embedd_url = "http://127.0.0.1:3216"
pathwayd_url = "http://127.0.0.1:3217"
matrixd_url = "http://127.0.0.1:3218"
[vectord]
bind = "127.0.0.1:3215"
storaged_url = ""
[matrixd]
bind = "127.0.0.1:3218"
embedd_url = "http://127.0.0.1:3216"
vectord_url = "http://127.0.0.1:3215"
EOF
poll_health() {
local port="$1" deadline=$(($(date +%s) + 5))
while [ "$(date +%s)" -lt "$deadline" ]; do
if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
sleep 0.05
done
return 1
}
echo "[lift] launching stack..."
./bin/storaged -config "$CFG" > /tmp/storaged.log 2>&1 & PIDS+=($!)
poll_health 3211 || { echo "storaged failed"; exit 1; }
./bin/embedd -config "$CFG" > /tmp/embedd.log 2>&1 & PIDS+=($!)
poll_health 3216 || { echo "embedd failed"; exit 1; }
./bin/vectord -config "$CFG" > /tmp/vectord.log 2>&1 & PIDS+=($!)
poll_health 3215 || { echo "vectord failed"; exit 1; }
./bin/matrixd -config "$CFG" > /tmp/matrixd.log 2>&1 & PIDS+=($!)
poll_health 3218 || { echo "matrixd failed"; exit 1; }
./bin/gateway -config "$CFG" > /tmp/gateway.log 2>&1 & PIDS+=($!)
poll_health 3110 || { echo "gateway failed"; exit 1; }
echo
echo "[lift] ingest workers (limit=$WORKERS_LIMIT)..."
./bin/staffing_workers -limit "$WORKERS_LIMIT"
echo
echo "[lift] ingest candidates..."
./bin/staffing_candidates -skip-populate=false -query "warmup" 2>&1 \
| grep -v "^\[candidates\]\(matrix\|reality\)" || true
echo
echo "[lift] running driver — judge=$JUDGE_MODEL · queries=$QUERIES_FILE · k=$K"
./bin/playbook_lift \
-gateway "http://127.0.0.1:3110" \
-ollama "http://localhost:11434" \
-queries "$QUERIES_FILE" \
-corpora "$CORPORA" \
-judge "$JUDGE_MODEL" \
-k "$K" \
-out "$OUT_JSON"
echo
echo "[lift] generating markdown report → $OUT_MD"
generate_md() {
local json="$1" md="$2"
local total discovery lift no_change boosted mean_delta gen_at
total=$(jq -r '.summary.total' "$json")
discovery=$(jq -r '.summary.with_discovery' "$json")
lift=$(jq -r '.summary.lift_count' "$json")
no_change=$(jq -r '.summary.no_change' "$json")
boosted=$(jq -r '.summary.playbook_boosted_total' "$json")
mean_delta=$(jq -r '.summary.mean_top1_delta_distance' "$json")
gen_at=$(jq -r '.summary.generated_at' "$json")
cat > "$md" <<MDEOF
# Playbook-Lift Reality Test — Run ${RUN_ID}
**Generated:** ${gen_at}
**Judge:** \`${JUDGE_MODEL}\` (Ollama)
**Corpora:** \`${CORPORA}\`
**Workers limit:** ${WORKERS_LIMIT}
**Queries:** \`${QUERIES_FILE}\` (${total} executed)
**K per pass:** ${K}
**Evidence:** \`${OUT_JSON}\`
---
## Headline
| Metric | Value |
|---|---:|
| Total queries run | ${total} |
| Cold-pass discoveries (judge-best ≠ top-1) | ${discovery} |
| Warm-pass lifts (recorded playbook → top-1) | ${lift} |
| No change (judge-best already top-1, no playbook needed) | ${no_change} |
| Playbook boosts triggered (warm pass) | ${boosted} |
| Mean Δ top-1 distance (warm cold) | ${mean_delta} |
**Lift rate:** ${lift} of ${discovery} discoveries became top-1 after warm pass.
---
## Per-query results
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|---|---|---|---|---|---|---|---|
MDEOF
jq -r '.runs | to_entries[] |
[
(.key + 1 | tostring),
(.value.query | .[0:60]),
.value.cold_top1_id,
((.value.cold_judge_best_rank | tostring) + "/" + (.value.cold_judge_best_rating | tostring)),
(if .value.playbook_recorded then "✓ " + (.value.playbook_target_id // "") else "—" end),
.value.warm_top1_id,
(.value.warm_judge_best_rank | tostring),
(if .value.lift then "**YES**" else "no" end)
] | "| " + join(" | ") + " |"
' "$json" >> "$md"
cat >> "$md" <<MDEOF
---
## Honesty caveats
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
judge's verdict is what defines "best." If \`${JUDGE_MODEL}\` rates badly,
the lift number is meaningless. To validate the judge itself, sample 510
verdicts manually and check agreement.
2. **Score-1.0 boost = distance halved.** Playbook math is
\`distance' = distance × (1 - 0.5 × score)\`. Lift requires the judge-best
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
even halving doesn't promote it. Tight clusters → little visible lift.
3. **Same-query replay is the cheap case.** Real lift comes from *similar but
not identical* queries hitting a recorded playbook. This run only tests
verbatim replay. A v2 should add paraphrase queries.
4. **Multi-corpus skew.** Default corpora=\`${CORPORA}\`if all judge-best
results land in one corpus, the matrix layer's purpose isn't being tested.
Check per-corpus distribution in the JSON.
## Next moves
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
work. Move to paraphrase queries + tag-based boost (currently ignored).
- If lift rate < 20%: investigate why — judge variance, distance gap too
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
retuning.
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
already close to optimal on this query distribution. Either the corpus
is too narrow or the queries are too easy.
MDEOF
}
generate_md "$OUT_JSON" "$OUT_MD"
echo
echo "[lift] DONE"
echo "[lift] evidence: $OUT_JSON"
echo "[lift] report: $OUT_MD"

View File

@ -0,0 +1,391 @@
// Playbook-lift reality test driver. Two-pass design:
//
// Pass 1 (cold): for each query → matrix.search use_playbook=false →
// LLM judge rates top-K → record playbook entry pointing
// at the highest-rated result (which may NOT be top-1
// by distance — that's the discovery worth boosting).
//
// Pass 2 (warm): same queries → use_playbook=true → measure how the
// ranking shifted.
//
// Lift = real if pass-2 brings the LLM-judged-best result into top-1
// more often than pass-1. If lift ≈ 0, the playbook is just confirming
// what cosine already said and the 5-loop thesis is unproven.
//
// Honest about what this measures: with no human-labeled ground truth,
// the LLM judge IS the ground truth proxy. That's the small-model
// pipeline thesis itself — the same model class that runs the inner
// loop is also what we trust to evaluate it. If you don't trust the
// judge, the lift number is meaningless; that's a separate problem
// for ground-truth labeling.
//
// Usage (driven by scripts/playbook_lift.sh):
// playbook_lift -gateway http://127.0.0.1:3110 \
// -queries tests/reality/playbook_lift_queries.txt \
// -judge qwen3.5:latest \
// -corpora workers,candidates \
// -k 10 \
// -out reports/reality-tests/playbook_lift_001.json
package main
import (
"bytes"
"encoding/json"
"flag"
"fmt"
"io"
"log"
"net/http"
"os"
"sort"
"strings"
"time"
)
type matrixResult struct {
ID string `json:"id"`
Distance float32 `json:"distance"`
Corpus string `json:"corpus"`
Metadata json.RawMessage `json:"metadata,omitempty"`
}
type matrixResp struct {
Results []matrixResult `json:"results"`
PerCorpusCounts map[string]int `json:"per_corpus_counts"`
PlaybookBoosted int `json:"playbook_boosted,omitempty"`
}
type judgeVerdict struct {
Rating int `json:"rating"`
Reason string `json:"reason"`
}
type queryRun struct {
Query string `json:"query"`
ColdTop1ID string `json:"cold_top1_id"`
ColdTop1Distance float32 `json:"cold_top1_distance"`
ColdJudgeBestID string `json:"cold_judge_best_id"`
ColdJudgeBestRank int `json:"cold_judge_best_rank"`
ColdJudgeBestRating int `json:"cold_judge_best_rating"`
ColdRatings []int `json:"cold_ratings"`
PlaybookRecorded bool `json:"playbook_recorded"`
PlaybookID string `json:"playbook_target_id,omitempty"`
WarmTop1ID string `json:"warm_top1_id"`
WarmTop1Distance float32 `json:"warm_top1_distance"`
WarmBoostedCount int `json:"warm_boosted_count"`
WarmJudgeBestRank int `json:"warm_judge_best_rank"`
Lift bool `json:"lift"` // judge-best was below top-1 cold, but top-1 warm
Note string `json:"note,omitempty"`
}
type summary struct {
Total int `json:"total"`
WithDiscovery int `json:"with_discovery"` // judge-best != cold top-1
LiftCount int `json:"lift_count"` // top-1 changed warm→ judge-best
NoChange int `json:"no_change"`
MeanTop1DeltaDistance float32 `json:"mean_top1_delta_distance"`
PlaybookBoostedTotal int `json:"playbook_boosted_total"`
GeneratedAt time.Time `json:"generated_at"`
}
func main() {
gw := flag.String("gateway", "http://127.0.0.1:3110", "Go gateway base URL")
ollama := flag.String("ollama", "http://127.0.0.1:11434", "Ollama base URL for LLM judge")
queries := flag.String("queries", "tests/reality/playbook_lift_queries.txt", "query corpus path")
corporaCSV := flag.String("corpora", "workers,candidates", "comma-separated matrix corpora")
judge := flag.String("judge", "qwen3.5:latest", "Ollama model for relevance judging")
k := flag.Int("k", 10, "top-k from matrix.search per pass")
out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
flag.Parse()
corpora := strings.Split(*corporaCSV, ",")
qs, err := loadQueries(*queries)
if err != nil {
log.Fatalf("load queries: %v", err)
}
if len(qs) == 0 {
log.Fatalf("no queries in %s", *queries)
}
log.Printf("[lift] %d queries · corpora=%v · k=%d · judge=%s", len(qs), corpora, *k, *judge)
hc := &http.Client{Timeout: 60 * time.Second}
runs := make([]queryRun, 0, len(qs))
totalDelta := float32(0)
playbookBoostedTotal := 0
withDiscovery := 0
liftCount := 0
noChange := 0
// Pass 1 (cold) + record playbooks based on judge verdicts.
for i, q := range qs {
log.Printf("[lift] (%d/%d cold) %s", i+1, len(qs), abbrev(q, 60))
resp, err := matrixSearch(hc, *gw, q, corpora, *k, false)
if err != nil {
log.Printf(" cold search failed: %v — skipping", err)
continue
}
if len(resp.Results) == 0 {
log.Printf(" cold returned 0 results — skipping")
continue
}
ratings := make([]int, len(resp.Results))
bestRank := 0
bestRating := -1
for j, r := range resp.Results {
rating := judgeRate(hc, *ollama, *judge, q, r)
ratings[j] = rating
if rating > bestRating {
bestRating = rating
bestRank = j
}
}
run := queryRun{
Query: q,
ColdTop1ID: resp.Results[0].ID,
ColdTop1Distance: resp.Results[0].Distance,
ColdJudgeBestID: resp.Results[bestRank].ID,
ColdJudgeBestRank: bestRank,
ColdJudgeBestRating: bestRating,
ColdRatings: ratings,
}
// Record a playbook only if the judge best is not already top-1
// (otherwise we're boosting something cosine already crowned).
if bestRank > 0 && bestRating >= 4 {
withDiscovery++
if err := playbookRecord(hc, *gw, q, resp.Results[bestRank].ID, resp.Results[bestRank].Corpus, 1.0); err != nil {
log.Printf(" playbook record failed: %v", err)
run.Note = "playbook record failed: " + err.Error()
} else {
run.PlaybookRecorded = true
run.PlaybookID = resp.Results[bestRank].ID
}
} else if bestRank == 0 {
run.Note = "judge-best already top-1 cold — no playbook needed"
} else {
run.Note = fmt.Sprintf("judge-best rating %d below threshold (4) — no playbook", bestRating)
}
runs = append(runs, run)
}
// Pass 2 (warm) on the same queries.
for i := range runs {
q := runs[i].Query
log.Printf("[lift] (%d/%d warm) %s", i+1, len(runs), abbrev(q, 60))
resp, err := matrixSearch(hc, *gw, q, corpora, *k, true)
if err != nil || len(resp.Results) == 0 {
runs[i].Note = appendNote(runs[i].Note, fmt.Sprintf("warm search failed: %v", err))
continue
}
runs[i].WarmTop1ID = resp.Results[0].ID
runs[i].WarmTop1Distance = resp.Results[0].Distance
runs[i].WarmBoostedCount = resp.PlaybookBoosted
playbookBoostedTotal += resp.PlaybookBoosted
// Find where the cold judge-best ID landed in the warm ranking.
warmRank := -1
for j, r := range resp.Results {
if r.ID == runs[i].ColdJudgeBestID {
warmRank = j
break
}
}
runs[i].WarmJudgeBestRank = warmRank
switch {
case runs[i].PlaybookRecorded && warmRank == 0:
runs[i].Lift = true
liftCount++
case !runs[i].PlaybookRecorded:
noChange++
default:
noChange++
}
totalDelta += runs[i].WarmTop1Distance - runs[i].ColdTop1Distance
}
sum := summary{
Total: len(runs),
WithDiscovery: withDiscovery,
LiftCount: liftCount,
NoChange: noChange,
MeanTop1DeltaDistance: 0,
PlaybookBoostedTotal: playbookBoostedTotal,
GeneratedAt: time.Now().UTC(),
}
if len(runs) > 0 {
sum.MeanTop1DeltaDistance = totalDelta / float32(len(runs))
}
if err := writeJSON(*out, runs, sum); err != nil {
log.Fatalf("write %s: %v", *out, err)
}
log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f",
sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance)
log.Printf("[lift] results → %s", *out)
}
func loadQueries(path string) ([]string, error) {
bs, err := os.ReadFile(path)
if err != nil {
return nil, err
}
var out []string
for _, line := range strings.Split(string(bs), "\n") {
s := strings.TrimSpace(line)
if s == "" || strings.HasPrefix(s, "#") {
continue
}
out = append(out, s)
}
return out, nil
}
func matrixSearch(hc *http.Client, gw, query string, corpora []string, k int, usePlaybook bool) (*matrixResp, error) {
body := map[string]any{
"query_text": query,
"corpora": corpora,
"k": k,
"per_corpus_k": k,
"use_playbook": usePlaybook,
}
bs, _ := json.Marshal(body)
req, _ := http.NewRequest("POST", gw+"/v1/matrix/search", bytes.NewReader(bs))
req.Header.Set("Content-Type", "application/json")
resp, err := hc.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
rb, _ := io.ReadAll(resp.Body)
if resp.StatusCode/100 != 2 {
return nil, fmt.Errorf("status %d: %s", resp.StatusCode, string(rb))
}
var out matrixResp
if err := json.Unmarshal(rb, &out); err != nil {
return nil, fmt.Errorf("unmarshal: %w (body=%s)", err, abbrev(string(rb), 200))
}
return &out, nil
}
func playbookRecord(hc *http.Client, gw, query, answerID, answerCorpus string, score float64) error {
body := map[string]any{
"query": query,
"answer_id": answerID,
"answer_corpus": answerCorpus,
"score": score,
"tags": []string{"reality-test", "playbook-lift-001"},
}
bs, _ := json.Marshal(body)
req, _ := http.NewRequest("POST", gw+"/v1/matrix/playbooks/record", bytes.NewReader(bs))
req.Header.Set("Content-Type", "application/json")
resp, err := hc.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode/100 != 2 {
rb, _ := io.ReadAll(resp.Body)
return fmt.Errorf("status %d: %s", resp.StatusCode, string(rb))
}
return nil
}
// judgeRate calls Ollama's /api/chat directly and asks for a 1-5 rating
// of the result against the query. Returns 0 on any failure (treated as
// "couldn't judge, exclude from best-of consideration").
func judgeRate(hc *http.Client, ollamaURL, model, query string, r matrixResult) int {
system := `You rate retrieval results for a staffing co-pilot.
Rate the result 1-5 against the query:
5 = perfect match (this person/job IS what was asked for)
4 = strong match (right field, right level, minor mismatches)
3 = adjacent match (related field or partial overlap)
2 = weak/tangential match
1 = irrelevant
Output JSON only: {"rating": N, "reason": "<one sentence>"}.`
user := fmt.Sprintf("Query: %q\n\nResult corpus: %s\nResult ID: %s\nResult metadata:\n%s",
query, r.Corpus, r.ID, string(r.Metadata))
body := map[string]any{
"model": model,
"stream": false,
"format": "json",
"messages": []map[string]string{
{"role": "system", "content": system},
{"role": "user", "content": user},
},
"options": map[string]any{"temperature": 0},
}
bs, _ := json.Marshal(body)
req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(bs))
req.Header.Set("Content-Type", "application/json")
resp, err := hc.Do(req)
if err != nil {
return 0
}
defer resp.Body.Close()
if resp.StatusCode/100 != 2 {
return 0
}
rb, _ := io.ReadAll(resp.Body)
var ollamaResp struct {
Message struct {
Content string `json:"content"`
} `json:"message"`
}
if err := json.Unmarshal(rb, &ollamaResp); err != nil {
return 0
}
var v judgeVerdict
if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &v); err != nil {
return 0
}
if v.Rating < 1 || v.Rating > 5 {
return 0
}
return v.Rating
}
func writeJSON(path string, runs []queryRun, sum summary) error {
if err := os.MkdirAll(filepath_dir(path), 0o755); err != nil {
return err
}
out := struct {
Summary summary `json:"summary"`
Runs []queryRun `json:"runs"`
}{Summary: sum, Runs: runs}
bs, err := json.MarshalIndent(out, "", " ")
if err != nil {
return err
}
return os.WriteFile(path, bs, 0o644)
}
func filepath_dir(p string) string {
if i := strings.LastIndex(p, "/"); i >= 0 {
return p[:i]
}
return "."
}
func abbrev(s string, n int) string {
if len(s) <= n {
return s
}
return s[:n] + "…"
}
func appendNote(existing, add string) string {
if existing == "" {
return add
}
return existing + "; " + add
}
// Suppress unused-import warning when sort isn't used in a future
// refactor; harmless for now.
var _ = sort.Slice

View File

@ -0,0 +1,18 @@
# Playbook lift reality test — staffing query corpus.
#
# Each non-blank, non-comment line is one query. The harness will run
# each through matrix.search (cold pass, then warm pass with playbook),
# ask the LLM judge to rate top-K results, and record lift metrics.
#
# Goal: 20 queries, weighted toward the kinds of asks a staffing
# coordinator would actually issue. Specific roles + certifications +
# constraints surface playbook lift better than generic "find a worker"
# style queries.
#
# Placeholders (5) — J: replace + extend to 20+ for the real test.
Forklift operator with OSHA-30, warehouse experience, day shift availability
Bilingual customer service rep, Spanish + English, two years call-center experience
CDL Class A driver, clean record, willing to do regional 4-day routes
Production line supervisor with lean manufacturing background
Dental hygienist with three years experience, Indianapolis area