reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine?
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
rates top-K → record playbook entry pointing at the highest-rated
result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
ranking shift. Lift = real if recorded answer becomes top-1.
Files:
- scripts/playbook_lift/main.go driver (391 LoC)
- scripts/playbook_lift.sh stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt query corpus (5 placeholders;
J writes real 20+)
- reports/reality-tests/README.md framework + interpretation
- .gitignore track reports/reality-tests/
but ignore per-run JSON evidence
This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.
Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
8278eb9a87
commit
3dd7d9fe30
4
.gitignore
vendored
4
.gitignore
vendored
@ -39,10 +39,14 @@ vendor/
|
||||
# Use /reports/* + un-ignore so git can traverse into reports/.
|
||||
/reports/*
|
||||
!/reports/scrum/
|
||||
!/reports/reality-tests/
|
||||
# Inside the audit directory, the per-run _evidence/ dump (smoke logs,
|
||||
# command output) IS runtime — track the dir, ignore its contents.
|
||||
/reports/scrum/_evidence/*
|
||||
!/reports/scrum/_evidence/.gitkeep
|
||||
# Reality-test JSON evidence is runtime — track the dir + MD reports
|
||||
# (committed deliberately as outcome record), ignore per-run JSON.
|
||||
/reports/reality-tests/*.json
|
||||
|
||||
# Proof harness runtime output — same pattern as reports/scrum/_evidence.
|
||||
# Track the directory but ignore per-run subdirs.
|
||||
|
||||
69
reports/reality-tests/README.md
Normal file
69
reports/reality-tests/README.md
Normal file
@ -0,0 +1,69 @@
|
||||
# reports/reality-tests — does the 5-loop substrate actually work?
|
||||
|
||||
Reality tests measure **product outcomes**, not substrate health. The 21 smokes prove the system *runs*; the proof harness proves the system *makes the claims it claims*; reality tests answer: **does the small-model pipeline + matrix indexer + playbook give measurably better results than raw cosine?**
|
||||
|
||||
This is the gate from `project_small_model_pipeline_vision.md`: *"the playbook + matrix indexer must give the results we're looking for."* Single load-bearing criterion. Throughput, scaling, code elegance are secondary.
|
||||
|
||||
---
|
||||
|
||||
## What lives here
|
||||
|
||||
Each reality test is a numbered run that produces:
|
||||
|
||||
- `<test>_<NNN>.json` — raw structured evidence (per-query data, summary metrics)
|
||||
- `<test>_<NNN>.md` — human-readable report with headline metrics, per-query table, honesty caveats, next moves
|
||||
|
||||
Runs are append-only. Earlier runs stay in tree as historical baseline.
|
||||
|
||||
---
|
||||
|
||||
## Test catalog
|
||||
|
||||
### `playbook_lift_<NNN>` — does the playbook actually lift the right answer?
|
||||
|
||||
**Driver:** `scripts/playbook_lift.sh` → `bin/playbook_lift`
|
||||
**Queries:** `tests/reality/playbook_lift_queries.txt`
|
||||
**Pipeline:** cold pass → LLM judge → playbook record → warm pass → measure ranking shift.
|
||||
|
||||
The headline question: **when the LLM judge finds a better answer than cosine top-1, can the playbook boost it to top-1 on the next run?** If yes, the learning loop closes; if no, the matrix layer + playbook is infrastructure for a thesis that doesn't pay rent.
|
||||
|
||||
See the run reports for honesty caveats — chiefly that the LLM judge IS the ground-truth proxy.
|
||||
|
||||
---
|
||||
|
||||
## Running a reality test
|
||||
|
||||
```bash
|
||||
# Defaults: judge=qwen3.5:latest, workers limit 5000, run id 001
|
||||
./scripts/playbook_lift.sh
|
||||
|
||||
# Re-run with a different judge to check inter-judge agreement
|
||||
JUDGE_MODEL=qwen2.5:latest RUN_ID=002 ./scripts/playbook_lift.sh
|
||||
|
||||
# Smaller scale for fast iteration
|
||||
WORKERS_LIMIT=1000 K=5 RUN_ID=dev ./scripts/playbook_lift.sh
|
||||
```
|
||||
|
||||
Requires: Ollama on `:11434` with `nomic-embed-text` + the chosen judge model loaded. Skips cleanly (exit 0) if Ollama is absent.
|
||||
|
||||
---
|
||||
|
||||
## Interpreting results
|
||||
|
||||
Three thresholds matter on the `playbook_lift` tests:
|
||||
|
||||
| Lift rate (lifts / discoveries) | Verdict |
|
||||
|---|---|
|
||||
| ≥ 50% | Loop closes — playbook is doing real work, move to paraphrase queries |
|
||||
| 20-50% | Lift exists but inconsistent — investigate boost math (`score × 0.5`) or judge variance |
|
||||
| < 20% | Loop is not pulling its weight — diagnose before adding more components |
|
||||
|
||||
A separate concern: **discovery rate** (cold judge-best ≠ cold top-1). If discovery is itself rare (< 30% of queries), cosine is already close to optimal on this query distribution and the matrix+playbook layer has little headroom. That's not necessarily a bug — but it means the value gate has to come from somewhere else (multi-corpus retrieval, domain-specific tags, drift signal).
|
||||
|
||||
---
|
||||
|
||||
## What this is not
|
||||
|
||||
- **Not a benchmark.** No comparison against external systems; only internal cold-vs-warm.
|
||||
- **Not a regression gate.** Each run is a snapshot. Scores will drift with corpus changes, judge updates, and playbook math tuning. Don't wire `just verify` to demand a minimum lift.
|
||||
- **Not human-validated.** The LLM judge is the ground truth proxy. Sample 5-10 verdicts manually per run to sanity-check the judge isn't pathological.
|
||||
233
scripts/playbook_lift.sh
Executable file
233
scripts/playbook_lift.sh
Executable file
@ -0,0 +1,233 @@
|
||||
#!/usr/bin/env bash
|
||||
# Playbook-lift reality test — measure whether the 5-loop substrate
|
||||
# (matrix retrieve+merge + playbook + small-model judge) actually beats
|
||||
# raw cosine on staffing queries.
|
||||
#
|
||||
# Pipeline:
|
||||
# 1. Boot the Go stack (storaged, embedd, vectord, matrixd, gateway)
|
||||
# 2. Ingest workers (default 5000) + candidates corpora
|
||||
# 3. Run the playbook_lift driver: cold pass → judge → record →
|
||||
# warm pass → measure
|
||||
# 4. Generate markdown report from the JSON evidence
|
||||
#
|
||||
# Output:
|
||||
# reports/reality-tests/playbook_lift_<N>.json — raw evidence
|
||||
# reports/reality-tests/playbook_lift_<N>.md — human report
|
||||
#
|
||||
# Requires: Ollama on :11434 with nomic-embed-text + the judge model
|
||||
# loaded. Skips (exit 0) if Ollama is absent.
|
||||
#
|
||||
# Usage:
|
||||
# ./scripts/playbook_lift.sh # run #001 with defaults
|
||||
# RUN_ID=002 ./scripts/playbook_lift.sh # explicit run id
|
||||
# JUDGE_MODEL=qwen2.5:latest ./scripts/playbook_lift.sh
|
||||
# WORKERS_LIMIT=2000 ./scripts/playbook_lift.sh
|
||||
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
export PATH="$PATH:/usr/local/go/bin"
|
||||
|
||||
RUN_ID="${RUN_ID:-001}"
|
||||
JUDGE_MODEL="${JUDGE_MODEL:-qwen3.5:latest}"
|
||||
WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
|
||||
QUERIES_FILE="${QUERIES_FILE:-tests/reality/playbook_lift_queries.txt}"
|
||||
CORPORA="${CORPORA:-workers,candidates}"
|
||||
K="${K:-10}"
|
||||
|
||||
OUT_JSON="reports/reality-tests/playbook_lift_${RUN_ID}.json"
|
||||
OUT_MD="reports/reality-tests/playbook_lift_${RUN_ID}.md"
|
||||
|
||||
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
|
||||
echo "[lift] Ollama not reachable on :11434 — skipping"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if ! curl -sS http://localhost:11434/api/tags | jq -e --arg m "$JUDGE_MODEL" \
|
||||
'.models[] | select(.name == $m)' >/dev/null 2>&1; then
|
||||
echo "[lift] judge model '$JUDGE_MODEL' not loaded in Ollama — pull it first"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "[lift] building binaries..."
|
||||
go build -o bin/ ./cmd/storaged ./cmd/embedd ./cmd/vectord ./cmd/matrixd ./cmd/gateway \
|
||||
./scripts/staffing_workers ./scripts/staffing_candidates \
|
||||
./scripts/playbook_lift
|
||||
|
||||
pkill -f "bin/(storaged|embedd|vectord|matrixd|gateway)" 2>/dev/null || true
|
||||
sleep 0.3
|
||||
|
||||
PIDS=()
|
||||
TMP="$(mktemp -d)"
|
||||
CFG="$TMP/lift.toml"
|
||||
|
||||
cleanup() {
|
||||
echo "[lift] cleanup"
|
||||
for p in "${PIDS[@]:-}"; do [ -n "${p:-}" ] && kill "$p" 2>/dev/null || true; done
|
||||
rm -rf "$TMP"
|
||||
}
|
||||
trap cleanup EXIT INT TERM
|
||||
|
||||
cat > "$CFG" <<EOF
|
||||
[gateway]
|
||||
bind = "127.0.0.1:3110"
|
||||
storaged_url = "http://127.0.0.1:3211"
|
||||
catalogd_url = "http://127.0.0.1:3212"
|
||||
ingestd_url = "http://127.0.0.1:3213"
|
||||
queryd_url = "http://127.0.0.1:3214"
|
||||
vectord_url = "http://127.0.0.1:3215"
|
||||
embedd_url = "http://127.0.0.1:3216"
|
||||
pathwayd_url = "http://127.0.0.1:3217"
|
||||
matrixd_url = "http://127.0.0.1:3218"
|
||||
|
||||
[vectord]
|
||||
bind = "127.0.0.1:3215"
|
||||
storaged_url = ""
|
||||
|
||||
[matrixd]
|
||||
bind = "127.0.0.1:3218"
|
||||
embedd_url = "http://127.0.0.1:3216"
|
||||
vectord_url = "http://127.0.0.1:3215"
|
||||
EOF
|
||||
|
||||
poll_health() {
|
||||
local port="$1" deadline=$(($(date +%s) + 5))
|
||||
while [ "$(date +%s)" -lt "$deadline" ]; do
|
||||
if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
|
||||
sleep 0.05
|
||||
done
|
||||
return 1
|
||||
}
|
||||
|
||||
echo "[lift] launching stack..."
|
||||
./bin/storaged -config "$CFG" > /tmp/storaged.log 2>&1 & PIDS+=($!)
|
||||
poll_health 3211 || { echo "storaged failed"; exit 1; }
|
||||
./bin/embedd -config "$CFG" > /tmp/embedd.log 2>&1 & PIDS+=($!)
|
||||
poll_health 3216 || { echo "embedd failed"; exit 1; }
|
||||
./bin/vectord -config "$CFG" > /tmp/vectord.log 2>&1 & PIDS+=($!)
|
||||
poll_health 3215 || { echo "vectord failed"; exit 1; }
|
||||
./bin/matrixd -config "$CFG" > /tmp/matrixd.log 2>&1 & PIDS+=($!)
|
||||
poll_health 3218 || { echo "matrixd failed"; exit 1; }
|
||||
./bin/gateway -config "$CFG" > /tmp/gateway.log 2>&1 & PIDS+=($!)
|
||||
poll_health 3110 || { echo "gateway failed"; exit 1; }
|
||||
|
||||
echo
|
||||
echo "[lift] ingest workers (limit=$WORKERS_LIMIT)..."
|
||||
./bin/staffing_workers -limit "$WORKERS_LIMIT"
|
||||
|
||||
echo
|
||||
echo "[lift] ingest candidates..."
|
||||
./bin/staffing_candidates -skip-populate=false -query "warmup" 2>&1 \
|
||||
| grep -v "^\[candidates\]\(matrix\|reality\)" || true
|
||||
|
||||
echo
|
||||
echo "[lift] running driver — judge=$JUDGE_MODEL · queries=$QUERIES_FILE · k=$K"
|
||||
./bin/playbook_lift \
|
||||
-gateway "http://127.0.0.1:3110" \
|
||||
-ollama "http://localhost:11434" \
|
||||
-queries "$QUERIES_FILE" \
|
||||
-corpora "$CORPORA" \
|
||||
-judge "$JUDGE_MODEL" \
|
||||
-k "$K" \
|
||||
-out "$OUT_JSON"
|
||||
|
||||
echo
|
||||
echo "[lift] generating markdown report → $OUT_MD"
|
||||
generate_md() {
|
||||
local json="$1" md="$2"
|
||||
local total discovery lift no_change boosted mean_delta gen_at
|
||||
total=$(jq -r '.summary.total' "$json")
|
||||
discovery=$(jq -r '.summary.with_discovery' "$json")
|
||||
lift=$(jq -r '.summary.lift_count' "$json")
|
||||
no_change=$(jq -r '.summary.no_change' "$json")
|
||||
boosted=$(jq -r '.summary.playbook_boosted_total' "$json")
|
||||
mean_delta=$(jq -r '.summary.mean_top1_delta_distance' "$json")
|
||||
gen_at=$(jq -r '.summary.generated_at' "$json")
|
||||
|
||||
cat > "$md" <<MDEOF
|
||||
# Playbook-Lift Reality Test — Run ${RUN_ID}
|
||||
|
||||
**Generated:** ${gen_at}
|
||||
**Judge:** \`${JUDGE_MODEL}\` (Ollama)
|
||||
**Corpora:** \`${CORPORA}\`
|
||||
**Workers limit:** ${WORKERS_LIMIT}
|
||||
**Queries:** \`${QUERIES_FILE}\` (${total} executed)
|
||||
**K per pass:** ${K}
|
||||
**Evidence:** \`${OUT_JSON}\`
|
||||
|
||||
---
|
||||
|
||||
## Headline
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Total queries run | ${total} |
|
||||
| Cold-pass discoveries (judge-best ≠ top-1) | ${discovery} |
|
||||
| Warm-pass lifts (recorded playbook → top-1) | ${lift} |
|
||||
| No change (judge-best already top-1, no playbook needed) | ${no_change} |
|
||||
| Playbook boosts triggered (warm pass) | ${boosted} |
|
||||
| Mean Δ top-1 distance (warm − cold) | ${mean_delta} |
|
||||
|
||||
**Lift rate:** ${lift} of ${discovery} discoveries became top-1 after warm pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-query results
|
||||
|
||||
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
MDEOF
|
||||
|
||||
jq -r '.runs | to_entries[] |
|
||||
[
|
||||
(.key + 1 | tostring),
|
||||
(.value.query | .[0:60]),
|
||||
.value.cold_top1_id,
|
||||
((.value.cold_judge_best_rank | tostring) + "/" + (.value.cold_judge_best_rating | tostring)),
|
||||
(if .value.playbook_recorded then "✓ " + (.value.playbook_target_id // "") else "—" end),
|
||||
.value.warm_top1_id,
|
||||
(.value.warm_judge_best_rank | tostring),
|
||||
(if .value.lift then "**YES**" else "no" end)
|
||||
] | "| " + join(" | ") + " |"
|
||||
' "$json" >> "$md"
|
||||
|
||||
cat >> "$md" <<MDEOF
|
||||
|
||||
---
|
||||
|
||||
## Honesty caveats
|
||||
|
||||
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
|
||||
judge's verdict is what defines "best." If \`${JUDGE_MODEL}\` rates badly,
|
||||
the lift number is meaningless. To validate the judge itself, sample 5–10
|
||||
verdicts manually and check agreement.
|
||||
2. **Score-1.0 boost = distance halved.** Playbook math is
|
||||
\`distance' = distance × (1 - 0.5 × score)\`. Lift requires the judge-best
|
||||
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
|
||||
even halving doesn't promote it. Tight clusters → little visible lift.
|
||||
3. **Same-query replay is the cheap case.** Real lift comes from *similar but
|
||||
not identical* queries hitting a recorded playbook. This run only tests
|
||||
verbatim replay. A v2 should add paraphrase queries.
|
||||
4. **Multi-corpus skew.** Default corpora=\`${CORPORA}\` — if all judge-best
|
||||
results land in one corpus, the matrix layer's purpose isn't being tested.
|
||||
Check per-corpus distribution in the JSON.
|
||||
|
||||
## Next moves
|
||||
|
||||
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
|
||||
work. Move to paraphrase queries + tag-based boost (currently ignored).
|
||||
- If lift rate < 20%: investigate why — judge variance, distance gap too
|
||||
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
|
||||
retuning.
|
||||
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
|
||||
already close to optimal on this query distribution. Either the corpus
|
||||
is too narrow or the queries are too easy.
|
||||
MDEOF
|
||||
}
|
||||
|
||||
generate_md "$OUT_JSON" "$OUT_MD"
|
||||
|
||||
echo
|
||||
echo "[lift] DONE"
|
||||
echo "[lift] evidence: $OUT_JSON"
|
||||
echo "[lift] report: $OUT_MD"
|
||||
391
scripts/playbook_lift/main.go
Normal file
391
scripts/playbook_lift/main.go
Normal file
@ -0,0 +1,391 @@
|
||||
// Playbook-lift reality test driver. Two-pass design:
|
||||
//
|
||||
// Pass 1 (cold): for each query → matrix.search use_playbook=false →
|
||||
// LLM judge rates top-K → record playbook entry pointing
|
||||
// at the highest-rated result (which may NOT be top-1
|
||||
// by distance — that's the discovery worth boosting).
|
||||
//
|
||||
// Pass 2 (warm): same queries → use_playbook=true → measure how the
|
||||
// ranking shifted.
|
||||
//
|
||||
// Lift = real if pass-2 brings the LLM-judged-best result into top-1
|
||||
// more often than pass-1. If lift ≈ 0, the playbook is just confirming
|
||||
// what cosine already said and the 5-loop thesis is unproven.
|
||||
//
|
||||
// Honest about what this measures: with no human-labeled ground truth,
|
||||
// the LLM judge IS the ground truth proxy. That's the small-model
|
||||
// pipeline thesis itself — the same model class that runs the inner
|
||||
// loop is also what we trust to evaluate it. If you don't trust the
|
||||
// judge, the lift number is meaningless; that's a separate problem
|
||||
// for ground-truth labeling.
|
||||
//
|
||||
// Usage (driven by scripts/playbook_lift.sh):
|
||||
// playbook_lift -gateway http://127.0.0.1:3110 \
|
||||
// -queries tests/reality/playbook_lift_queries.txt \
|
||||
// -judge qwen3.5:latest \
|
||||
// -corpora workers,candidates \
|
||||
// -k 10 \
|
||||
// -out reports/reality-tests/playbook_lift_001.json
|
||||
package main
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"encoding/json"
|
||||
"flag"
|
||||
"fmt"
|
||||
"io"
|
||||
"log"
|
||||
"net/http"
|
||||
"os"
|
||||
"sort"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
type matrixResult struct {
|
||||
ID string `json:"id"`
|
||||
Distance float32 `json:"distance"`
|
||||
Corpus string `json:"corpus"`
|
||||
Metadata json.RawMessage `json:"metadata,omitempty"`
|
||||
}
|
||||
|
||||
type matrixResp struct {
|
||||
Results []matrixResult `json:"results"`
|
||||
PerCorpusCounts map[string]int `json:"per_corpus_counts"`
|
||||
PlaybookBoosted int `json:"playbook_boosted,omitempty"`
|
||||
}
|
||||
|
||||
type judgeVerdict struct {
|
||||
Rating int `json:"rating"`
|
||||
Reason string `json:"reason"`
|
||||
}
|
||||
|
||||
type queryRun struct {
|
||||
Query string `json:"query"`
|
||||
|
||||
ColdTop1ID string `json:"cold_top1_id"`
|
||||
ColdTop1Distance float32 `json:"cold_top1_distance"`
|
||||
ColdJudgeBestID string `json:"cold_judge_best_id"`
|
||||
ColdJudgeBestRank int `json:"cold_judge_best_rank"`
|
||||
ColdJudgeBestRating int `json:"cold_judge_best_rating"`
|
||||
ColdRatings []int `json:"cold_ratings"`
|
||||
|
||||
PlaybookRecorded bool `json:"playbook_recorded"`
|
||||
PlaybookID string `json:"playbook_target_id,omitempty"`
|
||||
|
||||
WarmTop1ID string `json:"warm_top1_id"`
|
||||
WarmTop1Distance float32 `json:"warm_top1_distance"`
|
||||
WarmBoostedCount int `json:"warm_boosted_count"`
|
||||
WarmJudgeBestRank int `json:"warm_judge_best_rank"`
|
||||
|
||||
Lift bool `json:"lift"` // judge-best was below top-1 cold, but top-1 warm
|
||||
Note string `json:"note,omitempty"`
|
||||
}
|
||||
|
||||
type summary struct {
|
||||
Total int `json:"total"`
|
||||
WithDiscovery int `json:"with_discovery"` // judge-best != cold top-1
|
||||
LiftCount int `json:"lift_count"` // top-1 changed warm→ judge-best
|
||||
NoChange int `json:"no_change"`
|
||||
MeanTop1DeltaDistance float32 `json:"mean_top1_delta_distance"`
|
||||
PlaybookBoostedTotal int `json:"playbook_boosted_total"`
|
||||
GeneratedAt time.Time `json:"generated_at"`
|
||||
}
|
||||
|
||||
func main() {
|
||||
gw := flag.String("gateway", "http://127.0.0.1:3110", "Go gateway base URL")
|
||||
ollama := flag.String("ollama", "http://127.0.0.1:11434", "Ollama base URL for LLM judge")
|
||||
queries := flag.String("queries", "tests/reality/playbook_lift_queries.txt", "query corpus path")
|
||||
corporaCSV := flag.String("corpora", "workers,candidates", "comma-separated matrix corpora")
|
||||
judge := flag.String("judge", "qwen3.5:latest", "Ollama model for relevance judging")
|
||||
k := flag.Int("k", 10, "top-k from matrix.search per pass")
|
||||
out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
|
||||
flag.Parse()
|
||||
|
||||
corpora := strings.Split(*corporaCSV, ",")
|
||||
|
||||
qs, err := loadQueries(*queries)
|
||||
if err != nil {
|
||||
log.Fatalf("load queries: %v", err)
|
||||
}
|
||||
if len(qs) == 0 {
|
||||
log.Fatalf("no queries in %s", *queries)
|
||||
}
|
||||
log.Printf("[lift] %d queries · corpora=%v · k=%d · judge=%s", len(qs), corpora, *k, *judge)
|
||||
|
||||
hc := &http.Client{Timeout: 60 * time.Second}
|
||||
runs := make([]queryRun, 0, len(qs))
|
||||
totalDelta := float32(0)
|
||||
playbookBoostedTotal := 0
|
||||
withDiscovery := 0
|
||||
liftCount := 0
|
||||
noChange := 0
|
||||
|
||||
// Pass 1 (cold) + record playbooks based on judge verdicts.
|
||||
for i, q := range qs {
|
||||
log.Printf("[lift] (%d/%d cold) %s", i+1, len(qs), abbrev(q, 60))
|
||||
resp, err := matrixSearch(hc, *gw, q, corpora, *k, false)
|
||||
if err != nil {
|
||||
log.Printf(" cold search failed: %v — skipping", err)
|
||||
continue
|
||||
}
|
||||
if len(resp.Results) == 0 {
|
||||
log.Printf(" cold returned 0 results — skipping")
|
||||
continue
|
||||
}
|
||||
ratings := make([]int, len(resp.Results))
|
||||
bestRank := 0
|
||||
bestRating := -1
|
||||
for j, r := range resp.Results {
|
||||
rating := judgeRate(hc, *ollama, *judge, q, r)
|
||||
ratings[j] = rating
|
||||
if rating > bestRating {
|
||||
bestRating = rating
|
||||
bestRank = j
|
||||
}
|
||||
}
|
||||
run := queryRun{
|
||||
Query: q,
|
||||
ColdTop1ID: resp.Results[0].ID,
|
||||
ColdTop1Distance: resp.Results[0].Distance,
|
||||
ColdJudgeBestID: resp.Results[bestRank].ID,
|
||||
ColdJudgeBestRank: bestRank,
|
||||
ColdJudgeBestRating: bestRating,
|
||||
ColdRatings: ratings,
|
||||
}
|
||||
// Record a playbook only if the judge best is not already top-1
|
||||
// (otherwise we're boosting something cosine already crowned).
|
||||
if bestRank > 0 && bestRating >= 4 {
|
||||
withDiscovery++
|
||||
if err := playbookRecord(hc, *gw, q, resp.Results[bestRank].ID, resp.Results[bestRank].Corpus, 1.0); err != nil {
|
||||
log.Printf(" playbook record failed: %v", err)
|
||||
run.Note = "playbook record failed: " + err.Error()
|
||||
} else {
|
||||
run.PlaybookRecorded = true
|
||||
run.PlaybookID = resp.Results[bestRank].ID
|
||||
}
|
||||
} else if bestRank == 0 {
|
||||
run.Note = "judge-best already top-1 cold — no playbook needed"
|
||||
} else {
|
||||
run.Note = fmt.Sprintf("judge-best rating %d below threshold (4) — no playbook", bestRating)
|
||||
}
|
||||
runs = append(runs, run)
|
||||
}
|
||||
|
||||
// Pass 2 (warm) on the same queries.
|
||||
for i := range runs {
|
||||
q := runs[i].Query
|
||||
log.Printf("[lift] (%d/%d warm) %s", i+1, len(runs), abbrev(q, 60))
|
||||
resp, err := matrixSearch(hc, *gw, q, corpora, *k, true)
|
||||
if err != nil || len(resp.Results) == 0 {
|
||||
runs[i].Note = appendNote(runs[i].Note, fmt.Sprintf("warm search failed: %v", err))
|
||||
continue
|
||||
}
|
||||
runs[i].WarmTop1ID = resp.Results[0].ID
|
||||
runs[i].WarmTop1Distance = resp.Results[0].Distance
|
||||
runs[i].WarmBoostedCount = resp.PlaybookBoosted
|
||||
playbookBoostedTotal += resp.PlaybookBoosted
|
||||
|
||||
// Find where the cold judge-best ID landed in the warm ranking.
|
||||
warmRank := -1
|
||||
for j, r := range resp.Results {
|
||||
if r.ID == runs[i].ColdJudgeBestID {
|
||||
warmRank = j
|
||||
break
|
||||
}
|
||||
}
|
||||
runs[i].WarmJudgeBestRank = warmRank
|
||||
|
||||
switch {
|
||||
case runs[i].PlaybookRecorded && warmRank == 0:
|
||||
runs[i].Lift = true
|
||||
liftCount++
|
||||
case !runs[i].PlaybookRecorded:
|
||||
noChange++
|
||||
default:
|
||||
noChange++
|
||||
}
|
||||
totalDelta += runs[i].WarmTop1Distance - runs[i].ColdTop1Distance
|
||||
}
|
||||
|
||||
sum := summary{
|
||||
Total: len(runs),
|
||||
WithDiscovery: withDiscovery,
|
||||
LiftCount: liftCount,
|
||||
NoChange: noChange,
|
||||
MeanTop1DeltaDistance: 0,
|
||||
PlaybookBoostedTotal: playbookBoostedTotal,
|
||||
GeneratedAt: time.Now().UTC(),
|
||||
}
|
||||
if len(runs) > 0 {
|
||||
sum.MeanTop1DeltaDistance = totalDelta / float32(len(runs))
|
||||
}
|
||||
|
||||
if err := writeJSON(*out, runs, sum); err != nil {
|
||||
log.Fatalf("write %s: %v", *out, err)
|
||||
}
|
||||
log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f",
|
||||
sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance)
|
||||
log.Printf("[lift] results → %s", *out)
|
||||
}
|
||||
|
||||
func loadQueries(path string) ([]string, error) {
|
||||
bs, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
var out []string
|
||||
for _, line := range strings.Split(string(bs), "\n") {
|
||||
s := strings.TrimSpace(line)
|
||||
if s == "" || strings.HasPrefix(s, "#") {
|
||||
continue
|
||||
}
|
||||
out = append(out, s)
|
||||
}
|
||||
return out, nil
|
||||
}
|
||||
|
||||
func matrixSearch(hc *http.Client, gw, query string, corpora []string, k int, usePlaybook bool) (*matrixResp, error) {
|
||||
body := map[string]any{
|
||||
"query_text": query,
|
||||
"corpora": corpora,
|
||||
"k": k,
|
||||
"per_corpus_k": k,
|
||||
"use_playbook": usePlaybook,
|
||||
}
|
||||
bs, _ := json.Marshal(body)
|
||||
req, _ := http.NewRequest("POST", gw+"/v1/matrix/search", bytes.NewReader(bs))
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
resp, err := hc.Do(req)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
rb, _ := io.ReadAll(resp.Body)
|
||||
if resp.StatusCode/100 != 2 {
|
||||
return nil, fmt.Errorf("status %d: %s", resp.StatusCode, string(rb))
|
||||
}
|
||||
var out matrixResp
|
||||
if err := json.Unmarshal(rb, &out); err != nil {
|
||||
return nil, fmt.Errorf("unmarshal: %w (body=%s)", err, abbrev(string(rb), 200))
|
||||
}
|
||||
return &out, nil
|
||||
}
|
||||
|
||||
func playbookRecord(hc *http.Client, gw, query, answerID, answerCorpus string, score float64) error {
|
||||
body := map[string]any{
|
||||
"query": query,
|
||||
"answer_id": answerID,
|
||||
"answer_corpus": answerCorpus,
|
||||
"score": score,
|
||||
"tags": []string{"reality-test", "playbook-lift-001"},
|
||||
}
|
||||
bs, _ := json.Marshal(body)
|
||||
req, _ := http.NewRequest("POST", gw+"/v1/matrix/playbooks/record", bytes.NewReader(bs))
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
resp, err := hc.Do(req)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
if resp.StatusCode/100 != 2 {
|
||||
rb, _ := io.ReadAll(resp.Body)
|
||||
return fmt.Errorf("status %d: %s", resp.StatusCode, string(rb))
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// judgeRate calls Ollama's /api/chat directly and asks for a 1-5 rating
|
||||
// of the result against the query. Returns 0 on any failure (treated as
|
||||
// "couldn't judge, exclude from best-of consideration").
|
||||
func judgeRate(hc *http.Client, ollamaURL, model, query string, r matrixResult) int {
|
||||
system := `You rate retrieval results for a staffing co-pilot.
|
||||
Rate the result 1-5 against the query:
|
||||
5 = perfect match (this person/job IS what was asked for)
|
||||
4 = strong match (right field, right level, minor mismatches)
|
||||
3 = adjacent match (related field or partial overlap)
|
||||
2 = weak/tangential match
|
||||
1 = irrelevant
|
||||
Output JSON only: {"rating": N, "reason": "<one sentence>"}.`
|
||||
user := fmt.Sprintf("Query: %q\n\nResult corpus: %s\nResult ID: %s\nResult metadata:\n%s",
|
||||
query, r.Corpus, r.ID, string(r.Metadata))
|
||||
|
||||
body := map[string]any{
|
||||
"model": model,
|
||||
"stream": false,
|
||||
"format": "json",
|
||||
"messages": []map[string]string{
|
||||
{"role": "system", "content": system},
|
||||
{"role": "user", "content": user},
|
||||
},
|
||||
"options": map[string]any{"temperature": 0},
|
||||
}
|
||||
bs, _ := json.Marshal(body)
|
||||
req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(bs))
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
resp, err := hc.Do(req)
|
||||
if err != nil {
|
||||
return 0
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
if resp.StatusCode/100 != 2 {
|
||||
return 0
|
||||
}
|
||||
rb, _ := io.ReadAll(resp.Body)
|
||||
var ollamaResp struct {
|
||||
Message struct {
|
||||
Content string `json:"content"`
|
||||
} `json:"message"`
|
||||
}
|
||||
if err := json.Unmarshal(rb, &ollamaResp); err != nil {
|
||||
return 0
|
||||
}
|
||||
var v judgeVerdict
|
||||
if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &v); err != nil {
|
||||
return 0
|
||||
}
|
||||
if v.Rating < 1 || v.Rating > 5 {
|
||||
return 0
|
||||
}
|
||||
return v.Rating
|
||||
}
|
||||
|
||||
func writeJSON(path string, runs []queryRun, sum summary) error {
|
||||
if err := os.MkdirAll(filepath_dir(path), 0o755); err != nil {
|
||||
return err
|
||||
}
|
||||
out := struct {
|
||||
Summary summary `json:"summary"`
|
||||
Runs []queryRun `json:"runs"`
|
||||
}{Summary: sum, Runs: runs}
|
||||
bs, err := json.MarshalIndent(out, "", " ")
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
return os.WriteFile(path, bs, 0o644)
|
||||
}
|
||||
|
||||
func filepath_dir(p string) string {
|
||||
if i := strings.LastIndex(p, "/"); i >= 0 {
|
||||
return p[:i]
|
||||
}
|
||||
return "."
|
||||
}
|
||||
|
||||
func abbrev(s string, n int) string {
|
||||
if len(s) <= n {
|
||||
return s
|
||||
}
|
||||
return s[:n] + "…"
|
||||
}
|
||||
|
||||
func appendNote(existing, add string) string {
|
||||
if existing == "" {
|
||||
return add
|
||||
}
|
||||
return existing + "; " + add
|
||||
}
|
||||
|
||||
// Suppress unused-import warning when sort isn't used in a future
|
||||
// refactor; harmless for now.
|
||||
var _ = sort.Slice
|
||||
18
tests/reality/playbook_lift_queries.txt
Normal file
18
tests/reality/playbook_lift_queries.txt
Normal file
@ -0,0 +1,18 @@
|
||||
# Playbook lift reality test — staffing query corpus.
|
||||
#
|
||||
# Each non-blank, non-comment line is one query. The harness will run
|
||||
# each through matrix.search (cold pass, then warm pass with playbook),
|
||||
# ask the LLM judge to rate top-K results, and record lift metrics.
|
||||
#
|
||||
# Goal: 20 queries, weighted toward the kinds of asks a staffing
|
||||
# coordinator would actually issue. Specific roles + certifications +
|
||||
# constraints surface playbook lift better than generic "find a worker"
|
||||
# style queries.
|
||||
#
|
||||
# Placeholders (5) — J: replace + extend to 20+ for the real test.
|
||||
|
||||
Forklift operator with OSHA-30, warehouse experience, day shift availability
|
||||
Bilingual customer service rep, Spanish + English, two years call-center experience
|
||||
CDL Class A driver, clean record, willing to do regional 4-day routes
|
||||
Production line supervisor with lean manufacturing background
|
||||
Dental hygienist with three years experience, Indianapolis area
|
||||
Loading…
x
Reference in New Issue
Block a user