reality-tests: playbook-lift harness — does the 5-loop substrate beat raw cosine?
First reality test driver. Two-pass design:
- Pass 1 (cold): matrix.search use_playbook=false → small-model judge
rates top-K → record playbook entry pointing at the highest-rated
result (which may NOT be top-1 by distance — that's the discovery).
- Pass 2 (warm): same queries with use_playbook=true → measure
ranking shift. Lift = real if recorded answer becomes top-1.
Files:
- scripts/playbook_lift/main.go driver (391 LoC)
- scripts/playbook_lift.sh stack-bring-up + report gen
- tests/reality/playbook_lift_queries.txt query corpus (5 placeholders;
J writes real 20+)
- reports/reality-tests/README.md framework + interpretation
- .gitignore track reports/reality-tests/
but ignore per-run JSON evidence
This answers the gate from project_small_model_pipeline_vision.md:
"the playbook + matrix indexer must give the results we're looking
for." Without ground-truth labels, the LLM judge is the proxy — the
same small-model thesis applied to evaluation. Honest about that
limitation in the generated reports.
Driver compiles clean; full run requires Ollama + workers/candidates
ingest. Skips cleanly if Ollama absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
8278eb9a87
commit
3dd7d9fe30
4
.gitignore
vendored
4
.gitignore
vendored
@ -39,10 +39,14 @@ vendor/
|
|||||||
# Use /reports/* + un-ignore so git can traverse into reports/.
|
# Use /reports/* + un-ignore so git can traverse into reports/.
|
||||||
/reports/*
|
/reports/*
|
||||||
!/reports/scrum/
|
!/reports/scrum/
|
||||||
|
!/reports/reality-tests/
|
||||||
# Inside the audit directory, the per-run _evidence/ dump (smoke logs,
|
# Inside the audit directory, the per-run _evidence/ dump (smoke logs,
|
||||||
# command output) IS runtime — track the dir, ignore its contents.
|
# command output) IS runtime — track the dir, ignore its contents.
|
||||||
/reports/scrum/_evidence/*
|
/reports/scrum/_evidence/*
|
||||||
!/reports/scrum/_evidence/.gitkeep
|
!/reports/scrum/_evidence/.gitkeep
|
||||||
|
# Reality-test JSON evidence is runtime — track the dir + MD reports
|
||||||
|
# (committed deliberately as outcome record), ignore per-run JSON.
|
||||||
|
/reports/reality-tests/*.json
|
||||||
|
|
||||||
# Proof harness runtime output — same pattern as reports/scrum/_evidence.
|
# Proof harness runtime output — same pattern as reports/scrum/_evidence.
|
||||||
# Track the directory but ignore per-run subdirs.
|
# Track the directory but ignore per-run subdirs.
|
||||||
|
|||||||
69
reports/reality-tests/README.md
Normal file
69
reports/reality-tests/README.md
Normal file
@ -0,0 +1,69 @@
|
|||||||
|
# reports/reality-tests — does the 5-loop substrate actually work?
|
||||||
|
|
||||||
|
Reality tests measure **product outcomes**, not substrate health. The 21 smokes prove the system *runs*; the proof harness proves the system *makes the claims it claims*; reality tests answer: **does the small-model pipeline + matrix indexer + playbook give measurably better results than raw cosine?**
|
||||||
|
|
||||||
|
This is the gate from `project_small_model_pipeline_vision.md`: *"the playbook + matrix indexer must give the results we're looking for."* Single load-bearing criterion. Throughput, scaling, code elegance are secondary.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What lives here
|
||||||
|
|
||||||
|
Each reality test is a numbered run that produces:
|
||||||
|
|
||||||
|
- `<test>_<NNN>.json` — raw structured evidence (per-query data, summary metrics)
|
||||||
|
- `<test>_<NNN>.md` — human-readable report with headline metrics, per-query table, honesty caveats, next moves
|
||||||
|
|
||||||
|
Runs are append-only. Earlier runs stay in tree as historical baseline.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Test catalog
|
||||||
|
|
||||||
|
### `playbook_lift_<NNN>` — does the playbook actually lift the right answer?
|
||||||
|
|
||||||
|
**Driver:** `scripts/playbook_lift.sh` → `bin/playbook_lift`
|
||||||
|
**Queries:** `tests/reality/playbook_lift_queries.txt`
|
||||||
|
**Pipeline:** cold pass → LLM judge → playbook record → warm pass → measure ranking shift.
|
||||||
|
|
||||||
|
The headline question: **when the LLM judge finds a better answer than cosine top-1, can the playbook boost it to top-1 on the next run?** If yes, the learning loop closes; if no, the matrix layer + playbook is infrastructure for a thesis that doesn't pay rent.
|
||||||
|
|
||||||
|
See the run reports for honesty caveats — chiefly that the LLM judge IS the ground-truth proxy.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Running a reality test
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Defaults: judge=qwen3.5:latest, workers limit 5000, run id 001
|
||||||
|
./scripts/playbook_lift.sh
|
||||||
|
|
||||||
|
# Re-run with a different judge to check inter-judge agreement
|
||||||
|
JUDGE_MODEL=qwen2.5:latest RUN_ID=002 ./scripts/playbook_lift.sh
|
||||||
|
|
||||||
|
# Smaller scale for fast iteration
|
||||||
|
WORKERS_LIMIT=1000 K=5 RUN_ID=dev ./scripts/playbook_lift.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Requires: Ollama on `:11434` with `nomic-embed-text` + the chosen judge model loaded. Skips cleanly (exit 0) if Ollama is absent.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Interpreting results
|
||||||
|
|
||||||
|
Three thresholds matter on the `playbook_lift` tests:
|
||||||
|
|
||||||
|
| Lift rate (lifts / discoveries) | Verdict |
|
||||||
|
|---|---|
|
||||||
|
| ≥ 50% | Loop closes — playbook is doing real work, move to paraphrase queries |
|
||||||
|
| 20-50% | Lift exists but inconsistent — investigate boost math (`score × 0.5`) or judge variance |
|
||||||
|
| < 20% | Loop is not pulling its weight — diagnose before adding more components |
|
||||||
|
|
||||||
|
A separate concern: **discovery rate** (cold judge-best ≠ cold top-1). If discovery is itself rare (< 30% of queries), cosine is already close to optimal on this query distribution and the matrix+playbook layer has little headroom. That's not necessarily a bug — but it means the value gate has to come from somewhere else (multi-corpus retrieval, domain-specific tags, drift signal).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What this is not
|
||||||
|
|
||||||
|
- **Not a benchmark.** No comparison against external systems; only internal cold-vs-warm.
|
||||||
|
- **Not a regression gate.** Each run is a snapshot. Scores will drift with corpus changes, judge updates, and playbook math tuning. Don't wire `just verify` to demand a minimum lift.
|
||||||
|
- **Not human-validated.** The LLM judge is the ground truth proxy. Sample 5-10 verdicts manually per run to sanity-check the judge isn't pathological.
|
||||||
233
scripts/playbook_lift.sh
Executable file
233
scripts/playbook_lift.sh
Executable file
@ -0,0 +1,233 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Playbook-lift reality test — measure whether the 5-loop substrate
|
||||||
|
# (matrix retrieve+merge + playbook + small-model judge) actually beats
|
||||||
|
# raw cosine on staffing queries.
|
||||||
|
#
|
||||||
|
# Pipeline:
|
||||||
|
# 1. Boot the Go stack (storaged, embedd, vectord, matrixd, gateway)
|
||||||
|
# 2. Ingest workers (default 5000) + candidates corpora
|
||||||
|
# 3. Run the playbook_lift driver: cold pass → judge → record →
|
||||||
|
# warm pass → measure
|
||||||
|
# 4. Generate markdown report from the JSON evidence
|
||||||
|
#
|
||||||
|
# Output:
|
||||||
|
# reports/reality-tests/playbook_lift_<N>.json — raw evidence
|
||||||
|
# reports/reality-tests/playbook_lift_<N>.md — human report
|
||||||
|
#
|
||||||
|
# Requires: Ollama on :11434 with nomic-embed-text + the judge model
|
||||||
|
# loaded. Skips (exit 0) if Ollama is absent.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./scripts/playbook_lift.sh # run #001 with defaults
|
||||||
|
# RUN_ID=002 ./scripts/playbook_lift.sh # explicit run id
|
||||||
|
# JUDGE_MODEL=qwen2.5:latest ./scripts/playbook_lift.sh
|
||||||
|
# WORKERS_LIMIT=2000 ./scripts/playbook_lift.sh
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
cd "$(dirname "$0")/.."
|
||||||
|
|
||||||
|
export PATH="$PATH:/usr/local/go/bin"
|
||||||
|
|
||||||
|
RUN_ID="${RUN_ID:-001}"
|
||||||
|
JUDGE_MODEL="${JUDGE_MODEL:-qwen3.5:latest}"
|
||||||
|
WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
|
||||||
|
QUERIES_FILE="${QUERIES_FILE:-tests/reality/playbook_lift_queries.txt}"
|
||||||
|
CORPORA="${CORPORA:-workers,candidates}"
|
||||||
|
K="${K:-10}"
|
||||||
|
|
||||||
|
OUT_JSON="reports/reality-tests/playbook_lift_${RUN_ID}.json"
|
||||||
|
OUT_MD="reports/reality-tests/playbook_lift_${RUN_ID}.md"
|
||||||
|
|
||||||
|
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
|
||||||
|
echo "[lift] Ollama not reachable on :11434 — skipping"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! curl -sS http://localhost:11434/api/tags | jq -e --arg m "$JUDGE_MODEL" \
|
||||||
|
'.models[] | select(.name == $m)' >/dev/null 2>&1; then
|
||||||
|
echo "[lift] judge model '$JUDGE_MODEL' not loaded in Ollama — pull it first"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[lift] building binaries..."
|
||||||
|
go build -o bin/ ./cmd/storaged ./cmd/embedd ./cmd/vectord ./cmd/matrixd ./cmd/gateway \
|
||||||
|
./scripts/staffing_workers ./scripts/staffing_candidates \
|
||||||
|
./scripts/playbook_lift
|
||||||
|
|
||||||
|
pkill -f "bin/(storaged|embedd|vectord|matrixd|gateway)" 2>/dev/null || true
|
||||||
|
sleep 0.3
|
||||||
|
|
||||||
|
PIDS=()
|
||||||
|
TMP="$(mktemp -d)"
|
||||||
|
CFG="$TMP/lift.toml"
|
||||||
|
|
||||||
|
cleanup() {
|
||||||
|
echo "[lift] cleanup"
|
||||||
|
for p in "${PIDS[@]:-}"; do [ -n "${p:-}" ] && kill "$p" 2>/dev/null || true; done
|
||||||
|
rm -rf "$TMP"
|
||||||
|
}
|
||||||
|
trap cleanup EXIT INT TERM
|
||||||
|
|
||||||
|
cat > "$CFG" <<EOF
|
||||||
|
[gateway]
|
||||||
|
bind = "127.0.0.1:3110"
|
||||||
|
storaged_url = "http://127.0.0.1:3211"
|
||||||
|
catalogd_url = "http://127.0.0.1:3212"
|
||||||
|
ingestd_url = "http://127.0.0.1:3213"
|
||||||
|
queryd_url = "http://127.0.0.1:3214"
|
||||||
|
vectord_url = "http://127.0.0.1:3215"
|
||||||
|
embedd_url = "http://127.0.0.1:3216"
|
||||||
|
pathwayd_url = "http://127.0.0.1:3217"
|
||||||
|
matrixd_url = "http://127.0.0.1:3218"
|
||||||
|
|
||||||
|
[vectord]
|
||||||
|
bind = "127.0.0.1:3215"
|
||||||
|
storaged_url = ""
|
||||||
|
|
||||||
|
[matrixd]
|
||||||
|
bind = "127.0.0.1:3218"
|
||||||
|
embedd_url = "http://127.0.0.1:3216"
|
||||||
|
vectord_url = "http://127.0.0.1:3215"
|
||||||
|
EOF
|
||||||
|
|
||||||
|
poll_health() {
|
||||||
|
local port="$1" deadline=$(($(date +%s) + 5))
|
||||||
|
while [ "$(date +%s)" -lt "$deadline" ]; do
|
||||||
|
if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
|
||||||
|
sleep 0.05
|
||||||
|
done
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
echo "[lift] launching stack..."
|
||||||
|
./bin/storaged -config "$CFG" > /tmp/storaged.log 2>&1 & PIDS+=($!)
|
||||||
|
poll_health 3211 || { echo "storaged failed"; exit 1; }
|
||||||
|
./bin/embedd -config "$CFG" > /tmp/embedd.log 2>&1 & PIDS+=($!)
|
||||||
|
poll_health 3216 || { echo "embedd failed"; exit 1; }
|
||||||
|
./bin/vectord -config "$CFG" > /tmp/vectord.log 2>&1 & PIDS+=($!)
|
||||||
|
poll_health 3215 || { echo "vectord failed"; exit 1; }
|
||||||
|
./bin/matrixd -config "$CFG" > /tmp/matrixd.log 2>&1 & PIDS+=($!)
|
||||||
|
poll_health 3218 || { echo "matrixd failed"; exit 1; }
|
||||||
|
./bin/gateway -config "$CFG" > /tmp/gateway.log 2>&1 & PIDS+=($!)
|
||||||
|
poll_health 3110 || { echo "gateway failed"; exit 1; }
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "[lift] ingest workers (limit=$WORKERS_LIMIT)..."
|
||||||
|
./bin/staffing_workers -limit "$WORKERS_LIMIT"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "[lift] ingest candidates..."
|
||||||
|
./bin/staffing_candidates -skip-populate=false -query "warmup" 2>&1 \
|
||||||
|
| grep -v "^\[candidates\]\(matrix\|reality\)" || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "[lift] running driver — judge=$JUDGE_MODEL · queries=$QUERIES_FILE · k=$K"
|
||||||
|
./bin/playbook_lift \
|
||||||
|
-gateway "http://127.0.0.1:3110" \
|
||||||
|
-ollama "http://localhost:11434" \
|
||||||
|
-queries "$QUERIES_FILE" \
|
||||||
|
-corpora "$CORPORA" \
|
||||||
|
-judge "$JUDGE_MODEL" \
|
||||||
|
-k "$K" \
|
||||||
|
-out "$OUT_JSON"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "[lift] generating markdown report → $OUT_MD"
|
||||||
|
generate_md() {
|
||||||
|
local json="$1" md="$2"
|
||||||
|
local total discovery lift no_change boosted mean_delta gen_at
|
||||||
|
total=$(jq -r '.summary.total' "$json")
|
||||||
|
discovery=$(jq -r '.summary.with_discovery' "$json")
|
||||||
|
lift=$(jq -r '.summary.lift_count' "$json")
|
||||||
|
no_change=$(jq -r '.summary.no_change' "$json")
|
||||||
|
boosted=$(jq -r '.summary.playbook_boosted_total' "$json")
|
||||||
|
mean_delta=$(jq -r '.summary.mean_top1_delta_distance' "$json")
|
||||||
|
gen_at=$(jq -r '.summary.generated_at' "$json")
|
||||||
|
|
||||||
|
cat > "$md" <<MDEOF
|
||||||
|
# Playbook-Lift Reality Test — Run ${RUN_ID}
|
||||||
|
|
||||||
|
**Generated:** ${gen_at}
|
||||||
|
**Judge:** \`${JUDGE_MODEL}\` (Ollama)
|
||||||
|
**Corpora:** \`${CORPORA}\`
|
||||||
|
**Workers limit:** ${WORKERS_LIMIT}
|
||||||
|
**Queries:** \`${QUERIES_FILE}\` (${total} executed)
|
||||||
|
**K per pass:** ${K}
|
||||||
|
**Evidence:** \`${OUT_JSON}\`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Headline
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|---|---:|
|
||||||
|
| Total queries run | ${total} |
|
||||||
|
| Cold-pass discoveries (judge-best ≠ top-1) | ${discovery} |
|
||||||
|
| Warm-pass lifts (recorded playbook → top-1) | ${lift} |
|
||||||
|
| No change (judge-best already top-1, no playbook needed) | ${no_change} |
|
||||||
|
| Playbook boosts triggered (warm pass) | ${boosted} |
|
||||||
|
| Mean Δ top-1 distance (warm − cold) | ${mean_delta} |
|
||||||
|
|
||||||
|
**Lift rate:** ${lift} of ${discovery} discoveries became top-1 after warm pass.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Per-query results
|
||||||
|
|
||||||
|
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|
||||||
|
|---|---|---|---|---|---|---|---|
|
||||||
|
MDEOF
|
||||||
|
|
||||||
|
jq -r '.runs | to_entries[] |
|
||||||
|
[
|
||||||
|
(.key + 1 | tostring),
|
||||||
|
(.value.query | .[0:60]),
|
||||||
|
.value.cold_top1_id,
|
||||||
|
((.value.cold_judge_best_rank | tostring) + "/" + (.value.cold_judge_best_rating | tostring)),
|
||||||
|
(if .value.playbook_recorded then "✓ " + (.value.playbook_target_id // "") else "—" end),
|
||||||
|
.value.warm_top1_id,
|
||||||
|
(.value.warm_judge_best_rank | tostring),
|
||||||
|
(if .value.lift then "**YES**" else "no" end)
|
||||||
|
] | "| " + join(" | ") + " |"
|
||||||
|
' "$json" >> "$md"
|
||||||
|
|
||||||
|
cat >> "$md" <<MDEOF
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Honesty caveats
|
||||||
|
|
||||||
|
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
|
||||||
|
judge's verdict is what defines "best." If \`${JUDGE_MODEL}\` rates badly,
|
||||||
|
the lift number is meaningless. To validate the judge itself, sample 5–10
|
||||||
|
verdicts manually and check agreement.
|
||||||
|
2. **Score-1.0 boost = distance halved.** Playbook math is
|
||||||
|
\`distance' = distance × (1 - 0.5 × score)\`. Lift requires the judge-best
|
||||||
|
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
|
||||||
|
even halving doesn't promote it. Tight clusters → little visible lift.
|
||||||
|
3. **Same-query replay is the cheap case.** Real lift comes from *similar but
|
||||||
|
not identical* queries hitting a recorded playbook. This run only tests
|
||||||
|
verbatim replay. A v2 should add paraphrase queries.
|
||||||
|
4. **Multi-corpus skew.** Default corpora=\`${CORPORA}\` — if all judge-best
|
||||||
|
results land in one corpus, the matrix layer's purpose isn't being tested.
|
||||||
|
Check per-corpus distribution in the JSON.
|
||||||
|
|
||||||
|
## Next moves
|
||||||
|
|
||||||
|
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
|
||||||
|
work. Move to paraphrase queries + tag-based boost (currently ignored).
|
||||||
|
- If lift rate < 20%: investigate why — judge variance, distance gap too
|
||||||
|
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
|
||||||
|
retuning.
|
||||||
|
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
|
||||||
|
already close to optimal on this query distribution. Either the corpus
|
||||||
|
is too narrow or the queries are too easy.
|
||||||
|
MDEOF
|
||||||
|
}
|
||||||
|
|
||||||
|
generate_md "$OUT_JSON" "$OUT_MD"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "[lift] DONE"
|
||||||
|
echo "[lift] evidence: $OUT_JSON"
|
||||||
|
echo "[lift] report: $OUT_MD"
|
||||||
391
scripts/playbook_lift/main.go
Normal file
391
scripts/playbook_lift/main.go
Normal file
@ -0,0 +1,391 @@
|
|||||||
|
// Playbook-lift reality test driver. Two-pass design:
|
||||||
|
//
|
||||||
|
// Pass 1 (cold): for each query → matrix.search use_playbook=false →
|
||||||
|
// LLM judge rates top-K → record playbook entry pointing
|
||||||
|
// at the highest-rated result (which may NOT be top-1
|
||||||
|
// by distance — that's the discovery worth boosting).
|
||||||
|
//
|
||||||
|
// Pass 2 (warm): same queries → use_playbook=true → measure how the
|
||||||
|
// ranking shifted.
|
||||||
|
//
|
||||||
|
// Lift = real if pass-2 brings the LLM-judged-best result into top-1
|
||||||
|
// more often than pass-1. If lift ≈ 0, the playbook is just confirming
|
||||||
|
// what cosine already said and the 5-loop thesis is unproven.
|
||||||
|
//
|
||||||
|
// Honest about what this measures: with no human-labeled ground truth,
|
||||||
|
// the LLM judge IS the ground truth proxy. That's the small-model
|
||||||
|
// pipeline thesis itself — the same model class that runs the inner
|
||||||
|
// loop is also what we trust to evaluate it. If you don't trust the
|
||||||
|
// judge, the lift number is meaningless; that's a separate problem
|
||||||
|
// for ground-truth labeling.
|
||||||
|
//
|
||||||
|
// Usage (driven by scripts/playbook_lift.sh):
|
||||||
|
// playbook_lift -gateway http://127.0.0.1:3110 \
|
||||||
|
// -queries tests/reality/playbook_lift_queries.txt \
|
||||||
|
// -judge qwen3.5:latest \
|
||||||
|
// -corpora workers,candidates \
|
||||||
|
// -k 10 \
|
||||||
|
// -out reports/reality-tests/playbook_lift_001.json
|
||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"bytes"
|
||||||
|
"encoding/json"
|
||||||
|
"flag"
|
||||||
|
"fmt"
|
||||||
|
"io"
|
||||||
|
"log"
|
||||||
|
"net/http"
|
||||||
|
"os"
|
||||||
|
"sort"
|
||||||
|
"strings"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
type matrixResult struct {
|
||||||
|
ID string `json:"id"`
|
||||||
|
Distance float32 `json:"distance"`
|
||||||
|
Corpus string `json:"corpus"`
|
||||||
|
Metadata json.RawMessage `json:"metadata,omitempty"`
|
||||||
|
}
|
||||||
|
|
||||||
|
type matrixResp struct {
|
||||||
|
Results []matrixResult `json:"results"`
|
||||||
|
PerCorpusCounts map[string]int `json:"per_corpus_counts"`
|
||||||
|
PlaybookBoosted int `json:"playbook_boosted,omitempty"`
|
||||||
|
}
|
||||||
|
|
||||||
|
type judgeVerdict struct {
|
||||||
|
Rating int `json:"rating"`
|
||||||
|
Reason string `json:"reason"`
|
||||||
|
}
|
||||||
|
|
||||||
|
type queryRun struct {
|
||||||
|
Query string `json:"query"`
|
||||||
|
|
||||||
|
ColdTop1ID string `json:"cold_top1_id"`
|
||||||
|
ColdTop1Distance float32 `json:"cold_top1_distance"`
|
||||||
|
ColdJudgeBestID string `json:"cold_judge_best_id"`
|
||||||
|
ColdJudgeBestRank int `json:"cold_judge_best_rank"`
|
||||||
|
ColdJudgeBestRating int `json:"cold_judge_best_rating"`
|
||||||
|
ColdRatings []int `json:"cold_ratings"`
|
||||||
|
|
||||||
|
PlaybookRecorded bool `json:"playbook_recorded"`
|
||||||
|
PlaybookID string `json:"playbook_target_id,omitempty"`
|
||||||
|
|
||||||
|
WarmTop1ID string `json:"warm_top1_id"`
|
||||||
|
WarmTop1Distance float32 `json:"warm_top1_distance"`
|
||||||
|
WarmBoostedCount int `json:"warm_boosted_count"`
|
||||||
|
WarmJudgeBestRank int `json:"warm_judge_best_rank"`
|
||||||
|
|
||||||
|
Lift bool `json:"lift"` // judge-best was below top-1 cold, but top-1 warm
|
||||||
|
Note string `json:"note,omitempty"`
|
||||||
|
}
|
||||||
|
|
||||||
|
type summary struct {
|
||||||
|
Total int `json:"total"`
|
||||||
|
WithDiscovery int `json:"with_discovery"` // judge-best != cold top-1
|
||||||
|
LiftCount int `json:"lift_count"` // top-1 changed warm→ judge-best
|
||||||
|
NoChange int `json:"no_change"`
|
||||||
|
MeanTop1DeltaDistance float32 `json:"mean_top1_delta_distance"`
|
||||||
|
PlaybookBoostedTotal int `json:"playbook_boosted_total"`
|
||||||
|
GeneratedAt time.Time `json:"generated_at"`
|
||||||
|
}
|
||||||
|
|
||||||
|
func main() {
|
||||||
|
gw := flag.String("gateway", "http://127.0.0.1:3110", "Go gateway base URL")
|
||||||
|
ollama := flag.String("ollama", "http://127.0.0.1:11434", "Ollama base URL for LLM judge")
|
||||||
|
queries := flag.String("queries", "tests/reality/playbook_lift_queries.txt", "query corpus path")
|
||||||
|
corporaCSV := flag.String("corpora", "workers,candidates", "comma-separated matrix corpora")
|
||||||
|
judge := flag.String("judge", "qwen3.5:latest", "Ollama model for relevance judging")
|
||||||
|
k := flag.Int("k", 10, "top-k from matrix.search per pass")
|
||||||
|
out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
|
||||||
|
flag.Parse()
|
||||||
|
|
||||||
|
corpora := strings.Split(*corporaCSV, ",")
|
||||||
|
|
||||||
|
qs, err := loadQueries(*queries)
|
||||||
|
if err != nil {
|
||||||
|
log.Fatalf("load queries: %v", err)
|
||||||
|
}
|
||||||
|
if len(qs) == 0 {
|
||||||
|
log.Fatalf("no queries in %s", *queries)
|
||||||
|
}
|
||||||
|
log.Printf("[lift] %d queries · corpora=%v · k=%d · judge=%s", len(qs), corpora, *k, *judge)
|
||||||
|
|
||||||
|
hc := &http.Client{Timeout: 60 * time.Second}
|
||||||
|
runs := make([]queryRun, 0, len(qs))
|
||||||
|
totalDelta := float32(0)
|
||||||
|
playbookBoostedTotal := 0
|
||||||
|
withDiscovery := 0
|
||||||
|
liftCount := 0
|
||||||
|
noChange := 0
|
||||||
|
|
||||||
|
// Pass 1 (cold) + record playbooks based on judge verdicts.
|
||||||
|
for i, q := range qs {
|
||||||
|
log.Printf("[lift] (%d/%d cold) %s", i+1, len(qs), abbrev(q, 60))
|
||||||
|
resp, err := matrixSearch(hc, *gw, q, corpora, *k, false)
|
||||||
|
if err != nil {
|
||||||
|
log.Printf(" cold search failed: %v — skipping", err)
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if len(resp.Results) == 0 {
|
||||||
|
log.Printf(" cold returned 0 results — skipping")
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
ratings := make([]int, len(resp.Results))
|
||||||
|
bestRank := 0
|
||||||
|
bestRating := -1
|
||||||
|
for j, r := range resp.Results {
|
||||||
|
rating := judgeRate(hc, *ollama, *judge, q, r)
|
||||||
|
ratings[j] = rating
|
||||||
|
if rating > bestRating {
|
||||||
|
bestRating = rating
|
||||||
|
bestRank = j
|
||||||
|
}
|
||||||
|
}
|
||||||
|
run := queryRun{
|
||||||
|
Query: q,
|
||||||
|
ColdTop1ID: resp.Results[0].ID,
|
||||||
|
ColdTop1Distance: resp.Results[0].Distance,
|
||||||
|
ColdJudgeBestID: resp.Results[bestRank].ID,
|
||||||
|
ColdJudgeBestRank: bestRank,
|
||||||
|
ColdJudgeBestRating: bestRating,
|
||||||
|
ColdRatings: ratings,
|
||||||
|
}
|
||||||
|
// Record a playbook only if the judge best is not already top-1
|
||||||
|
// (otherwise we're boosting something cosine already crowned).
|
||||||
|
if bestRank > 0 && bestRating >= 4 {
|
||||||
|
withDiscovery++
|
||||||
|
if err := playbookRecord(hc, *gw, q, resp.Results[bestRank].ID, resp.Results[bestRank].Corpus, 1.0); err != nil {
|
||||||
|
log.Printf(" playbook record failed: %v", err)
|
||||||
|
run.Note = "playbook record failed: " + err.Error()
|
||||||
|
} else {
|
||||||
|
run.PlaybookRecorded = true
|
||||||
|
run.PlaybookID = resp.Results[bestRank].ID
|
||||||
|
}
|
||||||
|
} else if bestRank == 0 {
|
||||||
|
run.Note = "judge-best already top-1 cold — no playbook needed"
|
||||||
|
} else {
|
||||||
|
run.Note = fmt.Sprintf("judge-best rating %d below threshold (4) — no playbook", bestRating)
|
||||||
|
}
|
||||||
|
runs = append(runs, run)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Pass 2 (warm) on the same queries.
|
||||||
|
for i := range runs {
|
||||||
|
q := runs[i].Query
|
||||||
|
log.Printf("[lift] (%d/%d warm) %s", i+1, len(runs), abbrev(q, 60))
|
||||||
|
resp, err := matrixSearch(hc, *gw, q, corpora, *k, true)
|
||||||
|
if err != nil || len(resp.Results) == 0 {
|
||||||
|
runs[i].Note = appendNote(runs[i].Note, fmt.Sprintf("warm search failed: %v", err))
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
runs[i].WarmTop1ID = resp.Results[0].ID
|
||||||
|
runs[i].WarmTop1Distance = resp.Results[0].Distance
|
||||||
|
runs[i].WarmBoostedCount = resp.PlaybookBoosted
|
||||||
|
playbookBoostedTotal += resp.PlaybookBoosted
|
||||||
|
|
||||||
|
// Find where the cold judge-best ID landed in the warm ranking.
|
||||||
|
warmRank := -1
|
||||||
|
for j, r := range resp.Results {
|
||||||
|
if r.ID == runs[i].ColdJudgeBestID {
|
||||||
|
warmRank = j
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}
|
||||||
|
runs[i].WarmJudgeBestRank = warmRank
|
||||||
|
|
||||||
|
switch {
|
||||||
|
case runs[i].PlaybookRecorded && warmRank == 0:
|
||||||
|
runs[i].Lift = true
|
||||||
|
liftCount++
|
||||||
|
case !runs[i].PlaybookRecorded:
|
||||||
|
noChange++
|
||||||
|
default:
|
||||||
|
noChange++
|
||||||
|
}
|
||||||
|
totalDelta += runs[i].WarmTop1Distance - runs[i].ColdTop1Distance
|
||||||
|
}
|
||||||
|
|
||||||
|
sum := summary{
|
||||||
|
Total: len(runs),
|
||||||
|
WithDiscovery: withDiscovery,
|
||||||
|
LiftCount: liftCount,
|
||||||
|
NoChange: noChange,
|
||||||
|
MeanTop1DeltaDistance: 0,
|
||||||
|
PlaybookBoostedTotal: playbookBoostedTotal,
|
||||||
|
GeneratedAt: time.Now().UTC(),
|
||||||
|
}
|
||||||
|
if len(runs) > 0 {
|
||||||
|
sum.MeanTop1DeltaDistance = totalDelta / float32(len(runs))
|
||||||
|
}
|
||||||
|
|
||||||
|
if err := writeJSON(*out, runs, sum); err != nil {
|
||||||
|
log.Fatalf("write %s: %v", *out, err)
|
||||||
|
}
|
||||||
|
log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f",
|
||||||
|
sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance)
|
||||||
|
log.Printf("[lift] results → %s", *out)
|
||||||
|
}
|
||||||
|
|
||||||
|
func loadQueries(path string) ([]string, error) {
|
||||||
|
bs, err := os.ReadFile(path)
|
||||||
|
if err != nil {
|
||||||
|
return nil, err
|
||||||
|
}
|
||||||
|
var out []string
|
||||||
|
for _, line := range strings.Split(string(bs), "\n") {
|
||||||
|
s := strings.TrimSpace(line)
|
||||||
|
if s == "" || strings.HasPrefix(s, "#") {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
out = append(out, s)
|
||||||
|
}
|
||||||
|
return out, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func matrixSearch(hc *http.Client, gw, query string, corpora []string, k int, usePlaybook bool) (*matrixResp, error) {
|
||||||
|
body := map[string]any{
|
||||||
|
"query_text": query,
|
||||||
|
"corpora": corpora,
|
||||||
|
"k": k,
|
||||||
|
"per_corpus_k": k,
|
||||||
|
"use_playbook": usePlaybook,
|
||||||
|
}
|
||||||
|
bs, _ := json.Marshal(body)
|
||||||
|
req, _ := http.NewRequest("POST", gw+"/v1/matrix/search", bytes.NewReader(bs))
|
||||||
|
req.Header.Set("Content-Type", "application/json")
|
||||||
|
resp, err := hc.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
return nil, err
|
||||||
|
}
|
||||||
|
defer resp.Body.Close()
|
||||||
|
rb, _ := io.ReadAll(resp.Body)
|
||||||
|
if resp.StatusCode/100 != 2 {
|
||||||
|
return nil, fmt.Errorf("status %d: %s", resp.StatusCode, string(rb))
|
||||||
|
}
|
||||||
|
var out matrixResp
|
||||||
|
if err := json.Unmarshal(rb, &out); err != nil {
|
||||||
|
return nil, fmt.Errorf("unmarshal: %w (body=%s)", err, abbrev(string(rb), 200))
|
||||||
|
}
|
||||||
|
return &out, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func playbookRecord(hc *http.Client, gw, query, answerID, answerCorpus string, score float64) error {
|
||||||
|
body := map[string]any{
|
||||||
|
"query": query,
|
||||||
|
"answer_id": answerID,
|
||||||
|
"answer_corpus": answerCorpus,
|
||||||
|
"score": score,
|
||||||
|
"tags": []string{"reality-test", "playbook-lift-001"},
|
||||||
|
}
|
||||||
|
bs, _ := json.Marshal(body)
|
||||||
|
req, _ := http.NewRequest("POST", gw+"/v1/matrix/playbooks/record", bytes.NewReader(bs))
|
||||||
|
req.Header.Set("Content-Type", "application/json")
|
||||||
|
resp, err := hc.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
defer resp.Body.Close()
|
||||||
|
if resp.StatusCode/100 != 2 {
|
||||||
|
rb, _ := io.ReadAll(resp.Body)
|
||||||
|
return fmt.Errorf("status %d: %s", resp.StatusCode, string(rb))
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// judgeRate calls Ollama's /api/chat directly and asks for a 1-5 rating
|
||||||
|
// of the result against the query. Returns 0 on any failure (treated as
|
||||||
|
// "couldn't judge, exclude from best-of consideration").
|
||||||
|
func judgeRate(hc *http.Client, ollamaURL, model, query string, r matrixResult) int {
|
||||||
|
system := `You rate retrieval results for a staffing co-pilot.
|
||||||
|
Rate the result 1-5 against the query:
|
||||||
|
5 = perfect match (this person/job IS what was asked for)
|
||||||
|
4 = strong match (right field, right level, minor mismatches)
|
||||||
|
3 = adjacent match (related field or partial overlap)
|
||||||
|
2 = weak/tangential match
|
||||||
|
1 = irrelevant
|
||||||
|
Output JSON only: {"rating": N, "reason": "<one sentence>"}.`
|
||||||
|
user := fmt.Sprintf("Query: %q\n\nResult corpus: %s\nResult ID: %s\nResult metadata:\n%s",
|
||||||
|
query, r.Corpus, r.ID, string(r.Metadata))
|
||||||
|
|
||||||
|
body := map[string]any{
|
||||||
|
"model": model,
|
||||||
|
"stream": false,
|
||||||
|
"format": "json",
|
||||||
|
"messages": []map[string]string{
|
||||||
|
{"role": "system", "content": system},
|
||||||
|
{"role": "user", "content": user},
|
||||||
|
},
|
||||||
|
"options": map[string]any{"temperature": 0},
|
||||||
|
}
|
||||||
|
bs, _ := json.Marshal(body)
|
||||||
|
req, _ := http.NewRequest("POST", ollamaURL+"/api/chat", bytes.NewReader(bs))
|
||||||
|
req.Header.Set("Content-Type", "application/json")
|
||||||
|
resp, err := hc.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
defer resp.Body.Close()
|
||||||
|
if resp.StatusCode/100 != 2 {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
rb, _ := io.ReadAll(resp.Body)
|
||||||
|
var ollamaResp struct {
|
||||||
|
Message struct {
|
||||||
|
Content string `json:"content"`
|
||||||
|
} `json:"message"`
|
||||||
|
}
|
||||||
|
if err := json.Unmarshal(rb, &ollamaResp); err != nil {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
var v judgeVerdict
|
||||||
|
if err := json.Unmarshal([]byte(ollamaResp.Message.Content), &v); err != nil {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
if v.Rating < 1 || v.Rating > 5 {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
return v.Rating
|
||||||
|
}
|
||||||
|
|
||||||
|
func writeJSON(path string, runs []queryRun, sum summary) error {
|
||||||
|
if err := os.MkdirAll(filepath_dir(path), 0o755); err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
out := struct {
|
||||||
|
Summary summary `json:"summary"`
|
||||||
|
Runs []queryRun `json:"runs"`
|
||||||
|
}{Summary: sum, Runs: runs}
|
||||||
|
bs, err := json.MarshalIndent(out, "", " ")
|
||||||
|
if err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
return os.WriteFile(path, bs, 0o644)
|
||||||
|
}
|
||||||
|
|
||||||
|
func filepath_dir(p string) string {
|
||||||
|
if i := strings.LastIndex(p, "/"); i >= 0 {
|
||||||
|
return p[:i]
|
||||||
|
}
|
||||||
|
return "."
|
||||||
|
}
|
||||||
|
|
||||||
|
func abbrev(s string, n int) string {
|
||||||
|
if len(s) <= n {
|
||||||
|
return s
|
||||||
|
}
|
||||||
|
return s[:n] + "…"
|
||||||
|
}
|
||||||
|
|
||||||
|
func appendNote(existing, add string) string {
|
||||||
|
if existing == "" {
|
||||||
|
return add
|
||||||
|
}
|
||||||
|
return existing + "; " + add
|
||||||
|
}
|
||||||
|
|
||||||
|
// Suppress unused-import warning when sort isn't used in a future
|
||||||
|
// refactor; harmless for now.
|
||||||
|
var _ = sort.Slice
|
||||||
18
tests/reality/playbook_lift_queries.txt
Normal file
18
tests/reality/playbook_lift_queries.txt
Normal file
@ -0,0 +1,18 @@
|
|||||||
|
# Playbook lift reality test — staffing query corpus.
|
||||||
|
#
|
||||||
|
# Each non-blank, non-comment line is one query. The harness will run
|
||||||
|
# each through matrix.search (cold pass, then warm pass with playbook),
|
||||||
|
# ask the LLM judge to rate top-K results, and record lift metrics.
|
||||||
|
#
|
||||||
|
# Goal: 20 queries, weighted toward the kinds of asks a staffing
|
||||||
|
# coordinator would actually issue. Specific roles + certifications +
|
||||||
|
# constraints surface playbook lift better than generic "find a worker"
|
||||||
|
# style queries.
|
||||||
|
#
|
||||||
|
# Placeholders (5) — J: replace + extend to 20+ for the real test.
|
||||||
|
|
||||||
|
Forklift operator with OSHA-30, warehouse experience, day shift availability
|
||||||
|
Bilingual customer service rep, Spanish + English, two years call-center experience
|
||||||
|
CDL Class A driver, clean record, willing to do regional 4-day routes
|
||||||
|
Production line supervisor with lean manufacturing background
|
||||||
|
Dental hygienist with three years experience, Indianapolis area
|
||||||
Loading…
x
Reference in New Issue
Block a user