golangLAKEHOUSE/scripts/playbook_lift.sh
root b13b5cd7a1 playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14%
The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't
distinguish "Shape B surfaced a strictly-better answer" from "Shape B
shuffled ranks but quality is unchanged" from "Shape B replaced a good
answer with a wrong one." This commit adds Pass 4: judge warm top-1
with the same prompt as cold ratings, then bucket the comparison.

Implementation:
- New --with-rejudge driver flag (default off).
- New WITH_REJUDGE harness env (default 1, on for prod runs).
- queryRun gains WarmTop1Metadata (cached during Pass 2 for the
  rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no
  rejudge, 0..5 = rating).
- summary gains RejudgeAttempted, QualityLifted, QualityNeutral,
  QualityRegressed (counts of warm-rating > / == / < cold-rating).
- Markdown headline gains a Quality block when rejudge ran.
- ~21 extra judge calls (~30s on qwen2.5).

Run #005 result (split inject threshold 0.20 + paraphrase + rejudge):

  Quality lifted     5 / 21  (24%)  — 3× +2 rating, 2× +1 rating
  Quality neutral   13 / 21  (62%)  — includes OOD queries holding 1
  Quality regressed  3 / 21  (14%)
  Net rating delta  +3 across 21 queries (+0.14 average)

The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4
warm — Shape B took mediocre matches and substituted substantively
better ones. The 3 regressions were small (-1, -1, -3).

Q11 is the cautionary tale: cold top-1 "production line worker"
(rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator"
e-5729 (rating 1). Adjacent-domain cross-pollination — production
worker and forklift operator embed within 0.20 cosine because both
are warehouse-adjacent staffing queries, even though the judge
correctly distinguishes them. The split-threshold defense (0.5 boost
/ 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed
neutral at rating 1) but not adjacent-domain cross-pollination.

Net product verdict: working, net-positive on quality, but the worst
case (Q11 4→1) is customer-visible and warrants a tighter inject
threshold OR an additional gate beyond cosine distance. Filed in
STATE_OF_PLAY OPEN as a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:42:04 -05:00

469 lines
19 KiB
Bash
Executable File
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env bash
# Playbook-lift reality test — measure whether the 5-loop substrate
# (matrix retrieve+merge + playbook + small-model judge) actually beats
# raw cosine on staffing queries.
#
# Pipeline:
# 1. Boot the full Go HTTP stack (storaged, catalogd, ingestd, queryd,
# embedd, vectord, pathwayd, observerd, matrixd, gateway). Earlier
# versions booted only the 5 daemons matrix.search needs, which
# gave a falsely clean "everything works" signal — we now exercise
# the prod-realistic daemon graph so daemons that observe (observerd)
# or persist (pathwayd) are actually in the loop.
# 2. SQL surface probe — ingest a 3-row CSV via /v1/ingest (catalogd
# → ingestd → queryd refresh), assert SELECT COUNT(*)=3. Proves the
# ingestd→catalogd→queryd path is wired even though the lift driver
# itself is vector-only retrieval.
# 3. Ingest workers (default 5000) + candidates corpora into vectord
# 4. Run the playbook_lift driver: cold pass → judge → record →
# warm pass → measure
# 5. Generate markdown report from the JSON evidence
#
# Output:
# reports/reality-tests/playbook_lift_<N>.json — raw evidence
# reports/reality-tests/playbook_lift_<N>.md — human report
#
# Requires: Ollama on :11434 with nomic-embed-text + the judge model
# loaded. Skips (exit 0) if Ollama is absent.
#
# Usage:
# ./scripts/playbook_lift.sh # run #001 with defaults
# RUN_ID=002 ./scripts/playbook_lift.sh # explicit run id
# JUDGE_MODEL=qwen2.5:latest ./scripts/playbook_lift.sh
# WORKERS_LIMIT=2000 ./scripts/playbook_lift.sh
set -euo pipefail
cd "$(dirname "$0")/.."
export PATH="$PATH:/usr/local/go/bin"
RUN_ID="${RUN_ID:-001}"
# JUDGE_MODEL: empty means "let the Go driver resolve from
# lakehouse.toml [models].local_judge". Set explicitly to override.
JUDGE_MODEL="${JUDGE_MODEL:-}"
WORKERS_LIMIT="${WORKERS_LIMIT:-5000}"
QUERIES_FILE="${QUERIES_FILE:-tests/reality/playbook_lift_queries.txt}"
CORPORA="${CORPORA:-workers,ethereal_workers}"
K="${K:-10}"
CONFIG_PATH="${CONFIG_PATH:-lakehouse.toml}"
# WITH_PARAPHRASE=1 (default) adds a Pass 3 — for each query whose
# Pass 1 cold pass recorded a playbook, generate a paraphrase via the
# judge and re-query with playbook=true. The paraphrase pass is the
# actual learning-property test (does cosine on paraphrase find the
# recorded entry?). Set WITH_PARAPHRASE=0 for a faster verbatim-only run.
WITH_PARAPHRASE="${WITH_PARAPHRASE:-1}"
# WITH_REJUDGE=1 (default) adds a Pass 4 — judge warm top-1 to measure
# quality lift (warm rating vs cold rating). Catches cases where Shape B
# surfaces a different-but-equally-good answer (which the rank-based
# lift metric misses). +21 judge calls (~30s on qwen2.5).
WITH_REJUDGE="${WITH_REJUDGE:-1}"
OUT_JSON="reports/reality-tests/playbook_lift_${RUN_ID}.json"
OUT_MD="reports/reality-tests/playbook_lift_${RUN_ID}.md"
if ! curl -sS --max-time 3 http://localhost:11434/api/tags >/dev/null 2>&1; then
echo "[lift] Ollama not reachable on :11434 — skipping"
exit 0
fi
# Resolve judge from config when not set explicitly — needed for the
# Ollama model-presence check below. Mirrors the Go driver's priority.
EFFECTIVE_JUDGE="$JUDGE_MODEL"
if [ -z "$EFFECTIVE_JUDGE" ] && [ -f "$CONFIG_PATH" ]; then
EFFECTIVE_JUDGE="$(grep -E '^local_judge\s*=' "$CONFIG_PATH" | head -1 | sed -E 's/.*=\s*"([^"]+)".*/\1/')"
fi
EFFECTIVE_JUDGE="${EFFECTIVE_JUDGE:-qwen3.5:latest}"
if ! curl -sS http://localhost:11434/api/tags | jq -e --arg m "$EFFECTIVE_JUDGE" \
'.models[] | select(.name == $m)' >/dev/null 2>&1; then
echo "[lift] judge model '$EFFECTIVE_JUDGE' not loaded in Ollama — pull it first"
exit 1
fi
# Compute a single string for "where did the judge come from" so the
# log line + the markdown report don't have to chain :+/:- substitutions
# (those silently fuse "env JUDGE_MODEL" + the value into "env JUDGE_MODELx"
# without a separator — the bug Opus caught on lift_001's report).
if [ -n "$JUDGE_MODEL" ]; then
JUDGE_SOURCE="env JUDGE_MODEL=${JUDGE_MODEL}"
else
JUDGE_SOURCE="config [models].local_judge"
fi
echo "[lift] judge resolved to: $EFFECTIVE_JUDGE (from $JUDGE_SOURCE)"
echo "[lift] building binaries..."
go build -o bin/ ./cmd/storaged ./cmd/catalogd ./cmd/ingestd ./cmd/queryd \
./cmd/embedd ./cmd/vectord ./cmd/pathwayd ./cmd/observerd \
./cmd/matrixd ./cmd/gateway \
./scripts/staffing_workers ./scripts/staffing_candidates \
./scripts/playbook_lift
# Anchor pkill to bin/<name>$ so we don't accidentally hit unrelated
# binaries — and exclude chatd (independent of retrieval, stays up).
pkill -f "bin/(storaged|catalogd|ingestd|queryd|embedd|vectord|pathwayd|observerd|matrixd|gateway)" 2>/dev/null || true
sleep 0.3
PIDS=()
TMP="$(mktemp -d)"
CFG="$TMP/lift.toml"
cleanup() {
echo "[lift] cleanup"
for p in "${PIDS[@]:-}"; do [ -n "${p:-}" ] && kill "$p" 2>/dev/null || true; done
rm -rf "$TMP"
}
trap cleanup EXIT INT TERM
cat > "$CFG" <<EOF
# [s3] tells storaged which bucket to talk to. Without it, defaults
# resolve to "lakehouse-primary" (no -go-) which doesn't exist on this
# box and catalogd's rehydrate fails with NoSuchBucket. Access keys
# come from the secrets file (storaged -secrets defaults to
# /etc/lakehouse/secrets-go.toml), not this temp toml.
[s3]
endpoint = "http://localhost:9000"
region = "us-east-1"
bucket = "lakehouse-go-primary"
use_path_style = true
[gateway]
bind = "127.0.0.1:3110"
storaged_url = "http://127.0.0.1:3211"
catalogd_url = "http://127.0.0.1:3212"
ingestd_url = "http://127.0.0.1:3213"
queryd_url = "http://127.0.0.1:3214"
vectord_url = "http://127.0.0.1:3215"
embedd_url = "http://127.0.0.1:3216"
pathwayd_url = "http://127.0.0.1:3217"
matrixd_url = "http://127.0.0.1:3218"
observerd_url = "http://127.0.0.1:3219"
[storaged]
bind = "127.0.0.1:3211"
[catalogd]
bind = "127.0.0.1:3212"
storaged_url = "http://127.0.0.1:3211"
[ingestd]
bind = "127.0.0.1:3213"
storaged_url = "http://127.0.0.1:3211"
catalogd_url = "http://127.0.0.1:3212"
max_ingest_bytes = 268435456
[queryd]
bind = "127.0.0.1:3214"
catalogd_url = "http://127.0.0.1:3212"
secrets_path = "/etc/lakehouse/secrets-go.toml"
# Aggressive refresh so the SQL probe table appears within ~1s of
# ingestd registering it, instead of the prod default 30s.
refresh_every = "1s"
[embedd]
bind = "127.0.0.1:3216"
provider_url = "http://localhost:11434"
default_model = "nomic-embed-text"
[vectord]
bind = "127.0.0.1:3215"
storaged_url = ""
[pathwayd]
bind = "127.0.0.1:3217"
persist_path = ""
[observerd]
bind = "127.0.0.1:3219"
persist_path = ""
[matrixd]
bind = "127.0.0.1:3218"
embedd_url = "http://127.0.0.1:3216"
vectord_url = "http://127.0.0.1:3215"
EOF
poll_health() {
local port="$1" deadline=$(($(date +%s) + 5))
while [ "$(date +%s)" -lt "$deadline" ]; do
if curl -sS --max-time 1 "http://127.0.0.1:$port/health" >/dev/null 2>&1; then return 0; fi
sleep 0.05
done
return 1
}
echo "[lift] launching stack (10 daemons; chatd stays up independently)..."
# Order respects dependencies: storaged → catalogd (needs storaged) →
# ingestd (needs storaged+catalogd) → queryd (needs catalogd) → embedd →
# vectord → pathwayd → observerd → matrixd (needs embedd+vectord) →
# gateway (needs all of them).
./bin/storaged -config "$CFG" > /tmp/storaged.log 2>&1 & PIDS+=($!)
poll_health 3211 || { echo "storaged failed"; exit 1; }
./bin/catalogd -config "$CFG" > /tmp/catalogd.log 2>&1 & PIDS+=($!)
poll_health 3212 || { echo "catalogd failed"; exit 1; }
./bin/ingestd -config "$CFG" > /tmp/ingestd.log 2>&1 & PIDS+=($!)
poll_health 3213 || { echo "ingestd failed"; exit 1; }
./bin/queryd -config "$CFG" > /tmp/queryd.log 2>&1 & PIDS+=($!)
poll_health 3214 || { echo "queryd failed"; exit 1; }
./bin/embedd -config "$CFG" > /tmp/embedd.log 2>&1 & PIDS+=($!)
poll_health 3216 || { echo "embedd failed"; exit 1; }
./bin/vectord -config "$CFG" > /tmp/vectord.log 2>&1 & PIDS+=($!)
poll_health 3215 || { echo "vectord failed"; exit 1; }
./bin/pathwayd -config "$CFG" > /tmp/pathwayd.log 2>&1 & PIDS+=($!)
poll_health 3217 || { echo "pathwayd failed"; exit 1; }
./bin/observerd -config "$CFG" > /tmp/observerd.log 2>&1 & PIDS+=($!)
poll_health 3219 || { echo "observerd failed"; exit 1; }
./bin/matrixd -config "$CFG" > /tmp/matrixd.log 2>&1 & PIDS+=($!)
poll_health 3218 || { echo "matrixd failed"; exit 1; }
./bin/gateway -config "$CFG" > /tmp/gateway.log 2>&1 & PIDS+=($!)
poll_health 3110 || { echo "gateway failed"; exit 1; }
echo
echo "[lift] SQL surface probe — ingest 3-row CSV, assert SELECT COUNT(*)=3..."
PROBE_CSV="$TMP/sql_probe.csv"
cat > "$PROBE_CSV" <<CSVEOF
id,name,role
1,Alice,Forklift Operator
2,Bob,Production Worker
3,Charlie,Warehouse Associate
CSVEOF
INGEST_RESP="$(curl -sS -F "file=@$PROBE_CSV" "http://127.0.0.1:3110/v1/ingest?name=lift_sql_probe")"
echo "[lift] ingest response: $INGEST_RESP"
# Poll up to 5s for queryd to discover the manifest. refresh_every=1s
# is a lower bound; under load or slow disks the manifest may not be
# visible in a fixed sleep, which would 4xx the SQL probe spuriously.
PROBE_COUNT=ERR
SQL_RESP=""
deadline=$(($(date +%s) + 5))
while [ "$(date +%s)" -lt "$deadline" ]; do
SQL_RESP="$(curl -sS -X POST http://127.0.0.1:3110/v1/sql \
-H 'content-type: application/json' \
-d '{"sql":"SELECT COUNT(*) FROM lift_sql_probe"}')"
PROBE_COUNT="$(echo "$SQL_RESP" | jq -r '.rows[0][0] // "ERR"' 2>/dev/null || echo "ERR")"
[ "$PROBE_COUNT" = "3" ] && break
sleep 0.25
done
if [ "$PROBE_COUNT" = "3" ]; then
echo "[lift] ✓ SQL surface probe passed (rowcount=3)"
else
echo "[lift] ✗ SQL surface probe FAILED after 5s (got: $SQL_RESP)"
exit 1
fi
echo
echo "[lift] ingest workers (limit=$WORKERS_LIMIT)..."
./bin/staffing_workers -limit "$WORKERS_LIMIT"
echo
echo "[lift] ingest ethereal_workers (10K, second staffing-domain corpus)..."
# ethereal_workers is the right second corpus for staffing-domain reality
# tests: same schema as workers_500k but a different population (Material
# Handlers, Admin Assistants, etc.) so the matrix layer's multi-corpus
# retrieve+merge actually has TWO relevant corpora to compose against.
# Earlier versions used scripts/staffing_candidates against the SWE-tech
# candidates parquet (Swift/iOS, Scala/Spark, Rust/DataFusion) — wrong
# domain for staffing queries; effectively dead-corpus noise.
# id-prefix "e-" prevents collisions with workers' "w-" since both files
# count worker_id from 1.
./bin/staffing_workers \
-parquet "/home/profit/lakehouse/data/datasets/ethereal_workers.parquet" \
-index-name ethereal_workers \
-id-prefix "e-" \
-limit 0
echo
echo "[lift] running driver — judge=$EFFECTIVE_JUDGE · queries=$QUERIES_FILE · k=$K"
# -judge "$JUDGE_MODEL" passes either the explicit env override or
# the empty string. The Go driver treats empty -judge as "not set"
# and runs its own resolution chain (env → config → fallback). When
# JUDGE_MODEL IS set, the explicit -judge wins inside the Go driver
# regardless of what its env-lookup would find — flag wins by design.
EXTRA_FLAGS=""
if [ "$WITH_PARAPHRASE" = "1" ]; then
EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase"
fi
if [ "$WITH_REJUDGE" = "1" ]; then
EXTRA_FLAGS="$EXTRA_FLAGS -with-rejudge"
fi
./bin/playbook_lift \
-config "$CONFIG_PATH" \
-gateway "http://127.0.0.1:3110" \
-ollama "http://localhost:11434" \
-queries "$QUERIES_FILE" \
-corpora "$CORPORA" \
-judge "$JUDGE_MODEL" \
-k "$K" \
-out "$OUT_JSON" \
$EXTRA_FLAGS
echo
echo "[lift] generating markdown report → $OUT_MD"
generate_md() {
local json="$1" md="$2"
local total discovery lift no_change boosted mean_delta gen_at
local p_attempted p_top1 p_anyrank p_block
total=$(jq -r '.summary.total' "$json")
discovery=$(jq -r '.summary.with_discovery' "$json")
lift=$(jq -r '.summary.lift_count' "$json")
no_change=$(jq -r '.summary.no_change' "$json")
boosted=$(jq -r '.summary.playbook_boosted_total' "$json")
mean_delta=$(jq -r '.summary.mean_top1_delta_distance' "$json")
gen_at=$(jq -r '.summary.generated_at' "$json")
p_attempted=$(jq -r '.summary.paraphrase_attempted // 0' "$json")
p_top1=$(jq -r '.summary.paraphrase_top1_lifts // 0' "$json")
p_anyrank=$(jq -r '.summary.paraphrase_any_rank_hits // 0' "$json")
rj_attempted=$(jq -r '.summary.rejudge_attempted // 0' "$json")
q_lifted=$(jq -r '.summary.quality_lifted // 0' "$json")
q_neutral=$(jq -r '.summary.quality_neutral // 0' "$json")
q_regressed=$(jq -r '.summary.quality_regressed // 0' "$json")
# Only emit the paraphrase block when --with-paraphrase actually ran
# (i.e. .summary.paraphrase_attempted > 0). For verbatim-only runs we
# leave the headline clean.
p_block=""
if [ "$p_attempted" != "0" ] && [ "$p_attempted" != "null" ]; then
p_block="| **Paraphrase pass — recorded answer at rank 0 (top-1)** | **${p_top1} / ${p_attempted}** |
| Paraphrase pass — recorded answer at any rank in top-K | ${p_anyrank} / ${p_attempted} |"
fi
rj_block=""
if [ "$rj_attempted" != "0" ] && [ "$rj_attempted" != "null" ]; then
rj_block="| **Quality lift** (warm top-1 rating > cold top-1 rating) | **${q_lifted} / ${rj_attempted}** |
| Quality neutral (warm top-1 rating = cold top-1 rating) | ${q_neutral} / ${rj_attempted} |
| Quality regressed (warm top-1 rating < cold top-1 rating) | ${q_regressed} / ${rj_attempted} |"
fi
cat > "$md" <<MDEOF
# Playbook-Lift Reality Test — Run ${RUN_ID}
**Generated:** ${gen_at}
**Judge:** \`${EFFECTIVE_JUDGE}\` (Ollama, resolved from ${JUDGE_SOURCE})
**Corpora:** \`${CORPORA}\`
**Workers limit:** ${WORKERS_LIMIT}
**Queries:** \`${QUERIES_FILE}\` (${total} executed)
**K per pass:** ${K}
**Paraphrase pass:** $([ "$WITH_PARAPHRASE" = "1" ] && echo "ENABLED" || echo "disabled")
**Re-judge pass:** $([ "$WITH_REJUDGE" = "1" ] && echo "ENABLED" || echo "disabled")
**Evidence:** \`${OUT_JSON}\`
---
## Headline
| Metric | Value |
|---|---:|
| Total queries run | ${total} |
| Cold-pass discoveries (judge-best ≠ top-1) | ${discovery} |
| Warm-pass lifts (recorded playbook → top-1) | ${lift} |
| No change (judge-best already top-1, no playbook needed) | ${no_change} |
| Playbook boosts triggered (warm pass) | ${boosted} |
| Mean Δ top-1 distance (warm cold) | ${mean_delta} |
${p_block}
${rj_block}
**Verbatim lift rate:** ${lift} of ${discovery} discoveries became top-1 after warm pass.
---
## Per-query results
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|---|---|---|---|---|---|---|---|
MDEOF
jq -r '.runs | to_entries[] |
[
(.key + 1 | tostring),
(.value.query | .[0:60]),
.value.cold_top1_id,
((.value.cold_judge_best_rank | tostring) + "/" + (.value.cold_judge_best_rating | tostring)),
(if .value.playbook_recorded then "✓ " + (.value.playbook_target_id // "") else "—" end),
.value.warm_top1_id,
(.value.warm_judge_best_rank | tostring),
(if .value.lift then "**YES**" else "no" end)
] | "| " + join(" | ") + " |"
' "$json" >> "$md"
# Paraphrase per-query table — only emit when the pass ran, and only
# for queries where Pass 1 recorded a playbook (others have no
# paraphrase_query field).
if [ "$p_attempted" != "0" ] && [ "$p_attempted" != "null" ]; then
cat >> "$md" <<MDEOF
---
## Paraphrase pass — does the playbook help similar-but-different queries?
For each query whose Pass 1 cold pass recorded a playbook entry, the
judge model rephrased the query, and the rephrased version was sent
through warm matrix.search. The recorded answer ID's rank in those
results tests whether cosine on the embedded paraphrase finds the
recorded query's vector.
| # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift |
|---|---|---|---|---|---|---|
MDEOF
jq -r '.runs | to_entries[] |
select(.value.playbook_recorded == true and (.value.paraphrase_query // "") != "") |
[
(.key + 1 | tostring),
(.value.query | .[0:40]),
((.value.paraphrase_query // "") | .[0:60]),
(.value.playbook_target_id // "—"),
(.value.paraphrase_top1_id // "—"),
(.value.paraphrase_recorded_rank | tostring),
(if .value.paraphrase_lift then "**YES**" else "no" end)
] | "| " + join(" | ") + " |"
' "$json" >> "$md"
fi
cat >> "$md" <<MDEOF
---
## Honesty caveats
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
judge's verdict is what defines "best." If \`${JUDGE_MODEL}\` rates badly,
the lift number is meaningless. To validate the judge itself, sample 510
verdicts manually and check agreement.
2. **Score-1.0 boost = distance halved.** Playbook math is
\`distance' = distance × (1 - 0.5 × score)\`. Lift requires the judge-best
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
even halving doesn't promote it. Tight clusters → little visible lift.
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
case — same query, recorded playbook, expected boost. The paraphrase
pass (when enabled) is the actual learning property: similar-but-different
queries hitting a recorded playbook. Compare verbatim and paraphrase
lift rates — paraphrase should be lower (semantic-distance gates some
playbook hits) but non-zero is the meaningful signal.
4. **Multi-corpus skew.** Default corpora=\`${CORPORA}\` — if all judge-best
results land in one corpus, the matrix layer's purpose isn't being tested.
Check per-corpus distribution in the JSON.
5. **Judge resolution.** This run used \`${EFFECTIVE_JUDGE}\` from
${JUDGE_SOURCE}.
Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
6. **Paraphrase generation also uses the judge.** The same model that rates
relevance also rephrases queries. A judge that's bad at rating staffing
queries is probably also bad at rephrasing them. Worth sanity-checking
a sample of \`paraphrase_query\` values in the JSON before trusting the
paraphrase lift number.
## Next moves
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
work. Move to paraphrase queries + tag-based boost (currently ignored).
- If lift rate < 20%: investigate why — judge variance, distance gap too
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
retuning.
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
already close to optimal on this query distribution. Either the corpus
is too narrow or the queries are too easy.
MDEOF
}
generate_md "$OUT_JSON" "$OUT_MD"
echo
echo "[lift] DONE"
echo "[lift] evidence: $OUT_JSON"
echo "[lift] report: $OUT_MD"