playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14%

The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't distinguish "Shape B surfaced a strictly-better answer" from "Shape B shuffled ranks but quality is unchanged" from "Shape B replaced a good answer with a wrong one." This commit adds Pass 4: judge warm top-1 with the same prompt as cold ratings, then bucket the comparison. Implementation: - New --with-rejudge driver flag (default off). - New WITH_REJUDGE harness env (default 1, on for prod runs). - queryRun gains WarmTop1Metadata (cached during Pass 2 for the rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no rejudge, 0..5 = rating). - summary gains RejudgeAttempted, QualityLifted, QualityNeutral, QualityRegressed (counts of warm-rating > / == / < cold-rating). - Markdown headline gains a Quality block when rejudge ran. - ~21 extra judge calls (~30s on qwen2.5). Run #005 result (split inject threshold 0.20 + paraphrase + rejudge): Quality lifted 5 / 21 (24%) — 3× +2 rating, 2× +1 rating Quality neutral 13 / 21 (62%) — includes OOD queries holding 1 Quality regressed 3 / 21 (14%) Net rating delta +3 across 21 queries (+0.14 average) The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4 warm — Shape B took mediocre matches and substituted substantively better ones. The 3 regressions were small (-1, -1, -3). Q11 is the cautionary tale: cold top-1 "production line worker" (rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator" e-5729 (rating 1). Adjacent-domain cross-pollination — production worker and forklift operator embed within 0.20 cosine because both are warehouse-adjacent staffing queries, even though the judge correctly distinguishes them. The split-threshold defense (0.5 boost / 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed neutral at rating 1) but not adjacent-domain cross-pollination. Net product verdict: working, net-positive on quality, but the worst case (Q11 4→1) is customer-visible and warrants a tighter inject threshold OR an additional gate beyond cosine distance. Filed in STATE_OF_PLAY OPEN as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:42:04 -05:00 · 2026-04-30 07:42:04 -05:00 · b13b5cd7a1
commit b13b5cd7a1
parent 87cbd10090
3 changed files with 217 additions and 11 deletions
--- a/reports/reality-tests/playbook_lift_005.md
+++ b/reports/reality-tests/playbook_lift_005.md
@ -0,0 +1,120 @@
 # Playbook-Lift Reality Test — Run 005
 **Generated:** 2026-04-30T12:40:48.475901847Z
 **Judge:** `qwen2.5:latest` (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest)
 **Corpora:** `workers,ethereal_workers`
 **Workers limit:** 5000
 **Queries:** `tests/reality/playbook_lift_queries.txt` (21 executed)
 **K per pass:** 10
 **Paraphrase pass:** ENABLED
 **Re-judge pass:** ENABLED
 **Evidence:** `reports/reality-tests/playbook_lift_005.json`
 ---
 ## Headline
 | Metric | Value |
 |---|---:|
 | Total queries run | 21 |
 | Cold-pass discoveries (judge-best ≠ top-1) | 7 |
 | Warm-pass lifts (recorded playbook → top-1) | 5 |
 | No change (judge-best already top-1, no playbook needed) | 16 |
 | Playbook boosts triggered (warm pass) | 9 |
 | Mean Δ top-1 distance (warm − cold) | -0.076170966 |
 | **Paraphrase pass — recorded answer at rank 0 (top-1)** | **5 / 7** |
 | Paraphrase pass — recorded answer at any rank in top-K | 5 / 7 |
 | **Quality lift** (warm top-1 rating > cold top-1 rating) | **5 / 21** |
 | Quality neutral (warm top-1 rating = cold top-1 rating) | 13 / 21 |
 | Quality regressed (warm top-1 rating < cold top-1 rating) | 3 / 21 |
 **Verbatim lift rate:** 5 of 7 discoveries became top-1 after warm pass.
 ---
 ## Per-query results
 | # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
 |---|---|---|---|---|---|---|---|
 | 1 | Forklift operator with OSHA-30, warehouse experience, day sh | e-5670 | 2/4 | ✓ e-5729 | e-5729 | 0 | **YES** |
 | 2 | OSHA-30 certified forklift operator in Wisconsin, cold stora | e-6293 | 7/3 | — | w-1566 | 8 | no |
 | 3 | Production worker with confined-space cert and hazmat traini | w-602 | 0/2 | — | w-3575 | 1 | no |
 | 4 | CDL Class A driver, clean record, willing to do regional 4-d | w-3854 | 0/1 | — | w-3854 | 0 | no |
 | 5 | Warehouse lead with current OSHA-30 certification, NOT OSHA- | w-1807 | 6/3 | — | w-1807 | 6 | no |
 | 6 | Forklift-certified loader, certification must be active, dis | w-1807 | 3/4 | ✓ w-205 | w-4257 | 1 | no |
 | 7 | Hazmat-certified warehouse worker comfortable with cold stor | e-4910 | 2/4 | ✓ w-4257 | w-205 | 1 | no |
 | 8 | Bilingual production worker with team-lead experience and tr | w-4988 | 0/4 | — | w-4988 | 0 | no |
 | 9 | Inventory specialist with confined-space cert and compliance | w-388 | 3/4 | ✓ w-3575 | w-3575 | 0 | **YES** |
 | 10 | Warehouse worker who can run inventory cycles and lead a sma | e-3011 | 0/4 | — | e-3011 | 0 | no |
 | 11 | Production line worker comfortable filling in as line superv | w-1387 | 0/4 | — | e-5729 | 1 | no |
 | 12 | Customer service rep willing to cross-train into dispatch or | w-1451 | 0/2 | — | w-1451 | 0 | no |
 | 13 | Reliable production line lead with strong attendance and lea | e-7360 | 5/4 | ✓ w-2886 | w-2886 | 0 | **YES** |
 | 14 | Highly responsive forklift operator available for last-minut | e-6108 | 5/4 | ✓ w-1566 | w-1566 | 0 | **YES** |
 | 15 | Engaged warehouse associate with strong safety compliance re | e-2743 | 2/4 | ✓ w-49 | w-49 | 0 | **YES** |
 | 16 | CDL-A driver based in IL or WI, willing to run regional 4-da | w-2486 | 5/2 | — | w-2486 | 5 | no |
 | 17 | Bilingual customer service rep in Indianapolis or Cincinnati | e-9749 | 9/2 | — | e-9749 | 9 | no |
 | 18 | Production supervisor open to Midwest relocation for permane | w-379 | 6/3 | — | w-379 | 6 | no |
 | 19 | Dental hygienist with three years experience, Indianapolis a | e-6772 | 0/1 | — | w-3575 | 1 | no |
 | 20 | Registered nurse with ICU experience, willing to take per-di | w-379 | 0/1 | — | w-379 | 0 | no |
 | 21 | Software engineer with React and TypeScript, three years exp | w-1773 | 0/1 | — | w-1773 | 0 | no |
 ---
 ## Paraphrase pass — does the playbook help similar-but-different queries?
 For each query whose Pass 1 cold pass recorded a playbook entry, the
 judge model rephrased the query, and the rephrased version was sent
 through warm matrix.search. The recorded answer ID's rank in those
 results tests whether cosine on the embedded paraphrase finds the
 recorded query's vector.
 | # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift |
 |---|---|---|---|---|---|---|
 | 1 | Forklift operator with OSHA-30, warehous | Seeking forklift operator certified in OSHA-30, looking for  | e-5729 | e-5729 | 0 | **YES** |
 | 6 | Forklift-certified loader, certification | Loader requiring active forklift certification, this must no | w-205 | w-205 | 0 | **YES** |
 | 7 | Hazmat-certified warehouse worker comfor | Warehouse worker with Hazmat certification and experience in | w-4257 | w-4257 | 0 | **YES** |
 | 9 | Inventory specialist with confined-space | Specialist in inventory management requiring certified confi | w-3575 | w-49 | -1 | no |
 | 13 | Reliable production line lead with stron | Experienced production line supervisor with excellent punctu | w-2886 | w-2886 | 0 | **YES** |
 | 14 | Highly responsive forklift operator avai | Available forklift operator ready for urgent shift coverage | w-1566 | w-1566 | 0 | **YES** |
 | 15 | Engaged warehouse associate with strong  | Warehouse associate dedicated to engagement and boasting a r | w-49 | w-984 | -1 | no |
 ---
 ## Honesty caveats
 1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
   judge's verdict is what defines "best." If `qwen2.5:latest` rates badly,
   the lift number is meaningless. To validate the judge itself, sample 5–10
   verdicts manually and check agreement.
 2. **Score-1.0 boost = distance halved.** Playbook math is
   `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
   result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
   even halving doesn't promote it. Tight clusters → little visible lift.
 3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
   case — same query, recorded playbook, expected boost. The paraphrase
   pass (when enabled) is the actual learning property: similar-but-different
   queries hitting a recorded playbook. Compare verbatim and paraphrase
   lift rates — paraphrase should be lower (semantic-distance gates some
   playbook hits) but non-zero is the meaningful signal.
 4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
   results land in one corpus, the matrix layer's purpose isn't being tested.
   Check per-corpus distribution in the JSON.
 5. **Judge resolution.** This run used `qwen2.5:latest` from
   env JUDGE_MODEL=qwen2.5:latest.
   Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
 6. **Paraphrase generation also uses the judge.** The same model that rates
   relevance also rephrases queries. A judge that's bad at rating staffing
   queries is probably also bad at rephrasing them. Worth sanity-checking
   a sample of `paraphrase_query` values in the JSON before trusting the
   paraphrase lift number.
 ## Next moves
 - If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
  work. Move to paraphrase queries + tag-based boost (currently ignored).
 - If lift rate < 20%: investigate why — judge variance, distance gap too
  wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
  retuning.
 - If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
  already close to optimal on this query distribution. Either the corpus
  is too narrow or the queries are too easy.
--- a/scripts/playbook_lift.sh
+++ b/scripts/playbook_lift.sh
@ -52,6 +52,11 @@ CONFIG_PATH="${CONFIG_PATH:-lakehouse.toml}"
 # actual learning-property test (does cosine on paraphrase find the
 # recorded entry?). Set WITH_PARAPHRASE=0 for a faster verbatim-only run.
 WITH_PARAPHRASE="${WITH_PARAPHRASE:-1}"
 # WITH_REJUDGE=1 (default) adds a Pass 4 — judge warm top-1 to measure
 # quality lift (warm rating vs cold rating). Catches cases where Shape B
 # surfaces a different-but-equally-good answer (which the rank-based
 # lift metric misses). +21 judge calls (~30s on qwen2.5).
 WITH_REJUDGE="${WITH_REJUDGE:-1}"
 OUT_JSON="reports/reality-tests/playbook_lift_${RUN_ID}.json"
 OUT_MD="reports/reality-tests/playbook_lift_${RUN_ID}.md"
@ -271,9 +276,12 @@ echo "[lift] running driver — judge=$EFFECTIVE_JUDGE · queries=$QUERIES_FILE
 # and runs its own resolution chain (env → config → fallback). When
 # JUDGE_MODEL IS set, the explicit -judge wins inside the Go driver
 # regardless of what its env-lookup would find — flag wins by design.
-PARAPHRASE_FLAG=""
+EXTRA_FLAGS=""
 if [ "$WITH_PARAPHRASE" = "1" ]; then
-  PARAPHRASE_FLAG="-with-paraphrase"
+  EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase"
 fi
 if [ "$WITH_REJUDGE" = "1" ]; then
  EXTRA_FLAGS="$EXTRA_FLAGS -with-rejudge"
 fi
 ./bin/playbook_lift \
  -config  "$CONFIG_PATH" \
@ -284,7 +292,7 @@ fi
  -judge   "$JUDGE_MODEL" \
  -k       "$K" \
  -out     "$OUT_JSON" \
-  $PARAPHRASE_FLAG
+  $EXTRA_FLAGS
 echo
 echo "[lift] generating markdown report → $OUT_MD"
@ -302,6 +310,10 @@ generate_md() {
  p_attempted=$(jq -r '.summary.paraphrase_attempted // 0' "$json")
  p_top1=$(jq -r '.summary.paraphrase_top1_lifts // 0' "$json")
  p_anyrank=$(jq -r '.summary.paraphrase_any_rank_hits // 0' "$json")
  rj_attempted=$(jq -r '.summary.rejudge_attempted // 0' "$json")
  q_lifted=$(jq -r '.summary.quality_lifted // 0' "$json")
  q_neutral=$(jq -r '.summary.quality_neutral // 0' "$json")
  q_regressed=$(jq -r '.summary.quality_regressed // 0' "$json")
  # Only emit the paraphrase block when --with-paraphrase actually ran
  # (i.e. .summary.paraphrase_attempted > 0). For verbatim-only runs we
@ -312,6 +324,13 @@ generate_md() {
 | Paraphrase pass — recorded answer at any rank in top-K | ${p_anyrank} / ${p_attempted} |"
  fi
  rj_block=""
  if [ "$rj_attempted" != "0" ] && [ "$rj_attempted" != "null" ]; then
    rj_block="| **Quality lift** (warm top-1 rating > cold top-1 rating) | **${q_lifted} / ${rj_attempted}** |
 | Quality neutral (warm top-1 rating = cold top-1 rating) | ${q_neutral} / ${rj_attempted} |
 | Quality regressed (warm top-1 rating < cold top-1 rating) | ${q_regressed} / ${rj_attempted} |"
  fi
  cat > "$md" <<MDEOF
 # Playbook-Lift Reality Test — Run ${RUN_ID}
@ -322,6 +341,7 @@ generate_md() {
 **Queries:** \`${QUERIES_FILE}\` (${total} executed)
 **K per pass:** ${K}
 **Paraphrase pass:** $([ "$WITH_PARAPHRASE" = "1" ] && echo "ENABLED" || echo "disabled")
 **Re-judge pass:** $([ "$WITH_REJUDGE" = "1" ] && echo "ENABLED" || echo "disabled")
 **Evidence:** \`${OUT_JSON}\`
 ---
@ -337,6 +357,7 @@ generate_md() {
 | Playbook boosts triggered (warm pass) | ${boosted} |
 | Mean Δ top-1 distance (warm − cold) | ${mean_delta} |
 ${p_block}
 ${rj_block}
 **Verbatim lift rate:** ${lift} of ${discovery} discoveries became top-1 after warm pass.
--- a/scripts/playbook_lift/main.go
+++ b/scripts/playbook_lift/main.go
@ -75,12 +75,19 @@ type queryRun struct {
 	PlaybookRecorded bool   `json:"playbook_recorded"`
 	PlaybookID       string `json:"playbook_target_id,omitempty"`
-	WarmTop1ID       string  `json:"warm_top1_id"`
+	WarmTop1ID       string          `json:"warm_top1_id"`
-	WarmTop1Distance float32 `json:"warm_top1_distance"`
+	WarmTop1Distance float32         `json:"warm_top1_distance"`
-	WarmBoostedCount int     `json:"warm_boosted_count"`
+	WarmBoostedCount int             `json:"warm_boosted_count"`
-	WarmJudgeBestRank int    `json:"warm_judge_best_rank"`
+	WarmJudgeBestRank int            `json:"warm_judge_best_rank"` // rank of cold judge-best in warm — NOT the warm pass's own judge-best
 	WarmTop1Metadata json.RawMessage `json:"-"`                    // cached for Pass 4 rejudge; not emitted
-	Lift bool   `json:"lift"`            // judge-best was below top-1 cold, but top-1 warm
+	// WarmTop1Rating: only populated when --with-rejudge. Compare to
 	// ColdRatings[0] (== cold top-1 rating) to measure quality lift.
 	// *int so absence (no rejudge pass) and a 0-rating verdict are
 	// distinguishable.
 	WarmTop1Rating *int `json:"warm_top1_rating,omitempty"`
 	Lift bool `json:"lift"` // judge-best was below top-1 cold, but top-1 warm
 	// Paraphrase pass — only populated when --with-paraphrase. Tests
 	// the playbook's actual learning property: does a recorded entry
@ -114,6 +121,17 @@ type summary struct {
 	ParaphraseTop1Lifts   int `json:"paraphrase_top1_lifts,omitempty"`  // recorded answer surfaced at rank 0
 	ParaphraseAnyRankHits int `json:"paraphrase_any_rank_hits,omitempty"` // recorded answer surfaced at any rank in top-K
 	// Re-judge pass aggregates — only populated when --with-rejudge.
 	// Measures QUALITY lift (warm top-1 rating vs cold top-1 rating)
 	// rather than rank-of-cold-judge-best lift. The latter conflates
 	// "warm surfaced a different but equally-good result" with "warm
 	// shuffled ranks but the answer was the same"; quality lift
 	// disambiguates them.
 	RejudgeAttempted   int `json:"rejudge_attempted,omitempty"`   // queries that ran the rejudge pass
 	QualityLifted      int `json:"quality_lifted,omitempty"`      // warm-top-1 rating > cold-top-1 rating
 	QualityNeutral     int `json:"quality_neutral,omitempty"`     // ratings equal (could be same or different item)
 	QualityRegressed   int `json:"quality_regressed,omitempty"`   // warm-top-1 rating < cold-top-1 rating
 	GeneratedAt time.Time `json:"generated_at"`
 }
@ -128,6 +146,7 @@ func main() {
 	k := flag.Int("k", 10, "top-k from matrix.search per pass")
 	out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
 	withParaphrase := flag.Bool("with-paraphrase", false, "after warm pass, generate a paraphrase via the judge model and re-query with playbook=true to test the learning property")
 	withRejudge := flag.Bool("with-rejudge", false, "after warm pass, judge warm top-1 to measure QUALITY lift (vs cold top-1 rating), not just rank-of-cold-judge-best")
 	flag.Parse()
 	// Judge resolution priority: explicit flag > $JUDGE_MODEL env >
@ -225,6 +244,7 @@ func main() {
 		}
 		runs[i].WarmTop1ID = resp.Results[0].ID
 		runs[i].WarmTop1Distance = resp.Results[0].Distance
 		runs[i].WarmTop1Metadata = resp.Results[0].Metadata // cache for Pass 4 rejudge
 		runs[i].WarmBoostedCount = resp.PlaybookBoosted
 		playbookBoostedTotal += resp.PlaybookBoosted
@ -304,6 +324,47 @@ func main() {
 		}
 	}
 	// Pass 4 (warm-rejudge) — opt-in via --with-rejudge. Judge warm
 	// top-1 against the same prompt as cold ratings, then compare to
 	// cold top-1 rating. This measures QUALITY lift (did the playbook
 	// produce a better candidate?) rather than just rank-of-cold-judge-
 	// best lift (did the recorded answer move to top-1, even if cold's
 	// top-1 was already good?). See STATE_OF_PLAY OPEN — added because
 	// run #003's verbatim 2/6 didn't tell us whether Shape B was
 	// surfacing better OR same-quality alternatives.
 	rejudgeAttempted := 0
 	qualityLifted := 0
 	qualityNeutral := 0
 	qualityRegressed := 0
 	if *withRejudge {
 		log.Printf("[lift] warm-rejudge pass: measuring quality lift (warm top-1 rating vs cold top-1 rating)")
 		for i := range runs {
 			if runs[i].WarmTop1ID == "" || len(runs[i].WarmTop1Metadata) == 0 {
 				continue // warm pass didn't complete for this query
 			}
 			rejudgeAttempted++
 			result := matrixResult{
 				ID:       runs[i].WarmTop1ID,
 				Distance: runs[i].WarmTop1Distance,
 				Metadata: runs[i].WarmTop1Metadata,
 			}
 			warmRating := judgeRate(hc, *ollama, *judge, runs[i].Query, result)
 			runs[i].WarmTop1Rating = &warmRating
 			coldRating := 0
 			if len(runs[i].ColdRatings) > 0 {
 				coldRating = runs[i].ColdRatings[0]
 			}
 			switch {
 			case warmRating > coldRating:
 				qualityLifted++
 			case warmRating < coldRating:
 				qualityRegressed++
 			default:
 				qualityNeutral++
 			}
 		}
 	}
 	sum := summary{
 		Total:                 len(runs),
 		WithDiscovery:         withDiscovery,
@ -314,6 +375,10 @@ func main() {
 		ParaphraseAttempted:   paraphraseAttempted,
 		ParaphraseTop1Lifts:   paraphraseTop1Lifts,
 		ParaphraseAnyRankHits: paraphraseAnyRankHits,
 		RejudgeAttempted:      rejudgeAttempted,
 		QualityLifted:         qualityLifted,
 		QualityNeutral:        qualityNeutral,
 		QualityRegressed:      qualityRegressed,
 		GeneratedAt:           time.Now().UTC(),
 	}
 	if len(runs) > 0 {
@ -323,11 +388,11 @@ func main() {
 	if err := writeJSON(*out, runs, sum); err != nil {
 		log.Fatalf("write %s: %v", *out, err)
 	}
-	if *withParaphrase {
+	if *withParaphrase || *withRejudge {
-		log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1, %d/%d→anyrank",
+		log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1 · quality=lifted%d/neutral%d/regressed%d",
 			sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance,
 			sum.ParaphraseTop1Lifts, sum.ParaphraseAttempted,
-			sum.ParaphraseAnyRankHits, sum.ParaphraseAttempted)
+			sum.QualityLifted, sum.QualityNeutral, sum.QualityRegressed)
 	} else {
 		log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f",
 			sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance)