playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14%

The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't distinguish "Shape B surfaced a strictly-better answer" from "Shape B shuffled ranks but quality is unchanged" from "Shape B replaced a good answer with a wrong one." This commit adds Pass 4: judge warm top-1 with the same prompt as cold ratings, then bucket the comparison. Implementation: - New --with-rejudge driver flag (default off). - New WITH_REJUDGE harness env (default 1, on for prod runs). - queryRun gains WarmTop1Metadata (cached during Pass 2 for the rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no rejudge, 0..5 = rating). - summary gains RejudgeAttempted, QualityLifted, QualityNeutral, QualityRegressed (counts of warm-rating > / == / < cold-rating). - Markdown headline gains a Quality block when rejudge ran. - ~21 extra judge calls (~30s on qwen2.5). Run #005 result (split inject threshold 0.20 + paraphrase + rejudge): Quality lifted 5 / 21 (24%) — 3× +2 rating, 2× +1 rating Quality neutral 13 / 21 (62%) — includes OOD queries holding 1 Quality regressed 3 / 21 (14%) Net rating delta +3 across 21 queries (+0.14 average) The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4 warm — Shape B took mediocre matches and substituted substantively better ones. The 3 regressions were small (-1, -1, -3). Q11 is the cautionary tale: cold top-1 "production line worker" (rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator" e-5729 (rating 1). Adjacent-domain cross-pollination — production worker and forklift operator embed within 0.20 cosine because both are warehouse-adjacent staffing queries, even though the judge correctly distinguishes them. The split-threshold defense (0.5 boost / 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed neutral at rating 1) but not adjacent-domain cross-pollination. Net product verdict: working, net-positive on quality, but the worst case (Q11 4→1) is customer-visible and warrants a tighter inject threshold OR an additional gate beyond cosine distance. Filed in STATE_OF_PLAY OPEN as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:42:04 -05:00 · 2026-04-30 07:42:04 -05:00 · b13b5cd7a1
commit b13b5cd7a1
parent 87cbd10090
3 changed files with 217 additions and 11 deletions
--- a/reports/reality-tests/playbook_lift_005.md
+++ b/reports/reality-tests/playbook_lift_005.md
@ -0,0 +1,120 @@
+# Playbook-Lift Reality Test — Run 005
+
+**Generated:** 2026-04-30T12:40:48.475901847Z
+**Judge:** `qwen2.5:latest` (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest)
+**Corpora:** `workers,ethereal_workers`
+**Workers limit:** 5000
+**Queries:** `tests/reality/playbook_lift_queries.txt` (21 executed)
+**K per pass:** 10
+**Paraphrase pass:** ENABLED
+**Re-judge pass:** ENABLED
+**Evidence:** `reports/reality-tests/playbook_lift_005.json`
+
+---
+
+## Headline
+
+| Metric | Value |
+|---|---:|
+| Total queries run | 21 |
+| Cold-pass discoveries (judge-best ≠ top-1) | 7 |
+| Warm-pass lifts (recorded playbook → top-1) | 5 |
+| No change (judge-best already top-1, no playbook needed) | 16 |
+| Playbook boosts triggered (warm pass) | 9 |
+| Mean Δ top-1 distance (warm − cold) | -0.076170966 |
+| **Paraphrase pass — recorded answer at rank 0 (top-1)** | **5 / 7** |
+| Paraphrase pass — recorded answer at any rank in top-K | 5 / 7 |
+| **Quality lift** (warm top-1 rating > cold top-1 rating) | **5 / 21** |
+| Quality neutral (warm top-1 rating = cold top-1 rating) | 13 / 21 |
+| Quality regressed (warm top-1 rating < cold top-1 rating) | 3 / 21 |
+
+**Verbatim lift rate:** 5 of 7 discoveries became top-1 after warm pass.
+
+---
+
+## Per-query results
+
+| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
+|---|---|---|---|---|---|---|---|
+| 1 | Forklift operator with OSHA-30, warehouse experience, day sh | e-5670 | 2/4 | ✓ e-5729 | e-5729 | 0 | **YES** |
+| 2 | OSHA-30 certified forklift operator in Wisconsin, cold stora | e-6293 | 7/3 | — | w-1566 | 8 | no |
+| 3 | Production worker with confined-space cert and hazmat traini | w-602 | 0/2 | — | w-3575 | 1 | no |
+| 4 | CDL Class A driver, clean record, willing to do regional 4-d | w-3854 | 0/1 | — | w-3854 | 0 | no |
+| 5 | Warehouse lead with current OSHA-30 certification, NOT OSHA- | w-1807 | 6/3 | — | w-1807 | 6 | no |
+| 6 | Forklift-certified loader, certification must be active, dis | w-1807 | 3/4 | ✓ w-205 | w-4257 | 1 | no |
+| 7 | Hazmat-certified warehouse worker comfortable with cold stor | e-4910 | 2/4 | ✓ w-4257 | w-205 | 1 | no |
+| 8 | Bilingual production worker with team-lead experience and tr | w-4988 | 0/4 | — | w-4988 | 0 | no |
+| 9 | Inventory specialist with confined-space cert and compliance | w-388 | 3/4 | ✓ w-3575 | w-3575 | 0 | **YES** |
+| 10 | Warehouse worker who can run inventory cycles and lead a sma | e-3011 | 0/4 | — | e-3011 | 0 | no |
+| 11 | Production line worker comfortable filling in as line superv | w-1387 | 0/4 | — | e-5729 | 1 | no |
+| 12 | Customer service rep willing to cross-train into dispatch or | w-1451 | 0/2 | — | w-1451 | 0 | no |
+| 13 | Reliable production line lead with strong attendance and lea | e-7360 | 5/4 | ✓ w-2886 | w-2886 | 0 | **YES** |
+| 14 | Highly responsive forklift operator available for last-minut | e-6108 | 5/4 | ✓ w-1566 | w-1566 | 0 | **YES** |
+| 15 | Engaged warehouse associate with strong safety compliance re | e-2743 | 2/4 | ✓ w-49 | w-49 | 0 | **YES** |
+| 16 | CDL-A driver based in IL or WI, willing to run regional 4-da | w-2486 | 5/2 | — | w-2486 | 5 | no |
+| 17 | Bilingual customer service rep in Indianapolis or Cincinnati | e-9749 | 9/2 | — | e-9749 | 9 | no |
+| 18 | Production supervisor open to Midwest relocation for permane | w-379 | 6/3 | — | w-379 | 6 | no |
+| 19 | Dental hygienist with three years experience, Indianapolis a | e-6772 | 0/1 | — | w-3575 | 1 | no |
+| 20 | Registered nurse with ICU experience, willing to take per-di | w-379 | 0/1 | — | w-379 | 0 | no |
+| 21 | Software engineer with React and TypeScript, three years exp | w-1773 | 0/1 | — | w-1773 | 0 | no |
+
+---
+
+## Paraphrase pass — does the playbook help similar-but-different queries?
+
+For each query whose Pass 1 cold pass recorded a playbook entry, the
+judge model rephrased the query, and the rephrased version was sent
+through warm matrix.search. The recorded answer ID's rank in those
+results tests whether cosine on the embedded paraphrase finds the
+recorded query's vector.
+
+| # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift |
+|---|---|---|---|---|---|---|
+| 1 | Forklift operator with OSHA-30, warehous | Seeking forklift operator certified in OSHA-30, looking for  | e-5729 | e-5729 | 0 | **YES** |
+| 6 | Forklift-certified loader, certification | Loader requiring active forklift certification, this must no | w-205 | w-205 | 0 | **YES** |
+| 7 | Hazmat-certified warehouse worker comfor | Warehouse worker with Hazmat certification and experience in | w-4257 | w-4257 | 0 | **YES** |
+| 9 | Inventory specialist with confined-space | Specialist in inventory management requiring certified confi | w-3575 | w-49 | -1 | no |
+| 13 | Reliable production line lead with stron | Experienced production line supervisor with excellent punctu | w-2886 | w-2886 | 0 | **YES** |
+| 14 | Highly responsive forklift operator avai | Available forklift operator ready for urgent shift coverage | w-1566 | w-1566 | 0 | **YES** |
+| 15 | Engaged warehouse associate with strong  | Warehouse associate dedicated to engagement and boasting a r | w-49 | w-984 | -1 | no |
+
+---
+
+## Honesty caveats
+
+1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
+   judge's verdict is what defines "best." If `qwen2.5:latest` rates badly,
+   the lift number is meaningless. To validate the judge itself, sample 5–10
+   verdicts manually and check agreement.
+2. **Score-1.0 boost = distance halved.** Playbook math is
+   `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
+   result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
+   even halving doesn't promote it. Tight clusters → little visible lift.
+3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
+   case — same query, recorded playbook, expected boost. The paraphrase
+   pass (when enabled) is the actual learning property: similar-but-different
+   queries hitting a recorded playbook. Compare verbatim and paraphrase
+   lift rates — paraphrase should be lower (semantic-distance gates some
+   playbook hits) but non-zero is the meaningful signal.
+4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
+   results land in one corpus, the matrix layer's purpose isn't being tested.
+   Check per-corpus distribution in the JSON.
+5. **Judge resolution.** This run used `qwen2.5:latest` from
+   env JUDGE_MODEL=qwen2.5:latest.
+   Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
+6. **Paraphrase generation also uses the judge.** The same model that rates
+   relevance also rephrases queries. A judge that's bad at rating staffing
+   queries is probably also bad at rephrasing them. Worth sanity-checking
+   a sample of `paraphrase_query` values in the JSON before trusting the
+   paraphrase lift number.
+
+## Next moves
+
+- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
+  work. Move to paraphrase queries + tag-based boost (currently ignored).
+- If lift rate < 20%: investigate why — judge variance, distance gap too
+  wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
+  retuning.
+- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
+  already close to optimal on this query distribution. Either the corpus
+  is too narrow or the queries are too easy.
--- a/scripts/playbook_lift.sh
+++ b/scripts/playbook_lift.sh
@ -52,6 +52,11 @@ CONFIG_PATH="${CONFIG_PATH:-lakehouse.toml}"
 # actual learning-property test (does cosine on paraphrase find the
 # recorded entry?). Set WITH_PARAPHRASE=0 for a faster verbatim-only run.
 WITH_PARAPHRASE="${WITH_PARAPHRASE:-1}"
+# WITH_REJUDGE=1 (default) adds a Pass 4 — judge warm top-1 to measure
+# quality lift (warm rating vs cold rating). Catches cases where Shape B
+# surfaces a different-but-equally-good answer (which the rank-based
+# lift metric misses). +21 judge calls (~30s on qwen2.5).
+WITH_REJUDGE="${WITH_REJUDGE:-1}"

 OUT_JSON="reports/reality-tests/playbook_lift_${RUN_ID}.json"
 OUT_MD="reports/reality-tests/playbook_lift_${RUN_ID}.md"
@ -271,9 +276,12 @@ echo "[lift] running driver — judge=$EFFECTIVE_JUDGE · queries=$QUERIES_FILE
 # and runs its own resolution chain (env → config → fallback). When
 # JUDGE_MODEL IS set, the explicit -judge wins inside the Go driver
 # regardless of what its env-lookup would find — flag wins by design.
-PARAPHRASE_FLAG=""
+EXTRA_FLAGS=""
 if [ "$WITH_PARAPHRASE" = "1" ]; then
-  PARAPHRASE_FLAG="-with-paraphrase"
+  EXTRA_FLAGS="$EXTRA_FLAGS -with-paraphrase"
+fi
+if [ "$WITH_REJUDGE" = "1" ]; then
+  EXTRA_FLAGS="$EXTRA_FLAGS -with-rejudge"
 fi
 ./bin/playbook_lift \
  -config  "$CONFIG_PATH" \
@ -284,7 +292,7 @@ fi
  -judge   "$JUDGE_MODEL" \
  -k       "$K" \
  -out     "$OUT_JSON" \
-  $PARAPHRASE_FLAG
+  $EXTRA_FLAGS

 echo
 echo "[lift] generating markdown report → $OUT_MD"
@ -302,6 +310,10 @@ generate_md() {
  p_attempted=$(jq -r '.summary.paraphrase_attempted // 0' "$json")
  p_top1=$(jq -r '.summary.paraphrase_top1_lifts // 0' "$json")
  p_anyrank=$(jq -r '.summary.paraphrase_any_rank_hits // 0' "$json")
+  rj_attempted=$(jq -r '.summary.rejudge_attempted // 0' "$json")
+  q_lifted=$(jq -r '.summary.quality_lifted // 0' "$json")
+  q_neutral=$(jq -r '.summary.quality_neutral // 0' "$json")
+  q_regressed=$(jq -r '.summary.quality_regressed // 0' "$json")

  # Only emit the paraphrase block when --with-paraphrase actually ran
  # (i.e. .summary.paraphrase_attempted > 0). For verbatim-only runs we
@ -312,6 +324,13 @@ generate_md() {
 | Paraphrase pass — recorded answer at any rank in top-K | ${p_anyrank} / ${p_attempted} |"
  fi

+  rj_block=""
+  if [ "$rj_attempted" != "0" ] && [ "$rj_attempted" != "null" ]; then
+    rj_block="| **Quality lift** (warm top-1 rating > cold top-1 rating) | **${q_lifted} / ${rj_attempted}** |
+| Quality neutral (warm top-1 rating = cold top-1 rating) | ${q_neutral} / ${rj_attempted} |
+| Quality regressed (warm top-1 rating < cold top-1 rating) | ${q_regressed} / ${rj_attempted} |"
+  fi
+
  cat > "$md" <<MDEOF
 # Playbook-Lift Reality Test — Run ${RUN_ID}

@ -322,6 +341,7 @@ generate_md() {
 **Queries:** \`${QUERIES_FILE}\` (${total} executed)
 **K per pass:** ${K}
 **Paraphrase pass:** $([ "$WITH_PARAPHRASE" = "1" ] && echo "ENABLED" || echo "disabled")
+**Re-judge pass:** $([ "$WITH_REJUDGE" = "1" ] && echo "ENABLED" || echo "disabled")
 **Evidence:** \`${OUT_JSON}\`

 ---
@ -337,6 +357,7 @@ generate_md() {
 | Playbook boosts triggered (warm pass) | ${boosted} |
 | Mean Δ top-1 distance (warm − cold) | ${mean_delta} |
 ${p_block}
+${rj_block}

 **Verbatim lift rate:** ${lift} of ${discovery} discoveries became top-1 after warm pass.

--- a/scripts/playbook_lift/main.go
+++ b/scripts/playbook_lift/main.go
@ -75,12 +75,19 @@ type queryRun struct {
 	PlaybookRecorded bool   `json:"playbook_recorded"`
 	PlaybookID       string `json:"playbook_target_id,omitempty"`

-	WarmTop1ID       string  `json:"warm_top1_id"`
-	WarmTop1Distance float32 `json:"warm_top1_distance"`
-	WarmBoostedCount int     `json:"warm_boosted_count"`
-	WarmJudgeBestRank int    `json:"warm_judge_best_rank"`
+	WarmTop1ID       string          `json:"warm_top1_id"`
+	WarmTop1Distance float32         `json:"warm_top1_distance"`
+	WarmBoostedCount int             `json:"warm_boosted_count"`
+	WarmJudgeBestRank int            `json:"warm_judge_best_rank"` // rank of cold judge-best in warm — NOT the warm pass's own judge-best
+	WarmTop1Metadata json.RawMessage `json:"-"`                    // cached for Pass 4 rejudge; not emitted

-	Lift bool   `json:"lift"`            // judge-best was below top-1 cold, but top-1 warm
+	// WarmTop1Rating: only populated when --with-rejudge. Compare to
+	// ColdRatings[0] (== cold top-1 rating) to measure quality lift.
+	// *int so absence (no rejudge pass) and a 0-rating verdict are
+	// distinguishable.
+	WarmTop1Rating *int `json:"warm_top1_rating,omitempty"`
+
+	Lift bool `json:"lift"` // judge-best was below top-1 cold, but top-1 warm

 	// Paraphrase pass — only populated when --with-paraphrase. Tests
 	// the playbook's actual learning property: does a recorded entry
@ -114,6 +121,17 @@ type summary struct {
 	ParaphraseTop1Lifts   int `json:"paraphrase_top1_lifts,omitempty"`  // recorded answer surfaced at rank 0
 	ParaphraseAnyRankHits int `json:"paraphrase_any_rank_hits,omitempty"` // recorded answer surfaced at any rank in top-K

+	// Re-judge pass aggregates — only populated when --with-rejudge.
+	// Measures QUALITY lift (warm top-1 rating vs cold top-1 rating)
+	// rather than rank-of-cold-judge-best lift. The latter conflates
+	// "warm surfaced a different but equally-good result" with "warm
+	// shuffled ranks but the answer was the same"; quality lift
+	// disambiguates them.
+	RejudgeAttempted   int `json:"rejudge_attempted,omitempty"`   // queries that ran the rejudge pass
+	QualityLifted      int `json:"quality_lifted,omitempty"`      // warm-top-1 rating > cold-top-1 rating
+	QualityNeutral     int `json:"quality_neutral,omitempty"`     // ratings equal (could be same or different item)
+	QualityRegressed   int `json:"quality_regressed,omitempty"`   // warm-top-1 rating < cold-top-1 rating
+
 	GeneratedAt time.Time `json:"generated_at"`
 }

@ -128,6 +146,7 @@ func main() {
 	k := flag.Int("k", 10, "top-k from matrix.search per pass")
 	out := flag.String("out", "reports/reality-tests/playbook_lift_001.json", "output JSONL path")
 	withParaphrase := flag.Bool("with-paraphrase", false, "after warm pass, generate a paraphrase via the judge model and re-query with playbook=true to test the learning property")
+	withRejudge := flag.Bool("with-rejudge", false, "after warm pass, judge warm top-1 to measure QUALITY lift (vs cold top-1 rating), not just rank-of-cold-judge-best")
 	flag.Parse()

 	// Judge resolution priority: explicit flag > $JUDGE_MODEL env >
@ -225,6 +244,7 @@ func main() {
 		}
 		runs[i].WarmTop1ID = resp.Results[0].ID
 		runs[i].WarmTop1Distance = resp.Results[0].Distance
+		runs[i].WarmTop1Metadata = resp.Results[0].Metadata // cache for Pass 4 rejudge
 		runs[i].WarmBoostedCount = resp.PlaybookBoosted
 		playbookBoostedTotal += resp.PlaybookBoosted

@ -304,6 +324,47 @@ func main() {
 		}
 	}

+	// Pass 4 (warm-rejudge) — opt-in via --with-rejudge. Judge warm
+	// top-1 against the same prompt as cold ratings, then compare to
+	// cold top-1 rating. This measures QUALITY lift (did the playbook
+	// produce a better candidate?) rather than just rank-of-cold-judge-
+	// best lift (did the recorded answer move to top-1, even if cold's
+	// top-1 was already good?). See STATE_OF_PLAY OPEN — added because
+	// run #003's verbatim 2/6 didn't tell us whether Shape B was
+	// surfacing better OR same-quality alternatives.
+	rejudgeAttempted := 0
+	qualityLifted := 0
+	qualityNeutral := 0
+	qualityRegressed := 0
+	if *withRejudge {
+		log.Printf("[lift] warm-rejudge pass: measuring quality lift (warm top-1 rating vs cold top-1 rating)")
+		for i := range runs {
+			if runs[i].WarmTop1ID == "" || len(runs[i].WarmTop1Metadata) == 0 {
+				continue // warm pass didn't complete for this query
+			}
+			rejudgeAttempted++
+			result := matrixResult{
+				ID:       runs[i].WarmTop1ID,
+				Distance: runs[i].WarmTop1Distance,
+				Metadata: runs[i].WarmTop1Metadata,
+			}
+			warmRating := judgeRate(hc, *ollama, *judge, runs[i].Query, result)
+			runs[i].WarmTop1Rating = &warmRating
+			coldRating := 0
+			if len(runs[i].ColdRatings) > 0 {
+				coldRating = runs[i].ColdRatings[0]
+			}
+			switch {
+			case warmRating > coldRating:
+				qualityLifted++
+			case warmRating < coldRating:
+				qualityRegressed++
+			default:
+				qualityNeutral++
+			}
+		}
+	}
+
 	sum := summary{
 		Total:                 len(runs),
 		WithDiscovery:         withDiscovery,
@ -314,6 +375,10 @@ func main() {
 		ParaphraseAttempted:   paraphraseAttempted,
 		ParaphraseTop1Lifts:   paraphraseTop1Lifts,
 		ParaphraseAnyRankHits: paraphraseAnyRankHits,
+		RejudgeAttempted:      rejudgeAttempted,
+		QualityLifted:         qualityLifted,
+		QualityNeutral:        qualityNeutral,
+		QualityRegressed:      qualityRegressed,
 		GeneratedAt:           time.Now().UTC(),
 	}
 	if len(runs) > 0 {
@ -323,11 +388,11 @@ func main() {
 	if err := writeJSON(*out, runs, sum); err != nil {
 		log.Fatalf("write %s: %v", *out, err)
 	}
-	if *withParaphrase {
-		log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1, %d/%d→anyrank",
+	if *withParaphrase || *withRejudge {
+		log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f · paraphrase=%d/%d→top1 · quality=lifted%d/neutral%d/regressed%d",
 			sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance,
 			sum.ParaphraseTop1Lifts, sum.ParaphraseAttempted,
-			sum.ParaphraseAnyRankHits, sum.ParaphraseAttempted)
+			sum.QualityLifted, sum.QualityNeutral, sum.QualityRegressed)
 	} else {
 		log.Printf("[lift] DONE — %d queries · discovery=%d · lift=%d · boosted=%d · meanΔdist=%.4f",
 			sum.Total, sum.WithDiscovery, sum.LiftCount, sum.PlaybookBoostedTotal, sum.MeanTop1DeltaDistance)