root b13b5cd7a1 playbook_lift v4 metric: warm-top-1 re-judge — quality lift +24%/-14%

The rank-based "lift" metric (warm-top-1 == cold-judge-best) doesn't
distinguish "Shape B surfaced a strictly-better answer" from "Shape B
shuffled ranks but quality is unchanged" from "Shape B replaced a good
answer with a wrong one." This commit adds Pass 4: judge warm top-1
with the same prompt as cold ratings, then bucket the comparison.

Implementation:
- New --with-rejudge driver flag (default off).
- New WITH_REJUDGE harness env (default 1, on for prod runs).
- queryRun gains WarmTop1Metadata (cached during Pass 2 for the
  rejudge call) + WarmTop1Rating *int (nil-distinguishable; nil = no
  rejudge, 0..5 = rating).
- summary gains RejudgeAttempted, QualityLifted, QualityNeutral,
  QualityRegressed (counts of warm-rating > / == / < cold-rating).
- Markdown headline gains a Quality block when rejudge ran.
- ~21 extra judge calls (~30s on qwen2.5).

Run #005 result (split inject threshold 0.20 + paraphrase + rejudge):

  Quality lifted     5 / 21  (24%)  — 3× +2 rating, 2× +1 rating
  Quality neutral   13 / 21  (62%)  — includes OOD queries holding 1
  Quality regressed  3 / 21  (14%)
  Net rating delta  +3 across 21 queries (+0.14 average)

The 5 lifts were all rating-2 cold replaced with rating-3 or rating-4
warm — Shape B took mediocre matches and substituted substantively
better ones. The 3 regressions were small (-1, -1, -3).

Q11 is the cautionary tale: cold top-1 "production line worker"
(rating 4) got replaced by Q1's recorded "forklift OSHA-30 operator"
e-5729 (rating 1). Adjacent-domain cross-pollination — production
worker and forklift operator embed within 0.20 cosine because both
are warehouse-adjacent staffing queries, even though the judge
correctly distinguishes them. The split-threshold defense (0.5 boost
/ 0.20 inject) catches OOD cross-pollination (Q19/Q20/Q21 all stayed
neutral at rating 1) but not adjacent-domain cross-pollination.

Net product verdict: working, net-positive on quality, but the worst
case (Q11 4→1) is customer-visible and warrants a tighter inject
threshold OR an additional gate beyond cosine distance. Filed in
STATE_OF_PLAY OPEN as a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 07:42:04 -05:00

7.4 KiB

Raw Permalink Blame History

Playbook-Lift Reality Test — Run 005

Generated: 2026-04-30T12:40:48.475901847Z Judge: qwen2.5:latest (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest) Corpora: workers,ethereal_workers Workers limit: 5000 Queries: tests/reality/playbook_lift_queries.txt (21 executed) K per pass: 10 Paraphrase pass: ENABLED Re-judge pass: ENABLED Evidence: reports/reality-tests/playbook_lift_005.json

Headline

Metric	Value
Total queries run	21
Cold-pass discoveries (judge-best ≠ top-1)	7
Warm-pass lifts (recorded playbook → top-1)	5
No change (judge-best already top-1, no playbook needed)	16
Playbook boosts triggered (warm pass)	9
Mean Δ top-1 distance (warm − cold)	-0.076170966
Paraphrase pass — recorded answer at rank 0 (top-1)	5 / 7
Paraphrase pass — recorded answer at any rank in top-K	5 / 7
Quality lift (warm top-1 rating > cold top-1 rating)	5 / 21
Quality neutral (warm top-1 rating = cold top-1 rating)	13 / 21
Quality regressed (warm top-1 rating < cold top-1 rating)	3 / 21

Verbatim lift rate: 5 of 7 discoveries became top-1 after warm pass.

Per-query results

#	Query	Cold top-1	Cold judge-best (rank/rating)	Recorded?	Warm top-1	Judge-best warm rank	Lift
1	Forklift operator with OSHA-30, warehouse experience, day sh	e-5670	2/4	✓ e-5729	e-5729	0	YES
2	OSHA-30 certified forklift operator in Wisconsin, cold stora	e-6293	7/3	—	w-1566	8	no
3	Production worker with confined-space cert and hazmat traini	w-602	0/2	—	w-3575	1	no
4	CDL Class A driver, clean record, willing to do regional 4-d	w-3854	0/1	—	w-3854	0	no
5	Warehouse lead with current OSHA-30 certification, NOT OSHA-	w-1807	6/3	—	w-1807	6	no
6	Forklift-certified loader, certification must be active, dis	w-1807	3/4	✓ w-205	w-4257	1	no
7	Hazmat-certified warehouse worker comfortable with cold stor	e-4910	2/4	✓ w-4257	w-205	1	no
8	Bilingual production worker with team-lead experience and tr	w-4988	0/4	—	w-4988	0	no
9	Inventory specialist with confined-space cert and compliance	w-388	3/4	✓ w-3575	w-3575	0	YES
10	Warehouse worker who can run inventory cycles and lead a sma	e-3011	0/4	—	e-3011	0	no
11	Production line worker comfortable filling in as line superv	w-1387	0/4	—	e-5729	1	no
12	Customer service rep willing to cross-train into dispatch or	w-1451	0/2	—	w-1451	0	no
13	Reliable production line lead with strong attendance and lea	e-7360	5/4	✓ w-2886	w-2886	0	YES
14	Highly responsive forklift operator available for last-minut	e-6108	5/4	✓ w-1566	w-1566	0	YES
15	Engaged warehouse associate with strong safety compliance re	e-2743	2/4	✓ w-49	w-49	0	YES
16	CDL-A driver based in IL or WI, willing to run regional 4-da	w-2486	5/2	—	w-2486	5	no
17	Bilingual customer service rep in Indianapolis or Cincinnati	e-9749	9/2	—	e-9749	9	no
18	Production supervisor open to Midwest relocation for permane	w-379	6/3	—	w-379	6	no
19	Dental hygienist with three years experience, Indianapolis a	e-6772	0/1	—	w-3575	1	no
20	Registered nurse with ICU experience, willing to take per-di	w-379	0/1	—	w-379	0	no
21	Software engineer with React and TypeScript, three years exp	w-1773	0/1	—	w-1773	0	no

Paraphrase pass — does the playbook help similar-but-different queries?

For each query whose Pass 1 cold pass recorded a playbook entry, the judge model rephrased the query, and the rephrased version was sent through warm matrix.search. The recorded answer ID's rank in those results tests whether cosine on the embedded paraphrase finds the recorded query's vector.

#	Original (≤40c)	Paraphrase (≤60c)	Recorded answer	Paraphrase top-1	Recorded rank	Paraphrase lift
1	Forklift operator with OSHA-30, warehous	Seeking forklift operator certified in OSHA-30, looking for	e-5729	e-5729	0	YES
6	Forklift-certified loader, certification	Loader requiring active forklift certification, this must no	w-205	w-205	0	YES
7	Hazmat-certified warehouse worker comfor	Warehouse worker with Hazmat certification and experience in	w-4257	w-4257	0	YES
9	Inventory specialist with confined-space	Specialist in inventory management requiring certified confi	w-3575	w-49	-1	no
13	Reliable production line lead with stron	Experienced production line supervisor with excellent punctu	w-2886	w-2886	0	YES
14	Highly responsive forklift operator avai	Available forklift operator ready for urgent shift coverage	w-1566	w-1566	0	YES
15	Engaged warehouse associate with strong	Warehouse associate dedicated to engagement and boasting a r	w-49	w-984	-1	no

Honesty caveats

Judge IS the ground truth proxy. Without human-labeled relevance, the LLM judge's verdict is what defines "best." If qwen2.5:latest rates badly, the lift number is meaningless. To validate the judge itself, sample 5–10 verdicts manually and check agreement.
Score-1.0 boost = distance halved. Playbook math is distance' = distance × (1 - 0.5 × score). Lift requires the judge-best result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise even halving doesn't promote it. Tight clusters → little visible lift.
Verbatim vs paraphrase. The verbatim lift rate (above) is the cheap case — same query, recorded playbook, expected boost. The paraphrase pass (when enabled) is the actual learning property: similar-but-different queries hitting a recorded playbook. Compare verbatim and paraphrase lift rates — paraphrase should be lower (semantic-distance gates some playbook hits) but non-zero is the meaningful signal.
Multi-corpus skew. Default corpora=workers,ethereal_workers — if all judge-best results land in one corpus, the matrix layer's purpose isn't being tested. Check per-corpus distribution in the JSON.
Judge resolution. This run used qwen2.5:latest from env JUDGE_MODEL=qwen2.5:latest. Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
Paraphrase generation also uses the judge. The same model that rates relevance also rephrases queries. A judge that's bad at rating staffing queries is probably also bad at rephrasing them. Worth sanity-checking a sample of paraphrase_query values in the JSON before trusting the paraphrase lift number.

Next moves

If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real work. Move to paraphrase queries + tag-based boost (currently ignored).
If lift rate < 20%: investigate why — judge variance, distance gap too wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need retuning.
If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is already close to optimal on this query distribution. Either the corpus is too narrow or the queries are too easy.

7.4 KiB Raw Permalink Blame History Unescape Escape