golangLAKEHOUSE/reports/reality-tests/playbook_lift_003.md

# Playbook-Lift Reality Test — Run 003

**Generated:** 2026-04-30T12:03:36.939020926Z
**Judge:** `qwen2.5:latest` (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest)
**Corpora:** `workers,ethereal_workers`
**Workers limit:** 5000
**Queries:** `tests/reality/playbook_lift_queries.txt` (21 executed)
**K per pass:** 10
**Paraphrase pass:** ENABLED
**Evidence:** `reports/reality-tests/playbook_lift_003.json`

---

## Headline

| Metric | Value |
|---|---:|
| Total queries run | 21 |
| Cold-pass discoveries (judge-best ≠ top-1) | 6 |
| Warm-pass lifts (recorded playbook → top-1) | 2 |
| No change (judge-best already top-1, no playbook needed) | 19 |
| Playbook boosts triggered (warm pass) | 6 |
| Mean Δ top-1 distance (warm − cold) | -0.16369006 |
| **Paraphrase pass — recorded answer at rank 0 (top-1)** | **6 / 6** |
| Paraphrase pass — recorded answer at any rank in top-K | 6 / 6 |

**Verbatim lift rate:** 2 of 6 discoveries became top-1 after warm pass.

---

## Per-query results

| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|---|---|---|---|---|---|---|---|
| 1 | Forklift operator with OSHA-30, warehouse experience, day sh | e-4079 | 3/3 | — | w-4435 | 6 | no |
| 2 | OSHA-30 certified forklift operator in Wisconsin, cold stora | e-8354 | 2/4 | ✓ w-4435 | w-3004 | 1 | no |
| 3 | Production worker with confined-space cert and hazmat traini | w-943 | 0/2 | — | w-392 | 3 | no |
| 4 | CDL Class A driver, clean record, willing to do regional 4-d | w-3272 | 0/1 | — | w-4435 | 3 | no |
| 5 | Warehouse lead with current OSHA-30 certification, NOT OSHA- | w-2759 | 0/2 | — | e-5778 | 3 | no |
| 6 | Forklift-certified loader, certification must be active, dis | e-3143 | 0/2 | — | w-3004 | 3 | no |
| 7 | Hazmat-certified warehouse worker comfortable with cold stor | w-2844 | 8/4 | ✓ w-3004 | w-4435 | 2 | no |
| 8 | Bilingual production worker with team-lead experience and tr | w-4749 | 0/4 | — | w-4260 | 3 | no |
| 9 | Inventory specialist with confined-space cert and compliance | w-153 | 6/4 | ✓ w-392 | w-392 | 0 | **YES** |
| 10 | Warehouse worker who can run inventory cycles and lead a sma | e-4744 | 9/4 | ✓ w-4260 | w-3004 | 1 | no |
| 11 | Production line worker comfortable filling in as line superv | w-1010 | 0/3 | — | w-3004 | 3 | no |
| 12 | Customer service rep willing to cross-train into dispatch or | e-3302 | 2/2 | — | w-4435 | 4 | no |
| 13 | Reliable production line lead with strong attendance and lea | e-4284 | 6/4 | ✓ e-5778 | e-5778 | 0 | **YES** |
| 14 | Highly responsive forklift operator available for last-minut | e-6762 | 1/2 | — | w-4435 | 4 | no |
| 15 | Engaged warehouse associate with strong safety compliance re | e-2743 | 1/4 | ✓ w-2523 | w-3004 | 1 | no |
| 16 | CDL-A driver based in IL or WI, willing to run regional 4-da | w-3272 | 3/2 | — | w-4435 | 6 | no |
| 17 | Bilingual customer service rep in Indianapolis or Cincinnati | e-8449 | 0/1 | — | w-4435 | 1 | no |
| 18 | Production supervisor open to Midwest relocation for permane | e-9292 | 4/3 | — | w-4435 | 7 | no |
| 19 | Dental hygienist with three years experience, Indianapolis a | w-943 | 0/1 | — | w-392 | 3 | no |
| 20 | Registered nurse with ICU experience, willing to take per-di | w-2998 | 0/1 | — | w-4435 | 3 | no |
| 21 | Software engineer with React and TypeScript, three years exp | w-2897 | 0/1 | — | w-4435 | 2 | no |

---

## Paraphrase pass — does the playbook help similar-but-different queries?

For each query whose Pass 1 cold pass recorded a playbook entry, the
judge model rephrased the query, and the rephrased version was sent
through warm matrix.search. The recorded answer ID's rank in those
results tests whether cosine on the embedded paraphrase finds the
recorded query's vector.

| # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift |
|---|---|---|---|---|---|---|
| 2 | OSHA-30 certified forklift operator in W | Looking for a OSHA-30 trained forklift driver based in Wisco | w-4435 | w-4435 | null | **YES** |
| 7 | Hazmat-certified warehouse worker comfor | Warehouse worker with Hazmat certification and experience in | w-3004 | w-3004 | null | **YES** |
| 9 | Inventory specialist with confined-space | Specialist in inventory management requiring certification i | w-392 | w-392 | null | **YES** |
| 10 | Warehouse worker who can run inventory c | Seeking a warehouse worker capable of conducting inventory c | w-4260 | w-4260 | null | **YES** |
| 13 | Reliable production line lead with stron | Experienced production line supervisor with excellent attend | e-5778 | e-5778 | null | **YES** |
| 15 | Engaged warehouse associate with strong  | Warehouse associate currently engaged with a robust history  | w-2523 | w-2523 | null | **YES** |

---

## Honesty caveats

1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
   judge's verdict is what defines "best." If `qwen2.5:latest` rates badly,
   the lift number is meaningless. To validate the judge itself, sample 5–10
   verdicts manually and check agreement.
2. **Score-1.0 boost = distance halved.** Playbook math is
   `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
   result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
   even halving doesn't promote it. Tight clusters → little visible lift.
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
   case — same query, recorded playbook, expected boost. The paraphrase
   pass (when enabled) is the actual learning property: similar-but-different
   queries hitting a recorded playbook. Compare verbatim and paraphrase
   lift rates — paraphrase should be lower (semantic-distance gates some
   playbook hits) but non-zero is the meaningful signal.
4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
   results land in one corpus, the matrix layer's purpose isn't being tested.
   Check per-corpus distribution in the JSON.
5. **Judge resolution.** This run used `qwen2.5:latest` from
   env JUDGE_MODEL=qwen2.5:latest.
   Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
6. **Paraphrase generation also uses the judge.** The same model that rates
   relevance also rephrases queries. A judge that's bad at rating staffing
   queries is probably also bad at rephrasing them. Worth sanity-checking
   a sample of `paraphrase_query` values in the JSON before trusting the
   paraphrase lift number.

## Next moves

- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
  work. Move to paraphrase queries + tag-based boost (currently ignored).
- If lift rate < 20%: investigate why — judge variance, distance gap too
  wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
  retuning.
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
  already close to optimal on this query distribution. Either the corpus
  is too narrow or the queries are too easy.