golangLAKEHOUSE/reports/reality-tests/playbook_lift_003.md
root 154a72ea5e matrix: Shape B — inject playbook misses + 6/6 paraphrase recovery
The v0 boost-only stance documented in internal/matrix/playbook.go:22-27
("the boost only re-ranks results that ALREADY surfaced from the regular
retrieval") couldn't promote recorded answers that dropped out of a
paraphrase's top-K. playbook_lift_002 surfaced exactly that gap: 0/2
paraphrase recoveries because the recorded answers weren't in regular
retrieval at all (rank=-1).

Shape B: when warm-pass retrieval doesn't surface a playbook hit's
answer, inject a synthetic Result for it directly. Distance =
playbook_hit_distance × BoostFactor — same formula as the boost path so
injections land in comparable distance space. Caller re-sorts +
truncates after both boost and inject have run.

Result on playbook_lift_003 (Shape B + paraphrase pass):

  Verbatim discovery        6
  Verbatim lift             2 / 6
  **Paraphrase top-1**      **6 / 6**
  Paraphrase any-rank in K  6 / 6
  Mean Δ top-1 distance     -0.1637 (warm closer than cold)

Every paraphrase the judge generated landed the v1-recorded answer at
top-1 of the new query's results. The learning property holds — cosine
on embed(paraphrase) finds the recorded query's vector within
DefaultPlaybookMaxDistance (0.5), and Shape B injects the answer.

Verbatim lift dropped from v1's 7/8 because Shape B cross-pollinates
recorded answers across queries. w-4435 (Q2's recording) appears as
warm top-1 for several other queries because their embeddings are
within the playbook hit threshold of "OSHA-30 forklift Wisconsin." This
is a feature, not a bug — the matrix layer's purpose is to share
knowledge across queries — but the lift metric only counts "warm top-1
== cold judge best," so cross-pollinated lifts don't register. A v3
metric would re-judge warm pass to measure true judge improvement.

Tests:
- TestInjectPlaybookMisses_AddsMissingAnswers — primary claim
- TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent — no double-inject
- TestInjectPlaybookMisses_DedupesPerAnswer — multi-hit same answer
- TestInjectPlaybookMisses_EmptyHits — fast-path no-op

Driver fix: ParaphraseRecordedRank int → *int. The `omitempty` int
silently dropped rank=0 (top-1, the WANTED value) from JSON, making the
v003 report show "null" instead of "0" for every successful recovery.
Pointer keeps nil/rank-0 distinguishable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:06:13 -05:00

7.0 KiB
Raw Blame History

Playbook-Lift Reality Test — Run 003

Generated: 2026-04-30T12:03:36.939020926Z Judge: qwen2.5:latest (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest) Corpora: workers,ethereal_workers Workers limit: 5000 Queries: tests/reality/playbook_lift_queries.txt (21 executed) K per pass: 10 Paraphrase pass: ENABLED Evidence: reports/reality-tests/playbook_lift_003.json


Headline

Metric Value
Total queries run 21
Cold-pass discoveries (judge-best ≠ top-1) 6
Warm-pass lifts (recorded playbook → top-1) 2
No change (judge-best already top-1, no playbook needed) 19
Playbook boosts triggered (warm pass) 6
Mean Δ top-1 distance (warm cold) -0.16369006
Paraphrase pass — recorded answer at rank 0 (top-1) 6 / 6
Paraphrase pass — recorded answer at any rank in top-K 6 / 6

Verbatim lift rate: 2 of 6 discoveries became top-1 after warm pass.


Per-query results

# Query Cold top-1 Cold judge-best (rank/rating) Recorded? Warm top-1 Judge-best warm rank Lift
1 Forklift operator with OSHA-30, warehouse experience, day sh e-4079 3/3 w-4435 6 no
2 OSHA-30 certified forklift operator in Wisconsin, cold stora e-8354 2/4 ✓ w-4435 w-3004 1 no
3 Production worker with confined-space cert and hazmat traini w-943 0/2 w-392 3 no
4 CDL Class A driver, clean record, willing to do regional 4-d w-3272 0/1 w-4435 3 no
5 Warehouse lead with current OSHA-30 certification, NOT OSHA- w-2759 0/2 e-5778 3 no
6 Forklift-certified loader, certification must be active, dis e-3143 0/2 w-3004 3 no
7 Hazmat-certified warehouse worker comfortable with cold stor w-2844 8/4 ✓ w-3004 w-4435 2 no
8 Bilingual production worker with team-lead experience and tr w-4749 0/4 w-4260 3 no
9 Inventory specialist with confined-space cert and compliance w-153 6/4 ✓ w-392 w-392 0 YES
10 Warehouse worker who can run inventory cycles and lead a sma e-4744 9/4 ✓ w-4260 w-3004 1 no
11 Production line worker comfortable filling in as line superv w-1010 0/3 w-3004 3 no
12 Customer service rep willing to cross-train into dispatch or e-3302 2/2 w-4435 4 no
13 Reliable production line lead with strong attendance and lea e-4284 6/4 ✓ e-5778 e-5778 0 YES
14 Highly responsive forklift operator available for last-minut e-6762 1/2 w-4435 4 no
15 Engaged warehouse associate with strong safety compliance re e-2743 1/4 ✓ w-2523 w-3004 1 no
16 CDL-A driver based in IL or WI, willing to run regional 4-da w-3272 3/2 w-4435 6 no
17 Bilingual customer service rep in Indianapolis or Cincinnati e-8449 0/1 w-4435 1 no
18 Production supervisor open to Midwest relocation for permane e-9292 4/3 w-4435 7 no
19 Dental hygienist with three years experience, Indianapolis a w-943 0/1 w-392 3 no
20 Registered nurse with ICU experience, willing to take per-di w-2998 0/1 w-4435 3 no
21 Software engineer with React and TypeScript, three years exp w-2897 0/1 w-4435 2 no

Paraphrase pass — does the playbook help similar-but-different queries?

For each query whose Pass 1 cold pass recorded a playbook entry, the judge model rephrased the query, and the rephrased version was sent through warm matrix.search. The recorded answer ID's rank in those results tests whether cosine on the embedded paraphrase finds the recorded query's vector.

# Original (≤40c) Paraphrase (≤60c) Recorded answer Paraphrase top-1 Recorded rank Paraphrase lift
2 OSHA-30 certified forklift operator in W Looking for a OSHA-30 trained forklift driver based in Wisco w-4435 w-4435 null YES
7 Hazmat-certified warehouse worker comfor Warehouse worker with Hazmat certification and experience in w-3004 w-3004 null YES
9 Inventory specialist with confined-space Specialist in inventory management requiring certification i w-392 w-392 null YES
10 Warehouse worker who can run inventory c Seeking a warehouse worker capable of conducting inventory c w-4260 w-4260 null YES
13 Reliable production line lead with stron Experienced production line supervisor with excellent attend e-5778 e-5778 null YES
15 Engaged warehouse associate with strong Warehouse associate currently engaged with a robust history w-2523 w-2523 null YES

Honesty caveats

  1. Judge IS the ground truth proxy. Without human-labeled relevance, the LLM judge's verdict is what defines "best." If qwen2.5:latest rates badly, the lift number is meaningless. To validate the judge itself, sample 510 verdicts manually and check agreement.
  2. Score-1.0 boost = distance halved. Playbook math is distance' = distance × (1 - 0.5 × score). Lift requires the judge-best result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise even halving doesn't promote it. Tight clusters → little visible lift.
  3. Verbatim vs paraphrase. The verbatim lift rate (above) is the cheap case — same query, recorded playbook, expected boost. The paraphrase pass (when enabled) is the actual learning property: similar-but-different queries hitting a recorded playbook. Compare verbatim and paraphrase lift rates — paraphrase should be lower (semantic-distance gates some playbook hits) but non-zero is the meaningful signal.
  4. Multi-corpus skew. Default corpora=workers,ethereal_workers — if all judge-best results land in one corpus, the matrix layer's purpose isn't being tested. Check per-corpus distribution in the JSON.
  5. Judge resolution. This run used qwen2.5:latest from env JUDGE_MODEL=qwen2.5:latest. Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
  6. Paraphrase generation also uses the judge. The same model that rates relevance also rephrases queries. A judge that's bad at rating staffing queries is probably also bad at rephrasing them. Worth sanity-checking a sample of paraphrase_query values in the JSON before trusting the paraphrase lift number.

Next moves

  • If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real work. Move to paraphrase queries + tag-based boost (currently ignored).
  • If lift rate < 20%: investigate why — judge variance, distance gap too wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need retuning.
  • If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is already close to optimal on this query distribution. Either the corpus is too narrow or the queries are too easy.