golangLAKEHOUSE/reports/reality-tests/playbook_lift_real_003.md
root 3263254f1c reality_test real_003: 40-query paraphrase stress + extractor extension
Stress-tests the role gate with 40 queries (10 fill_events rows × 4
styles): need, client_first, looking, shorthand. Each row's role +
client + city stays the same; only the surface phrasing changes.

real_003 (original extractor) confirmed the shorthand-vs-shorthand
failure mode: CNC Operator shorthand recording leaked w-2404 onto
Forklift Operator shorthand query within the same Beacon Freight
Detroit cluster. Both record + query had empty role (extractor
returns "" for shorthand because there's no separator between role
and city), gate disabled, distance check passed, bleed fired.

Fix: extended extractRoleFromNeed to handle client_first
("{client} needs N {role} in...") and looking ("Looking for N
{role} at...") patterns. Shorthand left intentionally unmatched —
"Forklift Operator Detroit" is shape-indistinguishable from
"Forklift" + "Operator Detroit" without an LLM extractor or known-
cities lookup.

real_003b (extended extractor) verifies bleed closed across all 4
styles for this dataset. Forklift Operator queries keep w-2136 (the
cold-pass-correct match) regardless of which style the query came
in. Same-role boosts now fire correctly across styles — a CNC
Operator recording made in `looking` style boosts the CNC need-form
query.

scripts/cutover/gen_real_queries.go: added -styles flag with values
need|client_first|looking|shorthand|all (default need preserves
real_001/002 behavior). Tests/reality/real_coord_queries_v2.txt is
the 40-query stress file.

scripts/playbook_lift/main_test.go: 10 sub-tests lock the four
documented patterns + shorthand limitation + lift-suite-style
queries (no clean role, returns empty as expected).

Aggregate metrics:
- real_003  (original): disc=7,  lift=7,  boost=14, meanΔ=-0.108
- real_003b (extended): disc=11, lift=10, boost=31, meanΔ=-0.202
The growth reflects more LEGITIMATE same-role same-cluster transfer
firing across styles, not bleed (verified by per-cluster bleed
table — Forklift Operator queries unchanged across all 4 styles).

Known limitation documented in real_003_findings.md: same-cluster,
same-role queries in shorthand still embed close enough that a
shorthand recording could bleed onto a different-role shorthand
query if both record + query strip role. Closing this requires
LLM extraction or known-cities lookup at record + query time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 21:42:02 -05:00

117 lines
7.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Playbook-Lift Reality Test — Run real_003
**Generated:** 2026-05-01T02:27:31.394679694Z
**Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge)
**Corpora:** `workers,ethereal_workers`
**Workers limit:** 5000
**Queries:** `tests/reality/real_coord_queries_v2.txt` (40 executed)
**K per pass:** 10
**Paraphrase pass:** disabled
**Re-judge pass:** disabled
**Evidence:** `reports/reality-tests/playbook_lift_real_003.json`
---
## Headline
| Metric | Value |
|---|---:|
| Total queries run | 40 |
| Cold-pass discoveries (judge-best ≠ top-1) | 7 |
| Warm-pass lifts (recorded playbook → top-1) | 7 |
| No change (judge-best already top-1, no playbook needed) | 33 |
| Playbook boosts triggered (warm pass) | 14 |
| Mean Δ top-1 distance (warm cold) | -0.10771029 |
**Verbatim lift rate:** 7 of 7 discoveries became top-1 after warm pass.
---
## Per-query results
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|---|---|---|---|---|---|---|---|
| 1 | Need 5 Warehouse Associates in Kansas City MO starting at 09 | e-9573 | 0/4 | — | e-9573 | 0 | no |
| 2 | Parallel Machining needs 5 Warehouse Associates in Kansas Ci | e-7538 | 0/4 | — | e-7538 | 0 | no |
| 3 | Looking for 5 Warehouse Associates at Parallel Machining in | e-9573 | 0/4 | — | e-9573 | 0 | no |
| 4 | 5 Warehouse Associates Kansas City MO 09:00 Parallel Machini | e-9573 | 0/4 | — | e-9573 | 0 | no |
| 5 | Need 1 Forklift Operator in Detroit MI starting at 15:00 for | w-2136 | 0/5 | — | w-2136 | 0 | no |
| 6 | Beacon Freight needs 1 Forklift Operator in Detroit MI at 15 | w-2136 | 0/5 | — | w-2136 | 0 | no |
| 7 | Looking for 1 Forklift Operator at Beacon Freight in Detroit | w-2136 | 0/5 | — | w-2136 | 0 | no |
| 8 | 1 Forklift Operator Detroit MI 15:00 Beacon Freight | w-4766 | 0/5 | — | w-2404 | 1 | no |
| 9 | Need 4 Loaders in Indianapolis IN starting at 12:00 for Midw | e-2820 | 4/4 | ✓ e-4769 | e-4769 | 0 | **YES** |
| 10 | Midway Distribution needs 4 Loaders in Indianapolis IN at 12 | e-6419 | 4/4 | ✓ e-4769 | e-4769 | 0 | **YES** |
| 11 | Looking for 4 Loaders at Midway Distribution in Indianapolis | e-2820 | 1/2 | — | e-4769 | 2 | no |
| 12 | 4 Loaders Indianapolis IN 12:00 Midway Distribution | e-2820 | 6/5 | ✓ e-4769 | e-4769 | 0 | **YES** |
| 13 | Need 3 Warehouse Associates in Fort Wayne IN starting at 17: | e-9237 | 0/4 | — | w-565 | 1 | no |
| 14 | Cornerstone Fabrication needs 3 Warehouse Associates in Fort | e-9237 | 4/4 | ✓ w-565 | w-565 | 0 | **YES** |
| 15 | Looking for 3 Warehouse Associates at Cornerstone Fabricatio | e-9237 | 3/4 | ✓ w-565 | w-565 | 0 | **YES** |
| 16 | 3 Warehouse Associates Fort Wayne IN 17:30 Cornerstone Fabri | e-9237 | 0/4 | — | w-565 | 1 | no |
| 17 | Need 4 Pickers in Detroit MI starting at 13:30 for Beacon Fr | w-2136 | 0/2 | — | w-2136 | 0 | no |
| 18 | Beacon Freight needs 4 Pickers in Detroit MI at 13:30 | w-2136 | 0/2 | — | w-2136 | 0 | no |
| 19 | Looking for 4 Pickers at Beacon Freight in Detroit MI for 13 | e-7948 | 0/1 | — | e-7948 | 0 | no |
| 20 | 4 Pickers Detroit MI 13:30 Beacon Freight | e-7948 | 3/2 | — | e-7948 | 3 | no |
| 21 | Need 2 Packers in Joliet IL starting at 09:30 for Parallel M | e-9191 | 0/2 | — | e-9191 | 0 | no |
| 22 | Parallel Machining needs 2 Packers in Joliet IL at 09:30 | e-9191 | 7/3 | — | e-9191 | 7 | no |
| 23 | Looking for 2 Packers at Parallel Machining in Joliet IL for | e-9191 | 0/2 | — | e-9191 | 0 | no |
| 24 | 2 Packers Joliet IL 09:30 Parallel Machining | e-9191 | 6/3 | — | e-9191 | 6 | no |
| 25 | Need 3 Assemblers in Flint MI starting at 08:30 for Heritage | w-2582 | 4/3 | — | w-2582 | 4 | no |
| 26 | Heritage Foods needs 3 Assemblers in Flint MI at 08:30 | w-2582 | 0/2 | — | w-2582 | 0 | no |
| 27 | Looking for 3 Assemblers at Heritage Foods in Flint MI for 0 | w-4817 | 0/2 | — | w-4817 | 0 | no |
| 28 | 3 Assemblers Flint MI 08:30 Heritage Foods | w-4124 | 2/2 | — | w-4124 | 2 | no |
| 29 | Need 3 Packers in Flint MI starting at 12:30 for Parallel Ma | e-6019 | 0/1 | — | e-6019 | 0 | no |
| 30 | Parallel Machining needs 3 Packers in Flint MI at 12:30 | e-6019 | 4/2 | — | e-6019 | 4 | no |
| 31 | Looking for 3 Packers at Parallel Machining in Flint MI for | e-6019 | 0/1 | — | e-6019 | 0 | no |
| 32 | 3 Packers Flint MI 12:30 Parallel Machining | e-6019 | 0/2 | — | e-6019 | 0 | no |
| 33 | Need 1 Shipping Clerk in Flint MI starting at 17:00 for Pion | w-3988 | 1/3 | — | w-122 | 2 | no |
| 34 | Pioneer Assembly needs 1 Shipping Clerk in Flint MI at 17:00 | w-3988 | 1/3 | — | w-122 | 2 | no |
| 35 | Looking for 1 Shipping Clerk at Pioneer Assembly in Flint MI | w-3988 | 2/3 | — | w-122 | 0 | no |
| 36 | 1 Shipping Clerk Flint MI 17:00 Pioneer Assembly | w-2564 | 2/4 | ✓ w-122 | w-122 | 0 | **YES** |
| 37 | Need 1 CNC Operator in Detroit MI starting at 17:30 for Beac | w-2136 | 6/3 | — | w-2404 | 7 | no |
| 38 | Beacon Freight needs 1 CNC Operator in Detroit MI at 17:30 | w-2404 | 0/5 | — | w-2404 | 0 | no |
| 39 | Looking for 1 CNC Operator at Beacon Freight in Detroit MI f | e-9958 | 1/2 | — | w-2404 | 2 | no |
| 40 | 1 CNC Operator Detroit MI 17:30 Beacon Freight | e-5546 | 2/5 | ✓ w-2404 | w-2404 | 0 | **YES** |
---
## Honesty caveats
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
judge's verdict is what defines "best." If `` rates badly,
the lift number is meaningless. To validate the judge itself, sample 510
verdicts manually and check agreement.
2. **Score-1.0 boost = distance halved.** Playbook math is
`distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
even halving doesn't promote it. Tight clusters → little visible lift.
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
case — same query, recorded playbook, expected boost. The paraphrase
pass (when enabled) is the actual learning property: similar-but-different
queries hitting a recorded playbook. Compare verbatim and paraphrase
lift rates — paraphrase should be lower (semantic-distance gates some
playbook hits) but non-zero is the meaningful signal.
4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
results land in one corpus, the matrix layer's purpose isn't being tested.
Check per-corpus distribution in the JSON.
5. **Judge resolution.** This run used `qwen2.5:latest` from
config [models].local_judge.
Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
6. **Paraphrase generation also uses the judge.** The same model that rates
relevance also rephrases queries. A judge that's bad at rating staffing
queries is probably also bad at rephrasing them. Worth sanity-checking
a sample of `paraphrase_query` values in the JSON before trusting the
paraphrase lift number.
## Next moves
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
work. Move to paraphrase queries + tag-based boost (currently ignored).
- If lift rate < 20%: investigate why judge variance, distance gap too
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
retuning.
- If discovery rate (cold judge-best top-1) is itself low: cosine is
already close to optimal on this query distribution. Either the corpus
is too narrow or the queries are too easy.