# Reality test real_003 / real_003b — paraphrase stress + extractor extension 40 queries (10 fill_events rows × 4 query styles) re-run twice: - **real_003**: with the original `extractRoleFromNeed` regex (only matches `^Need\s+\d+\s+\S+\s+in\s+` — the real_001 form) - **real_003b**: with the extractor extended to also handle `client_first` (`{client} needs N {role} in...`) and `looking` (`Looking for N {role} at...`). Shorthand still falls through to empty role. ## Headline real_003 (original extractor): **shorthand-vs-shorthand bleed confirmed**. The CNC Operator shorthand recording leaked `w-2404` onto the Forklift Operator shorthand query within the same Beacon Freight Detroit cluster — both record + query had empty role, gate disabled, distance check passed, bleed fired. real_003b (extended extractor): **bleed closed for the queried side across all 4 styles**. Forklift Operator queries keep `w-2136` (the cold-pass-correct match) regardless of which style the query came in. Same-role boosts now fire correctly across styles — a CNC Operator recording made in `looking` style boosts the CNC need-form query. ## Per-style behavior — real_003 (original extractor) | Style | Bleed? | Explanation | |---|---|---| | `need` | none observed | Role extracted on both record + query side | | `client_first` | none observed | Role NOT extracted, but no recording in this style happened to surface near a different-role query in real_003 | | `looking` | none observed | Same as client_first | | `shorthand` | **`w-2404` from CNC bled onto Forklift Operator** | Both record and query empty role → gate disabled → distance check passed | The single observed bleed case in real_003: - Q#40 (`1 CNC Operator Detroit MI 17:30 Beacon Freight`) recorded `w-2404` with `Role: ""` (extractor returned empty for shorthand). - Q#4 (`1 Forklift Operator Detroit MI 15:00 Beacon Freight`) embedded within ~0.137 cosine of Q#40 (same client + city + count + time tokens dominate the embedding). - `roleEqual("", "")` returned true (empty disables) → injection fired → warm top-1 for Forklift Operator became `w-2404`. ## Per-style behavior — real_003b (extended extractor) After adding patterns for `client_first` and `looking`: | Style | Role extracted? | Cross-role bleed observed? | |---|---|---| | `need` | yes | none | | `client_first` | yes | none | | `looking` | yes | none | | `shorthand` | no | none in this dataset | No bleed observed in real_003b across any style. Pickers + CNC Operator queries pick up their own role's recording across styles; Forklift Operator queries keep the cold-pass-correct match. ## Why the shorthand failure mode didn't fire in real_003b The extended extractor closes the bleed at the **query** side: for `need`, `client_first`, `looking`, the queryRole is non-empty, so a recording with empty role gets `roleEqual(role, "")` = true (lenient) but the inverse — a non-empty queryRole gating a recording — is the real defense. Wait — that's the same lenient semantic. So why isn't there a bleed? Two reasons real_003b's data didn't trigger one: 1. **The Pickers shorthand recording** has `Role: ""`. Forklift Operator queries embed > 0.20 cosine from "4 Pickers Detroit MI ..." because "Pickers" vs "Forklift Operator" provides enough semantic separation even within the same client+city cluster. The distance gate caught what the role gate let through. 2. **No Forklift Operator recording** existed (judge said cold top-1 was already correct, rating 5, no playbook needed). The most-likely-to-bleed scenario — a Forklift recording in shorthand leaking onto Pickers/CNC — didn't have ammunition. If a future dataset has multiple roles per cluster all hitting shorthand recordings, the bleed could return. **Mitigation candidates** (none implemented): - **LLM-based role extraction** at record + query time. Robust, slow. - **Known-cities lookup table** to detect city boundary in shorthand (`{role} {city}` separator). 50 US states + ~5000 cities = small static table. Fast, brittle on new cities. - **Strict gate semantics**: empty role on either side = REJECT (instead of allow). Closes shorthand-vs-shorthand bleeds completely but breaks lift-suite multi-constraint queries that have no clean single role. ## Aggregate metrics | Run | n | discoveries | lifts | boost_total | meanΔ | |---|---:|---:|---:|---:|---:| | real_003 (original extractor) | 40 | 7 | 7 | 14 | -0.108 | | real_003b (extended extractor) | 40 | 11 | 10 | 31 | -0.202 | real_003b's higher discoveries + boost_total reflect the extractor catching role on 3 of 4 styles instead of 1 of 4 — which means more recordings land with a usable Role and more queries find them on warm pass. The growth is *legitimate same-role same-cluster transfer*, not bleed. `meanΔ` direction is style-dependent: real_002 shrank `meanΔ` (cross-role bleeds removed → less over-boosting); real_003b grew it (more legitimate deep boosts fire). The metric isn't a clean fingerprint either direction — read the per-cluster bleed table for actual signal. ## Repro ```bash # Generate 40-query stress file (10 rows × 4 styles) go run scripts/cutover/gen_real_queries.go -limit 10 -styles all > tests/reality/real_coord_queries_v2.txt # Run with extended extractor (current main) QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_003b \ WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh ``` Evidence: - `reports/reality-tests/playbook_lift_real_003.{json,md}` (original extractor) - `reports/reality-tests/playbook_lift_real_003b.{json,md}` (extended) - `tests/reality/real_coord_queries_v2.txt` (40 queries × 4 styles)