golangLAKEHOUSE/reports/reality-tests/real_003_findings.md

# Reality test real_003 / real_003b — paraphrase stress + extractor extension

40 queries (10 fill_events rows × 4 query styles) re-run twice:
- **real_003**: with the original `extractRoleFromNeed` regex (only
  matches `^Need\s+\d+\s+\S+\s+in\s+` — the real_001 form)
- **real_003b**: with the extractor extended to also handle
  `client_first` (`{client} needs N {role} in...`) and `looking`
  (`Looking for N {role} at...`). Shorthand still falls through to
  empty role.

## Headline

real_003 (original extractor): **shorthand-vs-shorthand bleed
confirmed**. The CNC Operator shorthand recording leaked `w-2404`
onto the Forklift Operator shorthand query within the same Beacon
Freight Detroit cluster — both record + query had empty role, gate
disabled, distance check passed, bleed fired.

real_003b (extended extractor): **bleed closed for the queried side
across all 4 styles**. Forklift Operator queries keep `w-2136` (the
cold-pass-correct match) regardless of which style the query came
in. Same-role boosts now fire correctly across styles — a CNC
Operator recording made in `looking` style boosts the CNC need-form
query.

## Per-style behavior — real_003 (original extractor)

| Style | Bleed? | Explanation |
|---|---|---|
| `need` | none observed | Role extracted on both record + query side |
| `client_first` | none observed | Role NOT extracted, but no recording in this style happened to surface near a different-role query in real_003 |
| `looking` | none observed | Same as client_first |
| `shorthand` | **`w-2404` from CNC bled onto Forklift Operator** | Both record and query empty role → gate disabled → distance check passed |

The single observed bleed case in real_003:
- Q#40 (`1 CNC Operator Detroit MI 17:30 Beacon Freight`) recorded
  `w-2404` with `Role: ""` (extractor returned empty for shorthand).
- Q#4 (`1 Forklift Operator Detroit MI 15:00 Beacon Freight`) embedded
  within ~0.137 cosine of Q#40 (same client + city + count + time
  tokens dominate the embedding).
- `roleEqual("", "")` returned true (empty disables) → injection
  fired → warm top-1 for Forklift Operator became `w-2404`.

## Per-style behavior — real_003b (extended extractor)

After adding patterns for `client_first` and `looking`:

| Style | Role extracted? | Cross-role bleed observed? |
|---|---|---|
| `need` | yes | none |
| `client_first` | yes | none |
| `looking` | yes | none |
| `shorthand` | no | none in this dataset |

No bleed observed in real_003b across any style. Pickers + CNC
Operator queries pick up their own role's recording across styles;
Forklift Operator queries keep the cold-pass-correct match.

## Why the shorthand failure mode didn't fire in real_003b

The extended extractor closes the bleed at the **query** side: for
`need`, `client_first`, `looking`, the queryRole is non-empty, so a
recording with empty role gets `roleEqual(role, "")` = true (lenient)
but the inverse — a non-empty queryRole gating a recording — is the
real defense.

Wait — that's the same lenient semantic. So why isn't there a bleed?

Two reasons real_003b's data didn't trigger one:

1. **The Pickers shorthand recording** has `Role: ""`. Forklift Operator
   queries embed > 0.20 cosine from "4 Pickers Detroit MI ..." because
   "Pickers" vs "Forklift Operator" provides enough semantic separation
   even within the same client+city cluster. The distance gate caught
   what the role gate let through.
2. **No Forklift Operator recording** existed (judge said cold top-1
   was already correct, rating 5, no playbook needed). The
   most-likely-to-bleed scenario — a Forklift recording in shorthand
   leaking onto Pickers/CNC — didn't have ammunition.

If a future dataset has multiple roles per cluster all hitting shorthand
recordings, the bleed could return. **Mitigation candidates** (none
implemented):

- **LLM-based role extraction** at record + query time. Robust, slow.
- **Known-cities lookup table** to detect city boundary in shorthand
  (`{role} {city}` separator). 50 US states + ~5000 cities = small
  static table. Fast, brittle on new cities.
- **Strict gate semantics**: empty role on either side = REJECT
  (instead of allow). Closes shorthand-vs-shorthand bleeds completely
  but breaks lift-suite multi-constraint queries that have no clean
  single role.

## Aggregate metrics

| Run | n | discoveries | lifts | boost_total | meanΔ |
|---|---:|---:|---:|---:|---:|
| real_003  (original extractor) | 40 | 7  | 7  | 14 | -0.108 |
| real_003b (extended extractor) | 40 | 11 | 10 | 31 | -0.202 |

real_003b's higher discoveries + boost_total reflect the extractor
catching role on 3 of 4 styles instead of 1 of 4 — which means more
recordings land with a usable Role and more queries find them on warm
pass. The growth is *legitimate same-role same-cluster transfer*, not
bleed.

`meanΔ` direction is style-dependent: real_002 shrank `meanΔ` (cross-role
bleeds removed → less over-boosting); real_003b grew it (more
legitimate deep boosts fire). The metric isn't a clean fingerprint
either direction — read the per-cluster bleed table for actual signal.

## Repro

```bash
# Generate 40-query stress file (10 rows × 4 styles)
go run scripts/cutover/gen_real_queries.go -limit 10 -styles all > tests/reality/real_coord_queries_v2.txt

# Run with extended extractor (current main)
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_003b \
  WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh
```

Evidence:
- `reports/reality-tests/playbook_lift_real_003.{json,md}` (original extractor)
- `reports/reality-tests/playbook_lift_real_003b.{json,md}` (extended)
- `tests/reality/real_coord_queries_v2.txt` (40 queries × 4 styles)