golangLAKEHOUSE/reports/reality-tests/real_003_findings.md
root 3263254f1c reality_test real_003: 40-query paraphrase stress + extractor extension
Stress-tests the role gate with 40 queries (10 fill_events rows × 4
styles): need, client_first, looking, shorthand. Each row's role +
client + city stays the same; only the surface phrasing changes.

real_003 (original extractor) confirmed the shorthand-vs-shorthand
failure mode: CNC Operator shorthand recording leaked w-2404 onto
Forklift Operator shorthand query within the same Beacon Freight
Detroit cluster. Both record + query had empty role (extractor
returns "" for shorthand because there's no separator between role
and city), gate disabled, distance check passed, bleed fired.

Fix: extended extractRoleFromNeed to handle client_first
("{client} needs N {role} in...") and looking ("Looking for N
{role} at...") patterns. Shorthand left intentionally unmatched —
"Forklift Operator Detroit" is shape-indistinguishable from
"Forklift" + "Operator Detroit" without an LLM extractor or known-
cities lookup.

real_003b (extended extractor) verifies bleed closed across all 4
styles for this dataset. Forklift Operator queries keep w-2136 (the
cold-pass-correct match) regardless of which style the query came
in. Same-role boosts now fire correctly across styles — a CNC
Operator recording made in `looking` style boosts the CNC need-form
query.

scripts/cutover/gen_real_queries.go: added -styles flag with values
need|client_first|looking|shorthand|all (default need preserves
real_001/002 behavior). Tests/reality/real_coord_queries_v2.txt is
the 40-query stress file.

scripts/playbook_lift/main_test.go: 10 sub-tests lock the four
documented patterns + shorthand limitation + lift-suite-style
queries (no clean role, returns empty as expected).

Aggregate metrics:
- real_003  (original): disc=7,  lift=7,  boost=14, meanΔ=-0.108
- real_003b (extended): disc=11, lift=10, boost=31, meanΔ=-0.202
The growth reflects more LEGITIMATE same-role same-cluster transfer
firing across styles, not bleed (verified by per-cluster bleed
table — Forklift Operator queries unchanged across all 4 styles).

Known limitation documented in real_003_findings.md: same-cluster,
same-role queries in shorthand still embed close enough that a
shorthand recording could bleed onto a different-role shorthand
query if both record + query strip role. Closing this requires
LLM extraction or known-cities lookup at record + query time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 21:42:02 -05:00

127 lines
5.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Reality test real_003 / real_003b — paraphrase stress + extractor extension
40 queries (10 fill_events rows × 4 query styles) re-run twice:
- **real_003**: with the original `extractRoleFromNeed` regex (only
matches `^Need\s+\d+\s+\S+\s+in\s+` — the real_001 form)
- **real_003b**: with the extractor extended to also handle
`client_first` (`{client} needs N {role} in...`) and `looking`
(`Looking for N {role} at...`). Shorthand still falls through to
empty role.
## Headline
real_003 (original extractor): **shorthand-vs-shorthand bleed
confirmed**. The CNC Operator shorthand recording leaked `w-2404`
onto the Forklift Operator shorthand query within the same Beacon
Freight Detroit cluster — both record + query had empty role, gate
disabled, distance check passed, bleed fired.
real_003b (extended extractor): **bleed closed for the queried side
across all 4 styles**. Forklift Operator queries keep `w-2136` (the
cold-pass-correct match) regardless of which style the query came
in. Same-role boosts now fire correctly across styles — a CNC
Operator recording made in `looking` style boosts the CNC need-form
query.
## Per-style behavior — real_003 (original extractor)
| Style | Bleed? | Explanation |
|---|---|---|
| `need` | none observed | Role extracted on both record + query side |
| `client_first` | none observed | Role NOT extracted, but no recording in this style happened to surface near a different-role query in real_003 |
| `looking` | none observed | Same as client_first |
| `shorthand` | **`w-2404` from CNC bled onto Forklift Operator** | Both record and query empty role → gate disabled → distance check passed |
The single observed bleed case in real_003:
- Q#40 (`1 CNC Operator Detroit MI 17:30 Beacon Freight`) recorded
`w-2404` with `Role: ""` (extractor returned empty for shorthand).
- Q#4 (`1 Forklift Operator Detroit MI 15:00 Beacon Freight`) embedded
within ~0.137 cosine of Q#40 (same client + city + count + time
tokens dominate the embedding).
- `roleEqual("", "")` returned true (empty disables) → injection
fired → warm top-1 for Forklift Operator became `w-2404`.
## Per-style behavior — real_003b (extended extractor)
After adding patterns for `client_first` and `looking`:
| Style | Role extracted? | Cross-role bleed observed? |
|---|---|---|
| `need` | yes | none |
| `client_first` | yes | none |
| `looking` | yes | none |
| `shorthand` | no | none in this dataset |
No bleed observed in real_003b across any style. Pickers + CNC
Operator queries pick up their own role's recording across styles;
Forklift Operator queries keep the cold-pass-correct match.
## Why the shorthand failure mode didn't fire in real_003b
The extended extractor closes the bleed at the **query** side: for
`need`, `client_first`, `looking`, the queryRole is non-empty, so a
recording with empty role gets `roleEqual(role, "")` = true (lenient)
but the inverse — a non-empty queryRole gating a recording — is the
real defense.
Wait — that's the same lenient semantic. So why isn't there a bleed?
Two reasons real_003b's data didn't trigger one:
1. **The Pickers shorthand recording** has `Role: ""`. Forklift Operator
queries embed > 0.20 cosine from "4 Pickers Detroit MI ..." because
"Pickers" vs "Forklift Operator" provides enough semantic separation
even within the same client+city cluster. The distance gate caught
what the role gate let through.
2. **No Forklift Operator recording** existed (judge said cold top-1
was already correct, rating 5, no playbook needed). The
most-likely-to-bleed scenario — a Forklift recording in shorthand
leaking onto Pickers/CNC — didn't have ammunition.
If a future dataset has multiple roles per cluster all hitting shorthand
recordings, the bleed could return. **Mitigation candidates** (none
implemented):
- **LLM-based role extraction** at record + query time. Robust, slow.
- **Known-cities lookup table** to detect city boundary in shorthand
(`{role} {city}` separator). 50 US states + ~5000 cities = small
static table. Fast, brittle on new cities.
- **Strict gate semantics**: empty role on either side = REJECT
(instead of allow). Closes shorthand-vs-shorthand bleeds completely
but breaks lift-suite multi-constraint queries that have no clean
single role.
## Aggregate metrics
| Run | n | discoveries | lifts | boost_total | meanΔ |
|---|---:|---:|---:|---:|---:|
| real_003 (original extractor) | 40 | 7 | 7 | 14 | -0.108 |
| real_003b (extended extractor) | 40 | 11 | 10 | 31 | -0.202 |
real_003b's higher discoveries + boost_total reflect the extractor
catching role on 3 of 4 styles instead of 1 of 4 — which means more
recordings land with a usable Role and more queries find them on warm
pass. The growth is *legitimate same-role same-cluster transfer*, not
bleed.
`meanΔ` direction is style-dependent: real_002 shrank `meanΔ` (cross-role
bleeds removed → less over-boosting); real_003b grew it (more
legitimate deep boosts fire). The metric isn't a clean fingerprint
either direction — read the per-cluster bleed table for actual signal.
## Repro
```bash
# Generate 40-query stress file (10 rows × 4 styles)
go run scripts/cutover/gen_real_queries.go -limit 10 -styles all > tests/reality/real_coord_queries_v2.txt
# Run with extended extractor (current main)
QUERIES_FILE=tests/reality/real_coord_queries_v2.txt RUN_ID=real_003b \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh
```
Evidence:
- `reports/reality-tests/playbook_lift_real_003.{json,md}` (original extractor)
- `reports/reality-tests/playbook_lift_real_003b.{json,md}` (extended)
- `tests/reality/real_coord_queries_v2.txt` (40 queries × 4 styles)