golangLAKEHOUSE/reports/reality-tests/real_001_findings.md

# Reality test real_001 — findings

First retrieval probe with **real-shape coordinator queries** (sourced
from `fill_events.parquet` via `scripts/cutover/gen_real_queries.go`),
fed through the standard `playbook_lift.sh` harness with paraphrase
+ rejudge passes disabled (no recorded playbooks for these queries
on first touch, so those passes have nothing to measure).

Companion to `playbook_lift_real_001.md` — that file is the
auto-generated harness output; this file captures the reading.

---

## Substrate works on real-shape queries

8 / 10 queries had cold-pass top-1 = judge-best (rating ≥ 2). Single-role +
city + state + client queries are well within the substrate's competence.
This is the headline.

The matrix layer doesn't need to do clever things on these. Cosine on
`v2-moe` against the 5K worker corpus already surfaces a sensible match
in 8 / 10 cases — the v2-moe upgrade + the workers corpus (with role +
city + skills + certs in the resume text) is doing real work.

## Cross-pollination across same-client / same-city queries

Queries #2, #5, #10 all target **Beacon Freight in Detroit MI**:

| # | Role | Cold top-1 | Warm top-1 | Boosted | Playbook |
|---|---|---|---|---|---|
| 2 | Forklift Operator | `e-5617` | `e-6193` | 1 | YES — recorded `e-6193 fits Forklift Operator Detroit Beacon Freight` |
| 5 | Pickers | `e-5620` | `e-6193` | 0 | no |
| 10 | CNC Operator | `w-3759` | `e-6193` | 0 | no |

Q#2 records a playbook for `e-6193` after the judge promotes it to
rank 0 in the cold pass. Q#5 and Q#10 then inherit `e-6193` at warm
top-1 even though:

- Neither query has its own recorded playbook (column 4 = no).
- Neither warm pass triggers a Shape B inject (boosted = 0).
- The roles are *different* — Forklift, Pickers, CNC Operator are
  distinct staffing categories.

So `e-6193` is reaching warm-top-1 via Shape A's distance-based boost:
the playbook-corpus entry tagged with Q#2's query text is close enough
in cosine to Q#5 and Q#10's embeddings (same client + city dominate)
that the boost halves the distance and promotes the worker.

For Q#10 specifically, this **demoted the cold-pass-correct `w-3759`**
(judge rating 4 at rank 0) in favor of a worker who was approved by
the judge for a *different role* on a *different query*.

## Why the lift suite missed this

The synthetic `playbook_lift_queries.txt` uses 7 disjoint scenario
buckets: forklift+OSHA+WI, CDL+IL, hazmat+coldstorage, etc. Each
bucket is a distinct semantic neighborhood, so recorded playbook
entries don't compete. The cluster doesn't exist in the synthetic
distribution.

Real coordinator demand clusters on `(client, city)` because that's
how dispatch traffic shapes: same client across roles, same city
across days. The Beacon Freight Detroit cluster is what the synthetic
bucketing prevents. So the synthetic harness reports clean lift
numbers while a same-client cluster bleeds.

## Why the judge gate doesn't catch it

The judge gate (`internal/matrix/judge.go`, wired in `5a3364f`) is
**per-injection at record time** — Shape B inject calls
`gate.Approve(query, hit)` before adding a candidate. It does not
fire at retrieve time. A worker approved for Forklift at Beacon
Freight stays in the playbook corpus and rides along on later
Beacon Freight queries via Shape A boost without a second
judge call.

This is intentional in the design: judge calls are 1-3s on local
qwen2.5, so we batch them at record time. But the design didn't
anticipate the same-client cluster, where the boost surface is
much wider than per-query independence assumes.

## Mitigation options (none yet implemented)

In rough order of cost:

1. **Role-scoped playbook corpus** — include `role` (extracted at
   record time, possibly via the existing demand-parser) as
   metadata on each playbook entry. Restrict Shape A boost to
   matches where query-role and playbook-role agree. Cheap;
   doesn't need a new judge call.

2. **Tighten Shape A distance for cross-role queries** — currently
   `DefaultPlaybookMaxDistance = 0.5`. If the new query embeds
   close in `(client, city)` but far in `role`, the boost still
   fires because cosine doesn't separate the axes. Could derive
   a tighter threshold from intra-role vs inter-role distance
   distributions.

3. **Per-retrieve judge re-gate** — call the judge on the warm
   top-1 vs cold top-1 and demote the warm result if the judge
   prefers cold. Highest correctness, ~2× retrieve latency. Not
   viable for the hot path.

(1) is the obvious first fix. The role-extractor already exists
(see `internal/chat/inbox.go` LLM-parsed inbox demands; same
qwen2.5 format=json shape can run on playbook record).

## What the substrate gets right

- 80% cold-pass-correct on a query distribution it's never seen
  trained for — strong v2-moe + corpus signal.
- The two queries that did discover (Q#2, Q#9) lifted cleanly:
  recorded playbook → warm top-1 = judge-best at rank 0. The
  basic Shape B mechanism works on real-shape queries.
- Cross-pollination *only* fires within same-client+city clusters,
  not across them — the substrate is not behaving randomly. The
  bleed has clear semantics; it just exceeds what the inject
  gate's per-query scope catches.

## Repro

```bash
go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
  WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
  ./scripts/playbook_lift.sh
```

Evidence: `reports/reality-tests/playbook_lift_real_001.{json,md}`.