# Reality test real_001 — findings First retrieval probe with **real-shape coordinator queries** (sourced from `fill_events.parquet` via `scripts/cutover/gen_real_queries.go`), fed through the standard `playbook_lift.sh` harness with paraphrase + rejudge passes disabled (no recorded playbooks for these queries on first touch, so those passes have nothing to measure). Companion to `playbook_lift_real_001.md` — that file is the auto-generated harness output; this file captures the reading. --- ## Substrate works on real-shape queries 8 / 10 queries had cold-pass top-1 = judge-best (rating ≥ 2). Single-role + city + state + client queries are well within the substrate's competence. This is the headline. The matrix layer doesn't need to do clever things on these. Cosine on `v2-moe` against the 5K worker corpus already surfaces a sensible match in 8 / 10 cases — the v2-moe upgrade + the workers corpus (with role + city + skills + certs in the resume text) is doing real work. ## Cross-pollination across same-client / same-city queries Queries #2, #5, #10 all target **Beacon Freight in Detroit MI**: | # | Role | Cold top-1 | Warm top-1 | Boosted | Playbook | |---|---|---|---|---|---| | 2 | Forklift Operator | `e-5617` | `e-6193` | 1 | YES — recorded `e-6193 fits Forklift Operator Detroit Beacon Freight` | | 5 | Pickers | `e-5620` | `e-6193` | 0 | no | | 10 | CNC Operator | `w-3759` | `e-6193` | 0 | no | Q#2 records a playbook for `e-6193` after the judge promotes it to rank 0 in the cold pass. Q#5 and Q#10 then inherit `e-6193` at warm top-1 even though: - Neither query has its own recorded playbook (column 4 = no). - Neither warm pass triggers a Shape B inject (boosted = 0). - The roles are *different* — Forklift, Pickers, CNC Operator are distinct staffing categories. So `e-6193` is reaching warm-top-1 via Shape A's distance-based boost: the playbook-corpus entry tagged with Q#2's query text is close enough in cosine to Q#5 and Q#10's embeddings (same client + city dominate) that the boost halves the distance and promotes the worker. For Q#10 specifically, this **demoted the cold-pass-correct `w-3759`** (judge rating 4 at rank 0) in favor of a worker who was approved by the judge for a *different role* on a *different query*. ## Why the lift suite missed this The synthetic `playbook_lift_queries.txt` uses 7 disjoint scenario buckets: forklift+OSHA+WI, CDL+IL, hazmat+coldstorage, etc. Each bucket is a distinct semantic neighborhood, so recorded playbook entries don't compete. The cluster doesn't exist in the synthetic distribution. Real coordinator demand clusters on `(client, city)` because that's how dispatch traffic shapes: same client across roles, same city across days. The Beacon Freight Detroit cluster is what the synthetic bucketing prevents. So the synthetic harness reports clean lift numbers while a same-client cluster bleeds. ## Why the judge gate doesn't catch it The judge gate (`internal/matrix/judge.go`, wired in `5a3364f`) is **per-injection at record time** — Shape B inject calls `gate.Approve(query, hit)` before adding a candidate. It does not fire at retrieve time. A worker approved for Forklift at Beacon Freight stays in the playbook corpus and rides along on later Beacon Freight queries via Shape A boost without a second judge call. This is intentional in the design: judge calls are 1-3s on local qwen2.5, so we batch them at record time. But the design didn't anticipate the same-client cluster, where the boost surface is much wider than per-query independence assumes. ## Mitigation options (none yet implemented) In rough order of cost: 1. **Role-scoped playbook corpus** — include `role` (extracted at record time, possibly via the existing demand-parser) as metadata on each playbook entry. Restrict Shape A boost to matches where query-role and playbook-role agree. Cheap; doesn't need a new judge call. 2. **Tighten Shape A distance for cross-role queries** — currently `DefaultPlaybookMaxDistance = 0.5`. If the new query embeds close in `(client, city)` but far in `role`, the boost still fires because cosine doesn't separate the axes. Could derive a tighter threshold from intra-role vs inter-role distance distributions. 3. **Per-retrieve judge re-gate** — call the judge on the warm top-1 vs cold top-1 and demote the warm result if the judge prefers cold. Highest correctness, ~2× retrieve latency. Not viable for the hot path. (1) is the obvious first fix. The role-extractor already exists (see `internal/chat/inbox.go` LLM-parsed inbox demands; same qwen2.5 format=json shape can run on playbook record). ## What the substrate gets right - 80% cold-pass-correct on a query distribution it's never seen trained for — strong v2-moe + corpus signal. - The two queries that did discover (Q#2, Q#9) lifted cleanly: recorded playbook → warm top-1 = judge-best at rank 0. The basic Shape B mechanism works on real-shape queries. - Cross-pollination *only* fires within same-client+city clusters, not across them — the substrate is not behaving randomly. The bleed has clear semantics; it just exceeds what the inject gate's per-query scope catches. ## Repro ```bash go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \ WITH_PARAPHRASE=0 WITH_REJUDGE=0 \ ./scripts/playbook_lift.sh ``` Evidence: `reports/reality-tests/playbook_lift_real_001.{json,md}`.