golangLAKEHOUSE/reports/reality-tests/real_001_findings.md
root 7f2f112e6a reality_test real_001: real-shape coordinator queries — surfaces cross-role bleed
First retrieval probe with non-synthetic query distribution. Pulls
N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet
(real-shape demand data) and translates each to the natural language
a coordinator would type: "Need {count} {role}s in {city} {state}
starting at {at} for {client}".

Headline: 8/10 cold-pass top-1 = judge-best on real distribution.
Substrate works on queries it was never trained for. v2-moe + workers
corpus carry the load.

Surfaced finding (the real value of running this): same-client+city
queries cluster, and Shape A's distance boost bleeds across roles
within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records
e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and
Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1
even though:
- Neither query has its own recorded playbook.
- Neither warm pass triggers a Shape B inject (boosted=0).
- The roles are different staffing categories.

Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating
4 at rank 0) for a worker who was approved by the judge for a
different role on a different query.

Why the lift suite missed it: synthetic queries use 7 disjoint
scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand
clusters on (client, city). The cluster doesn't exist in the
synthetic distribution.

Why the judge gate doesn't catch it: the gate (5a3364f) is
per-injection at record time. After approval the worker rides Shape A
distance boosts on all later same-cluster queries with no second
gate call.

Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus
metadata + Shape A boost gate on role match. Cheap; doesn't need
new judge calls.

Files:
- scripts/cutover/gen_real_queries.go: parquet → coordinator NL
- tests/reality/real_coord_queries.txt: 10 generated queries
- reports/reality-tests/playbook_lift_real_001.md: harness output
- reports/reality-tests/real_001_findings.md: the reading

Repro:
  go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
  QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
    WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:18:40 -05:00

130 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Reality test real_001 — findings
First retrieval probe with **real-shape coordinator queries** (sourced
from `fill_events.parquet` via `scripts/cutover/gen_real_queries.go`),
fed through the standard `playbook_lift.sh` harness with paraphrase
+ rejudge passes disabled (no recorded playbooks for these queries
on first touch, so those passes have nothing to measure).
Companion to `playbook_lift_real_001.md` — that file is the
auto-generated harness output; this file captures the reading.
---
## Substrate works on real-shape queries
8 / 10 queries had cold-pass top-1 = judge-best (rating ≥ 2). Single-role +
city + state + client queries are well within the substrate's competence.
This is the headline.
The matrix layer doesn't need to do clever things on these. Cosine on
`v2-moe` against the 5K worker corpus already surfaces a sensible match
in 8 / 10 cases — the v2-moe upgrade + the workers corpus (with role +
city + skills + certs in the resume text) is doing real work.
## Cross-pollination across same-client / same-city queries
Queries #2, #5, #10 all target **Beacon Freight in Detroit MI**:
| # | Role | Cold top-1 | Warm top-1 | Boosted | Playbook |
|---|---|---|---|---|---|
| 2 | Forklift Operator | `e-5617` | `e-6193` | 1 | YES — recorded `e-6193 fits Forklift Operator Detroit Beacon Freight` |
| 5 | Pickers | `e-5620` | `e-6193` | 0 | no |
| 10 | CNC Operator | `w-3759` | `e-6193` | 0 | no |
Q#2 records a playbook for `e-6193` after the judge promotes it to
rank 0 in the cold pass. Q#5 and Q#10 then inherit `e-6193` at warm
top-1 even though:
- Neither query has its own recorded playbook (column 4 = no).
- Neither warm pass triggers a Shape B inject (boosted = 0).
- The roles are *different* — Forklift, Pickers, CNC Operator are
distinct staffing categories.
So `e-6193` is reaching warm-top-1 via Shape A's distance-based boost:
the playbook-corpus entry tagged with Q#2's query text is close enough
in cosine to Q#5 and Q#10's embeddings (same client + city dominate)
that the boost halves the distance and promotes the worker.
For Q#10 specifically, this **demoted the cold-pass-correct `w-3759`**
(judge rating 4 at rank 0) in favor of a worker who was approved by
the judge for a *different role* on a *different query*.
## Why the lift suite missed this
The synthetic `playbook_lift_queries.txt` uses 7 disjoint scenario
buckets: forklift+OSHA+WI, CDL+IL, hazmat+coldstorage, etc. Each
bucket is a distinct semantic neighborhood, so recorded playbook
entries don't compete. The cluster doesn't exist in the synthetic
distribution.
Real coordinator demand clusters on `(client, city)` because that's
how dispatch traffic shapes: same client across roles, same city
across days. The Beacon Freight Detroit cluster is what the synthetic
bucketing prevents. So the synthetic harness reports clean lift
numbers while a same-client cluster bleeds.
## Why the judge gate doesn't catch it
The judge gate (`internal/matrix/judge.go`, wired in `5a3364f`) is
**per-injection at record time** — Shape B inject calls
`gate.Approve(query, hit)` before adding a candidate. It does not
fire at retrieve time. A worker approved for Forklift at Beacon
Freight stays in the playbook corpus and rides along on later
Beacon Freight queries via Shape A boost without a second
judge call.
This is intentional in the design: judge calls are 1-3s on local
qwen2.5, so we batch them at record time. But the design didn't
anticipate the same-client cluster, where the boost surface is
much wider than per-query independence assumes.
## Mitigation options (none yet implemented)
In rough order of cost:
1. **Role-scoped playbook corpus** — include `role` (extracted at
record time, possibly via the existing demand-parser) as
metadata on each playbook entry. Restrict Shape A boost to
matches where query-role and playbook-role agree. Cheap;
doesn't need a new judge call.
2. **Tighten Shape A distance for cross-role queries** — currently
`DefaultPlaybookMaxDistance = 0.5`. If the new query embeds
close in `(client, city)` but far in `role`, the boost still
fires because cosine doesn't separate the axes. Could derive
a tighter threshold from intra-role vs inter-role distance
distributions.
3. **Per-retrieve judge re-gate** — call the judge on the warm
top-1 vs cold top-1 and demote the warm result if the judge
prefers cold. Highest correctness, ~2× retrieve latency. Not
viable for the hot path.
(1) is the obvious first fix. The role-extractor already exists
(see `internal/chat/inbox.go` LLM-parsed inbox demands; same
qwen2.5 format=json shape can run on playbook record).
## What the substrate gets right
- 80% cold-pass-correct on a query distribution it's never seen
trained for — strong v2-moe + corpus signal.
- The two queries that did discover (Q#2, Q#9) lifted cleanly:
recorded playbook → warm top-1 = judge-best at rank 0. The
basic Shape B mechanism works on real-shape queries.
- Cross-pollination *only* fires within same-client+city clusters,
not across them — the substrate is not behaving randomly. The
bleed has clear semantics; it just exceeds what the inject
gate's per-query scope catches.
## Repro
```bash
go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
./scripts/playbook_lift.sh
```
Evidence: `reports/reality-tests/playbook_lift_real_001.{json,md}`.