First retrieval probe with non-synthetic query distribution. Pulls
N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet
(real-shape demand data) and translates each to the natural language
a coordinator would type: "Need {count} {role}s in {city} {state}
starting at {at} for {client}".
Headline: 8/10 cold-pass top-1 = judge-best on real distribution.
Substrate works on queries it was never trained for. v2-moe + workers
corpus carry the load.
Surfaced finding (the real value of running this): same-client+city
queries cluster, and Shape A's distance boost bleeds across roles
within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records
e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and
Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1
even though:
- Neither query has its own recorded playbook.
- Neither warm pass triggers a Shape B inject (boosted=0).
- The roles are different staffing categories.
Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating
4 at rank 0) for a worker who was approved by the judge for a
different role on a different query.
Why the lift suite missed it: synthetic queries use 7 disjoint
scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand
clusters on (client, city). The cluster doesn't exist in the
synthetic distribution.
Why the judge gate doesn't catch it: the gate (5a3364f) is
per-injection at record time. After approval the worker rides Shape A
distance boosts on all later same-cluster queries with no second
gate call.
Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus
metadata + Shape A boost gate on role match. Cheap; doesn't need
new judge calls.
Files:
- scripts/cutover/gen_real_queries.go: parquet → coordinator NL
- tests/reality/real_coord_queries.txt: 10 generated queries
- reports/reality-tests/playbook_lift_real_001.md: harness output
- reports/reality-tests/real_001_findings.md: the reading
Repro:
go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
130 lines
5.4 KiB
Markdown
130 lines
5.4 KiB
Markdown
# Reality test real_001 — findings
|
||
|
||
First retrieval probe with **real-shape coordinator queries** (sourced
|
||
from `fill_events.parquet` via `scripts/cutover/gen_real_queries.go`),
|
||
fed through the standard `playbook_lift.sh` harness with paraphrase
|
||
+ rejudge passes disabled (no recorded playbooks for these queries
|
||
on first touch, so those passes have nothing to measure).
|
||
|
||
Companion to `playbook_lift_real_001.md` — that file is the
|
||
auto-generated harness output; this file captures the reading.
|
||
|
||
---
|
||
|
||
## Substrate works on real-shape queries
|
||
|
||
8 / 10 queries had cold-pass top-1 = judge-best (rating ≥ 2). Single-role +
|
||
city + state + client queries are well within the substrate's competence.
|
||
This is the headline.
|
||
|
||
The matrix layer doesn't need to do clever things on these. Cosine on
|
||
`v2-moe` against the 5K worker corpus already surfaces a sensible match
|
||
in 8 / 10 cases — the v2-moe upgrade + the workers corpus (with role +
|
||
city + skills + certs in the resume text) is doing real work.
|
||
|
||
## Cross-pollination across same-client / same-city queries
|
||
|
||
Queries #2, #5, #10 all target **Beacon Freight in Detroit MI**:
|
||
|
||
| # | Role | Cold top-1 | Warm top-1 | Boosted | Playbook |
|
||
|---|---|---|---|---|---|
|
||
| 2 | Forklift Operator | `e-5617` | `e-6193` | 1 | YES — recorded `e-6193 fits Forklift Operator Detroit Beacon Freight` |
|
||
| 5 | Pickers | `e-5620` | `e-6193` | 0 | no |
|
||
| 10 | CNC Operator | `w-3759` | `e-6193` | 0 | no |
|
||
|
||
Q#2 records a playbook for `e-6193` after the judge promotes it to
|
||
rank 0 in the cold pass. Q#5 and Q#10 then inherit `e-6193` at warm
|
||
top-1 even though:
|
||
|
||
- Neither query has its own recorded playbook (column 4 = no).
|
||
- Neither warm pass triggers a Shape B inject (boosted = 0).
|
||
- The roles are *different* — Forklift, Pickers, CNC Operator are
|
||
distinct staffing categories.
|
||
|
||
So `e-6193` is reaching warm-top-1 via Shape A's distance-based boost:
|
||
the playbook-corpus entry tagged with Q#2's query text is close enough
|
||
in cosine to Q#5 and Q#10's embeddings (same client + city dominate)
|
||
that the boost halves the distance and promotes the worker.
|
||
|
||
For Q#10 specifically, this **demoted the cold-pass-correct `w-3759`**
|
||
(judge rating 4 at rank 0) in favor of a worker who was approved by
|
||
the judge for a *different role* on a *different query*.
|
||
|
||
## Why the lift suite missed this
|
||
|
||
The synthetic `playbook_lift_queries.txt` uses 7 disjoint scenario
|
||
buckets: forklift+OSHA+WI, CDL+IL, hazmat+coldstorage, etc. Each
|
||
bucket is a distinct semantic neighborhood, so recorded playbook
|
||
entries don't compete. The cluster doesn't exist in the synthetic
|
||
distribution.
|
||
|
||
Real coordinator demand clusters on `(client, city)` because that's
|
||
how dispatch traffic shapes: same client across roles, same city
|
||
across days. The Beacon Freight Detroit cluster is what the synthetic
|
||
bucketing prevents. So the synthetic harness reports clean lift
|
||
numbers while a same-client cluster bleeds.
|
||
|
||
## Why the judge gate doesn't catch it
|
||
|
||
The judge gate (`internal/matrix/judge.go`, wired in `5a3364f`) is
|
||
**per-injection at record time** — Shape B inject calls
|
||
`gate.Approve(query, hit)` before adding a candidate. It does not
|
||
fire at retrieve time. A worker approved for Forklift at Beacon
|
||
Freight stays in the playbook corpus and rides along on later
|
||
Beacon Freight queries via Shape A boost without a second
|
||
judge call.
|
||
|
||
This is intentional in the design: judge calls are 1-3s on local
|
||
qwen2.5, so we batch them at record time. But the design didn't
|
||
anticipate the same-client cluster, where the boost surface is
|
||
much wider than per-query independence assumes.
|
||
|
||
## Mitigation options (none yet implemented)
|
||
|
||
In rough order of cost:
|
||
|
||
1. **Role-scoped playbook corpus** — include `role` (extracted at
|
||
record time, possibly via the existing demand-parser) as
|
||
metadata on each playbook entry. Restrict Shape A boost to
|
||
matches where query-role and playbook-role agree. Cheap;
|
||
doesn't need a new judge call.
|
||
|
||
2. **Tighten Shape A distance for cross-role queries** — currently
|
||
`DefaultPlaybookMaxDistance = 0.5`. If the new query embeds
|
||
close in `(client, city)` but far in `role`, the boost still
|
||
fires because cosine doesn't separate the axes. Could derive
|
||
a tighter threshold from intra-role vs inter-role distance
|
||
distributions.
|
||
|
||
3. **Per-retrieve judge re-gate** — call the judge on the warm
|
||
top-1 vs cold top-1 and demote the warm result if the judge
|
||
prefers cold. Highest correctness, ~2× retrieve latency. Not
|
||
viable for the hot path.
|
||
|
||
(1) is the obvious first fix. The role-extractor already exists
|
||
(see `internal/chat/inbox.go` LLM-parsed inbox demands; same
|
||
qwen2.5 format=json shape can run on playbook record).
|
||
|
||
## What the substrate gets right
|
||
|
||
- 80% cold-pass-correct on a query distribution it's never seen
|
||
trained for — strong v2-moe + corpus signal.
|
||
- The two queries that did discover (Q#2, Q#9) lifted cleanly:
|
||
recorded playbook → warm top-1 = judge-best at rank 0. The
|
||
basic Shape B mechanism works on real-shape queries.
|
||
- Cross-pollination *only* fires within same-client+city clusters,
|
||
not across them — the substrate is not behaving randomly. The
|
||
bleed has clear semantics; it just exceeds what the inject
|
||
gate's per-query scope catches.
|
||
|
||
## Repro
|
||
|
||
```bash
|
||
go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
|
||
QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
|
||
WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
|
||
./scripts/playbook_lift.sh
|
||
```
|
||
|
||
Evidence: `reports/reality-tests/playbook_lift_real_001.{json,md}`.
|