First retrieval probe with non-synthetic query distribution. Pulls
N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet
(real-shape demand data) and translates each to the natural language
a coordinator would type: "Need {count} {role}s in {city} {state}
starting at {at} for {client}".
Headline: 8/10 cold-pass top-1 = judge-best on real distribution.
Substrate works on queries it was never trained for. v2-moe + workers
corpus carry the load.
Surfaced finding (the real value of running this): same-client+city
queries cluster, and Shape A's distance boost bleeds across roles
within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records
e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and
Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1
even though:
- Neither query has its own recorded playbook.
- Neither warm pass triggers a Shape B inject (boosted=0).
- The roles are different staffing categories.
Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating
4 at rank 0) for a worker who was approved by the judge for a
different role on a different query.
Why the lift suite missed it: synthetic queries use 7 disjoint
scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand
clusters on (client, city). The cluster doesn't exist in the
synthetic distribution.
Why the judge gate doesn't catch it: the gate (5a3364f) is
per-injection at record time. After approval the worker rides Shape A
distance boosts on all later same-cluster queries with no second
gate call.
Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus
metadata + Shape A boost gate on role match. Cheap; doesn't need
new judge calls.
Files:
- scripts/cutover/gen_real_queries.go: parquet → coordinator NL
- tests/reality/real_coord_queries.txt: 10 generated queries
- reports/reality-tests/playbook_lift_real_001.md: harness output
- reports/reality-tests/real_001_findings.md: the reading
Repro:
go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.4 KiB
Reality test real_001 — findings
First retrieval probe with real-shape coordinator queries (sourced
from fill_events.parquet via scripts/cutover/gen_real_queries.go),
fed through the standard playbook_lift.sh harness with paraphrase
- rejudge passes disabled (no recorded playbooks for these queries on first touch, so those passes have nothing to measure).
Companion to playbook_lift_real_001.md — that file is the
auto-generated harness output; this file captures the reading.
Substrate works on real-shape queries
8 / 10 queries had cold-pass top-1 = judge-best (rating ≥ 2). Single-role + city + state + client queries are well within the substrate's competence. This is the headline.
The matrix layer doesn't need to do clever things on these. Cosine on
v2-moe against the 5K worker corpus already surfaces a sensible match
in 8 / 10 cases — the v2-moe upgrade + the workers corpus (with role +
city + skills + certs in the resume text) is doing real work.
Cross-pollination across same-client / same-city queries
Queries #2, #5, #10 all target Beacon Freight in Detroit MI:
| # | Role | Cold top-1 | Warm top-1 | Boosted | Playbook |
|---|---|---|---|---|---|
| 2 | Forklift Operator | e-5617 |
e-6193 |
1 | YES — recorded e-6193 fits Forklift Operator Detroit Beacon Freight |
| 5 | Pickers | e-5620 |
e-6193 |
0 | no |
| 10 | CNC Operator | w-3759 |
e-6193 |
0 | no |
Q#2 records a playbook for e-6193 after the judge promotes it to
rank 0 in the cold pass. Q#5 and Q#10 then inherit e-6193 at warm
top-1 even though:
- Neither query has its own recorded playbook (column 4 = no).
- Neither warm pass triggers a Shape B inject (boosted = 0).
- The roles are different — Forklift, Pickers, CNC Operator are distinct staffing categories.
So e-6193 is reaching warm-top-1 via Shape A's distance-based boost:
the playbook-corpus entry tagged with Q#2's query text is close enough
in cosine to Q#5 and Q#10's embeddings (same client + city dominate)
that the boost halves the distance and promotes the worker.
For Q#10 specifically, this demoted the cold-pass-correct w-3759
(judge rating 4 at rank 0) in favor of a worker who was approved by
the judge for a different role on a different query.
Why the lift suite missed this
The synthetic playbook_lift_queries.txt uses 7 disjoint scenario
buckets: forklift+OSHA+WI, CDL+IL, hazmat+coldstorage, etc. Each
bucket is a distinct semantic neighborhood, so recorded playbook
entries don't compete. The cluster doesn't exist in the synthetic
distribution.
Real coordinator demand clusters on (client, city) because that's
how dispatch traffic shapes: same client across roles, same city
across days. The Beacon Freight Detroit cluster is what the synthetic
bucketing prevents. So the synthetic harness reports clean lift
numbers while a same-client cluster bleeds.
Why the judge gate doesn't catch it
The judge gate (internal/matrix/judge.go, wired in 5a3364f) is
per-injection at record time — Shape B inject calls
gate.Approve(query, hit) before adding a candidate. It does not
fire at retrieve time. A worker approved for Forklift at Beacon
Freight stays in the playbook corpus and rides along on later
Beacon Freight queries via Shape A boost without a second
judge call.
This is intentional in the design: judge calls are 1-3s on local qwen2.5, so we batch them at record time. But the design didn't anticipate the same-client cluster, where the boost surface is much wider than per-query independence assumes.
Mitigation options (none yet implemented)
In rough order of cost:
-
Role-scoped playbook corpus — include
role(extracted at record time, possibly via the existing demand-parser) as metadata on each playbook entry. Restrict Shape A boost to matches where query-role and playbook-role agree. Cheap; doesn't need a new judge call. -
Tighten Shape A distance for cross-role queries — currently
DefaultPlaybookMaxDistance = 0.5. If the new query embeds close in(client, city)but far inrole, the boost still fires because cosine doesn't separate the axes. Could derive a tighter threshold from intra-role vs inter-role distance distributions. -
Per-retrieve judge re-gate — call the judge on the warm top-1 vs cold top-1 and demote the warm result if the judge prefers cold. Highest correctness, ~2× retrieve latency. Not viable for the hot path.
(1) is the obvious first fix. The role-extractor already exists
(see internal/chat/inbox.go LLM-parsed inbox demands; same
qwen2.5 format=json shape can run on playbook record).
What the substrate gets right
- 80% cold-pass-correct on a query distribution it's never seen trained for — strong v2-moe + corpus signal.
- The two queries that did discover (Q#2, Q#9) lifted cleanly: recorded playbook → warm top-1 = judge-best at rank 0. The basic Shape B mechanism works on real-shape queries.
- Cross-pollination only fires within same-client+city clusters, not across them — the substrate is not behaving randomly. The bleed has clear semantics; it just exceeds what the inject gate's per-query scope catches.
Repro
go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
./scripts/playbook_lift.sh
Evidence: reports/reality-tests/playbook_lift_real_001.{json,md}.