root 7f2f112e6a reality_test real_001: real-shape coordinator queries — surfaces cross-role bleed

First retrieval probe with non-synthetic query distribution. Pulls
N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet
(real-shape demand data) and translates each to the natural language
a coordinator would type: "Need {count} {role}s in {city} {state}
starting at {at} for {client}".

Headline: 8/10 cold-pass top-1 = judge-best on real distribution.
Substrate works on queries it was never trained for. v2-moe + workers
corpus carry the load.

Surfaced finding (the real value of running this): same-client+city
queries cluster, and Shape A's distance boost bleeds across roles
within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records
e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and
Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1
even though:
- Neither query has its own recorded playbook.
- Neither warm pass triggers a Shape B inject (boosted=0).
- The roles are different staffing categories.

Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating
4 at rank 0) for a worker who was approved by the judge for a
different role on a different query.

Why the lift suite missed it: synthetic queries use 7 disjoint
scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand
clusters on (client, city). The cluster doesn't exist in the
synthetic distribution.

Why the judge gate doesn't catch it: the gate (5a3364f) is
per-injection at record time. After approval the worker rides Shape A
distance boosts on all later same-cluster queries with no second
gate call.

Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus
metadata + Shape A boost gate on role match. Cheap; doesn't need
new judge calls.

Files:
- scripts/cutover/gen_real_queries.go: parquet → coordinator NL
- tests/reality/real_coord_queries.txt: 10 generated queries
- reports/reality-tests/playbook_lift_real_001.md: harness output
- reports/reality-tests/real_001_findings.md: the reading

Repro:
  go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
  QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
    WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 20:18:40 -05:00

5.4 KiB

Raw Permalink Blame History

Reality test real_001 — findings

First retrieval probe with real-shape coordinator queries (sourced from fill_events.parquet via scripts/cutover/gen_real_queries.go), fed through the standard playbook_lift.sh harness with paraphrase

rejudge passes disabled (no recorded playbooks for these queries on first touch, so those passes have nothing to measure).

Companion to playbook_lift_real_001.md — that file is the auto-generated harness output; this file captures the reading.

Substrate works on real-shape queries

8 / 10 queries had cold-pass top-1 = judge-best (rating ≥ 2). Single-role + city + state + client queries are well within the substrate's competence. This is the headline.

The matrix layer doesn't need to do clever things on these. Cosine on v2-moe against the 5K worker corpus already surfaces a sensible match in 8 / 10 cases — the v2-moe upgrade + the workers corpus (with role + city + skills + certs in the resume text) is doing real work.

Cross-pollination across same-client / same-city queries

Queries #2, #5, #10 all target Beacon Freight in Detroit MI:

#	Role	Cold top-1	Warm top-1	Boosted	Playbook
2	Forklift Operator	`e-5617`	`e-6193`	1	YES — recorded `e-6193 fits Forklift Operator Detroit Beacon Freight`
5	Pickers	`e-5620`	`e-6193`	0	no
10	CNC Operator	`w-3759`	`e-6193`	0	no

Q#2 records a playbook for e-6193 after the judge promotes it to rank 0 in the cold pass. Q#5 and Q#10 then inherit e-6193 at warm top-1 even though:

Neither query has its own recorded playbook (column 4 = no).
Neither warm pass triggers a Shape B inject (boosted = 0).
The roles are different — Forklift, Pickers, CNC Operator are distinct staffing categories.

So e-6193 is reaching warm-top-1 via Shape A's distance-based boost: the playbook-corpus entry tagged with Q#2's query text is close enough in cosine to Q#5 and Q#10's embeddings (same client + city dominate) that the boost halves the distance and promotes the worker.

For Q#10 specifically, this demoted the cold-pass-correct w-3759 (judge rating 4 at rank 0) in favor of a worker who was approved by the judge for a different role on a different query.

Why the lift suite missed this

The synthetic playbook_lift_queries.txt uses 7 disjoint scenario buckets: forklift+OSHA+WI, CDL+IL, hazmat+coldstorage, etc. Each bucket is a distinct semantic neighborhood, so recorded playbook entries don't compete. The cluster doesn't exist in the synthetic distribution.

Real coordinator demand clusters on (client, city) because that's how dispatch traffic shapes: same client across roles, same city across days. The Beacon Freight Detroit cluster is what the synthetic bucketing prevents. So the synthetic harness reports clean lift numbers while a same-client cluster bleeds.

Why the judge gate doesn't catch it

The judge gate (internal/matrix/judge.go, wired in 5a3364f) is per-injection at record time — Shape B inject calls gate.Approve(query, hit) before adding a candidate. It does not fire at retrieve time. A worker approved for Forklift at Beacon Freight stays in the playbook corpus and rides along on later Beacon Freight queries via Shape A boost without a second judge call.

This is intentional in the design: judge calls are 1-3s on local qwen2.5, so we batch them at record time. But the design didn't anticipate the same-client cluster, where the boost surface is much wider than per-query independence assumes.

Mitigation options (none yet implemented)

In rough order of cost:

Role-scoped playbook corpus — include role (extracted at record time, possibly via the existing demand-parser) as metadata on each playbook entry. Restrict Shape A boost to matches where query-role and playbook-role agree. Cheap; doesn't need a new judge call.
Tighten Shape A distance for cross-role queries — currently DefaultPlaybookMaxDistance = 0.5. If the new query embeds close in (client, city) but far in role, the boost still fires because cosine doesn't separate the axes. Could derive a tighter threshold from intra-role vs inter-role distance distributions.
Per-retrieve judge re-gate — call the judge on the warm top-1 vs cold top-1 and demote the warm result if the judge prefers cold. Highest correctness, ~2× retrieve latency. Not viable for the hot path.

(1) is the obvious first fix. The role-extractor already exists (see internal/chat/inbox.go LLM-parsed inbox demands; same qwen2.5 format=json shape can run on playbook record).

What the substrate gets right

80% cold-pass-correct on a query distribution it's never seen trained for — strong v2-moe + corpus signal.
The two queries that did discover (Q#2, Q#9) lifted cleanly: recorded playbook → warm top-1 = judge-best at rank 0. The basic Shape B mechanism works on real-shape queries.
Cross-pollination only fires within same-client+city clusters, not across them — the substrate is not behaving randomly. The bleed has clear semantics; it just exceeds what the inject gate's per-query scope catches.

Repro

go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
  WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
  ./scripts/playbook_lift.sh

Evidence: reports/reality-tests/playbook_lift_real_001.{json,md}.

5.4 KiB Raw Permalink Blame History Unescape Escape