reality_test real_001: real-shape coordinator queries — surfaces cross-role bleed

First retrieval probe with non-synthetic query distribution. Pulls
N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet
(real-shape demand data) and translates each to the natural language
a coordinator would type: "Need {count} {role}s in {city} {state}
starting at {at} for {client}".

Headline: 8/10 cold-pass top-1 = judge-best on real distribution.
Substrate works on queries it was never trained for. v2-moe + workers
corpus carry the load.

Surfaced finding (the real value of running this): same-client+city
queries cluster, and Shape A's distance boost bleeds across roles
within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records
e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and
Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1
even though:
- Neither query has its own recorded playbook.
- Neither warm pass triggers a Shape B inject (boosted=0).
- The roles are different staffing categories.

Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating
4 at rank 0) for a worker who was approved by the judge for a
different role on a different query.

Why the lift suite missed it: synthetic queries use 7 disjoint
scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand
clusters on (client, city). The cluster doesn't exist in the
synthetic distribution.

Why the judge gate doesn't catch it: the gate (5a3364f) is
per-injection at record time. After approval the worker rides Shape A
distance boosts on all later same-cluster queries with no second
gate call.

Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus
metadata + Shape A boost gate on role match. Cheap; doesn't need
new judge calls.

Files:
- scripts/cutover/gen_real_queries.go: parquet → coordinator NL
- tests/reality/real_coord_queries.txt: 10 generated queries
- reports/reality-tests/playbook_lift_real_001.md: harness output
- reports/reality-tests/real_001_findings.md: the reading

Repro:
  go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
  QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
    WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-30 20:18:40 -05:00
parent 5687ec65c2
commit 7f2f112e6a
5 changed files with 339 additions and 4 deletions

View File

@ -220,10 +220,11 @@ The list is intentionally short. Items move to closed when the work demands them
| # | Item | When to act |
|---|---|---|
| 1 | **Periodic fresh→main index merge** — two-tier pattern works but `fresh_workers` grows monotonically. A scheduled job that re-ingests the fresh corpus into `workers` (with the v2-moe embedder) + clears fresh closes the loop. | When `fresh_workers` crosses ~500 items in production. |
| 2 | **Distillation full port**`57d0df1` shipped scorer + contamination firewall (E partial). SFT export pipeline + audit_baselines lineage still on the Rust side. | When distillation becomes a production dependency. |
| 3 | **Drift quantification**`be65f85` is "scorer drift first." Full distribution-drift signal is underspecified everywhere; this is research, not a port. | Open research item; no calendar. |
| 4 | **Operational nice-to-haves** — real-time wall-clock for the stress harness; chatd fixture-mode storage half (mock S3 for CI without MinIO); liberal-paraphrase calibration once real coordinator queries land. | When any of these block someone. |
| 1 | **Same-client+city cross-role bleed** — reality_test real_001 (`reports/reality-tests/real_001_findings.md`) caught Shape A boost from a recorded playbook hitting different-role queries on same `(client, city)` cluster. Fix candidate: role-scoped playbook corpus metadata + Shape A boost gate on role match. The judge gate handles per-injection but doesn't fire at retrieve time. | Before real coordinator demand actually starts hitting the substrate. |
| 2 | **Periodic fresh→main index merge** — two-tier pattern works but `fresh_workers` grows monotonically. A scheduled job that re-ingests the fresh corpus into `workers` (with the v2-moe embedder) + clears fresh closes the loop. | When `fresh_workers` crosses ~500 items in production. |
| 3 | **Distillation full port**`57d0df1` shipped scorer + contamination firewall (E partial). SFT export pipeline + audit_baselines lineage still on the Rust side. | When distillation becomes a production dependency. |
| 4 | **Drift quantification**`be65f85` is "scorer drift first." Full distribution-drift signal is underspecified everywhere; this is research, not a port. | Open research item; no calendar. |
| 5 | **Operational nice-to-haves** — real-time wall-clock for the stress harness; chatd fixture-mode storage half (mock S3 for CI without MinIO); liberal-paraphrase calibration once real coordinator queries land. | When any of these block someone. |
---
@ -261,6 +262,7 @@ The list is intentionally short. Items move to closed when the work demands them
| `68d9e55` | shared: auto-emit Langfuse trace+span per HTTP request — closes OPEN #2 |
| `a2fa9a2` | scrum_review: pipe diff via temp files — fixes argv overflow on large bundles |
| (prep) | G5 cutover prep: `embed_parity` probe — Rust `/ai/embed` ↔ Go `/v1/embed` 5/5 cos=1.000 (both v1 and v2-moe). Verdict + drift catalog in `reports/cutover/SUMMARY.md`. Wire-format remap (`embeddings`/`vectors`, `dimensions`/`dimension`) is the only real cutover work; math is provably equivalent. |
| (probe) | Reality test real_001: 10 real-shape queries from `fill_events.parquet` through lift harness. 8/10 cold-pass top-1 = judge-best (substrate works on real distribution). Surfaced **same-client+city cross-role bleed** — Shape A boost from Forklift-Operator playbook landed on CNC-Operator query, demoting the cold-pass-correct worker. Becomes new OPEN #1. Findings: `reports/reality-tests/real_001_findings.md`. |
Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

View File

@ -0,0 +1,86 @@
# Playbook-Lift Reality Test — Run real_001
**Generated:** 2026-05-01T01:15:54.879146128Z
**Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge)
**Corpora:** `workers,ethereal_workers`
**Workers limit:** 5000
**Queries:** `tests/reality/real_coord_queries.txt` (10 executed)
**K per pass:** 10
**Paraphrase pass:** disabled
**Re-judge pass:** disabled
**Evidence:** `reports/reality-tests/playbook_lift_real_001.json`
---
## Headline
| Metric | Value |
|---|---:|
| Total queries run | 10 |
| Cold-pass discoveries (judge-best ≠ top-1) | 2 |
| Warm-pass lifts (recorded playbook → top-1) | 2 |
| No change (judge-best already top-1, no playbook needed) | 8 |
| Playbook boosts triggered (warm pass) | 2 |
| Mean Δ top-1 distance (warm cold) | -0.12680706 |
**Verbatim lift rate:** 2 of 2 discoveries became top-1 after warm pass.
---
## Per-query results
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|---|---|---|---|---|---|---|---|
| 1 | Need 5 Warehouse Associates in Kansas City MO starting at 09 | w-4781 | 0/4 | — | w-4781 | 0 | no |
| 2 | Need 1 Forklift Operator in Detroit MI starting at 15:00 for | e-5617 | 2/4 | ✓ e-6193 | e-6193 | 0 | **YES** |
| 3 | Need 4 Loaders in Indianapolis IN starting at 12:00 for Midw | e-9877 | 0/4 | — | e-9877 | 0 | no |
| 4 | Need 3 Warehouse Associates in Fort Wayne IN starting at 17: | w-3370 | 0/4 | — | w-3370 | 0 | no |
| 5 | Need 4 Pickers in Detroit MI starting at 13:30 for Beacon Fr | e-5620 | 1/2 | — | e-6193 | 2 | no |
| 6 | Need 2 Packers in Joliet IL starting at 09:30 for Parallel M | e-1746 | 0/2 | — | e-1746 | 0 | no |
| 7 | Need 3 Assemblers in Flint MI starting at 08:30 for Heritage | w-4124 | 0/2 | — | w-4124 | 0 | no |
| 8 | Need 3 Packers in Flint MI starting at 12:30 for Parallel Ma | e-7325 | 0/2 | — | e-7325 | 0 | no |
| 9 | Need 1 Shipping Clerk in Flint MI starting at 17:00 for Pion | w-3988 | 4/4 | ✓ w-1367 | w-1367 | 0 | **YES** |
| 10 | Need 1 CNC Operator in Detroit MI starting at 17:30 for Beac | w-3759 | 0/4 | — | e-6193 | 1 | no |
---
## Honesty caveats
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
judge's verdict is what defines "best." If `` rates badly,
the lift number is meaningless. To validate the judge itself, sample 510
verdicts manually and check agreement.
2. **Score-1.0 boost = distance halved.** Playbook math is
`distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
even halving doesn't promote it. Tight clusters → little visible lift.
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
case — same query, recorded playbook, expected boost. The paraphrase
pass (when enabled) is the actual learning property: similar-but-different
queries hitting a recorded playbook. Compare verbatim and paraphrase
lift rates — paraphrase should be lower (semantic-distance gates some
playbook hits) but non-zero is the meaningful signal.
4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
results land in one corpus, the matrix layer's purpose isn't being tested.
Check per-corpus distribution in the JSON.
5. **Judge resolution.** This run used `qwen2.5:latest` from
config [models].local_judge.
Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
6. **Paraphrase generation also uses the judge.** The same model that rates
relevance also rephrases queries. A judge that's bad at rating staffing
queries is probably also bad at rephrasing them. Worth sanity-checking
a sample of `paraphrase_query` values in the JSON before trusting the
paraphrase lift number.
## Next moves
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
work. Move to paraphrase queries + tag-based boost (currently ignored).
- If lift rate < 20%: investigate why judge variance, distance gap too
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
retuning.
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
already close to optimal on this query distribution. Either the corpus
is too narrow or the queries are too easy.

View File

@ -0,0 +1,129 @@
# Reality test real_001 — findings
First retrieval probe with **real-shape coordinator queries** (sourced
from `fill_events.parquet` via `scripts/cutover/gen_real_queries.go`),
fed through the standard `playbook_lift.sh` harness with paraphrase
+ rejudge passes disabled (no recorded playbooks for these queries
on first touch, so those passes have nothing to measure).
Companion to `playbook_lift_real_001.md` — that file is the
auto-generated harness output; this file captures the reading.
---
## Substrate works on real-shape queries
8 / 10 queries had cold-pass top-1 = judge-best (rating ≥ 2). Single-role +
city + state + client queries are well within the substrate's competence.
This is the headline.
The matrix layer doesn't need to do clever things on these. Cosine on
`v2-moe` against the 5K worker corpus already surfaces a sensible match
in 8 / 10 cases — the v2-moe upgrade + the workers corpus (with role +
city + skills + certs in the resume text) is doing real work.
## Cross-pollination across same-client / same-city queries
Queries #2, #5, #10 all target **Beacon Freight in Detroit MI**:
| # | Role | Cold top-1 | Warm top-1 | Boosted | Playbook |
|---|---|---|---|---|---|
| 2 | Forklift Operator | `e-5617` | `e-6193` | 1 | YES — recorded `e-6193 fits Forklift Operator Detroit Beacon Freight` |
| 5 | Pickers | `e-5620` | `e-6193` | 0 | no |
| 10 | CNC Operator | `w-3759` | `e-6193` | 0 | no |
Q#2 records a playbook for `e-6193` after the judge promotes it to
rank 0 in the cold pass. Q#5 and Q#10 then inherit `e-6193` at warm
top-1 even though:
- Neither query has its own recorded playbook (column 4 = no).
- Neither warm pass triggers a Shape B inject (boosted = 0).
- The roles are *different* — Forklift, Pickers, CNC Operator are
distinct staffing categories.
So `e-6193` is reaching warm-top-1 via Shape A's distance-based boost:
the playbook-corpus entry tagged with Q#2's query text is close enough
in cosine to Q#5 and Q#10's embeddings (same client + city dominate)
that the boost halves the distance and promotes the worker.
For Q#10 specifically, this **demoted the cold-pass-correct `w-3759`**
(judge rating 4 at rank 0) in favor of a worker who was approved by
the judge for a *different role* on a *different query*.
## Why the lift suite missed this
The synthetic `playbook_lift_queries.txt` uses 7 disjoint scenario
buckets: forklift+OSHA+WI, CDL+IL, hazmat+coldstorage, etc. Each
bucket is a distinct semantic neighborhood, so recorded playbook
entries don't compete. The cluster doesn't exist in the synthetic
distribution.
Real coordinator demand clusters on `(client, city)` because that's
how dispatch traffic shapes: same client across roles, same city
across days. The Beacon Freight Detroit cluster is what the synthetic
bucketing prevents. So the synthetic harness reports clean lift
numbers while a same-client cluster bleeds.
## Why the judge gate doesn't catch it
The judge gate (`internal/matrix/judge.go`, wired in `5a3364f`) is
**per-injection at record time** — Shape B inject calls
`gate.Approve(query, hit)` before adding a candidate. It does not
fire at retrieve time. A worker approved for Forklift at Beacon
Freight stays in the playbook corpus and rides along on later
Beacon Freight queries via Shape A boost without a second
judge call.
This is intentional in the design: judge calls are 1-3s on local
qwen2.5, so we batch them at record time. But the design didn't
anticipate the same-client cluster, where the boost surface is
much wider than per-query independence assumes.
## Mitigation options (none yet implemented)
In rough order of cost:
1. **Role-scoped playbook corpus** — include `role` (extracted at
record time, possibly via the existing demand-parser) as
metadata on each playbook entry. Restrict Shape A boost to
matches where query-role and playbook-role agree. Cheap;
doesn't need a new judge call.
2. **Tighten Shape A distance for cross-role queries** — currently
`DefaultPlaybookMaxDistance = 0.5`. If the new query embeds
close in `(client, city)` but far in `role`, the boost still
fires because cosine doesn't separate the axes. Could derive
a tighter threshold from intra-role vs inter-role distance
distributions.
3. **Per-retrieve judge re-gate** — call the judge on the warm
top-1 vs cold top-1 and demote the warm result if the judge
prefers cold. Highest correctness, ~2× retrieve latency. Not
viable for the hot path.
(1) is the obvious first fix. The role-extractor already exists
(see `internal/chat/inbox.go` LLM-parsed inbox demands; same
qwen2.5 format=json shape can run on playbook record).
## What the substrate gets right
- 80% cold-pass-correct on a query distribution it's never seen
trained for — strong v2-moe + corpus signal.
- The two queries that did discover (Q#2, Q#9) lifted cleanly:
recorded playbook → warm top-1 = judge-best at rank 0. The
basic Shape B mechanism works on real-shape queries.
- Cross-pollination *only* fires within same-client+city clusters,
not across them — the substrate is not behaving randomly. The
bleed has clear semantics; it just exceeds what the inject
gate's per-query scope catches.
## Repro
```bash
go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
./scripts/playbook_lift.sh
```
Evidence: `reports/reality-tests/playbook_lift_real_001.{json,md}`.

View File

@ -0,0 +1,100 @@
// gen_real_queries — pull N rows from fill_events.parquet and translate
// each into a coordinator-style natural-language query.
//
// Output: one query per line, written to stdout (intended for redirect
// into tests/reality/real_coord_queries.txt and then fed to
// scripts/playbook_lift.sh as --queries=<path>).
//
// Why: the lift harness's standard query corpus is hand-crafted to
// stress multi-constraint matching. Real coordinator demand has a
// different distribution — single-role, single-geo, count + time —
// and we want to probe whether the substrate handles that shape too.
// The fill_events parquet on the Rust side is the closest thing to
// "real demand" we have on disk (123 rows, sourced from staffing
// fixture generation but shaped like genuine fills).
package main
import (
"context"
"flag"
"fmt"
"log"
"github.com/apache/arrow-go/v18/arrow/memory"
"github.com/apache/arrow-go/v18/parquet/file"
"github.com/apache/arrow-go/v18/parquet/pqarrow"
)
func main() {
src := flag.String("src", "/home/profit/lakehouse/data/datasets/fill_events.parquet", "fill_events parquet path")
limit := flag.Int("limit", 10, "number of queries to generate")
flag.Parse()
r, err := file.OpenParquetFile(*src, false)
if err != nil {
log.Fatalf("open %s: %v", *src, err)
}
defer r.Close()
pr, err := pqarrow.NewFileReader(r, pqarrow.ArrowReadProperties{}, memory.DefaultAllocator)
if err != nil {
log.Fatalf("pqarrow reader: %v", err)
}
tbl, err := pr.ReadTable(context.Background())
if err != nil {
log.Fatalf("read table: %v", err)
}
defer tbl.Release()
// Field order must match parquet schema (see scripts/cutover dev probe):
// 3=client, 5=city, 6=state, 7=role, 8=count, 10=at, 12=deadline.
client := tbl.Column(3).Data().Chunk(0)
city := tbl.Column(5).Data().Chunk(0)
state := tbl.Column(6).Data().Chunk(0)
role := tbl.Column(7).Data().Chunk(0)
count := tbl.Column(8).Data().Chunk(0)
at := tbl.Column(10).Data().Chunk(0)
deadline := tbl.Column(12).Data().Chunk(0)
n := int(tbl.NumRows())
if *limit < n {
n = *limit
}
fmt.Println("# Real-shape coordinator queries — generated from fill_events.parquet")
fmt.Println("# (real-shape demand data; queries built mechanically from event rows).")
fmt.Printf("# Source: %s (%d rows total, %d emitted)\n", *src, tbl.NumRows(), n)
fmt.Println("#")
fmt.Println("# Format: client + count + role + city/state + start time +")
fmt.Println("# (optional deadline). Mimics the natural language a coordinator would")
fmt.Println("# type into a dispatch tool when triaging the next-up demand.")
fmt.Println()
for i := 0; i < n; i++ {
c := client.ValueStr(i)
cy := city.ValueStr(i)
st := state.ValueStr(i)
ro := role.ValueStr(i)
ct := count.ValueStr(i)
t := at.ValueStr(i)
dl := deadline.ValueStr(i)
// Phrase one is the urgent ask; phrase two is the natural rephrase
// a coordinator might use when typing fast. Different syntax,
// same intent — exercises the embedder's paraphrase tolerance.
q := fmt.Sprintf("Need %s %s in %s %s starting at %s for %s", ct, pluralize(ro, ct), cy, st, t, c)
if dl != "" && dl != "(null)" {
q += fmt.Sprintf(", deadline %s", dl)
}
fmt.Println(q)
}
}
func pluralize(role, count string) string {
if count == "1" {
return role
}
// "Warehouse Associate" → "Warehouse Associates"; "Loader" → "Loaders".
// Naive but fits the staffing-domain vocabulary in fill_events.
return role + "s"
}

View File

@ -0,0 +1,18 @@
# Real-shape coordinator queries — generated from fill_events.parquet
# (real-shape demand data; queries built mechanically from event rows).
# Source: /home/profit/lakehouse/data/datasets/fill_events.parquet (123 rows total, 10 emitted)
#
# Format: client + count + role + city/state + start time +
# (optional deadline). Mimics the natural language a coordinator would
# type into a dispatch tool when triaging the next-up demand.
Need 5 Warehouse Associates in Kansas City MO starting at 09:00 for Parallel Machining
Need 1 Forklift Operator in Detroit MI starting at 15:00 for Beacon Freight, deadline 2026-05-28
Need 4 Loaders in Indianapolis IN starting at 12:00 for Midway Distribution
Need 3 Warehouse Associates in Fort Wayne IN starting at 17:30 for Cornerstone Fabrication, deadline 2026-05-17
Need 4 Pickers in Detroit MI starting at 13:30 for Beacon Freight, deadline 2026-05-28
Need 2 Packers in Joliet IL starting at 09:30 for Parallel Machining
Need 3 Assemblers in Flint MI starting at 08:30 for Heritage Foods
Need 3 Packers in Flint MI starting at 12:30 for Parallel Machining
Need 1 Shipping Clerk in Flint MI starting at 17:00 for Pioneer Assembly
Need 1 CNC Operator in Detroit MI starting at 17:30 for Beacon Freight, deadline 2026-05-28