reality_test real_001: real-shape coordinator queries — surfaces cross-role bleed

First retrieval probe with non-synthetic query distribution. Pulls N rows from /home/profit/lakehouse/data/datasets/fill_events.parquet (real-shape demand data) and translates each to the natural language a coordinator would type: "Need {count} {role}s in {city} {state} starting at {at} for {client}". Headline: 8/10 cold-pass top-1 = judge-best on real distribution. Substrate works on queries it was never trained for. v2-moe + workers corpus carry the load. Surfaced finding (the real value of running this): same-client+city queries cluster, and Shape A's distance boost bleeds across roles within the cluster. Q#2 (Forklift @ Beacon Freight Detroit) records e-6193 in the playbook corpus. Q#5 (Pickers same client+city) and Q#10 (CNC Operator same client+city) inherit e-6193 at warm top-1 even though: - Neither query has its own recorded playbook. - Neither warm pass triggers a Shape B inject (boosted=0). - The roles are different staffing categories. Q#10 specifically demoted the cold-pass-correct w-3759 (judge rating 4 at rank 0) for a worker who was approved by the judge for a different role on a different query. Why the lift suite missed it: synthetic queries use 7 disjoint scenario buckets (forklift+OSHA+WI / CDL+IL / etc.). Real demand clusters on (client, city). The cluster doesn't exist in the synthetic distribution. Why the judge gate doesn't catch it: the gate (5a3364f) is per-injection at record time. After approval the worker rides Shape A distance boosts on all later same-cluster queries with no second gate call. Becomes new OPEN #1. Fix candidate: role-scoped playbook corpus metadata + Shape A boost gate on role match. Cheap; doesn't need new judge calls. Files: - scripts/cutover/gen_real_queries.go: parquet → coordinator NL - tests/reality/real_coord_queries.txt: 10 generated queries - reports/reality-tests/playbook_lift_real_001.md: harness output - reports/reality-tests/real_001_findings.md: the reading Repro: go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \ WITH_PARAPHRASE=0 WITH_REJUDGE=0 ./scripts/playbook_lift.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:18:40 -05:00 · 2026-04-30 20:18:40 -05:00 · 7f2f112e6a
commit 7f2f112e6a
parent 5687ec65c2
5 changed files with 339 additions and 4 deletions
--- a/STATE_OF_PLAY.md
+++ b/STATE_OF_PLAY.md
@ -220,10 +220,11 @@ The list is intentionally short. Items move to closed when the work demands them

 | # | Item | When to act |
 |---|---|---|
-| 1 | **Periodic fresh→main index merge** — two-tier pattern works but `fresh_workers` grows monotonically. A scheduled job that re-ingests the fresh corpus into `workers` (with the v2-moe embedder) + clears fresh closes the loop. | When `fresh_workers` crosses ~500 items in production. |
-| 2 | **Distillation full port** — `57d0df1` shipped scorer + contamination firewall (E partial). SFT export pipeline + audit_baselines lineage still on the Rust side. | When distillation becomes a production dependency. |
-| 3 | **Drift quantification** — `be65f85` is "scorer drift first." Full distribution-drift signal is underspecified everywhere; this is research, not a port. | Open research item; no calendar. |
-| 4 | **Operational nice-to-haves** — real-time wall-clock for the stress harness; chatd fixture-mode storage half (mock S3 for CI without MinIO); liberal-paraphrase calibration once real coordinator queries land. | When any of these block someone. |
+| 1 | **Same-client+city cross-role bleed** — reality_test real_001 (`reports/reality-tests/real_001_findings.md`) caught Shape A boost from a recorded playbook hitting different-role queries on same `(client, city)` cluster. Fix candidate: role-scoped playbook corpus metadata + Shape A boost gate on role match. The judge gate handles per-injection but doesn't fire at retrieve time. | Before real coordinator demand actually starts hitting the substrate. |
+| 2 | **Periodic fresh→main index merge** — two-tier pattern works but `fresh_workers` grows monotonically. A scheduled job that re-ingests the fresh corpus into `workers` (with the v2-moe embedder) + clears fresh closes the loop. | When `fresh_workers` crosses ~500 items in production. |
+| 3 | **Distillation full port** — `57d0df1` shipped scorer + contamination firewall (E partial). SFT export pipeline + audit_baselines lineage still on the Rust side. | When distillation becomes a production dependency. |
+| 4 | **Drift quantification** — `be65f85` is "scorer drift first." Full distribution-drift signal is underspecified everywhere; this is research, not a port. | Open research item; no calendar. |
+| 5 | **Operational nice-to-haves** — real-time wall-clock for the stress harness; chatd fixture-mode storage half (mock S3 for CI without MinIO); liberal-paraphrase calibration once real coordinator queries land. | When any of these block someone. |

 ---

@ -261,6 +262,7 @@ The list is intentionally short. Items move to closed when the work demands them
 | `68d9e55` | shared: auto-emit Langfuse trace+span per HTTP request — closes OPEN #2 |
 | `a2fa9a2` | scrum_review: pipe diff via temp files — fixes argv overflow on large bundles |
 | (prep)    | G5 cutover prep: `embed_parity` probe — Rust `/ai/embed` ↔ Go `/v1/embed` 5/5 cos=1.000 (both v1 and v2-moe). Verdict + drift catalog in `reports/cutover/SUMMARY.md`. Wire-format remap (`embeddings`/`vectors`, `dimensions`/`dimension`) is the only real cutover work; math is provably equivalent. |
+| (probe)   | Reality test real_001: 10 real-shape queries from `fill_events.parquet` through lift harness. 8/10 cold-pass top-1 = judge-best (substrate works on real distribution). Surfaced **same-client+city cross-role bleed** — Shape A boost from Forklift-Operator playbook landed on CNC-Operator query, demoting the cold-pass-correct worker. Becomes new OPEN #1. Findings: `reports/reality-tests/real_001_findings.md`. |

 Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

--- a/reports/reality-tests/playbook_lift_real_001.md
+++ b/reports/reality-tests/playbook_lift_real_001.md
@ -0,0 +1,86 @@
+# Playbook-Lift Reality Test — Run real_001
+
+**Generated:** 2026-05-01T01:15:54.879146128Z
+**Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge)
+**Corpora:** `workers,ethereal_workers`
+**Workers limit:** 5000
+**Queries:** `tests/reality/real_coord_queries.txt` (10 executed)
+**K per pass:** 10
+**Paraphrase pass:** disabled
+**Re-judge pass:** disabled
+**Evidence:** `reports/reality-tests/playbook_lift_real_001.json`
+
+---
+
+## Headline
+
+| Metric | Value |
+|---|---:|
+| Total queries run | 10 |
+| Cold-pass discoveries (judge-best ≠ top-1) | 2 |
+| Warm-pass lifts (recorded playbook → top-1) | 2 |
+| No change (judge-best already top-1, no playbook needed) | 8 |
+| Playbook boosts triggered (warm pass) | 2 |
+| Mean Δ top-1 distance (warm − cold) | -0.12680706 |
+
+
+
+**Verbatim lift rate:** 2 of 2 discoveries became top-1 after warm pass.
+
+---
+
+## Per-query results
+
+| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
+|---|---|---|---|---|---|---|---|
+| 1 | Need 5 Warehouse Associates in Kansas City MO starting at 09 | w-4781 | 0/4 | — | w-4781 | 0 | no |
+| 2 | Need 1 Forklift Operator in Detroit MI starting at 15:00 for | e-5617 | 2/4 | ✓ e-6193 | e-6193 | 0 | **YES** |
+| 3 | Need 4 Loaders in Indianapolis IN starting at 12:00 for Midw | e-9877 | 0/4 | — | e-9877 | 0 | no |
+| 4 | Need 3 Warehouse Associates in Fort Wayne IN starting at 17: | w-3370 | 0/4 | — | w-3370 | 0 | no |
+| 5 | Need 4 Pickers in Detroit MI starting at 13:30 for Beacon Fr | e-5620 | 1/2 | — | e-6193 | 2 | no |
+| 6 | Need 2 Packers in Joliet IL starting at 09:30 for Parallel M | e-1746 | 0/2 | — | e-1746 | 0 | no |
+| 7 | Need 3 Assemblers in Flint MI starting at 08:30 for Heritage | w-4124 | 0/2 | — | w-4124 | 0 | no |
+| 8 | Need 3 Packers in Flint MI starting at 12:30 for Parallel Ma | e-7325 | 0/2 | — | e-7325 | 0 | no |
+| 9 | Need 1 Shipping Clerk in Flint MI starting at 17:00 for Pion | w-3988 | 4/4 | ✓ w-1367 | w-1367 | 0 | **YES** |
+| 10 | Need 1 CNC Operator in Detroit MI starting at 17:30 for Beac | w-3759 | 0/4 | — | e-6193 | 1 | no |
+
+---
+
+## Honesty caveats
+
+1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
+   judge's verdict is what defines "best." If `` rates badly,
+   the lift number is meaningless. To validate the judge itself, sample 5–10
+   verdicts manually and check agreement.
+2. **Score-1.0 boost = distance halved.** Playbook math is
+   `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
+   result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
+   even halving doesn't promote it. Tight clusters → little visible lift.
+3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
+   case — same query, recorded playbook, expected boost. The paraphrase
+   pass (when enabled) is the actual learning property: similar-but-different
+   queries hitting a recorded playbook. Compare verbatim and paraphrase
+   lift rates — paraphrase should be lower (semantic-distance gates some
+   playbook hits) but non-zero is the meaningful signal.
+4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
+   results land in one corpus, the matrix layer's purpose isn't being tested.
+   Check per-corpus distribution in the JSON.
+5. **Judge resolution.** This run used `qwen2.5:latest` from
+   config [models].local_judge.
+   Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
+6. **Paraphrase generation also uses the judge.** The same model that rates
+   relevance also rephrases queries. A judge that's bad at rating staffing
+   queries is probably also bad at rephrasing them. Worth sanity-checking
+   a sample of `paraphrase_query` values in the JSON before trusting the
+   paraphrase lift number.
+
+## Next moves
+
+- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
+  work. Move to paraphrase queries + tag-based boost (currently ignored).
+- If lift rate < 20%: investigate why — judge variance, distance gap too
+  wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
+  retuning.
+- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
+  already close to optimal on this query distribution. Either the corpus
+  is too narrow or the queries are too easy.
--- a/reports/reality-tests/real_001_findings.md
+++ b/reports/reality-tests/real_001_findings.md
@ -0,0 +1,129 @@
+# Reality test real_001 — findings
+
+First retrieval probe with **real-shape coordinator queries** (sourced
+from `fill_events.parquet` via `scripts/cutover/gen_real_queries.go`),
+fed through the standard `playbook_lift.sh` harness with paraphrase
+ rejudge passes disabled (no recorded playbooks for these queries
+on first touch, so those passes have nothing to measure).
+
+Companion to `playbook_lift_real_001.md` — that file is the
+auto-generated harness output; this file captures the reading.
+
+---
+
+## Substrate works on real-shape queries
+
+8 / 10 queries had cold-pass top-1 = judge-best (rating ≥ 2). Single-role +
+city + state + client queries are well within the substrate's competence.
+This is the headline.
+
+The matrix layer doesn't need to do clever things on these. Cosine on
+`v2-moe` against the 5K worker corpus already surfaces a sensible match
+in 8 / 10 cases — the v2-moe upgrade + the workers corpus (with role +
+city + skills + certs in the resume text) is doing real work.
+
+## Cross-pollination across same-client / same-city queries
+
+Queries #2, #5, #10 all target **Beacon Freight in Detroit MI**:
+
+| # | Role | Cold top-1 | Warm top-1 | Boosted | Playbook |
+|---|---|---|---|---|---|
+| 2 | Forklift Operator | `e-5617` | `e-6193` | 1 | YES — recorded `e-6193 fits Forklift Operator Detroit Beacon Freight` |
+| 5 | Pickers | `e-5620` | `e-6193` | 0 | no |
+| 10 | CNC Operator | `w-3759` | `e-6193` | 0 | no |
+
+Q#2 records a playbook for `e-6193` after the judge promotes it to
+rank 0 in the cold pass. Q#5 and Q#10 then inherit `e-6193` at warm
+top-1 even though:
+
+- Neither query has its own recorded playbook (column 4 = no).
+- Neither warm pass triggers a Shape B inject (boosted = 0).
+- The roles are *different* — Forklift, Pickers, CNC Operator are
+  distinct staffing categories.
+
+So `e-6193` is reaching warm-top-1 via Shape A's distance-based boost:
+the playbook-corpus entry tagged with Q#2's query text is close enough
+in cosine to Q#5 and Q#10's embeddings (same client + city dominate)
+that the boost halves the distance and promotes the worker.
+
+For Q#10 specifically, this **demoted the cold-pass-correct `w-3759`**
+(judge rating 4 at rank 0) in favor of a worker who was approved by
+the judge for a *different role* on a *different query*.
+
+## Why the lift suite missed this
+
+The synthetic `playbook_lift_queries.txt` uses 7 disjoint scenario
+buckets: forklift+OSHA+WI, CDL+IL, hazmat+coldstorage, etc. Each
+bucket is a distinct semantic neighborhood, so recorded playbook
+entries don't compete. The cluster doesn't exist in the synthetic
+distribution.
+
+Real coordinator demand clusters on `(client, city)` because that's
+how dispatch traffic shapes: same client across roles, same city
+across days. The Beacon Freight Detroit cluster is what the synthetic
+bucketing prevents. So the synthetic harness reports clean lift
+numbers while a same-client cluster bleeds.
+
+## Why the judge gate doesn't catch it
+
+The judge gate (`internal/matrix/judge.go`, wired in `5a3364f`) is
+**per-injection at record time** — Shape B inject calls
+`gate.Approve(query, hit)` before adding a candidate. It does not
+fire at retrieve time. A worker approved for Forklift at Beacon
+Freight stays in the playbook corpus and rides along on later
+Beacon Freight queries via Shape A boost without a second
+judge call.
+
+This is intentional in the design: judge calls are 1-3s on local
+qwen2.5, so we batch them at record time. But the design didn't
+anticipate the same-client cluster, where the boost surface is
+much wider than per-query independence assumes.
+
+## Mitigation options (none yet implemented)
+
+In rough order of cost:
+
+1. **Role-scoped playbook corpus** — include `role` (extracted at
+   record time, possibly via the existing demand-parser) as
+   metadata on each playbook entry. Restrict Shape A boost to
+   matches where query-role and playbook-role agree. Cheap;
+   doesn't need a new judge call.
+
+2. **Tighten Shape A distance for cross-role queries** — currently
+   `DefaultPlaybookMaxDistance = 0.5`. If the new query embeds
+   close in `(client, city)` but far in `role`, the boost still
+   fires because cosine doesn't separate the axes. Could derive
+   a tighter threshold from intra-role vs inter-role distance
+   distributions.
+
+3. **Per-retrieve judge re-gate** — call the judge on the warm
+   top-1 vs cold top-1 and demote the warm result if the judge
+   prefers cold. Highest correctness, ~2× retrieve latency. Not
+   viable for the hot path.
+
+(1) is the obvious first fix. The role-extractor already exists
+(see `internal/chat/inbox.go` LLM-parsed inbox demands; same
+qwen2.5 format=json shape can run on playbook record).
+
+## What the substrate gets right
+
+- 80% cold-pass-correct on a query distribution it's never seen
+  trained for — strong v2-moe + corpus signal.
+- The two queries that did discover (Q#2, Q#9) lifted cleanly:
+  recorded playbook → warm top-1 = judge-best at rank 0. The
+  basic Shape B mechanism works on real-shape queries.
+- Cross-pollination *only* fires within same-client+city clusters,
+  not across them — the substrate is not behaving randomly. The
+  bleed has clear semantics; it just exceeds what the inject
+  gate's per-query scope catches.
+
+## Repro
+
+```bash
+go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt
+QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \
+  WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
+  ./scripts/playbook_lift.sh
+```
+
+Evidence: `reports/reality-tests/playbook_lift_real_001.{json,md}`.
--- a/scripts/cutover/gen_real_queries.go
+++ b/scripts/cutover/gen_real_queries.go
@ -0,0 +1,100 @@
+// gen_real_queries — pull N rows from fill_events.parquet and translate
+// each into a coordinator-style natural-language query.
+//
+// Output: one query per line, written to stdout (intended for redirect
+// into tests/reality/real_coord_queries.txt and then fed to
+// scripts/playbook_lift.sh as --queries=<path>).
+//
+// Why: the lift harness's standard query corpus is hand-crafted to
+// stress multi-constraint matching. Real coordinator demand has a
+// different distribution — single-role, single-geo, count + time —
+// and we want to probe whether the substrate handles that shape too.
+// The fill_events parquet on the Rust side is the closest thing to
+// "real demand" we have on disk (123 rows, sourced from staffing
+// fixture generation but shaped like genuine fills).
+package main
+
+import (
+	"context"
+	"flag"
+	"fmt"
+	"log"
+
+	"github.com/apache/arrow-go/v18/arrow/memory"
+	"github.com/apache/arrow-go/v18/parquet/file"
+	"github.com/apache/arrow-go/v18/parquet/pqarrow"
+)
+
+func main() {
+	src := flag.String("src", "/home/profit/lakehouse/data/datasets/fill_events.parquet", "fill_events parquet path")
+	limit := flag.Int("limit", 10, "number of queries to generate")
+	flag.Parse()
+
+	r, err := file.OpenParquetFile(*src, false)
+	if err != nil {
+		log.Fatalf("open %s: %v", *src, err)
+	}
+	defer r.Close()
+
+	pr, err := pqarrow.NewFileReader(r, pqarrow.ArrowReadProperties{}, memory.DefaultAllocator)
+	if err != nil {
+		log.Fatalf("pqarrow reader: %v", err)
+	}
+	tbl, err := pr.ReadTable(context.Background())
+	if err != nil {
+		log.Fatalf("read table: %v", err)
+	}
+	defer tbl.Release()
+
+	// Field order must match parquet schema (see scripts/cutover dev probe):
+	//   3=client, 5=city, 6=state, 7=role, 8=count, 10=at, 12=deadline.
+	client := tbl.Column(3).Data().Chunk(0)
+	city := tbl.Column(5).Data().Chunk(0)
+	state := tbl.Column(6).Data().Chunk(0)
+	role := tbl.Column(7).Data().Chunk(0)
+	count := tbl.Column(8).Data().Chunk(0)
+	at := tbl.Column(10).Data().Chunk(0)
+	deadline := tbl.Column(12).Data().Chunk(0)
+
+	n := int(tbl.NumRows())
+	if *limit < n {
+		n = *limit
+	}
+
+	fmt.Println("# Real-shape coordinator queries — generated from fill_events.parquet")
+	fmt.Println("# (real-shape demand data; queries built mechanically from event rows).")
+	fmt.Printf("# Source: %s (%d rows total, %d emitted)\n", *src, tbl.NumRows(), n)
+	fmt.Println("#")
+	fmt.Println("# Format: client + count + role + city/state + start time +")
+	fmt.Println("# (optional deadline). Mimics the natural language a coordinator would")
+	fmt.Println("# type into a dispatch tool when triaging the next-up demand.")
+	fmt.Println()
+
+	for i := 0; i < n; i++ {
+		c := client.ValueStr(i)
+		cy := city.ValueStr(i)
+		st := state.ValueStr(i)
+		ro := role.ValueStr(i)
+		ct := count.ValueStr(i)
+		t := at.ValueStr(i)
+		dl := deadline.ValueStr(i)
+
+		// Phrase one is the urgent ask; phrase two is the natural rephrase
+		// a coordinator might use when typing fast. Different syntax,
+		// same intent — exercises the embedder's paraphrase tolerance.
+		q := fmt.Sprintf("Need %s %s in %s %s starting at %s for %s", ct, pluralize(ro, ct), cy, st, t, c)
+		if dl != "" && dl != "(null)" {
+			q += fmt.Sprintf(", deadline %s", dl)
+		}
+		fmt.Println(q)
+	}
+}
+
+func pluralize(role, count string) string {
+	if count == "1" {
+		return role
+	}
+	// "Warehouse Associate" → "Warehouse Associates"; "Loader" → "Loaders".
+	// Naive but fits the staffing-domain vocabulary in fill_events.
+	return role + "s"
+}
--- a/tests/reality/real_coord_queries.txt
+++ b/tests/reality/real_coord_queries.txt
@ -0,0 +1,18 @@
+# Real-shape coordinator queries — generated from fill_events.parquet
+# (real-shape demand data; queries built mechanically from event rows).
+# Source: /home/profit/lakehouse/data/datasets/fill_events.parquet (123 rows total, 10 emitted)
+#
+# Format: client + count + role + city/state + start time +
+# (optional deadline). Mimics the natural language a coordinator would
+# type into a dispatch tool when triaging the next-up demand.
+
+Need 5 Warehouse Associates in Kansas City MO starting at 09:00 for Parallel Machining
+Need 1 Forklift Operator in Detroit MI starting at 15:00 for Beacon Freight, deadline 2026-05-28
+Need 4 Loaders in Indianapolis IN starting at 12:00 for Midway Distribution
+Need 3 Warehouse Associates in Fort Wayne IN starting at 17:30 for Cornerstone Fabrication, deadline 2026-05-17
+Need 4 Pickers in Detroit MI starting at 13:30 for Beacon Freight, deadline 2026-05-28
+Need 2 Packers in Joliet IL starting at 09:30 for Parallel Machining
+Need 3 Assemblers in Flint MI starting at 08:30 for Heritage Foods
+Need 3 Packers in Flint MI starting at 12:30 for Parallel Machining
+Need 1 Shipping Clerk in Flint MI starting at 17:00 for Pioneer Assembly
+Need 1 CNC Operator in Detroit MI starting at 17:30 for Beacon Freight, deadline 2026-05-28