diff --git a/STATE_OF_PLAY.md b/STATE_OF_PLAY.md index ec61d2c..9212d24 100644 --- a/STATE_OF_PLAY.md +++ b/STATE_OF_PLAY.md @@ -220,10 +220,11 @@ The list is intentionally short. Items move to closed when the work demands them | # | Item | When to act | |---|---|---| -| 1 | **Periodic fresh→main index merge** — two-tier pattern works but `fresh_workers` grows monotonically. A scheduled job that re-ingests the fresh corpus into `workers` (with the v2-moe embedder) + clears fresh closes the loop. | When `fresh_workers` crosses ~500 items in production. | -| 2 | **Distillation full port** — `57d0df1` shipped scorer + contamination firewall (E partial). SFT export pipeline + audit_baselines lineage still on the Rust side. | When distillation becomes a production dependency. | -| 3 | **Drift quantification** — `be65f85` is "scorer drift first." Full distribution-drift signal is underspecified everywhere; this is research, not a port. | Open research item; no calendar. | -| 4 | **Operational nice-to-haves** — real-time wall-clock for the stress harness; chatd fixture-mode storage half (mock S3 for CI without MinIO); liberal-paraphrase calibration once real coordinator queries land. | When any of these block someone. | +| 1 | **Same-client+city cross-role bleed** — reality_test real_001 (`reports/reality-tests/real_001_findings.md`) caught Shape A boost from a recorded playbook hitting different-role queries on same `(client, city)` cluster. Fix candidate: role-scoped playbook corpus metadata + Shape A boost gate on role match. The judge gate handles per-injection but doesn't fire at retrieve time. | Before real coordinator demand actually starts hitting the substrate. | +| 2 | **Periodic fresh→main index merge** — two-tier pattern works but `fresh_workers` grows monotonically. A scheduled job that re-ingests the fresh corpus into `workers` (with the v2-moe embedder) + clears fresh closes the loop. | When `fresh_workers` crosses ~500 items in production. | +| 3 | **Distillation full port** — `57d0df1` shipped scorer + contamination firewall (E partial). SFT export pipeline + audit_baselines lineage still on the Rust side. | When distillation becomes a production dependency. | +| 4 | **Drift quantification** — `be65f85` is "scorer drift first." Full distribution-drift signal is underspecified everywhere; this is research, not a port. | Open research item; no calendar. | +| 5 | **Operational nice-to-haves** — real-time wall-clock for the stress harness; chatd fixture-mode storage half (mock S3 for CI without MinIO); liberal-paraphrase calibration once real coordinator queries land. | When any of these block someone. | --- @@ -261,6 +262,7 @@ The list is intentionally short. Items move to closed when the work demands them | `68d9e55` | shared: auto-emit Langfuse trace+span per HTTP request — closes OPEN #2 | | `a2fa9a2` | scrum_review: pipe diff via temp files — fixes argv overflow on large bundles | | (prep) | G5 cutover prep: `embed_parity` probe — Rust `/ai/embed` ↔ Go `/v1/embed` 5/5 cos=1.000 (both v1 and v2-moe). Verdict + drift catalog in `reports/cutover/SUMMARY.md`. Wire-format remap (`embeddings`/`vectors`, `dimensions`/`dimension`) is the only real cutover work; math is provably equivalent. | +| (probe) | Reality test real_001: 10 real-shape queries from `fill_events.parquet` through lift harness. 8/10 cold-pass top-1 = judge-best (substrate works on real distribution). Surfaced **same-client+city cross-role bleed** — Shape A boost from Forklift-Operator playbook landed on CNC-Operator query, demoting the cold-pass-correct worker. Becomes new OPEN #1. Findings: `reports/reality-tests/real_001_findings.md`. | Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds). diff --git a/reports/reality-tests/playbook_lift_real_001.md b/reports/reality-tests/playbook_lift_real_001.md new file mode 100644 index 0000000..ede867c --- /dev/null +++ b/reports/reality-tests/playbook_lift_real_001.md @@ -0,0 +1,86 @@ +# Playbook-Lift Reality Test — Run real_001 + +**Generated:** 2026-05-01T01:15:54.879146128Z +**Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge) +**Corpora:** `workers,ethereal_workers` +**Workers limit:** 5000 +**Queries:** `tests/reality/real_coord_queries.txt` (10 executed) +**K per pass:** 10 +**Paraphrase pass:** disabled +**Re-judge pass:** disabled +**Evidence:** `reports/reality-tests/playbook_lift_real_001.json` + +--- + +## Headline + +| Metric | Value | +|---|---:| +| Total queries run | 10 | +| Cold-pass discoveries (judge-best ≠ top-1) | 2 | +| Warm-pass lifts (recorded playbook → top-1) | 2 | +| No change (judge-best already top-1, no playbook needed) | 8 | +| Playbook boosts triggered (warm pass) | 2 | +| Mean Δ top-1 distance (warm − cold) | -0.12680706 | + + + +**Verbatim lift rate:** 2 of 2 discoveries became top-1 after warm pass. + +--- + +## Per-query results + +| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift | +|---|---|---|---|---|---|---|---| +| 1 | Need 5 Warehouse Associates in Kansas City MO starting at 09 | w-4781 | 0/4 | — | w-4781 | 0 | no | +| 2 | Need 1 Forklift Operator in Detroit MI starting at 15:00 for | e-5617 | 2/4 | ✓ e-6193 | e-6193 | 0 | **YES** | +| 3 | Need 4 Loaders in Indianapolis IN starting at 12:00 for Midw | e-9877 | 0/4 | — | e-9877 | 0 | no | +| 4 | Need 3 Warehouse Associates in Fort Wayne IN starting at 17: | w-3370 | 0/4 | — | w-3370 | 0 | no | +| 5 | Need 4 Pickers in Detroit MI starting at 13:30 for Beacon Fr | e-5620 | 1/2 | — | e-6193 | 2 | no | +| 6 | Need 2 Packers in Joliet IL starting at 09:30 for Parallel M | e-1746 | 0/2 | — | e-1746 | 0 | no | +| 7 | Need 3 Assemblers in Flint MI starting at 08:30 for Heritage | w-4124 | 0/2 | — | w-4124 | 0 | no | +| 8 | Need 3 Packers in Flint MI starting at 12:30 for Parallel Ma | e-7325 | 0/2 | — | e-7325 | 0 | no | +| 9 | Need 1 Shipping Clerk in Flint MI starting at 17:00 for Pion | w-3988 | 4/4 | ✓ w-1367 | w-1367 | 0 | **YES** | +| 10 | Need 1 CNC Operator in Detroit MI starting at 17:30 for Beac | w-3759 | 0/4 | — | e-6193 | 1 | no | + +--- + +## Honesty caveats + +1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM + judge's verdict is what defines "best." If `` rates badly, + the lift number is meaningless. To validate the judge itself, sample 5–10 + verdicts manually and check agreement. +2. **Score-1.0 boost = distance halved.** Playbook math is + `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best + result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise + even halving doesn't promote it. Tight clusters → little visible lift. +3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap + case — same query, recorded playbook, expected boost. The paraphrase + pass (when enabled) is the actual learning property: similar-but-different + queries hitting a recorded playbook. Compare verbatim and paraphrase + lift rates — paraphrase should be lower (semantic-distance gates some + playbook hits) but non-zero is the meaningful signal. +4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best + results land in one corpus, the matrix layer's purpose isn't being tested. + Check per-corpus distribution in the JSON. +5. **Judge resolution.** This run used `qwen2.5:latest` from + config [models].local_judge. + Bumping the judge for run #N+1 means editing one line in lakehouse.toml. +6. **Paraphrase generation also uses the judge.** The same model that rates + relevance also rephrases queries. A judge that's bad at rating staffing + queries is probably also bad at rephrasing them. Worth sanity-checking + a sample of `paraphrase_query` values in the JSON before trusting the + paraphrase lift number. + +## Next moves + +- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real + work. Move to paraphrase queries + tag-based boost (currently ignored). +- If lift rate < 20%: investigate why — judge variance, distance gap too + wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need + retuning. +- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is + already close to optimal on this query distribution. Either the corpus + is too narrow or the queries are too easy. diff --git a/reports/reality-tests/real_001_findings.md b/reports/reality-tests/real_001_findings.md new file mode 100644 index 0000000..3112bb7 --- /dev/null +++ b/reports/reality-tests/real_001_findings.md @@ -0,0 +1,129 @@ +# Reality test real_001 — findings + +First retrieval probe with **real-shape coordinator queries** (sourced +from `fill_events.parquet` via `scripts/cutover/gen_real_queries.go`), +fed through the standard `playbook_lift.sh` harness with paraphrase ++ rejudge passes disabled (no recorded playbooks for these queries +on first touch, so those passes have nothing to measure). + +Companion to `playbook_lift_real_001.md` — that file is the +auto-generated harness output; this file captures the reading. + +--- + +## Substrate works on real-shape queries + +8 / 10 queries had cold-pass top-1 = judge-best (rating ≥ 2). Single-role + +city + state + client queries are well within the substrate's competence. +This is the headline. + +The matrix layer doesn't need to do clever things on these. Cosine on +`v2-moe` against the 5K worker corpus already surfaces a sensible match +in 8 / 10 cases — the v2-moe upgrade + the workers corpus (with role + +city + skills + certs in the resume text) is doing real work. + +## Cross-pollination across same-client / same-city queries + +Queries #2, #5, #10 all target **Beacon Freight in Detroit MI**: + +| # | Role | Cold top-1 | Warm top-1 | Boosted | Playbook | +|---|---|---|---|---|---| +| 2 | Forklift Operator | `e-5617` | `e-6193` | 1 | YES — recorded `e-6193 fits Forklift Operator Detroit Beacon Freight` | +| 5 | Pickers | `e-5620` | `e-6193` | 0 | no | +| 10 | CNC Operator | `w-3759` | `e-6193` | 0 | no | + +Q#2 records a playbook for `e-6193` after the judge promotes it to +rank 0 in the cold pass. Q#5 and Q#10 then inherit `e-6193` at warm +top-1 even though: + +- Neither query has its own recorded playbook (column 4 = no). +- Neither warm pass triggers a Shape B inject (boosted = 0). +- The roles are *different* — Forklift, Pickers, CNC Operator are + distinct staffing categories. + +So `e-6193` is reaching warm-top-1 via Shape A's distance-based boost: +the playbook-corpus entry tagged with Q#2's query text is close enough +in cosine to Q#5 and Q#10's embeddings (same client + city dominate) +that the boost halves the distance and promotes the worker. + +For Q#10 specifically, this **demoted the cold-pass-correct `w-3759`** +(judge rating 4 at rank 0) in favor of a worker who was approved by +the judge for a *different role* on a *different query*. + +## Why the lift suite missed this + +The synthetic `playbook_lift_queries.txt` uses 7 disjoint scenario +buckets: forklift+OSHA+WI, CDL+IL, hazmat+coldstorage, etc. Each +bucket is a distinct semantic neighborhood, so recorded playbook +entries don't compete. The cluster doesn't exist in the synthetic +distribution. + +Real coordinator demand clusters on `(client, city)` because that's +how dispatch traffic shapes: same client across roles, same city +across days. The Beacon Freight Detroit cluster is what the synthetic +bucketing prevents. So the synthetic harness reports clean lift +numbers while a same-client cluster bleeds. + +## Why the judge gate doesn't catch it + +The judge gate (`internal/matrix/judge.go`, wired in `5a3364f`) is +**per-injection at record time** — Shape B inject calls +`gate.Approve(query, hit)` before adding a candidate. It does not +fire at retrieve time. A worker approved for Forklift at Beacon +Freight stays in the playbook corpus and rides along on later +Beacon Freight queries via Shape A boost without a second +judge call. + +This is intentional in the design: judge calls are 1-3s on local +qwen2.5, so we batch them at record time. But the design didn't +anticipate the same-client cluster, where the boost surface is +much wider than per-query independence assumes. + +## Mitigation options (none yet implemented) + +In rough order of cost: + +1. **Role-scoped playbook corpus** — include `role` (extracted at + record time, possibly via the existing demand-parser) as + metadata on each playbook entry. Restrict Shape A boost to + matches where query-role and playbook-role agree. Cheap; + doesn't need a new judge call. + +2. **Tighten Shape A distance for cross-role queries** — currently + `DefaultPlaybookMaxDistance = 0.5`. If the new query embeds + close in `(client, city)` but far in `role`, the boost still + fires because cosine doesn't separate the axes. Could derive + a tighter threshold from intra-role vs inter-role distance + distributions. + +3. **Per-retrieve judge re-gate** — call the judge on the warm + top-1 vs cold top-1 and demote the warm result if the judge + prefers cold. Highest correctness, ~2× retrieve latency. Not + viable for the hot path. + +(1) is the obvious first fix. The role-extractor already exists +(see `internal/chat/inbox.go` LLM-parsed inbox demands; same +qwen2.5 format=json shape can run on playbook record). + +## What the substrate gets right + +- 80% cold-pass-correct on a query distribution it's never seen + trained for — strong v2-moe + corpus signal. +- The two queries that did discover (Q#2, Q#9) lifted cleanly: + recorded playbook → warm top-1 = judge-best at rank 0. The + basic Shape B mechanism works on real-shape queries. +- Cross-pollination *only* fires within same-client+city clusters, + not across them — the substrate is not behaving randomly. The + bleed has clear semantics; it just exceeds what the inject + gate's per-query scope catches. + +## Repro + +```bash +go run scripts/cutover/gen_real_queries.go -limit 10 > tests/reality/real_coord_queries.txt +QUERIES_FILE=tests/reality/real_coord_queries.txt RUN_ID=real_001 \ + WITH_PARAPHRASE=0 WITH_REJUDGE=0 \ + ./scripts/playbook_lift.sh +``` + +Evidence: `reports/reality-tests/playbook_lift_real_001.{json,md}`. diff --git a/scripts/cutover/gen_real_queries.go b/scripts/cutover/gen_real_queries.go new file mode 100644 index 0000000..939364d --- /dev/null +++ b/scripts/cutover/gen_real_queries.go @@ -0,0 +1,100 @@ +// gen_real_queries — pull N rows from fill_events.parquet and translate +// each into a coordinator-style natural-language query. +// +// Output: one query per line, written to stdout (intended for redirect +// into tests/reality/real_coord_queries.txt and then fed to +// scripts/playbook_lift.sh as --queries=). +// +// Why: the lift harness's standard query corpus is hand-crafted to +// stress multi-constraint matching. Real coordinator demand has a +// different distribution — single-role, single-geo, count + time — +// and we want to probe whether the substrate handles that shape too. +// The fill_events parquet on the Rust side is the closest thing to +// "real demand" we have on disk (123 rows, sourced from staffing +// fixture generation but shaped like genuine fills). +package main + +import ( + "context" + "flag" + "fmt" + "log" + + "github.com/apache/arrow-go/v18/arrow/memory" + "github.com/apache/arrow-go/v18/parquet/file" + "github.com/apache/arrow-go/v18/parquet/pqarrow" +) + +func main() { + src := flag.String("src", "/home/profit/lakehouse/data/datasets/fill_events.parquet", "fill_events parquet path") + limit := flag.Int("limit", 10, "number of queries to generate") + flag.Parse() + + r, err := file.OpenParquetFile(*src, false) + if err != nil { + log.Fatalf("open %s: %v", *src, err) + } + defer r.Close() + + pr, err := pqarrow.NewFileReader(r, pqarrow.ArrowReadProperties{}, memory.DefaultAllocator) + if err != nil { + log.Fatalf("pqarrow reader: %v", err) + } + tbl, err := pr.ReadTable(context.Background()) + if err != nil { + log.Fatalf("read table: %v", err) + } + defer tbl.Release() + + // Field order must match parquet schema (see scripts/cutover dev probe): + // 3=client, 5=city, 6=state, 7=role, 8=count, 10=at, 12=deadline. + client := tbl.Column(3).Data().Chunk(0) + city := tbl.Column(5).Data().Chunk(0) + state := tbl.Column(6).Data().Chunk(0) + role := tbl.Column(7).Data().Chunk(0) + count := tbl.Column(8).Data().Chunk(0) + at := tbl.Column(10).Data().Chunk(0) + deadline := tbl.Column(12).Data().Chunk(0) + + n := int(tbl.NumRows()) + if *limit < n { + n = *limit + } + + fmt.Println("# Real-shape coordinator queries — generated from fill_events.parquet") + fmt.Println("# (real-shape demand data; queries built mechanically from event rows).") + fmt.Printf("# Source: %s (%d rows total, %d emitted)\n", *src, tbl.NumRows(), n) + fmt.Println("#") + fmt.Println("# Format: client + count + role + city/state + start time +") + fmt.Println("# (optional deadline). Mimics the natural language a coordinator would") + fmt.Println("# type into a dispatch tool when triaging the next-up demand.") + fmt.Println() + + for i := 0; i < n; i++ { + c := client.ValueStr(i) + cy := city.ValueStr(i) + st := state.ValueStr(i) + ro := role.ValueStr(i) + ct := count.ValueStr(i) + t := at.ValueStr(i) + dl := deadline.ValueStr(i) + + // Phrase one is the urgent ask; phrase two is the natural rephrase + // a coordinator might use when typing fast. Different syntax, + // same intent — exercises the embedder's paraphrase tolerance. + q := fmt.Sprintf("Need %s %s in %s %s starting at %s for %s", ct, pluralize(ro, ct), cy, st, t, c) + if dl != "" && dl != "(null)" { + q += fmt.Sprintf(", deadline %s", dl) + } + fmt.Println(q) + } +} + +func pluralize(role, count string) string { + if count == "1" { + return role + } + // "Warehouse Associate" → "Warehouse Associates"; "Loader" → "Loaders". + // Naive but fits the staffing-domain vocabulary in fill_events. + return role + "s" +} diff --git a/tests/reality/real_coord_queries.txt b/tests/reality/real_coord_queries.txt new file mode 100644 index 0000000..662cb59 --- /dev/null +++ b/tests/reality/real_coord_queries.txt @@ -0,0 +1,18 @@ +# Real-shape coordinator queries — generated from fill_events.parquet +# (real-shape demand data; queries built mechanically from event rows). +# Source: /home/profit/lakehouse/data/datasets/fill_events.parquet (123 rows total, 10 emitted) +# +# Format: client + count + role + city/state + start time + +# (optional deadline). Mimics the natural language a coordinator would +# type into a dispatch tool when triaging the next-up demand. + +Need 5 Warehouse Associates in Kansas City MO starting at 09:00 for Parallel Machining +Need 1 Forklift Operator in Detroit MI starting at 15:00 for Beacon Freight, deadline 2026-05-28 +Need 4 Loaders in Indianapolis IN starting at 12:00 for Midway Distribution +Need 3 Warehouse Associates in Fort Wayne IN starting at 17:30 for Cornerstone Fabrication, deadline 2026-05-17 +Need 4 Pickers in Detroit MI starting at 13:30 for Beacon Freight, deadline 2026-05-28 +Need 2 Packers in Joliet IL starting at 09:30 for Parallel Machining +Need 3 Assemblers in Flint MI starting at 08:30 for Heritage Foods +Need 3 Packers in Flint MI starting at 12:30 for Parallel Machining +Need 1 Shipping Clerk in Flint MI starting at 17:00 for Pioneer Assembly +Need 1 CNC Operator in Detroit MI starting at 17:30 for Beacon Freight, deadline 2026-05-28