diff --git a/STATE_OF_PLAY.md b/STATE_OF_PLAY.md index b142fa2..0e924c0 100644 --- a/STATE_OF_PLAY.md +++ b/STATE_OF_PLAY.md @@ -266,6 +266,7 @@ The list is intentionally short. Items move to closed when the work demands them | (probe) | Reality test real_003: 40 queries (10 fill_events rows × 4 styles — `need` / `client_first` / `looking` / `shorthand`). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked `w-2404` onto Forklift Operator shorthand query (both empty role, gate disabled). Extended `extractRoleFromNeed` to handle `client_first` + `looking` patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in `scripts/playbook_lift/main_test.go` lock the patterns + the documented shorthand limitation. Findings: `reports/reality-tests/real_003_findings.md`. | | (fix) | LLM-based role extractor (real_004): `roleExtractor` struct with regex-first → qwen2.5 format=json fallback → per-process cache. Opt-in via `-llm-role-extract` flag + `LLM_ROLE_EXTRACT=1` env. Off-by-default preserves real_003b shipping config. 8 new tests including `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the load-bearing witness pairing with the matrix-side `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` to prove the extraction-layer + gate-layer compose correctly on the exact real_003 failure mode. Findings: `reports/reality-tests/real_004_findings.md`. | | (scrum) | 3-lineage scrum review on `7f2f112..0331288` (Opus + Kimi + Qwen3-coder via `scripts/scrum_review.sh`). Convergent finding (3/3): `roleNormalize` plural-stripper mangled non-plural-s tokens (Sales → Sale, Logistics → Logistic). **Fixed**: `nonPluralSWords` allowlist + `-ss` ending check + `strings.ToLower`/`TrimSpace` cleanup. New tests `TestRoleNormalize_NonPluralS` + `TestRoleEqual_NonPluralS` lock the edge cases. Kimi 2 BLOCKs were false positives (model-truncation artifacts per `feedback_cross_lineage_review.md`). Disposition: `reports/scrum/_evidence/2026-04-30/verdicts/role_gate_v1_disposition.md` (local). | +| (probe) | Negation reality test real_005: 5 explicit-negation queries ("NOT in Detroit", "excluding Cornerstone roster", etc.). Confirmed substrate has **zero negation handling** — cosine treats "NOT X" as "X" + noise. Judge IS the safety net (Q1/Q3/Q4 rated all top-10 results 1-2/5 — operator-visible honesty signal). **No code change needed**: production UI should handle exclusion via `ExcludeIDs` (already supported, added in multi-coord stress 200-worker swap), not via NL-negation. Findings: `reports/reality-tests/real_005_findings.md`. | Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds). diff --git a/reports/reality-tests/playbook_lift_real_005.md b/reports/reality-tests/playbook_lift_real_005.md new file mode 100644 index 0000000..39be31f --- /dev/null +++ b/reports/reality-tests/playbook_lift_real_005.md @@ -0,0 +1,81 @@ +# Playbook-Lift Reality Test — Run real_005 + +**Generated:** 2026-05-01T04:04:14.242729367Z +**Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge) +**Corpora:** `workers,ethereal_workers` +**Workers limit:** 5000 +**Queries:** `tests/reality/negation_queries.txt` (5 executed) +**K per pass:** 10 +**Paraphrase pass:** disabled +**Re-judge pass:** disabled +**Evidence:** `reports/reality-tests/playbook_lift_real_005.json` + +--- + +## Headline + +| Metric | Value | +|---|---:| +| Total queries run | 5 | +| Cold-pass discoveries (judge-best ≠ top-1) | 0 | +| Warm-pass lifts (recorded playbook → top-1) | 0 | +| No change (judge-best already top-1, no playbook needed) | 5 | +| Playbook boosts triggered (warm pass) | 0 | +| Mean Δ top-1 distance (warm − cold) | 0 | + + + +**Verbatim lift rate:** 0 of 0 discoveries became top-1 after warm pass. + +--- + +## Per-query results + +| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift | +|---|---|---|---|---|---|---|---| +| 1 | Need 5 Forklift Operators in Aurora IL, NOT in Detroit (Detr | e-1723 | 4/2 | — | e-1723 | 4 | no | +| 2 | Need 3 Warehouse Associates, but NOT anyone from Beacon Frei | w-2937 | 0/4 | — | w-2937 | 0 | no | +| 3 | Looking for Pickers in Indianapolis, excluding the Cornersto | e-5033 | 0/1 | — | e-5033 | 0 | no | +| 4 | 1 CNC Operator needed in Flint MI - we cannot use any Detroi | w-1360 | 0/2 | — | w-1360 | 0 | no | +| 5 | Need 2 Loaders in Joliet IL but exclude all currently-placed | w-2998 | 0/4 | — | w-2998 | 0 | no | + +--- + +## Honesty caveats + +1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM + judge's verdict is what defines "best." If `` rates badly, + the lift number is meaningless. To validate the judge itself, sample 5–10 + verdicts manually and check agreement. +2. **Score-1.0 boost = distance halved.** Playbook math is + `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best + result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise + even halving doesn't promote it. Tight clusters → little visible lift. +3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap + case — same query, recorded playbook, expected boost. The paraphrase + pass (when enabled) is the actual learning property: similar-but-different + queries hitting a recorded playbook. Compare verbatim and paraphrase + lift rates — paraphrase should be lower (semantic-distance gates some + playbook hits) but non-zero is the meaningful signal. +4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best + results land in one corpus, the matrix layer's purpose isn't being tested. + Check per-corpus distribution in the JSON. +5. **Judge resolution.** This run used `qwen2.5:latest` from + config [models].local_judge. + Bumping the judge for run #N+1 means editing one line in lakehouse.toml. +6. **Paraphrase generation also uses the judge.** The same model that rates + relevance also rephrases queries. A judge that's bad at rating staffing + queries is probably also bad at rephrasing them. Worth sanity-checking + a sample of `paraphrase_query` values in the JSON before trusting the + paraphrase lift number. + +## Next moves + +- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real + work. Move to paraphrase queries + tag-based boost (currently ignored). +- If lift rate < 20%: investigate why — judge variance, distance gap too + wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need + retuning. +- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is + already close to optimal on this query distribution. Either the corpus + is too narrow or the queries are too easy. diff --git a/reports/reality-tests/real_005_findings.md b/reports/reality-tests/real_005_findings.md new file mode 100644 index 0000000..6cee0a1 --- /dev/null +++ b/reports/reality-tests/real_005_findings.md @@ -0,0 +1,111 @@ +# Reality test real_005 — negation probe (substrate has none, judge catches it) + +5 explicit-negation queries — "NOT in Detroit", "excluding Beacon +Freight", etc. — through the standard `playbook_lift.sh` harness. +Goal: characterize whether the substrate has any negation handling +or silently treats "NOT X" as "X". + +## Headline + +**The substrate has zero negation handling.** Cosine on dense +embeddings tokenizes "NOT in Detroit" the same as "in Detroit" plus +the noise word "NOT" — there is no logical-quantifier representation +in the embedding space. The judge catches the failure +post-retrieval (low ratings) but the retrieval itself doesn't honor +the negation. + +## Per-query results + +| Q | Query (head) | cold_top1_dist | judge ratings (top-10) | judge says | +|---|---|---:|---|---| +| 1 | "Need 5 Forklift Operators in Aurora IL, NOT in Detroit" | 0.386 | `1,1,1,1,2,1,2,2,2,2` | all bad — system can't honor | +| 2 | "Need 3 Warehouse Associates, but NOT from Beacon Freight" | 0.453 | `4,4,3,2,3,4,1,4,4,1` | top-1 4/5 — accidentally OK | +| 3 | "Looking for Pickers in Indianapolis, excluding Cornerstone Fabrication" | 0.468 | `1,1,1,1,1,1,1,1,1,1` | unanimous fail | +| 4 | "1 CNC Operator in Flint MI - cannot use Detroit-area" | 0.434 | `2,1,1,1,2,1,1,1,2,1` | all bad | +| 5 | "Need 2 Loaders in Joliet IL but exclude Heritage Foods workers" | 0.439 | `4,3,4,4,1,4,2,2,1,2` | top-1 4/5 — accidentally OK | + +Q2 and Q5 only "passed" because the non-negated signals (role + city) +were strong enough to pull workers from outside the negated set +naturally. Q1, Q3, Q4 hit the wall: the negated entity (Detroit / +Cornerstone roster / Detroit-area) is the dominant content word, so +cosine pulls workers from exactly the location/client the +coordinator told us to avoid. + +## Why this is structural, not fixable in the substrate + +LLM-style decoder models can do negation via attention patterns over +generation. Embedding models compress text into a single dense vector +where token-level structure is lost. There is no "NOT" operator in +cosine space — the literature is clear on this; it's an active +research area (e.g. negation-aware contrastive training). + +Mitigation paths in our substrate: + +1. **Pre-process query with an LLM** to extract negations as + structured filters before retrieval. Same shape as the + `roleExtractor` — qwen2.5 format=json, schema like + `{"positive": "...", "exclude_locations": [...], "exclude_clients": [...]}`. + Cost: ~1-3s/query. +2. **Surface "low-confidence" signal** when the judge rates everything + below a threshold (already implicit — `discovery=0` in this run). + Promote that to an operator-visible signal so the UI can prompt + "this query has constraints I couldn't honor; please exclude + manually." +3. **Use structured `ExcludeIDs`** at the API boundary, populated by + the UI. The substrate already supports this (added in the + multi-coord stress 200-worker swap). Coordinators in a UI + shouldn't type "NOT Beacon Freight" — they should click an + exclude button. + +## Architectural recommendation + +**(3)** is the right answer for production. UIs solve negation +cheaper than NLP. The substrate's job is to make exclusion machinery +available (`ExcludeIDs` is already there) and surface honesty signals +when its retrieval doesn't fit the query (judge-rating distribution +is already there). Adding NL-negation handling would be product +debt — it would let coordinators type sloppier queries and then +silently fail when the LLM extractor misses a phrasing. + +**(2)** is a small UX improvement worth shipping eventually: when +all top-K judge ratings are ≤ 2/5, surface "no good match found — +consider tightening constraints" instead of returning the cold-top-1 +silently. This is one query-response shape change, not a substrate +change. + +**(1)** is research-grade work. Don't ship until production traffic +demonstrates coordinators actually type natural-language negations +rather than using exclude affordances. + +## Honesty signal validation + +The judge IS doing its job. Q1, Q3, Q4 had judge ratings of mostly +1/5, with no rating ≥ 4. That means in production, a judge-rating- +distribution monitor would flag: "this query produced 0 results with +quality score ≥ 4." That's an actionable operator signal without +requiring any new code in the substrate. + +## What this probe does NOT cover + +- **Quantifier negation** ("at most 3 of these workers"): different + failure mode, also unhandled, also won't be added to substrate. +- **Conditional constraints** ("if no forklift ops available, fall + back to material handlers"): same. +- **Soft preferences** ("prefer locals over commuters"): partially + handled via tag boost; not tested in this probe. + +These are deferred to "when production traffic shows them." + +## Repro + +```bash +# Already-shipped queries file +cat tests/reality/negation_queries.txt + +# Run with default config (no LLM extractor, no paraphrase) +QUERIES_FILE=tests/reality/negation_queries.txt RUN_ID=real_005 \ + WITH_PARAPHRASE=0 WITH_REJUDGE=0 \ + ./scripts/playbook_lift.sh +``` + +Evidence: `reports/reality-tests/playbook_lift_real_005.{json,md}`. diff --git a/tests/reality/negation_queries.txt b/tests/reality/negation_queries.txt new file mode 100644 index 0000000..073962c --- /dev/null +++ b/tests/reality/negation_queries.txt @@ -0,0 +1,17 @@ +# Negation reality-test queries — real_005 +# +# Each query carries an explicit negation that should suppress some +# subset of workers/clients/cities. Cosine on dense embeddings has +# known weaknesses around negation: "NOT Detroit" still tokenizes +# "Detroit" and pulls Detroit-aligned vectors near. Without explicit +# negation handling, the system may silently surface exactly the +# entities the coordinator excluded. +# +# Test goal: characterize whether the substrate degrades silently +# (treats "NOT X" as "X") or surfaces an honesty signal. + +Need 5 Forklift Operators in Aurora IL, NOT in Detroit (Detroit pool is reserved for another contract) +Need 3 Warehouse Associates, but NOT anyone from Beacon Freight +Looking for Pickers in Indianapolis, excluding the Cornerstone Fabrication roster +1 CNC Operator needed in Flint MI - we cannot use any Detroit-area workers due to non-compete +Need 2 Loaders in Joliet IL but exclude all currently-placed Heritage Foods workers