reality_test real_005: negation probe — substrate gap is correctly out-of-scope

5 explicit-negation queries ("Need Forklift Operators in Aurora IL, NOT in Detroit", "excluding Cornerstone Fabrication roster", etc.) through the standard playbook_lift harness. Goal: characterize whether the substrate has negation handling or silently treats "NOT X" as "X". Headline: substrate has zero negation handling. Cosine on dense embeddings tokenizes "NOT in Detroit" identical to "in Detroit" plus noise — there is no logical-quantifier representation in the embedding space. This is a structural property of dense embeddings, not a substrate bug. Per-query observations: - Q1 (Aurora IL, NOT Detroit): all top-10 rated 1-2/5 by judge - Q2 (NOT Beacon Freight): top-1 rated 4/5 — accidentally OK because role+city signal pulled non-Beacon worker naturally - Q3 (excluding Cornerstone): unanimous 1/5 across top-10 - Q4 (NOT Detroit-area): all top-10 rated 1-2/5 - Q5 (exclude Heritage Foods): top-1 rated 4/5 — accidentally OK The judge IS the safety net: when retrieval can't honor the constraint, the judge refuses to approve any result. That's the honesty signal — `discovery=0` for the run aggregates it. No code change. The architectural answer for production is: - UI surfaces an "exclude" affordance that populates ExcludeIDs (already supported, added in multi-coord stress 200-worker swap) - Coordinators don't type natural-language negation — they click - Substrate's role: surface honesty signal (judge ratings) + don't pretend to honor unparseable constraints Adding NL-negation handling at the substrate level would be product debt — it would let coordinators type sloppier queries that silently fail when the LLM extractor misses a phrasing. Don't ship until production traffic demonstrates demand for it. Findings: reports/reality-tests/real_005_findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:06:06 -05:00 · 2026-04-30 23:06:06 -05:00 · cca32344f3
commit cca32344f3
parent 434f466288
4 changed files with 210 additions and 0 deletions
--- a/STATE_OF_PLAY.md
+++ b/STATE_OF_PLAY.md
@ -266,6 +266,7 @@ The list is intentionally short. Items move to closed when the work demands them
 | (probe)   | Reality test real_003: 40 queries (10 fill_events rows × 4 styles — `need` / `client_first` / `looking` / `shorthand`). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked `w-2404` onto Forklift Operator shorthand query (both empty role, gate disabled). Extended `extractRoleFromNeed` to handle `client_first` + `looking` patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in `scripts/playbook_lift/main_test.go` lock the patterns + the documented shorthand limitation. Findings: `reports/reality-tests/real_003_findings.md`. |
 | (fix)     | LLM-based role extractor (real_004): `roleExtractor` struct with regex-first → qwen2.5 format=json fallback → per-process cache. Opt-in via `-llm-role-extract` flag + `LLM_ROLE_EXTRACT=1` env. Off-by-default preserves real_003b shipping config. 8 new tests including `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the load-bearing witness pairing with the matrix-side `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` to prove the extraction-layer + gate-layer compose correctly on the exact real_003 failure mode. Findings: `reports/reality-tests/real_004_findings.md`. |
 | (scrum)   | 3-lineage scrum review on `7f2f112..0331288` (Opus + Kimi + Qwen3-coder via `scripts/scrum_review.sh`). Convergent finding (3/3): `roleNormalize` plural-stripper mangled non-plural-s tokens (Sales → Sale, Logistics → Logistic). **Fixed**: `nonPluralSWords` allowlist + `-ss` ending check + `strings.ToLower`/`TrimSpace` cleanup. New tests `TestRoleNormalize_NonPluralS` + `TestRoleEqual_NonPluralS` lock the edge cases. Kimi 2 BLOCKs were false positives (model-truncation artifacts per `feedback_cross_lineage_review.md`). Disposition: `reports/scrum/_evidence/2026-04-30/verdicts/role_gate_v1_disposition.md` (local). |
+| (probe)   | Negation reality test real_005: 5 explicit-negation queries ("NOT in Detroit", "excluding Cornerstone roster", etc.). Confirmed substrate has **zero negation handling** — cosine treats "NOT X" as "X" + noise. Judge IS the safety net (Q1/Q3/Q4 rated all top-10 results 1-2/5 — operator-visible honesty signal). **No code change needed**: production UI should handle exclusion via `ExcludeIDs` (already supported, added in multi-coord stress 200-worker swap), not via NL-negation. Findings: `reports/reality-tests/real_005_findings.md`. |

 Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

--- a/reports/reality-tests/playbook_lift_real_005.md
+++ b/reports/reality-tests/playbook_lift_real_005.md
@ -0,0 +1,81 @@
+# Playbook-Lift Reality Test — Run real_005
+
+**Generated:** 2026-05-01T04:04:14.242729367Z
+**Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge)
+**Corpora:** `workers,ethereal_workers`
+**Workers limit:** 5000
+**Queries:** `tests/reality/negation_queries.txt` (5 executed)
+**K per pass:** 10
+**Paraphrase pass:** disabled
+**Re-judge pass:** disabled
+**Evidence:** `reports/reality-tests/playbook_lift_real_005.json`
+
+---
+
+## Headline
+
+| Metric | Value |
+|---|---:|
+| Total queries run | 5 |
+| Cold-pass discoveries (judge-best ≠ top-1) | 0 |
+| Warm-pass lifts (recorded playbook → top-1) | 0 |
+| No change (judge-best already top-1, no playbook needed) | 5 |
+| Playbook boosts triggered (warm pass) | 0 |
+| Mean Δ top-1 distance (warm − cold) | 0 |
+
+
+
+**Verbatim lift rate:** 0 of 0 discoveries became top-1 after warm pass.
+
+---
+
+## Per-query results
+
+| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
+|---|---|---|---|---|---|---|---|
+| 1 | Need 5 Forklift Operators in Aurora IL, NOT in Detroit (Detr | e-1723 | 4/2 | — | e-1723 | 4 | no |
+| 2 | Need 3 Warehouse Associates, but NOT anyone from Beacon Frei | w-2937 | 0/4 | — | w-2937 | 0 | no |
+| 3 | Looking for Pickers in Indianapolis, excluding the Cornersto | e-5033 | 0/1 | — | e-5033 | 0 | no |
+| 4 | 1 CNC Operator needed in Flint MI - we cannot use any Detroi | w-1360 | 0/2 | — | w-1360 | 0 | no |
+| 5 | Need 2 Loaders in Joliet IL but exclude all currently-placed | w-2998 | 0/4 | — | w-2998 | 0 | no |
+
+---
+
+## Honesty caveats
+
+1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
+   judge's verdict is what defines "best." If `` rates badly,
+   the lift number is meaningless. To validate the judge itself, sample 5–10
+   verdicts manually and check agreement.
+2. **Score-1.0 boost = distance halved.** Playbook math is
+   `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
+   result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
+   even halving doesn't promote it. Tight clusters → little visible lift.
+3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
+   case — same query, recorded playbook, expected boost. The paraphrase
+   pass (when enabled) is the actual learning property: similar-but-different
+   queries hitting a recorded playbook. Compare verbatim and paraphrase
+   lift rates — paraphrase should be lower (semantic-distance gates some
+   playbook hits) but non-zero is the meaningful signal.
+4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
+   results land in one corpus, the matrix layer's purpose isn't being tested.
+   Check per-corpus distribution in the JSON.
+5. **Judge resolution.** This run used `qwen2.5:latest` from
+   config [models].local_judge.
+   Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
+6. **Paraphrase generation also uses the judge.** The same model that rates
+   relevance also rephrases queries. A judge that's bad at rating staffing
+   queries is probably also bad at rephrasing them. Worth sanity-checking
+   a sample of `paraphrase_query` values in the JSON before trusting the
+   paraphrase lift number.
+
+## Next moves
+
+- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
+  work. Move to paraphrase queries + tag-based boost (currently ignored).
+- If lift rate < 20%: investigate why — judge variance, distance gap too
+  wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
+  retuning.
+- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
+  already close to optimal on this query distribution. Either the corpus
+  is too narrow or the queries are too easy.
--- a/reports/reality-tests/real_005_findings.md
+++ b/reports/reality-tests/real_005_findings.md
@ -0,0 +1,111 @@
+# Reality test real_005 — negation probe (substrate has none, judge catches it)
+
+5 explicit-negation queries — "NOT in Detroit", "excluding Beacon
+Freight", etc. — through the standard `playbook_lift.sh` harness.
+Goal: characterize whether the substrate has any negation handling
+or silently treats "NOT X" as "X".
+
+## Headline
+
+**The substrate has zero negation handling.** Cosine on dense
+embeddings tokenizes "NOT in Detroit" the same as "in Detroit" plus
+the noise word "NOT" — there is no logical-quantifier representation
+in the embedding space. The judge catches the failure
+post-retrieval (low ratings) but the retrieval itself doesn't honor
+the negation.
+
+## Per-query results
+
+| Q | Query (head) | cold_top1_dist | judge ratings (top-10) | judge says |
+|---|---|---:|---|---|
+| 1 | "Need 5 Forklift Operators in Aurora IL, NOT in Detroit" | 0.386 | `1,1,1,1,2,1,2,2,2,2` | all bad — system can't honor |
+| 2 | "Need 3 Warehouse Associates, but NOT from Beacon Freight" | 0.453 | `4,4,3,2,3,4,1,4,4,1` | top-1 4/5 — accidentally OK |
+| 3 | "Looking for Pickers in Indianapolis, excluding Cornerstone Fabrication" | 0.468 | `1,1,1,1,1,1,1,1,1,1` | unanimous fail |
+| 4 | "1 CNC Operator in Flint MI - cannot use Detroit-area" | 0.434 | `2,1,1,1,2,1,1,1,2,1` | all bad |
+| 5 | "Need 2 Loaders in Joliet IL but exclude Heritage Foods workers" | 0.439 | `4,3,4,4,1,4,2,2,1,2` | top-1 4/5 — accidentally OK |
+
+Q2 and Q5 only "passed" because the non-negated signals (role + city)
+were strong enough to pull workers from outside the negated set
+naturally. Q1, Q3, Q4 hit the wall: the negated entity (Detroit /
+Cornerstone roster / Detroit-area) is the dominant content word, so
+cosine pulls workers from exactly the location/client the
+coordinator told us to avoid.
+
+## Why this is structural, not fixable in the substrate
+
+LLM-style decoder models can do negation via attention patterns over
+generation. Embedding models compress text into a single dense vector
+where token-level structure is lost. There is no "NOT" operator in
+cosine space — the literature is clear on this; it's an active
+research area (e.g. negation-aware contrastive training).
+
+Mitigation paths in our substrate:
+
+1. **Pre-process query with an LLM** to extract negations as
+   structured filters before retrieval. Same shape as the
+   `roleExtractor` — qwen2.5 format=json, schema like
+   `{"positive": "...", "exclude_locations": [...], "exclude_clients": [...]}`.
+   Cost: ~1-3s/query.
+2. **Surface "low-confidence" signal** when the judge rates everything
+   below a threshold (already implicit — `discovery=0` in this run).
+   Promote that to an operator-visible signal so the UI can prompt
+   "this query has constraints I couldn't honor; please exclude
+   manually."
+3. **Use structured `ExcludeIDs`** at the API boundary, populated by
+   the UI. The substrate already supports this (added in the
+   multi-coord stress 200-worker swap). Coordinators in a UI
+   shouldn't type "NOT Beacon Freight" — they should click an
+   exclude button.
+
+## Architectural recommendation
+
+**(3)** is the right answer for production. UIs solve negation
+cheaper than NLP. The substrate's job is to make exclusion machinery
+available (`ExcludeIDs` is already there) and surface honesty signals
+when its retrieval doesn't fit the query (judge-rating distribution
+is already there). Adding NL-negation handling would be product
+debt — it would let coordinators type sloppier queries and then
+silently fail when the LLM extractor misses a phrasing.
+
+**(2)** is a small UX improvement worth shipping eventually: when
+all top-K judge ratings are ≤ 2/5, surface "no good match found —
+consider tightening constraints" instead of returning the cold-top-1
+silently. This is one query-response shape change, not a substrate
+change.
+
+**(1)** is research-grade work. Don't ship until production traffic
+demonstrates coordinators actually type natural-language negations
+rather than using exclude affordances.
+
+## Honesty signal validation
+
+The judge IS doing its job. Q1, Q3, Q4 had judge ratings of mostly
+1/5, with no rating ≥ 4. That means in production, a judge-rating-
+distribution monitor would flag: "this query produced 0 results with
+quality score ≥ 4." That's an actionable operator signal without
+requiring any new code in the substrate.
+
+## What this probe does NOT cover
+
+- **Quantifier negation** ("at most 3 of these workers"): different
+  failure mode, also unhandled, also won't be added to substrate.
+- **Conditional constraints** ("if no forklift ops available, fall
+  back to material handlers"): same.
+- **Soft preferences** ("prefer locals over commuters"): partially
+  handled via tag boost; not tested in this probe.
+
+These are deferred to "when production traffic shows them."
+
+## Repro
+
+```bash
+# Already-shipped queries file
+cat tests/reality/negation_queries.txt
+
+# Run with default config (no LLM extractor, no paraphrase)
+QUERIES_FILE=tests/reality/negation_queries.txt RUN_ID=real_005 \
+  WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
+  ./scripts/playbook_lift.sh
+```
+
+Evidence: `reports/reality-tests/playbook_lift_real_005.{json,md}`.
--- a/tests/reality/negation_queries.txt
+++ b/tests/reality/negation_queries.txt
@ -0,0 +1,17 @@
+# Negation reality-test queries — real_005
+#
+# Each query carries an explicit negation that should suppress some
+# subset of workers/clients/cities. Cosine on dense embeddings has
+# known weaknesses around negation: "NOT Detroit" still tokenizes
+# "Detroit" and pulls Detroit-aligned vectors near. Without explicit
+# negation handling, the system may silently surface exactly the
+# entities the coordinator excluded.
+#
+# Test goal: characterize whether the substrate degrades silently
+# (treats "NOT X" as "X") or surfaces an honesty signal.
+
+Need 5 Forklift Operators in Aurora IL, NOT in Detroit (Detroit pool is reserved for another contract)
+Need 3 Warehouse Associates, but NOT anyone from Beacon Freight
+Looking for Pickers in Indianapolis, excluding the Cornerstone Fabrication roster
+1 CNC Operator needed in Flint MI - we cannot use any Detroit-area workers due to non-compete
+Need 2 Loaders in Joliet IL but exclude all currently-placed Heritage Foods workers