# Reality test real_005 — negation probe (substrate has none, judge catches it) 5 explicit-negation queries — "NOT in Detroit", "excluding Beacon Freight", etc. — through the standard `playbook_lift.sh` harness. Goal: characterize whether the substrate has any negation handling or silently treats "NOT X" as "X". ## Headline **The substrate has zero negation handling.** Cosine on dense embeddings tokenizes "NOT in Detroit" the same as "in Detroit" plus the noise word "NOT" — there is no logical-quantifier representation in the embedding space. The judge catches the failure post-retrieval (low ratings) but the retrieval itself doesn't honor the negation. ## Per-query results | Q | Query (head) | cold_top1_dist | judge ratings (top-10) | judge says | |---|---|---:|---|---| | 1 | "Need 5 Forklift Operators in Aurora IL, NOT in Detroit" | 0.386 | `1,1,1,1,2,1,2,2,2,2` | all bad — system can't honor | | 2 | "Need 3 Warehouse Associates, but NOT from Beacon Freight" | 0.453 | `4,4,3,2,3,4,1,4,4,1` | top-1 4/5 — accidentally OK | | 3 | "Looking for Pickers in Indianapolis, excluding Cornerstone Fabrication" | 0.468 | `1,1,1,1,1,1,1,1,1,1` | unanimous fail | | 4 | "1 CNC Operator in Flint MI - cannot use Detroit-area" | 0.434 | `2,1,1,1,2,1,1,1,2,1` | all bad | | 5 | "Need 2 Loaders in Joliet IL but exclude Heritage Foods workers" | 0.439 | `4,3,4,4,1,4,2,2,1,2` | top-1 4/5 — accidentally OK | Q2 and Q5 only "passed" because the non-negated signals (role + city) were strong enough to pull workers from outside the negated set naturally. Q1, Q3, Q4 hit the wall: the negated entity (Detroit / Cornerstone roster / Detroit-area) is the dominant content word, so cosine pulls workers from exactly the location/client the coordinator told us to avoid. ## Why this is structural, not fixable in the substrate LLM-style decoder models can do negation via attention patterns over generation. Embedding models compress text into a single dense vector where token-level structure is lost. There is no "NOT" operator in cosine space — the literature is clear on this; it's an active research area (e.g. negation-aware contrastive training). Mitigation paths in our substrate: 1. **Pre-process query with an LLM** to extract negations as structured filters before retrieval. Same shape as the `roleExtractor` — qwen2.5 format=json, schema like `{"positive": "...", "exclude_locations": [...], "exclude_clients": [...]}`. Cost: ~1-3s/query. 2. **Surface "low-confidence" signal** when the judge rates everything below a threshold (already implicit — `discovery=0` in this run). Promote that to an operator-visible signal so the UI can prompt "this query has constraints I couldn't honor; please exclude manually." 3. **Use structured `ExcludeIDs`** at the API boundary, populated by the UI. The substrate already supports this (added in the multi-coord stress 200-worker swap). Coordinators in a UI shouldn't type "NOT Beacon Freight" — they should click an exclude button. ## Architectural recommendation **(3)** is the right answer for production. UIs solve negation cheaper than NLP. The substrate's job is to make exclusion machinery available (`ExcludeIDs` is already there) and surface honesty signals when its retrieval doesn't fit the query (judge-rating distribution is already there). Adding NL-negation handling would be product debt — it would let coordinators type sloppier queries and then silently fail when the LLM extractor misses a phrasing. **(2)** is a small UX improvement worth shipping eventually: when all top-K judge ratings are ≤ 2/5, surface "no good match found — consider tightening constraints" instead of returning the cold-top-1 silently. This is one query-response shape change, not a substrate change. **(1)** is research-grade work. Don't ship until production traffic demonstrates coordinators actually type natural-language negations rather than using exclude affordances. ## Honesty signal validation The judge IS doing its job. Q1, Q3, Q4 had judge ratings of mostly 1/5, with no rating ≥ 4. That means in production, a judge-rating- distribution monitor would flag: "this query produced 0 results with quality score ≥ 4." That's an actionable operator signal without requiring any new code in the substrate. ## What this probe does NOT cover - **Quantifier negation** ("at most 3 of these workers"): different failure mode, also unhandled, also won't be added to substrate. - **Conditional constraints** ("if no forklift ops available, fall back to material handlers"): same. - **Soft preferences** ("prefer locals over commuters"): partially handled via tag boost; not tested in this probe. These are deferred to "when production traffic shows them." ## Repro ```bash # Already-shipped queries file cat tests/reality/negation_queries.txt # Run with default config (no LLM extractor, no paraphrase) QUERIES_FILE=tests/reality/negation_queries.txt RUN_ID=real_005 \ WITH_PARAPHRASE=0 WITH_REJUDGE=0 \ ./scripts/playbook_lift.sh ``` Evidence: `reports/reality-tests/playbook_lift_real_005.{json,md}`.