golangLAKEHOUSE/reports/reality-tests/real_005_findings.md

# Reality test real_005 — negation probe (substrate has none, judge catches it)

5 explicit-negation queries — "NOT in Detroit", "excluding Beacon
Freight", etc. — through the standard `playbook_lift.sh` harness.
Goal: characterize whether the substrate has any negation handling
or silently treats "NOT X" as "X".

## Headline

**The substrate has zero negation handling.** Cosine on dense
embeddings tokenizes "NOT in Detroit" the same as "in Detroit" plus
the noise word "NOT" — there is no logical-quantifier representation
in the embedding space. The judge catches the failure
post-retrieval (low ratings) but the retrieval itself doesn't honor
the negation.

## Per-query results

| Q | Query (head) | cold_top1_dist | judge ratings (top-10) | judge says |
|---|---|---:|---|---|
| 1 | "Need 5 Forklift Operators in Aurora IL, NOT in Detroit" | 0.386 | `1,1,1,1,2,1,2,2,2,2` | all bad — system can't honor |
| 2 | "Need 3 Warehouse Associates, but NOT from Beacon Freight" | 0.453 | `4,4,3,2,3,4,1,4,4,1` | top-1 4/5 — accidentally OK |
| 3 | "Looking for Pickers in Indianapolis, excluding Cornerstone Fabrication" | 0.468 | `1,1,1,1,1,1,1,1,1,1` | unanimous fail |
| 4 | "1 CNC Operator in Flint MI - cannot use Detroit-area" | 0.434 | `2,1,1,1,2,1,1,1,2,1` | all bad |
| 5 | "Need 2 Loaders in Joliet IL but exclude Heritage Foods workers" | 0.439 | `4,3,4,4,1,4,2,2,1,2` | top-1 4/5 — accidentally OK |

Q2 and Q5 only "passed" because the non-negated signals (role + city)
were strong enough to pull workers from outside the negated set
naturally. Q1, Q3, Q4 hit the wall: the negated entity (Detroit /
Cornerstone roster / Detroit-area) is the dominant content word, so
cosine pulls workers from exactly the location/client the
coordinator told us to avoid.

## Why this is structural, not fixable in the substrate

LLM-style decoder models can do negation via attention patterns over
generation. Embedding models compress text into a single dense vector
where token-level structure is lost. There is no "NOT" operator in
cosine space — the literature is clear on this; it's an active
research area (e.g. negation-aware contrastive training).

Mitigation paths in our substrate:

1. **Pre-process query with an LLM** to extract negations as
   structured filters before retrieval. Same shape as the
   `roleExtractor` — qwen2.5 format=json, schema like
   `{"positive": "...", "exclude_locations": [...], "exclude_clients": [...]}`.
   Cost: ~1-3s/query.
2. **Surface "low-confidence" signal** when the judge rates everything
   below a threshold (already implicit — `discovery=0` in this run).
   Promote that to an operator-visible signal so the UI can prompt
   "this query has constraints I couldn't honor; please exclude
   manually."
3. **Use structured `ExcludeIDs`** at the API boundary, populated by
   the UI. The substrate already supports this (added in the
   multi-coord stress 200-worker swap). Coordinators in a UI
   shouldn't type "NOT Beacon Freight" — they should click an
   exclude button.

## Architectural recommendation

**(3)** is the right answer for production. UIs solve negation
cheaper than NLP. The substrate's job is to make exclusion machinery
available (`ExcludeIDs` is already there) and surface honesty signals
when its retrieval doesn't fit the query (judge-rating distribution
is already there). Adding NL-negation handling would be product
debt — it would let coordinators type sloppier queries and then
silently fail when the LLM extractor misses a phrasing.

**(2)** is a small UX improvement worth shipping eventually: when
all top-K judge ratings are ≤ 2/5, surface "no good match found —
consider tightening constraints" instead of returning the cold-top-1
silently. This is one query-response shape change, not a substrate
change.

**(1)** is research-grade work. Don't ship until production traffic
demonstrates coordinators actually type natural-language negations
rather than using exclude affordances.

## Honesty signal validation

The judge IS doing its job. Q1, Q3, Q4 had judge ratings of mostly
1/5, with no rating ≥ 4. That means in production, a judge-rating-
distribution monitor would flag: "this query produced 0 results with
quality score ≥ 4." That's an actionable operator signal without
requiring any new code in the substrate.

## What this probe does NOT cover

- **Quantifier negation** ("at most 3 of these workers"): different
  failure mode, also unhandled, also won't be added to substrate.
- **Conditional constraints** ("if no forklift ops available, fall
  back to material handlers"): same.
- **Soft preferences** ("prefer locals over commuters"): partially
  handled via tag boost; not tested in this probe.

These are deferred to "when production traffic shows them."

## Repro

```bash
# Already-shipped queries file
cat tests/reality/negation_queries.txt

# Run with default config (no LLM extractor, no paraphrase)
QUERIES_FILE=tests/reality/negation_queries.txt RUN_ID=real_005 \
  WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
  ./scripts/playbook_lift.sh
```

Evidence: `reports/reality-tests/playbook_lift_real_005.{json,md}`.