golangLAKEHOUSE/reports/reality-tests/real_005_findings.md
root cca32344f3 reality_test real_005: negation probe — substrate gap is correctly out-of-scope
5 explicit-negation queries ("Need Forklift Operators in Aurora IL,
NOT in Detroit", "excluding Cornerstone Fabrication roster", etc.)
through the standard playbook_lift harness. Goal: characterize
whether the substrate has negation handling or silently treats
"NOT X" as "X".

Headline: substrate has zero negation handling. Cosine on dense
embeddings tokenizes "NOT in Detroit" identical to "in Detroit"
plus noise — there is no logical-quantifier representation in the
embedding space. This is a structural property of dense embeddings,
not a substrate bug.

Per-query observations:
- Q1 (Aurora IL, NOT Detroit): all top-10 rated 1-2/5 by judge
- Q2 (NOT Beacon Freight): top-1 rated 4/5 — accidentally OK
  because role+city signal pulled non-Beacon worker naturally
- Q3 (excluding Cornerstone): unanimous 1/5 across top-10
- Q4 (NOT Detroit-area): all top-10 rated 1-2/5
- Q5 (exclude Heritage Foods): top-1 rated 4/5 — accidentally OK

The judge IS the safety net: when retrieval can't honor the
constraint, the judge refuses to approve any result. That's the
honesty signal — `discovery=0` for the run aggregates it.

No code change. The architectural answer for production is:
- UI surfaces an "exclude" affordance that populates ExcludeIDs
  (already supported, added in multi-coord stress 200-worker swap)
- Coordinators don't type natural-language negation — they click
- Substrate's role: surface honesty signal (judge ratings) + don't
  pretend to honor unparseable constraints

Adding NL-negation handling at the substrate level would be product
debt — it would let coordinators type sloppier queries that
silently fail when the LLM extractor misses a phrasing. Don't ship
until production traffic demonstrates demand for it.

Findings: reports/reality-tests/real_005_findings.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:06:06 -05:00

112 lines
5.0 KiB
Markdown

# Reality test real_005 — negation probe (substrate has none, judge catches it)
5 explicit-negation queries — "NOT in Detroit", "excluding Beacon
Freight", etc. — through the standard `playbook_lift.sh` harness.
Goal: characterize whether the substrate has any negation handling
or silently treats "NOT X" as "X".
## Headline
**The substrate has zero negation handling.** Cosine on dense
embeddings tokenizes "NOT in Detroit" the same as "in Detroit" plus
the noise word "NOT" — there is no logical-quantifier representation
in the embedding space. The judge catches the failure
post-retrieval (low ratings) but the retrieval itself doesn't honor
the negation.
## Per-query results
| Q | Query (head) | cold_top1_dist | judge ratings (top-10) | judge says |
|---|---|---:|---|---|
| 1 | "Need 5 Forklift Operators in Aurora IL, NOT in Detroit" | 0.386 | `1,1,1,1,2,1,2,2,2,2` | all bad — system can't honor |
| 2 | "Need 3 Warehouse Associates, but NOT from Beacon Freight" | 0.453 | `4,4,3,2,3,4,1,4,4,1` | top-1 4/5 — accidentally OK |
| 3 | "Looking for Pickers in Indianapolis, excluding Cornerstone Fabrication" | 0.468 | `1,1,1,1,1,1,1,1,1,1` | unanimous fail |
| 4 | "1 CNC Operator in Flint MI - cannot use Detroit-area" | 0.434 | `2,1,1,1,2,1,1,1,2,1` | all bad |
| 5 | "Need 2 Loaders in Joliet IL but exclude Heritage Foods workers" | 0.439 | `4,3,4,4,1,4,2,2,1,2` | top-1 4/5 — accidentally OK |
Q2 and Q5 only "passed" because the non-negated signals (role + city)
were strong enough to pull workers from outside the negated set
naturally. Q1, Q3, Q4 hit the wall: the negated entity (Detroit /
Cornerstone roster / Detroit-area) is the dominant content word, so
cosine pulls workers from exactly the location/client the
coordinator told us to avoid.
## Why this is structural, not fixable in the substrate
LLM-style decoder models can do negation via attention patterns over
generation. Embedding models compress text into a single dense vector
where token-level structure is lost. There is no "NOT" operator in
cosine space — the literature is clear on this; it's an active
research area (e.g. negation-aware contrastive training).
Mitigation paths in our substrate:
1. **Pre-process query with an LLM** to extract negations as
structured filters before retrieval. Same shape as the
`roleExtractor` — qwen2.5 format=json, schema like
`{"positive": "...", "exclude_locations": [...], "exclude_clients": [...]}`.
Cost: ~1-3s/query.
2. **Surface "low-confidence" signal** when the judge rates everything
below a threshold (already implicit — `discovery=0` in this run).
Promote that to an operator-visible signal so the UI can prompt
"this query has constraints I couldn't honor; please exclude
manually."
3. **Use structured `ExcludeIDs`** at the API boundary, populated by
the UI. The substrate already supports this (added in the
multi-coord stress 200-worker swap). Coordinators in a UI
shouldn't type "NOT Beacon Freight" — they should click an
exclude button.
## Architectural recommendation
**(3)** is the right answer for production. UIs solve negation
cheaper than NLP. The substrate's job is to make exclusion machinery
available (`ExcludeIDs` is already there) and surface honesty signals
when its retrieval doesn't fit the query (judge-rating distribution
is already there). Adding NL-negation handling would be product
debt — it would let coordinators type sloppier queries and then
silently fail when the LLM extractor misses a phrasing.
**(2)** is a small UX improvement worth shipping eventually: when
all top-K judge ratings are ≤ 2/5, surface "no good match found —
consider tightening constraints" instead of returning the cold-top-1
silently. This is one query-response shape change, not a substrate
change.
**(1)** is research-grade work. Don't ship until production traffic
demonstrates coordinators actually type natural-language negations
rather than using exclude affordances.
## Honesty signal validation
The judge IS doing its job. Q1, Q3, Q4 had judge ratings of mostly
1/5, with no rating ≥ 4. That means in production, a judge-rating-
distribution monitor would flag: "this query produced 0 results with
quality score ≥ 4." That's an actionable operator signal without
requiring any new code in the substrate.
## What this probe does NOT cover
- **Quantifier negation** ("at most 3 of these workers"): different
failure mode, also unhandled, also won't be added to substrate.
- **Conditional constraints** ("if no forklift ops available, fall
back to material handlers"): same.
- **Soft preferences** ("prefer locals over commuters"): partially
handled via tag boost; not tested in this probe.
These are deferred to "when production traffic shows them."
## Repro
```bash
# Already-shipped queries file
cat tests/reality/negation_queries.txt
# Run with default config (no LLM extractor, no paraphrase)
QUERIES_FILE=tests/reality/negation_queries.txt RUN_ID=real_005 \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
./scripts/playbook_lift.sh
```
Evidence: `reports/reality-tests/playbook_lift_real_005.{json,md}`.