5 explicit-negation queries ("Need Forklift Operators in Aurora IL,
NOT in Detroit", "excluding Cornerstone Fabrication roster", etc.)
through the standard playbook_lift harness. Goal: characterize
whether the substrate has negation handling or silently treats
"NOT X" as "X".
Headline: substrate has zero negation handling. Cosine on dense
embeddings tokenizes "NOT in Detroit" identical to "in Detroit"
plus noise — there is no logical-quantifier representation in the
embedding space. This is a structural property of dense embeddings,
not a substrate bug.
Per-query observations:
- Q1 (Aurora IL, NOT Detroit): all top-10 rated 1-2/5 by judge
- Q2 (NOT Beacon Freight): top-1 rated 4/5 — accidentally OK
because role+city signal pulled non-Beacon worker naturally
- Q3 (excluding Cornerstone): unanimous 1/5 across top-10
- Q4 (NOT Detroit-area): all top-10 rated 1-2/5
- Q5 (exclude Heritage Foods): top-1 rated 4/5 — accidentally OK
The judge IS the safety net: when retrieval can't honor the
constraint, the judge refuses to approve any result. That's the
honesty signal — `discovery=0` for the run aggregates it.
No code change. The architectural answer for production is:
- UI surfaces an "exclude" affordance that populates ExcludeIDs
(already supported, added in multi-coord stress 200-worker swap)
- Coordinators don't type natural-language negation — they click
- Substrate's role: surface honesty signal (judge ratings) + don't
pretend to honor unparseable constraints
Adding NL-negation handling at the substrate level would be product
debt — it would let coordinators type sloppier queries that
silently fail when the LLM extractor misses a phrasing. Don't ship
until production traffic demonstrates demand for it.
Findings: reports/reality-tests/real_005_findings.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
112 lines
5.0 KiB
Markdown
112 lines
5.0 KiB
Markdown
# Reality test real_005 — negation probe (substrate has none, judge catches it)
|
|
|
|
5 explicit-negation queries — "NOT in Detroit", "excluding Beacon
|
|
Freight", etc. — through the standard `playbook_lift.sh` harness.
|
|
Goal: characterize whether the substrate has any negation handling
|
|
or silently treats "NOT X" as "X".
|
|
|
|
## Headline
|
|
|
|
**The substrate has zero negation handling.** Cosine on dense
|
|
embeddings tokenizes "NOT in Detroit" the same as "in Detroit" plus
|
|
the noise word "NOT" — there is no logical-quantifier representation
|
|
in the embedding space. The judge catches the failure
|
|
post-retrieval (low ratings) but the retrieval itself doesn't honor
|
|
the negation.
|
|
|
|
## Per-query results
|
|
|
|
| Q | Query (head) | cold_top1_dist | judge ratings (top-10) | judge says |
|
|
|---|---|---:|---|---|
|
|
| 1 | "Need 5 Forklift Operators in Aurora IL, NOT in Detroit" | 0.386 | `1,1,1,1,2,1,2,2,2,2` | all bad — system can't honor |
|
|
| 2 | "Need 3 Warehouse Associates, but NOT from Beacon Freight" | 0.453 | `4,4,3,2,3,4,1,4,4,1` | top-1 4/5 — accidentally OK |
|
|
| 3 | "Looking for Pickers in Indianapolis, excluding Cornerstone Fabrication" | 0.468 | `1,1,1,1,1,1,1,1,1,1` | unanimous fail |
|
|
| 4 | "1 CNC Operator in Flint MI - cannot use Detroit-area" | 0.434 | `2,1,1,1,2,1,1,1,2,1` | all bad |
|
|
| 5 | "Need 2 Loaders in Joliet IL but exclude Heritage Foods workers" | 0.439 | `4,3,4,4,1,4,2,2,1,2` | top-1 4/5 — accidentally OK |
|
|
|
|
Q2 and Q5 only "passed" because the non-negated signals (role + city)
|
|
were strong enough to pull workers from outside the negated set
|
|
naturally. Q1, Q3, Q4 hit the wall: the negated entity (Detroit /
|
|
Cornerstone roster / Detroit-area) is the dominant content word, so
|
|
cosine pulls workers from exactly the location/client the
|
|
coordinator told us to avoid.
|
|
|
|
## Why this is structural, not fixable in the substrate
|
|
|
|
LLM-style decoder models can do negation via attention patterns over
|
|
generation. Embedding models compress text into a single dense vector
|
|
where token-level structure is lost. There is no "NOT" operator in
|
|
cosine space — the literature is clear on this; it's an active
|
|
research area (e.g. negation-aware contrastive training).
|
|
|
|
Mitigation paths in our substrate:
|
|
|
|
1. **Pre-process query with an LLM** to extract negations as
|
|
structured filters before retrieval. Same shape as the
|
|
`roleExtractor` — qwen2.5 format=json, schema like
|
|
`{"positive": "...", "exclude_locations": [...], "exclude_clients": [...]}`.
|
|
Cost: ~1-3s/query.
|
|
2. **Surface "low-confidence" signal** when the judge rates everything
|
|
below a threshold (already implicit — `discovery=0` in this run).
|
|
Promote that to an operator-visible signal so the UI can prompt
|
|
"this query has constraints I couldn't honor; please exclude
|
|
manually."
|
|
3. **Use structured `ExcludeIDs`** at the API boundary, populated by
|
|
the UI. The substrate already supports this (added in the
|
|
multi-coord stress 200-worker swap). Coordinators in a UI
|
|
shouldn't type "NOT Beacon Freight" — they should click an
|
|
exclude button.
|
|
|
|
## Architectural recommendation
|
|
|
|
**(3)** is the right answer for production. UIs solve negation
|
|
cheaper than NLP. The substrate's job is to make exclusion machinery
|
|
available (`ExcludeIDs` is already there) and surface honesty signals
|
|
when its retrieval doesn't fit the query (judge-rating distribution
|
|
is already there). Adding NL-negation handling would be product
|
|
debt — it would let coordinators type sloppier queries and then
|
|
silently fail when the LLM extractor misses a phrasing.
|
|
|
|
**(2)** is a small UX improvement worth shipping eventually: when
|
|
all top-K judge ratings are ≤ 2/5, surface "no good match found —
|
|
consider tightening constraints" instead of returning the cold-top-1
|
|
silently. This is one query-response shape change, not a substrate
|
|
change.
|
|
|
|
**(1)** is research-grade work. Don't ship until production traffic
|
|
demonstrates coordinators actually type natural-language negations
|
|
rather than using exclude affordances.
|
|
|
|
## Honesty signal validation
|
|
|
|
The judge IS doing its job. Q1, Q3, Q4 had judge ratings of mostly
|
|
1/5, with no rating ≥ 4. That means in production, a judge-rating-
|
|
distribution monitor would flag: "this query produced 0 results with
|
|
quality score ≥ 4." That's an actionable operator signal without
|
|
requiring any new code in the substrate.
|
|
|
|
## What this probe does NOT cover
|
|
|
|
- **Quantifier negation** ("at most 3 of these workers"): different
|
|
failure mode, also unhandled, also won't be added to substrate.
|
|
- **Conditional constraints** ("if no forklift ops available, fall
|
|
back to material handlers"): same.
|
|
- **Soft preferences** ("prefer locals over commuters"): partially
|
|
handled via tag boost; not tested in this probe.
|
|
|
|
These are deferred to "when production traffic shows them."
|
|
|
|
## Repro
|
|
|
|
```bash
|
|
# Already-shipped queries file
|
|
cat tests/reality/negation_queries.txt
|
|
|
|
# Run with default config (no LLM extractor, no paraphrase)
|
|
QUERIES_FILE=tests/reality/negation_queries.txt RUN_ID=real_005 \
|
|
WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
|
|
./scripts/playbook_lift.sh
|
|
```
|
|
|
|
Evidence: `reports/reality-tests/playbook_lift_real_005.{json,md}`.
|