root cca32344f3 reality_test real_005: negation probe — substrate gap is correctly out-of-scope

5 explicit-negation queries ("Need Forklift Operators in Aurora IL,
NOT in Detroit", "excluding Cornerstone Fabrication roster", etc.)
through the standard playbook_lift harness. Goal: characterize
whether the substrate has negation handling or silently treats
"NOT X" as "X".

Headline: substrate has zero negation handling. Cosine on dense
embeddings tokenizes "NOT in Detroit" identical to "in Detroit"
plus noise — there is no logical-quantifier representation in the
embedding space. This is a structural property of dense embeddings,
not a substrate bug.

Per-query observations:
- Q1 (Aurora IL, NOT Detroit): all top-10 rated 1-2/5 by judge
- Q2 (NOT Beacon Freight): top-1 rated 4/5 — accidentally OK
  because role+city signal pulled non-Beacon worker naturally
- Q3 (excluding Cornerstone): unanimous 1/5 across top-10
- Q4 (NOT Detroit-area): all top-10 rated 1-2/5
- Q5 (exclude Heritage Foods): top-1 rated 4/5 — accidentally OK

The judge IS the safety net: when retrieval can't honor the
constraint, the judge refuses to approve any result. That's the
honesty signal — `discovery=0` for the run aggregates it.

No code change. The architectural answer for production is:
- UI surfaces an "exclude" affordance that populates ExcludeIDs
  (already supported, added in multi-coord stress 200-worker swap)
- Coordinators don't type natural-language negation — they click
- Substrate's role: surface honesty signal (judge ratings) + don't
  pretend to honor unparseable constraints

Adding NL-negation handling at the substrate level would be product
debt — it would let coordinators type sloppier queries that
silently fail when the LLM extractor misses a phrasing. Don't ship
until production traffic demonstrates demand for it.

Findings: reports/reality-tests/real_005_findings.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 23:06:06 -05:00

5.0 KiB

Raw Blame History

Reality test real_005 — negation probe (substrate has none, judge catches it)

5 explicit-negation queries — "NOT in Detroit", "excluding Beacon Freight", etc. — through the standard playbook_lift.sh harness. Goal: characterize whether the substrate has any negation handling or silently treats "NOT X" as "X".

Headline

The substrate has zero negation handling. Cosine on dense embeddings tokenizes "NOT in Detroit" the same as "in Detroit" plus the noise word "NOT" — there is no logical-quantifier representation in the embedding space. The judge catches the failure post-retrieval (low ratings) but the retrieval itself doesn't honor the negation.

Per-query results

Q	Query (head)	cold_top1_dist	judge ratings (top-10)	judge says
1	"Need 5 Forklift Operators in Aurora IL, NOT in Detroit"	0.386	`1,1,1,1,2,1,2,2,2,2`	all bad — system can't honor
2	"Need 3 Warehouse Associates, but NOT from Beacon Freight"	0.453	`4,4,3,2,3,4,1,4,4,1`	top-1 4/5 — accidentally OK
3	"Looking for Pickers in Indianapolis, excluding Cornerstone Fabrication"	0.468	`1,1,1,1,1,1,1,1,1,1`	unanimous fail
4	"1 CNC Operator in Flint MI - cannot use Detroit-area"	0.434	`2,1,1,1,2,1,1,1,2,1`	all bad
5	"Need 2 Loaders in Joliet IL but exclude Heritage Foods workers"	0.439	`4,3,4,4,1,4,2,2,1,2`	top-1 4/5 — accidentally OK

Q2 and Q5 only "passed" because the non-negated signals (role + city) were strong enough to pull workers from outside the negated set naturally. Q1, Q3, Q4 hit the wall: the negated entity (Detroit / Cornerstone roster / Detroit-area) is the dominant content word, so cosine pulls workers from exactly the location/client the coordinator told us to avoid.

Why this is structural, not fixable in the substrate

LLM-style decoder models can do negation via attention patterns over generation. Embedding models compress text into a single dense vector where token-level structure is lost. There is no "NOT" operator in cosine space — the literature is clear on this; it's an active research area (e.g. negation-aware contrastive training).

Mitigation paths in our substrate:

Pre-process query with an LLM to extract negations as structured filters before retrieval. Same shape as the roleExtractor — qwen2.5 format=json, schema like {"positive": "...", "exclude_locations": [...], "exclude_clients": [...]}. Cost: ~1-3s/query.
Surface "low-confidence" signal when the judge rates everything below a threshold (already implicit — discovery=0 in this run). Promote that to an operator-visible signal so the UI can prompt "this query has constraints I couldn't honor; please exclude manually."
Use structured ExcludeIDs at the API boundary, populated by the UI. The substrate already supports this (added in the multi-coord stress 200-worker swap). Coordinators in a UI shouldn't type "NOT Beacon Freight" — they should click an exclude button.

Architectural recommendation

(3) is the right answer for production. UIs solve negation cheaper than NLP. The substrate's job is to make exclusion machinery available (ExcludeIDs is already there) and surface honesty signals when its retrieval doesn't fit the query (judge-rating distribution is already there). Adding NL-negation handling would be product debt — it would let coordinators type sloppier queries and then silently fail when the LLM extractor misses a phrasing.

(2) is a small UX improvement worth shipping eventually: when all top-K judge ratings are ≤ 2/5, surface "no good match found — consider tightening constraints" instead of returning the cold-top-1 silently. This is one query-response shape change, not a substrate change.

(1) is research-grade work. Don't ship until production traffic demonstrates coordinators actually type natural-language negations rather than using exclude affordances.

Honesty signal validation

The judge IS doing its job. Q1, Q3, Q4 had judge ratings of mostly 1/5, with no rating ≥ 4. That means in production, a judge-rating- distribution monitor would flag: "this query produced 0 results with quality score ≥ 4." That's an actionable operator signal without requiring any new code in the substrate.

What this probe does NOT cover

Quantifier negation ("at most 3 of these workers"): different failure mode, also unhandled, also won't be added to substrate.
Conditional constraints ("if no forklift ops available, fall back to material handlers"): same.
Soft preferences ("prefer locals over commuters"): partially handled via tag boost; not tested in this probe.

These are deferred to "when production traffic shows them."

Repro

# Already-shipped queries file
cat tests/reality/negation_queries.txt

# Run with default config (no LLM extractor, no paraphrase)
QUERIES_FILE=tests/reality/negation_queries.txt RUN_ID=real_005 \
  WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
  ./scripts/playbook_lift.sh

Evidence: reports/reality-tests/playbook_lift_real_005.{json,md}.

5.0 KiB Raw Blame History