reality_test real_005: negation probe — substrate gap is correctly out-of-scope

5 explicit-negation queries ("Need Forklift Operators in Aurora IL, NOT in Detroit", "excluding Cornerstone Fabrication roster", etc.) through the standard playbook_lift harness. Goal: characterize whether the substrate has negation handling or silently treats "NOT X" as "X". Headline: substrate has zero negation handling. Cosine on dense embeddings tokenizes "NOT in Detroit" identical to "in Detroit" plus noise — there is no logical-quantifier representation in the embedding space. This is a structural property of dense embeddings, not a substrate bug. Per-query observations: - Q1 (Aurora IL, NOT Detroit): all top-10 rated 1-2/5 by judge - Q2 (NOT Beacon Freight): top-1 rated 4/5 — accidentally OK because role+city signal pulled non-Beacon worker naturally - Q3 (excluding Cornerstone): unanimous 1/5 across top-10 - Q4 (NOT Detroit-area): all top-10 rated 1-2/5 - Q5 (exclude Heritage Foods): top-1 rated 4/5 — accidentally OK The judge IS the safety net: when retrieval can't honor the constraint, the judge refuses to approve any result. That's the honesty signal — `discovery=0` for the run aggregates it. No code change. The architectural answer for production is: - UI surfaces an "exclude" affordance that populates ExcludeIDs (already supported, added in multi-coord stress 200-worker swap) - Coordinators don't type natural-language negation — they click - Substrate's role: surface honesty signal (judge ratings) + don't pretend to honor unparseable constraints Adding NL-negation handling at the substrate level would be product debt — it would let coordinators type sloppier queries that silently fail when the LLM extractor misses a phrasing. Don't ship until production traffic demonstrates demand for it. Findings: reports/reality-tests/real_005_findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:06:06 -05:00 · 2026-04-30 23:06:06 -05:00 · cca32344f3
commit cca32344f3
parent 434f466288
4 changed files with 210 additions and 0 deletions
--- a/STATE_OF_PLAY.md
+++ b/STATE_OF_PLAY.md
@ -266,6 +266,7 @@ The list is intentionally short. Items move to closed when the work demands them
 | (probe)   | Reality test real_003: 40 queries (10 fill_events rows × 4 styles — `need` / `client_first` / `looking` / `shorthand`). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked `w-2404` onto Forklift Operator shorthand query (both empty role, gate disabled). Extended `extractRoleFromNeed` to handle `client_first` + `looking` patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in `scripts/playbook_lift/main_test.go` lock the patterns + the documented shorthand limitation. Findings: `reports/reality-tests/real_003_findings.md`. |
 | (fix)     | LLM-based role extractor (real_004): `roleExtractor` struct with regex-first → qwen2.5 format=json fallback → per-process cache. Opt-in via `-llm-role-extract` flag + `LLM_ROLE_EXTRACT=1` env. Off-by-default preserves real_003b shipping config. 8 new tests including `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the load-bearing witness pairing with the matrix-side `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` to prove the extraction-layer + gate-layer compose correctly on the exact real_003 failure mode. Findings: `reports/reality-tests/real_004_findings.md`. |
 | (scrum)   | 3-lineage scrum review on `7f2f112..0331288` (Opus + Kimi + Qwen3-coder via `scripts/scrum_review.sh`). Convergent finding (3/3): `roleNormalize` plural-stripper mangled non-plural-s tokens (Sales → Sale, Logistics → Logistic). **Fixed**: `nonPluralSWords` allowlist + `-ss` ending check + `strings.ToLower`/`TrimSpace` cleanup. New tests `TestRoleNormalize_NonPluralS` + `TestRoleEqual_NonPluralS` lock the edge cases. Kimi 2 BLOCKs were false positives (model-truncation artifacts per `feedback_cross_lineage_review.md`). Disposition: `reports/scrum/_evidence/2026-04-30/verdicts/role_gate_v1_disposition.md` (local). |
 | (probe)   | Negation reality test real_005: 5 explicit-negation queries ("NOT in Detroit", "excluding Cornerstone roster", etc.). Confirmed substrate has **zero negation handling** — cosine treats "NOT X" as "X" + noise. Judge IS the safety net (Q1/Q3/Q4 rated all top-10 results 1-2/5 — operator-visible honesty signal). **No code change needed**: production UI should handle exclusion via `ExcludeIDs` (already supported, added in multi-coord stress 200-worker swap), not via NL-negation. Findings: `reports/reality-tests/real_005_findings.md`. |
 Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).
--- a/reports/reality-tests/playbook_lift_real_005.md
+++ b/reports/reality-tests/playbook_lift_real_005.md
@ -0,0 +1,81 @@
 # Playbook-Lift Reality Test — Run real_005
 **Generated:** 2026-05-01T04:04:14.242729367Z
 **Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge)
 **Corpora:** `workers,ethereal_workers`
 **Workers limit:** 5000
 **Queries:** `tests/reality/negation_queries.txt` (5 executed)
 **K per pass:** 10
 **Paraphrase pass:** disabled
 **Re-judge pass:** disabled
 **Evidence:** `reports/reality-tests/playbook_lift_real_005.json`
 ---
 ## Headline
 | Metric | Value |
 |---|---:|
 | Total queries run | 5 |
 | Cold-pass discoveries (judge-best ≠ top-1) | 0 |
 | Warm-pass lifts (recorded playbook → top-1) | 0 |
 | No change (judge-best already top-1, no playbook needed) | 5 |
 | Playbook boosts triggered (warm pass) | 0 |
 | Mean Δ top-1 distance (warm − cold) | 0 |
 **Verbatim lift rate:** 0 of 0 discoveries became top-1 after warm pass.
 ---
 ## Per-query results
 | # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
 |---|---|---|---|---|---|---|---|
 | 1 | Need 5 Forklift Operators in Aurora IL, NOT in Detroit (Detr | e-1723 | 4/2 | — | e-1723 | 4 | no |
 | 2 | Need 3 Warehouse Associates, but NOT anyone from Beacon Frei | w-2937 | 0/4 | — | w-2937 | 0 | no |
 | 3 | Looking for Pickers in Indianapolis, excluding the Cornersto | e-5033 | 0/1 | — | e-5033 | 0 | no |
 | 4 | 1 CNC Operator needed in Flint MI - we cannot use any Detroi | w-1360 | 0/2 | — | w-1360 | 0 | no |
 | 5 | Need 2 Loaders in Joliet IL but exclude all currently-placed | w-2998 | 0/4 | — | w-2998 | 0 | no |
 ---
 ## Honesty caveats
 1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
   judge's verdict is what defines "best." If `` rates badly,
   the lift number is meaningless. To validate the judge itself, sample 5–10
   verdicts manually and check agreement.
 2. **Score-1.0 boost = distance halved.** Playbook math is
   `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
   result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
   even halving doesn't promote it. Tight clusters → little visible lift.
 3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
   case — same query, recorded playbook, expected boost. The paraphrase
   pass (when enabled) is the actual learning property: similar-but-different
   queries hitting a recorded playbook. Compare verbatim and paraphrase
   lift rates — paraphrase should be lower (semantic-distance gates some
   playbook hits) but non-zero is the meaningful signal.
 4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
   results land in one corpus, the matrix layer's purpose isn't being tested.
   Check per-corpus distribution in the JSON.
 5. **Judge resolution.** This run used `qwen2.5:latest` from
   config [models].local_judge.
   Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
 6. **Paraphrase generation also uses the judge.** The same model that rates
   relevance also rephrases queries. A judge that's bad at rating staffing
   queries is probably also bad at rephrasing them. Worth sanity-checking
   a sample of `paraphrase_query` values in the JSON before trusting the
   paraphrase lift number.
 ## Next moves
 - If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
  work. Move to paraphrase queries + tag-based boost (currently ignored).
 - If lift rate < 20%: investigate why — judge variance, distance gap too
  wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
  retuning.
 - If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
  already close to optimal on this query distribution. Either the corpus
  is too narrow or the queries are too easy.
--- a/reports/reality-tests/real_005_findings.md
+++ b/reports/reality-tests/real_005_findings.md
@ -0,0 +1,111 @@
 # Reality test real_005 — negation probe (substrate has none, judge catches it)
 5 explicit-negation queries — "NOT in Detroit", "excluding Beacon
 Freight", etc. — through the standard `playbook_lift.sh` harness.
 Goal: characterize whether the substrate has any negation handling
 or silently treats "NOT X" as "X".
 ## Headline
 **The substrate has zero negation handling.** Cosine on dense
 embeddings tokenizes "NOT in Detroit" the same as "in Detroit" plus
 the noise word "NOT" — there is no logical-quantifier representation
 in the embedding space. The judge catches the failure
 post-retrieval (low ratings) but the retrieval itself doesn't honor
 the negation.
 ## Per-query results
 | Q | Query (head) | cold_top1_dist | judge ratings (top-10) | judge says |
 |---|---|---:|---|---|
 | 1 | "Need 5 Forklift Operators in Aurora IL, NOT in Detroit" | 0.386 | `1,1,1,1,2,1,2,2,2,2` | all bad — system can't honor |
 | 2 | "Need 3 Warehouse Associates, but NOT from Beacon Freight" | 0.453 | `4,4,3,2,3,4,1,4,4,1` | top-1 4/5 — accidentally OK |
 | 3 | "Looking for Pickers in Indianapolis, excluding Cornerstone Fabrication" | 0.468 | `1,1,1,1,1,1,1,1,1,1` | unanimous fail |
 | 4 | "1 CNC Operator in Flint MI - cannot use Detroit-area" | 0.434 | `2,1,1,1,2,1,1,1,2,1` | all bad |
 | 5 | "Need 2 Loaders in Joliet IL but exclude Heritage Foods workers" | 0.439 | `4,3,4,4,1,4,2,2,1,2` | top-1 4/5 — accidentally OK |
 Q2 and Q5 only "passed" because the non-negated signals (role + city)
 were strong enough to pull workers from outside the negated set
 naturally. Q1, Q3, Q4 hit the wall: the negated entity (Detroit /
 Cornerstone roster / Detroit-area) is the dominant content word, so
 cosine pulls workers from exactly the location/client the
 coordinator told us to avoid.
 ## Why this is structural, not fixable in the substrate
 LLM-style decoder models can do negation via attention patterns over
 generation. Embedding models compress text into a single dense vector
 where token-level structure is lost. There is no "NOT" operator in
 cosine space — the literature is clear on this; it's an active
 research area (e.g. negation-aware contrastive training).
 Mitigation paths in our substrate:
 1. **Pre-process query with an LLM** to extract negations as
   structured filters before retrieval. Same shape as the
   `roleExtractor` — qwen2.5 format=json, schema like
   `{"positive": "...", "exclude_locations": [...], "exclude_clients": [...]}`.
   Cost: ~1-3s/query.
 2. **Surface "low-confidence" signal** when the judge rates everything
   below a threshold (already implicit — `discovery=0` in this run).
   Promote that to an operator-visible signal so the UI can prompt
   "this query has constraints I couldn't honor; please exclude
   manually."
 3. **Use structured `ExcludeIDs`** at the API boundary, populated by
   the UI. The substrate already supports this (added in the
   multi-coord stress 200-worker swap). Coordinators in a UI
   shouldn't type "NOT Beacon Freight" — they should click an
   exclude button.
 ## Architectural recommendation
 **(3)** is the right answer for production. UIs solve negation
 cheaper than NLP. The substrate's job is to make exclusion machinery
 available (`ExcludeIDs` is already there) and surface honesty signals
 when its retrieval doesn't fit the query (judge-rating distribution
 is already there). Adding NL-negation handling would be product
 debt — it would let coordinators type sloppier queries and then
 silently fail when the LLM extractor misses a phrasing.
 **(2)** is a small UX improvement worth shipping eventually: when
 all top-K judge ratings are ≤ 2/5, surface "no good match found —
 consider tightening constraints" instead of returning the cold-top-1
 silently. This is one query-response shape change, not a substrate
 change.
 **(1)** is research-grade work. Don't ship until production traffic
 demonstrates coordinators actually type natural-language negations
 rather than using exclude affordances.
 ## Honesty signal validation
 The judge IS doing its job. Q1, Q3, Q4 had judge ratings of mostly
 1/5, with no rating ≥ 4. That means in production, a judge-rating-
 distribution monitor would flag: "this query produced 0 results with
 quality score ≥ 4." That's an actionable operator signal without
 requiring any new code in the substrate.
 ## What this probe does NOT cover
 - **Quantifier negation** ("at most 3 of these workers"): different
  failure mode, also unhandled, also won't be added to substrate.
 - **Conditional constraints** ("if no forklift ops available, fall
  back to material handlers"): same.
 - **Soft preferences** ("prefer locals over commuters"): partially
  handled via tag boost; not tested in this probe.
 These are deferred to "when production traffic shows them."
 ## Repro
 ```bash
 # Already-shipped queries file
 cat tests/reality/negation_queries.txt
 # Run with default config (no LLM extractor, no paraphrase)
 QUERIES_FILE=tests/reality/negation_queries.txt RUN_ID=real_005 \
  WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
  ./scripts/playbook_lift.sh
 ```
 Evidence: `reports/reality-tests/playbook_lift_real_005.{json,md}`.
--- a/tests/reality/negation_queries.txt
+++ b/tests/reality/negation_queries.txt
@ -0,0 +1,17 @@
 # Negation reality-test queries — real_005
 #
 # Each query carries an explicit negation that should suppress some
 # subset of workers/clients/cities. Cosine on dense embeddings has
 # known weaknesses around negation: "NOT Detroit" still tokenizes
 # "Detroit" and pulls Detroit-aligned vectors near. Without explicit
 # negation handling, the system may silently surface exactly the
 # entities the coordinator excluded.
 #
 # Test goal: characterize whether the substrate degrades silently
 # (treats "NOT X" as "X") or surfaces an honesty signal.
 Need 5 Forklift Operators in Aurora IL, NOT in Detroit (Detroit pool is reserved for another contract)
 Need 3 Warehouse Associates, but NOT anyone from Beacon Freight
 Looking for Pickers in Indianapolis, excluding the Cornerstone Fabrication roster
 1 CNC Operator needed in Flint MI - we cannot use any Detroit-area workers due to non-compete
 Need 2 Loaders in Joliet IL but exclude all currently-placed Heritage Foods workers