reality_test real_005: negation probe — substrate gap is correctly out-of-scope

5 explicit-negation queries ("Need Forklift Operators in Aurora IL,
NOT in Detroit", "excluding Cornerstone Fabrication roster", etc.)
through the standard playbook_lift harness. Goal: characterize
whether the substrate has negation handling or silently treats
"NOT X" as "X".

Headline: substrate has zero negation handling. Cosine on dense
embeddings tokenizes "NOT in Detroit" identical to "in Detroit"
plus noise — there is no logical-quantifier representation in the
embedding space. This is a structural property of dense embeddings,
not a substrate bug.

Per-query observations:
- Q1 (Aurora IL, NOT Detroit): all top-10 rated 1-2/5 by judge
- Q2 (NOT Beacon Freight): top-1 rated 4/5 — accidentally OK
  because role+city signal pulled non-Beacon worker naturally
- Q3 (excluding Cornerstone): unanimous 1/5 across top-10
- Q4 (NOT Detroit-area): all top-10 rated 1-2/5
- Q5 (exclude Heritage Foods): top-1 rated 4/5 — accidentally OK

The judge IS the safety net: when retrieval can't honor the
constraint, the judge refuses to approve any result. That's the
honesty signal — `discovery=0` for the run aggregates it.

No code change. The architectural answer for production is:
- UI surfaces an "exclude" affordance that populates ExcludeIDs
  (already supported, added in multi-coord stress 200-worker swap)
- Coordinators don't type natural-language negation — they click
- Substrate's role: surface honesty signal (judge ratings) + don't
  pretend to honor unparseable constraints

Adding NL-negation handling at the substrate level would be product
debt — it would let coordinators type sloppier queries that
silently fail when the LLM extractor misses a phrasing. Don't ship
until production traffic demonstrates demand for it.

Findings: reports/reality-tests/real_005_findings.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-30 23:06:06 -05:00
parent 434f466288
commit cca32344f3
4 changed files with 210 additions and 0 deletions

View File

@ -266,6 +266,7 @@ The list is intentionally short. Items move to closed when the work demands them
| (probe) | Reality test real_003: 40 queries (10 fill_events rows × 4 styles — `need` / `client_first` / `looking` / `shorthand`). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked `w-2404` onto Forklift Operator shorthand query (both empty role, gate disabled). Extended `extractRoleFromNeed` to handle `client_first` + `looking` patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in `scripts/playbook_lift/main_test.go` lock the patterns + the documented shorthand limitation. Findings: `reports/reality-tests/real_003_findings.md`. | | (probe) | Reality test real_003: 40 queries (10 fill_events rows × 4 styles — `need` / `client_first` / `looking` / `shorthand`). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked `w-2404` onto Forklift Operator shorthand query (both empty role, gate disabled). Extended `extractRoleFromNeed` to handle `client_first` + `looking` patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in `scripts/playbook_lift/main_test.go` lock the patterns + the documented shorthand limitation. Findings: `reports/reality-tests/real_003_findings.md`. |
| (fix) | LLM-based role extractor (real_004): `roleExtractor` struct with regex-first → qwen2.5 format=json fallback → per-process cache. Opt-in via `-llm-role-extract` flag + `LLM_ROLE_EXTRACT=1` env. Off-by-default preserves real_003b shipping config. 8 new tests including `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the load-bearing witness pairing with the matrix-side `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` to prove the extraction-layer + gate-layer compose correctly on the exact real_003 failure mode. Findings: `reports/reality-tests/real_004_findings.md`. | | (fix) | LLM-based role extractor (real_004): `roleExtractor` struct with regex-first → qwen2.5 format=json fallback → per-process cache. Opt-in via `-llm-role-extract` flag + `LLM_ROLE_EXTRACT=1` env. Off-by-default preserves real_003b shipping config. 8 new tests including `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the load-bearing witness pairing with the matrix-side `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` to prove the extraction-layer + gate-layer compose correctly on the exact real_003 failure mode. Findings: `reports/reality-tests/real_004_findings.md`. |
| (scrum) | 3-lineage scrum review on `7f2f112..0331288` (Opus + Kimi + Qwen3-coder via `scripts/scrum_review.sh`). Convergent finding (3/3): `roleNormalize` plural-stripper mangled non-plural-s tokens (Sales → Sale, Logistics → Logistic). **Fixed**: `nonPluralSWords` allowlist + `-ss` ending check + `strings.ToLower`/`TrimSpace` cleanup. New tests `TestRoleNormalize_NonPluralS` + `TestRoleEqual_NonPluralS` lock the edge cases. Kimi 2 BLOCKs were false positives (model-truncation artifacts per `feedback_cross_lineage_review.md`). Disposition: `reports/scrum/_evidence/2026-04-30/verdicts/role_gate_v1_disposition.md` (local). | | (scrum) | 3-lineage scrum review on `7f2f112..0331288` (Opus + Kimi + Qwen3-coder via `scripts/scrum_review.sh`). Convergent finding (3/3): `roleNormalize` plural-stripper mangled non-plural-s tokens (Sales → Sale, Logistics → Logistic). **Fixed**: `nonPluralSWords` allowlist + `-ss` ending check + `strings.ToLower`/`TrimSpace` cleanup. New tests `TestRoleNormalize_NonPluralS` + `TestRoleEqual_NonPluralS` lock the edge cases. Kimi 2 BLOCKs were false positives (model-truncation artifacts per `feedback_cross_lineage_review.md`). Disposition: `reports/scrum/_evidence/2026-04-30/verdicts/role_gate_v1_disposition.md` (local). |
| (probe) | Negation reality test real_005: 5 explicit-negation queries ("NOT in Detroit", "excluding Cornerstone roster", etc.). Confirmed substrate has **zero negation handling** — cosine treats "NOT X" as "X" + noise. Judge IS the safety net (Q1/Q3/Q4 rated all top-10 results 1-2/5 — operator-visible honesty signal). **No code change needed**: production UI should handle exclusion via `ExcludeIDs` (already supported, added in multi-coord stress 200-worker swap), not via NL-negation. Findings: `reports/reality-tests/real_005_findings.md`. |
Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds). Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

View File

@ -0,0 +1,81 @@
# Playbook-Lift Reality Test — Run real_005
**Generated:** 2026-05-01T04:04:14.242729367Z
**Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge)
**Corpora:** `workers,ethereal_workers`
**Workers limit:** 5000
**Queries:** `tests/reality/negation_queries.txt` (5 executed)
**K per pass:** 10
**Paraphrase pass:** disabled
**Re-judge pass:** disabled
**Evidence:** `reports/reality-tests/playbook_lift_real_005.json`
---
## Headline
| Metric | Value |
|---|---:|
| Total queries run | 5 |
| Cold-pass discoveries (judge-best ≠ top-1) | 0 |
| Warm-pass lifts (recorded playbook → top-1) | 0 |
| No change (judge-best already top-1, no playbook needed) | 5 |
| Playbook boosts triggered (warm pass) | 0 |
| Mean Δ top-1 distance (warm cold) | 0 |
**Verbatim lift rate:** 0 of 0 discoveries became top-1 after warm pass.
---
## Per-query results
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|---|---|---|---|---|---|---|---|
| 1 | Need 5 Forklift Operators in Aurora IL, NOT in Detroit (Detr | e-1723 | 4/2 | — | e-1723 | 4 | no |
| 2 | Need 3 Warehouse Associates, but NOT anyone from Beacon Frei | w-2937 | 0/4 | — | w-2937 | 0 | no |
| 3 | Looking for Pickers in Indianapolis, excluding the Cornersto | e-5033 | 0/1 | — | e-5033 | 0 | no |
| 4 | 1 CNC Operator needed in Flint MI - we cannot use any Detroi | w-1360 | 0/2 | — | w-1360 | 0 | no |
| 5 | Need 2 Loaders in Joliet IL but exclude all currently-placed | w-2998 | 0/4 | — | w-2998 | 0 | no |
---
## Honesty caveats
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
judge's verdict is what defines "best." If `` rates badly,
the lift number is meaningless. To validate the judge itself, sample 510
verdicts manually and check agreement.
2. **Score-1.0 boost = distance halved.** Playbook math is
`distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
even halving doesn't promote it. Tight clusters → little visible lift.
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
case — same query, recorded playbook, expected boost. The paraphrase
pass (when enabled) is the actual learning property: similar-but-different
queries hitting a recorded playbook. Compare verbatim and paraphrase
lift rates — paraphrase should be lower (semantic-distance gates some
playbook hits) but non-zero is the meaningful signal.
4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
results land in one corpus, the matrix layer's purpose isn't being tested.
Check per-corpus distribution in the JSON.
5. **Judge resolution.** This run used `qwen2.5:latest` from
config [models].local_judge.
Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
6. **Paraphrase generation also uses the judge.** The same model that rates
relevance also rephrases queries. A judge that's bad at rating staffing
queries is probably also bad at rephrasing them. Worth sanity-checking
a sample of `paraphrase_query` values in the JSON before trusting the
paraphrase lift number.
## Next moves
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
work. Move to paraphrase queries + tag-based boost (currently ignored).
- If lift rate < 20%: investigate why judge variance, distance gap too
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
retuning.
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
already close to optimal on this query distribution. Either the corpus
is too narrow or the queries are too easy.

View File

@ -0,0 +1,111 @@
# Reality test real_005 — negation probe (substrate has none, judge catches it)
5 explicit-negation queries — "NOT in Detroit", "excluding Beacon
Freight", etc. — through the standard `playbook_lift.sh` harness.
Goal: characterize whether the substrate has any negation handling
or silently treats "NOT X" as "X".
## Headline
**The substrate has zero negation handling.** Cosine on dense
embeddings tokenizes "NOT in Detroit" the same as "in Detroit" plus
the noise word "NOT" — there is no logical-quantifier representation
in the embedding space. The judge catches the failure
post-retrieval (low ratings) but the retrieval itself doesn't honor
the negation.
## Per-query results
| Q | Query (head) | cold_top1_dist | judge ratings (top-10) | judge says |
|---|---|---:|---|---|
| 1 | "Need 5 Forklift Operators in Aurora IL, NOT in Detroit" | 0.386 | `1,1,1,1,2,1,2,2,2,2` | all bad — system can't honor |
| 2 | "Need 3 Warehouse Associates, but NOT from Beacon Freight" | 0.453 | `4,4,3,2,3,4,1,4,4,1` | top-1 4/5 — accidentally OK |
| 3 | "Looking for Pickers in Indianapolis, excluding Cornerstone Fabrication" | 0.468 | `1,1,1,1,1,1,1,1,1,1` | unanimous fail |
| 4 | "1 CNC Operator in Flint MI - cannot use Detroit-area" | 0.434 | `2,1,1,1,2,1,1,1,2,1` | all bad |
| 5 | "Need 2 Loaders in Joliet IL but exclude Heritage Foods workers" | 0.439 | `4,3,4,4,1,4,2,2,1,2` | top-1 4/5 — accidentally OK |
Q2 and Q5 only "passed" because the non-negated signals (role + city)
were strong enough to pull workers from outside the negated set
naturally. Q1, Q3, Q4 hit the wall: the negated entity (Detroit /
Cornerstone roster / Detroit-area) is the dominant content word, so
cosine pulls workers from exactly the location/client the
coordinator told us to avoid.
## Why this is structural, not fixable in the substrate
LLM-style decoder models can do negation via attention patterns over
generation. Embedding models compress text into a single dense vector
where token-level structure is lost. There is no "NOT" operator in
cosine space — the literature is clear on this; it's an active
research area (e.g. negation-aware contrastive training).
Mitigation paths in our substrate:
1. **Pre-process query with an LLM** to extract negations as
structured filters before retrieval. Same shape as the
`roleExtractor` — qwen2.5 format=json, schema like
`{"positive": "...", "exclude_locations": [...], "exclude_clients": [...]}`.
Cost: ~1-3s/query.
2. **Surface "low-confidence" signal** when the judge rates everything
below a threshold (already implicit — `discovery=0` in this run).
Promote that to an operator-visible signal so the UI can prompt
"this query has constraints I couldn't honor; please exclude
manually."
3. **Use structured `ExcludeIDs`** at the API boundary, populated by
the UI. The substrate already supports this (added in the
multi-coord stress 200-worker swap). Coordinators in a UI
shouldn't type "NOT Beacon Freight" — they should click an
exclude button.
## Architectural recommendation
**(3)** is the right answer for production. UIs solve negation
cheaper than NLP. The substrate's job is to make exclusion machinery
available (`ExcludeIDs` is already there) and surface honesty signals
when its retrieval doesn't fit the query (judge-rating distribution
is already there). Adding NL-negation handling would be product
debt — it would let coordinators type sloppier queries and then
silently fail when the LLM extractor misses a phrasing.
**(2)** is a small UX improvement worth shipping eventually: when
all top-K judge ratings are ≤ 2/5, surface "no good match found —
consider tightening constraints" instead of returning the cold-top-1
silently. This is one query-response shape change, not a substrate
change.
**(1)** is research-grade work. Don't ship until production traffic
demonstrates coordinators actually type natural-language negations
rather than using exclude affordances.
## Honesty signal validation
The judge IS doing its job. Q1, Q3, Q4 had judge ratings of mostly
1/5, with no rating ≥ 4. That means in production, a judge-rating-
distribution monitor would flag: "this query produced 0 results with
quality score ≥ 4." That's an actionable operator signal without
requiring any new code in the substrate.
## What this probe does NOT cover
- **Quantifier negation** ("at most 3 of these workers"): different
failure mode, also unhandled, also won't be added to substrate.
- **Conditional constraints** ("if no forklift ops available, fall
back to material handlers"): same.
- **Soft preferences** ("prefer locals over commuters"): partially
handled via tag boost; not tested in this probe.
These are deferred to "when production traffic shows them."
## Repro
```bash
# Already-shipped queries file
cat tests/reality/negation_queries.txt
# Run with default config (no LLM extractor, no paraphrase)
QUERIES_FILE=tests/reality/negation_queries.txt RUN_ID=real_005 \
WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
./scripts/playbook_lift.sh
```
Evidence: `reports/reality-tests/playbook_lift_real_005.{json,md}`.

View File

@ -0,0 +1,17 @@
# Negation reality-test queries — real_005
#
# Each query carries an explicit negation that should suppress some
# subset of workers/clients/cities. Cosine on dense embeddings has
# known weaknesses around negation: "NOT Detroit" still tokenizes
# "Detroit" and pulls Detroit-aligned vectors near. Without explicit
# negation handling, the system may silently surface exactly the
# entities the coordinator excluded.
#
# Test goal: characterize whether the substrate degrades silently
# (treats "NOT X" as "X") or surfaces an honesty signal.
Need 5 Forklift Operators in Aurora IL, NOT in Detroit (Detroit pool is reserved for another contract)
Need 3 Warehouse Associates, but NOT anyone from Beacon Freight
Looking for Pickers in Indianapolis, excluding the Cornerstone Fabrication roster
1 CNC Operator needed in Flint MI - we cannot use any Detroit-area workers due to non-compete
Need 2 Loaders in Joliet IL but exclude all currently-placed Heritage Foods workers