reality_test real_005: negation probe — substrate gap is correctly out-of-scope
5 explicit-negation queries ("Need Forklift Operators in Aurora IL,
NOT in Detroit", "excluding Cornerstone Fabrication roster", etc.)
through the standard playbook_lift harness. Goal: characterize
whether the substrate has negation handling or silently treats
"NOT X" as "X".
Headline: substrate has zero negation handling. Cosine on dense
embeddings tokenizes "NOT in Detroit" identical to "in Detroit"
plus noise — there is no logical-quantifier representation in the
embedding space. This is a structural property of dense embeddings,
not a substrate bug.
Per-query observations:
- Q1 (Aurora IL, NOT Detroit): all top-10 rated 1-2/5 by judge
- Q2 (NOT Beacon Freight): top-1 rated 4/5 — accidentally OK
because role+city signal pulled non-Beacon worker naturally
- Q3 (excluding Cornerstone): unanimous 1/5 across top-10
- Q4 (NOT Detroit-area): all top-10 rated 1-2/5
- Q5 (exclude Heritage Foods): top-1 rated 4/5 — accidentally OK
The judge IS the safety net: when retrieval can't honor the
constraint, the judge refuses to approve any result. That's the
honesty signal — `discovery=0` for the run aggregates it.
No code change. The architectural answer for production is:
- UI surfaces an "exclude" affordance that populates ExcludeIDs
(already supported, added in multi-coord stress 200-worker swap)
- Coordinators don't type natural-language negation — they click
- Substrate's role: surface honesty signal (judge ratings) + don't
pretend to honor unparseable constraints
Adding NL-negation handling at the substrate level would be product
debt — it would let coordinators type sloppier queries that
silently fail when the LLM extractor misses a phrasing. Don't ship
until production traffic demonstrates demand for it.
Findings: reports/reality-tests/real_005_findings.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
434f466288
commit
cca32344f3
@ -266,6 +266,7 @@ The list is intentionally short. Items move to closed when the work demands them
|
||||
| (probe) | Reality test real_003: 40 queries (10 fill_events rows × 4 styles — `need` / `client_first` / `looking` / `shorthand`). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked `w-2404` onto Forklift Operator shorthand query (both empty role, gate disabled). Extended `extractRoleFromNeed` to handle `client_first` + `looking` patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in `scripts/playbook_lift/main_test.go` lock the patterns + the documented shorthand limitation. Findings: `reports/reality-tests/real_003_findings.md`. |
|
||||
| (fix) | LLM-based role extractor (real_004): `roleExtractor` struct with regex-first → qwen2.5 format=json fallback → per-process cache. Opt-in via `-llm-role-extract` flag + `LLM_ROLE_EXTRACT=1` env. Off-by-default preserves real_003b shipping config. 8 new tests including `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the load-bearing witness pairing with the matrix-side `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` to prove the extraction-layer + gate-layer compose correctly on the exact real_003 failure mode. Findings: `reports/reality-tests/real_004_findings.md`. |
|
||||
| (scrum) | 3-lineage scrum review on `7f2f112..0331288` (Opus + Kimi + Qwen3-coder via `scripts/scrum_review.sh`). Convergent finding (3/3): `roleNormalize` plural-stripper mangled non-plural-s tokens (Sales → Sale, Logistics → Logistic). **Fixed**: `nonPluralSWords` allowlist + `-ss` ending check + `strings.ToLower`/`TrimSpace` cleanup. New tests `TestRoleNormalize_NonPluralS` + `TestRoleEqual_NonPluralS` lock the edge cases. Kimi 2 BLOCKs were false positives (model-truncation artifacts per `feedback_cross_lineage_review.md`). Disposition: `reports/scrum/_evidence/2026-04-30/verdicts/role_gate_v1_disposition.md` (local). |
|
||||
| (probe) | Negation reality test real_005: 5 explicit-negation queries ("NOT in Detroit", "excluding Cornerstone roster", etc.). Confirmed substrate has **zero negation handling** — cosine treats "NOT X" as "X" + noise. Judge IS the safety net (Q1/Q3/Q4 rated all top-10 results 1-2/5 — operator-visible honesty signal). **No code change needed**: production UI should handle exclusion via `ExcludeIDs` (already supported, added in multi-coord stress 200-worker swap), not via NL-negation. Findings: `reports/reality-tests/real_005_findings.md`. |
|
||||
|
||||
Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).
|
||||
|
||||
|
||||
81
reports/reality-tests/playbook_lift_real_005.md
Normal file
81
reports/reality-tests/playbook_lift_real_005.md
Normal file
@ -0,0 +1,81 @@
|
||||
# Playbook-Lift Reality Test — Run real_005
|
||||
|
||||
**Generated:** 2026-05-01T04:04:14.242729367Z
|
||||
**Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge)
|
||||
**Corpora:** `workers,ethereal_workers`
|
||||
**Workers limit:** 5000
|
||||
**Queries:** `tests/reality/negation_queries.txt` (5 executed)
|
||||
**K per pass:** 10
|
||||
**Paraphrase pass:** disabled
|
||||
**Re-judge pass:** disabled
|
||||
**Evidence:** `reports/reality-tests/playbook_lift_real_005.json`
|
||||
|
||||
---
|
||||
|
||||
## Headline
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Total queries run | 5 |
|
||||
| Cold-pass discoveries (judge-best ≠ top-1) | 0 |
|
||||
| Warm-pass lifts (recorded playbook → top-1) | 0 |
|
||||
| No change (judge-best already top-1, no playbook needed) | 5 |
|
||||
| Playbook boosts triggered (warm pass) | 0 |
|
||||
| Mean Δ top-1 distance (warm − cold) | 0 |
|
||||
|
||||
|
||||
|
||||
**Verbatim lift rate:** 0 of 0 discoveries became top-1 after warm pass.
|
||||
|
||||
---
|
||||
|
||||
## Per-query results
|
||||
|
||||
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| 1 | Need 5 Forklift Operators in Aurora IL, NOT in Detroit (Detr | e-1723 | 4/2 | — | e-1723 | 4 | no |
|
||||
| 2 | Need 3 Warehouse Associates, but NOT anyone from Beacon Frei | w-2937 | 0/4 | — | w-2937 | 0 | no |
|
||||
| 3 | Looking for Pickers in Indianapolis, excluding the Cornersto | e-5033 | 0/1 | — | e-5033 | 0 | no |
|
||||
| 4 | 1 CNC Operator needed in Flint MI - we cannot use any Detroi | w-1360 | 0/2 | — | w-1360 | 0 | no |
|
||||
| 5 | Need 2 Loaders in Joliet IL but exclude all currently-placed | w-2998 | 0/4 | — | w-2998 | 0 | no |
|
||||
|
||||
---
|
||||
|
||||
## Honesty caveats
|
||||
|
||||
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
|
||||
judge's verdict is what defines "best." If `` rates badly,
|
||||
the lift number is meaningless. To validate the judge itself, sample 5–10
|
||||
verdicts manually and check agreement.
|
||||
2. **Score-1.0 boost = distance halved.** Playbook math is
|
||||
`distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
|
||||
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
|
||||
even halving doesn't promote it. Tight clusters → little visible lift.
|
||||
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
|
||||
case — same query, recorded playbook, expected boost. The paraphrase
|
||||
pass (when enabled) is the actual learning property: similar-but-different
|
||||
queries hitting a recorded playbook. Compare verbatim and paraphrase
|
||||
lift rates — paraphrase should be lower (semantic-distance gates some
|
||||
playbook hits) but non-zero is the meaningful signal.
|
||||
4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
|
||||
results land in one corpus, the matrix layer's purpose isn't being tested.
|
||||
Check per-corpus distribution in the JSON.
|
||||
5. **Judge resolution.** This run used `qwen2.5:latest` from
|
||||
config [models].local_judge.
|
||||
Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
|
||||
6. **Paraphrase generation also uses the judge.** The same model that rates
|
||||
relevance also rephrases queries. A judge that's bad at rating staffing
|
||||
queries is probably also bad at rephrasing them. Worth sanity-checking
|
||||
a sample of `paraphrase_query` values in the JSON before trusting the
|
||||
paraphrase lift number.
|
||||
|
||||
## Next moves
|
||||
|
||||
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
|
||||
work. Move to paraphrase queries + tag-based boost (currently ignored).
|
||||
- If lift rate < 20%: investigate why — judge variance, distance gap too
|
||||
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
|
||||
retuning.
|
||||
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
|
||||
already close to optimal on this query distribution. Either the corpus
|
||||
is too narrow or the queries are too easy.
|
||||
111
reports/reality-tests/real_005_findings.md
Normal file
111
reports/reality-tests/real_005_findings.md
Normal file
@ -0,0 +1,111 @@
|
||||
# Reality test real_005 — negation probe (substrate has none, judge catches it)
|
||||
|
||||
5 explicit-negation queries — "NOT in Detroit", "excluding Beacon
|
||||
Freight", etc. — through the standard `playbook_lift.sh` harness.
|
||||
Goal: characterize whether the substrate has any negation handling
|
||||
or silently treats "NOT X" as "X".
|
||||
|
||||
## Headline
|
||||
|
||||
**The substrate has zero negation handling.** Cosine on dense
|
||||
embeddings tokenizes "NOT in Detroit" the same as "in Detroit" plus
|
||||
the noise word "NOT" — there is no logical-quantifier representation
|
||||
in the embedding space. The judge catches the failure
|
||||
post-retrieval (low ratings) but the retrieval itself doesn't honor
|
||||
the negation.
|
||||
|
||||
## Per-query results
|
||||
|
||||
| Q | Query (head) | cold_top1_dist | judge ratings (top-10) | judge says |
|
||||
|---|---|---:|---|---|
|
||||
| 1 | "Need 5 Forklift Operators in Aurora IL, NOT in Detroit" | 0.386 | `1,1,1,1,2,1,2,2,2,2` | all bad — system can't honor |
|
||||
| 2 | "Need 3 Warehouse Associates, but NOT from Beacon Freight" | 0.453 | `4,4,3,2,3,4,1,4,4,1` | top-1 4/5 — accidentally OK |
|
||||
| 3 | "Looking for Pickers in Indianapolis, excluding Cornerstone Fabrication" | 0.468 | `1,1,1,1,1,1,1,1,1,1` | unanimous fail |
|
||||
| 4 | "1 CNC Operator in Flint MI - cannot use Detroit-area" | 0.434 | `2,1,1,1,2,1,1,1,2,1` | all bad |
|
||||
| 5 | "Need 2 Loaders in Joliet IL but exclude Heritage Foods workers" | 0.439 | `4,3,4,4,1,4,2,2,1,2` | top-1 4/5 — accidentally OK |
|
||||
|
||||
Q2 and Q5 only "passed" because the non-negated signals (role + city)
|
||||
were strong enough to pull workers from outside the negated set
|
||||
naturally. Q1, Q3, Q4 hit the wall: the negated entity (Detroit /
|
||||
Cornerstone roster / Detroit-area) is the dominant content word, so
|
||||
cosine pulls workers from exactly the location/client the
|
||||
coordinator told us to avoid.
|
||||
|
||||
## Why this is structural, not fixable in the substrate
|
||||
|
||||
LLM-style decoder models can do negation via attention patterns over
|
||||
generation. Embedding models compress text into a single dense vector
|
||||
where token-level structure is lost. There is no "NOT" operator in
|
||||
cosine space — the literature is clear on this; it's an active
|
||||
research area (e.g. negation-aware contrastive training).
|
||||
|
||||
Mitigation paths in our substrate:
|
||||
|
||||
1. **Pre-process query with an LLM** to extract negations as
|
||||
structured filters before retrieval. Same shape as the
|
||||
`roleExtractor` — qwen2.5 format=json, schema like
|
||||
`{"positive": "...", "exclude_locations": [...], "exclude_clients": [...]}`.
|
||||
Cost: ~1-3s/query.
|
||||
2. **Surface "low-confidence" signal** when the judge rates everything
|
||||
below a threshold (already implicit — `discovery=0` in this run).
|
||||
Promote that to an operator-visible signal so the UI can prompt
|
||||
"this query has constraints I couldn't honor; please exclude
|
||||
manually."
|
||||
3. **Use structured `ExcludeIDs`** at the API boundary, populated by
|
||||
the UI. The substrate already supports this (added in the
|
||||
multi-coord stress 200-worker swap). Coordinators in a UI
|
||||
shouldn't type "NOT Beacon Freight" — they should click an
|
||||
exclude button.
|
||||
|
||||
## Architectural recommendation
|
||||
|
||||
**(3)** is the right answer for production. UIs solve negation
|
||||
cheaper than NLP. The substrate's job is to make exclusion machinery
|
||||
available (`ExcludeIDs` is already there) and surface honesty signals
|
||||
when its retrieval doesn't fit the query (judge-rating distribution
|
||||
is already there). Adding NL-negation handling would be product
|
||||
debt — it would let coordinators type sloppier queries and then
|
||||
silently fail when the LLM extractor misses a phrasing.
|
||||
|
||||
**(2)** is a small UX improvement worth shipping eventually: when
|
||||
all top-K judge ratings are ≤ 2/5, surface "no good match found —
|
||||
consider tightening constraints" instead of returning the cold-top-1
|
||||
silently. This is one query-response shape change, not a substrate
|
||||
change.
|
||||
|
||||
**(1)** is research-grade work. Don't ship until production traffic
|
||||
demonstrates coordinators actually type natural-language negations
|
||||
rather than using exclude affordances.
|
||||
|
||||
## Honesty signal validation
|
||||
|
||||
The judge IS doing its job. Q1, Q3, Q4 had judge ratings of mostly
|
||||
1/5, with no rating ≥ 4. That means in production, a judge-rating-
|
||||
distribution monitor would flag: "this query produced 0 results with
|
||||
quality score ≥ 4." That's an actionable operator signal without
|
||||
requiring any new code in the substrate.
|
||||
|
||||
## What this probe does NOT cover
|
||||
|
||||
- **Quantifier negation** ("at most 3 of these workers"): different
|
||||
failure mode, also unhandled, also won't be added to substrate.
|
||||
- **Conditional constraints** ("if no forklift ops available, fall
|
||||
back to material handlers"): same.
|
||||
- **Soft preferences** ("prefer locals over commuters"): partially
|
||||
handled via tag boost; not tested in this probe.
|
||||
|
||||
These are deferred to "when production traffic shows them."
|
||||
|
||||
## Repro
|
||||
|
||||
```bash
|
||||
# Already-shipped queries file
|
||||
cat tests/reality/negation_queries.txt
|
||||
|
||||
# Run with default config (no LLM extractor, no paraphrase)
|
||||
QUERIES_FILE=tests/reality/negation_queries.txt RUN_ID=real_005 \
|
||||
WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
|
||||
./scripts/playbook_lift.sh
|
||||
```
|
||||
|
||||
Evidence: `reports/reality-tests/playbook_lift_real_005.{json,md}`.
|
||||
17
tests/reality/negation_queries.txt
Normal file
17
tests/reality/negation_queries.txt
Normal file
@ -0,0 +1,17 @@
|
||||
# Negation reality-test queries — real_005
|
||||
#
|
||||
# Each query carries an explicit negation that should suppress some
|
||||
# subset of workers/clients/cities. Cosine on dense embeddings has
|
||||
# known weaknesses around negation: "NOT Detroit" still tokenizes
|
||||
# "Detroit" and pulls Detroit-aligned vectors near. Without explicit
|
||||
# negation handling, the system may silently surface exactly the
|
||||
# entities the coordinator excluded.
|
||||
#
|
||||
# Test goal: characterize whether the substrate degrades silently
|
||||
# (treats "NOT X" as "X") or surfaces an honesty signal.
|
||||
|
||||
Need 5 Forklift Operators in Aurora IL, NOT in Detroit (Detroit pool is reserved for another contract)
|
||||
Need 3 Warehouse Associates, but NOT anyone from Beacon Freight
|
||||
Looking for Pickers in Indianapolis, excluding the Cornerstone Fabrication roster
|
||||
1 CNC Operator needed in Flint MI - we cannot use any Detroit-area workers due to non-compete
|
||||
Need 2 Loaders in Joliet IL but exclude all currently-placed Heritage Foods workers
|
||||
Loading…
x
Reference in New Issue
Block a user