reality_test real_005: negation probe — substrate gap is correctly out-of-scope
5 explicit-negation queries ("Need Forklift Operators in Aurora IL,
NOT in Detroit", "excluding Cornerstone Fabrication roster", etc.)
through the standard playbook_lift harness. Goal: characterize
whether the substrate has negation handling or silently treats
"NOT X" as "X".
Headline: substrate has zero negation handling. Cosine on dense
embeddings tokenizes "NOT in Detroit" identical to "in Detroit"
plus noise — there is no logical-quantifier representation in the
embedding space. This is a structural property of dense embeddings,
not a substrate bug.
Per-query observations:
- Q1 (Aurora IL, NOT Detroit): all top-10 rated 1-2/5 by judge
- Q2 (NOT Beacon Freight): top-1 rated 4/5 — accidentally OK
because role+city signal pulled non-Beacon worker naturally
- Q3 (excluding Cornerstone): unanimous 1/5 across top-10
- Q4 (NOT Detroit-area): all top-10 rated 1-2/5
- Q5 (exclude Heritage Foods): top-1 rated 4/5 — accidentally OK
The judge IS the safety net: when retrieval can't honor the
constraint, the judge refuses to approve any result. That's the
honesty signal — `discovery=0` for the run aggregates it.
No code change. The architectural answer for production is:
- UI surfaces an "exclude" affordance that populates ExcludeIDs
(already supported, added in multi-coord stress 200-worker swap)
- Coordinators don't type natural-language negation — they click
- Substrate's role: surface honesty signal (judge ratings) + don't
pretend to honor unparseable constraints
Adding NL-negation handling at the substrate level would be product
debt — it would let coordinators type sloppier queries that
silently fail when the LLM extractor misses a phrasing. Don't ship
until production traffic demonstrates demand for it.
Findings: reports/reality-tests/real_005_findings.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
434f466288
commit
cca32344f3
@ -266,6 +266,7 @@ The list is intentionally short. Items move to closed when the work demands them
|
|||||||
| (probe) | Reality test real_003: 40 queries (10 fill_events rows × 4 styles — `need` / `client_first` / `looking` / `shorthand`). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked `w-2404` onto Forklift Operator shorthand query (both empty role, gate disabled). Extended `extractRoleFromNeed` to handle `client_first` + `looking` patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in `scripts/playbook_lift/main_test.go` lock the patterns + the documented shorthand limitation. Findings: `reports/reality-tests/real_003_findings.md`. |
|
| (probe) | Reality test real_003: 40 queries (10 fill_events rows × 4 styles — `need` / `client_first` / `looking` / `shorthand`). Confirmed shorthand-vs-shorthand bleed is real: CNC Operator shorthand recording leaked `w-2404` onto Forklift Operator shorthand query (both empty role, gate disabled). Extended `extractRoleFromNeed` to handle `client_first` + `looking` patterns; shorthand stays empty (regex can't separate role from city without anchor). Re-run real_003b: bleed closed across all 4 styles in this dataset. 10 new sub-tests in `scripts/playbook_lift/main_test.go` lock the patterns + the documented shorthand limitation. Findings: `reports/reality-tests/real_003_findings.md`. |
|
||||||
| (fix) | LLM-based role extractor (real_004): `roleExtractor` struct with regex-first → qwen2.5 format=json fallback → per-process cache. Opt-in via `-llm-role-extract` flag + `LLM_ROLE_EXTRACT=1` env. Off-by-default preserves real_003b shipping config. 8 new tests including `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the load-bearing witness pairing with the matrix-side `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` to prove the extraction-layer + gate-layer compose correctly on the exact real_003 failure mode. Findings: `reports/reality-tests/real_004_findings.md`. |
|
| (fix) | LLM-based role extractor (real_004): `roleExtractor` struct with regex-first → qwen2.5 format=json fallback → per-process cache. Opt-in via `-llm-role-extract` flag + `LLM_ROLE_EXTRACT=1` env. Off-by-default preserves real_003b shipping config. 8 new tests including `TestRoleExtractor_ClosesCrossRoleShorthandBleed` — the load-bearing witness pairing with the matrix-side `TestInjectPlaybookMisses_RoleGateRejectsCrossRole` to prove the extraction-layer + gate-layer compose correctly on the exact real_003 failure mode. Findings: `reports/reality-tests/real_004_findings.md`. |
|
||||||
| (scrum) | 3-lineage scrum review on `7f2f112..0331288` (Opus + Kimi + Qwen3-coder via `scripts/scrum_review.sh`). Convergent finding (3/3): `roleNormalize` plural-stripper mangled non-plural-s tokens (Sales → Sale, Logistics → Logistic). **Fixed**: `nonPluralSWords` allowlist + `-ss` ending check + `strings.ToLower`/`TrimSpace` cleanup. New tests `TestRoleNormalize_NonPluralS` + `TestRoleEqual_NonPluralS` lock the edge cases. Kimi 2 BLOCKs were false positives (model-truncation artifacts per `feedback_cross_lineage_review.md`). Disposition: `reports/scrum/_evidence/2026-04-30/verdicts/role_gate_v1_disposition.md` (local). |
|
| (scrum) | 3-lineage scrum review on `7f2f112..0331288` (Opus + Kimi + Qwen3-coder via `scripts/scrum_review.sh`). Convergent finding (3/3): `roleNormalize` plural-stripper mangled non-plural-s tokens (Sales → Sale, Logistics → Logistic). **Fixed**: `nonPluralSWords` allowlist + `-ss` ending check + `strings.ToLower`/`TrimSpace` cleanup. New tests `TestRoleNormalize_NonPluralS` + `TestRoleEqual_NonPluralS` lock the edge cases. Kimi 2 BLOCKs were false positives (model-truncation artifacts per `feedback_cross_lineage_review.md`). Disposition: `reports/scrum/_evidence/2026-04-30/verdicts/role_gate_v1_disposition.md` (local). |
|
||||||
|
| (probe) | Negation reality test real_005: 5 explicit-negation queries ("NOT in Detroit", "excluding Cornerstone roster", etc.). Confirmed substrate has **zero negation handling** — cosine treats "NOT X" as "X" + noise. Judge IS the safety net (Q1/Q3/Q4 rated all top-10 results 1-2/5 — operator-visible honesty signal). **No code change needed**: production UI should handle exclusion via `ExcludeIDs` (already supported, added in multi-coord stress 200-worker swap), not via NL-negation. Findings: `reports/reality-tests/real_005_findings.md`. |
|
||||||
|
|
||||||
Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).
|
Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).
|
||||||
|
|
||||||
|
|||||||
81
reports/reality-tests/playbook_lift_real_005.md
Normal file
81
reports/reality-tests/playbook_lift_real_005.md
Normal file
@ -0,0 +1,81 @@
|
|||||||
|
# Playbook-Lift Reality Test — Run real_005
|
||||||
|
|
||||||
|
**Generated:** 2026-05-01T04:04:14.242729367Z
|
||||||
|
**Judge:** `qwen2.5:latest` (Ollama, resolved from config [models].local_judge)
|
||||||
|
**Corpora:** `workers,ethereal_workers`
|
||||||
|
**Workers limit:** 5000
|
||||||
|
**Queries:** `tests/reality/negation_queries.txt` (5 executed)
|
||||||
|
**K per pass:** 10
|
||||||
|
**Paraphrase pass:** disabled
|
||||||
|
**Re-judge pass:** disabled
|
||||||
|
**Evidence:** `reports/reality-tests/playbook_lift_real_005.json`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Headline
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|---|---:|
|
||||||
|
| Total queries run | 5 |
|
||||||
|
| Cold-pass discoveries (judge-best ≠ top-1) | 0 |
|
||||||
|
| Warm-pass lifts (recorded playbook → top-1) | 0 |
|
||||||
|
| No change (judge-best already top-1, no playbook needed) | 5 |
|
||||||
|
| Playbook boosts triggered (warm pass) | 0 |
|
||||||
|
| Mean Δ top-1 distance (warm − cold) | 0 |
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**Verbatim lift rate:** 0 of 0 discoveries became top-1 after warm pass.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Per-query results
|
||||||
|
|
||||||
|
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|
||||||
|
|---|---|---|---|---|---|---|---|
|
||||||
|
| 1 | Need 5 Forklift Operators in Aurora IL, NOT in Detroit (Detr | e-1723 | 4/2 | — | e-1723 | 4 | no |
|
||||||
|
| 2 | Need 3 Warehouse Associates, but NOT anyone from Beacon Frei | w-2937 | 0/4 | — | w-2937 | 0 | no |
|
||||||
|
| 3 | Looking for Pickers in Indianapolis, excluding the Cornersto | e-5033 | 0/1 | — | e-5033 | 0 | no |
|
||||||
|
| 4 | 1 CNC Operator needed in Flint MI - we cannot use any Detroi | w-1360 | 0/2 | — | w-1360 | 0 | no |
|
||||||
|
| 5 | Need 2 Loaders in Joliet IL but exclude all currently-placed | w-2998 | 0/4 | — | w-2998 | 0 | no |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Honesty caveats
|
||||||
|
|
||||||
|
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
|
||||||
|
judge's verdict is what defines "best." If `` rates badly,
|
||||||
|
the lift number is meaningless. To validate the judge itself, sample 5–10
|
||||||
|
verdicts manually and check agreement.
|
||||||
|
2. **Score-1.0 boost = distance halved.** Playbook math is
|
||||||
|
`distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
|
||||||
|
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
|
||||||
|
even halving doesn't promote it. Tight clusters → little visible lift.
|
||||||
|
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
|
||||||
|
case — same query, recorded playbook, expected boost. The paraphrase
|
||||||
|
pass (when enabled) is the actual learning property: similar-but-different
|
||||||
|
queries hitting a recorded playbook. Compare verbatim and paraphrase
|
||||||
|
lift rates — paraphrase should be lower (semantic-distance gates some
|
||||||
|
playbook hits) but non-zero is the meaningful signal.
|
||||||
|
4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
|
||||||
|
results land in one corpus, the matrix layer's purpose isn't being tested.
|
||||||
|
Check per-corpus distribution in the JSON.
|
||||||
|
5. **Judge resolution.** This run used `qwen2.5:latest` from
|
||||||
|
config [models].local_judge.
|
||||||
|
Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
|
||||||
|
6. **Paraphrase generation also uses the judge.** The same model that rates
|
||||||
|
relevance also rephrases queries. A judge that's bad at rating staffing
|
||||||
|
queries is probably also bad at rephrasing them. Worth sanity-checking
|
||||||
|
a sample of `paraphrase_query` values in the JSON before trusting the
|
||||||
|
paraphrase lift number.
|
||||||
|
|
||||||
|
## Next moves
|
||||||
|
|
||||||
|
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
|
||||||
|
work. Move to paraphrase queries + tag-based boost (currently ignored).
|
||||||
|
- If lift rate < 20%: investigate why — judge variance, distance gap too
|
||||||
|
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
|
||||||
|
retuning.
|
||||||
|
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
|
||||||
|
already close to optimal on this query distribution. Either the corpus
|
||||||
|
is too narrow or the queries are too easy.
|
||||||
111
reports/reality-tests/real_005_findings.md
Normal file
111
reports/reality-tests/real_005_findings.md
Normal file
@ -0,0 +1,111 @@
|
|||||||
|
# Reality test real_005 — negation probe (substrate has none, judge catches it)
|
||||||
|
|
||||||
|
5 explicit-negation queries — "NOT in Detroit", "excluding Beacon
|
||||||
|
Freight", etc. — through the standard `playbook_lift.sh` harness.
|
||||||
|
Goal: characterize whether the substrate has any negation handling
|
||||||
|
or silently treats "NOT X" as "X".
|
||||||
|
|
||||||
|
## Headline
|
||||||
|
|
||||||
|
**The substrate has zero negation handling.** Cosine on dense
|
||||||
|
embeddings tokenizes "NOT in Detroit" the same as "in Detroit" plus
|
||||||
|
the noise word "NOT" — there is no logical-quantifier representation
|
||||||
|
in the embedding space. The judge catches the failure
|
||||||
|
post-retrieval (low ratings) but the retrieval itself doesn't honor
|
||||||
|
the negation.
|
||||||
|
|
||||||
|
## Per-query results
|
||||||
|
|
||||||
|
| Q | Query (head) | cold_top1_dist | judge ratings (top-10) | judge says |
|
||||||
|
|---|---|---:|---|---|
|
||||||
|
| 1 | "Need 5 Forklift Operators in Aurora IL, NOT in Detroit" | 0.386 | `1,1,1,1,2,1,2,2,2,2` | all bad — system can't honor |
|
||||||
|
| 2 | "Need 3 Warehouse Associates, but NOT from Beacon Freight" | 0.453 | `4,4,3,2,3,4,1,4,4,1` | top-1 4/5 — accidentally OK |
|
||||||
|
| 3 | "Looking for Pickers in Indianapolis, excluding Cornerstone Fabrication" | 0.468 | `1,1,1,1,1,1,1,1,1,1` | unanimous fail |
|
||||||
|
| 4 | "1 CNC Operator in Flint MI - cannot use Detroit-area" | 0.434 | `2,1,1,1,2,1,1,1,2,1` | all bad |
|
||||||
|
| 5 | "Need 2 Loaders in Joliet IL but exclude Heritage Foods workers" | 0.439 | `4,3,4,4,1,4,2,2,1,2` | top-1 4/5 — accidentally OK |
|
||||||
|
|
||||||
|
Q2 and Q5 only "passed" because the non-negated signals (role + city)
|
||||||
|
were strong enough to pull workers from outside the negated set
|
||||||
|
naturally. Q1, Q3, Q4 hit the wall: the negated entity (Detroit /
|
||||||
|
Cornerstone roster / Detroit-area) is the dominant content word, so
|
||||||
|
cosine pulls workers from exactly the location/client the
|
||||||
|
coordinator told us to avoid.
|
||||||
|
|
||||||
|
## Why this is structural, not fixable in the substrate
|
||||||
|
|
||||||
|
LLM-style decoder models can do negation via attention patterns over
|
||||||
|
generation. Embedding models compress text into a single dense vector
|
||||||
|
where token-level structure is lost. There is no "NOT" operator in
|
||||||
|
cosine space — the literature is clear on this; it's an active
|
||||||
|
research area (e.g. negation-aware contrastive training).
|
||||||
|
|
||||||
|
Mitigation paths in our substrate:
|
||||||
|
|
||||||
|
1. **Pre-process query with an LLM** to extract negations as
|
||||||
|
structured filters before retrieval. Same shape as the
|
||||||
|
`roleExtractor` — qwen2.5 format=json, schema like
|
||||||
|
`{"positive": "...", "exclude_locations": [...], "exclude_clients": [...]}`.
|
||||||
|
Cost: ~1-3s/query.
|
||||||
|
2. **Surface "low-confidence" signal** when the judge rates everything
|
||||||
|
below a threshold (already implicit — `discovery=0` in this run).
|
||||||
|
Promote that to an operator-visible signal so the UI can prompt
|
||||||
|
"this query has constraints I couldn't honor; please exclude
|
||||||
|
manually."
|
||||||
|
3. **Use structured `ExcludeIDs`** at the API boundary, populated by
|
||||||
|
the UI. The substrate already supports this (added in the
|
||||||
|
multi-coord stress 200-worker swap). Coordinators in a UI
|
||||||
|
shouldn't type "NOT Beacon Freight" — they should click an
|
||||||
|
exclude button.
|
||||||
|
|
||||||
|
## Architectural recommendation
|
||||||
|
|
||||||
|
**(3)** is the right answer for production. UIs solve negation
|
||||||
|
cheaper than NLP. The substrate's job is to make exclusion machinery
|
||||||
|
available (`ExcludeIDs` is already there) and surface honesty signals
|
||||||
|
when its retrieval doesn't fit the query (judge-rating distribution
|
||||||
|
is already there). Adding NL-negation handling would be product
|
||||||
|
debt — it would let coordinators type sloppier queries and then
|
||||||
|
silently fail when the LLM extractor misses a phrasing.
|
||||||
|
|
||||||
|
**(2)** is a small UX improvement worth shipping eventually: when
|
||||||
|
all top-K judge ratings are ≤ 2/5, surface "no good match found —
|
||||||
|
consider tightening constraints" instead of returning the cold-top-1
|
||||||
|
silently. This is one query-response shape change, not a substrate
|
||||||
|
change.
|
||||||
|
|
||||||
|
**(1)** is research-grade work. Don't ship until production traffic
|
||||||
|
demonstrates coordinators actually type natural-language negations
|
||||||
|
rather than using exclude affordances.
|
||||||
|
|
||||||
|
## Honesty signal validation
|
||||||
|
|
||||||
|
The judge IS doing its job. Q1, Q3, Q4 had judge ratings of mostly
|
||||||
|
1/5, with no rating ≥ 4. That means in production, a judge-rating-
|
||||||
|
distribution monitor would flag: "this query produced 0 results with
|
||||||
|
quality score ≥ 4." That's an actionable operator signal without
|
||||||
|
requiring any new code in the substrate.
|
||||||
|
|
||||||
|
## What this probe does NOT cover
|
||||||
|
|
||||||
|
- **Quantifier negation** ("at most 3 of these workers"): different
|
||||||
|
failure mode, also unhandled, also won't be added to substrate.
|
||||||
|
- **Conditional constraints** ("if no forklift ops available, fall
|
||||||
|
back to material handlers"): same.
|
||||||
|
- **Soft preferences** ("prefer locals over commuters"): partially
|
||||||
|
handled via tag boost; not tested in this probe.
|
||||||
|
|
||||||
|
These are deferred to "when production traffic shows them."
|
||||||
|
|
||||||
|
## Repro
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Already-shipped queries file
|
||||||
|
cat tests/reality/negation_queries.txt
|
||||||
|
|
||||||
|
# Run with default config (no LLM extractor, no paraphrase)
|
||||||
|
QUERIES_FILE=tests/reality/negation_queries.txt RUN_ID=real_005 \
|
||||||
|
WITH_PARAPHRASE=0 WITH_REJUDGE=0 \
|
||||||
|
./scripts/playbook_lift.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Evidence: `reports/reality-tests/playbook_lift_real_005.{json,md}`.
|
||||||
17
tests/reality/negation_queries.txt
Normal file
17
tests/reality/negation_queries.txt
Normal file
@ -0,0 +1,17 @@
|
|||||||
|
# Negation reality-test queries — real_005
|
||||||
|
#
|
||||||
|
# Each query carries an explicit negation that should suppress some
|
||||||
|
# subset of workers/clients/cities. Cosine on dense embeddings has
|
||||||
|
# known weaknesses around negation: "NOT Detroit" still tokenizes
|
||||||
|
# "Detroit" and pulls Detroit-aligned vectors near. Without explicit
|
||||||
|
# negation handling, the system may silently surface exactly the
|
||||||
|
# entities the coordinator excluded.
|
||||||
|
#
|
||||||
|
# Test goal: characterize whether the substrate degrades silently
|
||||||
|
# (treats "NOT X" as "X") or surfaces an honesty signal.
|
||||||
|
|
||||||
|
Need 5 Forklift Operators in Aurora IL, NOT in Detroit (Detroit pool is reserved for another contract)
|
||||||
|
Need 3 Warehouse Associates, but NOT anyone from Beacon Freight
|
||||||
|
Looking for Pickers in Indianapolis, excluding the Cornerstone Fabrication roster
|
||||||
|
1 CNC Operator needed in Flint MI - we cannot use any Detroit-area workers due to non-compete
|
||||||
|
Need 2 Loaders in Joliet IL but exclude all currently-placed Heritage Foods workers
|
||||||
Loading…
x
Reference in New Issue
Block a user