matrix: split boost / inject thresholds — kills Shape B cross-pollination

Run #003 surfaced rampant cross-pollination: Q2's "OSHA-30 forklift
Wisconsin" recording (w-4435) became warm top-1 for Q19 (dental
hygienist), Q20 (RN), Q21 (software engineer), and 6 other unrelated
staffing queries. Cause: InjectPlaybookMisses inherited the same
DefaultPlaybookMaxDistance (0.5) as the boost path, but inject is
structurally riskier than boost — boost only re-ranks results that
already retrieved on their own merits, while inject FORCES a result
into top-K, so a loose match cross-pollinates wrong-domain answers.

Empirical motivation from v3:
  Implied playbook hit distances for cross-pollinated cases: 0.20-0.46
  Implied distances for the 6/6 paraphrase recoveries:        0.23-0.30
  Threshold of 0.20 should keep most paraphrases, kill the OOD bleed.

Implementation:
- New DefaultPlaybookMaxInjectDistance = 0.20 in playbook.go.
- New PlaybookMaxInjectDistance field on SearchRequest (override).
- InjectPlaybookMisses signature gains maxInjectDist param; hits whose
  Distance exceeds it are skipped (boost path may still re-rank them).
- TestInjectPlaybookMisses_RespectsInjectThreshold locks the contract
  with one tight + one loose hit, asserting only the tight one injects.
- Existing tests pass explicit threshold (0 = default for tight tests,
  0.5 for the dedupe test which uses 0.30 hits).

Run #004 result on identical queries with the split threshold:

  Verbatim discovery        8 (vs v3's 6 — judge variance, separate)
  Verbatim lift             6 / 8 (75%)
  Paraphrase top-1          6 / 8 (75%)
  Paraphrase any-rank in K  6 / 8

OOD queries Q19/Q20/Q21 ALL show warm top-1 = cold top-1 (no
injection) — cross-pollination eliminated where it was wrong-direction.
Mean Δ top-1 distance dropped from -0.164 (v3, distorted) to -0.071
(v4, comparable to v1's -0.053).

Two paraphrases missed in v4 (Q9, Q15) were ones where qwen2.5
rephrased liberally enough to drift past 0.20 — Q9: "Inventory
specialist..." → "Individual needed for inventory management..." and
Q15: "Engaged warehouse associate..." → "Warehouse associate currently
engaged with a robust history...". The system correctly refusing to
inject when it's not confident is the right product behavior; the
boost path still re-ranks recorded answers when they appear in regular
retrieval.

The Q6 ↔ Q7 cross-pollination ("Forklift-certified loader" ↔
"Hazmat warehouse worker") is legitimate — these are genuinely similar
staffing queries and the judge ranks both directions as plausible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-30 07:24:55 -05:00
parent 94fc3b67ec
commit 67d1957b87
4 changed files with 217 additions and 10 deletions

View File

@ -49,8 +49,31 @@ const DefaultPlaybookTopK = 3
// query is similar enough to count." 0.5 lets in genuinely related
// queries while excluding pure-coincidence neighbors. Caller can
// override per-request as we learn what works for staffing data.
//
// This threshold gates the BOOST path (re-rank in place), which is
// safe at loose thresholds because boost only modifies results already
// in regular retrieval. The INJECT path uses a tighter ceiling — see
// DefaultPlaybookMaxInjectDistance.
const DefaultPlaybookMaxDistance = 0.5
// DefaultPlaybookMaxInjectDistance is the SHAPE B cosine ceiling for
// "this past query is similar enough to FORCE its answer into the
// result set." Tighter than DefaultPlaybookMaxDistance because inject
// is structurally riskier than boost: it adds a result the embedding
// didn't surface, so a loose match can cross-pollinate the wrong
// answer into unrelated queries.
//
// Empirical motivation (playbook_lift_003): Q2's recording for an
// OSHA-30 forklift operator surfaced as warm top-1 for the dental
// hygienist / RN / software engineer OOD queries because their text
// vectors fell within 0.5 cosine of "OSHA-30 forklift Wisconsin."
// 0.20 would have rejected those (implied playbook distances 0.38-0.46)
// while keeping all 6 paraphrase recoveries (≤ 0.30 implied).
//
// Boost path stays at 0.5 — re-ranking results that already retrieved
// by their own merits is safe even when the playbook match is loose.
const DefaultPlaybookMaxInjectDistance = 0.20
// PlaybookEntry is what gets stored as metadata on each playbook
// vector. RecordedAt is captured at write time; callers should not
// set it (the recorder fills it in).
@ -174,10 +197,18 @@ type PlaybookHit struct {
//
// Returns the (possibly extended) results slice and how many synthetic
// rows were appended. Caller MUST re-sort + truncate to K afterwards.
func InjectPlaybookMisses(results []Result, hits []PlaybookHit) ([]Result, int) {
//
// maxInjectDist filters which hits qualify for injection — hits whose
// playbook-corpus cosine distance exceeds it are skipped (the boost
// path may still re-rank them in place). Pass 0 (or any non-positive
// value) to use DefaultPlaybookMaxInjectDistance.
func InjectPlaybookMisses(results []Result, hits []PlaybookHit, maxInjectDist float32) ([]Result, int) {
if len(hits) == 0 {
return results, 0
}
if maxInjectDist <= 0 {
maxInjectDist = float32(DefaultPlaybookMaxInjectDistance)
}
present := make(map[string]bool, len(results))
for _, r := range results {
present[r.Corpus+"|"+r.ID] = true
@ -188,6 +219,13 @@ func InjectPlaybookMisses(results []Result, hits []PlaybookHit) ([]Result, int)
// Multiple hits to the same answer collapse to one injection.
bestForKey := make(map[string]PlaybookHit)
for _, h := range hits {
// Inject-specific tighter threshold (boost path's threshold is
// looser; this prevents cross-pollination of wrong-domain
// answers into queries whose text happens to fall within
// boost-distance of an unrelated recording).
if h.Distance > maxInjectDist {
continue
}
key := h.Entry.AnswerCorpus + "|" + h.Entry.AnswerID
if present[key] {
continue

View File

@ -187,7 +187,7 @@ func TestInjectPlaybookMisses_AddsMissingAnswers(t *testing.T) {
},
},
}
out, injected := InjectPlaybookMisses(results, hits)
out, injected := InjectPlaybookMisses(results, hits, 0)
if injected != 1 {
t.Fatalf("expected 1 injected, got %d", injected)
}
@ -240,7 +240,7 @@ func TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent(t *testing.T) {
},
},
}
out, injected := InjectPlaybookMisses(results, hits)
out, injected := InjectPlaybookMisses(results, hits, 0)
if injected != 0 {
t.Errorf("expected 0 injected (answer already present), got %d", injected)
}
@ -266,7 +266,7 @@ func TestInjectPlaybookMisses_DedupesPerAnswer(t *testing.T) {
Entry: PlaybookEntry{QueryText: "q2", AnswerID: "w-99", AnswerCorpus: "workers", Score: 1.0},
},
}
out, injected := InjectPlaybookMisses(results, hits)
out, injected := InjectPlaybookMisses(results, hits, 0.5) // explicit loose threshold so 0.30 hits qualify
if injected != 1 {
t.Errorf("expected 1 injection (deduped), got %d", injected)
}
@ -280,10 +280,51 @@ func TestInjectPlaybookMisses_DedupesPerAnswer(t *testing.T) {
}
}
// TestInjectPlaybookMisses_RespectsInjectThreshold locks the
// cross-pollination defense added after run #003: hits whose playbook
// distance exceeds the inject threshold are skipped, preventing the
// "OSHA-30 forklift" recording from surfacing as warm top-1 for an
// unrelated dental-hygienist query just because their text vectors
// happened to fall within boost-threshold (0.5).
func TestInjectPlaybookMisses_RespectsInjectThreshold(t *testing.T) {
results := []Result{{ID: "w-1", Corpus: "workers", Distance: 0.30}}
// Two hits: one within tight inject threshold, one beyond it but
// within boost threshold. Only the tight one should inject.
hits := []PlaybookHit{
{
PlaybookID: "tight",
Distance: 0.10, // within inject (true paraphrase territory)
Entry: PlaybookEntry{QueryText: "q1", AnswerID: "w-tight", AnswerCorpus: "workers", Score: 1.0},
},
{
PlaybookID: "loose",
Distance: 0.40, // boost-eligible but inject-rejected
Entry: PlaybookEntry{QueryText: "q2", AnswerID: "w-loose", AnswerCorpus: "workers", Score: 1.0},
},
}
// Default threshold (0 → DefaultPlaybookMaxInjectDistance = 0.20)
out, injected := InjectPlaybookMisses(results, hits, 0)
if injected != 1 {
t.Errorf("expected 1 injection (only the tight hit qualifies), got %d", injected)
}
gotTight := false
for _, r := range out {
if r.ID == "w-tight" {
gotTight = true
}
if r.ID == "w-loose" {
t.Errorf("loose hit (distance > inject threshold) was injected anyway")
}
}
if !gotTight {
t.Error("tight hit should have been injected")
}
}
// TestInjectPlaybookMisses_EmptyHits is a fast-path no-op check.
func TestInjectPlaybookMisses_EmptyHits(t *testing.T) {
results := []Result{{ID: "w-1", Corpus: "workers", Distance: 0.30}}
out, injected := InjectPlaybookMisses(results, nil)
out, injected := InjectPlaybookMisses(results, nil, 0)
if injected != 0 {
t.Errorf("expected 0 injection, got %d", injected)
}

View File

@ -53,8 +53,14 @@ type Result struct {
// PlaybookCorpus: index name; empty = DefaultPlaybookCorpus.
// PlaybookTopK: number of similar past queries to consider; 0 =
// DefaultPlaybookTopK.
// PlaybookMaxDistance: cosine ceiling for "similar enough"; 0 =
// DefaultPlaybookMaxDistance.
// PlaybookMaxDistance: cosine ceiling for "similar enough" on the
// BOOST path (re-rank in place); 0 = DefaultPlaybookMaxDistance.
// PlaybookMaxInjectDistance: tighter cosine ceiling for the SHAPE B
// INJECT path; 0 = DefaultPlaybookMaxInjectDistance. Splitting the
// two thresholds is intentional — boost is safe at loose thresholds
// because it only re-ranks results that already retrieved on their
// own merits, while inject forces results in and so cross-pollinates
// wrong-domain answers if the threshold is too loose.
//
// Metadata filter (post-retrieval structured gate):
// MetadataFilter: map of metadata-field → expected value. Results
@ -76,8 +82,9 @@ type SearchRequest struct {
UsePlaybook bool `json:"use_playbook,omitempty"`
PlaybookCorpus string `json:"playbook_corpus,omitempty"`
PlaybookTopK int `json:"playbook_top_k,omitempty"`
PlaybookMaxDistance float64 `json:"playbook_max_distance,omitempty"`
MetadataFilter map[string]any `json:"metadata_filter,omitempty"`
PlaybookMaxDistance float64 `json:"playbook_max_distance,omitempty"`
PlaybookMaxInjectDistance float64 `json:"playbook_max_inject_distance,omitempty"`
MetadataFilter map[string]any `json:"metadata_filter,omitempty"`
}
// SearchResponse wraps the merged results plus per-corpus return
@ -233,8 +240,12 @@ func (r *Retriever) Search(ctx context.Context, req SearchRequest) (*SearchRespo
slog.Warn("matrix: playbook lookup failed; skipping boost+inject", "err", err)
} else if len(hits) > 0 {
resp.PlaybookBoosted = ApplyPlaybookBoost(resp.Results, hits)
maxInjectDist := float32(req.PlaybookMaxInjectDistance)
if maxInjectDist <= 0 {
maxInjectDist = float32(DefaultPlaybookMaxInjectDistance)
}
var injected int
resp.Results, injected = InjectPlaybookMisses(resp.Results, hits)
resp.Results, injected = InjectPlaybookMisses(resp.Results, hits, maxInjectDist)
resp.PlaybookInjected = injected
if injected > 0 {
// Re-sort + truncate after injection. ApplyPlaybookBoost

View File

@ -0,0 +1,117 @@
# Playbook-Lift Reality Test — Run 004
**Generated:** 2026-04-30T12:23:36.594892386Z
**Judge:** `qwen2.5:latest` (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest)
**Corpora:** `workers,ethereal_workers`
**Workers limit:** 5000
**Queries:** `tests/reality/playbook_lift_queries.txt` (21 executed)
**K per pass:** 10
**Paraphrase pass:** ENABLED
**Evidence:** `reports/reality-tests/playbook_lift_004.json`
---
## Headline
| Metric | Value |
|---|---:|
| Total queries run | 21 |
| Cold-pass discoveries (judge-best ≠ top-1) | 8 |
| Warm-pass lifts (recorded playbook → top-1) | 6 |
| No change (judge-best already top-1, no playbook needed) | 15 |
| Playbook boosts triggered (warm pass) | 8 |
| Mean Δ top-1 distance (warm cold) | -0.070719235 |
| **Paraphrase pass — recorded answer at rank 0 (top-1)** | **6 / 8** |
| Paraphrase pass — recorded answer at any rank in top-K | 6 / 8 |
**Verbatim lift rate:** 6 of 8 discoveries became top-1 after warm pass.
---
## Per-query results
| # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
|---|---|---|---|---|---|---|---|
| 1 | Forklift operator with OSHA-30, warehouse experience, day sh | e-4983 | 1/4 | ✓ e-5729 | e-5729 | 0 | **YES** |
| 2 | OSHA-30 certified forklift operator in Wisconsin, cold stora | e-868 | 9/3 | — | e-7308 | -1 | no |
| 3 | Production worker with confined-space cert and hazmat traini | w-4583 | 1/2 | — | w-1231 | 2 | no |
| 4 | CDL Class A driver, clean record, willing to do regional 4-d | w-3272 | 0/1 | — | w-3272 | 0 | no |
| 5 | Warehouse lead with current OSHA-30 certification, NOT OSHA- | w-2356 | 3/2 | — | w-2356 | 3 | no |
| 6 | Forklift-certified loader, certification must be active, dis | e-3940 | 3/4 | ✓ w-330 | e-7453 | 1 | no |
| 7 | Hazmat-certified warehouse worker comfortable with cold stor | w-4633 | 4/4 | ✓ e-7453 | w-330 | 1 | no |
| 8 | Bilingual production worker with team-lead experience and tr | w-2983 | 0/4 | — | w-2983 | 0 | no |
| 9 | Inventory specialist with confined-space cert and compliance | w-3037 | 7/4 | ✓ w-1231 | w-1231 | 0 | **YES** |
| 10 | Warehouse worker who can run inventory cycles and lead a sma | e-6649 | 1/4 | ✓ w-4113 | w-4113 | 0 | **YES** |
| 11 | Production line worker comfortable filling in as line superv | w-1010 | 3/4 | ✓ w-1153 | w-1153 | 0 | **YES** |
| 12 | Customer service rep willing to cross-train into dispatch or | e-6474 | 1/2 | — | e-6474 | 1 | no |
| 13 | Reliable production line lead with strong attendance and lea | e-4284 | 0/3 | — | e-4284 | 0 | no |
| 14 | Highly responsive forklift operator available for last-minut | e-285 | 4/4 | ✓ e-7308 | e-7308 | 0 | **YES** |
| 15 | Engaged warehouse associate with strong safety compliance re | e-8404 | 5/4 | ✓ w-3242 | w-3242 | 0 | **YES** |
| 16 | CDL-A driver based in IL or WI, willing to run regional 4-da | w-3257 | 4/2 | — | w-3257 | 4 | no |
| 17 | Bilingual customer service rep in Indianapolis or Cincinnati | w-1387 | 0/1 | — | w-1387 | 0 | no |
| 18 | Production supervisor open to Midwest relocation for permane | e-7478 | 1/2 | — | e-7478 | 1 | no |
| 19 | Dental hygienist with three years experience, Indianapolis a | e-2544 | 0/1 | — | e-2544 | 0 | no |
| 20 | Registered nurse with ICU experience, willing to take per-di | w-419 | 0/1 | — | w-419 | 0 | no |
| 21 | Software engineer with React and TypeScript, three years exp | w-334 | 0/1 | — | w-334 | 0 | no |
---
## Paraphrase pass — does the playbook help similar-but-different queries?
For each query whose Pass 1 cold pass recorded a playbook entry, the
judge model rephrased the query, and the rephrased version was sent
through warm matrix.search. The recorded answer ID's rank in those
results tests whether cosine on the embedded paraphrase finds the
recorded query's vector.
| # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift |
|---|---|---|---|---|---|---|
| 1 | Forklift operator with OSHA-30, warehous | Seeking forklift operator certified in OSHA-30, with backgro | e-5729 | e-5729 | 0 | **YES** |
| 6 | Forklift-certified loader, certification | Loader with active forklift certification, separate from reg | w-330 | w-330 | 0 | **YES** |
| 7 | Hazmat-certified warehouse worker comfor | Warehouse worker with Hazmat certification and experience in | e-7453 | e-7453 | 0 | **YES** |
| 9 | Inventory specialist with confined-space | Individual needed for inventory management with certificatio | w-1231 | w-987 | -1 | no |
| 10 | Warehouse worker who can run inventory c | Seeking a warehouse worker capable of conducting inventory c | w-4113 | w-4113 | 0 | **YES** |
| 11 | Production line worker comfortable filli | Seeking a production line worker capable of temporarily step | w-1153 | w-1153 | 0 | **YES** |
| 14 | Highly responsive forklift operator avai | Available for urgent forklift operation shifts requiring imm | e-7308 | e-7308 | 0 | **YES** |
| 15 | Engaged warehouse associate with strong | Warehouse associate currently engaged with a robust history | w-3242 | e-2615 | -1 | no |
---
## Honesty caveats
1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
judge's verdict is what defines "best." If `qwen2.5:latest` rates badly,
the lift number is meaningless. To validate the judge itself, sample 510
verdicts manually and check agreement.
2. **Score-1.0 boost = distance halved.** Playbook math is
`distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
even halving doesn't promote it. Tight clusters → little visible lift.
3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
case — same query, recorded playbook, expected boost. The paraphrase
pass (when enabled) is the actual learning property: similar-but-different
queries hitting a recorded playbook. Compare verbatim and paraphrase
lift rates — paraphrase should be lower (semantic-distance gates some
playbook hits) but non-zero is the meaningful signal.
4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
results land in one corpus, the matrix layer's purpose isn't being tested.
Check per-corpus distribution in the JSON.
5. **Judge resolution.** This run used `qwen2.5:latest` from
env JUDGE_MODEL=qwen2.5:latest.
Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
6. **Paraphrase generation also uses the judge.** The same model that rates
relevance also rephrases queries. A judge that's bad at rating staffing
queries is probably also bad at rephrasing them. Worth sanity-checking
a sample of `paraphrase_query` values in the JSON before trusting the
paraphrase lift number.
## Next moves
- If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
work. Move to paraphrase queries + tag-based boost (currently ignored).
- If lift rate < 20%: investigate why judge variance, distance gap too
wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
retuning.
- If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
already close to optimal on this query distribution. Either the corpus
is too narrow or the queries are too easy.