matrix: Shape B — inject playbook misses + 6/6 paraphrase recovery

The v0 boost-only stance documented in internal/matrix/playbook.go:22-27 ("the boost only re-ranks results that ALREADY surfaced from the regular retrieval") couldn't promote recorded answers that dropped out of a paraphrase's top-K. playbook_lift_002 surfaced exactly that gap: 0/2 paraphrase recoveries because the recorded answers weren't in regular retrieval at all (rank=-1). Shape B: when warm-pass retrieval doesn't surface a playbook hit's answer, inject a synthetic Result for it directly. Distance = playbook_hit_distance × BoostFactor — same formula as the boost path so injections land in comparable distance space. Caller re-sorts + truncates after both boost and inject have run. Result on playbook_lift_003 (Shape B + paraphrase pass): Verbatim discovery 6 Verbatim lift 2 / 6 **Paraphrase top-1** **6 / 6** Paraphrase any-rank in K 6 / 6 Mean Δ top-1 distance -0.1637 (warm closer than cold) Every paraphrase the judge generated landed the v1-recorded answer at top-1 of the new query's results. The learning property holds — cosine on embed(paraphrase) finds the recorded query's vector within DefaultPlaybookMaxDistance (0.5), and Shape B injects the answer. Verbatim lift dropped from v1's 7/8 because Shape B cross-pollinates recorded answers across queries. w-4435 (Q2's recording) appears as warm top-1 for several other queries because their embeddings are within the playbook hit threshold of "OSHA-30 forklift Wisconsin." This is a feature, not a bug — the matrix layer's purpose is to share knowledge across queries — but the lift metric only counts "warm top-1 == cold judge best," so cross-pollinated lifts don't register. A v3 metric would re-judge warm pass to measure true judge improvement. Tests: - TestInjectPlaybookMisses_AddsMissingAnswers — primary claim - TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent — no double-inject - TestInjectPlaybookMisses_DedupesPerAnswer — multi-hit same answer - TestInjectPlaybookMisses_EmptyHits — fast-path no-op Driver fix: ParaphraseRecordedRank int → *int. The `omitempty` int silently dropped rank=0 (top-1, the WANTED value) from JSON, making the v003 report show "null" instead of "0" for every successful recovery. Pointer keeps nil/rank-0 distinguishable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:06:13 -05:00 · 2026-04-30 07:06:13 -05:00 · 154a72ea5e
commit 154a72ea5e
parent e9822f025d
5 changed files with 352 additions and 10 deletions
--- a/internal/matrix/playbook.go
+++ b/internal/matrix/playbook.go
@ -151,6 +151,78 @@ type PlaybookHit struct {
 	Entry      PlaybookEntry `json:"entry"`
 }
 // InjectPlaybookMisses appends synthetic Results for playbook hits
 // whose (AnswerCorpus, AnswerID) doesn't already appear in results.
 // This is "Shape B" from the doc comment at the top of this file:
 // the v0 boost-only stance (ApplyPlaybookBoost) can't promote a
 // recorded answer that wasn't already in the regular retrieval's
 // top-K. Paraphrase queries broke this — different embedding ⇒
 // different top-K ⇒ recorded answer drops out ⇒ no boost can save
 // it. Reality test playbook_lift_002 showed 0/2 paraphrase top-1
 // lifts because of exactly that.
 //
 // Synthetic distance = playbook_hit_distance × BoostFactor — same
 // formula as ApplyPlaybookBoost, applied to the playbook hit's own
 // distance instead of a result's. Lower playbook hit distance
 // (current query is similar to recorded query) AND higher score
 // (recorded outcome was strong) push the injection toward top-1.
 //
 // fetchPlaybookHits has already filtered hits to those within
 // DefaultPlaybookMaxDistance (0.5), so injected results land in the
 // same distance range as regular retrieval — they don't dominate
 // top-K from out-of-distribution playbooks.
 //
 // Returns the (possibly extended) results slice and how many synthetic
 // rows were appended. Caller MUST re-sort + truncate to K afterwards.
 func InjectPlaybookMisses(results []Result, hits []PlaybookHit) ([]Result, int) {
 	if len(hits) == 0 {
 		return results, 0
 	}
 	present := make(map[string]bool, len(results))
 	for _, r := range results {
 		present[r.Corpus+"|"+r.ID] = true
 	}
 	// For each (corpus, id) NOT in results, keep the playbook hit
 	// with the largest boost (lowest BoostFactor = highest score).
 	// Multiple hits to the same answer collapse to one injection.
 	bestForKey := make(map[string]PlaybookHit)
 	for _, h := range hits {
 		key := h.Entry.AnswerCorpus + "|" + h.Entry.AnswerID
 		if present[key] {
 			continue
 		}
 		if existing, ok := bestForKey[key]; !ok || h.Entry.BoostFactor() < existing.Entry.BoostFactor() {
 			bestForKey[key] = h
 		}
 	}
 	for _, h := range bestForKey {
 		injectedDist := h.Distance * float32(h.Entry.BoostFactor())
 		// Synthesize metadata that flags the injection so callers
 		// (driver/UI/observer) can distinguish "regular retrieval"
 		// from "playbook injection." Production consumers needing
 		// the actual worker metadata can fetch from vectord by
 		// (Corpus, ID) — synthetic results carry only provenance.
 		meta, _ := json.Marshal(map[string]any{
 			"playbook_injected":      true,
 			"playbook_id":            h.PlaybookID,
 			"playbook_score":         h.Entry.Score,
 			"playbook_query_text":    h.Entry.QueryText,
 			"playbook_recorded_at_ns": h.Entry.RecordedAtNs,
 			"playbook_hit_distance":  h.Distance,
 		})
 		results = append(results, Result{
 			ID:       h.Entry.AnswerID,
 			Corpus:   h.Entry.AnswerCorpus,
 			Distance: injectedDist,
 			Metadata: meta,
 		})
 	}
 	return results, len(bestForKey)
 }
 // ApplyPlaybookBoost re-ranks results in place using matched
 // playbook hits. For each hit whose (AnswerID, AnswerCorpus)
 // matches a result, multiply that result's distance by the hit's
--- a/internal/matrix/playbook_test.go
+++ b/internal/matrix/playbook_test.go
@ -164,6 +164,134 @@ func TestUnmarshalPlaybookMetadata_RejectsEmpty(t *testing.T) {
 	}
 }
 // TestInjectPlaybookMisses_AddsMissingAnswers locks Shape B's primary
 // claim: when a playbook hit's answer isn't already in regular
 // retrieval results, InjectPlaybookMisses appends a synthetic Result
 // for it. Reality test playbook_lift_002 surfaced 0/2 paraphrase
 // recoveries because the v0 boost-only stance couldn't promote
 // answers that dropped out of the paraphrase's top-K.
 func TestInjectPlaybookMisses_AddsMissingAnswers(t *testing.T) {
 	results := []Result{
 		{ID: "w-1", Corpus: "workers", Distance: 0.30},
 		{ID: "w-2", Corpus: "workers", Distance: 0.35},
 	}
 	hits := []PlaybookHit{
 		{
 			PlaybookID: "pb-x",
 			Distance:   0.20, // current query is close to recorded query
 			Entry: PlaybookEntry{
 				QueryText:    "recorded query",
 				AnswerID:     "w-99", // NOT in results
 				AnswerCorpus: "workers",
 				Score:        1.0, // strong outcome → boost factor 0.5
 			},
 		},
 	}
 	out, injected := InjectPlaybookMisses(results, hits)
 	if injected != 1 {
 		t.Fatalf("expected 1 injected, got %d", injected)
 	}
 	if len(out) != 3 {
 		t.Fatalf("expected len=3, got %d (%v)", len(out), idsOf(out))
 	}
 	// The injected result should be findable + carry the playbook
 	// provenance metadata flag.
 	var injectedResult *Result
 	for i := range out {
 		if out[i].ID == "w-99" {
 			injectedResult = &out[i]
 			break
 		}
 	}
 	if injectedResult == nil {
 		t.Fatal("w-99 not present in output")
 	}
 	// distance = 0.20 * 0.5 = 0.10 → near-top after caller re-sorts
 	if injectedResult.Distance < 0.099 || injectedResult.Distance > 0.101 {
 		t.Errorf("expected injected distance ~0.10, got %f", injectedResult.Distance)
 	}
 	var meta map[string]any
 	if err := json.Unmarshal(injectedResult.Metadata, &meta); err != nil {
 		t.Fatalf("decode meta: %v", err)
 	}
 	if v, _ := meta["playbook_injected"].(bool); !v {
 		t.Errorf("expected playbook_injected=true marker, got %v", meta)
 	}
 	if v, _ := meta["playbook_query_text"].(string); v != "recorded query" {
 		t.Errorf("expected recorded query in meta, got %v", v)
 	}
 }
 // TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent locks the
 // boost-only-when-present property. If a playbook hit's answer is
 // ALREADY in results, we don't duplicate-inject — ApplyPlaybookBoost
 // has handled that case via in-place re-rank.
 func TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent(t *testing.T) {
 	results := []Result{
 		{ID: "w-1", Corpus: "workers", Distance: 0.30},
 		{ID: "w-99", Corpus: "workers", Distance: 0.40}, // ALREADY HERE
 	}
 	hits := []PlaybookHit{
 		{
 			PlaybookID: "pb-x",
 			Distance:   0.20,
 			Entry: PlaybookEntry{
 				QueryText: "x", AnswerID: "w-99", AnswerCorpus: "workers", Score: 1.0,
 			},
 		},
 	}
 	out, injected := InjectPlaybookMisses(results, hits)
 	if injected != 0 {
 		t.Errorf("expected 0 injected (answer already present), got %d", injected)
 	}
 	if len(out) != 2 {
 		t.Errorf("expected results unchanged at len=2, got %d", len(out))
 	}
 }
 // TestInjectPlaybookMisses_DedupesPerAnswer locks: multiple playbook
 // hits all pointing to the same missing answer collapse to ONE
 // injection (the highest-scoring hit wins).
 func TestInjectPlaybookMisses_DedupesPerAnswer(t *testing.T) {
 	results := []Result{{ID: "w-1", Corpus: "workers", Distance: 0.30}}
 	hits := []PlaybookHit{
 		{
 			PlaybookID: "pb-low",
 			Distance:   0.30,
 			Entry:      PlaybookEntry{QueryText: "q1", AnswerID: "w-99", AnswerCorpus: "workers", Score: 0.4},
 		},
 		{
 			PlaybookID: "pb-high",
 			Distance:   0.30,
 			Entry:      PlaybookEntry{QueryText: "q2", AnswerID: "w-99", AnswerCorpus: "workers", Score: 1.0},
 		},
 	}
 	out, injected := InjectPlaybookMisses(results, hits)
 	if injected != 1 {
 		t.Errorf("expected 1 injection (deduped), got %d", injected)
 	}
 	// Score=1.0 (the high one) wins → boost factor 0.5 → distance 0.15
 	for _, r := range out {
 		if r.ID == "w-99" {
 			if r.Distance < 0.149 || r.Distance > 0.151 {
 				t.Errorf("expected distance from highest-score hit (~0.15), got %f", r.Distance)
 			}
 		}
 	}
 }
 // TestInjectPlaybookMisses_EmptyHits is a fast-path no-op check.
 func TestInjectPlaybookMisses_EmptyHits(t *testing.T) {
 	results := []Result{{ID: "w-1", Corpus: "workers", Distance: 0.30}}
 	out, injected := InjectPlaybookMisses(results, nil)
 	if injected != 0 {
 		t.Errorf("expected 0 injection, got %d", injected)
 	}
 	if len(out) != 1 {
 		t.Errorf("results should be unchanged, got len=%d", len(out))
 	}
 }
 func abs(f float64) float64 {
 	if f < 0 {
 		return -f
--- a/internal/matrix/retrieve.go
+++ b/internal/matrix/retrieve.go
@ -91,6 +91,11 @@ type SearchResponse struct {
 	Results               []Result       `json:"results"`
 	PerCorpusCounts       map[string]int `json:"per_corpus_counts"`
 	PlaybookBoosted       int            `json:"playbook_boosted,omitempty"`
 	// PlaybookInjected is Shape B's per-query metric: synthetic
 	// results inserted from playbook hits whose answer wasn't already
 	// in the regular retrieval. Distinct from PlaybookBoosted (which
 	// counts in-place re-ranks of results that WERE present).
 	PlaybookInjected      int            `json:"playbook_injected,omitempty"`
 	MetadataFilterDropped int            `json:"metadata_filter_dropped,omitempty"`
 }
@ -218,17 +223,30 @@ func (r *Retriever) Search(ctx context.Context, req SearchRequest) (*SearchRespo
 		MetadataFilterDropped: dropped,
 	}
-	// Playbook boost (component 5). Reuses the query vector — no
+	// Playbook (component 5) — both boost (re-rank existing) and
-	// extra embed call. If the playbook corpus doesn't exist (first
+	// inject (Shape B: bring in answers that aren't in regular
-	// search before any Record), the lookup gracefully no-ops.
+	// retrieval). Reuses the query vector — no extra embed call.
 	// Missing playbook corpus is a legitimate cold-start no-op.
 	if req.UsePlaybook {
 		hits, err := r.fetchPlaybookHits(ctx, qvec, req)
 		if err != nil {
-			// Don't fail the whole search on playbook errors — the
+			slog.Warn("matrix: playbook lookup failed; skipping boost+inject", "err", err)
 			// boost is opportunistic. Log + continue.
 			slog.Warn("matrix: playbook lookup failed; skipping boost", "err", err)
 		} else if len(hits) > 0 {
 			resp.PlaybookBoosted = ApplyPlaybookBoost(resp.Results, hits)
 			var injected int
 			resp.Results, injected = InjectPlaybookMisses(resp.Results, hits)
 			resp.PlaybookInjected = injected
 			if injected > 0 {
 				// Re-sort + truncate after injection. ApplyPlaybookBoost
 				// already sorted, but injection appends past the end —
 				// resort to merge, then enforce K.
 				sort.SliceStable(resp.Results, func(i, j int) bool {
 					return resp.Results[i].Distance < resp.Results[j].Distance
 				})
 				if len(resp.Results) > req.K {
 					resp.Results = resp.Results[:req.K]
 				}
 			}
 		}
 	}
--- a/reports/reality-tests/playbook_lift_003.md
+++ b/reports/reality-tests/playbook_lift_003.md
@ -0,0 +1,115 @@
 # Playbook-Lift Reality Test — Run 003
 **Generated:** 2026-04-30T12:03:36.939020926Z
 **Judge:** `qwen2.5:latest` (Ollama, resolved from env JUDGE_MODEL=qwen2.5:latest)
 **Corpora:** `workers,ethereal_workers`
 **Workers limit:** 5000
 **Queries:** `tests/reality/playbook_lift_queries.txt` (21 executed)
 **K per pass:** 10
 **Paraphrase pass:** ENABLED
 **Evidence:** `reports/reality-tests/playbook_lift_003.json`
 ---
 ## Headline
 | Metric | Value |
 |---|---:|
 | Total queries run | 21 |
 | Cold-pass discoveries (judge-best ≠ top-1) | 6 |
 | Warm-pass lifts (recorded playbook → top-1) | 2 |
 | No change (judge-best already top-1, no playbook needed) | 19 |
 | Playbook boosts triggered (warm pass) | 6 |
 | Mean Δ top-1 distance (warm − cold) | -0.16369006 |
 | **Paraphrase pass — recorded answer at rank 0 (top-1)** | **6 / 6** |
 | Paraphrase pass — recorded answer at any rank in top-K | 6 / 6 |
 **Verbatim lift rate:** 2 of 6 discoveries became top-1 after warm pass.
 ---
 ## Per-query results
 | # | Query | Cold top-1 | Cold judge-best (rank/rating) | Recorded? | Warm top-1 | Judge-best warm rank | Lift |
 |---|---|---|---|---|---|---|---|
 | 1 | Forklift operator with OSHA-30, warehouse experience, day sh | e-4079 | 3/3 | — | w-4435 | 6 | no |
 | 2 | OSHA-30 certified forklift operator in Wisconsin, cold stora | e-8354 | 2/4 | ✓ w-4435 | w-3004 | 1 | no |
 | 3 | Production worker with confined-space cert and hazmat traini | w-943 | 0/2 | — | w-392 | 3 | no |
 | 4 | CDL Class A driver, clean record, willing to do regional 4-d | w-3272 | 0/1 | — | w-4435 | 3 | no |
 | 5 | Warehouse lead with current OSHA-30 certification, NOT OSHA- | w-2759 | 0/2 | — | e-5778 | 3 | no |
 | 6 | Forklift-certified loader, certification must be active, dis | e-3143 | 0/2 | — | w-3004 | 3 | no |
 | 7 | Hazmat-certified warehouse worker comfortable with cold stor | w-2844 | 8/4 | ✓ w-3004 | w-4435 | 2 | no |
 | 8 | Bilingual production worker with team-lead experience and tr | w-4749 | 0/4 | — | w-4260 | 3 | no |
 | 9 | Inventory specialist with confined-space cert and compliance | w-153 | 6/4 | ✓ w-392 | w-392 | 0 | **YES** |
 | 10 | Warehouse worker who can run inventory cycles and lead a sma | e-4744 | 9/4 | ✓ w-4260 | w-3004 | 1 | no |
 | 11 | Production line worker comfortable filling in as line superv | w-1010 | 0/3 | — | w-3004 | 3 | no |
 | 12 | Customer service rep willing to cross-train into dispatch or | e-3302 | 2/2 | — | w-4435 | 4 | no |
 | 13 | Reliable production line lead with strong attendance and lea | e-4284 | 6/4 | ✓ e-5778 | e-5778 | 0 | **YES** |
 | 14 | Highly responsive forklift operator available for last-minut | e-6762 | 1/2 | — | w-4435 | 4 | no |
 | 15 | Engaged warehouse associate with strong safety compliance re | e-2743 | 1/4 | ✓ w-2523 | w-3004 | 1 | no |
 | 16 | CDL-A driver based in IL or WI, willing to run regional 4-da | w-3272 | 3/2 | — | w-4435 | 6 | no |
 | 17 | Bilingual customer service rep in Indianapolis or Cincinnati | e-8449 | 0/1 | — | w-4435 | 1 | no |
 | 18 | Production supervisor open to Midwest relocation for permane | e-9292 | 4/3 | — | w-4435 | 7 | no |
 | 19 | Dental hygienist with three years experience, Indianapolis a | w-943 | 0/1 | — | w-392 | 3 | no |
 | 20 | Registered nurse with ICU experience, willing to take per-di | w-2998 | 0/1 | — | w-4435 | 3 | no |
 | 21 | Software engineer with React and TypeScript, three years exp | w-2897 | 0/1 | — | w-4435 | 2 | no |
 ---
 ## Paraphrase pass — does the playbook help similar-but-different queries?
 For each query whose Pass 1 cold pass recorded a playbook entry, the
 judge model rephrased the query, and the rephrased version was sent
 through warm matrix.search. The recorded answer ID's rank in those
 results tests whether cosine on the embedded paraphrase finds the
 recorded query's vector.
 | # | Original (≤40c) | Paraphrase (≤60c) | Recorded answer | Paraphrase top-1 | Recorded rank | Paraphrase lift |
 |---|---|---|---|---|---|---|
 | 2 | OSHA-30 certified forklift operator in W | Looking for a OSHA-30 trained forklift driver based in Wisco | w-4435 | w-4435 | null | **YES** |
 | 7 | Hazmat-certified warehouse worker comfor | Warehouse worker with Hazmat certification and experience in | w-3004 | w-3004 | null | **YES** |
 | 9 | Inventory specialist with confined-space | Specialist in inventory management requiring certification i | w-392 | w-392 | null | **YES** |
 | 10 | Warehouse worker who can run inventory c | Seeking a warehouse worker capable of conducting inventory c | w-4260 | w-4260 | null | **YES** |
 | 13 | Reliable production line lead with stron | Experienced production line supervisor with excellent attend | e-5778 | e-5778 | null | **YES** |
 | 15 | Engaged warehouse associate with strong  | Warehouse associate currently engaged with a robust history  | w-2523 | w-2523 | null | **YES** |
 ---
 ## Honesty caveats
 1. **Judge IS the ground truth proxy.** Without human-labeled relevance, the LLM
   judge's verdict is what defines "best." If `qwen2.5:latest` rates badly,
   the lift number is meaningless. To validate the judge itself, sample 5–10
   verdicts manually and check agreement.
 2. **Score-1.0 boost = distance halved.** Playbook math is
   `distance' = distance × (1 - 0.5 × score)`. Lift requires the judge-best
   result's pre-boost distance to be ≤ 2× the cold top-1's distance, otherwise
   even halving doesn't promote it. Tight clusters → little visible lift.
 3. **Verbatim vs paraphrase.** The verbatim lift rate (above) is the cheap
   case — same query, recorded playbook, expected boost. The paraphrase
   pass (when enabled) is the actual learning property: similar-but-different
   queries hitting a recorded playbook. Compare verbatim and paraphrase
   lift rates — paraphrase should be lower (semantic-distance gates some
   playbook hits) but non-zero is the meaningful signal.
 4. **Multi-corpus skew.** Default corpora=`workers,ethereal_workers` — if all judge-best
   results land in one corpus, the matrix layer's purpose isn't being tested.
   Check per-corpus distribution in the JSON.
 5. **Judge resolution.** This run used `qwen2.5:latest` from
   env JUDGE_MODEL=qwen2.5:latest.
   Bumping the judge for run #N+1 means editing one line in lakehouse.toml.
 6. **Paraphrase generation also uses the judge.** The same model that rates
   relevance also rephrases queries. A judge that's bad at rating staffing
   queries is probably also bad at rephrasing them. Worth sanity-checking
   a sample of `paraphrase_query` values in the JSON before trusting the
   paraphrase lift number.
 ## Next moves
 - If lift rate ≥ 50% of discoveries: matrix layer + playbook is doing real
  work. Move to paraphrase queries + tag-based boost (currently ignored).
 - If lift rate < 20%: investigate why — judge variance, distance gap too
  wide, or playbook math too gentle. The score=1.0 / 0.5× formula may need
  retuning.
 - If discovery rate (cold judge-best ≠ top-1) is itself low: cosine is
  already close to optimal on this query distribution. Either the corpus
  is too narrow or the queries are too easy.
--- a/scripts/playbook_lift/main.go
+++ b/scripts/playbook_lift/main.go
@ -85,10 +85,18 @@ type queryRun struct {
 	// Paraphrase pass — only populated when --with-paraphrase. Tests
 	// the playbook's actual learning property: does a recorded entry
 	// for query Q help a similar-but-different query Q'?
 	//
 	// ParaphraseRecordedRank semantics:
 	//   nil    = paraphrase pass didn't run for this query (no playbook
 	//            was recorded in cold pass, so nothing to test)
 	//   0      = recorded answer landed at top-1
 	//   1..K-1 = recorded answer present in top-K at that rank
 	//   -1     = recorded answer absent from top-K
 	// Pointer (not int) so nil and rank-0 are distinguishable in JSON.
 	ParaphraseQuery        string `json:"paraphrase_query,omitempty"`
 	ParaphraseTop1ID       string `json:"paraphrase_top1_id,omitempty"`
-	ParaphraseRecordedRank int    `json:"paraphrase_recorded_rank,omitempty"` // -1 if not in top-K
+	ParaphraseRecordedRank *int   `json:"paraphrase_recorded_rank,omitempty"`
-	ParaphraseLift         bool   `json:"paraphrase_lift,omitempty"`          // recorded answer at rank 0 for paraphrase
+	ParaphraseLift         bool   `json:"paraphrase_lift,omitempty"` // recorded answer at rank 0 for paraphrase
 	Note string `json:"note,omitempty"`
 }
@ -273,7 +281,8 @@ func main() {
 			resp, err := matrixSearch(hc, *gw, paraphrase, corpora, *k, true)
 			if err != nil || len(resp.Results) == 0 {
 				runs[i].Note = appendNote(runs[i].Note, fmt.Sprintf("paraphrase search failed: %v", err))
-				runs[i].ParaphraseRecordedRank = -1
+				missed := -1
 				runs[i].ParaphraseRecordedRank = &missed
 				continue
 			}
 			runs[i].ParaphraseTop1ID = resp.Results[0].ID
@ -284,7 +293,7 @@ func main() {
 					break
 				}
 			}
-			runs[i].ParaphraseRecordedRank = recordedRank
+			runs[i].ParaphraseRecordedRank = &recordedRank
 			if recordedRank == 0 {
 				runs[i].ParaphraseLift = true
 				paraphraseTop1Lifts++