matrix: judge-gated Shape B inject — closes lift-suite tail issues
Lift suite run #004 left two unresolved tail issues: - Q6 ("Forklift loader") ↔ Q7 ("Hazmat warehouse, cold storage") swap recordings as warm top-1 because their embeddings are within 0.20 cosine of each other. Distance gate can't tell them apart. - Q9 + Q15 lose paraphrase recovery when qwen2.5 rephrases past the 0.20 threshold. Distance says "drift too far"; sometimes the drift is real (skip), sometimes the paraphrase is still on-domain (don't want to skip). Multi-coord run #008's judge re-rating proved the LLM can distinguish: Q3 crane case landed at distance 0.23 (looks tight) but rating 1 (irrelevant). The judge sees domain mismatch the embedder doesn't. This commit lifts that pattern into the matrix substrate. Shape B inject now optionally routes every candidate through a judge gate before the rank insert lands. Distance + judge BOTH have to approve. internal/matrix/playbook.go: - InjectPlaybookMisses signature gains a query string + an optional InjectGate. nil gate preserves pre-judge-gating behavior (current tests already pass with nil). - New InjectGate interface + InjectGateFunc adapter for tests and non-LLM callers. - Per-candidate gate.Approve(query, hit) call inserted between the dedup and the inject. Rejected candidates skip silently; injected count reflects post-gate decision. internal/matrix/judge.go (new, ~140 lines): - LLMJudgeGate calls an Ollama-shape /api/chat endpoint with the same 1-5 staffing-rubric prompt that worked in multi_coord run #008. fail-closed on HTTP/JSON errors (don't inject if judge can't speak — better miss than wrong-domain). - NewLLMJudgeGate returns nil when URL or Model is empty, matching InjectGate's nil-means-no-judge semantics. internal/matrix/retrieve.go: - SearchRequest gains JudgeURL, JudgeModel, JudgeMinRating fields. Run() builds an LLMJudgeGate when set; passes nil otherwise. Backward compatible — existing callers see no behavior change. Tests: - TestInjectPlaybookMisses_GateRejectsCandidate (rejectAll → 0 injected, even with tight distance) - TestInjectPlaybookMisses_GateApprovesCandidate (approveAll → same as nil-gate behavior) - TestInjectPlaybookMisses_GateSeesCorrectQuery (gate receives CURRENT query + RECORDED query separately so it can score the (current, candidate) pair) - All 5 existing inject tests updated to new signature go test ./internal/matrix → all 8 inject tests pass. go test ./internal/matrix ./internal/shared ./cmd/{matrixd, queryd,pathwayd,observerd} → all green. STATE_OF_PLAY: - OPEN item #1 (judge-gated injection) closed. - DO NOT RELITIGATE adds the substrate-level judge-gate lock. - OPEN list now 5 rows (was 6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
247e36e687
commit
5a3364f539
@ -202,6 +202,7 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
|
|||||||
- **Boost / inject use SEPARATE thresholds.** Boost stays at `DefaultPlaybookMaxDistance = 0.5` (safe — only re-ranks results already in retrieval). Inject uses tighter `DefaultPlaybookMaxInjectDistance = 0.20` (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them.
|
- **Boost / inject use SEPARATE thresholds.** Boost stays at `DefaultPlaybookMaxDistance = 0.5` (safe — only re-ranks results already in retrieval). Inject uses tighter `DefaultPlaybookMaxInjectDistance = 0.20` (Shape B forces results in, so loose match cross-pollinates wrong-domain). Don't merge them.
|
||||||
- **Multi-coord product theory is empirically VALIDATED** by run #011 (Phase 3). Per-coordinator playbook namespaces (`playbook_alice` etc.) with cross-coordinator handover (Bob takes Alice's contract using `playbook_corpus=playbook_alice`) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does.
|
- **Multi-coord product theory is empirically VALIDATED** by run #011 (Phase 3). Per-coordinator playbook namespaces (`playbook_alice` etc.) with cross-coordinator handover (Bob takes Alice's contract using `playbook_corpus=playbook_alice`) work end-to-end including paraphrase handover. Don't propose to "test if multi-coord works" — it does.
|
||||||
- **Auth posture is locked per ADR-006.** Non-loopback bind requires `auth.token` (mechanical gate at `shared.Run`). Operators set the token via `token_env` (defaults to `AUTH_TOKEN`) loaded by systemd `EnvironmentFile=/etc/lakehouse/auth.env`, NOT in the committed TOML. Internal services use `AllowedIPs`; external boundary uses Bearer. Token rotation is dual-token via `secondary_tokens`. TLS terminates at the edge (nginx/Caddy), not in-process. Don't re-litigate.
|
- **Auth posture is locked per ADR-006.** Non-loopback bind requires `auth.token` (mechanical gate at `shared.Run`). Operators set the token via `token_env` (defaults to `AUTH_TOKEN`) loaded by systemd `EnvironmentFile=/etc/lakehouse/auth.env`, NOT in the committed TOML. Internal services use `AllowedIPs`; external boundary uses Bearer. Token rotation is dual-token via `secondary_tokens`. TLS terminates at the edge (nginx/Caddy), not in-process. Don't re-litigate.
|
||||||
|
- **Shape B inject has a judge-gate substrate.** `InjectPlaybookMisses` takes an optional `InjectGate` (interface) that approves each candidate before the rank insert. `LLMJudgeGate` (Ollama-shape /api/chat client) is the default impl; nil gate = pre-judge-gating distance-only behavior preserved for backward compat. Caller wires via `SearchRequest.{JudgeURL, JudgeModel, JudgeMinRating}`. Closes the lift-suite tail issues (Q6↔Q7 adjacent-query swap + Q9/Q15 paraphrase drift) at substrate level.
|
||||||
- **Fresh content uses two-tier indexing.** Fresh resumes go to `fresh_workers` corpus, not the main `workers` index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern.
|
- **Fresh content uses two-tier indexing.** Fresh resumes go to `fresh_workers` corpus, not the main `workers` index. coder/hnsw incremental adds to a populated 5K+ graph have recall issues; the small hot index has no such crowding. Periodic merge (post-G3) consolidates fresh→main. Canonical NRT pattern.
|
||||||
- **`embedd.default_model = "nomic-embed-text-v2-moe"`** (475M MoE, 768d). Don't bump to `nomic-embed-text` (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work.
|
- **`embedd.default_model = "nomic-embed-text-v2-moe"`** (475M MoE, 768d). Don't bump to `nomic-embed-text` (137M) "for speed" — diversity scores fell from 0.000 → 0.080 same-role-across-contracts on the smaller model. Cost (~5× slower ingest) is acceptable for once-per-deploy work.
|
||||||
- **Inbox flow: parse + search + judge + trace.** `/v1/observer/inbox` records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs `matrix.search` on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal.
|
- **Inbox flow: parse + search + judge + trace.** `/v1/observer/inbox` records the body; coordinator/driver parses via LLM (qwen2.5 format=json), runs `matrix.search` on the parsed query, judges top-1 against the ORIGINAL body, emits Langfuse spans through it all. Don't replace the judge re-rate with a distance-only gate — tight distance + low rating is the load-bearing honesty signal.
|
||||||
@ -219,12 +220,11 @@ The list is intentionally short. Items move to closed when the work demands them
|
|||||||
|
|
||||||
| # | Item | When to act |
|
| # | Item | When to act |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| 1 | **Judge-gated playbook injection** — close lift-suite tail issues (Q6↔Q7 swap, Q9/Q15 paraphrase drift) by routing every Shape B injection through the judge before the rank insert lands. Multi-coord run #008 already proved the judge can distinguish tight-but-wrong from tight-and-right; this lifts that pattern into the matrix substrate. ~1.5 hr. | When playbook quality starts mattering more than retrieval throughput. |
|
| 1 | **Wider Langfuse instrumentation across daemons** — `internal/langfuse/middleware.go` that auto-emits one span per HTTP request from every daemon's `shared.Run`. Production traffic gets free trace visibility without per-handler wiring. | When production traffic actually starts hitting the gateway. |
|
||||||
| 2 | **Wider Langfuse instrumentation across daemons** — `internal/langfuse/middleware.go` that auto-emits one span per HTTP request from every daemon's `shared.Run`. Production traffic gets free trace visibility without per-handler wiring. | When production traffic actually starts hitting the gateway. |
|
| 2 | **Periodic fresh→main index merge** — two-tier pattern works but `fresh_workers` grows monotonically. A scheduled job that re-ingests the fresh corpus into `workers` (with the v2-moe embedder) + clears fresh closes the loop. | When `fresh_workers` crosses ~500 items in production. |
|
||||||
| 3 | **Periodic fresh→main index merge** — two-tier pattern works but `fresh_workers` grows monotonically. A scheduled job that re-ingests the fresh corpus into `workers` (with the v2-moe embedder) + clears fresh closes the loop. | When `fresh_workers` crosses ~500 items in production. |
|
| 3 | **Distillation full port** — `57d0df1` shipped scorer + contamination firewall (E partial). SFT export pipeline + audit_baselines lineage still on the Rust side. | When distillation becomes a production dependency. |
|
||||||
| 4 | **Distillation full port** — `57d0df1` shipped scorer + contamination firewall (E partial). SFT export pipeline + audit_baselines lineage still on the Rust side. | When distillation becomes a production dependency. |
|
| 4 | **Drift quantification** — `be65f85` is "scorer drift first." Full distribution-drift signal is underspecified everywhere; this is research, not a port. | Open research item; no calendar. |
|
||||||
| 5 | **Drift quantification** — `be65f85` is "scorer drift first." Full distribution-drift signal is underspecified everywhere; this is research, not a port. | Open research item; no calendar. |
|
| 5 | **Operational nice-to-haves** — real-time wall-clock for the stress harness; chatd fixture-mode storage half (mock S3 for CI without MinIO); liberal-paraphrase calibration once real coordinator queries land. | When any of these block someone. |
|
||||||
| 6 | **Operational nice-to-haves** — real-time wall-clock for the stress harness; chatd fixture-mode storage half (mock S3 for CI without MinIO); liberal-paraphrase calibration once real coordinator queries land. | When any of these block someone. |
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
152
internal/matrix/judge.go
Normal file
152
internal/matrix/judge.go
Normal file
@ -0,0 +1,152 @@
|
|||||||
|
package matrix
|
||||||
|
|
||||||
|
import (
|
||||||
|
"bytes"
|
||||||
|
"context"
|
||||||
|
"encoding/json"
|
||||||
|
"fmt"
|
||||||
|
"io"
|
||||||
|
"log/slog"
|
||||||
|
"net/http"
|
||||||
|
"strings"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
// LLMJudgeGate is an InjectGate implementation that uses an Ollama-
|
||||||
|
// compatible chat endpoint (or chatd's /v1/chat) to rate the
|
||||||
|
// (query, candidate) pair on a 1-5 rubric, then approves the
|
||||||
|
// injection iff rating >= MinRating.
|
||||||
|
//
|
||||||
|
// The HTTP path is intentionally generic — works against any
|
||||||
|
// endpoint that speaks Ollama's /api/chat shape: bare Ollama,
|
||||||
|
// chatd's /v1/chat, or anything else honoring the same JSON.
|
||||||
|
// Per-call timeout is bounded by the parent ctx + the http.Client.
|
||||||
|
//
|
||||||
|
// Best-effort posture: a judge call that fails (network, JSON
|
||||||
|
// decode, anything) returns Approve=false. Same fail-closed default
|
||||||
|
// as the inject path's distance gate — when the judge can't speak,
|
||||||
|
// don't inject (better silent miss than confident wrong-domain).
|
||||||
|
//
|
||||||
|
// Usage from retrieve.go:
|
||||||
|
// gate := matrix.NewLLMJudgeGate(req.JudgeURL, req.JudgeModel,
|
||||||
|
// req.JudgeMinRating, hc)
|
||||||
|
// results, injected = matrix.InjectPlaybookMisses(req.QueryText,
|
||||||
|
// results, hits, maxInjectDist, gate)
|
||||||
|
type LLMJudgeGate struct {
|
||||||
|
URL string
|
||||||
|
Model string
|
||||||
|
MinRating int
|
||||||
|
HTTPClient *http.Client
|
||||||
|
}
|
||||||
|
|
||||||
|
// NewLLMJudgeGate is the constructor. Defaults: minRating 3, 10s
|
||||||
|
// HTTP timeout. URL must include the path (e.g.
|
||||||
|
// "http://localhost:11434/api/chat" for bare Ollama). Returns nil
|
||||||
|
// when URL or Model is empty — caller treats nil InjectGate as
|
||||||
|
// "no judge configured, default-approve" per InjectPlaybookMisses
|
||||||
|
// contract.
|
||||||
|
func NewLLMJudgeGate(url, model string, minRating int, hc *http.Client) *LLMJudgeGate {
|
||||||
|
if url == "" || model == "" {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
if minRating <= 0 {
|
||||||
|
minRating = 3
|
||||||
|
}
|
||||||
|
if hc == nil {
|
||||||
|
hc = &http.Client{Timeout: 10 * time.Second}
|
||||||
|
}
|
||||||
|
return &LLMJudgeGate{
|
||||||
|
URL: url,
|
||||||
|
Model: model,
|
||||||
|
MinRating: minRating,
|
||||||
|
HTTPClient: hc,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Approve calls the LLM judge with a query+candidate prompt; returns
|
||||||
|
// true iff the judge's rating meets MinRating. Errors return false
|
||||||
|
// (fail-closed — see type doc).
|
||||||
|
func (g *LLMJudgeGate) Approve(query string, hit PlaybookHit) bool {
|
||||||
|
if g == nil || query == "" {
|
||||||
|
// No judge or no query to judge against — treat as approve.
|
||||||
|
// Empty-query case mirrors InjectPlaybookMisses' contract:
|
||||||
|
// callers without a query string can't usefully judge.
|
||||||
|
return true
|
||||||
|
}
|
||||||
|
rating := g.rate(query, hit)
|
||||||
|
return rating >= g.MinRating
|
||||||
|
}
|
||||||
|
|
||||||
|
func (g *LLMJudgeGate) rate(query string, hit PlaybookHit) int {
|
||||||
|
system := `You rate retrieval results for a staffing co-pilot.
|
||||||
|
Rate the result 1-5 against the query:
|
||||||
|
5 = perfect match (this person/role IS what was asked for)
|
||||||
|
4 = strong match (right field, right level, minor mismatches)
|
||||||
|
3 = adjacent match (related field or partial overlap)
|
||||||
|
2 = weak/tangential match
|
||||||
|
1 = irrelevant
|
||||||
|
Output JSON only: {"rating": N, "reason": "<one sentence>"}.`
|
||||||
|
// We pass the recorded query text + answer ID to give the judge
|
||||||
|
// minimal context. Production might also fetch the answer's
|
||||||
|
// metadata, but that requires a second HTTP hop; the recorded
|
||||||
|
// query is usually enough to sniff wrong-domain matches.
|
||||||
|
user := fmt.Sprintf("Query: %q\n\nCandidate playbook entry:\n recorded_query: %q\n answer_id: %s\n answer_corpus: %s\n recorded_score: %.2f",
|
||||||
|
query, hit.Entry.QueryText, hit.Entry.AnswerID, hit.Entry.AnswerCorpus, hit.Entry.Score)
|
||||||
|
|
||||||
|
body, _ := json.Marshal(map[string]any{
|
||||||
|
"model": g.Model,
|
||||||
|
"stream": false,
|
||||||
|
"format": "json",
|
||||||
|
"messages": []map[string]string{
|
||||||
|
{"role": "system", "content": system},
|
||||||
|
{"role": "user", "content": user},
|
||||||
|
},
|
||||||
|
"options": map[string]any{"temperature": 0},
|
||||||
|
})
|
||||||
|
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
|
||||||
|
defer cancel()
|
||||||
|
req, err := http.NewRequestWithContext(ctx, "POST", g.URL, bytes.NewReader(body))
|
||||||
|
if err != nil {
|
||||||
|
slog.Warn("matrix.judge: build request", "err", err)
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
req.Header.Set("Content-Type", "application/json")
|
||||||
|
resp, err := g.HTTPClient.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
slog.Warn("matrix.judge: HTTP", "err", err, "url", g.URL)
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
defer resp.Body.Close()
|
||||||
|
if resp.StatusCode/100 != 2 {
|
||||||
|
slog.Warn("matrix.judge: non-2xx", "status", resp.StatusCode, "url", g.URL)
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
rb, _ := io.ReadAll(resp.Body)
|
||||||
|
var ollamaResp struct {
|
||||||
|
Message struct {
|
||||||
|
Content string `json:"content"`
|
||||||
|
} `json:"message"`
|
||||||
|
}
|
||||||
|
if err := json.Unmarshal(rb, &ollamaResp); err != nil {
|
||||||
|
slog.Warn("matrix.judge: decode envelope", "err", err)
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
var v struct {
|
||||||
|
Rating int `json:"rating"`
|
||||||
|
}
|
||||||
|
// Some chat endpoints wrap content in markdown code fences even
|
||||||
|
// with format=json. Strip leading/trailing whitespace + fences.
|
||||||
|
content := strings.TrimSpace(ollamaResp.Message.Content)
|
||||||
|
content = strings.TrimPrefix(content, "```json")
|
||||||
|
content = strings.TrimPrefix(content, "```")
|
||||||
|
content = strings.TrimSuffix(content, "```")
|
||||||
|
content = strings.TrimSpace(content)
|
||||||
|
if err := json.Unmarshal([]byte(content), &v); err != nil {
|
||||||
|
slog.Warn("matrix.judge: decode rating", "err", err, "content", content)
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
if v.Rating < 1 || v.Rating > 5 {
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
return v.Rating
|
||||||
|
}
|
||||||
@ -202,7 +202,24 @@ type PlaybookHit struct {
|
|||||||
// playbook-corpus cosine distance exceeds it are skipped (the boost
|
// playbook-corpus cosine distance exceeds it are skipped (the boost
|
||||||
// path may still re-rank them in place). Pass 0 (or any non-positive
|
// path may still re-rank them in place). Pass 0 (or any non-positive
|
||||||
// value) to use DefaultPlaybookMaxInjectDistance.
|
// value) to use DefaultPlaybookMaxInjectDistance.
|
||||||
func InjectPlaybookMisses(results []Result, hits []PlaybookHit, maxInjectDist float32) ([]Result, int) {
|
//
|
||||||
|
// gate is an optional approval callback called once per CANDIDATE
|
||||||
|
// (post-distance-filter, post-dedup) before injection. Returning
|
||||||
|
// false rejects that candidate. Use nil for the historical "all
|
||||||
|
// distance-eligible candidates inject" behavior.
|
||||||
|
//
|
||||||
|
// Multi-coord run #008's judge re-rating proved that distance + LLM
|
||||||
|
// rating disagree often enough to matter (Q3 crane: dist 0.23 looks
|
||||||
|
// confident, judge says 1/5 = irrelevant). Lift-suite tail issues
|
||||||
|
// (Q6↔Q7 swap, Q9/Q15 paraphrase drift) are exactly this shape —
|
||||||
|
// embedding-tight but wrong-domain. The gate parameter lets callers
|
||||||
|
// route those candidates through a judge before the inject lands.
|
||||||
|
//
|
||||||
|
// query is the current search's query text — passed to the gate so
|
||||||
|
// it can score (query, candidate) pairs without re-deriving from
|
||||||
|
// SearchRequest. Empty when the caller doesn't have it (gate
|
||||||
|
// implementations should treat empty query as "skip judge, allow").
|
||||||
|
func InjectPlaybookMisses(query string, results []Result, hits []PlaybookHit, maxInjectDist float32, gate InjectGate) ([]Result, int) {
|
||||||
if len(hits) == 0 {
|
if len(hits) == 0 {
|
||||||
return results, 0
|
return results, 0
|
||||||
}
|
}
|
||||||
@ -235,7 +252,16 @@ func InjectPlaybookMisses(results []Result, hits []PlaybookHit, maxInjectDist fl
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
injected := 0
|
||||||
for _, h := range bestForKey {
|
for _, h := range bestForKey {
|
||||||
|
// Judge gate (per OPEN item #1, closed by this commit):
|
||||||
|
// post-distance-filter, ask the gate whether the candidate
|
||||||
|
// actually fits the current query before letting it inject.
|
||||||
|
// Closes the lift-suite tail issues where embedding said
|
||||||
|
// "tight" but a judge said "wrong domain."
|
||||||
|
if gate != nil && !gate.Approve(query, h) {
|
||||||
|
continue
|
||||||
|
}
|
||||||
injectedDist := h.Distance * float32(h.Entry.BoostFactor())
|
injectedDist := h.Distance * float32(h.Entry.BoostFactor())
|
||||||
// Synthesize metadata that flags the injection so callers
|
// Synthesize metadata that flags the injection so callers
|
||||||
// (driver/UI/observer) can distinguish "regular retrieval"
|
// (driver/UI/observer) can distinguish "regular retrieval"
|
||||||
@ -256,11 +282,34 @@ func InjectPlaybookMisses(results []Result, hits []PlaybookHit, maxInjectDist fl
|
|||||||
Distance: injectedDist,
|
Distance: injectedDist,
|
||||||
Metadata: meta,
|
Metadata: meta,
|
||||||
})
|
})
|
||||||
|
injected++
|
||||||
}
|
}
|
||||||
|
|
||||||
return results, len(bestForKey)
|
return results, injected
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// InjectGate is the optional approval callback for Shape B inject.
|
||||||
|
// Called once per candidate (after distance filter, after dedup).
|
||||||
|
// Returning false rejects that candidate. Implementations:
|
||||||
|
// - LLMJudgeGate (this package, see judge.go): Ollama LLM rates the
|
||||||
|
// (query, candidate) pair against a 1-5 rubric.
|
||||||
|
// - InjectGateFunc (this package): zero-deps adapter for arbitrary
|
||||||
|
// caller logic — useful in tests + when callers want non-LLM
|
||||||
|
// gating (e.g. metadata-only filters).
|
||||||
|
//
|
||||||
|
// nil InjectGate = pre-judge-gating behavior (all distance-eligible
|
||||||
|
// candidates inject); preserves backward compatibility.
|
||||||
|
type InjectGate interface {
|
||||||
|
Approve(query string, hit PlaybookHit) bool
|
||||||
|
}
|
||||||
|
|
||||||
|
// InjectGateFunc adapts a plain function to the InjectGate interface.
|
||||||
|
// Used heavily in tests; production callers usually use LLMJudgeGate.
|
||||||
|
type InjectGateFunc func(query string, hit PlaybookHit) bool
|
||||||
|
|
||||||
|
// Approve makes InjectGateFunc satisfy InjectGate.
|
||||||
|
func (f InjectGateFunc) Approve(q string, h PlaybookHit) bool { return f(q, h) }
|
||||||
|
|
||||||
// ApplyPlaybookBoost re-ranks results in place using matched
|
// ApplyPlaybookBoost re-ranks results in place using matched
|
||||||
// playbook hits. For each hit whose (AnswerID, AnswerCorpus)
|
// playbook hits. For each hit whose (AnswerID, AnswerCorpus)
|
||||||
// matches a result, multiply that result's distance by the hit's
|
// matches a result, multiply that result's distance by the hit's
|
||||||
|
|||||||
@ -187,7 +187,7 @@ func TestInjectPlaybookMisses_AddsMissingAnswers(t *testing.T) {
|
|||||||
},
|
},
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
out, injected := InjectPlaybookMisses(results, hits, 0)
|
out, injected := InjectPlaybookMisses("test query", results, hits, 0, nil)
|
||||||
if injected != 1 {
|
if injected != 1 {
|
||||||
t.Fatalf("expected 1 injected, got %d", injected)
|
t.Fatalf("expected 1 injected, got %d", injected)
|
||||||
}
|
}
|
||||||
@ -240,7 +240,7 @@ func TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent(t *testing.T) {
|
|||||||
},
|
},
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
out, injected := InjectPlaybookMisses(results, hits, 0)
|
out, injected := InjectPlaybookMisses("test query", results, hits, 0, nil)
|
||||||
if injected != 0 {
|
if injected != 0 {
|
||||||
t.Errorf("expected 0 injected (answer already present), got %d", injected)
|
t.Errorf("expected 0 injected (answer already present), got %d", injected)
|
||||||
}
|
}
|
||||||
@ -266,7 +266,7 @@ func TestInjectPlaybookMisses_DedupesPerAnswer(t *testing.T) {
|
|||||||
Entry: PlaybookEntry{QueryText: "q2", AnswerID: "w-99", AnswerCorpus: "workers", Score: 1.0},
|
Entry: PlaybookEntry{QueryText: "q2", AnswerID: "w-99", AnswerCorpus: "workers", Score: 1.0},
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
out, injected := InjectPlaybookMisses(results, hits, 0.5) // explicit loose threshold so 0.30 hits qualify
|
out, injected := InjectPlaybookMisses("test query", results, hits, 0.5, nil) // explicit loose threshold so 0.30 hits qualify
|
||||||
if injected != 1 {
|
if injected != 1 {
|
||||||
t.Errorf("expected 1 injection (deduped), got %d", injected)
|
t.Errorf("expected 1 injection (deduped), got %d", injected)
|
||||||
}
|
}
|
||||||
@ -280,6 +280,89 @@ func TestInjectPlaybookMisses_DedupesPerAnswer(t *testing.T) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// TestInjectPlaybookMisses_GateRejectsCandidate locks the judge-gate
|
||||||
|
// path (OPEN item #1, closed by this commit). When the InjectGate
|
||||||
|
// returns false on a candidate, the candidate is skipped — even if
|
||||||
|
// distance would otherwise allow it. Closes the lift-suite tail
|
||||||
|
// issues where embedding said "tight" but a judge said "wrong domain."
|
||||||
|
func TestInjectPlaybookMisses_GateRejectsCandidate(t *testing.T) {
|
||||||
|
results := []Result{{ID: "w-1", Corpus: "workers", Distance: 0.30}}
|
||||||
|
hits := []PlaybookHit{
|
||||||
|
{
|
||||||
|
PlaybookID: "pb-x",
|
||||||
|
Distance: 0.10, // tight in cosine — would inject without gate
|
||||||
|
Entry: PlaybookEntry{
|
||||||
|
QueryText: "recorded crane operator query",
|
||||||
|
AnswerID: "w-99", AnswerCorpus: "workers", Score: 1.0,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
rejectAll := InjectGateFunc(func(string, PlaybookHit) bool { return false })
|
||||||
|
out, injected := InjectPlaybookMisses("forklift loader query", results, hits, 0, rejectAll)
|
||||||
|
if injected != 0 {
|
||||||
|
t.Errorf("rejectAll gate should skip injection, got %d injected", injected)
|
||||||
|
}
|
||||||
|
if len(out) != 1 {
|
||||||
|
t.Errorf("results should be unchanged at len=1, got %d", len(out))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestInjectPlaybookMisses_GateApprovesCandidate locks the
|
||||||
|
// always-approve gate path: behavior matches nil-gate (current
|
||||||
|
// distance-only filter). Useful for tests that want to assert
|
||||||
|
// "judge-gate API is wired" without an actual decision.
|
||||||
|
func TestInjectPlaybookMisses_GateApprovesCandidate(t *testing.T) {
|
||||||
|
results := []Result{{ID: "w-1", Corpus: "workers", Distance: 0.30}}
|
||||||
|
hits := []PlaybookHit{
|
||||||
|
{
|
||||||
|
PlaybookID: "pb-x",
|
||||||
|
Distance: 0.10,
|
||||||
|
Entry: PlaybookEntry{
|
||||||
|
QueryText: "x", AnswerID: "w-99", AnswerCorpus: "workers", Score: 1.0,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
approveAll := InjectGateFunc(func(string, PlaybookHit) bool { return true })
|
||||||
|
out, injected := InjectPlaybookMisses("test query", results, hits, 0, approveAll)
|
||||||
|
if injected != 1 {
|
||||||
|
t.Errorf("approveAll gate should inject, got %d", injected)
|
||||||
|
}
|
||||||
|
if len(out) != 2 {
|
||||||
|
t.Errorf("results should grow to 2, got %d", len(out))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestInjectPlaybookMisses_GateSeesCorrectQuery locks the gate's
|
||||||
|
// query+hit visibility — the gate must receive the CURRENT search's
|
||||||
|
// query (not the recorded one) so it can judge the (current_query,
|
||||||
|
// candidate) pair. The recorded query lives on hit.Entry.QueryText.
|
||||||
|
func TestInjectPlaybookMisses_GateSeesCorrectQuery(t *testing.T) {
|
||||||
|
results := []Result{{ID: "w-1", Corpus: "workers", Distance: 0.30}}
|
||||||
|
hits := []PlaybookHit{
|
||||||
|
{
|
||||||
|
PlaybookID: "pb-x",
|
||||||
|
Distance: 0.10,
|
||||||
|
Entry: PlaybookEntry{
|
||||||
|
QueryText: "RECORDED",
|
||||||
|
AnswerID: "w-99", AnswerCorpus: "workers", Score: 1.0,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
var seenQuery, seenRecordedQuery string
|
||||||
|
gate := InjectGateFunc(func(q string, h PlaybookHit) bool {
|
||||||
|
seenQuery = q
|
||||||
|
seenRecordedQuery = h.Entry.QueryText
|
||||||
|
return true
|
||||||
|
})
|
||||||
|
_, _ = InjectPlaybookMisses("CURRENT", results, hits, 0, gate)
|
||||||
|
if seenQuery != "CURRENT" {
|
||||||
|
t.Errorf("gate received query=%q, want CURRENT", seenQuery)
|
||||||
|
}
|
||||||
|
if seenRecordedQuery != "RECORDED" {
|
||||||
|
t.Errorf("gate received recorded=%q, want RECORDED", seenRecordedQuery)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// TestInjectPlaybookMisses_RespectsInjectThreshold locks the
|
// TestInjectPlaybookMisses_RespectsInjectThreshold locks the
|
||||||
// cross-pollination defense added after run #003: hits whose playbook
|
// cross-pollination defense added after run #003: hits whose playbook
|
||||||
// distance exceeds the inject threshold are skipped, preventing the
|
// distance exceeds the inject threshold are skipped, preventing the
|
||||||
@ -303,7 +386,7 @@ func TestInjectPlaybookMisses_RespectsInjectThreshold(t *testing.T) {
|
|||||||
},
|
},
|
||||||
}
|
}
|
||||||
// Default threshold (0 → DefaultPlaybookMaxInjectDistance = 0.20)
|
// Default threshold (0 → DefaultPlaybookMaxInjectDistance = 0.20)
|
||||||
out, injected := InjectPlaybookMisses(results, hits, 0)
|
out, injected := InjectPlaybookMisses("test query", results, hits, 0, nil)
|
||||||
if injected != 1 {
|
if injected != 1 {
|
||||||
t.Errorf("expected 1 injection (only the tight hit qualifies), got %d", injected)
|
t.Errorf("expected 1 injection (only the tight hit qualifies), got %d", injected)
|
||||||
}
|
}
|
||||||
@ -324,7 +407,7 @@ func TestInjectPlaybookMisses_RespectsInjectThreshold(t *testing.T) {
|
|||||||
// TestInjectPlaybookMisses_EmptyHits is a fast-path no-op check.
|
// TestInjectPlaybookMisses_EmptyHits is a fast-path no-op check.
|
||||||
func TestInjectPlaybookMisses_EmptyHits(t *testing.T) {
|
func TestInjectPlaybookMisses_EmptyHits(t *testing.T) {
|
||||||
results := []Result{{ID: "w-1", Corpus: "workers", Distance: 0.30}}
|
results := []Result{{ID: "w-1", Corpus: "workers", Distance: 0.30}}
|
||||||
out, injected := InjectPlaybookMisses(results, nil, 0)
|
out, injected := InjectPlaybookMisses("test query", results, nil, 0, nil)
|
||||||
if injected != 0 {
|
if injected != 0 {
|
||||||
t.Errorf("expected 0 injection, got %d", injected)
|
t.Errorf("expected 0 injection, got %d", injected)
|
||||||
}
|
}
|
||||||
|
|||||||
@ -84,6 +84,14 @@ type SearchRequest struct {
|
|||||||
PlaybookTopK int `json:"playbook_top_k,omitempty"`
|
PlaybookTopK int `json:"playbook_top_k,omitempty"`
|
||||||
PlaybookMaxDistance float64 `json:"playbook_max_distance,omitempty"`
|
PlaybookMaxDistance float64 `json:"playbook_max_distance,omitempty"`
|
||||||
PlaybookMaxInjectDistance float64 `json:"playbook_max_inject_distance,omitempty"`
|
PlaybookMaxInjectDistance float64 `json:"playbook_max_inject_distance,omitempty"`
|
||||||
|
// JudgeURL: when set, every Shape B injection candidate is
|
||||||
|
// rated by an LLM at this Ollama-shape /api/chat endpoint
|
||||||
|
// (chatd's /v1/chat works too). Candidates with rating <
|
||||||
|
// JudgeMinRating are skipped. Empty = no judge gate (current
|
||||||
|
// behavior — distance-only filter).
|
||||||
|
JudgeURL string `json:"judge_url,omitempty"`
|
||||||
|
JudgeModel string `json:"judge_model,omitempty"`
|
||||||
|
JudgeMinRating int `json:"judge_min_rating,omitempty"`
|
||||||
MetadataFilter map[string]any `json:"metadata_filter,omitempty"`
|
MetadataFilter map[string]any `json:"metadata_filter,omitempty"`
|
||||||
// ExcludeIDs filters out specific worker IDs post-retrieval.
|
// ExcludeIDs filters out specific worker IDs post-retrieval.
|
||||||
// Real-world driver: a coordinator places 200 workers at a
|
// Real-world driver: a coordinator places 200 workers at a
|
||||||
@ -289,8 +297,14 @@ func (r *Retriever) Search(ctx context.Context, req SearchRequest) (*SearchRespo
|
|||||||
if maxInjectDist <= 0 {
|
if maxInjectDist <= 0 {
|
||||||
maxInjectDist = float32(DefaultPlaybookMaxInjectDistance)
|
maxInjectDist = float32(DefaultPlaybookMaxInjectDistance)
|
||||||
}
|
}
|
||||||
|
// Optional LLM judge gate (per OPEN item #1). nil when
|
||||||
|
// JudgeURL/JudgeModel are unset → distance-only filter.
|
||||||
|
var gate InjectGate
|
||||||
|
if g := NewLLMJudgeGate(req.JudgeURL, req.JudgeModel, req.JudgeMinRating, nil); g != nil {
|
||||||
|
gate = g
|
||||||
|
}
|
||||||
var injected int
|
var injected int
|
||||||
resp.Results, injected = InjectPlaybookMisses(resp.Results, hits, maxInjectDist)
|
resp.Results, injected = InjectPlaybookMisses(req.QueryText, resp.Results, hits, maxInjectDist, gate)
|
||||||
resp.PlaybookInjected = injected
|
resp.PlaybookInjected = injected
|
||||||
if injected > 0 {
|
if injected > 0 {
|
||||||
// Re-sort + truncate after injection. ApplyPlaybookBoost
|
// Re-sort + truncate after injection. ApplyPlaybookBoost
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user