Item 3 — geo-filtered playbook boost; diagnostic logging

ROOT CAUSE (found via instrumentation, not hunch):
After a 20-scenario corpus batch, only 6/40 successful (role, city)
combos ever triggered playbook_memory citations on subsequent runs.
Added `playbook_boost:` tracing::info! line in vectord::service to log
boost map size vs candidate pool vs match count. One query revealed:

  boosts=170 sources=50 parsed=50 matched=0

170 endorsed workers came back from compute_boost_for — but zero were
in the 50-candidate Toledo pool. The boost map was pulling globally-
ranked semantic neighbors (top-100 playbooks across ALL cities),
dominated by Kansas City / Chicago / Detroit forklift playbooks the
Toledo SQL filter would never admit. The mechanism was correct at the
per-playbook level; the problem was pool intersection.

FIX (surgical, not cap-tuning):
- playbook_memory::compute_boost_for_filtered(): accepts optional
  (city, state) filter. When set, skips playbooks from other geos
  BEFORE cosine-ranking, so top-k is within the target city.
- Backwards-compatible: compute_boost_for() calls the filtered variant
  with None — existing callers unchanged.
- service::hybrid_search(): extracts target (city, state) from the
  executor's SQL filter via a small parser (extract_target_geo),
  passes to compute_boost_for_filtered.

VERIFIED:
  Before fix: boosts=170 sources=50 parsed=50 matched=0   (0% hit)
  After fix:  boosts=36  sources=50 parsed=50 matched=11  (22% hit)
Top-k=10 now has 7/10 boosted workers with 2-3 citations each.
Boost values 0.075-0.113 on cosine scores 0.67-0.74 — meaningful
reorder without saturation.

scripts/kb_measure.py:
Aggregator that reads data/_kb/*.jsonl and playbooks/*/results.json,
reports fill rate, citation density, recommender confidence trend,
and zero-citation-ok combos (item 3 target signal). Used to measure
before/after on bigger batches.

Diagnostic logging stays — the class of "boosts computed but not
matched" bug can recur if the SQL filter format ever drifts, and
without the counter it's invisible. Every hybrid_search with
use_playbook_memory=true now logs its boost stats.
This commit is contained in:
root 2026-04-20 21:35:04 -05:00
parent 330cb90f99
commit a663698571
3 changed files with 283 additions and 1 deletions

View File

@ -213,6 +213,28 @@ impl PlaybookMemory {
query_embedding: &[f32],
top_k_playbooks: usize,
base_weight: f32,
) -> HashMap<(String, String, String), BoostEntry> {
self.compute_boost_for_filtered(query_embedding, top_k_playbooks, base_weight, None).await
}
/// Same as `compute_boost_for` but only considers playbooks whose
/// (city, state) matches the caller's target. This is the honest
/// fix for the "boosts=170 matched=0" pathology: globally-ranked
/// semantic neighbors include playbooks from every city the query
/// could never reach via its SQL filter. When the caller knows the
/// target geo, restricting here collapses noise and raises the
/// endorsed-worker hit rate. Pass None for the original behavior.
///
/// 2026-04-21 — added after a corpus-density batch of 25 runs
/// showed only 6/40 successful (role, city) combos ever triggered
/// a citation on subsequent runs. Diagnostic logging proved the
/// boost map had 170 keys but the 50-candidate pool matched 0.
pub async fn compute_boost_for_filtered(
&self,
query_embedding: &[f32],
top_k_playbooks: usize,
base_weight: f32,
target_geo: Option<(&str, &str)>,
) -> HashMap<(String, String, String), BoostEntry> {
let state = self.state.read().await;
let entries = state.entries.clone();
@ -225,8 +247,18 @@ impl PlaybookMemory {
drop(state);
// Brute-force cosine. Empty / missing embeddings just skip.
// When target_geo is set, pre-filter to matching playbooks BEFORE
// cosine sort — that way top-k is within the city, not across
// all cities.
let mut scored: Vec<(f32, &PlaybookEntry)> = entries
.iter()
.filter(|e| match (target_geo, &e.city, &e.state) {
(None, _, _) => true,
(Some((tc, ts)), Some(ec), Some(es)) => {
ec.eq_ignore_ascii_case(tc) && es.eq_ignore_ascii_case(ts)
}
_ => false,
})
.filter_map(|e| e.embedding.as_ref().map(|v| (cosine(query_embedding, v), e)))
.collect();
scored.sort_by(|a, b| b.0.partial_cmp(&a.0).unwrap_or(std::cmp::Ordering::Equal));

View File

@ -803,22 +803,61 @@ async fn hybrid_search(
// set. Additive boost on the existing vector score, then re-sort.
if req.use_playbook_memory {
let boost_k = req.playbook_memory_k.unwrap_or(playbook_memory::DEFAULT_TOP_K_PLAYBOOKS);
// Extract target (city, state) from the SQL filter so
// compute_boost_for can skip playbooks from other cities that
// would never intersect the candidate pool. The executor's
// filter shape is stable: `... city = 'Toledo' AND state = 'OH' ...`.
// Case-insensitive match, tolerant of single quotes and spaces.
let target_geo = req.sql_filter.as_deref().and_then(extract_target_geo);
// We embedded the question as `qv` above — reuse it for the
// playbook similarity lookup so we don't double-pay Ollama.
let boosts = state.playbook_memory.compute_boost_for(&qv, boost_k, 0.5).await;
let boosts = state.playbook_memory
.compute_boost_for_filtered(
&qv,
boost_k,
0.5,
target_geo.as_ref().map(|(c, s)| (c.as_str(), s.as_str())),
)
.await;
// Diagnostics for Phase 19 boost pipeline. Logged so item 3
// investigation has ground truth:
// - boosts.len(): how many (city,state,name) keys surfaced for
// this query (0 = playbook_memory found nothing semantically
// similar to the question).
// - parsed: how many candidate chunks parsed cleanly into
// (name,city,state) via parse_worker_chunk.
// - matched: how many parsed keys matched an entry in boosts.
// 2026-04-21 — 20-scenario batch showed 34/40 ok combos never
// got a citation. These counters pin whether the gap is on the
// SIMILARITY side (boosts empty) or the MATCH side (parsed vs
// boosted keys mismatch — e.g. name format drift).
let mut parsed_count = 0usize;
let mut matched_count = 0usize;
for src in sources.iter_mut() {
// Parse "{Name} — {Role} in {City}, {State}. …" chunk. Being
// defensive: chunks from other datasets may not follow this
// exact shape, so absent fields just skip the boost.
if let Some((name, city, state)) = parse_worker_chunk(&src.chunk_text) {
parsed_count += 1;
let key = (city, state, name);
if let Some(entry) = boosts.get(&key) {
src.score += entry.boost;
src.playbook_boost = entry.boost;
src.playbook_citations = entry.citations.clone();
matched_count += 1;
}
}
}
tracing::info!(
"playbook_boost: boosts={} sources={} parsed={} matched={} target_geo={:?} (query='{}')",
boosts.len(),
sources.len(),
parsed_count,
matched_count,
target_geo,
req.question.chars().take(60).collect::<String>(),
);
// Re-rank: boosted scores can flip ordering.
sources.sort_by(|a, b| b.score.partial_cmp(&a.score).unwrap_or(std::cmp::Ordering::Equal));
// Finally trim to the caller's requested top_k — we pulled fetch_k
@ -2059,6 +2098,54 @@ struct LanceRecallQuery {
/// "{Name} — {Role} in {City}, {State}. Skills: …".
/// Returns None if the chunk doesn't match the shape; callers simply
/// skip the boost for that hit.
/// Pull (city, state) out of a SQL filter that uses
/// `city = 'Toledo' AND state = 'OH'` style equality. Returns None if
/// either is missing — the caller keeps the original global boost map
/// behavior (no geo narrowing). Case-insensitive on the column name
/// so `CITY=` or `City =` also work.
fn extract_target_geo(sql_filter: &str) -> Option<(String, String)> {
fn grab_eq(src: &str, col: &str) -> Option<String> {
// Very small parser, resilient enough for the executor's
// filter shapes. Matches `col = 'value'` or `col='value'` with
// case-insensitive column name.
let lower = src.to_ascii_lowercase();
let col_lower = col.to_ascii_lowercase();
let mut search_from = 0usize;
while let Some(off) = lower[search_from..].find(&col_lower) {
let pos = search_from + off;
// Require word boundary before the column name so "city"
// inside "civilian_rank" doesn't false-match.
let prior_ok = pos == 0
|| !lower.as_bytes()[pos - 1].is_ascii_alphanumeric()
&& lower.as_bytes()[pos - 1] != b'_';
let after = pos + col_lower.len();
if !prior_ok || after >= src.len() {
search_from = pos + col_lower.len();
continue;
}
// Walk past whitespace, require '='.
let mut i = after;
while i < src.len() && src.as_bytes()[i] == b' ' { i += 1; }
if i >= src.len() || src.as_bytes()[i] != b'=' { search_from = pos + col_lower.len(); continue; }
i += 1;
while i < src.len() && src.as_bytes()[i] == b' ' { i += 1; }
// Value is single-quoted literal; extract until the next '.
if i >= src.len() || src.as_bytes()[i] != b'\'' { search_from = pos + col_lower.len(); continue; }
i += 1;
let start = i;
while i < src.len() && src.as_bytes()[i] != b'\'' { i += 1; }
if i > start {
return Some(src[start..i].to_string());
}
search_from = pos + col_lower.len();
}
None
}
let city = grab_eq(sql_filter, "city")?;
let state = grab_eq(sql_filter, "state")?;
Some((city, state))
}
fn parse_worker_chunk(chunk: &str) -> Option<(String, String, String)> {
// "Name — Role in City, ST. …" → split on "—" then " in " then ","
let (name_part, rest) = chunk.split_once('—')?;

163
scripts/kb_measure.py Executable file
View File

@ -0,0 +1,163 @@
#!/usr/bin/env python3
"""Aggregate KB state for item 3 decision.
Reads data/_kb/*.jsonl and tests/multi-agent/playbooks/*/results.json
to answer:
- How many distinct signatures exist?
- Total runs, avg ok rate, avg citations per event?
- Which (role, city) combos have NEVER gotten a citation?
- Recommender confidence progression (cold medium high)?
- Mean turn count trend across runs (proxy for efficiency).
Run after `scripts/run_kb_batch.sh` completes. Writes a markdown
summary to tests/multi-agent/playbooks/kb_measurement.md and prints
to stdout.
"""
import json
import os
import sys
from collections import Counter, defaultdict
from pathlib import Path
ROOT = Path(__file__).resolve().parents[1]
KB = ROOT / "data" / "_kb"
PLAYBOOKS = ROOT / "tests" / "multi-agent" / "playbooks"
def load_jsonl(p):
if not p.exists():
return []
out = []
for line in p.read_text().splitlines():
if line.strip():
try:
out.append(json.loads(line))
except json.JSONDecodeError:
pass
return out
def main():
sigs = load_jsonl(KB / "signatures.jsonl")
outcomes = load_jsonl(KB / "outcomes.jsonl")
recs = load_jsonl(KB / "pathway_recommendations.jsonl")
corrections = load_jsonl(KB / "error_corrections.jsonl")
# --- Basic counts ---
print(f"Signatures: {len(sigs)}")
print(f"Outcomes: {len(outcomes)}")
print(f"Recommendations: {len(recs)}")
print(f"Error corrections: {len(corrections)}")
print()
# --- Recommender confidence progression ---
conf_counts = Counter(r.get("confidence", "?") for r in recs)
print(f"Recommender confidence distribution:")
for c in ("high", "medium", "low"):
print(f" {c:8s}: {conf_counts.get(c, 0)}")
print()
# Time-ordered confidence
recs_sorted = sorted(recs, key=lambda r: r.get("generated_at", ""))
neighbor_counts = [len(r.get("neighbors_consulted", [])) for r in recs_sorted]
if neighbor_counts:
print(f"Neighbors consulted over time (first → last):")
print(f" first 3: {neighbor_counts[:3]}")
print(f" last 3: {neighbor_counts[-3:]}")
print(f" max: {max(neighbor_counts)}")
print()
# --- Fill rate + citation density per run ---
if outcomes:
total_ok = sum(o["ok_events"] for o in outcomes)
total_events = sum(o["total_events"] for o in outcomes)
total_cites = sum(o.get("total_citations", 0) for o in outcomes)
total_turns = sum(o.get("total_turns", 0) for o in outcomes)
print(f"Fill rate: {total_ok}/{total_events} = {100*total_ok/max(1,total_events):.1f}%")
print(f"Avg citations per run: {total_cites/len(outcomes):.2f}")
print(f"Avg turns per run: {total_turns/len(outcomes):.1f}")
print()
# First 5 runs vs last 5 — does it get better?
sorted_out = sorted(outcomes, key=lambda o: o.get("created_at", ""))
if len(sorted_out) >= 10:
first = sorted_out[:5]
last = sorted_out[-5:]
fok = sum(o["ok_events"] for o in first) / sum(o["total_events"] for o in first)
lok = sum(o["ok_events"] for o in last) / sum(o["total_events"] for o in last)
fcit = sum(o.get("total_citations", 0) for o in first) / 5
lcit = sum(o.get("total_citations", 0) for o in last) / 5
print(f"First 5 runs ok rate: {100*fok:.1f}% avg cites: {fcit:.2f}")
print(f"Last 5 runs ok rate: {100*lok:.1f}% avg cites: {lcit:.2f}")
print()
# --- Per-(role, city) citation coverage ---
cite_by_combo = Counter()
combo_attempts = Counter()
for o in outcomes:
for ev in o.get("per_event", []):
key = (ev.get("role", "?"), "?") # city not in per_event summary
combo_attempts[key] += 1
# Read the playbook dirs for full event detail (has city)
cites_by_role_city = defaultdict(lambda: {"attempts": 0, "citations": 0, "ok": 0})
for o in outcomes:
run_dir = PLAYBOOKS / o["run_id"]
results_file = run_dir / "results.json"
if not results_file.exists():
continue
try:
results = json.loads(results_file.read_text())
except Exception:
continue
for r in results:
e = r.get("event", {})
key = (e.get("role"), e.get("city"), e.get("state"))
cites_by_role_city[key]["attempts"] += 1
cites_by_role_city[key]["citations"] += len(r.get("playbook_citations") or [])
if r.get("ok"):
cites_by_role_city[key]["ok"] += 1
combos_with_cites = [(k, v) for k, v in cites_by_role_city.items() if v["citations"] > 0]
combos_zero_cites = [(k, v) for k, v in cites_by_role_city.items() if v["citations"] == 0 and v["ok"] > 0]
print(f"(role, city, state) combos with any citation: {len(combos_with_cites)}")
print(f"(role, city, state) combos with ok fills but 0 cites: {len(combos_zero_cites)}")
print()
if combos_with_cites:
print("Top 10 combos by citation count:")
for (role, city, state), v in sorted(combos_with_cites, key=lambda x: -x[1]["citations"])[:10]:
print(f" {role:25s} {city:15s} {state}: {v['citations']} cites across {v['attempts']} attempts ({v['ok']} ok)")
print()
# --- Write markdown report ---
lines = ["# KB Measurement Report", ""]
lines.append(f"Generated from {len(outcomes)} runs across {len(sigs)} distinct signatures.")
lines.append("")
lines.append("## Recommender confidence")
for c in ("high", "medium", "low"):
lines.append(f"- {c}: {conf_counts.get(c, 0)}")
lines.append("")
lines.append("## Overall fill + citation")
if outcomes:
lines.append(f"- Fill rate: **{total_ok}/{total_events}** ({100*total_ok/max(1,total_events):.1f}%)")
lines.append(f"- Avg citations per run: **{total_cites/len(outcomes):.2f}**")
lines.append(f"- Avg turns per run: {total_turns/len(outcomes):.1f}")
lines.append("")
lines.append("## Citation coverage by (role, city, state)")
lines.append(f"- Combos with ≥1 citation: {len(combos_with_cites)}")
lines.append(f"- Combos with ok fills but 0 citations: {len(combos_zero_cites)}")
lines.append("")
lines.append("## Item 3 decision signal")
if combos_zero_cites:
lines.append("Non-zero: there are **combos that succeeded but never triggered playbook_memory boost**. Candidates for item 3 investigation:")
for (role, city, state), v in combos_zero_cites[:5]:
lines.append(f"- {role} in {city}, {state}: {v['ok']}/{v['attempts']} ok, 0 cites")
else:
lines.append("All ok combos got at least some citation firing. Boost mechanism is healthy; raising the cap may help but isn't forced.")
lines.append("")
out = PLAYBOOKS / "kb_measurement.md"
out.write_text("\n".join(lines))
print(f"✓ markdown report → {out}")
if __name__ == "__main__":
main()