1 Commits

Author SHA1 Message Date
root
9588bd82ae matrix: relevance filter — SPEC §3.4 component 3 of 5
Faithful port of mcp-server/relevance.ts (Rust observer's adjacency-
pollution filter). Same 5-signal scoring, same default threshold 0.3.
Adds POST /v1/matrix/relevance endpoint via matrixd.

Scoring signals (additive, can sign-flip):
  path_match     +1.0  chunk source/doc_id encodes focus.path
  filename_match +0.6  chunk text mentions focus's filename
  defined_match  +0.6  chunk text mentions focus.defined_symbols
  token_overlap  +0.4  jaccard of non-stopword tokens
  prefix_match   +0.3  chunk source shares first-2-segment prefix
  import_penalty -0.5  mentions ONLY imported symbols, no defined ones

What this does and doesn't do:
  - DOES filter code-aware corpora (eventually lakehouse_arch_v1,
    lakehouse_symbols_v1, scrum_findings_v1) — drops chunks about
    code the focus file IMPORTS rather than DEFINES, the
    "adjacency pollution" pattern that makes a reviewer LLM
    hallucinate imported-crate internals as belonging to the focus
  - DOES NOT meaningfully filter staffing data — the candidates
    reality test 2026-04-29 had "exact skill match buried at #3"
    which is a different problem (semantic-only ranking dominated
    by secondary text). Staffing needs structured filtering
    (status gates, location gates) that lives outside this
    package — future work, not in SPEC §3.4 yet

Headline smoke assertion: focus = crates/queryd/src/db.go which
defines Connector and imports catalogd::Registry. The filter
scores:
  Connector chunk: +0.68  (defined_match fires, kept)
  Registry chunk: -0.46  (import_only penalty fires, dropped)
  unrelated junk:  0.00  (no signals, dropped)

That's a 1.14-point gap between what we ARE and what we IMPORT —
the entire purpose of the filter.

Tests:
  - 9 unit tests in internal/matrix/relevance_test.go covering
    Tokenize, Jaccard, ExtractDefinedSymbols (Rust + TS),
    ExtractImportedSymbols, FilePrefix, ScoreRelevance per-signal,
    FilterChunks threshold splitting, and the headline
    AdjacencyPollutionScenario
  - scripts/relevance_smoke.sh integration smoke (3 assertions PASS):
    adjacency-pollution scenario, empty-chunks 400, threshold honored

13-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix, relevance).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:13:22 -05:00