9 Commits

Author SHA1 Message Date
root
67d1957b87 matrix: split boost / inject thresholds — kills Shape B cross-pollination
Run #003 surfaced rampant cross-pollination: Q2's "OSHA-30 forklift
Wisconsin" recording (w-4435) became warm top-1 for Q19 (dental
hygienist), Q20 (RN), Q21 (software engineer), and 6 other unrelated
staffing queries. Cause: InjectPlaybookMisses inherited the same
DefaultPlaybookMaxDistance (0.5) as the boost path, but inject is
structurally riskier than boost — boost only re-ranks results that
already retrieved on their own merits, while inject FORCES a result
into top-K, so a loose match cross-pollinates wrong-domain answers.

Empirical motivation from v3:
  Implied playbook hit distances for cross-pollinated cases: 0.20-0.46
  Implied distances for the 6/6 paraphrase recoveries:        0.23-0.30
  Threshold of 0.20 should keep most paraphrases, kill the OOD bleed.

Implementation:
- New DefaultPlaybookMaxInjectDistance = 0.20 in playbook.go.
- New PlaybookMaxInjectDistance field on SearchRequest (override).
- InjectPlaybookMisses signature gains maxInjectDist param; hits whose
  Distance exceeds it are skipped (boost path may still re-rank them).
- TestInjectPlaybookMisses_RespectsInjectThreshold locks the contract
  with one tight + one loose hit, asserting only the tight one injects.
- Existing tests pass explicit threshold (0 = default for tight tests,
  0.5 for the dedupe test which uses 0.30 hits).

Run #004 result on identical queries with the split threshold:

  Verbatim discovery        8 (vs v3's 6 — judge variance, separate)
  Verbatim lift             6 / 8 (75%)
  Paraphrase top-1          6 / 8 (75%)
  Paraphrase any-rank in K  6 / 8

OOD queries Q19/Q20/Q21 ALL show warm top-1 = cold top-1 (no
injection) — cross-pollination eliminated where it was wrong-direction.
Mean Δ top-1 distance dropped from -0.164 (v3, distorted) to -0.071
(v4, comparable to v1's -0.053).

Two paraphrases missed in v4 (Q9, Q15) were ones where qwen2.5
rephrased liberally enough to drift past 0.20 — Q9: "Inventory
specialist..." → "Individual needed for inventory management..." and
Q15: "Engaged warehouse associate..." → "Warehouse associate currently
engaged with a robust history...". The system correctly refusing to
inject when it's not confident is the right product behavior; the
boost path still re-ranks recorded answers when they appear in regular
retrieval.

The Q6 ↔ Q7 cross-pollination ("Forklift-certified loader" ↔
"Hazmat warehouse worker") is legitimate — these are genuinely similar
staffing queries and the judge ranks both directions as plausible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:24:55 -05:00
root
154a72ea5e matrix: Shape B — inject playbook misses + 6/6 paraphrase recovery
The v0 boost-only stance documented in internal/matrix/playbook.go:22-27
("the boost only re-ranks results that ALREADY surfaced from the regular
retrieval") couldn't promote recorded answers that dropped out of a
paraphrase's top-K. playbook_lift_002 surfaced exactly that gap: 0/2
paraphrase recoveries because the recorded answers weren't in regular
retrieval at all (rank=-1).

Shape B: when warm-pass retrieval doesn't surface a playbook hit's
answer, inject a synthetic Result for it directly. Distance =
playbook_hit_distance × BoostFactor — same formula as the boost path so
injections land in comparable distance space. Caller re-sorts +
truncates after both boost and inject have run.

Result on playbook_lift_003 (Shape B + paraphrase pass):

  Verbatim discovery        6
  Verbatim lift             2 / 6
  **Paraphrase top-1**      **6 / 6**
  Paraphrase any-rank in K  6 / 6
  Mean Δ top-1 distance     -0.1637 (warm closer than cold)

Every paraphrase the judge generated landed the v1-recorded answer at
top-1 of the new query's results. The learning property holds — cosine
on embed(paraphrase) finds the recorded query's vector within
DefaultPlaybookMaxDistance (0.5), and Shape B injects the answer.

Verbatim lift dropped from v1's 7/8 because Shape B cross-pollinates
recorded answers across queries. w-4435 (Q2's recording) appears as
warm top-1 for several other queries because their embeddings are
within the playbook hit threshold of "OSHA-30 forklift Wisconsin." This
is a feature, not a bug — the matrix layer's purpose is to share
knowledge across queries — but the lift metric only counts "warm top-1
== cold judge best," so cross-pollinated lifts don't register. A v3
metric would re-judge warm pass to measure true judge improvement.

Tests:
- TestInjectPlaybookMisses_AddsMissingAnswers — primary claim
- TestInjectPlaybookMisses_SkipsAnswersAlreadyPresent — no double-inject
- TestInjectPlaybookMisses_DedupesPerAnswer — multi-hit same answer
- TestInjectPlaybookMisses_EmptyHits — fast-path no-op

Driver fix: ParaphraseRecordedRank int → *int. The `omitempty` int
silently dropped rank=0 (top-1, the WANTED value) from JSON, making the
v003 report show "null" instead of "0" for every successful recovery.
Pointer keeps nil/rank-0 distinguishable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 07:06:13 -05:00
root
622e124b8f phase 2: matrix.downgrade reads WeakModels from config
migrate the strong-model auto-downgrade gate from a hardcoded weak
list to cfg.Models.WeakModels. backward compatible: existing API
preserved, callers that don't migrate keep using DefaultWeakModels.

changes:
- internal/matrix/downgrade.go: split IsWeakModel into rule-based
  base (`:free` suffix/infix) + literal-list lookup. New
  IsWeakModelInList(model, list) takes the config-supplied list.
  DowngradeInput grows a WeakModels field; nil falls back to
  DefaultWeakModels (preserves pre-phase-2 behavior).
- internal/workflow/modes.go: add MatrixDowngradeWithWeakList(list)
  factory mirroring MatrixSearch's pattern. Plain MatrixDowngrade
  kept for backward compat.
- cmd/matrixd/main.go: handlers struct holds weakModels populated
  from cfg.Models.WeakModels at startup; handleDowngrade threads it
  into every DowngradeInput.
- cmd/observerd/main.go: registerBuiltinModes accepts weakModels
  and uses the factory variant. observerd reads cfg.Models.WeakModels
  in main().

end-to-end verified: downgrade + matrix + observer + workflow smokes
all pass. Existing TestMaybeDowngrade_TruthTable + TestIsWeakModel
unchanged (backward compat). Two new tests cover the config path:
- TestIsWeakModelInList — covers rule + literal + empty + nil
- TestMaybeDowngrade_WithConfigList — verifies cfg list overrides
  default

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 23:52:18 -05:00
root
b199093d1f B: matrix metadata filter — post-retrieval structured gate
Addresses the reality-test gap surfaced by the candidates and
multi-corpus e2e runs (0d1553c, a97881d): semantic-only retrieval
can't gate by status / state / availability. SearchRequest now
takes an optional MetadataFilter map; results whose metadata
doesn't match every key are dropped before top-K truncation.

Filter value semantics:
  string|number|bool → exact equality (JSON-canonical, so 1 ≡ 1.0)
  []any              → OR within key (any element matching wins)
  AND across keys: every filter key must match.

Missing key in metadata = drop. Malformed metadata = drop. Filter
absent or empty = pass through (zero overhead).

The response now reports MetadataFilterDropped so callers can see
how aggressive the filter was without re-querying.

Caveat (also captured in code comment): this is POST-retrieval, not
PRE-filtering via SQL. Aggressive filters can shrink the result set
below K; caller should bump PerCorpusK to compensate. A queryd-
backed pre-filter is a future commit; this lands the user-visible
fix today.

Tests:
  - 7 unit tests (internal/matrix/filter_test.go) covering: nil/
    empty filter pass-through, missing-metadata always-fails,
    single-value exact match (incl. numeric 5 ≡ 5.0), AND across
    keys, OR within list, bool match, malformed JSON metadata
  - matrix_smoke.sh: new assertion #7 — filter
    label∈{"a near","b near"} drops the 4 mid/far entries from the
    6-entry pool, keeping exactly 2 (one per corpus, both with the
    matching label). Dropped count surfaces in the response.

15-smoke regression all green. vet clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:08:56 -05:00
root
a730fc2016 scrum fixes: 4 real findings landed, 4 false positives dismissed
Cross-lineage scrum review on the 12 commits of this session
(afbb506..06e7152) via Rust gateway :3100 with Opus + Kimi +
Qwen3-coder. Results:

  Real findings landed:
    1. Opus BLOCK — vectord BatchAdd intra-batch duplicates panic
       coder/hnsw's "node not added" length-invariant. Fixed with
       last-write-wins dedup inside BatchAdd before the pre-pass.
       Regression test TestBatchAdd_IntraBatchDedup added.
    2. Opus + Kimi convergent WARN — strings.Contains(err.Error(),
       "status 404") was brittle string-matching to detect cold-
       start playbook state. Fixed: ErrCorpusNotFound sentinel
       returned by searchCorpus on HTTP 404; fetchPlaybookHits
       uses errors.Is.
    3. Opus WARN — corpusingest.Run returned nil on total batch
       failure, masking broken pipelines as "empty corpora." Fixed:
       Stats.FailedBatches counter, ErrPartialFailure sentinel
       returned when nonzero. New regression test
       TestRun_NonzeroFailedBatchesReturnsError.
    4. Opus WARN — dead var _ = io.EOF in staffing_500k/main.go
       was justified by a fictional comment. Removed.

  Drivers (staffing_500k, staffing_candidates, staffing_workers)
  updated to handle ErrPartialFailure gracefully — print warn, keep
  running queries — rather than fatal'ing on transient hiccups
  while still surfacing the failure clearly in the output.

  Documented (no code change):
    - Opus WARN: matrixd /matrix/downgrade reads
      LH_FORCE_FULL_ENRICHMENT from process env when body omits
      it. Comment now explains the opinionated default and points
      callers wanting deterministic behavior to pass the field
      explicitly.

  False positives dismissed (caught and verified, NOT acted on):
    A. Kimi BLOCK on errors.Is + wrapped error in cmd/matrixd:223.
       Verified false: Search wraps with %w (fmt.Errorf("%w: %v",
       ErrEmbed, err)), so errors.Is matches the chain correctly.
    B. Kimi INFO "BatchAdd has no unit tests." Verified false:
       batch_bench_test.go has BenchmarkBatchAdd; the new dedup
       test TestBatchAdd_IntraBatchDedup adds another.
    C. Opus BLOCK on missing finite/zero-norm pre-validation in
       cmd/vectord:280-291. Verified false: line 272 already calls
       vectord.ValidateVector before BatchAdd, so finite + zero-
       norm IS checked. Pre-validation is exhaustive.
    D. Opus WARN on relevance.go tokenRe (Opus self-corrected
       mid-finding when realizing leading char counts toward token
       length).

  Qwen3-coder returned NO FINDINGS — known issue with very long
  diffs through the OpenRouter free tier; lineage rotation worked
  as designed (Opus + Kimi between them caught everything Qwen
  would have).

15-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix, relevance, downgrade, playbook).
Unit tests all green (corpusingest +1, vectord +1).

Per feedback_cross_lineage_review.md: convergent finding #2 (404
detection) is the highest-signal one — both Opus and Kimi
flagged it independently. The other Opus findings stand on
single-reviewer signal but each one verified against the actual
code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:42:39 -05:00
root
06e71520c4 matrix: playbook memory + boost — SPEC §3.4 component 5 of 5 (LEARNING LOOP)
Closes SPEC §3.4. The matrix indexer is now a learning meta-index per
feedback_meta_index_vision.md — every successful (query → answer)
pair recorded via /matrix/playbooks/record boosts that answer for
future similar queries.

This is the architectural piece that lifts vectord from "static
hybrid search" to the meta-index J originally framed in Phase 19 of
the Rust system.

What's new:
  - internal/matrix/playbook.go — PlaybookEntry, PlaybookHit,
    ApplyPlaybookBoost. Pure-function boost math:
      distance' = distance * (1 - 0.5 * score)
    Score 0 = no boost (factor 1.0); score 1 = halve distance
    (factor 0.5). Capped at 0.5 deliberately so a single high-
    confidence playbook can't dominate the base ranking forever
    (runaway-feedback-loop guard).
  - Retriever.Record(entry, corpus) — embeds query_text, ensures
    playbook corpus exists (idempotent), upserts via deterministic
    sha256-derived ID (last score wins on re-record of same triple).
  - Retriever.Search extended with UsePlaybook + PlaybookCorpus +
    PlaybookTopK + PlaybookMaxDistance. Reuses the query vector —
    no extra embed call. Missing-corpus 404 = no-op (cold-start
    state before any Record call), not an error.
  - POST /v1/matrix/playbooks/record (matrixd) — caller submits
    {query_text, answer_id, answer_corpus, score, tags?}; gets
    {playbook_id} back.

Storage: a vectord index named "playbook_memory" (configurable per
request) with embed(query_text) as the vector and the
PlaybookEntry JSON as metadata. Just another corpus — observable
from /vectors/index, persistable through G1P, etc.

Match key for boost: (AnswerID, AnswerCorpus). Cross-corpus ID
collisions don't false-match — verified by
TestApplyPlaybookBoost_CorpusAttributionRespected.

End-to-end smoke (scripts/playbook_smoke.sh, all assertions PASS):
  - Baseline search: widget-c at distance 0.6566 (rank 3)
  - Record playbook: query → widget-c, score=1.0
  - Re-search with use_playbook=true:
      widget-c distance: 0.3283 (rank 2)
      ratio: 0.5 EXACTLY (matches boost math precisely)
      playbook_boosted: 1
  - widget-c jumped from #3 to #2 — learning loop visible

Tests:
  - 8 unit tests in internal/matrix/playbook_test.go covering
    Validate, BoostFactor (5 cases), the no-boost identity, the
    boost-moves-result-up scenario, highest-score wins on duplicate
    matches, cross-corpus attribution, JSON round-trip, and
    rejection of empty metadata
  - scripts/playbook_smoke.sh integration test (3 assertions PASS)

15-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix, relevance, downgrade, playbook).

SPEC §3.4 NOW COMPLETE: 5 of 5 components shipped. The matrix
indexer's port is done as a substrate; remaining work is operational
(rating signal sources, telemetry, eventual structured filtering for
staffing data — none in §3.4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:34:24 -05:00
root
3968ec8a7b matrix: strong-model downgrade gate — SPEC §3.4 component 4 of 5
Pure-Go port of mode.rs::execute's pass5 downgrade gate (Rust
2026-04-26). Adds POST /v1/matrix/downgrade endpoint via matrixd.

The gate captures the pass5 finding: composing matrix corpora into
codereview_lakehouse on a strong model LOST 5/5 head-to-head reps
against matrix-free codereview_isolation on grok-4.1-fast (p=0.031).
Strong models have enough native capacity that bug fingerprints +
adversarial framing + file content carry them; matrix chunks
displace depth-of-analysis.

Logic (matches Rust mode.rs:614-632):
  if mode == codereview_lakehouse
     && !forced_mode
     && !LH_FORCE_FULL_ENRICHMENT
     && !is_weak_model(model)
  → flip to codereview_isolation, record downgraded_from

is_weak_model captures the empirical weak-list:
  - `:free` suffix or `:free/` infix (OpenRouter free tier)
  - qwen3.5:latest, qwen3:latest (local last-resort rungs)
  - everything else → strong by default

Tests:
  - 3 unit tests in internal/matrix/downgrade_test.go: IsWeakModel
    coverage, MaybeDowngrade truth table (5 rows), forced-mode
    precedence (forced beats every other bypass)
  - scripts/downgrade_smoke.sh: 6 assertions through gateway covering
    all 5 truth-table rows + empty-mode 400

14-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix, relevance, downgrade).

SPEC §3.4 progress: 4 of 5 components shipped (corpus builders,
multi-corpus retrieve+merge, relevance filter, downgrade gate).
Last component is learning-loop integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:17:55 -05:00
root
9588bd82ae matrix: relevance filter — SPEC §3.4 component 3 of 5
Faithful port of mcp-server/relevance.ts (Rust observer's adjacency-
pollution filter). Same 5-signal scoring, same default threshold 0.3.
Adds POST /v1/matrix/relevance endpoint via matrixd.

Scoring signals (additive, can sign-flip):
  path_match     +1.0  chunk source/doc_id encodes focus.path
  filename_match +0.6  chunk text mentions focus's filename
  defined_match  +0.6  chunk text mentions focus.defined_symbols
  token_overlap  +0.4  jaccard of non-stopword tokens
  prefix_match   +0.3  chunk source shares first-2-segment prefix
  import_penalty -0.5  mentions ONLY imported symbols, no defined ones

What this does and doesn't do:
  - DOES filter code-aware corpora (eventually lakehouse_arch_v1,
    lakehouse_symbols_v1, scrum_findings_v1) — drops chunks about
    code the focus file IMPORTS rather than DEFINES, the
    "adjacency pollution" pattern that makes a reviewer LLM
    hallucinate imported-crate internals as belonging to the focus
  - DOES NOT meaningfully filter staffing data — the candidates
    reality test 2026-04-29 had "exact skill match buried at #3"
    which is a different problem (semantic-only ranking dominated
    by secondary text). Staffing needs structured filtering
    (status gates, location gates) that lives outside this
    package — future work, not in SPEC §3.4 yet

Headline smoke assertion: focus = crates/queryd/src/db.go which
defines Connector and imports catalogd::Registry. The filter
scores:
  Connector chunk: +0.68  (defined_match fires, kept)
  Registry chunk: -0.46  (import_only penalty fires, dropped)
  unrelated junk:  0.00  (no signals, dropped)

That's a 1.14-point gap between what we ARE and what we IMPORT —
the entire purpose of the filter.

Tests:
  - 9 unit tests in internal/matrix/relevance_test.go covering
    Tokenize, Jaccard, ExtractDefinedSymbols (Rust + TS),
    ExtractImportedSymbols, FilePrefix, ScoreRelevance per-signal,
    FilterChunks threshold splitting, and the headline
    AdjacencyPollutionScenario
  - scripts/relevance_smoke.sh integration smoke (3 assertions PASS):
    adjacency-pollution scenario, empty-chunks 400, threshold honored

13-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix, relevance).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:13:22 -05:00
root
c1d96b7b60 matrixd: multi-corpus retrieve+merge — SPEC §3.4 component 2 of 5
Lands the matrix indexer's first piece per docs/SPEC.md §3.4:
multi-corpus retrieve+merge with corpus attribution per result.
Future components (relevance filter, downgrade gate, learning-loop
integration) layer on top of this surface.

Architecture:
  - internal/matrix/retrieve.go — Retriever takes (query, corpora,
    k, per_corpus_k), parallel-fans across vectord indexes, merges
    by distance ascending, preserves corpus origin per hit
  - cmd/matrixd — HTTP service on :3217, fronts /v1/matrix/*
  - gateway proxy + [matrixd] config + lakehouse.toml entry
  - Either query_text (matrix calls embedd) or query_vector
    (caller pre-embedded) — vector takes precedence if both set

Error policy: fail-loud on any corpus error. Silent partial returns
would lie about coverage, defeating the matrix's whole purpose.
Bubbles vectord errors as 502 (upstream), validation as 400.

Smoke (scripts/matrix_smoke.sh, 6 assertions PASS first try):
  - /matrix/corpora lists indexes
  - Multi-corpus search returns hits from BOTH corpora
  - Top hit is the globally-closest across all corpora
    (b-near beats a-near at distance 0.05 vs 0.1 — proves merge)
  - Metadata round-trips through the merge
  - Distances ascending in result list
  - Negative paths: empty corpora → 400, missing corpus → 502,
    no query → 400

12-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:39:17 -05:00