golangLAKEHOUSE

Author	SHA1	Message	Date
root	eb0dfdff04	vectord: v2 envelope + handleMerge robustness — actions post_role_gate_v1 scrum 3-lineage scrum on 434f466..0d4f033 surfaced one convergent finding (Opus + Kimi) and 3 Opus-only real bugs. All actioned in this commit. Two false positives (Kimi rollback misreading, Opus stale- comment claim) verified + rejected — both required manual control- flow inspection to refute, matching the documented Kimi-truncation behavior in feedback_cross_lineage_review.md. Convergent fix — DecodeIndex lost nil-meta items: - Envelope version bumped 1 → 2. - New v2 field: IDs []string carries the canonical ID set explicitly, independent of meta map's nil-vs-{} sparseness. - DecodeIndex accepts both versions: v2 reads from env.IDs; v1 falls back to meta-key inference (with the documented limitation that nil-meta items are invisible — preserved for backward-compat with already-persisted indexes). - Encode emits v2 going forward. - 2 new regression tests: - TestEncodeDecode_NilMetaItemsSurviveRoundTrip: items added with nil metadata MUST survive Encode → Decode and remain visible to IDs(). Pre-fix would have yielded IDs() == []. - TestDecodeIndex_V1BackwardCompat: hand-crafted v1 envelope still decodes (proves the fallback path). Opus-only fixes: - handleMerge: non-ErrIndexNotFound errors at h.reg.Get(name) / h.reg.Get(req.Dest) now return 500 + log instead of falling through with nil src/dest pointers (which would panic on the next deref). Real bug — only the sentinel error was handled. - internal/drift/drift.go: mathLog wrapper removed; math.Log inlined. Wrapper added no value (math was already imported). - internal/distillation/audit_baseline.go: BuildAuditDriftTable's bubble sort replaced with sort.Slice. Idiomatic + shorter. Rejected after verification: - Kimi WARN "missing rollback on partial merge": misread the control flow. Code at cmd/vectord/main.go:404-414 does NOT delete from src when dest.Add fails (continue before reaching src.Delete). Only successful Adds trigger Deletes. - Opus INFO "TimestampUnixNano comment references missing field": field exists at scripts/multi_coord_stress/main.go:128. Opus saw only the diff context, not the full file. Deferred (no fired trigger): - Opus WARN "no per-index lock during merge": no concurrent merge callers today (operators run merge as deliberate one-shot job). Worth a lock if/when matrixd or chatd start auto-triggering. Disposition: reports/scrum/_evidence/2026-05-01/verdicts/post_role_gate_v1_disposition.md. Build + vet + tests green; 2 new regression tests + all prior tests unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 01:20:37 -05:00
root	b216b7e5b6	fix the other 4: close all OPEN-list items in one wave Substantial wave addressing all 4 prior OPEN items. Three closed in full, one partially (the speculative half deliberately deferred). OPEN #1 — Periodic fresh→main index merge (FULL): - POST /v1/vectors/index/{src}/merge with {dest, clear_source} - Idempotent on re-runs (existing-in-dest items skipped) - internal/vectord/index.go: new Index.IDs() snapshot method + i.ids tracker field as canonical ID set, independent of meta map's nil-vs-{} sparseness (was a real bug — IDs() backed by meta alone missed items added with nil metadata) - 4 cmd-level integration tests (happy path drain+clear, dim mismatch, dest not found, self-merge rejection) + 1 unit test - DecodeIndex backward-compat: old envelopes restore i.ids from meta keys (best effort; new items going forward use the tracker) OPEN #2 — Distillation SFT export (SUBSTRATE): - internal/distillation/sft_export.go ports the load-bearing half: IsSftNever predicate + ListScoredRunFiles (data/scored-runs/YYYY/ MM/DD walk) + LoadScoredRunsFromFile + partial ExportSft. - Synthesis (instruction/input/response generation) deferred to a separate wave — too big for this session, but the substrate makes the next wave a port-not-design exercise. - TestSftNever_PinsExpectedSet locks the contamination firewall set: if a future commit adds/removes from SftNever, this test fails — forcing the change through review. - 5 new tests; firewall fires end-to-end through the partial port. OPEN #3 — Distribution drift via PSI (FULL): - internal/drift/drift.go: ComputeDistributionDrift via Population Stability Index. Standard finance/risk metric, well-defined verdict tiers (stable < 0.10, minor 0.10–0.25, major ≥ 0.25). - Equal-width bucketing over combined min/max so neither dist falls outside; epsilon-clamping for empty buckets so log doesn't blow up. Per-bucket breakdown for drilldown. - Pairs with the existing ComputeScorerDrift: scorer drift is categorical, distribution drift is continuous. Different shapes, same package. - 7 new tests covering identical-is-stable, hard-shift-is-major, moderate-detected-not-stable, empty-inputs-safe, all-identical- safe, bucket-counts-conserved, num-buckets-clamping. OPEN #4 — Ops nice-to-haves (PARTIAL — wall-clock done, others deferred): - (a) Real-time wall-clock for stress harness: per-phase elapsed time logged to stdout as it runs (`[stress] phase NAME starting (T+12.3s)` + `[stress] phase NAME done — 8.5s (T+20.8s)`). Output.PhaseTimings + Output.TotalElapsedMs in JSON. - (b) chatd fixture-mode S3 mock + (c) liberal-paraphrase calibration: not actioned — no fired trigger, would be speculative. Documented as deferred-until-need rather than ignored. Per the project's discipline ("don't add features beyond what the task requires"). OPEN list now empty / steady-state. Future items will land as production triggers fire. Build + vet + tests green; 18 new tests across the 4 closures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 23:42:11 -05:00
root	a730fc2016	scrum fixes: 4 real findings landed, 4 false positives dismissed Cross-lineage scrum review on the 12 commits of this session (afbb506..06e7152) via Rust gateway :3100 with Opus + Kimi + Qwen3-coder. Results: Real findings landed: 1. Opus BLOCK — vectord BatchAdd intra-batch duplicates panic coder/hnsw's "node not added" length-invariant. Fixed with last-write-wins dedup inside BatchAdd before the pre-pass. Regression test TestBatchAdd_IntraBatchDedup added. 2. Opus + Kimi convergent WARN — strings.Contains(err.Error(), "status 404") was brittle string-matching to detect cold- start playbook state. Fixed: ErrCorpusNotFound sentinel returned by searchCorpus on HTTP 404; fetchPlaybookHits uses errors.Is. 3. Opus WARN — corpusingest.Run returned nil on total batch failure, masking broken pipelines as "empty corpora." Fixed: Stats.FailedBatches counter, ErrPartialFailure sentinel returned when nonzero. New regression test TestRun_NonzeroFailedBatchesReturnsError. 4. Opus WARN — dead var _ = io.EOF in staffing_500k/main.go was justified by a fictional comment. Removed. Drivers (staffing_500k, staffing_candidates, staffing_workers) updated to handle ErrPartialFailure gracefully — print warn, keep running queries — rather than fatal'ing on transient hiccups while still surfacing the failure clearly in the output. Documented (no code change): - Opus WARN: matrixd /matrix/downgrade reads LH_FORCE_FULL_ENRICHMENT from process env when body omits it. Comment now explains the opinionated default and points callers wanting deterministic behavior to pass the field explicitly. False positives dismissed (caught and verified, NOT acted on): A. Kimi BLOCK on errors.Is + wrapped error in cmd/matrixd:223. Verified false: Search wraps with %w (fmt.Errorf("%w: %v", ErrEmbed, err)), so errors.Is matches the chain correctly. B. Kimi INFO "BatchAdd has no unit tests." Verified false: batch_bench_test.go has BenchmarkBatchAdd; the new dedup test TestBatchAdd_IntraBatchDedup adds another. C. Opus BLOCK on missing finite/zero-norm pre-validation in cmd/vectord:280-291. Verified false: line 272 already calls vectord.ValidateVector before BatchAdd, so finite + zero- norm IS checked. Pre-validation is exhaustive. D. Opus WARN on relevance.go tokenRe (Opus self-corrected mid-finding when realizing leading char counts toward token length). Qwen3-coder returned NO FINDINGS — known issue with very long diffs through the OpenRouter free tier; lineage rotation worked as designed (Opus + Kimi between them caught everything Qwen would have). 15-smoke regression sweep all green (D1-D6, G1, G1P, G2, storaged_cap, pathway, matrix, relevance, downgrade, playbook). Unit tests all green (corpusingest +1, vectord +1). Per feedback_cross_lineage_review.md: convergent finding #2 (404 detection) is the highest-signal one — both Opus and Kimi flagged it independently. The other Opus findings stand on single-reviewer signal but each one verified against the actual code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:42:39 -05:00
root	f1c188323c	vectord: BatchAdd — single-lock variadic batch (Option A) Replaces the per-item Add loop in the HTTP handler with one call to Index.BatchAdd, which acquires the write-lock once and pushes the whole batch through coder/hnsw's variadic Graph.Add. Pre-validation stays in the handler so per-item error messages keep their item-index precision. Microbench (internal/vectord/batch_bench_test.go) at d=768 cosine: N=16 SingleAdd 283µs/op → BatchAdd 170µs/op 1.66× N=128 SingleAdd 7.9ms/op → BatchAdd 7.5ms/op 1.05× N=1024 SingleAdd 87.5ms/op → BatchAdd 83.4ms/op 1.05× Win is biggest at staffing-driver batch sizes (N=16) where per-call lock + validation overhead is a meaningful fraction. At larger N the inner HNSW neighborhood search per insert dominates, which is the load-bearing finding for Option B (sharded indexes): the throughput ceiling lives inside the library, not at the lock, so sharding to N parallel Graphs is the only path to true concurrent-Add throughput. g1, g1p, g2 smokes all PASS post-change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 18:05:48 -05:00
root	8b92518d21	G1P: vectord persistence to storaged + scrum (3 fixes incl. 3-way convergent) Adds optional persistence to vectord (G1's HNSW vector search). Single- file framed format per index — eliminates the torn-write class that the 3-way convergent scrum finding identified: _vectors/<name>.lhv1 — single binary blob: [4 bytes magic "LHV1"] [4 bytes envelope_len uint32 BE] [envelope bytes — JSON params + metadata + version] [graph bytes — raw hnsw.Graph.Export] Pre-extraction: internal/catalogd/store_client.go → internal/storeclient/ shared package, since both catalogd and vectord need it. Same pattern as the pre-D5 catalogclient extraction. Optional via [vectord].storaged_url config (empty = ephemeral mode). On startup: List + Load each persisted index. After Create / batch Add / DELETE: Save (or Delete from storaged). Save failures are logged-not- fatal — in-memory state is the source of truth in flight. Acceptance smoke G1P 8/8 PASS — kill+restart preserves state, post- restart search returns dist=0 (graph round-trips exactly), DELETE removes the file, post-delete restart shows count=0. All 8 smokes (D1-D6 + G1 + G1P) PASS deterministically. The g1_smoke gained scripts/g1_smoke.toml that disables persistence so the in-memory API test stays decoupled from any rehydrate-from-storaged state contamination. Cross-lineage scrum on shipped code: - Opus 4.7 (opencode): 1 BLOCK + 5 WARN + 3 INFO - Kimi K2-0905 (openrouter): 1 BLOCK + 2 WARN - Qwen3-coder (openrouter): 2 BLOCK + 2 WARN + 1 INFO Fixed (3 — 1 convergent + 2 single-reviewer): C1 (Opus + Kimi + Qwen 3-WAY CONVERGENT WARN): Save was non-atomic across two PUTs — envelope-succeeds + graph-fails left a half- saved index that passed the "both present" List filter and silently mismatched metadata against vectors on Load. Fix: collapse to single framed file (no torn-write window possible). O-B1 (Opus BLOCK): isNotFound substring-matched "key not found" against the wrapped error message — brittle, any 5xx body containing that text would silently misclassify as missing. Fix: errors.Is(err, storeclient.ErrKeyNotFound). O-I3 (Opus INFO): handleAdd pre-validation only covered id+dim; NaN/Inf/zero-norm could still fail mid-batch leaving partial commits. Fix: extend pre-validation to call ValidateVector (newly exported) per item before any commit. Dismissed (3 false positives): K-B1 + Q-B1 ("safeKey double-escapes %2F segments") — false convergent. Wire-protocol escape is decoded by storaged's chi router on the way in; on-disk key is the original literal. %2F round-trips correctly through PathEscape → URL → chi decode → S3 key. Q-B2 ("List vulnerable to race conditions") — vectord is single- process; no concurrent Save against List in the same vectord. Deferred (3): rehydrate per-index timeout (G2+ multi-index scale), saveAfter request ctx (matches G0 timeout deferral), Encode RLock during slow writer (documented as buffer-only API). The C1 finding is the strongest signal of the cross-lineage filter: three independent reviewers all flagged the same torn-write hazard. Single-file framing eliminates the class — there's now no Persistor state where envelope and graph can disagree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 01:33:23 -05:00
root	b8c072cf0b	G1: vectord — HNSW vector search via coder/hnsw · 6 scrum fixes applied First G1+ piece. Standalone vectord service with in-memory HNSW indexes keyed by string IDs and optional opaque JSON metadata. Wraps github.com/coder/hnsw v0.6.1 (pure Go, no cgo). New port :3215 with /v1/vectors/* routed through gateway. API: POST /v1/vectors/index create GET /v1/vectors/index list GET /v1/vectors/index/{name} get info DELETE /v1/vectors/index/{name} POST /v1/vectors/index/{name}/add (batch) POST /v1/vectors/index/{name}/search Acceptance smoke 7/7 PASS — including recall=1 on inserted vector w-042 (cosine distance 5.96e-8, float32 precision noise), 200- vector batch round-trip, dim mismatch → 400, missing index → 404, duplicate create → 409. Two upstream library quirks worked around in the wrapper: 1. coder/hnsw.Add panics with "node not added" on re-adding an existing key (length-invariant fires because internal delete+re-add doesn't change Len). Pre-Delete fixes for n>1. 2. Delete of the LAST node leaves layers[0] non-empty but entryless; next Add SIGSEGVs in Dims(). Workaround: when re-adding to a 1-node graph, recreate the underlying graph fresh via resetGraphLocked(). Cross-lineage scrum on shipped code: - Opus 4.7 (opencode): 0 BLOCK + 4 WARN + 3 INFO - Kimi K2-0905 (openrouter): 2 BLOCK + 2 WARN + 1 INFO - Qwen3-coder (openrouter): "No BLOCKs" (4 tokens) Fixed (4 real + 2 cleanup): O-W1: Lookup returned the raw []float32 from coder/hnsw — caller mutation would corrupt index. Now copies before return. O-W3: NaN/Inf vectors poison HNSW (distance comparisons return false for both < and >, breaking heap invariants). Zero-norm under cosine produces NaN. Now validated at Add time. K-B1: Re-adding with nil metadata silently cleared the existing entry — JSON-omitted "metadata" field deserializes as nil, making upsert non-idempotent. Now nil = "leave alone"; explicit {} or Delete to clear. O-W4: Batch Add with mid-batch failure left items 0..N-1 committed and item N rejected. Now pre-validates all IDs+dims before any Add. O-I1: jsonItoa hand-roll replaced with strconv.Itoa — no measured allocation win. O-I2: distanceFn re-resolved per Search → use stored i.g.Distance. Dismissed (2 false positives): K-B2 "MaxBytesReader applied after full read" — false, applied BEFORE Decode in decodeJSON K-W1 "Search distances under read lock might see invalidated slices from concurrent Add" — false, RWMutex serializes write-lock during Add against read-lock during Search Deferred (3): HTTP server timeouts (consistent G0 punt), Content-Type validation (internal service behind gateway), Lookup dim assertion (in-memory state can't drift). The K-B1 finding is worth pausing on: nil metadata on re-add is the kind of API ergonomics bug only a code-reading reviewer catches — smoke would never detect it because the smoke always sends explicit metadata. Three lines changed in Add; the resulting API matches what callers actually expect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:50:28 -05:00

6 Commits