Cross-lineage scrum review on the 12 commits of this session
(afbb506..06e7152) via Rust gateway :3100 with Opus + Kimi +
Qwen3-coder. Results:
Real findings landed:
1. Opus BLOCK — vectord BatchAdd intra-batch duplicates panic
coder/hnsw's "node not added" length-invariant. Fixed with
last-write-wins dedup inside BatchAdd before the pre-pass.
Regression test TestBatchAdd_IntraBatchDedup added.
2. Opus + Kimi convergent WARN — strings.Contains(err.Error(),
"status 404") was brittle string-matching to detect cold-
start playbook state. Fixed: ErrCorpusNotFound sentinel
returned by searchCorpus on HTTP 404; fetchPlaybookHits
uses errors.Is.
3. Opus WARN — corpusingest.Run returned nil on total batch
failure, masking broken pipelines as "empty corpora." Fixed:
Stats.FailedBatches counter, ErrPartialFailure sentinel
returned when nonzero. New regression test
TestRun_NonzeroFailedBatchesReturnsError.
4. Opus WARN — dead var _ = io.EOF in staffing_500k/main.go
was justified by a fictional comment. Removed.
Drivers (staffing_500k, staffing_candidates, staffing_workers)
updated to handle ErrPartialFailure gracefully — print warn, keep
running queries — rather than fatal'ing on transient hiccups
while still surfacing the failure clearly in the output.
Documented (no code change):
- Opus WARN: matrixd /matrix/downgrade reads
LH_FORCE_FULL_ENRICHMENT from process env when body omits
it. Comment now explains the opinionated default and points
callers wanting deterministic behavior to pass the field
explicitly.
False positives dismissed (caught and verified, NOT acted on):
A. Kimi BLOCK on errors.Is + wrapped error in cmd/matrixd:223.
Verified false: Search wraps with %w (fmt.Errorf("%w: %v",
ErrEmbed, err)), so errors.Is matches the chain correctly.
B. Kimi INFO "BatchAdd has no unit tests." Verified false:
batch_bench_test.go has BenchmarkBatchAdd; the new dedup
test TestBatchAdd_IntraBatchDedup adds another.
C. Opus BLOCK on missing finite/zero-norm pre-validation in
cmd/vectord:280-291. Verified false: line 272 already calls
vectord.ValidateVector before BatchAdd, so finite + zero-
norm IS checked. Pre-validation is exhaustive.
D. Opus WARN on relevance.go tokenRe (Opus self-corrected
mid-finding when realizing leading char counts toward token
length).
Qwen3-coder returned NO FINDINGS — known issue with very long
diffs through the OpenRouter free tier; lineage rotation worked
as designed (Opus + Kimi between them caught everything Qwen
would have).
15-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix, relevance, downgrade, playbook).
Unit tests all green (corpusingest +1, vectord +1).
Per feedback_cross_lineage_review.md: convergent finding #2 (404
detection) is the highest-signal one — both Opus and Kimi
flagged it independently. The other Opus findings stand on
single-reviewer signal but each one verified against the actual
code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the per-item Add loop in the HTTP handler with one call to
Index.BatchAdd, which acquires the write-lock once and pushes the
whole batch through coder/hnsw's variadic Graph.Add. Pre-validation
stays in the handler so per-item error messages keep their item-index
precision.
Microbench (internal/vectord/batch_bench_test.go) at d=768 cosine:
N=16 SingleAdd 283µs/op → BatchAdd 170µs/op 1.66×
N=128 SingleAdd 7.9ms/op → BatchAdd 7.5ms/op 1.05×
N=1024 SingleAdd 87.5ms/op → BatchAdd 83.4ms/op 1.05×
Win is biggest at staffing-driver batch sizes (N=16) where
per-call lock + validation overhead is a meaningful fraction. At
larger N the inner HNSW neighborhood search per insert dominates,
which is the load-bearing finding for Option B (sharded indexes):
the throughput ceiling lives inside the library, not at the lock,
so sharding to N parallel Graphs is the only path to true
concurrent-Add throughput.
g1, g1p, g2 smokes all PASS post-change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds optional persistence to vectord (G1's HNSW vector search). Single-
file framed format per index — eliminates the torn-write class that
the 3-way convergent scrum finding identified:
_vectors/<name>.lhv1 — single binary blob:
[4 bytes magic "LHV1"]
[4 bytes envelope_len uint32 BE]
[envelope bytes — JSON params + metadata + version]
[graph bytes — raw hnsw.Graph.Export]
Pre-extraction: internal/catalogd/store_client.go → internal/storeclient/
shared package, since both catalogd and vectord need it. Same pattern as
the pre-D5 catalogclient extraction.
Optional via [vectord].storaged_url config (empty = ephemeral mode).
On startup: List + Load each persisted index. After Create / batch Add /
DELETE: Save (or Delete from storaged). Save failures are logged-not-
fatal — in-memory state is the source of truth in flight.
Acceptance smoke G1P 8/8 PASS — kill+restart preserves state, post-
restart search returns dist=0 (graph round-trips exactly), DELETE
removes the file, post-delete restart shows count=0.
All 8 smokes (D1-D6 + G1 + G1P) PASS deterministically. The g1_smoke
gained scripts/g1_smoke.toml that disables persistence so the
in-memory API test stays decoupled from any rehydrate-from-storaged
state contamination.
Cross-lineage scrum on shipped code:
- Opus 4.7 (opencode): 1 BLOCK + 5 WARN + 3 INFO
- Kimi K2-0905 (openrouter): 1 BLOCK + 2 WARN
- Qwen3-coder (openrouter): 2 BLOCK + 2 WARN + 1 INFO
Fixed (3 — 1 convergent + 2 single-reviewer):
C1 (Opus + Kimi + Qwen 3-WAY CONVERGENT WARN): Save was non-atomic
across two PUTs — envelope-succeeds + graph-fails left a half-
saved index that passed the "both present" List filter and
silently mismatched metadata against vectors on Load.
Fix: collapse to single framed file (no torn-write window
possible).
O-B1 (Opus BLOCK): isNotFound substring-matched "key not found"
against the wrapped error message — brittle, any 5xx body
containing that text would silently misclassify as missing.
Fix: errors.Is(err, storeclient.ErrKeyNotFound).
O-I3 (Opus INFO): handleAdd pre-validation only covered id+dim;
NaN/Inf/zero-norm could still fail mid-batch leaving partial
commits. Fix: extend pre-validation to call ValidateVector
(newly exported) per item before any commit.
Dismissed (3 false positives):
K-B1 + Q-B1 ("safeKey double-escapes %2F segments") — false
convergent. Wire-protocol escape is decoded by storaged's chi
router on the way in; on-disk key is the original literal.
%2F round-trips correctly through PathEscape → URL → chi decode
→ S3 key.
Q-B2 ("List vulnerable to race conditions") — vectord is single-
process; no concurrent Save against List in the same vectord.
Deferred (3): rehydrate per-index timeout (G2+ multi-index scale),
saveAfter request ctx (matches G0 timeout deferral), Encode RLock
during slow writer (documented as buffer-only API).
The C1 finding is the strongest signal of the cross-lineage filter:
three independent reviewers all flagged the same torn-write hazard.
Single-file framing eliminates the class — there's now no Persistor
state where envelope and graph can disagree.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First G1+ piece. Standalone vectord service with in-memory HNSW
indexes keyed by string IDs and optional opaque JSON metadata.
Wraps github.com/coder/hnsw v0.6.1 (pure Go, no cgo). New port
:3215 with /v1/vectors/* routed through gateway.
API:
POST /v1/vectors/index create
GET /v1/vectors/index list
GET /v1/vectors/index/{name} get info
DELETE /v1/vectors/index/{name}
POST /v1/vectors/index/{name}/add (batch)
POST /v1/vectors/index/{name}/search
Acceptance smoke 7/7 PASS — including recall=1 on inserted vector
w-042 (cosine distance 5.96e-8, float32 precision noise), 200-
vector batch round-trip, dim mismatch → 400, missing index → 404,
duplicate create → 409.
Two upstream library quirks worked around in the wrapper:
1. coder/hnsw.Add panics with "node not added" on re-adding an
existing key (length-invariant fires because internal
delete+re-add doesn't change Len). Pre-Delete fixes for n>1.
2. Delete of the LAST node leaves layers[0] non-empty but
entryless; next Add SIGSEGVs in Dims(). Workaround: when
re-adding to a 1-node graph, recreate the underlying graph
fresh via resetGraphLocked().
Cross-lineage scrum on shipped code:
- Opus 4.7 (opencode): 0 BLOCK + 4 WARN + 3 INFO
- Kimi K2-0905 (openrouter): 2 BLOCK + 2 WARN + 1 INFO
- Qwen3-coder (openrouter): "No BLOCKs" (4 tokens)
Fixed (4 real + 2 cleanup):
O-W1: Lookup returned the raw []float32 from coder/hnsw — caller
mutation would corrupt index. Now copies before return.
O-W3: NaN/Inf vectors poison HNSW (distance comparisons return
false for both < and >, breaking heap invariants). Zero-norm
under cosine produces NaN. Now validated at Add time.
K-B1: Re-adding with nil metadata silently cleared the existing
entry — JSON-omitted "metadata" field deserializes as nil,
making upsert non-idempotent. Now nil = "leave alone"; explicit
{} or Delete to clear.
O-W4: Batch Add with mid-batch failure left items 0..N-1
committed and item N rejected. Now pre-validates all IDs+dims
before any Add.
O-I1: jsonItoa hand-roll replaced with strconv.Itoa — no
measured allocation win.
O-I2: distanceFn re-resolved per Search → use stored i.g.Distance.
Dismissed (2 false positives):
K-B2 "MaxBytesReader applied after full read" — false, applied
BEFORE Decode in decodeJSON
K-W1 "Search distances under read lock might see invalidated
slices from concurrent Add" — false, RWMutex serializes
write-lock during Add against read-lock during Search
Deferred (3): HTTP server timeouts (consistent G0 punt),
Content-Type validation (internal service behind gateway), Lookup
dim assertion (in-memory state can't drift).
The K-B1 finding is worth pausing on: nil metadata on re-add is
the kind of API ergonomics bug only a code-reading reviewer
catches — smoke would never detect it because the smoke always
sends explicit metadata. Three lines changed in Add; the resulting
API matches what callers actually expect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>