root 8b92518d21 G1P: vectord persistence to storaged + scrum (3 fixes incl. 3-way convergent)
Adds optional persistence to vectord (G1's HNSW vector search). Single-
file framed format per index — eliminates the torn-write class that
the 3-way convergent scrum finding identified:

  _vectors/<name>.lhv1  — single binary blob:
      [4 bytes magic "LHV1"]
      [4 bytes envelope_len uint32 BE]
      [envelope bytes — JSON params + metadata + version]
      [graph bytes — raw hnsw.Graph.Export]

Pre-extraction: internal/catalogd/store_client.go → internal/storeclient/
shared package, since both catalogd and vectord need it. Same pattern as
the pre-D5 catalogclient extraction.

Optional via [vectord].storaged_url config (empty = ephemeral mode).
On startup: List + Load each persisted index. After Create / batch Add /
DELETE: Save (or Delete from storaged). Save failures are logged-not-
fatal — in-memory state is the source of truth in flight.

Acceptance smoke G1P 8/8 PASS — kill+restart preserves state, post-
restart search returns dist=0 (graph round-trips exactly), DELETE
removes the file, post-delete restart shows count=0.

All 8 smokes (D1-D6 + G1 + G1P) PASS deterministically. The g1_smoke
gained scripts/g1_smoke.toml that disables persistence so the
in-memory API test stays decoupled from any rehydrate-from-storaged
state contamination.

Cross-lineage scrum on shipped code:
  - Opus 4.7 (opencode):                     1 BLOCK + 5 WARN + 3 INFO
  - Kimi K2-0905 (openrouter):               1 BLOCK + 2 WARN
  - Qwen3-coder (openrouter):                2 BLOCK + 2 WARN + 1 INFO

Fixed (3 — 1 convergent + 2 single-reviewer):
  C1 (Opus + Kimi + Qwen 3-WAY CONVERGENT WARN): Save was non-atomic
    across two PUTs — envelope-succeeds + graph-fails left a half-
    saved index that passed the "both present" List filter and
    silently mismatched metadata against vectors on Load.
    Fix: collapse to single framed file (no torn-write window
    possible).
  O-B1 (Opus BLOCK): isNotFound substring-matched "key not found"
    against the wrapped error message — brittle, any 5xx body
    containing that text would silently misclassify as missing.
    Fix: errors.Is(err, storeclient.ErrKeyNotFound).
  O-I3 (Opus INFO): handleAdd pre-validation only covered id+dim;
    NaN/Inf/zero-norm could still fail mid-batch leaving partial
    commits. Fix: extend pre-validation to call ValidateVector
    (newly exported) per item before any commit.

Dismissed (3 false positives):
  K-B1 + Q-B1 ("safeKey double-escapes %2F segments") — false
    convergent. Wire-protocol escape is decoded by storaged's chi
    router on the way in; on-disk key is the original literal.
    %2F round-trips correctly through PathEscape → URL → chi decode
    → S3 key.
  Q-B2 ("List vulnerable to race conditions") — vectord is single-
    process; no concurrent Save against List in the same vectord.

Deferred (3): rehydrate per-index timeout (G2+ multi-index scale),
saveAfter request ctx (matches G0 timeout deferral), Encode RLock
during slow writer (documented as buffer-only API).

The C1 finding is the strongest signal of the cross-lineage filter:
three independent reviewers all flagged the same torn-write hazard.
Single-file framing eliminates the class — there's now no Persistor
state where envelope and graph can disagree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 01:33:23 -05:00

355 lines
11 KiB
Go

// Package vectord owns the vector-search surface — HNSW indexes
// keyed by string IDs with optional opaque JSON metadata. The
// underlying library is github.com/coder/hnsw (pure Go, no cgo).
//
// G1 scope: in-memory only. Persistence to storaged + rehydrate
// across restart is the next piece — keeping it out of this layer
// makes the index API easier to test and keeps the storaged
// dependency optional for downstream tooling.
package vectord
import (
"encoding/json"
"errors"
"fmt"
"io"
"math"
"sync"
"github.com/coder/hnsw"
)
// Distance names accepted by IndexParams.Distance.
const (
DistanceCosine = "cosine"
DistanceEuclidean = "euclidean"
)
// Default HNSW parameters — match coder/hnsw's NewGraph defaults
// which are tuned for OpenAI-shaped embeddings (1536-d, but the
// hyperparameters generalize).
const (
DefaultM = 16
DefaultEfSearch = 20
)
// IndexParams describes one vector index. Once an Index is built,
// these are fixed — changing M / dimension / distance requires a
// rebuild.
type IndexParams struct {
Name string `json:"name"`
Dimension int `json:"dimension"`
M int `json:"m"`
EfSearch int `json:"ef_search"`
Distance string `json:"distance"`
}
// Result is one search hit. Distance semantics depend on the
// configured distance function — for cosine it's `1 - cos(a,b)`
// where smaller = closer; for euclidean it's the L2 norm of
// `a - b`. Either way, smaller = closer and the result list is
// sorted ascending.
type Result struct {
ID string `json:"id"`
Distance float32 `json:"distance"`
Metadata json.RawMessage `json:"metadata,omitempty"`
}
// Index wraps a coder/hnsw graph plus a side map of opaque JSON
// metadata per ID. Concurrency: read-heavy via Search (read-lock);
// Add and Delete take the write lock.
type Index struct {
params IndexParams
g *hnsw.Graph[string]
meta map[string]json.RawMessage
mu sync.RWMutex
}
// Errors surfaced to HTTP handlers. Sentinel-based so the wire
// layer can map to status codes via errors.Is.
var (
ErrDimensionMismatch = errors.New("vectord: vector dimension mismatch")
ErrUnknownDistance = errors.New("vectord: unknown distance function")
ErrInvalidParams = errors.New("vectord: invalid index params")
)
// NewIndex builds a fresh index from validated params.
func NewIndex(p IndexParams) (*Index, error) {
if p.Name == "" {
return nil, fmt.Errorf("%w: empty name", ErrInvalidParams)
}
if p.Dimension <= 0 {
return nil, fmt.Errorf("%w: dimension must be > 0 (got %d)", ErrInvalidParams, p.Dimension)
}
if p.M <= 0 {
p.M = DefaultM
}
if p.EfSearch <= 0 {
p.EfSearch = DefaultEfSearch
}
if p.Distance == "" {
p.Distance = DistanceCosine
}
dist, err := distanceFn(p.Distance)
if err != nil {
return nil, err
}
g := hnsw.NewGraph[string]()
g.M = p.M
g.EfSearch = p.EfSearch
g.Distance = dist
// Ml stays at the library default (0.25); exposing it as a knob
// is a G2 concern when we have real tuning data.
return &Index{
params: p,
g: g,
meta: make(map[string]json.RawMessage),
}, nil
}
// distanceFn maps the string name to the underlying function.
// Easier to unit-test than calling out to coder/hnsw's registry.
func distanceFn(name string) (hnsw.DistanceFunc, error) {
switch name {
case DistanceCosine, "":
return hnsw.CosineDistance, nil
case DistanceEuclidean:
return hnsw.EuclideanDistance, nil
}
return nil, fmt.Errorf("%w: %q (want cosine or euclidean)", ErrUnknownDistance, name)
}
// Params returns a copy of the immutable index params.
func (i *Index) Params() IndexParams { return i.params }
// Len returns the number of vectors currently in the index.
func (i *Index) Len() int {
i.mu.RLock()
defer i.mu.RUnlock()
return i.g.Len()
}
// Add inserts a vector with optional metadata, with replace
// semantics for the vector: if id already exists, the prior
// vector is removed first. Dim must match the index dim or
// ErrDimensionMismatch is returned.
//
// Metadata semantics (post-scrum K-B1): nil meta is "leave
// existing alone" (upsert-style); to clear metadata, pass an
// empty `{}` or Delete+Add. This avoids silent metadata loss
// when the JSON `metadata` field is omitted on re-add.
//
// Validates that all vector components are finite (post-scrum
// O-W3). NaN/Inf in any component poisons HNSW: distance
// comparisons return false for both `<` and `>`, breaking the
// search heap invariants. Zero-norm vectors are also rejected
// under cosine distance — cos(0,x) = NaN.
//
// Note: coder/hnsw's Graph.Add panics on re-adding an existing
// key (internal "node not added" length-invariant check). We
// pre-Delete to make Add idempotent on re-insert.
func (i *Index) Add(id string, vec []float32, meta json.RawMessage) error {
if id == "" {
return errors.New("vectord: empty id")
}
if len(vec) != i.params.Dimension {
return fmt.Errorf("%w: index dim=%d, got=%d", ErrDimensionMismatch, i.params.Dimension, len(vec))
}
if err := validateVector(vec, i.params.Distance); err != nil {
return err
}
i.mu.Lock()
defer i.mu.Unlock()
// coder/hnsw has two sharp edges on re-add:
// 1. Add of an existing key panics with "node not added"
// (length-invariant fires because internal delete+re-add
// doesn't change Len). Pre-Delete fixes this for n>1.
// 2. Delete of the LAST node leaves layers[0] non-empty but
// entryless; the next Add SIGSEGVs in Dims() because
// entry().Value is nil. We rebuild the graph in that case.
_, exists := i.g.Lookup(id)
if exists {
if i.g.Len() == 1 {
i.resetGraphLocked()
} else {
i.g.Delete(id)
}
}
i.g.Add(hnsw.MakeNode(id, vec))
if meta != nil {
// Per scrum K-B1 (Kimi): only OVERWRITE on explicit non-nil.
// nil = "leave existing meta alone" (upsert). To clear, the
// caller should send an empty `{}` body or Delete the id.
i.meta[id] = meta
}
return nil
}
// resetGraphLocked recreates the underlying coder/hnsw Graph with
// the same params. Caller MUST hold i.mu (write-lock). Used to
// dodge the library's "delete the last node, then segfault on
// next Add" bug — see Add for details. Metadata map is preserved
// because the only entry it could affect is the one being
// re-added, which Add overwrites.
func (i *Index) resetGraphLocked() {
g := hnsw.NewGraph[string]()
g.M = i.params.M
g.EfSearch = i.params.EfSearch
g.Distance = i.g.Distance
i.g = g
}
// ValidateVector is the exported form of validateVector — the HTTP
// handler pre-validates batches before committing, so it needs the
// same predicate Add uses internally. Per scrum O-I3 (G1P).
func ValidateVector(vec []float32, distance string) error {
return validateVector(vec, distance)
}
// validateVector rejects vectors that would poison the HNSW
// graph or produce NaN distances. Per scrum O-W3 (Opus, G1).
func validateVector(vec []float32, distance string) error {
var sumSq float64
for j, v := range vec {
f := float64(v)
if math.IsNaN(f) || math.IsInf(f, 0) {
return fmt.Errorf("vectord: vec[%d] is non-finite (got %v)", j, v)
}
sumSq += f * f
}
if distance == DistanceCosine && sumSq == 0 {
return errors.New("vectord: zero-norm vector under cosine distance")
}
return nil
}
// Delete removes id from the index. Returns true if present.
func (i *Index) Delete(id string) bool {
i.mu.Lock()
defer i.mu.Unlock()
delete(i.meta, id)
return i.g.Delete(id)
}
// Search returns the k nearest neighbors of query, sorted
// ascending by distance.
//
// Note: coder/hnsw's Search returns `[]Node[K]` without distances —
// they're computed internally in the search candidate heap but
// dropped from the public API. We recompute distance from the
// returned vectors. O(k·dim) per search, negligible at typical
// k=10 / dim<2048.
func (i *Index) Search(query []float32, k int) ([]Result, error) {
if len(query) != i.params.Dimension {
return nil, fmt.Errorf("%w: index dim=%d, got=%d", ErrDimensionMismatch, i.params.Dimension, len(query))
}
if k <= 0 {
return nil, errors.New("vectord: k must be > 0")
}
i.mu.RLock()
defer i.mu.RUnlock()
// Per scrum O-I2 (Opus): use the stored distance function
// directly rather than re-resolving the string name on every
// search. The graph's Distance is set once at NewIndex.
dist := i.g.Distance
hits := i.g.Search(query, k)
out := make([]Result, len(hits))
for j, n := range hits {
out[j] = Result{
ID: n.Key,
Distance: dist(query, n.Value),
Metadata: i.meta[n.Key],
}
}
return out, nil
}
// IndexEnvelope is the JSON shape persisted alongside the binary
// HNSW graph bytes. params + metadata + format version travel
// together; the graph itself is opaque binary that round-trips
// through hnsw.Graph.Export / Import.
type IndexEnvelope struct {
Version int `json:"version"`
Params IndexParams `json:"params"`
Metadata map[string]json.RawMessage `json:"metadata"`
}
// envelopeVersion bumps when the on-disk JSON shape changes
// incompatibly. Reading a future version returns ErrVersionMismatch
// rather than producing a half-decoded index.
const envelopeVersion = 1
// ErrVersionMismatch is returned by DecodeIndex when the envelope
// claims a version this build doesn't understand.
var ErrVersionMismatch = errors.New("vectord: unknown envelope version")
// Encode writes the index's JSON envelope (params + metadata) and
// the binary HNSW graph bytes through two writers. Two-stream
// shape lets the persistor write each to a distinct storaged key
// without reframing.
//
// envelopeW receives params+metadata as JSON; graphW receives the
// raw output of hnsw.Graph.Export.
func (i *Index) Encode(envelopeW, graphW io.Writer) error {
i.mu.RLock()
defer i.mu.RUnlock()
env := IndexEnvelope{
Version: envelopeVersion,
Params: i.params,
Metadata: i.meta,
}
if err := json.NewEncoder(envelopeW).Encode(env); err != nil {
return fmt.Errorf("encode envelope: %w", err)
}
if err := i.g.Export(graphW); err != nil {
return fmt.Errorf("export graph: %w", err)
}
return nil
}
// DecodeIndex reconstructs an Index from a previously-Encoded pair
// of streams. The returned Index is independent — closing either
// reader after this call doesn't affect the result.
func DecodeIndex(envelopeR, graphR io.Reader) (*Index, error) {
var env IndexEnvelope
if err := json.NewDecoder(envelopeR).Decode(&env); err != nil {
return nil, fmt.Errorf("decode envelope: %w", err)
}
if env.Version != envelopeVersion {
return nil, fmt.Errorf("%w: have %d, got %d",
ErrVersionMismatch, envelopeVersion, env.Version)
}
idx, err := NewIndex(env.Params)
if err != nil {
return nil, err
}
if err := idx.g.Import(graphR); err != nil {
return nil, fmt.Errorf("import graph: %w", err)
}
if env.Metadata != nil {
idx.meta = env.Metadata
}
return idx, nil
}
// Lookup returns the stored vector + metadata for an id.
//
// Per scrum O-W1 (Opus): the vector is COPIED before return.
// coder/hnsw's Lookup hands back the underlying graph slice;
// caller mutation would corrupt the index without locking.
func (i *Index) Lookup(id string) (vec []float32, meta json.RawMessage, ok bool) {
i.mu.RLock()
defer i.mu.RUnlock()
v, found := i.g.Lookup(id)
if !found {
return nil, nil, false
}
out := make([]float32, len(v))
copy(out, v)
return out, i.meta[id], true
}