root 8b92518d21 G1P: vectord persistence to storaged + scrum (3 fixes incl. 3-way convergent)
Adds optional persistence to vectord (G1's HNSW vector search). Single-
file framed format per index — eliminates the torn-write class that
the 3-way convergent scrum finding identified:

  _vectors/<name>.lhv1  — single binary blob:
      [4 bytes magic "LHV1"]
      [4 bytes envelope_len uint32 BE]
      [envelope bytes — JSON params + metadata + version]
      [graph bytes — raw hnsw.Graph.Export]

Pre-extraction: internal/catalogd/store_client.go → internal/storeclient/
shared package, since both catalogd and vectord need it. Same pattern as
the pre-D5 catalogclient extraction.

Optional via [vectord].storaged_url config (empty = ephemeral mode).
On startup: List + Load each persisted index. After Create / batch Add /
DELETE: Save (or Delete from storaged). Save failures are logged-not-
fatal — in-memory state is the source of truth in flight.

Acceptance smoke G1P 8/8 PASS — kill+restart preserves state, post-
restart search returns dist=0 (graph round-trips exactly), DELETE
removes the file, post-delete restart shows count=0.

All 8 smokes (D1-D6 + G1 + G1P) PASS deterministically. The g1_smoke
gained scripts/g1_smoke.toml that disables persistence so the
in-memory API test stays decoupled from any rehydrate-from-storaged
state contamination.

Cross-lineage scrum on shipped code:
  - Opus 4.7 (opencode):                     1 BLOCK + 5 WARN + 3 INFO
  - Kimi K2-0905 (openrouter):               1 BLOCK + 2 WARN
  - Qwen3-coder (openrouter):                2 BLOCK + 2 WARN + 1 INFO

Fixed (3 — 1 convergent + 2 single-reviewer):
  C1 (Opus + Kimi + Qwen 3-WAY CONVERGENT WARN): Save was non-atomic
    across two PUTs — envelope-succeeds + graph-fails left a half-
    saved index that passed the "both present" List filter and
    silently mismatched metadata against vectors on Load.
    Fix: collapse to single framed file (no torn-write window
    possible).
  O-B1 (Opus BLOCK): isNotFound substring-matched "key not found"
    against the wrapped error message — brittle, any 5xx body
    containing that text would silently misclassify as missing.
    Fix: errors.Is(err, storeclient.ErrKeyNotFound).
  O-I3 (Opus INFO): handleAdd pre-validation only covered id+dim;
    NaN/Inf/zero-norm could still fail mid-batch leaving partial
    commits. Fix: extend pre-validation to call ValidateVector
    (newly exported) per item before any commit.

Dismissed (3 false positives):
  K-B1 + Q-B1 ("safeKey double-escapes %2F segments") — false
    convergent. Wire-protocol escape is decoded by storaged's chi
    router on the way in; on-disk key is the original literal.
    %2F round-trips correctly through PathEscape → URL → chi decode
    → S3 key.
  Q-B2 ("List vulnerable to race conditions") — vectord is single-
    process; no concurrent Save against List in the same vectord.

Deferred (3): rehydrate per-index timeout (G2+ multi-index scale),
saveAfter request ctx (matches G0 timeout deferral), Encode RLock
during slow writer (documented as buffer-only API).

The C1 finding is the strongest signal of the cross-lineage filter:
three independent reviewers all flagged the same torn-write hazard.
Single-file framing eliminates the class — there's now no Persistor
state where envelope and graph can disagree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 01:33:23 -05:00

185 lines
6.4 KiB
Go

// persistor.go — drives Index ↔ storaged round-trips for vectord.
// Single-file framed format per index (post-G1P-scrum C1 fix —
// the prior two-file format had a torn-write class where a
// successful envelope PUT followed by a failed graph PUT would
// pass the "both present" List filter and load mismatched state):
//
// _vectors/<name>.lhv1 — single binary blob, framed:
// [4 bytes magic "LHV1"]
// [4 bytes envelope_len uint32 big-endian]
// [envelope bytes — JSON envelope from Index.Encode]
// [graph bytes — rest of file, raw hnsw.Graph.Export]
//
// One Put → no torn-write window. One Get → no need to filter
// half-saved orphans on List.
package vectord
import (
"bytes"
"context"
"encoding/binary"
"errors"
"fmt"
"io"
"sort"
"strings"
"git.agentview.dev/profit/golangLAKEHOUSE/internal/storeclient"
)
// VectorPrefix is the storaged key prefix that holds vector index
// state. Objects under this prefix are NOT user vectors — those
// live in the index in memory; this is the persistence sidecar.
const VectorPrefix = "_vectors/"
// fileSuffix is the extension for the framed combined file.
// Versioned in the magic+frame so future formats can land
// alongside without rewriting all existing files.
const fileSuffix = ".lhv1"
// magic is the 4-byte file header. Lakehouse-go Vectord v1.
var magic = [4]byte{'L', 'H', 'V', '1'}
// Store is the subset of the storeclient API that persistor needs.
// Defined as an interface so unit tests can inject a fake without
// spinning up real storaged.
type Store interface {
Put(ctx context.Context, key string, body []byte) error
Get(ctx context.Context, key string) ([]byte, error)
Delete(ctx context.Context, key string) error
List(ctx context.Context, prefix string) ([]string, error)
}
// ErrKeyMissing is returned by Load when the persisted file is
// absent. The single-file format means partial persistence isn't
// a concern (post-G1P-scrum C1) — torn-write would orphan a file
// with bad framing, surfacing as a parse error not a silent miss.
var ErrKeyMissing = errors.New("persistor: index file missing")
// ErrBadFormat is returned by Load when the persisted bytes don't
// match the expected magic or framing. Possible causes: bit-rot,
// version skew (a future format we don't speak), partial PUT
// before storaged grew transactional writes, or operator-edited.
var ErrBadFormat = errors.New("persistor: bad file format")
// Persistor wires Index Encode/Decode to a Store.
type Persistor struct {
store Store
}
func NewPersistor(s Store) *Persistor {
return &Persistor{store: s}
}
// Save encodes idx into a single framed blob and writes it.
// Atomic at the storaged layer: one Put → no torn-write hazard.
func (p *Persistor) Save(ctx context.Context, idx *Index) error {
var envBuf, graphBuf bytes.Buffer
if err := idx.Encode(&envBuf, &graphBuf); err != nil {
return fmt.Errorf("encode %q: %w", idx.params.Name, err)
}
body := frame(envBuf.Bytes(), graphBuf.Bytes())
if err := p.store.Put(ctx, fileKey(idx.params.Name), body); err != nil {
return fmt.Errorf("put: %w", err)
}
return nil
}
// Load reconstructs an Index from the persisted single-file blob.
// ErrKeyMissing if the file is absent; ErrBadFormat on framing
// mismatch (corruption / version skew / operator edit).
func (p *Persistor) Load(ctx context.Context, name string) (*Index, error) {
body, err := p.store.Get(ctx, fileKey(name))
if err != nil {
// Per scrum O-B1 (Opus): use errors.Is on the typed
// sentinel, not substring matching. The prior substring
// approach silently misclassified any 5xx body that
// happened to contain "key not found" as missing.
if errors.Is(err, storeclient.ErrKeyNotFound) {
return nil, fmt.Errorf("%w: %q", ErrKeyMissing, name)
}
return nil, fmt.Errorf("get %q: %w", name, err)
}
envBytes, graphBytes, err := unframe(body)
if err != nil {
return nil, fmt.Errorf("unframe %q: %w", name, err)
}
idx, err := DecodeIndex(bytes.NewReader(envBytes), bytes.NewReader(graphBytes))
if err != nil {
return nil, fmt.Errorf("decode %q: %w", name, err)
}
return idx, nil
}
// Delete removes the persisted file for an index. Idempotent on
// missing — storaged DELETE is itself idempotent.
func (p *Persistor) Delete(ctx context.Context, name string) error {
if err := p.store.Delete(ctx, fileKey(name)); err != nil {
return fmt.Errorf("delete %q: %w", name, err)
}
return nil
}
// List returns persisted index names, sorted. Single-file format
// means no half-saved-orphan filtering needed — bad framing on Load
// surfaces as ErrBadFormat per index, which the rehydrate caller
// can log and skip.
func (p *Persistor) List(ctx context.Context) ([]string, error) {
keys, err := p.store.List(ctx, VectorPrefix)
if err != nil {
return nil, fmt.Errorf("list: %w", err)
}
out := make([]string, 0, len(keys))
for _, k := range keys {
if !strings.HasPrefix(k, VectorPrefix) || !strings.HasSuffix(k, fileSuffix) {
continue
}
name := strings.TrimSuffix(strings.TrimPrefix(k, VectorPrefix), fileSuffix)
if name != "" {
out = append(out, name)
}
}
sort.Strings(out)
return out, nil
}
func fileKey(name string) string { return VectorPrefix + name + fileSuffix }
// frame produces the on-disk byte layout:
//
// [4 bytes magic "LHV1"]
// [4 bytes envelope_len uint32 big-endian]
// [envelope bytes]
// [graph bytes — rest of file]
func frame(envBytes, graphBytes []byte) []byte {
out := make([]byte, 0, 8+len(envBytes)+len(graphBytes))
out = append(out, magic[:]...)
var lenBuf [4]byte
binary.BigEndian.PutUint32(lenBuf[:], uint32(len(envBytes)))
out = append(out, lenBuf[:]...)
out = append(out, envBytes...)
out = append(out, graphBytes...)
return out
}
// unframe reverses frame. Returns ErrBadFormat on any mismatch.
func unframe(body []byte) (envBytes, graphBytes []byte, err error) {
if len(body) < 8 {
return nil, nil, fmt.Errorf("%w: body too short (%d bytes)", ErrBadFormat, len(body))
}
if !bytes.Equal(body[:4], magic[:]) {
return nil, nil, fmt.Errorf("%w: bad magic %q", ErrBadFormat, body[:4])
}
envLen := binary.BigEndian.Uint32(body[4:8])
if int(envLen)+8 > len(body) {
return nil, nil, fmt.Errorf("%w: envelope_len=%d exceeds body=%d", ErrBadFormat, envLen, len(body))
}
envBytes = body[8 : 8+envLen]
graphBytes = body[8+envLen:]
return envBytes, graphBytes, nil
}
// (io import retained for future Save/Load that streams; current
// path is buffer-based.)
var _ = io.Discard