root 8b92518d21 G1P: vectord persistence to storaged + scrum (3 fixes incl. 3-way convergent)
Adds optional persistence to vectord (G1's HNSW vector search). Single-
file framed format per index — eliminates the torn-write class that
the 3-way convergent scrum finding identified:

  _vectors/<name>.lhv1  — single binary blob:
      [4 bytes magic "LHV1"]
      [4 bytes envelope_len uint32 BE]
      [envelope bytes — JSON params + metadata + version]
      [graph bytes — raw hnsw.Graph.Export]

Pre-extraction: internal/catalogd/store_client.go → internal/storeclient/
shared package, since both catalogd and vectord need it. Same pattern as
the pre-D5 catalogclient extraction.

Optional via [vectord].storaged_url config (empty = ephemeral mode).
On startup: List + Load each persisted index. After Create / batch Add /
DELETE: Save (or Delete from storaged). Save failures are logged-not-
fatal — in-memory state is the source of truth in flight.

Acceptance smoke G1P 8/8 PASS — kill+restart preserves state, post-
restart search returns dist=0 (graph round-trips exactly), DELETE
removes the file, post-delete restart shows count=0.

All 8 smokes (D1-D6 + G1 + G1P) PASS deterministically. The g1_smoke
gained scripts/g1_smoke.toml that disables persistence so the
in-memory API test stays decoupled from any rehydrate-from-storaged
state contamination.

Cross-lineage scrum on shipped code:
  - Opus 4.7 (opencode):                     1 BLOCK + 5 WARN + 3 INFO
  - Kimi K2-0905 (openrouter):               1 BLOCK + 2 WARN
  - Qwen3-coder (openrouter):                2 BLOCK + 2 WARN + 1 INFO

Fixed (3 — 1 convergent + 2 single-reviewer):
  C1 (Opus + Kimi + Qwen 3-WAY CONVERGENT WARN): Save was non-atomic
    across two PUTs — envelope-succeeds + graph-fails left a half-
    saved index that passed the "both present" List filter and
    silently mismatched metadata against vectors on Load.
    Fix: collapse to single framed file (no torn-write window
    possible).
  O-B1 (Opus BLOCK): isNotFound substring-matched "key not found"
    against the wrapped error message — brittle, any 5xx body
    containing that text would silently misclassify as missing.
    Fix: errors.Is(err, storeclient.ErrKeyNotFound).
  O-I3 (Opus INFO): handleAdd pre-validation only covered id+dim;
    NaN/Inf/zero-norm could still fail mid-batch leaving partial
    commits. Fix: extend pre-validation to call ValidateVector
    (newly exported) per item before any commit.

Dismissed (3 false positives):
  K-B1 + Q-B1 ("safeKey double-escapes %2F segments") — false
    convergent. Wire-protocol escape is decoded by storaged's chi
    router on the way in; on-disk key is the original literal.
    %2F round-trips correctly through PathEscape → URL → chi decode
    → S3 key.
  Q-B2 ("List vulnerable to race conditions") — vectord is single-
    process; no concurrent Save against List in the same vectord.

Deferred (3): rehydrate per-index timeout (G2+ multi-index scale),
saveAfter request ctx (matches G0 timeout deferral), Encode RLock
during slow writer (documented as buffer-only API).

The C1 finding is the strongest signal of the cross-lineage filter:
three independent reviewers all flagged the same torn-write hazard.
Single-file framing eliminates the class — there's now no Persistor
state where envelope and graph can disagree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 01:33:23 -05:00

184 lines
6.4 KiB
Go
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

// Package shared also provides the TOML config loader. Per ADR
// equivalent of Rust ADR-006 (TOML config over env vars), every
// service reads `lakehouse.toml` with sane defaults and env
// overrides. Config is hot-reload-unaware in G0; reload-on-SIGHUP
// is a G1+ concern.
package shared
import (
"errors"
"fmt"
"io/fs"
"log/slog"
"os"
"github.com/pelletier/go-toml/v2"
)
// Config is the unified Lakehouse config. Each service reads only
// the section it cares about, but they all share the same file so
// operators have one place to look.
type Config struct {
Gateway GatewayConfig `toml:"gateway"`
Storaged ServiceConfig `toml:"storaged"`
Catalogd CatalogConfig `toml:"catalogd"`
Ingestd IngestConfig `toml:"ingestd"`
Queryd QuerydConfig `toml:"queryd"`
Vectord VectordConfig `toml:"vectord"`
S3 S3Config `toml:"s3"`
Log LogConfig `toml:"log"`
}
// IngestConfig adds ingestd-specific knobs. ingestd needs to PUT
// parquet to storaged AND register manifests with catalogd, so it
// holds two upstream URLs in addition to its own bind.
//
// MaxIngestBytes caps the multipart body size. CSVs are typically
// 4-6× larger than the resulting Snappy-compressed Parquet, so 256
// MiB CSV → ~50 MiB Parquet — well under storaged's 256 MiB PUT
// cap. Real-scale validation (2026-04-29) showed 500K workers ×
// 18 cols = 344 MiB CSV → 71 MiB Parquet; bumping this knob to
// 512 MiB is the documented path for that workload.
type IngestConfig struct {
Bind string `toml:"bind"`
StoragedURL string `toml:"storaged_url"`
CatalogdURL string `toml:"catalogd_url"`
MaxIngestBytes int64 `toml:"max_ingest_bytes"`
}
// GatewayConfig adds the upstream URLs the reverse proxy fronts.
// Each route family (/v1/storage, /v1/catalog, /v1/ingest, /v1/sql,
// /v1/vectors) has its own upstream so we can scale services
// independently or move them to different boxes without touching
// gateway code.
type GatewayConfig struct {
Bind string `toml:"bind"`
StoragedURL string `toml:"storaged_url"`
CatalogdURL string `toml:"catalogd_url"`
IngestdURL string `toml:"ingestd_url"`
QuerydURL string `toml:"queryd_url"`
VectordURL string `toml:"vectord_url"`
}
// VectordConfig adds vectord-specific knobs. StoragedURL is
// optional — empty string disables persistence, useful for ephemeral
// dev or test runs. When set, indexes Save after every state change
// and Load on startup.
type VectordConfig struct {
Bind string `toml:"bind"`
StoragedURL string `toml:"storaged_url"`
}
// QuerydConfig adds queryd-specific knobs. queryd talks DuckDB
// directly to MinIO via DuckDB's httpfs extension (so no storaged
// URL needed), and reads the catalog over HTTP for view registration.
// SecretsPath defaults to /etc/lakehouse/secrets-go.toml — the same
// file storaged uses, since both services need the S3 credentials.
type QuerydConfig struct {
Bind string `toml:"bind"`
CatalogdURL string `toml:"catalogd_url"`
SecretsPath string `toml:"secrets_path"`
RefreshEvery string `toml:"refresh_every"` // duration string, e.g. "30s"
}
// CatalogConfig adds catalogd-specific knobs on top of the standard
// bind. StoragedURL points at the storaged service for manifest
// persistence; G0 defaults to the localhost bind.
type CatalogConfig struct {
Bind string `toml:"bind"`
StoragedURL string `toml:"storaged_url"`
}
// ServiceConfig is the per-binary bind config. Default Bind ""
// means "use the service's hardcoded G0 default" — see DefaultConfig.
type ServiceConfig struct {
Bind string `toml:"bind"`
}
// S3Config holds S3-compatible storage settings. Endpoint blank →
// AWS default. Bucket "" → "lakehouse-primary".
type S3Config struct {
Endpoint string `toml:"endpoint"`
Region string `toml:"region"`
Bucket string `toml:"bucket"`
AccessKeyID string `toml:"access_key_id"`
SecretAccessKey string `toml:"secret_access_key"`
UsePathStyle bool `toml:"use_path_style"`
}
// LogConfig — slog level for now; structured fields land G1+.
type LogConfig struct {
Level string `toml:"level"`
}
// DefaultConfig returns the G0 dev defaults. Ports are shifted to
// 3110+ to coexist with the live Rust lakehouse on 3100/3201-3204
// during the migration. G5 cutover flips gateway back to 3100.
func DefaultConfig() Config {
return Config{
Gateway: GatewayConfig{
Bind: "127.0.0.1:3110",
StoragedURL: "http://127.0.0.1:3211",
CatalogdURL: "http://127.0.0.1:3212",
IngestdURL: "http://127.0.0.1:3213",
QuerydURL: "http://127.0.0.1:3214",
VectordURL: "http://127.0.0.1:3215",
},
Storaged: ServiceConfig{Bind: "127.0.0.1:3211"},
Catalogd: CatalogConfig{Bind: "127.0.0.1:3212", StoragedURL: "http://127.0.0.1:3211"},
Ingestd: IngestConfig{
Bind: "127.0.0.1:3213",
StoragedURL: "http://127.0.0.1:3211",
CatalogdURL: "http://127.0.0.1:3212",
MaxIngestBytes: 256 << 20, // 256 MiB; bump per deployment via lakehouse.toml
},
Vectord: VectordConfig{
Bind: "127.0.0.1:3215",
StoragedURL: "http://127.0.0.1:3211",
},
Queryd: QuerydConfig{
Bind: "127.0.0.1:3214",
CatalogdURL: "http://127.0.0.1:3212",
SecretsPath: "/etc/lakehouse/secrets-go.toml",
RefreshEvery: "30s",
},
S3: S3Config{
Endpoint: "http://localhost:9000",
Region: "us-east-1",
Bucket: "lakehouse-primary",
UsePathStyle: true,
},
Log: LogConfig{Level: "info"},
}
}
// LoadConfig reads `lakehouse.toml` from path; if path is empty or
// the file doesn't exist, returns DefaultConfig. Any decode error is
// fatal (we don't want a misconfigured service silently falling back
// to defaults — that's the kind of bug you find at 2am).
//
// Per Opus + Qwen WARN #3: when path WAS given but the file is
// missing, log a warning so silent default-fallback doesn't hide
// misconfiguration. Empty path is fine (caller didn't ask for a
// file); non-empty + missing is suspicious.
func LoadConfig(path string) (Config, error) {
cfg := DefaultConfig()
if path == "" {
return cfg, nil
}
b, err := os.ReadFile(path)
if errors.Is(err, fs.ErrNotExist) {
slog.Warn("config file not found, using defaults",
"path", path,
"hint", "create the file or pass -config /path/to/lakehouse.toml")
return cfg, nil
}
if err != nil {
return cfg, fmt.Errorf("read config: %w", err)
}
if err := toml.Unmarshal(b, &cfg); err != nil {
return cfg, fmt.Errorf("parse config: %w", err)
}
return cfg, nil
}