Closes the documented 500K-test gap (memory project_golang_lakehouse:
"storaged 256 MiB PUT cap blocks single-file LHV1 persistence above
~150K vectors at d=768"). Vectord persistence under "_vectors/" now
gets a 4 GiB cap; everything else (parquets, manifests, ingest)
keeps the 256 MiB default.
Why per-prefix and not "raise globally":
- 256 MiB cap is a real DoS protection — runaway clients can't
drain the daemon. Raising it for ALL traffic would expand the
attack surface for routine paths that have no need.
- Per-prefix preserves existing protection while opening the one
documented production-scale path.
Why not split LHV1 across multiple keys (the alternative):
- G1P shipped a single-Put framed format SPECIFICALLY to eliminate
the torn-write class (memory: "Single Put eliminates the torn-
write class that the 3-way convergent scrum finding identified").
- Multi-key LHV1 would re-introduce the half-saved-state failure
mode we just paid to fix. Streaming via existing manager.Uploader
is the better architectural answer.
Why not bump the cap operationally via env/config:
- Future operator-driven cap can drop in cleanly via the
maxPutBytesFor function. Started with hardcoded 4 GiB to keep
this commit small; config knob is a follow-up if production
workloads diverge from the documented 500K-vector ceiling.
manager.Uploader is already streaming-multipart on the outbound
S3 side; the inbound MaxBytesReader cap is a safety gate, not a
memory bottleneck. So raising it for vectord just lets the
existing streaming path actually flow, without introducing new
memory pressure (4-slot semaphore × 4 GiB worst case = 16 GiB
only if all slots simultaneously max out — vanishingly unlikely).
Implementation:
cmd/storaged/main.go:
new constant maxPutBytesVectors = 4 GiB (covers >700K vectors @ d=768)
new constant vectorsPrefix = "_vectors/" (synced with vectord.VectorPrefix)
new function maxPutBytesFor(key) → cap-by-prefix
handlePut: ContentLength check + MaxBytesReader use the per-key cap
cmd/storaged/main_test.go (3 new test funcs):
TestMaxPutBytesFor: 7 cases incl. nested prefix, substring-but-not-
prefix, empty key, parquet/manifest paths.
TestVectorPrefixSyncWithVectord: regression test that asserts
vectorsPrefix == vectord.VectorPrefix. A future rename surfaces
here instead of silently bypassing the larger cap.
TestVectorCapAccommodates500KStaffingTest: bounds the cap above
the documented production workload (~700 MiB conservative).
Verified:
go test ./cmd/storaged/ — all green (was 1 func, now 4)
just verify — 9 smokes still pass · 32s wall
just proof contract — 53/0/1 unchanged
Out of scope for this commit (deserves its own):
- Heavy integration smoke: 200K dim=768 synthetic vectors → ~700
MiB LHV1 → kill+restart vectord → recall=1. ~5-10 min wall;
follow-up if you want production-scale persistence verified
end-to-end. Unit tests + existing g1p_smoke cover the wiring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
97 lines
3.4 KiB
Go
97 lines
3.4 KiB
Go
package main
|
|
|
|
import (
|
|
"testing"
|
|
|
|
"git.agentview.dev/profit/golangLAKEHOUSE/internal/vectord"
|
|
)
|
|
|
|
func TestValidateKey(t *testing.T) {
|
|
cases := []struct {
|
|
name string
|
|
key string
|
|
wantErr bool
|
|
}{
|
|
{"happy path", "data/workers_500k.parquet", false},
|
|
{"deeply nested", "a/b/c/d/e/f/g.txt", false},
|
|
{"empty", "", true},
|
|
{"too long", string(make([]byte, 1025)), true},
|
|
{"NUL byte", "foo\x00bar", true},
|
|
{"leading slash", "/foo/bar", true},
|
|
{"dotdot component", "data/../etc/passwd", true},
|
|
{"dotdot at end", "data/..", true},
|
|
{"dotdot at start", "../etc", true},
|
|
{"single dot is fine", "data/./x.txt", false},
|
|
{"dotdot embedded is fine", "data/.. /x.txt", false}, // not a path component
|
|
{"newline", "foo\nbar", true},
|
|
{"carriage return", "foo\rbar", true},
|
|
{"tab", "foo\tbar", true},
|
|
{"unicode is fine", "résumé/data.txt", false},
|
|
}
|
|
for _, tc := range cases {
|
|
t.Run(tc.name, func(t *testing.T) {
|
|
err := validateKey(tc.key)
|
|
if tc.wantErr && err == nil {
|
|
t.Errorf("expected error for %q, got nil", tc.key)
|
|
}
|
|
if !tc.wantErr && err != nil {
|
|
t.Errorf("expected ok for %q, got %v", tc.key, err)
|
|
}
|
|
})
|
|
}
|
|
}
|
|
|
|
func TestMaxPutBytesFor(t *testing.T) {
|
|
// Vectord LHV1 persistence gets the larger cap so single-file
|
|
// indexes above ~150K vectors at d=768 (the 500K staffing test
|
|
// gap) can survive a Save without 413. Default cap stays at
|
|
// 256 MiB for everything else (parquets, manifests, etc.).
|
|
cases := []struct {
|
|
name string
|
|
key string
|
|
want int64
|
|
}{
|
|
{"vectord LHV1 file", "_vectors/workers_500k.lhv1", maxPutBytesVectors},
|
|
{"vectord nested", "_vectors/some/nested/index.lhv1", maxPutBytesVectors},
|
|
{"parquet under datasets/", "datasets/workers/abc.parquet", maxPutBytesDefault},
|
|
{"catalog manifest", "_catalog/manifests/foo.bin", maxPutBytesDefault},
|
|
{"plain key", "x.txt", maxPutBytesDefault},
|
|
{"empty key", "", maxPutBytesDefault},
|
|
{"key with vectors-ish substring (not prefix)", "datasets/_vectors/x", maxPutBytesDefault},
|
|
}
|
|
for _, tc := range cases {
|
|
t.Run(tc.name, func(t *testing.T) {
|
|
got := maxPutBytesFor(tc.key)
|
|
if got != tc.want {
|
|
t.Errorf("maxPutBytesFor(%q) = %d, want %d", tc.key, got, tc.want)
|
|
}
|
|
})
|
|
}
|
|
}
|
|
|
|
// TestVectorPrefixSyncWithVectord locks the prefix value to vectord's
|
|
// VectorPrefix constant. If a future refactor renames either side,
|
|
// this test surfaces the drift before vectord saves silently bypass
|
|
// the larger cap.
|
|
func TestVectorPrefixSyncWithVectord(t *testing.T) {
|
|
if vectorsPrefix != vectord.VectorPrefix {
|
|
t.Errorf("storaged vectorsPrefix=%q out of sync with vectord.VectorPrefix=%q",
|
|
vectorsPrefix, vectord.VectorPrefix)
|
|
}
|
|
}
|
|
|
|
// TestVectorCapAccommodates500KStaffingTest reads the documented gap
|
|
// (memory project_golang_lakehouse.md): "storaged 256 MiB PUT cap
|
|
// blocks single-file LHV1 persistence above ~150K vectors at d=768."
|
|
// 500K vectors at d=768 with HNSW graph overhead is approximately
|
|
// 4.5 GiB resident, ~700 MiB on disk after compression.
|
|
// 4 GiB cap covers more than the documented production workload.
|
|
func TestVectorCapAccommodates500KStaffingTest(t *testing.T) {
|
|
const fiveHundredKVectorsAtD768Disk = 700 << 20 // ~700 MiB conservative estimate
|
|
if maxPutBytesVectors < fiveHundredKVectorsAtD768Disk {
|
|
t.Errorf("maxPutBytesVectors=%d (%.0f MiB) below documented production "+
|
|
"workload of ~700 MiB for 500K vectors at d=768",
|
|
maxPutBytesVectors, float64(maxPutBytesVectors)/(1<<20))
|
|
}
|
|
}
|