root 89ca72d471 materializer + replay ports + vectord substrate fix verified at scale

Two threads landing together — the doc edits interleave so they ship
in a single commit.

1. **vectord substrate fix verified at original scale** (closes the
   2026-05-01 thread). Re-ran multitier 5min @ conc=50: 132,211
   scenarios at 438/sec, 6/6 classes at 0% failure (was 4/6 pre-fix).
   Throughput dropped 1,115 → 438/sec because previously-broken
   scenarios now do real HNSW Add work — honest cost of correctness.
   The fix (i.vectors side-store + safeGraphAdd recover wrappers +
   smallIndexRebuildThreshold=32 + saveTask coalescing) holds at the
   footprint that originally surfaced the bug.

2. **Materializer port** — internal/materializer + cmd/materializer +
   scripts/materializer_smoke.sh. Ports scripts/distillation/transforms.ts
   (12 transforms) + build_evidence_index.ts (idempotency, day-partition,
   receipt). On-wire JSON shape matches TS so Bun and Go runs are
   interchangeable. 14 tests green.

3. **Replay port** — internal/replay + cmd/replay +
   scripts/replay_smoke.sh. Ports scripts/distillation/replay.ts
   (retrieve → bundle → /v1/chat → validate → log). Closes audit-FULL
   phase 7 live invocation on the Go side. Both runtimes append to the
   same data/_kb/replay_runs.jsonl (schema=replay_run.v1). 14 tests green.

Side effect on internal/distillation/types.go: EvidenceRecord gained
prompt_tokens, completion_tokens, and metadata fields to mirror the TS
shape the materializer transforms produce.

STATE_OF_PLAY refreshed to 2026-05-02; ARCHITECTURE_COMPARISON decisions
tracker moves the materializer + replay items from _open_ to DONE and
adds the substrate-fix scale verification row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-02 03:31:02 -05:00

9.9 KiB

Raw Permalink Blame History

Multi-tier load test — 100k workers, 6 scenarios, real validators

J's request: a much more sophisticated test using the 100k corpus from the Rust legacy database, exercising the new EmailValidator + FillValidator, plus profile-swap and other realistic coordinator workflow scenarios.

Setup

Corpus: 100,000 workers from /home/profit/lakehouse/data/datasets/workers_100k.parquet, ingested into Go vectord via staffing_workers -limit 100000 (~55 minutes). Index: workers on persistent stack, dim=768.
Persistent Go stack on :4110+:4211-:4219 (11 daemons, 3-layer isolation from smoke harness).
Bun frontend at :3700 (not used by this test — direct hits to Go gateway).
Validator pool: 200 in-process workers (test-w-XXX IDs) with matched city/state/role pairs across 35 unique combos.
Tool: scripts/cutover/multitier/main.go — 6-scenario harness with weighted random scenario selection per goroutine.

Six scenarios + weights

Scenario	Weight	Steps	Validators
`cold_search_email`	35%	search → email outreach + validate	EmailValidator
`surge_fill_validate`	15%	search → fill proposal (2 workers) → FillValidator → record	FillValidator
`profile_swap`	15%	original search → swap with `ExcludeIDs` → no-overlap check	(none — substrate-only)
`repeat_cache`	15%	same query × 5 → cache effectiveness measure	(none)
`sms_validate`	10%	search → SMS draft (≤160 chars, contains phone for SSN false-positive test) → validate	EmailValidator (kind=sms)
`playbook_record_replay`	10%	cold search → record → warm search w/ `use_playbook=true`	(none — exercises learning loop)

Results — sustained 5-minute run, conc=50

Scenario	Runs	Fail%	p50	p95	p99	max
`cold_search_email`	117,406	0.0%	2.22ms	5.37ms	8.61ms	452ms
`surge_fill_validate`	50,091	98.8%	5.02ms	13.14ms	44.02ms	681ms
`profile_swap`	50,263	0.0%	4.45ms	9.65ms	14.04ms	461ms
`repeat_cache`	50,576	0.0%	11.73ms	21.03ms	29.92ms	453ms
`sms_validate`	33,524	0.0%	2.13ms	5.24ms	8.48ms	467ms
`playbook_record_replay`	33,397	96.8%	391ms	477ms	719ms	1,018ms
TOTAL	335,257	—	—	—	—	—

1,115 scenarios per second sustained over 5 minutes. 4 of 6 scenarios at 0% failure across 251,769 successful workflows.

Cache effectiveness (repeat_cache scenario, 5 sequential queries each): 50,576 × 5 = 252,880 cached searches, all returning the same top-K with no failures. The matrixd retrieve path scales fine on the 100k corpus.

Resource footprint at 100k corpus

Daemon	CPU%	RSS	Note
persistent-vectord	76%	1.23GB	linear with 100k vectors (vs 82MB at 5k)
persistent-matrixd	75%	26MB	bottleneck at conc=50+ (1 core pegged)
persistent-gateway	30%	26MB	proxy + auth
persistent-embedd	21%	97MB	embed cache + Ollama bridge
persistent-storaged	11%	82MB	rehydrate I/O active
(5 other daemons)	~0%	~25MB each	idle
Total	—	~1.7GB

Compare to Rust gateway under similar load: 14.9GB RSS. Even at 100k workers, Go uses ~10× less memory with explicit per-daemon attribution.

What the test exposed (substrate finding)

The two scenarios that hit /v1/matrix/playbooks/record (surge_fill_validate, playbook_record_replay) failed at 96-98% rate. Failure stack identified: coder/hnsw v0.6.1 nil pointer in layerNode.search (graph.go:95) triggered during HNSW Add to the small-state playbook_memory index.

Reproduction:

Empty playbook_memory index (length=0)
First record succeeds (length=1)
Subsequent record under concurrent load → coder/hnsw panics
Repeated concurrent records → index transitions through degenerate states where entry node is nil

Root cause: coder/hnsw v0.6.1 doesn't handle the len=0/1 edge case correctly when the graph has been Delete'd-then-Add'd. The vectord wrapper has a partial guard (resets graph on len=1 during re-add) but doesn't catch every degenerate state.

Workaround applied: added a recover() guard in internal/vectord/index.go BatchAdd — panics now return errors instead of killing the request handler. Daemon stays up; clients get HTTP 500 with a clear "DELETE the index to recover" hint.

Operator recovery: when /v1/matrix/playbooks/record starts returning 500s, run:

curl -X DELETE http://localhost:4215/vectors/index/playbook_memory

Next record will recreate the index fresh.

Proper fix (deferred): either (a) upstream patch to coder/hnsw, (b) write a different small-index Add path that always rebuilds from scratch when len < threshold, or (c) switch playbook_memory to a different vector store (Lance? in-memory map for the playbook-corpus shape, since playbook entries are small).

What the test confirmed (production-readiness)

Across 335k scenarios in 5 minutes:

Search at 100k corpus is fast — p99 8.6ms on cold path, matching the 5k corpus characteristics. HNSW search is O(log n) so 20× corpus growth barely registered.
Validator integration works at load — 117,406 EmailValidator passes in cold_search_email + 33,524 in sms_validate. The in-process validators don't bottleneck.
Profile swap with ExcludeIDs is correct — 50,263 swaps, zero overlap detected between original + swap result sets. The ExcludeIDs filter holds.
Embed cache effectiveness verified — repeat_cache scenario (5 sequential queries each) yielded 252,880 cached searches with no failures and consistent latencies. Cache hit rate is high enough that 100k-corpus search costs match 5k-corpus search costs in p50.
SMS-shape phone-number false-positive guard works — 33,524 SMS drafts containing "Call 555-123-4567" (phone shape that ALMOST matches SSN-shape NNN-NN-NNNN) all passed the EmailValidator's flanking-digit guard.
Cross-daemon HTTP overhead is negligible — matrixd→vectord→embedd round-trips at ~2-12ms p50 across scenarios.

What this DOES NOT cover

Real coordinator demand patterns — bodies rotated round-robin; real workloads have arrival-rate variability + burst clustering.
Multi-host horizontal scale — single-machine load.
Sustained for hours — 5-minute window; long-tail leaks (file handles, goroutine pools, MinIO connections) not tested.
Concurrent ingest + load — the 100k ingest finished BEFORE the test ran. Mixed read/write at scale is a separate probe.
Real Bun frontend in path — direct-to-Go for max throughput. Bun adds ~5x latency overhead per the earlier g5_load_test.md.

Repro

# Stack must be up:
./scripts/cutover/start_go_stack.sh

# Ingest 100k workers (one-time, ~55 min):
./bin/staffing_workers -limit 100000 \
  -parquet /home/profit/lakehouse/data/datasets/workers_100k.parquet \
  -gateway http://127.0.0.1:4110 -drop=true

# Reset playbook_memory if it's in a degenerate state:
curl -X DELETE http://127.0.0.1:4215/vectors/index/playbook_memory

# Build + run multitier:
go build -o bin/multitier ./scripts/cutover/multitier
./bin/multitier -gateway http://127.0.0.1:4110 -concurrency 50 -duration 300s

# Stderr is parseable JSON for CI integration.

Decisions tracker delta

Add to docs/ARCHITECTURE_COMPARISON.md Decisions tracker:

Date	Decision	Effect
2026-05-01	playbook_record under load triggers coder/hnsw v0.6.1 nil-deref	Recover guard added in BatchAdd; daemon stays up. Real fix open: upstream patch OR small-index custom Add path OR alternate store.
2026-05-01 (later)	Real fix landed. vectord lifts source-of-truth out of coder/hnsw via `i.vectors map[string][]float32` side store; `safeGraphAdd`/`safeGraphDelete` recover panics; warm-path Add falls back to rebuild on failure; `rebuildGraphLocked` reads from the panic-safe side map. Re-ran multitier 60s/conc=50: 0 failures across 19,622 scenarios (was 96-98% on 2/6). p50 on previously-failing scenarios moves 5ms (instant fail) → 551ms (real Add work — honest cost of correctness). Memory cost: ~2× for vectors. STATE_OF_PLAY captures the architecture invariant.
2026-05-02	Full-scale verification. Re-ran multitier at the original failure-surfacing footprint (5min @ conc=50). Result: 132,211 scenarios at 438.5/sec, 0 failures across all 6 classes. Throughput dropped from pre-fix 1,115/sec → 438/sec because previously-broken scenarios (96-98% fail) now do real HNSW Add work instead of fast nil-deref panics. Healthy tails: `surge_fill_validate` p50=28.9ms / p99=1.53s, `playbook_record_replay` p50=504ms / p99=2.32s — small-index rebuild kicking in under sustained churn, working as designed. Substrate fix scales beyond the 19.6k-scenario probe; closing the open thread.

Conclusion

Pre-fix (2026-05-01): 335,257 scenarios in 5min, 4/6 classes at 0% failure, 2/6 hit a coder/hnsw v0.6.1 nil-deref under playbook record churn. Operator recovery via DELETE + recreate.

Post-fix (2026-05-02): 132,211 scenarios in 5min @ conc=50, 6/6 classes at 0% failure. Throughput moved 1,115/sec → 438/sec because the formerly fast-failing scenarios are now doing real HNSW Add work — that's the honest cost of correctness, not a regression. The fix (i.vectors side-store + safeGraphAdd recover wrappers + small-index rebuild threshold of 32 + saveTask write coalescing) shifts vectord's source-of-truth out of coder/hnsw so panics can't lose data and the daemon recovers automatically.

This is the most production-shape test we've run. The harness mixes search, validator calls (in-process), HTTP cross-daemon round-trips, playbook recording, and cache exercise. The result is more honest than a single-endpoint load test, and post-fix all six workflows work cleanly at scale.

9.9 KiB Raw Permalink Blame History Unescape Escape