Two threads landing together — the doc edits interleave so they ship in a single commit. 1. **vectord substrate fix verified at original scale** (closes the 2026-05-01 thread). Re-ran multitier 5min @ conc=50: 132,211 scenarios at 438/sec, 6/6 classes at 0% failure (was 4/6 pre-fix). Throughput dropped 1,115 → 438/sec because previously-broken scenarios now do real HNSW Add work — honest cost of correctness. The fix (i.vectors side-store + safeGraphAdd recover wrappers + smallIndexRebuildThreshold=32 + saveTask coalescing) holds at the footprint that originally surfaced the bug. 2. **Materializer port** — internal/materializer + cmd/materializer + scripts/materializer_smoke.sh. Ports scripts/distillation/transforms.ts (12 transforms) + build_evidence_index.ts (idempotency, day-partition, receipt). On-wire JSON shape matches TS so Bun and Go runs are interchangeable. 14 tests green. 3. **Replay port** — internal/replay + cmd/replay + scripts/replay_smoke.sh. Ports scripts/distillation/replay.ts (retrieve → bundle → /v1/chat → validate → log). Closes audit-FULL phase 7 live invocation on the Go side. Both runtimes append to the same data/_kb/replay_runs.jsonl (schema=replay_run.v1). 14 tests green. Side effect on internal/distillation/types.go: EvidenceRecord gained prompt_tokens, completion_tokens, and metadata fields to mirror the TS shape the materializer transforms produce. STATE_OF_PLAY refreshed to 2026-05-02; ARCHITECTURE_COMPARISON decisions tracker moves the materializer + replay items from _open_ to DONE and adds the substrate-fix scale verification row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.9 KiB
Multi-tier load test — 100k workers, 6 scenarios, real validators
J's request: a much more sophisticated test using the 100k corpus from the Rust legacy database, exercising the new EmailValidator + FillValidator, plus profile-swap and other realistic coordinator workflow scenarios.
Setup
- Corpus: 100,000 workers from
/home/profit/lakehouse/data/datasets/workers_100k.parquet, ingested into Go vectord viastaffing_workers -limit 100000(~55 minutes). Index:workerson persistent stack, dim=768. - Persistent Go stack on
:4110+:4211-:4219(11 daemons, 3-layer isolation from smoke harness). - Bun frontend at
:3700(not used by this test — direct hits to Go gateway). - Validator pool: 200 in-process workers (
test-w-XXXIDs) with matched city/state/role pairs across 35 unique combos. - Tool:
scripts/cutover/multitier/main.go— 6-scenario harness with weighted random scenario selection per goroutine.
Six scenarios + weights
| Scenario | Weight | Steps | Validators |
|---|---|---|---|
cold_search_email |
35% | search → email outreach + validate | EmailValidator |
surge_fill_validate |
15% | search → fill proposal (2 workers) → FillValidator → record | FillValidator |
profile_swap |
15% | original search → swap with ExcludeIDs → no-overlap check |
(none — substrate-only) |
repeat_cache |
15% | same query × 5 → cache effectiveness measure | (none) |
sms_validate |
10% | search → SMS draft (≤160 chars, contains phone for SSN false-positive test) → validate | EmailValidator (kind=sms) |
playbook_record_replay |
10% | cold search → record → warm search w/ use_playbook=true |
(none — exercises learning loop) |
Results — sustained 5-minute run, conc=50
| Scenario | Runs | Fail% | p50 | p95 | p99 | max |
|---|---|---|---|---|---|---|
cold_search_email |
117,406 | 0.0% | 2.22ms | 5.37ms | 8.61ms | 452ms |
surge_fill_validate |
50,091 | 98.8% | 5.02ms | 13.14ms | 44.02ms | 681ms |
profile_swap |
50,263 | 0.0% | 4.45ms | 9.65ms | 14.04ms | 461ms |
repeat_cache |
50,576 | 0.0% | 11.73ms | 21.03ms | 29.92ms | 453ms |
sms_validate |
33,524 | 0.0% | 2.13ms | 5.24ms | 8.48ms | 467ms |
playbook_record_replay |
33,397 | 96.8% | 391ms | 477ms | 719ms | 1,018ms |
| TOTAL | 335,257 | — | — | — | — | — |
1,115 scenarios per second sustained over 5 minutes. 4 of 6 scenarios at 0% failure across 251,769 successful workflows.
Cache effectiveness (repeat_cache scenario, 5 sequential queries each): 50,576 × 5 = 252,880 cached searches, all returning the same top-K with no failures. The matrixd retrieve path scales fine on the 100k corpus.
Resource footprint at 100k corpus
| Daemon | CPU% | RSS | Note |
|---|---|---|---|
| persistent-vectord | 76% | 1.23GB | linear with 100k vectors (vs 82MB at 5k) |
| persistent-matrixd | 75% | 26MB | bottleneck at conc=50+ (1 core pegged) |
| persistent-gateway | 30% | 26MB | proxy + auth |
| persistent-embedd | 21% | 97MB | embed cache + Ollama bridge |
| persistent-storaged | 11% | 82MB | rehydrate I/O active |
| (5 other daemons) | ~0% | ~25MB each | idle |
| Total | — | ~1.7GB |
Compare to Rust gateway under similar load: 14.9GB RSS. Even at 100k workers, Go uses ~10× less memory with explicit per-daemon attribution.
What the test exposed (substrate finding)
The two scenarios that hit /v1/matrix/playbooks/record
(surge_fill_validate, playbook_record_replay) failed at 96-98% rate.
Failure stack identified: coder/hnsw v0.6.1 nil pointer in
layerNode.search (graph.go:95) triggered during HNSW Add to the
small-state playbook_memory index.
Reproduction:
- Empty playbook_memory index (length=0)
- First record succeeds (length=1)
- Subsequent record under concurrent load → coder/hnsw panics
- Repeated concurrent records → index transitions through degenerate states where entry node is nil
Root cause: coder/hnsw v0.6.1 doesn't handle the len=0/1 edge case correctly when the graph has been Delete'd-then-Add'd. The vectord wrapper has a partial guard (resets graph on len=1 during re-add) but doesn't catch every degenerate state.
Workaround applied: added a recover() guard in
internal/vectord/index.go BatchAdd — panics now return errors
instead of killing the request handler. Daemon stays up; clients
get HTTP 500 with a clear "DELETE the index to recover" hint.
Operator recovery: when /v1/matrix/playbooks/record starts
returning 500s, run:
curl -X DELETE http://localhost:4215/vectors/index/playbook_memory
Next record will recreate the index fresh.
Proper fix (deferred): either (a) upstream patch to coder/hnsw, (b) write a different small-index Add path that always rebuilds from scratch when len < threshold, or (c) switch playbook_memory to a different vector store (Lance? in-memory map for the playbook-corpus shape, since playbook entries are small).
What the test confirmed (production-readiness)
Across 335k scenarios in 5 minutes:
- Search at 100k corpus is fast — p99 8.6ms on cold path,
matching the 5k corpus characteristics. HNSW search is
O(log n)so 20× corpus growth barely registered. - Validator integration works at load — 117,406 EmailValidator passes in cold_search_email + 33,524 in sms_validate. The in-process validators don't bottleneck.
- Profile swap with ExcludeIDs is correct — 50,263 swaps, zero overlap detected between original + swap result sets. The ExcludeIDs filter holds.
- Embed cache effectiveness verified — repeat_cache scenario (5 sequential queries each) yielded 252,880 cached searches with no failures and consistent latencies. Cache hit rate is high enough that 100k-corpus search costs match 5k-corpus search costs in p50.
- SMS-shape phone-number false-positive guard works — 33,524 SMS drafts containing "Call 555-123-4567" (phone shape that ALMOST matches SSN-shape NNN-NN-NNNN) all passed the EmailValidator's flanking-digit guard.
- Cross-daemon HTTP overhead is negligible — matrixd→vectord→embedd round-trips at ~2-12ms p50 across scenarios.
What this DOES NOT cover
- Real coordinator demand patterns — bodies rotated round-robin; real workloads have arrival-rate variability + burst clustering.
- Multi-host horizontal scale — single-machine load.
- Sustained for hours — 5-minute window; long-tail leaks (file handles, goroutine pools, MinIO connections) not tested.
- Concurrent ingest + load — the 100k ingest finished BEFORE the test ran. Mixed read/write at scale is a separate probe.
- Real Bun frontend in path — direct-to-Go for max throughput.
Bun adds ~5x latency overhead per the earlier
g5_load_test.md.
Repro
# Stack must be up:
./scripts/cutover/start_go_stack.sh
# Ingest 100k workers (one-time, ~55 min):
./bin/staffing_workers -limit 100000 \
-parquet /home/profit/lakehouse/data/datasets/workers_100k.parquet \
-gateway http://127.0.0.1:4110 -drop=true
# Reset playbook_memory if it's in a degenerate state:
curl -X DELETE http://127.0.0.1:4215/vectors/index/playbook_memory
# Build + run multitier:
go build -o bin/multitier ./scripts/cutover/multitier
./bin/multitier -gateway http://127.0.0.1:4110 -concurrency 50 -duration 300s
# Stderr is parseable JSON for CI integration.
Decisions tracker delta
Add to docs/ARCHITECTURE_COMPARISON.md Decisions tracker:
| Date | Decision | Effect |
|---|---|---|
| 2026-05-01 | playbook_record under load triggers coder/hnsw v0.6.1 nil-deref | Recover guard added in BatchAdd; daemon stays up. Real fix open: upstream patch OR small-index custom Add path OR alternate store. |
| 2026-05-01 (later) | Real fix landed. vectord lifts source-of-truth out of coder/hnsw via i.vectors map[string][]float32 side store; safeGraphAdd/safeGraphDelete recover panics; warm-path Add falls back to rebuild on failure; rebuildGraphLocked reads from the panic-safe side map. Re-ran multitier 60s/conc=50: 0 failures across 19,622 scenarios (was 96-98% on 2/6). p50 on previously-failing scenarios moves 5ms (instant fail) → 551ms (real Add work — honest cost of correctness). Memory cost: ~2× for vectors. STATE_OF_PLAY captures the architecture invariant. |
|
| 2026-05-02 | Full-scale verification. Re-ran multitier at the original failure-surfacing footprint (5min @ conc=50). Result: 132,211 scenarios at 438.5/sec, 0 failures across all 6 classes. Throughput dropped from pre-fix 1,115/sec → 438/sec because previously-broken scenarios (96-98% fail) now do real HNSW Add work instead of fast nil-deref panics. Healthy tails: surge_fill_validate p50=28.9ms / p99=1.53s, playbook_record_replay p50=504ms / p99=2.32s — small-index rebuild kicking in under sustained churn, working as designed. Substrate fix scales beyond the 19.6k-scenario probe; closing the open thread. |
Conclusion
Pre-fix (2026-05-01): 335,257 scenarios in 5min, 4/6 classes at 0% failure, 2/6 hit a coder/hnsw v0.6.1 nil-deref under playbook record churn. Operator recovery via DELETE + recreate.
Post-fix (2026-05-02): 132,211 scenarios in 5min @ conc=50, 6/6 classes at 0% failure. Throughput moved 1,115/sec → 438/sec because the formerly fast-failing scenarios are now doing real HNSW Add work — that's the honest cost of correctness, not a regression. The fix (i.vectors side-store + safeGraphAdd recover wrappers + small-index rebuild threshold of 32 + saveTask write coalescing) shifts vectord's source-of-truth out of coder/hnsw so panics can't lose data and the daemon recovers automatically.
This is the most production-shape test we've run. The harness mixes search, validator calls (in-process), HTTP cross-daemon round-trips, playbook recording, and cache exercise. The result is more honest than a single-endpoint load test, and post-fix all six workflows work cleanly at scale.