golangLAKEHOUSE/reports/cutover/multitier_100k.md
root 89ca72d471 materializer + replay ports + vectord substrate fix verified at scale
Two threads landing together — the doc edits interleave so they ship
in a single commit.

1. **vectord substrate fix verified at original scale** (closes the
   2026-05-01 thread). Re-ran multitier 5min @ conc=50: 132,211
   scenarios at 438/sec, 6/6 classes at 0% failure (was 4/6 pre-fix).
   Throughput dropped 1,115 → 438/sec because previously-broken
   scenarios now do real HNSW Add work — honest cost of correctness.
   The fix (i.vectors side-store + safeGraphAdd recover wrappers +
   smallIndexRebuildThreshold=32 + saveTask coalescing) holds at the
   footprint that originally surfaced the bug.

2. **Materializer port** — internal/materializer + cmd/materializer +
   scripts/materializer_smoke.sh. Ports scripts/distillation/transforms.ts
   (12 transforms) + build_evidence_index.ts (idempotency, day-partition,
   receipt). On-wire JSON shape matches TS so Bun and Go runs are
   interchangeable. 14 tests green.

3. **Replay port** — internal/replay + cmd/replay +
   scripts/replay_smoke.sh. Ports scripts/distillation/replay.ts
   (retrieve → bundle → /v1/chat → validate → log). Closes audit-FULL
   phase 7 live invocation on the Go side. Both runtimes append to the
   same data/_kb/replay_runs.jsonl (schema=replay_run.v1). 14 tests green.

Side effect on internal/distillation/types.go: EvidenceRecord gained
prompt_tokens, completion_tokens, and metadata fields to mirror the TS
shape the materializer transforms produce.

STATE_OF_PLAY refreshed to 2026-05-02; ARCHITECTURE_COMPARISON decisions
tracker moves the materializer + replay items from _open_ to DONE and
adds the substrate-fix scale verification row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 03:31:02 -05:00

9.9 KiB
Raw Permalink Blame History

Multi-tier load test — 100k workers, 6 scenarios, real validators

J's request: a much more sophisticated test using the 100k corpus from the Rust legacy database, exercising the new EmailValidator + FillValidator, plus profile-swap and other realistic coordinator workflow scenarios.

Setup

  • Corpus: 100,000 workers from /home/profit/lakehouse/data/datasets/workers_100k.parquet, ingested into Go vectord via staffing_workers -limit 100000 (~55 minutes). Index: workers on persistent stack, dim=768.
  • Persistent Go stack on :4110+:4211-:4219 (11 daemons, 3-layer isolation from smoke harness).
  • Bun frontend at :3700 (not used by this test — direct hits to Go gateway).
  • Validator pool: 200 in-process workers (test-w-XXX IDs) with matched city/state/role pairs across 35 unique combos.
  • Tool: scripts/cutover/multitier/main.go — 6-scenario harness with weighted random scenario selection per goroutine.

Six scenarios + weights

Scenario Weight Steps Validators
cold_search_email 35% search → email outreach + validate EmailValidator
surge_fill_validate 15% search → fill proposal (2 workers) → FillValidator → record FillValidator
profile_swap 15% original search → swap with ExcludeIDs → no-overlap check (none — substrate-only)
repeat_cache 15% same query × 5 → cache effectiveness measure (none)
sms_validate 10% search → SMS draft (≤160 chars, contains phone for SSN false-positive test) → validate EmailValidator (kind=sms)
playbook_record_replay 10% cold search → record → warm search w/ use_playbook=true (none — exercises learning loop)

Results — sustained 5-minute run, conc=50

Scenario Runs Fail% p50 p95 p99 max
cold_search_email 117,406 0.0% 2.22ms 5.37ms 8.61ms 452ms
surge_fill_validate 50,091 98.8% 5.02ms 13.14ms 44.02ms 681ms
profile_swap 50,263 0.0% 4.45ms 9.65ms 14.04ms 461ms
repeat_cache 50,576 0.0% 11.73ms 21.03ms 29.92ms 453ms
sms_validate 33,524 0.0% 2.13ms 5.24ms 8.48ms 467ms
playbook_record_replay 33,397 96.8% 391ms 477ms 719ms 1,018ms
TOTAL 335,257

1,115 scenarios per second sustained over 5 minutes. 4 of 6 scenarios at 0% failure across 251,769 successful workflows.

Cache effectiveness (repeat_cache scenario, 5 sequential queries each): 50,576 × 5 = 252,880 cached searches, all returning the same top-K with no failures. The matrixd retrieve path scales fine on the 100k corpus.

Resource footprint at 100k corpus

Daemon CPU% RSS Note
persistent-vectord 76% 1.23GB linear with 100k vectors (vs 82MB at 5k)
persistent-matrixd 75% 26MB bottleneck at conc=50+ (1 core pegged)
persistent-gateway 30% 26MB proxy + auth
persistent-embedd 21% 97MB embed cache + Ollama bridge
persistent-storaged 11% 82MB rehydrate I/O active
(5 other daemons) ~0% ~25MB each idle
Total ~1.7GB

Compare to Rust gateway under similar load: 14.9GB RSS. Even at 100k workers, Go uses ~10× less memory with explicit per-daemon attribution.

What the test exposed (substrate finding)

The two scenarios that hit /v1/matrix/playbooks/record (surge_fill_validate, playbook_record_replay) failed at 96-98% rate. Failure stack identified: coder/hnsw v0.6.1 nil pointer in layerNode.search (graph.go:95) triggered during HNSW Add to the small-state playbook_memory index.

Reproduction:

  1. Empty playbook_memory index (length=0)
  2. First record succeeds (length=1)
  3. Subsequent record under concurrent load → coder/hnsw panics
  4. Repeated concurrent records → index transitions through degenerate states where entry node is nil

Root cause: coder/hnsw v0.6.1 doesn't handle the len=0/1 edge case correctly when the graph has been Delete'd-then-Add'd. The vectord wrapper has a partial guard (resets graph on len=1 during re-add) but doesn't catch every degenerate state.

Workaround applied: added a recover() guard in internal/vectord/index.go BatchAdd — panics now return errors instead of killing the request handler. Daemon stays up; clients get HTTP 500 with a clear "DELETE the index to recover" hint.

Operator recovery: when /v1/matrix/playbooks/record starts returning 500s, run:

curl -X DELETE http://localhost:4215/vectors/index/playbook_memory

Next record will recreate the index fresh.

Proper fix (deferred): either (a) upstream patch to coder/hnsw, (b) write a different small-index Add path that always rebuilds from scratch when len < threshold, or (c) switch playbook_memory to a different vector store (Lance? in-memory map for the playbook-corpus shape, since playbook entries are small).

What the test confirmed (production-readiness)

Across 335k scenarios in 5 minutes:

  1. Search at 100k corpus is fast — p99 8.6ms on cold path, matching the 5k corpus characteristics. HNSW search is O(log n) so 20× corpus growth barely registered.
  2. Validator integration works at load — 117,406 EmailValidator passes in cold_search_email + 33,524 in sms_validate. The in-process validators don't bottleneck.
  3. Profile swap with ExcludeIDs is correct — 50,263 swaps, zero overlap detected between original + swap result sets. The ExcludeIDs filter holds.
  4. Embed cache effectiveness verified — repeat_cache scenario (5 sequential queries each) yielded 252,880 cached searches with no failures and consistent latencies. Cache hit rate is high enough that 100k-corpus search costs match 5k-corpus search costs in p50.
  5. SMS-shape phone-number false-positive guard works — 33,524 SMS drafts containing "Call 555-123-4567" (phone shape that ALMOST matches SSN-shape NNN-NN-NNNN) all passed the EmailValidator's flanking-digit guard.
  6. Cross-daemon HTTP overhead is negligible — matrixd→vectord→embedd round-trips at ~2-12ms p50 across scenarios.

What this DOES NOT cover

  • Real coordinator demand patterns — bodies rotated round-robin; real workloads have arrival-rate variability + burst clustering.
  • Multi-host horizontal scale — single-machine load.
  • Sustained for hours — 5-minute window; long-tail leaks (file handles, goroutine pools, MinIO connections) not tested.
  • Concurrent ingest + load — the 100k ingest finished BEFORE the test ran. Mixed read/write at scale is a separate probe.
  • Real Bun frontend in path — direct-to-Go for max throughput. Bun adds ~5x latency overhead per the earlier g5_load_test.md.

Repro

# Stack must be up:
./scripts/cutover/start_go_stack.sh

# Ingest 100k workers (one-time, ~55 min):
./bin/staffing_workers -limit 100000 \
  -parquet /home/profit/lakehouse/data/datasets/workers_100k.parquet \
  -gateway http://127.0.0.1:4110 -drop=true

# Reset playbook_memory if it's in a degenerate state:
curl -X DELETE http://127.0.0.1:4215/vectors/index/playbook_memory

# Build + run multitier:
go build -o bin/multitier ./scripts/cutover/multitier
./bin/multitier -gateway http://127.0.0.1:4110 -concurrency 50 -duration 300s

# Stderr is parseable JSON for CI integration.

Decisions tracker delta

Add to docs/ARCHITECTURE_COMPARISON.md Decisions tracker:

Date Decision Effect
2026-05-01 playbook_record under load triggers coder/hnsw v0.6.1 nil-deref Recover guard added in BatchAdd; daemon stays up. Real fix open: upstream patch OR small-index custom Add path OR alternate store.
2026-05-01 (later) Real fix landed. vectord lifts source-of-truth out of coder/hnsw via i.vectors map[string][]float32 side store; safeGraphAdd/safeGraphDelete recover panics; warm-path Add falls back to rebuild on failure; rebuildGraphLocked reads from the panic-safe side map. Re-ran multitier 60s/conc=50: 0 failures across 19,622 scenarios (was 96-98% on 2/6). p50 on previously-failing scenarios moves 5ms (instant fail) → 551ms (real Add work — honest cost of correctness). Memory cost: ~2× for vectors. STATE_OF_PLAY captures the architecture invariant.
2026-05-02 Full-scale verification. Re-ran multitier at the original failure-surfacing footprint (5min @ conc=50). Result: 132,211 scenarios at 438.5/sec, 0 failures across all 6 classes. Throughput dropped from pre-fix 1,115/sec → 438/sec because previously-broken scenarios (96-98% fail) now do real HNSW Add work instead of fast nil-deref panics. Healthy tails: surge_fill_validate p50=28.9ms / p99=1.53s, playbook_record_replay p50=504ms / p99=2.32s — small-index rebuild kicking in under sustained churn, working as designed. Substrate fix scales beyond the 19.6k-scenario probe; closing the open thread.

Conclusion

Pre-fix (2026-05-01): 335,257 scenarios in 5min, 4/6 classes at 0% failure, 2/6 hit a coder/hnsw v0.6.1 nil-deref under playbook record churn. Operator recovery via DELETE + recreate.

Post-fix (2026-05-02): 132,211 scenarios in 5min @ conc=50, 6/6 classes at 0% failure. Throughput moved 1,115/sec → 438/sec because the formerly fast-failing scenarios are now doing real HNSW Add work — that's the honest cost of correctness, not a regression. The fix (i.vectors side-store + safeGraphAdd recover wrappers + small-index rebuild threshold of 32 + saveTask write coalescing) shifts vectord's source-of-truth out of coder/hnsw so panics can't lose data and the daemon recovers automatically.

This is the most production-shape test we've run. The harness mixes search, validator calls (in-process), HTTP cross-daemon round-trips, playbook recording, and cache exercise. The result is more honest than a single-endpoint load test, and post-fix all six workflows work cleanly at scale.