Two threads landing together — the doc edits interleave so they ship in a single commit. 1. **vectord substrate fix verified at original scale** (closes the 2026-05-01 thread). Re-ran multitier 5min @ conc=50: 132,211 scenarios at 438/sec, 6/6 classes at 0% failure (was 4/6 pre-fix). Throughput dropped 1,115 → 438/sec because previously-broken scenarios now do real HNSW Add work — honest cost of correctness. The fix (i.vectors side-store + safeGraphAdd recover wrappers + smallIndexRebuildThreshold=32 + saveTask coalescing) holds at the footprint that originally surfaced the bug. 2. **Materializer port** — internal/materializer + cmd/materializer + scripts/materializer_smoke.sh. Ports scripts/distillation/transforms.ts (12 transforms) + build_evidence_index.ts (idempotency, day-partition, receipt). On-wire JSON shape matches TS so Bun and Go runs are interchangeable. 14 tests green. 3. **Replay port** — internal/replay + cmd/replay + scripts/replay_smoke.sh. Ports scripts/distillation/replay.ts (retrieve → bundle → /v1/chat → validate → log). Closes audit-FULL phase 7 live invocation on the Go side. Both runtimes append to the same data/_kb/replay_runs.jsonl (schema=replay_run.v1). 14 tests green. Side effect on internal/distillation/types.go: EvidenceRecord gained prompt_tokens, completion_tokens, and metadata fields to mirror the TS shape the materializer transforms produce. STATE_OF_PLAY refreshed to 2026-05-02; ARCHITECTURE_COMPARISON decisions tracker moves the materializer + replay items from _open_ to DONE and adds the substrate-fix scale verification row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
199 lines
9.9 KiB
Markdown
199 lines
9.9 KiB
Markdown
# Multi-tier load test — 100k workers, 6 scenarios, real validators
|
||
|
||
J's request: a much more sophisticated test using the 100k corpus
|
||
from the Rust legacy database, exercising the new EmailValidator +
|
||
FillValidator, plus profile-swap and other realistic coordinator
|
||
workflow scenarios.
|
||
|
||
## Setup
|
||
|
||
- **Corpus**: 100,000 workers from
|
||
`/home/profit/lakehouse/data/datasets/workers_100k.parquet`,
|
||
ingested into Go vectord via `staffing_workers -limit 100000`
|
||
(~55 minutes). Index: `workers` on persistent stack, dim=768.
|
||
- **Persistent Go stack** on `:4110+:4211-:4219` (11 daemons,
|
||
3-layer isolation from smoke harness).
|
||
- **Bun frontend** at `:3700` (not used by this test — direct hits to
|
||
Go gateway).
|
||
- **Validator pool**: 200 in-process workers (`test-w-XXX` IDs)
|
||
with matched city/state/role pairs across 35 unique combos.
|
||
- **Tool**: `scripts/cutover/multitier/main.go` — 6-scenario
|
||
harness with weighted random scenario selection per goroutine.
|
||
|
||
## Six scenarios + weights
|
||
|
||
| Scenario | Weight | Steps | Validators |
|
||
|---|---:|---|---|
|
||
| `cold_search_email` | 35% | search → email outreach + validate | EmailValidator |
|
||
| `surge_fill_validate` | 15% | search → fill proposal (2 workers) → FillValidator → record | FillValidator |
|
||
| `profile_swap` | 15% | original search → swap with `ExcludeIDs` → no-overlap check | (none — substrate-only) |
|
||
| `repeat_cache` | 15% | same query × 5 → cache effectiveness measure | (none) |
|
||
| `sms_validate` | 10% | search → SMS draft (≤160 chars, contains phone for SSN false-positive test) → validate | EmailValidator (kind=sms) |
|
||
| `playbook_record_replay` | 10% | cold search → record → warm search w/ `use_playbook=true` | (none — exercises learning loop) |
|
||
|
||
## Results — sustained 5-minute run, conc=50
|
||
|
||
| Scenario | Runs | Fail% | p50 | p95 | p99 | max |
|
||
|---|---:|---:|---:|---:|---:|---:|
|
||
| `cold_search_email` | 117,406 | **0.0%** | 2.22ms | 5.37ms | 8.61ms | 452ms |
|
||
| `surge_fill_validate` | 50,091 | 98.8% | 5.02ms | 13.14ms | 44.02ms | 681ms |
|
||
| `profile_swap` | 50,263 | **0.0%** | 4.45ms | 9.65ms | 14.04ms | 461ms |
|
||
| `repeat_cache` | 50,576 | **0.0%** | 11.73ms | 21.03ms | 29.92ms | 453ms |
|
||
| `sms_validate` | 33,524 | **0.0%** | 2.13ms | 5.24ms | 8.48ms | 467ms |
|
||
| `playbook_record_replay` | 33,397 | 96.8% | 391ms | 477ms | 719ms | 1,018ms |
|
||
| **TOTAL** | **335,257** | — | — | — | — | — |
|
||
|
||
**1,115 scenarios per second** sustained over 5 minutes. **4 of 6
|
||
scenarios at 0% failure** across 251,769 successful workflows.
|
||
|
||
Cache effectiveness (repeat_cache scenario, 5 sequential queries
|
||
each): 50,576 × 5 = **252,880 cached searches**, all returning the
|
||
same top-K with no failures. The matrixd retrieve path scales fine
|
||
on the 100k corpus.
|
||
|
||
## Resource footprint at 100k corpus
|
||
|
||
| Daemon | CPU% | RSS | Note |
|
||
|---|---:|---:|---|
|
||
| persistent-vectord | 76% | **1.23GB** | linear with 100k vectors (vs 82MB at 5k) |
|
||
| persistent-matrixd | 75% | 26MB | bottleneck at conc=50+ (1 core pegged) |
|
||
| persistent-gateway | 30% | 26MB | proxy + auth |
|
||
| persistent-embedd | 21% | 97MB | embed cache + Ollama bridge |
|
||
| persistent-storaged | 11% | 82MB | rehydrate I/O active |
|
||
| (5 other daemons) | ~0% | ~25MB each | idle |
|
||
| **Total** | — | **~1.7GB** | |
|
||
|
||
Compare to Rust gateway under similar load: **14.9GB RSS**. Even at
|
||
100k workers, Go uses **~10× less memory** with explicit per-daemon
|
||
attribution.
|
||
|
||
## What the test exposed (substrate finding)
|
||
|
||
The two scenarios that hit `/v1/matrix/playbooks/record`
|
||
(surge_fill_validate, playbook_record_replay) failed at 96-98% rate.
|
||
Failure stack identified: **coder/hnsw v0.6.1 nil pointer in
|
||
`layerNode.search` (graph.go:95)** triggered during HNSW Add to the
|
||
small-state playbook_memory index.
|
||
|
||
**Reproduction:**
|
||
1. Empty playbook_memory index (length=0)
|
||
2. First record succeeds (length=1)
|
||
3. Subsequent record under concurrent load → coder/hnsw panics
|
||
4. Repeated concurrent records → index transitions through
|
||
degenerate states where entry node is nil
|
||
|
||
**Root cause:** coder/hnsw v0.6.1 doesn't handle the len=0/1
|
||
edge case correctly when the graph has been Delete'd-then-Add'd.
|
||
The vectord wrapper has a partial guard (resets graph on len=1
|
||
during re-add) but doesn't catch every degenerate state.
|
||
|
||
**Workaround applied:** added a `recover()` guard in
|
||
`internal/vectord/index.go` BatchAdd — panics now return errors
|
||
instead of killing the request handler. Daemon stays up; clients
|
||
get HTTP 500 with a clear "DELETE the index to recover" hint.
|
||
|
||
**Operator recovery:** when `/v1/matrix/playbooks/record` starts
|
||
returning 500s, run:
|
||
|
||
```bash
|
||
curl -X DELETE http://localhost:4215/vectors/index/playbook_memory
|
||
```
|
||
|
||
Next record will recreate the index fresh.
|
||
|
||
**Proper fix (deferred):** either (a) upstream patch to coder/hnsw,
|
||
(b) write a different small-index Add path that always rebuilds
|
||
from scratch when len < threshold, or (c) switch playbook_memory
|
||
to a different vector store (Lance? in-memory map for the
|
||
playbook-corpus shape, since playbook entries are small).
|
||
|
||
## What the test confirmed (production-readiness)
|
||
|
||
Across 335k scenarios in 5 minutes:
|
||
|
||
1. **Search at 100k corpus is fast** — p99 8.6ms on cold path,
|
||
matching the 5k corpus characteristics. HNSW search is
|
||
`O(log n)` so 20× corpus growth barely registered.
|
||
2. **Validator integration works at load** — 117,406 EmailValidator
|
||
passes in cold_search_email + 33,524 in sms_validate. The
|
||
in-process validators don't bottleneck.
|
||
3. **Profile swap with ExcludeIDs is correct** — 50,263 swaps,
|
||
zero overlap detected between original + swap result sets.
|
||
The ExcludeIDs filter holds.
|
||
4. **Embed cache effectiveness verified** — repeat_cache scenario
|
||
(5 sequential queries each) yielded 252,880 cached searches
|
||
with no failures and consistent latencies. Cache hit rate is
|
||
high enough that 100k-corpus search costs match 5k-corpus
|
||
search costs in p50.
|
||
5. **SMS-shape phone-number false-positive guard works** —
|
||
33,524 SMS drafts containing "Call 555-123-4567" (phone shape
|
||
that ALMOST matches SSN-shape NNN-NN-NNNN) all passed the
|
||
EmailValidator's flanking-digit guard.
|
||
6. **Cross-daemon HTTP overhead is negligible** —
|
||
matrixd→vectord→embedd round-trips at ~2-12ms p50 across
|
||
scenarios.
|
||
|
||
## What this DOES NOT cover
|
||
|
||
- **Real coordinator demand patterns** — bodies rotated round-robin;
|
||
real workloads have arrival-rate variability + burst clustering.
|
||
- **Multi-host horizontal scale** — single-machine load.
|
||
- **Sustained for hours** — 5-minute window; long-tail leaks
|
||
(file handles, goroutine pools, MinIO connections) not tested.
|
||
- **Concurrent ingest + load** — the 100k ingest finished BEFORE
|
||
the test ran. Mixed read/write at scale is a separate probe.
|
||
- **Real Bun frontend in path** — direct-to-Go for max throughput.
|
||
Bun adds ~5x latency overhead per the earlier `g5_load_test.md`.
|
||
|
||
## Repro
|
||
|
||
```bash
|
||
# Stack must be up:
|
||
./scripts/cutover/start_go_stack.sh
|
||
|
||
# Ingest 100k workers (one-time, ~55 min):
|
||
./bin/staffing_workers -limit 100000 \
|
||
-parquet /home/profit/lakehouse/data/datasets/workers_100k.parquet \
|
||
-gateway http://127.0.0.1:4110 -drop=true
|
||
|
||
# Reset playbook_memory if it's in a degenerate state:
|
||
curl -X DELETE http://127.0.0.1:4215/vectors/index/playbook_memory
|
||
|
||
# Build + run multitier:
|
||
go build -o bin/multitier ./scripts/cutover/multitier
|
||
./bin/multitier -gateway http://127.0.0.1:4110 -concurrency 50 -duration 300s
|
||
|
||
# Stderr is parseable JSON for CI integration.
|
||
```
|
||
|
||
## Decisions tracker delta
|
||
|
||
Add to `docs/ARCHITECTURE_COMPARISON.md` Decisions tracker:
|
||
|
||
| Date | Decision | Effect |
|
||
|---|---|---|
|
||
| 2026-05-01 | playbook_record under load triggers coder/hnsw v0.6.1 nil-deref | **Recover guard added** in BatchAdd; daemon stays up. **Real fix open**: upstream patch OR small-index custom Add path OR alternate store. |
|
||
| 2026-05-01 (later) | **Real fix landed.** vectord lifts source-of-truth out of coder/hnsw via `i.vectors map[string][]float32` side store; `safeGraphAdd`/`safeGraphDelete` recover panics; warm-path Add falls back to rebuild on failure; `rebuildGraphLocked` reads from the panic-safe side map. Re-ran multitier 60s/conc=50: **0 failures across 19,622 scenarios** (was 96-98% on 2/6). p50 on previously-failing scenarios moves 5ms (instant fail) → 551ms (real Add work — honest cost of correctness). Memory cost: ~2× for vectors. STATE_OF_PLAY captures the architecture invariant. |
|
||
| 2026-05-02 | **Full-scale verification.** Re-ran multitier at the original failure-surfacing footprint (5min @ conc=50). Result: **132,211 scenarios at 438.5/sec, 0 failures across all 6 classes.** Throughput dropped from pre-fix 1,115/sec → 438/sec because previously-broken scenarios (96-98% fail) now do real HNSW Add work instead of fast nil-deref panics. Healthy tails: `surge_fill_validate` p50=28.9ms / p99=1.53s, `playbook_record_replay` p50=504ms / p99=2.32s — small-index rebuild kicking in under sustained churn, working as designed. **Substrate fix scales beyond the 19.6k-scenario probe; closing the open thread.** |
|
||
|
||
## Conclusion
|
||
|
||
**Pre-fix (2026-05-01):** 335,257 scenarios in 5min, 4/6 classes at 0%
|
||
failure, 2/6 hit a coder/hnsw v0.6.1 nil-deref under playbook record
|
||
churn. Operator recovery via DELETE + recreate.
|
||
|
||
**Post-fix (2026-05-02):** 132,211 scenarios in 5min @ conc=50,
|
||
**6/6 classes at 0% failure**. Throughput moved 1,115/sec → 438/sec
|
||
because the formerly fast-failing scenarios are now doing real HNSW
|
||
Add work — that's the honest cost of correctness, not a regression.
|
||
The fix (i.vectors side-store + safeGraphAdd recover wrappers +
|
||
small-index rebuild threshold of 32 + saveTask write coalescing)
|
||
shifts vectord's source-of-truth out of coder/hnsw so panics can't
|
||
lose data and the daemon recovers automatically.
|
||
|
||
This is the most production-shape test we've run. The harness mixes
|
||
search, validator calls (in-process), HTTP cross-daemon round-trips,
|
||
playbook recording, and cache exercise. The result is more honest
|
||
than a single-endpoint load test, and post-fix all six workflows
|
||
work cleanly at scale.
|