Larger-scale follow-up to the original load test. Three axis expansions: corpus 200→5K workers, body variety 6→200 distinct queries, concurrency sweep 10/50/100/200, plus mixed embed+search workload. Concurrency sweep on /v1/matrix/search direct (3 min each): conc=10: 486,733 req · 2,704 RPS · p50 2.19ms · p99 6.7ms conc=50: 1,148,543 req · 6,381 RPS · p50 7.08ms · p99 20ms conc=100: 1,253,389 req · 6,963 RPS · p50 13.34ms · p99 37ms conc=200: 1,460,676 req · 8,114 RPS · p50 23.45ms · p99 56ms Mixed embed+search at 60 conc each, 90s: /v1/embed: 1,127,854 req · 12,531 RPS · p50 3.31ms · p99 14.6ms /v1/matrix/search: 392,229 req · 4,358 RPS · p50 12.68ms · p99 33.8ms TOTAL: 5,869,424 requests across ~13.5 minutes. ZERO errors. Resource footprint during peak load: matrixd 105% CPU, 33MB RSS (bottleneck — pegs 1 core) vectord 39% CPU, 82MB RSS gateway 44% CPU, 41MB RSS embedd 30% CPU, 67MB RSS Total RSS across 11 daemons: ~370MB Compare to Rust gateway under similar load: 14.9GB RSS, 374% CPU. Go uses ~40x less memory + spreads load across daemons rather than packing into one mega-process. Saturation analysis: - conc 10→50: +135% RPS (linear-ish scaling) - conc 50→100: +9% RPS (saturation begins) - conc 100→200: +17% RPS (matrixd 1-core pegged) Headroom paths if production exceeds current demand: 1. Run multiple matrixd instances behind a load balancer. Substrate is stateless (recordings via storaged), horizontal scale is straightforward. 2. Profile matrixd's per-request work (role-gate + judge-eligibility + result merge). 3. Skip Bun for hot endpoints (direct nginx → Go = 5.7x previously measured). Evidence: reports/cutover/g5_load_test_big.md (full tables + methodology + repro script). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.2 KiB
G5 cutover slice — bigger load test (5K corpus, 200 bodies, conc-sweep + mixed)
Larger-scale follow-up to g5_load_test.md. Three axis expansions:
- Corpus: 200 → 5,000 workers
- Body variety: 6 → 200 distinct queries (4 styles × 50 fill_events rows)
- Concurrency sweep: 10 → 50 → 100 → 200, 3 minutes each
- Mixed workload: parallel embed + search at 60 conc each, 90s
All hits are direct to Go gateway on :4110/v1/matrix/search and
:4110/v1/embed — no Bun frontend in this test (Bun adds proxy
overhead but isn't the substrate's limit).
Setup
- Persistent Go stack on
:4110+:4211-:4219(11 daemons, 11h+ uptime at test time) - Workers corpus: 5,000 rows from
workers_500k.parquet(real production data) - Search bodies: 200 distinct queries via
gen_real_queries -limit 50 -styles all(need / client_first / looking / shorthand × 50 = 200 unique inputs to stress the embed cache + matrix retrieve path)
Concurrency sweep — /v1/matrix/search direct
| Conc | Duration | Requests | RPS | p50 | p95 | p99 | max | errors |
|---|---|---|---|---|---|---|---|---|
| 10 | 3m | 486,733 | 2,704 | 2.19ms | 2.99ms | 6.72ms | 651ms | 0 |
| 50 | 3m | 1,148,543 | 6,381 | 7.08ms | 14.96ms | 20.20ms | 77ms | 0 |
| 100 | 3m | 1,253,389 | 6,963 | 13.34ms | 27.44ms | 36.96ms | 182ms | 0 |
| 200 | 3m | 1,460,676 | 8,114 | 23.45ms | 44.20ms | 56.38ms | 225ms | 0 |
| Total | 12 min | 4,349,341 | — | — | — | — | — | 0 |
Mixed workload — /v1/embed + /v1/matrix/search in parallel
| Endpoint | Conc | Duration | Requests | RPS | p50 | p95 | p99 |
|---|---|---|---|---|---|---|---|
/v1/embed |
60 | 90s | 1,127,854 | 12,531 | 3.31ms | 10.22ms | 14.59ms |
/v1/matrix/search |
60 | 90s | 392,229 | 4,358 | 12.68ms | 26.08ms | 33.78ms |
| Total | 120 | 90s | 1,520,083 | 16,889 | — | — | — |
Both endpoints competing for the same matrixd / vectord / embedd processes. Zero errors across 1.52M requests in 90s.
Resource footprint during load (peak observed)
| Daemon | CPU% | RSS | Read |
|---|---|---|---|
| persistent-matrixd | 105% | 33MB | bottleneck — pegging 1 core |
| persistent-gateway | 44% | 41MB | proxy + auth |
| persistent-vectord | 39% | 82MB | HNSW search |
| persistent-embedd | 30% | 67MB | embed cache + Ollama bridge |
| persistent-storaged | 0.1% | 22MB | idle (read-mostly) |
| (5 other daemons) | ~0% | ~25MB each | idle |
| Total | — | ~370MB | across all 11 daemons |
Compare to Rust gateway during similar load earlier today: 14.9GB RSS, 374% CPU. Go uses ~40× less memory + 4× less CPU concentration (Go spreads load across daemons; Rust packs into one mega-process).
Aggregate
| Metric | Value |
|---|---|
| Total requests | 5,869,424 |
| Total wall time | ~13.5 minutes |
| Errors (any kind) | 0 |
| Peak RSS across all 11 daemons | ~370MB |
5.87 million requests, zero errors. This is the substrate's production-readiness signal at scale.
What scales, what saturates
RPS scaling (search):
- 10 → 50 conc: +135% RPS, +224% p50 latency. Sub-linear scaling is expected (Little's law + Go's GMP scheduler context-switching).
- 50 → 100 conc: +9% RPS, +88% p50 latency. Saturation begins.
- 100 → 200 conc: +17% RPS, +76% p50 latency. Saturation point: matrixd is at ~105% CPU (1 core pegged); doubling concurrency past 100 adds queue depth, not throughput.
What saturates:
- matrixd is the bottleneck at conc=100+. Pegs one CPU core.
- vectord at 39% has headroom — HNSW search is fast.
- embedd at 30% has headroom — cache hit rate is high (200 bodies × millions of requests means everything stays in 4096-cap LRU).
Headroom paths (if you ever need more throughput):
- Run multiple matrixd instances behind a load balancer. Substrate is stateless (recordings persist via storaged), so horizontal scale is straightforward.
- Optimize matrixd's per-request work — role-gate + judge-eligibility
- result merge. Hot path could be profile-guided.
- Skip Bun for hot endpoints — direct nginx → Go shaved 5.7× from the original load test.
What this load test does NOT cover
- Cold-cache embed — 200 bodies × LRU cap 4096 = 100% cache hit rate after first round. Cold workloads (every query unique) would bottleneck on Ollama at ~30-50ms/embed.
- Sustained for hours — 12 minutes per endpoint. Memory/file-handle leaks would surface over multi-hour runs.
- Real coordinator demand patterns — bodies rotated round-robin; real workloads would have arrival-rate variability and burst patterns.
- Cross-daemon failure — what happens if vectord crashes mid-query? Smoke tests cover restart, but in-flight failure recovery wasn't exercised.
Repro
# Stack must be up:
./scripts/cutover/start_go_stack.sh
# Ingest 5K workers (~3 min):
./bin/staffing_workers -limit 5000 -gateway http://127.0.0.1:4110 -drop=true
# Generate 200-body search file:
go run ./scripts/cutover/gen_real_queries -limit 50 -styles all > /tmp/big_test_queries.txt
# (then convert to JSON bodies — see this doc for the python3 conversion snippet)
# Concurrency sweep:
for conc in 10 50 100 200; do
./bin/loadgen -url http://127.0.0.1:4110/v1/matrix/search \
-bodies-file /tmp/big_search_bodies.txt \
-concurrency $conc -duration 180s
done
# Mixed:
./bin/loadgen -url http://127.0.0.1:4110/v1/embed \
-bodies-file /tmp/embed_bodies.txt \
-concurrency 60 -duration 90s &
./bin/loadgen -url http://127.0.0.1:4110/v1/matrix/search \
-bodies-file /tmp/big_search_bodies.txt \
-concurrency 60 -duration 90s &
wait
Conclusion
The Go substrate handles 5.87 million requests across 13 minutes with zero errors, ~370MB total RSS, and matrixd as the visible bottleneck at concurrency-100+. Production-ready at well above any staffing-domain demand level (<1 RPS typical per coordinator, <100 RPS even with hundreds of coordinators concurrent).
The matrixd-saturates pattern is operationally good news: you know exactly which daemon to scale first if/when you grow past current demand. Substrate is well-shaped for horizontal growth.