root 3a2823c02f g5 cutover: bigger load test — 5.87M req, 0 errors, 370MB RSS

Larger-scale follow-up to the original load test. Three axis
expansions: corpus 200→5K workers, body variety 6→200 distinct
queries, concurrency sweep 10/50/100/200, plus mixed
embed+search workload.

Concurrency sweep on /v1/matrix/search direct (3 min each):
  conc=10:  486,733 req  · 2,704 RPS · p50 2.19ms · p99 6.7ms
  conc=50:  1,148,543 req · 6,381 RPS · p50 7.08ms · p99 20ms
  conc=100: 1,253,389 req · 6,963 RPS · p50 13.34ms · p99 37ms
  conc=200: 1,460,676 req · 8,114 RPS · p50 23.45ms · p99 56ms

Mixed embed+search at 60 conc each, 90s:
  /v1/embed: 1,127,854 req · 12,531 RPS · p50 3.31ms · p99 14.6ms
  /v1/matrix/search: 392,229 req · 4,358 RPS · p50 12.68ms · p99 33.8ms

TOTAL: 5,869,424 requests across ~13.5 minutes. ZERO errors.

Resource footprint during peak load:
  matrixd  105% CPU, 33MB RSS (bottleneck — pegs 1 core)
  vectord   39% CPU, 82MB RSS
  gateway   44% CPU, 41MB RSS
  embedd    30% CPU, 67MB RSS
  Total RSS across 11 daemons: ~370MB

Compare to Rust gateway under similar load: 14.9GB RSS, 374% CPU.
Go uses ~40x less memory + spreads load across daemons rather
than packing into one mega-process.

Saturation analysis:
- conc 10→50: +135% RPS (linear-ish scaling)
- conc 50→100: +9% RPS (saturation begins)
- conc 100→200: +17% RPS (matrixd 1-core pegged)

Headroom paths if production exceeds current demand:
1. Run multiple matrixd instances behind a load balancer.
   Substrate is stateless (recordings via storaged), horizontal
   scale is straightforward.
2. Profile matrixd's per-request work (role-gate + judge-eligibility
   + result merge).
3. Skip Bun for hot endpoints (direct nginx → Go = 5.7x previously
   measured).

Evidence: reports/cutover/g5_load_test_big.md (full tables +
methodology + repro script).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 05:18:00 -05:00

6.2 KiB

Raw Blame History

G5 cutover slice — bigger load test (5K corpus, 200 bodies, conc-sweep + mixed)

Larger-scale follow-up to g5_load_test.md. Three axis expansions:

Corpus: 200 → 5,000 workers
Body variety: 6 → 200 distinct queries (4 styles × 50 fill_events rows)
Concurrency sweep: 10 → 50 → 100 → 200, 3 minutes each
Mixed workload: parallel embed + search at 60 conc each, 90s

All hits are direct to Go gateway on :4110/v1/matrix/search and :4110/v1/embed — no Bun frontend in this test (Bun adds proxy overhead but isn't the substrate's limit).

Setup

Persistent Go stack on :4110+:4211-:4219 (11 daemons, 11h+ uptime at test time)
Workers corpus: 5,000 rows from workers_500k.parquet (real production data)
Search bodies: 200 distinct queries via gen_real_queries -limit 50 -styles all (need / client_first / looking / shorthand × 50 = 200 unique inputs to stress the embed cache + matrix retrieve path)

Concurrency sweep — `/v1/matrix/search` direct

Conc	Duration	Requests	RPS	p50	p95	p99	max
10	3m	486,733	2,704	2.19ms	2.99ms	6.72ms	651ms
50	3m	1,148,543	6,381	7.08ms	14.96ms	20.20ms	77ms
100	3m	1,253,389	6,963	13.34ms	27.44ms	36.96ms	182ms
200	3m	1,460,676	8,114	23.45ms	44.20ms	56.38ms	225ms
Total	12 min	4,349,341	—	—	—	—	—

Mixed workload — `/v1/embed` + `/v1/matrix/search` in parallel

Endpoint	Conc	Duration	Requests	RPS	p50	p95	p99
`/v1/embed`	60	90s	1,127,854	12,531	3.31ms	10.22ms	14.59ms
`/v1/matrix/search`	60	90s	392,229	4,358	12.68ms	26.08ms	33.78ms
Total	120	90s	1,520,083	16,889	—	—	—

Both endpoints competing for the same matrixd / vectord / embedd processes. Zero errors across 1.52M requests in 90s.

Resource footprint during load (peak observed)

Daemon	CPU%	RSS	Read
persistent-matrixd	105%	33MB	bottleneck — pegging 1 core
persistent-gateway	44%	41MB	proxy + auth
persistent-vectord	39%	82MB	HNSW search
persistent-embedd	30%	67MB	embed cache + Ollama bridge
persistent-storaged	0.1%	22MB	idle (read-mostly)
(5 other daemons)	~0%	~25MB each	idle
Total	—	~370MB	across all 11 daemons

Compare to Rust gateway during similar load earlier today: 14.9GB RSS, 374% CPU. Go uses ~40× less memory + 4× less CPU concentration (Go spreads load across daemons; Rust packs into one mega-process).

Aggregate

Metric	Value
Total requests	5,869,424
Total wall time	~13.5 minutes
Errors (any kind)	0
Peak RSS across all 11 daemons	~370MB

5.87 million requests, zero errors. This is the substrate's production-readiness signal at scale.

What scales, what saturates

RPS scaling (search):

10 → 50 conc: +135% RPS, +224% p50 latency. Sub-linear scaling is expected (Little's law + Go's GMP scheduler context-switching).
50 → 100 conc: +9% RPS, +88% p50 latency. Saturation begins.
100 → 200 conc: +17% RPS, +76% p50 latency. Saturation point: matrixd is at ~105% CPU (1 core pegged); doubling concurrency past 100 adds queue depth, not throughput.

What saturates:

matrixd is the bottleneck at conc=100+. Pegs one CPU core.
vectord at 39% has headroom — HNSW search is fast.
embedd at 30% has headroom — cache hit rate is high (200 bodies × millions of requests means everything stays in 4096-cap LRU).

Headroom paths (if you ever need more throughput):

Run multiple matrixd instances behind a load balancer. Substrate is stateless (recordings persist via storaged), so horizontal scale is straightforward.
Optimize matrixd's per-request work — role-gate + judge-eligibility
- result merge. Hot path could be profile-guided.
Skip Bun for hot endpoints — direct nginx → Go shaved 5.7× from the original load test.

What this load test does NOT cover

Cold-cache embed — 200 bodies × LRU cap 4096 = 100% cache hit rate after first round. Cold workloads (every query unique) would bottleneck on Ollama at ~30-50ms/embed.
Sustained for hours — 12 minutes per endpoint. Memory/file-handle leaks would surface over multi-hour runs.
Real coordinator demand patterns — bodies rotated round-robin; real workloads would have arrival-rate variability and burst patterns.
Cross-daemon failure — what happens if vectord crashes mid-query? Smoke tests cover restart, but in-flight failure recovery wasn't exercised.

Repro

# Stack must be up:
./scripts/cutover/start_go_stack.sh

# Ingest 5K workers (~3 min):
./bin/staffing_workers -limit 5000 -gateway http://127.0.0.1:4110 -drop=true

# Generate 200-body search file:
go run ./scripts/cutover/gen_real_queries -limit 50 -styles all > /tmp/big_test_queries.txt
# (then convert to JSON bodies — see this doc for the python3 conversion snippet)

# Concurrency sweep:
for conc in 10 50 100 200; do
  ./bin/loadgen -url http://127.0.0.1:4110/v1/matrix/search \
    -bodies-file /tmp/big_search_bodies.txt \
    -concurrency $conc -duration 180s
done

# Mixed:
./bin/loadgen -url http://127.0.0.1:4110/v1/embed \
  -bodies-file /tmp/embed_bodies.txt \
  -concurrency 60 -duration 90s &
./bin/loadgen -url http://127.0.0.1:4110/v1/matrix/search \
  -bodies-file /tmp/big_search_bodies.txt \
  -concurrency 60 -duration 90s &
wait

Conclusion

The Go substrate handles 5.87 million requests across 13 minutes with zero errors, ~370MB total RSS, and matrixd as the visible bottleneck at concurrency-100+. Production-ready at well above any staffing-domain demand level (<1 RPS typical per coordinator, <100 RPS even with hundreds of coordinators concurrent).

The matrixd-saturates pattern is operationally good news: you know exactly which daemon to scale first if/when you grow past current demand. Substrate is well-shaped for horizontal growth.

6.2 KiB Raw Blame History Unescape Escape