golangLAKEHOUSE/reports/cutover/g5_load_test_big.md
root 3a2823c02f g5 cutover: bigger load test — 5.87M req, 0 errors, 370MB RSS
Larger-scale follow-up to the original load test. Three axis
expansions: corpus 200→5K workers, body variety 6→200 distinct
queries, concurrency sweep 10/50/100/200, plus mixed
embed+search workload.

Concurrency sweep on /v1/matrix/search direct (3 min each):
  conc=10:  486,733 req  · 2,704 RPS · p50 2.19ms · p99 6.7ms
  conc=50:  1,148,543 req · 6,381 RPS · p50 7.08ms · p99 20ms
  conc=100: 1,253,389 req · 6,963 RPS · p50 13.34ms · p99 37ms
  conc=200: 1,460,676 req · 8,114 RPS · p50 23.45ms · p99 56ms

Mixed embed+search at 60 conc each, 90s:
  /v1/embed: 1,127,854 req · 12,531 RPS · p50 3.31ms · p99 14.6ms
  /v1/matrix/search: 392,229 req · 4,358 RPS · p50 12.68ms · p99 33.8ms

TOTAL: 5,869,424 requests across ~13.5 minutes. ZERO errors.

Resource footprint during peak load:
  matrixd  105% CPU, 33MB RSS (bottleneck — pegs 1 core)
  vectord   39% CPU, 82MB RSS
  gateway   44% CPU, 41MB RSS
  embedd    30% CPU, 67MB RSS
  Total RSS across 11 daemons: ~370MB

Compare to Rust gateway under similar load: 14.9GB RSS, 374% CPU.
Go uses ~40x less memory + spreads load across daemons rather
than packing into one mega-process.

Saturation analysis:
- conc 10→50: +135% RPS (linear-ish scaling)
- conc 50→100: +9% RPS (saturation begins)
- conc 100→200: +17% RPS (matrixd 1-core pegged)

Headroom paths if production exceeds current demand:
1. Run multiple matrixd instances behind a load balancer.
   Substrate is stateless (recordings via storaged), horizontal
   scale is straightforward.
2. Profile matrixd's per-request work (role-gate + judge-eligibility
   + result merge).
3. Skip Bun for hot endpoints (direct nginx → Go = 5.7x previously
   measured).

Evidence: reports/cutover/g5_load_test_big.md (full tables +
methodology + repro script).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 05:18:00 -05:00

6.2 KiB
Raw Blame History

G5 cutover slice — bigger load test (5K corpus, 200 bodies, conc-sweep + mixed)

Larger-scale follow-up to g5_load_test.md. Three axis expansions:

  • Corpus: 200 → 5,000 workers
  • Body variety: 6 → 200 distinct queries (4 styles × 50 fill_events rows)
  • Concurrency sweep: 10 → 50 → 100 → 200, 3 minutes each
  • Mixed workload: parallel embed + search at 60 conc each, 90s

All hits are direct to Go gateway on :4110/v1/matrix/search and :4110/v1/embed — no Bun frontend in this test (Bun adds proxy overhead but isn't the substrate's limit).

Setup

  • Persistent Go stack on :4110+:4211-:4219 (11 daemons, 11h+ uptime at test time)
  • Workers corpus: 5,000 rows from workers_500k.parquet (real production data)
  • Search bodies: 200 distinct queries via gen_real_queries -limit 50 -styles all (need / client_first / looking / shorthand × 50 = 200 unique inputs to stress the embed cache + matrix retrieve path)

Concurrency sweep — /v1/matrix/search direct

Conc Duration Requests RPS p50 p95 p99 max errors
10 3m 486,733 2,704 2.19ms 2.99ms 6.72ms 651ms 0
50 3m 1,148,543 6,381 7.08ms 14.96ms 20.20ms 77ms 0
100 3m 1,253,389 6,963 13.34ms 27.44ms 36.96ms 182ms 0
200 3m 1,460,676 8,114 23.45ms 44.20ms 56.38ms 225ms 0
Total 12 min 4,349,341 0

Mixed workload — /v1/embed + /v1/matrix/search in parallel

Endpoint Conc Duration Requests RPS p50 p95 p99
/v1/embed 60 90s 1,127,854 12,531 3.31ms 10.22ms 14.59ms
/v1/matrix/search 60 90s 392,229 4,358 12.68ms 26.08ms 33.78ms
Total 120 90s 1,520,083 16,889

Both endpoints competing for the same matrixd / vectord / embedd processes. Zero errors across 1.52M requests in 90s.

Resource footprint during load (peak observed)

Daemon CPU% RSS Read
persistent-matrixd 105% 33MB bottleneck — pegging 1 core
persistent-gateway 44% 41MB proxy + auth
persistent-vectord 39% 82MB HNSW search
persistent-embedd 30% 67MB embed cache + Ollama bridge
persistent-storaged 0.1% 22MB idle (read-mostly)
(5 other daemons) ~0% ~25MB each idle
Total ~370MB across all 11 daemons

Compare to Rust gateway during similar load earlier today: 14.9GB RSS, 374% CPU. Go uses ~40× less memory + 4× less CPU concentration (Go spreads load across daemons; Rust packs into one mega-process).

Aggregate

Metric Value
Total requests 5,869,424
Total wall time ~13.5 minutes
Errors (any kind) 0
Peak RSS across all 11 daemons ~370MB

5.87 million requests, zero errors. This is the substrate's production-readiness signal at scale.

What scales, what saturates

  • 10 → 50 conc: +135% RPS, +224% p50 latency. Sub-linear scaling is expected (Little's law + Go's GMP scheduler context-switching).
  • 50 → 100 conc: +9% RPS, +88% p50 latency. Saturation begins.
  • 100 → 200 conc: +17% RPS, +76% p50 latency. Saturation point: matrixd is at ~105% CPU (1 core pegged); doubling concurrency past 100 adds queue depth, not throughput.

What saturates:

  • matrixd is the bottleneck at conc=100+. Pegs one CPU core.
  • vectord at 39% has headroom — HNSW search is fast.
  • embedd at 30% has headroom — cache hit rate is high (200 bodies × millions of requests means everything stays in 4096-cap LRU).

Headroom paths (if you ever need more throughput):

  1. Run multiple matrixd instances behind a load balancer. Substrate is stateless (recordings persist via storaged), so horizontal scale is straightforward.
  2. Optimize matrixd's per-request work — role-gate + judge-eligibility
    • result merge. Hot path could be profile-guided.
  3. Skip Bun for hot endpoints — direct nginx → Go shaved 5.7× from the original load test.

What this load test does NOT cover

  • Cold-cache embed — 200 bodies × LRU cap 4096 = 100% cache hit rate after first round. Cold workloads (every query unique) would bottleneck on Ollama at ~30-50ms/embed.
  • Sustained for hours — 12 minutes per endpoint. Memory/file-handle leaks would surface over multi-hour runs.
  • Real coordinator demand patterns — bodies rotated round-robin; real workloads would have arrival-rate variability and burst patterns.
  • Cross-daemon failure — what happens if vectord crashes mid-query? Smoke tests cover restart, but in-flight failure recovery wasn't exercised.

Repro

# Stack must be up:
./scripts/cutover/start_go_stack.sh

# Ingest 5K workers (~3 min):
./bin/staffing_workers -limit 5000 -gateway http://127.0.0.1:4110 -drop=true

# Generate 200-body search file:
go run ./scripts/cutover/gen_real_queries -limit 50 -styles all > /tmp/big_test_queries.txt
# (then convert to JSON bodies — see this doc for the python3 conversion snippet)

# Concurrency sweep:
for conc in 10 50 100 200; do
  ./bin/loadgen -url http://127.0.0.1:4110/v1/matrix/search \
    -bodies-file /tmp/big_search_bodies.txt \
    -concurrency $conc -duration 180s
done

# Mixed:
./bin/loadgen -url http://127.0.0.1:4110/v1/embed \
  -bodies-file /tmp/embed_bodies.txt \
  -concurrency 60 -duration 90s &
./bin/loadgen -url http://127.0.0.1:4110/v1/matrix/search \
  -bodies-file /tmp/big_search_bodies.txt \
  -concurrency 60 -duration 90s &
wait

Conclusion

The Go substrate handles 5.87 million requests across 13 minutes with zero errors, ~370MB total RSS, and matrixd as the visible bottleneck at concurrency-100+. Production-ready at well above any staffing-domain demand level (<1 RPS typical per coordinator, <100 RPS even with hundreds of coordinators concurrent).

The matrixd-saturates pattern is operationally good news: you know exactly which daemon to scale first if/when you grow past current demand. Substrate is well-shaped for horizontal growth.