# G5 cutover slice — bigger load test (5K corpus, 200 bodies, conc-sweep + mixed) Larger-scale follow-up to `g5_load_test.md`. Three axis expansions: - **Corpus**: 200 → 5,000 workers - **Body variety**: 6 → 200 distinct queries (4 styles × 50 fill_events rows) - **Concurrency sweep**: 10 → 50 → 100 → 200, 3 minutes each - **Mixed workload**: parallel embed + search at 60 conc each, 90s All hits are direct to Go gateway on `:4110/v1/matrix/search` and `:4110/v1/embed` — no Bun frontend in this test (Bun adds proxy overhead but isn't the substrate's limit). ## Setup - Persistent Go stack on `:4110+:4211-:4219` (11 daemons, 11h+ uptime at test time) - Workers corpus: 5,000 rows from `workers_500k.parquet` (real production data) - Search bodies: 200 distinct queries via `gen_real_queries -limit 50 -styles all` (need / client_first / looking / shorthand × 50 = 200 unique inputs to stress the embed cache + matrix retrieve path) ## Concurrency sweep — `/v1/matrix/search` direct | Conc | Duration | Requests | RPS | p50 | p95 | p99 | max | errors | |---:|---:|---:|---:|---:|---:|---:|---:|---:| | 10 | 3m | 486,733 | **2,704** | 2.19ms | 2.99ms | 6.72ms | 651ms | **0** | | 50 | 3m | 1,148,543 | **6,381** | 7.08ms | 14.96ms | 20.20ms | 77ms | **0** | | 100 | 3m | 1,253,389 | **6,963** | 13.34ms | 27.44ms | 36.96ms | 182ms | **0** | | 200 | 3m | 1,460,676 | **8,114** | 23.45ms | 44.20ms | 56.38ms | 225ms | **0** | | **Total** | **12 min** | **4,349,341** | — | — | — | — | — | **0** | ## Mixed workload — `/v1/embed` + `/v1/matrix/search` in parallel | Endpoint | Conc | Duration | Requests | RPS | p50 | p95 | p99 | |---|---:|---:|---:|---:|---:|---:|---:| | `/v1/embed` | 60 | 90s | 1,127,854 | 12,531 | 3.31ms | 10.22ms | 14.59ms | | `/v1/matrix/search` | 60 | 90s | 392,229 | 4,358 | 12.68ms | 26.08ms | 33.78ms | | **Total** | **120** | **90s** | **1,520,083** | **16,889** | — | — | — | Both endpoints competing for the same matrixd / vectord / embedd processes. Zero errors across 1.52M requests in 90s. ## Resource footprint during load (peak observed) | Daemon | CPU% | RSS | Read | |---|---:|---:|---| | persistent-matrixd | 105% | 33MB | bottleneck — pegging 1 core | | persistent-gateway | 44% | 41MB | proxy + auth | | persistent-vectord | 39% | 82MB | HNSW search | | persistent-embedd | 30% | 67MB | embed cache + Ollama bridge | | persistent-storaged | 0.1% | 22MB | idle (read-mostly) | | (5 other daemons) | ~0% | ~25MB each | idle | | **Total** | — | **~370MB** | across all 11 daemons | Compare to Rust gateway during similar load earlier today: **14.9GB RSS, 374% CPU**. **Go uses ~40× less memory** + 4× less CPU concentration (Go spreads load across daemons; Rust packs into one mega-process). ## Aggregate | Metric | Value | |---|---:| | Total requests | **5,869,424** | | Total wall time | ~13.5 minutes | | Errors (any kind) | **0** | | Peak RSS across all 11 daemons | ~370MB | **5.87 million requests, zero errors.** This is the substrate's production-readiness signal at scale. ## What scales, what saturates ### RPS scaling (search): - 10 → 50 conc: +135% RPS, +224% p50 latency. Sub-linear scaling is expected (Little's law + Go's GMP scheduler context-switching). - 50 → 100 conc: +9% RPS, +88% p50 latency. **Saturation begins.** - 100 → 200 conc: +17% RPS, +76% p50 latency. **Saturation point**: matrixd is at ~105% CPU (1 core pegged); doubling concurrency past 100 adds queue depth, not throughput. ### What saturates: - **matrixd** is the bottleneck at conc=100+. Pegs one CPU core. - **vectord** at 39% has headroom — HNSW search is fast. - **embedd** at 30% has headroom — cache hit rate is high (200 bodies × millions of requests means everything stays in 4096-cap LRU). ### Headroom paths (if you ever need more throughput): 1. **Run multiple matrixd instances** behind a load balancer. Substrate is stateless (recordings persist via storaged), so horizontal scale is straightforward. 2. **Optimize matrixd's per-request work** — role-gate + judge-eligibility + result merge. Hot path could be profile-guided. 3. **Skip Bun for hot endpoints** — direct nginx → Go shaved 5.7× from the original load test. ## What this load test does NOT cover - **Cold-cache embed** — 200 bodies × LRU cap 4096 = 100% cache hit rate after first round. Cold workloads (every query unique) would bottleneck on Ollama at ~30-50ms/embed. - **Sustained for hours** — 12 minutes per endpoint. Memory/file-handle leaks would surface over multi-hour runs. - **Real coordinator demand patterns** — bodies rotated round-robin; real workloads would have arrival-rate variability and burst patterns. - **Cross-daemon failure** — what happens if vectord crashes mid-query? Smoke tests cover restart, but in-flight failure recovery wasn't exercised. ## Repro ```bash # Stack must be up: ./scripts/cutover/start_go_stack.sh # Ingest 5K workers (~3 min): ./bin/staffing_workers -limit 5000 -gateway http://127.0.0.1:4110 -drop=true # Generate 200-body search file: go run ./scripts/cutover/gen_real_queries -limit 50 -styles all > /tmp/big_test_queries.txt # (then convert to JSON bodies — see this doc for the python3 conversion snippet) # Concurrency sweep: for conc in 10 50 100 200; do ./bin/loadgen -url http://127.0.0.1:4110/v1/matrix/search \ -bodies-file /tmp/big_search_bodies.txt \ -concurrency $conc -duration 180s done # Mixed: ./bin/loadgen -url http://127.0.0.1:4110/v1/embed \ -bodies-file /tmp/embed_bodies.txt \ -concurrency 60 -duration 90s & ./bin/loadgen -url http://127.0.0.1:4110/v1/matrix/search \ -bodies-file /tmp/big_search_bodies.txt \ -concurrency 60 -duration 90s & wait ``` ## Conclusion The Go substrate handles 5.87 million requests across 13 minutes with zero errors, ~370MB total RSS, and matrixd as the visible bottleneck at concurrency-100+. Production-ready at well above any staffing-domain demand level (<1 RPS typical per coordinator, <100 RPS even with hundreds of coordinators concurrent). The matrixd-saturates pattern is operationally good news: you know exactly which daemon to scale first if/when you grow past current demand. Substrate is well-shaped for horizontal growth.