golangLAKEHOUSE/reports/cutover/g5_load_test_big.md
root 3a2823c02f g5 cutover: bigger load test — 5.87M req, 0 errors, 370MB RSS
Larger-scale follow-up to the original load test. Three axis
expansions: corpus 200→5K workers, body variety 6→200 distinct
queries, concurrency sweep 10/50/100/200, plus mixed
embed+search workload.

Concurrency sweep on /v1/matrix/search direct (3 min each):
  conc=10:  486,733 req  · 2,704 RPS · p50 2.19ms · p99 6.7ms
  conc=50:  1,148,543 req · 6,381 RPS · p50 7.08ms · p99 20ms
  conc=100: 1,253,389 req · 6,963 RPS · p50 13.34ms · p99 37ms
  conc=200: 1,460,676 req · 8,114 RPS · p50 23.45ms · p99 56ms

Mixed embed+search at 60 conc each, 90s:
  /v1/embed: 1,127,854 req · 12,531 RPS · p50 3.31ms · p99 14.6ms
  /v1/matrix/search: 392,229 req · 4,358 RPS · p50 12.68ms · p99 33.8ms

TOTAL: 5,869,424 requests across ~13.5 minutes. ZERO errors.

Resource footprint during peak load:
  matrixd  105% CPU, 33MB RSS (bottleneck — pegs 1 core)
  vectord   39% CPU, 82MB RSS
  gateway   44% CPU, 41MB RSS
  embedd    30% CPU, 67MB RSS
  Total RSS across 11 daemons: ~370MB

Compare to Rust gateway under similar load: 14.9GB RSS, 374% CPU.
Go uses ~40x less memory + spreads load across daemons rather
than packing into one mega-process.

Saturation analysis:
- conc 10→50: +135% RPS (linear-ish scaling)
- conc 50→100: +9% RPS (saturation begins)
- conc 100→200: +17% RPS (matrixd 1-core pegged)

Headroom paths if production exceeds current demand:
1. Run multiple matrixd instances behind a load balancer.
   Substrate is stateless (recordings via storaged), horizontal
   scale is straightforward.
2. Profile matrixd's per-request work (role-gate + judge-eligibility
   + result merge).
3. Skip Bun for hot endpoints (direct nginx → Go = 5.7x previously
   measured).

Evidence: reports/cutover/g5_load_test_big.md (full tables +
methodology + repro script).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 05:18:00 -05:00

154 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# G5 cutover slice — bigger load test (5K corpus, 200 bodies, conc-sweep + mixed)
Larger-scale follow-up to `g5_load_test.md`. Three axis expansions:
- **Corpus**: 200 → 5,000 workers
- **Body variety**: 6 → 200 distinct queries (4 styles × 50 fill_events rows)
- **Concurrency sweep**: 10 → 50 → 100 → 200, 3 minutes each
- **Mixed workload**: parallel embed + search at 60 conc each, 90s
All hits are direct to Go gateway on `:4110/v1/matrix/search` and
`:4110/v1/embed` — no Bun frontend in this test (Bun adds proxy
overhead but isn't the substrate's limit).
## Setup
- Persistent Go stack on `:4110+:4211-:4219` (11 daemons, 11h+ uptime
at test time)
- Workers corpus: 5,000 rows from `workers_500k.parquet` (real
production data)
- Search bodies: 200 distinct queries via
`gen_real_queries -limit 50 -styles all` (need / client_first /
looking / shorthand × 50 = 200 unique inputs to stress
the embed cache + matrix retrieve path)
## Concurrency sweep — `/v1/matrix/search` direct
| Conc | Duration | Requests | RPS | p50 | p95 | p99 | max | errors |
|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| 10 | 3m | 486,733 | **2,704** | 2.19ms | 2.99ms | 6.72ms | 651ms | **0** |
| 50 | 3m | 1,148,543 | **6,381** | 7.08ms | 14.96ms | 20.20ms | 77ms | **0** |
| 100 | 3m | 1,253,389 | **6,963** | 13.34ms | 27.44ms | 36.96ms | 182ms | **0** |
| 200 | 3m | 1,460,676 | **8,114** | 23.45ms | 44.20ms | 56.38ms | 225ms | **0** |
| **Total** | **12 min** | **4,349,341** | — | — | — | — | — | **0** |
## Mixed workload — `/v1/embed` + `/v1/matrix/search` in parallel
| Endpoint | Conc | Duration | Requests | RPS | p50 | p95 | p99 |
|---|---:|---:|---:|---:|---:|---:|---:|
| `/v1/embed` | 60 | 90s | 1,127,854 | 12,531 | 3.31ms | 10.22ms | 14.59ms |
| `/v1/matrix/search` | 60 | 90s | 392,229 | 4,358 | 12.68ms | 26.08ms | 33.78ms |
| **Total** | **120** | **90s** | **1,520,083** | **16,889** | — | — | — |
Both endpoints competing for the same matrixd / vectord / embedd
processes. Zero errors across 1.52M requests in 90s.
## Resource footprint during load (peak observed)
| Daemon | CPU% | RSS | Read |
|---|---:|---:|---|
| persistent-matrixd | 105% | 33MB | bottleneck — pegging 1 core |
| persistent-gateway | 44% | 41MB | proxy + auth |
| persistent-vectord | 39% | 82MB | HNSW search |
| persistent-embedd | 30% | 67MB | embed cache + Ollama bridge |
| persistent-storaged | 0.1% | 22MB | idle (read-mostly) |
| (5 other daemons) | ~0% | ~25MB each | idle |
| **Total** | — | **~370MB** | across all 11 daemons |
Compare to Rust gateway during similar load earlier today:
**14.9GB RSS, 374% CPU**. **Go uses ~40× less memory** + 4× less
CPU concentration (Go spreads load across daemons; Rust packs
into one mega-process).
## Aggregate
| Metric | Value |
|---|---:|
| Total requests | **5,869,424** |
| Total wall time | ~13.5 minutes |
| Errors (any kind) | **0** |
| Peak RSS across all 11 daemons | ~370MB |
**5.87 million requests, zero errors.** This is the substrate's
production-readiness signal at scale.
## What scales, what saturates
### RPS scaling (search):
- 10 → 50 conc: +135% RPS, +224% p50 latency. Sub-linear scaling
is expected (Little's law + Go's GMP scheduler context-switching).
- 50 → 100 conc: +9% RPS, +88% p50 latency. **Saturation begins.**
- 100 → 200 conc: +17% RPS, +76% p50 latency. **Saturation point**:
matrixd is at ~105% CPU (1 core pegged); doubling concurrency
past 100 adds queue depth, not throughput.
### What saturates:
- **matrixd** is the bottleneck at conc=100+. Pegs one CPU core.
- **vectord** at 39% has headroom — HNSW search is fast.
- **embedd** at 30% has headroom — cache hit rate is high (200 bodies
× millions of requests means everything stays in 4096-cap LRU).
### Headroom paths (if you ever need more throughput):
1. **Run multiple matrixd instances** behind a load balancer.
Substrate is stateless (recordings persist via storaged), so
horizontal scale is straightforward.
2. **Optimize matrixd's per-request work** — role-gate + judge-eligibility
+ result merge. Hot path could be profile-guided.
3. **Skip Bun for hot endpoints** — direct nginx → Go shaved
5.7× from the original load test.
## What this load test does NOT cover
- **Cold-cache embed** — 200 bodies × LRU cap 4096 = 100% cache
hit rate after first round. Cold workloads (every query unique)
would bottleneck on Ollama at ~30-50ms/embed.
- **Sustained for hours** — 12 minutes per endpoint. Memory/file-handle
leaks would surface over multi-hour runs.
- **Real coordinator demand patterns** — bodies rotated round-robin;
real workloads would have arrival-rate variability and burst
patterns.
- **Cross-daemon failure** — what happens if vectord crashes
mid-query? Smoke tests cover restart, but in-flight failure
recovery wasn't exercised.
## Repro
```bash
# Stack must be up:
./scripts/cutover/start_go_stack.sh
# Ingest 5K workers (~3 min):
./bin/staffing_workers -limit 5000 -gateway http://127.0.0.1:4110 -drop=true
# Generate 200-body search file:
go run ./scripts/cutover/gen_real_queries -limit 50 -styles all > /tmp/big_test_queries.txt
# (then convert to JSON bodies — see this doc for the python3 conversion snippet)
# Concurrency sweep:
for conc in 10 50 100 200; do
./bin/loadgen -url http://127.0.0.1:4110/v1/matrix/search \
-bodies-file /tmp/big_search_bodies.txt \
-concurrency $conc -duration 180s
done
# Mixed:
./bin/loadgen -url http://127.0.0.1:4110/v1/embed \
-bodies-file /tmp/embed_bodies.txt \
-concurrency 60 -duration 90s &
./bin/loadgen -url http://127.0.0.1:4110/v1/matrix/search \
-bodies-file /tmp/big_search_bodies.txt \
-concurrency 60 -duration 90s &
wait
```
## Conclusion
The Go substrate handles 5.87 million requests across 13 minutes with
zero errors, ~370MB total RSS, and matrixd as the visible bottleneck
at concurrency-100+. Production-ready at well above any
staffing-domain demand level (<1 RPS typical per coordinator, <100 RPS
even with hundreds of coordinators concurrent).
The matrixd-saturates pattern is operationally good news: you know
exactly which daemon to scale first if/when you grow past current
demand. Substrate is well-shaped for horizontal growth.