From 3a2823c02f49e1a509f33d811c31c1d27161940b Mon Sep 17 00:00:00 2001 From: root Date: Fri, 1 May 2026 05:18:00 -0500 Subject: [PATCH] =?UTF-8?q?g5=20cutover:=20bigger=20load=20test=20?= =?UTF-8?q?=E2=80=94=205.87M=20req,=200=20errors,=20370MB=20RSS?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Larger-scale follow-up to the original load test. Three axis expansions: corpus 200→5K workers, body variety 6→200 distinct queries, concurrency sweep 10/50/100/200, plus mixed embed+search workload. Concurrency sweep on /v1/matrix/search direct (3 min each): conc=10: 486,733 req · 2,704 RPS · p50 2.19ms · p99 6.7ms conc=50: 1,148,543 req · 6,381 RPS · p50 7.08ms · p99 20ms conc=100: 1,253,389 req · 6,963 RPS · p50 13.34ms · p99 37ms conc=200: 1,460,676 req · 8,114 RPS · p50 23.45ms · p99 56ms Mixed embed+search at 60 conc each, 90s: /v1/embed: 1,127,854 req · 12,531 RPS · p50 3.31ms · p99 14.6ms /v1/matrix/search: 392,229 req · 4,358 RPS · p50 12.68ms · p99 33.8ms TOTAL: 5,869,424 requests across ~13.5 minutes. ZERO errors. Resource footprint during peak load: matrixd 105% CPU, 33MB RSS (bottleneck — pegs 1 core) vectord 39% CPU, 82MB RSS gateway 44% CPU, 41MB RSS embedd 30% CPU, 67MB RSS Total RSS across 11 daemons: ~370MB Compare to Rust gateway under similar load: 14.9GB RSS, 374% CPU. Go uses ~40x less memory + spreads load across daemons rather than packing into one mega-process. Saturation analysis: - conc 10→50: +135% RPS (linear-ish scaling) - conc 50→100: +9% RPS (saturation begins) - conc 100→200: +17% RPS (matrixd 1-core pegged) Headroom paths if production exceeds current demand: 1. Run multiple matrixd instances behind a load balancer. Substrate is stateless (recordings via storaged), horizontal scale is straightforward. 2. Profile matrixd's per-request work (role-gate + judge-eligibility + result merge). 3. Skip Bun for hot endpoints (direct nginx → Go = 5.7x previously measured). Evidence: reports/cutover/g5_load_test_big.md (full tables + methodology + repro script). Co-Authored-By: Claude Opus 4.7 (1M context) --- reports/cutover/SUMMARY.md | 1 + reports/cutover/g5_load_test_big.md | 153 ++++++++++++++++++++++++++++ 2 files changed, 154 insertions(+) create mode 100644 reports/cutover/g5_load_test_big.md diff --git a/reports/cutover/SUMMARY.md b/reports/cutover/SUMMARY.md index b3eb45b..b99c927 100644 --- a/reports/cutover/SUMMARY.md +++ b/reports/cutover/SUMMARY.md @@ -15,6 +15,7 @@ what's safe to flip. Append a row when a new endpoint clears parity. | **G5 cutover slice live** | 2026-05-01 | (none — pure cutover) | Bun `/_go/*` → Go gateway `:4110` | ✅ End-to-end | First real Bun-frontend traffic to Go substrate. Rust legacy `mcp-server/index.ts` gains opt-in `/_go/*` pass-through driven by `GO_LAKEHOUSE_URL` env (systemd drop-in at `/etc/systemd/system/lakehouse-agent.service.d/go-cutover.conf`). `/_go/v1/embed` returns nomic-embed-text-v2-moe vectors; `/_go/v1/matrix/search` returns 3/3 Forklift Operators against persistent stack's 200-worker corpus. Reversible (unset env or revert systemd unit). See `g5_first_slice_live.md`. | | **5-loop live through cutover slice** | 2026-05-01 | (none — pure substrate) | Bun `/_go/v1/matrix/search` + `/_go/v1/matrix/playbooks/record` | ✅ Math + Gate verified | First end-to-end learning loop through real Bun-frontend traffic. Cold dist 0.4449 → warm dist 0.2224 (BoostFactor=0.5 for score=1.0; 0.4449×0.5=0.2225 expected, 0.2224 observed — 4-decimal exact). Cross-role gate: Forklift recording does NOT bleed onto CNC Operator query (boosted=0, injected=0). Both substrate properties (Shape A boost + role gate) hold through 3 HTTP hops (Bun → gateway → matrixd). See `g5_first_loop_live.md`. | | **Production load test** | 2026-05-01 | (none — pure load probe) | Bun `/_go/v1/matrix/search` + direct Go `:4110` | ✅ 0 errors / 101k req | Three runs, **zero correctness errors**. Direct-to-Go: 2,772 RPS @ p50 2.5ms / p99 8.5ms (production-grade). Via Bun: 484 RPS @ p50 4.6ms / p99 92ms (Bun event-loop is the bottleneck — 5.7× RPS hit, 11× p99 inflation; substrate itself is fine). For staffing-domain demand (<1 RPS typical), Bun-fronted has 480× headroom. See `g5_load_test.md`. | +| **Big load test (5K corpus, 200 bodies)** | 2026-05-01 | (none — pure load probe) | Direct Go `:4110/v1/matrix/search` + `:4110/v1/embed` | ✅ **0 errors / 5.87M req** | Concurrency sweep (10/50/100/200) + mixed embed+search workload. Peak: 8,114 RPS @ conc=200 (search). Mixed: 16,889 RPS combined. Saturation at conc=100+ — matrixd pegs 1 CPU core. **Total RSS ~370MB** across 11 daemons (40× lower than Rust 14.9G). matrixd identified as horizontal-scale target. See `g5_load_test_big.md`. | ## Wire-format drift catalog diff --git a/reports/cutover/g5_load_test_big.md b/reports/cutover/g5_load_test_big.md new file mode 100644 index 0000000..907a805 --- /dev/null +++ b/reports/cutover/g5_load_test_big.md @@ -0,0 +1,153 @@ +# G5 cutover slice — bigger load test (5K corpus, 200 bodies, conc-sweep + mixed) + +Larger-scale follow-up to `g5_load_test.md`. Three axis expansions: +- **Corpus**: 200 → 5,000 workers +- **Body variety**: 6 → 200 distinct queries (4 styles × 50 fill_events rows) +- **Concurrency sweep**: 10 → 50 → 100 → 200, 3 minutes each +- **Mixed workload**: parallel embed + search at 60 conc each, 90s + +All hits are direct to Go gateway on `:4110/v1/matrix/search` and +`:4110/v1/embed` — no Bun frontend in this test (Bun adds proxy +overhead but isn't the substrate's limit). + +## Setup + +- Persistent Go stack on `:4110+:4211-:4219` (11 daemons, 11h+ uptime + at test time) +- Workers corpus: 5,000 rows from `workers_500k.parquet` (real + production data) +- Search bodies: 200 distinct queries via + `gen_real_queries -limit 50 -styles all` (need / client_first / + looking / shorthand × 50 = 200 unique inputs to stress + the embed cache + matrix retrieve path) + +## Concurrency sweep — `/v1/matrix/search` direct + +| Conc | Duration | Requests | RPS | p50 | p95 | p99 | max | errors | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 10 | 3m | 486,733 | **2,704** | 2.19ms | 2.99ms | 6.72ms | 651ms | **0** | +| 50 | 3m | 1,148,543 | **6,381** | 7.08ms | 14.96ms | 20.20ms | 77ms | **0** | +| 100 | 3m | 1,253,389 | **6,963** | 13.34ms | 27.44ms | 36.96ms | 182ms | **0** | +| 200 | 3m | 1,460,676 | **8,114** | 23.45ms | 44.20ms | 56.38ms | 225ms | **0** | +| **Total** | **12 min** | **4,349,341** | — | — | — | — | — | **0** | + +## Mixed workload — `/v1/embed` + `/v1/matrix/search` in parallel + +| Endpoint | Conc | Duration | Requests | RPS | p50 | p95 | p99 | +|---|---:|---:|---:|---:|---:|---:|---:| +| `/v1/embed` | 60 | 90s | 1,127,854 | 12,531 | 3.31ms | 10.22ms | 14.59ms | +| `/v1/matrix/search` | 60 | 90s | 392,229 | 4,358 | 12.68ms | 26.08ms | 33.78ms | +| **Total** | **120** | **90s** | **1,520,083** | **16,889** | — | — | — | + +Both endpoints competing for the same matrixd / vectord / embedd +processes. Zero errors across 1.52M requests in 90s. + +## Resource footprint during load (peak observed) + +| Daemon | CPU% | RSS | Read | +|---|---:|---:|---| +| persistent-matrixd | 105% | 33MB | bottleneck — pegging 1 core | +| persistent-gateway | 44% | 41MB | proxy + auth | +| persistent-vectord | 39% | 82MB | HNSW search | +| persistent-embedd | 30% | 67MB | embed cache + Ollama bridge | +| persistent-storaged | 0.1% | 22MB | idle (read-mostly) | +| (5 other daemons) | ~0% | ~25MB each | idle | +| **Total** | — | **~370MB** | across all 11 daemons | + +Compare to Rust gateway during similar load earlier today: +**14.9GB RSS, 374% CPU**. **Go uses ~40× less memory** + 4× less +CPU concentration (Go spreads load across daemons; Rust packs +into one mega-process). + +## Aggregate + +| Metric | Value | +|---|---:| +| Total requests | **5,869,424** | +| Total wall time | ~13.5 minutes | +| Errors (any kind) | **0** | +| Peak RSS across all 11 daemons | ~370MB | + +**5.87 million requests, zero errors.** This is the substrate's +production-readiness signal at scale. + +## What scales, what saturates + +### RPS scaling (search): +- 10 → 50 conc: +135% RPS, +224% p50 latency. Sub-linear scaling + is expected (Little's law + Go's GMP scheduler context-switching). +- 50 → 100 conc: +9% RPS, +88% p50 latency. **Saturation begins.** +- 100 → 200 conc: +17% RPS, +76% p50 latency. **Saturation point**: + matrixd is at ~105% CPU (1 core pegged); doubling concurrency + past 100 adds queue depth, not throughput. + +### What saturates: +- **matrixd** is the bottleneck at conc=100+. Pegs one CPU core. +- **vectord** at 39% has headroom — HNSW search is fast. +- **embedd** at 30% has headroom — cache hit rate is high (200 bodies + × millions of requests means everything stays in 4096-cap LRU). + +### Headroom paths (if you ever need more throughput): +1. **Run multiple matrixd instances** behind a load balancer. + Substrate is stateless (recordings persist via storaged), so + horizontal scale is straightforward. +2. **Optimize matrixd's per-request work** — role-gate + judge-eligibility + + result merge. Hot path could be profile-guided. +3. **Skip Bun for hot endpoints** — direct nginx → Go shaved + 5.7× from the original load test. + +## What this load test does NOT cover + +- **Cold-cache embed** — 200 bodies × LRU cap 4096 = 100% cache + hit rate after first round. Cold workloads (every query unique) + would bottleneck on Ollama at ~30-50ms/embed. +- **Sustained for hours** — 12 minutes per endpoint. Memory/file-handle + leaks would surface over multi-hour runs. +- **Real coordinator demand patterns** — bodies rotated round-robin; + real workloads would have arrival-rate variability and burst + patterns. +- **Cross-daemon failure** — what happens if vectord crashes + mid-query? Smoke tests cover restart, but in-flight failure + recovery wasn't exercised. + +## Repro + +```bash +# Stack must be up: +./scripts/cutover/start_go_stack.sh + +# Ingest 5K workers (~3 min): +./bin/staffing_workers -limit 5000 -gateway http://127.0.0.1:4110 -drop=true + +# Generate 200-body search file: +go run ./scripts/cutover/gen_real_queries -limit 50 -styles all > /tmp/big_test_queries.txt +# (then convert to JSON bodies — see this doc for the python3 conversion snippet) + +# Concurrency sweep: +for conc in 10 50 100 200; do + ./bin/loadgen -url http://127.0.0.1:4110/v1/matrix/search \ + -bodies-file /tmp/big_search_bodies.txt \ + -concurrency $conc -duration 180s +done + +# Mixed: +./bin/loadgen -url http://127.0.0.1:4110/v1/embed \ + -bodies-file /tmp/embed_bodies.txt \ + -concurrency 60 -duration 90s & +./bin/loadgen -url http://127.0.0.1:4110/v1/matrix/search \ + -bodies-file /tmp/big_search_bodies.txt \ + -concurrency 60 -duration 90s & +wait +``` + +## Conclusion + +The Go substrate handles 5.87 million requests across 13 minutes with +zero errors, ~370MB total RSS, and matrixd as the visible bottleneck +at concurrency-100+. Production-ready at well above any +staffing-domain demand level (<1 RPS typical per coordinator, <100 RPS +even with hundreds of coordinators concurrent). + +The matrixd-saturates pattern is operationally good news: you know +exactly which daemon to scale first if/when you grow past current +demand. Substrate is well-shaped for horizontal growth.