golangLAKEHOUSE/reports/cutover/g5_load_test.md

# G5 cutover slice — production load test

Sustained-traffic load test against the cutover slice. Companion to
`g5_first_loop_live.md` (which proved learning-loop math) — this
report proves the substrate holds up under concurrent load.

## Setup

- Persistent Go stack on `:4110+:4211-:4219` (11 daemons)
- Workers corpus: 200 rows, in-memory + persisted to MinIO
- Bun mcp-server on `:3700` with `GO_LAKEHOUSE_URL=http://127.0.0.1:4110`
- Load generator: `scripts/cutover/loadgen/` — Go binary, 6-query
  rotating body mix (Forklift/CNC/Warehouse/Picker/Loader/Shipping)
- All queries `use_playbook=false` (cold-pass retrieval only — the
  load test isolates retrieval performance from learning-loop costs)

## Results

| Run | Path | Concurrency | Duration | Requests | RPS | p50 | p95 | p99 | max | errors |
|---|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| 1 | Bun `/_go/*` → Go | 1 | 10s | 4,085 | 408 | 1.3ms | 3.2ms | 32ms | 215ms | 0 |
| 2 | Bun `/_go/*` → Go | 10 | 30s | 14,527 | 484 | 4.6ms | 76ms | 92ms | 372ms | 0 |
| 3 | Direct → Go (`:4110`) | 10 | 30s | 83,158 | **2,772** | 2.5ms | 7.2ms | 8.5ms | 16ms | 0 |

**Total: 101,770 requests, zero errors.**

## Read

### What the load test confirmed

1. **Zero correctness errors across 101k requests.** Matrix gate +
   vectord HNSW + embedd cache + gateway proxy all hold under
   sustained concurrent traffic. No 5xx, no transport errors, no
   panics. This was the load test's primary question.

2. **Direct-to-Go performance is production-grade.** 2,772 RPS at
   p50 2.5ms / p99 8.5ms / max 16ms on a single host. The substrate
   itself has no scaling cliff at concurrency=10.

3. **The substrate's tail latency is well-bounded direct.** p99
   8.5ms means 99% of requests complete in under 9ms. For a vector-
   search workload (which involves embed → HNSW search → metadata
   join), that's a strong number.

### What the load test exposed

**Bun frontend is the bottleneck.** Adding Bun's reframing layer
collapses throughput by 5.7× and inflates p99 by 11×:

| Metric | Direct | Via Bun | Cost |
|---|---:|---:|---|
| RPS | 2,772 | 484 | -82% |
| p50 latency | 2.5ms | 4.6ms | +84% |
| p99 latency | 8.5ms | 92ms | +982% |
| max latency | 16ms | 372ms | +2,225% |

The p99/max cliff (>10× worse via Bun) suggests Bun's single-process
JS event loop is queueing under concurrent requests. This is a
known characteristic of Node/Bun in proxy mode — the event loop
serializes I/O completions, and at concurrency=10 the queue depth
during fan-out shows up as tail-latency cliffs.

### What this means for production

**For staffing-domain demand levels** (single-coordinator workflows
typically run <1 RPS even at peak), the Bun-fronted 484 RPS path
has 480× headroom. No urgency to optimize Bun out of the data path.

**If/when concurrent demand grows orders of magnitude** (e.g. 100+
simultaneous coordinators, automated pipelines), the optimization
path is clear: route nginx → Go directly for `/v1/matrix/search`
(or other hot endpoints), skip Bun for those. The 5.7× throughput
gain isn't gated on Go-side optimization — it's gated on Bun
reframing exit.

**The substrate itself is production-ready.** Zero errors, sub-10ms
p99 direct, no concurrency bugs surfaced under sustained load. The
load test's null result on correctness is the load test's signal.

## What this load test does NOT cover

- **Embedder hot path**: bodies rotate across 6 queries, so embed
  cache hits frequently. Cold-cache RPS would be lower.
- **Larger corpus**: 200 workers is a small index. HNSW search
  costs scale with `O(log n)` so 5K or 500K row corpora would
  show small additional latency, but the experiment isn't done.
- **Mixed read/write**: load is read-only. Concurrent
  ingest+search hasn't been tested under sustained load.
- **Multi-host cluster**: single-process load on one box. Horizontal
  scaling characteristics unknown.
- **Real chatd/observer/pathway calls**: load test bodies set
  `use_playbook=false` to isolate the matrix→vectord retrieve
  path. Full 5-loop traffic (with playbook lookup + judge gate)
  has different RPS characteristics.

## Repro

```bash
# Stack must be up:
./scripts/cutover/start_go_stack.sh
./bin/staffing_workers -limit 200 -gateway http://127.0.0.1:4110 -drop=true

# Build loadgen:
go build -o bin/loadgen ./scripts/cutover/loadgen

# Three runs:
./bin/loadgen -url http://localhost:3700/_go/v1/matrix/search -concurrency 1  -duration 10s
./bin/loadgen -url http://localhost:3700/_go/v1/matrix/search -concurrency 10 -duration 30s
./bin/loadgen -url http://localhost:4110/v1/matrix/search       -concurrency 10 -duration 30s
```

JSON summary on stderr is parseable for CI integration.