J's reset: I'd been iterating on pipeline internals without a
driver. The PRD says staffing is the REFERENCE consumer, not the
domain driver — the architecture is the thing. This test makes
that explicit.
8 sections exercise the PRD §Shared requirements against live
production-shaped data (500k workers parquet, 50k-chunk vector
index, 768d nomic embeddings):
1. preconditions — gateway + sidecar alive
2. catalog lookup — workers_500k resolves to 500000 rows
3. SQL at scale — count(*) + geo filter on 500k rows
4. vector search — /vectors/search returns top-k
5. hybrid SQL+vector — /vectors/hybrid with sql_filter
6. playbook_memory — /vectors/playbook_memory/stats
7. pathway_memory — ADR-021 stats + bug_fingerprints
8. truth gate — DROP TABLE blocked with 403
No cloud calls. Completes in ~5 seconds. Exits non-zero on any
failure; failure messages print "these are the next things to fix."
First-run measurements against current code:
- 500k COUNT(*) = 22ms, OH-filtered = 20ms (invariant met)
- vector search p=368ms on 10-NN
- hybrid p=4662ms, returned 0 Toledo-OH hits (two signals worth
investigating: the latency AND the empty result)
- playbook_memory = 0 entries (rebuild never fired since boot)
The 11/11 pass means the substrate's contract is intact. The
measurements tell us WHERE to look next, not what to speculate.
Going forward: this script is the canary. Run it after every
substantive change. If a section flips from pass to fail, that IS
the regression; roll back or fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>