Generalizes the staffing_500k driver's embed-and-push loop into
internal/corpusingest. Per docs/SPEC.md §3.4 component 1 (corpus
builders): adding a new staffing/code/playbook corpus is now one
Source impl + one main.go calling Run, not 200 lines of pipeline
copy-paste.
API:
type Source interface { Next() (Row, error) }
func Run(ctx, Config, Source) (Stats, error)
Library owns:
- Index lifecycle (create, optional drop-existing, idempotent
reuse on 409)
- Parallel embed dispatcher (configurable workers + batch size)
- Vectord push batching
- Progress logging + Stats reporting
- Partial-failure semantics (log + continue per-batch errors;
operator decides on re-run via Stats.Embedded vs Scanned delta)
Per-corpus driver owns: source parsing + column→Row mapping +
post-ingest validation queries.
Refactor scripts/staffing_500k/main.go to use it. Driver is now
~190 lines (was 339), with the embed/add plumbing replaced by one
Run call. -drop flag added so callers can opt out of the destructive
DELETE-first behavior (default still true to keep the 500K test
clean-recall semantics).
Unit tests (internal/corpusingest/ingest_test.go, 8/8 PASS):
- Pipeline shape: 50 rows / 16 batch → 4 embed + 4 add calls,
every ID added exactly once, vectors at correct dimension
- DropExisting fires DELETE
- 409 on create → reuse existing index
- Limit stops early
- Empty Text rows skipped (counted as scanned, not added)
- Required IndexName + Dimension validation
- Context cancel stops mid-pipeline
Real bug caught and fixed by the test suite: if embedd ever returns
fewer vectors than texts in the request (degraded backend), the
addBatch loop would panic with index-out-of-range. Worker now
length-checks the response and logs+skips on mismatch.
12-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix). vet clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/staffing_500k/main.go: driver that reads workers_500k.csv,
embeds combined-text per worker via /v1/embed, adds to vectord index
"workers_500k", runs canonical staffing queries against the populated
index. Reproducible end-to-end test of the staffing co-pilot pipeline
at production scale.
Run results (2026-04-29 ~02:30):
500,000 vectors ingested in 35m 36s (~234/sec avg)
vectord peak RSS 4.5 GB (~9 KB/vector incl. HNSW graph)
Query latency: embed 40-59ms + search 1-3ms = ~50ms end-to-end
GPU avg ~65% (Ollama not the bottleneck — vectord Add is)
Semantic recall on canonical queries:
"electrician with industrial wiring": top 2 are literal Electricians (d=0.30)
"CNC operator with first article": Assembler / Quality Techs (adjacent, d=0.24)
"forklift driver OSHA-30": warehouse roles (d=0.33)
"warehouse picker night shift bilingual": Material Handlers (d=0.31)
"dental hygienist": Production Workers at d=0.49+ — correctly
LOW-similarity, signals "no dental hygienists in this manufacturing
dataset" rather than hallucinating a fake match.
Documented gaps:
- storaged's 256 MiB PUT cap blocks single-file LHV1 persistence
above ~150K vectors at d=768. Test ran with persistence disabled.
- vectord Add is RWMutex-serialized — with GPU at 65% util this is
the throughput cap. Concurrent Adds would be 2-3x faster but
require careful audit of coder/hnsw thread-safety (G1 scrum
documented two known quirks).
PHASE_G0_KICKOFF.md gains a "Staffing scale test" section with full
metrics + the gaps-surfaced list. The architectural payoff is real:
six binaries, one HTTP route, ~50ms from text query to top-K
semantically-relevant workers across 500K records.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>