root 166470f532 corpusingest: extract reusable text→vector ingest pipeline
Generalizes the staffing_500k driver's embed-and-push loop into
internal/corpusingest. Per docs/SPEC.md §3.4 component 1 (corpus
builders): adding a new staffing/code/playbook corpus is now one
Source impl + one main.go calling Run, not 200 lines of pipeline
copy-paste.

API:
  type Source interface { Next() (Row, error) }
  func Run(ctx, Config, Source) (Stats, error)

Library owns:
  - Index lifecycle (create, optional drop-existing, idempotent
    reuse on 409)
  - Parallel embed dispatcher (configurable workers + batch size)
  - Vectord push batching
  - Progress logging + Stats reporting
  - Partial-failure semantics (log + continue per-batch errors;
    operator decides on re-run via Stats.Embedded vs Scanned delta)

Per-corpus driver owns: source parsing + column→Row mapping +
post-ingest validation queries.

Refactor scripts/staffing_500k/main.go to use it. Driver is now
~190 lines (was 339), with the embed/add plumbing replaced by one
Run call. -drop flag added so callers can opt out of the destructive
DELETE-first behavior (default still true to keep the 500K test
clean-recall semantics).

Unit tests (internal/corpusingest/ingest_test.go, 8/8 PASS):
  - Pipeline shape: 50 rows / 16 batch → 4 embed + 4 add calls,
    every ID added exactly once, vectors at correct dimension
  - DropExisting fires DELETE
  - 409 on create → reuse existing index
  - Limit stops early
  - Empty Text rows skipped (counted as scanned, not added)
  - Required IndexName + Dimension validation
  - Context cancel stops mid-pipeline

Real bug caught and fixed by the test suite: if embedd ever returns
fewer vectors than texts in the request (degraded backend), the
addBatch loop would panic with index-out-of-range. Worker now
length-checks the response and logs+skips on mismatch.

12-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix). vet clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:47:18 -05:00
..