2 Commits

Author SHA1 Message Date
root
0d1553ca88 candidates corpus: first deep-field reality test on real staffing data
Lands the second staffing corpus and the first end-to-end reality test
through the full Go pipeline: parquet → corpusingest → embedd →
vectord → matrixd → gateway.

What's new:
  - scripts/staffing_candidates/main.go — parquet Source over
    candidates.parquet (1000 rows, 11 cols), single-chunk arrow-go
    pqarrow read. Embed text: "Candidate skills: <s>. Based in
    <city>, <state>. <years> years experience. Status: <status>.
    <first> <last>." IDs prefixed "c-" so multi-corpus merges
    against workers ("w-") stay unambiguous.
  - scripts/candidates_e2e.sh — first integration smoke that runs
    the full stack (storaged + embedd + vectord + matrixd + gateway),
    ingests via corpusingest, runs a real query through
    /v1/matrix/search, prints results. Ephemeral mode (vectord
    persistence disabled via custom toml) so re-runs don't pollute
    MinIO _vectors/ and break g1p_smoke's "only-one-persisted-index"
    assertion.

Real bug caught + fixed in corpusingest:
  When LogProgress > 0, the progress goroutine's only exit was
  ctx.Done(). With context.Background() in the production driver,
  Run hung forever after the pipeline finished. Added a stopProgress
  channel that close()s after wg.Wait(). Regression test
  TestRun_ProgressLoggerExits bounds Run's wall to 2s with
  LogProgress=50ms.

This is the bug the unit tests didn't catch because every prior test
set LogProgress: 0. Reality test surfaced it on first real-data
run — exactly the hyperfocus-and-find-architectural-weakness
property J framed as the reason for the Go pass.

End-to-end output (1000 candidates, query "Python AWS Docker
engineer in Chicago available now"):

  populate: scanned=1000 embedded=1000 added=1000 wall=3.5s
  matrix returned 5 hits in 26ms

The result quality is the interesting signal: top-5 had ZERO
Chicago candidates, ZERO active-status candidates, and the exact-
skill-match (Python,AWS,Docker) ranked #3 not #1. Pipeline works;
retrieval quality has real architectural limits (no structured
filtering, no relevance gate, semantic-only ranking dominated by
secondary signals like "1 year experience" and "engineer"). This
motivates SPEC §3.4 components 3 (relevance filter) and
eventually structured filtering — exactly the kind of finding the
deep field reality tests are supposed to surface before Enterprise
cutover.

12-smoke regression sweep all green. 9 corpusingest unit tests
including the new regression. vet clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:06:27 -05:00
root
166470f532 corpusingest: extract reusable text→vector ingest pipeline
Generalizes the staffing_500k driver's embed-and-push loop into
internal/corpusingest. Per docs/SPEC.md §3.4 component 1 (corpus
builders): adding a new staffing/code/playbook corpus is now one
Source impl + one main.go calling Run, not 200 lines of pipeline
copy-paste.

API:
  type Source interface { Next() (Row, error) }
  func Run(ctx, Config, Source) (Stats, error)

Library owns:
  - Index lifecycle (create, optional drop-existing, idempotent
    reuse on 409)
  - Parallel embed dispatcher (configurable workers + batch size)
  - Vectord push batching
  - Progress logging + Stats reporting
  - Partial-failure semantics (log + continue per-batch errors;
    operator decides on re-run via Stats.Embedded vs Scanned delta)

Per-corpus driver owns: source parsing + column→Row mapping +
post-ingest validation queries.

Refactor scripts/staffing_500k/main.go to use it. Driver is now
~190 lines (was 339), with the embed/add plumbing replaced by one
Run call. -drop flag added so callers can opt out of the destructive
DELETE-first behavior (default still true to keep the 500K test
clean-recall semantics).

Unit tests (internal/corpusingest/ingest_test.go, 8/8 PASS):
  - Pipeline shape: 50 rows / 16 batch → 4 embed + 4 add calls,
    every ID added exactly once, vectors at correct dimension
  - DropExisting fires DELETE
  - 409 on create → reuse existing index
  - Limit stops early
  - Empty Text rows skipped (counted as scanned, not added)
  - Required IndexName + Dimension validation
  - Context cancel stops mid-pipeline

Real bug caught and fixed by the test suite: if embedd ever returns
fewer vectors than texts in the request (degraded backend), the
addBatch loop would panic with index-out-of-range. Worker now
length-checks the response and logs+skips on mismatch.

12-smoke regression sweep all green (D1-D6, G1, G1P, G2,
storaged_cap, pathway, matrix). vet clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:47:18 -05:00