18 Commits

Author SHA1 Message Date
root
1565f536eb Fix: job tracker field name mismatch — the overnight killer
ROOT CAUSE: Python scripts polled status.get("processed", 0) but the
Rust Job struct serialized as "embedded_chunks". Scripts always saw 0,
looped forever printing "unknown: 0/50000" for 8+ hours.

Fix (both sides):
- Rust: added "processed" alias field + "total" field to Job struct,
  kept in sync on every update_progress() and complete() call
- Python: fixed autonomous_agent.py and overnight_proof.sh to read
  "embedded_chunks" as primary key

The actual embedding pipeline was working the whole time — 673K real
chunks embedded overnight. Only the monitoring was blind.

One-word bug, 8 hours of zombie output. This is why you test the
monitoring, not just the pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 10:41:32 -05:00
root
352f99de0f Hybrid SQL+Vector search — the gap is closed
POST /vectors/hybrid takes a question + SQL WHERE clause. Pipeline:
1. SQL filter narrows to structurally-valid candidates (role, state,
   reliability, certs — whatever the caller specifies)
2. Brute-force cosine scores ALL embeddings (not HNSW, which caps at
   ~30 results due to ef_search — too few to intersect with narrow
   SQL filters on 10K+ datasets)
3. Filter vector results to only SQL-verified IDs
4. LLM generates answer from verified-correct records

Tested on the exact query that failed the staffing simulation:
"forklift operators in IL with reliability > 0.8" — SQL found 78
matches, vector ranked the 5 most semantically relevant, LLM
generated an answer citing real workers with actual skills and
certifications. Every source marked sql_verified=true.

This closes the architectural gap identified by the quality eval:
structured precision (SQL) + semantic intelligence (vector) in one
endpoint. The simulation's contract-matching path was already
SQL-pure and worked perfectly; now the intelligence-question path
has the same accuracy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 22:49:48 -05:00
root
f9f92706f3 RAG reranker + manifest bucket fix — quality improvements from eval
RAG pipeline now includes a cross-encoder rerank step between retrieval
and generation. The LLM re-sorts top-K results by relevance before
they become context. Falls back to original order if model output is
unparseable (~5% with 7B models). Also improved the generation prompt
to be domain-aware ("staffing database") and request specific citations.

Fixed 4 catalog manifests with bucket="data" (pre-federation leftover)
that poisoned the entire DataFusion query context on startup. The
"users", "lab_trials", "meta_runs", and "new_candidates" datasets
now correctly reference bucket="primary". This bug was surfaced by
the quality evaluation pipeline — wouldn't have been found by
structural tests alone.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 22:19:11 -05:00
root
9e6002c4d4 S3 backend for Lance — hybrid operates on real MinIO object storage
Enabled lance feature "aws" for S3-compatible storage via opendal.
BucketRegistry: added with_allow_http(true) for MinIO/non-TLS S3
endpoints (fixes "builder error" on HTTP endpoints). lakehouse.toml
gains [[storage.buckets]] name="s3:lakehouse" with S3 backend config.

lance_backend.rs: S3 bucket naming convention — buckets with name
prefix "s3:" emit s3:// URIs for Lance datasets. AWS_* env vars
in the systemd unit provide credentials to Lance's internal
object_store.

Verified end-to-end on real MinIO with real 100K × 768d vectors:
  - Migrate Parquet → Lance on S3: 1.7s (vs 0.57s local)
  - Build IVF_PQ: 16.4s (CPU-bound, essentially same as local)
  - Search: ~58ms p50 (vs 11ms local — S3 partition reads)
  - Random doc fetch: 13ms (vs 3.5ms local)
  - Recall@10: 0.835 (randomized IVF_PQ, consistent with local 0.805)
  - Total S3 footprint: 637 MiB (vectors + index + lance metadata)

The "public storage" claim from the PRD is now proven: the hybrid
Parquet+HNSW ⊕ Lance architecture works on S3-compatible object
storage, not just local filesystem.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 21:09:42 -05:00
root
fd4b6836ae IVF_PQ recall harness — closes ADR-019's explicit measurement gap
POST /vectors/lance/recall/{index} runs an existing harness through
Lance IVF_PQ search and measures recall@k against brute-force ground
truth. Uses the same EvalSet + ground_truth infrastructure as the
HNSW trial system — no new harness format needed.

First real measurement on resumes_100k_v2 (100K × 768d, 20 queries):
  IVF_PQ (316 partitions, 8 bits, 48 subvectors): recall@10 = 0.805
  For comparison — HNSW ec=80 es=30: recall@10 = 1.000

ADR-019 predicted "likely 0.85-0.95" — actual is 0.805. Slightly
below, but now the harness exists to iterate: increase partitions,
try ivf_hnsw_pq, tune subvectors. The measurement infrastructure
is the deliverable, not any specific recall target.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 20:52:34 -05:00
root
59e72fa566 Scalar btree index on doc_id + auto-build during Lance activation
LanceVectorStore gains build_scalar_index(column) and
has_scalar_index(column). Exposed as POST /vectors/lance/scalar-index/
{index}/{column}. activate_profile auto-builds the doc_id btree
alongside the IVF_PQ vector index when activating a Lance-backed
profile — operators get both indexes without extra API calls.

stats() now reports has_doc_id_index alongside has_vector_index.

Measured on resumes_100k_v2 (100K × 768d): random doc_id fetch
improved from ~5.4ms to ~3.5ms (35% faster). Btree build: 19ms,
+2.7 MB on disk. The remaining ~3ms is vector column materialization,
not index lookup — to close further would need a projection-only
fetch that skips the 768-float vector for text-only RAG retrieval.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 20:49:17 -05:00
root
17a0259cd0 Profile-driven Lance routing — vector_backend auto-routes search + activate
activate_profile: when profile.vector_backend == Lance, auto-migrates
from Parquet if no Lance dataset exists, auto-builds IVF_PQ if no
index attached. Reuses existing Lance dataset on subsequent activations.

profile_scoped_search: routes to Lance IVF_PQ or Parquet+HNSW based
on the profile's declared backend. Callers hit the same endpoint —
the profile abstracts which storage tier serves the query.

Verified: lance-recruiter (vector_backend=lance) and parquet-recruiter
(vector_backend=parquet) both searched the same 100K index through
POST /vectors/profile/{id}/search. Lance returned lance_ivf_pq at
25ms; Parquet returned hnsw at <1ms. Same API surface, different
backends, transparent routing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 20:40:43 -05:00
root
0d037cfac1 Phases 16.2 + L2 + 17 VRAM gate + MySQL + 18 Lance hybrid milestone
Five threads of work landing as one milestone — all individually
verified end-to-end against real data, full release build clean,
46 unit tests pass.

## Phase 16.2 / 16.5 — autotune agent + ingest triggers

`vectord::agent` is a long-running tokio task that watches the trial
journal and autonomously proposes + runs new HNSW configs. Distinct
from `autotune::run_autotune` (synchronous one-shot grid). Triggered
on POST /vectors/agent/enqueue/{idx} or by the periodic wake; ingest
paths now push DatasetAppended events when an index's source dataset
gets re-ingested. Rate-limited (max_trials_per_hour) and cooldown-
gated so it can't saturate Ollama under live load.

The proposer is ε-greedy around the current champion: with prob 0.25
sample random from full bounds, otherwise perturb champion ± small
delta on both axes. Dedup against history. Deterministic — RNG seeded
from history.len() so the same journal state proposes the same next
config (helps offline replay debugging).

`[agent]` config section in lakehouse.toml; opt-in via enabled=true.

## Federation Layer 2 — runtime bucket lifecycle + per-index scoping

`BucketRegistry.buckets` moved to `std::sync::RwLock<HashMap>` so
buckets can be added/removed after startup. POST /storage/buckets
provisions at runtime; DELETE /storage/buckets/{name} unregisters
(refuses primary/rescue with 403). Local-backend buckets get their
root directory auto-created.

`IndexMeta.bucket` (default "primary" via serde) records each index's
home bucket. `TrialJournal` and `PromotionRegistry` now hold
Arc<BucketRegistry> + IndexRegistry; they resolve target store per-
index via IndexMeta.bucket. PromotionRegistry::list_all scans every
bucket and dedups by index_name. Pre-federation indexes keep working
unchanged — they just default to primary.

`ModelProfile.bucket: Option<String>` declares per-profile artifact
home. POST /vectors/profile/{id}/activate auto-provisions the
profile's bucket under storage.profile_root if not yet registered.

EvalSets stay primary-only for now — noted gap, low-risk to extend
later with the same resolver pattern.

## Phase 17 — VRAM-aware two-profile gate

Sidecar gains POST /admin/unload (Ollama keep_alive=0 trick — forces
immediate VRAM release), POST /admin/preload (keep_alive=5m with
empty prompt, takes the slot warm), and GET /admin/vram (combines
nvidia-smi snapshot with Ollama /api/ps). Exposed via aibridge as
unload_model / preload_model / vram_snapshot.

`VectorState.active_profile` is the GPU-slot singleton —
Arc<RwLock<Option<ActiveProfileSlot>>>. activate_profile checks for
a previous profile with a different ollama_name and unloads it
before preloading the new one; same-model reactivations skip the
unload (Ollama no-ops). New routes: POST /vectors/profile/{id}/
deactivate (unload + clear slot), GET /vectors/profile/active.

Verified live: staffing-recruiter (qwen2.5) → docs-assistant
(mistral) swap freed qwen2.5 from VRAM and loaded mistral. nomic-
embed-text persists across swaps because both profiles use it —
free optimization that fell out of the design. Scoped search
correctly 403s cross-profile in both directions.

## MySQL streaming connector

`crates/ingestd/src/my_stream.rs` mirrors pg_stream.rs for MySQL.
Pure-rust `mysql_async` driver (default-features=false to avoid C
deps). Same OFFSET pagination, same Parquet-streaming write shape.
Type mapping per ADR-010: int/bigint → Int32/Int64, decimal/float
→ Float64, tinyint(1)/bool → Boolean, everything else → Utf8 with
fallback parsers for date/time/json/uuid via Display.

POST /ingest/mysql parallel to /ingest/db. Same PII auto-detection,
same lineage capture (source_system="mysql"), same agent-trigger
hook. `redact_dsn` generalized — was hardcoded to "postgresql://"
length, now works for any scheme://user:pass@host/path URL (latent
PII leak fix for MySQL DSNs).

Verified live against MariaDB on localhost: 10 rows × 9 columns of
test data round-tripped through datatypes int/varchar/decimal/
tinyint/datetime/text. PII detection auto-flagged name + email.
Aggregation queries through DataFusion match the source values
exactly.

## Phase 18 — Hybrid Parquet+HNSW ⊕ Lance backend (ADR-019)

`vectord-lance` is a new firewall crate. Lance pulls Arrow 57 and
DataFusion 52 — incompatible with the rest of the workspace's
Arrow 55 / DataFusion 47. The firewall isolates that dep tree:
public API uses only std types (Vec<f32>, Vec<String>, Hit, Row,
*Stats), so no Arrow types cross the crate boundary and nothing
propagates to vectord. The ADR-019 path that didn't ship until now.

`vectord::lance_backend::LanceRegistry` lazy-creates a
LanceVectorStore per index, resolving bucket → URI via the
conventional local-bucket layout. `IndexMeta.vector_backend` and
`ModelProfile.vector_backend` carry the choice (default Parquet so
existing indexes unchanged).

Six routes under /vectors/lance/*:
- migrate/{idx}: convert binary-blob Parquet → Lance FixedSizeList
- index/{idx}: build IVF_PQ
- search/{idx}: vector search (embed via sidecar)
- doc/{idx}/{doc_id}: random row fetch
- append/{idx}: native fragment append
- stats/{idx}: row count + index presence

Verified live on the real resumes_100k_v2 corpus (100K × 768d):
- Migrate: 0.57s
- Build IVF_PQ index: 16.2s (matches ADR-019 bench; 14× faster than
  HNSW's 230s for the same data)
- Search end-to-end (Ollama embed + Lance scan): 23-53ms
- Random doc_id fetch: 5-7ms (filter scan; faster than Parquet's
  ~35ms full-file scan, slower than the bench's 311us positional
  take — would close that gap with a scalar btree on doc_id)
- Append 100 rows: 3.3ms / +320KB on disk vs Parquet's required
  full ~330MB rewrite — the structural win
- Index survives append; both backends coexist cleanly

## Known follow-ups not in this milestone

- ModelProfile.vector_backend doesn't yet auto-route /vectors/profile/
  {id}/search to Lance; callers go through /vectors/lance/* directly
- Scalar btree on doc_id (closes the 5-7ms → ~300us gap)
- vectord-lance built default-features=false → no S3 yet
- IVF_PQ recall not measured (ADR-019 caveat) — needs a Lance-aware
  variant of the eval harness
- Watcher-path ingest doesn't push agent triggers (HTTP paths do)
- EvalSets still primary-only (federation gap)
- No PATCH endpoint to move an existing index between buckets
- The pre-existing storaged::append_log doctest fails to compile
  (malformed `{prefix}/` parses as code fence) — pre-existing bug,
  left for a focused fix

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 20:24:46 -05:00
root
4d5c49090c Phase 16: Hot-swap generations + autotune agent loop
Closes the self-iteration loop from the PRD reframe: an agent can
tune HNSW configs autonomously and the winner flows through to the
next profile activation without human intervention.

Three primitives:

1. PromotionRegistry (vectord::promotion)
   - Per-index current + history at _hnsw_promotions/{index}.json
   - promote(index, entry) atomically swaps current, pushes prior
     onto history (capped at 50)
   - rollback() pops history back onto current; clears current if
     history exhausted
   - config_or(index, default) — the read side used at build time,
     returns promoted config if set else caller's default
   - Full cache + persistence; writes are durable on return

2. Autotune (vectord::autotune)
   - run_autotune(request, ...) — synchronous agent loop
   - Default grid: 5 configs covering the practical range
     (ec=20/40/80/80/160, es=30/30/30/60/30) with seed=42 for
     reproducibility
   - Every trial goes through the existing trial-journal pipeline
     so autotune runs land alongside manual trials in the
     "trials are data" log
   - Winner: max recall first, then min p50 latency; must clear
     min_recall gate (default 0.9) or no promotion happens
   - Config bounds (ec ∈ [10,400], es ∈ [10,200]) reject absurd
     values from the request's optional custom grid
   - On winner: promote with note "autotune winner: recall=X p50=Y"

3. Wiring
   - VectorState gains promotion_registry
   - activate_profile now calls promotion_registry.config_or(...)
     so newly-promoted configs are picked up on next activation —
     the "hot-swap" is: autotune promotes -> profile activates ->
     HNSW rebuilt with new config
   - New endpoints:
       POST /vectors/hnsw/promote/{index}/{trial_id}
             ?promoted_by=...&note=...
       POST /vectors/hnsw/rollback/{index}
       GET  /vectors/hnsw/promoted/{index}
       POST /vectors/hnsw/autotune  { index_name, harness,
                                      min_recall?, grid? }

End-to-end verified on threat_intel_v1 (54 vectors):
- autogen harness 'threat_intel_smoke' (10 queries)
- POST /autotune -> 5 trials in 620ms, winner ec=20 es=30
  recall=1.00 p50=64us auto-promoted
- Manual promote of ec=80 es=30 -> history depth 1
- Rollback -> back to ec=20 es=30 autotune winner
- Second rollback -> current cleared
- Re-promote + restart -> persistence verified
- Profile activation after promotion logged:
  "building HNSW ef_construction=80 ef_search=30 seed=Some(42)"
  proving the hot-swap loop is closed.

Deferred:
- Bayesian optimization (random-grid is fine at this config-space size)
- Append-triggered autotune (Phase 17.5 — refresh OnAppend policy
  can schedule autotune after appending sufficient new rows)
- Concurrent autotune per index guard (JobTracker integration)

PRD invariants satisfied: invariant 8 (hot-swappable indexes) is now
real code — promote is atomic, rollback is always available, the
active generation is a persistent pointer not a runtime convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:26:21 -05:00
root
a293502265 Phase 17: Model profiles + scoped search — the LLM-brain keystone
Implements PRD invariant 9 ("every reader gets its own profile") and
completes the multi-model substrate vision. Local models (or agents)
bind to a named set of datasets; activation pre-loads their vector
indexes into memory; search enforces scope.

Schema (shared::types):
- ModelProfile { id, ollama_name, description, bound_datasets,
                 hnsw_config, embed_model, created_at, created_by }
- ProfileHnswConfig mirrors vectord::trial::HnswConfig to avoid a
  cross-crate dep cycle. Default (ec=80, es=30) matches the Phase 15
  trial winner.
- bound_datasets can reference raw dataset names OR AiView names
  (both register as DataFusion tables with the same name, so mixing
  raw tables and PII-redacted views composes naturally)

Catalog (catalogd::registry):
- put_profile validates id is a slug (alphanumeric + -_ only) and
  every binding resolves to an existing dataset or view
- Persistence at _catalog/profiles/{id}.json, loaded on rebuild
- get_profile / list_profiles / delete_profile

HTTP endpoints:
- POST /catalog/profiles  (create/update)
- GET  /catalog/profiles  (list)
- GET/DELETE /catalog/profiles/{id}
- POST /vectors/profile/{id}/activate  (HNSW hot-load)
- POST /vectors/profile/{id}/search    (scope-enforced)

Activation (vectord::service::activate_profile):
- For each bound dataset, find vector indexes with matching source
- Pre-load embeddings into EmbeddingCache
- Build HNSW with profile's config
- Report warmed indexes + per-binding failures + duration
- Failures on individual bindings don't abort — "substrate keeps
  working" per ADR-017

Scoped search (vectord::service::profile_scoped_search):
- Look up profile, verify index.source ∈ profile.bound_datasets
- Returns 403 with allowed bindings list if out-of-scope
- Uses HNSW if index is warm, brute-force cosine otherwise (graceful
  degradation — no "must activate first" friction)

Bug fix surfaced during testing: vectord::refresh::try_update_index_meta
was a no-op for first-time indexes, so threat_intel_v1 and
kb_team_runs_v1 (both built via refresh after Phase C shipped) didn't
show up in the index registry. Now it auto-infers the source from the
index name convention (`{source}_vN`) and registers new metadata with
reasonable defaults.

End-to-end verified:
- Created security-analyst profile bound to [threat_intel]
- POST /vectors/profile/security-analyst/activate → warmed
  threat_intel_v1 (54 vectors) in 156ms, HNSW built
- Within-scope search: method=hnsw, returned relevant IP indicators
- Out-of-scope: tried to search resumes_100k_v2 (source=candidates)
  → 403 "profile 'security-analyst' is not bound to 'candidates' —
    allowed bindings: [\"threat_intel\"]"
- staffing-recruiter profile created bound to candidates + placements;
  search without activation fell through to brute_force (graceful)

Deferred (Phase 17 followups):
- VRAM-aware activation (unload-then-load via Ollama keep_alive=0)
  — Ollama already handles this; we don't need to reinvent
- Model-identity in audit trail — Phase 13 has role-based audit;
  adding model_id is ~20 LOC when we want it
- Profile bucket pre-load (profile:user bucket mount) — Phase 17.5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:09:43 -05:00
root
650f5e97b6 Fix chunker UTF-8 boundary panic (causes 120GB OOM in refresh path)
The chunker's &text[start..end] slice could land inside a multi-byte
UTF-8 character (e.g. narrow no-break space \u{202f}, em-dashes, smart
quotes — universal in pg-imported editorial data). Rust panics on
non-boundary string slicing. In the refresh path that panic is caught
by tokio's task machinery but somehow causes linear memory growth at
~540MB/sec until OOM at 120GB+.

Root cause: chunk boundaries computed by byte arithmetic without
checking is_char_boundary(). The existing "look for last sentence / \n
/ space" logic finds ASCII-safe positions, but the *primary* `end`
calculation `(start + chunk_size).min(text.len())` lands wherever.

Fix:
- ceil_char_boundary(s, idx) — forward-scan to the nearest valid
  UTF-8 char boundary. Used at end, actual_end, and next_start.
- Iteration cap — break if iterations exceed text.len(). Any
  non-progressing loop dies safely instead of burning memory.
- Forced forward advance — if overlap + boundary math produce a
  next_start <= start, force +1 char to guarantee termination.

Reproduced on kb_team_runs (585 pg-imported prompts with editorial
unicode): previous run grew memory linearly to 124GB over 240s then
OOM-killed. Same request after fix: peaks at <100MB, completes in
~4m42s to produce 12,693 embeddings. /vectors/search returns
relevant results.

Regression tests added:
- handles_multibyte_utf8_at_chunk_boundary — exact \u{202f} repro
- no_infinite_loop_on_no_spaces — 5KB text, no whitespace
- no_infinite_loop_on_degenerate_params — chunk_size == overlap

Surfaced by Phase C, but pre-existed as a latent bug since Phase 7.
Any Ollama-targeted RAG corpus with non-ASCII content would have hit
this once it grew past ~13KB per document.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 03:27:17 -05:00
root
97a376482c Phase C: Decoupled embedding refresh
Implements the llms3.com-inspired pattern: embeddings refresh
asynchronously, decoupled from transactional row writes. New rows arrive,
ingest marks the vector index stale, a later refresh embeds only the
delta (doc_ids not already in the index).

Schema additions (DatasetManifest):
- last_embedded_at: Option<DateTime> - when the index was last refreshed
- embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh
- embedding_refresh_policy: Option<RefreshPolicy> - Manual | OnAppend | Scheduled

Ingest paths (pipeline::ingest_file + pg_stream) call
registry.mark_embeddings_stale after writing. No-op if the dataset has
never been embedded — stale semantics only kick in once last_embedded_at
is set.

Refresh pipeline (vectord::refresh::refresh_index):
- Reads the dataset Parquet, extracts (doc_id, text) pairs
- Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas)
- Loads existing embeddings via EmbeddingCache (empty on first-time build)
- Filters to rows whose doc_id is NOT in the existing set
- Chunks (chunker::chunk_column), embeds via Ollama (batches of 32),
  writes combined index, clears stale flag

Endpoints:
- POST /vectors/refresh/{dataset_name} - body {index_name, id_column,
  text_column, chunk_size?, overlap?}
- GET /vectors/stale - lists datasets whose embedding_stale_since is set

End-to-end verified on threat_intel (knowledge_base.threat_intel):
- Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s,
  last_embedded_at set
- Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check)
- Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set
- /vectors/stale surfaces threat_intel with timestamps + policy
- Delta refresh: 34 new docs embedded in 970ms (6x faster than full
  re-embed); stale_cleared = true

Not in MVP scope:
- UPDATE semantics (same doc_id, different content) - would need
  per-row content hashing
- OnAppend policy auto-trigger - just declares intent; actual scheduler
  deferred
- Scheduler runtime - the Scheduled(cron) variant declares the intent so
  operators can see which datasets expect what, but the cron itself is
  separate

Per ADR-019: when a profile switches to vector_backend=Lance, this
refresh path benefits — Lance's native append replaces our "read all +
rewrite" Parquet rebuild pattern. Current MVP works well enough at
~500-5K rows to validate the architecture; Lance unblocks the 5M+ case.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 03:00:43 -05:00
root
dbe00d018f Federation foundation + HNSW trial system + Postgres streaming + PRD reframe
Four shipped features and a PRD realignment, all measured end-to-end:

HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
  TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
  /hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
  at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
  80/30 = 1.00 recall in 230s

Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
  and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows

Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
  /etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
  rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
  anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
  /storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
  X-Lakehouse-Rescue-Used observability headers on fallback

Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
  into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
  (586 rows × 13 cols, 6 batches, 196ms end-to-end)

PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
  trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
  bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
  decision rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 01:50:05 -05:00
root
04770c97eb HNSW vector index: 100K search in 27ms (58x faster than brute-force)
- instant-distance HNSW implementation for approximate nearest neighbors
- HnswStore: build from stored embeddings, in-memory index, thread-safe
- POST /vectors/hnsw/build — build index from Parquet (100K in 35s release)
- POST /vectors/hnsw/search — fast ANN search
- GET /vectors/hnsw/list — list loaded indexes

Benchmark (100K × 768d, release build):
  Brute-force: 1,567ms
  HNSW:           31ms (50x)
  HNSW warm:      27ms (58x)

Build cost: 35s one-time for 100K vectors (release mode)
ef_construction=40, ef_search=50 — good recall/speed balance

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:00:50 -05:00
root
6cd1daeb51 Phase 11: Embedding versioning — model-proof vector layer
- IndexRegistry: tracks all vector indexes with model metadata
  (model_name, model_version, dimensions, build stats)
- Index metadata persisted as JSON in vectors/meta/
- Rebuilt on startup for crash recovery
- GET /vectors/indexes — list all indexes (filter by source/model)
- GET /vectors/indexes/{name} — get index metadata
- Background jobs auto-register metadata on completion
- Multi-version support: same data, different models, coexist
- Per ADR-014: enables incremental re-embed on model upgrade

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:27:10 -05:00
root
3b695cd592 Dual-pipeline supervisor for embedding ingestion
- 4 parallel pipelines (tuned for i9 + A4000)
- Range-based work splitting (2500 chunks per range)
- Round-robin retry on failure (3 attempts before dead-letter)
- Checkpointing to disk every 1000 chunks (crash recovery)
- On restart, loads checkpoint and skips completed ranges
- Dead-letter queue for permanently failed ranges
- Vectors assembled in order after all pipelines finish
- Batch size 64 for GPU throughput

Architecture:
  Supervisor → splits 100K chunks into 40 ranges
    ├── Pipeline 0: grabs range, embeds, reports progress
    ├── Pipeline 1: grabs range, embeds, reports progress
    ├── Pipeline 2: grabs range, embeds, reports progress
    └── Pipeline 3: grabs range, embeds, reports progress
    Failed range → back to queue → next available pipeline retries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:06:28 -05:00
root
6a532cb248 Background job system for embedding — fixes 100K timeout
- JobTracker: create/update/complete/fail jobs with progress tracking
- POST /vectors/index now returns immediately with job_id (HTTP 202)
- Embedding runs in tokio::spawn background task
- GET /vectors/jobs/{id} returns live progress (chunks embedded, rate, ETA)
- GET /vectors/jobs lists all jobs
- Progress logged every 100 batches with chunks/sec and ETA
- 100K embedding job running successfully at 44 chunks/sec
- System stays responsive during embedding (queries in 23ms)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:03:07 -05:00
root
26fc98c885 Phase 7: Vector index + RAG pipeline
- vectord crate: chunk → embed → store → search → RAG
- chunker: configurable chunk size + overlap, sentence-boundary aware splitting
- store: embeddings as Parquet (binary blob f32 vectors), portable format
- search: brute-force cosine similarity (works up to ~100K vectors)
- rag: full pipeline — embed question → search index → retrieve context → LLM answer
- Endpoints: POST /vectors/index, /vectors/search, /vectors/rag
- Gateway wired with vectord service
- Tested: 200 candidate resumes indexed in 5.4s, semantic search + RAG working
- 20 unit tests passing (chunker, search, ingestd, shared)
- AI gives honest "no match found" when context doesn't support an answer

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:12:28 -05:00