lakehouse

Author	SHA1	Message	Date
root	1565f536eb	Fix: job tracker field name mismatch — the overnight killer ROOT CAUSE: Python scripts polled status.get("processed", 0) but the Rust Job struct serialized as "embedded_chunks". Scripts always saw 0, looped forever printing "unknown: 0/50000" for 8+ hours. Fix (both sides): - Rust: added "processed" alias field + "total" field to Job struct, kept in sync on every update_progress() and complete() call - Python: fixed autonomous_agent.py and overnight_proof.sh to read "embedded_chunks" as primary key The actual embedding pipeline was working the whole time — 673K real chunks embedded overnight. Only the monitoring was blind. One-word bug, 8 hours of zombie output. This is why you test the monitoring, not just the pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 10:41:32 -05:00
root	352f99de0f	Hybrid SQL+Vector search — the gap is closed POST /vectors/hybrid takes a question + SQL WHERE clause. Pipeline: 1. SQL filter narrows to structurally-valid candidates (role, state, reliability, certs — whatever the caller specifies) 2. Brute-force cosine scores ALL embeddings (not HNSW, which caps at ~30 results due to ef_search — too few to intersect with narrow SQL filters on 10K+ datasets) 3. Filter vector results to only SQL-verified IDs 4. LLM generates answer from verified-correct records Tested on the exact query that failed the staffing simulation: "forklift operators in IL with reliability > 0.8" — SQL found 78 matches, vector ranked the 5 most semantically relevant, LLM generated an answer citing real workers with actual skills and certifications. Every source marked sql_verified=true. This closes the architectural gap identified by the quality eval: structured precision (SQL) + semantic intelligence (vector) in one endpoint. The simulation's contract-matching path was already SQL-pure and worked perfectly; now the intelligence-question path has the same accuracy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:49:48 -05:00
root	f9f92706f3	RAG reranker + manifest bucket fix — quality improvements from eval RAG pipeline now includes a cross-encoder rerank step between retrieval and generation. The LLM re-sorts top-K results by relevance before they become context. Falls back to original order if model output is unparseable (~5% with 7B models). Also improved the generation prompt to be domain-aware ("staffing database") and request specific citations. Fixed 4 catalog manifests with bucket="data" (pre-federation leftover) that poisoned the entire DataFusion query context on startup. The "users", "lab_trials", "meta_runs", and "new_candidates" datasets now correctly reference bucket="primary". This bug was surfaced by the quality evaluation pipeline — wouldn't have been found by structural tests alone. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:19:11 -05:00
root	9e6002c4d4	S3 backend for Lance — hybrid operates on real MinIO object storage Enabled lance feature "aws" for S3-compatible storage via opendal. BucketRegistry: added with_allow_http(true) for MinIO/non-TLS S3 endpoints (fixes "builder error" on HTTP endpoints). lakehouse.toml gains [[storage.buckets]] name="s3:lakehouse" with S3 backend config. lance_backend.rs: S3 bucket naming convention — buckets with name prefix "s3:" emit s3:// URIs for Lance datasets. AWS_* env vars in the systemd unit provide credentials to Lance's internal object_store. Verified end-to-end on real MinIO with real 100K × 768d vectors: - Migrate Parquet → Lance on S3: 1.7s (vs 0.57s local) - Build IVF_PQ: 16.4s (CPU-bound, essentially same as local) - Search: ~58ms p50 (vs 11ms local — S3 partition reads) - Random doc fetch: 13ms (vs 3.5ms local) - Recall@10: 0.835 (randomized IVF_PQ, consistent with local 0.805) - Total S3 footprint: 637 MiB (vectors + index + lance metadata) The "public storage" claim from the PRD is now proven: the hybrid Parquet+HNSW ⊕ Lance architecture works on S3-compatible object storage, not just local filesystem. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 21:09:42 -05:00
root	fd4b6836ae	IVF_PQ recall harness — closes ADR-019's explicit measurement gap POST /vectors/lance/recall/{index} runs an existing harness through Lance IVF_PQ search and measures recall@k against brute-force ground truth. Uses the same EvalSet + ground_truth infrastructure as the HNSW trial system — no new harness format needed. First real measurement on resumes_100k_v2 (100K × 768d, 20 queries): IVF_PQ (316 partitions, 8 bits, 48 subvectors): recall@10 = 0.805 For comparison — HNSW ec=80 es=30: recall@10 = 1.000 ADR-019 predicted "likely 0.85-0.95" — actual is 0.805. Slightly below, but now the harness exists to iterate: increase partitions, try ivf_hnsw_pq, tune subvectors. The measurement infrastructure is the deliverable, not any specific recall target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 20:52:34 -05:00
root	59e72fa566	Scalar btree index on doc_id + auto-build during Lance activation LanceVectorStore gains build_scalar_index(column) and has_scalar_index(column). Exposed as POST /vectors/lance/scalar-index/ {index}/{column}. activate_profile auto-builds the doc_id btree alongside the IVF_PQ vector index when activating a Lance-backed profile — operators get both indexes without extra API calls. stats() now reports has_doc_id_index alongside has_vector_index. Measured on resumes_100k_v2 (100K × 768d): random doc_id fetch improved from ~5.4ms to ~3.5ms (35% faster). Btree build: 19ms, +2.7 MB on disk. The remaining ~3ms is vector column materialization, not index lookup — to close further would need a projection-only fetch that skips the 768-float vector for text-only RAG retrieval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 20:49:17 -05:00
root	17a0259cd0	Profile-driven Lance routing — vector_backend auto-routes search + activate activate_profile: when profile.vector_backend == Lance, auto-migrates from Parquet if no Lance dataset exists, auto-builds IVF_PQ if no index attached. Reuses existing Lance dataset on subsequent activations. profile_scoped_search: routes to Lance IVF_PQ or Parquet+HNSW based on the profile's declared backend. Callers hit the same endpoint — the profile abstracts which storage tier serves the query. Verified: lance-recruiter (vector_backend=lance) and parquet-recruiter (vector_backend=parquet) both searched the same 100K index through POST /vectors/profile/{id}/search. Lance returned lance_ivf_pq at 25ms; Parquet returned hnsw at <1ms. Same API surface, different backends, transparent routing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 20:40:43 -05:00
root	0d037cfac1	Phases 16.2 + L2 + 17 VRAM gate + MySQL + 18 Lance hybrid milestone Five threads of work landing as one milestone — all individually verified end-to-end against real data, full release build clean, 46 unit tests pass. ## Phase 16.2 / 16.5 — autotune agent + ingest triggers `vectord::agent` is a long-running tokio task that watches the trial journal and autonomously proposes + runs new HNSW configs. Distinct from `autotune::run_autotune` (synchronous one-shot grid). Triggered on POST /vectors/agent/enqueue/{idx} or by the periodic wake; ingest paths now push DatasetAppended events when an index's source dataset gets re-ingested. Rate-limited (max_trials_per_hour) and cooldown- gated so it can't saturate Ollama under live load. The proposer is ε-greedy around the current champion: with prob 0.25 sample random from full bounds, otherwise perturb champion ± small delta on both axes. Dedup against history. Deterministic — RNG seeded from history.len() so the same journal state proposes the same next config (helps offline replay debugging). `[agent]` config section in lakehouse.toml; opt-in via enabled=true. ## Federation Layer 2 — runtime bucket lifecycle + per-index scoping `BucketRegistry.buckets` moved to `std::sync::RwLock<HashMap>` so buckets can be added/removed after startup. POST /storage/buckets provisions at runtime; DELETE /storage/buckets/{name} unregisters (refuses primary/rescue with 403). Local-backend buckets get their root directory auto-created. `IndexMeta.bucket` (default "primary" via serde) records each index's home bucket. `TrialJournal` and `PromotionRegistry` now hold Arc<BucketRegistry> + IndexRegistry; they resolve target store per- index via IndexMeta.bucket. PromotionRegistry::list_all scans every bucket and dedups by index_name. Pre-federation indexes keep working unchanged — they just default to primary. `ModelProfile.bucket: Option<String>` declares per-profile artifact home. POST /vectors/profile/{id}/activate auto-provisions the profile's bucket under storage.profile_root if not yet registered. EvalSets stay primary-only for now — noted gap, low-risk to extend later with the same resolver pattern. ## Phase 17 — VRAM-aware two-profile gate Sidecar gains POST /admin/unload (Ollama keep_alive=0 trick — forces immediate VRAM release), POST /admin/preload (keep_alive=5m with empty prompt, takes the slot warm), and GET /admin/vram (combines nvidia-smi snapshot with Ollama /api/ps). Exposed via aibridge as unload_model / preload_model / vram_snapshot. `VectorState.active_profile` is the GPU-slot singleton — Arc<RwLock<Option<ActiveProfileSlot>>>. activate_profile checks for a previous profile with a different ollama_name and unloads it before preloading the new one; same-model reactivations skip the unload (Ollama no-ops). New routes: POST /vectors/profile/{id}/ deactivate (unload + clear slot), GET /vectors/profile/active. Verified live: staffing-recruiter (qwen2.5) → docs-assistant (mistral) swap freed qwen2.5 from VRAM and loaded mistral. nomic- embed-text persists across swaps because both profiles use it — free optimization that fell out of the design. Scoped search correctly 403s cross-profile in both directions. ## MySQL streaming connector `crates/ingestd/src/my_stream.rs` mirrors pg_stream.rs for MySQL. Pure-rust `mysql_async` driver (default-features=false to avoid C deps). Same OFFSET pagination, same Parquet-streaming write shape. Type mapping per ADR-010: int/bigint → Int32/Int64, decimal/float → Float64, tinyint(1)/bool → Boolean, everything else → Utf8 with fallback parsers for date/time/json/uuid via Display. POST /ingest/mysql parallel to /ingest/db. Same PII auto-detection, same lineage capture (source_system="mysql"), same agent-trigger hook. `redact_dsn` generalized — was hardcoded to "postgresql://" length, now works for any scheme://user:pass@host/path URL (latent PII leak fix for MySQL DSNs). Verified live against MariaDB on localhost: 10 rows × 9 columns of test data round-tripped through datatypes int/varchar/decimal/ tinyint/datetime/text. PII detection auto-flagged name + email. Aggregation queries through DataFusion match the source values exactly. ## Phase 18 — Hybrid Parquet+HNSW ⊕ Lance backend (ADR-019) `vectord-lance` is a new firewall crate. Lance pulls Arrow 57 and DataFusion 52 — incompatible with the rest of the workspace's Arrow 55 / DataFusion 47. The firewall isolates that dep tree: public API uses only std types (Vec<f32>, Vec<String>, Hit, Row, Stats), so no Arrow types cross the crate boundary and nothing propagates to vectord. The ADR-019 path that didn't ship until now. `vectord::lance_backend::LanceRegistry` lazy-creates a LanceVectorStore per index, resolving bucket → URI via the conventional local-bucket layout. `IndexMeta.vector_backend` and `ModelProfile.vector_backend` carry the choice (default Parquet so existing indexes unchanged). Six routes under /vectors/lance/: - migrate/{idx}: convert binary-blob Parquet → Lance FixedSizeList - index/{idx}: build IVF_PQ - search/{idx}: vector search (embed via sidecar) - doc/{idx}/{doc_id}: random row fetch - append/{idx}: native fragment append - stats/{idx}: row count + index presence Verified live on the real resumes_100k_v2 corpus (100K × 768d): - Migrate: 0.57s - Build IVF_PQ index: 16.2s (matches ADR-019 bench; 14× faster than HNSW's 230s for the same data) - Search end-to-end (Ollama embed + Lance scan): 23-53ms - Random doc_id fetch: 5-7ms (filter scan; faster than Parquet's ~35ms full-file scan, slower than the bench's 311us positional take — would close that gap with a scalar btree on doc_id) - Append 100 rows: 3.3ms / +320KB on disk vs Parquet's required full ~330MB rewrite — the structural win - Index survives append; both backends coexist cleanly ## Known follow-ups not in this milestone - ModelProfile.vector_backend doesn't yet auto-route /vectors/profile/ {id}/search to Lance; callers go through /vectors/lance/* directly - Scalar btree on doc_id (closes the 5-7ms → ~300us gap) - vectord-lance built default-features=false → no S3 yet - IVF_PQ recall not measured (ADR-019 caveat) — needs a Lance-aware variant of the eval harness - Watcher-path ingest doesn't push agent triggers (HTTP paths do) - EvalSets still primary-only (federation gap) - No PATCH endpoint to move an existing index between buckets - The pre-existing storaged::append_log doctest fails to compile (malformed `{prefix}/` parses as code fence) — pre-existing bug, left for a focused fix 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 20:24:46 -05:00
root	4d5c49090c	Phase 16: Hot-swap generations + autotune agent loop Closes the self-iteration loop from the PRD reframe: an agent can tune HNSW configs autonomously and the winner flows through to the next profile activation without human intervention. Three primitives: 1. PromotionRegistry (vectord::promotion) - Per-index current + history at _hnsw_promotions/{index}.json - promote(index, entry) atomically swaps current, pushes prior onto history (capped at 50) - rollback() pops history back onto current; clears current if history exhausted - config_or(index, default) — the read side used at build time, returns promoted config if set else caller's default - Full cache + persistence; writes are durable on return 2. Autotune (vectord::autotune) - run_autotune(request, ...) — synchronous agent loop - Default grid: 5 configs covering the practical range (ec=20/40/80/80/160, es=30/30/30/60/30) with seed=42 for reproducibility - Every trial goes through the existing trial-journal pipeline so autotune runs land alongside manual trials in the "trials are data" log - Winner: max recall first, then min p50 latency; must clear min_recall gate (default 0.9) or no promotion happens - Config bounds (ec ∈ [10,400], es ∈ [10,200]) reject absurd values from the request's optional custom grid - On winner: promote with note "autotune winner: recall=X p50=Y" 3. Wiring - VectorState gains promotion_registry - activate_profile now calls promotion_registry.config_or(...) so newly-promoted configs are picked up on next activation — the "hot-swap" is: autotune promotes -> profile activates -> HNSW rebuilt with new config - New endpoints: POST /vectors/hnsw/promote/{index}/{trial_id} ?promoted_by=...&note=... POST /vectors/hnsw/rollback/{index} GET /vectors/hnsw/promoted/{index} POST /vectors/hnsw/autotune { index_name, harness, min_recall?, grid? } End-to-end verified on threat_intel_v1 (54 vectors): - autogen harness 'threat_intel_smoke' (10 queries) - POST /autotune -> 5 trials in 620ms, winner ec=20 es=30 recall=1.00 p50=64us auto-promoted - Manual promote of ec=80 es=30 -> history depth 1 - Rollback -> back to ec=20 es=30 autotune winner - Second rollback -> current cleared - Re-promote + restart -> persistence verified - Profile activation after promotion logged: "building HNSW ef_construction=80 ef_search=30 seed=Some(42)" proving the hot-swap loop is closed. Deferred: - Bayesian optimization (random-grid is fine at this config-space size) - Append-triggered autotune (Phase 17.5 — refresh OnAppend policy can schedule autotune after appending sufficient new rows) - Concurrent autotune per index guard (JobTracker integration) PRD invariants satisfied: invariant 8 (hot-swappable indexes) is now real code — promote is atomic, rollback is always available, the active generation is a persistent pointer not a runtime convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 10:26:21 -05:00
root	a293502265	Phase 17: Model profiles + scoped search — the LLM-brain keystone Implements PRD invariant 9 ("every reader gets its own profile") and completes the multi-model substrate vision. Local models (or agents) bind to a named set of datasets; activation pre-loads their vector indexes into memory; search enforces scope. Schema (shared::types): - ModelProfile { id, ollama_name, description, bound_datasets, hnsw_config, embed_model, created_at, created_by } - ProfileHnswConfig mirrors vectord::trial::HnswConfig to avoid a cross-crate dep cycle. Default (ec=80, es=30) matches the Phase 15 trial winner. - bound_datasets can reference raw dataset names OR AiView names (both register as DataFusion tables with the same name, so mixing raw tables and PII-redacted views composes naturally) Catalog (catalogd::registry): - put_profile validates id is a slug (alphanumeric + -_ only) and every binding resolves to an existing dataset or view - Persistence at _catalog/profiles/{id}.json, loaded on rebuild - get_profile / list_profiles / delete_profile HTTP endpoints: - POST /catalog/profiles (create/update) - GET /catalog/profiles (list) - GET/DELETE /catalog/profiles/{id} - POST /vectors/profile/{id}/activate (HNSW hot-load) - POST /vectors/profile/{id}/search (scope-enforced) Activation (vectord::service::activate_profile): - For each bound dataset, find vector indexes with matching source - Pre-load embeddings into EmbeddingCache - Build HNSW with profile's config - Report warmed indexes + per-binding failures + duration - Failures on individual bindings don't abort — "substrate keeps working" per ADR-017 Scoped search (vectord::service::profile_scoped_search): - Look up profile, verify index.source ∈ profile.bound_datasets - Returns 403 with allowed bindings list if out-of-scope - Uses HNSW if index is warm, brute-force cosine otherwise (graceful degradation — no "must activate first" friction) Bug fix surfaced during testing: vectord::refresh::try_update_index_meta was a no-op for first-time indexes, so threat_intel_v1 and kb_team_runs_v1 (both built via refresh after Phase C shipped) didn't show up in the index registry. Now it auto-infers the source from the index name convention (`{source}_vN`) and registers new metadata with reasonable defaults. End-to-end verified: - Created security-analyst profile bound to [threat_intel] - POST /vectors/profile/security-analyst/activate → warmed threat_intel_v1 (54 vectors) in 156ms, HNSW built - Within-scope search: method=hnsw, returned relevant IP indicators - Out-of-scope: tried to search resumes_100k_v2 (source=candidates) → 403 "profile 'security-analyst' is not bound to 'candidates' — allowed bindings: [\"threat_intel\"]" - staffing-recruiter profile created bound to candidates + placements; search without activation fell through to brute_force (graceful) Deferred (Phase 17 followups): - VRAM-aware activation (unload-then-load via Ollama keep_alive=0) — Ollama already handles this; we don't need to reinvent - Model-identity in audit trail — Phase 13 has role-based audit; adding model_id is ~20 LOC when we want it - Profile bucket pre-load (profile:user bucket mount) — Phase 17.5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 10:09:43 -05:00
root	650f5e97b6	Fix chunker UTF-8 boundary panic (causes 120GB OOM in refresh path) The chunker's &text[start..end] slice could land inside a multi-byte UTF-8 character (e.g. narrow no-break space \u{202f}, em-dashes, smart quotes — universal in pg-imported editorial data). Rust panics on non-boundary string slicing. In the refresh path that panic is caught by tokio's task machinery but somehow causes linear memory growth at ~540MB/sec until OOM at 120GB+. Root cause: chunk boundaries computed by byte arithmetic without checking is_char_boundary(). The existing "look for last sentence / \n / space" logic finds ASCII-safe positions, but the primary `end` calculation `(start + chunk_size).min(text.len())` lands wherever. Fix: - ceil_char_boundary(s, idx) — forward-scan to the nearest valid UTF-8 char boundary. Used at end, actual_end, and next_start. - Iteration cap — break if iterations exceed text.len(). Any non-progressing loop dies safely instead of burning memory. - Forced forward advance — if overlap + boundary math produce a next_start <= start, force +1 char to guarantee termination. Reproduced on kb_team_runs (585 pg-imported prompts with editorial unicode): previous run grew memory linearly to 124GB over 240s then OOM-killed. Same request after fix: peaks at <100MB, completes in ~4m42s to produce 12,693 embeddings. /vectors/search returns relevant results. Regression tests added: - handles_multibyte_utf8_at_chunk_boundary — exact \u{202f} repro - no_infinite_loop_on_no_spaces — 5KB text, no whitespace - no_infinite_loop_on_degenerate_params — chunk_size == overlap Surfaced by Phase C, but pre-existed as a latent bug since Phase 7. Any Ollama-targeted RAG corpus with non-ASCII content would have hit this once it grew past ~13KB per document. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 03:27:17 -05:00
root	97a376482c	Phase C: Decoupled embedding refresh Implements the llms3.com-inspired pattern: embeddings refresh asynchronously, decoupled from transactional row writes. New rows arrive, ingest marks the vector index stale, a later refresh embeds only the delta (doc_ids not already in the index). Schema additions (DatasetManifest): - last_embedded_at: Option<DateTime> - when the index was last refreshed - embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh - embedding_refresh_policy: Option<RefreshPolicy> - Manual \| OnAppend \| Scheduled Ingest paths (pipeline::ingest_file + pg_stream) call registry.mark_embeddings_stale after writing. No-op if the dataset has never been embedded — stale semantics only kick in once last_embedded_at is set. Refresh pipeline (vectord::refresh::refresh_index): - Reads the dataset Parquet, extracts (doc_id, text) pairs - Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas) - Loads existing embeddings via EmbeddingCache (empty on first-time build) - Filters to rows whose doc_id is NOT in the existing set - Chunks (chunker::chunk_column), embeds via Ollama (batches of 32), writes combined index, clears stale flag Endpoints: - POST /vectors/refresh/{dataset_name} - body {index_name, id_column, text_column, chunk_size?, overlap?} - GET /vectors/stale - lists datasets whose embedding_stale_since is set End-to-end verified on threat_intel (knowledge_base.threat_intel): - Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s, last_embedded_at set - Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check) - Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set - /vectors/stale surfaces threat_intel with timestamps + policy - Delta refresh: 34 new docs embedded in 970ms (6x faster than full re-embed); stale_cleared = true Not in MVP scope: - UPDATE semantics (same doc_id, different content) - would need per-row content hashing - OnAppend policy auto-trigger - just declares intent; actual scheduler deferred - Scheduler runtime - the Scheduled(cron) variant declares the intent so operators can see which datasets expect what, but the cron itself is separate Per ADR-019: when a profile switches to vector_backend=Lance, this refresh path benefits — Lance's native append replaces our "read all + rewrite" Parquet rebuild pattern. Current MVP works well enough at ~500-5K rows to validate the architecture; Lance unblocks the 5M+ case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 03:00:43 -05:00
root	dbe00d018f	Federation foundation + HNSW trial system + Postgres streaming + PRD reframe Four shipped features and a PRD realignment, all measured end-to-end: HNSW trial system (Phase 15 horizon item → complete) - vectord: EmbeddingCache, harness (eval sets + brute-force ground truth), TrialJournal, parameterized HnswConfig on build_index_with_config - /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best, /hnsw/evals/{name}/autogen, /hnsw/cache/stats - Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us at 100% recall@10. ec=80 es=30 locked as HnswConfig::default() - Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s, 80/30 = 1.00 recall in 230s Catalog manifest repair - catalogd: resync_from_parquet reads parquet footers to restore row_count and columns on drifted manifests - POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing - All 7 staffing tables recovered to PRD-matching 2,469,278 rows Federation foundation (ADR-017) - shared::secrets: SecretsProvider trait + FileSecretsProvider (reads /etc/lakehouse/secrets.toml, enforces 0600 perms) - storaged::registry::BucketRegistry — multi-bucket resolution with rescue_bucket read fallback and reachability probing - storaged::error_journal — bucket op failures visible in one HTTP call - storaged::append_log — write-once batched append pattern (fixes the RMW anti-pattern llms3.com calls out; errors and trial journals both use it) - /storage/buckets, /storage/errors, /storage/bucket-health, /storage/errors/{flush,compact} - Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with X-Lakehouse-Rescue-Used observability headers on fallback Postgres streaming ingest - ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination into ArrowWriter, lineage redacts password - POST /ingest/db — verified against live knowledge_base.team_runs (586 rows × 13 cols, 6 batches, 196ms end-to-end) PRD realignment (2026-04-16) - Dual use case: staffing analytics + local LLM knowledge substrate - Removed "multi-tenancy (single-owner system)" from non-goals - Added invariants 8-11: indexes hot-swappable, per-reader profiles, trials-as-data, operational failures findable in one HTTP call - New phases 16 (hot-swap generations), 17 (model profiles + dataset bindings), 18 (Lance vs Parquet+sidecar evaluation) - Known ceilings table documents the 5M vector wall and escape hatches - ADR-017 (federation), ADR-018 (append-log pattern) added - EXECUTION_PLAN.md sequences phases B-E with success gates and decision rules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 01:50:05 -05:00
root	04770c97eb	HNSW vector index: 100K search in 27ms (58x faster than brute-force) - instant-distance HNSW implementation for approximate nearest neighbors - HnswStore: build from stored embeddings, in-memory index, thread-safe - POST /vectors/hnsw/build — build index from Parquet (100K in 35s release) - POST /vectors/hnsw/search — fast ANN search - GET /vectors/hnsw/list — list loaded indexes Benchmark (100K × 768d, release build): Brute-force: 1,567ms HNSW: 31ms (50x) HNSW warm: 27ms (58x) Build cost: 35s one-time for 100K vectors (release mode) ef_construction=40, ef_search=50 — good recall/speed balance Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:00:50 -05:00
root	6cd1daeb51	Phase 11: Embedding versioning — model-proof vector layer - IndexRegistry: tracks all vector indexes with model metadata (model_name, model_version, dimensions, build stats) - Index metadata persisted as JSON in vectors/meta/ - Rebuilt on startup for crash recovery - GET /vectors/indexes — list all indexes (filter by source/model) - GET /vectors/indexes/{name} — get index metadata - Background jobs auto-register metadata on completion - Multi-version support: same data, different models, coexist - Per ADR-014: enables incremental re-embed on model upgrade Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:27:10 -05:00
root	3b695cd592	Dual-pipeline supervisor for embedding ingestion - 4 parallel pipelines (tuned for i9 + A4000) - Range-based work splitting (2500 chunks per range) - Round-robin retry on failure (3 attempts before dead-letter) - Checkpointing to disk every 1000 chunks (crash recovery) - On restart, loads checkpoint and skips completed ranges - Dead-letter queue for permanently failed ranges - Vectors assembled in order after all pipelines finish - Batch size 64 for GPU throughput Architecture: Supervisor → splits 100K chunks into 40 ranges ├── Pipeline 0: grabs range, embeds, reports progress ├── Pipeline 1: grabs range, embeds, reports progress ├── Pipeline 2: grabs range, embeds, reports progress └── Pipeline 3: grabs range, embeds, reports progress Failed range → back to queue → next available pipeline retries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:06:28 -05:00
root	6a532cb248	Background job system for embedding — fixes 100K timeout - JobTracker: create/update/complete/fail jobs with progress tracking - POST /vectors/index now returns immediately with job_id (HTTP 202) - Embedding runs in tokio::spawn background task - GET /vectors/jobs/{id} returns live progress (chunks embedded, rate, ETA) - GET /vectors/jobs lists all jobs - Progress logged every 100 batches with chunks/sec and ETA - 100K embedding job running successfully at 44 chunks/sec - System stays responsive during embedding (queries in 23ms) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:03:07 -05:00
root	26fc98c885	Phase 7: Vector index + RAG pipeline - vectord crate: chunk → embed → store → search → RAG - chunker: configurable chunk size + overlap, sentence-boundary aware splitting - store: embeddings as Parquet (binary blob f32 vectors), portable format - search: brute-force cosine similarity (works up to ~100K vectors) - rag: full pipeline — embed question → search index → retrieve context → LLM answer - Endpoints: POST /vectors/index, /vectors/search, /vectors/rag - Gateway wired with vectord service - Tested: 200 candidate resumes indexed in 5.4s, semantic search + RAG working - 20 unit tests passing (chunker, search, ingestd, shared) - AI gives honest "no match found" when context doesn't support an answer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:12:28 -05:00

18 Commits