lakehouse

Author	SHA1	Message	Date
root	0d037cfac1	Phases 16.2 + L2 + 17 VRAM gate + MySQL + 18 Lance hybrid milestone Five threads of work landing as one milestone — all individually verified end-to-end against real data, full release build clean, 46 unit tests pass. ## Phase 16.2 / 16.5 — autotune agent + ingest triggers `vectord::agent` is a long-running tokio task that watches the trial journal and autonomously proposes + runs new HNSW configs. Distinct from `autotune::run_autotune` (synchronous one-shot grid). Triggered on POST /vectors/agent/enqueue/{idx} or by the periodic wake; ingest paths now push DatasetAppended events when an index's source dataset gets re-ingested. Rate-limited (max_trials_per_hour) and cooldown- gated so it can't saturate Ollama under live load. The proposer is ε-greedy around the current champion: with prob 0.25 sample random from full bounds, otherwise perturb champion ± small delta on both axes. Dedup against history. Deterministic — RNG seeded from history.len() so the same journal state proposes the same next config (helps offline replay debugging). `[agent]` config section in lakehouse.toml; opt-in via enabled=true. ## Federation Layer 2 — runtime bucket lifecycle + per-index scoping `BucketRegistry.buckets` moved to `std::sync::RwLock<HashMap>` so buckets can be added/removed after startup. POST /storage/buckets provisions at runtime; DELETE /storage/buckets/{name} unregisters (refuses primary/rescue with 403). Local-backend buckets get their root directory auto-created. `IndexMeta.bucket` (default "primary" via serde) records each index's home bucket. `TrialJournal` and `PromotionRegistry` now hold Arc<BucketRegistry> + IndexRegistry; they resolve target store per- index via IndexMeta.bucket. PromotionRegistry::list_all scans every bucket and dedups by index_name. Pre-federation indexes keep working unchanged — they just default to primary. `ModelProfile.bucket: Option<String>` declares per-profile artifact home. POST /vectors/profile/{id}/activate auto-provisions the profile's bucket under storage.profile_root if not yet registered. EvalSets stay primary-only for now — noted gap, low-risk to extend later with the same resolver pattern. ## Phase 17 — VRAM-aware two-profile gate Sidecar gains POST /admin/unload (Ollama keep_alive=0 trick — forces immediate VRAM release), POST /admin/preload (keep_alive=5m with empty prompt, takes the slot warm), and GET /admin/vram (combines nvidia-smi snapshot with Ollama /api/ps). Exposed via aibridge as unload_model / preload_model / vram_snapshot. `VectorState.active_profile` is the GPU-slot singleton — Arc<RwLock<Option<ActiveProfileSlot>>>. activate_profile checks for a previous profile with a different ollama_name and unloads it before preloading the new one; same-model reactivations skip the unload (Ollama no-ops). New routes: POST /vectors/profile/{id}/ deactivate (unload + clear slot), GET /vectors/profile/active. Verified live: staffing-recruiter (qwen2.5) → docs-assistant (mistral) swap freed qwen2.5 from VRAM and loaded mistral. nomic- embed-text persists across swaps because both profiles use it — free optimization that fell out of the design. Scoped search correctly 403s cross-profile in both directions. ## MySQL streaming connector `crates/ingestd/src/my_stream.rs` mirrors pg_stream.rs for MySQL. Pure-rust `mysql_async` driver (default-features=false to avoid C deps). Same OFFSET pagination, same Parquet-streaming write shape. Type mapping per ADR-010: int/bigint → Int32/Int64, decimal/float → Float64, tinyint(1)/bool → Boolean, everything else → Utf8 with fallback parsers for date/time/json/uuid via Display. POST /ingest/mysql parallel to /ingest/db. Same PII auto-detection, same lineage capture (source_system="mysql"), same agent-trigger hook. `redact_dsn` generalized — was hardcoded to "postgresql://" length, now works for any scheme://user:pass@host/path URL (latent PII leak fix for MySQL DSNs). Verified live against MariaDB on localhost: 10 rows × 9 columns of test data round-tripped through datatypes int/varchar/decimal/ tinyint/datetime/text. PII detection auto-flagged name + email. Aggregation queries through DataFusion match the source values exactly. ## Phase 18 — Hybrid Parquet+HNSW ⊕ Lance backend (ADR-019) `vectord-lance` is a new firewall crate. Lance pulls Arrow 57 and DataFusion 52 — incompatible with the rest of the workspace's Arrow 55 / DataFusion 47. The firewall isolates that dep tree: public API uses only std types (Vec<f32>, Vec<String>, Hit, Row, Stats), so no Arrow types cross the crate boundary and nothing propagates to vectord. The ADR-019 path that didn't ship until now. `vectord::lance_backend::LanceRegistry` lazy-creates a LanceVectorStore per index, resolving bucket → URI via the conventional local-bucket layout. `IndexMeta.vector_backend` and `ModelProfile.vector_backend` carry the choice (default Parquet so existing indexes unchanged). Six routes under /vectors/lance/: - migrate/{idx}: convert binary-blob Parquet → Lance FixedSizeList - index/{idx}: build IVF_PQ - search/{idx}: vector search (embed via sidecar) - doc/{idx}/{doc_id}: random row fetch - append/{idx}: native fragment append - stats/{idx}: row count + index presence Verified live on the real resumes_100k_v2 corpus (100K × 768d): - Migrate: 0.57s - Build IVF_PQ index: 16.2s (matches ADR-019 bench; 14× faster than HNSW's 230s for the same data) - Search end-to-end (Ollama embed + Lance scan): 23-53ms - Random doc_id fetch: 5-7ms (filter scan; faster than Parquet's ~35ms full-file scan, slower than the bench's 311us positional take — would close that gap with a scalar btree on doc_id) - Append 100 rows: 3.3ms / +320KB on disk vs Parquet's required full ~330MB rewrite — the structural win - Index survives append; both backends coexist cleanly ## Known follow-ups not in this milestone - ModelProfile.vector_backend doesn't yet auto-route /vectors/profile/ {id}/search to Lance; callers go through /vectors/lance/* directly - Scalar btree on doc_id (closes the 5-7ms → ~300us gap) - vectord-lance built default-features=false → no S3 yet - IVF_PQ recall not measured (ADR-019 caveat) — needs a Lance-aware variant of the eval harness - Watcher-path ingest doesn't push agent triggers (HTTP paths do) - EvalSets still primary-only (federation gap) - No PATCH endpoint to move an existing index between buckets - The pre-existing storaged::append_log doctest fails to compile (malformed `{prefix}/` parses as code fence) — pre-existing bug, left for a focused fix 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 20:24:46 -05:00
root	24f1249a62	Federation layer 2: header routing + cross-bucket SQL Three pieces of the multi-bucket federation made real: 1. Catalog migration (POST /catalog/migrate-buckets) - One-shot normalizer for ObjectRef.bucket field - Empty -> "primary"; legacy "data"/"local" -> "primary" - Idempotent; re-running on canonical state is no-op - Ran on existing catalog: 12 refs renamed from "data", 2 already "primary", all 14 now canonical 2. X-Lakehouse-Bucket header middleware on ingest - resolve_bucket() helper extracts header, returns (bucket_name, store) or 404 with valid bucket list - ingest_file and ingest_db_stream now route writes per-request - Defaults to "primary" when header absent - pipeline::ingest_file_to_bucket records the actual bucket on the ObjectRef so catalog stays the source of truth for "where does this data live" - Verified: ingest with X-Lakehouse-Bucket: testing lands in data/_testing/, ingest without header lands in data/, bad header returns 404 with hint 3. queryd registers every bucket with DataFusion - QueryEngine now holds Arc<BucketRegistry> instead of single store - build_context iterates all buckets, registers each as a separate ObjectStore under URL scheme "lakehouse-{bucket}://" - ListingTable URLs include the per-object bucket scheme so DataFusion routes scans automatically based on ObjectRef.bucket - Profile bucket names like "profile:user" sanitized to "lakehouse-profile-user" since URL host segments can't contain ":" - Tolerant of duplicate manifest entries (pre-existing pipeline::ingest_file behavior creates a fresh dataset id per ingest); duplicates skipped with debug log - Backward compat: legacy "lakehouse://data/" URL still registered pointing at primary Success gate: cross-bucket CROSS JOIN SELECT p.name, p.role, a.species FROM people_test p (bucket: testing) CROSS JOIN animals a (bucket: primary) LIMIT 5 returns rows correctly. DataFusion routed each scan to its bucket's ObjectStore based on the URL scheme. No regressions: SELECT COUNT(*) FROM candidates still returns 100000 from the primary bucket. Deferred to Phase 17: - POST /profile/{user}/activate (HNSW hot-load on profile switch) - vectord storage paths becoming bucket-scoped (trial journals, eval sets per-profile) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 08:52:32 -05:00
root	97a376482c	Phase C: Decoupled embedding refresh Implements the llms3.com-inspired pattern: embeddings refresh asynchronously, decoupled from transactional row writes. New rows arrive, ingest marks the vector index stale, a later refresh embeds only the delta (doc_ids not already in the index). Schema additions (DatasetManifest): - last_embedded_at: Option<DateTime> - when the index was last refreshed - embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh - embedding_refresh_policy: Option<RefreshPolicy> - Manual \| OnAppend \| Scheduled Ingest paths (pipeline::ingest_file + pg_stream) call registry.mark_embeddings_stale after writing. No-op if the dataset has never been embedded — stale semantics only kick in once last_embedded_at is set. Refresh pipeline (vectord::refresh::refresh_index): - Reads the dataset Parquet, extracts (doc_id, text) pairs - Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas) - Loads existing embeddings via EmbeddingCache (empty on first-time build) - Filters to rows whose doc_id is NOT in the existing set - Chunks (chunker::chunk_column), embeds via Ollama (batches of 32), writes combined index, clears stale flag Endpoints: - POST /vectors/refresh/{dataset_name} - body {index_name, id_column, text_column, chunk_size?, overlap?} - GET /vectors/stale - lists datasets whose embedding_stale_since is set End-to-end verified on threat_intel (knowledge_base.threat_intel): - Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s, last_embedded_at set - Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check) - Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set - /vectors/stale surfaces threat_intel with timestamps + policy - Delta refresh: 34 new docs embedded in 970ms (6x faster than full re-embed); stale_cleared = true Not in MVP scope: - UPDATE semantics (same doc_id, different content) - would need per-row content hashing - OnAppend policy auto-trigger - just declares intent; actual scheduler deferred - Scheduler runtime - the Scheduled(cron) variant declares the intent so operators can see which datasets expect what, but the cron itself is separate Per ADR-019: when a profile switches to vector_backend=Lance, this refresh path benefits — Lance's native append replaces our "read all + rewrite" Parquet rebuild pattern. Current MVP works well enough at ~500-5K rows to validate the architecture; Lance unblocks the 5M+ case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 03:00:43 -05:00
root	dbe00d018f	Federation foundation + HNSW trial system + Postgres streaming + PRD reframe Four shipped features and a PRD realignment, all measured end-to-end: HNSW trial system (Phase 15 horizon item → complete) - vectord: EmbeddingCache, harness (eval sets + brute-force ground truth), TrialJournal, parameterized HnswConfig on build_index_with_config - /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best, /hnsw/evals/{name}/autogen, /hnsw/cache/stats - Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us at 100% recall@10. ec=80 es=30 locked as HnswConfig::default() - Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s, 80/30 = 1.00 recall in 230s Catalog manifest repair - catalogd: resync_from_parquet reads parquet footers to restore row_count and columns on drifted manifests - POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing - All 7 staffing tables recovered to PRD-matching 2,469,278 rows Federation foundation (ADR-017) - shared::secrets: SecretsProvider trait + FileSecretsProvider (reads /etc/lakehouse/secrets.toml, enforces 0600 perms) - storaged::registry::BucketRegistry — multi-bucket resolution with rescue_bucket read fallback and reachability probing - storaged::error_journal — bucket op failures visible in one HTTP call - storaged::append_log — write-once batched append pattern (fixes the RMW anti-pattern llms3.com calls out; errors and trial journals both use it) - /storage/buckets, /storage/errors, /storage/bucket-health, /storage/errors/{flush,compact} - Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with X-Lakehouse-Rescue-Used observability headers on fallback Postgres streaming ingest - ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination into ArrowWriter, lineage redacts password - POST /ingest/db — verified against live knowledge_base.team_runs (586 rows × 13 cols, 6 batches, 196ms end-to-end) PRD realignment (2026-04-16) - Dual use case: staffing analytics + local LLM knowledge substrate - Removed "multi-tenancy (single-owner system)" from non-goals - Added invariants 8-11: indexes hot-swappable, per-reader profiles, trials-as-data, operational failures findable in one HTTP call - New phases 16 (hot-swap generations), 17 (model profiles + dataset bindings), 18 (Lance vs Parquet+sidecar evaluation) - Known ceilings table documents the 5M vector wall and escape hatches - ADR-017 (federation), ADR-018 (append-log pattern) added - EXECUTION_PLAN.md sequences phases B-E with success gates and decision rules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 01:50:05 -05:00
root	9992b5f135	Database connector: PostgreSQL → Parquet import - POST /ingest/postgres/tables — list all tables in a database - POST /ingest/postgres/import — import table → Parquet → catalog → queryable - Auto type mapping: int2/4/8 → Int, float4/8 → Float64, bool → Boolean, text/varchar/jsonb/timestamp → Utf8 (safe default per ADR-010) - Auto PII detection + lineage on import - Empty password support for trust auth - Tested: imported lab_trials (40 rows, 10 cols) and threat_intel (20 rows, 30 cols) from local knowledge_base Postgres database — immediately queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:14:16 -05:00
root	294f3f6a49	Scheduled ingest: file watcher auto-ingests from ./inbox - Drop CSV/JSON/PDF/text into ./inbox → auto-detected → Parquet → queryable - Polls every 10 seconds (configurable) - Processed files moved to ./inbox/processed/ - Failed files moved to ./inbox/failed/ - Dedup: same file dropped twice = no-op - Watcher starts automatically on gateway boot - Tested: CSV dropped → queryable in <15s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:04:40 -05:00
root	35f0559d78	Phase 14: Schema evolution with AI migration rules - Schema diff detection: compare old vs new schema, identify changes (added, removed, type changed, renamed columns) - Fuzzy rename detection: "first_name" → "full_name" detected by shared word parts - Auto-generated migration rules: direct map, cast, concat, split, drop Each rule has confidence score (0.0-1.0) - AI migration prompt builder: generates LLM prompt for complex schema changes LLM suggests JSON migration rules when heuristics aren't enough - 5 new unit tests (detect added, removed, type change, rename, rule generation) - 30 total unit tests passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 19:31:19 -05:00
root	9e53caaec3	Phase 10: Rich catalog v2 — metadata as product - DatasetManifest expanded: description, owner, sensitivity, columns, lineage, freshness contract, tags, row_count - All new fields use #[serde(default)] for backward compatibility - PII auto-detection: scans column names for email, phone, SSN, salary, address, DOB, medical terms — flags as PII/PHI/Financial - Column-level metadata: name, type, sensitivity, is_pii flag - Lineage tracking: source_system, source_file, ingest_job, timestamp - Ingest pipeline auto-populates: PII scan, column meta, lineage, row count - PATCH /catalog/datasets/by-name/{name}/metadata — update metadata - Catalog responses now include all rich fields - 25 unit tests passing (5 new PII detection tests) Per ADR-013: datasets without metadata become mystery files. This makes every ingested file self-describing from day one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:15:09 -05:00
root	bb05c4412e	Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support - ingestd crate: detect file type → parse → schema detection → Parquet → catalog - CSV: auto-detect column types (int, float, bool, string), handles $, %, commas Strips dollar signs from amounts, flexible row parsing, sanitized column names - JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c) - PDF: text extraction via lopdf, one row per page (source_file, page_number, text) - Text/SMS: line-based ingestion with line numbers - Dedup: SHA-256 content hash, re-ingest same file = no-op - Gateway: POST /ingest/file multipart upload, 256MB body limit - Schema detection per ADR-010: ambiguous types default to String - 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup) - Tested: messy CSV with missing data, dollar amounts, N/A values → queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:07:31 -05:00

9 Commits