lakehouse

Author	SHA1	Message	Date
root	24f1249a62	Federation layer 2: header routing + cross-bucket SQL Three pieces of the multi-bucket federation made real: 1. Catalog migration (POST /catalog/migrate-buckets) - One-shot normalizer for ObjectRef.bucket field - Empty -> "primary"; legacy "data"/"local" -> "primary" - Idempotent; re-running on canonical state is no-op - Ran on existing catalog: 12 refs renamed from "data", 2 already "primary", all 14 now canonical 2. X-Lakehouse-Bucket header middleware on ingest - resolve_bucket() helper extracts header, returns (bucket_name, store) or 404 with valid bucket list - ingest_file and ingest_db_stream now route writes per-request - Defaults to "primary" when header absent - pipeline::ingest_file_to_bucket records the actual bucket on the ObjectRef so catalog stays the source of truth for "where does this data live" - Verified: ingest with X-Lakehouse-Bucket: testing lands in data/_testing/, ingest without header lands in data/, bad header returns 404 with hint 3. queryd registers every bucket with DataFusion - QueryEngine now holds Arc<BucketRegistry> instead of single store - build_context iterates all buckets, registers each as a separate ObjectStore under URL scheme "lakehouse-{bucket}://" - ListingTable URLs include the per-object bucket scheme so DataFusion routes scans automatically based on ObjectRef.bucket - Profile bucket names like "profile:user" sanitized to "lakehouse-profile-user" since URL host segments can't contain ":" - Tolerant of duplicate manifest entries (pre-existing pipeline::ingest_file behavior creates a fresh dataset id per ingest); duplicates skipped with debug log - Backward compat: legacy "lakehouse://data/" URL still registered pointing at primary Success gate: cross-bucket CROSS JOIN SELECT p.name, p.role, a.species FROM people_test p (bucket: testing) CROSS JOIN animals a (bucket: primary) LIMIT 5 returns rows correctly. DataFusion routed each scan to its bucket's ObjectStore based on the URL scheme. No regressions: SELECT COUNT(*) FROM candidates still returns 100000 from the primary bucket. Deferred to Phase 17: - POST /profile/{user}/activate (HNSW hot-load on profile switch) - vectord storage paths becoming bucket-scoped (trial journals, eval sets per-profile) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 08:52:32 -05:00
root	97a376482c	Phase C: Decoupled embedding refresh Implements the llms3.com-inspired pattern: embeddings refresh asynchronously, decoupled from transactional row writes. New rows arrive, ingest marks the vector index stale, a later refresh embeds only the delta (doc_ids not already in the index). Schema additions (DatasetManifest): - last_embedded_at: Option<DateTime> - when the index was last refreshed - embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh - embedding_refresh_policy: Option<RefreshPolicy> - Manual \| OnAppend \| Scheduled Ingest paths (pipeline::ingest_file + pg_stream) call registry.mark_embeddings_stale after writing. No-op if the dataset has never been embedded — stale semantics only kick in once last_embedded_at is set. Refresh pipeline (vectord::refresh::refresh_index): - Reads the dataset Parquet, extracts (doc_id, text) pairs - Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas) - Loads existing embeddings via EmbeddingCache (empty on first-time build) - Filters to rows whose doc_id is NOT in the existing set - Chunks (chunker::chunk_column), embeds via Ollama (batches of 32), writes combined index, clears stale flag Endpoints: - POST /vectors/refresh/{dataset_name} - body {index_name, id_column, text_column, chunk_size?, overlap?} - GET /vectors/stale - lists datasets whose embedding_stale_since is set End-to-end verified on threat_intel (knowledge_base.threat_intel): - Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s, last_embedded_at set - Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check) - Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set - /vectors/stale surfaces threat_intel with timestamps + policy - Delta refresh: 34 new docs embedded in 970ms (6x faster than full re-embed); stale_cleared = true Not in MVP scope: - UPDATE semantics (same doc_id, different content) - would need per-row content hashing - OnAppend policy auto-trigger - just declares intent; actual scheduler deferred - Scheduler runtime - the Scheduled(cron) variant declares the intent so operators can see which datasets expect what, but the cron itself is separate Per ADR-019: when a profile switches to vector_backend=Lance, this refresh path benefits — Lance's native append replaces our "read all + rewrite" Parquet rebuild pattern. Current MVP works well enough at ~500-5K rows to validate the architecture; Lance unblocks the 5M+ case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 03:00:43 -05:00
root	9e53caaec3	Phase 10: Rich catalog v2 — metadata as product - DatasetManifest expanded: description, owner, sensitivity, columns, lineage, freshness contract, tags, row_count - All new fields use #[serde(default)] for backward compatibility - PII auto-detection: scans column names for email, phone, SSN, salary, address, DOB, medical terms — flags as PII/PHI/Financial - Column-level metadata: name, type, sensitivity, is_pii flag - Lineage tracking: source_system, source_file, ingest_job, timestamp - Ingest pipeline auto-populates: PII scan, column meta, lineage, row count - PATCH /catalog/datasets/by-name/{name}/metadata — update metadata - Catalog responses now include all rich fields - 25 unit tests passing (5 new PII detection tests) Per ADR-013: datasets without metadata become mystery files. This makes every ingested file self-describing from day one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:15:09 -05:00
root	bb05c4412e	Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support - ingestd crate: detect file type → parse → schema detection → Parquet → catalog - CSV: auto-detect column types (int, float, bool, string), handles $, %, commas Strips dollar signs from amounts, flexible row parsing, sanitized column names - JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c) - PDF: text extraction via lopdf, one row per page (source_file, page_number, text) - Text/SMS: line-based ingestion with line numbers - Dedup: SHA-256 content hash, re-ingest same file = no-op - Gateway: POST /ingest/file multipart upload, 256MB body limit - Schema detection per ADR-010: ambiguous types default to String - 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup) - Tested: messy CSV with missing data, dollar amounts, N/A values → queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:07:31 -05:00

4 Commits