lakehouse

Author	SHA1	Message	Date
root	d87f2ccac6	Phase E: Soft deletes (tombstones) for compliance-grade row deletion Implements GDPR/CCPA-compatible row-level deletion without rewriting the underlying Parquet. Tombstone markers live beside each dataset and are applied at query time via a DataFusion view that excludes the deleted row_key_values. Schema (shared::types): - Tombstone { dataset, row_key_column, row_key_value, deleted_at, actor, reason } - All tombstones for a dataset must share one row_key_column — enforced at write so the query-time filter remains a single WHERE NOT IN (...) clause Storage (catalogd::tombstones): - Per-dataset AppendLog at _catalog/tombstones/{dataset}/ - flush_threshold=1 + explicit flush after every append — tombstones are high-value, low-frequency; durability on return is the contract - Reuses storaged::append_log infra so compaction is already wired (POST .../tombstones/compact will work once we expose it) Catalog (catalogd::registry): - add_tombstone validates dataset exists + key column compatibility - list_tombstones for the GET endpoint - TombstoneStore exposed via Registry::tombstones() for queryd HTTP (catalogd::service): - POST /catalog/datasets/by-name/{name}/tombstone { row_key_column, row_key_values[], actor, reason } Returns rows_tombstoned count + per-value failure list (207 on partial success). - GET same path lists active tombstones with full audit info. Query layer (queryd::context): - Snapshot tombstones-by-dataset before registering tables - Tombstoned tables: raw goes to "__raw__{name}", public "{name}" becomes DataFusion view with SELECT * FROM "__raw__{name}" WHERE CAST(col AS VARCHAR) NOT IN (...) - CAST AS VARCHAR handles both string and integer key columns - Untombstoned tables register as before — zero overhead End-to-end on candidates (100K rows): - Pick CAND-000001/2/3 (Linda/Charles/Kimberly) - POST tombstone -> rows_tombstoned: 3 - COUNT() drops 100000 -> 99997 - WHERE candidate_id IN (those 3) -> 0 rows - candidates_safe view transitively excludes them (Linda+Denver: __raw__candidates=159, candidates_safe=158) - Restart: COUNT still 99997, 3 tombstones reload from disk Reversibility: tombstones are reversible deletes, not destruction. Power users can still query "__raw__{name}" to see deleted rows. Phase 13 access control is what stops a non-admin from accessing __raw__ tables. Limits / follow-up: - Physical compaction not yet integrated — Phase 8's compact_files doesn't read tombstones during merge. Tombstoned rows are still on disk until that integration ships. - Phase 9 journald event emission for tombstones not wired — tombstone records carry their own actor+reason+timestamp so the audit trail is intact, but cross-referencing with the mutation event log would help compliance reporting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 09:40:48 -05:00
root	09fd446c8d	Phase D: AI-safe views — capability-surface projections over base data Implements the llms3.com "AI-safe views" pattern: a named projection that exposes only whitelisted columns, with optional row filter and per-column redactions. AI agents (or Phase 13 roles) bind to the view; they can never accidentally see PII even if they write raw SQL. Schema (shared::types): - AiView { name, base_dataset, columns: Vec<String>, row_filter, column_redactions: HashMap<String, Redaction>, ... } - Redaction enum: Null \| Hash \| Mask { keep_prefix, keep_suffix } Catalog (catalogd::registry): - put_view validates base dataset exists + columns non-empty - Persists JSON at _catalog/views/{name}.json (sanitized name) - rebuild() loads views alongside dataset manifests on startup Query layer (queryd::context): - build_context registers every AiView as a DataFusion view object - Constructed SELECT applies whitelist projection, WHERE filter, and redaction expressions per column - Mask: substr(prefix) + repeat('', mid_len) + substr(suffix) - Hash: digest(value, 'sha256') - Null: CAST(NULL AS VARCHAR) AS col - DataFusion handles JOINs/aggregates over the view natively — it's a real view, not a query rewrite HTTP (catalogd::service): - POST /catalog/views (create) - GET /catalog/views (list) - GET /catalog/views/{name} (full def) - DELETE /catalog/views/{name} End-to-end test on candidates (100K rows, 15 columns): candidates_safe view: columns: candidate_id, first_name, city, state, vertical, skills, years_experience, status row_filter: status != 'blocked' redaction: candidate_id mask(prefix=3, suffix=2) SELECT FROM candidates_safe LIMIT 5 -> 8 columns only, candidate_id shown as "CAN******01" (PII fields email/phone/last_name absent from result) SELECT email FROM candidates_safe -> fails (column not in projection) SELECT email FROM candidates -> succeeds (raw table still accessible by name — Phase 13 access control is the gate, not the view itself) Survives restart — view definitions reload from object storage. Limits / not in MVP: - View CANNOT shadow base table by name (DataFusion treats them as separate identifiers; access control must restrict raw-table access) - row_filter is treated as trusted SQL — operators must validate before persisting; only authenticated admin path should call put_view - Redaction expressions assume column is castable to VARCHAR; numeric redactions could be misleading (a Hash on Int64 returns a hex string that won't equi-join with another hash on the same value type) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 09:16:44 -05:00
root	24f1249a62	Federation layer 2: header routing + cross-bucket SQL Three pieces of the multi-bucket federation made real: 1. Catalog migration (POST /catalog/migrate-buckets) - One-shot normalizer for ObjectRef.bucket field - Empty -> "primary"; legacy "data"/"local" -> "primary" - Idempotent; re-running on canonical state is no-op - Ran on existing catalog: 12 refs renamed from "data", 2 already "primary", all 14 now canonical 2. X-Lakehouse-Bucket header middleware on ingest - resolve_bucket() helper extracts header, returns (bucket_name, store) or 404 with valid bucket list - ingest_file and ingest_db_stream now route writes per-request - Defaults to "primary" when header absent - pipeline::ingest_file_to_bucket records the actual bucket on the ObjectRef so catalog stays the source of truth for "where does this data live" - Verified: ingest with X-Lakehouse-Bucket: testing lands in data/_testing/, ingest without header lands in data/, bad header returns 404 with hint 3. queryd registers every bucket with DataFusion - QueryEngine now holds Arc<BucketRegistry> instead of single store - build_context iterates all buckets, registers each as a separate ObjectStore under URL scheme "lakehouse-{bucket}://" - ListingTable URLs include the per-object bucket scheme so DataFusion routes scans automatically based on ObjectRef.bucket - Profile bucket names like "profile:user" sanitized to "lakehouse-profile-user" since URL host segments can't contain ":" - Tolerant of duplicate manifest entries (pre-existing pipeline::ingest_file behavior creates a fresh dataset id per ingest); duplicates skipped with debug log - Backward compat: legacy "lakehouse://data/" URL still registered pointing at primary Success gate: cross-bucket CROSS JOIN SELECT p.name, p.role, a.species FROM people_test p (bucket: testing) CROSS JOIN animals a (bucket: primary) LIMIT 5 returns rows correctly. DataFusion routed each scan to its bucket's ObjectStore based on the URL scheme. No regressions: SELECT COUNT(*) FROM candidates still returns 100000 from the primary bucket. Deferred to Phase 17: - POST /profile/{user}/activate (HNSW hot-load on profile switch) - vectord storage paths becoming bucket-scoped (trial journals, eval sets per-profile) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 08:52:32 -05:00
root	650f5e97b6	Fix chunker UTF-8 boundary panic (causes 120GB OOM in refresh path) The chunker's &text[start..end] slice could land inside a multi-byte UTF-8 character (e.g. narrow no-break space \u{202f}, em-dashes, smart quotes — universal in pg-imported editorial data). Rust panics on non-boundary string slicing. In the refresh path that panic is caught by tokio's task machinery but somehow causes linear memory growth at ~540MB/sec until OOM at 120GB+. Root cause: chunk boundaries computed by byte arithmetic without checking is_char_boundary(). The existing "look for last sentence / \n / space" logic finds ASCII-safe positions, but the primary `end` calculation `(start + chunk_size).min(text.len())` lands wherever. Fix: - ceil_char_boundary(s, idx) — forward-scan to the nearest valid UTF-8 char boundary. Used at end, actual_end, and next_start. - Iteration cap — break if iterations exceed text.len(). Any non-progressing loop dies safely instead of burning memory. - Forced forward advance — if overlap + boundary math produce a next_start <= start, force +1 char to guarantee termination. Reproduced on kb_team_runs (585 pg-imported prompts with editorial unicode): previous run grew memory linearly to 124GB over 240s then OOM-killed. Same request after fix: peaks at <100MB, completes in ~4m42s to produce 12,693 embeddings. /vectors/search returns relevant results. Regression tests added: - handles_multibyte_utf8_at_chunk_boundary — exact \u{202f} repro - no_infinite_loop_on_no_spaces — 5KB text, no whitespace - no_infinite_loop_on_degenerate_params — chunk_size == overlap Surfaced by Phase C, but pre-existed as a latent bug since Phase 7. Any Ollama-targeted RAG corpus with non-ASCII content would have hit this once it grew past ~13KB per document. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 03:27:17 -05:00
root	97a376482c	Phase C: Decoupled embedding refresh Implements the llms3.com-inspired pattern: embeddings refresh asynchronously, decoupled from transactional row writes. New rows arrive, ingest marks the vector index stale, a later refresh embeds only the delta (doc_ids not already in the index). Schema additions (DatasetManifest): - last_embedded_at: Option<DateTime> - when the index was last refreshed - embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh - embedding_refresh_policy: Option<RefreshPolicy> - Manual \| OnAppend \| Scheduled Ingest paths (pipeline::ingest_file + pg_stream) call registry.mark_embeddings_stale after writing. No-op if the dataset has never been embedded — stale semantics only kick in once last_embedded_at is set. Refresh pipeline (vectord::refresh::refresh_index): - Reads the dataset Parquet, extracts (doc_id, text) pairs - Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas) - Loads existing embeddings via EmbeddingCache (empty on first-time build) - Filters to rows whose doc_id is NOT in the existing set - Chunks (chunker::chunk_column), embeds via Ollama (batches of 32), writes combined index, clears stale flag Endpoints: - POST /vectors/refresh/{dataset_name} - body {index_name, id_column, text_column, chunk_size?, overlap?} - GET /vectors/stale - lists datasets whose embedding_stale_since is set End-to-end verified on threat_intel (knowledge_base.threat_intel): - Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s, last_embedded_at set - Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check) - Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set - /vectors/stale surfaces threat_intel with timestamps + policy - Delta refresh: 34 new docs embedded in 970ms (6x faster than full re-embed); stale_cleared = true Not in MVP scope: - UPDATE semantics (same doc_id, different content) - would need per-row content hashing - OnAppend policy auto-trigger - just declares intent; actual scheduler deferred - Scheduler runtime - the Scheduled(cron) variant declares the intent so operators can see which datasets expect what, but the cron itself is separate Per ADR-019: when a profile switches to vector_backend=Lance, this refresh path benefits — Lance's native append replaces our "read all + rewrite" Parquet rebuild pattern. Current MVP works well enough at ~500-5K rows to validate the architecture; Lance unblocks the 5M+ case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 03:00:43 -05:00
root	76f6fba5de	Phase B: Lance pilot — hybrid decision with measured benchmark Standalone benchmark crate `crates/lance-bench` running Lance 4.0 against our Parquet+HNSW at 100K × 768d (resumes_100k_v2) measured 8 dimensions. Results (see docs/ADR-019-vector-storage.md for full scorecard): Cold load: Parquet 0.17s vs Lance 0.13s (tie — not ≥2× threshold) Disk size: 330.3 MB vs 330.4 MB (tie) Search p50: 873us vs 2229us (Parquet 2.55× faster) Search p95: 1413us vs 4998us (Parquet 3.54× faster) Index build: 230s (ec=80) vs 16s (IVF_PQ) (Lance 14× faster) Random access: 35ms (scan) vs 311us (Lance 112× faster) Append 10K rows: full rewrite vs 0.08s/+31MB (Lance structural win) Decision (ADR-019): hybrid, not migrate-or-reject. - Parquet+HNSW stays primary — our HNSW at ec=80 es=30 recall=1.00 is 2.55× faster than Lance IVF_PQ at 100K in-RAM scale - Lance joins as second backend per-profile for workloads where it wins architecturally: random row access (RAG text fetch), append-heavy pipelines (Phase C), hot-swap generations (Phase 16, 14× faster builds), and indexes past the ~5M RAM ceiling - Phase 17 ModelProfile gets vector_backend: Parquet \| Lance field - Ceiling table in PRD updated — 5M ceiling now says "switch to Lance" instead of "migrate" since Lance runs alongside, not instead of Isolation: lance-bench is a standalone workspace crate with its own dep tree (Lance pulls DataFusion 52 + Arrow 57 incompatible with main stack DataFusion 47 + Arrow 55). Kept off the critical path until API is stable enough to promote into vectord::lance_store. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 02:37:11 -05:00
root	dbe00d018f	Federation foundation + HNSW trial system + Postgres streaming + PRD reframe Four shipped features and a PRD realignment, all measured end-to-end: HNSW trial system (Phase 15 horizon item → complete) - vectord: EmbeddingCache, harness (eval sets + brute-force ground truth), TrialJournal, parameterized HnswConfig on build_index_with_config - /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best, /hnsw/evals/{name}/autogen, /hnsw/cache/stats - Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us at 100% recall@10. ec=80 es=30 locked as HnswConfig::default() - Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s, 80/30 = 1.00 recall in 230s Catalog manifest repair - catalogd: resync_from_parquet reads parquet footers to restore row_count and columns on drifted manifests - POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing - All 7 staffing tables recovered to PRD-matching 2,469,278 rows Federation foundation (ADR-017) - shared::secrets: SecretsProvider trait + FileSecretsProvider (reads /etc/lakehouse/secrets.toml, enforces 0600 perms) - storaged::registry::BucketRegistry — multi-bucket resolution with rescue_bucket read fallback and reachability probing - storaged::error_journal — bucket op failures visible in one HTTP call - storaged::append_log — write-once batched append pattern (fixes the RMW anti-pattern llms3.com calls out; errors and trial journals both use it) - /storage/buckets, /storage/errors, /storage/bucket-health, /storage/errors/{flush,compact} - Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with X-Lakehouse-Rescue-Used observability headers on fallback Postgres streaming ingest - ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination into ArrowWriter, lineage redacts password - POST /ingest/db — verified against live knowledge_base.team_runs (586 rows × 13 cols, 6 batches, 196ms end-to-end) PRD realignment (2026-04-16) - Dual use case: staffing analytics + local LLM knowledge substrate - Removed "multi-tenancy (single-owner system)" from non-goals - Added invariants 8-11: indexes hot-swappable, per-reader profiles, trials-as-data, operational failures findable in one HTTP call - New phases 16 (hot-swap generations), 17 (model profiles + dataset bindings), 18 (Lance vs Parquet+sidecar evaluation) - Known ceilings table documents the 5M vector wall and escape hatches - ADR-017 (federation), ADR-018 (append-log pattern) added - EXECUTION_PLAN.md sequences phases B-E with success gates and decision rules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 01:50:05 -05:00
root	fdb2e9cda8	Fix browser crash: cache schema context, lazy Dashboard, default to Ask tab Root cause: Dashboard auto-fired 6+ API calls on load, then Ask tab fired 7 DESCRIBE queries per question — 15+ concurrent requests from WASM. Fixes: - Schema context cached after first build (7 DESCRIBE → 0 on subsequent questions) - Dashboard lazy-loads only when tab clicked (not on app mount) - Default tab changed back to Ask (no background API storm) - std::sync::Mutex for WASM compat (no tokio in browser) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 21:18:53 -05:00
root	238cb84d26	Server-side pagination for large result sets - ResultStore: execute query, store batches server-side, serve pages on demand - POST /query/paged → returns query_id + total_rows + page count (no rows) - GET /query/page/{id}/{page}?size=100 → returns one page of rows - RecordBatch slicing for efficient page extraction from Arrow batches - LRU eviction: keeps 50 most recent query results in memory - Tested: 100K rows → 1,000 pages of 100, any page fetchable by number - Supervisor pattern: chunk results, serve on demand, retry-safe (idempotent GET) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:54:44 -05:00
root	ed17216005	Fix browser crash: limit schema context + cap table rows at 200 - Schema context limited to 7 core staffing tables (was all 12+) - Results table capped at 200 rows to prevent DOM explosion - Shows "first 200 of N rows" when truncated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:48:32 -05:00
root	0bd753294b	Robust SQL extraction: handles explanations, markdown, prefixes clean_sql now uses 3 strategies in priority order: 1. Extract from ```sql...``` markdown blocks 2. Find first SELECT/WITH/INSERT statement in text 3. Strip leading "sql" keyword fallback Tested against 5 real model output patterns: - Clean SQL ✓ - "sql" prefixed ✓ - Markdown fenced ✓ - Explanation before ```sql block ✓ - Explanation with SELECT buried in text ✓ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:42:11 -05:00
root	34c03894ae	Auto-retry on ALL SQL errors, not just schema errors Previous: only retried on "Schema error" or "No field named" Now: retries on any error (type mismatches, execution errors, etc.) Model gets full error message + schema to write corrected SQL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:38:27 -05:00
root	ddcbb9590c	Fix SQL generation: clean_sql helper + relationship hints + verified - clean_sql() strips markdown fences, leading "sql" keyword, trailing explanations - Schema context now includes table relationships (JOIN paths) - Explicit note: "vertical only in candidates/clients/job_orders, JOIN for others" - Full column paths (table.column) in schema to reduce ambiguity - Auto-retry on schema errors feeds error + schema back to model - TESTED: 4 questions all return correct results: "highest avg salary" → IT $2,213 ✓ "top 5 earning over $50/hr" → correct candidates ✓ "most placements by vertical" → Industrial 10,096 ✓ "revenue by client" → 1,996 clients ✓ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:36:55 -05:00
root	2c5aeaeada	Fix SQL generation: stricter prompt + auto-retry on schema errors - Prompt now says "CRITICAL: ONLY use columns from schema, do NOT invent" - Strips markdown backticks from model output - Auto-retry: if SQL fails with "Schema error" or "No field named", feeds the error + schema back to the model for a corrected query - Both button click and Enter key paths have retry logic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:32:25 -05:00
root	399fc81ab5	UI: Dashboard + Ingest tabs showing full system progression - Dashboard: live stats (datasets, rows, embeddings, HNSW, tools, cache) Architecture overview (6 capability areas) Build progression timeline (all 17 phases listed) - Ingest tab: Postgres table browser + import, file upload info, inbox watcher - System tab: existing health checks - Starts on Dashboard for immediate overview - No futures::executor in WASM — all async/await Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:22:36 -05:00
root	9992b5f135	Database connector: PostgreSQL → Parquet import - POST /ingest/postgres/tables — list all tables in a database - POST /ingest/postgres/import — import table → Parquet → catalog → queryable - Auto type mapping: int2/4/8 → Int, float4/8 → Float64, bool → Boolean, text/varchar/jsonb/timestamp → Utf8 (safe default per ADR-010) - Auto PII detection + lineage on import - Empty password support for trust auth - Tested: imported lab_trials (40 rows, 10 cols) and threat_intel (20 rows, 30 cols) from local knowledge_base Postgres database — immediately queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:14:16 -05:00
root	294f3f6a49	Scheduled ingest: file watcher auto-ingests from ./inbox - Drop CSV/JSON/PDF/text into ./inbox → auto-detected → Parquet → queryable - Polls every 10 seconds (configurable) - Processed files moved to ./inbox/processed/ - Failed files moved to ./inbox/failed/ - Dedup: same file dropped twice = no-op - Watcher starts automatically on gateway boot - Tested: CSV dropped → queryable in <15s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:04:40 -05:00
root	04770c97eb	HNSW vector index: 100K search in 27ms (58x faster than brute-force) - instant-distance HNSW implementation for approximate nearest neighbors - HnswStore: build from stored embeddings, in-memory index, thread-safe - POST /vectors/hnsw/build — build index from Parquet (100K in 35s release) - POST /vectors/hnsw/search — fast ANN search - GET /vectors/hnsw/list — list loaded indexes Benchmark (100K × 768d, release build): Brute-force: 1,567ms HNSW: 31ms (50x) HNSW warm: 27ms (58x) Build cost: 35s one-time for 100K vectors (release mode) ef_construction=40, ef_search=50 — good recall/speed balance Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:00:50 -05:00
root	35f0559d78	Phase 14: Schema evolution with AI migration rules - Schema diff detection: compare old vs new schema, identify changes (added, removed, type changed, renamed columns) - Fuzzy rename detection: "first_name" → "full_name" detected by shared word parts - Auto-generated migration rules: direct map, cast, concat, split, drop Each rule has confidence score (0.0-1.0) - AI migration prompt builder: generates LLM prompt for complex schema changes LLM suggests JSON migration rules when heuristics aren't enough - 5 new unit tests (detect added, removed, type change, rename, rule generation) - 30 total unit tests passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 19:31:19 -05:00
root	e5b7663c20	Phase 13: Access control — role-based sensitivity enforcement - AccessControl: agent roles with allowed sensitivity levels - 4 default roles: admin (all), recruiter (PII ok), analyst (financial ok), agent (internal only) - Field-level masking: determines which columns to mask per agent based on sensitivity - Query audit log: tracks every query with agent, datasets, PII fields accessed - Endpoints: GET/POST /access/roles, GET /access/audit, POST /access/check - Toggleable via config (auth.enabled) - 100K embedding: supervisor now sustained 125/sec (2.9x vs single pipeline) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:47:47 -05:00
root	6f0f92a9e4	Phase 12: Tool registry — governed business actions for AI agents - ToolRegistry: named tools with parameter validation and audit logging - 6 built-in staffing tools: search_candidates (skills, city, state, experience, availability) get_candidate (by ID) revenue_by_client (top N by billed revenue) recruiter_performance (placements, revenue per recruiter) cold_leads (called N+ times, never placed) open_jobs (by vertical, city) - Each tool: name, description, params, permission level (read/write/admin) - SQL template with validated parameter substitution - Full audit trail: every invocation logged with agent, params, result - Endpoints: GET /tools (list), GET /tools/{name} (schema), POST /tools/{name}/call (execute), GET /tools/audit (log) - Per ADR-015: governed interface before raw SQL for agents Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:31:42 -05:00
root	6cd1daeb51	Phase 11: Embedding versioning — model-proof vector layer - IndexRegistry: tracks all vector indexes with model metadata (model_name, model_version, dimensions, build stats) - Index metadata persisted as JSON in vectors/meta/ - Rebuilt on startup for crash recovery - GET /vectors/indexes — list all indexes (filter by source/model) - GET /vectors/indexes/{name} — get index metadata - Background jobs auto-register metadata on completion - Multi-version support: same data, different models, coexist - Per ADR-014: enables incremental re-embed on model upgrade Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:27:10 -05:00
root	9e53caaec3	Phase 10: Rich catalog v2 — metadata as product - DatasetManifest expanded: description, owner, sensitivity, columns, lineage, freshness contract, tags, row_count - All new fields use #[serde(default)] for backward compatibility - PII auto-detection: scans column names for email, phone, SSN, salary, address, DOB, medical terms — flags as PII/PHI/Financial - Column-level metadata: name, type, sensitivity, is_pii flag - Lineage tracking: source_system, source_file, ingest_job, timestamp - Ingest pipeline auto-populates: PII scan, column meta, lineage, row count - PATCH /catalog/datasets/by-name/{name}/metadata — update metadata - Catalog responses now include all rich fields - 25 unit tests passing (5 new PII detection tests) Per ADR-013: datasets without metadata become mystery files. This makes every ingested file self-describing from day one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:15:09 -05:00
root	bf7cf96911	Phase 9: Event journal — append-only mutation history - journald crate: immutable event log for every data mutation - Events: entity_type, entity_id, field, action, old_value, new_value, actor, source, workspace_id, timestamp - In-memory buffer with configurable flush threshold (default 100 events) - Flush writes events as Parquet to journal/ directory - Query: GET /journal/history/{entity_id} — full history of any record - Query: GET /journal/recent?limit=50 — latest events across all entities - Convenience methods: record_insert, record_update, record_ingest - Stats: GET /journal/stats — buffer size, persisted file count - Manual flush: POST /journal/flush - Per ADR-012: events are never modified or deleted This is the single most important future-proofing decision. Once history is lost, it's gone forever. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:09:33 -05:00
root	3b695cd592	Dual-pipeline supervisor for embedding ingestion - 4 parallel pipelines (tuned for i9 + A4000) - Range-based work splitting (2500 chunks per range) - Round-robin retry on failure (3 attempts before dead-letter) - Checkpointing to disk every 1000 chunks (crash recovery) - On restart, loads checkpoint and skips completed ranges - Dead-letter queue for permanently failed ranges - Vectors assembled in order after all pipelines finish - Batch size 64 for GPU throughput Architecture: Supervisor → splits 100K chunks into 40 ranges ├── Pipeline 0: grabs range, embeds, reports progress ├── Pipeline 1: grabs range, embeds, reports progress ├── Pipeline 2: grabs range, embeds, reports progress └── Pipeline 3: grabs range, embeds, reports progress Failed range → back to queue → next available pipeline retries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:06:28 -05:00
root	6a532cb248	Background job system for embedding — fixes 100K timeout - JobTracker: create/update/complete/fail jobs with progress tracking - POST /vectors/index now returns immediately with job_id (HTTP 202) - Embedding runs in tokio::spawn background task - GET /vectors/jobs/{id} returns live progress (chunks embedded, rate, ETA) - GET /vectors/jobs lists all jobs - Progress logged every 100 batches with chunks/sec and ETA - 100K embedding job running successfully at 44 chunks/sec - System stays responsive during embedding (queries in 23ms) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 09:03:07 -05:00
root	0b9da45647	Agent workspaces: per-contract overlays with instant handoff - WorkspaceManager: create/get/list workspaces with daily/weekly/monthly/pinned tiers - Saved searches: agent stores SQL queries in workspace context - Shortlist: tag candidates/records to a workspace with notes - Activity log: track calls, emails, updates per workspace per agent - Instant handoff: transfer workspace ownership with full history Zero data copy — just a pointer swap, receiving agent sees everything - Persistence: workspaces stored as JSON in object storage, rebuilt on startup - Endpoints: /workspaces/create, /{id}, /{id}/handoff, /{id}/search, /{id}/shortlist, /{id}/activity - Tested: Sarah creates workspace, saves searches, shortlists 3 candidates, logs activity, hands off to Mike who continues seamlessly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:44:45 -05:00
root	6df904a03c	Phase 8: Hot cache + incremental delta updates - MemCache: LRU in-memory cache for hot datasets (configurable max, default 16GB) Pin/evict/stats endpoints: POST /query/cache/pin, /cache/evict, GET /cache/stats - Delta store: append-only delta Parquet files for row-level updates Write deltas without rewriting base files, merge at query time - Compaction: POST /query/compact merges deltas into base Parquet - Query engine: checks cache first, falls back to Parquet, merges deltas - Benchmarked on 2.47M rows: 1M row JOIN: 854ms cold → 96ms hot (8.9x speedup) 100K filter: 62ms cold → 21ms hot (3x speedup) 1.1M rows cached in 408MB RAM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:37:28 -05:00
root	26fc98c885	Phase 7: Vector index + RAG pipeline - vectord crate: chunk → embed → store → search → RAG - chunker: configurable chunk size + overlap, sentence-boundary aware splitting - store: embeddings as Parquet (binary blob f32 vectors), portable format - search: brute-force cosine similarity (works up to ~100K vectors) - rag: full pipeline — embed question → search index → retrieve context → LLM answer - Endpoints: POST /vectors/index, /vectors/search, /vectors/rag - Gateway wired with vectord service - Tested: 200 candidate resumes indexed in 5.4s, semantic search + RAG working - 20 unit tests passing (chunker, search, ingestd, shared) - AI gives honest "no match found" when context doesn't support an answer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:12:28 -05:00
root	bb05c4412e	Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support - ingestd crate: detect file type → parse → schema detection → Parquet → catalog - CSV: auto-detect column types (int, float, bool, string), handles $, %, commas Strips dollar signs from amounts, flexible row parsing, sanitized column names - JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c) - PDF: text extraction via lopdf, one row per page (source_file, page_number, text) - Text/SMS: line-based ingestion with line numbers - Dedup: SHA-256 content hash, re-ingest same file = no-op - Gateway: POST /ingest/file multipart upload, 256MB body limit - Schema detection per ADR-010: ambiguous types default to String - 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup) - Tested: messy CSV with missing data, dollar amounts, N/A values → queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:07:31 -05:00
root	b37e171e10	UI redesign: Ask, Explore, SQL, System tabs - Ask: natural language → AI generates SQL → DataFusion executes → results Shows the AI-over-data-lake story: schema introspection → LLM → query - Explore: click dataset → schema + preview + AI-generated summary - SQL: raw DataFusion SQL editor with Ctrl+Enter - System: health grid testing all 5 services + embeddings + generation - Example prompts for quick demo - Dark theme with accent styling Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 07:24:51 -05:00
root	b235ef9201	Fix nginx route collision — namespace lakehouse API under /lakehouse/api/ Previous regex routes for /catalog, /storage, /health intercepted main site. Now all lakehouse API calls go through /lakehouse/api/ prefix, stripped by nginx rewrite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:57:58 -05:00
root	387ce0074c	UI: full-stack test coverage with tabs for Query, Storage, AI, Status - Query tab: SQL editor with results table (existing) - Storage tab: list objects, register datasets pointing at storage keys - AI tab: embed (nomic-embed-text), generate (qwen2.5), rerank with scored results - Status tab: health checks for all 5 services + functional tests (embed, generate, SQL) - nginx: added /lakehouse/ and API proxy paths to devop.live config - Loaded 3 sample datasets: employees, events, products - Fixed Rust 2024 reserved keyword `gen` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:56:18 -05:00
root	01373c0e45	Phase 5: hardening — gRPC, observability, auth, config - proto: lakehouse.proto with CatalogService, QueryService, StorageService, AiService - proto crate: tonic-build codegen from proto definitions - catalogd: gRPC CatalogService implementation - gateway: dual HTTP (:3100) + gRPC (:3101) servers - gateway: OpenTelemetry tracing with stdout exporter - gateway: API key auth middleware (toggleable) - shared: TOML config system with typed structs and defaults - lakehouse.toml config file - ADR-006 and ADR-007 documented Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:37:07 -05:00
root	50a8c8013f	Phase 4: Dioxus frontend with dataset browser and SQL query editor - ui: Dioxus WASM app with dataset sidebar, SQL editor (Ctrl+Enter), results table - ui: dynamic API base URL (same-origin for nginx, port-based for local dev) - gateway: CORS enabled for cross-origin requests - nginx: lakehouse.devop.live proxies UI (:3300) + API (:3100) on same origin - justfile: ui-build, ui-serve, sidecar, up commands added Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:24:15 -05:00
root	239e471223	Phase 3: AI integration with Ollama via Python sidecar - sidecar: FastAPI app with /embed, /generate, /rerank hitting Ollama - sidecar: Dockerfile, env var config (EMBED_MODEL, GEN_MODEL, RERANK_MODEL) - aibridge: reqwest HTTP client with typed request/response structs - aibridge: Axum proxy endpoints (POST /ai/embed, /ai/generate, /ai/rerank) - gateway: wires AiClient with SIDECAR_URL env var - e2e verified: nomic-embed-text returns 768d vectors, qwen2.5 generates text Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 05:53:56 -05:00
root	19bdfab227	Phase 2: DataFusion query engine over Parquet - queryd: SessionContext with custom URL scheme to avoid path doubling with LocalFileSystem - queryd: ListingTable registration from catalog ObjectRefs with schema inference - queryd: POST /query/sql returns JSON {columns, rows, row_count} - queryd→catalogd wiring: reads all datasets, registers as named tables - gateway: wires QueryEngine with shared store + registry - e2e verified: SELECT *, WHERE/ORDER BY, COUNT/AVG all correct Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 05:48:20 -05:00
root	655b6c0b37	Phase 1: storage + catalog layer - storaged: object_store backend (LocalFileSystem), PUT/GET/DELETE/LIST endpoints - shared: arrow_helpers with Parquet roundtrip + schema fingerprinting (2 tests) - catalogd: in-memory registry with write-ahead manifest persistence to object storage - catalogd: POST/GET /datasets, GET /datasets/by-name/{name} - gateway: wires storaged + catalogd with shared object_store state - Phase tracker updated: Phase 0 + Phase 1 gates passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 05:15:27 -05:00
root	a52ca841c6	Phase 0: bootstrap Rust workspace - Cargo workspace with 6 crates: shared, storaged, catalogd, queryd, aibridge, gateway - shared: types (DatasetId, ObjectRef, SchemaFingerprint, DatasetManifest) + error enum - gateway: Axum HTTP entrypoint with nested service routers + tracing - All services expose /health stubs - justfile with build/test/run recipes - PRD, phase tracker, and ADR docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 04:59:05 -05:00

39 Commits