lakehouse

Author	SHA1	Message	Date
root	9992b5f135	Database connector: PostgreSQL → Parquet import - POST /ingest/postgres/tables — list all tables in a database - POST /ingest/postgres/import — import table → Parquet → catalog → queryable - Auto type mapping: int2/4/8 → Int, float4/8 → Float64, bool → Boolean, text/varchar/jsonb/timestamp → Utf8 (safe default per ADR-010) - Auto PII detection + lineage on import - Empty password support for trust auth - Tested: imported lab_trials (40 rows, 10 cols) and threat_intel (20 rows, 30 cols) from local knowledge_base Postgres database — immediately queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:14:16 -05:00
root	294f3f6a49	Scheduled ingest: file watcher auto-ingests from ./inbox - Drop CSV/JSON/PDF/text into ./inbox → auto-detected → Parquet → queryable - Polls every 10 seconds (configurable) - Processed files moved to ./inbox/processed/ - Failed files moved to ./inbox/failed/ - Dedup: same file dropped twice = no-op - Watcher starts automatically on gateway boot - Tested: CSV dropped → queryable in <15s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:04:40 -05:00
root	eae51977ab	Scale test: 2.47M rows + 10K vector index benchmarked Benchmarks on 128GB RAM server: - 100K candidate filter (skills+city+status): 257ms - 1M timesheet aggregation (revenue by client): 942ms - 800K call log cross-reference (cold leads): 642ms - Triple JOIN recruiter performance: 487ms - 500K email open rate aggregation: 259ms - COUNT all 2.47M rows: 84ms - 10K vector search (cosine similarity): ~450ms - Embedding throughput: 49 chunks/sec via Ollama - RAG correctly refuses to hallucinate when no match exists Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:31:37 -05:00
root	26fc98c885	Phase 7: Vector index + RAG pipeline - vectord crate: chunk → embed → store → search → RAG - chunker: configurable chunk size + overlap, sentence-boundary aware splitting - store: embeddings as Parquet (binary blob f32 vectors), portable format - search: brute-force cosine similarity (works up to ~100K vectors) - rag: full pipeline — embed question → search index → retrieve context → LLM answer - Endpoints: POST /vectors/index, /vectors/search, /vectors/rag - Gateway wired with vectord service - Tested: 200 candidate resumes indexed in 5.4s, semantic search + RAG working - 20 unit tests passing (chunker, search, ingestd, shared) - AI gives honest "no match found" when context doesn't support an answer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:12:28 -05:00
root	bb05c4412e	Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support - ingestd crate: detect file type → parse → schema detection → Parquet → catalog - CSV: auto-detect column types (int, float, bool, string), handles $, %, commas Strips dollar signs from amounts, flexible row parsing, sanitized column names - JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c) - PDF: text extraction via lopdf, one row per page (source_file, page_number, text) - Text/SMS: line-based ingestion with line numbers - Dedup: SHA-256 content hash, re-ingest same file = no-op - Gateway: POST /ingest/file multipart upload, 256MB body limit - Schema detection per ADR-010: ambiguous types default to String - 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup) - Tested: messy CSV with missing data, dollar amounts, N/A values → queryable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:07:31 -05:00
root	387ce0074c	UI: full-stack test coverage with tabs for Query, Storage, AI, Status - Query tab: SQL editor with results table (existing) - Storage tab: list objects, register datasets pointing at storage keys - AI tab: embed (nomic-embed-text), generate (qwen2.5), rerank with scored results - Status tab: health checks for all 5 services + functional tests (embed, generate, SQL) - nginx: added /lakehouse/ and API proxy paths to devop.live config - Loaded 3 sample datasets: employees, events, products - Fixed Rust 2024 reserved keyword `gen` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:56:18 -05:00
root	01373c0e45	Phase 5: hardening — gRPC, observability, auth, config - proto: lakehouse.proto with CatalogService, QueryService, StorageService, AiService - proto crate: tonic-build codegen from proto definitions - catalogd: gRPC CatalogService implementation - gateway: dual HTTP (:3100) + gRPC (:3101) servers - gateway: OpenTelemetry tracing with stdout exporter - gateway: API key auth middleware (toggleable) - shared: TOML config system with typed structs and defaults - lakehouse.toml config file - ADR-006 and ADR-007 documented Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:37:07 -05:00
root	50a8c8013f	Phase 4: Dioxus frontend with dataset browser and SQL query editor - ui: Dioxus WASM app with dataset sidebar, SQL editor (Ctrl+Enter), results table - ui: dynamic API base URL (same-origin for nginx, port-based for local dev) - gateway: CORS enabled for cross-origin requests - nginx: lakehouse.devop.live proxies UI (:3300) + API (:3100) on same origin - justfile: ui-build, ui-serve, sidecar, up commands added Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 06:24:15 -05:00

8 Commits