7 Commits

Author SHA1 Message Date
root
eae51977ab Scale test: 2.47M rows + 10K vector index benchmarked
Benchmarks on 128GB RAM server:
- 100K candidate filter (skills+city+status): 257ms
- 1M timesheet aggregation (revenue by client): 942ms
- 800K call log cross-reference (cold leads): 642ms
- Triple JOIN recruiter performance: 487ms
- 500K email open rate aggregation: 259ms
- COUNT all 2.47M rows: 84ms
- 10K vector search (cosine similarity): ~450ms
- Embedding throughput: 49 chunks/sec via Ollama
- RAG correctly refuses to hallucinate when no match exists

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:31:37 -05:00
root
26fc98c885 Phase 7: Vector index + RAG pipeline
- vectord crate: chunk → embed → store → search → RAG
- chunker: configurable chunk size + overlap, sentence-boundary aware splitting
- store: embeddings as Parquet (binary blob f32 vectors), portable format
- search: brute-force cosine similarity (works up to ~100K vectors)
- rag: full pipeline — embed question → search index → retrieve context → LLM answer
- Endpoints: POST /vectors/index, /vectors/search, /vectors/rag
- Gateway wired with vectord service
- Tested: 200 candidate resumes indexed in 5.4s, semantic search + RAG working
- 20 unit tests passing (chunker, search, ingestd, shared)
- AI gives honest "no match found" when context doesn't support an answer

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:12:28 -05:00
root
bb05c4412e Phase 6: Ingest pipeline — CSV, JSON, PDF, text file support
- ingestd crate: detect file type → parse → schema detection → Parquet → catalog
- CSV: auto-detect column types (int, float, bool, string), handles $, %, commas
  Strips dollar signs from amounts, flexible row parsing, sanitized column names
- JSON: array or newline-delimited, nested object flattening (a.b.c → a_b_c)
- PDF: text extraction via lopdf, one row per page (source_file, page_number, text)
- Text/SMS: line-based ingestion with line numbers
- Dedup: SHA-256 content hash, re-ingest same file = no-op
- Gateway: POST /ingest/file multipart upload, 256MB body limit
- Schema detection per ADR-010: ambiguous types default to String
- 12 unit tests passing (CSV parsing, JSON flattening, type inference, dedup)
- Tested: messy CSV with missing data, dollar amounts, N/A values → queryable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 08:07:31 -05:00
root
b37e171e10 UI redesign: Ask, Explore, SQL, System tabs
- Ask: natural language → AI generates SQL → DataFusion executes → results
  Shows the AI-over-data-lake story: schema introspection → LLM → query
- Explore: click dataset → schema + preview + AI-generated summary
- SQL: raw DataFusion SQL editor with Ctrl+Enter
- System: health grid testing all 5 services + embeddings + generation
- Example prompts for quick demo
- Dark theme with accent styling

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:24:51 -05:00
root
387ce0074c UI: full-stack test coverage with tabs for Query, Storage, AI, Status
- Query tab: SQL editor with results table (existing)
- Storage tab: list objects, register datasets pointing at storage keys
- AI tab: embed (nomic-embed-text), generate (qwen2.5), rerank with scored results
- Status tab: health checks for all 5 services + functional tests (embed, generate, SQL)
- nginx: added /lakehouse/ and API proxy paths to devop.live config
- Loaded 3 sample datasets: employees, events, products
- Fixed Rust 2024 reserved keyword `gen`

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 06:56:18 -05:00
root
01373c0e45 Phase 5: hardening — gRPC, observability, auth, config
- proto: lakehouse.proto with CatalogService, QueryService, StorageService, AiService
- proto crate: tonic-build codegen from proto definitions
- catalogd: gRPC CatalogService implementation
- gateway: dual HTTP (:3100) + gRPC (:3101) servers
- gateway: OpenTelemetry tracing with stdout exporter
- gateway: API key auth middleware (toggleable)
- shared: TOML config system with typed structs and defaults
- lakehouse.toml config file
- ADR-006 and ADR-007 documented

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 06:37:07 -05:00
root
50a8c8013f Phase 4: Dioxus frontend with dataset browser and SQL query editor
- ui: Dioxus WASM app with dataset sidebar, SQL editor (Ctrl+Enter), results table
- ui: dynamic API base URL (same-origin for nginx, port-based for local dev)
- gateway: CORS enabled for cross-origin requests
- nginx: lakehouse.devop.live proxies UI (:3300) + API (:3100) on same origin
- justfile: ui-build, ui-serve, sidecar, up commands added

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 06:24:15 -05:00