# PRD: Lakehouse — Rust-First Object Storage System **Status:** Active — Phase 0-5 complete, entering production path **Created:** 2026-03-27 **Owner:** J --- ## Problem Legacy data systems silo information across CRMs, databases, spreadsheets, and file shares. Querying across them requires manual ETL, pre-defined schemas, and expensive database licenses. When AI enters the picture, these systems can't handle the dual requirement of fast analytical queries AND semantic retrieval over unstructured text. A staffing company (our reference case) has candidate records in an ATS, client data in a CRM, timesheets in billing software, call logs from a phone system, and email records from Exchange. Answering "find every Java developer in Chicago who was called 5+ times but never placed" requires querying across all of them — and no single system can do it. We need a system where: - Any data source (CSV, DB export, PDF, JSON) can be ingested without pre-defined schemas - Structured data is queryable via SQL at scale (millions of rows, sub-second) - Unstructured data is searchable via AI embeddings (semantic retrieval) - An LLM can answer natural language questions against all of it - Everything runs locally — no cloud APIs, total data privacy - The system is rebuildable from repository + object storage alone --- ## Solution A modular Rust service mesh over S3-compatible object storage, with a local AI layer for embeddings and generation. ### Locked Stack | Layer | Technology | Locked | |---|---|---| | Frontend | Dioxus | Yes | | API | Axum + Tokio | Yes | | Object Storage Interface | Apache Arrow `object_store` | Yes | | Storage Backend | LocalFileSystem → RustFS → S3 | Yes | | Query Engine | DataFusion | Yes | | Data Format | Parquet + Arrow | Yes | | RPC (internal) | tonic (gRPC) | Yes | | AI Runtime | Ollama (local models) | Yes | | AI Boundary | Python FastAPI sidecar → Ollama HTTP API | Yes | | Vector Index | TBD — evaluate `hora`, `qdrant` crate, or HNSW from scratch | **Open** | No new frameworks without documented ADR. --- ## Architecture ### Services | Service | Responsibility | |---|---| | **gateway** | HTTP/gRPC ingress, routing, auth, CORS, body limits | | **catalogd** | Metadata control plane — dataset registry, schema versions, manifests | | **storaged** | Object I/O — read/write/list/delete via `object_store` | | **queryd** | SQL execution — DataFusion over Parquet, MemTable hot cache | | **ingestd** | *NEW* — Ingest pipeline: CSV/JSON/DB → normalize → Parquet → catalog | | **vectord** | *NEW* — Embedding store + vector index: chunk → embed → index → search | | **aibridge** | Rust↔Python boundary — HTTP client to FastAPI sidecar | | **ui** | Dioxus frontend — Ask, Explore, SQL, System tabs | | **shared** | Types, errors, Arrow helpers, config, protobuf definitions | ### Data Flow ``` Raw data → ingestd (normalize, chunk, detect schema) ├→ storaged (Parquet files to object storage) ├→ catalogd (register dataset + schema) ├→ vectord (embed text chunks, build index) └→ queryd (auto-register as queryable table) User question → gateway ├→ vectord (semantic search for relevant chunks) ← RAG path ├→ queryd (SQL over structured data) ← Analytics path └→ aibridge → Ollama (generate answer from context) ``` ### Query Paths **Analytical (SQL):** "What's the average bill rate for .NET devs in Chicago?" → DataFusion scans Parquet columnar, returns in <200ms **Semantic (RAG):** "Find candidates who could do data engineering work" → Embed question → vector search across resume embeddings → retrieve top chunks → LLM answers **Hybrid:** "Which clients are we losing money on, and why?" → SQL for margin calculations + RAG over client notes/emails for context → LLM synthesizes ### Invariants 1. Object storage = source of truth for all data 2. catalogd = sole metadata authority 3. No raw data in catalog — only pointers 4. vectord stores embeddings AS Parquet (portable, not a proprietary format) 5. ingestd is idempotent — re-ingesting the same file is a no-op 6. Hot cache is a performance layer, not a source of truth — eviction is safe 7. All services modular and independently replaceable --- ## Phases ### Phase 0-5: Foundation ✅ COMPLETE - Rust workspace, Axum gateway, object storage, catalog, DataFusion query engine - Python sidecar with real Ollama models (embed, generate, rerank) - Dioxus UI with Ask (NL→SQL), Explore, SQL, System tabs - gRPC, OpenTelemetry, auth middleware, TOML config - Validated with 286K row staffing company dataset across 7 tables - Cross-reference queries (JOINs across candidates, placements, timesheets, calls) in <150ms ### Phase 6: Ingest Pipeline Build the data on-ramp. Accept messy real-world data, normalize it, make it queryable. | Step | Deliverable | Gate | |---|---|---| | 6.1 | `ingestd` crate with CSV parser → Arrow RecordBatch → Parquet | CSV file → queryable dataset | | 6.2 | JSON ingest (newline-delimited JSON, nested objects) | JSON file → flat Parquet | | 6.3 | Schema detection — infer column types from data | No manual schema definition needed | | 6.4 | Deduplication — detect and skip already-ingested files (content hash) | Re-ingest same file = no-op | | 6.5 | Text chunking — split large text fields for embedding | Long text → overlapping chunks | | 6.6 | Auto-registration — ingest writes to storage AND registers in catalog | Single API call: file in → queryable | | 6.7 | Gateway endpoint: `POST /ingest` with file upload | Upload CSV from browser → query in seconds | **Gate:** Upload a raw CSV or JSON file → auto-detected schema → stored as Parquet → registered → immediately queryable via SQL. No manual steps. **Risk:** Schema detection on messy data (mixed types, nulls, inconsistent formatting). Mitigation: conservative type inference (default to string), let user override. ### Phase 7: Vector Index + RAG Pipeline Make unstructured data searchable by meaning, not just keywords. | Step | Deliverable | Gate | |---|---|---| | 7.1 | `vectord` crate with embedding storage as Parquet (doc_id, chunk_text, vector) | Embeddings stored as portable Parquet | | 7.2 | Chunking strategy — configurable chunk size + overlap for text columns | Large text fields split into embeddable chunks | | 7.3 | Brute-force vector search via DataFusion (cosine similarity SQL) | Semantic search works, correctness verified | | 7.4 | HNSW index for fast approximate nearest neighbor | Search over 100K+ vectors in <50ms | | 7.5 | RAG endpoint: `POST /rag` — question → embed → search → retrieve → generate | Natural language question → grounded answer | | 7.6 | Auto-embed on ingest — text columns automatically embedded during ingest | No separate embedding step needed | | 7.7 | Hybrid search — combine SQL filters with vector similarity | "Java devs in Chicago" (SQL) + "who could do data engineering" (semantic) | **Gate:** Ingest 15K candidate resumes → auto-embed → ask "find someone who could handle our Kubernetes migration" → system returns relevant candidates ranked by semantic match, with LLM explanation. **Risk: HNSW in Rust at scale.** This is the hardest technical problem. Options: - `hora` crate — Rust-native ANN, but less mature than FAISS - Store HNSW index as a serialized file alongside Parquet data - Fallback: brute-force scan is fine up to ~100K vectors; optimize later - Nuclear option: use Qdrant as an external vector store (breaks "no new services" rule) **Decision needed:** Evaluate `hora` vs external Qdrant vs brute-force at J's data scale. ### Phase 8: Hot Cache + Incremental Updates Make frequently-accessed data fast, and handle real-time updates without full rewrite. | Step | Deliverable | Gate | |---|---|---| | 8.1 | MemTable hot cache — pin active datasets in memory | Queries on hot data: <10ms | | 8.2 | Cache policy — LRU eviction based on access patterns | Memory-bounded, auto-manages | | 8.3 | Incremental writes — append new rows without rewriting entire Parquet file | Update one candidate's phone → no full table rewrite | | 8.4 | Merge-on-read — query combines base Parquet + delta files | Correct results from base + updates | | 8.5 | Compaction — periodic merge of delta files into base Parquet | Prevent delta file proliferation | | 8.6 | Upsert semantics — insert or update by primary key | Same candidate ID → update in place | **Gate:** Update a single row in a 15K-row dataset. Query reflects the change immediately. No full Parquet rewrite. Memory cache serves hot data in <10ms. **Risk: This is the Delta Lake problem.** Full ACID transactions over Parquet files is what Databricks spent years building. We're NOT building Delta Lake — we're building a pragmatic version: - Append-only delta files (easy) - Merge-on-read (moderate) - Compaction (moderate) - Full ACID isolation (NOT attempting — single-writer model instead) ### Phase 9+: Future (not designed yet) - Database connector ingest (PostgreSQL, MySQL, MSSQL → Parquet) - PDF/document ingest (OCR → text → chunks → embed) - Scheduled ingest (cron-based file watching) - Multi-node query distribution - Row-level access control - Audit log (who queried what, when) --- ## Reference Dataset: Staffing Company Validated with realistic staffing company data: | Table | Rows | Description | |---|---|---| | candidates | 15,000 | Names, phones, emails, zip, skills, resume text, availability | | clients | 500 | Companies, contacts, verticals, bill rates | | job_orders | 3,000 | Positions with descriptions, requirements, rates | | placements | 8,000 | Candidate↔job matches with dates, rates, recruiters | | timesheets | 120,000 | Weekly hours, bill/pay totals, approvals | | call_log | 80,000 | Phone CDR — who called whom, duration, disposition | | email_log | 60,000 | Email tracking — subject, opened, direction | | **Total** | **286,500** | **7 tables, cross-referenced** | Proven queries: - Candidate search by skills + location + availability: 80ms - Revenue by client with profit margins (JOIN 120K timesheets): 142ms - Cold lead detection (candidates called 5+ times, never placed): 94ms - Margin analysis by vertical (JOIN placements → job orders): 53ms - Natural language → AI-generated SQL → execution → results: ~3s (model inference) --- ## Available Local Models | Model | Use | |---|---| | `nomic-embed-text` | Embeddings (768d) — semantic search, RAG retrieval | | `qwen2.5` | SQL generation, structured output, summarization | | `mistral` | General generation, longer context | | `gemma2` | General generation | | `llama3.2` | General generation, lightweight | --- ## Non-Goals - Multi-tenancy (single-owner system) - Cloud deployment (local-first, always) - Full ACID transactions (single-writer model is sufficient) - Real-time streaming / CDC (batch ingest is the model) - Replacing the CRM (this is the analytical layer BEHIND the CRM) - Custom file formats (Parquet is the format, period) --- ## Risks | Risk | Severity | Mitigation | |---|---|---| | Vector search in Rust at scale | **High** | Start brute-force, evaluate `hora` crate, Qdrant as fallback | | Incremental updates on Parquet | **High** | Delta files + merge-on-read, NOT full Delta Lake | | Legacy data messiness | **High** | Conservative schema detection, default to string, user overrides | | Schema evolution across ingests | **Medium** | Schema fingerprinting + versioned manifests | | Memory pressure from hot cache | **Medium** | LRU eviction, configurable memory limit | | HNSW index persistence | **Medium** | Serialize alongside Parquet, rebuild on startup | | Python sidecar as bottleneck | **Low** | Can replace with direct Ollama HTTP from Rust later | --- ## Operating Rules 1. PRD > architecture > phases > status > git 2. Git is memory, not chat 3. No undocumented changes 4. No silent architecture drift 5. Always work in smallest valid step 6. Always verify before moving on 7. **New:** Flag when something is genuinely hard vs just engineering work 8. **New:** If a phase reveals the approach is wrong, update the PRD before continuing