lakehouse/docs/PRD.md
root 6740a017c7 PRD v2: production roadmap with ingest, vector search, hot cache phases
- Phase 6: Ingest pipeline (CSV/JSON → schema detect → Parquet → catalog)
- Phase 7: Vector index + RAG (embed → HNSW → semantic search → LLM answer)
- Phase 8: Hot cache + incremental updates (MemTable, delta files, merge-on-read)
- ADR-008 through ADR-011: embeddings as Parquet, delta files not Delta Lake,
  schema defaults to string, not a CRM replacement
- Staffing company reference dataset (286K rows, 7 tables)
- Honest risk assessment: vector search at scale and incremental updates are hard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:54:24 -05:00

258 lines
12 KiB
Markdown

# PRD: Lakehouse — Rust-First Object Storage System
**Status:** Active — Phase 0-5 complete, entering production path
**Created:** 2026-03-27
**Owner:** J
---
## Problem
Legacy data systems silo information across CRMs, databases, spreadsheets, and file shares. Querying across them requires manual ETL, pre-defined schemas, and expensive database licenses. When AI enters the picture, these systems can't handle the dual requirement of fast analytical queries AND semantic retrieval over unstructured text.
A staffing company (our reference case) has candidate records in an ATS, client data in a CRM, timesheets in billing software, call logs from a phone system, and email records from Exchange. Answering "find every Java developer in Chicago who was called 5+ times but never placed" requires querying across all of them — and no single system can do it.
We need a system where:
- Any data source (CSV, DB export, PDF, JSON) can be ingested without pre-defined schemas
- Structured data is queryable via SQL at scale (millions of rows, sub-second)
- Unstructured data is searchable via AI embeddings (semantic retrieval)
- An LLM can answer natural language questions against all of it
- Everything runs locally — no cloud APIs, total data privacy
- The system is rebuildable from repository + object storage alone
---
## Solution
A modular Rust service mesh over S3-compatible object storage, with a local AI layer for embeddings and generation.
### Locked Stack
| Layer | Technology | Locked |
|---|---|---|
| Frontend | Dioxus | Yes |
| API | Axum + Tokio | Yes |
| Object Storage Interface | Apache Arrow `object_store` | Yes |
| Storage Backend | LocalFileSystem → RustFS → S3 | Yes |
| Query Engine | DataFusion | Yes |
| Data Format | Parquet + Arrow | Yes |
| RPC (internal) | tonic (gRPC) | Yes |
| AI Runtime | Ollama (local models) | Yes |
| AI Boundary | Python FastAPI sidecar → Ollama HTTP API | Yes |
| Vector Index | TBD — evaluate `hora`, `qdrant` crate, or HNSW from scratch | **Open** |
No new frameworks without documented ADR.
---
## Architecture
### Services
| Service | Responsibility |
|---|---|
| **gateway** | HTTP/gRPC ingress, routing, auth, CORS, body limits |
| **catalogd** | Metadata control plane — dataset registry, schema versions, manifests |
| **storaged** | Object I/O — read/write/list/delete via `object_store` |
| **queryd** | SQL execution — DataFusion over Parquet, MemTable hot cache |
| **ingestd** | *NEW* — Ingest pipeline: CSV/JSON/DB → normalize → Parquet → catalog |
| **vectord** | *NEW* — Embedding store + vector index: chunk → embed → index → search |
| **aibridge** | Rust↔Python boundary — HTTP client to FastAPI sidecar |
| **ui** | Dioxus frontend — Ask, Explore, SQL, System tabs |
| **shared** | Types, errors, Arrow helpers, config, protobuf definitions |
### Data Flow
```
Raw data → ingestd (normalize, chunk, detect schema)
├→ storaged (Parquet files to object storage)
├→ catalogd (register dataset + schema)
├→ vectord (embed text chunks, build index)
└→ queryd (auto-register as queryable table)
User question → gateway
├→ vectord (semantic search for relevant chunks) ← RAG path
├→ queryd (SQL over structured data) ← Analytics path
└→ aibridge → Ollama (generate answer from context)
```
### Query Paths
**Analytical (SQL):** "What's the average bill rate for .NET devs in Chicago?"
→ DataFusion scans Parquet columnar, returns in <200ms
**Semantic (RAG):** "Find candidates who could do data engineering work"
Embed question vector search across resume embeddings retrieve top chunks LLM answers
**Hybrid:** "Which clients are we losing money on, and why?"
SQL for margin calculations + RAG over client notes/emails for context LLM synthesizes
### Invariants
1. Object storage = source of truth for all data
2. catalogd = sole metadata authority
3. No raw data in catalog only pointers
4. vectord stores embeddings AS Parquet (portable, not a proprietary format)
5. ingestd is idempotent re-ingesting the same file is a no-op
6. Hot cache is a performance layer, not a source of truth eviction is safe
7. All services modular and independently replaceable
---
## Phases
### Phase 0-5: Foundation ✅ COMPLETE
- Rust workspace, Axum gateway, object storage, catalog, DataFusion query engine
- Python sidecar with real Ollama models (embed, generate, rerank)
- Dioxus UI with Ask (NLSQL), Explore, SQL, System tabs
- gRPC, OpenTelemetry, auth middleware, TOML config
- Validated with 286K row staffing company dataset across 7 tables
- Cross-reference queries (JOINs across candidates, placements, timesheets, calls) in <150ms
### Phase 6: Ingest Pipeline
Build the data on-ramp. Accept messy real-world data, normalize it, make it queryable.
| Step | Deliverable | Gate |
|---|---|---|
| 6.1 | `ingestd` crate with CSV parser Arrow RecordBatch Parquet | CSV file queryable dataset |
| 6.2 | JSON ingest (newline-delimited JSON, nested objects) | JSON file flat Parquet |
| 6.3 | Schema detection infer column types from data | No manual schema definition needed |
| 6.4 | Deduplication detect and skip already-ingested files (content hash) | Re-ingest same file = no-op |
| 6.5 | Text chunking split large text fields for embedding | Long text overlapping chunks |
| 6.6 | Auto-registration ingest writes to storage AND registers in catalog | Single API call: file in queryable |
| 6.7 | Gateway endpoint: `POST /ingest` with file upload | Upload CSV from browser query in seconds |
**Gate:** Upload a raw CSV or JSON file auto-detected schema stored as Parquet registered immediately queryable via SQL. No manual steps.
**Risk:** Schema detection on messy data (mixed types, nulls, inconsistent formatting). Mitigation: conservative type inference (default to string), let user override.
### Phase 7: Vector Index + RAG Pipeline
Make unstructured data searchable by meaning, not just keywords.
| Step | Deliverable | Gate |
|---|---|---|
| 7.1 | `vectord` crate with embedding storage as Parquet (doc_id, chunk_text, vector) | Embeddings stored as portable Parquet |
| 7.2 | Chunking strategy configurable chunk size + overlap for text columns | Large text fields split into embeddable chunks |
| 7.3 | Brute-force vector search via DataFusion (cosine similarity SQL) | Semantic search works, correctness verified |
| 7.4 | HNSW index for fast approximate nearest neighbor | Search over 100K+ vectors in <50ms |
| 7.5 | RAG endpoint: `POST /rag` question embed search retrieve generate | Natural language question grounded answer |
| 7.6 | Auto-embed on ingest text columns automatically embedded during ingest | No separate embedding step needed |
| 7.7 | Hybrid search combine SQL filters with vector similarity | "Java devs in Chicago" (SQL) + "who could do data engineering" (semantic) |
**Gate:** Ingest 15K candidate resumes auto-embed ask "find someone who could handle our Kubernetes migration" system returns relevant candidates ranked by semantic match, with LLM explanation.
**Risk: HNSW in Rust at scale.** This is the hardest technical problem. Options:
- `hora` crate Rust-native ANN, but less mature than FAISS
- Store HNSW index as a serialized file alongside Parquet data
- Fallback: brute-force scan is fine up to ~100K vectors; optimize later
- Nuclear option: use Qdrant as an external vector store (breaks "no new services" rule)
**Decision needed:** Evaluate `hora` vs external Qdrant vs brute-force at J's data scale.
### Phase 8: Hot Cache + Incremental Updates
Make frequently-accessed data fast, and handle real-time updates without full rewrite.
| Step | Deliverable | Gate |
|---|---|---|
| 8.1 | MemTable hot cache pin active datasets in memory | Queries on hot data: <10ms |
| 8.2 | Cache policy LRU eviction based on access patterns | Memory-bounded, auto-manages |
| 8.3 | Incremental writes append new rows without rewriting entire Parquet file | Update one candidate's phone no full table rewrite |
| 8.4 | Merge-on-read query combines base Parquet + delta files | Correct results from base + updates |
| 8.5 | Compaction periodic merge of delta files into base Parquet | Prevent delta file proliferation |
| 8.6 | Upsert semantics insert or update by primary key | Same candidate ID update in place |
**Gate:** Update a single row in a 15K-row dataset. Query reflects the change immediately. No full Parquet rewrite. Memory cache serves hot data in <10ms.
**Risk: This is the Delta Lake problem.** Full ACID transactions over Parquet files is what Databricks spent years building. We're NOT building Delta Lake we're building a pragmatic version:
- Append-only delta files (easy)
- Merge-on-read (moderate)
- Compaction (moderate)
- Full ACID isolation (NOT attempting single-writer model instead)
### Phase 9+: Future (not designed yet)
- Database connector ingest (PostgreSQL, MySQL, MSSQL Parquet)
- PDF/document ingest (OCR text chunks embed)
- Scheduled ingest (cron-based file watching)
- Multi-node query distribution
- Row-level access control
- Audit log (who queried what, when)
---
## Reference Dataset: Staffing Company
Validated with realistic staffing company data:
| Table | Rows | Description |
|---|---|---|
| candidates | 15,000 | Names, phones, emails, zip, skills, resume text, availability |
| clients | 500 | Companies, contacts, verticals, bill rates |
| job_orders | 3,000 | Positions with descriptions, requirements, rates |
| placements | 8,000 | Candidatejob matches with dates, rates, recruiters |
| timesheets | 120,000 | Weekly hours, bill/pay totals, approvals |
| call_log | 80,000 | Phone CDR who called whom, duration, disposition |
| email_log | 60,000 | Email tracking subject, opened, direction |
| **Total** | **286,500** | **7 tables, cross-referenced** |
Proven queries:
- Candidate search by skills + location + availability: 80ms
- Revenue by client with profit margins (JOIN 120K timesheets): 142ms
- Cold lead detection (candidates called 5+ times, never placed): 94ms
- Margin analysis by vertical (JOIN placements job orders): 53ms
- Natural language AI-generated SQL execution results: ~3s (model inference)
---
## Available Local Models
| Model | Use |
|---|---|
| `nomic-embed-text` | Embeddings (768d) semantic search, RAG retrieval |
| `qwen2.5` | SQL generation, structured output, summarization |
| `mistral` | General generation, longer context |
| `gemma2` | General generation |
| `llama3.2` | General generation, lightweight |
---
## Non-Goals
- Multi-tenancy (single-owner system)
- Cloud deployment (local-first, always)
- Full ACID transactions (single-writer model is sufficient)
- Real-time streaming / CDC (batch ingest is the model)
- Replacing the CRM (this is the analytical layer BEHIND the CRM)
- Custom file formats (Parquet is the format, period)
---
## Risks
| Risk | Severity | Mitigation |
|---|---|---|
| Vector search in Rust at scale | **High** | Start brute-force, evaluate `hora` crate, Qdrant as fallback |
| Incremental updates on Parquet | **High** | Delta files + merge-on-read, NOT full Delta Lake |
| Legacy data messiness | **High** | Conservative schema detection, default to string, user overrides |
| Schema evolution across ingests | **Medium** | Schema fingerprinting + versioned manifests |
| Memory pressure from hot cache | **Medium** | LRU eviction, configurable memory limit |
| HNSW index persistence | **Medium** | Serialize alongside Parquet, rebuild on startup |
| Python sidecar as bottleneck | **Low** | Can replace with direct Ollama HTTP from Rust later |
---
## Operating Rules
1. PRD > architecture > phases > status > git
2. Git is memory, not chat
3. No undocumented changes
4. No silent architecture drift
5. Always work in smallest valid step
6. Always verify before moving on
7. **New:** Flag when something is genuinely hard vs just engineering work
8. **New:** If a phase reveals the approach is wrong, update the PRD before continuing