lakehouse/docs/PRD.md
root 6740a017c7 PRD v2: production roadmap with ingest, vector search, hot cache phases
- Phase 6: Ingest pipeline (CSV/JSON → schema detect → Parquet → catalog)
- Phase 7: Vector index + RAG (embed → HNSW → semantic search → LLM answer)
- Phase 8: Hot cache + incremental updates (MemTable, delta files, merge-on-read)
- ADR-008 through ADR-011: embeddings as Parquet, delta files not Delta Lake,
  schema defaults to string, not a CRM replacement
- Staffing company reference dataset (286K rows, 7 tables)
- Honest risk assessment: vector search at scale and incremental updates are hard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:54:24 -05:00

12 KiB

PRD: Lakehouse — Rust-First Object Storage System

Status: Active — Phase 0-5 complete, entering production path Created: 2026-03-27 Owner: J


Problem

Legacy data systems silo information across CRMs, databases, spreadsheets, and file shares. Querying across them requires manual ETL, pre-defined schemas, and expensive database licenses. When AI enters the picture, these systems can't handle the dual requirement of fast analytical queries AND semantic retrieval over unstructured text.

A staffing company (our reference case) has candidate records in an ATS, client data in a CRM, timesheets in billing software, call logs from a phone system, and email records from Exchange. Answering "find every Java developer in Chicago who was called 5+ times but never placed" requires querying across all of them — and no single system can do it.

We need a system where:

  • Any data source (CSV, DB export, PDF, JSON) can be ingested without pre-defined schemas
  • Structured data is queryable via SQL at scale (millions of rows, sub-second)
  • Unstructured data is searchable via AI embeddings (semantic retrieval)
  • An LLM can answer natural language questions against all of it
  • Everything runs locally — no cloud APIs, total data privacy
  • The system is rebuildable from repository + object storage alone

Solution

A modular Rust service mesh over S3-compatible object storage, with a local AI layer for embeddings and generation.

Locked Stack

Layer Technology Locked
Frontend Dioxus Yes
API Axum + Tokio Yes
Object Storage Interface Apache Arrow object_store Yes
Storage Backend LocalFileSystem → RustFS → S3 Yes
Query Engine DataFusion Yes
Data Format Parquet + Arrow Yes
RPC (internal) tonic (gRPC) Yes
AI Runtime Ollama (local models) Yes
AI Boundary Python FastAPI sidecar → Ollama HTTP API Yes
Vector Index TBD — evaluate hora, qdrant crate, or HNSW from scratch Open

No new frameworks without documented ADR.


Architecture

Services

Service Responsibility
gateway HTTP/gRPC ingress, routing, auth, CORS, body limits
catalogd Metadata control plane — dataset registry, schema versions, manifests
storaged Object I/O — read/write/list/delete via object_store
queryd SQL execution — DataFusion over Parquet, MemTable hot cache
ingestd NEW — Ingest pipeline: CSV/JSON/DB → normalize → Parquet → catalog
vectord NEW — Embedding store + vector index: chunk → embed → index → search
aibridge Rust↔Python boundary — HTTP client to FastAPI sidecar
ui Dioxus frontend — Ask, Explore, SQL, System tabs
shared Types, errors, Arrow helpers, config, protobuf definitions

Data Flow

Raw data → ingestd (normalize, chunk, detect schema)
    ├→ storaged (Parquet files to object storage)
    ├→ catalogd (register dataset + schema)
    ├→ vectord (embed text chunks, build index)
    └→ queryd  (auto-register as queryable table)

User question → gateway
    ├→ vectord (semantic search for relevant chunks)  ← RAG path
    ├→ queryd  (SQL over structured data)             ← Analytics path
    └→ aibridge → Ollama (generate answer from context)

Query Paths

Analytical (SQL): "What's the average bill rate for .NET devs in Chicago?" → DataFusion scans Parquet columnar, returns in <200ms

Semantic (RAG): "Find candidates who could do data engineering work" → Embed question → vector search across resume embeddings → retrieve top chunks → LLM answers

Hybrid: "Which clients are we losing money on, and why?" → SQL for margin calculations + RAG over client notes/emails for context → LLM synthesizes

Invariants

  1. Object storage = source of truth for all data
  2. catalogd = sole metadata authority
  3. No raw data in catalog — only pointers
  4. vectord stores embeddings AS Parquet (portable, not a proprietary format)
  5. ingestd is idempotent — re-ingesting the same file is a no-op
  6. Hot cache is a performance layer, not a source of truth — eviction is safe
  7. All services modular and independently replaceable

Phases

Phase 0-5: Foundation COMPLETE

  • Rust workspace, Axum gateway, object storage, catalog, DataFusion query engine
  • Python sidecar with real Ollama models (embed, generate, rerank)
  • Dioxus UI with Ask (NL→SQL), Explore, SQL, System tabs
  • gRPC, OpenTelemetry, auth middleware, TOML config
  • Validated with 286K row staffing company dataset across 7 tables
  • Cross-reference queries (JOINs across candidates, placements, timesheets, calls) in <150ms

Phase 6: Ingest Pipeline

Build the data on-ramp. Accept messy real-world data, normalize it, make it queryable.

Step Deliverable Gate
6.1 ingestd crate with CSV parser → Arrow RecordBatch → Parquet CSV file → queryable dataset
6.2 JSON ingest (newline-delimited JSON, nested objects) JSON file → flat Parquet
6.3 Schema detection — infer column types from data No manual schema definition needed
6.4 Deduplication — detect and skip already-ingested files (content hash) Re-ingest same file = no-op
6.5 Text chunking — split large text fields for embedding Long text → overlapping chunks
6.6 Auto-registration — ingest writes to storage AND registers in catalog Single API call: file in → queryable
6.7 Gateway endpoint: POST /ingest with file upload Upload CSV from browser → query in seconds

Gate: Upload a raw CSV or JSON file → auto-detected schema → stored as Parquet → registered → immediately queryable via SQL. No manual steps.

Risk: Schema detection on messy data (mixed types, nulls, inconsistent formatting). Mitigation: conservative type inference (default to string), let user override.

Phase 7: Vector Index + RAG Pipeline

Make unstructured data searchable by meaning, not just keywords.

Step Deliverable Gate
7.1 vectord crate with embedding storage as Parquet (doc_id, chunk_text, vector) Embeddings stored as portable Parquet
7.2 Chunking strategy — configurable chunk size + overlap for text columns Large text fields split into embeddable chunks
7.3 Brute-force vector search via DataFusion (cosine similarity SQL) Semantic search works, correctness verified
7.4 HNSW index for fast approximate nearest neighbor Search over 100K+ vectors in <50ms
7.5 RAG endpoint: POST /rag — question → embed → search → retrieve → generate Natural language question → grounded answer
7.6 Auto-embed on ingest — text columns automatically embedded during ingest No separate embedding step needed
7.7 Hybrid search — combine SQL filters with vector similarity "Java devs in Chicago" (SQL) + "who could do data engineering" (semantic)

Gate: Ingest 15K candidate resumes → auto-embed → ask "find someone who could handle our Kubernetes migration" → system returns relevant candidates ranked by semantic match, with LLM explanation.

Risk: HNSW in Rust at scale. This is the hardest technical problem. Options:

  • hora crate — Rust-native ANN, but less mature than FAISS
  • Store HNSW index as a serialized file alongside Parquet data
  • Fallback: brute-force scan is fine up to ~100K vectors; optimize later
  • Nuclear option: use Qdrant as an external vector store (breaks "no new services" rule)

Decision needed: Evaluate hora vs external Qdrant vs brute-force at J's data scale.

Phase 8: Hot Cache + Incremental Updates

Make frequently-accessed data fast, and handle real-time updates without full rewrite.

Step Deliverable Gate
8.1 MemTable hot cache — pin active datasets in memory Queries on hot data: <10ms
8.2 Cache policy — LRU eviction based on access patterns Memory-bounded, auto-manages
8.3 Incremental writes — append new rows without rewriting entire Parquet file Update one candidate's phone → no full table rewrite
8.4 Merge-on-read — query combines base Parquet + delta files Correct results from base + updates
8.5 Compaction — periodic merge of delta files into base Parquet Prevent delta file proliferation
8.6 Upsert semantics — insert or update by primary key Same candidate ID → update in place

Gate: Update a single row in a 15K-row dataset. Query reflects the change immediately. No full Parquet rewrite. Memory cache serves hot data in <10ms.

Risk: This is the Delta Lake problem. Full ACID transactions over Parquet files is what Databricks spent years building. We're NOT building Delta Lake — we're building a pragmatic version:

  • Append-only delta files (easy)
  • Merge-on-read (moderate)
  • Compaction (moderate)
  • Full ACID isolation (NOT attempting — single-writer model instead)

Phase 9+: Future (not designed yet)

  • Database connector ingest (PostgreSQL, MySQL, MSSQL → Parquet)
  • PDF/document ingest (OCR → text → chunks → embed)
  • Scheduled ingest (cron-based file watching)
  • Multi-node query distribution
  • Row-level access control
  • Audit log (who queried what, when)

Reference Dataset: Staffing Company

Validated with realistic staffing company data:

Table Rows Description
candidates 15,000 Names, phones, emails, zip, skills, resume text, availability
clients 500 Companies, contacts, verticals, bill rates
job_orders 3,000 Positions with descriptions, requirements, rates
placements 8,000 Candidate↔job matches with dates, rates, recruiters
timesheets 120,000 Weekly hours, bill/pay totals, approvals
call_log 80,000 Phone CDR — who called whom, duration, disposition
email_log 60,000 Email tracking — subject, opened, direction
Total 286,500 7 tables, cross-referenced

Proven queries:

  • Candidate search by skills + location + availability: 80ms
  • Revenue by client with profit margins (JOIN 120K timesheets): 142ms
  • Cold lead detection (candidates called 5+ times, never placed): 94ms
  • Margin analysis by vertical (JOIN placements → job orders): 53ms
  • Natural language → AI-generated SQL → execution → results: ~3s (model inference)

Available Local Models

Model Use
nomic-embed-text Embeddings (768d) — semantic search, RAG retrieval
qwen2.5 SQL generation, structured output, summarization
mistral General generation, longer context
gemma2 General generation
llama3.2 General generation, lightweight

Non-Goals

  • Multi-tenancy (single-owner system)
  • Cloud deployment (local-first, always)
  • Full ACID transactions (single-writer model is sufficient)
  • Real-time streaming / CDC (batch ingest is the model)
  • Replacing the CRM (this is the analytical layer BEHIND the CRM)
  • Custom file formats (Parquet is the format, period)

Risks

Risk Severity Mitigation
Vector search in Rust at scale High Start brute-force, evaluate hora crate, Qdrant as fallback
Incremental updates on Parquet High Delta files + merge-on-read, NOT full Delta Lake
Legacy data messiness High Conservative schema detection, default to string, user overrides
Schema evolution across ingests Medium Schema fingerprinting + versioned manifests
Memory pressure from hot cache Medium LRU eviction, configurable memory limit
HNSW index persistence Medium Serialize alongside Parquet, rebuild on startup
Python sidecar as bottleneck Low Can replace with direct Ollama HTTP from Rust later

Operating Rules

  1. PRD > architecture > phases > status > git
  2. Git is memory, not chat
  3. No undocumented changes
  4. No silent architecture drift
  5. Always work in smallest valid step
  6. Always verify before moving on
  7. New: Flag when something is genuinely hard vs just engineering work
  8. New: If a phase reveals the approach is wrong, update the PRD before continuing