Phases 9-15 designed based on "future regret" analysis: - Phase 9: Event journal (append-only mutation history — can't retrofit) - Phase 10: Rich catalog v2 (ownership, sensitivity, lineage, freshness) - Phase 11: Embedding versioning (model-proof vector layer) - Phase 12: Tool registry (governed agent actions via MCP) - Phase 13: Security & access control (field-level, row-level, audit) - Phase 14: Schema evolution with AI migration rules - Phase 15+: Federated query, DB connectors, OCR, fine-tuned models 8 design principles: store truth openly, describe richly, never destroy evidence, secure centrally, expose through tools, version everything, unstructured first-class, separate storage/compute/intelligence. ADR-012 through ADR-016 documenting key future-proofing decisions. Updated benchmarks: 2.47M rows, hot cache 9.8x speedup. Updated operating rules: cheap-now/expensive-later built first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
21 KiB
PRD: Lakehouse — Rust-First Object Storage System
Status: Active — Phase 0-5 complete, entering production path Created: 2026-03-27 Owner: J
Problem
Legacy data systems silo information across CRMs, databases, spreadsheets, and file shares. Querying across them requires manual ETL, pre-defined schemas, and expensive database licenses. When AI enters the picture, these systems can't handle the dual requirement of fast analytical queries AND semantic retrieval over unstructured text.
A staffing company (our reference case) has candidate records in an ATS, client data in a CRM, timesheets in billing software, call logs from a phone system, and email records from Exchange. Answering "find every Java developer in Chicago who was called 5+ times but never placed" requires querying across all of them — and no single system can do it.
We need a system where:
- Any data source (CSV, DB export, PDF, JSON) can be ingested without pre-defined schemas
- Structured data is queryable via SQL at scale (millions of rows, sub-second)
- Unstructured data is searchable via AI embeddings (semantic retrieval)
- An LLM can answer natural language questions against all of it
- Everything runs locally — no cloud APIs, total data privacy
- The system is rebuildable from repository + object storage alone
Solution
A modular Rust service mesh over S3-compatible object storage, with a local AI layer for embeddings and generation.
Locked Stack
| Layer | Technology | Locked |
|---|---|---|
| Frontend | Dioxus | Yes |
| API | Axum + Tokio | Yes |
| Object Storage Interface | Apache Arrow object_store |
Yes |
| Storage Backend | LocalFileSystem → RustFS → S3 | Yes |
| Query Engine | DataFusion | Yes |
| Data Format | Parquet + Arrow | Yes |
| RPC (internal) | tonic (gRPC) | Yes |
| AI Runtime | Ollama (local models) | Yes |
| AI Boundary | Python FastAPI sidecar → Ollama HTTP API | Yes |
| Vector Index | TBD — evaluate hora, qdrant crate, or HNSW from scratch |
Open |
No new frameworks without documented ADR.
Architecture
Services
| Service | Responsibility |
|---|---|
| gateway | HTTP/gRPC ingress, routing, auth, CORS, body limits |
| catalogd | Metadata control plane — dataset registry, schema versions, manifests |
| storaged | Object I/O — read/write/list/delete via object_store |
| queryd | SQL execution — DataFusion over Parquet, MemTable hot cache |
| ingestd | NEW — Ingest pipeline: CSV/JSON/DB → normalize → Parquet → catalog |
| vectord | NEW — Embedding store + vector index: chunk → embed → index → search |
| aibridge | Rust↔Python boundary — HTTP client to FastAPI sidecar |
| ui | Dioxus frontend — Ask, Explore, SQL, System tabs |
| shared | Types, errors, Arrow helpers, config, protobuf definitions |
Data Flow
Raw data → ingestd (normalize, chunk, detect schema)
├→ storaged (Parquet files to object storage)
├→ catalogd (register dataset + schema)
├→ vectord (embed text chunks, build index)
└→ queryd (auto-register as queryable table)
User question → gateway
├→ vectord (semantic search for relevant chunks) ← RAG path
├→ queryd (SQL over structured data) ← Analytics path
└→ aibridge → Ollama (generate answer from context)
Query Paths
Analytical (SQL): "What's the average bill rate for .NET devs in Chicago?" → DataFusion scans Parquet columnar, returns in <200ms
Semantic (RAG): "Find candidates who could do data engineering work" → Embed question → vector search across resume embeddings → retrieve top chunks → LLM answers
Hybrid: "Which clients are we losing money on, and why?" → SQL for margin calculations + RAG over client notes/emails for context → LLM synthesizes
Invariants
- Object storage = source of truth for all data
- catalogd = sole metadata authority
- No raw data in catalog — only pointers
- vectord stores embeddings AS Parquet (portable, not a proprietary format)
- ingestd is idempotent — re-ingesting the same file is a no-op
- Hot cache is a performance layer, not a source of truth — eviction is safe
- All services modular and independently replaceable
Phases
Phase 0-5: Foundation ✅ COMPLETE
- Rust workspace, Axum gateway, object storage, catalog, DataFusion query engine
- Python sidecar with real Ollama models (embed, generate, rerank)
- Dioxus UI with Ask (NL→SQL), Explore, SQL, System tabs
- gRPC, OpenTelemetry, auth middleware, TOML config
- Validated with 286K row staffing company dataset across 7 tables
- Cross-reference queries (JOINs across candidates, placements, timesheets, calls) in <150ms
Phase 6: Ingest Pipeline
Build the data on-ramp. Accept messy real-world data, normalize it, make it queryable.
| Step | Deliverable | Gate |
|---|---|---|
| 6.1 | ingestd crate with CSV parser → Arrow RecordBatch → Parquet |
CSV file → queryable dataset |
| 6.2 | JSON ingest (newline-delimited JSON, nested objects) | JSON file → flat Parquet |
| 6.3 | Schema detection — infer column types from data | No manual schema definition needed |
| 6.4 | Deduplication — detect and skip already-ingested files (content hash) | Re-ingest same file = no-op |
| 6.5 | Text chunking — split large text fields for embedding | Long text → overlapping chunks |
| 6.6 | Auto-registration — ingest writes to storage AND registers in catalog | Single API call: file in → queryable |
| 6.7 | Gateway endpoint: POST /ingest with file upload |
Upload CSV from browser → query in seconds |
Gate: Upload a raw CSV or JSON file → auto-detected schema → stored as Parquet → registered → immediately queryable via SQL. No manual steps.
Risk: Schema detection on messy data (mixed types, nulls, inconsistent formatting). Mitigation: conservative type inference (default to string), let user override.
Phase 7: Vector Index + RAG Pipeline
Make unstructured data searchable by meaning, not just keywords.
| Step | Deliverable | Gate |
|---|---|---|
| 7.1 | vectord crate with embedding storage as Parquet (doc_id, chunk_text, vector) |
Embeddings stored as portable Parquet |
| 7.2 | Chunking strategy — configurable chunk size + overlap for text columns | Large text fields split into embeddable chunks |
| 7.3 | Brute-force vector search via DataFusion (cosine similarity SQL) | Semantic search works, correctness verified |
| 7.4 | HNSW index for fast approximate nearest neighbor | Search over 100K+ vectors in <50ms |
| 7.5 | RAG endpoint: POST /rag — question → embed → search → retrieve → generate |
Natural language question → grounded answer |
| 7.6 | Auto-embed on ingest — text columns automatically embedded during ingest | No separate embedding step needed |
| 7.7 | Hybrid search — combine SQL filters with vector similarity | "Java devs in Chicago" (SQL) + "who could do data engineering" (semantic) |
Gate: Ingest 15K candidate resumes → auto-embed → ask "find someone who could handle our Kubernetes migration" → system returns relevant candidates ranked by semantic match, with LLM explanation.
Risk: HNSW in Rust at scale. This is the hardest technical problem. Options:
horacrate — Rust-native ANN, but less mature than FAISS- Store HNSW index as a serialized file alongside Parquet data
- Fallback: brute-force scan is fine up to ~100K vectors; optimize later
- Nuclear option: use Qdrant as an external vector store (breaks "no new services" rule)
Decision needed: Evaluate hora vs external Qdrant vs brute-force at J's data scale.
Phase 8: Hot Cache + Incremental Updates
Make frequently-accessed data fast, and handle real-time updates without full rewrite.
| Step | Deliverable | Gate |
|---|---|---|
| 8.1 | MemTable hot cache — pin active datasets in memory | Queries on hot data: <10ms |
| 8.2 | Cache policy — LRU eviction based on access patterns | Memory-bounded, auto-manages |
| 8.3 | Incremental writes — append new rows without rewriting entire Parquet file | Update one candidate's phone → no full table rewrite |
| 8.4 | Merge-on-read — query combines base Parquet + delta files | Correct results from base + updates |
| 8.5 | Compaction — periodic merge of delta files into base Parquet | Prevent delta file proliferation |
| 8.6 | Upsert semantics — insert or update by primary key | Same candidate ID → update in place |
Gate: Update a single row in a 15K-row dataset. Query reflects the change immediately. No full Parquet rewrite. Memory cache serves hot data in <10ms.
Risk: This is the Delta Lake problem. Full ACID transactions over Parquet files is what Databricks spent years building. We're NOT building Delta Lake — we're building a pragmatic version:
- Append-only delta files (easy)
- Merge-on-read (moderate)
- Compaction (moderate)
- Full ACID isolation (NOT attempting — single-writer model instead)
Phase 8.5: Agent Workspaces ✅ COMPLETE
Per-contract overlays with daily/weekly/monthly tiers and instant handoff.
- WorkspaceManager with saved searches, shortlists, activity logs
- Zero-copy handoff between agents (pointer swap, not data copy)
- Persisted to object storage, rebuilt on startup
Phase 9: Event Journal — Never Destroy Evidence
Principle: Every mutation is appended, never overwritten. This is the one decision that's impossible to retrofit — once history is lost, it's gone forever.
| Step | Deliverable | Gate |
|---|---|---|
| 9.1 | journald crate: append-only event log as Parquet |
Every write/update/delete logged with who, when, what, old value, new value |
| 9.2 | Event schema: entity, field, old_value, new_value, actor, timestamp, source, workspace_id | Standardized across all mutations |
| 9.3 | Journal query: SELECT * FROM journal WHERE entity = 'CAND-001' ORDER BY timestamp |
Full history of any record |
| 9.4 | Replay capability: rebuild any dataset's state at any point in time | Time-travel queries |
| 9.5 | Journal compaction: roll old events into monthly summary Parquet files | Prevent unbounded growth |
Gate: Change a candidate's phone number. Query shows the change. Journal shows old value, new value, who changed it, when, and why. Replay to yesterday's state.
Why now: In 3 years, compliance, AI auditability, and "why did the agent recommend this candidate" all require mutation history. Adding it later means you only have history from that day forward.
Phase 10: Rich Catalog v2 — Metadata as Product
Principle: Every dataset should be self-describing. A new team member (or AI agent) should understand what data exists, who owns it, how fresh it is, and what's sensitive — without asking anyone.
| Step | Deliverable | Gate |
|---|---|---|
| 10.1 | Catalog schema upgrade: add owner, sensitivity, freshness_sla, description, tags, lineage | GET /catalog/datasets returns rich metadata |
| 10.2 | Sensitivity classification: PII, PHI, financial, public, internal | Sensitive fields tagged at ingest |
| 10.3 | Lineage tracking: source_system → ingest_job → dataset → derived_dataset | "Where did this data come from?" answerable |
| 10.4 | Freshness contracts: expected_update_frequency, last_updated, stale_after | Alert when data goes stale |
| 10.5 | Dataset contracts: required columns, type expectations, validation rules | Ingest rejects data that breaks the contract |
| 10.6 | Auto-documentation: AI generates dataset description from schema + sample data | New datasets self-describe on ingest |
Gate: Ingest a CSV. System auto-detects PII columns (email, phone, SSN patterns), tags them, generates a description, sets owner, and tracks lineage back to the source file.
Why now: Every dataset you ingest without metadata becomes a "mystery file" in 6 months. The metadata layer makes the difference between a searchable knowledge platform and a data graveyard.
Phase 11: Embedding Versioning — Model-Proof Vector Layer
Principle: Embedding models will change. If you don't track which model created which vectors, upgrading means re-embedding everything from scratch.
| Step | Deliverable | Gate |
|---|---|---|
| 11.1 | Vector index metadata: model_name, model_version, dimensions, created_at | Every index knows its embedding model |
| 11.2 | Multi-version indexes: same data, different models, coexist | Search specifies which model version |
| 11.3 | Incremental re-embed: only new/changed docs get re-embedded on model upgrade | Model swap doesn't require full re-embed |
| 11.4 | A/B search: query both old and new model, compare results | Validate model upgrade before committing |
Gate: Upgrade from nomic-embed-text to a new model. Old index still works. New index builds incrementally. Compare search quality. Switch when ready.
Phase 12: Tool Registry — Agent-Safe Business Actions
Principle: In 3 years, AI agents won't just query — they'll act. Instead of every agent getting raw SQL access, expose named, governed, audited business actions.
| Step | Deliverable | Gate |
|---|---|---|
| 12.1 | Tool definition: name, description, parameters, permissions, audit_level | search_candidates(skills, city, min_years) as a registered tool |
| 12.2 | Tool execution: validates params, checks permissions, logs usage, runs query | Agent calls tool, gets results, action is logged |
| 12.3 | Read vs write tools: read tools are permissive, write tools require confirmation | get_candidate = auto-approved, update_phone = requires review |
| 12.4 | MCP-compatible interface: expose tools via Model Context Protocol | Any MCP-compatible agent (Claude, GPT, local) can use them |
| 12.5 | Rate limiting + quotas per agent/tool | Prevent runaway agent from overwhelming the system |
Gate: An AI agent calls search_candidates(skills="Python,AWS", city="Chicago", available=true) → gets results → calls shortlist_candidate(workspace_id, candidate_id, reason) → action is logged, auditable, reversible.
Why now: The tool interface is cheap to build (it's just named endpoints with validation). But retrofitting audit logging and permission checks onto raw SQL access is a nightmare. Build the governed interface first.
Phase 13: Security & Access Control
| Step | Deliverable | Gate |
|---|---|---|
| 13.1 | Field-level sensitivity tags (PII, PHI, financial) in catalog | Sensitive fields identified |
| 13.2 | Row-level access policies (agent A sees their candidates only) | Policy evaluated at query time |
| 13.3 | Column masking (show last 4 of SSN, redact salary for non-managers) | Masked results based on role |
| 13.4 | Query audit log (who queried what, when, which fields) | Every data access recorded |
| 13.5 | Policy-as-code (TOML/YAML rules, not hardcoded) | Non-engineer can update access rules |
Phase 14: Schema Evolution + AI Migration
| Step | Deliverable | Gate |
|---|---|---|
| 14.1 | Schema diff detection: old schema vs new ingest → list changes | "Column renamed: first_name → full_name" |
| 14.2 | AI-generated migration rules: LLM suggests column mappings | "full_name = concat(first_name, ' ', last_name)" |
| 14.3 | Migration preview: show how old data maps to new schema before applying | Human approves before data transforms |
| 14.4 | Versioned schemas in catalog: v1, v2, v3 coexist | Queries specify version or use latest |
Phase 15+: Horizon
- Federated multi-bucket query (client A's S3 + client B's S3 + yours)
- Database connector ingest (PostgreSQL, MySQL, MSSQL → Parquet via CDC)
- PDF OCR for scanned documents (Tesseract integration)
- Scheduled ingest (cron-based file watching, S3 event triggers)
- Specialized fine-tuned models per domain (staffing matcher, resume parser)
- Multi-node query distribution (DataFusion supports this architecturally)
- Video/audio transcript ingest + multimodal embeddings
Reference Dataset: Staffing Company
Scale-tested on 128GB RAM server:
| Table | Rows | Size | Description |
|---|---|---|---|
| candidates | 100,000 | 10.1 MB | Names, phones, emails, zip, skills, resume text |
| clients | 2,000 | 33 KB | Companies, contacts, verticals |
| job_orders | 15,000 | 0.9 MB | Positions with descriptions, requirements, rates |
| placements | 50,000 | 1.2 MB | Candidate↔job matches with rates, recruiters |
| timesheets | 1,000,000 | 16.7 MB | Weekly hours, bill/pay totals, approvals |
| call_log | 800,000 | 34.3 MB | Phone CDR — who called whom, duration, disposition |
| email_log | 500,000 | 16.0 MB | Email tracking — subject, opened, direction |
| Total | 2,467,000 | 79 MB | 7 tables, cross-referenced |
Benchmarks (2.47M rows)
| Query | Cold (Parquet) | Hot (MemCache) | Speedup |
|---|---|---|---|
| 100K candidate filter (skills+city+status) | 257ms | 21ms | 12x |
| 1M timesheet aggregation + JOIN | 942ms | 96ms | 9.8x |
| 800K call log cross-reference (cold leads) | 642ms | — | — |
| Triple JOIN recruiter performance | 487ms | — | — |
| 500K email open rate aggregation | 259ms | — | — |
| COUNT all 2.47M rows | 84ms | — | — |
| 10K vector semantic search (cosine) | 450ms | — | — |
| Natural language → AI SQL → execute | ~3s | — | (model inference) |
Vector Search
- 10K candidate resumes embedded in 204s (49 chunks/sec via Ollama)
- Semantic search over 10K vectors: ~450ms (brute-force cosine)
- RAG pipeline: question → embed → search → retrieve → LLM answer with citations
- AI correctly refuses to hallucinate when context doesn't support an answer
Agent Workspaces
- Create per-contract workspace with saved searches + shortlists
- Instant handoff between agents — zero data copy
- Full activity timeline preserved across handoffs
Available Local Models
| Model | Use |
|---|---|
nomic-embed-text |
Embeddings (768d) — semantic search, RAG retrieval |
qwen2.5 |
SQL generation, structured output, summarization |
mistral |
General generation, longer context |
gemma2 |
General generation |
llama3.2 |
General generation, lightweight |
Non-Goals
- Multi-tenancy (single-owner system)
- Cloud deployment (local-first, always)
- Full ACID transactions (single-writer model is sufficient)
- Real-time streaming / CDC (batch ingest is the model)
- Replacing the CRM (this is the analytical layer BEHIND the CRM)
- Custom file formats (Parquet is the format, period)
Risks
Technical Risks
| Risk | Severity | Mitigation |
|---|---|---|
| Vector search in Rust at scale | High | Start brute-force, evaluate hora crate, Qdrant as fallback |
| Incremental updates on Parquet | High | Delta files + merge-on-read, NOT full Delta Lake |
| Legacy data messiness | High | Conservative schema detection, default to string, user overrides |
| 100K+ embedding timeout | High | Async background job with progress, not single HTTP request |
| Schema evolution across ingests | Medium | Schema fingerprinting + versioned manifests (Phase 14) |
| Memory pressure from hot cache | Medium | LRU eviction, configurable memory limit (tested: 408MB for 1.1M rows) |
| HNSW index persistence | Medium | Serialize alongside Parquet, rebuild on startup |
| Python sidecar as bottleneck | Low | Can replace with direct Ollama HTTP from Rust later |
Strategic Risks (Future-Proofing)
| Risk | Impact | Phase |
|---|---|---|
| No mutation history → can't audit AI decisions | Critical — compliance, trust | Phase 9 (event journal) |
| No metadata → datasets become mystery files | High — onboarding, discovery | Phase 10 (rich catalog) |
| Embeddings locked to one model | High — can't upgrade models | Phase 11 (versioning) |
| Raw SQL as only interface → ungoverned agent access | High — security, auditability | Phase 12 (tool registry) |
| No sensitivity classification → compliance exposure | Medium — grows with data volume | Phase 13 (access control) |
| No schema evolution handling → ingest breaks on format change | Medium — grows with source count | Phase 14 (AI migration) |
Design Principles (Future-Proofing)
These are the decisions that still look smart after the stack changes:
- Store the truth openly. Parquet on object storage. No proprietary formats. Any engine can read it.
- Describe it richly. Every dataset has an owner, lineage, sensitivity tags, freshness contract.
- Never destroy evidence. Every mutation is journaled. Rebuild any state at any point in time.
- Secure it centrally. Permissions live in the data layer, not application code.
- Expose it through reusable interfaces. Named tools with contracts, not raw SQL for every consumer.
- Version everything. Schemas, embeddings, models — all versioned, all coexist during migration.
- Make unstructured data first-class. Every document gets: storage, text extraction, entity tags, chunks, embeddings, linkage.
- Separate storage from compute from intelligence. Scale each independently. Replace any layer without touching the others.
Operating Rules
- PRD > architecture > phases > status > git
- Git is memory, not chat
- No undocumented changes
- No silent architecture drift
- Always work in smallest valid step
- Always verify before moving on
- Flag when something is genuinely hard vs just engineering work
- If a phase reveals the approach is wrong, update the PRD before continuing
- Cheap-now, expensive-later decisions get built first (event journal, metadata, versioning)
- Build the governed interface before the raw interface (tools before SQL for agents)