# PRD: Lakehouse — Rust-First Object Storage System **Status:** Active **Created:** 2026-03-27 **Owner:** J --- ## Problem Traditional data platforms couple storage, compute, and metadata into monolithic databases. This creates vendor lock-in, scaling bottlenecks, and opaque data access. AI workloads bolt onto these systems awkwardly, sharing resources with transactional queries. We need a system where: - Object storage is the source of truth (not a database) - Metadata, access, and execution are controlled by Rust services - Queries run directly over object storage via Arrow/Parquet - AI inference is isolated and swappable - The entire system is rebuildable from repository + docs alone --- ## Solution A modular Rust service mesh over S3-compatible object storage. ### Locked Stack | Layer | Technology | Locked | |---|---|---| | Frontend | Dioxus | Yes | | API | Axum + Tokio | Yes | | Object Storage Interface | Apache Arrow `object_store` | Yes | | Storage Backend | RustFS (fallback: SeaweedFS) | Yes | | Query Engine | DataFusion | Yes | | Data Format | Parquet + Arrow | Yes | | RPC (internal) | tonic (gRPC) | Yes | | AI Runtime | Ollama (local models) | Yes | | AI Boundary | Python FastAPI sidecar → Ollama HTTP API | Yes | No new frameworks. No exceptions. --- ## Architecture ### Services | Service | Responsibility | |---|---| | **gateway** | HTTP ingress, routing, auth envelope, middleware | | **catalogd** | Metadata control plane — dataset registry, schema versions, manifest index | | **storaged** | Object I/O — read/write/list/delete via `object_store` crate | | **queryd** | SQL execution — DataFusion over registered Parquet datasets | | **aibridge** | Rust↔Python boundary — HTTP client to FastAPI sidecar | | **ui** | Dioxus frontend — dataset browser, query editor, results viewer | | **shared** | Types, errors, Arrow helpers, protobuf definitions | ### AI Sidecar Python FastAPI process that adapts Ollama's HTTP API into Arrow-compatible formats: - `POST /embed` → `nomic-embed-text` via Ollama - `POST /generate` → configurable model (qwen2.5, mistral, gemma2, llama3.2) - `POST /rerank` → cross-encoder reranking via generate endpoint No mocks. No stubs. Real models from day one. Ollama manages model lifecycle, GPU scheduling, caching. Sidecar is stateless passthrough. ### Data Flow ``` Client → gateway → catalogd (metadata lookup) → storaged (object read/write) → queryd (SQL execution over Parquet) → aibridge → sidecar → Ollama (inference) ``` ### Invariants 1. Object storage = source of truth for all data 2. catalogd = sole metadata authority (datasets, schemas, manifests) 3. No raw data stored in catalog — only pointers (bucket, key, schema fingerprint) 4. storaged never interprets data — dumb pipe with presigned URLs 5. queryd registers tables via catalog pointers, not by scanning storage 6. aibridge is stateless — Python sidecar is replaceable without touching Rust 7. All services are modular and independently replaceable ### Dependency Graph ``` shared ← storaged ← catalogd ← queryd shared ← aibridge gateway → {storaged, catalogd, queryd, aibridge} ui → gateway (HTTP only, no crate dependency) ``` --- ## Phases ### Phase 0: Bootstrap Workspace compiles, gateway serves health check, structured logging works. **Gate:** `cargo build` clean, `GET /health` returns 200, logs on stdout, docs committed. ### Phase 1: Storage + Catalog Write Parquet to object storage, register in catalog, read back. **Gate:** Upload Parquet → register dataset → retrieve metadata → read back. All via gateway HTTP. ### Phase 2: Query Engine SQL queries over registered Parquet datasets via DataFusion. **Gate:** `SELECT * FROM dataset LIMIT 10` returns correct results. Resolution goes through catalog. ### Phase 3: AI Integration Python sidecar with real Ollama models. Embeddings, generation, reranking. **Gate:** Rust sends text → Python → Ollama → real embeddings return as Arrow-compatible floats. ### Phase 4: Frontend Dioxus UI: dataset browser, query editor, results table. **Gate:** User can browse datasets and run queries from browser. ### Phase 5: Hardening gRPC internals, OpenTelemetry, auth, config-driven startup. **Gate:** Services communicate via gRPC. Traces propagate. Auth enforced. System restartable from repo + config. --- ## Available Local Models | Model | Use | |---|---| | `nomic-embed-text` | Embeddings (768d) | | `qwen2.5` | Code generation, structured output | | `mistral` | General generation | | `gemma2` | General generation | | `llama3.2` | General generation | Model selection via environment variables. No hardcoded model names in Rust code. --- ## Non-Goals - Multi-tenancy - Streaming ingestion / CDC - Custom file formats - Query caching / materialized views - Wrapping `object_store` with another abstraction - Cloud deployment (local-first) --- ## Risks | Risk | Severity | Mitigation | |---|---|---| | RustFS immaturity | High | Start with LocalFileSystem, test against MinIO, RustFS last. SeaweedFS fallback. | | DataFusion table registration overhead | Medium | Lazy registration + LRU cache of SessionContext instances. | | Catalog consistency without DB | Medium | Write-ahead: persist manifest before in-memory update. Rebuild from storage on restart. | | Dioxus WASM gaps | Medium | Phase 4 is last. Fallback to plain HTML if blocked. | | Schema evolution | Medium | Schema fingerprinting in Phase 1. Validate before query. | --- ## Operating Rules 1. PRD > architecture > phases > status > git 2. Git is memory, not chat 3. No undocumented changes 4. No silent architecture drift 5. Always work in smallest valid step 6. Always verify before moving on