- Cargo workspace with 6 crates: shared, storaged, catalogd, queryd, aibridge, gateway - shared: types (DatasetId, ObjectRef, SchemaFingerprint, DatasetManifest) + error enum - gateway: Axum HTTP entrypoint with nested service routers + tracing - All services expose /health stubs - justfile with build/test/run recipes - PRD, phase tracker, and ADR docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5.7 KiB
PRD: Lakehouse — Rust-First Object Storage System
Status: Active Created: 2026-03-27 Owner: J
Problem
Traditional data platforms couple storage, compute, and metadata into monolithic databases. This creates vendor lock-in, scaling bottlenecks, and opaque data access. AI workloads bolt onto these systems awkwardly, sharing resources with transactional queries.
We need a system where:
- Object storage is the source of truth (not a database)
- Metadata, access, and execution are controlled by Rust services
- Queries run directly over object storage via Arrow/Parquet
- AI inference is isolated and swappable
- The entire system is rebuildable from repository + docs alone
Solution
A modular Rust service mesh over S3-compatible object storage.
Locked Stack
| Layer | Technology | Locked |
|---|---|---|
| Frontend | Dioxus | Yes |
| API | Axum + Tokio | Yes |
| Object Storage Interface | Apache Arrow object_store |
Yes |
| Storage Backend | RustFS (fallback: SeaweedFS) | Yes |
| Query Engine | DataFusion | Yes |
| Data Format | Parquet + Arrow | Yes |
| RPC (internal) | tonic (gRPC) | Yes |
| AI Runtime | Ollama (local models) | Yes |
| AI Boundary | Python FastAPI sidecar → Ollama HTTP API | Yes |
No new frameworks. No exceptions.
Architecture
Services
| Service | Responsibility |
|---|---|
| gateway | HTTP ingress, routing, auth envelope, middleware |
| catalogd | Metadata control plane — dataset registry, schema versions, manifest index |
| storaged | Object I/O — read/write/list/delete via object_store crate |
| queryd | SQL execution — DataFusion over registered Parquet datasets |
| aibridge | Rust↔Python boundary — HTTP client to FastAPI sidecar |
| ui | Dioxus frontend — dataset browser, query editor, results viewer |
| shared | Types, errors, Arrow helpers, protobuf definitions |
AI Sidecar
Python FastAPI process that adapts Ollama's HTTP API into Arrow-compatible formats:
POST /embed→nomic-embed-textvia OllamaPOST /generate→ configurable model (qwen2.5, mistral, gemma2, llama3.2)POST /rerank→ cross-encoder reranking via generate endpoint
No mocks. No stubs. Real models from day one. Ollama manages model lifecycle, GPU scheduling, caching. Sidecar is stateless passthrough.
Data Flow
Client → gateway → catalogd (metadata lookup)
→ storaged (object read/write)
→ queryd (SQL execution over Parquet)
→ aibridge → sidecar → Ollama (inference)
Invariants
- Object storage = source of truth for all data
- catalogd = sole metadata authority (datasets, schemas, manifests)
- No raw data stored in catalog — only pointers (bucket, key, schema fingerprint)
- storaged never interprets data — dumb pipe with presigned URLs
- queryd registers tables via catalog pointers, not by scanning storage
- aibridge is stateless — Python sidecar is replaceable without touching Rust
- All services are modular and independently replaceable
Dependency Graph
shared ← storaged ← catalogd ← queryd
shared ← aibridge
gateway → {storaged, catalogd, queryd, aibridge}
ui → gateway (HTTP only, no crate dependency)
Phases
Phase 0: Bootstrap
Workspace compiles, gateway serves health check, structured logging works.
Gate: cargo build clean, GET /health returns 200, logs on stdout, docs committed.
Phase 1: Storage + Catalog
Write Parquet to object storage, register in catalog, read back.
Gate: Upload Parquet → register dataset → retrieve metadata → read back. All via gateway HTTP.
Phase 2: Query Engine
SQL queries over registered Parquet datasets via DataFusion.
Gate: SELECT * FROM dataset LIMIT 10 returns correct results. Resolution goes through catalog.
Phase 3: AI Integration
Python sidecar with real Ollama models. Embeddings, generation, reranking.
Gate: Rust sends text → Python → Ollama → real embeddings return as Arrow-compatible floats.
Phase 4: Frontend
Dioxus UI: dataset browser, query editor, results table.
Gate: User can browse datasets and run queries from browser.
Phase 5: Hardening
gRPC internals, OpenTelemetry, auth, config-driven startup.
Gate: Services communicate via gRPC. Traces propagate. Auth enforced. System restartable from repo + config.
Available Local Models
| Model | Use |
|---|---|
nomic-embed-text |
Embeddings (768d) |
qwen2.5 |
Code generation, structured output |
mistral |
General generation |
gemma2 |
General generation |
llama3.2 |
General generation |
Model selection via environment variables. No hardcoded model names in Rust code.
Non-Goals
- Multi-tenancy
- Streaming ingestion / CDC
- Custom file formats
- Query caching / materialized views
- Wrapping
object_storewith another abstraction - Cloud deployment (local-first)
Risks
| Risk | Severity | Mitigation |
|---|---|---|
| RustFS immaturity | High | Start with LocalFileSystem, test against MinIO, RustFS last. SeaweedFS fallback. |
| DataFusion table registration overhead | Medium | Lazy registration + LRU cache of SessionContext instances. |
| Catalog consistency without DB | Medium | Write-ahead: persist manifest before in-memory update. Rebuild from storage on restart. |
| Dioxus WASM gaps | Medium | Phase 4 is last. Fallback to plain HTML if blocked. |
| Schema evolution | Medium | Schema fingerprinting in Phase 1. Validate before query. |
Operating Rules
- PRD > architecture > phases > status > git
- Git is memory, not chat
- No undocumented changes
- No silent architecture drift
- Always work in smallest valid step
- Always verify before moving on