lakehouse/docs/PRD.md
root a52ca841c6 Phase 0: bootstrap Rust workspace
- Cargo workspace with 6 crates: shared, storaged, catalogd, queryd, aibridge, gateway
- shared: types (DatasetId, ObjectRef, SchemaFingerprint, DatasetManifest) + error enum
- gateway: Axum HTTP entrypoint with nested service routers + tracing
- All services expose /health stubs
- justfile with build/test/run recipes
- PRD, phase tracker, and ADR docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 04:59:05 -05:00

5.7 KiB

PRD: Lakehouse — Rust-First Object Storage System

Status: Active Created: 2026-03-27 Owner: J


Problem

Traditional data platforms couple storage, compute, and metadata into monolithic databases. This creates vendor lock-in, scaling bottlenecks, and opaque data access. AI workloads bolt onto these systems awkwardly, sharing resources with transactional queries.

We need a system where:

  • Object storage is the source of truth (not a database)
  • Metadata, access, and execution are controlled by Rust services
  • Queries run directly over object storage via Arrow/Parquet
  • AI inference is isolated and swappable
  • The entire system is rebuildable from repository + docs alone

Solution

A modular Rust service mesh over S3-compatible object storage.

Locked Stack

Layer Technology Locked
Frontend Dioxus Yes
API Axum + Tokio Yes
Object Storage Interface Apache Arrow object_store Yes
Storage Backend RustFS (fallback: SeaweedFS) Yes
Query Engine DataFusion Yes
Data Format Parquet + Arrow Yes
RPC (internal) tonic (gRPC) Yes
AI Runtime Ollama (local models) Yes
AI Boundary Python FastAPI sidecar → Ollama HTTP API Yes

No new frameworks. No exceptions.


Architecture

Services

Service Responsibility
gateway HTTP ingress, routing, auth envelope, middleware
catalogd Metadata control plane — dataset registry, schema versions, manifest index
storaged Object I/O — read/write/list/delete via object_store crate
queryd SQL execution — DataFusion over registered Parquet datasets
aibridge Rust↔Python boundary — HTTP client to FastAPI sidecar
ui Dioxus frontend — dataset browser, query editor, results viewer
shared Types, errors, Arrow helpers, protobuf definitions

AI Sidecar

Python FastAPI process that adapts Ollama's HTTP API into Arrow-compatible formats:

  • POST /embednomic-embed-text via Ollama
  • POST /generate → configurable model (qwen2.5, mistral, gemma2, llama3.2)
  • POST /rerank → cross-encoder reranking via generate endpoint

No mocks. No stubs. Real models from day one. Ollama manages model lifecycle, GPU scheduling, caching. Sidecar is stateless passthrough.

Data Flow

Client → gateway → catalogd (metadata lookup)
                  → storaged (object read/write)
                  → queryd  (SQL execution over Parquet)
                  → aibridge → sidecar → Ollama (inference)

Invariants

  1. Object storage = source of truth for all data
  2. catalogd = sole metadata authority (datasets, schemas, manifests)
  3. No raw data stored in catalog — only pointers (bucket, key, schema fingerprint)
  4. storaged never interprets data — dumb pipe with presigned URLs
  5. queryd registers tables via catalog pointers, not by scanning storage
  6. aibridge is stateless — Python sidecar is replaceable without touching Rust
  7. All services are modular and independently replaceable

Dependency Graph

shared ← storaged ← catalogd ← queryd
shared ← aibridge
gateway → {storaged, catalogd, queryd, aibridge}
ui → gateway (HTTP only, no crate dependency)

Phases

Phase 0: Bootstrap

Workspace compiles, gateway serves health check, structured logging works.

Gate: cargo build clean, GET /health returns 200, logs on stdout, docs committed.

Phase 1: Storage + Catalog

Write Parquet to object storage, register in catalog, read back.

Gate: Upload Parquet → register dataset → retrieve metadata → read back. All via gateway HTTP.

Phase 2: Query Engine

SQL queries over registered Parquet datasets via DataFusion.

Gate: SELECT * FROM dataset LIMIT 10 returns correct results. Resolution goes through catalog.

Phase 3: AI Integration

Python sidecar with real Ollama models. Embeddings, generation, reranking.

Gate: Rust sends text → Python → Ollama → real embeddings return as Arrow-compatible floats.

Phase 4: Frontend

Dioxus UI: dataset browser, query editor, results table.

Gate: User can browse datasets and run queries from browser.

Phase 5: Hardening

gRPC internals, OpenTelemetry, auth, config-driven startup.

Gate: Services communicate via gRPC. Traces propagate. Auth enforced. System restartable from repo + config.


Available Local Models

Model Use
nomic-embed-text Embeddings (768d)
qwen2.5 Code generation, structured output
mistral General generation
gemma2 General generation
llama3.2 General generation

Model selection via environment variables. No hardcoded model names in Rust code.


Non-Goals

  • Multi-tenancy
  • Streaming ingestion / CDC
  • Custom file formats
  • Query caching / materialized views
  • Wrapping object_store with another abstraction
  • Cloud deployment (local-first)

Risks

Risk Severity Mitigation
RustFS immaturity High Start with LocalFileSystem, test against MinIO, RustFS last. SeaweedFS fallback.
DataFusion table registration overhead Medium Lazy registration + LRU cache of SessionContext instances.
Catalog consistency without DB Medium Write-ahead: persist manifest before in-memory update. Rebuild from storage on restart.
Dioxus WASM gaps Medium Phase 4 is last. Fallback to plain HTML if blocked.
Schema evolution Medium Schema fingerprinting in Phase 1. Validate before query.

Operating Rules

  1. PRD > architecture > phases > status > git
  2. Git is memory, not chat
  3. No undocumented changes
  4. No silent architecture drift
  5. Always work in smallest valid step
  6. Always verify before moving on