root a52ca841c6 Phase 0: bootstrap Rust workspace

- Cargo workspace with 6 crates: shared, storaged, catalogd, queryd, aibridge, gateway
- shared: types (DatasetId, ObjectRef, SchemaFingerprint, DatasetManifest) + error enum
- gateway: Axum HTTP entrypoint with nested service routers + tracing
- All services expose /health stubs
- justfile with build/test/run recipes
- PRD, phase tracker, and ADR docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 04:59:05 -05:00

5.7 KiB

Raw Blame History

PRD: Lakehouse — Rust-First Object Storage System

Status: Active Created: 2026-03-27 Owner: J

Problem

Traditional data platforms couple storage, compute, and metadata into monolithic databases. This creates vendor lock-in, scaling bottlenecks, and opaque data access. AI workloads bolt onto these systems awkwardly, sharing resources with transactional queries.

We need a system where:

Object storage is the source of truth (not a database)
Metadata, access, and execution are controlled by Rust services
Queries run directly over object storage via Arrow/Parquet
AI inference is isolated and swappable
The entire system is rebuildable from repository + docs alone

Solution

A modular Rust service mesh over S3-compatible object storage.

Locked Stack

Layer	Technology	Locked
Frontend	Dioxus	Yes
API	Axum + Tokio	Yes
Object Storage Interface	Apache Arrow `object_store`	Yes
Storage Backend	RustFS (fallback: SeaweedFS)	Yes
Query Engine	DataFusion	Yes
Data Format	Parquet + Arrow	Yes
RPC (internal)	tonic (gRPC)	Yes
AI Runtime	Ollama (local models)	Yes
AI Boundary	Python FastAPI sidecar → Ollama HTTP API	Yes

No new frameworks. No exceptions.

Architecture

Services

Service	Responsibility
gateway	HTTP ingress, routing, auth envelope, middleware
catalogd	Metadata control plane — dataset registry, schema versions, manifest index
storaged	Object I/O — read/write/list/delete via `object_store` crate
queryd	SQL execution — DataFusion over registered Parquet datasets
aibridge	Rust↔Python boundary — HTTP client to FastAPI sidecar
ui	Dioxus frontend — dataset browser, query editor, results viewer
shared	Types, errors, Arrow helpers, protobuf definitions

AI Sidecar

Python FastAPI process that adapts Ollama's HTTP API into Arrow-compatible formats:

POST /embed → nomic-embed-text via Ollama
POST /generate → configurable model (qwen2.5, mistral, gemma2, llama3.2)
POST /rerank → cross-encoder reranking via generate endpoint

No mocks. No stubs. Real models from day one. Ollama manages model lifecycle, GPU scheduling, caching. Sidecar is stateless passthrough.

Data Flow

Client → gateway → catalogd (metadata lookup)
                  → storaged (object read/write)
                  → queryd  (SQL execution over Parquet)
                  → aibridge → sidecar → Ollama (inference)

Invariants

Object storage = source of truth for all data
catalogd = sole metadata authority (datasets, schemas, manifests)
No raw data stored in catalog — only pointers (bucket, key, schema fingerprint)
storaged never interprets data — dumb pipe with presigned URLs
queryd registers tables via catalog pointers, not by scanning storage
aibridge is stateless — Python sidecar is replaceable without touching Rust
All services are modular and independently replaceable

Dependency Graph

shared ← storaged ← catalogd ← queryd
shared ← aibridge
gateway → {storaged, catalogd, queryd, aibridge}
ui → gateway (HTTP only, no crate dependency)

Phases

Phase 0: Bootstrap

Workspace compiles, gateway serves health check, structured logging works.

Gate: cargo build clean, GET /health returns 200, logs on stdout, docs committed.

Phase 1: Storage + Catalog

Write Parquet to object storage, register in catalog, read back.

Gate: Upload Parquet → register dataset → retrieve metadata → read back. All via gateway HTTP.

Phase 2: Query Engine

SQL queries over registered Parquet datasets via DataFusion.

Gate: SELECT * FROM dataset LIMIT 10 returns correct results. Resolution goes through catalog.

Phase 3: AI Integration

Python sidecar with real Ollama models. Embeddings, generation, reranking.

Gate: Rust sends text → Python → Ollama → real embeddings return as Arrow-compatible floats.

Phase 4: Frontend

Dioxus UI: dataset browser, query editor, results table.

Gate: User can browse datasets and run queries from browser.

Phase 5: Hardening

gRPC internals, OpenTelemetry, auth, config-driven startup.

Gate: Services communicate via gRPC. Traces propagate. Auth enforced. System restartable from repo + config.

Available Local Models

Model	Use
`nomic-embed-text`	Embeddings (768d)
`qwen2.5`	Code generation, structured output
`mistral`	General generation
`gemma2`	General generation
`llama3.2`	General generation

Model selection via environment variables. No hardcoded model names in Rust code.

Non-Goals

Multi-tenancy
Streaming ingestion / CDC
Custom file formats
Query caching / materialized views
Wrapping object_store with another abstraction
Cloud deployment (local-first)

Risks

Risk	Severity	Mitigation
RustFS immaturity	High	Start with LocalFileSystem, test against MinIO, RustFS last. SeaweedFS fallback.
DataFusion table registration overhead	Medium	Lazy registration + LRU cache of SessionContext instances.
Catalog consistency without DB	Medium	Write-ahead: persist manifest before in-memory update. Rebuild from storage on restart.
Dioxus WASM gaps	Medium	Phase 4 is last. Fallback to plain HTML if blocked.
Schema evolution	Medium	Schema fingerprinting in Phase 1. Validate before query.

Operating Rules

PRD > architecture > phases > status > git
Git is memory, not chat
No undocumented changes
No silent architecture drift
Always work in smallest valid step
Always verify before moving on

5.7 KiB Raw Blame History