lakehouse/docs/PHASES.md
root 01373c0e45 Phase 5: hardening — gRPC, observability, auth, config
- proto: lakehouse.proto with CatalogService, QueryService, StorageService, AiService
- proto crate: tonic-build codegen from proto definitions
- catalogd: gRPC CatalogService implementation
- gateway: dual HTTP (:3100) + gRPC (:3101) servers
- gateway: OpenTelemetry tracing with stdout exporter
- gateway: API key auth middleware (toggleable)
- shared: TOML config system with typed structs and defaults
- lakehouse.toml config file
- ADR-006 and ADR-007 documented

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 06:37:07 -05:00

3.2 KiB

Phase Tracker

Phase 0: Bootstrap

  • 0.1 — Cargo workspace with all crate stubs compiling
  • 0.2 — shared crate: error types, ObjectRef, DatasetId
  • 0.3 — gateway with Axum: GET /health → 200
  • 0.4 — tracing + tracing-subscriber wired in gateway
  • 0.5 — justfile with build, test, run recipes
  • 0.6 — docs committed to git

Gate: PASSED — All crates compile. Gateway runs. Logs emit. Docs committed.

Phase 1: Storage + Catalog

  • 1.1 — storaged: object_store backend init (LocalFileSystem)
  • 1.2 — storaged: Axum endpoints (PUT/GET/DELETE/LIST /objects/{key})
  • 1.3 — shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
  • 1.4 — catalogd/registry.rs: in-memory index + manifest persistence to object storage
  • 1.5 — catalogd/schema.rs: schema fingerprinting (merged into shared/arrow_helpers.rs)
  • 1.6 — catalogd service: POST/GET /datasets + GET /datasets/by-name/{name}
  • 1.7 — gateway routes to storaged + catalogd with shared state

Gate: PASSED — PUT object → register dataset → list → get by name. All via gateway HTTP.

Phase 2: Query Engine

  • 2.1 — queryd: SessionContext + object_store config (custom scheme to avoid path doubling)
  • 2.2 — queryd: ListingTable from catalog ObjectRefs with schema inference
  • 2.3 — queryd service: POST /query/sql → JSON (columns + rows + row_count)
  • 2.4 — queryd → catalogd wiring (reads dataset list, registers as tables)
  • 2.5 — gateway routes /query with QueryEngine state

Gate: PASSED — SELECT *, WHERE/ORDER BY, COUNT/AVG all return correct results via catalog.

Phase 3: AI Integration

  • 3.1 — Python sidecar: FastAPI + Ollama (embed/generate/rerank) — real models, no mocks
  • 3.2 — Dockerfile for sidecar
  • 3.3 — aibridge/client.rs: reqwest HTTP client with 120s timeout
  • 3.4 — aibridge service: Axum proxy endpoints (POST /ai/embed, /ai/generate, /ai/rerank)
  • 3.5 — Model config via env vars (EMBED_MODEL, GEN_MODEL, RERANK_MODEL, SIDECAR_URL)

Gate: PASSED — Gateway → aibridge → sidecar → Ollama → real 768d embeddings + generation.

Phase 4: Frontend

  • 4.1 — Dioxus scaffold, WASM build (dx build --platform web)
  • 4.2 — Dataset browser (sidebar, click to select, refresh)
  • 4.3 — Query editor + results table (Ctrl+Enter to run, column types, row count)
  • 4.4 — Error display + loading states
  • 4.5 — Nginx proxy (lakehouse.devop.live), same-origin API detection

Gate: PASSED — Browse datasets and query from browser at lakehouse.devop.live.

Phase 5: Hardening

  • 5.1 — Proto definitions (lakehouse.proto: CatalogService, QueryService, StorageService, AiService)
  • 5.2 — Internal gRPC: CatalogService on :3101, proto crate with tonic codegen
  • 5.3 — OpenTelemetry tracing: stdout exporter, configurable via lakehouse.toml
  • 5.4 — Auth middleware: X-API-Key header check, toggleable via config
  • 5.5 — Config-driven startup: lakehouse.toml (gateway, storage, catalog, sidecar, ai, auth, observability)

Gate: PASSED — gRPC on :3101, OTel traces, auth ready, system starts from repo + lakehouse.toml.