lakehouse/docs/DECISIONS.md
root 76f6fba5de Phase B: Lance pilot — hybrid decision with measured benchmark
Standalone benchmark crate `crates/lance-bench` running Lance 4.0 against
our Parquet+HNSW at 100K × 768d (resumes_100k_v2) measured 8 dimensions.

Results (see docs/ADR-019-vector-storage.md for full scorecard):

  Cold load:        Parquet 0.17s   vs Lance 0.13s   (tie — not ≥2× threshold)
  Disk size:        330.3 MB        vs 330.4 MB      (tie)
  Search p50:       873us           vs 2229us        (Parquet 2.55× faster)
  Search p95:       1413us          vs 4998us        (Parquet 3.54× faster)
  Index build:      230s (ec=80)    vs 16s (IVF_PQ)  (Lance 14× faster)
  Random access:    35ms (scan)     vs 311us         (Lance 112× faster)
  Append 10K rows:  full rewrite    vs 0.08s/+31MB   (Lance structural win)

Decision (ADR-019): hybrid, not migrate-or-reject.

- Parquet+HNSW stays primary — our HNSW at ec=80 es=30 recall=1.00 is
  2.55× faster than Lance IVF_PQ at 100K in-RAM scale
- Lance joins as second backend per-profile for workloads where it wins
  architecturally: random row access (RAG text fetch), append-heavy
  pipelines (Phase C), hot-swap generations (Phase 16, 14× faster
  builds), and indexes past the ~5M RAM ceiling
- Phase 17 ModelProfile gets vector_backend: Parquet | Lance field
- Ceiling table in PRD updated — 5M ceiling now says "switch to Lance"
  instead of "migrate" since Lance runs alongside, not instead of

Isolation: lance-bench is a standalone workspace crate with its own dep
tree (Lance pulls DataFusion 52 + Arrow 57 incompatible with main stack
DataFusion 47 + Arrow 55). Kept off the critical path until API is
stable enough to promote into vectord::lance_store.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 02:37:11 -05:00

97 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Architecture Decision Records
## ADR-001: Object storage as source of truth
**Date:** 2026-03-27
**Decision:** All data lives in S3-compatible object storage. No traditional database.
**Rationale:** Eliminates DB operational overhead, enables infinite scale at storage tier, forces clean separation of data and metadata.
## ADR-002: Catalog metadata persistence
**Date:** 2026-03-27
**Decision:** catalogd persists manifests as Parquet files in object storage. In-memory index rebuilt on startup.
**Rationale:** No external DB dependency. Storage is already the source of truth. Write-ahead pattern ensures consistency.
## ADR-003: Real models only (no mocks)
**Date:** 2026-03-27
**Decision:** AI sidecar hits Ollama with real models from Phase 3 onward. No stub/mock endpoints.
**Rationale:** Local Ollama instance available with nomic-embed-text, qwen2.5, mistral, gemma2, llama3.2. Mocks hide integration bugs.
## ADR-004: Python sidecar as Ollama adapter
**Date:** 2026-03-27
**Decision:** Python FastAPI sidecar is a thin HTTP adapter over Ollama's API. No model loading in Python.
**Rationale:** Ollama handles model lifecycle, GPU scheduling, caching. Sidecar stays stateless and lightweight — no torch/transformers deps.
## ADR-005: HTTP-first, gRPC later
**Date:** 2026-03-27
**Decision:** All inter-service communication uses HTTP through Phase 4. gRPC migration in Phase 5.
**Rationale:** HTTP is simpler to debug, test, and iterate on. gRPC adds protobuf compilation and streaming complexity before APIs stabilize.
## ADR-006: TOML config over environment variables
**Date:** 2026-03-27
**Decision:** System configuration via lakehouse.toml file, with sane defaults for all values.
**Rationale:** Config files are versionable, self-documenting, and support structured data. Env vars remain available as overrides. System must be restartable from repo + config alone.
## ADR-007: Dual HTTP+gRPC on gateway
**Date:** 2026-03-27
**Decision:** Gateway serves HTTP on :3100 (external) and gRPC on :3101 (internal). Both run in the same process.
**Rationale:** Single binary simplifies deployment. HTTP stays for browser/curl access. gRPC provides typed contracts for service-to-service calls. No premature microservice split.
## ADR-008: Embeddings stored as Parquet, not a proprietary vector DB
**Date:** 2026-03-27
**Decision:** Vector embeddings stored as Parquet files (doc_id, chunk_text, vector columns). Vector index (HNSW) serialized as a sidecar file.
**Rationale:** Keeps all data in one portable format. No vendor lock-in to Pinecone/Weaviate/Qdrant. Vectors are queryable via DataFusion like any other data. Trade-off: brute-force search is fine up to ~100K vectors; HNSW needed beyond that.
## ADR-009: Incremental updates via delta files, not Delta Lake
**Date:** 2026-03-27
**Decision:** Updates append to delta Parquet files. Queries merge base + deltas at read time. Periodic compaction merges deltas into base. Single-writer model (no concurrent writers).
**Rationale:** Full ACID over Parquet (Delta Lake/Iceberg) is a multi-year project. Our use case is single-writer (one ingest pipeline) with read-heavy workloads. Merge-on-read with compaction is sufficient and dramatically simpler.
## ADR-010: Schema detection defaults to string
**Date:** 2026-03-27
**Decision:** Ingest pipeline infers column types from data. When ambiguous or mixed, defaults to String rather than failing.
**Rationale:** Legacy data is messy. A column with "123", "N/A", and "" is a string, not an integer. Downstream queries can CAST as needed. Better to ingest everything than reject on type errors.
## ADR-011: This is not a CRM replacement
**Date:** 2026-03-27
**Decision:** The lakehouse is the analytical layer BEHIND operational systems. It ingests exports, not live data. CRM/ATS stays for daily operations.
**Rationale:** Operational systems need single-record CRUD, permissions, UI workflows. The lakehouse answers cross-cutting questions that no single operational system can. They complement, not compete.
## ADR-012: Event journal — append-only mutation history
**Date:** 2026-03-27
**Decision:** Every data mutation (insert, update, delete) is appended to an immutable event journal. The journal stores: entity, field, old_value, new_value, actor, timestamp, source, workspace_id. Events are never modified or deleted.
**Rationale:** This is the single most important future-proofing decision. AI auditability ("why did the agent recommend this candidate?"), compliance ("who changed this PII field?"), and time-travel queries ("what did this record look like on March 1st?") all require mutation history. This is impossible to retrofit — once history is lost, it's gone. The cost to implement is low (append-only Parquet), the cost of NOT implementing grows every day.
## ADR-013: Rich metadata is a product, not a byproduct
**Date:** 2026-03-27
**Decision:** Every dataset in the catalog carries: owner, sensitivity classification, lineage (source_system → ingest → dataset), freshness SLA, description, and tags. Auto-detected where possible, required on manual ingest.
**Rationale:** Datasets without metadata become "mystery files" within months. As data volume grows and AI agents consume data, the metadata layer is what makes the platform discoverable, governable, and trustworthy. Legacy companies that skip this step end up with expensive data swamps instead of data platforms.
## ADR-014: Embedding versioning — model-proof vector layer
**Date:** 2026-03-27
**Decision:** Every vector index tracks: model_name, model_version, dimensions, created_at. Multiple index versions for the same data coexist. Re-embedding on model upgrade is incremental (only new/changed docs).
**Rationale:** Embedding models improve rapidly. nomic-embed-text today, something better in 6 months. Without version tracking, upgrading means re-embedding the entire corpus. With versioning, you can A/B test new models, migrate incrementally, and maintain backward compatibility.
## ADR-015: Tool registry before raw SQL for agents
**Date:** 2026-03-27
**Decision:** AI agents interact with the system through named, governed tools (search_candidates, update_phone, create_placement) rather than raw SQL. Tools have parameter validation, permission checks, audit logging, and rate limits.
**Rationale:** In 3 years, most data access will be by AI agents, not humans typing SQL. Giving agents raw SQL access is ungovernable — you can't audit, permission, or rate-limit individual operations. Named tools with contracts are the interface that scales. Building the governed interface first prevents the technical debt of retrofitting controls onto raw access.
## ADR-016: Agent workspaces as first-class concept
**Date:** 2026-03-27
**Decision:** Each contract/search gets a named workspace with saved queries, shortlists, activity logs, and delta layers. Workspaces have daily/weekly/monthly tiers and support instant zero-copy handoff between agents.
**Rationale:** Staffing workflows are inherently agent-centric — a recruiter works a contract, builds context, then may need to hand it off. The workspace captures that context in a structured, queryable, transferable format. Without it, handoff means "read the email thread and figure it out."
## ADR-017: Federated multi-bucket storage
**Date:** 2026-04-16
**Decision:** Every `ObjectRef` belongs to exactly one named bucket. Three bucket classes: `primary` (system default, always present), `profile:{user}` (per-user/per-model workspace bucket), and named tenant buckets (`client_a`, `client_eu`). A single shared `rescue_bucket` handles read fallback on target failure. Writes hard-fail on unreachable target; no silent fallback. Every bucket op failure lands in an error journal at `primary://_errors/bucket_errors/`, queryable via `/storage/errors`. Credentials never live in `lakehouse.toml` — a pluggable `SecretsProvider` trait resolves opaque `secret_ref` handles.
**Rationale:** The single-backend assumption breaks when we want: tenant data isolation, data residency (EU bucket), per-profile workspaces for local models. DataFusion already supports multiple registered object stores; catalogd is the cross-bucket metadata authority. Rescue bucket + visible error journal makes operational failures diagnosable in one HTTP call. See `docs/ADR-017-federation.md` for full design + success gates.
## ADR-018: Write-once batched append pattern for journals
**Date:** 2026-04-16
**Decision:** All append-only journals (error journal, HNSW trial journal, future audit logs) use the `storaged::append_log::AppendLog` helper. Events accumulate in an in-memory buffer; on threshold or explicit `flush()`, the buffer is written as one new timestamped file (`batch_{epoch_us}.jsonl`). Existing files are never rewritten. `compact()` merges all batches into one with a fresh timestamp, preserving chronological sort order.
**Rationale:** Object stores have no append primitive. Naive "read-modify-write the whole JSONL file on every event" is O(N²) cumulative work and creates the classic small-file / rewrite-amplification anti-pattern that llms3.com flags as the top lakehouse pitfall. Write-once batching is the LSM-tree idea applied to small JSONL events — bounded write amplification, append-only semantics, optional compaction for read efficiency. The in-memory ring buffer preserves O(1) recent-event reads for the `/storage/errors` and `/hnsw/trials` query endpoints.
## ADR-019: Vector storage — Parquet+HNSW primary, Lance secondary (hybrid)
**Date:** 2026-04-16
**Decision:** Keep Parquet + binary-blob vectors + in-RAM HNSW as the primary vector backend. Add Lance as a second backend available per-profile for workloads where Lance wins architecturally. Per-profile `vector_backend: Parquet | Lance` field becomes part of Phase 17 model profiles. Implementation kicks off via the standalone `crates/lance-bench` crate and is promoted into `vectord::lance_store` when the API stabilizes.
**Rationale:** Head-to-head benchmark on the 100K × 768d `resumes_100k_v2` index (see `docs/ADR-019-vector-storage.md` for the full scorecard). Parquet+HNSW wins current-scale search latency by 2.55× (873us vs 2229us p50). Lance wins index build time by 14× (16s vs 230s), random row access by 112× (311us vs ~35ms full-file scan), and append speed structurally (0.08s vs full Parquet rewrite). Neither strictly dominates — the dual-use PRD framing (staffing + LLM brain) means both workloads exist in the same system. Keeps ADR-008's "Parquet is the format" principle intact for dataset tables; adds Lance as a purpose-built vector-tier option without discarding the tuned HNSW stack.