lakehouse/docs/DECISIONS.md

# Architecture Decision Records

## ADR-001: Object storage as source of truth
**Date:** 2026-03-27
**Decision:** All data lives in S3-compatible object storage. No traditional database.
**Rationale:** Eliminates DB operational overhead, enables infinite scale at storage tier, forces clean separation of data and metadata.

## ADR-002: Catalog metadata persistence
**Date:** 2026-03-27
**Decision:** catalogd persists manifests as Parquet files in object storage. In-memory index rebuilt on startup.
**Rationale:** No external DB dependency. Storage is already the source of truth. Write-ahead pattern ensures consistency.

## ADR-003: Real models only (no mocks)
**Date:** 2026-03-27
**Decision:** AI sidecar hits Ollama with real models from Phase 3 onward. No stub/mock endpoints.
**Rationale:** Local Ollama instance available with nomic-embed-text, qwen2.5, mistral, gemma2, llama3.2. Mocks hide integration bugs.

## ADR-004: Python sidecar as Ollama adapter
**Date:** 2026-03-27
**Decision:** Python FastAPI sidecar is a thin HTTP adapter over Ollama's API. No model loading in Python.
**Rationale:** Ollama handles model lifecycle, GPU scheduling, caching. Sidecar stays stateless and lightweight — no torch/transformers deps.

## ADR-005: HTTP-first, gRPC later
**Date:** 2026-03-27
**Decision:** All inter-service communication uses HTTP through Phase 4. gRPC migration in Phase 5.
**Rationale:** HTTP is simpler to debug, test, and iterate on. gRPC adds protobuf compilation and streaming complexity before APIs stabilize.

## ADR-006: TOML config over environment variables
**Date:** 2026-03-27
**Decision:** System configuration via lakehouse.toml file, with sane defaults for all values.
**Rationale:** Config files are versionable, self-documenting, and support structured data. Env vars remain available as overrides. System must be restartable from repo + config alone.

## ADR-007: Dual HTTP+gRPC on gateway
**Date:** 2026-03-27
**Decision:** Gateway serves HTTP on :3100 (external) and gRPC on :3101 (internal). Both run in the same process.
**Rationale:** Single binary simplifies deployment. HTTP stays for browser/curl access. gRPC provides typed contracts for service-to-service calls. No premature microservice split.

## ADR-008: Embeddings stored as Parquet, not a proprietary vector DB
**Date:** 2026-03-27
**Decision:** Vector embeddings stored as Parquet files (doc_id, chunk_text, vector columns). Vector index (HNSW) serialized as a sidecar file.
**Rationale:** Keeps all data in one portable format. No vendor lock-in to Pinecone/Weaviate/Qdrant. Vectors are queryable via DataFusion like any other data. Trade-off: brute-force search is fine up to ~100K vectors; HNSW needed beyond that.

## ADR-009: Incremental updates via delta files, not Delta Lake
**Date:** 2026-03-27
**Decision:** Updates append to delta Parquet files. Queries merge base + deltas at read time. Periodic compaction merges deltas into base. Single-writer model (no concurrent writers).
**Rationale:** Full ACID over Parquet (Delta Lake/Iceberg) is a multi-year project. Our use case is single-writer (one ingest pipeline) with read-heavy workloads. Merge-on-read with compaction is sufficient and dramatically simpler.

## ADR-010: Schema detection defaults to string
**Date:** 2026-03-27
**Decision:** Ingest pipeline infers column types from data. When ambiguous or mixed, defaults to String rather than failing.
**Rationale:** Legacy data is messy. A column with "123", "N/A", and "" is a string, not an integer. Downstream queries can CAST as needed. Better to ingest everything than reject on type errors.

## ADR-011: This is not a CRM replacement
**Date:** 2026-03-27
**Decision:** The lakehouse is the analytical layer BEHIND operational systems. It ingests exports, not live data. CRM/ATS stays for daily operations.
**Rationale:** Operational systems need single-record CRUD, permissions, UI workflows. The lakehouse answers cross-cutting questions that no single operational system can. They complement, not compete.

## ADR-012: Event journal — append-only mutation history
**Date:** 2026-03-27
**Decision:** Every data mutation (insert, update, delete) is appended to an immutable event journal. The journal stores: entity, field, old_value, new_value, actor, timestamp, source, workspace_id. Events are never modified or deleted.
**Rationale:** This is the single most important future-proofing decision. AI auditability ("why did the agent recommend this candidate?"), compliance ("who changed this PII field?"), and time-travel queries ("what did this record look like on March 1st?") all require mutation history. This is impossible to retrofit — once history is lost, it's gone. The cost to implement is low (append-only Parquet), the cost of NOT implementing grows every day.

## ADR-013: Rich metadata is a product, not a byproduct
**Date:** 2026-03-27
**Decision:** Every dataset in the catalog carries: owner, sensitivity classification, lineage (source_system → ingest → dataset), freshness SLA, description, and tags. Auto-detected where possible, required on manual ingest.
**Rationale:** Datasets without metadata become "mystery files" within months. As data volume grows and AI agents consume data, the metadata layer is what makes the platform discoverable, governable, and trustworthy. Legacy companies that skip this step end up with expensive data swamps instead of data platforms.

## ADR-014: Embedding versioning — model-proof vector layer
**Date:** 2026-03-27
**Decision:** Every vector index tracks: model_name, model_version, dimensions, created_at. Multiple index versions for the same data coexist. Re-embedding on model upgrade is incremental (only new/changed docs).
**Rationale:** Embedding models improve rapidly. nomic-embed-text today, something better in 6 months. Without version tracking, upgrading means re-embedding the entire corpus. With versioning, you can A/B test new models, migrate incrementally, and maintain backward compatibility.

## ADR-015: Tool registry before raw SQL for agents
**Date:** 2026-03-27
**Decision:** AI agents interact with the system through named, governed tools (search_candidates, update_phone, create_placement) rather than raw SQL. Tools have parameter validation, permission checks, audit logging, and rate limits.
**Rationale:** In 3 years, most data access will be by AI agents, not humans typing SQL. Giving agents raw SQL access is ungovernable — you can't audit, permission, or rate-limit individual operations. Named tools with contracts are the interface that scales. Building the governed interface first prevents the technical debt of retrofitting controls onto raw access.

## ADR-016: Agent workspaces as first-class concept
**Date:** 2026-03-27
**Decision:** Each contract/search gets a named workspace with saved queries, shortlists, activity logs, and delta layers. Workspaces have daily/weekly/monthly tiers and support instant zero-copy handoff between agents.
**Rationale:** Staffing workflows are inherently agent-centric — a recruiter works a contract, builds context, then may need to hand it off. The workspace captures that context in a structured, queryable, transferable format. Without it, handoff means "read the email thread and figure it out."

## ADR-017: Federated multi-bucket storage
**Date:** 2026-04-16
**Decision:** Every `ObjectRef` belongs to exactly one named bucket. Three bucket classes: `primary` (system default, always present), `profile:{user}` (per-user/per-model workspace bucket), and named tenant buckets (`client_a`, `client_eu`). A single shared `rescue_bucket` handles read fallback on target failure. Writes hard-fail on unreachable target; no silent fallback. Every bucket op failure lands in an error journal at `primary://_errors/bucket_errors/`, queryable via `/storage/errors`. Credentials never live in `lakehouse.toml` — a pluggable `SecretsProvider` trait resolves opaque `secret_ref` handles.
**Rationale:** The single-backend assumption breaks when we want: tenant data isolation, data residency (EU bucket), per-profile workspaces for local models. DataFusion already supports multiple registered object stores; catalogd is the cross-bucket metadata authority. Rescue bucket + visible error journal makes operational failures diagnosable in one HTTP call. See `docs/ADR-017-federation.md` for full design + success gates.

## ADR-018: Write-once batched append pattern for journals
**Date:** 2026-04-16
**Decision:** All append-only journals (error journal, HNSW trial journal, future audit logs) use the `storaged::append_log::AppendLog` helper. Events accumulate in an in-memory buffer; on threshold or explicit `flush()`, the buffer is written as one new timestamped file (`batch_{epoch_us}.jsonl`). Existing files are never rewritten. `compact()` merges all batches into one with a fresh timestamp, preserving chronological sort order.
**Rationale:** Object stores have no append primitive. Naive "read-modify-write the whole JSONL file on every event" is O(N²) cumulative work and creates the classic small-file / rewrite-amplification anti-pattern that llms3.com flags as the top lakehouse pitfall. Write-once batching is the LSM-tree idea applied to small JSONL events — bounded write amplification, append-only semantics, optional compaction for read efficiency. The in-memory ring buffer preserves O(1) recent-event reads for the `/storage/errors` and `/hnsw/trials` query endpoints.

## ADR-019: Vector storage — Parquet+HNSW primary, Lance secondary (hybrid)
**Date:** 2026-04-16
**Decision:** Keep Parquet + binary-blob vectors + in-RAM HNSW as the primary vector backend. Add Lance as a second backend available per-profile for workloads where Lance wins architecturally. Per-profile `vector_backend: Parquet | Lance` field becomes part of Phase 17 model profiles. Implementation kicks off via the standalone `crates/lance-bench` crate and is promoted into `vectord::lance_store` when the API stabilizes.
**Rationale:** Head-to-head benchmark on the 100K × 768d `resumes_100k_v2` index (see `docs/ADR-019-vector-storage.md` for the full scorecard). Parquet+HNSW wins current-scale search latency by 2.55× (873us vs 2229us p50). Lance wins index build time by 14× (16s vs 230s), random row access by 112× (311us vs ~35ms full-file scan), and append speed structurally (0.08s vs full Parquet rewrite). Neither strictly dominates — the dual-use PRD framing (staffing + LLM brain) means both workloads exist in the same system. Keeps ADR-008's "Parquet is the format" principle intact for dataset tables; adds Lance as a purpose-built vector-tier option without discarding the tuned HNSW stack.

## ADR-020: `register()` is idempotent by name with a schema-fingerprint gate
**Date:** 2026-04-19
**Decision:** `catalogd::Registry::register(name, fingerprint, objects)` is idempotent on `name`. If no manifest for `name` exists, create one. If one exists with the same `schema_fingerprint`, reuse its `DatasetId`, replace `objects`, bump `updated_at`, and write through. If one exists with a different `schema_fingerprint`, reject with `409 Conflict` (HTTP) / `FAILED_PRECONDITION` (gRPC). A one-shot operator endpoint `POST /catalog/dedupe` collapses any pre-existing duplicates (preferring the manifest with a non-null `row_count`, then the most recently updated).
**Rationale:** Registry was keyed by surrogate `DatasetId` with no uniqueness constraint on `name`, so every caller that re-registered (re-ingest, external cron, gRPC retry) silently created a parallel manifest pointing at the same parquet — accumulating 308× `successful_playbooks` in live state before detection. The fingerprint gate turns re-ingest into an explicit no-op (matching PRD invariant #5 "ingestd is idempotent — re-ingesting the same file is a no-op") while forcing schema drift to be visible instead of silently clobbering. 409 status separates policy rejections from server errors, which matters for the Phase 12 tool-consumer ecosystem. Concurrency: the write lock is held across the storage write to close the check→insert TOCTOU window; serializing registers is acceptable because registers-per-second is low on the ingest path. Audit: idempotent-register events are visible as bumps to the stored manifest's `updated_at` field and in `catalogd` tracing output (tracing is non-durable, operator view only); `DedupeReport` is the return-value audit for cleanup runs. No event-journal entries are emitted — ADR-012 scopes the journal to row-level mutations, not catalog-manifest operations.