Phases 9-15 designed based on "future regret" analysis: - Phase 9: Event journal (append-only mutation history — can't retrofit) - Phase 10: Rich catalog v2 (ownership, sensitivity, lineage, freshness) - Phase 11: Embedding versioning (model-proof vector layer) - Phase 12: Tool registry (governed agent actions via MCP) - Phase 13: Security & access control (field-level, row-level, audit) - Phase 14: Schema evolution with AI migration rules - Phase 15+: Federated query, DB connectors, OCR, fine-tuned models 8 design principles: store truth openly, describe richly, never destroy evidence, secure centrally, expose through tools, version everything, unstructured first-class, separate storage/compute/intelligence. ADR-012 through ADR-016 documenting key future-proofing decisions. Updated benchmarks: 2.47M rows, hot cache 9.8x speedup. Updated operating rules: cheap-now/expensive-later built first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
82 lines
7.1 KiB
Markdown
82 lines
7.1 KiB
Markdown
# Architecture Decision Records
|
|
|
|
## ADR-001: Object storage as source of truth
|
|
**Date:** 2026-03-27
|
|
**Decision:** All data lives in S3-compatible object storage. No traditional database.
|
|
**Rationale:** Eliminates DB operational overhead, enables infinite scale at storage tier, forces clean separation of data and metadata.
|
|
|
|
## ADR-002: Catalog metadata persistence
|
|
**Date:** 2026-03-27
|
|
**Decision:** catalogd persists manifests as Parquet files in object storage. In-memory index rebuilt on startup.
|
|
**Rationale:** No external DB dependency. Storage is already the source of truth. Write-ahead pattern ensures consistency.
|
|
|
|
## ADR-003: Real models only (no mocks)
|
|
**Date:** 2026-03-27
|
|
**Decision:** AI sidecar hits Ollama with real models from Phase 3 onward. No stub/mock endpoints.
|
|
**Rationale:** Local Ollama instance available with nomic-embed-text, qwen2.5, mistral, gemma2, llama3.2. Mocks hide integration bugs.
|
|
|
|
## ADR-004: Python sidecar as Ollama adapter
|
|
**Date:** 2026-03-27
|
|
**Decision:** Python FastAPI sidecar is a thin HTTP adapter over Ollama's API. No model loading in Python.
|
|
**Rationale:** Ollama handles model lifecycle, GPU scheduling, caching. Sidecar stays stateless and lightweight — no torch/transformers deps.
|
|
|
|
## ADR-005: HTTP-first, gRPC later
|
|
**Date:** 2026-03-27
|
|
**Decision:** All inter-service communication uses HTTP through Phase 4. gRPC migration in Phase 5.
|
|
**Rationale:** HTTP is simpler to debug, test, and iterate on. gRPC adds protobuf compilation and streaming complexity before APIs stabilize.
|
|
|
|
## ADR-006: TOML config over environment variables
|
|
**Date:** 2026-03-27
|
|
**Decision:** System configuration via lakehouse.toml file, with sane defaults for all values.
|
|
**Rationale:** Config files are versionable, self-documenting, and support structured data. Env vars remain available as overrides. System must be restartable from repo + config alone.
|
|
|
|
## ADR-007: Dual HTTP+gRPC on gateway
|
|
**Date:** 2026-03-27
|
|
**Decision:** Gateway serves HTTP on :3100 (external) and gRPC on :3101 (internal). Both run in the same process.
|
|
**Rationale:** Single binary simplifies deployment. HTTP stays for browser/curl access. gRPC provides typed contracts for service-to-service calls. No premature microservice split.
|
|
|
|
## ADR-008: Embeddings stored as Parquet, not a proprietary vector DB
|
|
**Date:** 2026-03-27
|
|
**Decision:** Vector embeddings stored as Parquet files (doc_id, chunk_text, vector columns). Vector index (HNSW) serialized as a sidecar file.
|
|
**Rationale:** Keeps all data in one portable format. No vendor lock-in to Pinecone/Weaviate/Qdrant. Vectors are queryable via DataFusion like any other data. Trade-off: brute-force search is fine up to ~100K vectors; HNSW needed beyond that.
|
|
|
|
## ADR-009: Incremental updates via delta files, not Delta Lake
|
|
**Date:** 2026-03-27
|
|
**Decision:** Updates append to delta Parquet files. Queries merge base + deltas at read time. Periodic compaction merges deltas into base. Single-writer model (no concurrent writers).
|
|
**Rationale:** Full ACID over Parquet (Delta Lake/Iceberg) is a multi-year project. Our use case is single-writer (one ingest pipeline) with read-heavy workloads. Merge-on-read with compaction is sufficient and dramatically simpler.
|
|
|
|
## ADR-010: Schema detection defaults to string
|
|
**Date:** 2026-03-27
|
|
**Decision:** Ingest pipeline infers column types from data. When ambiguous or mixed, defaults to String rather than failing.
|
|
**Rationale:** Legacy data is messy. A column with "123", "N/A", and "" is a string, not an integer. Downstream queries can CAST as needed. Better to ingest everything than reject on type errors.
|
|
|
|
## ADR-011: This is not a CRM replacement
|
|
**Date:** 2026-03-27
|
|
**Decision:** The lakehouse is the analytical layer BEHIND operational systems. It ingests exports, not live data. CRM/ATS stays for daily operations.
|
|
**Rationale:** Operational systems need single-record CRUD, permissions, UI workflows. The lakehouse answers cross-cutting questions that no single operational system can. They complement, not compete.
|
|
|
|
## ADR-012: Event journal — append-only mutation history
|
|
**Date:** 2026-03-27
|
|
**Decision:** Every data mutation (insert, update, delete) is appended to an immutable event journal. The journal stores: entity, field, old_value, new_value, actor, timestamp, source, workspace_id. Events are never modified or deleted.
|
|
**Rationale:** This is the single most important future-proofing decision. AI auditability ("why did the agent recommend this candidate?"), compliance ("who changed this PII field?"), and time-travel queries ("what did this record look like on March 1st?") all require mutation history. This is impossible to retrofit — once history is lost, it's gone. The cost to implement is low (append-only Parquet), the cost of NOT implementing grows every day.
|
|
|
|
## ADR-013: Rich metadata is a product, not a byproduct
|
|
**Date:** 2026-03-27
|
|
**Decision:** Every dataset in the catalog carries: owner, sensitivity classification, lineage (source_system → ingest → dataset), freshness SLA, description, and tags. Auto-detected where possible, required on manual ingest.
|
|
**Rationale:** Datasets without metadata become "mystery files" within months. As data volume grows and AI agents consume data, the metadata layer is what makes the platform discoverable, governable, and trustworthy. Legacy companies that skip this step end up with expensive data swamps instead of data platforms.
|
|
|
|
## ADR-014: Embedding versioning — model-proof vector layer
|
|
**Date:** 2026-03-27
|
|
**Decision:** Every vector index tracks: model_name, model_version, dimensions, created_at. Multiple index versions for the same data coexist. Re-embedding on model upgrade is incremental (only new/changed docs).
|
|
**Rationale:** Embedding models improve rapidly. nomic-embed-text today, something better in 6 months. Without version tracking, upgrading means re-embedding the entire corpus. With versioning, you can A/B test new models, migrate incrementally, and maintain backward compatibility.
|
|
|
|
## ADR-015: Tool registry before raw SQL for agents
|
|
**Date:** 2026-03-27
|
|
**Decision:** AI agents interact with the system through named, governed tools (search_candidates, update_phone, create_placement) rather than raw SQL. Tools have parameter validation, permission checks, audit logging, and rate limits.
|
|
**Rationale:** In 3 years, most data access will be by AI agents, not humans typing SQL. Giving agents raw SQL access is ungovernable — you can't audit, permission, or rate-limit individual operations. Named tools with contracts are the interface that scales. Building the governed interface first prevents the technical debt of retrofitting controls onto raw access.
|
|
|
|
## ADR-016: Agent workspaces as first-class concept
|
|
**Date:** 2026-03-27
|
|
**Decision:** Each contract/search gets a named workspace with saved queries, shortlists, activity logs, and delta layers. Workspaces have daily/weekly/monthly tiers and support instant zero-copy handoff between agents.
|
|
**Rationale:** Staffing workflows are inherently agent-centric — a recruiter works a contract, builds context, then may need to hand it off. The workspace captures that context in a structured, queryable, transferable format. Without it, handoff means "read the email thread and figure it out."
|