lakehouse/docs/DECISIONS.md
root dbe00d018f Federation foundation + HNSW trial system + Postgres streaming + PRD reframe
Four shipped features and a PRD realignment, all measured end-to-end:

HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
  TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
  /hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
  at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
  80/30 = 1.00 recall in 230s

Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
  and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows

Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
  /etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
  rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
  anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
  /storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
  X-Lakehouse-Rescue-Used observability headers on fallback

Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
  into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
  (586 rows × 13 cols, 6 batches, 196ms end-to-end)

PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
  trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
  bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
  decision rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 01:50:05 -05:00

92 lines
9.2 KiB
Markdown

# Architecture Decision Records
## ADR-001: Object storage as source of truth
**Date:** 2026-03-27
**Decision:** All data lives in S3-compatible object storage. No traditional database.
**Rationale:** Eliminates DB operational overhead, enables infinite scale at storage tier, forces clean separation of data and metadata.
## ADR-002: Catalog metadata persistence
**Date:** 2026-03-27
**Decision:** catalogd persists manifests as Parquet files in object storage. In-memory index rebuilt on startup.
**Rationale:** No external DB dependency. Storage is already the source of truth. Write-ahead pattern ensures consistency.
## ADR-003: Real models only (no mocks)
**Date:** 2026-03-27
**Decision:** AI sidecar hits Ollama with real models from Phase 3 onward. No stub/mock endpoints.
**Rationale:** Local Ollama instance available with nomic-embed-text, qwen2.5, mistral, gemma2, llama3.2. Mocks hide integration bugs.
## ADR-004: Python sidecar as Ollama adapter
**Date:** 2026-03-27
**Decision:** Python FastAPI sidecar is a thin HTTP adapter over Ollama's API. No model loading in Python.
**Rationale:** Ollama handles model lifecycle, GPU scheduling, caching. Sidecar stays stateless and lightweight — no torch/transformers deps.
## ADR-005: HTTP-first, gRPC later
**Date:** 2026-03-27
**Decision:** All inter-service communication uses HTTP through Phase 4. gRPC migration in Phase 5.
**Rationale:** HTTP is simpler to debug, test, and iterate on. gRPC adds protobuf compilation and streaming complexity before APIs stabilize.
## ADR-006: TOML config over environment variables
**Date:** 2026-03-27
**Decision:** System configuration via lakehouse.toml file, with sane defaults for all values.
**Rationale:** Config files are versionable, self-documenting, and support structured data. Env vars remain available as overrides. System must be restartable from repo + config alone.
## ADR-007: Dual HTTP+gRPC on gateway
**Date:** 2026-03-27
**Decision:** Gateway serves HTTP on :3100 (external) and gRPC on :3101 (internal). Both run in the same process.
**Rationale:** Single binary simplifies deployment. HTTP stays for browser/curl access. gRPC provides typed contracts for service-to-service calls. No premature microservice split.
## ADR-008: Embeddings stored as Parquet, not a proprietary vector DB
**Date:** 2026-03-27
**Decision:** Vector embeddings stored as Parquet files (doc_id, chunk_text, vector columns). Vector index (HNSW) serialized as a sidecar file.
**Rationale:** Keeps all data in one portable format. No vendor lock-in to Pinecone/Weaviate/Qdrant. Vectors are queryable via DataFusion like any other data. Trade-off: brute-force search is fine up to ~100K vectors; HNSW needed beyond that.
## ADR-009: Incremental updates via delta files, not Delta Lake
**Date:** 2026-03-27
**Decision:** Updates append to delta Parquet files. Queries merge base + deltas at read time. Periodic compaction merges deltas into base. Single-writer model (no concurrent writers).
**Rationale:** Full ACID over Parquet (Delta Lake/Iceberg) is a multi-year project. Our use case is single-writer (one ingest pipeline) with read-heavy workloads. Merge-on-read with compaction is sufficient and dramatically simpler.
## ADR-010: Schema detection defaults to string
**Date:** 2026-03-27
**Decision:** Ingest pipeline infers column types from data. When ambiguous or mixed, defaults to String rather than failing.
**Rationale:** Legacy data is messy. A column with "123", "N/A", and "" is a string, not an integer. Downstream queries can CAST as needed. Better to ingest everything than reject on type errors.
## ADR-011: This is not a CRM replacement
**Date:** 2026-03-27
**Decision:** The lakehouse is the analytical layer BEHIND operational systems. It ingests exports, not live data. CRM/ATS stays for daily operations.
**Rationale:** Operational systems need single-record CRUD, permissions, UI workflows. The lakehouse answers cross-cutting questions that no single operational system can. They complement, not compete.
## ADR-012: Event journal — append-only mutation history
**Date:** 2026-03-27
**Decision:** Every data mutation (insert, update, delete) is appended to an immutable event journal. The journal stores: entity, field, old_value, new_value, actor, timestamp, source, workspace_id. Events are never modified or deleted.
**Rationale:** This is the single most important future-proofing decision. AI auditability ("why did the agent recommend this candidate?"), compliance ("who changed this PII field?"), and time-travel queries ("what did this record look like on March 1st?") all require mutation history. This is impossible to retrofit — once history is lost, it's gone. The cost to implement is low (append-only Parquet), the cost of NOT implementing grows every day.
## ADR-013: Rich metadata is a product, not a byproduct
**Date:** 2026-03-27
**Decision:** Every dataset in the catalog carries: owner, sensitivity classification, lineage (source_system → ingest → dataset), freshness SLA, description, and tags. Auto-detected where possible, required on manual ingest.
**Rationale:** Datasets without metadata become "mystery files" within months. As data volume grows and AI agents consume data, the metadata layer is what makes the platform discoverable, governable, and trustworthy. Legacy companies that skip this step end up with expensive data swamps instead of data platforms.
## ADR-014: Embedding versioning — model-proof vector layer
**Date:** 2026-03-27
**Decision:** Every vector index tracks: model_name, model_version, dimensions, created_at. Multiple index versions for the same data coexist. Re-embedding on model upgrade is incremental (only new/changed docs).
**Rationale:** Embedding models improve rapidly. nomic-embed-text today, something better in 6 months. Without version tracking, upgrading means re-embedding the entire corpus. With versioning, you can A/B test new models, migrate incrementally, and maintain backward compatibility.
## ADR-015: Tool registry before raw SQL for agents
**Date:** 2026-03-27
**Decision:** AI agents interact with the system through named, governed tools (search_candidates, update_phone, create_placement) rather than raw SQL. Tools have parameter validation, permission checks, audit logging, and rate limits.
**Rationale:** In 3 years, most data access will be by AI agents, not humans typing SQL. Giving agents raw SQL access is ungovernable — you can't audit, permission, or rate-limit individual operations. Named tools with contracts are the interface that scales. Building the governed interface first prevents the technical debt of retrofitting controls onto raw access.
## ADR-016: Agent workspaces as first-class concept
**Date:** 2026-03-27
**Decision:** Each contract/search gets a named workspace with saved queries, shortlists, activity logs, and delta layers. Workspaces have daily/weekly/monthly tiers and support instant zero-copy handoff between agents.
**Rationale:** Staffing workflows are inherently agent-centric — a recruiter works a contract, builds context, then may need to hand it off. The workspace captures that context in a structured, queryable, transferable format. Without it, handoff means "read the email thread and figure it out."
## ADR-017: Federated multi-bucket storage
**Date:** 2026-04-16
**Decision:** Every `ObjectRef` belongs to exactly one named bucket. Three bucket classes: `primary` (system default, always present), `profile:{user}` (per-user/per-model workspace bucket), and named tenant buckets (`client_a`, `client_eu`). A single shared `rescue_bucket` handles read fallback on target failure. Writes hard-fail on unreachable target; no silent fallback. Every bucket op failure lands in an error journal at `primary://_errors/bucket_errors/`, queryable via `/storage/errors`. Credentials never live in `lakehouse.toml` — a pluggable `SecretsProvider` trait resolves opaque `secret_ref` handles.
**Rationale:** The single-backend assumption breaks when we want: tenant data isolation, data residency (EU bucket), per-profile workspaces for local models. DataFusion already supports multiple registered object stores; catalogd is the cross-bucket metadata authority. Rescue bucket + visible error journal makes operational failures diagnosable in one HTTP call. See `docs/ADR-017-federation.md` for full design + success gates.
## ADR-018: Write-once batched append pattern for journals
**Date:** 2026-04-16
**Decision:** All append-only journals (error journal, HNSW trial journal, future audit logs) use the `storaged::append_log::AppendLog` helper. Events accumulate in an in-memory buffer; on threshold or explicit `flush()`, the buffer is written as one new timestamped file (`batch_{epoch_us}.jsonl`). Existing files are never rewritten. `compact()` merges all batches into one with a fresh timestamp, preserving chronological sort order.
**Rationale:** Object stores have no append primitive. Naive "read-modify-write the whole JSONL file on every event" is O(N²) cumulative work and creates the classic small-file / rewrite-amplification anti-pattern that llms3.com flags as the top lakehouse pitfall. Write-once batching is the LSM-tree idea applied to small JSONL events — bounded write amplification, append-only semantics, optional compaction for read efficiency. The in-memory ring buffer preserves O(1) recent-event reads for the `/storage/errors` and `/hnsw/trials` query endpoints.