root 6740a017c7 PRD v2: production roadmap with ingest, vector search, hot cache phases

- Phase 6: Ingest pipeline (CSV/JSON → schema detect → Parquet → catalog)
- Phase 7: Vector index + RAG (embed → HNSW → semantic search → LLM answer)
- Phase 8: Hot cache + incremental updates (MemTable, delta files, merge-on-read)
- ADR-008 through ADR-011: embeddings as Parquet, delta files not Delta Lake,
  schema defaults to string, not a CRM replacement
- Staffing company reference dataset (286K rows, 7 tables)
- Honest risk assessment: vector search at scale and incremental updates are hard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 07:54:24 -05:00

3.9 KiB

Raw Blame History

Architecture Decision Records

ADR-001: Object storage as source of truth

Date: 2026-03-27 Decision: All data lives in S3-compatible object storage. No traditional database. Rationale: Eliminates DB operational overhead, enables infinite scale at storage tier, forces clean separation of data and metadata.

ADR-002: Catalog metadata persistence

Date: 2026-03-27 Decision: catalogd persists manifests as Parquet files in object storage. In-memory index rebuilt on startup. Rationale: No external DB dependency. Storage is already the source of truth. Write-ahead pattern ensures consistency.

ADR-003: Real models only (no mocks)

Date: 2026-03-27 Decision: AI sidecar hits Ollama with real models from Phase 3 onward. No stub/mock endpoints. Rationale: Local Ollama instance available with nomic-embed-text, qwen2.5, mistral, gemma2, llama3.2. Mocks hide integration bugs.

ADR-004: Python sidecar as Ollama adapter

Date: 2026-03-27 Decision: Python FastAPI sidecar is a thin HTTP adapter over Ollama's API. No model loading in Python. Rationale: Ollama handles model lifecycle, GPU scheduling, caching. Sidecar stays stateless and lightweight — no torch/transformers deps.

ADR-005: HTTP-first, gRPC later

Date: 2026-03-27 Decision: All inter-service communication uses HTTP through Phase 4. gRPC migration in Phase 5. Rationale: HTTP is simpler to debug, test, and iterate on. gRPC adds protobuf compilation and streaming complexity before APIs stabilize.

ADR-006: TOML config over environment variables

Date: 2026-03-27 Decision: System configuration via lakehouse.toml file, with sane defaults for all values. Rationale: Config files are versionable, self-documenting, and support structured data. Env vars remain available as overrides. System must be restartable from repo + config alone.

ADR-007: Dual HTTP+gRPC on gateway

Date: 2026-03-27 Decision: Gateway serves HTTP on :3100 (external) and gRPC on :3101 (internal). Both run in the same process. Rationale: Single binary simplifies deployment. HTTP stays for browser/curl access. gRPC provides typed contracts for service-to-service calls. No premature microservice split.

ADR-008: Embeddings stored as Parquet, not a proprietary vector DB

Date: 2026-03-27 Decision: Vector embeddings stored as Parquet files (doc_id, chunk_text, vector columns). Vector index (HNSW) serialized as a sidecar file. Rationale: Keeps all data in one portable format. No vendor lock-in to Pinecone/Weaviate/Qdrant. Vectors are queryable via DataFusion like any other data. Trade-off: brute-force search is fine up to ~100K vectors; HNSW needed beyond that.

ADR-009: Incremental updates via delta files, not Delta Lake

Date: 2026-03-27 Decision: Updates append to delta Parquet files. Queries merge base + deltas at read time. Periodic compaction merges deltas into base. Single-writer model (no concurrent writers). Rationale: Full ACID over Parquet (Delta Lake/Iceberg) is a multi-year project. Our use case is single-writer (one ingest pipeline) with read-heavy workloads. Merge-on-read with compaction is sufficient and dramatically simpler.

ADR-010: Schema detection defaults to string

Date: 2026-03-27 Decision: Ingest pipeline infers column types from data. When ambiguous or mixed, defaults to String rather than failing. Rationale: Legacy data is messy. A column with "123", "N/A", and "" is a string, not an integer. Downstream queries can CAST as needed. Better to ingest everything than reject on type errors.

ADR-011: This is not a CRM replacement

Date: 2026-03-27 Decision: The lakehouse is the analytical layer BEHIND operational systems. It ingests exports, not live data. CRM/ATS stays for daily operations. Rationale: Operational systems need single-record CRUD, permissions, UI workflows. The lakehouse answers cross-cutting questions that no single operational system can. They complement, not compete.

3.9 KiB Raw Blame History