Standalone benchmark crate `crates/lance-bench` running Lance 4.0 against our Parquet+HNSW at 100K × 768d (resumes_100k_v2) measured 8 dimensions. Results (see docs/ADR-019-vector-storage.md for full scorecard): Cold load: Parquet 0.17s vs Lance 0.13s (tie — not ≥2× threshold) Disk size: 330.3 MB vs 330.4 MB (tie) Search p50: 873us vs 2229us (Parquet 2.55× faster) Search p95: 1413us vs 4998us (Parquet 3.54× faster) Index build: 230s (ec=80) vs 16s (IVF_PQ) (Lance 14× faster) Random access: 35ms (scan) vs 311us (Lance 112× faster) Append 10K rows: full rewrite vs 0.08s/+31MB (Lance structural win) Decision (ADR-019): hybrid, not migrate-or-reject. - Parquet+HNSW stays primary — our HNSW at ec=80 es=30 recall=1.00 is 2.55× faster than Lance IVF_PQ at 100K in-RAM scale - Lance joins as second backend per-profile for workloads where it wins architecturally: random row access (RAG text fetch), append-heavy pipelines (Phase C), hot-swap generations (Phase 16, 14× faster builds), and indexes past the ~5M RAM ceiling - Phase 17 ModelProfile gets vector_backend: Parquet | Lance field - Ceiling table in PRD updated — 5M ceiling now says "switch to Lance" instead of "migrate" since Lance runs alongside, not instead of Isolation: lance-bench is a standalone workspace crate with its own dep tree (Lance pulls DataFusion 52 + Arrow 57 incompatible with main stack DataFusion 47 + Arrow 55). Kept off the critical path until API is stable enough to promote into vectord::lance_store. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
522 lines
32 KiB
Markdown
522 lines
32 KiB
Markdown
# PRD: Lakehouse — Rust-First Substrate for Versioned Knowledge Stores
|
||
|
||
**Status:** Active — Phase 0-14 complete; federation foundation + HNSW trial system shipped 2026-04-16; entering Phase 16 (hot-swap + model profiles)
|
||
**Created:** 2026-03-27
|
||
**Last reframed:** 2026-04-16 — from "staffing analytics platform" to "dual-use knowledge substrate" (see §Problem below)
|
||
**Owner:** J
|
||
|
||
---
|
||
|
||
## Problem
|
||
|
||
### Use case 1 — Staffing analytics (reference implementation)
|
||
|
||
Legacy data systems silo information across CRMs, databases, spreadsheets, and file shares. Querying across them requires manual ETL, pre-defined schemas, and expensive database licenses. When AI enters the picture, these systems can't handle the dual requirement of fast analytical queries AND semantic retrieval over unstructured text.
|
||
|
||
A staffing company (our reference case) has candidate records in an ATS, client data in a CRM, timesheets in billing software, call logs from a phone system, and email records from Exchange. Answering "find every Java developer in Chicago who was called 5+ times but never placed" requires querying across all of them — and no single system can do it.
|
||
|
||
### Use case 2 — Local AI knowledge substrate (the second half)
|
||
|
||
Local LLM workloads need a substrate for ingesting, indexing, and retrieving large knowledge corpora. Each running model (or agent) has its own context — documents it cares about, a vector index tuned to its domain, a scoped view of the catalog. That infrastructure is architecturally identical to the staffing problem: ingest messy data, index it, query it, hand it to an AI. Building one substrate that serves both prevents fragmentation.
|
||
|
||
Concretely this means a running Ollama model like `qwen2.5:7b` or `claude-code-local` should be able to:
|
||
- Bind to a named set of datasets
|
||
- Get a scoped vector index pre-warmed for its domain
|
||
- Issue searches that only see its bound data
|
||
- Have its trial/tuning history isolated from other models
|
||
- Swap between knowledge generations (today's, yesterday's) without rebuild
|
||
|
||
The same infrastructure that lets a recruiter query 2.47M rows of staffing data also lets a local 7B model answer questions grounded in a 500K-chunk documentation corpus. Same substrate, different tenant.
|
||
|
||
### Shared requirements
|
||
|
||
- Any data source (CSV, DB export, PDF, JSON, Postgres table) can be ingested without pre-defined schemas
|
||
- Structured data is queryable via SQL at scale (millions of rows, sub-second)
|
||
- Unstructured data is searchable via AI embeddings with per-profile indexes
|
||
- An LLM can answer natural language questions against scoped data
|
||
- Indexes can be hot-swapped between generations without rebuild downtime
|
||
- Trials are first-class data — the system remembers how it was tuned
|
||
- Everything runs locally — no cloud APIs, total data privacy
|
||
- The system is rebuildable from repository + object storage alone
|
||
|
||
---
|
||
|
||
## Solution
|
||
|
||
A modular Rust service mesh over S3-compatible object storage, with a local AI layer for embeddings and generation.
|
||
|
||
### Locked Stack
|
||
|
||
| Layer | Technology | Locked |
|
||
|---|---|---|
|
||
| Frontend | Dioxus | Yes |
|
||
| API | Axum + Tokio | Yes |
|
||
| Object Storage Interface | Apache Arrow `object_store` | Yes |
|
||
| Storage Backend | LocalFileSystem → RustFS → S3 | Yes |
|
||
| Query Engine | DataFusion | Yes |
|
||
| Data Format | Parquet + Arrow | Yes |
|
||
| RPC (internal) | tonic (gRPC) | Yes |
|
||
| AI Runtime | Ollama (local models) | Yes |
|
||
| AI Boundary | Python FastAPI sidecar → Ollama HTTP API | Yes |
|
||
| Vector Index | TBD — evaluate `hora`, `qdrant` crate, or HNSW from scratch | **Open** |
|
||
|
||
No new frameworks without documented ADR.
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
### Services
|
||
|
||
| Service | Responsibility |
|
||
|---|---|
|
||
| **gateway** | HTTP/gRPC ingress, routing, auth, CORS, body limits, `X-Lakehouse-Bucket` header routing |
|
||
| **catalogd** | Metadata control plane — dataset registry, schema versions, manifests, per-dataset resync from parquet footers |
|
||
| **storaged** | Object I/O — `BucketRegistry` (multi-backend), rescue fallback, error journal, append-log batching pattern |
|
||
| **queryd** | SQL execution — DataFusion over Parquet, MemTable hot cache, delta merge-on-read |
|
||
| **ingestd** | Ingest pipeline: CSV / JSON / PDF / Postgres-stream → normalize → Parquet → catalog |
|
||
| **vectord** | Embedding store + vector indexes + HNSW trial system (EmbeddingCache, trial journal, eval harness) |
|
||
| **journald** | Append-only mutation event log (ADR-012) — distinct from storaged error journal |
|
||
| **aibridge** | Rust↔Python boundary — HTTP client to FastAPI sidecar |
|
||
| **ui** | Dioxus frontend — Ask, Explore, SQL, System tabs |
|
||
| **shared** | Types, errors, Arrow helpers, config, protobuf definitions, **secrets provider trait**, **PII detection** |
|
||
|
||
**Federation building blocks** (shipped 2026-04-16):
|
||
- `shared::secrets::SecretsProvider` trait + `FileSecretsProvider` reading `/etc/lakehouse/secrets.toml` (0600 enforced)
|
||
- `storaged::registry::BucketRegistry` — multi-bucket resolution with `rescue_bucket` read fallback
|
||
- `storaged::append_log::AppendLog` — write-once batched append pattern (no RMW, no small-file problem)
|
||
- `storaged::error_journal::ErrorJournal` — bucket operation failure log at `primary://_errors/bucket_errors/batch_*.jsonl`
|
||
|
||
### Data Flow
|
||
|
||
```
|
||
Raw data → ingestd (normalize, chunk, detect schema)
|
||
├→ storaged (Parquet files to object storage)
|
||
├→ catalogd (register dataset + schema)
|
||
├→ vectord (embed text chunks, build index)
|
||
└→ queryd (auto-register as queryable table)
|
||
|
||
User question → gateway
|
||
├→ vectord (semantic search for relevant chunks) ← RAG path
|
||
├→ queryd (SQL over structured data) ← Analytics path
|
||
└→ aibridge → Ollama (generate answer from context)
|
||
```
|
||
|
||
### Query Paths
|
||
|
||
**Analytical (SQL):** "What's the average bill rate for .NET devs in Chicago?"
|
||
→ DataFusion scans Parquet columnar, returns in <200ms
|
||
|
||
**Semantic (RAG):** "Find candidates who could do data engineering work"
|
||
→ Embed question → vector search across resume embeddings → retrieve top chunks → LLM answers
|
||
|
||
**Hybrid:** "Which clients are we losing money on, and why?"
|
||
→ SQL for margin calculations + RAG over client notes/emails for context → LLM synthesizes
|
||
|
||
### Invariants
|
||
|
||
1. Object storage = source of truth for all data
|
||
2. catalogd = sole metadata authority
|
||
3. No raw data in catalog — only pointers
|
||
4. vectord stores embeddings AS Parquet (portable, not a proprietary format) — see ADR-018 for the Parquet-vs-Lance trade review
|
||
5. ingestd is idempotent — re-ingesting the same file is a no-op
|
||
6. Hot cache is a performance layer, not a source of truth — eviction is safe
|
||
7. All services modular and independently replaceable
|
||
8. **Indexes are hot-swappable.** A new index generation can be built in the background while the current one serves queries. Promotion is atomic (pointer swap). Rollback to a prior generation is always possible. (Phase 16)
|
||
9. **Every reader gets its own profile.** Human operators, AI agents, and local models are all clients of the same substrate. Each has a named profile with its own bucket, vector indexes, trial history, and dataset bindings. Profiles are a first-class architectural concept, not a tenancy afterthought. (Phase 17)
|
||
10. **Trials are data, not logs.** Every index build is a trial with measurable metrics. The trial journal IS the agent's memory for how to tune itself. Stored as write-once batched JSONL per the ADR-018 append-log pattern.
|
||
11. **Operational failures are findable in one HTTP call.** The bucket error journal, trial journal, and audit log all expose `/storage/errors`, `/hnsw/trials`, `/access/audit` with structured filter + aggregation. No `grep` archaeology to answer "what broke?"
|
||
|
||
---
|
||
|
||
## Phases
|
||
|
||
### Phase 0-5: Foundation ✅ COMPLETE
|
||
|
||
- Rust workspace, Axum gateway, object storage, catalog, DataFusion query engine
|
||
- Python sidecar with real Ollama models (embed, generate, rerank)
|
||
- Dioxus UI with Ask (NL→SQL), Explore, SQL, System tabs
|
||
- gRPC, OpenTelemetry, auth middleware, TOML config
|
||
- Validated with 286K row staffing company dataset across 7 tables
|
||
- Cross-reference queries (JOINs across candidates, placements, timesheets, calls) in <150ms
|
||
|
||
### Phase 6: Ingest Pipeline
|
||
|
||
Build the data on-ramp. Accept messy real-world data, normalize it, make it queryable.
|
||
|
||
| Step | Deliverable | Gate |
|
||
|---|---|---|
|
||
| 6.1 | `ingestd` crate with CSV parser → Arrow RecordBatch → Parquet | CSV file → queryable dataset |
|
||
| 6.2 | JSON ingest (newline-delimited JSON, nested objects) | JSON file → flat Parquet |
|
||
| 6.3 | Schema detection — infer column types from data | No manual schema definition needed |
|
||
| 6.4 | Deduplication — detect and skip already-ingested files (content hash) | Re-ingest same file = no-op |
|
||
| 6.5 | Text chunking — split large text fields for embedding | Long text → overlapping chunks |
|
||
| 6.6 | Auto-registration — ingest writes to storage AND registers in catalog | Single API call: file in → queryable |
|
||
| 6.7 | Gateway endpoint: `POST /ingest` with file upload | Upload CSV from browser → query in seconds |
|
||
|
||
**Gate:** Upload a raw CSV or JSON file → auto-detected schema → stored as Parquet → registered → immediately queryable via SQL. No manual steps.
|
||
|
||
**Risk:** Schema detection on messy data (mixed types, nulls, inconsistent formatting). Mitigation: conservative type inference (default to string), let user override.
|
||
|
||
### Phase 7: Vector Index + RAG Pipeline
|
||
|
||
Make unstructured data searchable by meaning, not just keywords.
|
||
|
||
| Step | Deliverable | Gate |
|
||
|---|---|---|
|
||
| 7.1 | `vectord` crate with embedding storage as Parquet (doc_id, chunk_text, vector) | Embeddings stored as portable Parquet |
|
||
| 7.2 | Chunking strategy — configurable chunk size + overlap for text columns | Large text fields split into embeddable chunks |
|
||
| 7.3 | Brute-force vector search via DataFusion (cosine similarity SQL) | Semantic search works, correctness verified |
|
||
| 7.4 | HNSW index for fast approximate nearest neighbor | Search over 100K+ vectors in <50ms |
|
||
| 7.5 | RAG endpoint: `POST /rag` — question → embed → search → retrieve → generate | Natural language question → grounded answer |
|
||
| 7.6 | Auto-embed on ingest — text columns automatically embedded during ingest | No separate embedding step needed |
|
||
| 7.7 | Hybrid search — combine SQL filters with vector similarity | "Java devs in Chicago" (SQL) + "who could do data engineering" (semantic) |
|
||
|
||
**Gate:** Ingest 15K candidate resumes → auto-embed → ask "find someone who could handle our Kubernetes migration" → system returns relevant candidates ranked by semantic match, with LLM explanation.
|
||
|
||
**Risk: HNSW in Rust at scale.** This is the hardest technical problem. Options:
|
||
- `hora` crate — Rust-native ANN, but less mature than FAISS
|
||
- Store HNSW index as a serialized file alongside Parquet data
|
||
- Fallback: brute-force scan is fine up to ~100K vectors; optimize later
|
||
- Nuclear option: use Qdrant as an external vector store (breaks "no new services" rule)
|
||
|
||
**Decision needed:** Evaluate `hora` vs external Qdrant vs brute-force at J's data scale.
|
||
|
||
### Phase 8: Hot Cache + Incremental Updates
|
||
|
||
Make frequently-accessed data fast, and handle real-time updates without full rewrite.
|
||
|
||
| Step | Deliverable | Gate |
|
||
|---|---|---|
|
||
| 8.1 | MemTable hot cache — pin active datasets in memory | Queries on hot data: <10ms |
|
||
| 8.2 | Cache policy — LRU eviction based on access patterns | Memory-bounded, auto-manages |
|
||
| 8.3 | Incremental writes — append new rows without rewriting entire Parquet file | Update one candidate's phone → no full table rewrite |
|
||
| 8.4 | Merge-on-read — query combines base Parquet + delta files | Correct results from base + updates |
|
||
| 8.5 | Compaction — periodic merge of delta files into base Parquet | Prevent delta file proliferation |
|
||
| 8.6 | Upsert semantics — insert or update by primary key | Same candidate ID → update in place |
|
||
|
||
**Gate:** Update a single row in a 15K-row dataset. Query reflects the change immediately. No full Parquet rewrite. Memory cache serves hot data in <10ms.
|
||
|
||
**Risk: This is the Delta Lake problem.** Full ACID transactions over Parquet files is what Databricks spent years building. We're NOT building Delta Lake — we're building a pragmatic version:
|
||
- Append-only delta files (easy)
|
||
- Merge-on-read (moderate)
|
||
- Compaction (moderate)
|
||
- Full ACID isolation (NOT attempting — single-writer model instead)
|
||
|
||
### Phase 8.5: Agent Workspaces ✅ COMPLETE
|
||
|
||
Per-contract overlays with daily/weekly/monthly tiers and instant handoff.
|
||
|
||
- WorkspaceManager with saved searches, shortlists, activity logs
|
||
- Zero-copy handoff between agents (pointer swap, not data copy)
|
||
- Persisted to object storage, rebuilt on startup
|
||
|
||
### Phase 9: Event Journal — Never Destroy Evidence
|
||
|
||
**Principle:** Every mutation is appended, never overwritten. This is the one decision that's impossible to retrofit — once history is lost, it's gone forever.
|
||
|
||
| Step | Deliverable | Gate |
|
||
|---|---|---|
|
||
| 9.1 | `journald` crate: append-only event log as Parquet | Every write/update/delete logged with who, when, what, old value, new value |
|
||
| 9.2 | Event schema: entity, field, old_value, new_value, actor, timestamp, source, workspace_id | Standardized across all mutations |
|
||
| 9.3 | Journal query: `SELECT * FROM journal WHERE entity = 'CAND-001' ORDER BY timestamp` | Full history of any record |
|
||
| 9.4 | Replay capability: rebuild any dataset's state at any point in time | Time-travel queries |
|
||
| 9.5 | Journal compaction: roll old events into monthly summary Parquet files | Prevent unbounded growth |
|
||
|
||
**Gate:** Change a candidate's phone number. Query shows the change. Journal shows old value, new value, who changed it, when, and why. Replay to yesterday's state.
|
||
|
||
**Why now:** In 3 years, compliance, AI auditability, and "why did the agent recommend this candidate" all require mutation history. Adding it later means you only have history from that day forward.
|
||
|
||
### Phase 10: Rich Catalog v2 — Metadata as Product
|
||
|
||
**Principle:** Every dataset should be self-describing. A new team member (or AI agent) should understand what data exists, who owns it, how fresh it is, and what's sensitive — without asking anyone.
|
||
|
||
| Step | Deliverable | Gate |
|
||
|---|---|---|
|
||
| 10.1 | Catalog schema upgrade: add owner, sensitivity, freshness_sla, description, tags, lineage | `GET /catalog/datasets` returns rich metadata |
|
||
| 10.2 | Sensitivity classification: PII, PHI, financial, public, internal | Sensitive fields tagged at ingest |
|
||
| 10.3 | Lineage tracking: source_system → ingest_job → dataset → derived_dataset | "Where did this data come from?" answerable |
|
||
| 10.4 | Freshness contracts: expected_update_frequency, last_updated, stale_after | Alert when data goes stale |
|
||
| 10.5 | Dataset contracts: required columns, type expectations, validation rules | Ingest rejects data that breaks the contract |
|
||
| 10.6 | Auto-documentation: AI generates dataset description from schema + sample data | New datasets self-describe on ingest |
|
||
|
||
**Gate:** Ingest a CSV. System auto-detects PII columns (email, phone, SSN patterns), tags them, generates a description, sets owner, and tracks lineage back to the source file.
|
||
|
||
**Why now:** Every dataset you ingest without metadata becomes a "mystery file" in 6 months. The metadata layer makes the difference between a searchable knowledge platform and a data graveyard.
|
||
|
||
### Phase 11: Embedding Versioning — Model-Proof Vector Layer
|
||
|
||
**Principle:** Embedding models will change. If you don't track which model created which vectors, upgrading means re-embedding everything from scratch.
|
||
|
||
| Step | Deliverable | Gate |
|
||
|---|---|---|
|
||
| 11.1 | Vector index metadata: model_name, model_version, dimensions, created_at | Every index knows its embedding model |
|
||
| 11.2 | Multi-version indexes: same data, different models, coexist | Search specifies which model version |
|
||
| 11.3 | Incremental re-embed: only new/changed docs get re-embedded on model upgrade | Model swap doesn't require full re-embed |
|
||
| 11.4 | A/B search: query both old and new model, compare results | Validate model upgrade before committing |
|
||
|
||
**Gate:** Upgrade from nomic-embed-text to a new model. Old index still works. New index builds incrementally. Compare search quality. Switch when ready.
|
||
|
||
### Phase 12: Tool Registry — Agent-Safe Business Actions
|
||
|
||
**Principle:** In 3 years, AI agents won't just query — they'll act. Instead of every agent getting raw SQL access, expose named, governed, audited business actions.
|
||
|
||
| Step | Deliverable | Gate |
|
||
|---|---|---|
|
||
| 12.1 | Tool definition: name, description, parameters, permissions, audit_level | `search_candidates(skills, city, min_years)` as a registered tool |
|
||
| 12.2 | Tool execution: validates params, checks permissions, logs usage, runs query | Agent calls tool, gets results, action is logged |
|
||
| 12.3 | Read vs write tools: read tools are permissive, write tools require confirmation | `get_candidate` = auto-approved, `update_phone` = requires review |
|
||
| 12.4 | MCP-compatible interface: expose tools via Model Context Protocol | Any MCP-compatible agent (Claude, GPT, local) can use them |
|
||
| 12.5 | Rate limiting + quotas per agent/tool | Prevent runaway agent from overwhelming the system |
|
||
|
||
**Gate:** An AI agent calls `search_candidates(skills="Python,AWS", city="Chicago", available=true)` → gets results → calls `shortlist_candidate(workspace_id, candidate_id, reason)` → action is logged, auditable, reversible.
|
||
|
||
**Why now:** The tool interface is cheap to build (it's just named endpoints with validation). But retrofitting audit logging and permission checks onto raw SQL access is a nightmare. Build the governed interface first.
|
||
|
||
### Phase 13: Security & Access Control
|
||
|
||
| Step | Deliverable | Gate |
|
||
|---|---|---|
|
||
| 13.1 | Field-level sensitivity tags (PII, PHI, financial) in catalog | Sensitive fields identified |
|
||
| 13.2 | Row-level access policies (agent A sees their candidates only) | Policy evaluated at query time |
|
||
| 13.3 | Column masking (show last 4 of SSN, redact salary for non-managers) | Masked results based on role |
|
||
| 13.4 | Query audit log (who queried what, when, which fields) | Every data access recorded |
|
||
| 13.5 | Policy-as-code (TOML/YAML rules, not hardcoded) | Non-engineer can update access rules |
|
||
|
||
### Phase 14: Schema Evolution + AI Migration
|
||
|
||
| Step | Deliverable | Gate |
|
||
|---|---|---|
|
||
| 14.1 | Schema diff detection: old schema vs new ingest → list changes | "Column renamed: first_name → full_name" |
|
||
| 14.2 | AI-generated migration rules: LLM suggests column mappings | "full_name = concat(first_name, ' ', last_name)" |
|
||
| 14.3 | Migration preview: show how old data maps to new schema before applying | Human approves before data transforms |
|
||
| 14.4 | Versioned schemas in catalog: v1, v2, v3 coexist | Queries specify version or use latest |
|
||
|
||
### Phase 15: Infrastructure horizon items
|
||
|
||
- [x] HNSW vector index with trial system (shipped 2026-04-16)
|
||
- [x] Federation foundation — ADR-017 (shipped 2026-04-16)
|
||
- [x] Database connector ingest — Postgres batch with streaming (shipped 2026-04-16)
|
||
- [ ] Federation layer 2 — X-Lakehouse-Bucket middleware, catalog migration, cross-bucket SQL in queryd
|
||
- [ ] PDF OCR for scanned documents (Tesseract integration)
|
||
- [ ] Scheduled ingest (cron-based file watching, S3 event triggers)
|
||
- [ ] Multi-node query distribution (DataFusion supports this architecturally)
|
||
|
||
### Phase 16: Hot-Swap Index Generations
|
||
|
||
Make indexes upgrade-in-place without dropping queries.
|
||
|
||
| Step | Deliverable | Gate |
|
||
|---|---|---|
|
||
| 16.1 | "Active generation" pointer per logical index name | `/vectors/search` routes to current champion automatically |
|
||
| 16.2 | Background trial runner: watches trial journal, proposes configs (random search / Bayesian), fires `/hnsw/trial` | Agent autonomously tunes without human POSTing each config |
|
||
| 16.3 | Promotion endpoint: `POST /hnsw/promote/{index}/{trial_id}` atomically swaps active pointer | Next search hits new config, zero downtime |
|
||
| 16.4 | Rollback: `POST /hnsw/rollback/{index}` reverts to previous generation | Bad promotion recoverable in milliseconds |
|
||
| 16.5 | Dataset-append triggers: when `POST /ingest/file` writes to a dataset with attached vector indexes, schedule automatic re-trial (not full rebuild) | New docs get embedded + indexed without manual intervention |
|
||
|
||
**Gate:** Run the trial agent for 10 minutes against `resumes_100k_v2` with a fresh eval set. It explores the `ef_construction × ef_search` space, promotes the Pareto winner, continues running. Zero human clicks. All trials and promotions appear in `/hnsw/trials/resumes_100k_v2`.
|
||
|
||
**Risk:** Agent loops into a bad region (e.g. always proposes ef_construction=1). Mitigation: a hardcoded config space constraint + minimum-quality gate (don't promote anything with recall <0.9).
|
||
|
||
### Phase 17: Model Profiles + Dataset Bindings
|
||
|
||
Make "different models see different data" real instead of a config string.
|
||
|
||
| Step | Deliverable | Gate |
|
||
|---|---|---|
|
||
| 17.1 | `ModelProfile` manifest: id, ollama_name, bucket, bound_datasets[], hnsw_config, embed_model | `GET /models` lists profiles; `POST /models` creates one |
|
||
| 17.2 | Profile activation endpoint: `POST /profile/{id}/activate` — warms EmbeddingCache for bound indexes, builds HNSW with profile's config | Next search against bound indexes is <1ms cold |
|
||
| 17.3 | Model-scoped search: `POST /search?model=X` filters to bound datasets only | Model A can't see Model B's datasets unless explicitly shared |
|
||
| 17.4 | VRAM-aware activation: only one (or small N) model loaded at a time on 16GB A4000 | Activating model B unloads model A via Ollama's keep_alive=0 |
|
||
| 17.5 | Audit: every tool invocation by a model is logged with model identity | `GET /models/{id}/audit` shows exactly what each model touched |
|
||
|
||
**Gate:** Two model profiles defined: `staffing-recruiter` (bound to candidates/placements/timesheets) and `docs-assistant` (bound to a documentation corpus). Activate staffing-recruiter, search for candidates — works. Switch to docs-assistant, same search — returns zero from staffing (not bound) but finds docs. VRAM shows only one embedding model loaded at a time.
|
||
|
||
**VRAM reality:** 16GB A4000 realistically holds 1-2 loaded models concurrently. "Multi-model" in practice means sequential swap between profiles, not parallel serving. The profile abstraction makes this swap clean.
|
||
|
||
### Phase 18: Storage format decision (Lance evaluation)
|
||
|
||
The question raised 2026-04-16 after J's LLMS3 knowledge base identified Lance as `alternative_to` Parquet for vector workloads. Current stack: Parquet with binary-blob vector columns + in-RAM HNSW sidecar. Evaluated against: Lance native vector format with disk-resident indexes.
|
||
|
||
| Step | Deliverable | Decision criteria |
|
||
|---|---|---|
|
||
| 18.1 | ✅ Parallel Lance-backed vector index for `resumes_100k_v2` in standalone `crates/lance-bench` | Built 2026-04-16 |
|
||
| 18.2 | ✅ Head-to-head benchmark across 8 dimensions (cold-load, search latency, disk, index build, random access, append) | Complete |
|
||
| 18.3 | ✅ ADR-019 committed with measured data and decision | See `docs/ADR-019-vector-storage.md` |
|
||
|
||
**Outcome:** Hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance joins as a second backend for Phase 16 hot-swap (14× faster index builds), Phase C/append workloads (0.08s vs full rewrite), RAG random-access retrieval (112× faster), and indexes past the ~5M RAM ceiling.
|
||
|
||
Per-profile `vector_backend: Parquet | Lance` becomes part of Phase 17 (model profiles). See ADR-019 for the full scorecard and caveats.
|
||
|
||
### Phase 19+: Further horizon
|
||
|
||
- PDF OCR for scanned documents (Tesseract integration)
|
||
- Specialized fine-tuned models per domain (staffing matcher, resume parser)
|
||
- Video/audio transcript ingest + multimodal embeddings
|
||
- True distributed query (DataFusion multi-node) — only if single-machine ceilings bite
|
||
|
||
---
|
||
|
||
## Known ceilings (honest)
|
||
|
||
The current stack has measurable limits. Documenting them so future decisions aren't based on wishful thinking.
|
||
|
||
| Dimension | Current ceiling | Breaks at | Escape hatch |
|
||
|---|---|---|---|
|
||
| Vector count per index (Parquet+HNSW in-RAM) | ~5M on 128GB | Past 5M | Switch that profile's `vector_backend` to Lance per ADR-019 — IVF_PQ stays on disk-resident quantized codes |
|
||
| Concurrent active indexes | ~50-100 at 100K vectors each | 10M×50 configurations | Lance disk-resident + per-profile activation |
|
||
| Rows per dataset | 2.47M proven, probably 100M+ fine | Approaches DataFusion memory limits | DataFusion predicate pushdown + partition pruning (existing) |
|
||
| Concurrent loaded models | 1-2 on 16GB VRAM (A4000) | 3+ models simultaneous | Not our problem — architectural, driven by Ollama |
|
||
| Trial journal growth per index | Thousands of trials, batched JSONL | High-frequency auto-tuning agent | Compaction via `/hnsw/trials/{idx}/compact` |
|
||
| Error journal growth | Bounded by ring buffer (2000 events in-memory) + batched JSONL on disk | Continuous failure scenarios | Compaction + retention policy (TODO) |
|
||
|
||
---
|
||
|
||
## Reference Workloads
|
||
|
||
### Workload 1: Staffing Company
|
||
|
||
Scale-tested on 128GB RAM server:
|
||
|
||
| Table | Rows | Size | Description |
|
||
|---|---|---|---|
|
||
| candidates | 100,000 | 10.1 MB | Names, phones, emails, zip, skills, resume text |
|
||
| clients | 2,000 | 33 KB | Companies, contacts, verticals |
|
||
| job_orders | 15,000 | 0.9 MB | Positions with descriptions, requirements, rates |
|
||
| placements | 50,000 | 1.2 MB | Candidate↔job matches with rates, recruiters |
|
||
| timesheets | 1,000,000 | 16.7 MB | Weekly hours, bill/pay totals, approvals |
|
||
| call_log | 800,000 | 34.3 MB | Phone CDR — who called whom, duration, disposition |
|
||
| email_log | 500,000 | 16.0 MB | Email tracking — subject, opened, direction |
|
||
| **Total** | **2,467,000** | **79 MB** | **7 tables, cross-referenced** |
|
||
|
||
### Benchmarks (2.47M rows)
|
||
|
||
| Query | Cold (Parquet) | Hot (MemCache) | Speedup |
|
||
|---|---|---|---|
|
||
| 100K candidate filter (skills+city+status) | 257ms | 21ms | 12x |
|
||
| 1M timesheet aggregation + JOIN | 942ms | 96ms | 9.8x |
|
||
| 800K call log cross-reference (cold leads) | 642ms | — | — |
|
||
| Triple JOIN recruiter performance | 487ms | — | — |
|
||
| 500K email open rate aggregation | 259ms | — | — |
|
||
| COUNT all 2.47M rows | 84ms | — | — |
|
||
| 10K vector semantic search (cosine) | 450ms | — | — |
|
||
| Natural language → AI SQL → execute | ~3s | — | (model inference) |
|
||
|
||
### Vector Search
|
||
|
||
- 10K candidate resumes embedded in 204s (49 chunks/sec via Ollama)
|
||
- Semantic search over 10K vectors: ~450ms (brute-force cosine)
|
||
- RAG pipeline: question → embed → search → retrieve → LLM answer with citations
|
||
- AI correctly refuses to hallucinate when context doesn't support an answer
|
||
|
||
### Agent Workspaces
|
||
|
||
- Create per-contract workspace with saved searches + shortlists
|
||
- Instant handoff between agents — zero data copy
|
||
- Full activity timeline preserved across handoffs
|
||
|
||
### Workload 2: Local LLM Knowledge Base
|
||
|
||
The second use case this substrate is built for. Reference corpus: the running `knowledge_base` Postgres database (586 team runs, response cache history, pipeline runs, threat intel) + LLMS3.com published corpus (~243 enriched documents).
|
||
|
||
Target scale on same 128GB server:
|
||
- Documents: 10K-100K per model profile
|
||
- Chunks after chunking: 500K-5M per profile
|
||
- Embedding dimensions: 768 (nomic-embed-text)
|
||
- Query latency: <100ms semantic search, <3s end-to-end RAG including LLM generation
|
||
- Concurrent model profiles: 2-5 configured, 1-2 active at a time (VRAM-bound)
|
||
|
||
Measured to date (Phase 7 + Phase 16 prep):
|
||
- 100K candidate-resume chunks embedded in 10 min via Ollama nomic-embed-text
|
||
- HNSW search at 100% recall, ~1ms p50 on 100K vectors (ec=80 es=30 locked as default)
|
||
- Trial journal instrumented and working for parameter tuning
|
||
|
||
Gaps still to close for this workload:
|
||
- Model profiles (Phase 17) — today, "model" is a string, not a first-class entity
|
||
- Hot-swap generations (Phase 16) — today, rebuild = downtime
|
||
- Scale past 5M vectors — needs Phase 18 Lance evaluation to decide path
|
||
|
||
---
|
||
|
||
## Available Local Models
|
||
|
||
| Model | Use |
|
||
|---|---|
|
||
| `nomic-embed-text` | Embeddings (768d) — semantic search, RAG retrieval |
|
||
| `qwen2.5` | SQL generation, structured output, summarization |
|
||
| `mistral` | General generation, longer context |
|
||
| `gemma2` | General generation |
|
||
| `llama3.2` | General generation, lightweight |
|
||
|
||
---
|
||
|
||
## Non-Goals
|
||
|
||
- Cloud deployment (local-first, always)
|
||
- Full ACID transactions (single-writer model is sufficient)
|
||
- Real-time streaming / CDC (batch ingest is the model; scheduled refresh, not transactional replication)
|
||
- Replacing the CRM (this is the analytical + AI layer BEHIND the CRM)
|
||
- Custom file formats — Parquet for datasets + sidecar indexes for vectors (see ADR-018 for why we stayed Parquet instead of migrating to Lance, and the ceilings that choice implies)
|
||
- Hard multi-tenant isolation (profiles and federation provide soft isolation; this is not a SaaS platform with adversarial tenants — operator is single-trust)
|
||
|
||
Removed from prior non-goals (2026-04-16):
|
||
- ~~Multi-tenancy (single-owner system)~~ — federation + profile buckets are now first-class; soft multi-tenancy is a design goal. Hard adversarial multi-tenancy (adversarial tenants on shared infrastructure) remains out of scope.
|
||
|
||
---
|
||
|
||
## Risks
|
||
|
||
### Technical Risks
|
||
|
||
| Risk | Severity | Mitigation |
|
||
|---|---|---|
|
||
| Vector search in Rust at scale | **High** | Start brute-force, evaluate `hora` crate, Qdrant as fallback |
|
||
| Incremental updates on Parquet | **High** | Delta files + merge-on-read, NOT full Delta Lake |
|
||
| Legacy data messiness | **High** | Conservative schema detection, default to string, user overrides |
|
||
| 100K+ embedding timeout | **High** | Async background job with progress, not single HTTP request |
|
||
| Schema evolution across ingests | **Medium** | Schema fingerprinting + versioned manifests (Phase 14) |
|
||
| Memory pressure from hot cache | **Medium** | LRU eviction, configurable memory limit (tested: 408MB for 1.1M rows) |
|
||
| HNSW index persistence | **Medium** | Serialize alongside Parquet, rebuild on startup |
|
||
| Python sidecar as bottleneck | **Low** | Can replace with direct Ollama HTTP from Rust later |
|
||
|
||
### Strategic Risks (Future-Proofing)
|
||
|
||
| Risk | Impact | Phase |
|
||
|---|---|---|
|
||
| No mutation history → can't audit AI decisions | **Critical** — compliance, trust | Phase 9 (event journal) |
|
||
| No metadata → datasets become mystery files | **High** — onboarding, discovery | Phase 10 (rich catalog) |
|
||
| Embeddings locked to one model | **High** — can't upgrade models | Phase 11 (versioning) |
|
||
| Raw SQL as only interface → ungoverned agent access | **High** — security, auditability | Phase 12 (tool registry) |
|
||
| No sensitivity classification → compliance exposure | **Medium** — grows with data volume | Phase 13 (access control) |
|
||
| No schema evolution handling → ingest breaks on format change | **Medium** — grows with source count | Phase 14 (AI migration) |
|
||
|
||
---
|
||
|
||
## Design Principles (Future-Proofing)
|
||
|
||
These are the decisions that still look smart after the stack changes:
|
||
|
||
1. **Store the truth openly.** Parquet on object storage. No proprietary formats. Any engine can read it.
|
||
2. **Describe it richly.** Every dataset has an owner, lineage, sensitivity tags, freshness contract.
|
||
3. **Never destroy evidence.** Every mutation is journaled. Rebuild any state at any point in time.
|
||
4. **Secure it centrally.** Permissions live in the data layer, not application code.
|
||
5. **Expose it through reusable interfaces.** Named tools with contracts, not raw SQL for every consumer.
|
||
6. **Version everything.** Schemas, embeddings, models — all versioned, all coexist during migration.
|
||
7. **Make unstructured data first-class.** Every document gets: storage, text extraction, entity tags, chunks, embeddings, linkage.
|
||
8. **Separate storage from compute from intelligence.** Scale each independently. Replace any layer without touching the others.
|
||
|
||
---
|
||
|
||
## Operating Rules
|
||
|
||
1. PRD > architecture > phases > status > git
|
||
2. Git is memory, not chat
|
||
3. No undocumented changes
|
||
4. No silent architecture drift
|
||
5. Always work in smallest valid step
|
||
6. Always verify before moving on
|
||
7. Flag when something is genuinely hard vs just engineering work
|
||
8. If a phase reveals the approach is wrong, update the PRD before continuing
|
||
9. **Cheap-now, expensive-later decisions get built first** (event journal, metadata, versioning)
|
||
10. **Build the governed interface before the raw interface** (tools before SQL for agents)
|