root dbe00d018f Federation foundation + HNSW trial system + Postgres streaming + PRD reframe

Four shipped features and a PRD realignment, all measured end-to-end:

HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
  TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
  /hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
  at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
  80/30 = 1.00 recall in 230s

Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
  and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows

Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
  /etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
  rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
  anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
  /storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
  X-Lakehouse-Rescue-Used observability headers on fallback

Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
  into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
  (586 rows × 13 cols, 6 batches, 196ms end-to-end)

PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
  trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
  bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
  decision rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 01:50:05 -05:00

32 KiB

Raw Blame History

PRD: Lakehouse — Rust-First Substrate for Versioned Knowledge Stores

Status: Active — Phase 0-14 complete; federation foundation + HNSW trial system shipped 2026-04-16; entering Phase 16 (hot-swap + model profiles) Created: 2026-03-27 Last reframed: 2026-04-16 — from "staffing analytics platform" to "dual-use knowledge substrate" (see §Problem below) Owner: J

Problem

Use case 1 — Staffing analytics (reference implementation)

Legacy data systems silo information across CRMs, databases, spreadsheets, and file shares. Querying across them requires manual ETL, pre-defined schemas, and expensive database licenses. When AI enters the picture, these systems can't handle the dual requirement of fast analytical queries AND semantic retrieval over unstructured text.

A staffing company (our reference case) has candidate records in an ATS, client data in a CRM, timesheets in billing software, call logs from a phone system, and email records from Exchange. Answering "find every Java developer in Chicago who was called 5+ times but never placed" requires querying across all of them — and no single system can do it.

Use case 2 — Local AI knowledge substrate (the second half)

Local LLM workloads need a substrate for ingesting, indexing, and retrieving large knowledge corpora. Each running model (or agent) has its own context — documents it cares about, a vector index tuned to its domain, a scoped view of the catalog. That infrastructure is architecturally identical to the staffing problem: ingest messy data, index it, query it, hand it to an AI. Building one substrate that serves both prevents fragmentation.

Concretely this means a running Ollama model like qwen2.5:7b or claude-code-local should be able to:

Bind to a named set of datasets
Get a scoped vector index pre-warmed for its domain
Issue searches that only see its bound data
Have its trial/tuning history isolated from other models
Swap between knowledge generations (today's, yesterday's) without rebuild

The same infrastructure that lets a recruiter query 2.47M rows of staffing data also lets a local 7B model answer questions grounded in a 500K-chunk documentation corpus. Same substrate, different tenant.

Shared requirements

Any data source (CSV, DB export, PDF, JSON, Postgres table) can be ingested without pre-defined schemas
Structured data is queryable via SQL at scale (millions of rows, sub-second)
Unstructured data is searchable via AI embeddings with per-profile indexes
An LLM can answer natural language questions against scoped data
Indexes can be hot-swapped between generations without rebuild downtime
Trials are first-class data — the system remembers how it was tuned
Everything runs locally — no cloud APIs, total data privacy
The system is rebuildable from repository + object storage alone

Solution

A modular Rust service mesh over S3-compatible object storage, with a local AI layer for embeddings and generation.

Locked Stack

Layer	Technology	Locked
Frontend	Dioxus	Yes
API	Axum + Tokio	Yes
Object Storage Interface	Apache Arrow `object_store`	Yes
Storage Backend	LocalFileSystem → RustFS → S3	Yes
Query Engine	DataFusion	Yes
Data Format	Parquet + Arrow	Yes
RPC (internal)	tonic (gRPC)	Yes
AI Runtime	Ollama (local models)	Yes
AI Boundary	Python FastAPI sidecar → Ollama HTTP API	Yes
Vector Index	TBD — evaluate `hora`, `qdrant` crate, or HNSW from scratch	Open

No new frameworks without documented ADR.

Architecture

Services

Service	Responsibility
gateway	HTTP/gRPC ingress, routing, auth, CORS, body limits, `X-Lakehouse-Bucket` header routing
catalogd	Metadata control plane — dataset registry, schema versions, manifests, per-dataset resync from parquet footers
storaged	Object I/O — `BucketRegistry` (multi-backend), rescue fallback, error journal, append-log batching pattern
queryd	SQL execution — DataFusion over Parquet, MemTable hot cache, delta merge-on-read
ingestd	Ingest pipeline: CSV / JSON / PDF / Postgres-stream → normalize → Parquet → catalog
vectord	Embedding store + vector indexes + HNSW trial system (EmbeddingCache, trial journal, eval harness)
journald	Append-only mutation event log (ADR-012) — distinct from storaged error journal
aibridge	Rust↔Python boundary — HTTP client to FastAPI sidecar
ui	Dioxus frontend — Ask, Explore, SQL, System tabs
shared	Types, errors, Arrow helpers, config, protobuf definitions, secrets provider trait, PII detection

Federation building blocks (shipped 2026-04-16):

shared::secrets::SecretsProvider trait + FileSecretsProvider reading /etc/lakehouse/secrets.toml (0600 enforced)
storaged::registry::BucketRegistry — multi-bucket resolution with rescue_bucket read fallback
storaged::append_log::AppendLog — write-once batched append pattern (no RMW, no small-file problem)
storaged::error_journal::ErrorJournal — bucket operation failure log at primary://_errors/bucket_errors/batch_*.jsonl

Data Flow

Raw data → ingestd (normalize, chunk, detect schema)
    ├→ storaged (Parquet files to object storage)
    ├→ catalogd (register dataset + schema)
    ├→ vectord (embed text chunks, build index)
    └→ queryd  (auto-register as queryable table)

User question → gateway
    ├→ vectord (semantic search for relevant chunks)  ← RAG path
    ├→ queryd  (SQL over structured data)             ← Analytics path
    └→ aibridge → Ollama (generate answer from context)

Query Paths

Analytical (SQL): "What's the average bill rate for .NET devs in Chicago?" → DataFusion scans Parquet columnar, returns in <200ms

Semantic (RAG): "Find candidates who could do data engineering work" → Embed question → vector search across resume embeddings → retrieve top chunks → LLM answers

Hybrid: "Which clients are we losing money on, and why?" → SQL for margin calculations + RAG over client notes/emails for context → LLM synthesizes

Invariants

Object storage = source of truth for all data
catalogd = sole metadata authority
No raw data in catalog — only pointers
vectord stores embeddings AS Parquet (portable, not a proprietary format) — see ADR-018 for the Parquet-vs-Lance trade review
ingestd is idempotent — re-ingesting the same file is a no-op
Hot cache is a performance layer, not a source of truth — eviction is safe
All services modular and independently replaceable
Indexes are hot-swappable. A new index generation can be built in the background while the current one serves queries. Promotion is atomic (pointer swap). Rollback to a prior generation is always possible. (Phase 16)
Every reader gets its own profile. Human operators, AI agents, and local models are all clients of the same substrate. Each has a named profile with its own bucket, vector indexes, trial history, and dataset bindings. Profiles are a first-class architectural concept, not a tenancy afterthought. (Phase 17)
Trials are data, not logs. Every index build is a trial with measurable metrics. The trial journal IS the agent's memory for how to tune itself. Stored as write-once batched JSONL per the ADR-018 append-log pattern.
Operational failures are findable in one HTTP call. The bucket error journal, trial journal, and audit log all expose /storage/errors, /hnsw/trials, /access/audit with structured filter + aggregation. No grep archaeology to answer "what broke?"

Phases

Phase 0-5: Foundation ✅ COMPLETE

Rust workspace, Axum gateway, object storage, catalog, DataFusion query engine
Python sidecar with real Ollama models (embed, generate, rerank)
Dioxus UI with Ask (NL→SQL), Explore, SQL, System tabs
gRPC, OpenTelemetry, auth middleware, TOML config
Validated with 286K row staffing company dataset across 7 tables
Cross-reference queries (JOINs across candidates, placements, timesheets, calls) in <150ms

Phase 6: Ingest Pipeline

Build the data on-ramp. Accept messy real-world data, normalize it, make it queryable.

Step	Deliverable	Gate
6.1	`ingestd` crate with CSV parser → Arrow RecordBatch → Parquet	CSV file → queryable dataset
6.2	JSON ingest (newline-delimited JSON, nested objects)	JSON file → flat Parquet
6.3	Schema detection — infer column types from data	No manual schema definition needed
6.4	Deduplication — detect and skip already-ingested files (content hash)	Re-ingest same file = no-op
6.5	Text chunking — split large text fields for embedding	Long text → overlapping chunks
6.6	Auto-registration — ingest writes to storage AND registers in catalog	Single API call: file in → queryable
6.7	Gateway endpoint: `POST /ingest` with file upload	Upload CSV from browser → query in seconds

Gate: Upload a raw CSV or JSON file → auto-detected schema → stored as Parquet → registered → immediately queryable via SQL. No manual steps.

Risk: Schema detection on messy data (mixed types, nulls, inconsistent formatting). Mitigation: conservative type inference (default to string), let user override.

Phase 7: Vector Index + RAG Pipeline

Make unstructured data searchable by meaning, not just keywords.

Step	Deliverable	Gate
7.1	`vectord` crate with embedding storage as Parquet (doc_id, chunk_text, vector)	Embeddings stored as portable Parquet
7.2	Chunking strategy — configurable chunk size + overlap for text columns	Large text fields split into embeddable chunks
7.3	Brute-force vector search via DataFusion (cosine similarity SQL)	Semantic search works, correctness verified
7.4	HNSW index for fast approximate nearest neighbor	Search over 100K+ vectors in <50ms
7.5	RAG endpoint: `POST /rag` — question → embed → search → retrieve → generate	Natural language question → grounded answer
7.6	Auto-embed on ingest — text columns automatically embedded during ingest	No separate embedding step needed
7.7	Hybrid search — combine SQL filters with vector similarity	"Java devs in Chicago" (SQL) + "who could do data engineering" (semantic)

Gate: Ingest 15K candidate resumes → auto-embed → ask "find someone who could handle our Kubernetes migration" → system returns relevant candidates ranked by semantic match, with LLM explanation.

Risk: HNSW in Rust at scale. This is the hardest technical problem. Options:

hora crate — Rust-native ANN, but less mature than FAISS
Store HNSW index as a serialized file alongside Parquet data
Fallback: brute-force scan is fine up to ~100K vectors; optimize later
Nuclear option: use Qdrant as an external vector store (breaks "no new services" rule)

Decision needed: Evaluate hora vs external Qdrant vs brute-force at J's data scale.

Phase 8: Hot Cache + Incremental Updates

Make frequently-accessed data fast, and handle real-time updates without full rewrite.

Step	Deliverable	Gate
8.1	MemTable hot cache — pin active datasets in memory	Queries on hot data: <10ms
8.2	Cache policy — LRU eviction based on access patterns	Memory-bounded, auto-manages
8.3	Incremental writes — append new rows without rewriting entire Parquet file	Update one candidate's phone → no full table rewrite
8.4	Merge-on-read — query combines base Parquet + delta files	Correct results from base + updates
8.5	Compaction — periodic merge of delta files into base Parquet	Prevent delta file proliferation
8.6	Upsert semantics — insert or update by primary key	Same candidate ID → update in place

Gate: Update a single row in a 15K-row dataset. Query reflects the change immediately. No full Parquet rewrite. Memory cache serves hot data in <10ms.

Risk: This is the Delta Lake problem. Full ACID transactions over Parquet files is what Databricks spent years building. We're NOT building Delta Lake — we're building a pragmatic version:

Append-only delta files (easy)
Merge-on-read (moderate)
Compaction (moderate)
Full ACID isolation (NOT attempting — single-writer model instead)

Phase 8.5: Agent Workspaces ✅ COMPLETE

Per-contract overlays with daily/weekly/monthly tiers and instant handoff.

WorkspaceManager with saved searches, shortlists, activity logs
Zero-copy handoff between agents (pointer swap, not data copy)
Persisted to object storage, rebuilt on startup

Phase 9: Event Journal — Never Destroy Evidence

Principle: Every mutation is appended, never overwritten. This is the one decision that's impossible to retrofit — once history is lost, it's gone forever.

Step	Deliverable	Gate
9.1	`journald` crate: append-only event log as Parquet	Every write/update/delete logged with who, when, what, old value, new value
9.2	Event schema: entity, field, old_value, new_value, actor, timestamp, source, workspace_id	Standardized across all mutations
9.3	Journal query: `SELECT * FROM journal WHERE entity = 'CAND-001' ORDER BY timestamp`	Full history of any record
9.4	Replay capability: rebuild any dataset's state at any point in time	Time-travel queries
9.5	Journal compaction: roll old events into monthly summary Parquet files	Prevent unbounded growth

Gate: Change a candidate's phone number. Query shows the change. Journal shows old value, new value, who changed it, when, and why. Replay to yesterday's state.

Why now: In 3 years, compliance, AI auditability, and "why did the agent recommend this candidate" all require mutation history. Adding it later means you only have history from that day forward.

Phase 10: Rich Catalog v2 — Metadata as Product

Principle: Every dataset should be self-describing. A new team member (or AI agent) should understand what data exists, who owns it, how fresh it is, and what's sensitive — without asking anyone.

Step	Deliverable	Gate
10.1	Catalog schema upgrade: add owner, sensitivity, freshness_sla, description, tags, lineage	`GET /catalog/datasets` returns rich metadata
10.2	Sensitivity classification: PII, PHI, financial, public, internal	Sensitive fields tagged at ingest
10.3	Lineage tracking: source_system → ingest_job → dataset → derived_dataset	"Where did this data come from?" answerable
10.4	Freshness contracts: expected_update_frequency, last_updated, stale_after	Alert when data goes stale
10.5	Dataset contracts: required columns, type expectations, validation rules	Ingest rejects data that breaks the contract
10.6	Auto-documentation: AI generates dataset description from schema + sample data	New datasets self-describe on ingest

Gate: Ingest a CSV. System auto-detects PII columns (email, phone, SSN patterns), tags them, generates a description, sets owner, and tracks lineage back to the source file.

Why now: Every dataset you ingest without metadata becomes a "mystery file" in 6 months. The metadata layer makes the difference between a searchable knowledge platform and a data graveyard.

Phase 11: Embedding Versioning — Model-Proof Vector Layer

Principle: Embedding models will change. If you don't track which model created which vectors, upgrading means re-embedding everything from scratch.

Step	Deliverable	Gate
11.1	Vector index metadata: model_name, model_version, dimensions, created_at	Every index knows its embedding model
11.2	Multi-version indexes: same data, different models, coexist	Search specifies which model version
11.3	Incremental re-embed: only new/changed docs get re-embedded on model upgrade	Model swap doesn't require full re-embed
11.4	A/B search: query both old and new model, compare results	Validate model upgrade before committing

Gate: Upgrade from nomic-embed-text to a new model. Old index still works. New index builds incrementally. Compare search quality. Switch when ready.

Phase 12: Tool Registry — Agent-Safe Business Actions

Principle: In 3 years, AI agents won't just query — they'll act. Instead of every agent getting raw SQL access, expose named, governed, audited business actions.

Step	Deliverable	Gate
12.1	Tool definition: name, description, parameters, permissions, audit_level	`search_candidates(skills, city, min_years)` as a registered tool
12.2	Tool execution: validates params, checks permissions, logs usage, runs query	Agent calls tool, gets results, action is logged
12.3	Read vs write tools: read tools are permissive, write tools require confirmation	`get_candidate` = auto-approved, `update_phone` = requires review
12.4	MCP-compatible interface: expose tools via Model Context Protocol	Any MCP-compatible agent (Claude, GPT, local) can use them
12.5	Rate limiting + quotas per agent/tool	Prevent runaway agent from overwhelming the system

Gate: An AI agent calls search_candidates(skills="Python,AWS", city="Chicago", available=true) → gets results → calls shortlist_candidate(workspace_id, candidate_id, reason) → action is logged, auditable, reversible.

Why now: The tool interface is cheap to build (it's just named endpoints with validation). But retrofitting audit logging and permission checks onto raw SQL access is a nightmare. Build the governed interface first.

Phase 13: Security & Access Control

Step	Deliverable	Gate
13.1	Field-level sensitivity tags (PII, PHI, financial) in catalog	Sensitive fields identified
13.2	Row-level access policies (agent A sees their candidates only)	Policy evaluated at query time
13.3	Column masking (show last 4 of SSN, redact salary for non-managers)	Masked results based on role
13.4	Query audit log (who queried what, when, which fields)	Every data access recorded
13.5	Policy-as-code (TOML/YAML rules, not hardcoded)	Non-engineer can update access rules

Phase 14: Schema Evolution + AI Migration

Step	Deliverable	Gate
14.1	Schema diff detection: old schema vs new ingest → list changes	"Column renamed: first_name → full_name"
14.2	AI-generated migration rules: LLM suggests column mappings	"full_name = concat(first_name, ' ', last_name)"
14.3	Migration preview: show how old data maps to new schema before applying	Human approves before data transforms
14.4	Versioned schemas in catalog: v1, v2, v3 coexist	Queries specify version or use latest

Phase 15: Infrastructure horizon items

HNSW vector index with trial system (shipped 2026-04-16)
Federation foundation — ADR-017 (shipped 2026-04-16)
Database connector ingest — Postgres batch with streaming (shipped 2026-04-16)
Federation layer 2 — X-Lakehouse-Bucket middleware, catalog migration, cross-bucket SQL in queryd
PDF OCR for scanned documents (Tesseract integration)
Scheduled ingest (cron-based file watching, S3 event triggers)
Multi-node query distribution (DataFusion supports this architecturally)

Phase 16: Hot-Swap Index Generations

Make indexes upgrade-in-place without dropping queries.

Step	Deliverable	Gate
16.1	"Active generation" pointer per logical index name	`/vectors/search` routes to current champion automatically
16.2	Background trial runner: watches trial journal, proposes configs (random search / Bayesian), fires `/hnsw/trial`	Agent autonomously tunes without human POSTing each config
16.3	Promotion endpoint: `POST /hnsw/promote/{index}/{trial_id}` atomically swaps active pointer	Next search hits new config, zero downtime
16.4	Rollback: `POST /hnsw/rollback/{index}` reverts to previous generation	Bad promotion recoverable in milliseconds
16.5	Dataset-append triggers: when `POST /ingest/file` writes to a dataset with attached vector indexes, schedule automatic re-trial (not full rebuild)	New docs get embedded + indexed without manual intervention

Gate: Run the trial agent for 10 minutes against resumes_100k_v2 with a fresh eval set. It explores the ef_construction × ef_search space, promotes the Pareto winner, continues running. Zero human clicks. All trials and promotions appear in /hnsw/trials/resumes_100k_v2.

Risk: Agent loops into a bad region (e.g. always proposes ef_construction=1). Mitigation: a hardcoded config space constraint + minimum-quality gate (don't promote anything with recall <0.9).

Phase 17: Model Profiles + Dataset Bindings

Make "different models see different data" real instead of a config string.

Step	Deliverable	Gate
17.1	`ModelProfile` manifest: id, ollama_name, bucket, bound_datasets[], hnsw_config, embed_model	`GET /models` lists profiles; `POST /models` creates one
17.2	Profile activation endpoint: `POST /profile/{id}/activate` — warms EmbeddingCache for bound indexes, builds HNSW with profile's config	Next search against bound indexes is <1ms cold
17.3	Model-scoped search: `POST /search?model=X` filters to bound datasets only	Model A can't see Model B's datasets unless explicitly shared
17.4	VRAM-aware activation: only one (or small N) model loaded at a time on 16GB A4000	Activating model B unloads model A via Ollama's keep_alive=0
17.5	Audit: every tool invocation by a model is logged with model identity	`GET /models/{id}/audit` shows exactly what each model touched

Gate: Two model profiles defined: staffing-recruiter (bound to candidates/placements/timesheets) and docs-assistant (bound to a documentation corpus). Activate staffing-recruiter, search for candidates — works. Switch to docs-assistant, same search — returns zero from staffing (not bound) but finds docs. VRAM shows only one embedding model loaded at a time.

VRAM reality: 16GB A4000 realistically holds 1-2 loaded models concurrently. "Multi-model" in practice means sequential swap between profiles, not parallel serving. The profile abstraction makes this swap clean.

Phase 18: Storage format decision (Lance evaluation)

The question raised 2026-04-16 after J's LLMS3 knowledge base identified Lance as alternative_to Parquet for vector workloads. Current stack: Parquet with binary-blob vector columns + in-RAM HNSW sidecar. Evaluated against: Lance native vector format with disk-resident indexes.

Step	Deliverable	Decision criteria
18.1	Parallel Lance-backed vector index for `resumes_100k_v2` behind feature flag	Both implementations coexist, benchmarkable
18.2	Head-to-head benchmark: cold-load, search latency, disk size, append cost	See criteria below
18.3	ADR-019 documenting the decision with measured data	Commit or reject with evidence

Decision rules:

Lance wins on cold-load by ≥2× AND matches search latency → migrate vector layer to Lance. Dataset Parquet stays.
Lance is within 50% of current → stay on current stack, document ceiling explicitly.
Lance loses → close the door, move on.

Phase 19+: Further horizon

PDF OCR for scanned documents (Tesseract integration)
Specialized fine-tuned models per domain (staffing matcher, resume parser)
Video/audio transcript ingest + multimodal embeddings
True distributed query (DataFusion multi-node) — only if single-machine ceilings bite

Known ceilings (honest)

The current stack has measurable limits. Documenting them so future decisions aren't based on wishful thinking.

Dimension	Current ceiling	Breaks at	Escape hatch
Vector count per index	~5M vectors on 128GB RAM	10M+ (serious web crawl)	Phase 18 Lance migration OR mmap'd embeddings
Concurrent active indexes	~50-100 at 100K vectors each	10M×50 configurations	Lance disk-resident + per-profile activation
Rows per dataset	2.47M proven, probably 100M+ fine	Approaches DataFusion memory limits	DataFusion predicate pushdown + partition pruning (existing)
Concurrent loaded models	1-2 on 16GB VRAM (A4000)	3+ models simultaneous	Not our problem — architectural, driven by Ollama
Trial journal growth per index	Thousands of trials, batched JSONL	High-frequency auto-tuning agent	Compaction via `/hnsw/trials/{idx}/compact`
Error journal growth	Bounded by ring buffer (2000 events in-memory) + batched JSONL on disk	Continuous failure scenarios	Compaction + retention policy (TODO)

Reference Workloads

Workload 1: Staffing Company

Scale-tested on 128GB RAM server:

Table	Rows	Size	Description
candidates	100,000	10.1 MB	Names, phones, emails, zip, skills, resume text
clients	2,000	33 KB	Companies, contacts, verticals
job_orders	15,000	0.9 MB	Positions with descriptions, requirements, rates
placements	50,000	1.2 MB	Candidate↔job matches with rates, recruiters
timesheets	1,000,000	16.7 MB	Weekly hours, bill/pay totals, approvals
call_log	800,000	34.3 MB	Phone CDR — who called whom, duration, disposition
email_log	500,000	16.0 MB	Email tracking — subject, opened, direction
Total	2,467,000	79 MB	7 tables, cross-referenced

Benchmarks (2.47M rows)

Query	Cold (Parquet)	Hot (MemCache)	Speedup
100K candidate filter (skills+city+status)	257ms	21ms	12x
1M timesheet aggregation + JOIN	942ms	96ms	9.8x
800K call log cross-reference (cold leads)	642ms	—	—
Triple JOIN recruiter performance	487ms	—	—
500K email open rate aggregation	259ms	—	—
COUNT all 2.47M rows	84ms	—	—
10K vector semantic search (cosine)	450ms	—	—
Natural language → AI SQL → execute	~3s	—	(model inference)

Vector Search

10K candidate resumes embedded in 204s (49 chunks/sec via Ollama)
Semantic search over 10K vectors: ~450ms (brute-force cosine)
RAG pipeline: question → embed → search → retrieve → LLM answer with citations
AI correctly refuses to hallucinate when context doesn't support an answer

Agent Workspaces

Create per-contract workspace with saved searches + shortlists
Instant handoff between agents — zero data copy
Full activity timeline preserved across handoffs

Workload 2: Local LLM Knowledge Base

The second use case this substrate is built for. Reference corpus: the running knowledge_base Postgres database (586 team runs, response cache history, pipeline runs, threat intel) + LLMS3.com published corpus (~243 enriched documents).

Target scale on same 128GB server:

Documents: 10K-100K per model profile
Chunks after chunking: 500K-5M per profile
Embedding dimensions: 768 (nomic-embed-text)
Query latency: <100ms semantic search, <3s end-to-end RAG including LLM generation
Concurrent model profiles: 2-5 configured, 1-2 active at a time (VRAM-bound)

Measured to date (Phase 7 + Phase 16 prep):

100K candidate-resume chunks embedded in 10 min via Ollama nomic-embed-text
HNSW search at 100% recall, ~1ms p50 on 100K vectors (ec=80 es=30 locked as default)
Trial journal instrumented and working for parameter tuning

Gaps still to close for this workload:

Model profiles (Phase 17) — today, "model" is a string, not a first-class entity
Hot-swap generations (Phase 16) — today, rebuild = downtime
Scale past 5M vectors — needs Phase 18 Lance evaluation to decide path

Available Local Models

Model	Use
`nomic-embed-text`	Embeddings (768d) — semantic search, RAG retrieval
`qwen2.5`	SQL generation, structured output, summarization
`mistral`	General generation, longer context
`gemma2`	General generation
`llama3.2`	General generation, lightweight

Non-Goals

Cloud deployment (local-first, always)
Full ACID transactions (single-writer model is sufficient)
Real-time streaming / CDC (batch ingest is the model; scheduled refresh, not transactional replication)
Replacing the CRM (this is the analytical + AI layer BEHIND the CRM)
Custom file formats — Parquet for datasets + sidecar indexes for vectors (see ADR-018 for why we stayed Parquet instead of migrating to Lance, and the ceilings that choice implies)
Hard multi-tenant isolation (profiles and federation provide soft isolation; this is not a SaaS platform with adversarial tenants — operator is single-trust)

Removed from prior non-goals (2026-04-16):

~~Multi-tenancy (single-owner system)~~ — federation + profile buckets are now first-class; soft multi-tenancy is a design goal. Hard adversarial multi-tenancy (adversarial tenants on shared infrastructure) remains out of scope.

Risks

Technical Risks

Risk	Severity	Mitigation
Vector search in Rust at scale	High	Start brute-force, evaluate `hora` crate, Qdrant as fallback
Incremental updates on Parquet	High	Delta files + merge-on-read, NOT full Delta Lake
Legacy data messiness	High	Conservative schema detection, default to string, user overrides
100K+ embedding timeout	High	Async background job with progress, not single HTTP request
Schema evolution across ingests	Medium	Schema fingerprinting + versioned manifests (Phase 14)
Memory pressure from hot cache	Medium	LRU eviction, configurable memory limit (tested: 408MB for 1.1M rows)
HNSW index persistence	Medium	Serialize alongside Parquet, rebuild on startup
Python sidecar as bottleneck	Low	Can replace with direct Ollama HTTP from Rust later

Strategic Risks (Future-Proofing)

Risk	Impact	Phase
No mutation history → can't audit AI decisions	Critical — compliance, trust	Phase 9 (event journal)
No metadata → datasets become mystery files	High — onboarding, discovery	Phase 10 (rich catalog)
Embeddings locked to one model	High — can't upgrade models	Phase 11 (versioning)
Raw SQL as only interface → ungoverned agent access	High — security, auditability	Phase 12 (tool registry)
No sensitivity classification → compliance exposure	Medium — grows with data volume	Phase 13 (access control)
No schema evolution handling → ingest breaks on format change	Medium — grows with source count	Phase 14 (AI migration)

Design Principles (Future-Proofing)

These are the decisions that still look smart after the stack changes:

Store the truth openly. Parquet on object storage. No proprietary formats. Any engine can read it.
Describe it richly. Every dataset has an owner, lineage, sensitivity tags, freshness contract.
Never destroy evidence. Every mutation is journaled. Rebuild any state at any point in time.
Secure it centrally. Permissions live in the data layer, not application code.
Expose it through reusable interfaces. Named tools with contracts, not raw SQL for every consumer.
Version everything. Schemas, embeddings, models — all versioned, all coexist during migration.
Make unstructured data first-class. Every document gets: storage, text extraction, entity tags, chunks, embeddings, linkage.
Separate storage from compute from intelligence. Scale each independently. Replace any layer without touching the others.

Operating Rules

PRD > architecture > phases > status > git
Git is memory, not chat
No undocumented changes
No silent architecture drift
Always work in smallest valid step
Always verify before moving on
Flag when something is genuinely hard vs just engineering work
If a phase reveals the approach is wrong, update the PRD before continuing
Cheap-now, expensive-later decisions get built first (event journal, metadata, versioning)
Build the governed interface before the raw interface (tools before SQL for agents)

32 KiB Raw Blame History Unescape Escape