Phase 21 — Rust port of scratchpad + tree-split primitives (companion to
the 2026-04-21 TS shipment). New crates/aibridge modules:
context.rs — estimate_tokens (chars/4 ceil), context_window_for,
assert_context_budget returning a BudgetCheck with
numeric diagnostics on both success and overflow.
Windows table mirrors config/models.json.
continuation.rs — generate_continuable<G: TextGenerator>. Handles the
two failure modes: empty-response from thinking
models (geometric 2x budget backoff up to budget_cap)
and truncated-non-empty (continuation with partial
as scratchpad). is_structurally_complete balances
braces then JSON.parse-checks. Guards the degen case
"all retries empty, don't loop on empty partial".
tree_split.rs — generate_tree_split map->reduce with running
scratchpad. Per-shard + reduce-prompt go through
assert_context_budget first; loud-fails rather than
silently truncating. Oldest-digest-first scratchpad
truncation at scratchpad_budget (default 6000 t).
TextGenerator trait (native async-fn-in-trait, edition 2024). AiClient
implements it; ScriptedGenerator test double lets tests inject canned
sequences without a live Ollama.
GenerateRequest gained think: Option<bool> — forwards to sidecar for
per-call hidden-reasoning opt-out on hot-path JSON emitters. Three
existing callsites updated (rag.rs x2, service.rs hybrid answer).
Phase 27 — Playbook versioning. PlaybookEntry gained four optional
fields (all #[serde(default)] so pre-Phase-27 state loads as roots):
version u32, default 1
parent_id Option<String>, previous version's playbook_id
superseded_at Option<String>, set when newer version replaces
superseded_by Option<String>, the playbook_id that replaced
New methods:
revise_entry(parent_id, new_entry) — appends new version, stamps
superseded_at+superseded_by on parent, inherits parent_id and sets
version = parent + 1 on the new entry. Rejects revising a retired
or already-superseded parent (tip-of-chain is the only valid
revise target).
history(playbook_id) — returns full chain root->tip from any node.
Walks parent_id back to root, then superseded_by forward to tip.
Cycle-safe.
Superseded entries excluded from boost (same rule as retired): filter
in compute_boost_for_filtered_with_role (both active-entries prefilter
and geo-filtered path), rebuild_geo_index, and upsert_entry's existing-
idx search. status_counts returns (total, retired, superseded, failures);
/status JSON reports active = total - retired - superseded.
Endpoints:
POST /vectors/playbook_memory/revise
GET /vectors/playbook_memory/history/{id}
Doc-sync — PHASES.md + PRD.md drifted from git after Phases 24-26
shipped. Fixes applied:
- Phase 24 marked shipped (commit b95dd86) with detail of observer
HTTP ingest + scenario outcome streaming. PRD "NOT YET WIRED"
rewritten to reflect shipped state.
- Phase 25 (validity windows, commit e0a843d) added to PHASES +
PRD.
- Phase 26 (Mem0 upsert + Letta hot cache, commit 640db8c) added.
- Phase 27 entry added to both docs.
- Phase 19.6 time decay corrected: was documented as "deferred",
actually wired via BOOST_HALF_LIFE_DAYS = 30.0 in playbook_memory.rs.
- Phase E/Phase 8 tombstone-at-compaction limit note updated —
Phase E.2 closed it.
Tests: 8 new version_tests in vectord (chain-metadata stamping,
retired/superseded parent rejection, boost exclusion, history from
root/tip/middle, legacy default round-trip, status counts). 25 new
aibridge tests (context/continuation/tree_split). Workspace total
145 green (was 120).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
53 KiB
PRD: Lakehouse — Rust-First Substrate for Versioned Knowledge Stores
Status: Active — Phases 0-18 shipped; hybrid SQL+Vector search operational; IVF_PQ recall tuned to 1.000 at p50 ≈ 7.4ms via nprobes+refine; autonomous agent rotates across full index portfolio; cron-scheduled ingest; eval federation complete
Created: 2026-03-27
Last updated: 2026-04-20 — portfolio-wide autotune, real cron, evals federation, bucket-migrate, IVF_PQ recall 0.805 → 1.000
Owner: J
Problem
Use case 1 — Staffing analytics (reference implementation)
Legacy data systems silo information across CRMs, databases, spreadsheets, and file shares. Querying across them requires manual ETL, pre-defined schemas, and expensive database licenses. When AI enters the picture, these systems can't handle the dual requirement of fast analytical queries AND semantic retrieval over unstructured text.
A staffing company (our reference case) has candidate records in an ATS, client data in a CRM, timesheets in billing software, call logs from a phone system, and email records from Exchange. Answering "find every Java developer in Chicago who was called 5+ times but never placed" requires querying across all of them — and no single system can do it.
Use case 2 — Local AI knowledge substrate (the second half)
Local LLM workloads need a substrate for ingesting, indexing, and retrieving large knowledge corpora. Each running model (or agent) has its own context — documents it cares about, a vector index tuned to its domain, a scoped view of the catalog. That infrastructure is architecturally identical to the staffing problem: ingest messy data, index it, query it, hand it to an AI. Building one substrate that serves both prevents fragmentation.
Concretely this means a running Ollama model like qwen2.5:7b or claude-code-local should be able to:
- Bind to a named set of datasets
- Get a scoped vector index pre-warmed for its domain
- Issue searches that only see its bound data
- Have its trial/tuning history isolated from other models
- Swap between knowledge generations (today's, yesterday's) without rebuild
The same infrastructure that lets a recruiter query 2.47M rows of staffing data also lets a local 7B model answer questions grounded in a 500K-chunk documentation corpus. Same substrate, different tenant.
Shared requirements
- Any data source (CSV, DB export, PDF, JSON, Postgres table) can be ingested without pre-defined schemas
- Structured data is queryable via SQL at scale (millions of rows, sub-second)
- Unstructured data is searchable via AI embeddings with per-profile indexes
- An LLM can answer natural language questions against scoped data
- Indexes can be hot-swapped between generations without rebuild downtime
- Trials are first-class data — the system remembers how it was tuned
- Everything runs locally — no cloud APIs, total data privacy
- The system is rebuildable from repository + object storage alone
Solution
A modular Rust service mesh over S3-compatible object storage, with a local AI layer for embeddings and generation.
Locked Stack
| Layer | Technology | Locked |
|---|---|---|
| Frontend | Dioxus | Yes |
| API | Axum + Tokio | Yes |
| Object Storage Interface | Apache Arrow object_store |
Yes |
| Storage Backend | LocalFileSystem → RustFS → S3 | Yes |
| Query Engine | DataFusion | Yes |
| Data Format | Parquet + Arrow | Yes |
| RPC (internal) | tonic (gRPC) | Yes |
| AI Runtime | Ollama (local models) | Yes |
| AI Boundary | Python FastAPI sidecar → Ollama HTTP API | Yes |
| Vector Index | TBD — evaluate hora, qdrant crate, or HNSW from scratch |
Open |
No new frameworks without documented ADR.
Architecture
Services
| Service | Responsibility |
|---|---|
| gateway | HTTP/gRPC ingress, routing, auth, CORS, body limits, X-Lakehouse-Bucket header routing |
| catalogd | Metadata control plane — dataset registry, schema versions, manifests, per-dataset resync from parquet footers |
| storaged | Object I/O — BucketRegistry (multi-backend), rescue fallback, error journal, append-log batching pattern |
| queryd | SQL execution — DataFusion over Parquet, MemTable hot cache, delta merge-on-read |
| ingestd | Ingest pipeline: CSV / JSON / PDF / Postgres-stream → normalize → Parquet → catalog |
| vectord | Embedding store + vector indexes + HNSW trial system (EmbeddingCache, trial journal, eval harness) |
| journald | Append-only mutation event log (ADR-012) — distinct from storaged error journal |
| aibridge | Rust↔Python boundary — HTTP client to FastAPI sidecar |
| ui | Dioxus frontend — Ask, Explore, SQL, System tabs |
| shared | Types, errors, Arrow helpers, config, protobuf definitions, secrets provider trait, PII detection |
Federation building blocks (shipped 2026-04-16):
shared::secrets::SecretsProvidertrait +FileSecretsProviderreading/etc/lakehouse/secrets.toml(0600 enforced)storaged::registry::BucketRegistry— multi-bucket resolution withrescue_bucketread fallbackstoraged::append_log::AppendLog— write-once batched append pattern (no RMW, no small-file problem)storaged::error_journal::ErrorJournal— bucket operation failure log atprimary://_errors/bucket_errors/batch_*.jsonl
Data Flow
Raw data → ingestd (normalize, chunk, detect schema)
├→ storaged (Parquet files to object storage)
├→ catalogd (register dataset + schema)
├→ vectord (embed text chunks, build index)
└→ queryd (auto-register as queryable table)
User question → gateway
├→ vectord (semantic search for relevant chunks) ← RAG path
├→ queryd (SQL over structured data) ← Analytics path
└→ aibridge → Ollama (generate answer from context)
Query Paths
Analytical (SQL): "What's the average bill rate for .NET devs in Chicago?" → DataFusion scans Parquet columnar, returns in <200ms
Semantic (RAG): "Find candidates who could do data engineering work" → Embed question → vector search across resume embeddings → retrieve top chunks → LLM answers
Hybrid (shipped 2026-04-17): "Find reliable forklift operators in Illinois with OSHA certs"
→ POST /vectors/hybrid with sql_filter + question: SQL narrows to structurally-valid candidates (role, state, reliability, certs), brute-force cosine ranks by semantic relevance within the filtered set, LLM generates answer from SQL-verified records only. Zero hallucinations on the staffing simulation (16/16 positions filled, all workers verified against golden data).
Invariants
- Object storage = source of truth for all data
- catalogd = sole metadata authority
- No raw data in catalog — only pointers
- vectord stores embeddings AS Parquet (portable, not a proprietary format) — see ADR-018 for the Parquet-vs-Lance trade review
- ingestd is idempotent — re-ingesting the same file is a no-op
- Hot cache is a performance layer, not a source of truth — eviction is safe
- All services modular and independently replaceable
- Indexes are hot-swappable. A new index generation can be built in the background while the current one serves queries. Promotion is atomic (pointer swap). Rollback to a prior generation is always possible. (Phase 16)
- Every reader gets its own profile. Human operators, AI agents, and local models are all clients of the same substrate. Each has a named profile with its own bucket, vector indexes, trial history, and dataset bindings. Profiles are a first-class architectural concept, not a tenancy afterthought. (Phase 17)
- Trials are data, not logs. Every index build is a trial with measurable metrics. The trial journal IS the agent's memory for how to tune itself. Stored as write-once batched JSONL per the ADR-018 append-log pattern.
- Operational failures are findable in one HTTP call. The bucket error journal, trial journal, and audit log all expose
/storage/errors,/hnsw/trials,/access/auditwith structured filter + aggregation. Nogreparchaeology to answer "what broke?" - Playbooks feed the index, not just the log. A completed playbook isn't just a record of what worked — it's a signal that shapes future rankings. Every
successful_playbooksrow contributes to the playbook-memory vector index, so semantically-similar future operations re-rank toward workers that have actually succeeded in comparable fills. This is the "system gets smarter over time" dimension that distinguishes this substrate from a static search engine. (Phase 19)
Vision drift acknowledged (2026-04-20)
The system as shipped through Phase 18 is a hybrid SQL+vector search engine with a playbook log. The original pitch (and the "staffing AI co-pilot" framing) implied a meta-index that learns from playbooks over time — hot-swap profiles weren't just routing, they were knowledge generations that compounded. That learning loop was never built; playbooks were write-only. Phase 19 closes that gap explicitly.
The feedback signal is statistical + semantic, not neural. No model training — the index reads the playbook journal, computes operation-similarity, and boosts endorsed workers at query time. Rebuildable from successful_playbooks alone, same as every other derived index.
Phases
Phase 0-5: Foundation ✅ COMPLETE
- Rust workspace, Axum gateway, object storage, catalog, DataFusion query engine
- Python sidecar with real Ollama models (embed, generate, rerank)
- Dioxus UI with Ask (NL→SQL), Explore, SQL, System tabs
- gRPC, OpenTelemetry, auth middleware, TOML config
- Validated with 286K row staffing company dataset across 7 tables
- Cross-reference queries (JOINs across candidates, placements, timesheets, calls) in <150ms
Phase 6: Ingest Pipeline
Build the data on-ramp. Accept messy real-world data, normalize it, make it queryable.
| Step | Deliverable | Gate |
|---|---|---|
| 6.1 | ingestd crate with CSV parser → Arrow RecordBatch → Parquet |
CSV file → queryable dataset |
| 6.2 | JSON ingest (newline-delimited JSON, nested objects) | JSON file → flat Parquet |
| 6.3 | Schema detection — infer column types from data | No manual schema definition needed |
| 6.4 | Deduplication — detect and skip already-ingested files (content hash) | Re-ingest same file = no-op |
| 6.5 | Text chunking — split large text fields for embedding | Long text → overlapping chunks |
| 6.6 | Auto-registration — ingest writes to storage AND registers in catalog | Single API call: file in → queryable |
| 6.7 | Gateway endpoint: POST /ingest with file upload |
Upload CSV from browser → query in seconds |
Gate: Upload a raw CSV or JSON file → auto-detected schema → stored as Parquet → registered → immediately queryable via SQL. No manual steps.
Risk: Schema detection on messy data (mixed types, nulls, inconsistent formatting). Mitigation: conservative type inference (default to string), let user override.
Phase 7: Vector Index + RAG Pipeline
Make unstructured data searchable by meaning, not just keywords.
| Step | Deliverable | Gate |
|---|---|---|
| 7.1 | vectord crate with embedding storage as Parquet (doc_id, chunk_text, vector) |
Embeddings stored as portable Parquet |
| 7.2 | Chunking strategy — configurable chunk size + overlap for text columns | Large text fields split into embeddable chunks |
| 7.3 | Brute-force vector search via DataFusion (cosine similarity SQL) | Semantic search works, correctness verified |
| 7.4 | HNSW index for fast approximate nearest neighbor | Search over 100K+ vectors in <50ms |
| 7.5 | RAG endpoint: POST /rag — question → embed → search → retrieve → generate |
Natural language question → grounded answer |
| 7.6 | Auto-embed on ingest — text columns automatically embedded during ingest | No separate embedding step needed |
| 7.7 | Hybrid search — combine SQL filters with vector similarity | "Java devs in Chicago" (SQL) + "who could do data engineering" (semantic) |
Gate: Ingest 15K candidate resumes → auto-embed → ask "find someone who could handle our Kubernetes migration" → system returns relevant candidates ranked by semantic match, with LLM explanation.
Risk: HNSW in Rust at scale. This is the hardest technical problem. Options:
horacrate — Rust-native ANN, but less mature than FAISS- Store HNSW index as a serialized file alongside Parquet data
- Fallback: brute-force scan is fine up to ~100K vectors; optimize later
- Nuclear option: use Qdrant as an external vector store (breaks "no new services" rule)
Decision needed: Evaluate hora vs external Qdrant vs brute-force at J's data scale.
Phase 8: Hot Cache + Incremental Updates
Make frequently-accessed data fast, and handle real-time updates without full rewrite.
| Step | Deliverable | Gate |
|---|---|---|
| 8.1 | MemTable hot cache — pin active datasets in memory | Queries on hot data: <10ms |
| 8.2 | Cache policy — LRU eviction based on access patterns | Memory-bounded, auto-manages |
| 8.3 | Incremental writes — append new rows without rewriting entire Parquet file | Update one candidate's phone → no full table rewrite |
| 8.4 | Merge-on-read — query combines base Parquet + delta files | Correct results from base + updates |
| 8.5 | Compaction — periodic merge of delta files into base Parquet | Prevent delta file proliferation |
| 8.6 | Upsert semantics — insert or update by primary key | Same candidate ID → update in place |
Gate: Update a single row in a 15K-row dataset. Query reflects the change immediately. No full Parquet rewrite. Memory cache serves hot data in <10ms.
Risk: This is the Delta Lake problem. Full ACID transactions over Parquet files is what Databricks spent years building. We're NOT building Delta Lake — we're building a pragmatic version:
- Append-only delta files (easy)
- Merge-on-read (moderate)
- Compaction (moderate)
- Full ACID isolation (NOT attempting — single-writer model instead)
Phase 8.5: Agent Workspaces ✅ COMPLETE
Per-contract overlays with daily/weekly/monthly tiers and instant handoff.
- WorkspaceManager with saved searches, shortlists, activity logs
- Zero-copy handoff between agents (pointer swap, not data copy)
- Persisted to object storage, rebuilt on startup
Phase 9: Event Journal — Never Destroy Evidence
Principle: Every mutation is appended, never overwritten. This is the one decision that's impossible to retrofit — once history is lost, it's gone forever.
| Step | Deliverable | Gate |
|---|---|---|
| 9.1 | journald crate: append-only event log as Parquet |
Every write/update/delete logged with who, when, what, old value, new value |
| 9.2 | Event schema: entity, field, old_value, new_value, actor, timestamp, source, workspace_id | Standardized across all mutations |
| 9.3 | Journal query: SELECT * FROM journal WHERE entity = 'CAND-001' ORDER BY timestamp |
Full history of any record |
| 9.4 | Replay capability: rebuild any dataset's state at any point in time | Time-travel queries |
| 9.5 | Journal compaction: roll old events into monthly summary Parquet files | Prevent unbounded growth |
Gate: Change a candidate's phone number. Query shows the change. Journal shows old value, new value, who changed it, when, and why. Replay to yesterday's state.
Why now: In 3 years, compliance, AI auditability, and "why did the agent recommend this candidate" all require mutation history. Adding it later means you only have history from that day forward.
Phase 10: Rich Catalog v2 — Metadata as Product
Principle: Every dataset should be self-describing. A new team member (or AI agent) should understand what data exists, who owns it, how fresh it is, and what's sensitive — without asking anyone.
| Step | Deliverable | Gate |
|---|---|---|
| 10.1 | Catalog schema upgrade: add owner, sensitivity, freshness_sla, description, tags, lineage | GET /catalog/datasets returns rich metadata |
| 10.2 | Sensitivity classification: PII, PHI, financial, public, internal | Sensitive fields tagged at ingest |
| 10.3 | Lineage tracking: source_system → ingest_job → dataset → derived_dataset | "Where did this data come from?" answerable |
| 10.4 | Freshness contracts: expected_update_frequency, last_updated, stale_after | Alert when data goes stale |
| 10.5 | Dataset contracts: required columns, type expectations, validation rules | Ingest rejects data that breaks the contract |
| 10.6 | Auto-documentation: AI generates dataset description from schema + sample data | New datasets self-describe on ingest |
Gate: Ingest a CSV. System auto-detects PII columns (email, phone, SSN patterns), tags them, generates a description, sets owner, and tracks lineage back to the source file.
Why now: Every dataset you ingest without metadata becomes a "mystery file" in 6 months. The metadata layer makes the difference between a searchable knowledge platform and a data graveyard.
Phase 11: Embedding Versioning — Model-Proof Vector Layer
Principle: Embedding models will change. If you don't track which model created which vectors, upgrading means re-embedding everything from scratch.
| Step | Deliverable | Gate |
|---|---|---|
| 11.1 | Vector index metadata: model_name, model_version, dimensions, created_at | Every index knows its embedding model |
| 11.2 | Multi-version indexes: same data, different models, coexist | Search specifies which model version |
| 11.3 | Incremental re-embed: only new/changed docs get re-embedded on model upgrade | Model swap doesn't require full re-embed |
| 11.4 | A/B search: query both old and new model, compare results | Validate model upgrade before committing |
Gate: Upgrade from nomic-embed-text to a new model. Old index still works. New index builds incrementally. Compare search quality. Switch when ready.
Phase 12: Tool Registry — Agent-Safe Business Actions
Principle: In 3 years, AI agents won't just query — they'll act. Instead of every agent getting raw SQL access, expose named, governed, audited business actions.
| Step | Deliverable | Gate |
|---|---|---|
| 12.1 | Tool definition: name, description, parameters, permissions, audit_level | search_candidates(skills, city, min_years) as a registered tool |
| 12.2 | Tool execution: validates params, checks permissions, logs usage, runs query | Agent calls tool, gets results, action is logged |
| 12.3 | Read vs write tools: read tools are permissive, write tools require confirmation | get_candidate = auto-approved, update_phone = requires review |
| 12.4 | MCP-compatible interface: expose tools via Model Context Protocol | Any MCP-compatible agent (Claude, GPT, local) can use them |
| 12.5 | Rate limiting + quotas per agent/tool | Prevent runaway agent from overwhelming the system |
Gate: An AI agent calls search_candidates(skills="Python,AWS", city="Chicago", available=true) → gets results → calls shortlist_candidate(workspace_id, candidate_id, reason) → action is logged, auditable, reversible.
Why now: The tool interface is cheap to build (it's just named endpoints with validation). But retrofitting audit logging and permission checks onto raw SQL access is a nightmare. Build the governed interface first.
Phase 13: Security & Access Control
| Step | Deliverable | Gate |
|---|---|---|
| 13.1 | Field-level sensitivity tags (PII, PHI, financial) in catalog | Sensitive fields identified |
| 13.2 | Row-level access policies (agent A sees their candidates only) | Policy evaluated at query time |
| 13.3 | Column masking (show last 4 of SSN, redact salary for non-managers) | Masked results based on role |
| 13.4 | Query audit log (who queried what, when, which fields) | Every data access recorded |
| 13.5 | Policy-as-code (TOML/YAML rules, not hardcoded) | Non-engineer can update access rules |
Phase 14: Schema Evolution + AI Migration
| Step | Deliverable | Gate |
|---|---|---|
| 14.1 | Schema diff detection: old schema vs new ingest → list changes | "Column renamed: first_name → full_name" |
| 14.2 | AI-generated migration rules: LLM suggests column mappings | "full_name = concat(first_name, ' ', last_name)" |
| 14.3 | Migration preview: show how old data maps to new schema before applying | Human approves before data transforms |
| 14.4 | Versioned schemas in catalog: v1, v2, v3 coexist | Queries specify version or use latest |
Phase 15: Infrastructure horizon items
- HNSW vector index with trial system (shipped 2026-04-16)
- Federation foundation — ADR-017 (shipped 2026-04-16)
- Database connector ingest — Postgres batch with streaming (shipped 2026-04-16)
- Federation layer 2 — runtime bucket lifecycle, per-index bucket scoping, profile bucket auto-provisioning (shipped 2026-04-17)
- MySQL streaming connector — mirrors Postgres path, verified on live MariaDB (shipped 2026-04-17)
- PDF OCR for scanned documents — Tesseract 5.5 fallback when lopdf yields no text (shipped 2026-04-17)
- Scheduled ingest — interval-based per-source schedules with CRUD + run-now + auto-trigger agent (shipped 2026-04-17)
- Multi-node query distribution (DataFusion supports this architecturally)
Phase 16: Hot-Swap Index Generations
Make indexes upgrade-in-place without dropping queries.
| Step | Deliverable | Gate |
|---|---|---|
| 16.1 | "Active generation" pointer per logical index name | /vectors/search routes to current champion automatically |
| 16.2 | Background trial runner: watches trial journal, proposes configs (random search / Bayesian), fires /hnsw/trial |
Agent autonomously tunes without human POSTing each config |
| 16.3 | Promotion endpoint: POST /hnsw/promote/{index}/{trial_id} atomically swaps active pointer |
Next search hits new config, zero downtime |
| 16.4 | Rollback: POST /hnsw/rollback/{index} reverts to previous generation |
Bad promotion recoverable in milliseconds |
| 16.5 | Dataset-append triggers: when POST /ingest/file writes to a dataset with attached vector indexes, schedule automatic re-trial (not full rebuild) |
New docs get embedded + indexed without manual intervention |
Gate: Run the trial agent for 10 minutes against resumes_100k_v2 with a fresh eval set. It explores the ef_construction × ef_search space, promotes the Pareto winner, continues running. Zero human clicks. All trials and promotions appear in /hnsw/trials/resumes_100k_v2.
Risk: Agent loops into a bad region (e.g. always proposes ef_construction=1). Mitigation: a hardcoded config space constraint + minimum-quality gate (don't promote anything with recall <0.9).
Phase 17: Model Profiles + Dataset Bindings
Make "different models see different data" real instead of a config string.
| Step | Deliverable | Gate |
|---|---|---|
| 17.1 | ModelProfile manifest: id, ollama_name, bucket, bound_datasets[], hnsw_config, embed_model |
GET /models lists profiles; POST /models creates one |
| 17.2 | Profile activation endpoint: POST /profile/{id}/activate — warms EmbeddingCache for bound indexes, builds HNSW with profile's config |
Next search against bound indexes is <1ms cold |
| 17.3 | Model-scoped search: POST /search?model=X filters to bound datasets only |
Model A can't see Model B's datasets unless explicitly shared |
| 17.4 | VRAM-aware activation: only one (or small N) model loaded at a time on 16GB A4000 | Activating model B unloads model A via Ollama's keep_alive=0 |
| 17.5 | Audit: every tool invocation by a model is logged with model identity | GET /models/{id}/audit shows exactly what each model touched |
Gate: Two model profiles defined: staffing-recruiter (bound to candidates/placements/timesheets) and docs-assistant (bound to a documentation corpus). Activate staffing-recruiter, search for candidates — works. Switch to docs-assistant, same search — returns zero from staffing (not bound) but finds docs. VRAM shows only one embedding model loaded at a time.
VRAM reality: 16GB A4000 realistically holds 1-2 loaded models concurrently. "Multi-model" in practice means sequential swap between profiles, not parallel serving. The profile abstraction makes this swap clean.
Phase 18: Storage format decision (Lance evaluation)
The question raised 2026-04-16 after J's LLMS3 knowledge base identified Lance as alternative_to Parquet for vector workloads. Current stack: Parquet with binary-blob vector columns + in-RAM HNSW sidecar. Evaluated against: Lance native vector format with disk-resident indexes.
| Step | Deliverable | Decision criteria |
|---|---|---|
| 18.1 | ✅ Parallel Lance-backed vector index for resumes_100k_v2 in standalone crates/lance-bench |
Built 2026-04-16 |
| 18.2 | ✅ Head-to-head benchmark across 8 dimensions (cold-load, search latency, disk, index build, random access, append) | Complete |
| 18.3 | ✅ ADR-019 committed with measured data and decision | See docs/ADR-019-vector-storage.md |
Outcome: Hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance joins as a second backend for Phase 16 hot-swap (14× faster index builds), Phase C/append workloads (0.08s vs full rewrite), RAG random-access retrieval (112× faster), and indexes past the ~5M RAM ceiling.
Per-profile vector_backend: Parquet | Lance becomes part of Phase 17 (model profiles). See ADR-019 for the full scorecard and caveats.
Phase 19: Playbook memory (meta-index) — the feedback loop
Make successful playbooks actually improve future searches. Today successful_playbooks is a write-only log; future-you looks at it and thinks "cool, we filled Toledo welders once" — but the index has no idea it happened, so the next Toledo-welder search ranks the same as if none of those fills had existed. Phase 19 closes the loop.
| Step | Deliverable | Gate |
|---|---|---|
| 19.1 | Embed every successful_playbooks row — operation + approach + context → one chunk per playbook |
A new dataset playbook_memory appears in catalog with N rows = row count of successful_playbooks |
| 19.2 | Vector index on playbook_memory (HNSW or Lance — whichever agent-parquet profile uses) |
/vectors/search against playbook_memory returns semantically similar past playbooks |
| 19.3 | Endorsement extraction: each playbook row has fills[] (worker_ids it succeeded with). Parse them out at ingest time and store in a sidecar playbook_endorsements Parquet keyed by playbook_id |
SELECT * FROM playbook_endorsements WHERE playbook_id = 'X' returns the worker_ids |
| 19.4 | /vectors/hybrid gains opt-in use_playbook_memory: bool. When true: after hybrid ranks candidates, find top-K similar past playbooks (semantic search over playbook_memory), extract endorsed worker_ids, add a bounded boost to candidates in the endorsed set, re-rank |
A search where the "right" worker is known from a prior playbook ranks higher with the flag than without |
| 19.5 | Write-through from multi-agent orchestrator: when two agents seal a playbook, it appends to successful_playbooks AND triggers a refresh of playbook_memory (via existing Phase C stale-mark path). Next query sees the new signal. |
Run the orchestrator → inspect playbook_memory → see a new row. Run the same query before/after → ranking differs. |
| 19.6 | Ceiling-aware boost: cap the per-worker boost so one popular worker can't dominate future searches. Boost decays with time (optional) so stale playbooks matter less. | Synthetic test: 100 playbooks all filled with the same worker_id; the 101st search still returns a mix, not just that one worker |
Gate: Run a real search before and after a new successful playbook lands. The endorsed workers from similar past operations rank higher in the second call. Demonstrable with a diff of the two result sets.
Why this is the right version of "meta-index": The alternative — training a neural re-ranker on (query, candidate, outcome) triples — is a weeks-long ML story and requires labeled outcome data we don't really have. The statistical-semantic version here is rebuildable from the existing playbook journal, explainable ("boosted because of similar playbooks X, Y, Z"), and invalidatable (delete a playbook → boost goes away on next rebuild). It gets 80% of the payoff at 10% of the cost. Neural re-ranking stays as a future option if it bites.
Non-goals for this phase:
- Neural training / fine-tuning. Statistical feedback only.
- Hard guarantees about recall lift magnitude. "Measurably better on the demo query" is the gate, not a universal quality claim.
- Real-time recomputation on every playbook. Batched refresh via the existing stale-marking path is sufficient.
Phase 19 refinement (WIRED 2026-04-21): geo-filter + role prefilter on boost
Item-3 diagnostic pass surfaced that compute_boost_for was ranking playbooks globally by cosine similarity, while candidates came from an SQL-filtered city. Result: boost map had 170 endorsed workers, 0 intersected the 50 Nashville-filtered candidates. Zero citations where there should have been dozens.
Fix — in crates/vectord/src/playbook_memory.rs:
compute_boost_for_filtered(target_geo)— skip playbooks from other cities before cosine sort.compute_boost_for_filtered_with_role(target_geo, target_role)— multi-strategy: exact (role, city, state) match earns similarity=1.0 and fills up to half the top_k; cosine fallback fills the rest. Mirrors Mem0/Zep 2026 guidance on parallel-strategy rerank.
In crates/vectord/src/service.rs:
extract_target_geoandextract_target_rolepull both from the executor's SQL filter.tracing::info!emitsplaybook_boost: boosts=N sources=N parsed=N matched=N target_geo=? target_role=?on every hybrid_search. Silent-truncation class of bug now visible.
Citation lift measured: avg citations per run 0.32 → 1.38 after geo filter; then 2 → 28 in the single-scenario Riverfront Steel re-run after role prefilter landed. 14× delta on same scenario.
Unit tests: extract_target_geo_basic, _missing_state_returns_none, _word_boundary (rejects "civilian" substring), extract_target_role_basic, _none_when_absent, _multi_word — all pass (cargo test -p vectord --lib extractor_tests).
Phase 20: Model Matrix + Overseer Tiers (WIRED 2026-04-21)
Five-tier routing declared in config/models.json. Hot path (T1/T2) stays local (qwen3.5 + qwen3 after mistral was dropped for 0/14 fill rate on complex scenarios). Cloud for overview (T3 gpt-oss:120b), strategic (T4 qwen3.5:397b), and gatekeeper (T5 kimi-k2-thinking). Every tier declares context_window + context_budget + overflow_policy.
- T1 hot: 50-200 calls/scenario, local only —
qwen3.5:latestexecutor,think:false - T2 review: 5-14 calls/event, local only —
qwen3:latestreviewer,think:false - T3 overview: 1-3 calls/scenario, cloud primary —
gpt-oss:120bon Ollama Cloud, thinking on - T4 strategic: 1-10 calls/day, cloud primary
- T5 gatekeeper: 1-5 calls/day, audit-logged
T3 checkpoints + cross-day lessons wired. Lessons archive to data/_playbook_lessons/ and load back at next scenario start as prior_lessons in executor context. Cloud passthrough verified on stress_01 scenario with LH_OVERVIEW_CLOUD=1 — gpt-oss:120b response latency consistently 4-8s, diagnosing city-pivot ("Gary IN → Chicago IL, 40mi") when target city has zero supply.
think:false is the key mechanical finding — qwen3.5 burns ~650 tokens of hidden reasoning before emitting response; hot-path JSON emitters MUST disable thinking or continuation has to paper over empty returns. T3/T4 overseers KEEP thinking (that's the point).
Kimi-k2.6 upgrade path: Current Ollama Cloud key returns 403 on kimi-k2.6 (ollama run kimi-k2.6:cloud requires ollama signin with pro-tier account). kimi-k2.5 substitutes on the current tier — same family, strong at tool calling. Swap to k2.6 is a one-line change in applyToolLevel once the subscription lands.
Phase 21: Scratchpad + Tree-Split Continuation
Why this is a phase and not an optimization: bumping max_tokens until a response stops truncating is a tourniquet — J called this out explicitly. As playbooks accumulate into the hundreds and responses grow, eventually SOME request will exceed SOME model's window, and we can't solve it by raising a number. The stable answer is two primitives that let us handle arbitrary-size work without losing context: a scratchpad that glues multi-call responses together, and a tree split that shards oversized inputs and reduces them back.
Two primitives (WIRED 2026-04-21 in tests/multi-agent/agent.ts):
-
generateContinuable()— handles OUTPUT overflow. Calls the model; checks structural completeness (for JSON: matched braces + JSON.parse success; for text: non-empty). If incomplete, calls again with "continue from here" + the partial response as scratchpad. Up tomax_continuationstimes. Nomax_tokenstuning needed — if thinking ate the initial budget, continuation picks up the slack. -
generateTreeSplit()— handles INPUT overflow. Caller passes an array of shards (semantic chunks of the corpus). For each shard: map call with running scratchpad digest. Final reduce call produces the answer. Scratchpad truncates oldest content if it approaches its own budget. If a single shard still overflows,assertContextBudgetthrows — caller must re-shard at finer granularity, NOT silently truncate.
Guarantees:
- No agent call can silently truncate. Either it completes, continues, or throws with numbers.
- No corpus is too big —
generateTreeSplithandles any size the caller can shard. - Scratchpad is the glue between multi-call responses; context is never lost, only compacted.
- Token estimation uses
chars / 4(biased safe ~15%) until we wire the provider's tokenizer.
What lives where now:
agent.ts::estimateTokens()+assertContextBudget()+generateContinuable()+generateTreeSplit()— WIREDscenario.tsexecutor + reviewer + overviewGenerate calls — migrated togenerateContinuableconfig/models.json— context_window + context_budget + overflow_policies per tier (declarative)
Next sprint (Rust side, so gateway tools share it):
crates/aibridge/src/continuation.rs— port ofgenerateContinuablecrates/aibridge/src/tree_split.rs— port ofgenerateTreeSplitcrates/storaged/src/chunk_cache.rs— precomputed shards keyed by corpus hash (avoid re-chunking on every T4 run)/metricscounter:context_continuations_total{model,shape,succeeded}
Status: TS primitives WIRED. Rust port pending. The escalation path (tree split → bigger-context cloud model → kimi-k2:1t's 1M window → split decision into sub-decisions) is declared in config/models.json under context_management.overflow_policies.
Phase 21 status update (WIRED 2026-04-21 evening)
Additional primitives landed after the initial commit:
-
think: booleanflag plumbed throughgenerate(),generateCloud(),generateContinuable(), and into sidecar's/generateendpoint. Enables per-call opt-out of hidden reasoning for hot-path JSON emitters. Verified: qwen3.5 withthink:false+num_predict:400returns clean{"worker_id":...}on first call; withoutthink:false, 650 tokens eaten by reasoning, response empty. -
Cloud executor routing —
ACTIVE_EXECUTOR_CLOUD/ACTIVE_REVIEWER_CLOUDflags let per-staffer tool_level route executor to Ollama Cloud when weak local model (qwen2.5) would collapse. Verified on kimi-k2.5 via Ollama Cloud: clean JSON emission, think:false honored.
Rust port of continuation + tree-split primitives remains queued for next sprint (crates/aibridge/src/continuation.rs, tree_split.rs).
Phase 22: Internal Knowledge Library (KB)
Meta-layer over Phase 19 playbook_memory. Playbook memory answers "which WORKERS worked for this event." The KB answers "which CONFIG worked for this playbook signature." Subject changes from workers to the system itself — model choice, budget hints, overflow policies, pathway notes.
Files (data/_kb/):
signatures.jsonl— (sig_hash, embedding[], first_seen, last_seen, run_count). Sig = stable hash of the sequence of (kind, role, count, city, state) across events.outcomes.jsonl— per-run record: {sig, run_id, models, ok/total, turns, citations, per-event summary, elapsed}.pathway_recommendations.jsonl— AI-synthesized for next run: {confidence, rationale, top_models, budget_hints, pathway_notes, neighbors_consulted}.error_corrections.jsonl— detected fail→succeed pairs on same sig, diff of what changed.config_snapshots.jsonl— history of models.json changes + why.
Cycle (event-driven, not wall-clock):
- Scenario ends →
kb.indexRun()extracts signature, embeds spec digest, appends outcome. kb.recommendFor(nextSpec)finds k-NN signatures via cosine, feeds their outcome history + recent error corrections to the overview model, writes a structured recommendation.- Next scenario starts →
kb.loadRecommendation(spec)pulls the newest rec for this sig, injectspathway_notesintoguidanceFor()alongside prior lessons.
Why file-based for MVP: Phase 19 playbook_memory is already a catalogd dataset. KB is a separate meta-layer; keep it file-based first to iterate without a gateway schema migration. Rust port (and promotion to vectord-indexed corpus for neighbor search at scale) lands once shape stabilizes — mirrors how Phase 21 primitives were TS-first → Rust next sprint.
What the overview model gets asked:
- Target scenario digest
- Top-k neighbor signatures with avg ok rate, best model combo per neighbor
- Recent error corrections (sig, before/after model set)
What it outputs (JSON-constrained):
- confidence (high/medium/low)
- rationale (2-3 sentences)
- top_models {executor, reviewer, overview}
- budget_hints {executor_max_tokens, reviewer_max_tokens, executor_think}
- pathway_notes (concrete pre-run advice)
Status (WIRED 2026-04-21): tests/multi-agent/kb.ts holds all primitives. scenario.ts reads rec at start, indexes + recommends at end. Cold start gracefully writes a "low confidence, no history" rec so the second run has a floor to build on. snapshotConfig() wired to fire at every scenario start — active model set + tool_level + cloud flags hashed and appended to config_snapshots.jsonl.
Phase 22 item B — cloud rescue (WIRED): When an event fails and cloud T3 is enabled, requestCloudRemediation() feeds the failure trace (SQL filters attempted, row counts, reviewer drift reasons, gap signals, contract terms) to cloud and parses a JSON remediation with new_city / new_state / new_role / new_count / rationale. Event retries once with the pivot. Verified 1/3 rescues succeeded on stress_01 (Gary IN → South Bend IN pivot filled a Welder that local drift-aborted). Sanitizer splits "City, ST" comma-packed outputs so downstream SQL doesn't get Hammond, IN, IN.
Phase 23: Staffer identity + competence-weighted retrieval (WIRED 2026-04-21)
Answers "who handled this" as a first-class dimension of the matrix index. Senior staffers' playbooks rank higher than juniors' on similar scenarios via competence × similarity score. Auto-discovers "reliable performer" worker labels via cross-staffer endorsement overlap.
Schema (scenario.ts ScenarioSpec):
contract?: ContractTerms— deadline, budget_per_hour_max, local_bonus_per_hour, local_bonus_radius_mi, fill_requirement. Propagates into T3 checkpoint + cloud rescue prompts so cloud reasons about trade-offs (pivot-within-radius before budget-pivot-further).staffer?: Staffer— {id, name, tenure_months, role, tool_level}. tool_level controls subsystems available to this run:full— qwen3.5 + qwen3 local + cloud T3 + cloud rescuelocal— qwen3.5 + qwen3 local + local gpt-oss:20b T3 + rescuebasic— kimi-k2.5 cloud exec + qwen3 local reviewer + local T3, no rescueminimal— kimi-k2.5 cloud exec + qwen3 local reviewer, NO T3, NO rescue — tests whether playbook inheritance carries knowledge alone
KB staffer indexing (data/_kb/staffers.jsonl):
- Recomputed per-staffer on every run: total_runs, fill_rate, avg_turns_per_event, avg_citations_per_run, rescue_rate, competence_score.
competence_score = 0.45·fill_rate + 0.20·turn_efficiency + 0.20·citation_density + 0.15·rescue_rate. Bounded 0..1.
Weighted neighbor retrieval:
findNeighborsinkb.tsreturnsweighted_score = cosine × max_staffer_competence(floor 0.3). Senior playbooks rank above junior playbooks on similar scenarios.pathway_recommendationsincludebest_staffer_id/best_staffer_competenceso cloud knows WHOSE playbook it's synthesizing from.
Cross-staffer auto-discovery:
scripts/kb_staffer_report.pyemits leaderboard + workers endorsed across ≥2 staffers on same signature.- Validated output: Rachel D. Lewis (Welder Nashville) endorsed 12× across 4 staffers; Christina Watson (Machine Op Indianapolis) 11×. These are the highest-confidence "reliable performer" labels the system produced without human tagging.
Demo infrastructure:
tests/multi-agent/gen_staffer_demo.ts— 4 personas × 3 contracts = 12 scenario specs.scripts/run_staffer_demo.sh— sequential batch with cloud T3.scripts/kb_staffer_report.py— leaderboard + top/bottom differential + cross-staffer overlap.
Phase 24: Observer / Autotune integration (SHIPPED 2026-04-20, commit b95dd86)
The gap: lakehouse-observer.service wrapped MCP :3700, while tests/multi-agent/scenario.ts hit gateway :3100 directly. Observer idle at 0 ops across 3600+ cycles — scenarios invisible to ERROR_ANALYZER and PLAYBOOK_BUILDER, autotune running blind to outcomes.
What shipped:
observer.tsBun HTTP listener onOBSERVER_PORT(default 3800):GET /health,GET /stats(totals, by_source, recent scenario digest),POST /eventfor scenario outcomes.ObservedOpcarries provenance —source="scenario" | "mcp"+staffer_id+sig_hash+event_kind+ geo + rescue flags.recordExternalOp()— shared ring-buffer insert; main analyzer + playbook builder no longer care where the op came from.persistOp()fix: old path POSTed to/ingest/file?name=observed_operationswhich has REPLACE semantics (wiped prior ops); now uses append-friendly write-through.
Phase 25: Validity windows + playbook retirement (SHIPPED 2026-04-21, commit e0a843d)
Zep 2026-era finding: temporal validity is the single highest-value memory-hygiene primitive. PlaybookEntry gained schema_fingerprint / valid_until / retired_at / retirement_reason. compute_boost_for_filtered_with_role skips retired + expired before geo/cosine ranking. Two retirement paths: retire_one(id, reason) for manual, retire_on_schema_drift(city, state, fp, reason) for batch migration sweep. Endpoint: POST /vectors/playbook_memory/retire.
Phase 26: Mem0 upsert + Letta geo hot cache (SHIPPED 2026-04-21, commit 640db8c)
Same-day re-seed no longer duplicates rows. /seed with append=true routes through upsert_entry which decides ADD / UPDATE / NOOP on (operation, day, city, state). Playbook_id stays stable on UPDATE so existing citations remain valid. PlaybookMemory.geo_index: HashMap<(city, state), Vec<idx>> rebuilt on every mutation; geo-filtered boost queries skip the scan and hit O(1) lookup — sub-ms at current scale, same code path scales to 100K+ entries.
Phase 27: Playbook versioning (SHIPPED 2026-04-21)
PlaybookEntry gained version: u32 (default 1), parent_id, superseded_at, superseded_by — all #[serde(default)] so pre-Phase-27 state loads as roots. revise_entry(parent_id, new_entry) appends a new version, stamps the parent superseded, rejects revising a retired or already-superseded parent. history(id) returns the root→tip chain from any node. Superseded entries excluded from boost (same rule as retired). Endpoints: POST /vectors/playbook_memory/revise, GET /vectors/playbook_memory/history/{id}. /status reports superseded as a distinct counter. 8 new tests; 51/51 vectord lib tests green.
Phase 28+: Further horizon
- Specialized fine-tuned models per domain (staffing matcher, resume parser)
- Video/audio transcript ingest + multimodal embeddings
- Neural re-ranker over (query, candidate, outcome) triples — only if Phase 19's statistical feedback plateaus below usable recall
- True distributed query (DataFusion multi-node) — only if single-machine ceilings bite
- Playbook versioning (version + parent_id + retired_at) — touches gateway + catalogd + mcp-server
- Playbook board (6-phase deep_analysis applied to playbook ranking)
Known ceilings (honest)
The current stack has measurable limits. Documenting them so future decisions aren't based on wishful thinking.
| Dimension | Current ceiling | Breaks at | Escape hatch |
|---|---|---|---|
| Vector count per index (Parquet+HNSW in-RAM) | ~5M on 128GB | Past 5M | Switch that profile's vector_backend to Lance per ADR-019 — IVF_PQ stays on disk-resident quantized codes |
| Concurrent active indexes | ~50-100 at 100K vectors each | 10M×50 configurations | Lance disk-resident + per-profile activation |
| Rows per dataset | 2.47M proven, probably 100M+ fine | Approaches DataFusion memory limits | DataFusion predicate pushdown + partition pruning (existing) |
| Concurrent loaded models | 1-2 on 16GB VRAM (A4000) | 3+ models simultaneous | Not our problem — architectural, driven by Ollama |
| Trial journal growth per index | Thousands of trials, batched JSONL | High-frequency auto-tuning agent | Compaction via /hnsw/trials/{idx}/compact |
| Error journal growth | Bounded by ring buffer (2000 events in-memory) + batched JSONL on disk | Continuous failure scenarios | Compaction + retention policy (TODO) |
Reference Workloads
Workload 1: Staffing Company
Scale-tested on 128GB RAM server:
| Table | Rows | Size | Description |
|---|---|---|---|
| candidates | 100,000 | 10.1 MB | Names, phones, emails, zip, skills, resume text |
| clients | 2,000 | 33 KB | Companies, contacts, verticals |
| job_orders | 15,000 | 0.9 MB | Positions with descriptions, requirements, rates |
| placements | 50,000 | 1.2 MB | Candidate↔job matches with rates, recruiters |
| timesheets | 1,000,000 | 16.7 MB | Weekly hours, bill/pay totals, approvals |
| call_log | 800,000 | 34.3 MB | Phone CDR — who called whom, duration, disposition |
| email_log | 500,000 | 16.0 MB | Email tracking — subject, opened, direction |
| Total | 2,467,000 | 79 MB | 7 tables, cross-referenced |
Benchmarks (2.47M rows)
| Query | Cold (Parquet) | Hot (MemCache) | Speedup |
|---|---|---|---|
| 100K candidate filter (skills+city+status) | 257ms | 21ms | 12x |
| 1M timesheet aggregation + JOIN | 942ms | 96ms | 9.8x |
| 800K call log cross-reference (cold leads) | 642ms | — | — |
| Triple JOIN recruiter performance | 487ms | — | — |
| 500K email open rate aggregation | 259ms | — | — |
| COUNT all 2.47M rows | 84ms | — | — |
| 10K vector semantic search (cosine) | 450ms | — | — |
| Natural language → AI SQL → execute | ~3s | — | (model inference) |
Vector Search
- 10K candidate resumes embedded in 204s (49 chunks/sec via Ollama)
- Semantic search over 10K vectors: ~450ms (brute-force cosine)
- RAG pipeline: question → embed → search → retrieve → LLM answer with citations
- AI correctly refuses to hallucinate when context doesn't support an answer
Agent Workspaces
- Create per-contract workspace with saved searches + shortlists
- Instant handoff between agents — zero data copy
- Full activity timeline preserved across handoffs
Workload 2: Local LLM Knowledge Base
The second use case this substrate is built for. Reference corpus: the running knowledge_base Postgres database (586 team runs, response cache history, pipeline runs, threat intel) + LLMS3.com published corpus (~243 enriched documents).
Target scale on same 128GB server:
- Documents: 10K-100K per model profile
- Chunks after chunking: 500K-5M per profile
- Embedding dimensions: 768 (nomic-embed-text)
- Query latency: <100ms semantic search, <3s end-to-end RAG including LLM generation
- Concurrent model profiles: 2-5 configured, 1-2 active at a time (VRAM-bound)
Measured to date (Phase 7 + Phase 16 prep):
- 100K candidate-resume chunks embedded in 10 min via Ollama nomic-embed-text
- HNSW search at 100% recall, ~1ms p50 on 100K vectors (ec=80 es=30 locked as default)
- Trial journal instrumented and working for parameter tuning
Gaps still to close for this workload:
- Model profiles (Phase 17) — today, "model" is a string, not a first-class entity
- Hot-swap generations (Phase 16) — today, rebuild = downtime
- Scale past 5M vectors — needs Phase 18 Lance evaluation to decide path
Available Local Models
| Model | Use |
|---|---|
nomic-embed-text |
Embeddings (768d) — semantic search, RAG retrieval |
qwen2.5 |
SQL generation, structured output, summarization |
mistral |
General generation, longer context |
gemma2 |
General generation |
llama3.2 |
General generation, lightweight |
Non-Goals
- Cloud deployment (local-first, always)
- Full ACID transactions (single-writer model is sufficient)
- Real-time streaming / CDC (batch ingest is the model; scheduled refresh, not transactional replication)
- Replacing the CRM (this is the analytical + AI layer BEHIND the CRM)
- Custom file formats — Parquet for datasets + sidecar indexes for vectors (see ADR-018 for why we stayed Parquet instead of migrating to Lance, and the ceilings that choice implies)
- Hard multi-tenant isolation (profiles and federation provide soft isolation; this is not a SaaS platform with adversarial tenants — operator is single-trust)
Removed from prior non-goals (2026-04-16):
Multi-tenancy (single-owner system)— federation + profile buckets are now first-class; soft multi-tenancy is a design goal. Hard adversarial multi-tenancy (adversarial tenants on shared infrastructure) remains out of scope.
Risks
Technical Risks
| Risk | Severity | Mitigation |
|---|---|---|
| Vector search in Rust at scale | High | Start brute-force, evaluate hora crate, Qdrant as fallback |
| Incremental updates on Parquet | High | Delta files + merge-on-read, NOT full Delta Lake |
| Legacy data messiness | High | Conservative schema detection, default to string, user overrides |
| 100K+ embedding timeout | High | Async background job with progress, not single HTTP request |
| Schema evolution across ingests | Medium | Schema fingerprinting + versioned manifests (Phase 14) |
| Memory pressure from hot cache | Medium | LRU eviction, configurable memory limit (tested: 408MB for 1.1M rows) |
| HNSW index persistence | Medium | Serialize alongside Parquet, rebuild on startup |
| Python sidecar as bottleneck | Low | Can replace with direct Ollama HTTP from Rust later |
Strategic Risks (Future-Proofing)
| Risk | Impact | Phase |
|---|---|---|
| No mutation history → can't audit AI decisions | Critical — compliance, trust | Phase 9 (event journal) |
| No metadata → datasets become mystery files | High — onboarding, discovery | Phase 10 (rich catalog) |
| Embeddings locked to one model | High — can't upgrade models | Phase 11 (versioning) |
| Raw SQL as only interface → ungoverned agent access | High — security, auditability | Phase 12 (tool registry) |
| No sensitivity classification → compliance exposure | Medium — grows with data volume | Phase 13 (access control) |
| No schema evolution handling → ingest breaks on format change | Medium — grows with source count | Phase 14 (AI migration) |
Design Principles (Future-Proofing)
These are the decisions that still look smart after the stack changes:
- Store the truth openly. Parquet on object storage. No proprietary formats. Any engine can read it.
- Describe it richly. Every dataset has an owner, lineage, sensitivity tags, freshness contract.
- Never destroy evidence. Every mutation is journaled. Rebuild any state at any point in time.
- Secure it centrally. Permissions live in the data layer, not application code.
- Expose it through reusable interfaces. Named tools with contracts, not raw SQL for every consumer.
- Version everything. Schemas, embeddings, models — all versioned, all coexist during migration.
- Make unstructured data first-class. Every document gets: storage, text extraction, entity tags, chunks, embeddings, linkage.
- Separate storage from compute from intelligence. Scale each independently. Replace any layer without touching the others.
Operating Rules
- PRD > architecture > phases > status > git
- Git is memory, not chat
- No undocumented changes
- No silent architecture drift
- Always work in smallest valid step
- Always verify before moving on
- Flag when something is genuinely hard vs just engineering work
- If a phase reveals the approach is wrong, update the PRD before continuing
- Cheap-now, expensive-later decisions get built first (event journal, metadata, versioning)
- Build the governed interface before the raw interface (tools before SQL for agents)