# PRD: Lakehouse — Rust-First Substrate for Versioned Knowledge Stores

> ## ▶ Direction pivot — 2026-04-22
>
> **`docs/CONTROL_PLANE_PRD.md` is now the long-horizon architecture target.**
> Lakehouse (this document) is being refactored to conform to the Universal AI
> Control Plane architecture — six layers, `/v1/*` universal API, multi-provider
> routing, Truth Layer for DevOps constraints.
>
> This PRD remains the source of truth for **everything Phases 0-37** — the
> shipped staffing / AI-substrate system below is preserved as the reference
> implementation and the first domain-specific consumer of the control plane.
> All existing phases, ADRs, and invariants continue to apply.
>
> **Phases 38+ live in `CONTROL_PLANE_PRD.md`** (Universal API, provider
> adapters, routing engine, expanded profile system, Truth Layer, validation
> pipeline, caller migration).
>
> Cross-read: this PRD for what's shipped, CONTROL_PLANE_PRD for where it's going.

---

**Status:** Active — Phases 0-36 shipped, Phase 37 (hot-swap async) TODO; control-plane pivot in flight; E2E test 9.0/10; stress test passes
**Created:** 2026-03-27
**Last updated:** 2026-04-22 — Control-plane pivot; Gateway seed concurrency (Semaphore); stress test with 6 diverse tasks
**Owner:** J

---

## Problem

### Use case 1 — Staffing analytics (reference implementation)

Legacy data systems silo information across CRMs, databases, spreadsheets, and file shares. Querying across them requires manual ETL, pre-defined schemas, and expensive database licenses. When AI enters the picture, these systems can't handle the dual requirement of fast analytical queries AND semantic retrieval over unstructured text.

A staffing company (our reference case) has candidate records in an ATS, client data in a CRM, timesheets in billing software, call logs from a phone system, and email records from Exchange. Answering "find every Java developer in Chicago who was called 5+ times but never placed" requires querying across all of them — and no single system can do it.

### Use case 2 — Local AI knowledge substrate (the second half)

Local LLM workloads need a substrate for ingesting, indexing, and retrieving large knowledge corpora. Each running model (or agent) has its own context — documents it cares about, a vector index tuned to its domain, a scoped view of the catalog. That infrastructure is architecturally identical to the staffing problem: ingest messy data, index it, query it, hand it to an AI. Building one substrate that serves both prevents fragmentation.

Concretely this means a running Ollama model like `qwen2.5:7b` or `claude-code-local` should be able to:
- Bind to a named set of datasets
- Get a scoped vector index pre-warmed for its domain
- Issue searches that only see its bound data
- Have its trial/tuning history isolated from other models
- Swap between knowledge generations (today's, yesterday's) without rebuild

The same infrastructure that lets a recruiter query 2.47M rows of staffing data also lets a local 7B model answer questions grounded in a 500K-chunk documentation corpus. Same substrate, different tenant.

### Shared requirements

- Any data source (CSV, DB export, PDF, JSON, Postgres table) can be ingested without pre-defined schemas
- Structured data is queryable via SQL at scale (millions of rows, sub-second)
- Unstructured data is searchable via AI embeddings with per-profile indexes
- An LLM can answer natural language questions against scoped data
- Indexes can be hot-swapped between generations without rebuild downtime
- Trials are first-class data — the system remembers how it was tuned
- Everything runs locally — no cloud APIs, total data privacy
- The system is rebuildable from repository + object storage alone

---

## Solution

A modular Rust service mesh over S3-compatible object storage, with a local AI layer for embeddings and generation.

### Locked Stack

| Layer | Technology | Locked |
|---|---|---|
| Frontend | Dioxus | Yes |
| API | Axum + Tokio | Yes |
| Object Storage Interface | Apache Arrow `object_store` | Yes |
| Storage Backend | LocalFileSystem → RustFS → S3 | Yes |
| Query Engine | DataFusion | Yes |
| Data Format | Parquet + Arrow | Yes |
| RPC (internal) | tonic (gRPC) | Yes |
| AI Runtime | Ollama (local models) | Yes |
| AI Boundary | Direct Ollama HTTP API (gateway → Ollama, no sidecar). Updated 2026-05-02 per ADR-022 — was "Python FastAPI sidecar → Ollama HTTP API". Sidecar's lab_ui/pipeline_lab Python remain as dev-only tools (not on hot path). | Yes |
| Vector Index | TBD — evaluate `hora`, `qdrant` crate, or HNSW from scratch | **Open** |

No new frameworks without documented ADR.

---

## Architecture

### Services

| Service | Responsibility |
|---|---|
| **gateway** | HTTP/gRPC ingress, routing, auth, CORS, body limits, `X-Lakehouse-Bucket` header routing |
| **catalogd** | Metadata control plane — dataset registry, schema versions, manifests, per-dataset resync from parquet footers |
| **storaged** | Object I/O — `BucketRegistry` (multi-backend), rescue fallback, error journal, append-log batching pattern |
| **queryd** | SQL execution — DataFusion over Parquet, MemTable hot cache, delta merge-on-read |
| **ingestd** | Ingest pipeline: CSV / JSON / PDF / Postgres-stream → normalize → Parquet → catalog |
| **vectord** | Embedding store + vector indexes + HNSW trial system (EmbeddingCache, trial journal, eval harness) |
| **journald** | Append-only mutation event log (ADR-012) — distinct from storaged error journal |
| **aibridge** | AI client — direct Ollama HTTP (per ADR-022, 2026-05-02; was Rust↔Python sidecar boundary). Owns LRU embed cache. |
| **ui** | Dioxus frontend — Ask, Explore, SQL, System tabs |
| **shared** | Types, errors, Arrow helpers, config, protobuf definitions, **secrets provider trait**, **PII detection** |
| **mcp-server** | Agent gateway (Bun) — MCP tools, intelligence endpoints, scenario observer (:3700) |
| **observer** | Autonomous iteration loop — records ops, error analysis, playbook consolidation (:3800) |
| **scenario** | Day-in-the-life orchestrator — T5 tiers, KB integration, continuation tree-split (:bun) |

**Federation building blocks** (shipped 2026-04-16):
- `shared::secrets::SecretsProvider` trait + `FileSecretsProvider` reading `/etc/lakehouse/secrets.toml` (0600 enforced)
- `storaged::registry::BucketRegistry` — multi-bucket resolution with `rescue_bucket` read fallback
- `storaged::append_log::AppendLog` — write-once batched append pattern (no RMW, no small-file problem)
- `storaged::error_journal::ErrorJournal` — bucket operation failure log at `primary://_errors/bucket_errors/batch_*.jsonl`

### Data Flow

```
Raw data → ingestd (normalize, chunk, detect schema)
    ├→ storaged (Parquet files to object storage)
    ├→ catalogd (register dataset + schema)
    ├→ vectord (embed text chunks, build index)
    └→ queryd  (auto-register as queryable table)

User question → gateway
    ├→ vectord (semantic search for relevant chunks)  ← RAG path
    ├→ queryd  (SQL over structured data)             ← Analytics path
    └→ aibridge → Ollama (generate answer from context)
```

### Query Paths

**Analytical (SQL):** "What's the average bill rate for .NET devs in Chicago?"
→ DataFusion scans Parquet columnar, returns in <200ms

**Semantic (RAG):** "Find candidates who could do data engineering work"
→ Embed question → vector search across resume embeddings → retrieve top chunks → LLM answers

**Hybrid (shipped 2026-04-17):** "Find reliable forklift operators in Illinois with OSHA certs"
→ `POST /vectors/hybrid` with `sql_filter` + `question`: SQL narrows to structurally-valid candidates (role, state, reliability, certs), brute-force cosine ranks by semantic relevance within the filtered set, LLM generates answer from SQL-verified records only. Zero hallucinations on the staffing simulation (16/16 positions filled, all workers verified against golden data).

### Invariants

1. Object storage = source of truth for all data
2. catalogd = sole metadata authority
3. No raw data in catalog — only pointers
4. vectord stores embeddings AS Parquet (portable, not a proprietary format) — see ADR-018 for the Parquet-vs-Lance trade review
5. ingestd is idempotent — re-ingesting the same file is a no-op
6. Hot cache is a performance layer, not a source of truth — eviction is safe
7. All services modular and independently replaceable
8. **Indexes are hot-swappable.** A new index generation can be built in the background while the current one serves queries. Promotion is atomic (pointer swap). Rollback to a prior generation is always possible. (Phase 16)
9. **Every reader gets its own profile.** Human operators, AI agents, and local models are all clients of the same substrate. Each has a named profile with its own bucket, vector indexes, trial history, and dataset bindings. Profiles are a first-class architectural concept, not a tenancy afterthought. (Phase 17)
10. **Trials are data, not logs.** Every index build is a trial with measurable metrics. The trial journal IS the agent's memory for how to tune itself. Stored as write-once batched JSONL per the ADR-018 append-log pattern.
11. **Operational failures are findable in one HTTP call.** The bucket error journal, trial journal, and audit log all expose `/storage/errors`, `/hnsw/trials`, `/access/audit` with structured filter + aggregation. No `grep` archaeology to answer "what broke?"
12. **Playbooks feed the index, not just the log.** A completed playbook isn't just a record of what worked — it's a signal that shapes future rankings. Every `successful_playbooks` row contributes to the playbook-memory vector index, so semantically-similar future operations re-rank toward workers that have actually succeeded in comparable fills. This is the "system gets smarter over time" dimension that distinguishes this substrate from a static search engine. (Phase 19)

---

## Vision drift acknowledged (2026-04-20)

The system as shipped through Phase 18 is a **hybrid SQL+vector search engine with a playbook log**. The original pitch (and the "staffing AI co-pilot" framing) implied a **meta-index that learns from playbooks over time** — hot-swap profiles weren't just routing, they were knowledge generations that compounded. That learning loop was never built; playbooks were write-only. Phase 19 closes that gap explicitly.

The feedback signal is **statistical + semantic**, not neural. No model training — the index reads the playbook journal, computes operation-similarity, and boosts endorsed workers at query time. Rebuildable from `successful_playbooks` alone, same as every other derived index.

---

## Phases

### Phase 0-5: Foundation ✅ COMPLETE

- Rust workspace, Axum gateway, object storage, catalog, DataFusion query engine
- Python sidecar with real Ollama models (embed, generate, rerank)
- Dioxus UI with Ask (NL→SQL), Explore, SQL, System tabs
- gRPC, OpenTelemetry, auth middleware, TOML config
- Validated with 286K row staffing company dataset across 7 tables
- Cross-reference queries (JOINs across candidates, placements, timesheets, calls) in <150ms

### Phase 6: Ingest Pipeline

Build the data on-ramp. Accept messy real-world data, normalize it, make it queryable.

| Step | Deliverable | Gate |
|---|---|---|
| 6.1 | `ingestd` crate with CSV parser → Arrow RecordBatch → Parquet | CSV file → queryable dataset |
| 6.2 | JSON ingest (newline-delimited JSON, nested objects) | JSON file → flat Parquet |
| 6.3 | Schema detection — infer column types from data | No manual schema definition needed |
| 6.4 | Deduplication — detect and skip already-ingested files (content hash) | Re-ingest same file = no-op |
| 6.5 | Text chunking — split large text fields for embedding | Long text → overlapping chunks |
| 6.6 | Auto-registration — ingest writes to storage AND registers in catalog | Single API call: file in → queryable |
| 6.7 | Gateway endpoint: `POST /ingest` with file upload | Upload CSV from browser → query in seconds |

**Gate:** Upload a raw CSV or JSON file → auto-detected schema → stored as Parquet → registered → immediately queryable via SQL. No manual steps.

**Risk:** Schema detection on messy data (mixed types, nulls, inconsistent formatting). Mitigation: conservative type inference (default to string), let user override.

### Phase 7: Vector Index + RAG Pipeline

Make unstructured data searchable by meaning, not just keywords.

| Step | Deliverable | Gate |
|---|---|---|
| 7.1 | `vectord` crate with embedding storage as Parquet (doc_id, chunk_text, vector) | Embeddings stored as portable Parquet |
| 7.2 | Chunking strategy — configurable chunk size + overlap for text columns | Large text fields split into embeddable chunks |
| 7.3 | Brute-force vector search via DataFusion (cosine similarity SQL) | Semantic search works, correctness verified |
| 7.4 | HNSW index for fast approximate nearest neighbor | Search over 100K+ vectors in <50ms |
| 7.5 | RAG endpoint: `POST /rag` — question → embed → search → retrieve → generate | Natural language question → grounded answer |
| 7.6 | Auto-embed on ingest — text columns automatically embedded during ingest | No separate embedding step needed |
| 7.7 | Hybrid search — combine SQL filters with vector similarity | "Java devs in Chicago" (SQL) + "who could do data engineering" (semantic) |

**Gate:** Ingest 15K candidate resumes → auto-embed → ask "find someone who could handle our Kubernetes migration" → system returns relevant candidates ranked by semantic match, with LLM explanation.

**Risk: HNSW in Rust at scale.** This is the hardest technical problem. Options:
- `hora` crate — Rust-native ANN, but less mature than FAISS
- Store HNSW index as a serialized file alongside Parquet data
- Fallback: brute-force scan is fine up to ~100K vectors; optimize later
- Nuclear option: use Qdrant as an external vector store (breaks "no new services" rule)

**Decision needed:** Evaluate `hora` vs external Qdrant vs brute-force at J's data scale.

### Phase 8: Hot Cache + Incremental Updates

Make frequently-accessed data fast, and handle real-time updates without full rewrite.

| Step | Deliverable | Gate |
|---|---|---|
| 8.1 | MemTable hot cache — pin active datasets in memory | Queries on hot data: <10ms |
| 8.2 | Cache policy — LRU eviction based on access patterns | Memory-bounded, auto-manages |
| 8.3 | Incremental writes — append new rows without rewriting entire Parquet file | Update one candidate's phone → no full table rewrite |
| 8.4 | Merge-on-read — query combines base Parquet + delta files | Correct results from base + updates |
| 8.5 | Compaction — periodic merge of delta files into base Parquet | Prevent delta file proliferation |
| 8.6 | Upsert semantics — insert or update by primary key | Same candidate ID → update in place |

**Gate:** Update a single row in a 15K-row dataset. Query reflects the change immediately. No full Parquet rewrite. Memory cache serves hot data in <10ms.

**Risk: This is the Delta Lake problem.** Full ACID transactions over Parquet files is what Databricks spent years building. We're NOT building Delta Lake — we're building a pragmatic version:
- Append-only delta files (easy)
- Merge-on-read (moderate)
- Compaction (moderate)
- Full ACID isolation (NOT attempting — single-writer model instead)

### Phase 8.5: Agent Workspaces ✅ COMPLETE

Per-contract overlays with daily/weekly/monthly tiers and instant handoff.

- WorkspaceManager with saved searches, shortlists, activity logs
- Zero-copy handoff between agents (pointer swap, not data copy)
- Persisted to object storage, rebuilt on startup

### Phase 9: Event Journal — Never Destroy Evidence

**Principle:** Every mutation is appended, never overwritten. This is the one decision that's impossible to retrofit — once history is lost, it's gone forever.

| Step | Deliverable | Gate |
|---|---|---|
| 9.1 | `journald` crate: append-only event log as Parquet | Every write/update/delete logged with who, when, what, old value, new value |
| 9.2 | Event schema: entity, field, old_value, new_value, actor, timestamp, source, workspace_id | Standardized across all mutations |
| 9.3 | Journal query: `SELECT * FROM journal WHERE entity = 'CAND-001' ORDER BY timestamp` | Full history of any record |
| 9.4 | Replay capability: rebuild any dataset's state at any point in time | Time-travel queries |
| 9.5 | Journal compaction: roll old events into monthly summary Parquet files | Prevent unbounded growth |

**Gate:** Change a candidate's phone number. Query shows the change. Journal shows old value, new value, who changed it, when, and why. Replay to yesterday's state.

**Why now:** In 3 years, compliance, AI auditability, and "why did the agent recommend this candidate" all require mutation history. Adding it later means you only have history from that day forward.

### Phase 10: Rich Catalog v2 — Metadata as Product

**Principle:** Every dataset should be self-describing. A new team member (or AI agent) should understand what data exists, who owns it, how fresh it is, and what's sensitive — without asking anyone.

| Step | Deliverable | Gate |
|---|---|---|
| 10.1 | Catalog schema upgrade: add owner, sensitivity, freshness_sla, description, tags, lineage | `GET /catalog/datasets` returns rich metadata |
| 10.2 | Sensitivity classification: PII, PHI, financial, public, internal | Sensitive fields tagged at ingest |
| 10.3 | Lineage tracking: source_system → ingest_job → dataset → derived_dataset | "Where did this data come from?" answerable |
| 10.4 | Freshness contracts: expected_update_frequency, last_updated, stale_after | Alert when data goes stale |
| 10.5 | Dataset contracts: required columns, type expectations, validation rules | Ingest rejects data that breaks the contract |
| 10.6 | Auto-documentation: AI generates dataset description from schema + sample data | New datasets self-describe on ingest |

**Gate:** Ingest a CSV. System auto-detects PII columns (email, phone, SSN patterns), tags them, generates a description, sets owner, and tracks lineage back to the source file.

**Why now:** Every dataset you ingest without metadata becomes a "mystery file" in 6 months. The metadata layer makes the difference between a searchable knowledge platform and a data graveyard.

### Phase 11: Embedding Versioning — Model-Proof Vector Layer

**Principle:** Embedding models will change. If you don't track which model created which vectors, upgrading means re-embedding everything from scratch.

| Step | Deliverable | Gate |
|---|---|---|
| 11.1 | Vector index metadata: model_name, model_version, dimensions, created_at | Every index knows its embedding model |
| 11.2 | Multi-version indexes: same data, different models, coexist | Search specifies which model version |
| 11.3 | Incremental re-embed: only new/changed docs get re-embedded on model upgrade | Model swap doesn't require full re-embed |
| 11.4 | A/B search: query both old and new model, compare results | Validate model upgrade before committing |

**Gate:** Upgrade from nomic-embed-text to a new model. Old index still works. New index builds incrementally. Compare search quality. Switch when ready.

### Phase 12: Tool Registry — Agent-Safe Business Actions

**Principle:** In 3 years, AI agents won't just query — they'll act. Instead of every agent getting raw SQL access, expose named, governed, audited business actions.

| Step | Deliverable | Gate |
|---|---|---|
| 12.1 | Tool definition: name, description, parameters, permissions, audit_level | `search_candidates(skills, city, min_years)` as a registered tool |
| 12.2 | Tool execution: validates params, checks permissions, logs usage, runs query | Agent calls tool, gets results, action is logged |
| 12.3 | Read vs write tools: read tools are permissive, write tools require confirmation | `get_candidate` = auto-approved, `update_phone` = requires review |
| 12.4 | MCP-compatible interface: expose tools via Model Context Protocol | Any MCP-compatible agent (Claude, GPT, local) can use them |
| 12.5 | Rate limiting + quotas per agent/tool | Prevent runaway agent from overwhelming the system |

**Gate:** An AI agent calls `search_candidates(skills="Python,AWS", city="Chicago", available=true)` → gets results → calls `shortlist_candidate(workspace_id, candidate_id, reason)` → action is logged, auditable, reversible.

**Why now:** The tool interface is cheap to build (it's just named endpoints with validation). But retrofitting audit logging and permission checks onto raw SQL access is a nightmare. Build the governed interface first.

### Phase 13: Security & Access Control

| Step | Deliverable | Gate |
|---|---|---|
| 13.1 | Field-level sensitivity tags (PII, PHI, financial) in catalog | Sensitive fields identified |
| 13.2 | Row-level access policies (agent A sees their candidates only) | Policy evaluated at query time |
| 13.3 | Column masking (show last 4 of SSN, redact salary for non-managers) | Masked results based on role |
| 13.4 | Query audit log (who queried what, when, which fields) | Every data access recorded |
| 13.5 | Policy-as-code (TOML/YAML rules, not hardcoded) | Non-engineer can update access rules |

### Phase 14: Schema Evolution + AI Migration

| Step | Deliverable | Gate |
|---|---|---|
| 14.1 | Schema diff detection: old schema vs new ingest → list changes | "Column renamed: first_name → full_name" |
| 14.2 | AI-generated migration rules: LLM suggests column mappings | "full_name = concat(first_name, ' ', last_name)" |
| 14.3 | Migration preview: show how old data maps to new schema before applying | Human approves before data transforms |
| 14.4 | Versioned schemas in catalog: v1, v2, v3 coexist | Queries specify version or use latest |

### Phase 15: Infrastructure horizon items

- [x] HNSW vector index with trial system (shipped 2026-04-16)
- [x] Federation foundation — ADR-017 (shipped 2026-04-16)
- [x] Database connector ingest — Postgres batch with streaming (shipped 2026-04-16)
- [x] Federation layer 2 — runtime bucket lifecycle, per-index bucket scoping, profile bucket auto-provisioning (shipped 2026-04-17)
- [x] MySQL streaming connector — mirrors Postgres path, verified on live MariaDB (shipped 2026-04-17)
- [x] PDF OCR for scanned documents — Tesseract 5.5 fallback when lopdf yields no text (shipped 2026-04-17)
- [x] Scheduled ingest — interval-based per-source schedules with CRUD + run-now + auto-trigger agent (shipped 2026-04-17)
- [ ] Multi-node query distribution (DataFusion supports this architecturally)

### Phase 16: Hot-Swap Index Generations

Make indexes upgrade-in-place without dropping queries.

| Step | Deliverable | Gate |
|---|---|---|
| 16.1 | "Active generation" pointer per logical index name | `/vectors/search` routes to current champion automatically |
| 16.2 | Background trial runner: watches trial journal, proposes configs (random search / Bayesian), fires `/hnsw/trial` | Agent autonomously tunes without human POSTing each config |
| 16.3 | Promotion endpoint: `POST /hnsw/promote/{index}/{trial_id}` atomically swaps active pointer | Next search hits new config, zero downtime |
| 16.4 | Rollback: `POST /hnsw/rollback/{index}` reverts to previous generation | Bad promotion recoverable in milliseconds |
| 16.5 | Dataset-append triggers: when `POST /ingest/file` writes to a dataset with attached vector indexes, schedule automatic re-trial (not full rebuild) | New docs get embedded + indexed without manual intervention |

**Gate:** Run the trial agent for 10 minutes against `resumes_100k_v2` with a fresh eval set. It explores the `ef_construction × ef_search` space, promotes the Pareto winner, continues running. Zero human clicks. All trials and promotions appear in `/hnsw/trials/resumes_100k_v2`.

**Risk:** Agent loops into a bad region (e.g. always proposes ef_construction=1). Mitigation: a hardcoded config space constraint + minimum-quality gate (don't promote anything with recall <0.9).

### Phase 17: Model Profiles + Dataset Bindings

Make "different models see different data" real instead of a config string.

| Step | Deliverable | Gate |
|---|---|---|
| 17.1 | `ModelProfile` manifest: id, ollama_name, bucket, bound_datasets[], hnsw_config, embed_model | `GET /models` lists profiles; `POST /models` creates one |
| 17.2 | Profile activation endpoint: `POST /profile/{id}/activate` — warms EmbeddingCache for bound indexes, builds HNSW with profile's config | Next search against bound indexes is <1ms cold |
| 17.3 | Model-scoped search: `POST /search?model=X` filters to bound datasets only | Model A can't see Model B's datasets unless explicitly shared |
| 17.4 | VRAM-aware activation: only one (or small N) model loaded at a time on 16GB A4000 | Activating model B unloads model A via Ollama's keep_alive=0 |
| 17.5 | Audit: every tool invocation by a model is logged with model identity | `GET /models/{id}/audit` shows exactly what each model touched |

**Gate:** Two model profiles defined: `staffing-recruiter` (bound to candidates/placements/timesheets) and `docs-assistant` (bound to a documentation corpus). Activate staffing-recruiter, search for candidates — works. Switch to docs-assistant, same search — returns zero from staffing (not bound) but finds docs. VRAM shows only one embedding model loaded at a time.

**VRAM reality:** 16GB A4000 realistically holds 1-2 loaded models concurrently. "Multi-model" in practice means sequential swap between profiles, not parallel serving. The profile abstraction makes this swap clean.

### Phase 18: Storage format decision (Lance evaluation)

The question raised 2026-04-16 after J's LLMS3 knowledge base identified Lance as `alternative_to` Parquet for vector workloads. Current stack: Parquet with binary-blob vector columns + in-RAM HNSW sidecar. Evaluated against: Lance native vector format with disk-resident indexes.

| Step | Deliverable | Decision criteria |
|---|---|---|
| 18.1 | ✅ Parallel Lance-backed vector index for `resumes_100k_v2` in standalone `crates/lance-bench` | Built 2026-04-16 |
| 18.2 | ✅ Head-to-head benchmark across 8 dimensions (cold-load, search latency, disk, index build, random access, append) | Complete |
| 18.3 | ✅ ADR-019 committed with measured data and decision | See `docs/ADR-019-vector-storage.md` |

**Outcome:** Hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance joins as a second backend for Phase 16 hot-swap (14× faster index builds), Phase C/append workloads (0.08s vs full rewrite), RAG random-access retrieval (112× faster), and indexes past the ~5M RAM ceiling.

Per-profile `vector_backend: Parquet | Lance` becomes part of Phase 17 (model profiles). See ADR-019 for the full scorecard and caveats.

### Phase 19: Playbook memory (meta-index) — the feedback loop

Make successful playbooks actually improve future searches. Today `successful_playbooks` is a write-only log; future-you looks at it and thinks "cool, we filled Toledo welders once" — but the index has no idea it happened, so the next Toledo-welder search ranks the same as if none of those fills had existed. Phase 19 closes the loop.

| Step | Deliverable | Gate |
|---|---|---|
| 19.1 | Embed every `successful_playbooks` row — operation + approach + context → one chunk per playbook | A new dataset `playbook_memory` appears in catalog with N rows = row count of `successful_playbooks` |
| 19.2 | Vector index on `playbook_memory` (HNSW or Lance — whichever `agent-parquet` profile uses) | `/vectors/search` against `playbook_memory` returns semantically similar past playbooks |
| 19.3 | Endorsement extraction: each playbook row has `fills[]` (worker_ids it succeeded with). Parse them out at ingest time and store in a sidecar `playbook_endorsements` Parquet keyed by playbook_id | `SELECT * FROM playbook_endorsements WHERE playbook_id = 'X'` returns the worker_ids |
| 19.4 | `/vectors/hybrid` gains opt-in `use_playbook_memory: bool`. When true: after hybrid ranks candidates, find top-K similar past playbooks (semantic search over `playbook_memory`), extract endorsed worker_ids, add a bounded boost to candidates in the endorsed set, re-rank | A search where the "right" worker is known from a prior playbook ranks higher with the flag than without |
| 19.5 | Write-through from multi-agent orchestrator: when two agents seal a playbook, it appends to `successful_playbooks` AND triggers a refresh of `playbook_memory` (via existing Phase C stale-mark path). Next query sees the new signal. | Run the orchestrator → inspect `playbook_memory` → see a new row. Run the same query before/after → ranking differs. |
| 19.6 | Ceiling-aware boost: cap the per-worker boost so one popular worker can't dominate future searches. Boost decays with time (optional) so stale playbooks matter less. | Synthetic test: 100 playbooks all filled with the same worker_id; the 101st search still returns a mix, not just that one worker |

**Gate:** Run a real search before and after a new successful playbook lands. The endorsed workers from similar past operations rank higher in the second call. Demonstrable with a diff of the two result sets.

**Why this is the right version of "meta-index":** The alternative — training a neural re-ranker on (query, candidate, outcome) triples — is a weeks-long ML story and requires labeled outcome data we don't really have. The statistical-semantic version here is rebuildable from the existing playbook journal, explainable ("boosted because of similar playbooks X, Y, Z"), and invalidatable (delete a playbook → boost goes away on next rebuild). It gets 80% of the payoff at 10% of the cost. Neural re-ranking stays as a future option if it bites.

**Non-goals for this phase:**
- Neural training / fine-tuning. Statistical feedback only.
- Hard guarantees about recall lift magnitude. "Measurably better on the demo query" is the gate, not a universal quality claim.
- Real-time recomputation on every playbook. Batched refresh via the existing stale-marking path is sufficient.

### Phase 19 refinement (WIRED 2026-04-21): geo-filter + role prefilter on boost

Item-3 diagnostic pass surfaced that `compute_boost_for` was ranking playbooks globally by cosine similarity, while candidates came from an SQL-filtered city. Result: boost map had 170 endorsed workers, 0 intersected the 50 Nashville-filtered candidates. Zero citations where there should have been dozens.

Fix — in `crates/vectord/src/playbook_memory.rs`:
- `compute_boost_for_filtered(target_geo)` — skip playbooks from other cities before cosine sort.
- `compute_boost_for_filtered_with_role(target_geo, target_role)` — multi-strategy: exact (role, city, state) match earns similarity=1.0 and fills up to half the top_k; cosine fallback fills the rest. Mirrors Mem0/Zep 2026 guidance on parallel-strategy rerank.

In `crates/vectord/src/service.rs`:
- `extract_target_geo` and `extract_target_role` pull both from the executor's SQL filter.
- `tracing::info!` emits `playbook_boost: boosts=N sources=N parsed=N matched=N target_geo=? target_role=?` on every hybrid_search. Silent-truncation class of bug now visible.

Citation lift measured: avg citations per run 0.32 → 1.38 after geo filter; then 2 → 28 in the single-scenario Riverfront Steel re-run after role prefilter landed. 14× delta on same scenario.

Unit tests: `extract_target_geo_basic`, `_missing_state_returns_none`, `_word_boundary` (rejects "civilian" substring), `extract_target_role_basic`, `_none_when_absent`, `_multi_word` — all pass (`cargo test -p vectord --lib extractor_tests`).

### Phase 20: Model Matrix + Overseer Tiers (WIRED 2026-04-21)

Five-tier routing declared in `config/models.json`. Hot path (T1/T2) stays local (qwen3.5 + qwen3 after mistral was dropped for 0/14 fill rate on complex scenarios). Cloud for overview (T3 gpt-oss:120b), strategic (T4 qwen3.5:397b), and gatekeeper (T5 kimi-k2-thinking). Every tier declares `context_window` + `context_budget` + `overflow_policy`.

- T1 hot: 50-200 calls/scenario, local only — `qwen3.5:latest` executor, `think:false`
- T2 review: 5-14 calls/event, local only — `qwen3:latest` reviewer, `think:false`
- T3 overview: 1-3 calls/scenario, cloud primary — `gpt-oss:120b` on Ollama Cloud, thinking on
- T4 strategic: 1-10 calls/day, cloud primary
- T5 gatekeeper: 1-5 calls/day, audit-logged

T3 checkpoints + cross-day lessons wired. Lessons archive to `data/_playbook_lessons/` and load back at next scenario start as `prior_lessons` in executor context. Cloud passthrough verified on stress_01 scenario with `LH_OVERVIEW_CLOUD=1` — `gpt-oss:120b` response latency consistently 4-8s, diagnosing city-pivot ("Gary IN → Chicago IL, 40mi") when target city has zero supply.

`think:false` is the key mechanical finding — qwen3.5 burns ~650 tokens of hidden reasoning before emitting response; hot-path JSON emitters MUST disable thinking or continuation has to paper over empty returns. T3/T4 overseers KEEP thinking (that's the point).

**Kimi-k2.6 upgrade path:** Current Ollama Cloud key returns 403 on kimi-k2.6 (`ollama run kimi-k2.6:cloud` requires `ollama signin` with pro-tier account). kimi-k2.5 substitutes on the current tier — same family, strong at tool calling. Swap to k2.6 is a one-line change in `applyToolLevel` once the subscription lands.

### Phase 21: Scratchpad + Tree-Split Continuation

**Why this is a phase and not an optimization:** bumping `max_tokens` until a response stops truncating is a tourniquet — J called this out explicitly. As playbooks accumulate into the hundreds and responses grow, eventually SOME request will exceed SOME model's window, and we can't solve it by raising a number. The stable answer is two primitives that let us handle arbitrary-size work without losing context: a scratchpad that glues multi-call responses together, and a tree split that shards oversized inputs and reduces them back.

**Two primitives (WIRED 2026-04-21 in `tests/multi-agent/agent.ts`):**

1. **`generateContinuable()`** — handles OUTPUT overflow. Calls the model; checks structural completeness (for JSON: matched braces + JSON.parse success; for text: non-empty). If incomplete, calls again with "continue from here" + the partial response as scratchpad. Up to `max_continuations` times. No `max_tokens` tuning needed — if thinking ate the initial budget, continuation picks up the slack.

2. **`generateTreeSplit()`** — handles INPUT overflow. Caller passes an array of shards (semantic chunks of the corpus). For each shard: map call with running scratchpad digest. Final reduce call produces the answer. Scratchpad truncates oldest content if it approaches its own budget. If a single shard still overflows, `assertContextBudget` throws — caller must re-shard at finer granularity, NOT silently truncate.

**Guarantees:**
1. No agent call can silently truncate. Either it completes, continues, or throws with numbers.
2. No corpus is too big — `generateTreeSplit` handles any size the caller can shard.
3. Scratchpad is the glue between multi-call responses; context is never lost, only compacted.
4. Token estimation uses `chars / 4` (biased safe ~15%) until we wire the provider's tokenizer.

**What lives where now:**
- `agent.ts::estimateTokens()` + `assertContextBudget()` + `generateContinuable()` + `generateTreeSplit()` — WIRED
- `scenario.ts` executor + reviewer + overviewGenerate calls — migrated to `generateContinuable`
- `config/models.json` — context_window + context_budget + overflow_policies per tier (declarative)

**Next sprint (Rust side, so gateway tools share it):**
- `crates/aibridge/src/continuation.rs` — port of `generateContinuable`
- `crates/aibridge/src/tree_split.rs` — port of `generateTreeSplit`
- `crates/storaged/src/chunk_cache.rs` — precomputed shards keyed by corpus hash (avoid re-chunking on every T4 run)
- `/metrics` counter: `context_continuations_total{model,shape,succeeded}`

**Status:** TS primitives WIRED. Rust port pending. The escalation path (tree split → bigger-context cloud model → kimi-k2:1t's 1M window → split decision into sub-decisions) is declared in `config/models.json` under `context_management.overflow_policies`.

### Phase 21 status update (WIRED 2026-04-21 evening)

Additional primitives landed after the initial commit:

- **`think: boolean`** flag plumbed through `generate()`, `generateCloud()`, `generateContinuable()`, and into sidecar's `/generate` endpoint. Enables per-call opt-out of hidden reasoning for hot-path JSON emitters. Verified: qwen3.5 with `think:false` + `num_predict:400` returns clean `{"worker_id":...}` on first call; without `think:false`, 650 tokens eaten by reasoning, response empty.

- **Cloud executor routing** — `ACTIVE_EXECUTOR_CLOUD` / `ACTIVE_REVIEWER_CLOUD` flags let per-staffer tool_level route executor to Ollama Cloud when weak local model (qwen2.5) would collapse. Verified on kimi-k2.5 via Ollama Cloud: clean JSON emission, think:false honored.

Rust port of continuation + tree-split primitives remains queued for next sprint (`crates/aibridge/src/continuation.rs`, `tree_split.rs`).

### Phase 22: Internal Knowledge Library (KB)

Meta-layer over Phase 19 playbook_memory. Playbook memory answers "which WORKERS worked for this event." The KB answers "which CONFIG worked for this playbook signature." Subject changes from workers to the system itself — model choice, budget hints, overflow policies, pathway notes.

**Files (`data/_kb/`):**
- `signatures.jsonl` — (sig_hash, embedding[], first_seen, last_seen, run_count). Sig = stable hash of the sequence of (kind, role, count, city, state) across events.
- `outcomes.jsonl` — per-run record: {sig, run_id, models, ok/total, turns, citations, per-event summary, elapsed}.
- `pathway_recommendations.jsonl` — AI-synthesized for next run: {confidence, rationale, top_models, budget_hints, pathway_notes, neighbors_consulted}.
- `error_corrections.jsonl` — detected fail→succeed pairs on same sig, diff of what changed.
- `config_snapshots.jsonl` — history of models.json changes + why.

**Cycle (event-driven, not wall-clock):**
1. Scenario ends → `kb.indexRun()` extracts signature, embeds spec digest, appends outcome.
2. `kb.recommendFor(nextSpec)` finds k-NN signatures via cosine, feeds their outcome history + recent error corrections to the overview model, writes a structured recommendation.
3. Next scenario starts → `kb.loadRecommendation(spec)` pulls the newest rec for this sig, injects `pathway_notes` into `guidanceFor()` alongside prior lessons.

**Why file-based for MVP:** Phase 19 playbook_memory is already a catalogd dataset. KB is a separate meta-layer; keep it file-based first to iterate without a gateway schema migration. Rust port (and promotion to vectord-indexed corpus for neighbor search at scale) lands once shape stabilizes — mirrors how Phase 21 primitives were TS-first → Rust next sprint.

**What the overview model gets asked:**
- Target scenario digest
- Top-k neighbor signatures with avg ok rate, best model combo per neighbor
- Recent error corrections (sig, before/after model set)

**What it outputs (JSON-constrained):**
- confidence (high/medium/low)
- rationale (2-3 sentences)
- top_models {executor, reviewer, overview}
- budget_hints {executor_max_tokens, reviewer_max_tokens, executor_think}
- pathway_notes (concrete pre-run advice)

**Status (WIRED 2026-04-21):** `tests/multi-agent/kb.ts` holds all primitives. scenario.ts reads rec at start, indexes + recommends at end. Cold start gracefully writes a "low confidence, no history" rec so the second run has a floor to build on. `snapshotConfig()` wired to fire at every scenario start — active model set + tool_level + cloud flags hashed and appended to `config_snapshots.jsonl`.

**Phase 22 item B — cloud rescue (WIRED):** When an event fails and cloud T3 is enabled, `requestCloudRemediation()` feeds the failure trace (SQL filters attempted, row counts, reviewer drift reasons, gap signals, contract terms) to cloud and parses a JSON remediation with new_city / new_state / new_role / new_count / rationale. Event retries once with the pivot. Verified 1/3 rescues succeeded on stress_01 (Gary IN → South Bend IN pivot filled a Welder that local drift-aborted). Sanitizer splits "City, ST" comma-packed outputs so downstream SQL doesn't get `Hammond, IN, IN`.

### Phase 23: Staffer identity + competence-weighted retrieval (WIRED 2026-04-21)

Answers "who handled this" as a first-class dimension of the matrix index. Senior staffers' playbooks rank higher than juniors' on similar scenarios via competence × similarity score. Auto-discovers "reliable performer" worker labels via cross-staffer endorsement overlap.

**Schema (`scenario.ts` ScenarioSpec):**
- `contract?: ContractTerms` — deadline, budget_per_hour_max, local_bonus_per_hour, local_bonus_radius_mi, fill_requirement. Propagates into T3 checkpoint + cloud rescue prompts so cloud reasons about trade-offs (pivot-within-radius before budget-pivot-further).
- `staffer?: Staffer` — {id, name, tenure_months, role, tool_level}. tool_level controls subsystems available to this run:
  - `full` — qwen3.5 + qwen3 local + cloud T3 + cloud rescue
  - `local` — qwen3.5 + qwen3 local + local gpt-oss:20b T3 + rescue
  - `basic` — **kimi-k2.5 cloud** exec + qwen3 local reviewer + local T3, no rescue
  - `minimal` — kimi-k2.5 cloud exec + qwen3 local reviewer, NO T3, NO rescue — tests whether playbook inheritance carries knowledge alone

**KB staffer indexing (`data/_kb/staffers.jsonl`):**
- Recomputed per-staffer on every run: total_runs, fill_rate, avg_turns_per_event, avg_citations_per_run, rescue_rate, competence_score.
- `competence_score = 0.45·fill_rate + 0.20·turn_efficiency + 0.20·citation_density + 0.15·rescue_rate`. Bounded 0..1.

**Weighted neighbor retrieval:**
- `findNeighbors` in `kb.ts` returns `weighted_score = cosine × max_staffer_competence` (floor 0.3). Senior playbooks rank above junior playbooks on similar scenarios.
- `pathway_recommendations` include `best_staffer_id` / `best_staffer_competence` so cloud knows WHOSE playbook it's synthesizing from.

**Cross-staffer auto-discovery:**
- `scripts/kb_staffer_report.py` emits leaderboard + workers endorsed across ≥2 staffers on same signature.
- Validated output: Rachel D. Lewis (Welder Nashville) endorsed 12× across 4 staffers; Christina Watson (Machine Op Indianapolis) 11×. These are the highest-confidence "reliable performer" labels the system produced without human tagging.

**Demo infrastructure:**
- `tests/multi-agent/gen_staffer_demo.ts` — 4 personas × 3 contracts = 12 scenario specs.
- `scripts/run_staffer_demo.sh` — sequential batch with cloud T3.
- `scripts/kb_staffer_report.py` — leaderboard + top/bottom differential + cross-staffer overlap.

### Phase 24: Observer / Autotune integration (SHIPPED 2026-04-20, commit b95dd86)

The gap: `lakehouse-observer.service` wrapped MCP :3700, while `tests/multi-agent/scenario.ts` hit gateway :3100 directly. Observer idle at 0 ops across 3600+ cycles — scenarios invisible to ERROR_ANALYZER and PLAYBOOK_BUILDER, autotune running blind to outcomes.

**What shipped:**
- `observer.ts` Bun HTTP listener on `OBSERVER_PORT` (default 3800): `GET /health`, `GET /stats` (totals, by_source, recent scenario digest), `POST /event` for scenario outcomes.
- `ObservedOp` carries provenance — `source="scenario" | "mcp"` + `staffer_id` + `sig_hash` + `event_kind` + geo + rescue flags.
- `recordExternalOp()` — shared ring-buffer insert; main analyzer + playbook builder no longer care where the op came from.
- `persistOp()` fix: old path POSTed to `/ingest/file?name=observed_operations` which has REPLACE semantics (wiped prior ops); now uses append-friendly write-through.

### Phase 25: Validity windows + playbook retirement (SHIPPED 2026-04-21, commit e0a843d)

Zep 2026-era finding: temporal validity is the single highest-value memory-hygiene primitive. `PlaybookEntry` gained `schema_fingerprint` / `valid_until` / `retired_at` / `retirement_reason`. `compute_boost_for_filtered_with_role` skips retired + expired before geo/cosine ranking. Two retirement paths: `retire_one(id, reason)` for manual, `retire_on_schema_drift(city, state, fp, reason)` for batch migration sweep. Endpoint: `POST /vectors/playbook_memory/retire`.

### Phase 26: Mem0 upsert + Letta geo hot cache (SHIPPED 2026-04-21, commit 640db8c)

Same-day re-seed no longer duplicates rows. `/seed` with `append=true` routes through `upsert_entry` which decides ADD / UPDATE / NOOP on `(operation, day, city, state)`. Playbook_id stays stable on UPDATE so existing citations remain valid. `PlaybookMemory.geo_index: HashMap<(city, state), Vec<idx>>` rebuilt on every mutation; geo-filtered boost queries skip the scan and hit O(1) lookup — sub-ms at current scale, same code path scales to 100K+ entries.

### Phase 27: Playbook versioning (SHIPPED 2026-04-21)

`PlaybookEntry` gained `version: u32` (default 1), `parent_id`, `superseded_at`, `superseded_by` — all `#[serde(default)]` so pre-Phase-27 state loads as roots. `revise_entry(parent_id, new_entry)` appends a new version, stamps the parent superseded, rejects revising a retired or already-superseded parent. `history(id)` returns the root→tip chain from any node. Superseded entries excluded from boost (same rule as retired). Endpoints: `POST /vectors/playbook_memory/revise`, `GET /vectors/playbook_memory/history/{id}`. `/status` reports `superseded` as a distinct counter. 8 new tests; 51/51 vectord lib tests green.

### Phase 28: Agent Gateway + MCP Server (SHIPPED 2026-04-21)

The agent gateway wraps the Rust substrate with an MCP-first interface (Model Context Protocol), enabling Claude Code, GPT agents, and internal scripts to interact with the lakehouse through named tools rather than raw HTTP. Built on Bun (not Node), serving as the "front door" for all AI consumers.

**Files:** `mcp-server/index.ts` (2241 lines)

| Tool | Purpose | Endpoint |
|---|---|---|
| `search_workers` | Hybrid SQL+vector (the core) | `POST /vectors/hybrid` |
| `query_sql` | Analytical SQL on any dataset | `POST /query/sql` |
| `match_contract` | Find workers for a job order | `POST /match` |
| `get_worker` | Single worker by ID | `POST /worker/:id` |
| `rag_question` | Full RAG pipeline | `POST /vectors/rag` |
| `log_success` | Record operation → playbook | `POST /log` |
| `log_failure` | Record failed fill (negative signal) | `POST /log_failure` |
| `get_playbooks` | Retrieve past successes | `GET /playbooks` |
| `swap_profile` | Hot-swap model+data context | `POST /profile/:id` |
| `vram_status` | GPU introspection | `GET /vram` |

**HTTP Routes (internal agent consumption):**
- `/search` — hybrid search with client blacklist + rate enrichment
- `/verify` — claim verification against golden dataset
- `/context` — self-orientation for agents
- `/clients/:client/blacklist` — per-client worker exclusion
- `/memory/query` — unified memory surface (Phase 24 refinement)
- `/system/summary` — truthful row counts via SQL
- `/models/matrix` — read `config/models.json`

### Phase 29: Intelligence Suite (SHIPPED 2026-04-21)

Market intelligence endpoints surfaced through the MCP server — real-world demand signals from public data, cross-referenced with worker bench for staffing gap analysis.

**Files:** `mcp-server/index.ts` continued

| Endpoint | Purpose | Data Source |
|---|---|---|
| `/intelligence/market` | Building permits → demand forecast | Chicago Socrata API |
| `/intelligence/staffing_forecast` | 30-day demand by role | Chicago permits + bench |
| `/intelligence/permit_contracts` | Permits + Phase 19 ranked candidates | Chicago permits + Workers 500K |
| `/intelligence/activity` | Activity feed + learned patterns | successful_playbooks |
| `/intelligence/brief` | Parallel analytics across 500K profiles | Workers 500K |
| `/intelligence/chat` | Natural language → routed queries | Hybrid + RAG |

**Market signals captured:**
- Largest permits by cost (top 50)
- Work type breakout (electrical, mechanical, masonry, plumbing)
- Cross-reference with IL bench supply/available/reliable
- Demand forecast using $150K/worker industry heuristic

### Phase 30: Observer + Autotune Integration (SHIPPED 2026-04-20, commit b95dd86)

The gap: `lakehouse-observer.service` wrapped MCP :3700, while `tests/multi-agent/scenario.ts` hit gateway :3100 directly. Observer idle at 0 ops across 3600+ cycles — scenarios invisible to ERROR_ANALYZER and PLAYBOOK_BUILDER, autotune running blind to outcomes.

**Files:** `mcp-server/observer.ts` (335 lines)

**What shipped:**
- `observer.ts` Bun HTTP listener on `OBSERVER_PORT` (default 3800): `GET /health`, `GET /stats` (totals, by_source, recent scenario digest), `POST /event` for scenario outcomes.
- `ObservedOp` carries provenance — `source="scenario" | "mcp"` + `staffer_id` + `sig_hash` + `event_kind` + geo + rescue flags.
- `recordExternalOp()` — shared ring-buffer insert; main analyzer + playbook builder no longer care where the op came from.
- `persistOp()` fix: old path POSTed to `/ingest/file?name=observed_operations` which has REPLACE semantics (wiped prior ops); now uses append-friendly write-through to `data/_observer/ops.jsonl`.

**HTTP API:**
- `GET /health` — ring size
- `GET /stats` — totals, by_source, recent scenario ops, rescue stats
- `POST /event` — record scenario outcomes with provenance

### Phase 31: Scenario Orchestrator with Tiers (SHIPPED 2026-04-21)

A day-in-the-life scenario test running six events against the live substrate. Exercises baseline_fill, recurring, expansion, emergency, misplacement. Routes through the same executor/reviewer loop as the single-task orchestrator with event-specific constraints.

**Files:** `tests/multi-agent/scenario.ts` (1772 lines)

**Model Matrix (configurable per-tier):**

| Tier | Purpose | Model | Context Budget |
|---|---|---|---|
| T1 hot | 50-200 calls/scenario | qwen3.5:latest | 8K, think:false |
| T2 review | 5-14 calls/event | qwen3:latest | 8K, think:false |
| T3 overview | 1-3 calls/scenario | gpt-oss:120b | 32K, thinking on |
| T4 strategic | 1-10 calls/day | gpt-oss:120b | 64K |
| T5 gatekeeper | 1-5 calls/day | kimi-k2.6 | 128K, audit-logged |

**Staffer tool_levels (Phase 23):**
- `full` — qwen3.5 + qwen3 local + cloud T3 + cloud rescue
- `local` — qwen3.5 + qwen3 local + local gpt-oss:20b T3 + rescue
- `basic` — kimi-k2.5 cloud exec + qwen3 local reviewer, no rescue
- `minimal` — kimi-k2.5 cloud exec + qwen3 local reviewer, NO T3, NO rescue

**Key features:**
- `generateContinuable()` — handles output overflow with continuation
- `generateTreeSplit()` — handles input overflow via sharding
- `think:false` flag — disables hidden reasoning for hot-path JSON emitters
- Cloud executor routing — `ACTIVE_EXECUTOR_CLOUD` / `ACTIVE_REVIEWER_CLOUD` flags
- Cloud rescue (Phase 22B) — `requestCloudRemediation()` on failure
- KB integration at start (load recommendation) and end (index + recommend)

### Phase 32: Knowledge Library + Staffer Indexing (SHIPPED 2026-04-21)

Meta-layer over Phase 19 playbook_memory. Tracks which configs worked for which playbook signatures.

**Files:** `tests/multi-agent/kb.ts` (600 lines)

**Files under `data/_kb/`:**
- `signatures.jsonl` — (sig_hash, embedding[], first_seen, last_seen, run_count)
- `outcomes.jsonl` — per-run: {sig, run_id, models, ok/total, turns, citations}
- `pathway_recommendations.jsonl` — AI-synthesized for next run
- `error_corrections.jsonl` — detected fail→succeed pairs
- `config_snapshots.jsonl` — history of model changes
- `staffers.jsonl` — per-staffer competence scores

**Staffer competence scoring:**
```
competence_score = 0.45·fill_rate + 0.20·turn_efficiency + 0.20·citation_density + 0.15·rescue_rate
```

**Weighted neighbor retrieval:**
- `findNeighbors` returns `weighted_score = cosine × max_staffer_competence` (floor 0.3)
- Senior playbooks rank above junior playbooks on similar scenarios

### Phase 33: Validity Windows + Playbook Retirement (SHIPPED 2026-04-21)

Zep 2026-era finding: temporal validity is the single highest-value memory-hygiene primitive.

**What shipped:**
- `PlaybookEntry` gained `schema_fingerprint` / `valid_until` / `retired_at` / `retirement_reason`
- `compute_boost_for_filtered_with_role` skips retired + expired before geo/cosine ranking
- Two retirement paths: `retire_one(id, reason)` manual, `retire_on_schema_drift(city, state, fp, reason)` batch
- Endpoint: `POST /vectors/playbook_memory/retire`

### Phase 34: Mem0 Upsert + Geo Hot Cache (SHIPPED 2026-04-21, commit 640db8c)

Same-day re-seed no longer duplicates rows.

**What shipped:**
- `/seed` with `append=true` routes through `upsert_entry` — decides ADD / UPDATE / NOOP on `(operation, day, city, state)`
- Playbook_id stays stable on UPDATE so existing citations remain valid
- `PlaybookMemory.geo_index: HashMap<(city, state), Vec<idx>>` rebuilt on every mutation
- O(1) geo-filtered boost queries — sub-MS at current scale

### Phase 36: Gateway Seed Concurrency Fix (SHIPPED 2026-04-22)

Problem: Concurrent `playbook_memory/seed` calls caused socket collisions with the Python AI sidecar. Two staffing coordinators hitting seed simultaneously → one socket error.

Solution:
- Added `embed_semaphore: Arc<Semaphore>` to `VectorState` (permits=1)
- Seed endpoint acquires permit before embedding, releases after
- Serializes embed calls to sidecar, eliminates collision

E2E test result: 8.0/10 avg (persist=0 is expected — same day+op = NOOP, not add)

### Phase 37: Hot-Swap Endpoint Hang (TODO)

Problem: `POST /vectors/profile/{id}/activate` hangs for 2-4 minutes because it does heavy work synchronously:
- Loads embeddings into memory
- Builds HNSW indexes
- Preloads Ollama model
- Auto-provisions buckets

Current workaround: Client uses 5-second timeout (test passes but doesn't verify hot-swap).

**Required fix**: Convert to async job pattern:
1. Endpoint accepts request, spawns background task, returns job_id immediately
2. Client polls `GET /jobs/{job_id}` for status
3. Progress events streamed via SSE during activation
4. Background task updates job status on completion

This is a known architectural debt item - not blocking since tests use timeout workaround.

### Phase 35+: Further horizon

- Specialized fine-tuned models per domain (staffing matcher, resume parser)
- Video/audio transcript ingest + multimodal embeddings
- Neural re-ranker over (query, candidate, outcome) triples — only if Phase 19's statistical feedback plateaus
- True distributed query (DataFusion multi-node) — only if single-machine ceilings bite

---

## Known ceilings (honest)

The current stack has measurable limits. Documenting them so future decisions aren't based on wishful thinking.

| Dimension | Current ceiling | Breaks at | Escape hatch |
|---|---|---|---|
| Vector count per index (Parquet+HNSW in-RAM) | ~5M on 128GB | Past 5M | Switch that profile's `vector_backend` to Lance per ADR-019 — IVF_PQ stays on disk-resident quantized codes |
| Concurrent active indexes | ~50-100 at 100K vectors each | 10M×50 configurations | Lance disk-resident + per-profile activation |
| Rows per dataset | 2.47M proven, probably 100M+ fine | Approaches DataFusion memory limits | DataFusion predicate pushdown + partition pruning (existing) |
| Concurrent loaded models | 1-2 on 16GB VRAM (A4000) | 3+ models simultaneous | Not our problem — architectural, driven by Ollama |
| Trial journal growth per index | Thousands of trials, batched JSONL | High-frequency auto-tuning agent | Compaction via `/hnsw/trials/{idx}/compact` |
| Error journal growth | Bounded by ring buffer (2000 events in-memory) + batched JSONL on disk | Continuous failure scenarios | Compaction + retention policy (TODO) |

---

## Reference Workloads

### Workload 1: Staffing Company

Scale-tested on 128GB RAM server:

| Table | Rows | Size | Description |
|---|---|---|---|
| candidates | 100,000 | 10.1 MB | Names, phones, emails, zip, skills, resume text |
| clients | 2,000 | 33 KB | Companies, contacts, verticals |
| job_orders | 15,000 | 0.9 MB | Positions with descriptions, requirements, rates |
| placements | 50,000 | 1.2 MB | Candidate↔job matches with rates, recruiters |
| timesheets | 1,000,000 | 16.7 MB | Weekly hours, bill/pay totals, approvals |
| call_log | 800,000 | 34.3 MB | Phone CDR — who called whom, duration, disposition |
| email_log | 500,000 | 16.0 MB | Email tracking — subject, opened, direction |
| **Total** | **2,467,000** | **79 MB** | **7 tables, cross-referenced** |

### Benchmarks (2.47M rows)

| Query | Cold (Parquet) | Hot (MemCache) | Speedup |
|---|---|---|---|
| 100K candidate filter (skills+city+status) | 257ms | 21ms | 12x |
| 1M timesheet aggregation + JOIN | 942ms | 96ms | 9.8x |
| 800K call log cross-reference (cold leads) | 642ms | — | — |
| Triple JOIN recruiter performance | 487ms | — | — |
| 500K email open rate aggregation | 259ms | — | — |
| COUNT all 2.47M rows | 84ms | — | — |
| 10K vector semantic search (cosine) | 450ms | — | — |
| Natural language → AI SQL → execute | ~3s | — | (model inference) |

### Vector Search

- 10K candidate resumes embedded in 204s (49 chunks/sec via Ollama)
- Semantic search over 10K vectors: ~450ms (brute-force cosine)
- RAG pipeline: question → embed → search → retrieve → LLM answer with citations
- AI correctly refuses to hallucinate when context doesn't support an answer

### Agent Workspaces

- Create per-contract workspace with saved searches + shortlists
- Instant handoff between agents — zero data copy
- Full activity timeline preserved across handoffs

### Workload 2: Local LLM Knowledge Base

The second use case this substrate is built for. Reference corpus: the running `knowledge_base` Postgres database (586 team runs, response cache history, pipeline runs, threat intel) + LLMS3.com published corpus (~243 enriched documents).

Target scale on same 128GB server:
- Documents: 10K-100K per model profile
- Chunks after chunking: 500K-5M per profile
- Embedding dimensions: 768 (nomic-embed-text)
- Query latency: <100ms semantic search, <3s end-to-end RAG including LLM generation
- Concurrent model profiles: 2-5 configured, 1-2 active at a time (VRAM-bound)

Measured to date (Phase 7 + Phase 16 prep):
- 100K candidate-resume chunks embedded in 10 min via Ollama nomic-embed-text
- HNSW search at 100% recall, ~1ms p50 on 100K vectors (ec=80 es=30 locked as default)
- Trial journal instrumented and working for parameter tuning

Gaps still to close for this workload:
- Model profiles (Phase 17) — today, "model" is a string, not a first-class entity
- Hot-swap generations (Phase 16) — today, rebuild = downtime
- Scale past 5M vectors — needs Phase 18 Lance evaluation to decide path

---

## Available Local Models

| Model | Use |
|---|---|
| `nomic-embed-text` | Embeddings (768d) — semantic search, RAG retrieval |
| `qwen2.5` | SQL generation, structured output, summarization |
| `mistral` | General generation, longer context |
| `gemma2` | General generation |
| `llama3.2` | General generation, lightweight |

---

## Non-Goals

- Cloud deployment (local-first, always)
- Full ACID transactions (single-writer model is sufficient)
- Real-time streaming / CDC (batch ingest is the model; scheduled refresh, not transactional replication)
- Replacing the CRM (this is the analytical + AI layer BEHIND the CRM)
- Custom file formats — Parquet for datasets + sidecar indexes for vectors (see ADR-018 for why we stayed Parquet instead of migrating to Lance, and the ceilings that choice implies)
- Hard multi-tenant isolation (profiles and federation provide soft isolation; this is not a SaaS platform with adversarial tenants — operator is single-trust)

Removed from prior non-goals (2026-04-16):
- ~~Multi-tenancy (single-owner system)~~ — federation + profile buckets are now first-class; soft multi-tenancy is a design goal. Hard adversarial multi-tenancy (adversarial tenants on shared infrastructure) remains out of scope.

---

## Risks

### Technical Risks

| Risk | Severity | Mitigation |
|---|---|---|
| Vector search in Rust at scale | **High** | Start brute-force, evaluate `hora` crate, Qdrant as fallback |
| Incremental updates on Parquet | **High** | Delta files + merge-on-read, NOT full Delta Lake |
| Legacy data messiness | **High** | Conservative schema detection, default to string, user overrides |
| 100K+ embedding timeout | **High** | Async background job with progress, not single HTTP request |
| Schema evolution across ingests | **Medium** | Schema fingerprinting + versioned manifests (Phase 14) |
| Memory pressure from hot cache | **Medium** | LRU eviction, configurable memory limit (tested: 408MB for 1.1M rows) |
| HNSW index persistence | **Medium** | Serialize alongside Parquet, rebuild on startup |
| Python sidecar as bottleneck | **Low** | Can replace with direct Ollama HTTP from Rust later |

### Strategic Risks (Future-Proofing)

| Risk | Impact | Phase |
|---|---|---|
| No mutation history → can't audit AI decisions | **Critical** — compliance, trust | Phase 9 (event journal) |
| No metadata → datasets become mystery files | **High** — onboarding, discovery | Phase 10 (rich catalog) |
| Embeddings locked to one model | **High** — can't upgrade models | Phase 11 (versioning) |
| Raw SQL as only interface → ungoverned agent access | **High** — security, auditability | Phase 12 (tool registry) |
| No sensitivity classification → compliance exposure | **Medium** — grows with data volume | Phase 13 (access control) |
| No schema evolution handling → ingest breaks on format change | **Medium** — grows with source count | Phase 14 (AI migration) |

---

## Design Principles (Future-Proofing)

These are the decisions that still look smart after the stack changes:

1. **Store the truth openly.** Parquet on object storage. No proprietary formats. Any engine can read it.
2. **Describe it richly.** Every dataset has an owner, lineage, sensitivity tags, freshness contract.
3. **Never destroy evidence.** Every mutation is journaled. Rebuild any state at any point in time.
4. **Secure it centrally.** Permissions live in the data layer, not application code.
5. **Expose it through reusable interfaces.** Named tools with contracts, not raw SQL for every consumer.
6. **Version everything.** Schemas, embeddings, models — all versioned, all coexist during migration.
7. **Make unstructured data first-class.** Every document gets: storage, text extraction, entity tags, chunks, embeddings, linkage.
8. **Separate storage from compute from intelligence.** Scale each independently. Replace any layer without touching the others.

---

## Operating Rules

1. PRD > architecture > phases > status > git
2. Git is memory, not chat
3. No undocumented changes
4. No silent architecture drift
5. Always work in smallest valid step
6. Always verify before moving on
7. Flag when something is genuinely hard vs just engineering work
8. If a phase reveals the approach is wrong, update the PRD before continuing
9. **Cheap-now, expensive-later decisions get built first** (event journal, metadata, versioning)
10. **Build the governed interface before the raw interface** (tools before SQL for agents)