# PRD — Universal AI Control Plane

**Status:** Long-horizon architecture target as of 2026-04-22. Lakehouse Phases 0-37 (`docs/PRD.md`) are preserved as the reference implementation and first domain-specific consumer. Phases 38+ (control-plane layers) are sequenced below.

**Current domain: staffing.** The immediate proving ground is the staffing substrate already built — synthetic workers_500k, contracts, emails, SMS drafts, playbook memory. Everything Phase 38-44 ships is validated first against that domain. The DevOps / Terraform / Ansible framing from the original PRD draft stays as a **long-horizon target** — architecture-compatible but not in current scope. See §Long-horizon domains at the bottom.

**Owner:** J

**Cross-read:** `docs/PRD.md` for what's shipped (staffing + AI substrate, 13 crates, ~3M rows). This doc for the layered architecture those pieces now fit into.

---

## Phase Sequencing (Phases 38-44)

Ship each phase before starting the next. Each ends with green tests + docs update.

| Phase | Layer | What ships | Est. LOC | Risk |
|---|---|---|---|---|
| 38 | Layer 1 skeleton | `/v1/chat`, `/v1/usage`, `/v1/sessions` routes forwarding to existing `aibridge` → Ollama. Bot migrates as first consumer. | ~400 | Low — additive, no existing routes touched |
| 39 | Layer 3 adapters | `aibridge::ProviderAdapter` trait; Ollama + one new (OpenRouter). `/v1/chat` routes by config. | ~500 | Low-medium |
| 40 | Layer 2 engine | Rules-based routing (`config/routing.toml`), fallback chains, cost gating. Add Gemini + Claude adapters. | ~600 | Medium |
| 41 | Profile split | Separate Retrieval / Memory / Execution / Observer profiles; Phase 17 backward-compat. Absorbs Phase 37 hot-swap-async. | ~300 | Medium |
| 42 | Truth Layer | New `crates/truth`; Terraform/Ansible schemas; `/v1/context` serves rules to router + observer. | ~700 | Medium |
| 43 | Validation pipeline | Syntax/lint/dry-run/policy gates per output type. Plugs into Layer 5 execution loop. | ~400 | Medium |
| 44 | Caller migration | All internal callers route through `/v1/chat`. Direct sidecar access deprecated. | ~200 | Low |

**Total ≈3100 LOC.** Phase 37 (hot-swap async) folds into Phase 41 — it's an Execution-Profile activation concern.

---

## Phase 38 — Universal API Skeleton

**Goal:** OpenAI-compatible `/v1/*` surface exists and forwards to existing aibridge → Ollama. Nothing about multi-provider yet — just the SHAPE, so every downstream piece (adapters, routing, usage accounting) has a surface to plug into.

**Ships:**
- `crates/gateway/src/v1/mod.rs` — router + `/v1/chat`, `/v1/usage`, `/v1/sessions`
- `crates/gateway/src/v1/ollama.rs` — shape adapter (OpenAI chat ↔ existing aibridge `GenerateRequest`)
- One-line `nest("/v1", ...)` in `crates/gateway/src/main.rs`
- Unit test: `POST /v1/chat` roundtrips through mocked provider

**Gate:**
- `curl -X POST localhost:3100/v1/chat -d '{"model":"qwen3.5:latest","messages":[{"role":"user","content":"hi"}]}'` returns valid OpenAI-shape response.
- `GET localhost:3100/v1/usage` returns `{requests, prompt_tokens, completion_tokens, total_tokens}`.
- `GET localhost:3100/v1/sessions` returns `{data:[]}` (stub; real impl Phase 41).
- `cargo test -p gateway` green.

**Non-goals (explicit):** streaming, tool calls, function calling, session state, multi-provider, fallback, cost gating.

**Risk:** Low — additive, doesn't touch existing routes. Worst case: `/v1/*` returns 502 and we fix the adapter. No existing caller affected.

---

## Phase 39 — Provider Adapter Refactor

**Goal:** `aibridge` grows a `ProviderAdapter` trait. Ollama implementation wraps existing sidecar code. One new provider lands as proof: **OpenRouter** (simplest — it's OpenAI-compatible, so adapter is mostly passthrough).

**Ships:**
- `crates/aibridge/src/provider.rs` — `ProviderAdapter` trait with `chat()` + `embed()` + `unload()` methods
- `crates/aibridge/src/providers/ollama.rs` — existing sidecar code moved behind the trait
- `crates/aibridge/src/providers/openrouter.rs` — new, HTTP client to `openrouter.ai/api/v1/chat/completions`
- `config/providers.toml` — provider registry (name, base_url, auth, default_models)
- `/v1/chat` routes by `model` field: prefix match (e.g. `openrouter/anthropic/claude-3.5-sonnet` → OpenRouter; bare names → Ollama)

**Gate:**
- `/v1/chat` with `model: "qwen3.5:latest"` hits Ollama → green
- `/v1/chat` with `model: "openrouter/openai/gpt-4o-mini"` hits OpenRouter (key from secrets.toml) → green
- Neither call leaks provider-specific fields upward. Response is always the `/v1/chat` shape.

**Non-goals:** Fallback chain (Phase 40), cost gating (Phase 40), Gemini/Claude adapters (Phase 40).

**Risk:** Low-medium. The trait extraction is mostly a rearrange; OpenRouter is thin. Biggest risk is secret-loading conventions — `SecretsProvider` is already in place, so reuse that path.

---

## Phase 40 — Routing & Policy Engine

**Goal:** Replace hardcoded T1-T5 routing with a rules engine. Add Gemini + Claude adapters. Cost gating enforced at router level.

**Ships:**
- `crates/aibridge/src/routing.rs` — rules engine (match on: task type, token budget, previous attempt failures, profile ID)
- `config/routing.toml` — rules in TOML (human-editable, hot-reloadable)
- `crates/aibridge/src/providers/gemini.rs` — `generativelanguage.googleapis.com` adapter
- `crates/aibridge/src/providers/claude.rs` — `api.anthropic.com` adapter
- Fallback chain support: if primary returns 5xx or times out, try next in chain
- Cost gate: per-request budget + daily budget per-provider

**Gate:**
- Rule like "local models for simple JSON emitters, cloud for reasoning" fires correctly by task type
- Primary fails → fallback provider hits, response still matches `/v1/chat` shape
- Daily budget hit → subsequent requests return 429 with clear retry-at header
- `/v1/usage` reports per-provider breakdown

**Non-goals:** Retrieval Profile split (Phase 41), Truth Layer (Phase 42).

**Risk:** Medium. Multi-provider auth + cost tracking is cross-cutting. Mitigation: every provider call wrapped in a single `dispatch()` function, all observability flows through there.

---

## Phase 41 — Profile System Expansion (+ Phase 37 hot-swap async folded in)

**Goal:** The existing `ModelProfile` (Phase 17) becomes **ExecutionProfile**. Three new profile types land alongside. Profile activation is async — returns job_id, work runs in background (Phase 37 deliverable).

**Ships:**
- `crates/shared/src/profiles/` — `ExecutionProfile`, `RetrievalProfile`, `MemoryProfile`, `ObserverProfile`
- `crates/catalogd` gains per-profile-type CRUD endpoints (`/catalog/profiles/retrieval`, etc.)
- `crates/vectord/src/activation.rs` — `ActivationTracker` with background-job pattern (Phase 37 content)
- `POST /vectors/profile/{id}/activate` returns 202 + job_id, polling at `GET /vectors/profile/jobs/{id}`
- Single-flight guard: refuse new activation if one is pending/running
- Backward compat: `ModelProfile` still loads, aliased to ExecutionProfile

**Gate:**
- Activate a profile → returns 202 in <100ms → job completes in background → `/vectors/profile/jobs/{id}` shows progress + final report
- `tests/multi-agent/run_stress.ts` Phase 3 (hot-swap stress) passes (was SKIPPED)
- Retrieval + Memory + Observer profiles can be created independently of Execution profile

**Non-goals:** Truth Layer (Phase 42), validation (Phase 43), caller migration (Phase 44).

**Risk:** Medium. Schema change + async refactor. Mitigation: `#[serde(default)]` on all new fields; existing profiles load unchanged.

---

## Phase 42 — Truth Layer (staffing rules first)

**Goal:** New `crates/truth` crate holds immutable task-class constraints. Served via `/v1/context` to router and observer. No layer can override truth. **Staffing rules ship first**; Terraform/Ansible rule shapes are scaffolded but unpopulated until the long-horizon phase.

**Ships:**
- `crates/truth/src/lib.rs` — `TruthStore` with schema loading (TOML/YAML rules)
- `crates/truth/src/staffing.rs` — staffing rule shapes:
  - Worker eligibility (active status, not blacklisted for client, geo match, role match, availability window)
  - Contract invariants (deadline present, role/count/city/state populated, budget_per_hour_max ≥ 0)
  - PII handling (redaction rules on fields tagged `PII` before any cloud call — covers existing Phase 10 sensitivity tags)
  - Client blacklist enforcement (auto-applied before any fill proposal)
  - Fill requirements (endorsed_names count matches target_count, no duplicate worker_ids within a single fill)
- `crates/truth/src/devops.rs` — **scaffold only**: empty rule struct for Terraform/Ansible, populated in the long-horizon phase. Keeps the dispatcher signature stable so no refactor needed later.
- `truth/` dir at repo root — rule files, versioned in git
- `/v1/context` endpoint — returns applicable rules for a task class (`staffing.fill`, `staffing.rescue`, `staffing.sms_draft`, etc.)
- Router consults truth before dispatching: if task violates a rule, hard-fail with structured error + rule citation (matches existing Phase 13 access-control pattern)

**Gate:**
- Submit a fill proposal where a worker is client-blacklisted — router returns 422 + rule citation, no cloud tokens burned
- Submit a fill with `endorsed_names.length != target_count` — 422 before dispatch
- Observer cannot promote a correction that violates truth (rejected at router gate)
- PII redaction verified: SSN / salary fields stripped from prompts before cloud calls
- Truth reload is explicit (no file-watch hot reload in this phase)

**Non-goals:** Validation execution (Phase 43), policy learning / evolution (deferred), actual Terraform/Ansible rules (long-horizon phase).

**Risk:** Medium. Domain-specific rule enumeration takes discovery — start with a minimal rule set (5-10 staffing rules, derived from existing Phase 10-13 work) and grow organically as real fills surface edge cases.

---

## Phase 43 — Validation Pipeline (staffing outputs first)

**Goal:** Staffing outputs run through schema / completeness / consistency / policy gates. Plug into Layer 5 execution loop — failure triggers observer-correction iteration. This is where the **0→85% pattern reproduces on real staffing tasks** — the iteration loop with validation in place is what made small models successful.

**Ships:**
- `crates/validator/src/lib.rs` — `Validator` trait: `validate(artifact) -> Result<Report, ValidationError>` + `Artifact` enum over output types
- `crates/validator/src/staffing/fill.rs` — fill-proposal validator:
  - Schema compliance (propose_done shape matches `{fills: [{candidate_id, name}]}`)
  - Completeness (endorsed count == target_count)
  - Worker existence (every candidate_id present in workers_500k via SQL lookup)
  - Status check (every worker has status=active, not_on_client_blacklist)
  - Geo/role match (worker city/state/role matches contract)
- `crates/validator/src/staffing/email.rs` — generated email/SMS drafts:
  - Schema (TO/BODY fields present)
  - Length (SMS ≤ 160 chars; email subject ≤ 78 chars)
  - PII absence (no SSN / salary leaked into outgoing text)
  - Worker-name consistency (name in message matches worker record)
- `crates/validator/src/staffing/playbook.rs` — sealed playbook:
  - Operation format (`fill: Role xN in City, ST`)
  - endorsed_names non-empty, ≤ target_count × 2
  - fingerprint populated (Phase 25 validity window requirement)
- `crates/validator/src/devops.rs` — **scaffold only**: stubbed Terraform/Ansible validators (`terraform validate`, `ansible-lint`) for the long-horizon phase
- Task execution loop in gateway: generate → validate → if fail, observer correction + retry (bounded by `max_iterations=3`)
- Validation results logged to observer (`data/_observer/ops.jsonl`) + KB (`data/_kb/outcomes.jsonl`)

**Gate:**
- Generate a fill proposal → validator catches a phantom worker_id → observer + cloud rescue propose correction → retry → green. This reproduces the 0→85% pattern on the live staffing pipeline.
- `/v1/usage` shows iteration count per task, provider fallback chain, and tokens-per-iteration. Cost attribution per task class visible.
- Reproduces 14× citation-lift finding from Phase 19 refinement on similar geos after validation gates.

**Non-goals:** Caller migration (Phase 44), Terraform/Ansible wired validation (long-horizon).

**Risk:** Medium. Validation shapes have to match actual executor outputs; mitigation is using real scenario runs as test fixtures (we have ~100 of them in `tests/multi-agent/playbooks/`).

---

## Phase 44 — Caller Migration + Direct-Provider Deprecation

**Goal:** Every internal LLM caller routes through `/v1/chat`. Direct sidecar / direct Ollama / direct OpenAI calls are removed or explicitly deprecated with a warning.

**Ships:**
- `aibridge::AiClient` becomes a thin `/v1/chat` client (was direct-to-sidecar)
- `crates/vectord::agent` (autotune): routes through `/v1`
- `crates/vectord::autotune`: routes through `/v1`
- `tests/multi-agent/agent.ts::generate()`: routes through `/v1`
- `bot/propose.ts`: routes through `/v1` (already proposed as Phase 38's test consumer, formalized here)
- Lint rule / grep pre-commit hook: no `fetch.*:3200/generate` outside the provider adapters

**Gate:**
- `grep -r "localhost:3200/generate\|/api/generate"` returns only adapter files + deprecation shims
- `/v1/usage` accounts for every LLM call in the system within a 1-minute window after hitting a fresh scenario
- Full scenario passes end-to-end without any caller bypassing `/v1/*`

**Non-goals:** New features. This phase is purely mechanical migration.

**Risk:** Low. Mechanical. Tests catch regressions.

---

## Long-horizon domains (not in current phase sequence)

The architecture was drafted with DevOps execution (Terraform, Ansible) as the eventual target. **That remains aspirational, not current scope** — we don't start wiring `terraform validate` / `ansible-lint` until the staffing domain proves the six-layer architecture at scale.

What "proves at scale" means concretely:
- Phases 38-44 all shipped against staffing, green tests
- Live staffing pipeline handles **multiple concurrent contracts** with emails + SMS + indexed playbooks via `/v1/*`
- Observed **iteration success lift** (the 0→85% pattern) reproduced on varied staffing scenarios, not just the original proof-of-concept
- Token + cost accounting stable across provider fallback chains under real load
- Truth Layer rules prevent real fill errors before cloud burn (not just theoretical)

When staffing hits that bar, the DevOps domain lights up by:
- Populating `crates/truth/src/devops.rs` with real Terraform/Ansible rule shapes
- Populating `crates/validator/src/devops.rs` with `terraform validate` / `ansible-lint` shell-out
- Adding DevOps task classes to `/v1/context` rule lookup
- No architectural changes needed — the dispatcher, router, and execution loop stay identical.

Other candidate long-horizon domains (same pattern):
- Code generation tasks (validation via `cargo check` / `bun test`)
- SQL query generation (validation via EXPLAIN + schema compliance)
- Data pipeline definitions (validation via lineage check + schema compliance)

None of these are in the current roadmap. **Staffing first, production-proven, then expand.**

---

## 1. Purpose

Design and implement a universal AI control-plane API that enables:

- **deterministic high-stakes task execution** — the immediate domain is staffing fills (contracts, workers, emails, SMS) at scale; the same architecture extends later to DevOps (Terraform, Ansible) without redesign
- iterative capability amplification via observer loops
- hybrid local + cloud model orchestration
- structured knowledge + memory + playbook reuse
- controlled improvement over time through validated iteration

The system prioritizes **validated pipeline success over raw model intelligence**.

### Current scope — staffing at scale

The architecture must make the already-built staffing substrate reliably answer millions of inputs: pull real data, graph it across contracts, handle multiple concurrent contracts, index emails + SMS + playbooks via the hybrid SQL+vector method, and get **faster and better each iteration** via the feedback loops (Phase 19 playbook boost, Phase 22 KB pathway recommender, Phase 24 observer, Phase 26 Mem0 upsert).

DevOps is an eventual domain — see §Long-horizon domains.

## 2. Core Objectives

### 2.1 Functional Goals

- Provide a single universal API for all AI interactions
- Support multi-provider routing (local, flat-rate, token-based)
- Enable iterative execution loops with observer correction
- Store and reuse successful execution playbooks
- Integrate: S3-based knowledge storage, LanceDB retrieval/indexing, Mem0 memory layer, MCP tool ecosystem

### 2.2 Non-Functional Goals

- Deterministic behavior under constrained execution
- Full observability and cost accounting
- Safe DevOps execution (no uncontrolled mutation)
- Profile-driven routing and execution
- Reproducibility of successful runs

## 3. System Architecture

### 3.1 Layer Overview

**Layer 1 — Universal API**

Single entry point for all applications. Endpoints:

- `/v1/chat`
- `/v1/respond`
- `/v1/tools`
- `/v1/context`
- `/v1/usage`
- `/v1/sessions`

All programs must use this layer. No direct provider calls allowed.

**Layer 2 — Routing & Policy Engine**

Responsibilities: provider selection, fallback logic, cost gating, premium access control, profile enforcement.
Routing based on: task type, constraints, execution profile, system health.

**Layer 3 — Provider Adapter Layer**

Normalizes all providers: Ollama (local), OpenRouter, Gemini (direct), Claude (direct or routed), future providers.
Guarantee: no provider-specific logic leaks upward.

**Layer 4 — Knowledge & Memory Plane**

- Knowledge (S3 + LanceDB): raw documents, processed chunks, embeddings, index profiles
- Memory (Mem0): extracted facts, entity-linked memory, session-aware retrieval
- Playbooks: successful execution traces, reusable patterns, correction strategies

**Layer 5 — Execution Loop**

Each task runs through: Retrieval → Planning → Generation → Validation → Observer feedback → Iteration (if needed).

**Layer 6 — Observability & Accounting**

Every request logs: tokens (input/output), cost, latency, provider, fallback chain, profile used, iteration delta.

## 4. Execution Model

### 4.1 Iterative Loop

Each task follows: **Attempt → Validate → Observe → Adjust → Retry**

Constraints:

- max iterations (default: 3)
- minimum improvement threshold
- cost ceiling per task

### 4.2 Observer Role

Observer can: analyze failure, suggest corrections, recommend profile changes.
Observer cannot: modify truth layer, auto-promote changes, override constraints.

### 4.3 Cloud Escalation

Cloud models (Gemini, Claude) are used for: structural correction, reasoning gaps, complex decomposition.
They are not used for: brute-force retries, bulk execution.

## 5. Profile System

### 5.1 Profile Types

- **Retrieval Profile** — chunking strategy, embedding method, reranking rules
- **Memory Profile** — memory weighting, context injection rules
- **Execution Profile** — allowed providers, tool access, risk level
- **Observer Profile** — mutation aggressiveness, iteration strategy

### 5.2 Profile Constraints

- only one major profile change per iteration
- profiles must produce measurable deltas
- promotion requires repeated success

## 6. Truth Layer (Critical)

Defines non-negotiable constraints:

- Terraform rules
- Ansible structure requirements
- security policies
- organization standards

Rules:

- immutable at runtime
- referenced by all layers
- cannot be overridden by observer

## 7. Playbook System

### 7.1 Playbook Definition

Each successful run produces: task class, context used, steps executed, tools used, output artifacts, validation results, cost/latency, success score.

### 7.2 Playbook Lifecycle

- created on success
- reused for similar tasks
- decayed over time
- pruned if ineffective

## 8. Validation System

All DevOps outputs must pass: syntax validation, linting, dry-run, policy compliance.
Failure → iteration continues or task fails.

## 9. MCP Integration

MCP servers provide: tools, external data, execution capabilities.
All MCP outputs must be: normalized, validated, schema-compliant.
No direct MCP output reaches the model.

## 10. Token Accounting & Budget Control

Each request tracks: input tokens, output tokens, retries, fallback cost.
Policies: premium providers gated, cost ceilings enforced, per-task budget limits.

## 11. Failure Handling

**Recoverable failures:** bad decomposition, missing steps, weak retrieval → observer + iteration.

**Hard failures:** missing truth data, invalid task classification, unsafe execution → termination + error report.

## 12. Success Criteria

A task is successful only if:

- output is valid
- all validators pass
- no policy violations
- result is reproducible
- cost within limits

## 13. Key Risks & Mitigations

- **Observer drift** → bounded authority, confidence tracking
- **Memory poisoning** → validation layer, memory weighting
- **Cost explosion** → token accounting, iteration caps
- **Retrieval errors** → post-retrieval validation, profile tuning