# PRD — Universal AI Control Plane **Status:** Long-horizon architecture target as of 2026-04-22. Lakehouse Phases 0-37 (`docs/PRD.md`) are preserved as the reference implementation and first domain-specific consumer. Phases 38+ (control-plane layers) are sequenced below. **Current domain: staffing.** The immediate proving ground is the staffing substrate already built — synthetic workers_500k, contracts, emails, SMS drafts, playbook memory. Everything Phase 38-44 ships is validated first against that domain. The DevOps / Terraform / Ansible framing from the original PRD draft stays as a **long-horizon target** — architecture-compatible but not in current scope. See §Long-horizon domains at the bottom. **Owner:** J **Cross-read:** `docs/PRD.md` for what's shipped (staffing + AI substrate, 13 crates, ~3M rows). This doc for the layered architecture those pieces now fit into. --- ## Phase Sequencing (Phases 38-44) Ship each phase before starting the next. Each ends with green tests + docs update. | Phase | Layer | What ships | Est. LOC | Risk | |---|---|---|---|---| | 38 | Layer 1 skeleton | `/v1/chat`, `/v1/usage`, `/v1/sessions` routes forwarding to existing `aibridge` → Ollama. Bot migrates as first consumer. | ~400 | Low — additive, no existing routes touched | | 39 | Layer 3 adapters | `aibridge::ProviderAdapter` trait; Ollama + one new (OpenRouter). `/v1/chat` routes by config. | ~500 | Low-medium | | 40 | Layer 2 engine | Rules-based routing (`config/routing.toml`), fallback chains, cost gating. Add Gemini + Claude adapters. | ~600 | Medium | | 41 | Profile split | Separate Retrieval / Memory / Execution / Observer profiles; Phase 17 backward-compat. Absorbs Phase 37 hot-swap-async. | ~300 | Medium | | 42 | Truth Layer | New `crates/truth`; Terraform/Ansible schemas; `/v1/context` serves rules to router + observer. | ~700 | Medium | | 43 | Validation pipeline | Syntax/lint/dry-run/policy gates per output type. Plugs into Layer 5 execution loop. | ~400 | Medium | | 44 | Caller migration | All internal callers route through `/v1/chat`. Direct sidecar access deprecated. | ~200 | Low | **Total ≈3100 LOC.** Phase 37 (hot-swap async) folds into Phase 41 — it's an Execution-Profile activation concern. --- ## Phase 38 — Universal API Skeleton **Goal:** OpenAI-compatible `/v1/*` surface exists and forwards to existing aibridge → Ollama. Nothing about multi-provider yet — just the SHAPE, so every downstream piece (adapters, routing, usage accounting) has a surface to plug into. **Ships:** - `crates/gateway/src/v1/mod.rs` — router + `/v1/chat`, `/v1/usage`, `/v1/sessions` - `crates/gateway/src/v1/ollama.rs` — shape adapter (OpenAI chat ↔ existing aibridge `GenerateRequest`) - One-line `nest("/v1", ...)` in `crates/gateway/src/main.rs` - Unit test: `POST /v1/chat` roundtrips through mocked provider **Gate:** - `curl -X POST localhost:3100/v1/chat -d '{"model":"qwen3.5:latest","messages":[{"role":"user","content":"hi"}]}'` returns valid OpenAI-shape response. - `GET localhost:3100/v1/usage` returns `{requests, prompt_tokens, completion_tokens, total_tokens}`. - `GET localhost:3100/v1/sessions` returns `{data:[]}` (stub; real impl Phase 41). - `cargo test -p gateway` green. **Non-goals (explicit):** streaming, tool calls, function calling, session state, multi-provider, fallback, cost gating. **Risk:** Low — additive, doesn't touch existing routes. Worst case: `/v1/*` returns 502 and we fix the adapter. No existing caller affected. --- ## Phase 39 — Provider Adapter Refactor **Goal:** `aibridge` grows a `ProviderAdapter` trait. Ollama implementation wraps existing sidecar code. One new provider lands as proof: **OpenRouter** (simplest — it's OpenAI-compatible, so adapter is mostly passthrough). **Ships:** - `crates/aibridge/src/provider.rs` — `ProviderAdapter` trait with `chat()` + `embed()` + `unload()` methods - `crates/aibridge/src/providers/ollama.rs` — existing sidecar code moved behind the trait - `crates/aibridge/src/providers/openrouter.rs` — new, HTTP client to `openrouter.ai/api/v1/chat/completions` - `config/providers.toml` — provider registry (name, base_url, auth, default_models) - `/v1/chat` routes by `model` field: prefix match (e.g. `openrouter/anthropic/claude-3.5-sonnet` → OpenRouter; bare names → Ollama) **Gate:** - `/v1/chat` with `model: "qwen3.5:latest"` hits Ollama → green - `/v1/chat` with `model: "openrouter/openai/gpt-4o-mini"` hits OpenRouter (key from secrets.toml) → green - Neither call leaks provider-specific fields upward. Response is always the `/v1/chat` shape. **Non-goals:** Fallback chain (Phase 40), cost gating (Phase 40), Gemini/Claude adapters (Phase 40). **Risk:** Low-medium. The trait extraction is mostly a rearrange; OpenRouter is thin. Biggest risk is secret-loading conventions — `SecretsProvider` is already in place, so reuse that path. --- ## Phase 40 — Routing & Policy Engine **Goal:** Replace hardcoded T1-T5 routing with a rules engine. Add Gemini + Claude adapters. Cost gating enforced at router level. **Ships:** - `crates/aibridge/src/routing.rs` — rules engine (match on: task type, token budget, previous attempt failures, profile ID) - `config/routing.toml` — rules in TOML (human-editable, hot-reloadable) - `crates/aibridge/src/providers/gemini.rs` — `generativelanguage.googleapis.com` adapter - `crates/aibridge/src/providers/claude.rs` — `api.anthropic.com` adapter - Fallback chain support: if primary returns 5xx or times out, try next in chain - Cost gate: per-request budget + daily budget per-provider **Gate:** - Rule like "local models for simple JSON emitters, cloud for reasoning" fires correctly by task type - Primary fails → fallback provider hits, response still matches `/v1/chat` shape - Daily budget hit → subsequent requests return 429 with clear retry-at header - `/v1/usage` reports per-provider breakdown **Non-goals:** Retrieval Profile split (Phase 41), Truth Layer (Phase 42). **Risk:** Medium. Multi-provider auth + cost tracking is cross-cutting. Mitigation: every provider call wrapped in a single `dispatch()` function, all observability flows through there. --- ## Phase 41 — Profile System Expansion (+ Phase 37 hot-swap async folded in) **Goal:** The existing `ModelProfile` (Phase 17) becomes **ExecutionProfile**. Three new profile types land alongside. Profile activation is async — returns job_id, work runs in background (Phase 37 deliverable). **Ships:** - `crates/shared/src/profiles/` — `ExecutionProfile`, `RetrievalProfile`, `MemoryProfile`, `ObserverProfile` - `crates/catalogd` gains per-profile-type CRUD endpoints (`/catalog/profiles/retrieval`, etc.) - `crates/vectord/src/activation.rs` — `ActivationTracker` with background-job pattern (Phase 37 content) - `POST /vectors/profile/{id}/activate` returns 202 + job_id, polling at `GET /vectors/profile/jobs/{id}` - Single-flight guard: refuse new activation if one is pending/running - Backward compat: `ModelProfile` still loads, aliased to ExecutionProfile **Gate:** - Activate a profile → returns 202 in <100ms → job completes in background → `/vectors/profile/jobs/{id}` shows progress + final report - `tests/multi-agent/run_stress.ts` Phase 3 (hot-swap stress) passes (was SKIPPED) - Retrieval + Memory + Observer profiles can be created independently of Execution profile **Non-goals:** Truth Layer (Phase 42), validation (Phase 43), caller migration (Phase 44). **Risk:** Medium. Schema change + async refactor. Mitigation: `#[serde(default)]` on all new fields; existing profiles load unchanged. --- ## Phase 42 — Truth Layer (staffing rules first) **Goal:** New `crates/truth` crate holds immutable task-class constraints. Served via `/v1/context` to router and observer. No layer can override truth. **Staffing rules ship first**; Terraform/Ansible rule shapes are scaffolded but unpopulated until the long-horizon phase. **Ships:** - `crates/truth/src/lib.rs` — `TruthStore` with schema loading (TOML/YAML rules) - `crates/truth/src/staffing.rs` — staffing rule shapes: - Worker eligibility (active status, not blacklisted for client, geo match, role match, availability window) - Contract invariants (deadline present, role/count/city/state populated, budget_per_hour_max ≥ 0) - PII handling (redaction rules on fields tagged `PII` before any cloud call — covers existing Phase 10 sensitivity tags) - Client blacklist enforcement (auto-applied before any fill proposal) - Fill requirements (endorsed_names count matches target_count, no duplicate worker_ids within a single fill) - `crates/truth/src/devops.rs` — **scaffold only**: empty rule struct for Terraform/Ansible, populated in the long-horizon phase. Keeps the dispatcher signature stable so no refactor needed later. - `truth/` dir at repo root — rule files, versioned in git - `/v1/context` endpoint — returns applicable rules for a task class (`staffing.fill`, `staffing.rescue`, `staffing.sms_draft`, etc.) - Router consults truth before dispatching: if task violates a rule, hard-fail with structured error + rule citation (matches existing Phase 13 access-control pattern) **Gate:** - Submit a fill proposal where a worker is client-blacklisted — router returns 422 + rule citation, no cloud tokens burned - Submit a fill with `endorsed_names.length != target_count` — 422 before dispatch - Observer cannot promote a correction that violates truth (rejected at router gate) - PII redaction verified: SSN / salary fields stripped from prompts before cloud calls - Truth reload is explicit (no file-watch hot reload in this phase) **Non-goals:** Validation execution (Phase 43), policy learning / evolution (deferred), actual Terraform/Ansible rules (long-horizon phase). **Risk:** Medium. Domain-specific rule enumeration takes discovery — start with a minimal rule set (5-10 staffing rules, derived from existing Phase 10-13 work) and grow organically as real fills surface edge cases. --- ## Phase 43 — Validation Pipeline (staffing outputs first) **Goal:** Staffing outputs run through schema / completeness / consistency / policy gates. Plug into Layer 5 execution loop — failure triggers observer-correction iteration. This is where the **0→85% pattern reproduces on real staffing tasks** — the iteration loop with validation in place is what made small models successful. **Ships:** - `crates/validator/src/lib.rs` — `Validator` trait: `validate(artifact) -> Result` + `Artifact` enum over output types - `crates/validator/src/staffing/fill.rs` — fill-proposal validator: - Schema compliance (propose_done shape matches `{fills: [{candidate_id, name}]}`) - Completeness (endorsed count == target_count) - Worker existence (every candidate_id present in workers_500k via SQL lookup) - Status check (every worker has status=active, not_on_client_blacklist) - Geo/role match (worker city/state/role matches contract) - `crates/validator/src/staffing/email.rs` — generated email/SMS drafts: - Schema (TO/BODY fields present) - Length (SMS ≤ 160 chars; email subject ≤ 78 chars) - PII absence (no SSN / salary leaked into outgoing text) - Worker-name consistency (name in message matches worker record) - `crates/validator/src/staffing/playbook.rs` — sealed playbook: - Operation format (`fill: Role xN in City, ST`) - endorsed_names non-empty, ≤ target_count × 2 - fingerprint populated (Phase 25 validity window requirement) - `crates/validator/src/devops.rs` — **scaffold only**: stubbed Terraform/Ansible validators (`terraform validate`, `ansible-lint`) for the long-horizon phase - Task execution loop in gateway: generate → validate → if fail, observer correction + retry (bounded by `max_iterations=3`) - Validation results logged to observer (`data/_observer/ops.jsonl`) + KB (`data/_kb/outcomes.jsonl`) **Gate:** - Generate a fill proposal → validator catches a phantom worker_id → observer + cloud rescue propose correction → retry → green. This reproduces the 0→85% pattern on the live staffing pipeline. - `/v1/usage` shows iteration count per task, provider fallback chain, and tokens-per-iteration. Cost attribution per task class visible. - Reproduces 14× citation-lift finding from Phase 19 refinement on similar geos after validation gates. **Non-goals:** Caller migration (Phase 44), Terraform/Ansible wired validation (long-horizon). **Risk:** Medium. Validation shapes have to match actual executor outputs; mitigation is using real scenario runs as test fixtures (we have ~100 of them in `tests/multi-agent/playbooks/`). --- ## Phase 44 — Caller Migration + Direct-Provider Deprecation **Goal:** Every internal LLM caller routes through `/v1/chat`. Direct sidecar / direct Ollama / direct OpenAI calls are removed or explicitly deprecated with a warning. **Ships:** - `aibridge::AiClient` becomes a thin `/v1/chat` client (was direct-to-sidecar) - `crates/vectord::agent` (autotune): routes through `/v1` - `crates/vectord::autotune`: routes through `/v1` - `tests/multi-agent/agent.ts::generate()`: routes through `/v1` - `bot/propose.ts`: routes through `/v1` (already proposed as Phase 38's test consumer, formalized here) - Lint rule / grep pre-commit hook: no `fetch.*:3200/generate` outside the provider adapters **Gate:** - `grep -r "localhost:3200/generate\|/api/generate"` returns only adapter files + deprecation shims - `/v1/usage` accounts for every LLM call in the system within a 1-minute window after hitting a fresh scenario - Full scenario passes end-to-end without any caller bypassing `/v1/*` **Non-goals:** New features. This phase is purely mechanical migration. **Risk:** Low. Mechanical. Tests catch regressions. --- ## Long-horizon domains (not in current phase sequence) The architecture was drafted with DevOps execution (Terraform, Ansible) as the eventual target. **That remains aspirational, not current scope** — we don't start wiring `terraform validate` / `ansible-lint` until the staffing domain proves the six-layer architecture at scale. What "proves at scale" means concretely: - Phases 38-44 all shipped against staffing, green tests - Live staffing pipeline handles **multiple concurrent contracts** with emails + SMS + indexed playbooks via `/v1/*` - Observed **iteration success lift** (the 0→85% pattern) reproduced on varied staffing scenarios, not just the original proof-of-concept - Token + cost accounting stable across provider fallback chains under real load - Truth Layer rules prevent real fill errors before cloud burn (not just theoretical) When staffing hits that bar, the DevOps domain lights up by: - Populating `crates/truth/src/devops.rs` with real Terraform/Ansible rule shapes - Populating `crates/validator/src/devops.rs` with `terraform validate` / `ansible-lint` shell-out - Adding DevOps task classes to `/v1/context` rule lookup - No architectural changes needed — the dispatcher, router, and execution loop stay identical. Other candidate long-horizon domains (same pattern): - Code generation tasks (validation via `cargo check` / `bun test`) - SQL query generation (validation via EXPLAIN + schema compliance) - Data pipeline definitions (validation via lineage check + schema compliance) None of these are in the current roadmap. **Staffing first, production-proven, then expand.** --- ## 1. Purpose Design and implement a universal AI control-plane API that enables: - **deterministic high-stakes task execution** — the immediate domain is staffing fills (contracts, workers, emails, SMS) at scale; the same architecture extends later to DevOps (Terraform, Ansible) without redesign - iterative capability amplification via observer loops - hybrid local + cloud model orchestration - structured knowledge + memory + playbook reuse - controlled improvement over time through validated iteration The system prioritizes **validated pipeline success over raw model intelligence**. ### Current scope — staffing at scale The architecture must make the already-built staffing substrate reliably answer millions of inputs: pull real data, graph it across contracts, handle multiple concurrent contracts, index emails + SMS + playbooks via the hybrid SQL+vector method, and get **faster and better each iteration** via the feedback loops (Phase 19 playbook boost, Phase 22 KB pathway recommender, Phase 24 observer, Phase 26 Mem0 upsert). DevOps is an eventual domain — see §Long-horizon domains. ## 2. Core Objectives ### 2.1 Functional Goals - Provide a single universal API for all AI interactions - Support multi-provider routing (local, flat-rate, token-based) - Enable iterative execution loops with observer correction - Store and reuse successful execution playbooks - Integrate: S3-based knowledge storage, LanceDB retrieval/indexing, Mem0 memory layer, MCP tool ecosystem ### 2.2 Non-Functional Goals - Deterministic behavior under constrained execution - Full observability and cost accounting - Safe DevOps execution (no uncontrolled mutation) - Profile-driven routing and execution - Reproducibility of successful runs ## 3. System Architecture ### 3.1 Layer Overview **Layer 1 — Universal API** Single entry point for all applications. Endpoints: - `/v1/chat` - `/v1/respond` - `/v1/tools` - `/v1/context` - `/v1/usage` - `/v1/sessions` All programs must use this layer. No direct provider calls allowed. **Layer 2 — Routing & Policy Engine** Responsibilities: provider selection, fallback logic, cost gating, premium access control, profile enforcement. Routing based on: task type, constraints, execution profile, system health. **Layer 3 — Provider Adapter Layer** Normalizes all providers: Ollama (local), OpenRouter, Gemini (direct), Claude (direct or routed), future providers. Guarantee: no provider-specific logic leaks upward. **Layer 4 — Knowledge & Memory Plane** - Knowledge (S3 + LanceDB): raw documents, processed chunks, embeddings, index profiles - Memory (Mem0): extracted facts, entity-linked memory, session-aware retrieval - Playbooks: successful execution traces, reusable patterns, correction strategies **Layer 5 — Execution Loop** Each task runs through: Retrieval → Planning → Generation → Validation → Observer feedback → Iteration (if needed). **Layer 6 — Observability & Accounting** Every request logs: tokens (input/output), cost, latency, provider, fallback chain, profile used, iteration delta. ## 4. Execution Model ### 4.1 Iterative Loop Each task follows: **Attempt → Validate → Observe → Adjust → Retry** Constraints: - max iterations (default: 3) - minimum improvement threshold - cost ceiling per task ### 4.2 Observer Role Observer can: analyze failure, suggest corrections, recommend profile changes. Observer cannot: modify truth layer, auto-promote changes, override constraints. ### 4.3 Cloud Escalation Cloud models (Gemini, Claude) are used for: structural correction, reasoning gaps, complex decomposition. They are not used for: brute-force retries, bulk execution. ## 5. Profile System ### 5.1 Profile Types - **Retrieval Profile** — chunking strategy, embedding method, reranking rules - **Memory Profile** — memory weighting, context injection rules - **Execution Profile** — allowed providers, tool access, risk level - **Observer Profile** — mutation aggressiveness, iteration strategy ### 5.2 Profile Constraints - only one major profile change per iteration - profiles must produce measurable deltas - promotion requires repeated success ## 6. Truth Layer (Critical) Defines non-negotiable constraints: - Terraform rules - Ansible structure requirements - security policies - organization standards Rules: - immutable at runtime - referenced by all layers - cannot be overridden by observer ## 7. Playbook System ### 7.1 Playbook Definition Each successful run produces: task class, context used, steps executed, tools used, output artifacts, validation results, cost/latency, success score. ### 7.2 Playbook Lifecycle - created on success - reused for similar tasks - decayed over time - pruned if ineffective ## 8. Validation System All DevOps outputs must pass: syntax validation, linting, dry-run, policy compliance. Failure → iteration continues or task fails. ## 9. MCP Integration MCP servers provide: tools, external data, execution capabilities. All MCP outputs must be: normalized, validated, schema-compliant. No direct MCP output reaches the model. ## 10. Token Accounting & Budget Control Each request tracks: input tokens, output tokens, retries, fallback cost. Policies: premium providers gated, cost ceilings enforced, per-task budget limits. ## 11. Failure Handling **Recoverable failures:** bad decomposition, missing steps, weak retrieval → observer + iteration. **Hard failures:** missing truth data, invalid task classification, unsafe execution → termination + error report. ## 12. Success Criteria A task is successful only if: - output is valid - all validators pass - no policy violations - result is reproducible - cost within limits ## 13. Key Risks & Mitigations - **Observer drift** → bounded authority, confidence tracking - **Memory poisoning** → validation layer, memory weighting - **Cost explosion** → token accounting, iteration caps - **Retrieval errors** → post-retrieval validation, profile tuning