profit 2a4b81bf48 Phase 45 (first slice): DocRef + doc_refs field on PlaybookEntry

Phase J keeps asking for: playbooks know which external docs they
used, get flagged when those docs drift. This commit ships the data
model; context7 bridge + drift check endpoints land in follow-ups.

Added to crates/vectord/src/playbook_memory.rs:
- pub struct DocRef { tool, version_seen, snippet_hash, source_url,
  seen_at } — one external doc reference
- PlaybookEntry.doc_refs: Vec<DocRef> — empty on legacy entries,
  serde default ensures pre-Phase-45 persisted state loads cleanly
- PlaybookEntry.doc_drift_flagged_at: Option<String> — set by the
  (future) drift-check code when context7 reports newer version
- PlaybookEntry.doc_drift_reviewed_at: Option<String> — set by
  human via /resolve endpoint after reviewing the diagnosis
- impl Default for PlaybookEntry — collapses most test-helper
  constructors from 17 explicit fields to 6-9 fields +
  ..Default::default()

Updated SeedPlaybookRequest + RevisePlaybookRequest (service.rs) to
accept optional doc_refs: the seed/revise endpoints already take the
field, downstream drift detection (Phase 45.2) consumes it.

Docs: docs/CONTROL_PLANE_PRD.md gains full Phase 45 spec with gate
criteria, non-goals, and risk notes.

Tests: 51/51 vectord lib tests green (same count as before, field
additions are backward-compat).

Memory: project_doc_drift_vision.md written so this keeps coming
back to the front of mind.

Next slices (same phase): context7 HTTP bridge in mcp-server,
/vectors/playbook_memory/doc_drift/check/{id} endpoint, overview-
model drift synthesis writing to data/_kb/doc_drift_corrections.jsonl,
boost exclusion for flagged+unreviewed entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-22 03:14:07 -05:00

26 KiB

Raw Blame History

PRD — Universal AI Control Plane

Status: Long-horizon architecture target as of 2026-04-22. Lakehouse Phases 0-37 (docs/PRD.md) are preserved as the reference implementation and first domain-specific consumer. Phases 38+ (control-plane layers) are sequenced below.

Current domain: staffing. The immediate proving ground is the staffing substrate already built — synthetic workers_500k, contracts, emails, SMS drafts, playbook memory. Everything Phase 38-44 ships is validated first against that domain. The DevOps / Terraform / Ansible framing from the original PRD draft stays as a long-horizon target — architecture-compatible but not in current scope. See §Long-horizon domains at the bottom.

Owner: J

Cross-read: docs/PRD.md for what's shipped (staffing + AI substrate, 13 crates, ~3M rows). This doc for the layered architecture those pieces now fit into.

Phase Sequencing (Phases 38-44)

Ship each phase before starting the next. Each ends with green tests + docs update.

Phase	Layer	What ships	Est. LOC	Risk
38	Layer 1 skeleton	`/v1/chat`, `/v1/usage`, `/v1/sessions` routes forwarding to existing `aibridge` → Ollama. Bot migrates as first consumer.	~400	Low — additive, no existing routes touched
39	Layer 3 adapters	`aibridge::ProviderAdapter` trait; Ollama + one new (OpenRouter). `/v1/chat` routes by config.	~500	Low-medium
40	Layer 2 engine	Rules-based routing (`config/routing.toml`), fallback chains, cost gating. Add Gemini + Claude adapters.	~600	Medium
41	Profile split	Separate Retrieval / Memory / Execution / Observer profiles; Phase 17 backward-compat. Absorbs Phase 37 hot-swap-async.	~300	Medium
42	Truth Layer	New `crates/truth`; Terraform/Ansible schemas; `/v1/context` serves rules to router + observer.	~700	Medium
43	Validation pipeline	Syntax/lint/dry-run/policy gates per output type. Plugs into Layer 5 execution loop.	~400	Medium
44	Caller migration	All internal callers route through `/v1/chat`. Direct sidecar access deprecated.	~200	Low

Total ≈3100 LOC. Phase 37 (hot-swap async) folds into Phase 41 — it's an Execution-Profile activation concern.

Phase 38 — Universal API Skeleton

Goal: OpenAI-compatible /v1/* surface exists and forwards to existing aibridge → Ollama. Nothing about multi-provider yet — just the SHAPE, so every downstream piece (adapters, routing, usage accounting) has a surface to plug into.

Ships:

crates/gateway/src/v1/mod.rs — router + /v1/chat, /v1/usage, /v1/sessions
crates/gateway/src/v1/ollama.rs — shape adapter (OpenAI chat ↔ existing aibridge GenerateRequest)
One-line nest("/v1", ...) in crates/gateway/src/main.rs
Unit test: POST /v1/chat roundtrips through mocked provider

Gate:

curl -X POST localhost:3100/v1/chat -d '{"model":"qwen3.5:latest","messages":[{"role":"user","content":"hi"}]}' returns valid OpenAI-shape response.
GET localhost:3100/v1/usage returns {requests, prompt_tokens, completion_tokens, total_tokens}.
GET localhost:3100/v1/sessions returns {data:[]} (stub; real impl Phase 41).
cargo test -p gateway green.

Non-goals (explicit): streaming, tool calls, function calling, session state, multi-provider, fallback, cost gating.

Risk: Low — additive, doesn't touch existing routes. Worst case: /v1/* returns 502 and we fix the adapter. No existing caller affected.

Phase 39 — Provider Adapter Refactor

Goal: aibridge grows a ProviderAdapter trait. Ollama implementation wraps existing sidecar code. One new provider lands as proof: OpenRouter (simplest — it's OpenAI-compatible, so adapter is mostly passthrough).

Ships:

crates/aibridge/src/provider.rs — ProviderAdapter trait with chat() + embed() + unload() methods
crates/aibridge/src/providers/ollama.rs — existing sidecar code moved behind the trait
crates/aibridge/src/providers/openrouter.rs — new, HTTP client to openrouter.ai/api/v1/chat/completions
config/providers.toml — provider registry (name, base_url, auth, default_models)
/v1/chat routes by model field: prefix match (e.g. openrouter/anthropic/claude-3.5-sonnet → OpenRouter; bare names → Ollama)

Gate:

/v1/chat with model: "qwen3.5:latest" hits Ollama → green
/v1/chat with model: "openrouter/openai/gpt-4o-mini" hits OpenRouter (key from secrets.toml) → green
Neither call leaks provider-specific fields upward. Response is always the /v1/chat shape.

Non-goals: Fallback chain (Phase 40), cost gating (Phase 40), Gemini/Claude adapters (Phase 40).

Risk: Low-medium. The trait extraction is mostly a rearrange; OpenRouter is thin. Biggest risk is secret-loading conventions — SecretsProvider is already in place, so reuse that path.

Phase 40 — Routing & Policy Engine + Observability Recovery

Goal: Replace hardcoded T1-T5 routing with a rules engine. Add Gemini + Claude adapters. Cost gating enforced at router level. Reinstate Langfuse + Gitea MCP — recovery of the observability + repo-ops stack J built previously (see project_lost_stack memory).

Ships — routing:

crates/aibridge/src/routing.rs — rules engine (match on: task type, token budget, previous attempt failures, profile ID)
config/routing.toml — rules in TOML (human-editable, hot-reloadable)
crates/aibridge/src/providers/gemini.rs — generativelanguage.googleapis.com adapter
crates/aibridge/src/providers/claude.rs — api.anthropic.com adapter
Fallback chain support: if primary returns 5xx or times out, try next in chain
Cost gate: per-request budget + daily budget per-provider

Ships — observability (was lost, now restored):

Langfuse self-hosted via Docker Compose. Single source of truth for every LLM call trace: prompt / response / tokens / cost / latency / provider / fallback chain / profile used. UI at localhost:3000. Keys in /etc/lakehouse/secrets.toml.
crates/aibridge/src/langfuse.rs — thin fire-and-forget trace emitter. Every /v1/chat call spawns a background task that POSTs to langfuse/api/public/ingestion. Non-blocking: trace failures never affect response.
Langfuse → observer pipe — mcp-server/langfuse_bridge.ts or similar. Polls Langfuse's trace API at interval, forwards completed traces to observer :3800/event with source: "langfuse". KB now sees cost/latency deltas per model, not just outcome deltas.
Gitea MCP reconnect — the MCP server binary still installed at /home/profit/.bun/install/cache/gitea-mcp@0.0.10/ gets wired into mcp-server/index.ts tool registry. Agents can open PRs, comment on issues, list commits via named tools. Closes Phase 28's repo-ops gap.

Gate:

Rule like "local models for simple JSON emitters, cloud for reasoning" fires correctly by task type
Primary fails → fallback provider hits, response still matches /v1/chat shape
Daily budget hit → subsequent requests return 429 with clear retry-at header
/v1/usage reports per-provider breakdown
Every /v1/chat call appears in Langfuse UI with correct prompt, response, latency, token count within 2 seconds of the request completing
Langfuse → observer pipe delivers trace deltas to KB: GET :3800/stats?source=langfuse shows non-zero count after a few scenarios run
Gitea MCP tools callable — list_prs, open_pr, comment_on_issue exposed in mcp-server/index.ts, verifiable via a quick agent scenario

Non-goals: Retrieval Profile split (Phase 41), Truth Layer (Phase 42). Langfuse self-hosted UI customization / SSO.

Risk: Medium. Multi-provider auth + cost tracking is cross-cutting; Langfuse adds 4-5 Docker containers (PostgreSQL, ClickHouse, Redis, web, worker). Mitigation: every provider call wrapped in a single dispatch() function so observability flows through one point; Langfuse Docker Compose is their supported deployment path, well-tested.

Phase 41 — Profile System Expansion (+ Phase 37 hot-swap async folded in)

Goal: The existing ModelProfile (Phase 17) becomes ExecutionProfile. Three new profile types land alongside. Profile activation is async — returns job_id, work runs in background (Phase 37 deliverable).

Ships:

crates/shared/src/profiles/ — ExecutionProfile, RetrievalProfile, MemoryProfile, ObserverProfile
crates/catalogd gains per-profile-type CRUD endpoints (/catalog/profiles/retrieval, etc.)
crates/vectord/src/activation.rs — ActivationTracker with background-job pattern (Phase 37 content)
POST /vectors/profile/{id}/activate returns 202 + job_id, polling at GET /vectors/profile/jobs/{id}
Single-flight guard: refuse new activation if one is pending/running
Backward compat: ModelProfile still loads, aliased to ExecutionProfile

Gate:

Activate a profile → returns 202 in <100ms → job completes in background → /vectors/profile/jobs/{id} shows progress + final report
tests/multi-agent/run_stress.ts Phase 3 (hot-swap stress) passes (was SKIPPED)
Retrieval + Memory + Observer profiles can be created independently of Execution profile

Non-goals: Truth Layer (Phase 42), validation (Phase 43), caller migration (Phase 44).

Risk: Medium. Schema change + async refactor. Mitigation: #[serde(default)] on all new fields; existing profiles load unchanged.

Phase 42 — Truth Layer (staffing rules first)

Goal: New crates/truth crate holds immutable task-class constraints. Served via /v1/context to router and observer. No layer can override truth. Staffing rules ship first; Terraform/Ansible rule shapes are scaffolded but unpopulated until the long-horizon phase.

Ships:

crates/truth/src/lib.rs — TruthStore with schema loading (TOML/YAML rules)
crates/truth/src/staffing.rs — staffing rule shapes:
- Worker eligibility (active status, not blacklisted for client, geo match, role match, availability window)
- Contract invariants (deadline present, role/count/city/state populated, budget_per_hour_max ≥ 0)
- PII handling (redaction rules on fields tagged PII before any cloud call — covers existing Phase 10 sensitivity tags)
- Client blacklist enforcement (auto-applied before any fill proposal)
- Fill requirements (endorsed_names count matches target_count, no duplicate worker_ids within a single fill)
crates/truth/src/devops.rs — scaffold only: empty rule struct for Terraform/Ansible, populated in the long-horizon phase. Keeps the dispatcher signature stable so no refactor needed later.
truth/ dir at repo root — rule files, versioned in git
/v1/context endpoint — returns applicable rules for a task class (staffing.fill, staffing.rescue, staffing.sms_draft, etc.)
Router consults truth before dispatching: if task violates a rule, hard-fail with structured error + rule citation (matches existing Phase 13 access-control pattern)

Gate:

Submit a fill proposal where a worker is client-blacklisted — router returns 422 + rule citation, no cloud tokens burned
Submit a fill with endorsed_names.length != target_count — 422 before dispatch
Observer cannot promote a correction that violates truth (rejected at router gate)
PII redaction verified: SSN / salary fields stripped from prompts before cloud calls
Truth reload is explicit (no file-watch hot reload in this phase)

Non-goals: Validation execution (Phase 43), policy learning / evolution (deferred), actual Terraform/Ansible rules (long-horizon phase).

Risk: Medium. Domain-specific rule enumeration takes discovery — start with a minimal rule set (5-10 staffing rules, derived from existing Phase 10-13 work) and grow organically as real fills surface edge cases.

Phase 43 — Validation Pipeline (staffing outputs first)

Goal: Staffing outputs run through schema / completeness / consistency / policy gates. Plug into Layer 5 execution loop — failure triggers observer-correction iteration. This is where the 0→85% pattern reproduces on real staffing tasks — the iteration loop with validation in place is what made small models successful.

Ships:

crates/validator/src/lib.rs — Validator trait: validate(artifact) -> Result<Report, ValidationError> + Artifact enum over output types
crates/validator/src/staffing/fill.rs — fill-proposal validator:
- Schema compliance (propose_done shape matches {fills: [{candidate_id, name}]})
- Completeness (endorsed count == target_count)
- Worker existence (every candidate_id present in workers_500k via SQL lookup)
- Status check (every worker has status=active, not_on_client_blacklist)
- Geo/role match (worker city/state/role matches contract)
crates/validator/src/staffing/email.rs — generated email/SMS drafts:
- Schema (TO/BODY fields present)
- Length (SMS ≤ 160 chars; email subject ≤ 78 chars)
- PII absence (no SSN / salary leaked into outgoing text)
- Worker-name consistency (name in message matches worker record)
crates/validator/src/staffing/playbook.rs — sealed playbook:
- Operation format (fill: Role xN in City, ST)
- endorsed_names non-empty, ≤ target_count × 2
- fingerprint populated (Phase 25 validity window requirement)
crates/validator/src/devops.rs — scaffold only: stubbed Terraform/Ansible validators (terraform validate, ansible-lint) for the long-horizon phase
Task execution loop in gateway: generate → validate → if fail, observer correction + retry (bounded by max_iterations=3)
Validation results logged to observer (data/_observer/ops.jsonl) + KB (data/_kb/outcomes.jsonl)

Gate:

Generate a fill proposal → validator catches a phantom worker_id → observer + cloud rescue propose correction → retry → green. This reproduces the 0→85% pattern on the live staffing pipeline.
/v1/usage shows iteration count per task, provider fallback chain, and tokens-per-iteration. Cost attribution per task class visible.
Reproduces 14× citation-lift finding from Phase 19 refinement on similar geos after validation gates.

Non-goals: Caller migration (Phase 44), Terraform/Ansible wired validation (long-horizon).

Risk: Medium. Validation shapes have to match actual executor outputs; mitigation is using real scenario runs as test fixtures (we have ~100 of them in tests/multi-agent/playbooks/).

Phase 44 — Caller Migration + Direct-Provider Deprecation

Goal: Every internal LLM caller routes through /v1/chat. Direct sidecar / direct Ollama / direct OpenAI calls are removed or explicitly deprecated with a warning.

Ships:

aibridge::AiClient becomes a thin /v1/chat client (was direct-to-sidecar)
crates/vectord::agent (autotune): routes through /v1
crates/vectord::autotune: routes through /v1
tests/multi-agent/agent.ts::generate(): routes through /v1
bot/propose.ts: routes through /v1 (already proposed as Phase 38's test consumer, formalized here)
Lint rule / grep pre-commit hook: no fetch.*:3200/generate outside the provider adapters

Gate:

grep -r "localhost:3200/generate\|/api/generate" returns only adapter files + deprecation shims
/v1/usage accounts for every LLM call in the system within a 1-minute window after hitting a fresh scenario
Full scenario passes end-to-end without any caller bypassing /v1/*

Non-goals: New features. This phase is purely mechanical migration.

Risk: Low. Mechanical. Tests catch regressions.

Phase 45 — Doc-drift detection + context7 integration

Goal: Playbooks know which external docs they were written against. When those docs change (Docker adds a feature, npm lib goes major, Terraform renames a resource), the playbook is automatically flagged. Small models never run confidently-outdated procedures — the drift signal reaches them before the next execution does.

Why this phase exists at all: The 0→85% thesis depends on the hyperfocus lane staying valid. External doc drift invalidates the lane silently — popular playbooks can compound the wrong way, accumulating boost while growing more wrong. Phase 25 already retires playbooks on internal schema drift; Phase 45 is the same mechanism against external doc drift. This is the completion of the learning loop, not an optional add-on.

Ships:

shared::types::DocRef — { tool: String, version_seen: String, snippet_hash: Option<String>, source_url: Option<String>, seen_at: DateTime<Utc> }
PlaybookEntry.doc_refs: Vec<DocRef> — #[serde(default)] so pre-Phase-45 entries load as empty vec
/vectors/playbook_memory/seed + /revise accept doc_refs in the request body
/vectors/playbook_memory/doc_drift/check/{id} — manual drift check: looks up each doc_refs[] entry via the context7 bridge, returns per-tool {version_seen, version_current, drifted: bool} plus overall verdict
/vectors/playbook_memory/doc_drift/scan — batch scan across all active playbooks (scheduled path for Phase 45.2)
mcp-server/context7_bridge.ts — Bun HTTP bridge. Exposes GET /docs/:tool/version + GET /docs/:tool/:version/diff?since=X against the installed context7 MCP plugin. Gateway calls this over localhost.
PlaybookMemory::compute_boost_for_filtered_with_role — excludes entries where doc_drift_flagged_at.is_some() && doc_drift_review.is_none() (same rule as retired + superseded)
Overview model synthesis writes data/_kb/doc_drift_corrections.jsonl per detected drift: {playbook_id, tool, version_seen, version_current, diff_summary, recommended_action, generated_at}
Human-in-the-loop re-seal path: /vectors/playbook_memory/doc_drift/resolve/{id} — marks reviewed, optionally triggers revise_entry if procedure changed

Gate:

Seal a playbook referencing Docker 24.x → doc_refs captured. Bump Docker version behind the scenes → /doc_drift/check/{id} returns drifted: true, from: 24.0.7, to: 25.0.1, summary: "...". The boosted playbook count on next /vectors/hybrid query drops by 1 (drift-flagged skipped).
doc_drift_corrections.jsonl contains the overview model's synthesis for the drift with at least: summary of change, recommended action, cost/impact estimate.
Human calls /doc_drift/resolve/{id} after reviewing → playbook returns to active boost pool (or supersedes via Phase 27 if procedure materially changed).
Unit tests: DocRef serde default (legacy entries load as empty), drift check against mocked context7 bridge, boost exclusion when drifted+unreviewed.

Non-goals (explicit):

Automatic re-seal without human review. Drift-detection → flag, not silent rewrite.
Cross-playbook propagation of one drift diagnosis. Each playbook reviewed individually (aggregation later if warranted).
Generating the updated procedure. T3 suggests; human or separate bot (see bot/) writes.

Risk: Medium. The context7 bridge is new infrastructure (Bun ↔ context7 MCP plugin ↔ HTTP shape for gateway consumption). Mitigation: context7 plugin is already installed; its MCP tools return structured JSON; the bridge is thin adapter code. Start with single-tool drift check (Docker) before broadening.

Long-horizon domains (not in current phase sequence)

The architecture was drafted with DevOps execution (Terraform, Ansible) as the eventual target. That remains aspirational, not current scope — we don't start wiring terraform validate / ansible-lint until the staffing domain proves the six-layer architecture at scale.

What "proves at scale" means concretely:

Phases 38-44 all shipped against staffing, green tests
Live staffing pipeline handles multiple concurrent contracts with emails + SMS + indexed playbooks via /v1/*
Observed iteration success lift (the 0→85% pattern) reproduced on varied staffing scenarios, not just the original proof-of-concept
Token + cost accounting stable across provider fallback chains under real load
Truth Layer rules prevent real fill errors before cloud burn (not just theoretical)

When staffing hits that bar, the DevOps domain lights up by:

Populating crates/truth/src/devops.rs with real Terraform/Ansible rule shapes
Populating crates/validator/src/devops.rs with terraform validate / ansible-lint shell-out
Adding DevOps task classes to /v1/context rule lookup
No architectural changes needed — the dispatcher, router, and execution loop stay identical.

Other candidate long-horizon domains (same pattern):

Code generation tasks (validation via cargo check / bun test)
SQL query generation (validation via EXPLAIN + schema compliance)
Data pipeline definitions (validation via lineage check + schema compliance)

None of these are in the current roadmap. Staffing first, production-proven, then expand.

1. Purpose

Design and implement a universal AI control-plane API that enables:

deterministic high-stakes task execution — the immediate domain is staffing fills (contracts, workers, emails, SMS) at scale; the same architecture extends later to DevOps (Terraform, Ansible) without redesign
iterative capability amplification via observer loops
hybrid local + cloud model orchestration
structured knowledge + memory + playbook reuse
controlled improvement over time through validated iteration

The system prioritizes validated pipeline success over raw model intelligence.

Current scope — staffing at scale

The architecture must make the already-built staffing substrate reliably answer millions of inputs: pull real data, graph it across contracts, handle multiple concurrent contracts, index emails + SMS + playbooks via the hybrid SQL+vector method, and get faster and better each iteration via the feedback loops (Phase 19 playbook boost, Phase 22 KB pathway recommender, Phase 24 observer, Phase 26 Mem0 upsert).

DevOps is an eventual domain — see §Long-horizon domains.

2. Core Objectives

2.1 Functional Goals

Provide a single universal API for all AI interactions
Support multi-provider routing (local, flat-rate, token-based)
Enable iterative execution loops with observer correction
Store and reuse successful execution playbooks
Integrate: S3-based knowledge storage, LanceDB retrieval/indexing, Mem0 memory layer, MCP tool ecosystem

2.2 Non-Functional Goals

Deterministic behavior under constrained execution
Full observability and cost accounting
Safe DevOps execution (no uncontrolled mutation)
Profile-driven routing and execution
Reproducibility of successful runs

3. System Architecture

3.1 Layer Overview

Layer 1 — Universal API

Single entry point for all applications. Endpoints:

/v1/chat
/v1/respond
/v1/tools
/v1/context
/v1/usage
/v1/sessions

All programs must use this layer. No direct provider calls allowed.

Layer 2 — Routing & Policy Engine

Responsibilities: provider selection, fallback logic, cost gating, premium access control, profile enforcement. Routing based on: task type, constraints, execution profile, system health.

Layer 3 — Provider Adapter Layer

Normalizes all providers: Ollama (local), OpenRouter, Gemini (direct), Claude (direct or routed), future providers. Guarantee: no provider-specific logic leaks upward.

Layer 4 — Knowledge & Memory Plane

Knowledge (S3 + LanceDB): raw documents, processed chunks, embeddings, index profiles
Memory (Mem0): extracted facts, entity-linked memory, session-aware retrieval
Playbooks: successful execution traces, reusable patterns, correction strategies

Layer 5 — Execution Loop

Each task runs through: Retrieval → Planning → Generation → Validation → Observer feedback → Iteration (if needed).

Layer 6 — Observability & Accounting

Every request logs: tokens (input/output), cost, latency, provider, fallback chain, profile used, iteration delta.

4. Execution Model

4.1 Iterative Loop

Each task follows: Attempt → Validate → Observe → Adjust → Retry

Constraints:

max iterations (default: 3)
minimum improvement threshold
cost ceiling per task

4.2 Observer Role

Observer can: analyze failure, suggest corrections, recommend profile changes. Observer cannot: modify truth layer, auto-promote changes, override constraints.

4.3 Cloud Escalation

Cloud models (Gemini, Claude) are used for: structural correction, reasoning gaps, complex decomposition. They are not used for: brute-force retries, bulk execution.

5. Profile System

5.1 Profile Types

Retrieval Profile — chunking strategy, embedding method, reranking rules
Memory Profile — memory weighting, context injection rules
Execution Profile — allowed providers, tool access, risk level
Observer Profile — mutation aggressiveness, iteration strategy

5.2 Profile Constraints

only one major profile change per iteration
profiles must produce measurable deltas
promotion requires repeated success

6. Truth Layer (Critical)

Defines non-negotiable constraints:

Terraform rules
Ansible structure requirements
security policies
organization standards

Rules:

immutable at runtime
referenced by all layers
cannot be overridden by observer

7. Playbook System

7.1 Playbook Definition

Each successful run produces: task class, context used, steps executed, tools used, output artifacts, validation results, cost/latency, success score.

7.2 Playbook Lifecycle

created on success
reused for similar tasks
decayed over time
pruned if ineffective

8. Validation System

All DevOps outputs must pass: syntax validation, linting, dry-run, policy compliance. Failure → iteration continues or task fails.

9. MCP Integration

MCP servers provide: tools, external data, execution capabilities. All MCP outputs must be: normalized, validated, schema-compliant. No direct MCP output reaches the model.

10. Token Accounting & Budget Control

Each request tracks: input tokens, output tokens, retries, fallback cost. Policies: premium providers gated, cost ceilings enforced, per-task budget limits.

11. Failure Handling

Recoverable failures: bad decomposition, missing steps, weak retrieval → observer + iteration.

Hard failures: missing truth data, invalid task classification, unsafe execution → termination + error report.

12. Success Criteria

A task is successful only if:

output is valid
all validators pass
no policy violations
result is reproducible
cost within limits

13. Key Risks & Mitigations

Observer drift → bounded authority, confidence tracking
Memory poisoning → validation layer, memory weighting
Cost explosion → token accounting, iteration caps
Retrieval errors → post-retrieval validation, profile tuning

26 KiB Raw Blame History Unescape Escape

PRD — Universal AI Control Plane

Phase Sequencing (Phases 38-44)

Phase 38 — Universal API Skeleton

Phase 39 — Provider Adapter Refactor

Phase 40 — Routing & Policy Engine + Observability Recovery

Phase 41 — Profile System Expansion (+ Phase 37 hot-swap async folded in)

Phase 42 — Truth Layer (staffing rules first)

Phase 43 — Validation Pipeline (staffing outputs first)

Phase 44 — Caller Migration + Direct-Provider Deprecation

Phase 45 — Doc-drift detection + context7 integration

Long-horizon domains (not in current phase sequence)

1. Purpose

Current scope — staffing at scale

2. Core Objectives

2.1 Functional Goals

2.2 Non-Functional Goals

3. System Architecture

3.1 Layer Overview

4. Execution Model

4.1 Iterative Loop

4.2 Observer Role

4.3 Cloud Escalation

5. Profile System

5.1 Profile Types

5.2 Profile Constraints

6. Truth Layer (Critical)

7. Playbook System

7.1 Playbook Definition

7.2 Playbook Lifecycle

8. Validation System

9. MCP Integration

10. Token Accounting & Budget Control

11. Failure Handling

12. Success Criteria

13. Key Risks & Mitigations

26 KiB

Raw Blame History