lakehouse/docs/CONTROL_PLANE_PRD.md
profit 2a4b81bf48 Phase 45 (first slice): DocRef + doc_refs field on PlaybookEntry
Phase J keeps asking for: playbooks know which external docs they
used, get flagged when those docs drift. This commit ships the data
model; context7 bridge + drift check endpoints land in follow-ups.

Added to crates/vectord/src/playbook_memory.rs:
- pub struct DocRef { tool, version_seen, snippet_hash, source_url,
  seen_at } — one external doc reference
- PlaybookEntry.doc_refs: Vec<DocRef> — empty on legacy entries,
  serde default ensures pre-Phase-45 persisted state loads cleanly
- PlaybookEntry.doc_drift_flagged_at: Option<String> — set by the
  (future) drift-check code when context7 reports newer version
- PlaybookEntry.doc_drift_reviewed_at: Option<String> — set by
  human via /resolve endpoint after reviewing the diagnosis
- impl Default for PlaybookEntry — collapses most test-helper
  constructors from 17 explicit fields to 6-9 fields +
  ..Default::default()

Updated SeedPlaybookRequest + RevisePlaybookRequest (service.rs) to
accept optional doc_refs: the seed/revise endpoints already take the
field, downstream drift detection (Phase 45.2) consumes it.

Docs: docs/CONTROL_PLANE_PRD.md gains full Phase 45 spec with gate
criteria, non-goals, and risk notes.

Tests: 51/51 vectord lib tests green (same count as before, field
additions are backward-compat).

Memory: project_doc_drift_vision.md written so this keeps coming
back to the front of mind.

Next slices (same phase): context7 HTTP bridge in mcp-server,
/vectors/playbook_memory/doc_drift/check/{id} endpoint, overview-
model drift synthesis writing to data/_kb/doc_drift_corrections.jsonl,
boost exclusion for flagged+unreviewed entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 03:14:07 -05:00

26 KiB
Raw Blame History

PRD — Universal AI Control Plane

Status: Long-horizon architecture target as of 2026-04-22. Lakehouse Phases 0-37 (docs/PRD.md) are preserved as the reference implementation and first domain-specific consumer. Phases 38+ (control-plane layers) are sequenced below.

Current domain: staffing. The immediate proving ground is the staffing substrate already built — synthetic workers_500k, contracts, emails, SMS drafts, playbook memory. Everything Phase 38-44 ships is validated first against that domain. The DevOps / Terraform / Ansible framing from the original PRD draft stays as a long-horizon target — architecture-compatible but not in current scope. See §Long-horizon domains at the bottom.

Owner: J

Cross-read: docs/PRD.md for what's shipped (staffing + AI substrate, 13 crates, ~3M rows). This doc for the layered architecture those pieces now fit into.


Phase Sequencing (Phases 38-44)

Ship each phase before starting the next. Each ends with green tests + docs update.

Phase Layer What ships Est. LOC Risk
38 Layer 1 skeleton /v1/chat, /v1/usage, /v1/sessions routes forwarding to existing aibridge → Ollama. Bot migrates as first consumer. ~400 Low — additive, no existing routes touched
39 Layer 3 adapters aibridge::ProviderAdapter trait; Ollama + one new (OpenRouter). /v1/chat routes by config. ~500 Low-medium
40 Layer 2 engine Rules-based routing (config/routing.toml), fallback chains, cost gating. Add Gemini + Claude adapters. ~600 Medium
41 Profile split Separate Retrieval / Memory / Execution / Observer profiles; Phase 17 backward-compat. Absorbs Phase 37 hot-swap-async. ~300 Medium
42 Truth Layer New crates/truth; Terraform/Ansible schemas; /v1/context serves rules to router + observer. ~700 Medium
43 Validation pipeline Syntax/lint/dry-run/policy gates per output type. Plugs into Layer 5 execution loop. ~400 Medium
44 Caller migration All internal callers route through /v1/chat. Direct sidecar access deprecated. ~200 Low

Total ≈3100 LOC. Phase 37 (hot-swap async) folds into Phase 41 — it's an Execution-Profile activation concern.


Phase 38 — Universal API Skeleton

Goal: OpenAI-compatible /v1/* surface exists and forwards to existing aibridge → Ollama. Nothing about multi-provider yet — just the SHAPE, so every downstream piece (adapters, routing, usage accounting) has a surface to plug into.

Ships:

  • crates/gateway/src/v1/mod.rs — router + /v1/chat, /v1/usage, /v1/sessions
  • crates/gateway/src/v1/ollama.rs — shape adapter (OpenAI chat ↔ existing aibridge GenerateRequest)
  • One-line nest("/v1", ...) in crates/gateway/src/main.rs
  • Unit test: POST /v1/chat roundtrips through mocked provider

Gate:

  • curl -X POST localhost:3100/v1/chat -d '{"model":"qwen3.5:latest","messages":[{"role":"user","content":"hi"}]}' returns valid OpenAI-shape response.
  • GET localhost:3100/v1/usage returns {requests, prompt_tokens, completion_tokens, total_tokens}.
  • GET localhost:3100/v1/sessions returns {data:[]} (stub; real impl Phase 41).
  • cargo test -p gateway green.

Non-goals (explicit): streaming, tool calls, function calling, session state, multi-provider, fallback, cost gating.

Risk: Low — additive, doesn't touch existing routes. Worst case: /v1/* returns 502 and we fix the adapter. No existing caller affected.


Phase 39 — Provider Adapter Refactor

Goal: aibridge grows a ProviderAdapter trait. Ollama implementation wraps existing sidecar code. One new provider lands as proof: OpenRouter (simplest — it's OpenAI-compatible, so adapter is mostly passthrough).

Ships:

  • crates/aibridge/src/provider.rsProviderAdapter trait with chat() + embed() + unload() methods
  • crates/aibridge/src/providers/ollama.rs — existing sidecar code moved behind the trait
  • crates/aibridge/src/providers/openrouter.rs — new, HTTP client to openrouter.ai/api/v1/chat/completions
  • config/providers.toml — provider registry (name, base_url, auth, default_models)
  • /v1/chat routes by model field: prefix match (e.g. openrouter/anthropic/claude-3.5-sonnet → OpenRouter; bare names → Ollama)

Gate:

  • /v1/chat with model: "qwen3.5:latest" hits Ollama → green
  • /v1/chat with model: "openrouter/openai/gpt-4o-mini" hits OpenRouter (key from secrets.toml) → green
  • Neither call leaks provider-specific fields upward. Response is always the /v1/chat shape.

Non-goals: Fallback chain (Phase 40), cost gating (Phase 40), Gemini/Claude adapters (Phase 40).

Risk: Low-medium. The trait extraction is mostly a rearrange; OpenRouter is thin. Biggest risk is secret-loading conventions — SecretsProvider is already in place, so reuse that path.


Phase 40 — Routing & Policy Engine + Observability Recovery

Goal: Replace hardcoded T1-T5 routing with a rules engine. Add Gemini + Claude adapters. Cost gating enforced at router level. Reinstate Langfuse + Gitea MCP — recovery of the observability + repo-ops stack J built previously (see project_lost_stack memory).

Ships — routing:

  • crates/aibridge/src/routing.rs — rules engine (match on: task type, token budget, previous attempt failures, profile ID)
  • config/routing.toml — rules in TOML (human-editable, hot-reloadable)
  • crates/aibridge/src/providers/gemini.rsgenerativelanguage.googleapis.com adapter
  • crates/aibridge/src/providers/claude.rsapi.anthropic.com adapter
  • Fallback chain support: if primary returns 5xx or times out, try next in chain
  • Cost gate: per-request budget + daily budget per-provider

Ships — observability (was lost, now restored):

  • Langfuse self-hosted via Docker Compose. Single source of truth for every LLM call trace: prompt / response / tokens / cost / latency / provider / fallback chain / profile used. UI at localhost:3000. Keys in /etc/lakehouse/secrets.toml.
  • crates/aibridge/src/langfuse.rs — thin fire-and-forget trace emitter. Every /v1/chat call spawns a background task that POSTs to langfuse/api/public/ingestion. Non-blocking: trace failures never affect response.
  • Langfuse → observer pipemcp-server/langfuse_bridge.ts or similar. Polls Langfuse's trace API at interval, forwards completed traces to observer :3800/event with source: "langfuse". KB now sees cost/latency deltas per model, not just outcome deltas.
  • Gitea MCP reconnect — the MCP server binary still installed at /home/profit/.bun/install/cache/gitea-mcp@0.0.10/ gets wired into mcp-server/index.ts tool registry. Agents can open PRs, comment on issues, list commits via named tools. Closes Phase 28's repo-ops gap.

Gate:

  • Rule like "local models for simple JSON emitters, cloud for reasoning" fires correctly by task type
  • Primary fails → fallback provider hits, response still matches /v1/chat shape
  • Daily budget hit → subsequent requests return 429 with clear retry-at header
  • /v1/usage reports per-provider breakdown
  • Every /v1/chat call appears in Langfuse UI with correct prompt, response, latency, token count within 2 seconds of the request completing
  • Langfuse → observer pipe delivers trace deltas to KB: GET :3800/stats?source=langfuse shows non-zero count after a few scenarios run
  • Gitea MCP tools callablelist_prs, open_pr, comment_on_issue exposed in mcp-server/index.ts, verifiable via a quick agent scenario

Non-goals: Retrieval Profile split (Phase 41), Truth Layer (Phase 42). Langfuse self-hosted UI customization / SSO.

Risk: Medium. Multi-provider auth + cost tracking is cross-cutting; Langfuse adds 4-5 Docker containers (PostgreSQL, ClickHouse, Redis, web, worker). Mitigation: every provider call wrapped in a single dispatch() function so observability flows through one point; Langfuse Docker Compose is their supported deployment path, well-tested.


Phase 41 — Profile System Expansion (+ Phase 37 hot-swap async folded in)

Goal: The existing ModelProfile (Phase 17) becomes ExecutionProfile. Three new profile types land alongside. Profile activation is async — returns job_id, work runs in background (Phase 37 deliverable).

Ships:

  • crates/shared/src/profiles/ExecutionProfile, RetrievalProfile, MemoryProfile, ObserverProfile
  • crates/catalogd gains per-profile-type CRUD endpoints (/catalog/profiles/retrieval, etc.)
  • crates/vectord/src/activation.rsActivationTracker with background-job pattern (Phase 37 content)
  • POST /vectors/profile/{id}/activate returns 202 + job_id, polling at GET /vectors/profile/jobs/{id}
  • Single-flight guard: refuse new activation if one is pending/running
  • Backward compat: ModelProfile still loads, aliased to ExecutionProfile

Gate:

  • Activate a profile → returns 202 in <100ms → job completes in background → /vectors/profile/jobs/{id} shows progress + final report
  • tests/multi-agent/run_stress.ts Phase 3 (hot-swap stress) passes (was SKIPPED)
  • Retrieval + Memory + Observer profiles can be created independently of Execution profile

Non-goals: Truth Layer (Phase 42), validation (Phase 43), caller migration (Phase 44).

Risk: Medium. Schema change + async refactor. Mitigation: #[serde(default)] on all new fields; existing profiles load unchanged.


Phase 42 — Truth Layer (staffing rules first)

Goal: New crates/truth crate holds immutable task-class constraints. Served via /v1/context to router and observer. No layer can override truth. Staffing rules ship first; Terraform/Ansible rule shapes are scaffolded but unpopulated until the long-horizon phase.

Ships:

  • crates/truth/src/lib.rsTruthStore with schema loading (TOML/YAML rules)
  • crates/truth/src/staffing.rs — staffing rule shapes:
    • Worker eligibility (active status, not blacklisted for client, geo match, role match, availability window)
    • Contract invariants (deadline present, role/count/city/state populated, budget_per_hour_max ≥ 0)
    • PII handling (redaction rules on fields tagged PII before any cloud call — covers existing Phase 10 sensitivity tags)
    • Client blacklist enforcement (auto-applied before any fill proposal)
    • Fill requirements (endorsed_names count matches target_count, no duplicate worker_ids within a single fill)
  • crates/truth/src/devops.rsscaffold only: empty rule struct for Terraform/Ansible, populated in the long-horizon phase. Keeps the dispatcher signature stable so no refactor needed later.
  • truth/ dir at repo root — rule files, versioned in git
  • /v1/context endpoint — returns applicable rules for a task class (staffing.fill, staffing.rescue, staffing.sms_draft, etc.)
  • Router consults truth before dispatching: if task violates a rule, hard-fail with structured error + rule citation (matches existing Phase 13 access-control pattern)

Gate:

  • Submit a fill proposal where a worker is client-blacklisted — router returns 422 + rule citation, no cloud tokens burned
  • Submit a fill with endorsed_names.length != target_count — 422 before dispatch
  • Observer cannot promote a correction that violates truth (rejected at router gate)
  • PII redaction verified: SSN / salary fields stripped from prompts before cloud calls
  • Truth reload is explicit (no file-watch hot reload in this phase)

Non-goals: Validation execution (Phase 43), policy learning / evolution (deferred), actual Terraform/Ansible rules (long-horizon phase).

Risk: Medium. Domain-specific rule enumeration takes discovery — start with a minimal rule set (5-10 staffing rules, derived from existing Phase 10-13 work) and grow organically as real fills surface edge cases.


Phase 43 — Validation Pipeline (staffing outputs first)

Goal: Staffing outputs run through schema / completeness / consistency / policy gates. Plug into Layer 5 execution loop — failure triggers observer-correction iteration. This is where the 0→85% pattern reproduces on real staffing tasks — the iteration loop with validation in place is what made small models successful.

Ships:

  • crates/validator/src/lib.rsValidator trait: validate(artifact) -> Result<Report, ValidationError> + Artifact enum over output types
  • crates/validator/src/staffing/fill.rs — fill-proposal validator:
    • Schema compliance (propose_done shape matches {fills: [{candidate_id, name}]})
    • Completeness (endorsed count == target_count)
    • Worker existence (every candidate_id present in workers_500k via SQL lookup)
    • Status check (every worker has status=active, not_on_client_blacklist)
    • Geo/role match (worker city/state/role matches contract)
  • crates/validator/src/staffing/email.rs — generated email/SMS drafts:
    • Schema (TO/BODY fields present)
    • Length (SMS ≤ 160 chars; email subject ≤ 78 chars)
    • PII absence (no SSN / salary leaked into outgoing text)
    • Worker-name consistency (name in message matches worker record)
  • crates/validator/src/staffing/playbook.rs — sealed playbook:
    • Operation format (fill: Role xN in City, ST)
    • endorsed_names non-empty, ≤ target_count × 2
    • fingerprint populated (Phase 25 validity window requirement)
  • crates/validator/src/devops.rsscaffold only: stubbed Terraform/Ansible validators (terraform validate, ansible-lint) for the long-horizon phase
  • Task execution loop in gateway: generate → validate → if fail, observer correction + retry (bounded by max_iterations=3)
  • Validation results logged to observer (data/_observer/ops.jsonl) + KB (data/_kb/outcomes.jsonl)

Gate:

  • Generate a fill proposal → validator catches a phantom worker_id → observer + cloud rescue propose correction → retry → green. This reproduces the 0→85% pattern on the live staffing pipeline.
  • /v1/usage shows iteration count per task, provider fallback chain, and tokens-per-iteration. Cost attribution per task class visible.
  • Reproduces 14× citation-lift finding from Phase 19 refinement on similar geos after validation gates.

Non-goals: Caller migration (Phase 44), Terraform/Ansible wired validation (long-horizon).

Risk: Medium. Validation shapes have to match actual executor outputs; mitigation is using real scenario runs as test fixtures (we have ~100 of them in tests/multi-agent/playbooks/).


Phase 44 — Caller Migration + Direct-Provider Deprecation

Goal: Every internal LLM caller routes through /v1/chat. Direct sidecar / direct Ollama / direct OpenAI calls are removed or explicitly deprecated with a warning.

Ships:

  • aibridge::AiClient becomes a thin /v1/chat client (was direct-to-sidecar)
  • crates/vectord::agent (autotune): routes through /v1
  • crates/vectord::autotune: routes through /v1
  • tests/multi-agent/agent.ts::generate(): routes through /v1
  • bot/propose.ts: routes through /v1 (already proposed as Phase 38's test consumer, formalized here)
  • Lint rule / grep pre-commit hook: no fetch.*:3200/generate outside the provider adapters

Gate:

  • grep -r "localhost:3200/generate\|/api/generate" returns only adapter files + deprecation shims
  • /v1/usage accounts for every LLM call in the system within a 1-minute window after hitting a fresh scenario
  • Full scenario passes end-to-end without any caller bypassing /v1/*

Non-goals: New features. This phase is purely mechanical migration.

Risk: Low. Mechanical. Tests catch regressions.


Phase 45 — Doc-drift detection + context7 integration

Goal: Playbooks know which external docs they were written against. When those docs change (Docker adds a feature, npm lib goes major, Terraform renames a resource), the playbook is automatically flagged. Small models never run confidently-outdated procedures — the drift signal reaches them before the next execution does.

Why this phase exists at all: The 0→85% thesis depends on the hyperfocus lane staying valid. External doc drift invalidates the lane silently — popular playbooks can compound the wrong way, accumulating boost while growing more wrong. Phase 25 already retires playbooks on internal schema drift; Phase 45 is the same mechanism against external doc drift. This is the completion of the learning loop, not an optional add-on.

Ships:

  • shared::types::DocRef{ tool: String, version_seen: String, snippet_hash: Option<String>, source_url: Option<String>, seen_at: DateTime<Utc> }
  • PlaybookEntry.doc_refs: Vec<DocRef>#[serde(default)] so pre-Phase-45 entries load as empty vec
  • /vectors/playbook_memory/seed + /revise accept doc_refs in the request body
  • /vectors/playbook_memory/doc_drift/check/{id} — manual drift check: looks up each doc_refs[] entry via the context7 bridge, returns per-tool {version_seen, version_current, drifted: bool} plus overall verdict
  • /vectors/playbook_memory/doc_drift/scan — batch scan across all active playbooks (scheduled path for Phase 45.2)
  • mcp-server/context7_bridge.ts — Bun HTTP bridge. Exposes GET /docs/:tool/version + GET /docs/:tool/:version/diff?since=X against the installed context7 MCP plugin. Gateway calls this over localhost.
  • PlaybookMemory::compute_boost_for_filtered_with_role — excludes entries where doc_drift_flagged_at.is_some() && doc_drift_review.is_none() (same rule as retired + superseded)
  • Overview model synthesis writes data/_kb/doc_drift_corrections.jsonl per detected drift: {playbook_id, tool, version_seen, version_current, diff_summary, recommended_action, generated_at}
  • Human-in-the-loop re-seal path: /vectors/playbook_memory/doc_drift/resolve/{id} — marks reviewed, optionally triggers revise_entry if procedure changed

Gate:

  • Seal a playbook referencing Docker 24.x → doc_refs captured. Bump Docker version behind the scenes → /doc_drift/check/{id} returns drifted: true, from: 24.0.7, to: 25.0.1, summary: "...". The boosted playbook count on next /vectors/hybrid query drops by 1 (drift-flagged skipped).
  • doc_drift_corrections.jsonl contains the overview model's synthesis for the drift with at least: summary of change, recommended action, cost/impact estimate.
  • Human calls /doc_drift/resolve/{id} after reviewing → playbook returns to active boost pool (or supersedes via Phase 27 if procedure materially changed).
  • Unit tests: DocRef serde default (legacy entries load as empty), drift check against mocked context7 bridge, boost exclusion when drifted+unreviewed.

Non-goals (explicit):

  • Automatic re-seal without human review. Drift-detection → flag, not silent rewrite.
  • Cross-playbook propagation of one drift diagnosis. Each playbook reviewed individually (aggregation later if warranted).
  • Generating the updated procedure. T3 suggests; human or separate bot (see bot/) writes.

Risk: Medium. The context7 bridge is new infrastructure (Bun ↔ context7 MCP plugin ↔ HTTP shape for gateway consumption). Mitigation: context7 plugin is already installed; its MCP tools return structured JSON; the bridge is thin adapter code. Start with single-tool drift check (Docker) before broadening.


Long-horizon domains (not in current phase sequence)

The architecture was drafted with DevOps execution (Terraform, Ansible) as the eventual target. That remains aspirational, not current scope — we don't start wiring terraform validate / ansible-lint until the staffing domain proves the six-layer architecture at scale.

What "proves at scale" means concretely:

  • Phases 38-44 all shipped against staffing, green tests
  • Live staffing pipeline handles multiple concurrent contracts with emails + SMS + indexed playbooks via /v1/*
  • Observed iteration success lift (the 0→85% pattern) reproduced on varied staffing scenarios, not just the original proof-of-concept
  • Token + cost accounting stable across provider fallback chains under real load
  • Truth Layer rules prevent real fill errors before cloud burn (not just theoretical)

When staffing hits that bar, the DevOps domain lights up by:

  • Populating crates/truth/src/devops.rs with real Terraform/Ansible rule shapes
  • Populating crates/validator/src/devops.rs with terraform validate / ansible-lint shell-out
  • Adding DevOps task classes to /v1/context rule lookup
  • No architectural changes needed — the dispatcher, router, and execution loop stay identical.

Other candidate long-horizon domains (same pattern):

  • Code generation tasks (validation via cargo check / bun test)
  • SQL query generation (validation via EXPLAIN + schema compliance)
  • Data pipeline definitions (validation via lineage check + schema compliance)

None of these are in the current roadmap. Staffing first, production-proven, then expand.


1. Purpose

Design and implement a universal AI control-plane API that enables:

  • deterministic high-stakes task execution — the immediate domain is staffing fills (contracts, workers, emails, SMS) at scale; the same architecture extends later to DevOps (Terraform, Ansible) without redesign
  • iterative capability amplification via observer loops
  • hybrid local + cloud model orchestration
  • structured knowledge + memory + playbook reuse
  • controlled improvement over time through validated iteration

The system prioritizes validated pipeline success over raw model intelligence.

Current scope — staffing at scale

The architecture must make the already-built staffing substrate reliably answer millions of inputs: pull real data, graph it across contracts, handle multiple concurrent contracts, index emails + SMS + playbooks via the hybrid SQL+vector method, and get faster and better each iteration via the feedback loops (Phase 19 playbook boost, Phase 22 KB pathway recommender, Phase 24 observer, Phase 26 Mem0 upsert).

DevOps is an eventual domain — see §Long-horizon domains.

2. Core Objectives

2.1 Functional Goals

  • Provide a single universal API for all AI interactions
  • Support multi-provider routing (local, flat-rate, token-based)
  • Enable iterative execution loops with observer correction
  • Store and reuse successful execution playbooks
  • Integrate: S3-based knowledge storage, LanceDB retrieval/indexing, Mem0 memory layer, MCP tool ecosystem

2.2 Non-Functional Goals

  • Deterministic behavior under constrained execution
  • Full observability and cost accounting
  • Safe DevOps execution (no uncontrolled mutation)
  • Profile-driven routing and execution
  • Reproducibility of successful runs

3. System Architecture

3.1 Layer Overview

Layer 1 — Universal API

Single entry point for all applications. Endpoints:

  • /v1/chat
  • /v1/respond
  • /v1/tools
  • /v1/context
  • /v1/usage
  • /v1/sessions

All programs must use this layer. No direct provider calls allowed.

Layer 2 — Routing & Policy Engine

Responsibilities: provider selection, fallback logic, cost gating, premium access control, profile enforcement. Routing based on: task type, constraints, execution profile, system health.

Layer 3 — Provider Adapter Layer

Normalizes all providers: Ollama (local), OpenRouter, Gemini (direct), Claude (direct or routed), future providers. Guarantee: no provider-specific logic leaks upward.

Layer 4 — Knowledge & Memory Plane

  • Knowledge (S3 + LanceDB): raw documents, processed chunks, embeddings, index profiles
  • Memory (Mem0): extracted facts, entity-linked memory, session-aware retrieval
  • Playbooks: successful execution traces, reusable patterns, correction strategies

Layer 5 — Execution Loop

Each task runs through: Retrieval → Planning → Generation → Validation → Observer feedback → Iteration (if needed).

Layer 6 — Observability & Accounting

Every request logs: tokens (input/output), cost, latency, provider, fallback chain, profile used, iteration delta.

4. Execution Model

4.1 Iterative Loop

Each task follows: Attempt → Validate → Observe → Adjust → Retry

Constraints:

  • max iterations (default: 3)
  • minimum improvement threshold
  • cost ceiling per task

4.2 Observer Role

Observer can: analyze failure, suggest corrections, recommend profile changes. Observer cannot: modify truth layer, auto-promote changes, override constraints.

4.3 Cloud Escalation

Cloud models (Gemini, Claude) are used for: structural correction, reasoning gaps, complex decomposition. They are not used for: brute-force retries, bulk execution.

5. Profile System

5.1 Profile Types

  • Retrieval Profile — chunking strategy, embedding method, reranking rules
  • Memory Profile — memory weighting, context injection rules
  • Execution Profile — allowed providers, tool access, risk level
  • Observer Profile — mutation aggressiveness, iteration strategy

5.2 Profile Constraints

  • only one major profile change per iteration
  • profiles must produce measurable deltas
  • promotion requires repeated success

6. Truth Layer (Critical)

Defines non-negotiable constraints:

  • Terraform rules
  • Ansible structure requirements
  • security policies
  • organization standards

Rules:

  • immutable at runtime
  • referenced by all layers
  • cannot be overridden by observer

7. Playbook System

7.1 Playbook Definition

Each successful run produces: task class, context used, steps executed, tools used, output artifacts, validation results, cost/latency, success score.

7.2 Playbook Lifecycle

  • created on success
  • reused for similar tasks
  • decayed over time
  • pruned if ineffective

8. Validation System

All DevOps outputs must pass: syntax validation, linting, dry-run, policy compliance. Failure → iteration continues or task fails.

9. MCP Integration

MCP servers provide: tools, external data, execution capabilities. All MCP outputs must be: normalized, validated, schema-compliant. No direct MCP output reaches the model.

10. Token Accounting & Budget Control

Each request tracks: input tokens, output tokens, retries, fallback cost. Policies: premium providers gated, cost ceilings enforced, per-task budget limits.

11. Failure Handling

Recoverable failures: bad decomposition, missing steps, weak retrieval → observer + iteration.

Hard failures: missing truth data, invalid task classification, unsafe execution → termination + error report.

12. Success Criteria

A task is successful only if:

  • output is valid
  • all validators pass
  • no policy violations
  • result is reproducible
  • cost within limits

13. Key Risks & Mitigations

  • Observer drift → bounded authority, confidence tracking
  • Memory poisoning → validation layer, memory weighting
  • Cost explosion → token accounting, iteration caps
  • Retrieval errors → post-retrieval validation, profile tuning