# Multi-Agent Pipeline Architecture ## Overview This document describes the architecture for the production multi-agent pipeline system, including Vault token management, agent lifecycle, error handling, and observability integration. **Document Date:** 2026-01-24 **Status:** IMPLEMENTED --- ## 1. Pipeline Flow ``` ┌─────────────────────────────────────────────────────────────────┐ │ PIPELINE LIFECYCLE │ └─────────────────────────────────────────────────────────────────┘ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌───────────────┐ ┌───────────┐ │ SPAWN │────▶│ RUNNING │────▶│ REPORT │────▶│ ORCHESTRATING │────▶│ COMPLETED │ └─────────┘ └─────────┘ └─────────┘ └───────────────┘ └───────────┘ │ │ │ │ │ │ │ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐ │ Issue │ │ Agent │ │ Report │ │ ALPHA+BETA│ │ Consensus │ │ Vault │ │ Status │ │ Ready │ │ Parallel │ │ Achieved │ │ Token │ │ Updates │ │ │ │ │ │ │ └─────────┘ └─────────┘ └─────────┘ └───────────┘ └───────────┘ │ ┌───────▼───────┐ │ Error/Stuck? │ └───────┬───────┘ │ YES ┌───────▼───────┐ │ SPAWN GAMMA │ │ (Diagnostic) │ └───────────────┘ ``` --- ## 2. Vault Token Management ### 2.1 Token Lifecycle Each pipeline receives a dedicated, long-lived Vault token that persists through the entire orchestration: ``` Pipeline Start │ ▼ ┌─────────────────────────────────────┐ │ 1. Request Pipeline Token from Vault │ │ - AppRole: pipeline-orchestrator │ │ - TTL: 2 hours (renewable) │ │ - Policies: pipeline-agent │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ 2. Store Token in Redis │ │ Key: pipeline:{id}:vault_token │ │ + Encrypted with transit key │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ 3. Pass Token to All Agents │ │ - ALPHA, BETA, GAMMA inherit │ │ - Token renewal every 30 min │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ 4. Observability Monitors Token │ │ - Can revoke for policy violation│ │ - Logs all token usage │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ 5. Token Revoked on Completion │ │ - Or on error threshold breach │ └─────────────────────────────────────┘ ``` ### 2.2 Token Policies **Pipeline Agent Policy (`pipeline-agent.hcl`):** ```hcl # Read API keys for OpenRouter path "secret/data/api-keys/*" { capabilities = ["read"] } # Read service credentials (DragonflyDB) path "secret/data/services/*" { capabilities = ["read"] } # Agent-specific secrets path "secret/data/agents/{{identity.entity.aliases.auth_approle.metadata.pipeline_id}}/*" { capabilities = ["read", "create", "update"] } # Deny access to admin paths path "sys/*" { capabilities = ["deny"] } ``` ### 2.3 Token Revocation Triggers Observability can revoke a pipeline token mid-run for: | Condition | Threshold | Action | |-----------|-----------|--------| | Error rate | > 5 errors/minute | Revoke + spawn diagnostic | | Stuck agent | > 60 seconds no progress | Revoke agent token only | | Policy violation | Any CRITICAL violation | Immediate full revocation | | Resource abuse | > 100 API calls/minute | Rate limit, then revoke | --- ## 3. Report → Orchestration Transition ### 3.1 Automatic Trigger When a pipeline reaches REPORT phase with `auto_continue=true`: ```typescript async function checkPipelineCompletion(pipelineId: string) { // ... existing completion check ... if (autoContinue && anySuccess) { // Trigger OpenRouter orchestration triggerOrchestration(pipelineId, taskId, objective, model, timeout); } } ``` ### 3.2 Manual Trigger API endpoint for manual orchestration trigger: ``` POST /api/pipeline/continue Body: { pipeline_id, model?, timeout? } ``` ### 3.3 Orchestration Process 1. **Status Update**: Pipeline status → `ORCHESTRATING` 2. **Agent Spawn**: Launch ALPHA and BETA agents in parallel 3. **WebSocket Broadcast**: Real-time status to UI 4. **Monitor Loop**: Check for stuck/conflict conditions 5. **GAMMA Spawn**: If thresholds exceeded, spawn mediator 6. **Consensus**: Drive to final agreement 7. **Completion**: Status → `COMPLETED` or `FAILED` --- ## 4. Agent Multiplication and Handoff ### 4.1 Agent Roles | Agent | Role | Spawn Condition | |-------|------|-----------------| | ALPHA | Research & Analysis | Always (initial) | | BETA | Implementation & Synthesis | Always (initial) | | GAMMA | Mediator & Resolver | On error/stuck/conflict/complexity | ### 4.2 Spawn Conditions ```typescript const SPAWN_CONDITIONS = { STUCK: { threshold: 30, // seconds of inactivity description: "Spawn GAMMA when agents stuck" }, CONFLICT: { threshold: 3, // unresolved conflicts description: "Spawn GAMMA for mediation" }, COMPLEXITY: { threshold: 0.8, // complexity score description: "Spawn GAMMA for decomposition" }, SUCCESS: { threshold: 1.0, // task completion description: "Spawn GAMMA for validation" } }; ``` ### 4.3 Handoff Protocol When GAMMA spawns, it receives: - Full blackboard state (problem, solutions, progress) - Message log from ALPHA/BETA - Spawn reason and context - Authority to direct other agents ```typescript // GAMMA handoff message { type: "HANDOFF", payload: { type: "NEW_DIRECTION" | "SUBTASK_ASSIGNMENT", tasks?: string[], diagnosis?: string, recommended_actions?: string[] } } ``` ### 4.4 Agent Lifecycle States ``` ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌───────────┐ │ CREATED │───▶│ BUSY │───▶│ WAITING │───▶│ HANDED-OFF│───▶│ SUCCEEDED │ └──────────┘ └──────────┘ └──────────┘ └───────────┘ └───────────┘ │ │ │ ┌──────────┐ │ └─────────────▶│ ERROR │◀────────────────────────┘ └──────────┘ ``` UI displays each agent with: - Current state (color-coded) - Progress percentage - Current task description - Message count (sent/received) - Error count --- ## 5. Observability Integration ### 5.1 Real-Time Metrics All metrics stored in DragonflyDB with WebSocket broadcast: ```typescript // Metrics keys `metrics:${taskId}` → { total_messages: number, direct_messages: number, blackboard_writes: number, blackboard_reads: number, conflicts_detected: number, conflicts_resolved: number, gamma_spawned: boolean, gamma_spawn_reason: string, performance_score: number } ``` ### 5.2 Error Loop Handling ``` Error Detected │ ▼ ┌─────────────────────┐ │ Log to bug_watcher │ │ (SQLite + Redis) │ └─────────────────────┘ │ ▼ ┌─────────────────────┐ ┌─────────────────────┐ │ Check Error Budget │────▶│ Budget Exceeded? │ └─────────────────────┘ └─────────────────────┘ │ YES ▼ ┌─────────────────────┐ │ Spawn Diagnostic │ │ Pipeline with │ │ Error Context │ └─────────────────────┘ ``` ### 5.3 Status Broadcasting WebSocket events broadcast to UI: | Event | Payload | Trigger | |-------|---------|---------| | `pipeline_started` | pipeline_id, task_id | Pipeline spawn | | `agent_status` | agent_id, status | Any status change | | `agent_message` | agent, message | Agent log output | | `consensus_event` | proposal_id, votes | Consensus activity | | `orchestration_started` | model, agents | Orchestration begin | | `orchestration_complete` | status, metrics | Orchestration end | | `error_threshold` | pipeline_id, errors | Error budget breach | | `token_revoked` | pipeline_id, reason | Vault revocation | ### 5.4 Structured Handoff Reports On error threshold breach, generate handoff report: ```json { "report_type": "error_handoff", "pipeline_id": "pipeline-abc123", "timestamp": "2026-01-24T22:30:00Z", "summary": { "total_errors": 6, "error_types": ["api_timeout", "validation_failure"], "affected_agents": ["ALPHA"], "last_successful_checkpoint": "ckpt-xyz" }, "context": { "task_objective": "...", "progress_at_failure": 0.45, "blackboard_snapshot": {...} }, "recommended_actions": [ "Reduce API call rate", "Split task into smaller subtasks" ] } ``` --- ## 6. UI Components ### 6.1 Pipeline Status Panel ``` ┌──────────────────────────────────────────────────────────────────┐ │ Pipeline: pipeline-abc123 [ORCHESTRATING]│ ├──────────────────────────────────────────────────────────────────┤ │ Objective: Design distributed event-driven architecture... │ │ Model: anthropic/claude-sonnet-4 │ │ Started: 2026-01-24 22:15:00 UTC │ ├──────────────────────────────────────────────────────────────────┤ │ AGENTS │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ ALPHA │ │ BETA │ │ GAMMA │ │ │ │ ████░░░ │ │ ██████░ │ │ ░░░░░░░ │ │ │ │ 45% │ │ 75% │ │ PENDING │ │ │ │ WORKING │ │ WAITING │ │ │ │ │ └─────────┘ └─────────┘ └─────────┘ │ ├──────────────────────────────────────────────────────────────────┤ │ METRICS │ │ Messages: 24 │ Conflicts: 1/1 resolved │ Score: 72% │ ├──────────────────────────────────────────────────────────────────┤ │ RECENT ACTIVITY │ │ [22:16:32] ALPHA: Generated 3 initial proposals │ │ [22:16:45] BETA: Evaluating proposal prop-a1b2c3 │ │ [22:17:01] BETA: Proposal accepted with score 0.85 │ └──────────────────────────────────────────────────────────────────┘ ``` ### 6.2 Agent Lifecycle Cards Each agent displays: - Role badge (ALPHA/BETA/GAMMA) - Status indicator with color - Progress bar - Current task label - Message counters - Error indicator (if any) --- ## 7. Implementation Checklist ### Backend (server.ts) - [x] Pipeline spawn with auto_continue - [x] Orchestration trigger after REPORT - [x] Agent process spawning (Python + Bun) - [x] WebSocket status broadcasting - [x] Diagnostic agent (GAMMA) spawning on error - [x] Vault token issuance per pipeline - [x] Token renewal loop (every 30 minutes) - [x] Observability-driven revocation - [x] Error threshold monitoring - [x] Structured handoff reports ### Coordination (coordination.ts) - [x] Blackboard shared memory - [x] MessageBus point-to-point - [x] AgentStateManager - [x] SpawnController conditions - [x] MetricsCollector - [x] Token integration via pipeline context - [x] Error budget tracking ### Orchestrator (orchestrator.ts) - [x] Multi-agent initialization - [x] GAMMA spawn on conditions - [x] Consensus checking - [x] Performance analysis - [x] Receive pipeline ID from environment - [x] Error reporting to observability ### UI/API - [x] Pipeline list view - [x] Real-time log streaming - [x] Agent lifecycle status API - [x] Pipeline metrics endpoint - [x] Error budget API - [x] Token status/revoke/renew APIs - [x] Handoff report generation - [x] Diagnostic pipeline spawning --- ## 8. API Endpoints ### Pipeline Control | Endpoint | Method | Description | |----------|--------|-------------| | `/api/spawn` | POST | Spawn pipeline with auto_continue | | `/api/pipeline/continue` | POST | Manually trigger orchestration | | `/api/pipeline/orchestration` | GET | Get orchestration status | | `/api/pipeline/token` | GET | Get pipeline token status | | `/api/pipeline/revoke` | POST | Revoke pipeline token | | `/api/active-pipelines` | GET | List active pipelines | | `/api/pipeline/logs` | GET | Get pipeline logs | | `/api/pipeline/metrics` | GET | Get pipeline metrics | ### Agent Management | Endpoint | Method | Description | |----------|--------|-------------| | `/api/agents` | GET | List all agents | | `/api/agents/:id/status` | GET | Get agent status | | `/api/agents/:id/messages` | GET | Get agent message log | ### Observability | Endpoint | Method | Description | |----------|--------|-------------| | `/api/observability/errors` | GET | Get error summary | | `/api/observability/handoff` | POST | Generate handoff report | | `/api/observability/revoke` | POST | Trigger token revocation | --- *Last updated: 2026-01-24*