agent-governance/docs/MULTI_AGENT_PIPELINE_ARCHITECTURE.md
profit 09be7eff4b Add consensus failure handling with fallback options for multi-agent pipelines
Implements detection and recovery for when agents fail to reach consensus:
- Orchestrator exits with code 2 on consensus failure (distinct from error=1)
- Records failed run context (proposals, agent states, conflicts) to Dragonfly
- Provides fallback options: rerun same, rerun with GAMMA, escalate tier, accept partial
- Adds UI alert with action buttons for user-driven recovery
- Adds failure details modal and downloadable failure report
- Only marks pipeline complete when consensus achieved or user accepts fallback

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 18:24:19 -05:00

457 lines
18 KiB
Markdown

# Multi-Agent Pipeline Architecture
## Overview
This document describes the architecture for the production multi-agent pipeline system, including Vault token management, agent lifecycle, error handling, and observability integration.
**Document Date:** 2026-01-24
**Status:** IMPLEMENTED
---
## 1. Pipeline Flow
```
┌─────────────────────────────────────────────────────────────────┐
│ PIPELINE LIFECYCLE │
└─────────────────────────────────────────────────────────────────┘
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌───────────────┐ ┌───────────┐
│ SPAWN │────▶│ RUNNING │────▶│ REPORT │────▶│ ORCHESTRATING │────▶│ COMPLETED │
└─────────┘ └─────────┘ └─────────┘ └───────────────┘ └───────────┘
│ │ │ │ │
│ │ │ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Issue │ │ Agent │ │ Report │ │ ALPHA+BETA│ │ Consensus │
│ Vault │ │ Status │ │ Ready │ │ Parallel │ │ Achieved │
│ Token │ │ Updates │ │ │ │ │ │ │
└─────────┘ └─────────┘ └─────────┘ └───────────┘ └───────────┘
┌───────▼───────┐
│ Error/Stuck? │
└───────┬───────┘
│ YES
┌───────▼───────┐
│ SPAWN GAMMA │
│ (Diagnostic) │
└───────────────┘
```
---
## 2. Vault Token Management
### 2.1 Token Lifecycle
Each pipeline receives a dedicated, long-lived Vault token that persists through the entire orchestration:
```
Pipeline Start
┌─────────────────────────────────────┐
│ 1. Request Pipeline Token from Vault │
│ - AppRole: pipeline-orchestrator │
│ - TTL: 2 hours (renewable) │
│ - Policies: pipeline-agent │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 2. Store Token in Redis │
│ Key: pipeline:{id}:vault_token │
│ + Encrypted with transit key │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 3. Pass Token to All Agents │
│ - ALPHA, BETA, GAMMA inherit │
│ - Token renewal every 30 min │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 4. Observability Monitors Token │
│ - Can revoke for policy violation│
│ - Logs all token usage │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 5. Token Revoked on Completion │
│ - Or on error threshold breach │
└─────────────────────────────────────┘
```
### 2.2 Token Policies
**Pipeline Agent Policy (`pipeline-agent.hcl`):**
```hcl
# Read API keys for OpenRouter
path "secret/data/api-keys/*" {
capabilities = ["read"]
}
# Read service credentials (DragonflyDB)
path "secret/data/services/*" {
capabilities = ["read"]
}
# Agent-specific secrets
path "secret/data/agents/{{identity.entity.aliases.auth_approle.metadata.pipeline_id}}/*" {
capabilities = ["read", "create", "update"]
}
# Deny access to admin paths
path "sys/*" {
capabilities = ["deny"]
}
```
### 2.3 Token Revocation Triggers
Observability can revoke a pipeline token mid-run for:
| Condition | Threshold | Action |
|-----------|-----------|--------|
| Error rate | > 5 errors/minute | Revoke + spawn diagnostic |
| Stuck agent | > 60 seconds no progress | Revoke agent token only |
| Policy violation | Any CRITICAL violation | Immediate full revocation |
| Resource abuse | > 100 API calls/minute | Rate limit, then revoke |
---
## 3. Report → Orchestration Transition
### 3.1 Automatic Trigger
When a pipeline reaches REPORT phase with `auto_continue=true`:
```typescript
async function checkPipelineCompletion(pipelineId: string) {
// ... existing completion check ...
if (autoContinue && anySuccess) {
// Trigger OpenRouter orchestration
triggerOrchestration(pipelineId, taskId, objective, model, timeout);
}
}
```
### 3.2 Manual Trigger
API endpoint for manual orchestration trigger:
```
POST /api/pipeline/continue
Body: { pipeline_id, model?, timeout? }
```
### 3.3 Orchestration Process
1. **Status Update**: Pipeline status → `ORCHESTRATING`
2. **Agent Spawn**: Launch ALPHA and BETA agents in parallel
3. **WebSocket Broadcast**: Real-time status to UI
4. **Monitor Loop**: Check for stuck/conflict conditions
5. **GAMMA Spawn**: If thresholds exceeded, spawn mediator
6. **Consensus**: Drive to final agreement
7. **Completion**: Status → `COMPLETED` or `FAILED`
---
## 4. Agent Multiplication and Handoff
### 4.1 Agent Roles
| Agent | Role | Spawn Condition |
|-------|------|-----------------|
| ALPHA | Research & Analysis | Always (initial) |
| BETA | Implementation & Synthesis | Always (initial) |
| GAMMA | Mediator & Resolver | On error/stuck/conflict/complexity |
### 4.2 Spawn Conditions
```typescript
const SPAWN_CONDITIONS = {
STUCK: {
threshold: 30, // seconds of inactivity
description: "Spawn GAMMA when agents stuck"
},
CONFLICT: {
threshold: 3, // unresolved conflicts
description: "Spawn GAMMA for mediation"
},
COMPLEXITY: {
threshold: 0.8, // complexity score
description: "Spawn GAMMA for decomposition"
},
SUCCESS: {
threshold: 1.0, // task completion
description: "Spawn GAMMA for validation"
}
};
```
### 4.3 Handoff Protocol
When GAMMA spawns, it receives:
- Full blackboard state (problem, solutions, progress)
- Message log from ALPHA/BETA
- Spawn reason and context
- Authority to direct other agents
```typescript
// GAMMA handoff message
{
type: "HANDOFF",
payload: {
type: "NEW_DIRECTION" | "SUBTASK_ASSIGNMENT",
tasks?: string[],
diagnosis?: string,
recommended_actions?: string[]
}
}
```
### 4.4 Agent Lifecycle States
```
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌───────────┐
│ CREATED │───▶│ BUSY │───▶│ WAITING │───▶│ HANDED-OFF│───▶│ SUCCEEDED │
└──────────┘ └──────────┘ └──────────┘ └───────────┘ └───────────┘
│ │
│ ┌──────────┐ │
└─────────────▶│ ERROR │◀────────────────────────┘
└──────────┘
```
UI displays each agent with:
- Current state (color-coded)
- Progress percentage
- Current task description
- Message count (sent/received)
- Error count
---
## 5. Observability Integration
### 5.1 Real-Time Metrics
All metrics stored in DragonflyDB with WebSocket broadcast:
```typescript
// Metrics keys
`metrics:${taskId}` {
total_messages: number,
direct_messages: number,
blackboard_writes: number,
blackboard_reads: number,
conflicts_detected: number,
conflicts_resolved: number,
gamma_spawned: boolean,
gamma_spawn_reason: string,
performance_score: number
}
```
### 5.2 Error Loop Handling
```
Error Detected
┌─────────────────────┐
│ Log to bug_watcher │
│ (SQLite + Redis) │
└─────────────────────┘
┌─────────────────────┐ ┌─────────────────────┐
│ Check Error Budget │────▶│ Budget Exceeded? │
└─────────────────────┘ └─────────────────────┘
│ YES
┌─────────────────────┐
│ Spawn Diagnostic │
│ Pipeline with │
│ Error Context │
└─────────────────────┘
```
### 5.3 Status Broadcasting
WebSocket events broadcast to UI:
| Event | Payload | Trigger |
|-------|---------|---------|
| `pipeline_started` | pipeline_id, task_id | Pipeline spawn |
| `agent_status` | agent_id, status | Any status change |
| `agent_message` | agent, message | Agent log output |
| `consensus_event` | proposal_id, votes | Consensus activity |
| `orchestration_started` | model, agents | Orchestration begin |
| `orchestration_complete` | status, metrics | Orchestration end |
| `error_threshold` | pipeline_id, errors | Error budget breach |
| `token_revoked` | pipeline_id, reason | Vault revocation |
### 5.4 Structured Handoff Reports
On error threshold breach, generate handoff report:
```json
{
"report_type": "error_handoff",
"pipeline_id": "pipeline-abc123",
"timestamp": "2026-01-24T22:30:00Z",
"summary": {
"total_errors": 6,
"error_types": ["api_timeout", "validation_failure"],
"affected_agents": ["ALPHA"],
"last_successful_checkpoint": "ckpt-xyz"
},
"context": {
"task_objective": "...",
"progress_at_failure": 0.45,
"blackboard_snapshot": {...}
},
"recommended_actions": [
"Reduce API call rate",
"Split task into smaller subtasks"
]
}
```
---
## 6. UI Components
### 6.1 Pipeline Status Panel
```
┌──────────────────────────────────────────────────────────────────┐
│ Pipeline: pipeline-abc123 [ORCHESTRATING]│
├──────────────────────────────────────────────────────────────────┤
│ Objective: Design distributed event-driven architecture... │
│ Model: anthropic/claude-sonnet-4 │
│ Started: 2026-01-24 22:15:00 UTC │
├──────────────────────────────────────────────────────────────────┤
│ AGENTS │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ ALPHA │ │ BETA │ │ GAMMA │ │
│ │ ████░░░ │ │ ██████░ │ │ ░░░░░░░ │ │
│ │ 45% │ │ 75% │ │ PENDING │ │
│ │ WORKING │ │ WAITING │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ │
├──────────────────────────────────────────────────────────────────┤
│ METRICS │
│ Messages: 24 │ Conflicts: 1/1 resolved │ Score: 72% │
├──────────────────────────────────────────────────────────────────┤
│ RECENT ACTIVITY │
│ [22:16:32] ALPHA: Generated 3 initial proposals │
│ [22:16:45] BETA: Evaluating proposal prop-a1b2c3 │
│ [22:17:01] BETA: Proposal accepted with score 0.85 │
└──────────────────────────────────────────────────────────────────┘
```
### 6.2 Agent Lifecycle Cards
Each agent displays:
- Role badge (ALPHA/BETA/GAMMA)
- Status indicator with color
- Progress bar
- Current task label
- Message counters
- Error indicator (if any)
---
## 7. Implementation Checklist
### Backend (server.ts)
- [x] Pipeline spawn with auto_continue
- [x] Orchestration trigger after REPORT
- [x] Agent process spawning (Python + Bun)
- [x] WebSocket status broadcasting
- [x] Diagnostic agent (GAMMA) spawning on error
- [x] Vault token issuance per pipeline
- [x] Token renewal loop (every 30 minutes)
- [x] Observability-driven revocation
- [x] Error threshold monitoring
- [x] Structured handoff reports
### Coordination (coordination.ts)
- [x] Blackboard shared memory
- [x] MessageBus point-to-point
- [x] AgentStateManager
- [x] SpawnController conditions
- [x] MetricsCollector
- [x] Token integration via pipeline context
- [x] Error budget tracking
### Orchestrator (orchestrator.ts)
- [x] Multi-agent initialization
- [x] GAMMA spawn on conditions
- [x] Consensus checking
- [x] Performance analysis
- [x] Receive pipeline ID from environment
- [x] Error reporting to observability
### UI/API
- [x] Pipeline list view
- [x] Real-time log streaming
- [x] Agent lifecycle status API
- [x] Pipeline metrics endpoint
- [x] Error budget API
- [x] Token status/revoke/renew APIs
- [x] Handoff report generation
- [x] Diagnostic pipeline spawning
- [x] Consensus failure detection (exit code 2)
- [x] Consensus failure context recording
- [x] Fallback options (rerun, escalate, accept, download)
- [x] Failure report download
- [x] UI consensus failure alert with action buttons
- [x] Failure details modal
- [x] WebSocket notifications for consensus events
---
## 8. API Endpoints
### Pipeline Control
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/spawn` | POST | Spawn pipeline with auto_continue |
| `/api/pipeline/continue` | POST | Manually trigger orchestration |
| `/api/pipeline/orchestration` | GET | Get orchestration status |
| `/api/pipeline/token` | GET | Get pipeline token status |
| `/api/pipeline/revoke` | POST | Revoke pipeline token |
| `/api/active-pipelines` | GET | List active pipelines |
| `/api/pipeline/logs` | GET | Get pipeline logs |
| `/api/pipeline/metrics` | GET | Get pipeline metrics |
### Agent Management
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/agents` | GET | List all agents |
| `/api/agents/:id/status` | GET | Get agent status |
| `/api/agents/:id/messages` | GET | Get agent message log |
### Observability
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/observability/errors` | GET | Get error summary |
| `/api/observability/handoff` | POST | Generate handoff report |
| `/api/observability/revoke` | POST | Trigger token revocation |
---
*Last updated: 2026-01-24*