Implements detection and recovery for when agents fail to reach consensus: - Orchestrator exits with code 2 on consensus failure (distinct from error=1) - Records failed run context (proposals, agent states, conflicts) to Dragonfly - Provides fallback options: rerun same, rerun with GAMMA, escalate tier, accept partial - Adds UI alert with action buttons for user-driven recovery - Adds failure details modal and downloadable failure report - Only marks pipeline complete when consensus achieved or user accepts fallback Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
18 KiB
Multi-Agent Pipeline Architecture
Overview
This document describes the architecture for the production multi-agent pipeline system, including Vault token management, agent lifecycle, error handling, and observability integration.
Document Date: 2026-01-24 Status: IMPLEMENTED
1. Pipeline Flow
┌─────────────────────────────────────────────────────────────────┐
│ PIPELINE LIFECYCLE │
└─────────────────────────────────────────────────────────────────┘
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌───────────────┐ ┌───────────┐
│ SPAWN │────▶│ RUNNING │────▶│ REPORT │────▶│ ORCHESTRATING │────▶│ COMPLETED │
└─────────┘ └─────────┘ └─────────┘ └───────────────┘ └───────────┘
│ │ │ │ │
│ │ │ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Issue │ │ Agent │ │ Report │ │ ALPHA+BETA│ │ Consensus │
│ Vault │ │ Status │ │ Ready │ │ Parallel │ │ Achieved │
│ Token │ │ Updates │ │ │ │ │ │ │
└─────────┘ └─────────┘ └─────────┘ └───────────┘ └───────────┘
│
┌───────▼───────┐
│ Error/Stuck? │
└───────┬───────┘
│ YES
┌───────▼───────┐
│ SPAWN GAMMA │
│ (Diagnostic) │
└───────────────┘
2. Vault Token Management
2.1 Token Lifecycle
Each pipeline receives a dedicated, long-lived Vault token that persists through the entire orchestration:
Pipeline Start
│
▼
┌─────────────────────────────────────┐
│ 1. Request Pipeline Token from Vault │
│ - AppRole: pipeline-orchestrator │
│ - TTL: 2 hours (renewable) │
│ - Policies: pipeline-agent │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 2. Store Token in Redis │
│ Key: pipeline:{id}:vault_token │
│ + Encrypted with transit key │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 3. Pass Token to All Agents │
│ - ALPHA, BETA, GAMMA inherit │
│ - Token renewal every 30 min │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 4. Observability Monitors Token │
│ - Can revoke for policy violation│
│ - Logs all token usage │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 5. Token Revoked on Completion │
│ - Or on error threshold breach │
└─────────────────────────────────────┘
2.2 Token Policies
Pipeline Agent Policy (pipeline-agent.hcl):
# Read API keys for OpenRouter
path "secret/data/api-keys/*" {
capabilities = ["read"]
}
# Read service credentials (DragonflyDB)
path "secret/data/services/*" {
capabilities = ["read"]
}
# Agent-specific secrets
path "secret/data/agents/{{identity.entity.aliases.auth_approle.metadata.pipeline_id}}/*" {
capabilities = ["read", "create", "update"]
}
# Deny access to admin paths
path "sys/*" {
capabilities = ["deny"]
}
2.3 Token Revocation Triggers
Observability can revoke a pipeline token mid-run for:
| Condition | Threshold | Action |
|---|---|---|
| Error rate | > 5 errors/minute | Revoke + spawn diagnostic |
| Stuck agent | > 60 seconds no progress | Revoke agent token only |
| Policy violation | Any CRITICAL violation | Immediate full revocation |
| Resource abuse | > 100 API calls/minute | Rate limit, then revoke |
3. Report → Orchestration Transition
3.1 Automatic Trigger
When a pipeline reaches REPORT phase with auto_continue=true:
async function checkPipelineCompletion(pipelineId: string) {
// ... existing completion check ...
if (autoContinue && anySuccess) {
// Trigger OpenRouter orchestration
triggerOrchestration(pipelineId, taskId, objective, model, timeout);
}
}
3.2 Manual Trigger
API endpoint for manual orchestration trigger:
POST /api/pipeline/continue
Body: { pipeline_id, model?, timeout? }
3.3 Orchestration Process
- Status Update: Pipeline status →
ORCHESTRATING - Agent Spawn: Launch ALPHA and BETA agents in parallel
- WebSocket Broadcast: Real-time status to UI
- Monitor Loop: Check for stuck/conflict conditions
- GAMMA Spawn: If thresholds exceeded, spawn mediator
- Consensus: Drive to final agreement
- Completion: Status →
COMPLETEDorFAILED
4. Agent Multiplication and Handoff
4.1 Agent Roles
| Agent | Role | Spawn Condition |
|---|---|---|
| ALPHA | Research & Analysis | Always (initial) |
| BETA | Implementation & Synthesis | Always (initial) |
| GAMMA | Mediator & Resolver | On error/stuck/conflict/complexity |
4.2 Spawn Conditions
const SPAWN_CONDITIONS = {
STUCK: {
threshold: 30, // seconds of inactivity
description: "Spawn GAMMA when agents stuck"
},
CONFLICT: {
threshold: 3, // unresolved conflicts
description: "Spawn GAMMA for mediation"
},
COMPLEXITY: {
threshold: 0.8, // complexity score
description: "Spawn GAMMA for decomposition"
},
SUCCESS: {
threshold: 1.0, // task completion
description: "Spawn GAMMA for validation"
}
};
4.3 Handoff Protocol
When GAMMA spawns, it receives:
- Full blackboard state (problem, solutions, progress)
- Message log from ALPHA/BETA
- Spawn reason and context
- Authority to direct other agents
// GAMMA handoff message
{
type: "HANDOFF",
payload: {
type: "NEW_DIRECTION" | "SUBTASK_ASSIGNMENT",
tasks?: string[],
diagnosis?: string,
recommended_actions?: string[]
}
}
4.4 Agent Lifecycle States
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌───────────┐
│ CREATED │───▶│ BUSY │───▶│ WAITING │───▶│ HANDED-OFF│───▶│ SUCCEEDED │
└──────────┘ └──────────┘ └──────────┘ └───────────┘ └───────────┘
│ │
│ ┌──────────┐ │
└─────────────▶│ ERROR │◀────────────────────────┘
└──────────┘
UI displays each agent with:
- Current state (color-coded)
- Progress percentage
- Current task description
- Message count (sent/received)
- Error count
5. Observability Integration
5.1 Real-Time Metrics
All metrics stored in DragonflyDB with WebSocket broadcast:
// Metrics keys
`metrics:${taskId}` → {
total_messages: number,
direct_messages: number,
blackboard_writes: number,
blackboard_reads: number,
conflicts_detected: number,
conflicts_resolved: number,
gamma_spawned: boolean,
gamma_spawn_reason: string,
performance_score: number
}
5.2 Error Loop Handling
Error Detected
│
▼
┌─────────────────────┐
│ Log to bug_watcher │
│ (SQLite + Redis) │
└─────────────────────┘
│
▼
┌─────────────────────┐ ┌─────────────────────┐
│ Check Error Budget │────▶│ Budget Exceeded? │
└─────────────────────┘ └─────────────────────┘
│ YES
▼
┌─────────────────────┐
│ Spawn Diagnostic │
│ Pipeline with │
│ Error Context │
└─────────────────────┘
5.3 Status Broadcasting
WebSocket events broadcast to UI:
| Event | Payload | Trigger |
|---|---|---|
pipeline_started |
pipeline_id, task_id | Pipeline spawn |
agent_status |
agent_id, status | Any status change |
agent_message |
agent, message | Agent log output |
consensus_event |
proposal_id, votes | Consensus activity |
orchestration_started |
model, agents | Orchestration begin |
orchestration_complete |
status, metrics | Orchestration end |
error_threshold |
pipeline_id, errors | Error budget breach |
token_revoked |
pipeline_id, reason | Vault revocation |
5.4 Structured Handoff Reports
On error threshold breach, generate handoff report:
{
"report_type": "error_handoff",
"pipeline_id": "pipeline-abc123",
"timestamp": "2026-01-24T22:30:00Z",
"summary": {
"total_errors": 6,
"error_types": ["api_timeout", "validation_failure"],
"affected_agents": ["ALPHA"],
"last_successful_checkpoint": "ckpt-xyz"
},
"context": {
"task_objective": "...",
"progress_at_failure": 0.45,
"blackboard_snapshot": {...}
},
"recommended_actions": [
"Reduce API call rate",
"Split task into smaller subtasks"
]
}
6. UI Components
6.1 Pipeline Status Panel
┌──────────────────────────────────────────────────────────────────┐
│ Pipeline: pipeline-abc123 [ORCHESTRATING]│
├──────────────────────────────────────────────────────────────────┤
│ Objective: Design distributed event-driven architecture... │
│ Model: anthropic/claude-sonnet-4 │
│ Started: 2026-01-24 22:15:00 UTC │
├──────────────────────────────────────────────────────────────────┤
│ AGENTS │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ ALPHA │ │ BETA │ │ GAMMA │ │
│ │ ████░░░ │ │ ██████░ │ │ ░░░░░░░ │ │
│ │ 45% │ │ 75% │ │ PENDING │ │
│ │ WORKING │ │ WAITING │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ │
├──────────────────────────────────────────────────────────────────┤
│ METRICS │
│ Messages: 24 │ Conflicts: 1/1 resolved │ Score: 72% │
├──────────────────────────────────────────────────────────────────┤
│ RECENT ACTIVITY │
│ [22:16:32] ALPHA: Generated 3 initial proposals │
│ [22:16:45] BETA: Evaluating proposal prop-a1b2c3 │
│ [22:17:01] BETA: Proposal accepted with score 0.85 │
└──────────────────────────────────────────────────────────────────┘
6.2 Agent Lifecycle Cards
Each agent displays:
- Role badge (ALPHA/BETA/GAMMA)
- Status indicator with color
- Progress bar
- Current task label
- Message counters
- Error indicator (if any)
7. Implementation Checklist
Backend (server.ts)
- Pipeline spawn with auto_continue
- Orchestration trigger after REPORT
- Agent process spawning (Python + Bun)
- WebSocket status broadcasting
- Diagnostic agent (GAMMA) spawning on error
- Vault token issuance per pipeline
- Token renewal loop (every 30 minutes)
- Observability-driven revocation
- Error threshold monitoring
- Structured handoff reports
Coordination (coordination.ts)
- Blackboard shared memory
- MessageBus point-to-point
- AgentStateManager
- SpawnController conditions
- MetricsCollector
- Token integration via pipeline context
- Error budget tracking
Orchestrator (orchestrator.ts)
- Multi-agent initialization
- GAMMA spawn on conditions
- Consensus checking
- Performance analysis
- Receive pipeline ID from environment
- Error reporting to observability
UI/API
- Pipeline list view
- Real-time log streaming
- Agent lifecycle status API
- Pipeline metrics endpoint
- Error budget API
- Token status/revoke/renew APIs
- Handoff report generation
- Diagnostic pipeline spawning
- Consensus failure detection (exit code 2)
- Consensus failure context recording
- Fallback options (rerun, escalate, accept, download)
- Failure report download
- UI consensus failure alert with action buttons
- Failure details modal
- WebSocket notifications for consensus events
8. API Endpoints
Pipeline Control
| Endpoint | Method | Description |
|---|---|---|
/api/spawn |
POST | Spawn pipeline with auto_continue |
/api/pipeline/continue |
POST | Manually trigger orchestration |
/api/pipeline/orchestration |
GET | Get orchestration status |
/api/pipeline/token |
GET | Get pipeline token status |
/api/pipeline/revoke |
POST | Revoke pipeline token |
/api/active-pipelines |
GET | List active pipelines |
/api/pipeline/logs |
GET | Get pipeline logs |
/api/pipeline/metrics |
GET | Get pipeline metrics |
Agent Management
| Endpoint | Method | Description |
|---|---|---|
/api/agents |
GET | List all agents |
/api/agents/:id/status |
GET | Get agent status |
/api/agents/:id/messages |
GET | Get agent message log |
Observability
| Endpoint | Method | Description |
|---|---|---|
/api/observability/errors |
GET | Get error summary |
/api/observability/handoff |
POST | Generate handoff report |
/api/observability/revoke |
POST | Trigger token revocation |
Last updated: 2026-01-24