agent-governance/docs/MULTI_AGENT_PIPELINE_ARCHITECTURE.md
profit 09be7eff4b Add consensus failure handling with fallback options for multi-agent pipelines
Implements detection and recovery for when agents fail to reach consensus:
- Orchestrator exits with code 2 on consensus failure (distinct from error=1)
- Records failed run context (proposals, agent states, conflicts) to Dragonfly
- Provides fallback options: rerun same, rerun with GAMMA, escalate tier, accept partial
- Adds UI alert with action buttons for user-driven recovery
- Adds failure details modal and downloadable failure report
- Only marks pipeline complete when consensus achieved or user accepts fallback

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 18:24:19 -05:00

18 KiB

Multi-Agent Pipeline Architecture

Overview

This document describes the architecture for the production multi-agent pipeline system, including Vault token management, agent lifecycle, error handling, and observability integration.

Document Date: 2026-01-24 Status: IMPLEMENTED


1. Pipeline Flow

                    ┌─────────────────────────────────────────────────────────────────┐
                    │                      PIPELINE LIFECYCLE                          │
                    └─────────────────────────────────────────────────────────────────┘

    ┌─────────┐     ┌─────────┐     ┌─────────┐     ┌───────────────┐     ┌───────────┐
    │  SPAWN  │────▶│ RUNNING │────▶│ REPORT  │────▶│ ORCHESTRATING │────▶│ COMPLETED │
    └─────────┘     └─────────┘     └─────────┘     └───────────────┘     └───────────┘
         │               │               │                  │                   │
         │               │               │                  │                   │
    ┌────▼────┐     ┌────▼────┐     ┌────▼────┐      ┌─────▼─────┐       ┌─────▼─────┐
    │ Issue   │     │ Agent   │     │ Report  │      │ ALPHA+BETA│       │ Consensus │
    │ Vault   │     │ Status  │     │ Ready   │      │ Parallel  │       │ Achieved  │
    │ Token   │     │ Updates │     │         │      │           │       │           │
    └─────────┘     └─────────┘     └─────────┘      └───────────┘       └───────────┘
                                                            │
                                                    ┌───────▼───────┐
                                                    │ Error/Stuck?  │
                                                    └───────┬───────┘
                                                            │ YES
                                                    ┌───────▼───────┐
                                                    │ SPAWN GAMMA   │
                                                    │ (Diagnostic)  │
                                                    └───────────────┘

2. Vault Token Management

2.1 Token Lifecycle

Each pipeline receives a dedicated, long-lived Vault token that persists through the entire orchestration:

Pipeline Start
      │
      ▼
┌─────────────────────────────────────┐
│ 1. Request Pipeline Token from Vault │
│    - AppRole: pipeline-orchestrator  │
│    - TTL: 2 hours (renewable)        │
│    - Policies: pipeline-agent        │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 2. Store Token in Redis             │
│    Key: pipeline:{id}:vault_token   │
│    + Encrypted with transit key     │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 3. Pass Token to All Agents         │
│    - ALPHA, BETA, GAMMA inherit     │
│    - Token renewal every 30 min     │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 4. Observability Monitors Token     │
│    - Can revoke for policy violation│
│    - Logs all token usage           │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 5. Token Revoked on Completion      │
│    - Or on error threshold breach   │
└─────────────────────────────────────┘

2.2 Token Policies

Pipeline Agent Policy (pipeline-agent.hcl):

# Read API keys for OpenRouter
path "secret/data/api-keys/*" {
  capabilities = ["read"]
}

# Read service credentials (DragonflyDB)
path "secret/data/services/*" {
  capabilities = ["read"]
}

# Agent-specific secrets
path "secret/data/agents/{{identity.entity.aliases.auth_approle.metadata.pipeline_id}}/*" {
  capabilities = ["read", "create", "update"]
}

# Deny access to admin paths
path "sys/*" {
  capabilities = ["deny"]
}

2.3 Token Revocation Triggers

Observability can revoke a pipeline token mid-run for:

Condition Threshold Action
Error rate > 5 errors/minute Revoke + spawn diagnostic
Stuck agent > 60 seconds no progress Revoke agent token only
Policy violation Any CRITICAL violation Immediate full revocation
Resource abuse > 100 API calls/minute Rate limit, then revoke

3. Report → Orchestration Transition

3.1 Automatic Trigger

When a pipeline reaches REPORT phase with auto_continue=true:

async function checkPipelineCompletion(pipelineId: string) {
  // ... existing completion check ...

  if (autoContinue && anySuccess) {
    // Trigger OpenRouter orchestration
    triggerOrchestration(pipelineId, taskId, objective, model, timeout);
  }
}

3.2 Manual Trigger

API endpoint for manual orchestration trigger:

POST /api/pipeline/continue
Body: { pipeline_id, model?, timeout? }

3.3 Orchestration Process

  1. Status Update: Pipeline status → ORCHESTRATING
  2. Agent Spawn: Launch ALPHA and BETA agents in parallel
  3. WebSocket Broadcast: Real-time status to UI
  4. Monitor Loop: Check for stuck/conflict conditions
  5. GAMMA Spawn: If thresholds exceeded, spawn mediator
  6. Consensus: Drive to final agreement
  7. Completion: Status → COMPLETED or FAILED

4. Agent Multiplication and Handoff

4.1 Agent Roles

Agent Role Spawn Condition
ALPHA Research & Analysis Always (initial)
BETA Implementation & Synthesis Always (initial)
GAMMA Mediator & Resolver On error/stuck/conflict/complexity

4.2 Spawn Conditions

const SPAWN_CONDITIONS = {
  STUCK: {
    threshold: 30,  // seconds of inactivity
    description: "Spawn GAMMA when agents stuck"
  },
  CONFLICT: {
    threshold: 3,   // unresolved conflicts
    description: "Spawn GAMMA for mediation"
  },
  COMPLEXITY: {
    threshold: 0.8, // complexity score
    description: "Spawn GAMMA for decomposition"
  },
  SUCCESS: {
    threshold: 1.0, // task completion
    description: "Spawn GAMMA for validation"
  }
};

4.3 Handoff Protocol

When GAMMA spawns, it receives:

  • Full blackboard state (problem, solutions, progress)
  • Message log from ALPHA/BETA
  • Spawn reason and context
  • Authority to direct other agents
// GAMMA handoff message
{
  type: "HANDOFF",
  payload: {
    type: "NEW_DIRECTION" | "SUBTASK_ASSIGNMENT",
    tasks?: string[],
    diagnosis?: string,
    recommended_actions?: string[]
  }
}

4.4 Agent Lifecycle States

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────────┐    ┌───────────┐
│  CREATED │───▶│  BUSY    │───▶│ WAITING  │───▶│ HANDED-OFF│───▶│ SUCCEEDED │
└──────────┘    └──────────┘    └──────────┘    └───────────┘    └───────────┘
                     │                                                   │
                     │              ┌──────────┐                         │
                     └─────────────▶│  ERROR   │◀────────────────────────┘
                                    └──────────┘

UI displays each agent with:

  • Current state (color-coded)
  • Progress percentage
  • Current task description
  • Message count (sent/received)
  • Error count

5. Observability Integration

5.1 Real-Time Metrics

All metrics stored in DragonflyDB with WebSocket broadcast:

// Metrics keys
`metrics:${taskId}`  {
  total_messages: number,
  direct_messages: number,
  blackboard_writes: number,
  blackboard_reads: number,
  conflicts_detected: number,
  conflicts_resolved: number,
  gamma_spawned: boolean,
  gamma_spawn_reason: string,
  performance_score: number
}

5.2 Error Loop Handling

Error Detected
      │
      ▼
┌─────────────────────┐
│ Log to bug_watcher  │
│ (SQLite + Redis)    │
└─────────────────────┘
      │
      ▼
┌─────────────────────┐     ┌─────────────────────┐
│ Check Error Budget  │────▶│ Budget Exceeded?    │
└─────────────────────┘     └─────────────────────┘
                                    │ YES
                                    ▼
                            ┌─────────────────────┐
                            │ Spawn Diagnostic    │
                            │ Pipeline with       │
                            │ Error Context       │
                            └─────────────────────┘

5.3 Status Broadcasting

WebSocket events broadcast to UI:

Event Payload Trigger
pipeline_started pipeline_id, task_id Pipeline spawn
agent_status agent_id, status Any status change
agent_message agent, message Agent log output
consensus_event proposal_id, votes Consensus activity
orchestration_started model, agents Orchestration begin
orchestration_complete status, metrics Orchestration end
error_threshold pipeline_id, errors Error budget breach
token_revoked pipeline_id, reason Vault revocation

5.4 Structured Handoff Reports

On error threshold breach, generate handoff report:

{
  "report_type": "error_handoff",
  "pipeline_id": "pipeline-abc123",
  "timestamp": "2026-01-24T22:30:00Z",
  "summary": {
    "total_errors": 6,
    "error_types": ["api_timeout", "validation_failure"],
    "affected_agents": ["ALPHA"],
    "last_successful_checkpoint": "ckpt-xyz"
  },
  "context": {
    "task_objective": "...",
    "progress_at_failure": 0.45,
    "blackboard_snapshot": {...}
  },
  "recommended_actions": [
    "Reduce API call rate",
    "Split task into smaller subtasks"
  ]
}

6. UI Components

6.1 Pipeline Status Panel

┌──────────────────────────────────────────────────────────────────┐
│ Pipeline: pipeline-abc123                          [ORCHESTRATING]│
├──────────────────────────────────────────────────────────────────┤
│ Objective: Design distributed event-driven architecture...       │
│ Model: anthropic/claude-sonnet-4                                 │
│ Started: 2026-01-24 22:15:00 UTC                                │
├──────────────────────────────────────────────────────────────────┤
│ AGENTS                                                           │
│ ┌─────────┐  ┌─────────┐  ┌─────────┐                           │
│ │  ALPHA  │  │  BETA   │  │  GAMMA  │                           │
│ │ ████░░░ │  │ ██████░ │  │ ░░░░░░░ │                           │
│ │  45%    │  │  75%    │  │ PENDING │                           │
│ │ WORKING │  │ WAITING │  │         │                           │
│ └─────────┘  └─────────┘  └─────────┘                           │
├──────────────────────────────────────────────────────────────────┤
│ METRICS                                                          │
│ Messages: 24  │  Conflicts: 1/1 resolved  │  Score: 72%         │
├──────────────────────────────────────────────────────────────────┤
│ RECENT ACTIVITY                                                  │
│ [22:16:32] ALPHA: Generated 3 initial proposals                  │
│ [22:16:45] BETA: Evaluating proposal prop-a1b2c3                │
│ [22:17:01] BETA: Proposal accepted with score 0.85              │
└──────────────────────────────────────────────────────────────────┘

6.2 Agent Lifecycle Cards

Each agent displays:

  • Role badge (ALPHA/BETA/GAMMA)
  • Status indicator with color
  • Progress bar
  • Current task label
  • Message counters
  • Error indicator (if any)

7. Implementation Checklist

Backend (server.ts)

  • Pipeline spawn with auto_continue
  • Orchestration trigger after REPORT
  • Agent process spawning (Python + Bun)
  • WebSocket status broadcasting
  • Diagnostic agent (GAMMA) spawning on error
  • Vault token issuance per pipeline
  • Token renewal loop (every 30 minutes)
  • Observability-driven revocation
  • Error threshold monitoring
  • Structured handoff reports

Coordination (coordination.ts)

  • Blackboard shared memory
  • MessageBus point-to-point
  • AgentStateManager
  • SpawnController conditions
  • MetricsCollector
  • Token integration via pipeline context
  • Error budget tracking

Orchestrator (orchestrator.ts)

  • Multi-agent initialization
  • GAMMA spawn on conditions
  • Consensus checking
  • Performance analysis
  • Receive pipeline ID from environment
  • Error reporting to observability

UI/API

  • Pipeline list view
  • Real-time log streaming
  • Agent lifecycle status API
  • Pipeline metrics endpoint
  • Error budget API
  • Token status/revoke/renew APIs
  • Handoff report generation
  • Diagnostic pipeline spawning
  • Consensus failure detection (exit code 2)
  • Consensus failure context recording
  • Fallback options (rerun, escalate, accept, download)
  • Failure report download
  • UI consensus failure alert with action buttons
  • Failure details modal
  • WebSocket notifications for consensus events

8. API Endpoints

Pipeline Control

Endpoint Method Description
/api/spawn POST Spawn pipeline with auto_continue
/api/pipeline/continue POST Manually trigger orchestration
/api/pipeline/orchestration GET Get orchestration status
/api/pipeline/token GET Get pipeline token status
/api/pipeline/revoke POST Revoke pipeline token
/api/active-pipelines GET List active pipelines
/api/pipeline/logs GET Get pipeline logs
/api/pipeline/metrics GET Get pipeline metrics

Agent Management

Endpoint Method Description
/api/agents GET List all agents
/api/agents/:id/status GET Get agent status
/api/agents/:id/messages GET Get agent message log

Observability

Endpoint Method Description
/api/observability/errors GET Get error summary
/api/observability/handoff POST Generate handoff report
/api/observability/revoke POST Trigger token revocation

Last updated: 2026-01-24