profit 09be7eff4b Add consensus failure handling with fallback options for multi-agent pipelines

Implements detection and recovery for when agents fail to reach consensus:
- Orchestrator exits with code 2 on consensus failure (distinct from error=1)
- Records failed run context (proposals, agent states, conflicts) to Dragonfly
- Provides fallback options: rerun same, rerun with GAMMA, escalate tier, accept partial
- Adds UI alert with action buttons for user-driven recovery
- Adds failure details modal and downloadable failure report
- Only marks pipeline complete when consensus achieved or user accepts fallback

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-24 18:24:19 -05:00

18 KiB

Raw Blame History

Multi-Agent Pipeline Architecture

Overview

This document describes the architecture for the production multi-agent pipeline system, including Vault token management, agent lifecycle, error handling, and observability integration.

Document Date: 2026-01-24 Status: IMPLEMENTED

1. Pipeline Flow

                    ┌─────────────────────────────────────────────────────────────────┐
                    │                      PIPELINE LIFECYCLE                          │
                    └─────────────────────────────────────────────────────────────────┘

    ┌─────────┐     ┌─────────┐     ┌─────────┐     ┌───────────────┐     ┌───────────┐
    │  SPAWN  │────▶│ RUNNING │────▶│ REPORT  │────▶│ ORCHESTRATING │────▶│ COMPLETED │
    └─────────┘     └─────────┘     └─────────┘     └───────────────┘     └───────────┘
         │               │               │                  │                   │
         │               │               │                  │                   │
    ┌────▼────┐     ┌────▼────┐     ┌────▼────┐      ┌─────▼─────┐       ┌─────▼─────┐
    │ Issue   │     │ Agent   │     │ Report  │      │ ALPHA+BETA│       │ Consensus │
    │ Vault   │     │ Status  │     │ Ready   │      │ Parallel  │       │ Achieved  │
    │ Token   │     │ Updates │     │         │      │           │       │           │
    └─────────┘     └─────────┘     └─────────┘      └───────────┘       └───────────┘
                                                            │
                                                    ┌───────▼───────┐
                                                    │ Error/Stuck?  │
                                                    └───────┬───────┘
                                                            │ YES
                                                    ┌───────▼───────┐
                                                    │ SPAWN GAMMA   │
                                                    │ (Diagnostic)  │
                                                    └───────────────┘

2. Vault Token Management

2.1 Token Lifecycle

Each pipeline receives a dedicated, long-lived Vault token that persists through the entire orchestration:

Pipeline Start
      │
      ▼
┌─────────────────────────────────────┐
│ 1. Request Pipeline Token from Vault │
│    - AppRole: pipeline-orchestrator  │
│    - TTL: 2 hours (renewable)        │
│    - Policies: pipeline-agent        │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 2. Store Token in Redis             │
│    Key: pipeline:{id}:vault_token   │
│    + Encrypted with transit key     │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 3. Pass Token to All Agents         │
│    - ALPHA, BETA, GAMMA inherit     │
│    - Token renewal every 30 min     │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 4. Observability Monitors Token     │
│    - Can revoke for policy violation│
│    - Logs all token usage           │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 5. Token Revoked on Completion      │
│    - Or on error threshold breach   │
└─────────────────────────────────────┘

2.2 Token Policies

Pipeline Agent Policy (pipeline-agent.hcl):

# Read API keys for OpenRouter
path "secret/data/api-keys/*" {
  capabilities = ["read"]
}

# Read service credentials (DragonflyDB)
path "secret/data/services/*" {
  capabilities = ["read"]
}

# Agent-specific secrets
path "secret/data/agents/{{identity.entity.aliases.auth_approle.metadata.pipeline_id}}/*" {
  capabilities = ["read", "create", "update"]
}

# Deny access to admin paths
path "sys/*" {
  capabilities = ["deny"]
}

2.3 Token Revocation Triggers

Observability can revoke a pipeline token mid-run for:

Condition	Threshold	Action
Error rate	> 5 errors/minute	Revoke + spawn diagnostic
Stuck agent	> 60 seconds no progress	Revoke agent token only
Policy violation	Any CRITICAL violation	Immediate full revocation
Resource abuse	> 100 API calls/minute	Rate limit, then revoke

3. Report → Orchestration Transition

3.1 Automatic Trigger

When a pipeline reaches REPORT phase with auto_continue=true:

async function checkPipelineCompletion(pipelineId: string) {
  // ... existing completion check ...

  if (autoContinue && anySuccess) {
    // Trigger OpenRouter orchestration
    triggerOrchestration(pipelineId, taskId, objective, model, timeout);
  }
}

3.2 Manual Trigger

API endpoint for manual orchestration trigger:

POST /api/pipeline/continue
Body: { pipeline_id, model?, timeout? }

3.3 Orchestration Process

Status Update: Pipeline status → ORCHESTRATING
Agent Spawn: Launch ALPHA and BETA agents in parallel
WebSocket Broadcast: Real-time status to UI
Monitor Loop: Check for stuck/conflict conditions
GAMMA Spawn: If thresholds exceeded, spawn mediator
Consensus: Drive to final agreement
Completion: Status → COMPLETED or FAILED

4. Agent Multiplication and Handoff

4.1 Agent Roles

Agent	Role	Spawn Condition
ALPHA	Research & Analysis	Always (initial)
BETA	Implementation & Synthesis	Always (initial)
GAMMA	Mediator & Resolver	On error/stuck/conflict/complexity

4.2 Spawn Conditions

const SPAWN_CONDITIONS = {
  STUCK: {
    threshold: 30,  // seconds of inactivity
    description: "Spawn GAMMA when agents stuck"
  },
  CONFLICT: {
    threshold: 3,   // unresolved conflicts
    description: "Spawn GAMMA for mediation"
  },
  COMPLEXITY: {
    threshold: 0.8, // complexity score
    description: "Spawn GAMMA for decomposition"
  },
  SUCCESS: {
    threshold: 1.0, // task completion
    description: "Spawn GAMMA for validation"
  }
};

4.3 Handoff Protocol

When GAMMA spawns, it receives:

Full blackboard state (problem, solutions, progress)
Message log from ALPHA/BETA
Spawn reason and context
Authority to direct other agents

// GAMMA handoff message
{
  type: "HANDOFF",
  payload: {
    type: "NEW_DIRECTION" | "SUBTASK_ASSIGNMENT",
    tasks?: string[],
    diagnosis?: string,
    recommended_actions?: string[]
  }
}

4.4 Agent Lifecycle States

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────────┐    ┌───────────┐
│  CREATED │───▶│  BUSY    │───▶│ WAITING  │───▶│ HANDED-OFF│───▶│ SUCCEEDED │
└──────────┘    └──────────┘    └──────────┘    └───────────┘    └───────────┘
                     │                                                   │
                     │              ┌──────────┐                         │
                     └─────────────▶│  ERROR   │◀────────────────────────┘
                                    └──────────┘

UI displays each agent with:

Current state (color-coded)
Progress percentage
Current task description
Message count (sent/received)
Error count

5. Observability Integration

5.1 Real-Time Metrics

All metrics stored in DragonflyDB with WebSocket broadcast:

// Metrics keys
`metrics:${taskId}` → {
  total_messages: number,
  direct_messages: number,
  blackboard_writes: number,
  blackboard_reads: number,
  conflicts_detected: number,
  conflicts_resolved: number,
  gamma_spawned: boolean,
  gamma_spawn_reason: string,
  performance_score: number
}

5.2 Error Loop Handling

Error Detected
      │
      ▼
┌─────────────────────┐
│ Log to bug_watcher  │
│ (SQLite + Redis)    │
└─────────────────────┘
      │
      ▼
┌─────────────────────┐     ┌─────────────────────┐
│ Check Error Budget  │────▶│ Budget Exceeded?    │
└─────────────────────┘     └─────────────────────┘
                                    │ YES
                                    ▼
                            ┌─────────────────────┐
                            │ Spawn Diagnostic    │
                            │ Pipeline with       │
                            │ Error Context       │
                            └─────────────────────┘

5.3 Status Broadcasting

WebSocket events broadcast to UI:

Event	Payload	Trigger
`pipeline_started`	pipeline_id, task_id	Pipeline spawn
`agent_status`	agent_id, status	Any status change
`agent_message`	agent, message	Agent log output
`consensus_event`	proposal_id, votes	Consensus activity
`orchestration_started`	model, agents	Orchestration begin
`orchestration_complete`	status, metrics	Orchestration end
`error_threshold`	pipeline_id, errors	Error budget breach
`token_revoked`	pipeline_id, reason	Vault revocation

5.4 Structured Handoff Reports

On error threshold breach, generate handoff report:

{
  "report_type": "error_handoff",
  "pipeline_id": "pipeline-abc123",
  "timestamp": "2026-01-24T22:30:00Z",
  "summary": {
    "total_errors": 6,
    "error_types": ["api_timeout", "validation_failure"],
    "affected_agents": ["ALPHA"],
    "last_successful_checkpoint": "ckpt-xyz"
  },
  "context": {
    "task_objective": "...",
    "progress_at_failure": 0.45,
    "blackboard_snapshot": {...}
  },
  "recommended_actions": [
    "Reduce API call rate",
    "Split task into smaller subtasks"
  ]
}

6. UI Components

6.1 Pipeline Status Panel

┌──────────────────────────────────────────────────────────────────┐
│ Pipeline: pipeline-abc123                          [ORCHESTRATING]│
├──────────────────────────────────────────────────────────────────┤
│ Objective: Design distributed event-driven architecture...       │
│ Model: anthropic/claude-sonnet-4                                 │
│ Started: 2026-01-24 22:15:00 UTC                                │
├──────────────────────────────────────────────────────────────────┤
│ AGENTS                                                           │
│ ┌─────────┐  ┌─────────┐  ┌─────────┐                           │
│ │  ALPHA  │  │  BETA   │  │  GAMMA  │                           │
│ │ ████░░░ │  │ ██████░ │  │ ░░░░░░░ │                           │
│ │  45%    │  │  75%    │  │ PENDING │                           │
│ │ WORKING │  │ WAITING │  │         │                           │
│ └─────────┘  └─────────┘  └─────────┘                           │
├──────────────────────────────────────────────────────────────────┤
│ METRICS                                                          │
│ Messages: 24  │  Conflicts: 1/1 resolved  │  Score: 72%         │
├──────────────────────────────────────────────────────────────────┤
│ RECENT ACTIVITY                                                  │
│ [22:16:32] ALPHA: Generated 3 initial proposals                  │
│ [22:16:45] BETA: Evaluating proposal prop-a1b2c3                │
│ [22:17:01] BETA: Proposal accepted with score 0.85              │
└──────────────────────────────────────────────────────────────────┘

6.2 Agent Lifecycle Cards

Each agent displays:

Role badge (ALPHA/BETA/GAMMA)
Status indicator with color
Progress bar
Current task label
Message counters
Error indicator (if any)

7. Implementation Checklist

Backend (server.ts)

Pipeline spawn with auto_continue
Orchestration trigger after REPORT
Agent process spawning (Python + Bun)
WebSocket status broadcasting
Diagnostic agent (GAMMA) spawning on error
Vault token issuance per pipeline
Token renewal loop (every 30 minutes)
Observability-driven revocation
Error threshold monitoring
Structured handoff reports

Coordination (coordination.ts)

Blackboard shared memory
MessageBus point-to-point
AgentStateManager
SpawnController conditions
MetricsCollector
Token integration via pipeline context
Error budget tracking

Orchestrator (orchestrator.ts)

Multi-agent initialization
GAMMA spawn on conditions
Consensus checking
Performance analysis
Receive pipeline ID from environment
Error reporting to observability

UI/API

Pipeline list view
Real-time log streaming
Agent lifecycle status API
Pipeline metrics endpoint
Error budget API
Token status/revoke/renew APIs
Handoff report generation
Diagnostic pipeline spawning
Consensus failure detection (exit code 2)
Consensus failure context recording
Fallback options (rerun, escalate, accept, download)
Failure report download
UI consensus failure alert with action buttons
Failure details modal
WebSocket notifications for consensus events

8. API Endpoints

Pipeline Control

Endpoint	Method	Description
`/api/spawn`	POST	Spawn pipeline with auto_continue
`/api/pipeline/continue`	POST	Manually trigger orchestration
`/api/pipeline/orchestration`	GET	Get orchestration status
`/api/pipeline/token`	GET	Get pipeline token status
`/api/pipeline/revoke`	POST	Revoke pipeline token
`/api/active-pipelines`	GET	List active pipelines
`/api/pipeline/logs`	GET	Get pipeline logs
`/api/pipeline/metrics`	GET	Get pipeline metrics

Agent Management

Endpoint	Method	Description
`/api/agents`	GET	List all agents
`/api/agents/:id/status`	GET	Get agent status
`/api/agents/:id/messages`	GET	Get agent message log

Observability

Endpoint	Method	Description
`/api/observability/errors`	GET	Get error summary
`/api/observability/handoff`	POST	Generate handoff report
`/api/observability/revoke`	POST	Trigger token revocation

Last updated: 2026-01-24

18 KiB Raw Blame History