- Vault token issuance per pipeline with 2-hour TTL - Automatic token renewal loop every 30 minutes - Error budget tracking with threshold-based revocation - Observability-driven token revocation for policy violations - Diagnostic pipeline spawning on error threshold breach - Structured handoff reports for error recovery - Agent lifecycle status API - New API endpoints: /api/pipeline/token, /api/pipeline/errors, /api/observability/handoff, /api/observability/diagnostic Orchestrator now reports errors to parent pipeline's observability system via PIPELINE_ID environment variable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
450 lines
17 KiB
Markdown
450 lines
17 KiB
Markdown
# Multi-Agent Pipeline Architecture
|
|
|
|
## Overview
|
|
|
|
This document describes the architecture for the production multi-agent pipeline system, including Vault token management, agent lifecycle, error handling, and observability integration.
|
|
|
|
**Document Date:** 2026-01-24
|
|
**Status:** IMPLEMENTED
|
|
|
|
---
|
|
|
|
## 1. Pipeline Flow
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ PIPELINE LIFECYCLE │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌───────────────┐ ┌───────────┐
|
|
│ SPAWN │────▶│ RUNNING │────▶│ REPORT │────▶│ ORCHESTRATING │────▶│ COMPLETED │
|
|
└─────────┘ └─────────┘ └─────────┘ └───────────────┘ └───────────┘
|
|
│ │ │ │ │
|
|
│ │ │ │ │
|
|
┌────▼────┐ ┌────▼────┐ ┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
|
|
│ Issue │ │ Agent │ │ Report │ │ ALPHA+BETA│ │ Consensus │
|
|
│ Vault │ │ Status │ │ Ready │ │ Parallel │ │ Achieved │
|
|
│ Token │ │ Updates │ │ │ │ │ │ │
|
|
└─────────┘ └─────────┘ └─────────┘ └───────────┘ └───────────┘
|
|
│
|
|
┌───────▼───────┐
|
|
│ Error/Stuck? │
|
|
└───────┬───────┘
|
|
│ YES
|
|
┌───────▼───────┐
|
|
│ SPAWN GAMMA │
|
|
│ (Diagnostic) │
|
|
└───────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Vault Token Management
|
|
|
|
### 2.1 Token Lifecycle
|
|
|
|
Each pipeline receives a dedicated, long-lived Vault token that persists through the entire orchestration:
|
|
|
|
```
|
|
Pipeline Start
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 1. Request Pipeline Token from Vault │
|
|
│ - AppRole: pipeline-orchestrator │
|
|
│ - TTL: 2 hours (renewable) │
|
|
│ - Policies: pipeline-agent │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 2. Store Token in Redis │
|
|
│ Key: pipeline:{id}:vault_token │
|
|
│ + Encrypted with transit key │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 3. Pass Token to All Agents │
|
|
│ - ALPHA, BETA, GAMMA inherit │
|
|
│ - Token renewal every 30 min │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 4. Observability Monitors Token │
|
|
│ - Can revoke for policy violation│
|
|
│ - Logs all token usage │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 5. Token Revoked on Completion │
|
|
│ - Or on error threshold breach │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
### 2.2 Token Policies
|
|
|
|
**Pipeline Agent Policy (`pipeline-agent.hcl`):**
|
|
```hcl
|
|
# Read API keys for OpenRouter
|
|
path "secret/data/api-keys/*" {
|
|
capabilities = ["read"]
|
|
}
|
|
|
|
# Read service credentials (DragonflyDB)
|
|
path "secret/data/services/*" {
|
|
capabilities = ["read"]
|
|
}
|
|
|
|
# Agent-specific secrets
|
|
path "secret/data/agents/{{identity.entity.aliases.auth_approle.metadata.pipeline_id}}/*" {
|
|
capabilities = ["read", "create", "update"]
|
|
}
|
|
|
|
# Deny access to admin paths
|
|
path "sys/*" {
|
|
capabilities = ["deny"]
|
|
}
|
|
```
|
|
|
|
### 2.3 Token Revocation Triggers
|
|
|
|
Observability can revoke a pipeline token mid-run for:
|
|
|
|
| Condition | Threshold | Action |
|
|
|-----------|-----------|--------|
|
|
| Error rate | > 5 errors/minute | Revoke + spawn diagnostic |
|
|
| Stuck agent | > 60 seconds no progress | Revoke agent token only |
|
|
| Policy violation | Any CRITICAL violation | Immediate full revocation |
|
|
| Resource abuse | > 100 API calls/minute | Rate limit, then revoke |
|
|
|
|
---
|
|
|
|
## 3. Report → Orchestration Transition
|
|
|
|
### 3.1 Automatic Trigger
|
|
|
|
When a pipeline reaches REPORT phase with `auto_continue=true`:
|
|
|
|
```typescript
|
|
async function checkPipelineCompletion(pipelineId: string) {
|
|
// ... existing completion check ...
|
|
|
|
if (autoContinue && anySuccess) {
|
|
// Trigger OpenRouter orchestration
|
|
triggerOrchestration(pipelineId, taskId, objective, model, timeout);
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3.2 Manual Trigger
|
|
|
|
API endpoint for manual orchestration trigger:
|
|
|
|
```
|
|
POST /api/pipeline/continue
|
|
Body: { pipeline_id, model?, timeout? }
|
|
```
|
|
|
|
### 3.3 Orchestration Process
|
|
|
|
1. **Status Update**: Pipeline status → `ORCHESTRATING`
|
|
2. **Agent Spawn**: Launch ALPHA and BETA agents in parallel
|
|
3. **WebSocket Broadcast**: Real-time status to UI
|
|
4. **Monitor Loop**: Check for stuck/conflict conditions
|
|
5. **GAMMA Spawn**: If thresholds exceeded, spawn mediator
|
|
6. **Consensus**: Drive to final agreement
|
|
7. **Completion**: Status → `COMPLETED` or `FAILED`
|
|
|
|
---
|
|
|
|
## 4. Agent Multiplication and Handoff
|
|
|
|
### 4.1 Agent Roles
|
|
|
|
| Agent | Role | Spawn Condition |
|
|
|-------|------|-----------------|
|
|
| ALPHA | Research & Analysis | Always (initial) |
|
|
| BETA | Implementation & Synthesis | Always (initial) |
|
|
| GAMMA | Mediator & Resolver | On error/stuck/conflict/complexity |
|
|
|
|
### 4.2 Spawn Conditions
|
|
|
|
```typescript
|
|
const SPAWN_CONDITIONS = {
|
|
STUCK: {
|
|
threshold: 30, // seconds of inactivity
|
|
description: "Spawn GAMMA when agents stuck"
|
|
},
|
|
CONFLICT: {
|
|
threshold: 3, // unresolved conflicts
|
|
description: "Spawn GAMMA for mediation"
|
|
},
|
|
COMPLEXITY: {
|
|
threshold: 0.8, // complexity score
|
|
description: "Spawn GAMMA for decomposition"
|
|
},
|
|
SUCCESS: {
|
|
threshold: 1.0, // task completion
|
|
description: "Spawn GAMMA for validation"
|
|
}
|
|
};
|
|
```
|
|
|
|
### 4.3 Handoff Protocol
|
|
|
|
When GAMMA spawns, it receives:
|
|
- Full blackboard state (problem, solutions, progress)
|
|
- Message log from ALPHA/BETA
|
|
- Spawn reason and context
|
|
- Authority to direct other agents
|
|
|
|
```typescript
|
|
// GAMMA handoff message
|
|
{
|
|
type: "HANDOFF",
|
|
payload: {
|
|
type: "NEW_DIRECTION" | "SUBTASK_ASSIGNMENT",
|
|
tasks?: string[],
|
|
diagnosis?: string,
|
|
recommended_actions?: string[]
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4.4 Agent Lifecycle States
|
|
|
|
```
|
|
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌───────────┐
|
|
│ CREATED │───▶│ BUSY │───▶│ WAITING │───▶│ HANDED-OFF│───▶│ SUCCEEDED │
|
|
└──────────┘ └──────────┘ └──────────┘ └───────────┘ └───────────┘
|
|
│ │
|
|
│ ┌──────────┐ │
|
|
└─────────────▶│ ERROR │◀────────────────────────┘
|
|
└──────────┘
|
|
```
|
|
|
|
UI displays each agent with:
|
|
- Current state (color-coded)
|
|
- Progress percentage
|
|
- Current task description
|
|
- Message count (sent/received)
|
|
- Error count
|
|
|
|
---
|
|
|
|
## 5. Observability Integration
|
|
|
|
### 5.1 Real-Time Metrics
|
|
|
|
All metrics stored in DragonflyDB with WebSocket broadcast:
|
|
|
|
```typescript
|
|
// Metrics keys
|
|
`metrics:${taskId}` → {
|
|
total_messages: number,
|
|
direct_messages: number,
|
|
blackboard_writes: number,
|
|
blackboard_reads: number,
|
|
conflicts_detected: number,
|
|
conflicts_resolved: number,
|
|
gamma_spawned: boolean,
|
|
gamma_spawn_reason: string,
|
|
performance_score: number
|
|
}
|
|
```
|
|
|
|
### 5.2 Error Loop Handling
|
|
|
|
```
|
|
Error Detected
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ Log to bug_watcher │
|
|
│ (SQLite + Redis) │
|
|
└─────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐ ┌─────────────────────┐
|
|
│ Check Error Budget │────▶│ Budget Exceeded? │
|
|
└─────────────────────┘ └─────────────────────┘
|
|
│ YES
|
|
▼
|
|
┌─────────────────────┐
|
|
│ Spawn Diagnostic │
|
|
│ Pipeline with │
|
|
│ Error Context │
|
|
└─────────────────────┘
|
|
```
|
|
|
|
### 5.3 Status Broadcasting
|
|
|
|
WebSocket events broadcast to UI:
|
|
|
|
| Event | Payload | Trigger |
|
|
|-------|---------|---------|
|
|
| `pipeline_started` | pipeline_id, task_id | Pipeline spawn |
|
|
| `agent_status` | agent_id, status | Any status change |
|
|
| `agent_message` | agent, message | Agent log output |
|
|
| `consensus_event` | proposal_id, votes | Consensus activity |
|
|
| `orchestration_started` | model, agents | Orchestration begin |
|
|
| `orchestration_complete` | status, metrics | Orchestration end |
|
|
| `error_threshold` | pipeline_id, errors | Error budget breach |
|
|
| `token_revoked` | pipeline_id, reason | Vault revocation |
|
|
|
|
### 5.4 Structured Handoff Reports
|
|
|
|
On error threshold breach, generate handoff report:
|
|
|
|
```json
|
|
{
|
|
"report_type": "error_handoff",
|
|
"pipeline_id": "pipeline-abc123",
|
|
"timestamp": "2026-01-24T22:30:00Z",
|
|
"summary": {
|
|
"total_errors": 6,
|
|
"error_types": ["api_timeout", "validation_failure"],
|
|
"affected_agents": ["ALPHA"],
|
|
"last_successful_checkpoint": "ckpt-xyz"
|
|
},
|
|
"context": {
|
|
"task_objective": "...",
|
|
"progress_at_failure": 0.45,
|
|
"blackboard_snapshot": {...}
|
|
},
|
|
"recommended_actions": [
|
|
"Reduce API call rate",
|
|
"Split task into smaller subtasks"
|
|
]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 6. UI Components
|
|
|
|
### 6.1 Pipeline Status Panel
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|
│ Pipeline: pipeline-abc123 [ORCHESTRATING]│
|
|
├──────────────────────────────────────────────────────────────────┤
|
|
│ Objective: Design distributed event-driven architecture... │
|
|
│ Model: anthropic/claude-sonnet-4 │
|
|
│ Started: 2026-01-24 22:15:00 UTC │
|
|
├──────────────────────────────────────────────────────────────────┤
|
|
│ AGENTS │
|
|
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
|
│ │ ALPHA │ │ BETA │ │ GAMMA │ │
|
|
│ │ ████░░░ │ │ ██████░ │ │ ░░░░░░░ │ │
|
|
│ │ 45% │ │ 75% │ │ PENDING │ │
|
|
│ │ WORKING │ │ WAITING │ │ │ │
|
|
│ └─────────┘ └─────────┘ └─────────┘ │
|
|
├──────────────────────────────────────────────────────────────────┤
|
|
│ METRICS │
|
|
│ Messages: 24 │ Conflicts: 1/1 resolved │ Score: 72% │
|
|
├──────────────────────────────────────────────────────────────────┤
|
|
│ RECENT ACTIVITY │
|
|
│ [22:16:32] ALPHA: Generated 3 initial proposals │
|
|
│ [22:16:45] BETA: Evaluating proposal prop-a1b2c3 │
|
|
│ [22:17:01] BETA: Proposal accepted with score 0.85 │
|
|
└──────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 6.2 Agent Lifecycle Cards
|
|
|
|
Each agent displays:
|
|
- Role badge (ALPHA/BETA/GAMMA)
|
|
- Status indicator with color
|
|
- Progress bar
|
|
- Current task label
|
|
- Message counters
|
|
- Error indicator (if any)
|
|
|
|
---
|
|
|
|
## 7. Implementation Checklist
|
|
|
|
### Backend (server.ts)
|
|
|
|
- [x] Pipeline spawn with auto_continue
|
|
- [x] Orchestration trigger after REPORT
|
|
- [x] Agent process spawning (Python + Bun)
|
|
- [x] WebSocket status broadcasting
|
|
- [x] Diagnostic agent (GAMMA) spawning on error
|
|
- [x] Vault token issuance per pipeline
|
|
- [x] Token renewal loop (every 30 minutes)
|
|
- [x] Observability-driven revocation
|
|
- [x] Error threshold monitoring
|
|
- [x] Structured handoff reports
|
|
|
|
### Coordination (coordination.ts)
|
|
|
|
- [x] Blackboard shared memory
|
|
- [x] MessageBus point-to-point
|
|
- [x] AgentStateManager
|
|
- [x] SpawnController conditions
|
|
- [x] MetricsCollector
|
|
- [x] Token integration via pipeline context
|
|
- [x] Error budget tracking
|
|
|
|
### Orchestrator (orchestrator.ts)
|
|
|
|
- [x] Multi-agent initialization
|
|
- [x] GAMMA spawn on conditions
|
|
- [x] Consensus checking
|
|
- [x] Performance analysis
|
|
- [x] Receive pipeline ID from environment
|
|
- [x] Error reporting to observability
|
|
|
|
### UI/API
|
|
|
|
- [x] Pipeline list view
|
|
- [x] Real-time log streaming
|
|
- [x] Agent lifecycle status API
|
|
- [x] Pipeline metrics endpoint
|
|
- [x] Error budget API
|
|
- [x] Token status/revoke/renew APIs
|
|
- [x] Handoff report generation
|
|
- [x] Diagnostic pipeline spawning
|
|
|
|
---
|
|
|
|
## 8. API Endpoints
|
|
|
|
### Pipeline Control
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/api/spawn` | POST | Spawn pipeline with auto_continue |
|
|
| `/api/pipeline/continue` | POST | Manually trigger orchestration |
|
|
| `/api/pipeline/orchestration` | GET | Get orchestration status |
|
|
| `/api/pipeline/token` | GET | Get pipeline token status |
|
|
| `/api/pipeline/revoke` | POST | Revoke pipeline token |
|
|
| `/api/active-pipelines` | GET | List active pipelines |
|
|
| `/api/pipeline/logs` | GET | Get pipeline logs |
|
|
| `/api/pipeline/metrics` | GET | Get pipeline metrics |
|
|
|
|
### Agent Management
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/api/agents` | GET | List all agents |
|
|
| `/api/agents/:id/status` | GET | Get agent status |
|
|
| `/api/agents/:id/messages` | GET | Get agent message log |
|
|
|
|
### Observability
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/api/observability/errors` | GET | Get error summary |
|
|
| `/api/observability/handoff` | POST | Generate handoff report |
|
|
| `/api/observability/revoke` | POST | Trigger token revocation |
|
|
|
|
---
|
|
|
|
*Last updated: 2026-01-24*
|