agent-governance/docs/PRODUCTION_PIPELINE.md

# Production Pipeline: Report → OpenRouter Orchestration

## Overview

This document describes the automatic transition from the UI "view report" stage into the live multi-agent pipeline, including OpenRouter-driven parallel execution.

**Created:** 2026-01-24
**Status:** Implemented

---

## Architecture Flow

```
┌─────────────────────────────────────────────────────────────────────────┐
│                         UI DASHBOARD                                     │
│                                                                          │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────────────┐  │
│  │  SPAWN   │───▶│ RUNNING  │───▶│  REPORT  │───▶│ AUTO-ORCHESTRATE │  │
│  │ Pipeline │    │ Agents   │    │  Stage   │    │   (NEW)          │  │
│  └──────────┘    └──────────┘    └──────────┘    └────────┬─────────┘  │
│                                                           │             │
└───────────────────────────────────────────────────────────┼─────────────┘
                                                            │
                                                            ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    OPENROUTER ORCHESTRATION                              │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    MultiAgentOrchestrator                        │   │
│  │                                                                   │   │
│  │  ┌─────────────┐              ┌─────────────┐                   │   │
│  │  │   ALPHA     │◄────────────▶│   BETA      │                   │   │
│  │  │  (Research) │   Messages   │ (Synthesis) │                   │   │
│  │  │   Python    │              │    Bun      │                   │   │
│  │  └──────┬──────┘              └──────┬──────┘                   │   │
│  │         │                            │                           │   │
│  │         └─────────┬──────────────────┘                           │   │
│  │                   │                                               │   │
│  │                   ▼                                               │   │
│  │            ┌─────────────┐                                       │   │
│  │            │   GAMMA     │  (Spawned on STUCK/CONFLICT)          │   │
│  │            │ (Mediator)  │                                       │   │
│  │            └─────────────┘                                       │   │
│  │                                                                   │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  Shared Infrastructure:                                                  │
│  • Blackboard (DragonflyDB) - Proposals, solutions, consensus           │
│  • MessageBus (Redis PubSub) - Agent coordination                       │
│  • MetricsCollector - Performance tracking                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘
                                                            │
                                                            ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                       COMPLETION & AUDIT                                 │
│                                                                          │
│  • Results written to SQLite ledger                                     │
│  • Checkpoint created with final state                                  │
│  • WebSocket broadcast to UI                                            │
│  • Pipeline status → COMPLETED                                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘
```

---

## Implementation Components

### 1. Auto-Orchestration Trigger

**Location:** `/opt/agent-governance/ui/server.ts`

**Trigger Conditions:**
- Pipeline reaches REPORT phase
- All agents have completed or timed out
- No critical failures blocking continuation

**New Endpoint:** `POST /api/pipeline/continue`
```typescript
{
  pipeline_id: string;
  mode: "openrouter" | "local";  // openrouter = full LLM, local = mock
  model?: string;                // Default: anthropic/claude-sonnet-4
  timeout?: number;              // Default: 120s
}
```

### 2. Parallel Agent Execution

**Python Agent (ALPHA):**
- Path: `/opt/agent-governance/agents/llm-planner/governed_agent.py`
- Role: Research, analysis, proposal generation
- Runtime: Python 3.11 with venv

**Bun Agent (BETA):**
- Path: `/opt/agent-governance/agents/llm-planner-ts/governed-agent.ts`
- Role: Synthesis, evaluation, solution building
- Runtime: Bun (4x faster than Node.js)

**Coordination:**
- Both agents connect to same DragonflyDB instance
- Shared Blackboard for structured data exchange
- MessageBus for real-time communication
- SpawnController monitors for GAMMA trigger conditions

### 3. OpenRouter Integration

**Credential Flow:**
```
Vault (secret/data/api-keys/openrouter)
    │
    ▼
getVaultSecret() in agent code
    │
    ▼
OpenAI client with baseURL: "https://openrouter.ai/api/v1"
    │
    ▼
Model: anthropic/claude-sonnet-4
```

**Rate Limiting:**
- Handled by OpenRouter API
- Circuit breaker in governance layer (5 failures → open)
- Per-agent token budget tracking

### 4. Error Handling & Failover

**Level 1: Agent-Level Recovery**
```
Error Budget per agent:
- max_total_errors: 8
- max_same_error_repeats: 2
- max_procedure_violations: 1

On budget exceeded → Agent revoked, handoff created
```

**Level 2: Pipeline-Level Recovery**
```
On agent failure:
1. Record failure in DragonflyDB
2. Check if partner agent can continue alone
3. If both fail → Pipeline status = FAILED
4. Create checkpoint with failure details
```

**Level 3: Orchestration-Level Recovery**
```
On orchestration timeout (120s default):
1. Force-stop running agents
2. Collect partial results from Blackboard
3. Generate partial report
4. Pipeline status = TIMEOUT
```

**GAMMA Spawn Conditions:**
| Condition | Threshold | Action |
|-----------|-----------|--------|
| STUCK | 30s no progress | Spawn GAMMA mediator |
| CONFLICT | 3+ unresolved proposals | Spawn GAMMA to arbitrate |
| COMPLEXITY | Score > 0.8 | Spawn GAMMA for decomposition |

---

## API Endpoints

### Existing (Modified)
- `POST /api/spawn` - Creates pipeline, now includes `auto_continue: boolean`
- `GET /api/checkpoint/report` - Returns report with continuation status

### New
- `POST /api/pipeline/continue` - Triggers OpenRouter orchestration
- `GET /api/pipeline/{id}/orchestration` - Gets orchestration status
- `POST /api/pipeline/{id}/stop` - Emergency stop

---

## WebSocket Events

### New Events
```typescript
// Orchestration started
{ type: "orchestration_started", data: { pipeline_id, model, agents: ["ALPHA", "BETA"] } }

// Agent spawned
{ type: "agent_spawned", data: { pipeline_id, agent_id, role, runtime } }

// Agent message
{ type: "agent_message", data: { pipeline_id, from, to, content } }

// GAMMA spawned (conditional)
{ type: "gamma_spawned", data: { pipeline_id, reason: "STUCK" | "CONFLICT" | "COMPLEXITY" } }

// Consensus reached
{ type: "consensus_reached", data: { pipeline_id, proposal_id, votes } }

// Orchestration complete
{ type: "orchestration_complete", data: { pipeline_id, status, results } }
```

---

## Configuration

### Environment Variables
```bash
# Enable auto-orchestration after report
AUTO_ORCHESTRATE=true

# Default model for OpenRouter
OPENROUTER_MODEL=anthropic/claude-sonnet-4

# Orchestration timeout (seconds)
ORCHESTRATION_TIMEOUT=120

# GAMMA spawn thresholds
GAMMA_STUCK_THRESHOLD=30
GAMMA_CONFLICT_THRESHOLD=3
GAMMA_COMPLEXITY_THRESHOLD=0.8
```

### Vault Secrets Required
```
secret/data/api-keys/openrouter
  └── api_key: "sk-or-..."

secret/data/services/dragonfly
  └── password: "..."
```

---

## Implementation Steps

### Step 1: Add Auto-Continue Logic to UI Server
- [x] Add `triggerOrchestration()` function
- [x] Modify `checkPipelineCompletion()` to check for auto_continue
- [x] Add `/api/pipeline/continue` endpoint

### Step 2: Connect to Multi-Agent Orchestrator
- [x] Spawn orchestrator.ts from UI via Bun.spawn()
- [x] Pass pipeline context (task_id, objective, model, timeout)
- [x] Wire up WebSocket events (orchestration_started, agent_message, consensus_event, orchestration_complete)

### Step 3: Add Orchestration Status Tracking
- [x] Track orchestration state in Redis (ORCHESTRATING status)
- [x] Add orchestration_started_at timestamp
- [x] Create checkpoint on completion

### Step 4: Implement Error Handling
- [x] Add timeout handling via orchestrator --timeout flag
- [x] Capture exit codes and error messages
- [x] Set ORCHESTRATION_FAILED or ORCHESTRATION_ERROR status on failure

### Step 5: Test End-to-End
- [x] Spawn pipeline with objective
- [x] Verify report generation
- [x] Verify auto-trigger to orchestration
- [x] Verify parallel agent execution
- [x] Verify results collection

### Demonstration Results (2026-01-24)
Successfully tested with `pipeline-mksufe23`:
- Pipeline spawned → ALPHA/BETA ran → Report generated → Auto-orchestration triggered
- GAMMA spawned due to complexity (0.8 threshold)
- Total orchestration time: 51.4 seconds
- Final status: COMPLETED

---

## Testing

### Manual Test Command
```bash
# 1. Start UI server
cd /opt/agent-governance/ui && bun run server.ts

# 2. Spawn pipeline via API
curl -X POST http://localhost:3000/api/spawn \
  -H "Content-Type: application/json" \
  -d '{"objective": "Design a caching strategy", "auto_continue": true}'

# 3. Watch WebSocket for events
# Pipeline should: SPAWN → RUNNING → REPORT → ORCHESTRATE → COMPLETE
```

### Validation Criteria
- [x] Pipeline reaches ORCHESTRATION phase automatically
- [x] Both ALPHA and BETA agents spawn
- [x] Agents communicate via MessageBus
- [x] Results appear in Blackboard
- [x] Final checkpoint created
- [x] Audit trail in SQLite

---

## Rollback Plan

If orchestration fails repeatedly:
1. Set `AUTO_ORCHESTRATE=false`
2. Pipeline will stop at REPORT phase
3. Manual intervention can trigger orchestration
4. Review logs in `/api/pipeline/logs`

---

*Document Version: 1.0*
*Last Updated: 2026-01-24*