agent-governance/docs/PRODUCTION_PIPELINE.md
profit 8c6e7831e9 Add Phase 10-12 implementation: multi-tenant, marketplace, observability
Major additions:
- marketplace/: Agent template registry with FTS5 search, ratings, versioning
- observability/: Prometheus metrics, distributed tracing, structured logging
- ledger/migrations/: Database migration scripts for multi-tenant support
- tests/governance/: 15 new test files for phases 6-12 (295 total tests)
- bin/validate-phases: Full 12-phase validation script

New features:
- Multi-tenant support with tenant isolation and quota enforcement
- Agent marketplace with semantic versioning and search
- Observability with metrics, tracing, and log correlation
- Tier-1 agent bootstrap scripts

Updated components:
- ledger/api.py: Extended API for tenants, marketplace, observability
- ledger/schema.sql: Added tenant, project, marketplace tables
- testing/framework.ts: Enhanced test framework
- checkpoint/checkpoint.py: Improved checkpoint management

Archived:
- External integrations (Slack/GitHub/PagerDuty) moved to .archive/
- Old checkpoint files cleaned up

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 18:39:47 -05:00

312 lines
12 KiB
Markdown

# Production Pipeline: Report → OpenRouter Orchestration
## Overview
This document describes the automatic transition from the UI "view report" stage into the live multi-agent pipeline, including OpenRouter-driven parallel execution.
**Created:** 2026-01-24
**Status:** Implemented
---
## Architecture Flow
```
┌─────────────────────────────────────────────────────────────────────────┐
│ UI DASHBOARD │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ SPAWN │───▶│ RUNNING │───▶│ REPORT │───▶│ AUTO-ORCHESTRATE │ │
│ │ Pipeline │ │ Agents │ │ Stage │ │ (NEW) │ │
│ └──────────┘ └──────────┘ └──────────┘ └────────┬─────────┘ │
│ │ │
└───────────────────────────────────────────────────────────┼─────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ OPENROUTER ORCHESTRATION │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ MultiAgentOrchestrator │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ ALPHA │◄────────────▶│ BETA │ │ │
│ │ │ (Research) │ Messages │ (Synthesis) │ │ │
│ │ │ Python │ │ Bun │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ │ │
│ │ │ │ │ │
│ │ └─────────┬──────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────┐ │ │
│ │ │ GAMMA │ (Spawned on STUCK/CONFLICT) │ │
│ │ │ (Mediator) │ │ │
│ │ └─────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Shared Infrastructure: │
│ • Blackboard (DragonflyDB) - Proposals, solutions, consensus │
│ • MessageBus (Redis PubSub) - Agent coordination │
│ • MetricsCollector - Performance tracking │
│ │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ COMPLETION & AUDIT │
│ │
│ • Results written to SQLite ledger │
│ • Checkpoint created with final state │
│ • WebSocket broadcast to UI │
│ • Pipeline status → COMPLETED │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
---
## Implementation Components
### 1. Auto-Orchestration Trigger
**Location:** `/opt/agent-governance/ui/server.ts`
**Trigger Conditions:**
- Pipeline reaches REPORT phase
- All agents have completed or timed out
- No critical failures blocking continuation
**New Endpoint:** `POST /api/pipeline/continue`
```typescript
{
pipeline_id: string;
mode: "openrouter" | "local"; // openrouter = full LLM, local = mock
model?: string; // Default: anthropic/claude-sonnet-4
timeout?: number; // Default: 120s
}
```
### 2. Parallel Agent Execution
**Python Agent (ALPHA):**
- Path: `/opt/agent-governance/agents/llm-planner/governed_agent.py`
- Role: Research, analysis, proposal generation
- Runtime: Python 3.11 with venv
**Bun Agent (BETA):**
- Path: `/opt/agent-governance/agents/llm-planner-ts/governed-agent.ts`
- Role: Synthesis, evaluation, solution building
- Runtime: Bun (4x faster than Node.js)
**Coordination:**
- Both agents connect to same DragonflyDB instance
- Shared Blackboard for structured data exchange
- MessageBus for real-time communication
- SpawnController monitors for GAMMA trigger conditions
### 3. OpenRouter Integration
**Credential Flow:**
```
Vault (secret/data/api-keys/openrouter)
getVaultSecret() in agent code
OpenAI client with baseURL: "https://openrouter.ai/api/v1"
Model: anthropic/claude-sonnet-4
```
**Rate Limiting:**
- Handled by OpenRouter API
- Circuit breaker in governance layer (5 failures → open)
- Per-agent token budget tracking
### 4. Error Handling & Failover
**Level 1: Agent-Level Recovery**
```
Error Budget per agent:
- max_total_errors: 8
- max_same_error_repeats: 2
- max_procedure_violations: 1
On budget exceeded → Agent revoked, handoff created
```
**Level 2: Pipeline-Level Recovery**
```
On agent failure:
1. Record failure in DragonflyDB
2. Check if partner agent can continue alone
3. If both fail → Pipeline status = FAILED
4. Create checkpoint with failure details
```
**Level 3: Orchestration-Level Recovery**
```
On orchestration timeout (120s default):
1. Force-stop running agents
2. Collect partial results from Blackboard
3. Generate partial report
4. Pipeline status = TIMEOUT
```
**GAMMA Spawn Conditions:**
| Condition | Threshold | Action |
|-----------|-----------|--------|
| STUCK | 30s no progress | Spawn GAMMA mediator |
| CONFLICT | 3+ unresolved proposals | Spawn GAMMA to arbitrate |
| COMPLEXITY | Score > 0.8 | Spawn GAMMA for decomposition |
---
## API Endpoints
### Existing (Modified)
- `POST /api/spawn` - Creates pipeline, now includes `auto_continue: boolean`
- `GET /api/checkpoint/report` - Returns report with continuation status
### New
- `POST /api/pipeline/continue` - Triggers OpenRouter orchestration
- `GET /api/pipeline/{id}/orchestration` - Gets orchestration status
- `POST /api/pipeline/{id}/stop` - Emergency stop
---
## WebSocket Events
### New Events
```typescript
// Orchestration started
{ type: "orchestration_started", data: { pipeline_id, model, agents: ["ALPHA", "BETA"] } }
// Agent spawned
{ type: "agent_spawned", data: { pipeline_id, agent_id, role, runtime } }
// Agent message
{ type: "agent_message", data: { pipeline_id, from, to, content } }
// GAMMA spawned (conditional)
{ type: "gamma_spawned", data: { pipeline_id, reason: "STUCK" | "CONFLICT" | "COMPLEXITY" } }
// Consensus reached
{ type: "consensus_reached", data: { pipeline_id, proposal_id, votes } }
// Orchestration complete
{ type: "orchestration_complete", data: { pipeline_id, status, results } }
```
---
## Configuration
### Environment Variables
```bash
# Enable auto-orchestration after report
AUTO_ORCHESTRATE=true
# Default model for OpenRouter
OPENROUTER_MODEL=anthropic/claude-sonnet-4
# Orchestration timeout (seconds)
ORCHESTRATION_TIMEOUT=120
# GAMMA spawn thresholds
GAMMA_STUCK_THRESHOLD=30
GAMMA_CONFLICT_THRESHOLD=3
GAMMA_COMPLEXITY_THRESHOLD=0.8
```
### Vault Secrets Required
```
secret/data/api-keys/openrouter
└── api_key: "sk-or-..."
secret/data/services/dragonfly
└── password: "..."
```
---
## Implementation Steps
### Step 1: Add Auto-Continue Logic to UI Server
- [x] Add `triggerOrchestration()` function
- [x] Modify `checkPipelineCompletion()` to check for auto_continue
- [x] Add `/api/pipeline/continue` endpoint
### Step 2: Connect to Multi-Agent Orchestrator
- [x] Spawn orchestrator.ts from UI via Bun.spawn()
- [x] Pass pipeline context (task_id, objective, model, timeout)
- [x] Wire up WebSocket events (orchestration_started, agent_message, consensus_event, orchestration_complete)
### Step 3: Add Orchestration Status Tracking
- [x] Track orchestration state in Redis (ORCHESTRATING status)
- [x] Add orchestration_started_at timestamp
- [x] Create checkpoint on completion
### Step 4: Implement Error Handling
- [x] Add timeout handling via orchestrator --timeout flag
- [x] Capture exit codes and error messages
- [x] Set ORCHESTRATION_FAILED or ORCHESTRATION_ERROR status on failure
### Step 5: Test End-to-End
- [x] Spawn pipeline with objective
- [x] Verify report generation
- [x] Verify auto-trigger to orchestration
- [x] Verify parallel agent execution
- [x] Verify results collection
### Demonstration Results (2026-01-24)
Successfully tested with `pipeline-mksufe23`:
- Pipeline spawned → ALPHA/BETA ran → Report generated → Auto-orchestration triggered
- GAMMA spawned due to complexity (0.8 threshold)
- Total orchestration time: 51.4 seconds
- Final status: COMPLETED
---
## Testing
### Manual Test Command
```bash
# 1. Start UI server
cd /opt/agent-governance/ui && bun run server.ts
# 2. Spawn pipeline via API
curl -X POST http://localhost:3000/api/spawn \
-H "Content-Type: application/json" \
-d '{"objective": "Design a caching strategy", "auto_continue": true}'
# 3. Watch WebSocket for events
# Pipeline should: SPAWN → RUNNING → REPORT → ORCHESTRATE → COMPLETE
```
### Validation Criteria
- [x] Pipeline reaches ORCHESTRATION phase automatically
- [x] Both ALPHA and BETA agents spawn
- [x] Agents communicate via MessageBus
- [x] Results appear in Blackboard
- [x] Final checkpoint created
- [x] Audit trail in SQLite
---
## Rollback Plan
If orchestration fails repeatedly:
1. Set `AUTO_ORCHESTRATE=false`
2. Pipeline will stop at REPORT phase
3. Manual intervention can trigger orchestration
4. Review logs in `/api/pipeline/logs`
---
*Document Version: 1.0*
*Last Updated: 2026-01-24*