Major additions: - marketplace/: Agent template registry with FTS5 search, ratings, versioning - observability/: Prometheus metrics, distributed tracing, structured logging - ledger/migrations/: Database migration scripts for multi-tenant support - tests/governance/: 15 new test files for phases 6-12 (295 total tests) - bin/validate-phases: Full 12-phase validation script New features: - Multi-tenant support with tenant isolation and quota enforcement - Agent marketplace with semantic versioning and search - Observability with metrics, tracing, and log correlation - Tier-1 agent bootstrap scripts Updated components: - ledger/api.py: Extended API for tenants, marketplace, observability - ledger/schema.sql: Added tenant, project, marketplace tables - testing/framework.ts: Enhanced test framework - checkpoint/checkpoint.py: Improved checkpoint management Archived: - External integrations (Slack/GitHub/PagerDuty) moved to .archive/ - Old checkpoint files cleaned up Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
312 lines
12 KiB
Markdown
312 lines
12 KiB
Markdown
# Production Pipeline: Report → OpenRouter Orchestration
|
|
|
|
## Overview
|
|
|
|
This document describes the automatic transition from the UI "view report" stage into the live multi-agent pipeline, including OpenRouter-driven parallel execution.
|
|
|
|
**Created:** 2026-01-24
|
|
**Status:** Implemented
|
|
|
|
---
|
|
|
|
## Architecture Flow
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ UI DASHBOARD │
|
|
│ │
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
|
|
│ │ SPAWN │───▶│ RUNNING │───▶│ REPORT │───▶│ AUTO-ORCHESTRATE │ │
|
|
│ │ Pipeline │ │ Agents │ │ Stage │ │ (NEW) │ │
|
|
│ └──────────┘ └──────────┘ └──────────┘ └────────┬─────────┘ │
|
|
│ │ │
|
|
└───────────────────────────────────────────────────────────┼─────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ OPENROUTER ORCHESTRATION │
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
|
│ │ MultiAgentOrchestrator │ │
|
|
│ │ │ │
|
|
│ │ ┌─────────────┐ ┌─────────────┐ │ │
|
|
│ │ │ ALPHA │◄────────────▶│ BETA │ │ │
|
|
│ │ │ (Research) │ Messages │ (Synthesis) │ │ │
|
|
│ │ │ Python │ │ Bun │ │ │
|
|
│ │ └──────┬──────┘ └──────┬──────┘ │ │
|
|
│ │ │ │ │ │
|
|
│ │ └─────────┬──────────────────┘ │ │
|
|
│ │ │ │ │
|
|
│ │ ▼ │ │
|
|
│ │ ┌─────────────┐ │ │
|
|
│ │ │ GAMMA │ (Spawned on STUCK/CONFLICT) │ │
|
|
│ │ │ (Mediator) │ │ │
|
|
│ │ └─────────────┘ │ │
|
|
│ │ │ │
|
|
│ └─────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ Shared Infrastructure: │
|
|
│ • Blackboard (DragonflyDB) - Proposals, solutions, consensus │
|
|
│ • MessageBus (Redis PubSub) - Agent coordination │
|
|
│ • MetricsCollector - Performance tracking │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ COMPLETION & AUDIT │
|
|
│ │
|
|
│ • Results written to SQLite ledger │
|
|
│ • Checkpoint created with final state │
|
|
│ • WebSocket broadcast to UI │
|
|
│ • Pipeline status → COMPLETED │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Components
|
|
|
|
### 1. Auto-Orchestration Trigger
|
|
|
|
**Location:** `/opt/agent-governance/ui/server.ts`
|
|
|
|
**Trigger Conditions:**
|
|
- Pipeline reaches REPORT phase
|
|
- All agents have completed or timed out
|
|
- No critical failures blocking continuation
|
|
|
|
**New Endpoint:** `POST /api/pipeline/continue`
|
|
```typescript
|
|
{
|
|
pipeline_id: string;
|
|
mode: "openrouter" | "local"; // openrouter = full LLM, local = mock
|
|
model?: string; // Default: anthropic/claude-sonnet-4
|
|
timeout?: number; // Default: 120s
|
|
}
|
|
```
|
|
|
|
### 2. Parallel Agent Execution
|
|
|
|
**Python Agent (ALPHA):**
|
|
- Path: `/opt/agent-governance/agents/llm-planner/governed_agent.py`
|
|
- Role: Research, analysis, proposal generation
|
|
- Runtime: Python 3.11 with venv
|
|
|
|
**Bun Agent (BETA):**
|
|
- Path: `/opt/agent-governance/agents/llm-planner-ts/governed-agent.ts`
|
|
- Role: Synthesis, evaluation, solution building
|
|
- Runtime: Bun (4x faster than Node.js)
|
|
|
|
**Coordination:**
|
|
- Both agents connect to same DragonflyDB instance
|
|
- Shared Blackboard for structured data exchange
|
|
- MessageBus for real-time communication
|
|
- SpawnController monitors for GAMMA trigger conditions
|
|
|
|
### 3. OpenRouter Integration
|
|
|
|
**Credential Flow:**
|
|
```
|
|
Vault (secret/data/api-keys/openrouter)
|
|
│
|
|
▼
|
|
getVaultSecret() in agent code
|
|
│
|
|
▼
|
|
OpenAI client with baseURL: "https://openrouter.ai/api/v1"
|
|
│
|
|
▼
|
|
Model: anthropic/claude-sonnet-4
|
|
```
|
|
|
|
**Rate Limiting:**
|
|
- Handled by OpenRouter API
|
|
- Circuit breaker in governance layer (5 failures → open)
|
|
- Per-agent token budget tracking
|
|
|
|
### 4. Error Handling & Failover
|
|
|
|
**Level 1: Agent-Level Recovery**
|
|
```
|
|
Error Budget per agent:
|
|
- max_total_errors: 8
|
|
- max_same_error_repeats: 2
|
|
- max_procedure_violations: 1
|
|
|
|
On budget exceeded → Agent revoked, handoff created
|
|
```
|
|
|
|
**Level 2: Pipeline-Level Recovery**
|
|
```
|
|
On agent failure:
|
|
1. Record failure in DragonflyDB
|
|
2. Check if partner agent can continue alone
|
|
3. If both fail → Pipeline status = FAILED
|
|
4. Create checkpoint with failure details
|
|
```
|
|
|
|
**Level 3: Orchestration-Level Recovery**
|
|
```
|
|
On orchestration timeout (120s default):
|
|
1. Force-stop running agents
|
|
2. Collect partial results from Blackboard
|
|
3. Generate partial report
|
|
4. Pipeline status = TIMEOUT
|
|
```
|
|
|
|
**GAMMA Spawn Conditions:**
|
|
| Condition | Threshold | Action |
|
|
|-----------|-----------|--------|
|
|
| STUCK | 30s no progress | Spawn GAMMA mediator |
|
|
| CONFLICT | 3+ unresolved proposals | Spawn GAMMA to arbitrate |
|
|
| COMPLEXITY | Score > 0.8 | Spawn GAMMA for decomposition |
|
|
|
|
---
|
|
|
|
## API Endpoints
|
|
|
|
### Existing (Modified)
|
|
- `POST /api/spawn` - Creates pipeline, now includes `auto_continue: boolean`
|
|
- `GET /api/checkpoint/report` - Returns report with continuation status
|
|
|
|
### New
|
|
- `POST /api/pipeline/continue` - Triggers OpenRouter orchestration
|
|
- `GET /api/pipeline/{id}/orchestration` - Gets orchestration status
|
|
- `POST /api/pipeline/{id}/stop` - Emergency stop
|
|
|
|
---
|
|
|
|
## WebSocket Events
|
|
|
|
### New Events
|
|
```typescript
|
|
// Orchestration started
|
|
{ type: "orchestration_started", data: { pipeline_id, model, agents: ["ALPHA", "BETA"] } }
|
|
|
|
// Agent spawned
|
|
{ type: "agent_spawned", data: { pipeline_id, agent_id, role, runtime } }
|
|
|
|
// Agent message
|
|
{ type: "agent_message", data: { pipeline_id, from, to, content } }
|
|
|
|
// GAMMA spawned (conditional)
|
|
{ type: "gamma_spawned", data: { pipeline_id, reason: "STUCK" | "CONFLICT" | "COMPLEXITY" } }
|
|
|
|
// Consensus reached
|
|
{ type: "consensus_reached", data: { pipeline_id, proposal_id, votes } }
|
|
|
|
// Orchestration complete
|
|
{ type: "orchestration_complete", data: { pipeline_id, status, results } }
|
|
```
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
```bash
|
|
# Enable auto-orchestration after report
|
|
AUTO_ORCHESTRATE=true
|
|
|
|
# Default model for OpenRouter
|
|
OPENROUTER_MODEL=anthropic/claude-sonnet-4
|
|
|
|
# Orchestration timeout (seconds)
|
|
ORCHESTRATION_TIMEOUT=120
|
|
|
|
# GAMMA spawn thresholds
|
|
GAMMA_STUCK_THRESHOLD=30
|
|
GAMMA_CONFLICT_THRESHOLD=3
|
|
GAMMA_COMPLEXITY_THRESHOLD=0.8
|
|
```
|
|
|
|
### Vault Secrets Required
|
|
```
|
|
secret/data/api-keys/openrouter
|
|
└── api_key: "sk-or-..."
|
|
|
|
secret/data/services/dragonfly
|
|
└── password: "..."
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Steps
|
|
|
|
### Step 1: Add Auto-Continue Logic to UI Server
|
|
- [x] Add `triggerOrchestration()` function
|
|
- [x] Modify `checkPipelineCompletion()` to check for auto_continue
|
|
- [x] Add `/api/pipeline/continue` endpoint
|
|
|
|
### Step 2: Connect to Multi-Agent Orchestrator
|
|
- [x] Spawn orchestrator.ts from UI via Bun.spawn()
|
|
- [x] Pass pipeline context (task_id, objective, model, timeout)
|
|
- [x] Wire up WebSocket events (orchestration_started, agent_message, consensus_event, orchestration_complete)
|
|
|
|
### Step 3: Add Orchestration Status Tracking
|
|
- [x] Track orchestration state in Redis (ORCHESTRATING status)
|
|
- [x] Add orchestration_started_at timestamp
|
|
- [x] Create checkpoint on completion
|
|
|
|
### Step 4: Implement Error Handling
|
|
- [x] Add timeout handling via orchestrator --timeout flag
|
|
- [x] Capture exit codes and error messages
|
|
- [x] Set ORCHESTRATION_FAILED or ORCHESTRATION_ERROR status on failure
|
|
|
|
### Step 5: Test End-to-End
|
|
- [x] Spawn pipeline with objective
|
|
- [x] Verify report generation
|
|
- [x] Verify auto-trigger to orchestration
|
|
- [x] Verify parallel agent execution
|
|
- [x] Verify results collection
|
|
|
|
### Demonstration Results (2026-01-24)
|
|
Successfully tested with `pipeline-mksufe23`:
|
|
- Pipeline spawned → ALPHA/BETA ran → Report generated → Auto-orchestration triggered
|
|
- GAMMA spawned due to complexity (0.8 threshold)
|
|
- Total orchestration time: 51.4 seconds
|
|
- Final status: COMPLETED
|
|
|
|
---
|
|
|
|
## Testing
|
|
|
|
### Manual Test Command
|
|
```bash
|
|
# 1. Start UI server
|
|
cd /opt/agent-governance/ui && bun run server.ts
|
|
|
|
# 2. Spawn pipeline via API
|
|
curl -X POST http://localhost:3000/api/spawn \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"objective": "Design a caching strategy", "auto_continue": true}'
|
|
|
|
# 3. Watch WebSocket for events
|
|
# Pipeline should: SPAWN → RUNNING → REPORT → ORCHESTRATE → COMPLETE
|
|
```
|
|
|
|
### Validation Criteria
|
|
- [x] Pipeline reaches ORCHESTRATION phase automatically
|
|
- [x] Both ALPHA and BETA agents spawn
|
|
- [x] Agents communicate via MessageBus
|
|
- [x] Results appear in Blackboard
|
|
- [x] Final checkpoint created
|
|
- [x] Audit trail in SQLite
|
|
|
|
---
|
|
|
|
## Rollback Plan
|
|
|
|
If orchestration fails repeatedly:
|
|
1. Set `AUTO_ORCHESTRATE=false`
|
|
2. Pipeline will stop at REPORT phase
|
|
3. Manual intervention can trigger orchestration
|
|
4. Review logs in `/api/pipeline/logs`
|
|
|
|
---
|
|
|
|
*Document Version: 1.0*
|
|
*Last Updated: 2026-01-24*
|