Major additions: - marketplace/: Agent template registry with FTS5 search, ratings, versioning - observability/: Prometheus metrics, distributed tracing, structured logging - ledger/migrations/: Database migration scripts for multi-tenant support - tests/governance/: 15 new test files for phases 6-12 (295 total tests) - bin/validate-phases: Full 12-phase validation script New features: - Multi-tenant support with tenant isolation and quota enforcement - Agent marketplace with semantic versioning and search - Observability with metrics, tracing, and log correlation - Tier-1 agent bootstrap scripts Updated components: - ledger/api.py: Extended API for tenants, marketplace, observability - ledger/schema.sql: Added tenant, project, marketplace tables - testing/framework.ts: Enhanced test framework - checkpoint/checkpoint.py: Improved checkpoint management Archived: - External integrations (Slack/GitHub/PagerDuty) moved to .archive/ - Old checkpoint files cleaned up Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
12 KiB
12 KiB
Production Pipeline: Report → OpenRouter Orchestration
Overview
This document describes the automatic transition from the UI "view report" stage into the live multi-agent pipeline, including OpenRouter-driven parallel execution.
Created: 2026-01-24 Status: Implemented
Architecture Flow
┌─────────────────────────────────────────────────────────────────────────┐
│ UI DASHBOARD │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ SPAWN │───▶│ RUNNING │───▶│ REPORT │───▶│ AUTO-ORCHESTRATE │ │
│ │ Pipeline │ │ Agents │ │ Stage │ │ (NEW) │ │
│ └──────────┘ └──────────┘ └──────────┘ └────────┬─────────┘ │
│ │ │
└───────────────────────────────────────────────────────────┼─────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ OPENROUTER ORCHESTRATION │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ MultiAgentOrchestrator │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ ALPHA │◄────────────▶│ BETA │ │ │
│ │ │ (Research) │ Messages │ (Synthesis) │ │ │
│ │ │ Python │ │ Bun │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ │ │
│ │ │ │ │ │
│ │ └─────────┬──────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────┐ │ │
│ │ │ GAMMA │ (Spawned on STUCK/CONFLICT) │ │
│ │ │ (Mediator) │ │ │
│ │ └─────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Shared Infrastructure: │
│ • Blackboard (DragonflyDB) - Proposals, solutions, consensus │
│ • MessageBus (Redis PubSub) - Agent coordination │
│ • MetricsCollector - Performance tracking │
│ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ COMPLETION & AUDIT │
│ │
│ • Results written to SQLite ledger │
│ • Checkpoint created with final state │
│ • WebSocket broadcast to UI │
│ • Pipeline status → COMPLETED │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Implementation Components
1. Auto-Orchestration Trigger
Location: /opt/agent-governance/ui/server.ts
Trigger Conditions:
- Pipeline reaches REPORT phase
- All agents have completed or timed out
- No critical failures blocking continuation
New Endpoint: POST /api/pipeline/continue
{
pipeline_id: string;
mode: "openrouter" | "local"; // openrouter = full LLM, local = mock
model?: string; // Default: anthropic/claude-sonnet-4
timeout?: number; // Default: 120s
}
2. Parallel Agent Execution
Python Agent (ALPHA):
- Path:
/opt/agent-governance/agents/llm-planner/governed_agent.py - Role: Research, analysis, proposal generation
- Runtime: Python 3.11 with venv
Bun Agent (BETA):
- Path:
/opt/agent-governance/agents/llm-planner-ts/governed-agent.ts - Role: Synthesis, evaluation, solution building
- Runtime: Bun (4x faster than Node.js)
Coordination:
- Both agents connect to same DragonflyDB instance
- Shared Blackboard for structured data exchange
- MessageBus for real-time communication
- SpawnController monitors for GAMMA trigger conditions
3. OpenRouter Integration
Credential Flow:
Vault (secret/data/api-keys/openrouter)
│
▼
getVaultSecret() in agent code
│
▼
OpenAI client with baseURL: "https://openrouter.ai/api/v1"
│
▼
Model: anthropic/claude-sonnet-4
Rate Limiting:
- Handled by OpenRouter API
- Circuit breaker in governance layer (5 failures → open)
- Per-agent token budget tracking
4. Error Handling & Failover
Level 1: Agent-Level Recovery
Error Budget per agent:
- max_total_errors: 8
- max_same_error_repeats: 2
- max_procedure_violations: 1
On budget exceeded → Agent revoked, handoff created
Level 2: Pipeline-Level Recovery
On agent failure:
1. Record failure in DragonflyDB
2. Check if partner agent can continue alone
3. If both fail → Pipeline status = FAILED
4. Create checkpoint with failure details
Level 3: Orchestration-Level Recovery
On orchestration timeout (120s default):
1. Force-stop running agents
2. Collect partial results from Blackboard
3. Generate partial report
4. Pipeline status = TIMEOUT
GAMMA Spawn Conditions:
| Condition | Threshold | Action |
|---|---|---|
| STUCK | 30s no progress | Spawn GAMMA mediator |
| CONFLICT | 3+ unresolved proposals | Spawn GAMMA to arbitrate |
| COMPLEXITY | Score > 0.8 | Spawn GAMMA for decomposition |
API Endpoints
Existing (Modified)
POST /api/spawn- Creates pipeline, now includesauto_continue: booleanGET /api/checkpoint/report- Returns report with continuation status
New
POST /api/pipeline/continue- Triggers OpenRouter orchestrationGET /api/pipeline/{id}/orchestration- Gets orchestration statusPOST /api/pipeline/{id}/stop- Emergency stop
WebSocket Events
New Events
// Orchestration started
{ type: "orchestration_started", data: { pipeline_id, model, agents: ["ALPHA", "BETA"] } }
// Agent spawned
{ type: "agent_spawned", data: { pipeline_id, agent_id, role, runtime } }
// Agent message
{ type: "agent_message", data: { pipeline_id, from, to, content } }
// GAMMA spawned (conditional)
{ type: "gamma_spawned", data: { pipeline_id, reason: "STUCK" | "CONFLICT" | "COMPLEXITY" } }
// Consensus reached
{ type: "consensus_reached", data: { pipeline_id, proposal_id, votes } }
// Orchestration complete
{ type: "orchestration_complete", data: { pipeline_id, status, results } }
Configuration
Environment Variables
# Enable auto-orchestration after report
AUTO_ORCHESTRATE=true
# Default model for OpenRouter
OPENROUTER_MODEL=anthropic/claude-sonnet-4
# Orchestration timeout (seconds)
ORCHESTRATION_TIMEOUT=120
# GAMMA spawn thresholds
GAMMA_STUCK_THRESHOLD=30
GAMMA_CONFLICT_THRESHOLD=3
GAMMA_COMPLEXITY_THRESHOLD=0.8
Vault Secrets Required
secret/data/api-keys/openrouter
└── api_key: "sk-or-..."
secret/data/services/dragonfly
└── password: "..."
Implementation Steps
Step 1: Add Auto-Continue Logic to UI Server
- Add
triggerOrchestration()function - Modify
checkPipelineCompletion()to check for auto_continue - Add
/api/pipeline/continueendpoint
Step 2: Connect to Multi-Agent Orchestrator
- Spawn orchestrator.ts from UI via Bun.spawn()
- Pass pipeline context (task_id, objective, model, timeout)
- Wire up WebSocket events (orchestration_started, agent_message, consensus_event, orchestration_complete)
Step 3: Add Orchestration Status Tracking
- Track orchestration state in Redis (ORCHESTRATING status)
- Add orchestration_started_at timestamp
- Create checkpoint on completion
Step 4: Implement Error Handling
- Add timeout handling via orchestrator --timeout flag
- Capture exit codes and error messages
- Set ORCHESTRATION_FAILED or ORCHESTRATION_ERROR status on failure
Step 5: Test End-to-End
- Spawn pipeline with objective
- Verify report generation
- Verify auto-trigger to orchestration
- Verify parallel agent execution
- Verify results collection
Demonstration Results (2026-01-24)
Successfully tested with pipeline-mksufe23:
- Pipeline spawned → ALPHA/BETA ran → Report generated → Auto-orchestration triggered
- GAMMA spawned due to complexity (0.8 threshold)
- Total orchestration time: 51.4 seconds
- Final status: COMPLETED
Testing
Manual Test Command
# 1. Start UI server
cd /opt/agent-governance/ui && bun run server.ts
# 2. Spawn pipeline via API
curl -X POST http://localhost:3000/api/spawn \
-H "Content-Type: application/json" \
-d '{"objective": "Design a caching strategy", "auto_continue": true}'
# 3. Watch WebSocket for events
# Pipeline should: SPAWN → RUNNING → REPORT → ORCHESTRATE → COMPLETE
Validation Criteria
- Pipeline reaches ORCHESTRATION phase automatically
- Both ALPHA and BETA agents spawn
- Agents communicate via MessageBus
- Results appear in Blackboard
- Final checkpoint created
- Audit trail in SQLite
Rollback Plan
If orchestration fails repeatedly:
- Set
AUTO_ORCHESTRATE=false - Pipeline will stop at REPORT phase
- Manual intervention can trigger orchestration
- Review logs in
/api/pipeline/logs
Document Version: 1.0 Last Updated: 2026-01-24