profit ef18567674 Implement real supervisor-driven auto-recovery
Orchestrator changes:
- Force-spawn GAMMA on iteration_limit before abort
- GAMMA.synthesize() creates emergency handoff payload
- loadRecoveryContext() logs "Resuming from {task_id} handoff"
- POST to /api/pipeline/log for resume message visibility

AgentGamma changes:
- Add synthesize() method for emergency abort synthesis
- Merges existing proposals into coherent handoff
- Stores as synthesis_type: "abort_recovery"

Server changes:
- Add POST /api/pipeline/log endpoint for orchestrator logging
- Recovery pipeline properly inherits GAMMA synthesis

Test coverage:
- test_auto_recovery.py: 6 unit tests
- test_e2e_auto_recovery.py: 5 E2E tests
- test_supervisor_recovery.py: 3 supervisor tests
  - Success on attempt 2 (recovery works)
  - Max failures (3 retries then FAILED)
  - Success on attempt 1 (no recovery needed)

Recovery flow:
1. iteration_limit triggers
2. GAMMA force-spawned for emergency synthesis
3. Handoff dumped with GAMMA synthesis
4. Exit code 3 triggers auto-recovery
5. Recovery pipeline loads handoff
6. Logs "Resuming from {prior_pipeline} handoff"
7. Repeat up to 3 times or until success

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 19:47:56 -05:00

Agent Governance System

Production-grade framework for governing AI agent execution with multi-agent orchestration, Vault-backed security, real-time observability, and consensus-driven workflows.

Status: Phase 12 COMPLETE | Tests: 295/295 passing | Coverage: All 12 phases validated


Quick Start

# Check system health
checkpoint load                           # Load session state
checkpoint report                         # View combined status
validate-phases --verbose                 # Run full validation (295 tests)

# Run the orchestration dashboard
cd /opt/agent-governance/ui && bun run server.ts
# Dashboard: http://localhost:3000

# Bug tracking
bugs list --status open                   # View open bugs
bugs log -m "Description" --severity high # Log new bug

# Pipeline operations
pipeline spawn --plan <plan_id> --tier 1  # Spawn pipeline agents

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────────┐
│                              GOVERNANCE LAYER                                    │
│  ┌──────────────────┐  ┌───────────────────┐  ┌─────────────────────────────┐  │
│  │  HashiCorp Vault │  │    DragonflyDB    │  │      SQLite Ledger          │  │
│  │                  │  │                   │  │                             │  │
│  │  - Per-pipeline  │  │  - Blackboard     │  │  - agent_actions            │  │
│  │    token mgmt    │  │  - Metrics        │  │  - agent_metrics            │  │
│  │  - 2hr TTL +     │  │  - Consensus      │  │  - violations               │  │
│  │    auto-renewal  │  │  - Message bus    │  │  - promotions               │  │
│  │  - Observability │  │  - Error budgets  │  │  - tenants/projects         │  │
│  │    revocation    │  │  - WebSocket pub  │  │  - marketplace              │  │
│  └──────────────────┘  └───────────────────┘  └─────────────────────────────┘  │
├─────────────────────────────────────────────────────────────────────────────────┤
│                           ORCHESTRATION LAYER                                    │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │                       Multi-Agent Pipeline                               │    │
│  │                                                                          │    │
│  │   SPAWN ──► RUNNING ──► REPORT ──► ORCHESTRATING ──► COMPLETED          │    │
│  │     │          │          │             │                │               │    │
│  │   Issue     Agent      Report       ALPHA+BETA       Consensus          │    │
│  │   Vault     Status     Ready        Parallel         Achieved           │    │
│  │   Token     Updates                     │                               │    │
│  │                                   Error/Stuck?                          │    │
│  │                                         │ YES                           │    │
│  │                                   SPAWN GAMMA                           │    │
│  │                                   (Mediator)                            │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
├─────────────────────────────────────────────────────────────────────────────────┤
│                               AGENT LAYER                                        │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐  ┌─────────────────┐  │
│  │  Agent ALPHA  │  │  Agent BETA   │  │  Agent GAMMA  │  │  Governed LLM   │  │
│  │  (Research)   │◄─┼─► (Synthesis) │◄─┼─► (Mediator)  │  │  (T0/T1/T2)     │  │
│  │               │  │               │  │               │  │                 │  │
│  │  Parallel     │  │  Direct       │  │  Spawned on:  │  │  - llm-planner  │  │
│  │  Execution    │  │  Messages     │  │  - Stuck 30s  │  │  - tier0-agent  │  │
│  │               │  │               │  │  - Conflict 3 │  │  - tier1-agent  │  │
│  │               │  │               │  │  - Complex .8 │  │                 │  │
│  └───────┬───────┘  └───────┬───────┘  └───────────────┘  └─────────────────┘  │
│          └──────────────────┴──────────────────────────────────────────────────│
│                                    │                                            │
│                         ┌──────────▼──────────┐                                 │
│                         │     Blackboard      │                                 │
│                         │  - problem          │                                 │
│                         │  - solutions[]      │                                 │
│                         │  - progress         │                                 │
│                         │  - consensus        │                                 │
│                         └─────────────────────┘                                 │
├─────────────────────────────────────────────────────────────────────────────────┤
│                               UI / API LAYER                                     │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │  Orchestration Dashboard (Bun + WebSocket)                              │    │
│  │  - Real-time pipeline status        - Agent lifecycle cards             │    │
│  │  - Consensus failure alerts         - Fallback action buttons           │    │
│  │  - Log streaming                    - Metrics display                   │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────────┘

Core Components

Component Purpose Status
agents/ ALPHA/BETA/GAMMA multi-agent + T0/T1/T2 governed agents Complete
ui/ Orchestration dashboard with WebSocket real-time updates Complete
pipeline/ Pipeline DSL, templates, and execution engine Complete
orchestrator/ Multi-agent coordination with consensus tracking Complete
observability/ Prometheus metrics, distributed tracing, structured logging Complete
marketplace/ Agent template registry with FTS5 search Complete
checkpoint/ Session state management and recovery Complete
ledger/ SQLite audit trail with multi-tenant support Complete
testing/ 295 tests across 12 phases + chaos testing Complete

Key Workflows

Multi-Agent Pipeline

  1. Spawn: Pipeline created with objective, issues Vault token (2hr TTL, auto-renew)
  2. Running: ALPHA (research) and BETA (synthesis) agents work in parallel
  3. Orchestrating: Agents communicate via blackboard + direct messages
  4. Consensus: Proposals evaluated, votes counted, conflicts resolved
  5. GAMMA Spawn: If stuck >30s, conflicts >3, or complexity >0.8
  6. Completion: Final consensus achieved or fallback action taken

Consensus Failure Handling

When agents fail to reach consensus:

  • Rerun Same: Spawn fresh ALPHA/BETA with failure context
  • Rerun with GAMMA: Force mediator agent for conflict resolution
  • Escalate Tier: Increase agent permissions and retry
  • Accept Partial: Complete with best available proposal
  • Download Log: Export full context for manual review

Vault Token Lifecycle

Pipeline Start
      │
      ▼
┌─────────────────────────────────────┐
│ 1. Request Token (AppRole)          │
│    TTL: 2 hours, renewable          │
│    Policy: pipeline-agent           │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 2. Store in Redis (encrypted)       │
│    Key: pipeline:{id}:vault_token   │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 3. Pass to ALPHA, BETA, GAMMA       │
│    Auto-renewal every 30 min        │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 4. Observability monitors usage     │
│    Revoke on policy violation       │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 5. Revoke on completion/error       │
└─────────────────────────────────────┘

CLI Tools

Context Management

# Checkpoints - session state snapshots
checkpoint now --notes "..."       # Create checkpoint
checkpoint load                    # Load latest
checkpoint report                  # Combined status view
checkpoint timeline                # History

# Status - per-directory tracking
status sweep                       # Check all directories
status update <dir> --phase <p>    # Update status
status dashboard                   # Overview

# Memory - large content storage
memory log --stdin                 # Store from pipe
memory fetch <id> -s               # Get summary
memory list                        # Browse entries

Bug Tracking

bugs list                          # List all bugs
bugs list --status open            # Filter by status
bugs list --severity high          # Filter by severity
bugs log -m "Description"          # Log new bug
bugs update <id> resolved          # Update status
bugs get <id>                      # Get details
bugs scan                          # Scan for anomalies
bugs status                        # Summary view

Pipeline Operations

# Validation
validate-phases --verbose          # Full 12-phase validation

# Pipeline management (via dashboard API)
curl -X POST localhost:3000/api/spawn \
  -d '{"plan_id":"...", "tier":1}'

# Consensus handling
curl localhost:3000/api/pipeline/consensus/status?pipeline_id=...
curl -X POST localhost:3000/api/pipeline/consensus/fallback \
  -d '{"pipeline_id":"...", "action":"rerun_gamma"}'

Phase Completion Status

Phase Name Tests Status
1 Foundation 12/12 Complete
2 Secrets Management 14/14 Complete
3 Agent Execution 19/19 Complete
4 Promotion & Revocation 16/16 Complete
5 Bootstrap & Checkpointing 22/22 Complete
6 Multi-Agent Orchestration 56/56 Complete
7 Monitoring & Learning 46/46 Complete
8 Production Hardening 31/31 Complete
9 External Integrations - Framework retained, external deprecated
10 Multi-Tenant Support 18/18 Complete
11 Agent Marketplace 16/16 Complete
12 Observability 21/21 Complete
Total 295/295 Complete

Dependencies

Service Purpose Endpoint
HashiCorp Vault Secrets, token management https://127.0.0.1:8200
DragonflyDB State, metrics, pub/sub redis://127.0.0.1:6379
SQLite Audit ledger, marketplace File-based
Bun TypeScript runtime Local
OpenRouter LLM API gateway External

Directory Structure

agent-governance/
├── agents/               # Agent implementations
│   ├── multi-agent/      # ALPHA/BETA/GAMMA orchestrator
│   ├── llm-planner/      # Python LLM agent
│   ├── llm-planner-ts/   # TypeScript LLM agent
│   ├── tier0-agent/      # Observer tier (read-only)
│   └── tier1-agent/      # Executor tier (write)
├── bin/                  # CLI tools
├── checkpoint/           # Session state management
├── docs/                 # Documentation
├── evidence/             # Audit evidence packages
├── integrations/         # Integration framework
├── ledger/               # SQLite audit ledger + API
├── marketplace/          # Agent template registry
├── memory/               # External memory layer
├── observability/        # Metrics, tracing, logging
├── orchestrator/         # Pipeline orchestration
├── pipeline/             # Pipeline DSL and templates
├── preflight/            # Pre-execution validation
├── sandbox/              # Terraform/Ansible sandbox
├── testing/              # Test framework + oversight
├── tests/                # Test suites (295 tests)
└── ui/                   # Orchestration dashboard

Documentation

Architecture & Design

Document Description
ARCHITECTURE.md Full system design
MULTI_AGENT_PIPELINE_ARCHITECTURE.md Pipeline flow, Vault tokens, agent lifecycle
PHASE_DEPENDENCY_ANALYSIS.md Phase dependencies and order

Implementation & Operations

Document Description
PRODUCTION_PIPELINE.md Implementation plan and production workflows
ENGINEERING_GUIDE.md Runtime governance spec and quick reference
CREDENTIALS_SETUP.md Vault and DragonflyDB setup

Context & Memory

Document Description
CONTEXT_MANAGEMENT.md Checkpoints, STATUS, Memory
STATUS_PROTOCOL.md Directory status protocol
MEMORY_LAYER.md External memory layer details

Agent Documentation

Document Description
agents/README.md Agent foundation and tier system
tier0-guide.md Tier 0 agent guide

External References

Resource Description
HashiCorp Vault Secrets management documentation
Bun Runtime TypeScript runtime documentation
DragonflyDB Redis-compatible database docs

Production Constraints

Token Revocation Triggers

Condition Threshold Action
Error rate > 5 errors/minute Revoke + spawn diagnostic
Stuck agent > 60 seconds no progress Revoke agent token only
Policy violation Any CRITICAL Immediate full revocation
Resource abuse > 100 API calls/minute Rate limit, then revoke

Consensus Requirements

  • Pipelines remain in ORCHESTRATING until consensus achieved
  • Exit code 0 = success, 1 = error, 2 = consensus failure
  • Failure context recorded to DragonflyDB for retry attempts
  • User must explicitly accept partial output to complete without consensus

Recovery After Reset

# 1. Load checkpoint
checkpoint load

# 2. View combined status
checkpoint report

# 3. Check active bugs
bugs list --status open

# 4. Resume pipeline if needed
curl localhost:3000/api/pipeline/consensus/status?pipeline_id=...

API Endpoints

Pipeline Control

Endpoint Method Description
/api/spawn POST Spawn pipeline with plan
/api/pipeline/continue POST Trigger orchestration
/api/pipeline/orchestration GET Get orchestration status
/api/pipeline/token GET Get pipeline token status
/api/pipeline/revoke POST Revoke pipeline token
/api/active-pipelines GET List active pipelines
/api/pipeline/consensus/status GET Consensus status
/api/pipeline/consensus/fallback POST Execute fallback action

Observability

Endpoint Method Description
/api/observability/errors GET Error summary
/api/observability/handoff POST Generate handoff report

Phase 12: Observability - COMPLETE

All 12 phases validated | 295/295 tests passing | Last updated: 2026-01-24

Description
Agent Governance System - Production Hardening
Readme 1.3 MiB
Languages
Python 70.8%
TypeScript 25.5%
Shell 3.6%
HCL 0.1%