# Agent Governance System > Production-grade framework for governing AI agent execution with multi-agent orchestration, Vault-backed security, real-time observability, and consensus-driven workflows. **Status:** Phase 12 COMPLETE | **Tests:** 295/295 passing | **Coverage:** All 12 phases validated --- ## Quick Start ```bash # Check system health checkpoint load # Load session state checkpoint report # View combined status validate-phases --verbose # Run full validation (295 tests) # Run the orchestration dashboard cd /opt/agent-governance/ui && bun run server.ts # Dashboard: http://localhost:3000 # Bug tracking bugs list --status open # View open bugs bugs log -m "Description" --severity high # Log new bug # Pipeline operations pipeline spawn --plan --tier 1 # Spawn pipeline agents ``` --- ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────────────────┐ │ GOVERNANCE LAYER │ │ ┌──────────────────┐ ┌───────────────────┐ ┌─────────────────────────────┐ │ │ │ HashiCorp Vault │ │ DragonflyDB │ │ SQLite Ledger │ │ │ │ │ │ │ │ │ │ │ │ - Per-pipeline │ │ - Blackboard │ │ - agent_actions │ │ │ │ token mgmt │ │ - Metrics │ │ - agent_metrics │ │ │ │ - 2hr TTL + │ │ - Consensus │ │ - violations │ │ │ │ auto-renewal │ │ - Message bus │ │ - promotions │ │ │ │ - Observability │ │ - Error budgets │ │ - tenants/projects │ │ │ │ revocation │ │ - WebSocket pub │ │ - marketplace │ │ │ └──────────────────┘ └───────────────────┘ └─────────────────────────────┘ │ ├─────────────────────────────────────────────────────────────────────────────────┤ │ ORCHESTRATION LAYER │ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ Multi-Agent Pipeline │ │ │ │ │ │ │ │ SPAWN ──► RUNNING ──► REPORT ──► ORCHESTRATING ──► COMPLETED │ │ │ │ │ │ │ │ │ │ │ │ │ Issue Agent Report ALPHA+BETA Consensus │ │ │ │ Vault Status Ready Parallel Achieved │ │ │ │ Token Updates │ │ │ │ │ Error/Stuck? │ │ │ │ │ YES │ │ │ │ SPAWN GAMMA │ │ │ │ (Mediator) │ │ │ └─────────────────────────────────────────────────────────────────────────┘ │ ├─────────────────────────────────────────────────────────────────────────────────┤ │ AGENT LAYER │ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌─────────────────┐ │ │ │ Agent ALPHA │ │ Agent BETA │ │ Agent GAMMA │ │ Governed LLM │ │ │ │ (Research) │◄─┼─► (Synthesis) │◄─┼─► (Mediator) │ │ (T0/T1/T2) │ │ │ │ │ │ │ │ │ │ │ │ │ │ Parallel │ │ Direct │ │ Spawned on: │ │ - llm-planner │ │ │ │ Execution │ │ Messages │ │ - Stuck 30s │ │ - tier0-agent │ │ │ │ │ │ │ │ - Conflict 3 │ │ - tier1-agent │ │ │ │ │ │ │ │ - Complex .8 │ │ │ │ │ └───────┬───────┘ └───────┬───────┘ └───────────────┘ └─────────────────┘ │ │ └──────────────────┴──────────────────────────────────────────────────│ │ │ │ │ ┌──────────▼──────────┐ │ │ │ Blackboard │ │ │ │ - problem │ │ │ │ - solutions[] │ │ │ │ - progress │ │ │ │ - consensus │ │ │ └─────────────────────┘ │ ├─────────────────────────────────────────────────────────────────────────────────┤ │ UI / API LAYER │ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ Orchestration Dashboard (Bun + WebSocket) │ │ │ │ - Real-time pipeline status - Agent lifecycle cards │ │ │ │ - Consensus failure alerts - Fallback action buttons │ │ │ │ - Log streaming - Metrics display │ │ │ └─────────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────────┘ ``` --- ## Core Components | Component | Purpose | Status | |-----------|---------|--------| | **agents/** | ALPHA/BETA/GAMMA multi-agent + T0/T1/T2 governed agents | Complete | | **ui/** | Orchestration dashboard with WebSocket real-time updates | Complete | | **pipeline/** | Pipeline DSL, templates, and execution engine | Complete | | **orchestrator/** | Multi-agent coordination with consensus tracking | Complete | | **observability/** | Prometheus metrics, distributed tracing, structured logging | Complete | | **marketplace/** | Agent template registry with FTS5 search | Complete | | **checkpoint/** | Session state management and recovery | Complete | | **ledger/** | SQLite audit trail with multi-tenant support | Complete | | **testing/** | 295 tests across 12 phases + chaos testing | Complete | --- ## Key Workflows ### Multi-Agent Pipeline 1. **Spawn**: Pipeline created with objective, issues Vault token (2hr TTL, auto-renew) 2. **Running**: ALPHA (research) and BETA (synthesis) agents work in parallel 3. **Orchestrating**: Agents communicate via blackboard + direct messages 4. **Consensus**: Proposals evaluated, votes counted, conflicts resolved 5. **GAMMA Spawn**: If stuck >30s, conflicts >3, or complexity >0.8 6. **Completion**: Final consensus achieved or fallback action taken ### Consensus Failure Handling When agents fail to reach consensus: - **Rerun Same**: Spawn fresh ALPHA/BETA with failure context - **Rerun with GAMMA**: Force mediator agent for conflict resolution - **Escalate Tier**: Increase agent permissions and retry - **Accept Partial**: Complete with best available proposal - **Download Log**: Export full context for manual review ### Vault Token Lifecycle ``` Pipeline Start │ ▼ ┌─────────────────────────────────────┐ │ 1. Request Token (AppRole) │ │ TTL: 2 hours, renewable │ │ Policy: pipeline-agent │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ 2. Store in Redis (encrypted) │ │ Key: pipeline:{id}:vault_token │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ 3. Pass to ALPHA, BETA, GAMMA │ │ Auto-renewal every 30 min │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ 4. Observability monitors usage │ │ Revoke on policy violation │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ 5. Revoke on completion/error │ └─────────────────────────────────────┘ ``` --- ## CLI Tools ### Context Management ```bash # Checkpoints - session state snapshots checkpoint now --notes "..." # Create checkpoint checkpoint load # Load latest checkpoint report # Combined status view checkpoint timeline # History # Status - per-directory tracking status sweep # Check all directories status update --phase

# Update status status dashboard # Overview # Memory - large content storage memory log --stdin # Store from pipe memory fetch -s # Get summary memory list # Browse entries ``` ### Bug Tracking ```bash bugs list # List all bugs bugs list --status open # Filter by status bugs list --severity high # Filter by severity bugs log -m "Description" # Log new bug bugs update resolved # Update status bugs get # Get details bugs scan # Scan for anomalies bugs status # Summary view ``` ### Pipeline Operations ```bash # Validation validate-phases --verbose # Full 12-phase validation # Pipeline management (via dashboard API) curl -X POST localhost:3000/api/spawn \ -d '{"plan_id":"...", "tier":1}' # Consensus handling curl localhost:3000/api/pipeline/consensus/status?pipeline_id=... curl -X POST localhost:3000/api/pipeline/consensus/fallback \ -d '{"pipeline_id":"...", "action":"rerun_gamma"}' ``` --- ## Phase Completion Status | Phase | Name | Tests | Status | |-------|------|-------|--------| | 1 | Foundation | 12/12 | Complete | | 2 | Secrets Management | 14/14 | Complete | | 3 | Agent Execution | 19/19 | Complete | | 4 | Promotion & Revocation | 16/16 | Complete | | 5 | Bootstrap & Checkpointing | 22/22 | Complete | | 6 | Multi-Agent Orchestration | 56/56 | Complete | | 7 | Monitoring & Learning | 46/46 | Complete | | 8 | Production Hardening | 31/31 | Complete | | 9 | External Integrations | - | Framework retained, external deprecated | | 10 | Multi-Tenant Support | 18/18 | Complete | | 11 | Agent Marketplace | 16/16 | Complete | | 12 | Observability | 21/21 | Complete | | | **Total** | **295/295** | **Complete** | --- ## Dependencies | Service | Purpose | Endpoint | |---------|---------|----------| | HashiCorp Vault | Secrets, token management | https://127.0.0.1:8200 | | DragonflyDB | State, metrics, pub/sub | redis://127.0.0.1:6379 | | SQLite | Audit ledger, marketplace | File-based | | Bun | TypeScript runtime | Local | | OpenRouter | LLM API gateway | External | --- ## Directory Structure ``` agent-governance/ ├── agents/ # Agent implementations │ ├── multi-agent/ # ALPHA/BETA/GAMMA orchestrator │ ├── llm-planner/ # Python LLM agent │ ├── llm-planner-ts/ # TypeScript LLM agent │ ├── tier0-agent/ # Observer tier (read-only) │ └── tier1-agent/ # Executor tier (write) ├── bin/ # CLI tools ├── checkpoint/ # Session state management ├── docs/ # Documentation ├── evidence/ # Audit evidence packages ├── integrations/ # Integration framework ├── ledger/ # SQLite audit ledger + API ├── marketplace/ # Agent template registry ├── memory/ # External memory layer ├── observability/ # Metrics, tracing, logging ├── orchestrator/ # Pipeline orchestration ├── pipeline/ # Pipeline DSL and templates ├── preflight/ # Pre-execution validation ├── sandbox/ # Terraform/Ansible sandbox ├── testing/ # Test framework + oversight ├── tests/ # Test suites (295 tests) └── ui/ # Orchestration dashboard ``` --- ## Documentation | Document | Description | |----------|-------------| | [ARCHITECTURE.md](docs/ARCHITECTURE.md) | Full system design | | [MULTI_AGENT_PIPELINE_ARCHITECTURE.md](docs/MULTI_AGENT_PIPELINE_ARCHITECTURE.md) | Pipeline flow, Vault tokens, agent lifecycle | | [PHASE_DEPENDENCY_ANALYSIS.md](docs/PHASE_DEPENDENCY_ANALYSIS.md) | Phase dependencies and order | | [CONTEXT_MANAGEMENT.md](docs/CONTEXT_MANAGEMENT.md) | Checkpoints, STATUS, Memory | | [STATUS_PROTOCOL.md](docs/STATUS_PROTOCOL.md) | Directory status protocol | | [CREDENTIALS_SETUP.md](docs/CREDENTIALS_SETUP.md) | Vault and DragonflyDB setup | --- ## Production Constraints ### Token Revocation Triggers | Condition | Threshold | Action | |-----------|-----------|--------| | Error rate | > 5 errors/minute | Revoke + spawn diagnostic | | Stuck agent | > 60 seconds no progress | Revoke agent token only | | Policy violation | Any CRITICAL | Immediate full revocation | | Resource abuse | > 100 API calls/minute | Rate limit, then revoke | ### Consensus Requirements - Pipelines remain in `ORCHESTRATING` until consensus achieved - Exit code 0 = success, 1 = error, 2 = consensus failure - Failure context recorded to DragonflyDB for retry attempts - User must explicitly accept partial output to complete without consensus --- ## Recovery After Reset ```bash # 1. Load checkpoint checkpoint load # 2. View combined status checkpoint report # 3. Check active bugs bugs list --status open # 4. Resume pipeline if needed curl localhost:3000/api/pipeline/consensus/status?pipeline_id=... ``` --- ## API Endpoints ### Pipeline Control | Endpoint | Method | Description | |----------|--------|-------------| | `/api/spawn` | POST | Spawn pipeline with plan | | `/api/pipeline/continue` | POST | Trigger orchestration | | `/api/pipeline/orchestration` | GET | Get orchestration status | | `/api/pipeline/token` | GET | Get pipeline token status | | `/api/pipeline/revoke` | POST | Revoke pipeline token | | `/api/active-pipelines` | GET | List active pipelines | | `/api/pipeline/consensus/status` | GET | Consensus status | | `/api/pipeline/consensus/fallback` | POST | Execute fallback action | ### Observability | Endpoint | Method | Description | |----------|--------|-------------| | `/api/observability/errors` | GET | Error summary | | `/api/observability/handoff` | POST | Generate handoff report | --- *Phase 12: Observability - COMPLETE* **All 12 phases validated** | 295/295 tests passing | Last updated: 2026-01-24