Added sections: - Architecture & Design (existing docs reorganized) - Implementation & Operations (PRODUCTION_PIPELINE, ENGINEERING_GUIDE) - Context & Memory (added MEMORY_LAYER.md) - Agent Documentation (agents/README.md, tier0-guide) - External References (Vault, Bun, DragonflyDB docs) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
391 lines
19 KiB
Markdown
391 lines
19 KiB
Markdown
# Agent Governance System
|
|
|
|
> Production-grade framework for governing AI agent execution with multi-agent orchestration, Vault-backed security, real-time observability, and consensus-driven workflows.
|
|
|
|
**Status:** Phase 12 COMPLETE | **Tests:** 295/295 passing | **Coverage:** All 12 phases validated
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Check system health
|
|
checkpoint load # Load session state
|
|
checkpoint report # View combined status
|
|
validate-phases --verbose # Run full validation (295 tests)
|
|
|
|
# Run the orchestration dashboard
|
|
cd /opt/agent-governance/ui && bun run server.ts
|
|
# Dashboard: http://localhost:3000
|
|
|
|
# Bug tracking
|
|
bugs list --status open # View open bugs
|
|
bugs log -m "Description" --severity high # Log new bug
|
|
|
|
# Pipeline operations
|
|
pipeline spawn --plan <plan_id> --tier 1 # Spawn pipeline agents
|
|
```
|
|
|
|
---
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────────┐
|
|
│ GOVERNANCE LAYER │
|
|
│ ┌──────────────────┐ ┌───────────────────┐ ┌─────────────────────────────┐ │
|
|
│ │ HashiCorp Vault │ │ DragonflyDB │ │ SQLite Ledger │ │
|
|
│ │ │ │ │ │ │ │
|
|
│ │ - Per-pipeline │ │ - Blackboard │ │ - agent_actions │ │
|
|
│ │ token mgmt │ │ - Metrics │ │ - agent_metrics │ │
|
|
│ │ - 2hr TTL + │ │ - Consensus │ │ - violations │ │
|
|
│ │ auto-renewal │ │ - Message bus │ │ - promotions │ │
|
|
│ │ - Observability │ │ - Error budgets │ │ - tenants/projects │ │
|
|
│ │ revocation │ │ - WebSocket pub │ │ - marketplace │ │
|
|
│ └──────────────────┘ └───────────────────┘ └─────────────────────────────┘ │
|
|
├─────────────────────────────────────────────────────────────────────────────────┤
|
|
│ ORCHESTRATION LAYER │
|
|
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Multi-Agent Pipeline │ │
|
|
│ │ │ │
|
|
│ │ SPAWN ──► RUNNING ──► REPORT ──► ORCHESTRATING ──► COMPLETED │ │
|
|
│ │ │ │ │ │ │ │ │
|
|
│ │ Issue Agent Report ALPHA+BETA Consensus │ │
|
|
│ │ Vault Status Ready Parallel Achieved │ │
|
|
│ │ Token Updates │ │ │
|
|
│ │ Error/Stuck? │ │
|
|
│ │ │ YES │ │
|
|
│ │ SPAWN GAMMA │ │
|
|
│ │ (Mediator) │ │
|
|
│ └─────────────────────────────────────────────────────────────────────────┘ │
|
|
├─────────────────────────────────────────────────────────────────────────────────┤
|
|
│ AGENT LAYER │
|
|
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌─────────────────┐ │
|
|
│ │ Agent ALPHA │ │ Agent BETA │ │ Agent GAMMA │ │ Governed LLM │ │
|
|
│ │ (Research) │◄─┼─► (Synthesis) │◄─┼─► (Mediator) │ │ (T0/T1/T2) │ │
|
|
│ │ │ │ │ │ │ │ │ │
|
|
│ │ Parallel │ │ Direct │ │ Spawned on: │ │ - llm-planner │ │
|
|
│ │ Execution │ │ Messages │ │ - Stuck 30s │ │ - tier0-agent │ │
|
|
│ │ │ │ │ │ - Conflict 3 │ │ - tier1-agent │ │
|
|
│ │ │ │ │ │ - Complex .8 │ │ │ │
|
|
│ └───────┬───────┘ └───────┬───────┘ └───────────────┘ └─────────────────┘ │
|
|
│ └──────────────────┴──────────────────────────────────────────────────│
|
|
│ │ │
|
|
│ ┌──────────▼──────────┐ │
|
|
│ │ Blackboard │ │
|
|
│ │ - problem │ │
|
|
│ │ - solutions[] │ │
|
|
│ │ - progress │ │
|
|
│ │ - consensus │ │
|
|
│ └─────────────────────┘ │
|
|
├─────────────────────────────────────────────────────────────────────────────────┤
|
|
│ UI / API LAYER │
|
|
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Orchestration Dashboard (Bun + WebSocket) │ │
|
|
│ │ - Real-time pipeline status - Agent lifecycle cards │ │
|
|
│ │ - Consensus failure alerts - Fallback action buttons │ │
|
|
│ │ - Log streaming - Metrics display │ │
|
|
│ └─────────────────────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Core Components
|
|
|
|
| Component | Purpose | Status |
|
|
|-----------|---------|--------|
|
|
| **agents/** | ALPHA/BETA/GAMMA multi-agent + T0/T1/T2 governed agents | Complete |
|
|
| **ui/** | Orchestration dashboard with WebSocket real-time updates | Complete |
|
|
| **pipeline/** | Pipeline DSL, templates, and execution engine | Complete |
|
|
| **orchestrator/** | Multi-agent coordination with consensus tracking | Complete |
|
|
| **observability/** | Prometheus metrics, distributed tracing, structured logging | Complete |
|
|
| **marketplace/** | Agent template registry with FTS5 search | Complete |
|
|
| **checkpoint/** | Session state management and recovery | Complete |
|
|
| **ledger/** | SQLite audit trail with multi-tenant support | Complete |
|
|
| **testing/** | 295 tests across 12 phases + chaos testing | Complete |
|
|
|
|
---
|
|
|
|
## Key Workflows
|
|
|
|
### Multi-Agent Pipeline
|
|
|
|
1. **Spawn**: Pipeline created with objective, issues Vault token (2hr TTL, auto-renew)
|
|
2. **Running**: ALPHA (research) and BETA (synthesis) agents work in parallel
|
|
3. **Orchestrating**: Agents communicate via blackboard + direct messages
|
|
4. **Consensus**: Proposals evaluated, votes counted, conflicts resolved
|
|
5. **GAMMA Spawn**: If stuck >30s, conflicts >3, or complexity >0.8
|
|
6. **Completion**: Final consensus achieved or fallback action taken
|
|
|
|
### Consensus Failure Handling
|
|
|
|
When agents fail to reach consensus:
|
|
- **Rerun Same**: Spawn fresh ALPHA/BETA with failure context
|
|
- **Rerun with GAMMA**: Force mediator agent for conflict resolution
|
|
- **Escalate Tier**: Increase agent permissions and retry
|
|
- **Accept Partial**: Complete with best available proposal
|
|
- **Download Log**: Export full context for manual review
|
|
|
|
### Vault Token Lifecycle
|
|
|
|
```
|
|
Pipeline Start
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 1. Request Token (AppRole) │
|
|
│ TTL: 2 hours, renewable │
|
|
│ Policy: pipeline-agent │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 2. Store in Redis (encrypted) │
|
|
│ Key: pipeline:{id}:vault_token │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 3. Pass to ALPHA, BETA, GAMMA │
|
|
│ Auto-renewal every 30 min │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 4. Observability monitors usage │
|
|
│ Revoke on policy violation │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 5. Revoke on completion/error │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## CLI Tools
|
|
|
|
### Context Management
|
|
|
|
```bash
|
|
# Checkpoints - session state snapshots
|
|
checkpoint now --notes "..." # Create checkpoint
|
|
checkpoint load # Load latest
|
|
checkpoint report # Combined status view
|
|
checkpoint timeline # History
|
|
|
|
# Status - per-directory tracking
|
|
status sweep # Check all directories
|
|
status update <dir> --phase <p> # Update status
|
|
status dashboard # Overview
|
|
|
|
# Memory - large content storage
|
|
memory log --stdin # Store from pipe
|
|
memory fetch <id> -s # Get summary
|
|
memory list # Browse entries
|
|
```
|
|
|
|
### Bug Tracking
|
|
|
|
```bash
|
|
bugs list # List all bugs
|
|
bugs list --status open # Filter by status
|
|
bugs list --severity high # Filter by severity
|
|
bugs log -m "Description" # Log new bug
|
|
bugs update <id> resolved # Update status
|
|
bugs get <id> # Get details
|
|
bugs scan # Scan for anomalies
|
|
bugs status # Summary view
|
|
```
|
|
|
|
### Pipeline Operations
|
|
|
|
```bash
|
|
# Validation
|
|
validate-phases --verbose # Full 12-phase validation
|
|
|
|
# Pipeline management (via dashboard API)
|
|
curl -X POST localhost:3000/api/spawn \
|
|
-d '{"plan_id":"...", "tier":1}'
|
|
|
|
# Consensus handling
|
|
curl localhost:3000/api/pipeline/consensus/status?pipeline_id=...
|
|
curl -X POST localhost:3000/api/pipeline/consensus/fallback \
|
|
-d '{"pipeline_id":"...", "action":"rerun_gamma"}'
|
|
```
|
|
|
|
---
|
|
|
|
## Phase Completion Status
|
|
|
|
| Phase | Name | Tests | Status |
|
|
|-------|------|-------|--------|
|
|
| 1 | Foundation | 12/12 | Complete |
|
|
| 2 | Secrets Management | 14/14 | Complete |
|
|
| 3 | Agent Execution | 19/19 | Complete |
|
|
| 4 | Promotion & Revocation | 16/16 | Complete |
|
|
| 5 | Bootstrap & Checkpointing | 22/22 | Complete |
|
|
| 6 | Multi-Agent Orchestration | 56/56 | Complete |
|
|
| 7 | Monitoring & Learning | 46/46 | Complete |
|
|
| 8 | Production Hardening | 31/31 | Complete |
|
|
| 9 | External Integrations | - | Framework retained, external deprecated |
|
|
| 10 | Multi-Tenant Support | 18/18 | Complete |
|
|
| 11 | Agent Marketplace | 16/16 | Complete |
|
|
| 12 | Observability | 21/21 | Complete |
|
|
| | **Total** | **295/295** | **Complete** |
|
|
|
|
---
|
|
|
|
## Dependencies
|
|
|
|
| Service | Purpose | Endpoint |
|
|
|---------|---------|----------|
|
|
| HashiCorp Vault | Secrets, token management | https://127.0.0.1:8200 |
|
|
| DragonflyDB | State, metrics, pub/sub | redis://127.0.0.1:6379 |
|
|
| SQLite | Audit ledger, marketplace | File-based |
|
|
| Bun | TypeScript runtime | Local |
|
|
| OpenRouter | LLM API gateway | External |
|
|
|
|
---
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
agent-governance/
|
|
├── agents/ # Agent implementations
|
|
│ ├── multi-agent/ # ALPHA/BETA/GAMMA orchestrator
|
|
│ ├── llm-planner/ # Python LLM agent
|
|
│ ├── llm-planner-ts/ # TypeScript LLM agent
|
|
│ ├── tier0-agent/ # Observer tier (read-only)
|
|
│ └── tier1-agent/ # Executor tier (write)
|
|
├── bin/ # CLI tools
|
|
├── checkpoint/ # Session state management
|
|
├── docs/ # Documentation
|
|
├── evidence/ # Audit evidence packages
|
|
├── integrations/ # Integration framework
|
|
├── ledger/ # SQLite audit ledger + API
|
|
├── marketplace/ # Agent template registry
|
|
├── memory/ # External memory layer
|
|
├── observability/ # Metrics, tracing, logging
|
|
├── orchestrator/ # Pipeline orchestration
|
|
├── pipeline/ # Pipeline DSL and templates
|
|
├── preflight/ # Pre-execution validation
|
|
├── sandbox/ # Terraform/Ansible sandbox
|
|
├── testing/ # Test framework + oversight
|
|
├── tests/ # Test suites (295 tests)
|
|
└── ui/ # Orchestration dashboard
|
|
```
|
|
|
|
---
|
|
|
|
## Documentation
|
|
|
|
### Architecture & Design
|
|
|
|
| Document | Description |
|
|
|----------|-------------|
|
|
| [ARCHITECTURE.md](docs/ARCHITECTURE.md) | Full system design |
|
|
| [MULTI_AGENT_PIPELINE_ARCHITECTURE.md](docs/MULTI_AGENT_PIPELINE_ARCHITECTURE.md) | Pipeline flow, Vault tokens, agent lifecycle |
|
|
| [PHASE_DEPENDENCY_ANALYSIS.md](docs/PHASE_DEPENDENCY_ANALYSIS.md) | Phase dependencies and order |
|
|
|
|
### Implementation & Operations
|
|
|
|
| Document | Description |
|
|
|----------|-------------|
|
|
| [PRODUCTION_PIPELINE.md](docs/PRODUCTION_PIPELINE.md) | Implementation plan and production workflows |
|
|
| [ENGINEERING_GUIDE.md](docs/ENGINEERING_GUIDE.md) | Runtime governance spec and quick reference |
|
|
| [CREDENTIALS_SETUP.md](docs/CREDENTIALS_SETUP.md) | Vault and DragonflyDB setup |
|
|
|
|
### Context & Memory
|
|
|
|
| Document | Description |
|
|
|----------|-------------|
|
|
| [CONTEXT_MANAGEMENT.md](docs/CONTEXT_MANAGEMENT.md) | Checkpoints, STATUS, Memory |
|
|
| [STATUS_PROTOCOL.md](docs/STATUS_PROTOCOL.md) | Directory status protocol |
|
|
| [MEMORY_LAYER.md](docs/MEMORY_LAYER.md) | External memory layer details |
|
|
|
|
### Agent Documentation
|
|
|
|
| Document | Description |
|
|
|----------|-------------|
|
|
| [agents/README.md](agents/README.md) | Agent foundation and tier system |
|
|
| [tier0-guide.md](docs/tier0-guide.md) | Tier 0 agent guide |
|
|
|
|
### External References
|
|
|
|
| Resource | Description |
|
|
|----------|-------------|
|
|
| [HashiCorp Vault](https://developer.hashicorp.com/vault/docs) | Secrets management documentation |
|
|
| [Bun Runtime](https://bun.sh/docs) | TypeScript runtime documentation |
|
|
| [DragonflyDB](https://www.dragonflydb.io/docs) | Redis-compatible database docs |
|
|
|
|
---
|
|
|
|
## Production Constraints
|
|
|
|
### Token Revocation Triggers
|
|
|
|
| Condition | Threshold | Action |
|
|
|-----------|-----------|--------|
|
|
| Error rate | > 5 errors/minute | Revoke + spawn diagnostic |
|
|
| Stuck agent | > 60 seconds no progress | Revoke agent token only |
|
|
| Policy violation | Any CRITICAL | Immediate full revocation |
|
|
| Resource abuse | > 100 API calls/minute | Rate limit, then revoke |
|
|
|
|
### Consensus Requirements
|
|
|
|
- Pipelines remain in `ORCHESTRATING` until consensus achieved
|
|
- Exit code 0 = success, 1 = error, 2 = consensus failure
|
|
- Failure context recorded to DragonflyDB for retry attempts
|
|
- User must explicitly accept partial output to complete without consensus
|
|
|
|
---
|
|
|
|
## Recovery After Reset
|
|
|
|
```bash
|
|
# 1. Load checkpoint
|
|
checkpoint load
|
|
|
|
# 2. View combined status
|
|
checkpoint report
|
|
|
|
# 3. Check active bugs
|
|
bugs list --status open
|
|
|
|
# 4. Resume pipeline if needed
|
|
curl localhost:3000/api/pipeline/consensus/status?pipeline_id=...
|
|
```
|
|
|
|
---
|
|
|
|
## API Endpoints
|
|
|
|
### Pipeline Control
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/api/spawn` | POST | Spawn pipeline with plan |
|
|
| `/api/pipeline/continue` | POST | Trigger orchestration |
|
|
| `/api/pipeline/orchestration` | GET | Get orchestration status |
|
|
| `/api/pipeline/token` | GET | Get pipeline token status |
|
|
| `/api/pipeline/revoke` | POST | Revoke pipeline token |
|
|
| `/api/active-pipelines` | GET | List active pipelines |
|
|
| `/api/pipeline/consensus/status` | GET | Consensus status |
|
|
| `/api/pipeline/consensus/fallback` | POST | Execute fallback action |
|
|
|
|
### Observability
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/api/observability/errors` | GET | Error summary |
|
|
| `/api/observability/handoff` | POST | Generate handoff report |
|
|
|
|
---
|
|
|
|
*Phase 12: Observability - COMPLETE*
|
|
|
|
**All 12 phases validated** | 295/295 tests passing | Last updated: 2026-01-24
|