agent-governance/README.md
profit c6554a8b3d Add missing documentation references to README
Added sections:
- Architecture & Design (existing docs reorganized)
- Implementation & Operations (PRODUCTION_PIPELINE, ENGINEERING_GUIDE)
- Context & Memory (added MEMORY_LAYER.md)
- Agent Documentation (agents/README.md, tier0-guide)
- External References (Vault, Bun, DragonflyDB docs)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 19:00:08 -05:00

391 lines
19 KiB
Markdown

# Agent Governance System
> Production-grade framework for governing AI agent execution with multi-agent orchestration, Vault-backed security, real-time observability, and consensus-driven workflows.
**Status:** Phase 12 COMPLETE | **Tests:** 295/295 passing | **Coverage:** All 12 phases validated
---
## Quick Start
```bash
# Check system health
checkpoint load # Load session state
checkpoint report # View combined status
validate-phases --verbose # Run full validation (295 tests)
# Run the orchestration dashboard
cd /opt/agent-governance/ui && bun run server.ts
# Dashboard: http://localhost:3000
# Bug tracking
bugs list --status open # View open bugs
bugs log -m "Description" --severity high # Log new bug
# Pipeline operations
pipeline spawn --plan <plan_id> --tier 1 # Spawn pipeline agents
```
---
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────────────────┐
│ GOVERNANCE LAYER │
│ ┌──────────────────┐ ┌───────────────────┐ ┌─────────────────────────────┐ │
│ │ HashiCorp Vault │ │ DragonflyDB │ │ SQLite Ledger │ │
│ │ │ │ │ │ │ │
│ │ - Per-pipeline │ │ - Blackboard │ │ - agent_actions │ │
│ │ token mgmt │ │ - Metrics │ │ - agent_metrics │ │
│ │ - 2hr TTL + │ │ - Consensus │ │ - violations │ │
│ │ auto-renewal │ │ - Message bus │ │ - promotions │ │
│ │ - Observability │ │ - Error budgets │ │ - tenants/projects │ │
│ │ revocation │ │ - WebSocket pub │ │ - marketplace │ │
│ └──────────────────┘ └───────────────────┘ └─────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────────────────┤
│ ORCHESTRATION LAYER │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Multi-Agent Pipeline │ │
│ │ │ │
│ │ SPAWN ──► RUNNING ──► REPORT ──► ORCHESTRATING ──► COMPLETED │ │
│ │ │ │ │ │ │ │ │
│ │ Issue Agent Report ALPHA+BETA Consensus │ │
│ │ Vault Status Ready Parallel Achieved │ │
│ │ Token Updates │ │ │
│ │ Error/Stuck? │ │
│ │ │ YES │ │
│ │ SPAWN GAMMA │ │
│ │ (Mediator) │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────────────────┤
│ AGENT LAYER │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌─────────────────┐ │
│ │ Agent ALPHA │ │ Agent BETA │ │ Agent GAMMA │ │ Governed LLM │ │
│ │ (Research) │◄─┼─► (Synthesis) │◄─┼─► (Mediator) │ │ (T0/T1/T2) │ │
│ │ │ │ │ │ │ │ │ │
│ │ Parallel │ │ Direct │ │ Spawned on: │ │ - llm-planner │ │
│ │ Execution │ │ Messages │ │ - Stuck 30s │ │ - tier0-agent │ │
│ │ │ │ │ │ - Conflict 3 │ │ - tier1-agent │ │
│ │ │ │ │ │ - Complex .8 │ │ │ │
│ └───────┬───────┘ └───────┬───────┘ └───────────────┘ └─────────────────┘ │
│ └──────────────────┴──────────────────────────────────────────────────│
│ │ │
│ ┌──────────▼──────────┐ │
│ │ Blackboard │ │
│ │ - problem │ │
│ │ - solutions[] │ │
│ │ - progress │ │
│ │ - consensus │ │
│ └─────────────────────┘ │
├─────────────────────────────────────────────────────────────────────────────────┤
│ UI / API LAYER │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Orchestration Dashboard (Bun + WebSocket) │ │
│ │ - Real-time pipeline status - Agent lifecycle cards │ │
│ │ - Consensus failure alerts - Fallback action buttons │ │
│ │ - Log streaming - Metrics display │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
```
---
## Core Components
| Component | Purpose | Status |
|-----------|---------|--------|
| **agents/** | ALPHA/BETA/GAMMA multi-agent + T0/T1/T2 governed agents | Complete |
| **ui/** | Orchestration dashboard with WebSocket real-time updates | Complete |
| **pipeline/** | Pipeline DSL, templates, and execution engine | Complete |
| **orchestrator/** | Multi-agent coordination with consensus tracking | Complete |
| **observability/** | Prometheus metrics, distributed tracing, structured logging | Complete |
| **marketplace/** | Agent template registry with FTS5 search | Complete |
| **checkpoint/** | Session state management and recovery | Complete |
| **ledger/** | SQLite audit trail with multi-tenant support | Complete |
| **testing/** | 295 tests across 12 phases + chaos testing | Complete |
---
## Key Workflows
### Multi-Agent Pipeline
1. **Spawn**: Pipeline created with objective, issues Vault token (2hr TTL, auto-renew)
2. **Running**: ALPHA (research) and BETA (synthesis) agents work in parallel
3. **Orchestrating**: Agents communicate via blackboard + direct messages
4. **Consensus**: Proposals evaluated, votes counted, conflicts resolved
5. **GAMMA Spawn**: If stuck >30s, conflicts >3, or complexity >0.8
6. **Completion**: Final consensus achieved or fallback action taken
### Consensus Failure Handling
When agents fail to reach consensus:
- **Rerun Same**: Spawn fresh ALPHA/BETA with failure context
- **Rerun with GAMMA**: Force mediator agent for conflict resolution
- **Escalate Tier**: Increase agent permissions and retry
- **Accept Partial**: Complete with best available proposal
- **Download Log**: Export full context for manual review
### Vault Token Lifecycle
```
Pipeline Start
┌─────────────────────────────────────┐
│ 1. Request Token (AppRole) │
│ TTL: 2 hours, renewable │
│ Policy: pipeline-agent │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 2. Store in Redis (encrypted) │
│ Key: pipeline:{id}:vault_token │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 3. Pass to ALPHA, BETA, GAMMA │
│ Auto-renewal every 30 min │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 4. Observability monitors usage │
│ Revoke on policy violation │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 5. Revoke on completion/error │
└─────────────────────────────────────┘
```
---
## CLI Tools
### Context Management
```bash
# Checkpoints - session state snapshots
checkpoint now --notes "..." # Create checkpoint
checkpoint load # Load latest
checkpoint report # Combined status view
checkpoint timeline # History
# Status - per-directory tracking
status sweep # Check all directories
status update <dir> --phase <p> # Update status
status dashboard # Overview
# Memory - large content storage
memory log --stdin # Store from pipe
memory fetch <id> -s # Get summary
memory list # Browse entries
```
### Bug Tracking
```bash
bugs list # List all bugs
bugs list --status open # Filter by status
bugs list --severity high # Filter by severity
bugs log -m "Description" # Log new bug
bugs update <id> resolved # Update status
bugs get <id> # Get details
bugs scan # Scan for anomalies
bugs status # Summary view
```
### Pipeline Operations
```bash
# Validation
validate-phases --verbose # Full 12-phase validation
# Pipeline management (via dashboard API)
curl -X POST localhost:3000/api/spawn \
-d '{"plan_id":"...", "tier":1}'
# Consensus handling
curl localhost:3000/api/pipeline/consensus/status?pipeline_id=...
curl -X POST localhost:3000/api/pipeline/consensus/fallback \
-d '{"pipeline_id":"...", "action":"rerun_gamma"}'
```
---
## Phase Completion Status
| Phase | Name | Tests | Status |
|-------|------|-------|--------|
| 1 | Foundation | 12/12 | Complete |
| 2 | Secrets Management | 14/14 | Complete |
| 3 | Agent Execution | 19/19 | Complete |
| 4 | Promotion & Revocation | 16/16 | Complete |
| 5 | Bootstrap & Checkpointing | 22/22 | Complete |
| 6 | Multi-Agent Orchestration | 56/56 | Complete |
| 7 | Monitoring & Learning | 46/46 | Complete |
| 8 | Production Hardening | 31/31 | Complete |
| 9 | External Integrations | - | Framework retained, external deprecated |
| 10 | Multi-Tenant Support | 18/18 | Complete |
| 11 | Agent Marketplace | 16/16 | Complete |
| 12 | Observability | 21/21 | Complete |
| | **Total** | **295/295** | **Complete** |
---
## Dependencies
| Service | Purpose | Endpoint |
|---------|---------|----------|
| HashiCorp Vault | Secrets, token management | https://127.0.0.1:8200 |
| DragonflyDB | State, metrics, pub/sub | redis://127.0.0.1:6379 |
| SQLite | Audit ledger, marketplace | File-based |
| Bun | TypeScript runtime | Local |
| OpenRouter | LLM API gateway | External |
---
## Directory Structure
```
agent-governance/
├── agents/ # Agent implementations
│ ├── multi-agent/ # ALPHA/BETA/GAMMA orchestrator
│ ├── llm-planner/ # Python LLM agent
│ ├── llm-planner-ts/ # TypeScript LLM agent
│ ├── tier0-agent/ # Observer tier (read-only)
│ └── tier1-agent/ # Executor tier (write)
├── bin/ # CLI tools
├── checkpoint/ # Session state management
├── docs/ # Documentation
├── evidence/ # Audit evidence packages
├── integrations/ # Integration framework
├── ledger/ # SQLite audit ledger + API
├── marketplace/ # Agent template registry
├── memory/ # External memory layer
├── observability/ # Metrics, tracing, logging
├── orchestrator/ # Pipeline orchestration
├── pipeline/ # Pipeline DSL and templates
├── preflight/ # Pre-execution validation
├── sandbox/ # Terraform/Ansible sandbox
├── testing/ # Test framework + oversight
├── tests/ # Test suites (295 tests)
└── ui/ # Orchestration dashboard
```
---
## Documentation
### Architecture & Design
| Document | Description |
|----------|-------------|
| [ARCHITECTURE.md](docs/ARCHITECTURE.md) | Full system design |
| [MULTI_AGENT_PIPELINE_ARCHITECTURE.md](docs/MULTI_AGENT_PIPELINE_ARCHITECTURE.md) | Pipeline flow, Vault tokens, agent lifecycle |
| [PHASE_DEPENDENCY_ANALYSIS.md](docs/PHASE_DEPENDENCY_ANALYSIS.md) | Phase dependencies and order |
### Implementation & Operations
| Document | Description |
|----------|-------------|
| [PRODUCTION_PIPELINE.md](docs/PRODUCTION_PIPELINE.md) | Implementation plan and production workflows |
| [ENGINEERING_GUIDE.md](docs/ENGINEERING_GUIDE.md) | Runtime governance spec and quick reference |
| [CREDENTIALS_SETUP.md](docs/CREDENTIALS_SETUP.md) | Vault and DragonflyDB setup |
### Context & Memory
| Document | Description |
|----------|-------------|
| [CONTEXT_MANAGEMENT.md](docs/CONTEXT_MANAGEMENT.md) | Checkpoints, STATUS, Memory |
| [STATUS_PROTOCOL.md](docs/STATUS_PROTOCOL.md) | Directory status protocol |
| [MEMORY_LAYER.md](docs/MEMORY_LAYER.md) | External memory layer details |
### Agent Documentation
| Document | Description |
|----------|-------------|
| [agents/README.md](agents/README.md) | Agent foundation and tier system |
| [tier0-guide.md](docs/tier0-guide.md) | Tier 0 agent guide |
### External References
| Resource | Description |
|----------|-------------|
| [HashiCorp Vault](https://developer.hashicorp.com/vault/docs) | Secrets management documentation |
| [Bun Runtime](https://bun.sh/docs) | TypeScript runtime documentation |
| [DragonflyDB](https://www.dragonflydb.io/docs) | Redis-compatible database docs |
---
## Production Constraints
### Token Revocation Triggers
| Condition | Threshold | Action |
|-----------|-----------|--------|
| Error rate | > 5 errors/minute | Revoke + spawn diagnostic |
| Stuck agent | > 60 seconds no progress | Revoke agent token only |
| Policy violation | Any CRITICAL | Immediate full revocation |
| Resource abuse | > 100 API calls/minute | Rate limit, then revoke |
### Consensus Requirements
- Pipelines remain in `ORCHESTRATING` until consensus achieved
- Exit code 0 = success, 1 = error, 2 = consensus failure
- Failure context recorded to DragonflyDB for retry attempts
- User must explicitly accept partial output to complete without consensus
---
## Recovery After Reset
```bash
# 1. Load checkpoint
checkpoint load
# 2. View combined status
checkpoint report
# 3. Check active bugs
bugs list --status open
# 4. Resume pipeline if needed
curl localhost:3000/api/pipeline/consensus/status?pipeline_id=...
```
---
## API Endpoints
### Pipeline Control
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/spawn` | POST | Spawn pipeline with plan |
| `/api/pipeline/continue` | POST | Trigger orchestration |
| `/api/pipeline/orchestration` | GET | Get orchestration status |
| `/api/pipeline/token` | GET | Get pipeline token status |
| `/api/pipeline/revoke` | POST | Revoke pipeline token |
| `/api/active-pipelines` | GET | List active pipelines |
| `/api/pipeline/consensus/status` | GET | Consensus status |
| `/api/pipeline/consensus/fallback` | POST | Execute fallback action |
### Observability
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/observability/errors` | GET | Error summary |
| `/api/observability/handoff` | POST | Generate handoff report |
---
*Phase 12: Observability - COMPLETE*
**All 12 phases validated** | 295/295 tests passing | Last updated: 2026-01-24