profit 8561d13728 Add Vault token management and observability integration for multi-agent pipelines
- Vault token issuance per pipeline with 2-hour TTL
- Automatic token renewal loop every 30 minutes
- Error budget tracking with threshold-based revocation
- Observability-driven token revocation for policy violations
- Diagnostic pipeline spawning on error threshold breach
- Structured handoff reports for error recovery
- Agent lifecycle status API
- New API endpoints: /api/pipeline/token, /api/pipeline/errors,
  /api/observability/handoff, /api/observability/diagnostic

Orchestrator now reports errors to parent pipeline's observability
system via PIPELINE_ID environment variable.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 17:45:20 -05:00

Agent Governance System

A comprehensive framework for governing AI agent execution with security, auditability, and coordination.

Overview

The Agent Governance System provides infrastructure for running AI agents with:

  • Tiered permissions (T0 observer, T1 executor, T2 admin)
  • Audit trails via SQLite ledger
  • Secure credentials via HashiCorp Vault
  • State coordination via DragonflyDB
  • Pipeline orchestration for multi-agent workflows
  • Context management for long-running sessions

Quick Start

# Check system status
checkpoint load                    # Load session state
status dashboard                   # View directory progress
memory stats                       # Check memory usage

# Create checkpoint after work
checkpoint now --notes "Description of completed work"

Key Components

Directory Purpose Status
pipeline/ Pipeline DSL and core definitions Complete
runtime/ Agent lifecycle and governance Complete
checkpoint/ Session state management Complete
memory/ External memory layer Complete
teams/ Hierarchical team framework Complete
analytics/ Learning and pattern detection Complete
tests/ Test suites including chaos tests 🚧 In Progress

CLI Tools

Context Management

# Checkpoints - session state snapshots
checkpoint now --notes "..."       # Create checkpoint
checkpoint load                    # Load latest
checkpoint report                  # Combined status view
checkpoint timeline               # History

# Status - per-directory tracking
status sweep                       # Check all directories
status update <dir> --phase <p>    # Update status
status dashboard                   # Overview

# Memory - large content storage
memory log --stdin                 # Store from pipe
memory fetch <id> -s              # Get summary
memory list                        # Browse entries

Agent Operations

# Run chaos tests
python tests/multi-agent-chaos/orchestrator.py

# Validate pipelines
python pipeline/pipeline.py validate <file.yaml>

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Agent Governance                         │
├──────────────┬──────────────┬──────────────┬───────────────┤
│   Agents     │   Pipeline   │   Runtime    │   Context     │
│              │              │              │               │
│ • T0 Observer│ • DSL Parser │ • Lifecycle  │ • Checkpoints │
│ • T1 Executor│ • Stages     │ • Governance │ • STATUS      │
│ • T2 Admin   │ • Templates  │ • Revocation │ • Memory      │
├──────────────┴──────────────┴──────────────┴───────────────┤
│                    Infrastructure                            │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐  │
│  │  Vault   │  │ Dragonfly│  │  Ledger  │  │  Evidence  │  │
│  │ (secrets)│  │  (state) │  │  (audit) │  │ (artifacts)│  │
│  └──────────┘  └──────────┘  └──────────┘  └────────────┘  │
└─────────────────────────────────────────────────────────────┘

Documentation

Document Description
ARCHITECTURE.md Full system design
CONTEXT_MANAGEMENT.md Checkpoints, STATUS, Memory
MEMORY_LAYER.md External memory details
STATUS_PROTOCOL.md Directory status protocol

Directory Structure

agent-governance/
├── agents/           # Agent implementations (T0, T1, T2)
├── analytics/        # Learning and pattern detection
├── bin/              # CLI tools (checkpoint, status, memory)
├── checkpoint/       # Session state management
├── docs/             # Documentation
├── evidence/         # Audit evidence packages
├── integrations/     # External integrations (GitHub, Slack)
├── ledger/           # SQLite audit ledger
├── memory/           # External memory layer
├── orchestrator/     # Multi-agent orchestration
├── pipeline/         # Pipeline DSL and templates
├── preflight/        # Pre-execution validation
├── runtime/          # Agent lifecycle governance
├── sandbox/          # Sandboxed execution (Terraform, Ansible)
├── schemas/          # JSON schemas
├── teams/            # Hierarchical team framework
├── tests/            # Test suites
└── wrappers/         # Tool wrappers

Current Status

Progress: ███████░░░░░░░░░░░░░░░░░░░░░░░ 23%

✅ Complete:       14 directories
🚧 In Progress:     5 directories

Run status dashboard for current details.

Recovery After Reset

# 1. Load checkpoint
checkpoint load

# 2. View combined status
checkpoint report

# 3. Check memory
memory list --limit 5

# 4. Resume work
status update ./target-dir --task "Resuming work"

Dependencies

Service Purpose Port
HashiCorp Vault Secrets management 8200
DragonflyDB State coordination 6379
SQLite Audit ledger File

Phase 8: Production Hardening - In Progress

Completed Phases: 1-7 | Foundation, Vault, Pipeline, Promotion/Revocation, Agent Bootstrap, DSL/Templates/Testing, Teams/Learning

Description
Agent Governance System - Production Hardening
Readme 1.3 MiB
Languages
Python 70.8%
TypeScript 25.5%
Shell 3.6%
HCL 0.1%