Go to file

profit ef18567674 Implement real supervisor-driven auto-recovery

Orchestrator changes:
- Force-spawn GAMMA on iteration_limit before abort
- GAMMA.synthesize() creates emergency handoff payload
- loadRecoveryContext() logs "Resuming from {task_id} handoff"
- POST to /api/pipeline/log for resume message visibility

AgentGamma changes:
- Add synthesize() method for emergency abort synthesis
- Merges existing proposals into coherent handoff
- Stores as synthesis_type: "abort_recovery"

Server changes:
- Add POST /api/pipeline/log endpoint for orchestrator logging
- Recovery pipeline properly inherits GAMMA synthesis

Test coverage:
- test_auto_recovery.py: 6 unit tests
- test_e2e_auto_recovery.py: 5 E2E tests
- test_supervisor_recovery.py: 3 supervisor tests
  - Success on attempt 2 (recovery works)
  - Max failures (3 retries then FAILED)
  - Success on attempt 1 (no recovery needed)

Recovery flow:
1. iteration_limit triggers
2. GAMMA force-spawned for emergency synthesis
3. Handoff dumped with GAMMA synthesis
4. Exit code 3 triggers auto-recovery
5. Recovery pipeline loads handoff
6. Logs "Resuming from {prior_pipeline} handoff"
7. Repeat up to 3 times or until success

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-24 19:47:56 -05:00

.archive

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

agents

Implement real supervisor-driven auto-recovery

2026-01-24 19:47:56 -05:00

analytics

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

bin

Fix validate-phases thresholds to match current architecture

2026-01-24 18:51:57 -05:00

checkpoint

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

docs

Update .gitignore for operational data and finalize README refresh plan

2026-01-24 18:40:28 -05:00

evidence

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

integrations

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

inventory

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

ledger

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

lib

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

marketplace

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

memory

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

observability

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

orchestrator

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

pipeline

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

preflight

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

runtime

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

sandbox

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

schemas

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

teams

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

testing

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

tests

Implement real supervisor-driven auto-recovery

2026-01-24 19:47:56 -05:00

Implement real supervisor-driven auto-recovery

2026-01-24 19:47:56 -05:00

wrappers

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

.gitignore

Update .gitignore for operational data and finalize README refresh plan

2026-01-24 18:40:28 -05:00

project_state.yaml

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

README.md

Add missing documentation references to README

2026-01-24 19:00:08 -05:00

STATUS.md

Add Phase 10-12 implementation: multi-tenant, marketplace, observability

2026-01-24 18:39:47 -05:00

README.md

Agent Governance System

Production-grade framework for governing AI agent execution with multi-agent orchestration, Vault-backed security, real-time observability, and consensus-driven workflows.

Status: Phase 12 COMPLETE | Tests: 295/295 passing | Coverage: All 12 phases validated

Quick Start

# Check system health
checkpoint load                           # Load session state
checkpoint report                         # View combined status
validate-phases --verbose                 # Run full validation (295 tests)

# Run the orchestration dashboard
cd /opt/agent-governance/ui && bun run server.ts
# Dashboard: http://localhost:3000

# Bug tracking
bugs list --status open                   # View open bugs
bugs log -m "Description" --severity high # Log new bug

# Pipeline operations
pipeline spawn --plan <plan_id> --tier 1  # Spawn pipeline agents

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────────┐
│                              GOVERNANCE LAYER                                    │
│  ┌──────────────────┐  ┌───────────────────┐  ┌─────────────────────────────┐  │
│  │  HashiCorp Vault │  │    DragonflyDB    │  │      SQLite Ledger          │  │
│  │                  │  │                   │  │                             │  │
│  │  - Per-pipeline  │  │  - Blackboard     │  │  - agent_actions            │  │
│  │    token mgmt    │  │  - Metrics        │  │  - agent_metrics            │  │
│  │  - 2hr TTL +     │  │  - Consensus      │  │  - violations               │  │
│  │    auto-renewal  │  │  - Message bus    │  │  - promotions               │  │
│  │  - Observability │  │  - Error budgets  │  │  - tenants/projects         │  │
│  │    revocation    │  │  - WebSocket pub  │  │  - marketplace              │  │
│  └──────────────────┘  └───────────────────┘  └─────────────────────────────┘  │
├─────────────────────────────────────────────────────────────────────────────────┤
│                           ORCHESTRATION LAYER                                    │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │                       Multi-Agent Pipeline                               │    │
│  │                                                                          │    │
│  │   SPAWN ──► RUNNING ──► REPORT ──► ORCHESTRATING ──► COMPLETED          │    │
│  │     │          │          │             │                │               │    │
│  │   Issue     Agent      Report       ALPHA+BETA       Consensus          │    │
│  │   Vault     Status     Ready        Parallel         Achieved           │    │
│  │   Token     Updates                     │                               │    │
│  │                                   Error/Stuck?                          │    │
│  │                                         │ YES                           │    │
│  │                                   SPAWN GAMMA                           │    │
│  │                                   (Mediator)                            │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
├─────────────────────────────────────────────────────────────────────────────────┤
│                               AGENT LAYER                                        │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐  ┌─────────────────┐  │
│  │  Agent ALPHA  │  │  Agent BETA   │  │  Agent GAMMA  │  │  Governed LLM   │  │
│  │  (Research)   │◄─┼─► (Synthesis) │◄─┼─► (Mediator)  │  │  (T0/T1/T2)     │  │
│  │               │  │               │  │               │  │                 │  │
│  │  Parallel     │  │  Direct       │  │  Spawned on:  │  │  - llm-planner  │  │
│  │  Execution    │  │  Messages     │  │  - Stuck 30s  │  │  - tier0-agent  │  │
│  │               │  │               │  │  - Conflict 3 │  │  - tier1-agent  │  │
│  │               │  │               │  │  - Complex .8 │  │                 │  │
│  └───────┬───────┘  └───────┬───────┘  └───────────────┘  └─────────────────┘  │
│          └──────────────────┴──────────────────────────────────────────────────│
│                                    │                                            │
│                         ┌──────────▼──────────┐                                 │
│                         │     Blackboard      │                                 │
│                         │  - problem          │                                 │
│                         │  - solutions[]      │                                 │
│                         │  - progress         │                                 │
│                         │  - consensus        │                                 │
│                         └─────────────────────┘                                 │
├─────────────────────────────────────────────────────────────────────────────────┤
│                               UI / API LAYER                                     │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │  Orchestration Dashboard (Bun + WebSocket)                              │    │
│  │  - Real-time pipeline status        - Agent lifecycle cards             │    │
│  │  - Consensus failure alerts         - Fallback action buttons           │    │
│  │  - Log streaming                    - Metrics display                   │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────────┘

Core Components

Component	Purpose	Status
agents/	ALPHA/BETA/GAMMA multi-agent + T0/T1/T2 governed agents	Complete
ui/	Orchestration dashboard with WebSocket real-time updates	Complete
pipeline/	Pipeline DSL, templates, and execution engine	Complete
orchestrator/	Multi-agent coordination with consensus tracking	Complete
observability/	Prometheus metrics, distributed tracing, structured logging	Complete
marketplace/	Agent template registry with FTS5 search	Complete
checkpoint/	Session state management and recovery	Complete
ledger/	SQLite audit trail with multi-tenant support	Complete
testing/	295 tests across 12 phases + chaos testing	Complete

Key Workflows

Multi-Agent Pipeline

Spawn: Pipeline created with objective, issues Vault token (2hr TTL, auto-renew)
Running: ALPHA (research) and BETA (synthesis) agents work in parallel
Orchestrating: Agents communicate via blackboard + direct messages
Consensus: Proposals evaluated, votes counted, conflicts resolved
GAMMA Spawn: If stuck >30s, conflicts >3, or complexity >0.8
Completion: Final consensus achieved or fallback action taken

Consensus Failure Handling

When agents fail to reach consensus:

Rerun Same: Spawn fresh ALPHA/BETA with failure context
Rerun with GAMMA: Force mediator agent for conflict resolution
Escalate Tier: Increase agent permissions and retry
Accept Partial: Complete with best available proposal
Download Log: Export full context for manual review

Vault Token Lifecycle

Pipeline Start
      │
      ▼
┌─────────────────────────────────────┐
│ 1. Request Token (AppRole)          │
│    TTL: 2 hours, renewable          │
│    Policy: pipeline-agent           │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 2. Store in Redis (encrypted)       │
│    Key: pipeline:{id}:vault_token   │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 3. Pass to ALPHA, BETA, GAMMA       │
│    Auto-renewal every 30 min        │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 4. Observability monitors usage     │
│    Revoke on policy violation       │
└─────────────────────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│ 5. Revoke on completion/error       │
└─────────────────────────────────────┘

CLI Tools

Context Management

# Checkpoints - session state snapshots
checkpoint now --notes "..."       # Create checkpoint
checkpoint load                    # Load latest
checkpoint report                  # Combined status view
checkpoint timeline                # History

# Status - per-directory tracking
status sweep                       # Check all directories
status update <dir> --phase <p>    # Update status
status dashboard                   # Overview

# Memory - large content storage
memory log --stdin                 # Store from pipe
memory fetch <id> -s               # Get summary
memory list                        # Browse entries

Bug Tracking

bugs list                          # List all bugs
bugs list --status open            # Filter by status
bugs list --severity high          # Filter by severity
bugs log -m "Description"          # Log new bug
bugs update <id> resolved          # Update status
bugs get <id>                      # Get details
bugs scan                          # Scan for anomalies
bugs status                        # Summary view

Pipeline Operations

# Validation
validate-phases --verbose          # Full 12-phase validation

# Pipeline management (via dashboard API)
curl -X POST localhost:3000/api/spawn \
  -d '{"plan_id":"...", "tier":1}'

# Consensus handling
curl localhost:3000/api/pipeline/consensus/status?pipeline_id=...
curl -X POST localhost:3000/api/pipeline/consensus/fallback \
  -d '{"pipeline_id":"...", "action":"rerun_gamma"}'

Phase Completion Status

Phase	Name	Tests	Status
1	Foundation	12/12	Complete
2	Secrets Management	14/14	Complete
3	Agent Execution	19/19	Complete
4	Promotion & Revocation	16/16	Complete
5	Bootstrap & Checkpointing	22/22	Complete
6	Multi-Agent Orchestration	56/56	Complete
7	Monitoring & Learning	46/46	Complete
8	Production Hardening	31/31	Complete
9	External Integrations	-	Framework retained, external deprecated
10	Multi-Tenant Support	18/18	Complete
11	Agent Marketplace	16/16	Complete
12	Observability	21/21	Complete
	Total	295/295	Complete

Dependencies

Service	Purpose	Endpoint
HashiCorp Vault	Secrets, token management	https://127.0.0.1:8200
DragonflyDB	State, metrics, pub/sub	redis://127.0.0.1:6379
SQLite	Audit ledger, marketplace	File-based
Bun	TypeScript runtime	Local
OpenRouter	LLM API gateway	External

Directory Structure

agent-governance/
├── agents/               # Agent implementations
│   ├── multi-agent/      # ALPHA/BETA/GAMMA orchestrator
│   ├── llm-planner/      # Python LLM agent
│   ├── llm-planner-ts/   # TypeScript LLM agent
│   ├── tier0-agent/      # Observer tier (read-only)
│   └── tier1-agent/      # Executor tier (write)
├── bin/                  # CLI tools
├── checkpoint/           # Session state management
├── docs/                 # Documentation
├── evidence/             # Audit evidence packages
├── integrations/         # Integration framework
├── ledger/               # SQLite audit ledger + API
├── marketplace/          # Agent template registry
├── memory/               # External memory layer
├── observability/        # Metrics, tracing, logging
├── orchestrator/         # Pipeline orchestration
├── pipeline/             # Pipeline DSL and templates
├── preflight/            # Pre-execution validation
├── sandbox/              # Terraform/Ansible sandbox
├── testing/              # Test framework + oversight
├── tests/                # Test suites (295 tests)
└── ui/                   # Orchestration dashboard

Documentation

Architecture & Design

Document	Description
ARCHITECTURE.md	Full system design
MULTI_AGENT_PIPELINE_ARCHITECTURE.md	Pipeline flow, Vault tokens, agent lifecycle
PHASE_DEPENDENCY_ANALYSIS.md	Phase dependencies and order

Implementation & Operations

Document	Description
PRODUCTION_PIPELINE.md	Implementation plan and production workflows
ENGINEERING_GUIDE.md	Runtime governance spec and quick reference
CREDENTIALS_SETUP.md	Vault and DragonflyDB setup

Context & Memory

Document	Description
CONTEXT_MANAGEMENT.md	Checkpoints, STATUS, Memory
STATUS_PROTOCOL.md	Directory status protocol
MEMORY_LAYER.md	External memory layer details

Agent Documentation

Document	Description
agents/README.md	Agent foundation and tier system
tier0-guide.md	Tier 0 agent guide

External References

Resource	Description
HashiCorp Vault	Secrets management documentation
Bun Runtime	TypeScript runtime documentation
DragonflyDB	Redis-compatible database docs

Production Constraints

Token Revocation Triggers

Condition	Threshold	Action
Error rate	> 5 errors/minute	Revoke + spawn diagnostic
Stuck agent	> 60 seconds no progress	Revoke agent token only
Policy violation	Any CRITICAL	Immediate full revocation
Resource abuse	> 100 API calls/minute	Rate limit, then revoke

Consensus Requirements

Pipelines remain in ORCHESTRATING until consensus achieved
Exit code 0 = success, 1 = error, 2 = consensus failure
Failure context recorded to DragonflyDB for retry attempts
User must explicitly accept partial output to complete without consensus

Recovery After Reset

# 1. Load checkpoint
checkpoint load

# 2. View combined status
checkpoint report

# 3. Check active bugs
bugs list --status open

# 4. Resume pipeline if needed
curl localhost:3000/api/pipeline/consensus/status?pipeline_id=...

API Endpoints

Pipeline Control

Endpoint	Method	Description
`/api/spawn`	POST	Spawn pipeline with plan
`/api/pipeline/continue`	POST	Trigger orchestration
`/api/pipeline/orchestration`	GET	Get orchestration status
`/api/pipeline/token`	GET	Get pipeline token status
`/api/pipeline/revoke`	POST	Revoke pipeline token
`/api/active-pipelines`	GET	List active pipelines
`/api/pipeline/consensus/status`	GET	Consensus status
`/api/pipeline/consensus/fallback`	POST	Execute fallback action

Observability

Endpoint	Method	Description
`/api/observability/errors`	GET	Error summary
`/api/observability/handoff`	POST	Generate handoff report

Phase 12: Observability - COMPLETE

All 12 phases validated | 295/295 tests passing | Last updated: 2026-01-24