History

profit 77655c298c Initial commit: Agent Governance System Phase 8

Phase 8 Production Hardening with complete governance infrastructure:

- Vault integration with tiered policies (T0-T4)
- DragonflyDB state management
- SQLite audit ledger
- Pipeline DSL and templates
- Promotion/revocation engine
- Checkpoint system for session persistence
- Health manager and circuit breaker for fault tolerance
- GitHub/Slack integrations
- Architectural test pipeline with bug watcher, suggestion engine, council review
- Multi-agent chaos testing framework

Test Results:
- Governance tests: 68/68 passing
- E2E workflow: 16/16 passing
- Phase 2 Vault: 14/14 passing
- Integration tests: 27/27 passing

Coverage: 57.6% average across 12 phases

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-23 22:07:06 -05:00

storage

Initial commit: Agent Governance System Phase 8

2026-01-23 22:07:06 -05:00

templates

Initial commit: Agent Governance System Phase 8

2026-01-23 22:07:06 -05:00

checkpoint.py

Initial commit: Agent Governance System Phase 8

2026-01-23 22:07:06 -05:00

CLAUDE.md

Initial commit: Agent Governance System Phase 8

2026-01-23 22:07:06 -05:00

README.md

Initial commit: Agent Governance System Phase 8

2026-01-23 22:07:06 -05:00

STATUS.md

Initial commit: Agent Governance System Phase 8

2026-01-23 22:07:06 -05:00

README.md

Context Checkpoint Skill

Phase 5: Agent Bootstrapping

A context preservation system that helps maintain session state across token window resets, CLI restarts, and sub-agent orchestration.

Overview

The checkpoint skill provides:

Periodic State Capture - Captures phase, tasks, dependencies, variables, and outputs
Token-Aware Summarization - Creates minimal context summaries for sub-agent calls
CLI Integration - Manual and automatic checkpoint management
Extensible Storage - JSON files with DragonflyDB caching (future: remote sync)

Quick Start

# Add to PATH (optional)
export PATH="/opt/agent-governance/bin:$PATH"

# Create a checkpoint
checkpoint now --notes "Starting task X"

# Load latest checkpoint
checkpoint load

# Compare with previous
checkpoint diff

# Get compact context summary
checkpoint summary

# List all checkpoints
checkpoint list

Commands

`checkpoint now`

Create a new checkpoint capturing current state.

checkpoint now                           # Basic checkpoint
checkpoint now --notes "Phase 5 start"   # With notes
checkpoint now --var key value           # With variables
checkpoint now --var a 1 --var b 2       # Multiple variables
checkpoint now --json                    # Output JSON

Captures:

Current phase (from implementation plan)
Task states (from governance DB)
Dependency status (Vault, DragonflyDB, Ledger)
Custom variables
Recent evidence outputs
Agent ID and tier

`checkpoint load`

Load a checkpoint for review or restoration.

checkpoint load                          # Load latest
checkpoint load ckpt-20260123-120000-abc # Load specific
checkpoint load --json                   # Output JSON

`checkpoint diff`

Compare two checkpoints to see what changed.

checkpoint diff                          # Latest vs previous
checkpoint diff --from ckpt-A --to ckpt-B  # Specific comparison
checkpoint diff --json                   # Output JSON

Detects:

Phase changes
Task additions/removals/status changes
Dependency status changes
Variable additions/changes/removals

`checkpoint list`

List available checkpoints.

checkpoint list                          # Last 20
checkpoint list --limit 50               # Custom limit
checkpoint list --json                   # Output JSON

`checkpoint summary`

Generate context summary for review or sub-agent injection.

checkpoint summary                       # Compact (default)
checkpoint summary --level minimal       # ~500 tokens
checkpoint summary --level compact       # ~1000 tokens
checkpoint summary --level standard      # ~2000 tokens
checkpoint summary --level full          # ~4000 tokens
checkpoint summary --for terraform       # Task-specific
checkpoint summary --for governance      # Governance context

`checkpoint prune`

Remove old checkpoints.

checkpoint prune                         # Keep default (50)
checkpoint prune --keep 10               # Keep only 10

Token-Aware Sub-Agent Context

When orchestrating sub-agents, use the summarizer to minimize tokens while preserving essential context:

from checkpoint import CheckpointManager, ContextSummarizer

# Get latest checkpoint
manager = CheckpointManager()
checkpoint = manager.get_latest_checkpoint()

# Create task-specific summary
summarizer = ContextSummarizer(checkpoint)

# For infrastructure tasks (~1000 tokens)
context = summarizer.for_subagent("terraform", max_tokens=1000)

# For governance tasks with specific variables
context = summarizer.for_subagent(
    "promotion",
    relevant_keys=["agent_id", "current_tier"],
    max_tokens=500
)

# Pass context to sub-agent
subagent.run(context + "\n\n" + task_prompt)

Summary Levels

Level	Tokens	Contents
`minimal`	~500	Phase, active task, agent, available deps
`compact`	~1000	+ In-progress tasks, pending count, key vars
`standard`	~2000	+ All tasks, all deps, recent outputs
`full`	~4000	+ All variables, completed phases, metadata

Task-Specific Contexts

Task Type	Included Context
`terraform`, `ansible`, `infrastructure`	Infrastructure dependencies, service status
`database`, `query`, `ledger`	Database connections, endpoints
`promotion`, `revocation`, `governance`	Agent tier, governance variables

Auto-Checkpoint Events

Checkpoints can be automatically created by integrating with governance hooks:

# In your agent code
from checkpoint import CheckpointManager

manager = CheckpointManager()

# Auto-checkpoint on phase transitions
def on_phase_complete(phase_num):
    manager.create_checkpoint(
        notes=f"Phase {phase_num} complete",
        variables={"completed_phase": phase_num}
    )

# Auto-checkpoint on task completion
def on_task_complete(task_id, result):
    manager.create_checkpoint(
        variables={"last_task": task_id, "result": result}
    )

# Auto-checkpoint on error
def on_error(error_type, message):
    manager.create_checkpoint(
        notes=f"Error: {error_type}",
        variables={"error_type": error_type, "error_msg": message}
    )

Restoring After Restart

After a CLI restart or token window reset:

# 1. Load the latest checkpoint
checkpoint load

# 2. Review what was happening
checkpoint summary --level full

# 3. If needed, compare with earlier state
checkpoint diff

# 4. Resume work with context
checkpoint summary --level compact > /tmp/context.txt
# Use context.txt as preamble for new session

Programmatic Restoration

from checkpoint import CheckpointManager, ContextSummarizer

manager = CheckpointManager()
checkpoint = manager.get_latest_checkpoint()

if checkpoint:
    # Restore environment variables
    for key, value in checkpoint.variables.items():
        os.environ[f"CKPT_{key.upper()}"] = str(value)

    # Get context for continuation
    summarizer = ContextSummarizer(checkpoint)
    context = summarizer.standard_summary()

    print(f"Restored from: {checkpoint.checkpoint_id}")
    print(f"Phase: {checkpoint.phase.name if checkpoint.phase else 'Unknown'}")
    print(f"Tasks: {len(checkpoint.tasks)}")

Storage Format

Checkpoint JSON Structure

{
  "checkpoint_id": "ckpt-20260123-120000-abc12345",
  "created_at": "2026-01-23T12:00:00.000000+00:00",
  "session_id": "optional-session-id",

  "phase": {
    "name": "Phase 5: Agent Bootstrapping",
    "number": 5,
    "status": "in_progress",
    "started_at": "2026-01-23T10:00:00+00:00",
    "notes": "Working on checkpoint skill"
  },
  "phases_completed": [1, 2, 3, 4],

  "tasks": [
    {
      "id": "1",
      "subject": "Create checkpoint module",
      "status": "completed",
      "owner": null,
      "blocks": [],
      "blocked_by": []
    }
  ],
  "active_task_id": "2",

  "dependencies": [
    {
      "name": "vault",
      "type": "service",
      "status": "available",
      "endpoint": "https://127.0.0.1:8200",
      "last_checked": "2026-01-23T12:00:00+00:00"
    }
  ],

  "variables": {
    "custom_key": "custom_value"
  },

  "recent_outputs": [
    {
      "type": "evidence",
      "id": "evd-20260123-...",
      "action": "terraform",
      "success": true,
      "timestamp": "2026-01-23T11:30:00+00:00"
    }
  ],

  "agent_id": "my-agent",
  "agent_tier": 1,

  "content_hash": "abc123...",
  "parent_checkpoint_id": "ckpt-20260123-110000-...",
  "estimated_tokens": 450
}

File Locations

/opt/agent-governance/checkpoint/
├── checkpoint.py       # Core module
├── README.md           # This documentation
├── storage/            # Checkpoint JSON files
│   ├── ckpt-20260123-120000-abc.json
│   ├── ckpt-20260123-110000-def.json
│   └── ...
└── templates/          # Future: checkpoint templates

/opt/agent-governance/bin/
└── checkpoint          # CLI wrapper

Extensibility

Adding Custom State Collectors

from checkpoint import CheckpointManager

class MyCheckpointManager(CheckpointManager):

    def collect_my_state(self) -> dict:
        # Custom state collection
        return {"my_data": "..."}

    def create_checkpoint(self, **kwargs):
        # Add custom state to variables
        custom_vars = kwargs.get("variables", {})
        custom_vars.update(self.collect_my_state())
        kwargs["variables"] = custom_vars

        return super().create_checkpoint(**kwargs)

Remote Storage (Future)

# Planned: S3/remote sync
class RemoteCheckpointManager(CheckpointManager):

    def __init__(self, s3_bucket: str):
        super().__init__()
        self.s3_bucket = s3_bucket

    def save_checkpoint(self, checkpoint):
        # Save locally
        local_path = super().save_checkpoint(checkpoint)

        # Sync to S3
        self._upload_to_s3(local_path)

        return local_path

    def sync_from_remote(self):
        # Download checkpoints from S3
        pass

Integration with Agent Governance

The checkpoint skill integrates with the existing governance system:

┌─────────────────────────────────────────────────────────────┐
│                    Agent Governance                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │ Preflight │  │ Wrappers │  │ Evidence │  │Checkpoint│    │
│  │   Gate    │  │tf/ansible│  │ Package  │  │  Skill   │    │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘    │
│       │             │             │             │           │
│       v             v             v             v           │
│  ┌──────────────────────────────────────────────────────┐  │
│  │                    DragonflyDB                        │  │
│  │  agent:* states | checkpoint:* | revocations:ledger   │  │
│  └──────────────────────────────────────────────────────┘  │
│       │             │             │             │           │
│       v             v             v             v           │
│  ┌──────────────────────────────────────────────────────┐  │
│  │                   SQLite Ledger                       │  │
│  │  agent_actions | violations | promotions | tasks      │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Best Practices

Checkpoint Before Risky Operations

checkpoint now --notes "Before production deployment"

Include Relevant Variables

checkpoint now --var target_env production --var rollback_id v1.2.3

Use Task-Specific Summaries for Sub-Agents

checkpoint summary --for terraform > context.txt

Review Diffs After Long Operations
```
checkpoint diff  # What changed?
```
Prune Regularly in Long-Running Sessions
```
checkpoint prune --keep 20
```

Troubleshooting

"No checkpoint found"

Create one first:

checkpoint now

High token estimates

Use more aggressive summarization:

checkpoint summary --level minimal

Missing dependencies

Check services:

docker exec vault vault status
redis-cli -p 6379 -a governance2026 PING

Stale checkpoints

Prune and recreate:

checkpoint prune --keep 5
checkpoint now --notes "Fresh start"