Phase 8 Production Hardening with complete governance infrastructure: - Vault integration with tiered policies (T0-T4) - DragonflyDB state management - SQLite audit ledger - Pipeline DSL and templates - Promotion/revocation engine - Checkpoint system for session persistence - Health manager and circuit breaker for fault tolerance - GitHub/Slack integrations - Architectural test pipeline with bug watcher, suggestion engine, council review - Multi-agent chaos testing framework Test Results: - Governance tests: 68/68 passing - E2E workflow: 16/16 passing - Phase 2 Vault: 14/14 passing - Integration tests: 27/27 passing Coverage: 57.6% average across 12 phases Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Context Checkpoint Skill
Phase 5: Agent Bootstrapping
A context preservation system that helps maintain session state across token window resets, CLI restarts, and sub-agent orchestration.
Overview
The checkpoint skill provides:
- Periodic State Capture - Captures phase, tasks, dependencies, variables, and outputs
- Token-Aware Summarization - Creates minimal context summaries for sub-agent calls
- CLI Integration - Manual and automatic checkpoint management
- Extensible Storage - JSON files with DragonflyDB caching (future: remote sync)
Quick Start
# Add to PATH (optional)
export PATH="/opt/agent-governance/bin:$PATH"
# Create a checkpoint
checkpoint now --notes "Starting task X"
# Load latest checkpoint
checkpoint load
# Compare with previous
checkpoint diff
# Get compact context summary
checkpoint summary
# List all checkpoints
checkpoint list
Commands
checkpoint now
Create a new checkpoint capturing current state.
checkpoint now # Basic checkpoint
checkpoint now --notes "Phase 5 start" # With notes
checkpoint now --var key value # With variables
checkpoint now --var a 1 --var b 2 # Multiple variables
checkpoint now --json # Output JSON
Captures:
- Current phase (from implementation plan)
- Task states (from governance DB)
- Dependency status (Vault, DragonflyDB, Ledger)
- Custom variables
- Recent evidence outputs
- Agent ID and tier
checkpoint load
Load a checkpoint for review or restoration.
checkpoint load # Load latest
checkpoint load ckpt-20260123-120000-abc # Load specific
checkpoint load --json # Output JSON
checkpoint diff
Compare two checkpoints to see what changed.
checkpoint diff # Latest vs previous
checkpoint diff --from ckpt-A --to ckpt-B # Specific comparison
checkpoint diff --json # Output JSON
Detects:
- Phase changes
- Task additions/removals/status changes
- Dependency status changes
- Variable additions/changes/removals
checkpoint list
List available checkpoints.
checkpoint list # Last 20
checkpoint list --limit 50 # Custom limit
checkpoint list --json # Output JSON
checkpoint summary
Generate context summary for review or sub-agent injection.
checkpoint summary # Compact (default)
checkpoint summary --level minimal # ~500 tokens
checkpoint summary --level compact # ~1000 tokens
checkpoint summary --level standard # ~2000 tokens
checkpoint summary --level full # ~4000 tokens
checkpoint summary --for terraform # Task-specific
checkpoint summary --for governance # Governance context
checkpoint prune
Remove old checkpoints.
checkpoint prune # Keep default (50)
checkpoint prune --keep 10 # Keep only 10
Token-Aware Sub-Agent Context
When orchestrating sub-agents, use the summarizer to minimize tokens while preserving essential context:
from checkpoint import CheckpointManager, ContextSummarizer
# Get latest checkpoint
manager = CheckpointManager()
checkpoint = manager.get_latest_checkpoint()
# Create task-specific summary
summarizer = ContextSummarizer(checkpoint)
# For infrastructure tasks (~1000 tokens)
context = summarizer.for_subagent("terraform", max_tokens=1000)
# For governance tasks with specific variables
context = summarizer.for_subagent(
"promotion",
relevant_keys=["agent_id", "current_tier"],
max_tokens=500
)
# Pass context to sub-agent
subagent.run(context + "\n\n" + task_prompt)
Summary Levels
| Level | Tokens | Contents |
|---|---|---|
minimal |
~500 | Phase, active task, agent, available deps |
compact |
~1000 | + In-progress tasks, pending count, key vars |
standard |
~2000 | + All tasks, all deps, recent outputs |
full |
~4000 | + All variables, completed phases, metadata |
Task-Specific Contexts
| Task Type | Included Context |
|---|---|
terraform, ansible, infrastructure |
Infrastructure dependencies, service status |
database, query, ledger |
Database connections, endpoints |
promotion, revocation, governance |
Agent tier, governance variables |
Auto-Checkpoint Events
Checkpoints can be automatically created by integrating with governance hooks:
# In your agent code
from checkpoint import CheckpointManager
manager = CheckpointManager()
# Auto-checkpoint on phase transitions
def on_phase_complete(phase_num):
manager.create_checkpoint(
notes=f"Phase {phase_num} complete",
variables={"completed_phase": phase_num}
)
# Auto-checkpoint on task completion
def on_task_complete(task_id, result):
manager.create_checkpoint(
variables={"last_task": task_id, "result": result}
)
# Auto-checkpoint on error
def on_error(error_type, message):
manager.create_checkpoint(
notes=f"Error: {error_type}",
variables={"error_type": error_type, "error_msg": message}
)
Restoring After Restart
After a CLI restart or token window reset:
# 1. Load the latest checkpoint
checkpoint load
# 2. Review what was happening
checkpoint summary --level full
# 3. If needed, compare with earlier state
checkpoint diff
# 4. Resume work with context
checkpoint summary --level compact > /tmp/context.txt
# Use context.txt as preamble for new session
Programmatic Restoration
from checkpoint import CheckpointManager, ContextSummarizer
manager = CheckpointManager()
checkpoint = manager.get_latest_checkpoint()
if checkpoint:
# Restore environment variables
for key, value in checkpoint.variables.items():
os.environ[f"CKPT_{key.upper()}"] = str(value)
# Get context for continuation
summarizer = ContextSummarizer(checkpoint)
context = summarizer.standard_summary()
print(f"Restored from: {checkpoint.checkpoint_id}")
print(f"Phase: {checkpoint.phase.name if checkpoint.phase else 'Unknown'}")
print(f"Tasks: {len(checkpoint.tasks)}")
Storage Format
Checkpoint JSON Structure
{
"checkpoint_id": "ckpt-20260123-120000-abc12345",
"created_at": "2026-01-23T12:00:00.000000+00:00",
"session_id": "optional-session-id",
"phase": {
"name": "Phase 5: Agent Bootstrapping",
"number": 5,
"status": "in_progress",
"started_at": "2026-01-23T10:00:00+00:00",
"notes": "Working on checkpoint skill"
},
"phases_completed": [1, 2, 3, 4],
"tasks": [
{
"id": "1",
"subject": "Create checkpoint module",
"status": "completed",
"owner": null,
"blocks": [],
"blocked_by": []
}
],
"active_task_id": "2",
"dependencies": [
{
"name": "vault",
"type": "service",
"status": "available",
"endpoint": "https://127.0.0.1:8200",
"last_checked": "2026-01-23T12:00:00+00:00"
}
],
"variables": {
"custom_key": "custom_value"
},
"recent_outputs": [
{
"type": "evidence",
"id": "evd-20260123-...",
"action": "terraform",
"success": true,
"timestamp": "2026-01-23T11:30:00+00:00"
}
],
"agent_id": "my-agent",
"agent_tier": 1,
"content_hash": "abc123...",
"parent_checkpoint_id": "ckpt-20260123-110000-...",
"estimated_tokens": 450
}
File Locations
/opt/agent-governance/checkpoint/
├── checkpoint.py # Core module
├── README.md # This documentation
├── storage/ # Checkpoint JSON files
│ ├── ckpt-20260123-120000-abc.json
│ ├── ckpt-20260123-110000-def.json
│ └── ...
└── templates/ # Future: checkpoint templates
/opt/agent-governance/bin/
└── checkpoint # CLI wrapper
Extensibility
Adding Custom State Collectors
from checkpoint import CheckpointManager
class MyCheckpointManager(CheckpointManager):
def collect_my_state(self) -> dict:
# Custom state collection
return {"my_data": "..."}
def create_checkpoint(self, **kwargs):
# Add custom state to variables
custom_vars = kwargs.get("variables", {})
custom_vars.update(self.collect_my_state())
kwargs["variables"] = custom_vars
return super().create_checkpoint(**kwargs)
Remote Storage (Future)
# Planned: S3/remote sync
class RemoteCheckpointManager(CheckpointManager):
def __init__(self, s3_bucket: str):
super().__init__()
self.s3_bucket = s3_bucket
def save_checkpoint(self, checkpoint):
# Save locally
local_path = super().save_checkpoint(checkpoint)
# Sync to S3
self._upload_to_s3(local_path)
return local_path
def sync_from_remote(self):
# Download checkpoints from S3
pass
Integration with Agent Governance
The checkpoint skill integrates with the existing governance system:
┌─────────────────────────────────────────────────────────────┐
│ Agent Governance │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Preflight │ │ Wrappers │ │ Evidence │ │Checkpoint│ │
│ │ Gate │ │tf/ansible│ │ Package │ │ Skill │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ v v v v │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ DragonflyDB │ │
│ │ agent:* states | checkpoint:* | revocations:ledger │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │ │ │ │
│ v v v v │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ SQLite Ledger │ │
│ │ agent_actions | violations | promotions | tasks │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Best Practices
-
Checkpoint Before Risky Operations
checkpoint now --notes "Before production deployment" -
Include Relevant Variables
checkpoint now --var target_env production --var rollback_id v1.2.3 -
Use Task-Specific Summaries for Sub-Agents
checkpoint summary --for terraform > context.txt -
Review Diffs After Long Operations
checkpoint diff # What changed? -
Prune Regularly in Long-Running Sessions
checkpoint prune --keep 20
Troubleshooting
"No checkpoint found"
Create one first:
checkpoint now
High token estimates
Use more aggressive summarization:
checkpoint summary --level minimal
Missing dependencies
Check services:
docker exec vault vault status
redis-cli -p 6379 -a governance2026 PING
Stale checkpoints
Prune and recreate:
checkpoint prune --keep 5
checkpoint now --notes "Fresh start"