Phase 8 Production Hardening with complete governance infrastructure: - Vault integration with tiered policies (T0-T4) - DragonflyDB state management - SQLite audit ledger - Pipeline DSL and templates - Promotion/revocation engine - Checkpoint system for session persistence - Health manager and circuit breaker for fault tolerance - GitHub/Slack integrations - Architectural test pipeline with bug watcher, suggestion engine, council review - Multi-agent chaos testing framework Test Results: - Governance tests: 68/68 passing - E2E workflow: 16/16 passing - Phase 2 Vault: 14/14 passing - Integration tests: 27/27 passing Coverage: 57.6% average across 12 phases Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
449 lines
12 KiB
Markdown
449 lines
12 KiB
Markdown
# Context Checkpoint Skill
|
|
|
|
**Phase 5: Agent Bootstrapping**
|
|
|
|
A context preservation system that helps maintain session state across token window resets, CLI restarts, and sub-agent orchestration.
|
|
|
|
## Overview
|
|
|
|
The checkpoint skill provides:
|
|
|
|
1. **Periodic State Capture** - Captures phase, tasks, dependencies, variables, and outputs
|
|
2. **Token-Aware Summarization** - Creates minimal context summaries for sub-agent calls
|
|
3. **CLI Integration** - Manual and automatic checkpoint management
|
|
4. **Extensible Storage** - JSON files with DragonflyDB caching (future: remote sync)
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Add to PATH (optional)
|
|
export PATH="/opt/agent-governance/bin:$PATH"
|
|
|
|
# Create a checkpoint
|
|
checkpoint now --notes "Starting task X"
|
|
|
|
# Load latest checkpoint
|
|
checkpoint load
|
|
|
|
# Compare with previous
|
|
checkpoint diff
|
|
|
|
# Get compact context summary
|
|
checkpoint summary
|
|
|
|
# List all checkpoints
|
|
checkpoint list
|
|
```
|
|
|
|
## Commands
|
|
|
|
### `checkpoint now`
|
|
|
|
Create a new checkpoint capturing current state.
|
|
|
|
```bash
|
|
checkpoint now # Basic checkpoint
|
|
checkpoint now --notes "Phase 5 start" # With notes
|
|
checkpoint now --var key value # With variables
|
|
checkpoint now --var a 1 --var b 2 # Multiple variables
|
|
checkpoint now --json # Output JSON
|
|
```
|
|
|
|
**Captures:**
|
|
- Current phase (from implementation plan)
|
|
- Task states (from governance DB)
|
|
- Dependency status (Vault, DragonflyDB, Ledger)
|
|
- Custom variables
|
|
- Recent evidence outputs
|
|
- Agent ID and tier
|
|
|
|
### `checkpoint load`
|
|
|
|
Load a checkpoint for review or restoration.
|
|
|
|
```bash
|
|
checkpoint load # Load latest
|
|
checkpoint load ckpt-20260123-120000-abc # Load specific
|
|
checkpoint load --json # Output JSON
|
|
```
|
|
|
|
### `checkpoint diff`
|
|
|
|
Compare two checkpoints to see what changed.
|
|
|
|
```bash
|
|
checkpoint diff # Latest vs previous
|
|
checkpoint diff --from ckpt-A --to ckpt-B # Specific comparison
|
|
checkpoint diff --json # Output JSON
|
|
```
|
|
|
|
**Detects:**
|
|
- Phase changes
|
|
- Task additions/removals/status changes
|
|
- Dependency status changes
|
|
- Variable additions/changes/removals
|
|
|
|
### `checkpoint list`
|
|
|
|
List available checkpoints.
|
|
|
|
```bash
|
|
checkpoint list # Last 20
|
|
checkpoint list --limit 50 # Custom limit
|
|
checkpoint list --json # Output JSON
|
|
```
|
|
|
|
### `checkpoint summary`
|
|
|
|
Generate context summary for review or sub-agent injection.
|
|
|
|
```bash
|
|
checkpoint summary # Compact (default)
|
|
checkpoint summary --level minimal # ~500 tokens
|
|
checkpoint summary --level compact # ~1000 tokens
|
|
checkpoint summary --level standard # ~2000 tokens
|
|
checkpoint summary --level full # ~4000 tokens
|
|
checkpoint summary --for terraform # Task-specific
|
|
checkpoint summary --for governance # Governance context
|
|
```
|
|
|
|
### `checkpoint prune`
|
|
|
|
Remove old checkpoints.
|
|
|
|
```bash
|
|
checkpoint prune # Keep default (50)
|
|
checkpoint prune --keep 10 # Keep only 10
|
|
```
|
|
|
|
## Token-Aware Sub-Agent Context
|
|
|
|
When orchestrating sub-agents, use the summarizer to minimize tokens while preserving essential context:
|
|
|
|
```python
|
|
from checkpoint import CheckpointManager, ContextSummarizer
|
|
|
|
# Get latest checkpoint
|
|
manager = CheckpointManager()
|
|
checkpoint = manager.get_latest_checkpoint()
|
|
|
|
# Create task-specific summary
|
|
summarizer = ContextSummarizer(checkpoint)
|
|
|
|
# For infrastructure tasks (~1000 tokens)
|
|
context = summarizer.for_subagent("terraform", max_tokens=1000)
|
|
|
|
# For governance tasks with specific variables
|
|
context = summarizer.for_subagent(
|
|
"promotion",
|
|
relevant_keys=["agent_id", "current_tier"],
|
|
max_tokens=500
|
|
)
|
|
|
|
# Pass context to sub-agent
|
|
subagent.run(context + "\n\n" + task_prompt)
|
|
```
|
|
|
|
### Summary Levels
|
|
|
|
| Level | Tokens | Contents |
|
|
|-------|--------|----------|
|
|
| `minimal` | ~500 | Phase, active task, agent, available deps |
|
|
| `compact` | ~1000 | + In-progress tasks, pending count, key vars |
|
|
| `standard` | ~2000 | + All tasks, all deps, recent outputs |
|
|
| `full` | ~4000 | + All variables, completed phases, metadata |
|
|
|
|
### Task-Specific Contexts
|
|
|
|
| Task Type | Included Context |
|
|
|-----------|-----------------|
|
|
| `terraform`, `ansible`, `infrastructure` | Infrastructure dependencies, service status |
|
|
| `database`, `query`, `ledger` | Database connections, endpoints |
|
|
| `promotion`, `revocation`, `governance` | Agent tier, governance variables |
|
|
|
|
## Auto-Checkpoint Events
|
|
|
|
Checkpoints can be automatically created by integrating with governance hooks:
|
|
|
|
```python
|
|
# In your agent code
|
|
from checkpoint import CheckpointManager
|
|
|
|
manager = CheckpointManager()
|
|
|
|
# Auto-checkpoint on phase transitions
|
|
def on_phase_complete(phase_num):
|
|
manager.create_checkpoint(
|
|
notes=f"Phase {phase_num} complete",
|
|
variables={"completed_phase": phase_num}
|
|
)
|
|
|
|
# Auto-checkpoint on task completion
|
|
def on_task_complete(task_id, result):
|
|
manager.create_checkpoint(
|
|
variables={"last_task": task_id, "result": result}
|
|
)
|
|
|
|
# Auto-checkpoint on error
|
|
def on_error(error_type, message):
|
|
manager.create_checkpoint(
|
|
notes=f"Error: {error_type}",
|
|
variables={"error_type": error_type, "error_msg": message}
|
|
)
|
|
```
|
|
|
|
## Restoring After Restart
|
|
|
|
After a CLI restart or token window reset:
|
|
|
|
```bash
|
|
# 1. Load the latest checkpoint
|
|
checkpoint load
|
|
|
|
# 2. Review what was happening
|
|
checkpoint summary --level full
|
|
|
|
# 3. If needed, compare with earlier state
|
|
checkpoint diff
|
|
|
|
# 4. Resume work with context
|
|
checkpoint summary --level compact > /tmp/context.txt
|
|
# Use context.txt as preamble for new session
|
|
```
|
|
|
|
### Programmatic Restoration
|
|
|
|
```python
|
|
from checkpoint import CheckpointManager, ContextSummarizer
|
|
|
|
manager = CheckpointManager()
|
|
checkpoint = manager.get_latest_checkpoint()
|
|
|
|
if checkpoint:
|
|
# Restore environment variables
|
|
for key, value in checkpoint.variables.items():
|
|
os.environ[f"CKPT_{key.upper()}"] = str(value)
|
|
|
|
# Get context for continuation
|
|
summarizer = ContextSummarizer(checkpoint)
|
|
context = summarizer.standard_summary()
|
|
|
|
print(f"Restored from: {checkpoint.checkpoint_id}")
|
|
print(f"Phase: {checkpoint.phase.name if checkpoint.phase else 'Unknown'}")
|
|
print(f"Tasks: {len(checkpoint.tasks)}")
|
|
```
|
|
|
|
## Storage Format
|
|
|
|
### Checkpoint JSON Structure
|
|
|
|
```json
|
|
{
|
|
"checkpoint_id": "ckpt-20260123-120000-abc12345",
|
|
"created_at": "2026-01-23T12:00:00.000000+00:00",
|
|
"session_id": "optional-session-id",
|
|
|
|
"phase": {
|
|
"name": "Phase 5: Agent Bootstrapping",
|
|
"number": 5,
|
|
"status": "in_progress",
|
|
"started_at": "2026-01-23T10:00:00+00:00",
|
|
"notes": "Working on checkpoint skill"
|
|
},
|
|
"phases_completed": [1, 2, 3, 4],
|
|
|
|
"tasks": [
|
|
{
|
|
"id": "1",
|
|
"subject": "Create checkpoint module",
|
|
"status": "completed",
|
|
"owner": null,
|
|
"blocks": [],
|
|
"blocked_by": []
|
|
}
|
|
],
|
|
"active_task_id": "2",
|
|
|
|
"dependencies": [
|
|
{
|
|
"name": "vault",
|
|
"type": "service",
|
|
"status": "available",
|
|
"endpoint": "https://127.0.0.1:8200",
|
|
"last_checked": "2026-01-23T12:00:00+00:00"
|
|
}
|
|
],
|
|
|
|
"variables": {
|
|
"custom_key": "custom_value"
|
|
},
|
|
|
|
"recent_outputs": [
|
|
{
|
|
"type": "evidence",
|
|
"id": "evd-20260123-...",
|
|
"action": "terraform",
|
|
"success": true,
|
|
"timestamp": "2026-01-23T11:30:00+00:00"
|
|
}
|
|
],
|
|
|
|
"agent_id": "my-agent",
|
|
"agent_tier": 1,
|
|
|
|
"content_hash": "abc123...",
|
|
"parent_checkpoint_id": "ckpt-20260123-110000-...",
|
|
"estimated_tokens": 450
|
|
}
|
|
```
|
|
|
|
### File Locations
|
|
|
|
```
|
|
/opt/agent-governance/checkpoint/
|
|
├── checkpoint.py # Core module
|
|
├── README.md # This documentation
|
|
├── storage/ # Checkpoint JSON files
|
|
│ ├── ckpt-20260123-120000-abc.json
|
|
│ ├── ckpt-20260123-110000-def.json
|
|
│ └── ...
|
|
└── templates/ # Future: checkpoint templates
|
|
|
|
/opt/agent-governance/bin/
|
|
└── checkpoint # CLI wrapper
|
|
```
|
|
|
|
## Extensibility
|
|
|
|
### Adding Custom State Collectors
|
|
|
|
```python
|
|
from checkpoint import CheckpointManager
|
|
|
|
class MyCheckpointManager(CheckpointManager):
|
|
|
|
def collect_my_state(self) -> dict:
|
|
# Custom state collection
|
|
return {"my_data": "..."}
|
|
|
|
def create_checkpoint(self, **kwargs):
|
|
# Add custom state to variables
|
|
custom_vars = kwargs.get("variables", {})
|
|
custom_vars.update(self.collect_my_state())
|
|
kwargs["variables"] = custom_vars
|
|
|
|
return super().create_checkpoint(**kwargs)
|
|
```
|
|
|
|
### Remote Storage (Future)
|
|
|
|
```python
|
|
# Planned: S3/remote sync
|
|
class RemoteCheckpointManager(CheckpointManager):
|
|
|
|
def __init__(self, s3_bucket: str):
|
|
super().__init__()
|
|
self.s3_bucket = s3_bucket
|
|
|
|
def save_checkpoint(self, checkpoint):
|
|
# Save locally
|
|
local_path = super().save_checkpoint(checkpoint)
|
|
|
|
# Sync to S3
|
|
self._upload_to_s3(local_path)
|
|
|
|
return local_path
|
|
|
|
def sync_from_remote(self):
|
|
# Download checkpoints from S3
|
|
pass
|
|
```
|
|
|
|
## Integration with Agent Governance
|
|
|
|
The checkpoint skill integrates with the existing governance system:
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Agent Governance │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|
│ │ Preflight │ │ Wrappers │ │ Evidence │ │Checkpoint│ │
|
|
│ │ Gate │ │tf/ansible│ │ Package │ │ Skill │ │
|
|
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
|
|
│ │ │ │ │ │
|
|
│ v v v v │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ DragonflyDB │ │
|
|
│ │ agent:* states | checkpoint:* | revocations:ledger │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
│ │ │ │ │ │
|
|
│ v v v v │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ SQLite Ledger │ │
|
|
│ │ agent_actions | violations | promotions | tasks │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Checkpoint Before Risky Operations**
|
|
```bash
|
|
checkpoint now --notes "Before production deployment"
|
|
```
|
|
|
|
2. **Include Relevant Variables**
|
|
```bash
|
|
checkpoint now --var target_env production --var rollback_id v1.2.3
|
|
```
|
|
|
|
3. **Use Task-Specific Summaries for Sub-Agents**
|
|
```bash
|
|
checkpoint summary --for terraform > context.txt
|
|
```
|
|
|
|
4. **Review Diffs After Long Operations**
|
|
```bash
|
|
checkpoint diff # What changed?
|
|
```
|
|
|
|
5. **Prune Regularly in Long-Running Sessions**
|
|
```bash
|
|
checkpoint prune --keep 20
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### "No checkpoint found"
|
|
|
|
Create one first:
|
|
```bash
|
|
checkpoint now
|
|
```
|
|
|
|
### High token estimates
|
|
|
|
Use more aggressive summarization:
|
|
```bash
|
|
checkpoint summary --level minimal
|
|
```
|
|
|
|
### Missing dependencies
|
|
|
|
Check services:
|
|
```bash
|
|
docker exec vault vault status
|
|
redis-cli -p 6379 -a governance2026 PING
|
|
```
|
|
|
|
### Stale checkpoints
|
|
|
|
Prune and recreate:
|
|
```bash
|
|
checkpoint prune --keep 5
|
|
checkpoint now --notes "Fresh start"
|
|
```
|