profit 77655c298c Initial commit: Agent Governance System Phase 8
Phase 8 Production Hardening with complete governance infrastructure:

- Vault integration with tiered policies (T0-T4)
- DragonflyDB state management
- SQLite audit ledger
- Pipeline DSL and templates
- Promotion/revocation engine
- Checkpoint system for session persistence
- Health manager and circuit breaker for fault tolerance
- GitHub/Slack integrations
- Architectural test pipeline with bug watcher, suggestion engine, council review
- Multi-agent chaos testing framework

Test Results:
- Governance tests: 68/68 passing
- E2E workflow: 16/16 passing
- Phase 2 Vault: 14/14 passing
- Integration tests: 27/27 passing

Coverage: 57.6% average across 12 phases

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 22:07:06 -05:00

449 lines
12 KiB
Markdown

# Context Checkpoint Skill
**Phase 5: Agent Bootstrapping**
A context preservation system that helps maintain session state across token window resets, CLI restarts, and sub-agent orchestration.
## Overview
The checkpoint skill provides:
1. **Periodic State Capture** - Captures phase, tasks, dependencies, variables, and outputs
2. **Token-Aware Summarization** - Creates minimal context summaries for sub-agent calls
3. **CLI Integration** - Manual and automatic checkpoint management
4. **Extensible Storage** - JSON files with DragonflyDB caching (future: remote sync)
## Quick Start
```bash
# Add to PATH (optional)
export PATH="/opt/agent-governance/bin:$PATH"
# Create a checkpoint
checkpoint now --notes "Starting task X"
# Load latest checkpoint
checkpoint load
# Compare with previous
checkpoint diff
# Get compact context summary
checkpoint summary
# List all checkpoints
checkpoint list
```
## Commands
### `checkpoint now`
Create a new checkpoint capturing current state.
```bash
checkpoint now # Basic checkpoint
checkpoint now --notes "Phase 5 start" # With notes
checkpoint now --var key value # With variables
checkpoint now --var a 1 --var b 2 # Multiple variables
checkpoint now --json # Output JSON
```
**Captures:**
- Current phase (from implementation plan)
- Task states (from governance DB)
- Dependency status (Vault, DragonflyDB, Ledger)
- Custom variables
- Recent evidence outputs
- Agent ID and tier
### `checkpoint load`
Load a checkpoint for review or restoration.
```bash
checkpoint load # Load latest
checkpoint load ckpt-20260123-120000-abc # Load specific
checkpoint load --json # Output JSON
```
### `checkpoint diff`
Compare two checkpoints to see what changed.
```bash
checkpoint diff # Latest vs previous
checkpoint diff --from ckpt-A --to ckpt-B # Specific comparison
checkpoint diff --json # Output JSON
```
**Detects:**
- Phase changes
- Task additions/removals/status changes
- Dependency status changes
- Variable additions/changes/removals
### `checkpoint list`
List available checkpoints.
```bash
checkpoint list # Last 20
checkpoint list --limit 50 # Custom limit
checkpoint list --json # Output JSON
```
### `checkpoint summary`
Generate context summary for review or sub-agent injection.
```bash
checkpoint summary # Compact (default)
checkpoint summary --level minimal # ~500 tokens
checkpoint summary --level compact # ~1000 tokens
checkpoint summary --level standard # ~2000 tokens
checkpoint summary --level full # ~4000 tokens
checkpoint summary --for terraform # Task-specific
checkpoint summary --for governance # Governance context
```
### `checkpoint prune`
Remove old checkpoints.
```bash
checkpoint prune # Keep default (50)
checkpoint prune --keep 10 # Keep only 10
```
## Token-Aware Sub-Agent Context
When orchestrating sub-agents, use the summarizer to minimize tokens while preserving essential context:
```python
from checkpoint import CheckpointManager, ContextSummarizer
# Get latest checkpoint
manager = CheckpointManager()
checkpoint = manager.get_latest_checkpoint()
# Create task-specific summary
summarizer = ContextSummarizer(checkpoint)
# For infrastructure tasks (~1000 tokens)
context = summarizer.for_subagent("terraform", max_tokens=1000)
# For governance tasks with specific variables
context = summarizer.for_subagent(
"promotion",
relevant_keys=["agent_id", "current_tier"],
max_tokens=500
)
# Pass context to sub-agent
subagent.run(context + "\n\n" + task_prompt)
```
### Summary Levels
| Level | Tokens | Contents |
|-------|--------|----------|
| `minimal` | ~500 | Phase, active task, agent, available deps |
| `compact` | ~1000 | + In-progress tasks, pending count, key vars |
| `standard` | ~2000 | + All tasks, all deps, recent outputs |
| `full` | ~4000 | + All variables, completed phases, metadata |
### Task-Specific Contexts
| Task Type | Included Context |
|-----------|-----------------|
| `terraform`, `ansible`, `infrastructure` | Infrastructure dependencies, service status |
| `database`, `query`, `ledger` | Database connections, endpoints |
| `promotion`, `revocation`, `governance` | Agent tier, governance variables |
## Auto-Checkpoint Events
Checkpoints can be automatically created by integrating with governance hooks:
```python
# In your agent code
from checkpoint import CheckpointManager
manager = CheckpointManager()
# Auto-checkpoint on phase transitions
def on_phase_complete(phase_num):
manager.create_checkpoint(
notes=f"Phase {phase_num} complete",
variables={"completed_phase": phase_num}
)
# Auto-checkpoint on task completion
def on_task_complete(task_id, result):
manager.create_checkpoint(
variables={"last_task": task_id, "result": result}
)
# Auto-checkpoint on error
def on_error(error_type, message):
manager.create_checkpoint(
notes=f"Error: {error_type}",
variables={"error_type": error_type, "error_msg": message}
)
```
## Restoring After Restart
After a CLI restart or token window reset:
```bash
# 1. Load the latest checkpoint
checkpoint load
# 2. Review what was happening
checkpoint summary --level full
# 3. If needed, compare with earlier state
checkpoint diff
# 4. Resume work with context
checkpoint summary --level compact > /tmp/context.txt
# Use context.txt as preamble for new session
```
### Programmatic Restoration
```python
from checkpoint import CheckpointManager, ContextSummarizer
manager = CheckpointManager()
checkpoint = manager.get_latest_checkpoint()
if checkpoint:
# Restore environment variables
for key, value in checkpoint.variables.items():
os.environ[f"CKPT_{key.upper()}"] = str(value)
# Get context for continuation
summarizer = ContextSummarizer(checkpoint)
context = summarizer.standard_summary()
print(f"Restored from: {checkpoint.checkpoint_id}")
print(f"Phase: {checkpoint.phase.name if checkpoint.phase else 'Unknown'}")
print(f"Tasks: {len(checkpoint.tasks)}")
```
## Storage Format
### Checkpoint JSON Structure
```json
{
"checkpoint_id": "ckpt-20260123-120000-abc12345",
"created_at": "2026-01-23T12:00:00.000000+00:00",
"session_id": "optional-session-id",
"phase": {
"name": "Phase 5: Agent Bootstrapping",
"number": 5,
"status": "in_progress",
"started_at": "2026-01-23T10:00:00+00:00",
"notes": "Working on checkpoint skill"
},
"phases_completed": [1, 2, 3, 4],
"tasks": [
{
"id": "1",
"subject": "Create checkpoint module",
"status": "completed",
"owner": null,
"blocks": [],
"blocked_by": []
}
],
"active_task_id": "2",
"dependencies": [
{
"name": "vault",
"type": "service",
"status": "available",
"endpoint": "https://127.0.0.1:8200",
"last_checked": "2026-01-23T12:00:00+00:00"
}
],
"variables": {
"custom_key": "custom_value"
},
"recent_outputs": [
{
"type": "evidence",
"id": "evd-20260123-...",
"action": "terraform",
"success": true,
"timestamp": "2026-01-23T11:30:00+00:00"
}
],
"agent_id": "my-agent",
"agent_tier": 1,
"content_hash": "abc123...",
"parent_checkpoint_id": "ckpt-20260123-110000-...",
"estimated_tokens": 450
}
```
### File Locations
```
/opt/agent-governance/checkpoint/
├── checkpoint.py # Core module
├── README.md # This documentation
├── storage/ # Checkpoint JSON files
│ ├── ckpt-20260123-120000-abc.json
│ ├── ckpt-20260123-110000-def.json
│ └── ...
└── templates/ # Future: checkpoint templates
/opt/agent-governance/bin/
└── checkpoint # CLI wrapper
```
## Extensibility
### Adding Custom State Collectors
```python
from checkpoint import CheckpointManager
class MyCheckpointManager(CheckpointManager):
def collect_my_state(self) -> dict:
# Custom state collection
return {"my_data": "..."}
def create_checkpoint(self, **kwargs):
# Add custom state to variables
custom_vars = kwargs.get("variables", {})
custom_vars.update(self.collect_my_state())
kwargs["variables"] = custom_vars
return super().create_checkpoint(**kwargs)
```
### Remote Storage (Future)
```python
# Planned: S3/remote sync
class RemoteCheckpointManager(CheckpointManager):
def __init__(self, s3_bucket: str):
super().__init__()
self.s3_bucket = s3_bucket
def save_checkpoint(self, checkpoint):
# Save locally
local_path = super().save_checkpoint(checkpoint)
# Sync to S3
self._upload_to_s3(local_path)
return local_path
def sync_from_remote(self):
# Download checkpoints from S3
pass
```
## Integration with Agent Governance
The checkpoint skill integrates with the existing governance system:
```
┌─────────────────────────────────────────────────────────────┐
│ Agent Governance │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Preflight │ │ Wrappers │ │ Evidence │ │Checkpoint│ │
│ │ Gate │ │tf/ansible│ │ Package │ │ Skill │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ v v v v │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ DragonflyDB │ │
│ │ agent:* states | checkpoint:* | revocations:ledger │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │ │ │ │
│ v v v v │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ SQLite Ledger │ │
│ │ agent_actions | violations | promotions | tasks │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
```
## Best Practices
1. **Checkpoint Before Risky Operations**
```bash
checkpoint now --notes "Before production deployment"
```
2. **Include Relevant Variables**
```bash
checkpoint now --var target_env production --var rollback_id v1.2.3
```
3. **Use Task-Specific Summaries for Sub-Agents**
```bash
checkpoint summary --for terraform > context.txt
```
4. **Review Diffs After Long Operations**
```bash
checkpoint diff # What changed?
```
5. **Prune Regularly in Long-Running Sessions**
```bash
checkpoint prune --keep 20
```
## Troubleshooting
### "No checkpoint found"
Create one first:
```bash
checkpoint now
```
### High token estimates
Use more aggressive summarization:
```bash
checkpoint summary --level minimal
```
### Missing dependencies
Check services:
```bash
docker exec vault vault status
redis-cli -p 6379 -a governance2026 PING
```
### Stale checkpoints
Prune and recreate:
```bash
checkpoint prune --keep 5
checkpoint now --notes "Fresh start"
```