agent-governance/checkpoint/README.md

# Context Checkpoint Skill

**Phase 5: Agent Bootstrapping**

A context preservation system that helps maintain session state across token window resets, CLI restarts, and sub-agent orchestration.

## Overview

The checkpoint skill provides:

1. **Periodic State Capture** - Captures phase, tasks, dependencies, variables, and outputs
2. **Token-Aware Summarization** - Creates minimal context summaries for sub-agent calls
3. **CLI Integration** - Manual and automatic checkpoint management
4. **Extensible Storage** - JSON files with DragonflyDB caching (future: remote sync)

## Quick Start

```bash
# Add to PATH (optional)
export PATH="/opt/agent-governance/bin:$PATH"

# Create a checkpoint
checkpoint now --notes "Starting task X"

# Load latest checkpoint
checkpoint load

# Compare with previous
checkpoint diff

# Get compact context summary
checkpoint summary

# List all checkpoints
checkpoint list
```

## Commands

### `checkpoint now`

Create a new checkpoint capturing current state.

```bash
checkpoint now                           # Basic checkpoint
checkpoint now --notes "Phase 5 start"   # With notes
checkpoint now --var key value           # With variables
checkpoint now --var a 1 --var b 2       # Multiple variables
checkpoint now --json                    # Output JSON
```

**Captures:**
- Current phase (from implementation plan)
- Task states (from governance DB)
- Dependency status (Vault, DragonflyDB, Ledger)
- Custom variables
- Recent evidence outputs
- Agent ID and tier

### `checkpoint load`

Load a checkpoint for review or restoration.

```bash
checkpoint load                          # Load latest
checkpoint load ckpt-20260123-120000-abc # Load specific
checkpoint load --json                   # Output JSON
```

### `checkpoint diff`

Compare two checkpoints to see what changed.

```bash
checkpoint diff                          # Latest vs previous
checkpoint diff --from ckpt-A --to ckpt-B  # Specific comparison
checkpoint diff --json                   # Output JSON
```

**Detects:**
- Phase changes
- Task additions/removals/status changes
- Dependency status changes
- Variable additions/changes/removals

### `checkpoint list`

List available checkpoints.

```bash
checkpoint list                          # Last 20
checkpoint list --limit 50               # Custom limit
checkpoint list --json                   # Output JSON
```

### `checkpoint summary`

Generate context summary for review or sub-agent injection.

```bash
checkpoint summary                       # Compact (default)
checkpoint summary --level minimal       # ~500 tokens
checkpoint summary --level compact       # ~1000 tokens
checkpoint summary --level standard      # ~2000 tokens
checkpoint summary --level full          # ~4000 tokens
checkpoint summary --for terraform       # Task-specific
checkpoint summary --for governance      # Governance context
```

### `checkpoint prune`

Remove old checkpoints.

```bash
checkpoint prune                         # Keep default (50)
checkpoint prune --keep 10               # Keep only 10
```

## Token-Aware Sub-Agent Context

When orchestrating sub-agents, use the summarizer to minimize tokens while preserving essential context:

```python
from checkpoint import CheckpointManager, ContextSummarizer

# Get latest checkpoint
manager = CheckpointManager()
checkpoint = manager.get_latest_checkpoint()

# Create task-specific summary
summarizer = ContextSummarizer(checkpoint)

# For infrastructure tasks (~1000 tokens)
context = summarizer.for_subagent("terraform", max_tokens=1000)

# For governance tasks with specific variables
context = summarizer.for_subagent(
    "promotion",
    relevant_keys=["agent_id", "current_tier"],
    max_tokens=500
)

# Pass context to sub-agent
subagent.run(context + "\n\n" + task_prompt)
```

### Summary Levels

| Level | Tokens | Contents |
|-------|--------|----------|
| `minimal` | ~500 | Phase, active task, agent, available deps |
| `compact` | ~1000 | + In-progress tasks, pending count, key vars |
| `standard` | ~2000 | + All tasks, all deps, recent outputs |
| `full` | ~4000 | + All variables, completed phases, metadata |

### Task-Specific Contexts

| Task Type | Included Context |
|-----------|-----------------|
| `terraform`, `ansible`, `infrastructure` | Infrastructure dependencies, service status |
| `database`, `query`, `ledger` | Database connections, endpoints |
| `promotion`, `revocation`, `governance` | Agent tier, governance variables |

## Auto-Checkpoint Events

Checkpoints can be automatically created by integrating with governance hooks:

```python
# In your agent code
from checkpoint import CheckpointManager

manager = CheckpointManager()

# Auto-checkpoint on phase transitions
def on_phase_complete(phase_num):
    manager.create_checkpoint(
        notes=f"Phase {phase_num} complete",
        variables={"completed_phase": phase_num}
    )

# Auto-checkpoint on task completion
def on_task_complete(task_id, result):
    manager.create_checkpoint(
        variables={"last_task": task_id, "result": result}
    )

# Auto-checkpoint on error
def on_error(error_type, message):
    manager.create_checkpoint(
        notes=f"Error: {error_type}",
        variables={"error_type": error_type, "error_msg": message}
    )
```

## Restoring After Restart

After a CLI restart or token window reset:

```bash
# 1. Load the latest checkpoint
checkpoint load

# 2. Review what was happening
checkpoint summary --level full

# 3. If needed, compare with earlier state
checkpoint diff

# 4. Resume work with context
checkpoint summary --level compact > /tmp/context.txt
# Use context.txt as preamble for new session
```

### Programmatic Restoration

```python
from checkpoint import CheckpointManager, ContextSummarizer

manager = CheckpointManager()
checkpoint = manager.get_latest_checkpoint()

if checkpoint:
    # Restore environment variables
    for key, value in checkpoint.variables.items():
        os.environ[f"CKPT_{key.upper()}"] = str(value)

    # Get context for continuation
    summarizer = ContextSummarizer(checkpoint)
    context = summarizer.standard_summary()

    print(f"Restored from: {checkpoint.checkpoint_id}")
    print(f"Phase: {checkpoint.phase.name if checkpoint.phase else 'Unknown'}")
    print(f"Tasks: {len(checkpoint.tasks)}")
```

## Storage Format

### Checkpoint JSON Structure

```json
{
  "checkpoint_id": "ckpt-20260123-120000-abc12345",
  "created_at": "2026-01-23T12:00:00.000000+00:00",
  "session_id": "optional-session-id",

  "phase": {
    "name": "Phase 5: Agent Bootstrapping",
    "number": 5,
    "status": "in_progress",
    "started_at": "2026-01-23T10:00:00+00:00",
    "notes": "Working on checkpoint skill"
  },
  "phases_completed": [1, 2, 3, 4],

  "tasks": [
    {
      "id": "1",
      "subject": "Create checkpoint module",
      "status": "completed",
      "owner": null,
      "blocks": [],
      "blocked_by": []
    }
  ],
  "active_task_id": "2",

  "dependencies": [
    {
      "name": "vault",
      "type": "service",
      "status": "available",
      "endpoint": "https://127.0.0.1:8200",
      "last_checked": "2026-01-23T12:00:00+00:00"
    }
  ],

  "variables": {
    "custom_key": "custom_value"
  },

  "recent_outputs": [
    {
      "type": "evidence",
      "id": "evd-20260123-...",
      "action": "terraform",
      "success": true,
      "timestamp": "2026-01-23T11:30:00+00:00"
    }
  ],

  "agent_id": "my-agent",
  "agent_tier": 1,

  "content_hash": "abc123...",
  "parent_checkpoint_id": "ckpt-20260123-110000-...",
  "estimated_tokens": 450
}
```

### File Locations

```
/opt/agent-governance/checkpoint/
├── checkpoint.py       # Core module
├── README.md           # This documentation
├── storage/            # Checkpoint JSON files
│   ├── ckpt-20260123-120000-abc.json
│   ├── ckpt-20260123-110000-def.json
│   └── ...
└── templates/          # Future: checkpoint templates

/opt/agent-governance/bin/
└── checkpoint          # CLI wrapper
```

## Extensibility

### Adding Custom State Collectors

```python
from checkpoint import CheckpointManager

class MyCheckpointManager(CheckpointManager):

    def collect_my_state(self) -> dict:
        # Custom state collection
        return {"my_data": "..."}

    def create_checkpoint(self, **kwargs):
        # Add custom state to variables
        custom_vars = kwargs.get("variables", {})
        custom_vars.update(self.collect_my_state())
        kwargs["variables"] = custom_vars

        return super().create_checkpoint(**kwargs)
```

### Remote Storage (Future)

```python
# Planned: S3/remote sync
class RemoteCheckpointManager(CheckpointManager):

    def __init__(self, s3_bucket: str):
        super().__init__()
        self.s3_bucket = s3_bucket

    def save_checkpoint(self, checkpoint):
        # Save locally
        local_path = super().save_checkpoint(checkpoint)

        # Sync to S3
        self._upload_to_s3(local_path)

        return local_path

    def sync_from_remote(self):
        # Download checkpoints from S3
        pass
```

## Integration with Agent Governance

The checkpoint skill integrates with the existing governance system:

```
┌─────────────────────────────────────────────────────────────┐
│                    Agent Governance                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │ Preflight │  │ Wrappers │  │ Evidence │  │Checkpoint│    │
│  │   Gate    │  │tf/ansible│  │ Package  │  │  Skill   │    │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘    │
│       │             │             │             │           │
│       v             v             v             v           │
│  ┌──────────────────────────────────────────────────────┐  │
│  │                    DragonflyDB                        │  │
│  │  agent:* states | checkpoint:* | revocations:ledger   │  │
│  └──────────────────────────────────────────────────────┘  │
│       │             │             │             │           │
│       v             v             v             v           │
│  ┌──────────────────────────────────────────────────────┐  │
│  │                   SQLite Ledger                       │  │
│  │  agent_actions | violations | promotions | tasks      │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

## Best Practices

1. **Checkpoint Before Risky Operations**
   ```bash
   checkpoint now --notes "Before production deployment"
   ```

2. **Include Relevant Variables**
   ```bash
   checkpoint now --var target_env production --var rollback_id v1.2.3
   ```

3. **Use Task-Specific Summaries for Sub-Agents**
   ```bash
   checkpoint summary --for terraform > context.txt
   ```

4. **Review Diffs After Long Operations**
   ```bash
   checkpoint diff  # What changed?
   ```

5. **Prune Regularly in Long-Running Sessions**
   ```bash
   checkpoint prune --keep 20
   ```

## Troubleshooting

### "No checkpoint found"

Create one first:
```bash
checkpoint now
```

### High token estimates

Use more aggressive summarization:
```bash
checkpoint summary --level minimal
```

### Missing dependencies

Check services:
```bash
docker exec vault vault status
redis-cli -p 6379 -a governance2026 PING
```

### Stale checkpoints

Prune and recreate:
```bash
checkpoint prune --keep 5
checkpoint now --notes "Fresh start"
```