profit 8c6e7831e9 Add Phase 10-12 implementation: multi-tenant, marketplace, observability
Major additions:
- marketplace/: Agent template registry with FTS5 search, ratings, versioning
- observability/: Prometheus metrics, distributed tracing, structured logging
- ledger/migrations/: Database migration scripts for multi-tenant support
- tests/governance/: 15 new test files for phases 6-12 (295 total tests)
- bin/validate-phases: Full 12-phase validation script

New features:
- Multi-tenant support with tenant isolation and quota enforcement
- Agent marketplace with semantic versioning and search
- Observability with metrics, tracing, and log correlation
- Tier-1 agent bootstrap scripts

Updated components:
- ledger/api.py: Extended API for tenants, marketplace, observability
- ledger/schema.sql: Added tenant, project, marketplace tables
- testing/framework.ts: Enhanced test framework
- checkpoint/checkpoint.py: Improved checkpoint management

Archived:
- External integrations (Slack/GitHub/PagerDuty) moved to .archive/
- Old checkpoint files cleaned up

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 18:39:47 -05:00
..

Context Checkpoint Skill

Phase 5: Agent Bootstrapping

A context preservation system that helps maintain session state across token window resets, CLI restarts, and sub-agent orchestration.

Overview

The checkpoint skill provides:

  1. Periodic State Capture - Captures phase, tasks, dependencies, variables, and outputs
  2. Token-Aware Summarization - Creates minimal context summaries for sub-agent calls
  3. CLI Integration - Manual and automatic checkpoint management
  4. Extensible Storage - JSON files with DragonflyDB caching (future: remote sync)

Quick Start

# Add to PATH (optional)
export PATH="/opt/agent-governance/bin:$PATH"

# Create a checkpoint
checkpoint now --notes "Starting task X"

# Load latest checkpoint
checkpoint load

# Compare with previous
checkpoint diff

# Get compact context summary
checkpoint summary

# List all checkpoints
checkpoint list

Commands

checkpoint now

Create a new checkpoint capturing current state.

checkpoint now                           # Basic checkpoint
checkpoint now --notes "Phase 5 start"   # With notes
checkpoint now --var key value           # With variables
checkpoint now --var a 1 --var b 2       # Multiple variables
checkpoint now --json                    # Output JSON

Captures:

  • Current phase (from implementation plan)
  • Task states (from governance DB)
  • Dependency status (Vault, DragonflyDB, Ledger)
  • Custom variables
  • Recent evidence outputs
  • Agent ID and tier

checkpoint load

Load a checkpoint for review or restoration.

checkpoint load                          # Load latest
checkpoint load ckpt-20260123-120000-abc # Load specific
checkpoint load --json                   # Output JSON

checkpoint diff

Compare two checkpoints to see what changed.

checkpoint diff                          # Latest vs previous
checkpoint diff --from ckpt-A --to ckpt-B  # Specific comparison
checkpoint diff --json                   # Output JSON

Detects:

  • Phase changes
  • Task additions/removals/status changes
  • Dependency status changes
  • Variable additions/changes/removals

checkpoint list

List available checkpoints.

checkpoint list                          # Last 20
checkpoint list --limit 50               # Custom limit
checkpoint list --json                   # Output JSON

checkpoint summary

Generate context summary for review or sub-agent injection.

checkpoint summary                       # Compact (default)
checkpoint summary --level minimal       # ~500 tokens
checkpoint summary --level compact       # ~1000 tokens
checkpoint summary --level standard      # ~2000 tokens
checkpoint summary --level full          # ~4000 tokens
checkpoint summary --for terraform       # Task-specific
checkpoint summary --for governance      # Governance context

checkpoint prune

Remove old checkpoints.

checkpoint prune                         # Keep default (50)
checkpoint prune --keep 10               # Keep only 10

Token-Aware Sub-Agent Context

When orchestrating sub-agents, use the summarizer to minimize tokens while preserving essential context:

from checkpoint import CheckpointManager, ContextSummarizer

# Get latest checkpoint
manager = CheckpointManager()
checkpoint = manager.get_latest_checkpoint()

# Create task-specific summary
summarizer = ContextSummarizer(checkpoint)

# For infrastructure tasks (~1000 tokens)
context = summarizer.for_subagent("terraform", max_tokens=1000)

# For governance tasks with specific variables
context = summarizer.for_subagent(
    "promotion",
    relevant_keys=["agent_id", "current_tier"],
    max_tokens=500
)

# Pass context to sub-agent
subagent.run(context + "\n\n" + task_prompt)

Summary Levels

Level Tokens Contents
minimal ~500 Phase, active task, agent, available deps
compact ~1000 + In-progress tasks, pending count, key vars
standard ~2000 + All tasks, all deps, recent outputs
full ~4000 + All variables, completed phases, metadata

Task-Specific Contexts

Task Type Included Context
terraform, ansible, infrastructure Infrastructure dependencies, service status
database, query, ledger Database connections, endpoints
promotion, revocation, governance Agent tier, governance variables

Auto-Checkpoint Events

Checkpoints can be automatically created by integrating with governance hooks:

# In your agent code
from checkpoint import CheckpointManager

manager = CheckpointManager()

# Auto-checkpoint on phase transitions
def on_phase_complete(phase_num):
    manager.create_checkpoint(
        notes=f"Phase {phase_num} complete",
        variables={"completed_phase": phase_num}
    )

# Auto-checkpoint on task completion
def on_task_complete(task_id, result):
    manager.create_checkpoint(
        variables={"last_task": task_id, "result": result}
    )

# Auto-checkpoint on error
def on_error(error_type, message):
    manager.create_checkpoint(
        notes=f"Error: {error_type}",
        variables={"error_type": error_type, "error_msg": message}
    )

Restoring After Restart

After a CLI restart or token window reset:

# 1. Load the latest checkpoint
checkpoint load

# 2. Review what was happening
checkpoint summary --level full

# 3. If needed, compare with earlier state
checkpoint diff

# 4. Resume work with context
checkpoint summary --level compact > /tmp/context.txt
# Use context.txt as preamble for new session

Programmatic Restoration

from checkpoint import CheckpointManager, ContextSummarizer

manager = CheckpointManager()
checkpoint = manager.get_latest_checkpoint()

if checkpoint:
    # Restore environment variables
    for key, value in checkpoint.variables.items():
        os.environ[f"CKPT_{key.upper()}"] = str(value)

    # Get context for continuation
    summarizer = ContextSummarizer(checkpoint)
    context = summarizer.standard_summary()

    print(f"Restored from: {checkpoint.checkpoint_id}")
    print(f"Phase: {checkpoint.phase.name if checkpoint.phase else 'Unknown'}")
    print(f"Tasks: {len(checkpoint.tasks)}")

Storage Format

Checkpoint JSON Structure

{
  "checkpoint_id": "ckpt-20260123-120000-abc12345",
  "created_at": "2026-01-23T12:00:00.000000+00:00",
  "session_id": "optional-session-id",

  "phase": {
    "name": "Phase 5: Agent Bootstrapping",
    "number": 5,
    "status": "in_progress",
    "started_at": "2026-01-23T10:00:00+00:00",
    "notes": "Working on checkpoint skill"
  },
  "phases_completed": [1, 2, 3, 4],

  "tasks": [
    {
      "id": "1",
      "subject": "Create checkpoint module",
      "status": "completed",
      "owner": null,
      "blocks": [],
      "blocked_by": []
    }
  ],
  "active_task_id": "2",

  "dependencies": [
    {
      "name": "vault",
      "type": "service",
      "status": "available",
      "endpoint": "https://127.0.0.1:8200",
      "last_checked": "2026-01-23T12:00:00+00:00"
    }
  ],

  "variables": {
    "custom_key": "custom_value"
  },

  "recent_outputs": [
    {
      "type": "evidence",
      "id": "evd-20260123-...",
      "action": "terraform",
      "success": true,
      "timestamp": "2026-01-23T11:30:00+00:00"
    }
  ],

  "agent_id": "my-agent",
  "agent_tier": 1,

  "content_hash": "abc123...",
  "parent_checkpoint_id": "ckpt-20260123-110000-...",
  "estimated_tokens": 450
}

File Locations

/opt/agent-governance/checkpoint/
├── checkpoint.py       # Core module
├── README.md           # This documentation
├── storage/            # Checkpoint JSON files
│   ├── ckpt-20260123-120000-abc.json
│   ├── ckpt-20260123-110000-def.json
│   └── ...
└── templates/          # Future: checkpoint templates

/opt/agent-governance/bin/
└── checkpoint          # CLI wrapper

Extensibility

Adding Custom State Collectors

from checkpoint import CheckpointManager

class MyCheckpointManager(CheckpointManager):

    def collect_my_state(self) -> dict:
        # Custom state collection
        return {"my_data": "..."}

    def create_checkpoint(self, **kwargs):
        # Add custom state to variables
        custom_vars = kwargs.get("variables", {})
        custom_vars.update(self.collect_my_state())
        kwargs["variables"] = custom_vars

        return super().create_checkpoint(**kwargs)

Remote Storage (Future)

# Planned: S3/remote sync
class RemoteCheckpointManager(CheckpointManager):

    def __init__(self, s3_bucket: str):
        super().__init__()
        self.s3_bucket = s3_bucket

    def save_checkpoint(self, checkpoint):
        # Save locally
        local_path = super().save_checkpoint(checkpoint)

        # Sync to S3
        self._upload_to_s3(local_path)

        return local_path

    def sync_from_remote(self):
        # Download checkpoints from S3
        pass

Integration with Agent Governance

The checkpoint skill integrates with the existing governance system:

┌─────────────────────────────────────────────────────────────┐
│                    Agent Governance                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │ Preflight │  │ Wrappers │  │ Evidence │  │Checkpoint│    │
│  │   Gate    │  │tf/ansible│  │ Package  │  │  Skill   │    │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘    │
│       │             │             │             │           │
│       v             v             v             v           │
│  ┌──────────────────────────────────────────────────────┐  │
│  │                    DragonflyDB                        │  │
│  │  agent:* states | checkpoint:* | revocations:ledger   │  │
│  └──────────────────────────────────────────────────────┘  │
│       │             │             │             │           │
│       v             v             v             v           │
│  ┌──────────────────────────────────────────────────────┐  │
│  │                   SQLite Ledger                       │  │
│  │  agent_actions | violations | promotions | tasks      │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Best Practices

  1. Checkpoint Before Risky Operations

    checkpoint now --notes "Before production deployment"
    
  2. Include Relevant Variables

    checkpoint now --var target_env production --var rollback_id v1.2.3
    
  3. Use Task-Specific Summaries for Sub-Agents

    checkpoint summary --for terraform > context.txt
    
  4. Review Diffs After Long Operations

    checkpoint diff  # What changed?
    
  5. Prune Regularly in Long-Running Sessions

    checkpoint prune --keep 20
    

Troubleshooting

"No checkpoint found"

Create one first:

checkpoint now

High token estimates

Use more aggressive summarization:

checkpoint summary --level minimal

Missing dependencies

Check services:

docker exec vault vault status
redis-cli -p 6379 -a governance2026 PING

Stale checkpoints

Prune and recreate:

checkpoint prune --keep 5
checkpoint now --notes "Fresh start"