Major additions: - marketplace/: Agent template registry with FTS5 search, ratings, versioning - observability/: Prometheus metrics, distributed tracing, structured logging - ledger/migrations/: Database migration scripts for multi-tenant support - tests/governance/: 15 new test files for phases 6-12 (295 total tests) - bin/validate-phases: Full 12-phase validation script New features: - Multi-tenant support with tenant isolation and quota enforcement - Agent marketplace with semantic versioning and search - Observability with metrics, tracing, and log correlation - Tier-1 agent bootstrap scripts Updated components: - ledger/api.py: Extended API for tenants, marketplace, observability - ledger/schema.sql: Added tenant, project, marketplace tables - testing/framework.ts: Enhanced test framework - checkpoint/checkpoint.py: Improved checkpoint management Archived: - External integrations (Slack/GitHub/PagerDuty) moved to .archive/ - Old checkpoint files cleaned up Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Observability
Metrics, tracing, and structured logging for the agent governance system
Overview
This module provides comprehensive observability infrastructure including Prometheus-format metrics, distributed tracing with span propagation, and structured JSON logging with trace correlation.
Key Files
| File | Description |
|---|---|
metrics.py |
Prometheus metrics (Counter, Gauge, Histogram) with FastAPI router |
tracing.py |
Distributed tracing with Span/Trace classes and context propagation |
logging.py |
Structured JSON logging with SQLite persistence |
__init__.py |
Module exports and unified API |
Components
Metrics (metrics.py)
Prometheus-format metrics with automatic collection:
| Metric | Type | Description |
|---|---|---|
agent_executions_total |
Counter | Total executions by tier, action, status |
agent_execution_duration_seconds |
Histogram | Execution latency distribution |
agent_violations_total |
Counter | Violations by type and severity |
agent_promotions_total |
Counter | Tier promotions |
api_requests_total |
Counter | API requests by method, endpoint, status |
api_request_duration_seconds |
Histogram | Request latency |
component_health |
Gauge | Health status (1=healthy, 0=unhealthy) |
tenant_quota_usage_ratio |
Gauge | Quota usage per tenant |
Tracing (tracing.py)
Distributed tracing with automatic context propagation:
- Span: Individual operation with timing, attributes, events
- Trace: Collection of related spans
- Context Propagation: Thread-local storage + HTTP headers (
X-Trace-ID,X-Span-ID)
Logging (logging.py)
Structured JSON logging with:
- Automatic trace/span ID correlation
- SQLite persistence with full-text search
- Multi-tenant support
- Configurable retention (default: 7 days)
API Endpoints
Metrics
GET /metrics- Prometheus format export
Tracing
GET /traces- List traces with filtersGET /traces/{trace_id}- Full trace details
Logging
GET /logs- Search logs with filtersGET /logs/trace/{trace_id}- Logs for a traceGET /logs/stats- Log statisticsPOST /logs/cleanup- Clean old logs
Health
GET /health/detailed- Component health with details
Usage
from observability import (
# Metrics
registry,
record_agent_execution,
record_violation,
record_promotion,
MetricsMiddleware,
# Tracing
get_tracer,
get_current_trace_id,
# Logging
get_logger
)
# Create a logger
logger = get_logger("my_agent")
# Get the tracer
tracer = get_tracer()
# Trace an operation
with tracer.trace("agent_execution", agent_id="agent-123") as span:
logger.info("Starting execution", agent_id="agent-123")
try:
# Do work...
with tracer.span("sub_operation") as child:
# Child span automatically linked
pass
record_agent_execution(tier=1, action="update_config", success=True, duration=0.45)
except Exception as e:
span.set_error(e)
record_violation("unauthorized_action", "high")
logger.error("Execution failed", error=str(e))
FastAPI Integration
from fastapi import FastAPI
from observability import metrics_router, tracing_router, logging_router, MetricsMiddleware
app = FastAPI()
# Add metrics middleware
app.add_middleware(MetricsMiddleware)
# Mount routers
app.include_router(metrics_router)
app.include_router(tracing_router)
app.include_router(logging_router)
Configuration
| Setting | Default | Description |
|---|---|---|
DB_PATH |
/opt/agent-governance/ledger/governance.db |
SQLite database |
LOG_LEVEL |
INFO |
Minimum log level |
LOG_RETENTION_DAYS |
7 |
Days to retain logs |
Status
Complete
See STATUS.md for detailed progress tracking.
Architecture Reference
Part of the Agent Governance System.
Parent: Project Root
Last updated: 2026-01-24 UTC