# Observability > Metrics, tracing, and structured logging for the agent governance system ## Overview This module provides comprehensive observability infrastructure including Prometheus-format metrics, distributed tracing with span propagation, and structured JSON logging with trace correlation. ## Key Files | File | Description | |------|-------------| | `metrics.py` | Prometheus metrics (Counter, Gauge, Histogram) with FastAPI router | | `tracing.py` | Distributed tracing with Span/Trace classes and context propagation | | `logging.py` | Structured JSON logging with SQLite persistence | | `__init__.py` | Module exports and unified API | ## Components ### Metrics (`metrics.py`) Prometheus-format metrics with automatic collection: | Metric | Type | Description | |--------|------|-------------| | `agent_executions_total` | Counter | Total executions by tier, action, status | | `agent_execution_duration_seconds` | Histogram | Execution latency distribution | | `agent_violations_total` | Counter | Violations by type and severity | | `agent_promotions_total` | Counter | Tier promotions | | `api_requests_total` | Counter | API requests by method, endpoint, status | | `api_request_duration_seconds` | Histogram | Request latency | | `component_health` | Gauge | Health status (1=healthy, 0=unhealthy) | | `tenant_quota_usage_ratio` | Gauge | Quota usage per tenant | ### Tracing (`tracing.py`) Distributed tracing with automatic context propagation: - **Span**: Individual operation with timing, attributes, events - **Trace**: Collection of related spans - **Context Propagation**: Thread-local storage + HTTP headers (`X-Trace-ID`, `X-Span-ID`) ### Logging (`logging.py`) Structured JSON logging with: - Automatic trace/span ID correlation - SQLite persistence with full-text search - Multi-tenant support - Configurable retention (default: 7 days) ## API Endpoints ### Metrics - `GET /metrics` - Prometheus format export ### Tracing - `GET /traces` - List traces with filters - `GET /traces/{trace_id}` - Full trace details ### Logging - `GET /logs` - Search logs with filters - `GET /logs/trace/{trace_id}` - Logs for a trace - `GET /logs/stats` - Log statistics - `POST /logs/cleanup` - Clean old logs ### Health - `GET /health/detailed` - Component health with details ## Usage ```python from observability import ( # Metrics registry, record_agent_execution, record_violation, record_promotion, MetricsMiddleware, # Tracing get_tracer, get_current_trace_id, # Logging get_logger ) # Create a logger logger = get_logger("my_agent") # Get the tracer tracer = get_tracer() # Trace an operation with tracer.trace("agent_execution", agent_id="agent-123") as span: logger.info("Starting execution", agent_id="agent-123") try: # Do work... with tracer.span("sub_operation") as child: # Child span automatically linked pass record_agent_execution(tier=1, action="update_config", success=True, duration=0.45) except Exception as e: span.set_error(e) record_violation("unauthorized_action", "high") logger.error("Execution failed", error=str(e)) ``` ## FastAPI Integration ```python from fastapi import FastAPI from observability import metrics_router, tracing_router, logging_router, MetricsMiddleware app = FastAPI() # Add metrics middleware app.add_middleware(MetricsMiddleware) # Mount routers app.include_router(metrics_router) app.include_router(tracing_router) app.include_router(logging_router) ``` ## Configuration | Setting | Default | Description | |---------|---------|-------------| | `DB_PATH` | `/opt/agent-governance/ledger/governance.db` | SQLite database | | `LOG_LEVEL` | `INFO` | Minimum log level | | `LOG_RETENTION_DAYS` | `7` | Days to retain logs | ## Status **Complete** See [STATUS.md](./STATUS.md) for detailed progress tracking. ## Architecture Reference Part of the [Agent Governance System](/opt/agent-governance/docs/ARCHITECTURE.md). Parent: [Project Root](/opt/agent-governance) --- *Last updated: 2026-01-24 UTC*