agent-governance/observability/README.md

# Observability

> Metrics, tracing, and structured logging for the agent governance system

## Overview

This module provides comprehensive observability infrastructure including Prometheus-format metrics, distributed tracing with span propagation, and structured JSON logging with trace correlation.

## Key Files

| File | Description |
|------|-------------|
| `metrics.py` | Prometheus metrics (Counter, Gauge, Histogram) with FastAPI router |
| `tracing.py` | Distributed tracing with Span/Trace classes and context propagation |
| `logging.py` | Structured JSON logging with SQLite persistence |
| `__init__.py` | Module exports and unified API |

## Components

### Metrics (`metrics.py`)

Prometheus-format metrics with automatic collection:

| Metric | Type | Description |
|--------|------|-------------|
| `agent_executions_total` | Counter | Total executions by tier, action, status |
| `agent_execution_duration_seconds` | Histogram | Execution latency distribution |
| `agent_violations_total` | Counter | Violations by type and severity |
| `agent_promotions_total` | Counter | Tier promotions |
| `api_requests_total` | Counter | API requests by method, endpoint, status |
| `api_request_duration_seconds` | Histogram | Request latency |
| `component_health` | Gauge | Health status (1=healthy, 0=unhealthy) |
| `tenant_quota_usage_ratio` | Gauge | Quota usage per tenant |

### Tracing (`tracing.py`)

Distributed tracing with automatic context propagation:

- **Span**: Individual operation with timing, attributes, events
- **Trace**: Collection of related spans
- **Context Propagation**: Thread-local storage + HTTP headers (`X-Trace-ID`, `X-Span-ID`)

### Logging (`logging.py`)

Structured JSON logging with:

- Automatic trace/span ID correlation
- SQLite persistence with full-text search
- Multi-tenant support
- Configurable retention (default: 7 days)

## API Endpoints

### Metrics
- `GET /metrics` - Prometheus format export

### Tracing
- `GET /traces` - List traces with filters
- `GET /traces/{trace_id}` - Full trace details

### Logging
- `GET /logs` - Search logs with filters
- `GET /logs/trace/{trace_id}` - Logs for a trace
- `GET /logs/stats` - Log statistics
- `POST /logs/cleanup` - Clean old logs

### Health
- `GET /health/detailed` - Component health with details

## Usage

```python
from observability import (
    # Metrics
    registry,
    record_agent_execution,
    record_violation,
    record_promotion,
    MetricsMiddleware,

    # Tracing
    get_tracer,
    get_current_trace_id,

    # Logging
    get_logger
)

# Create a logger
logger = get_logger("my_agent")

# Get the tracer
tracer = get_tracer()

# Trace an operation
with tracer.trace("agent_execution", agent_id="agent-123") as span:
    logger.info("Starting execution", agent_id="agent-123")

    try:
        # Do work...
        with tracer.span("sub_operation") as child:
            # Child span automatically linked
            pass

        record_agent_execution(tier=1, action="update_config", success=True, duration=0.45)

    except Exception as e:
        span.set_error(e)
        record_violation("unauthorized_action", "high")
        logger.error("Execution failed", error=str(e))
```

## FastAPI Integration

```python
from fastapi import FastAPI
from observability import metrics_router, tracing_router, logging_router, MetricsMiddleware

app = FastAPI()

# Add metrics middleware
app.add_middleware(MetricsMiddleware)

# Mount routers
app.include_router(metrics_router)
app.include_router(tracing_router)
app.include_router(logging_router)
```

## Configuration

| Setting | Default | Description |
|---------|---------|-------------|
| `DB_PATH` | `/opt/agent-governance/ledger/governance.db` | SQLite database |
| `LOG_LEVEL` | `INFO` | Minimum log level |
| `LOG_RETENTION_DAYS` | `7` | Days to retain logs |

## Status

**Complete**

See [STATUS.md](./STATUS.md) for detailed progress tracking.

## Architecture Reference

Part of the [Agent Governance System](/opt/agent-governance/docs/ARCHITECTURE.md).

Parent: [Project Root](/opt/agent-governance)

---
*Last updated: 2026-01-24 UTC*