Major additions: - marketplace/: Agent template registry with FTS5 search, ratings, versioning - observability/: Prometheus metrics, distributed tracing, structured logging - ledger/migrations/: Database migration scripts for multi-tenant support - tests/governance/: 15 new test files for phases 6-12 (295 total tests) - bin/validate-phases: Full 12-phase validation script New features: - Multi-tenant support with tenant isolation and quota enforcement - Agent marketplace with semantic versioning and search - Observability with metrics, tracing, and log correlation - Tier-1 agent bootstrap scripts Updated components: - ledger/api.py: Extended API for tenants, marketplace, observability - ledger/schema.sql: Added tenant, project, marketplace tables - testing/framework.ts: Enhanced test framework - checkpoint/checkpoint.py: Improved checkpoint management Archived: - External integrations (Slack/GitHub/PagerDuty) moved to .archive/ - Old checkpoint files cleaned up Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
152 lines
4.0 KiB
Markdown
152 lines
4.0 KiB
Markdown
# Observability
|
|
|
|
> Metrics, tracing, and structured logging for the agent governance system
|
|
|
|
## Overview
|
|
|
|
This module provides comprehensive observability infrastructure including Prometheus-format metrics, distributed tracing with span propagation, and structured JSON logging with trace correlation.
|
|
|
|
## Key Files
|
|
|
|
| File | Description |
|
|
|------|-------------|
|
|
| `metrics.py` | Prometheus metrics (Counter, Gauge, Histogram) with FastAPI router |
|
|
| `tracing.py` | Distributed tracing with Span/Trace classes and context propagation |
|
|
| `logging.py` | Structured JSON logging with SQLite persistence |
|
|
| `__init__.py` | Module exports and unified API |
|
|
|
|
## Components
|
|
|
|
### Metrics (`metrics.py`)
|
|
|
|
Prometheus-format metrics with automatic collection:
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `agent_executions_total` | Counter | Total executions by tier, action, status |
|
|
| `agent_execution_duration_seconds` | Histogram | Execution latency distribution |
|
|
| `agent_violations_total` | Counter | Violations by type and severity |
|
|
| `agent_promotions_total` | Counter | Tier promotions |
|
|
| `api_requests_total` | Counter | API requests by method, endpoint, status |
|
|
| `api_request_duration_seconds` | Histogram | Request latency |
|
|
| `component_health` | Gauge | Health status (1=healthy, 0=unhealthy) |
|
|
| `tenant_quota_usage_ratio` | Gauge | Quota usage per tenant |
|
|
|
|
### Tracing (`tracing.py`)
|
|
|
|
Distributed tracing with automatic context propagation:
|
|
|
|
- **Span**: Individual operation with timing, attributes, events
|
|
- **Trace**: Collection of related spans
|
|
- **Context Propagation**: Thread-local storage + HTTP headers (`X-Trace-ID`, `X-Span-ID`)
|
|
|
|
### Logging (`logging.py`)
|
|
|
|
Structured JSON logging with:
|
|
|
|
- Automatic trace/span ID correlation
|
|
- SQLite persistence with full-text search
|
|
- Multi-tenant support
|
|
- Configurable retention (default: 7 days)
|
|
|
|
## API Endpoints
|
|
|
|
### Metrics
|
|
- `GET /metrics` - Prometheus format export
|
|
|
|
### Tracing
|
|
- `GET /traces` - List traces with filters
|
|
- `GET /traces/{trace_id}` - Full trace details
|
|
|
|
### Logging
|
|
- `GET /logs` - Search logs with filters
|
|
- `GET /logs/trace/{trace_id}` - Logs for a trace
|
|
- `GET /logs/stats` - Log statistics
|
|
- `POST /logs/cleanup` - Clean old logs
|
|
|
|
### Health
|
|
- `GET /health/detailed` - Component health with details
|
|
|
|
## Usage
|
|
|
|
```python
|
|
from observability import (
|
|
# Metrics
|
|
registry,
|
|
record_agent_execution,
|
|
record_violation,
|
|
record_promotion,
|
|
MetricsMiddleware,
|
|
|
|
# Tracing
|
|
get_tracer,
|
|
get_current_trace_id,
|
|
|
|
# Logging
|
|
get_logger
|
|
)
|
|
|
|
# Create a logger
|
|
logger = get_logger("my_agent")
|
|
|
|
# Get the tracer
|
|
tracer = get_tracer()
|
|
|
|
# Trace an operation
|
|
with tracer.trace("agent_execution", agent_id="agent-123") as span:
|
|
logger.info("Starting execution", agent_id="agent-123")
|
|
|
|
try:
|
|
# Do work...
|
|
with tracer.span("sub_operation") as child:
|
|
# Child span automatically linked
|
|
pass
|
|
|
|
record_agent_execution(tier=1, action="update_config", success=True, duration=0.45)
|
|
|
|
except Exception as e:
|
|
span.set_error(e)
|
|
record_violation("unauthorized_action", "high")
|
|
logger.error("Execution failed", error=str(e))
|
|
```
|
|
|
|
## FastAPI Integration
|
|
|
|
```python
|
|
from fastapi import FastAPI
|
|
from observability import metrics_router, tracing_router, logging_router, MetricsMiddleware
|
|
|
|
app = FastAPI()
|
|
|
|
# Add metrics middleware
|
|
app.add_middleware(MetricsMiddleware)
|
|
|
|
# Mount routers
|
|
app.include_router(metrics_router)
|
|
app.include_router(tracing_router)
|
|
app.include_router(logging_router)
|
|
```
|
|
|
|
## Configuration
|
|
|
|
| Setting | Default | Description |
|
|
|---------|---------|-------------|
|
|
| `DB_PATH` | `/opt/agent-governance/ledger/governance.db` | SQLite database |
|
|
| `LOG_LEVEL` | `INFO` | Minimum log level |
|
|
| `LOG_RETENTION_DAYS` | `7` | Days to retain logs |
|
|
|
|
## Status
|
|
|
|
**Complete**
|
|
|
|
See [STATUS.md](./STATUS.md) for detailed progress tracking.
|
|
|
|
## Architecture Reference
|
|
|
|
Part of the [Agent Governance System](/opt/agent-governance/docs/ARCHITECTURE.md).
|
|
|
|
Parent: [Project Root](/opt/agent-governance)
|
|
|
|
---
|
|
*Last updated: 2026-01-24 UTC*
|