profit 8c6e7831e9 Add Phase 10-12 implementation: multi-tenant, marketplace, observability
Major additions:
- marketplace/: Agent template registry with FTS5 search, ratings, versioning
- observability/: Prometheus metrics, distributed tracing, structured logging
- ledger/migrations/: Database migration scripts for multi-tenant support
- tests/governance/: 15 new test files for phases 6-12 (295 total tests)
- bin/validate-phases: Full 12-phase validation script

New features:
- Multi-tenant support with tenant isolation and quota enforcement
- Agent marketplace with semantic versioning and search
- Observability with metrics, tracing, and log correlation
- Tier-1 agent bootstrap scripts

Updated components:
- ledger/api.py: Extended API for tenants, marketplace, observability
- ledger/schema.sql: Added tenant, project, marketplace tables
- testing/framework.ts: Enhanced test framework
- checkpoint/checkpoint.py: Improved checkpoint management

Archived:
- External integrations (Slack/GitHub/PagerDuty) moved to .archive/
- Old checkpoint files cleaned up

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 18:39:47 -05:00

152 lines
4.0 KiB
Markdown

# Observability
> Metrics, tracing, and structured logging for the agent governance system
## Overview
This module provides comprehensive observability infrastructure including Prometheus-format metrics, distributed tracing with span propagation, and structured JSON logging with trace correlation.
## Key Files
| File | Description |
|------|-------------|
| `metrics.py` | Prometheus metrics (Counter, Gauge, Histogram) with FastAPI router |
| `tracing.py` | Distributed tracing with Span/Trace classes and context propagation |
| `logging.py` | Structured JSON logging with SQLite persistence |
| `__init__.py` | Module exports and unified API |
## Components
### Metrics (`metrics.py`)
Prometheus-format metrics with automatic collection:
| Metric | Type | Description |
|--------|------|-------------|
| `agent_executions_total` | Counter | Total executions by tier, action, status |
| `agent_execution_duration_seconds` | Histogram | Execution latency distribution |
| `agent_violations_total` | Counter | Violations by type and severity |
| `agent_promotions_total` | Counter | Tier promotions |
| `api_requests_total` | Counter | API requests by method, endpoint, status |
| `api_request_duration_seconds` | Histogram | Request latency |
| `component_health` | Gauge | Health status (1=healthy, 0=unhealthy) |
| `tenant_quota_usage_ratio` | Gauge | Quota usage per tenant |
### Tracing (`tracing.py`)
Distributed tracing with automatic context propagation:
- **Span**: Individual operation with timing, attributes, events
- **Trace**: Collection of related spans
- **Context Propagation**: Thread-local storage + HTTP headers (`X-Trace-ID`, `X-Span-ID`)
### Logging (`logging.py`)
Structured JSON logging with:
- Automatic trace/span ID correlation
- SQLite persistence with full-text search
- Multi-tenant support
- Configurable retention (default: 7 days)
## API Endpoints
### Metrics
- `GET /metrics` - Prometheus format export
### Tracing
- `GET /traces` - List traces with filters
- `GET /traces/{trace_id}` - Full trace details
### Logging
- `GET /logs` - Search logs with filters
- `GET /logs/trace/{trace_id}` - Logs for a trace
- `GET /logs/stats` - Log statistics
- `POST /logs/cleanup` - Clean old logs
### Health
- `GET /health/detailed` - Component health with details
## Usage
```python
from observability import (
# Metrics
registry,
record_agent_execution,
record_violation,
record_promotion,
MetricsMiddleware,
# Tracing
get_tracer,
get_current_trace_id,
# Logging
get_logger
)
# Create a logger
logger = get_logger("my_agent")
# Get the tracer
tracer = get_tracer()
# Trace an operation
with tracer.trace("agent_execution", agent_id="agent-123") as span:
logger.info("Starting execution", agent_id="agent-123")
try:
# Do work...
with tracer.span("sub_operation") as child:
# Child span automatically linked
pass
record_agent_execution(tier=1, action="update_config", success=True, duration=0.45)
except Exception as e:
span.set_error(e)
record_violation("unauthorized_action", "high")
logger.error("Execution failed", error=str(e))
```
## FastAPI Integration
```python
from fastapi import FastAPI
from observability import metrics_router, tracing_router, logging_router, MetricsMiddleware
app = FastAPI()
# Add metrics middleware
app.add_middleware(MetricsMiddleware)
# Mount routers
app.include_router(metrics_router)
app.include_router(tracing_router)
app.include_router(logging_router)
```
## Configuration
| Setting | Default | Description |
|---------|---------|-------------|
| `DB_PATH` | `/opt/agent-governance/ledger/governance.db` | SQLite database |
| `LOG_LEVEL` | `INFO` | Minimum log level |
| `LOG_RETENTION_DAYS` | `7` | Days to retain logs |
## Status
**Complete**
See [STATUS.md](./STATUS.md) for detailed progress tracking.
## Architecture Reference
Part of the [Agent Governance System](/opt/agent-governance/docs/ARCHITECTURE.md).
Parent: [Project Root](/opt/agent-governance)
---
*Last updated: 2026-01-24 UTC*