profit 8c6e7831e9 Add Phase 10-12 implementation: multi-tenant, marketplace, observability
Major additions:
- marketplace/: Agent template registry with FTS5 search, ratings, versioning
- observability/: Prometheus metrics, distributed tracing, structured logging
- ledger/migrations/: Database migration scripts for multi-tenant support
- tests/governance/: 15 new test files for phases 6-12 (295 total tests)
- bin/validate-phases: Full 12-phase validation script

New features:
- Multi-tenant support with tenant isolation and quota enforcement
- Agent marketplace with semantic versioning and search
- Observability with metrics, tracing, and log correlation
- Tier-1 agent bootstrap scripts

Updated components:
- ledger/api.py: Extended API for tenants, marketplace, observability
- ledger/schema.sql: Added tenant, project, marketplace tables
- testing/framework.ts: Enhanced test framework
- checkpoint/checkpoint.py: Improved checkpoint management

Archived:
- External integrations (Slack/GitHub/PagerDuty) moved to .archive/
- Old checkpoint files cleaned up

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 18:39:47 -05:00
..

Observability

Metrics, tracing, and structured logging for the agent governance system

Overview

This module provides comprehensive observability infrastructure including Prometheus-format metrics, distributed tracing with span propagation, and structured JSON logging with trace correlation.

Key Files

File Description
metrics.py Prometheus metrics (Counter, Gauge, Histogram) with FastAPI router
tracing.py Distributed tracing with Span/Trace classes and context propagation
logging.py Structured JSON logging with SQLite persistence
__init__.py Module exports and unified API

Components

Metrics (metrics.py)

Prometheus-format metrics with automatic collection:

Metric Type Description
agent_executions_total Counter Total executions by tier, action, status
agent_execution_duration_seconds Histogram Execution latency distribution
agent_violations_total Counter Violations by type and severity
agent_promotions_total Counter Tier promotions
api_requests_total Counter API requests by method, endpoint, status
api_request_duration_seconds Histogram Request latency
component_health Gauge Health status (1=healthy, 0=unhealthy)
tenant_quota_usage_ratio Gauge Quota usage per tenant

Tracing (tracing.py)

Distributed tracing with automatic context propagation:

  • Span: Individual operation with timing, attributes, events
  • Trace: Collection of related spans
  • Context Propagation: Thread-local storage + HTTP headers (X-Trace-ID, X-Span-ID)

Logging (logging.py)

Structured JSON logging with:

  • Automatic trace/span ID correlation
  • SQLite persistence with full-text search
  • Multi-tenant support
  • Configurable retention (default: 7 days)

API Endpoints

Metrics

  • GET /metrics - Prometheus format export

Tracing

  • GET /traces - List traces with filters
  • GET /traces/{trace_id} - Full trace details

Logging

  • GET /logs - Search logs with filters
  • GET /logs/trace/{trace_id} - Logs for a trace
  • GET /logs/stats - Log statistics
  • POST /logs/cleanup - Clean old logs

Health

  • GET /health/detailed - Component health with details

Usage

from observability import (
    # Metrics
    registry,
    record_agent_execution,
    record_violation,
    record_promotion,
    MetricsMiddleware,

    # Tracing
    get_tracer,
    get_current_trace_id,

    # Logging
    get_logger
)

# Create a logger
logger = get_logger("my_agent")

# Get the tracer
tracer = get_tracer()

# Trace an operation
with tracer.trace("agent_execution", agent_id="agent-123") as span:
    logger.info("Starting execution", agent_id="agent-123")

    try:
        # Do work...
        with tracer.span("sub_operation") as child:
            # Child span automatically linked
            pass

        record_agent_execution(tier=1, action="update_config", success=True, duration=0.45)

    except Exception as e:
        span.set_error(e)
        record_violation("unauthorized_action", "high")
        logger.error("Execution failed", error=str(e))

FastAPI Integration

from fastapi import FastAPI
from observability import metrics_router, tracing_router, logging_router, MetricsMiddleware

app = FastAPI()

# Add metrics middleware
app.add_middleware(MetricsMiddleware)

# Mount routers
app.include_router(metrics_router)
app.include_router(tracing_router)
app.include_router(logging_router)

Configuration

Setting Default Description
DB_PATH /opt/agent-governance/ledger/governance.db SQLite database
LOG_LEVEL INFO Minimum log level
LOG_RETENTION_DAYS 7 Days to retain logs

Status

Complete

See STATUS.md for detailed progress tracking.

Architecture Reference

Part of the Agent Governance System.

Parent: Project Root


Last updated: 2026-01-24 UTC