golangLAKEHOUSE/internal/shared/langfuse_middleware.go
root d6d2fdf81f trace-id propagation through /v1/iterate (multi-call observability)
Closes J's 2026-05-02 multi-call observability gap: a single
/v1/iterate session with N retries used to surface in Langfuse as
N+1 disconnected traces (one per /v1/chat hop + one for the iterate
request itself), with no parent/child linkage. Operators couldn't
scroll the retry chain in one trace tree to spot where grounding
failed.

## Wire-level change

- New header constant `shared.TraceIDHeader = "X-Lakehouse-Trace-Id"`
- `langfuseMiddleware` honors the header on inbound requests: if
  set, reuses that trace id instead of minting a new one. Stashes
  the trace id on the request context so handlers can attach
  application-level child spans.
- `validatord.chatCaller` forwards the header to chatd. Every chat
  hop in an iterate session lands as a child of the parent trace.

## Application-level spans

- `validator.IterateConfig` gains `Tracer` (optional callback).
  When wired, each iteration attempt emits one Langfuse span
  via `validator.AttemptSpan`:
    Name: iterate.attempt[N]
    Input: { iteration, model, provider, prompt }
    Output: { verdict, raw, error }
    Level: WARNING when verdict != accepted
- `validatord.iterTracer` is the production hook — bridges
  `validator.Tracer` → `langfuse.Client.Span`.
- `IterateRequest`/`IterateResponse`/`IterateFailure` gain
  `TraceID`; each `IterateAttempt` gains `SpanID`. The /v1/iterate
  caller can pivot from the JSON response straight into the
  Langfuse trace tree.

## What an operator sees post-cutover

  GET /v1/iterate {kind=fill, prompt=...} → Trace TR-1
    ├─ http.request span (from middleware)
    ├─ iterate.attempt[0] span (validator.Iterate emit)
    │     input: prompt+model
    │     output: { verdict: validation_failed, error: ..., raw }
    ├─ chatd /v1/chat call (X-Lakehouse-Trace-Id: TR-1)
    │     ├─ http.request span (chatd middleware)
    │     └─ chatd-internal spans (existing)
    ├─ iterate.attempt[1] span
    └─ ...

All in one Langfuse trace tree, not N+1 separate traces.

## Hallucinated-worker safety net is unchanged

The /v1/iterate flow's hard correctness gate is still
FillValidator + WorkerLookup. Phantom candidate IDs raise
ValidationError::Consistency which 422s and forces the iteration
loop to retry. The trace-id propagation is the OBSERVABILITY layer
on top — it makes the existing safety net's outcomes visible per-call,
not a replacement for it.

## Verification

- internal/validator: 4 new tests
  - TestIterate_TracerEmitsSpanPerAttempt — span/attempt count + SpanID
  - TestIterate_NoTraceIDSkipsTracer — no orphan spans without trace_id
  - TestIterate_ChatCallerReceivesTraceID — propagation contract
  - (existing iterate tests updated for new ChatCaller signature)
- internal/shared: 1 new test
  - TestLangfuseMiddleware_HonorsTraceIDHeader — cross-service linkage
- cmd/validatord: existing HTTP tests still PASS via the dual-shape
  UnmarshalJSON contract.
- validatord_smoke.sh: 5/5 PASS through gateway :3110 (unchanged).
- Full go test ./... green across 33 packages.

## Architecture invariant added

STATE_OF_PLAY "DO NOT RELITIGATE" gains a paragraph documenting
the X-Lakehouse-Trace-Id header contract + the iterate.attempt[N]
span emission. Future-Claude won't re-propose "wire trace-id
propagation" — the header IS the wiring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 05:13:18 -05:00

150 lines
5.1 KiB
Go

package shared
import (
"context"
"net/http"
"os"
"time"
"git.agentview.dev/profit/golangLAKEHOUSE/internal/langfuse"
)
// TraceIDHeader propagates a Langfuse trace id across services. When
// validatord makes a /v1/iterate call that internally calls chatd's
// /v1/chat, validatord sends this header so both daemons' middleware
// emit spans under the SAME trace tree (rather than two unrelated
// traces). Closes the multi-call observability gap J flagged
// 2026-05-02 ("we need to make sure they have the corpus of
// information to complete the process and we want to spot errors").
const TraceIDHeader = "X-Lakehouse-Trace-Id"
// traceIDCtxKey is the context value key for the per-request trace id.
// Handlers downstream of langfuseMiddleware can pull it via TraceIDFromCtx
// to attach child spans (e.g. iteration-attempt spans inside validatord).
type traceIDCtxKey struct{}
// TraceIDFromCtx returns the per-request Langfuse trace id, or "" if
// the middleware didn't set one (Langfuse not configured / /health
// bypass / no Client wired).
func TraceIDFromCtx(ctx context.Context) string {
if v, ok := ctx.Value(traceIDCtxKey{}).(string); ok {
return v
}
return ""
}
// langfuseMiddleware emits one Langfuse trace per HTTP request, with
// a single span carrying start/end timestamps + status code. Per
// OPEN item #2 (closed by the wave that adds this file): production
// traffic gets free trace visibility without per-handler wiring.
//
// nil client → returns a passthrough no-op middleware so callers
// don't need a nil check in shared.Run. Same fail-open posture as
// Langfuse's queue layer (per ADR-005 Decision 5.1: observability
// is a witness, never a gate).
//
// /health bypasses tracing — operators don't want every LB probe
// or monitor heartbeat polluting traces. Real traffic surfaces
// only via the registered routes.
func langfuseMiddleware(serviceName string, lf *langfuse.Client) func(http.Handler) http.Handler {
if lf == nil {
return func(next http.Handler) http.Handler { return next }
}
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// /health bypasses tracing — same exemption logic as
// the auth middleware (see RequireAuth).
if r.URL.Path == "/health" {
next.ServeHTTP(w, r)
return
}
start := time.Now()
sw := &statusWriter{ResponseWriter: w, status: http.StatusOK}
// If the caller forwarded a trace id (cross-service parent
// linkage) reuse it instead of starting a new trace. Spans
// from this service then attach to the parent trace tree
// so an /v1/iterate session shows as one trace with
// children for each /v1/chat hop.
traceID := r.Header.Get(TraceIDHeader)
if traceID == "" {
traceID = lf.Trace(r.Context(), langfuse.TraceInput{
Name: serviceName + " " + r.Method + " " + r.URL.Path,
Tags: []string{serviceName, r.Method},
Metadata: map[string]any{
"path": r.URL.Path,
"method": r.Method,
"remote_addr": r.RemoteAddr,
},
})
}
// Stash the trace id on the request context so downstream
// handlers can attach finer-grained spans (e.g. one per
// iteration attempt inside validator.Iterate).
r = r.WithContext(context.WithValue(r.Context(), traceIDCtxKey{}, traceID))
next.ServeHTTP(sw, r)
level := ""
if sw.status >= 500 {
level = "ERROR"
} else if sw.status >= 400 {
level = "WARNING"
}
lf.Span(r.Context(), langfuse.SpanInput{
TraceID: traceID,
Name: "http.request",
Input: map[string]any{
"method": r.Method,
"path": r.URL.Path,
"remote_addr": r.RemoteAddr,
},
Output: map[string]any{
"status": sw.status,
"duration_ms": time.Since(start).Milliseconds(),
},
StartTime: start,
EndTime: time.Now(),
StatusCode: sw.status,
Level: level,
})
})
}
}
// statusWriter is the standard "wrap http.ResponseWriter to capture
// the status code" trick. WriteHeader is the only method that
// changes status; any handler that doesn't call WriteHeader gets
// the implicit 200 from our struct's default.
type statusWriter struct {
http.ResponseWriter
status int
}
func (sw *statusWriter) WriteHeader(code int) {
sw.status = code
sw.ResponseWriter.WriteHeader(code)
}
// LoadLangfuseFromEnv builds a langfuse.Client from environment
// variables. Returns nil if any of LANGFUSE_URL / LANGFUSE_PUBLIC_KEY
// / LANGFUSE_SECRET_KEY is unset (best-effort: missing config means
// no tracing, never a startup error). Same env names as the bare
// /etc/lakehouse/langfuse.env file used by the multi_coord_stress
// driver — operators ship one env file across every daemon.
//
// Exported 2026-05-02 so daemons that need to emit application-level
// child spans (validatord's iterate-attempt spans) can hold their own
// reference to the same client `shared.Run` is already wiring into
// the middleware.
func LoadLangfuseFromEnv() *langfuse.Client {
url := os.Getenv("LANGFUSE_URL")
pk := os.Getenv("LANGFUSE_PUBLIC_KEY")
sk := os.Getenv("LANGFUSE_SECRET_KEY")
if url == "" || pk == "" || sk == "" {
return nil
}
return langfuse.New(url, pk, sk, nil)
}