root 1a3a82aedb validatord: coordinator session JSONL for offline analysis (B follow-up)

Closes the second half of J's 2026-05-02 multi-call observability
concern. Trace-id propagation (commit d6d2fdf) gave us the *live*
view in Langfuse; this gives us the *longitudinal* view for ad-hoc
DuckDB queries over thousands of sessions:

  "show me every session where the model produced a real candidate
   without ever needing a retry"
  "find sessions where validation rejected three times in a row"
  "first-shot success rate per model — did we feed it enough corpus?"

## What's in

internal/validator/session_log.go:
  - SessionRecord type (schema=session.iterate.v1)
  - SessionLogger writer — mutex-guarded append, best-effort posture,
    nil-safe (NewSessionLogger("") = nil = no-op on Append)
  - BuildSessionRecord helper — assembles a row from any
    iterate response/failure/infra-error combination, callable from
    other daemons that wrap iterate (cross-daemon shared schema)
  - 7 unit tests including concurrent-append safety + the three
    code paths (success / max_iter_exhausted / infra_error)

cmd/validatord/main.go:
  - handlers.sessionLog field + wiring from cfg.Validatord.SessionLogPath
  - Iterate handler: build + append a SessionRecord on every call
  - rosterCheckFor("fill") closure stamps grounded_in_roster — the
    load-bearing forensic property J flagged ("we can never
    hallucinate available staff members to contracts")

internal/shared/config.go + lakehouse.toml:
  - [validatord].session_log_path field; empty = disabled
  - Production: /var/lib/lakehouse/validator/sessions.jsonl

scripts/validatord_smoke.sh:
  - Adds a probe verifying validatord announces session log path on
    startup. Smoke is now 6/6 (was 5/5).

docs/SESSION_LOG.md:
  - Schema reference + 5 worked DuckDB query examples including the
    "alarm" query (sessions where grounded_in_roster=false on an
    accepted fill — should always be empty; if not, something is
    bypassing FillValidator).

## What this is NOT

This is NOT a duplicate of replay_runs.jsonl. They're siblings:
  - replay_runs.jsonl: replay tool's per-task retrieval+model output
  - sessions.jsonl: validatord's per-iterate full retry chain +
    grounded-in-roster verdict

A single coordinator session can produce rows in both streams; the
session_id (= Langfuse trace_id) is the join key.

## Layered observability now in place

  Live view:  Langfuse trace tree (X-Lakehouse-Trace-Id propagation)
              `iterate.attempt[N]` spans with prompt/raw/verdict
  Offline:    coordinator_sessions.jsonl (this commit)
              DuckDB-queryable; longitudinal forensics
  Hard gate:  FillValidator + WorkerLookup (existing)
              phantom IDs structurally rejected, never reach
              session log's grounded_in_roster=true bucket

Per the architecture invariant in STATE_OF_PLAY's DO NOT RELITIGATE
section — these layers are wired; future work targets the data, not
the wiring.

## Verification

- internal/validator: 7 new tests (session_log_test.go) — all PASS
- cmd/validatord: 3 new integration tests covering the success,
  failure, and grounded=false paths — all PASS
- validatord_smoke.sh: 6/6 PASS through gateway :3110
- Full go test ./... green across 33 packages

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-02 05:22:09 -05:00

6.9 KiB

Raw Blame History

Coordinator session log — `coordinator_sessions.jsonl`

Last updated: 2026-05-02 · Schema: session.iterate.v1 · Writer: internal/validator.SessionLogger · Producer: validatord /v1/iterate

Why

The Langfuse trace tree is the live view: per-session, you can scroll the retry chain and inspect every sub-call. But for longitudinal forensics ("show me every session in the last week where the model guessed a real worker without a retry," or "find sessions where validation rejected three times in a row"), Langfuse's UI doesn't scale — you need a queryable data plane.

This JSONL is that plane. One row per /v1/iterate session, append- only, DuckDB-friendly.

Where

Configurable via [validatord].session_log_path in lakehouse.toml. Empty = disabled (best-effort posture; a missing log never blocks an iterate request). Production:

[validatord]
session_log_path = "/var/lib/lakehouse/validator/sessions.jsonl"

Schema (v1)

{
  "schema":          "session.iterate.v1",
  "session_id":      "<Langfuse trace_id>",      // join key to Langfuse
  "timestamp":       "2026-05-02T07:30:00.123456Z",
  "daemon":          "validatord",
  "kind":            "fill | email | playbook",
  "model":           "qwen3.5:latest",
  "provider":        "ollama",
  "prompt":          "produce a fill artifact ...",  // truncated to 4000 chars
  "iterations":      3,                              // attempts spent
  "max_iterations":  3,                              // cap per request
  "final_verdict":   "accepted | max_iter_exhausted | infra_error",
  "attempts": [
    { "iteration": 0, "verdict_kind": "validation_failed",
      "error": "consistency: candidate_id W-X not in roster",
      "span_id": "abc..." },
    { "iteration": 1, "verdict_kind": "validation_failed",
      "error": "consistency: city mismatch", "span_id": "def..." },
    { "iteration": 2, "verdict_kind": "accepted", "span_id": "ghi..." }
  ],
  "artifact":        { /* final accepted artifact, omitted on failure */ },
  "grounded_in_roster": true,                       // null when N/A (email/playbook)
  "duration_ms":     2840
}

Field semantics

Field	When set	What it means
`session_id`	always	Langfuse trace id. Pivot to live trace tree by URL: `${LANGFUSE_URL}/trace/<session_id>`.
`final_verdict=accepted`	success	Loop converged within `max_iterations`. `artifact` is non-null.
`final_verdict=max_iter_exhausted`	failure	Loop hit the cap without passing validation. `artifact` is omitted.
`final_verdict=infra_error`	failure	Chat hop or other infra crashed. Single attempt with `verdict_kind=infra_error`.
`grounded_in_roster=true`	fill kind, success	Every `candidate_id` in the artifact exists in `WorkerLookup`.
`grounded_in_roster=false`	fill kind, anomaly	Phantom or otherwise-invalid candidate IDs (shouldn't happen — FillValidator catches these — but the explicit check defends against future validator weakening).
`grounded_in_roster=null` (omitted)	non-fill kinds, or failure	The roster check doesn't apply or wasn't run.

DuckDB queries

-- Read the log directly via DuckDB's read_json_auto.
ATTACH ':memory:' AS sessions;
SELECT * FROM read_json_auto(
  '/var/lib/lakehouse/validator/sessions.jsonl', format='newline_delimited'
) LIMIT 10;

"Did the validator catch every phantom worker?"

-- Sessions where iteration 0's verdict was validation_failed AND
-- the error mentions 'phantom' or 'consistency'. If grounded=true on
-- the same session's final state, the model recovered.
SELECT session_id, model, iterations, grounded_in_roster, final_verdict
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
WHERE final_verdict = 'accepted'
  AND iterations > 1
  AND list_contains(
        list_transform(attempts,
          x -> x.error LIKE '%consistency%' OR x.error LIKE '%phantom%'),
        true
      );

"First-shot success rate per model" (the "did the corpus give it enough" gate)

SELECT model,
       COUNT(*) AS sessions,
       SUM(CASE WHEN iterations = 1 AND final_verdict = 'accepted' THEN 1 ELSE 0 END) AS first_shot,
       ROUND(100.0 * SUM(CASE WHEN iterations = 1 AND final_verdict = 'accepted' THEN 1 ELSE 0 END) / COUNT(*), 1) AS pct
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
WHERE kind = 'fill'
GROUP BY model
ORDER BY pct DESC;

"Sessions that were never grounded" (the alarm query)

-- Should always be empty. If it isn't, FillValidator has a hole or
-- a different code path is bypassing the roster check.
SELECT session_id, model, iterations, attempts
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
WHERE kind = 'fill'
  AND final_verdict = 'accepted'
  AND grounded_in_roster = false;

"Average retry depth per model"

SELECT model, AVG(iterations) AS avg_iter, COUNT(*) AS n
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
WHERE kind = 'fill' AND final_verdict = 'accepted'
GROUP BY model
ORDER BY avg_iter ASC;

"What did validation reject?" (failure mode breakdown)

-- Pull each rejected attempt's error string, classify by prefix.
WITH errors AS (
  SELECT session_id,
         model,
         unnest(attempts) AS att
  FROM read_json_auto('sessions.jsonl', format='newline_delimited')
)
SELECT model,
       split_part(att.error, ':', 1) AS kind,
       COUNT(*) AS n
FROM errors
WHERE att.verdict_kind = 'validation_failed'
GROUP BY model, kind
ORDER BY n DESC;

Operational notes

Append-only. No row is ever updated; storage grows linearly with iterate calls. Operators rotate via cron when the file gets unwieldy (logrotate-style).
Best-effort posture. Every write goes through slog.Warn on failure but never blocks the iterate handler. A full disk silently drops session rows; the iterate response still ships.
Schema versioning. schema=session.iterate.v1 is the contract. Future incompatible changes bump the version; consumers should branch on the field.
PII consideration. prompt is captured truncated to 4000 chars and the final artifact (when present) is captured verbatim. Operators handling PII-bearing prompts should set the path under a restricted-access volume or filter before retention.
Cross-runtime parity. The Rust gateway's /v1/iterate does NOT yet write this file. If you want a unified longitudinal log across runtimes, port the writer to Rust (crates/gateway/src/v1/iterate.rs) and target the same JSONL path. ~50 LOC.

6.9 KiB Raw Blame History

Coordinator session log — coordinator_sessions.jsonl

Why