Closes the second half of J's 2026-05-02 multi-call observability
concern. Trace-id propagation (commit d6d2fdf) gave us the *live*
view in Langfuse; this gives us the *longitudinal* view for ad-hoc
DuckDB queries over thousands of sessions:
"show me every session where the model produced a real candidate
without ever needing a retry"
"find sessions where validation rejected three times in a row"
"first-shot success rate per model — did we feed it enough corpus?"
## What's in
internal/validator/session_log.go:
- SessionRecord type (schema=session.iterate.v1)
- SessionLogger writer — mutex-guarded append, best-effort posture,
nil-safe (NewSessionLogger("") = nil = no-op on Append)
- BuildSessionRecord helper — assembles a row from any
iterate response/failure/infra-error combination, callable from
other daemons that wrap iterate (cross-daemon shared schema)
- 7 unit tests including concurrent-append safety + the three
code paths (success / max_iter_exhausted / infra_error)
cmd/validatord/main.go:
- handlers.sessionLog field + wiring from cfg.Validatord.SessionLogPath
- Iterate handler: build + append a SessionRecord on every call
- rosterCheckFor("fill") closure stamps grounded_in_roster — the
load-bearing forensic property J flagged ("we can never
hallucinate available staff members to contracts")
internal/shared/config.go + lakehouse.toml:
- [validatord].session_log_path field; empty = disabled
- Production: /var/lib/lakehouse/validator/sessions.jsonl
scripts/validatord_smoke.sh:
- Adds a probe verifying validatord announces session log path on
startup. Smoke is now 6/6 (was 5/5).
docs/SESSION_LOG.md:
- Schema reference + 5 worked DuckDB query examples including the
"alarm" query (sessions where grounded_in_roster=false on an
accepted fill — should always be empty; if not, something is
bypassing FillValidator).
## What this is NOT
This is NOT a duplicate of replay_runs.jsonl. They're siblings:
- replay_runs.jsonl: replay tool's per-task retrieval+model output
- sessions.jsonl: validatord's per-iterate full retry chain +
grounded-in-roster verdict
A single coordinator session can produce rows in both streams; the
session_id (= Langfuse trace_id) is the join key.
## Layered observability now in place
Live view: Langfuse trace tree (X-Lakehouse-Trace-Id propagation)
`iterate.attempt[N]` spans with prompt/raw/verdict
Offline: coordinator_sessions.jsonl (this commit)
DuckDB-queryable; longitudinal forensics
Hard gate: FillValidator + WorkerLookup (existing)
phantom IDs structurally rejected, never reach
session log's grounded_in_roster=true bucket
Per the architecture invariant in STATE_OF_PLAY's DO NOT RELITIGATE
section — these layers are wired; future work targets the data, not
the wiring.
## Verification
- internal/validator: 7 new tests (session_log_test.go) — all PASS
- cmd/validatord: 3 new integration tests covering the success,
failure, and grounded=false paths — all PASS
- validatord_smoke.sh: 6/6 PASS through gateway :3110
- Full go test ./... green across 33 packages
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.9 KiB
Coordinator session log — coordinator_sessions.jsonl
Last updated: 2026-05-02 · Schema: session.iterate.v1 · Writer: internal/validator.SessionLogger · Producer: validatord /v1/iterate
Why
The Langfuse trace tree is the live view: per-session, you can scroll the retry chain and inspect every sub-call. But for longitudinal forensics ("show me every session in the last week where the model guessed a real worker without a retry," or "find sessions where validation rejected three times in a row"), Langfuse's UI doesn't scale — you need a queryable data plane.
This JSONL is that plane. One row per /v1/iterate session, append-
only, DuckDB-friendly.
Where
Configurable via [validatord].session_log_path in lakehouse.toml.
Empty = disabled (best-effort posture; a missing log never blocks an
iterate request). Production:
[validatord]
session_log_path = "/var/lib/lakehouse/validator/sessions.jsonl"
Schema (v1)
{
"schema": "session.iterate.v1",
"session_id": "<Langfuse trace_id>", // join key to Langfuse
"timestamp": "2026-05-02T07:30:00.123456Z",
"daemon": "validatord",
"kind": "fill | email | playbook",
"model": "qwen3.5:latest",
"provider": "ollama",
"prompt": "produce a fill artifact ...", // truncated to 4000 chars
"iterations": 3, // attempts spent
"max_iterations": 3, // cap per request
"final_verdict": "accepted | max_iter_exhausted | infra_error",
"attempts": [
{ "iteration": 0, "verdict_kind": "validation_failed",
"error": "consistency: candidate_id W-X not in roster",
"span_id": "abc..." },
{ "iteration": 1, "verdict_kind": "validation_failed",
"error": "consistency: city mismatch", "span_id": "def..." },
{ "iteration": 2, "verdict_kind": "accepted", "span_id": "ghi..." }
],
"artifact": { /* final accepted artifact, omitted on failure */ },
"grounded_in_roster": true, // null when N/A (email/playbook)
"duration_ms": 2840
}
Field semantics
| Field | When set | What it means |
|---|---|---|
session_id |
always | Langfuse trace id. Pivot to live trace tree by URL: ${LANGFUSE_URL}/trace/<session_id>. |
final_verdict=accepted |
success | Loop converged within max_iterations. artifact is non-null. |
final_verdict=max_iter_exhausted |
failure | Loop hit the cap without passing validation. artifact is omitted. |
final_verdict=infra_error |
failure | Chat hop or other infra crashed. Single attempt with verdict_kind=infra_error. |
grounded_in_roster=true |
fill kind, success | Every candidate_id in the artifact exists in WorkerLookup. |
grounded_in_roster=false |
fill kind, anomaly | Phantom or otherwise-invalid candidate IDs (shouldn't happen — FillValidator catches these — but the explicit check defends against future validator weakening). |
grounded_in_roster=null (omitted) |
non-fill kinds, or failure | The roster check doesn't apply or wasn't run. |
DuckDB queries
-- Read the log directly via DuckDB's read_json_auto.
ATTACH ':memory:' AS sessions;
SELECT * FROM read_json_auto(
'/var/lib/lakehouse/validator/sessions.jsonl', format='newline_delimited'
) LIMIT 10;
"Did the validator catch every phantom worker?"
-- Sessions where iteration 0's verdict was validation_failed AND
-- the error mentions 'phantom' or 'consistency'. If grounded=true on
-- the same session's final state, the model recovered.
SELECT session_id, model, iterations, grounded_in_roster, final_verdict
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
WHERE final_verdict = 'accepted'
AND iterations > 1
AND list_contains(
list_transform(attempts,
x -> x.error LIKE '%consistency%' OR x.error LIKE '%phantom%'),
true
);
"First-shot success rate per model" (the "did the corpus give it enough" gate)
SELECT model,
COUNT(*) AS sessions,
SUM(CASE WHEN iterations = 1 AND final_verdict = 'accepted' THEN 1 ELSE 0 END) AS first_shot,
ROUND(100.0 * SUM(CASE WHEN iterations = 1 AND final_verdict = 'accepted' THEN 1 ELSE 0 END) / COUNT(*), 1) AS pct
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
WHERE kind = 'fill'
GROUP BY model
ORDER BY pct DESC;
"Sessions that were never grounded" (the alarm query)
-- Should always be empty. If it isn't, FillValidator has a hole or
-- a different code path is bypassing the roster check.
SELECT session_id, model, iterations, attempts
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
WHERE kind = 'fill'
AND final_verdict = 'accepted'
AND grounded_in_roster = false;
"Average retry depth per model"
SELECT model, AVG(iterations) AS avg_iter, COUNT(*) AS n
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
WHERE kind = 'fill' AND final_verdict = 'accepted'
GROUP BY model
ORDER BY avg_iter ASC;
"What did validation reject?" (failure mode breakdown)
-- Pull each rejected attempt's error string, classify by prefix.
WITH errors AS (
SELECT session_id,
model,
unnest(attempts) AS att
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
)
SELECT model,
split_part(att.error, ':', 1) AS kind,
COUNT(*) AS n
FROM errors
WHERE att.verdict_kind = 'validation_failed'
GROUP BY model, kind
ORDER BY n DESC;
Operational notes
- Append-only. No row is ever updated; storage grows linearly with iterate calls. Operators rotate via cron when the file gets unwieldy (logrotate-style).
- Best-effort posture. Every write goes through
slog.Warnon failure but never blocks the iterate handler. A full disk silently drops session rows; the iterate response still ships. - Schema versioning.
schema=session.iterate.v1is the contract. Future incompatible changes bump the version; consumers should branch on the field. - PII consideration.
promptis captured truncated to 4000 chars and the finalartifact(when present) is captured verbatim. Operators handling PII-bearing prompts should set the path under a restricted-access volume or filter before retention. - Cross-runtime parity. The Rust gateway's
/v1/iteratedoes NOT yet write this file. If you want a unified longitudinal log across runtimes, port the writer to Rust (crates/gateway/src/v1/iterate.rs) and target the same JSONL path. ~50 LOC.
See also
internal/validator/session_log.go— writer + record typesinternal/validator/iterate.go—Tracercallback + Langfuse span emissioninternal/shared/langfuse_middleware.go—X-Lakehouse-Trace-Idheader propagation (thesession_idjoin key)data/_kb/replay_runs.jsonl— the replay tool's own JSONL (different shape, different producer); these two streams are siblings, not duplicates