Closes the second half of J's 2026-05-02 multi-call observability
concern. Trace-id propagation (commit d6d2fdf) gave us the *live*
view in Langfuse; this gives us the *longitudinal* view for ad-hoc
DuckDB queries over thousands of sessions:
"show me every session where the model produced a real candidate
without ever needing a retry"
"find sessions where validation rejected three times in a row"
"first-shot success rate per model — did we feed it enough corpus?"
## What's in
internal/validator/session_log.go:
- SessionRecord type (schema=session.iterate.v1)
- SessionLogger writer — mutex-guarded append, best-effort posture,
nil-safe (NewSessionLogger("") = nil = no-op on Append)
- BuildSessionRecord helper — assembles a row from any
iterate response/failure/infra-error combination, callable from
other daemons that wrap iterate (cross-daemon shared schema)
- 7 unit tests including concurrent-append safety + the three
code paths (success / max_iter_exhausted / infra_error)
cmd/validatord/main.go:
- handlers.sessionLog field + wiring from cfg.Validatord.SessionLogPath
- Iterate handler: build + append a SessionRecord on every call
- rosterCheckFor("fill") closure stamps grounded_in_roster — the
load-bearing forensic property J flagged ("we can never
hallucinate available staff members to contracts")
internal/shared/config.go + lakehouse.toml:
- [validatord].session_log_path field; empty = disabled
- Production: /var/lib/lakehouse/validator/sessions.jsonl
scripts/validatord_smoke.sh:
- Adds a probe verifying validatord announces session log path on
startup. Smoke is now 6/6 (was 5/5).
docs/SESSION_LOG.md:
- Schema reference + 5 worked DuckDB query examples including the
"alarm" query (sessions where grounded_in_roster=false on an
accepted fill — should always be empty; if not, something is
bypassing FillValidator).
## What this is NOT
This is NOT a duplicate of replay_runs.jsonl. They're siblings:
- replay_runs.jsonl: replay tool's per-task retrieval+model output
- sessions.jsonl: validatord's per-iterate full retry chain +
grounded-in-roster verdict
A single coordinator session can produce rows in both streams; the
session_id (= Langfuse trace_id) is the join key.
## Layered observability now in place
Live view: Langfuse trace tree (X-Lakehouse-Trace-Id propagation)
`iterate.attempt[N]` spans with prompt/raw/verdict
Offline: coordinator_sessions.jsonl (this commit)
DuckDB-queryable; longitudinal forensics
Hard gate: FillValidator + WorkerLookup (existing)
phantom IDs structurally rejected, never reach
session log's grounded_in_roster=true bucket
Per the architecture invariant in STATE_OF_PLAY's DO NOT RELITIGATE
section — these layers are wired; future work targets the data, not
the wiring.
## Verification
- internal/validator: 7 new tests (session_log_test.go) — all PASS
- cmd/validatord: 3 new integration tests covering the success,
failure, and grounded=false paths — all PASS
- validatord_smoke.sh: 6/6 PASS through gateway :3110
- Full go test ./... green across 33 packages
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
159 lines
6.9 KiB
Markdown
159 lines
6.9 KiB
Markdown
# Coordinator session log — `coordinator_sessions.jsonl`
|
|
|
|
**Last updated:** 2026-05-02 · **Schema:** `session.iterate.v1` · **Writer:** `internal/validator.SessionLogger` · **Producer:** validatord `/v1/iterate`
|
|
|
|
## Why
|
|
|
|
The Langfuse trace tree is the live view: per-session, you can scroll
|
|
the retry chain and inspect every sub-call. But for **longitudinal
|
|
forensics** ("show me every session in the last week where the model
|
|
guessed a real worker without a retry," or "find sessions where
|
|
validation rejected three times in a row"), Langfuse's UI doesn't
|
|
scale — you need a queryable data plane.
|
|
|
|
This JSONL is that plane. One row per `/v1/iterate` session, append-
|
|
only, DuckDB-friendly.
|
|
|
|
## Where
|
|
|
|
Configurable via `[validatord].session_log_path` in `lakehouse.toml`.
|
|
Empty = disabled (best-effort posture; a missing log never blocks an
|
|
iterate request). Production:
|
|
|
|
```toml
|
|
[validatord]
|
|
session_log_path = "/var/lib/lakehouse/validator/sessions.jsonl"
|
|
```
|
|
|
|
## Schema (v1)
|
|
|
|
```jsonc
|
|
{
|
|
"schema": "session.iterate.v1",
|
|
"session_id": "<Langfuse trace_id>", // join key to Langfuse
|
|
"timestamp": "2026-05-02T07:30:00.123456Z",
|
|
"daemon": "validatord",
|
|
"kind": "fill | email | playbook",
|
|
"model": "qwen3.5:latest",
|
|
"provider": "ollama",
|
|
"prompt": "produce a fill artifact ...", // truncated to 4000 chars
|
|
"iterations": 3, // attempts spent
|
|
"max_iterations": 3, // cap per request
|
|
"final_verdict": "accepted | max_iter_exhausted | infra_error",
|
|
"attempts": [
|
|
{ "iteration": 0, "verdict_kind": "validation_failed",
|
|
"error": "consistency: candidate_id W-X not in roster",
|
|
"span_id": "abc..." },
|
|
{ "iteration": 1, "verdict_kind": "validation_failed",
|
|
"error": "consistency: city mismatch", "span_id": "def..." },
|
|
{ "iteration": 2, "verdict_kind": "accepted", "span_id": "ghi..." }
|
|
],
|
|
"artifact": { /* final accepted artifact, omitted on failure */ },
|
|
"grounded_in_roster": true, // null when N/A (email/playbook)
|
|
"duration_ms": 2840
|
|
}
|
|
```
|
|
|
|
### Field semantics
|
|
|
|
| Field | When set | What it means |
|
|
|---|---|---|
|
|
| `session_id` | always | Langfuse trace id. Pivot to live trace tree by URL: `${LANGFUSE_URL}/trace/<session_id>`. |
|
|
| `final_verdict=accepted` | success | Loop converged within `max_iterations`. `artifact` is non-null. |
|
|
| `final_verdict=max_iter_exhausted` | failure | Loop hit the cap without passing validation. `artifact` is omitted. |
|
|
| `final_verdict=infra_error` | failure | Chat hop or other infra crashed. Single attempt with `verdict_kind=infra_error`. |
|
|
| `grounded_in_roster=true` | fill kind, success | Every `candidate_id` in the artifact exists in `WorkerLookup`. |
|
|
| `grounded_in_roster=false` | fill kind, anomaly | Phantom or otherwise-invalid candidate IDs (shouldn't happen — FillValidator catches these — but the explicit check defends against future validator weakening). |
|
|
| `grounded_in_roster=null` (omitted) | non-fill kinds, or failure | The roster check doesn't apply or wasn't run. |
|
|
|
|
## DuckDB queries
|
|
|
|
```sql
|
|
-- Read the log directly via DuckDB's read_json_auto.
|
|
ATTACH ':memory:' AS sessions;
|
|
SELECT * FROM read_json_auto(
|
|
'/var/lib/lakehouse/validator/sessions.jsonl', format='newline_delimited'
|
|
) LIMIT 10;
|
|
```
|
|
|
|
### "Did the validator catch every phantom worker?"
|
|
```sql
|
|
-- Sessions where iteration 0's verdict was validation_failed AND
|
|
-- the error mentions 'phantom' or 'consistency'. If grounded=true on
|
|
-- the same session's final state, the model recovered.
|
|
SELECT session_id, model, iterations, grounded_in_roster, final_verdict
|
|
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
|
|
WHERE final_verdict = 'accepted'
|
|
AND iterations > 1
|
|
AND list_contains(
|
|
list_transform(attempts,
|
|
x -> x.error LIKE '%consistency%' OR x.error LIKE '%phantom%'),
|
|
true
|
|
);
|
|
```
|
|
|
|
### "First-shot success rate per model" (the "did the corpus give it enough" gate)
|
|
```sql
|
|
SELECT model,
|
|
COUNT(*) AS sessions,
|
|
SUM(CASE WHEN iterations = 1 AND final_verdict = 'accepted' THEN 1 ELSE 0 END) AS first_shot,
|
|
ROUND(100.0 * SUM(CASE WHEN iterations = 1 AND final_verdict = 'accepted' THEN 1 ELSE 0 END) / COUNT(*), 1) AS pct
|
|
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
|
|
WHERE kind = 'fill'
|
|
GROUP BY model
|
|
ORDER BY pct DESC;
|
|
```
|
|
|
|
### "Sessions that were never grounded" (the alarm query)
|
|
```sql
|
|
-- Should always be empty. If it isn't, FillValidator has a hole or
|
|
-- a different code path is bypassing the roster check.
|
|
SELECT session_id, model, iterations, attempts
|
|
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
|
|
WHERE kind = 'fill'
|
|
AND final_verdict = 'accepted'
|
|
AND grounded_in_roster = false;
|
|
```
|
|
|
|
### "Average retry depth per model"
|
|
```sql
|
|
SELECT model, AVG(iterations) AS avg_iter, COUNT(*) AS n
|
|
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
|
|
WHERE kind = 'fill' AND final_verdict = 'accepted'
|
|
GROUP BY model
|
|
ORDER BY avg_iter ASC;
|
|
```
|
|
|
|
### "What did validation reject?" (failure mode breakdown)
|
|
```sql
|
|
-- Pull each rejected attempt's error string, classify by prefix.
|
|
WITH errors AS (
|
|
SELECT session_id,
|
|
model,
|
|
unnest(attempts) AS att
|
|
FROM read_json_auto('sessions.jsonl', format='newline_delimited')
|
|
)
|
|
SELECT model,
|
|
split_part(att.error, ':', 1) AS kind,
|
|
COUNT(*) AS n
|
|
FROM errors
|
|
WHERE att.verdict_kind = 'validation_failed'
|
|
GROUP BY model, kind
|
|
ORDER BY n DESC;
|
|
```
|
|
|
|
## Operational notes
|
|
|
|
- **Append-only.** No row is ever updated; storage grows linearly with iterate calls. Operators rotate via cron when the file gets unwieldy (logrotate-style).
|
|
- **Best-effort posture.** Every write goes through `slog.Warn` on failure but never blocks the iterate handler. A full disk silently drops session rows; the iterate response still ships.
|
|
- **Schema versioning.** `schema=session.iterate.v1` is the contract. Future incompatible changes bump the version; consumers should branch on the field.
|
|
- **PII consideration.** `prompt` is captured truncated to 4000 chars and the final `artifact` (when present) is captured verbatim. Operators handling PII-bearing prompts should set the path under a restricted-access volume or filter before retention.
|
|
- **Cross-runtime parity.** The Rust gateway's `/v1/iterate` does NOT yet write this file. If you want a unified longitudinal log across runtimes, port the writer to Rust (`crates/gateway/src/v1/iterate.rs`) and target the same JSONL path. ~50 LOC.
|
|
|
|
## See also
|
|
|
|
- `internal/validator/session_log.go` — writer + record types
|
|
- `internal/validator/iterate.go` — `Tracer` callback + Langfuse span emission
|
|
- `internal/shared/langfuse_middleware.go` — `X-Lakehouse-Trace-Id` header propagation (the `session_id` join key)
|
|
- `data/_kb/replay_runs.jsonl` — the *replay* tool's own JSONL (different shape, different producer); these two streams are siblings, not duplicates
|