ADR-005: observer fail-safe semantics

Closes the OPEN item from STATE_OF_PLAY. Required because observerd is now on the prod-realistic data path via the lift harness boot (b2e45f7), so the next consumer (scrum runner / distillation rebuild / production workflow) needs the fail-safe rationale locked, not implicit. The Rust "verdict:accept on crash" anti-pattern doesn't translate one-to-one to the Go observer (witness, not gate). But four adjacent fail-safe decisions are real and live: 5.1 Persist failure is logged-not-fatal; ring is in-flight source of truth. Persist-required mode deferred to a future opt-in ADR. 5.2 Mode failure → Success=false, no panic-swallow path. The runner catches mode errors and surfaces them via node.Error; downstream consumers see failures explicitly rather than as fake successes (the Rust anti-pattern surface). 5.3 One row per node, recorded post-run. A workflow with N nodes produces N audit rows, never a per-workflow catch-all that survives partial crashes. Known gap: recording happens after runner.Run returns (acceptable for short workflows; streaming callback is the right shape when workflows get longer). 5.4 /observer/event accepts on full ring (oldest evicted). Refusing to write would translate every burst into client errors — wrong direction for an audit witness. Mostly ratifies existing behavior; cross-checked claims against actual code (caught one error in Decision 5.3 draft — recording is post-run-batched, not per-node-as-it-completes — and the ADR now states reality). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:32:12 -05:00 · 2026-04-30 06:32:12 -05:00 · 2c71d1c637
commit 2c71d1c637
parent 6c02c905c8
2 changed files with 144 additions and 4 deletions
--- a/STATE_OF_PLAY.md
+++ b/STATE_OF_PLAY.md
@ -179,7 +179,6 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
 | **Reality test v2: paraphrase queries** | The 21 verbatim queries in `tests/reality/playbook_lift_queries.txt` exercise verbatim replay only. The interesting case is *similar but not identical* queries hitting a recorded playbook — does the cosine on `query_text` find the playbook hit? Add a paraphrase pass and measure. | After J wants to push the harness past v1 baseline. |
 | **Q15 boost-math edge case** | "Engaged warehouse associate with strong safety compliance" — judge picked rank-9 result; score=1.0 boost halves distance but rank-9 was >2× top-1 distance, so not promoted. Documented in caveat #2. Either (a) accept the math limit, or (b) tier scores so judge-best-found-deep gets score>1.0. Open design call. | When a second reality run shows the same edge case persisting. |
 | **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/<bin>.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. |
-| **ADR-005 — observer fail-safe semantics** | Observer ported but the upstream "verdict:accept on crash" anti-pattern still has no Go-side decision locked. Doc-only, ~30 min. | Before observer is wired into production paths. |
 | **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. |
 | **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. |
 | **Distillation full port** | `57d0df1` shipped scorer + contamination firewall (E partial); SFT export pipeline + audit_baselines lineage not yet ported. | When distillation is needed for production. |
@ -199,6 +198,9 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
 | `05273ac` | Phase 4: chatd + 5 providers (1,624 LoC) |
 | `0efc736` | Scrum: 4 fixes (B-1..B-4) + 2 INFOs from cross-lineage review |
 | `e4ee002` | `scripts/scrum_review.sh` — reusable 3-lineage driver |
+| `b2e45f7` | playbook_lift harness expansion + reality test #001 (7/8 lift, 87.5%) |
+| `6c02c90` | scrum lift_001: 4 fixes (sleep→polling SQL probe, JUDGE_SOURCE template, -id-prefix validation, chi.Router cast) |
+| (next)    | ADR-005: observer fail-safe semantics (this commit) |

 Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

--- a/docs/DECISIONS.md
+++ b/docs/DECISIONS.md
@ -359,6 +359,144 @@ in-memory only (matches vectord G1's pattern).

 ---

-(Future ADRs from ADR-005 onward will be added as the Go
-implementation accrues design decisions — e.g. observer fail-safe
-semantics, distillation rebuild, gRPC adapter wire format, etc.)
+## ADR-005: Observer fail-safe semantics
+
+**Date:** 2026-04-30
+**Status:** RATIFIED
+**Scope:** `internal/observer` (Store, Persistor) + `internal/workflow` (Runner) + `cmd/observerd`
+
+The Rust legacy had a documented "verdict:accept on crash" anti-pattern:
+when the observer crashed mid-evaluation, the upstream interpreted the
+missing verdict as implicit acceptance. Several silent regressions traced
+to it. The Go observer's role is structurally different — it is a
+**witness** (records what happened) rather than a **gate** (decides
+accept/reject) — but adjacent fail-safe decisions still need locking
+now that observerd is on the prod-realistic stack via the lift harness
+(commit `b2e45f7`, 2026-04-30). This ADR ratifies the current behavior
+and locks the rationale so future consumers don't break the invariant
+by flipping the defaults.
+
+### Decision 5.1 — Persist failure is logged-not-fatal; ring is the in-flight source of truth
+
+Already implemented (`internal/observer/store.go:60-67`). Locked:
+
+- If `persistor.Append` fails, log a warning and continue. Do NOT
+  return an error to the caller of `Store.Record`.
+- The in-memory ring buffer is the source of truth in flight; the
+  JSONL is a best-effort durability shadow.
+- Operators who need fail-closed audit-grade trails configure that
+  mode through a future opt-in (deferred to a later ADR; not the
+  G0/G1/G2 default).
+
+**Why fail-open here:** the observer's job is to keep recording even
+when the disk hiccups. A `persist-fail-fatal` mode would translate
+every transient I/O blip into an observer-blackout, which is strictly
+worse for the witness role than missing a few persisted entries — the
+ring still has them, and operators can drain it on restart.
+
+**Why this isn't the Rust anti-pattern:** the Go observer doesn't
+emit verdicts. A persist failure here means "we recorded fewer rows
+on disk than in memory," not "we accepted something we shouldn't have."
+
+### Decision 5.2 — Mode failure in workflow.Runner: `Success = (Error == "")`, no panic-swallow path
+
+Already implemented (`internal/workflow/runner.go`). Locked:
+
+- Mode errors are caught by the runner and surfaced via the node's
+  `Error` field; `Success` is the boolean derived from `Error == ""`.
+- `observerd` records an `ObservedOp` per node with `Success: false`
+  and the error string when a mode fails.
+- Cycles, missing-deps, and unknown modes are aborting errors → 4xx
+  from `/observer/workflow/run` with the failure encoded in the JSON
+  response.
+
+**Why this is the explicit anti-Rust:** allowing a mode to silently
+swallow its panic and report `Success: true` is exactly how the Rust
+"verdict:accept on crash" pattern manifests. Forcing the runner to
+record `Success: false` on error makes the failure observable to
+downstream consumers (observerd queries, scrum review, distillation
+selection) instead of laundering it into a fake success.
+
+### Decision 5.3 — Provenance is one-row-per-node, recorded post-run
+
+Already implemented (`cmd/observerd/main.go:140-154`). Locked:
+
+- `runner.Run` returns the full `RunResult` with per-node Success/Error;
+  `handleWorkflowRun` then iterates `res.Nodes` and `store.Record`s an
+  `ObservedOp` per node.
+- One row per node, NOT a single per-workflow catch-all. A workflow with
+  N nodes produces N audit rows.
+- Crash semantics:
+  - Crash *during* `runner.Run` → no provenance recorded; queries see
+    absence, not a false acceptance.
+  - Crash *during* the recording loop → some nodes recorded, some
+    absent; queries see partial provenance, again not a false
+    acceptance.
+- Recovery: re-run the whole workflow. No incremental resume in G0/G1/G2.
+
+**Why one row per node:** debugging a partial workflow is a one-grep
+operation when each node has its own row. A single catch-all row would
+be exactly the Rust anti-pattern surface — "we accepted this workflow"
+records that survive partial crashes look identical to genuine
+acceptances. Per-node-row makes that structurally impossible.
+
+**Known gap, not yet a follow-up ADR:** recording happens after
+`runner.Run` returns, not as each node completes. A long workflow with
+late-stage failure currently records nodes that already finished only
+once the runner returns. For G0/G1/G2 substrate this is fine —
+workflows are short. When workflows get long enough that mid-run
+visibility matters, a streaming-record callback is the right shape.
+
+### Decision 5.4 — `/observer/event` accepts even when the ring is full
+
+Already implemented via `Store.Record`'s shift-left eviction. Locked:
+
+- Ring overflow is normal operation: oldest evicted, newest accepted.
+- 200 OK from `/observer/event` means "we accepted into the ring"; it
+  does NOT promise "we persisted." Persistence remains best-effort
+  per Decision 5.1.
+- 4xx is reserved for malformed `ObservedOp` payloads (validation
+  failures).
+
+**Why accept-on-full:** treating a full ring as a 503 would translate
+every brief activity burst into client errors, which is exactly the
+wrong direction for an audit witness — the witness's job is to never
+refuse to write, only to lose oldest data when capacity binds.
+
+### Alternatives considered
+
+- **Persist-required mode** — caller-configurable fail-closed for
+  audit-grade workloads. The right approach when this lands is an
+  opt-in on `Store` construction, leaving the default fail-open.
+  Deferred to a future ADR.
+- **Distributed ring with WAL** — persist before accept-into-ring,
+  sync semantics. Too heavy for G0/G1 and breaks the ring's "in-flight
+  source of truth" property.
+- **Mode-result schema with explicit verdict field** — would force
+  every mode to declare accept/reject. Overengineered for the witness
+  role and reintroduces the gate-vs-witness confusion this ADR is
+  trying to avoid.
+
+### What this ADR does NOT do
+
+- **No retention policy.** "How long do we keep observer entries on
+  disk?" is a separate operations decision.
+- **No mode-level retry.** If a mode fails, the runner records that
+  and moves on. Whether to retry is a workflow-definition concern
+  (Archon-style retry policies in the YAML), not the runner's.
+- **No cross-process recovery.** A crashed observerd loses the ring;
+  the persistor preserves what it managed to write. Operators read the
+  JSONL after restart, not query a dead daemon.
+- **No persist-required opt-in.** Mentioned in alternatives; lands in
+  a separate ADR when an audit-grade consumer requires it.
+
+### How this closes the OPEN list
+
+STATE_OF_PLAY listed ADR-005 as a doc-only gate before observer wired
+into production paths. The 2026-04-30 lift run wired observerd into the
+prod-realistic harness boot, which means observer is now on the data
+path for every reality test workflow. This ADR locks the fail-safe
+invariants before the next consumer (scrum runner, distillation rebuild,
+or a real production workflow) takes a hard behavioral dependency.
+
+---