ADR-005: observer fail-safe semantics

Closes the OPEN item from STATE_OF_PLAY. Required because observerd is
now on the prod-realistic data path via the lift harness boot (b2e45f7),
so the next consumer (scrum runner / distillation rebuild / production
workflow) needs the fail-safe rationale locked, not implicit.

The Rust "verdict:accept on crash" anti-pattern doesn't translate
one-to-one to the Go observer (witness, not gate). But four adjacent
fail-safe decisions are real and live:

5.1 Persist failure is logged-not-fatal; ring is in-flight source of
    truth. Persist-required mode deferred to a future opt-in ADR.

5.2 Mode failure → Success=false, no panic-swallow path. The runner
    catches mode errors and surfaces them via node.Error; downstream
    consumers see failures explicitly rather than as fake successes
    (the Rust anti-pattern surface).

5.3 One row per node, recorded post-run. A workflow with N nodes
    produces N audit rows, never a per-workflow catch-all that
    survives partial crashes. Known gap: recording happens after
    runner.Run returns (acceptable for short workflows; streaming
    callback is the right shape when workflows get longer).

5.4 /observer/event accepts on full ring (oldest evicted). Refusing
    to write would translate every burst into client errors — wrong
    direction for an audit witness.

Mostly ratifies existing behavior; cross-checked claims against
actual code (caught one error in Decision 5.3 draft — recording is
post-run-batched, not per-node-as-it-completes — and the ADR now
states reality).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-30 06:32:12 -05:00
parent 6c02c905c8
commit 2c71d1c637
2 changed files with 144 additions and 4 deletions

View File

@ -179,7 +179,6 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
| **Reality test v2: paraphrase queries** | The 21 verbatim queries in `tests/reality/playbook_lift_queries.txt` exercise verbatim replay only. The interesting case is *similar but not identical* queries hitting a recorded playbook — does the cosine on `query_text` find the playbook hit? Add a paraphrase pass and measure. | After J wants to push the harness past v1 baseline. |
| **Q15 boost-math edge case** | "Engaged warehouse associate with strong safety compliance" — judge picked rank-9 result; score=1.0 boost halves distance but rank-9 was >2× top-1 distance, so not promoted. Documented in caveat #2. Either (a) accept the math limit, or (b) tier scores so judge-best-found-deep gets score>1.0. Open design call. | When a second reality run shows the same edge case persisting. |
| **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/<bin>.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. |
| **ADR-005 — observer fail-safe semantics** | Observer ported but the upstream "verdict:accept on crash" anti-pattern still has no Go-side decision locked. Doc-only, ~30 min. | Before observer is wired into production paths. |
| **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. |
| **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. |
| **Distillation full port** | `57d0df1` shipped scorer + contamination firewall (E partial); SFT export pipeline + audit_baselines lineage not yet ported. | When distillation is needed for production. |
@ -199,6 +198,9 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
| `05273ac` | Phase 4: chatd + 5 providers (1,624 LoC) |
| `0efc736` | Scrum: 4 fixes (B-1..B-4) + 2 INFOs from cross-lineage review |
| `e4ee002` | `scripts/scrum_review.sh` — reusable 3-lineage driver |
| `b2e45f7` | playbook_lift harness expansion + reality test #001 (7/8 lift, 87.5%) |
| `6c02c90` | scrum lift_001: 4 fixes (sleep→polling SQL probe, JUDGE_SOURCE template, -id-prefix validation, chi.Router cast) |
| (next) | ADR-005: observer fail-safe semantics (this commit) |
Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).

View File

@ -359,6 +359,144 @@ in-memory only (matches vectord G1's pattern).
---
(Future ADRs from ADR-005 onward will be added as the Go
implementation accrues design decisions — e.g. observer fail-safe
semantics, distillation rebuild, gRPC adapter wire format, etc.)
## ADR-005: Observer fail-safe semantics
**Date:** 2026-04-30
**Status:** RATIFIED
**Scope:** `internal/observer` (Store, Persistor) + `internal/workflow` (Runner) + `cmd/observerd`
The Rust legacy had a documented "verdict:accept on crash" anti-pattern:
when the observer crashed mid-evaluation, the upstream interpreted the
missing verdict as implicit acceptance. Several silent regressions traced
to it. The Go observer's role is structurally different — it is a
**witness** (records what happened) rather than a **gate** (decides
accept/reject) — but adjacent fail-safe decisions still need locking
now that observerd is on the prod-realistic stack via the lift harness
(commit `b2e45f7`, 2026-04-30). This ADR ratifies the current behavior
and locks the rationale so future consumers don't break the invariant
by flipping the defaults.
### Decision 5.1 — Persist failure is logged-not-fatal; ring is the in-flight source of truth
Already implemented (`internal/observer/store.go:60-67`). Locked:
- If `persistor.Append` fails, log a warning and continue. Do NOT
return an error to the caller of `Store.Record`.
- The in-memory ring buffer is the source of truth in flight; the
JSONL is a best-effort durability shadow.
- Operators who need fail-closed audit-grade trails configure that
mode through a future opt-in (deferred to a later ADR; not the
G0/G1/G2 default).
**Why fail-open here:** the observer's job is to keep recording even
when the disk hiccups. A `persist-fail-fatal` mode would translate
every transient I/O blip into an observer-blackout, which is strictly
worse for the witness role than missing a few persisted entries — the
ring still has them, and operators can drain it on restart.
**Why this isn't the Rust anti-pattern:** the Go observer doesn't
emit verdicts. A persist failure here means "we recorded fewer rows
on disk than in memory," not "we accepted something we shouldn't have."
### Decision 5.2 — Mode failure in workflow.Runner: `Success = (Error == "")`, no panic-swallow path
Already implemented (`internal/workflow/runner.go`). Locked:
- Mode errors are caught by the runner and surfaced via the node's
`Error` field; `Success` is the boolean derived from `Error == ""`.
- `observerd` records an `ObservedOp` per node with `Success: false`
and the error string when a mode fails.
- Cycles, missing-deps, and unknown modes are aborting errors → 4xx
from `/observer/workflow/run` with the failure encoded in the JSON
response.
**Why this is the explicit anti-Rust:** allowing a mode to silently
swallow its panic and report `Success: true` is exactly how the Rust
"verdict:accept on crash" pattern manifests. Forcing the runner to
record `Success: false` on error makes the failure observable to
downstream consumers (observerd queries, scrum review, distillation
selection) instead of laundering it into a fake success.
### Decision 5.3 — Provenance is one-row-per-node, recorded post-run
Already implemented (`cmd/observerd/main.go:140-154`). Locked:
- `runner.Run` returns the full `RunResult` with per-node Success/Error;
`handleWorkflowRun` then iterates `res.Nodes` and `store.Record`s an
`ObservedOp` per node.
- One row per node, NOT a single per-workflow catch-all. A workflow with
N nodes produces N audit rows.
- Crash semantics:
- Crash *during* `runner.Run` → no provenance recorded; queries see
absence, not a false acceptance.
- Crash *during* the recording loop → some nodes recorded, some
absent; queries see partial provenance, again not a false
acceptance.
- Recovery: re-run the whole workflow. No incremental resume in G0/G1/G2.
**Why one row per node:** debugging a partial workflow is a one-grep
operation when each node has its own row. A single catch-all row would
be exactly the Rust anti-pattern surface — "we accepted this workflow"
records that survive partial crashes look identical to genuine
acceptances. Per-node-row makes that structurally impossible.
**Known gap, not yet a follow-up ADR:** recording happens after
`runner.Run` returns, not as each node completes. A long workflow with
late-stage failure currently records nodes that already finished only
once the runner returns. For G0/G1/G2 substrate this is fine —
workflows are short. When workflows get long enough that mid-run
visibility matters, a streaming-record callback is the right shape.
### Decision 5.4 — `/observer/event` accepts even when the ring is full
Already implemented via `Store.Record`'s shift-left eviction. Locked:
- Ring overflow is normal operation: oldest evicted, newest accepted.
- 200 OK from `/observer/event` means "we accepted into the ring"; it
does NOT promise "we persisted." Persistence remains best-effort
per Decision 5.1.
- 4xx is reserved for malformed `ObservedOp` payloads (validation
failures).
**Why accept-on-full:** treating a full ring as a 503 would translate
every brief activity burst into client errors, which is exactly the
wrong direction for an audit witness — the witness's job is to never
refuse to write, only to lose oldest data when capacity binds.
### Alternatives considered
- **Persist-required mode** — caller-configurable fail-closed for
audit-grade workloads. The right approach when this lands is an
opt-in on `Store` construction, leaving the default fail-open.
Deferred to a future ADR.
- **Distributed ring with WAL** — persist before accept-into-ring,
sync semantics. Too heavy for G0/G1 and breaks the ring's "in-flight
source of truth" property.
- **Mode-result schema with explicit verdict field** — would force
every mode to declare accept/reject. Overengineered for the witness
role and reintroduces the gate-vs-witness confusion this ADR is
trying to avoid.
### What this ADR does NOT do
- **No retention policy.** "How long do we keep observer entries on
disk?" is a separate operations decision.
- **No mode-level retry.** If a mode fails, the runner records that
and moves on. Whether to retry is a workflow-definition concern
(Archon-style retry policies in the YAML), not the runner's.
- **No cross-process recovery.** A crashed observerd loses the ring;
the persistor preserves what it managed to write. Operators read the
JSONL after restart, not query a dead daemon.
- **No persist-required opt-in.** Mentioned in alternatives; lands in
a separate ADR when an audit-grade consumer requires it.
### How this closes the OPEN list
STATE_OF_PLAY listed ADR-005 as a doc-only gate before observer wired
into production paths. The 2026-04-30 lift run wired observerd into the
prod-realistic harness boot, which means observer is now on the data
path for every reality test workflow. This ADR locks the fail-safe
invariants before the next consumer (scrum runner, distillation rebuild,
or a real production workflow) takes a hard behavioral dependency.
---