ADR-005: observer fail-safe semantics
Closes the OPEN item from STATE_OF_PLAY. Required because observerd is
now on the prod-realistic data path via the lift harness boot (b2e45f7),
so the next consumer (scrum runner / distillation rebuild / production
workflow) needs the fail-safe rationale locked, not implicit.
The Rust "verdict:accept on crash" anti-pattern doesn't translate
one-to-one to the Go observer (witness, not gate). But four adjacent
fail-safe decisions are real and live:
5.1 Persist failure is logged-not-fatal; ring is in-flight source of
truth. Persist-required mode deferred to a future opt-in ADR.
5.2 Mode failure → Success=false, no panic-swallow path. The runner
catches mode errors and surfaces them via node.Error; downstream
consumers see failures explicitly rather than as fake successes
(the Rust anti-pattern surface).
5.3 One row per node, recorded post-run. A workflow with N nodes
produces N audit rows, never a per-workflow catch-all that
survives partial crashes. Known gap: recording happens after
runner.Run returns (acceptable for short workflows; streaming
callback is the right shape when workflows get longer).
5.4 /observer/event accepts on full ring (oldest evicted). Refusing
to write would translate every burst into client errors — wrong
direction for an audit witness.
Mostly ratifies existing behavior; cross-checked claims against
actual code (caught one error in Decision 5.3 draft — recording is
post-run-batched, not per-node-as-it-completes — and the ADR now
states reality).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
6c02c905c8
commit
2c71d1c637
@ -179,7 +179,6 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
|
||||
| **Reality test v2: paraphrase queries** | The 21 verbatim queries in `tests/reality/playbook_lift_queries.txt` exercise verbatim replay only. The interesting case is *similar but not identical* queries hitting a recorded playbook — does the cosine on `query_text` find the playbook hit? Add a paraphrase pass and measure. | After J wants to push the harness past v1 baseline. |
|
||||
| **Q15 boost-math edge case** | "Engaged warehouse associate with strong safety compliance" — judge picked rank-9 result; score=1.0 boost halves distance but rank-9 was >2× top-1 distance, so not promoted. Documented in caveat #2. Either (a) accept the math limit, or (b) tier scores so judge-best-found-deep gets score>1.0. Open design call. | When a second reality run shows the same edge case persisting. |
|
||||
| **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/<bin>.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. |
|
||||
| **ADR-005 — observer fail-safe semantics** | Observer ported but the upstream "verdict:accept on crash" anti-pattern still has no Go-side decision locked. Doc-only, ~30 min. | Before observer is wired into production paths. |
|
||||
| **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. |
|
||||
| **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. |
|
||||
| **Distillation full port** | `57d0df1` shipped scorer + contamination firewall (E partial); SFT export pipeline + audit_baselines lineage not yet ported. | When distillation is needed for production. |
|
||||
@ -199,6 +198,9 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition
|
||||
| `05273ac` | Phase 4: chatd + 5 providers (1,624 LoC) |
|
||||
| `0efc736` | Scrum: 4 fixes (B-1..B-4) + 2 INFOs from cross-lineage review |
|
||||
| `e4ee002` | `scripts/scrum_review.sh` — reusable 3-lineage driver |
|
||||
| `b2e45f7` | playbook_lift harness expansion + reality test #001 (7/8 lift, 87.5%) |
|
||||
| `6c02c90` | scrum lift_001: 4 fixes (sleep→polling SQL probe, JUDGE_SOURCE template, -id-prefix validation, chi.Router cast) |
|
||||
| (next) | ADR-005: observer fail-safe semantics (this commit) |
|
||||
|
||||
Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds).
|
||||
|
||||
|
||||
@ -359,6 +359,144 @@ in-memory only (matches vectord G1's pattern).
|
||||
|
||||
---
|
||||
|
||||
(Future ADRs from ADR-005 onward will be added as the Go
|
||||
implementation accrues design decisions — e.g. observer fail-safe
|
||||
semantics, distillation rebuild, gRPC adapter wire format, etc.)
|
||||
## ADR-005: Observer fail-safe semantics
|
||||
|
||||
**Date:** 2026-04-30
|
||||
**Status:** RATIFIED
|
||||
**Scope:** `internal/observer` (Store, Persistor) + `internal/workflow` (Runner) + `cmd/observerd`
|
||||
|
||||
The Rust legacy had a documented "verdict:accept on crash" anti-pattern:
|
||||
when the observer crashed mid-evaluation, the upstream interpreted the
|
||||
missing verdict as implicit acceptance. Several silent regressions traced
|
||||
to it. The Go observer's role is structurally different — it is a
|
||||
**witness** (records what happened) rather than a **gate** (decides
|
||||
accept/reject) — but adjacent fail-safe decisions still need locking
|
||||
now that observerd is on the prod-realistic stack via the lift harness
|
||||
(commit `b2e45f7`, 2026-04-30). This ADR ratifies the current behavior
|
||||
and locks the rationale so future consumers don't break the invariant
|
||||
by flipping the defaults.
|
||||
|
||||
### Decision 5.1 — Persist failure is logged-not-fatal; ring is the in-flight source of truth
|
||||
|
||||
Already implemented (`internal/observer/store.go:60-67`). Locked:
|
||||
|
||||
- If `persistor.Append` fails, log a warning and continue. Do NOT
|
||||
return an error to the caller of `Store.Record`.
|
||||
- The in-memory ring buffer is the source of truth in flight; the
|
||||
JSONL is a best-effort durability shadow.
|
||||
- Operators who need fail-closed audit-grade trails configure that
|
||||
mode through a future opt-in (deferred to a later ADR; not the
|
||||
G0/G1/G2 default).
|
||||
|
||||
**Why fail-open here:** the observer's job is to keep recording even
|
||||
when the disk hiccups. A `persist-fail-fatal` mode would translate
|
||||
every transient I/O blip into an observer-blackout, which is strictly
|
||||
worse for the witness role than missing a few persisted entries — the
|
||||
ring still has them, and operators can drain it on restart.
|
||||
|
||||
**Why this isn't the Rust anti-pattern:** the Go observer doesn't
|
||||
emit verdicts. A persist failure here means "we recorded fewer rows
|
||||
on disk than in memory," not "we accepted something we shouldn't have."
|
||||
|
||||
### Decision 5.2 — Mode failure in workflow.Runner: `Success = (Error == "")`, no panic-swallow path
|
||||
|
||||
Already implemented (`internal/workflow/runner.go`). Locked:
|
||||
|
||||
- Mode errors are caught by the runner and surfaced via the node's
|
||||
`Error` field; `Success` is the boolean derived from `Error == ""`.
|
||||
- `observerd` records an `ObservedOp` per node with `Success: false`
|
||||
and the error string when a mode fails.
|
||||
- Cycles, missing-deps, and unknown modes are aborting errors → 4xx
|
||||
from `/observer/workflow/run` with the failure encoded in the JSON
|
||||
response.
|
||||
|
||||
**Why this is the explicit anti-Rust:** allowing a mode to silently
|
||||
swallow its panic and report `Success: true` is exactly how the Rust
|
||||
"verdict:accept on crash" pattern manifests. Forcing the runner to
|
||||
record `Success: false` on error makes the failure observable to
|
||||
downstream consumers (observerd queries, scrum review, distillation
|
||||
selection) instead of laundering it into a fake success.
|
||||
|
||||
### Decision 5.3 — Provenance is one-row-per-node, recorded post-run
|
||||
|
||||
Already implemented (`cmd/observerd/main.go:140-154`). Locked:
|
||||
|
||||
- `runner.Run` returns the full `RunResult` with per-node Success/Error;
|
||||
`handleWorkflowRun` then iterates `res.Nodes` and `store.Record`s an
|
||||
`ObservedOp` per node.
|
||||
- One row per node, NOT a single per-workflow catch-all. A workflow with
|
||||
N nodes produces N audit rows.
|
||||
- Crash semantics:
|
||||
- Crash *during* `runner.Run` → no provenance recorded; queries see
|
||||
absence, not a false acceptance.
|
||||
- Crash *during* the recording loop → some nodes recorded, some
|
||||
absent; queries see partial provenance, again not a false
|
||||
acceptance.
|
||||
- Recovery: re-run the whole workflow. No incremental resume in G0/G1/G2.
|
||||
|
||||
**Why one row per node:** debugging a partial workflow is a one-grep
|
||||
operation when each node has its own row. A single catch-all row would
|
||||
be exactly the Rust anti-pattern surface — "we accepted this workflow"
|
||||
records that survive partial crashes look identical to genuine
|
||||
acceptances. Per-node-row makes that structurally impossible.
|
||||
|
||||
**Known gap, not yet a follow-up ADR:** recording happens after
|
||||
`runner.Run` returns, not as each node completes. A long workflow with
|
||||
late-stage failure currently records nodes that already finished only
|
||||
once the runner returns. For G0/G1/G2 substrate this is fine —
|
||||
workflows are short. When workflows get long enough that mid-run
|
||||
visibility matters, a streaming-record callback is the right shape.
|
||||
|
||||
### Decision 5.4 — `/observer/event` accepts even when the ring is full
|
||||
|
||||
Already implemented via `Store.Record`'s shift-left eviction. Locked:
|
||||
|
||||
- Ring overflow is normal operation: oldest evicted, newest accepted.
|
||||
- 200 OK from `/observer/event` means "we accepted into the ring"; it
|
||||
does NOT promise "we persisted." Persistence remains best-effort
|
||||
per Decision 5.1.
|
||||
- 4xx is reserved for malformed `ObservedOp` payloads (validation
|
||||
failures).
|
||||
|
||||
**Why accept-on-full:** treating a full ring as a 503 would translate
|
||||
every brief activity burst into client errors, which is exactly the
|
||||
wrong direction for an audit witness — the witness's job is to never
|
||||
refuse to write, only to lose oldest data when capacity binds.
|
||||
|
||||
### Alternatives considered
|
||||
|
||||
- **Persist-required mode** — caller-configurable fail-closed for
|
||||
audit-grade workloads. The right approach when this lands is an
|
||||
opt-in on `Store` construction, leaving the default fail-open.
|
||||
Deferred to a future ADR.
|
||||
- **Distributed ring with WAL** — persist before accept-into-ring,
|
||||
sync semantics. Too heavy for G0/G1 and breaks the ring's "in-flight
|
||||
source of truth" property.
|
||||
- **Mode-result schema with explicit verdict field** — would force
|
||||
every mode to declare accept/reject. Overengineered for the witness
|
||||
role and reintroduces the gate-vs-witness confusion this ADR is
|
||||
trying to avoid.
|
||||
|
||||
### What this ADR does NOT do
|
||||
|
||||
- **No retention policy.** "How long do we keep observer entries on
|
||||
disk?" is a separate operations decision.
|
||||
- **No mode-level retry.** If a mode fails, the runner records that
|
||||
and moves on. Whether to retry is a workflow-definition concern
|
||||
(Archon-style retry policies in the YAML), not the runner's.
|
||||
- **No cross-process recovery.** A crashed observerd loses the ring;
|
||||
the persistor preserves what it managed to write. Operators read the
|
||||
JSONL after restart, not query a dead daemon.
|
||||
- **No persist-required opt-in.** Mentioned in alternatives; lands in
|
||||
a separate ADR when an audit-grade consumer requires it.
|
||||
|
||||
### How this closes the OPEN list
|
||||
|
||||
STATE_OF_PLAY listed ADR-005 as a doc-only gate before observer wired
|
||||
into production paths. The 2026-04-30 lift run wired observerd into the
|
||||
prod-realistic harness boot, which means observer is now on the data
|
||||
path for every reality test workflow. This ADR locks the fail-safe
|
||||
invariants before the next consumer (scrum runner, distillation rebuild,
|
||||
or a real production workflow) takes a hard behavioral dependency.
|
||||
|
||||
---
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user