diff --git a/STATE_OF_PLAY.md b/STATE_OF_PLAY.md index 93c209d..4602d8c 100644 --- a/STATE_OF_PLAY.md +++ b/STATE_OF_PLAY.md @@ -179,7 +179,6 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition | **Reality test v2: paraphrase queries** | The 21 verbatim queries in `tests/reality/playbook_lift_queries.txt` exercise verbatim replay only. The interesting case is *similar but not identical* queries hitting a recorded playbook — does the cosine on `query_text` find the playbook hit? Add a paraphrase pass and measure. | After J wants to push the harness past v1 baseline. | | **Q15 boost-math edge case** | "Engaged warehouse associate with strong safety compliance" — judge picked rank-9 result; score=1.0 boost halves distance but rank-9 was >2× top-1 distance, so not promoted. Documented in caveat #2. Either (a) accept the math limit, or (b) tier scores so judge-best-found-deep gets score>1.0. Open design call. | When a second reality run shows the same edge case persisting. | | **Sprint 4 — deployment** | No `REPLICATION.md`, `secrets-go.toml.example`, `deploy/systemd/.service`, `Dockerfile`. Largest open Sprint. Required input for any G5 cutover plan. | When G5 cutover is on the table. | -| **ADR-005 — observer fail-safe semantics** | Observer ported but the upstream "verdict:accept on crash" anti-pattern still has no Go-side decision locked. Doc-only, ~30 min. | Before observer is wired into production paths. | | **ADR-006 — auth posture for non-loopback deploy** | Locks R-001 + R-007 from "opt-in middleware exists" to "wired-by-default for X, opt-in for Y." Doc-only, ~1 hr. | Required before any Go binary binds non-loopback in prod. | | **chatd fixture-mode storage half** | `g2_smoke_fixtures.sh` closed embed half via fake_ollama; storage half (mock S3) still deferred. Closes R-006 fully. | When CI box without MinIO is needed. | | **Distillation full port** | `57d0df1` shipped scorer + contamination firewall (E partial); SFT export pipeline + audit_baselines lineage not yet ported. | When distillation is needed for production. | @@ -199,6 +198,9 @@ Verbatim verdicts at `reports/scrum/_evidence/2026-04-30/verdicts/`. Disposition | `05273ac` | Phase 4: chatd + 5 providers (1,624 LoC) | | `0efc736` | Scrum: 4 fixes (B-1..B-4) + 2 INFOs from cross-lineage review | | `e4ee002` | `scripts/scrum_review.sh` — reusable 3-lineage driver | +| `b2e45f7` | playbook_lift harness expansion + reality test #001 (7/8 lift, 87.5%) | +| `6c02c90` | scrum lift_001: 4 fixes (sleep→polling SQL probe, JUDGE_SOURCE template, -id-prefix validation, chi.Router cast) | +| (next) | ADR-005: observer fail-safe semantics (this commit) | Plus on Rust side (`8de94eb`, `3d06868`): qwen2.5 → qwen3.5:latest backport in active defaults; distillation acceptance reports regenerated (run_hash refresh, reproducibility property still holds). diff --git a/docs/DECISIONS.md b/docs/DECISIONS.md index 86f3af6..4e237dd 100644 --- a/docs/DECISIONS.md +++ b/docs/DECISIONS.md @@ -359,6 +359,144 @@ in-memory only (matches vectord G1's pattern). --- -(Future ADRs from ADR-005 onward will be added as the Go -implementation accrues design decisions — e.g. observer fail-safe -semantics, distillation rebuild, gRPC adapter wire format, etc.) +## ADR-005: Observer fail-safe semantics + +**Date:** 2026-04-30 +**Status:** RATIFIED +**Scope:** `internal/observer` (Store, Persistor) + `internal/workflow` (Runner) + `cmd/observerd` + +The Rust legacy had a documented "verdict:accept on crash" anti-pattern: +when the observer crashed mid-evaluation, the upstream interpreted the +missing verdict as implicit acceptance. Several silent regressions traced +to it. The Go observer's role is structurally different — it is a +**witness** (records what happened) rather than a **gate** (decides +accept/reject) — but adjacent fail-safe decisions still need locking +now that observerd is on the prod-realistic stack via the lift harness +(commit `b2e45f7`, 2026-04-30). This ADR ratifies the current behavior +and locks the rationale so future consumers don't break the invariant +by flipping the defaults. + +### Decision 5.1 — Persist failure is logged-not-fatal; ring is the in-flight source of truth + +Already implemented (`internal/observer/store.go:60-67`). Locked: + +- If `persistor.Append` fails, log a warning and continue. Do NOT + return an error to the caller of `Store.Record`. +- The in-memory ring buffer is the source of truth in flight; the + JSONL is a best-effort durability shadow. +- Operators who need fail-closed audit-grade trails configure that + mode through a future opt-in (deferred to a later ADR; not the + G0/G1/G2 default). + +**Why fail-open here:** the observer's job is to keep recording even +when the disk hiccups. A `persist-fail-fatal` mode would translate +every transient I/O blip into an observer-blackout, which is strictly +worse for the witness role than missing a few persisted entries — the +ring still has them, and operators can drain it on restart. + +**Why this isn't the Rust anti-pattern:** the Go observer doesn't +emit verdicts. A persist failure here means "we recorded fewer rows +on disk than in memory," not "we accepted something we shouldn't have." + +### Decision 5.2 — Mode failure in workflow.Runner: `Success = (Error == "")`, no panic-swallow path + +Already implemented (`internal/workflow/runner.go`). Locked: + +- Mode errors are caught by the runner and surfaced via the node's + `Error` field; `Success` is the boolean derived from `Error == ""`. +- `observerd` records an `ObservedOp` per node with `Success: false` + and the error string when a mode fails. +- Cycles, missing-deps, and unknown modes are aborting errors → 4xx + from `/observer/workflow/run` with the failure encoded in the JSON + response. + +**Why this is the explicit anti-Rust:** allowing a mode to silently +swallow its panic and report `Success: true` is exactly how the Rust +"verdict:accept on crash" pattern manifests. Forcing the runner to +record `Success: false` on error makes the failure observable to +downstream consumers (observerd queries, scrum review, distillation +selection) instead of laundering it into a fake success. + +### Decision 5.3 — Provenance is one-row-per-node, recorded post-run + +Already implemented (`cmd/observerd/main.go:140-154`). Locked: + +- `runner.Run` returns the full `RunResult` with per-node Success/Error; + `handleWorkflowRun` then iterates `res.Nodes` and `store.Record`s an + `ObservedOp` per node. +- One row per node, NOT a single per-workflow catch-all. A workflow with + N nodes produces N audit rows. +- Crash semantics: + - Crash *during* `runner.Run` → no provenance recorded; queries see + absence, not a false acceptance. + - Crash *during* the recording loop → some nodes recorded, some + absent; queries see partial provenance, again not a false + acceptance. +- Recovery: re-run the whole workflow. No incremental resume in G0/G1/G2. + +**Why one row per node:** debugging a partial workflow is a one-grep +operation when each node has its own row. A single catch-all row would +be exactly the Rust anti-pattern surface — "we accepted this workflow" +records that survive partial crashes look identical to genuine +acceptances. Per-node-row makes that structurally impossible. + +**Known gap, not yet a follow-up ADR:** recording happens after +`runner.Run` returns, not as each node completes. A long workflow with +late-stage failure currently records nodes that already finished only +once the runner returns. For G0/G1/G2 substrate this is fine — +workflows are short. When workflows get long enough that mid-run +visibility matters, a streaming-record callback is the right shape. + +### Decision 5.4 — `/observer/event` accepts even when the ring is full + +Already implemented via `Store.Record`'s shift-left eviction. Locked: + +- Ring overflow is normal operation: oldest evicted, newest accepted. +- 200 OK from `/observer/event` means "we accepted into the ring"; it + does NOT promise "we persisted." Persistence remains best-effort + per Decision 5.1. +- 4xx is reserved for malformed `ObservedOp` payloads (validation + failures). + +**Why accept-on-full:** treating a full ring as a 503 would translate +every brief activity burst into client errors, which is exactly the +wrong direction for an audit witness — the witness's job is to never +refuse to write, only to lose oldest data when capacity binds. + +### Alternatives considered + +- **Persist-required mode** — caller-configurable fail-closed for + audit-grade workloads. The right approach when this lands is an + opt-in on `Store` construction, leaving the default fail-open. + Deferred to a future ADR. +- **Distributed ring with WAL** — persist before accept-into-ring, + sync semantics. Too heavy for G0/G1 and breaks the ring's "in-flight + source of truth" property. +- **Mode-result schema with explicit verdict field** — would force + every mode to declare accept/reject. Overengineered for the witness + role and reintroduces the gate-vs-witness confusion this ADR is + trying to avoid. + +### What this ADR does NOT do + +- **No retention policy.** "How long do we keep observer entries on + disk?" is a separate operations decision. +- **No mode-level retry.** If a mode fails, the runner records that + and moves on. Whether to retry is a workflow-definition concern + (Archon-style retry policies in the YAML), not the runner's. +- **No cross-process recovery.** A crashed observerd loses the ring; + the persistor preserves what it managed to write. Operators read the + JSONL after restart, not query a dead daemon. +- **No persist-required opt-in.** Mentioned in alternatives; lands in + a separate ADR when an audit-grade consumer requires it. + +### How this closes the OPEN list + +STATE_OF_PLAY listed ADR-005 as a doc-only gate before observer wired +into production paths. The 2026-04-30 lift run wired observerd into the +prod-realistic harness boot, which means observer is now on the data +path for every reality test workflow. This ADR locks the fail-safe +invariants before the next consumer (scrum runner, distillation rebuild, +or a real production workflow) takes a hard behavioral dependency. + +---