root 2c71d1c637 ADR-005: observer fail-safe semantics

Closes the OPEN item from STATE_OF_PLAY. Required because observerd is
now on the prod-realistic data path via the lift harness boot (b2e45f7),
so the next consumer (scrum runner / distillation rebuild / production
workflow) needs the fail-safe rationale locked, not implicit.

The Rust "verdict:accept on crash" anti-pattern doesn't translate
one-to-one to the Go observer (witness, not gate). But four adjacent
fail-safe decisions are real and live:

5.1 Persist failure is logged-not-fatal; ring is in-flight source of
    truth. Persist-required mode deferred to a future opt-in ADR.

5.2 Mode failure → Success=false, no panic-swallow path. The runner
    catches mode errors and surfaces them via node.Error; downstream
    consumers see failures explicitly rather than as fake successes
    (the Rust anti-pattern surface).

5.3 One row per node, recorded post-run. A workflow with N nodes
    produces N audit rows, never a per-workflow catch-all that
    survives partial crashes. Known gap: recording happens after
    runner.Run returns (acceptable for short workflows; streaming
    callback is the right shape when workflows get longer).

5.4 /observer/event accepts on full ring (oldest evicted). Refusing
    to write would translate every burst into client errors — wrong
    direction for an audit witness.

Mostly ratifies existing behavior; cross-checked claims against
actual code (caught one error in Decision 5.3 draft — recording is
post-run-batched, not per-node-as-it-completes — and the ADR now
states reality).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 06:32:12 -05:00

23 KiB

Raw Blame History

Architecture Decision Records — Lakehouse-Go

ADRs from the Go era. Numbered fresh from 001 to start clean lineage. Where a Rust ADR (numbered 001–021 in the Rust repo's DECISIONS.md) remains in force, this file references it explicitly. Where a Rust ADR is superseded, the new ADR records why.

ADR-001: Foundational decisions for the Go rewrite

Date: 2026-04-28 Decided by: J Status: Ratified — Phase G0 unblocked

The six questions that gated Phase G0 (per PRD.md / SPEC.md §8) are all answered.

Decision 1.1 — DuckDB via cgo for the query engine

Decision: queryd uses marcboeker/go-duckdb (cgo bindings to DuckDB). Pure-Go alternative was rejected.

Rationale: DuckDB reads Parquet natively, supports the SQL surface DataFusion exposed in the Rust era (CTEs, window functions, hybrid joins), and runs in-process with cgo. The alternatives were:

Hand-rolling a query planner over arrow-go RecordBatches — multi-engineer-month research project; high risk of correctness bugs.
Running DuckDB as an external process — adds an operational surface and a network hop to every query.

Cgo build complexity is the accepted cost. Single-binary deploy preserved (the cgo dependency embeds at link time).

Supersedes Rust ADR-001 (object storage as source of truth) — no. That ADR remains in force; the change is the engine over the storage, not the storage model.

Decision 1.2 — HTMX for the UI

Decision: Frontend is html/template + HTMX + Alpine.js, server-rendered by cmd/gateway. React/Vite in a separate repo is the fallback if UX requirements demand SPA-tier interactivity post-G5.

Rationale: The existing Lakehouse UIs (/lakehouse/ demo + staffer console) are mostly server-rendered HTML with vanilla JS that already fits the HTMX style. Single-binary deploy is preserved (gateway serves templates + static assets). No build chain beyond go build.

The React fallback is named explicitly so it's not relitigated unless an actual UX requirement triggers it.

Decision 1.3 — Gitea hosts the new repo

Decision: Repo lives at git.agentview.dev/profit/golangLAKEHOUSE (same Gitea server that hosts the Rust lakehouse).

Rationale: Single source of truth for repo hosting; existing auditor tooling (lakehouse-auditor systemd service) already speaks Gitea API; existing credentials work; no new ops surface.

Decision 1.4 — Distillation rebuilt in Go, not ported verbatim

Decision: The distillation v1.0.0 substrate (tag distillation-v1.0.0 at e7636f2 in the Rust repo) is not bit-identical-ported. The Go reimplementation:

Ports the LOGIC: SFT export pipeline, contamination firewall (the quality_score enum + SFT_NEVER constant), category mapping rules, audit-baselines append-only pattern.
Does NOT port the FIXTURES: tests/fixtures/distillation/acceptance/ is rebuilt from scratch in Go with new ground-truth golden files.
Does NOT port the bit-identical reproducibility PROPERTY: that was measured against the Rust implementation. The Go implementation establishes its own reproducibility baseline.

Rationale: Bit-identical reproducibility was a measured property of a specific implementation, not a portable invariant. Re-establishing it in Go means new fixtures, new gates, new audit-baselines. This is honest about what's transferring (logic) versus what's a Rust-era artifact (the specific bit-identical hashes).

Risk: the contamination firewall is the most consequential distillation safety net. The port must be reviewed line-by-line, and the new Go fixtures must include adversarial cases that prove the firewall works in the new implementation. See SPEC §7 acceptance gates.

Decision 1.5 — Pathway memory starts clean; old traces preserved as reference

Decision: Go pathway memory begins with zero traces. The existing 88 Rust traces at /home/profit/lakehouse/data/_pathway_memory/state.json are NOT loaded into the Go implementation. They are preserved as a historical record in the Rust repo and documented at docs/RUST_PATHWAY_MEMORY_NOTE.md.

Rationale: The Rust pathway memory's value compounded over months of scrum cycles. Loading those traces into a Go implementation that hasn't proven its byte-matching contract risks corrupting the new substrate's signal with semantically-mismatched data. Starting clean keeps the Go pathway memory's lineage clean and lets the byte-match correctness be proven on a known input (per SPEC §3.4 G3.4.B).

The historical note records the 88 traces' value (11/11 successful replays at the time of freeze) so the Go implementation has a reference baseline to outperform.

Decision 1.6 — Auditor longitudinal signal restarts

Decision: The Rust auditor's audit_baselines.jsonl (longitudinal drift signal accumulated across PRs #6–#13) is not ported to Go. The Go auditor begins a fresh audit_baselines.jsonl lineage on its first PR.

Rationale: The drift signal is anchored to specific Rust commits, verdict shapes, and Kimi/Haiku/Opus rotation traces. Carrying it into the Go era would be like grafting Rust-PR audit history onto the first Go PR's prologue — confusing more than informative. Restarting gives the Go auditor a clean baseline to measure drift against.

The existing Rust audit_baselines.jsonl stays in the Rust repo as a historical record.

ADR-002: storaged per-prefix PUT cap (vectord _vectors/ → 4 GiB)

Date: 2026-04-29 Decided by: J Status: Implemented (commit 423a381)

storaged enforces a 256 MiB per-PUT body cap as DoS protection (MaxBytesReader + Content-Length check). Keys under _vectors/ (vectord LHV1 persistence) get a raised cap of 4 GiB; everything else stays at 256 MiB.

Rationale: the 500K staffing test surfaced that single-file LHV1 above ~150K vectors at d=768 hits the 256 MiB cap. manager.Uploader already streams on the outbound side, so the cap is a safety gate not a memory bottleneck — raising it for the vector path doesn't introduce new memory pressure. Per-prefix preserves the safety gate for routine traffic while opening the documented production path. Splitting LHV1 across multiple keys was rejected because G1P specifically shipped the single-Put framed format to eliminate torn-write — multi-key would re-introduce that failure mode.

Follow-up: if production workloads exceed 4 GiB single-file LHV1, refactor to operator-driven config (env/TOML) rather than bumping the constant. The function-level maxPutBytesFor(key) in cmd/storaged/main.go keeps that drop-in clean.

ADR-003: Inter-service auth posture — Bearer token + IP allowlist

Date: 2026-04-29 Decided by: J + Claude Status: Decided — wiring deferred to Sprint 1

Decision: When inter-service auth is needed (the moment any binary binds non-loopback or the deployment crosses a trust boundary), the auth model is a Bearer token loaded from secrets-go.toml plus a configurable IP allowlist. Both layers required: the token authenticates the caller; the allowlist narrows the network surface.

Status today (G0): zero auth middleware. Every binary binds 127.0.0.1 by default; commit 6af0520 (R-001 partial fix) refuses non-loopback bind unless the per-service LH_<SVC>_ALLOW_NONLOOPBACK=1 env override is set. The override-and-no-auth combination is the worst case — this ADR locks in what we'll require before any production override fires.

What gets implemented when auth lands

secrets-go.toml adds a [auth] section:

[auth]
token = "..."          # 32+ random bytes, hex-encoded
allowed_ips = ["10.0.0.0/8", "127.0.0.1/32"]   # CIDR list

internal/shared/auth.go ships a single chi middleware:
```
func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler
```
- Empty cfg.Token → middleware is a no-op (G0 dev mode).
- Non-empty token → reject 401 unless request has Authorization: Bearer <token> matching constant-time.
- Non-empty allowed_ips → reject 403 unless r.RemoteAddr (or X-Forwarded-For first hop, configurable) is in CIDR set.
- /health exempt — load balancers + monitors need it open.
Every cmd/<svc>/main.go adds one line:
```
r.Use(shared.RequireAuth(cfg.Auth))
```
Mounted before register(r) so it covers every route the binary exposes after /health.
shared.Run startup gate: if bind is non-loopback AND cfg.Auth.Token == "", refuse to start. The implicit "localhost is the auth layer" guarantee becomes explicit when crossing the loopback boundary.

Alternatives considered

Option	Why rejected
mTLS	Strongest but heaviest — every binary needs cert provisioning, rotation tooling, and cert-aware client wiring. Overkill for inter-service traffic that already passes through a single gateway. Reconsider when Lakehouse-Go runs across machines.
JWT with short TTL	Buys nothing over Bearer here — there's no third-party identity provider, no claim hierarchy worth modelling. Pure token has the same security properties at half the wire complexity.
No auth, IP-allowlist only	One stolen IP allowlist entry → full access. Token + IP is defense in depth; either alone is too weak.
OAuth2 via external IdP	Rejected for G0–G3 timeline. No external IdP commitment. Revisit if Lakehouse-Go ever serves end-user requests directly (today everything fronts through the staffing co-pilot which has its own session model).

Constant-time comparison + token hygiene

Token comparison must use crypto/subtle.ConstantTimeCompare — naive == is vulnerable to timing attacks against an attacker who can issue many requests and measure round-trip. Token rotation is operator-driven via secrets-go.toml edit + restart; G0 doesn't need rotate-without-restart.

What this ADR does NOT do

Does not implement the middleware. Code lands in Sprint 1.
Does not require token in G0 dev. Empty token → no-op. Smokes
- proof harness keep working without setting tokens.
Does not address gateway → end-user auth. Gateway terminates inter-service auth at its inbound; if end-users hit gateway from a browser, that's a different ADR (likely cookie/session, fronted by a reverse proxy that handles user auth).

How this closes audit findings

R-001 (queryd /sql RCE-equivalent off-loopback): the bind gate prevents accidental exposure today; this ADR specifies the guardrail when intentional exposure is needed.
R-007 (zero auth middleware): answered by the design above; R-007 stays open until the middleware is implemented but is no longer "design TBD."
R-010 (no CORS posture): orthogonal to inter-service auth, but the RequireAuth middleware sits at the right layer to add CORS handling later (browsers don't reach inter-service routes in the current design, so CORS is also Sprint 1+ when end-user requests start landing).

ADR-004: Pathway memory data model — Mem0-style versioned traces

Date: 2026-04-29 Decided by: J + Claude Status: Decided — substrate landing in internal/pathway/

Decision: Pathway memory is an append-only event log of opaque traces with Mem0-style semantics: Add / Update / Revise / Retire / History / Search. Each trace has a UID; revisions chain backward via predecessor_uid so the full history is reconstructible. Persistence is JSONL append-only with full-replay on load; corruption recovery skips bad lines without halting startup.

Operations

Op	Effect
`Add(content, tags...)`	New UID, stored fresh, replay_count=1.
`AddIdempotent(uid, content, tags...)`	If UID exists → replay_count++. Else → Add with that UID.
`Update(uid, content)`	In-place content replacement (same UID). Bumps `updated_at_ns`. NOT a revision — same trace, new content.
`Revise(predecessorUID, content, tags...)`	New UID with `predecessor_uid` set. Old trace stays accessible via History. Failure modes: predecessor missing → error; predecessor retired → still allowed (revisions of retired traces are valid).
`Retire(uid)`	Sets `retired=true`. Excluded from `Search` by default; still accessible via `Get` and `History`.
`Get(uid)`	Returns the trace (including if retired); error on missing.
`History(uid)`	Walks `predecessor_uid` chain backward, returns slice [self, parent, grandparent, ...]. Cycle-detected via visited-set; returns error on cycle (which only happens if persistence file was hand-edited).
`Search(filter)`	Returns matching traces. Default excludes retired; opt in via `IncludeRetired: true`. Filters: tag-match, content-substring, time range.

Why Mem0-style + Why these specific ops

Mem0 (memory pattern from the OpenAI Memories paper / Mem0 lib) is the canonical "agent memory" interface for the same reason Markdown is the canonical text format: it's the lowest-common- denominator that the entire ecosystem assumes. Adopting it lets agent loops written against any Mem0-aware substrate work here.
Update vs Revise are deliberately separate. Update is "I noticed a typo in my note." Revise is "I now believe something different than I did when I wrote this; preserve the old belief for audit." Conflating them loses the audit trail.
Retire vs Delete is deliberate. Retire stops a trace from surfacing in search but preserves it for history reconstruction. Delete (which we don't expose) would break references.

Trace data shape

type Trace struct {
    UID            string          // UUID v4 unless caller provides one
    Content        json.RawMessage // opaque, schema is caller's contract
    PredecessorUID string          // empty if root revision
    CreatedAtNs    int64
    UpdatedAtNs    int64
    Retired        bool
    ReplayCount    int             // ≥1 for any stored trace
    Tags           []string        // for Search
}

Content is opaque JSON (not a struct) so callers can store any shape — the data model doesn't constrain semantics. Callers add their own validators on top.

Persistence

JSONL append-only log under _pathway/<store_name>.jsonl. Each mutation appends one JSON line:

{"op":"add",     "trace":{...}}
{"op":"update",  "uid":"…",   "content":"…"}
{"op":"revise",  "trace":{…}}    # trace.PredecessorUID is set
{"op":"retire",  "uid":"…"}
{"op":"replay",  "uid":"…"}     # idempotent re-add hit

On startup, replay every line in order, building in-memory state. A malformed line logs a warn and is skipped; load continues. Corruption tolerance is non-optional — partial state is better than no state for an agent substrate.

Compaction is a future concern. A 100K-trace log replays in seconds; below that scale, JSONL append is the simplest correct choice. When compaction lands, the format will be: snapshot file (full state JSON) + tail JSONL since snapshot. Detect snapshot, load it, then replay tail.

Cycle safety

UIDs are generated server-side via uuid.New() (existing dep — catalogd uses it). New UID for every Add and Revise. The data model itself can't form cycles — every Revise points at an EXISTING uid, and the new uid didn't exist a moment ago.

History walks defensively anyway: visited-set tracks UIDs seen this walk; if we encounter a duplicate, return error. Protects against corruption (manual edit, bug in a future op) without constraining the happy path.

Storage location

JSONL file path is configurable per store. Default: /var/lib/lakehouse/pathway/<name>.jsonl for prod; tests use t.TempDir(). Persistence is OPTIONAL — empty path means in-memory only (matches vectord G1's pattern).

What this ADR does NOT do

No HTTP surface decision. Whether cmd/pathwayd is its own binary or routes get added to cmd/vectord is the next ADR's concern. The substrate is a pure library either way.
No vector index integration. Pathway traces can carry a vector embedding in Content (caller decides), but this ADR doesn't define how the substrate integrates with vectord's HNSW indexes. That's the staffing co-pilot's design problem when those layers compose.
No agent-loop semantics. "When does an agent ADD vs REVISE?" is a workflow decision, not a substrate decision.

ADR-005: Observer fail-safe semantics

Date: 2026-04-30 Status: RATIFIED Scope: internal/observer (Store, Persistor) + internal/workflow (Runner) + cmd/observerd

The Rust legacy had a documented "verdict:accept on crash" anti-pattern: when the observer crashed mid-evaluation, the upstream interpreted the missing verdict as implicit acceptance. Several silent regressions traced to it. The Go observer's role is structurally different — it is a witness (records what happened) rather than a gate (decides accept/reject) — but adjacent fail-safe decisions still need locking now that observerd is on the prod-realistic stack via the lift harness (commit b2e45f7, 2026-04-30). This ADR ratifies the current behavior and locks the rationale so future consumers don't break the invariant by flipping the defaults.

Decision 5.1 — Persist failure is logged-not-fatal; ring is the in-flight source of truth

Already implemented (internal/observer/store.go:60-67). Locked:

If persistor.Append fails, log a warning and continue. Do NOT return an error to the caller of Store.Record.
The in-memory ring buffer is the source of truth in flight; the JSONL is a best-effort durability shadow.
Operators who need fail-closed audit-grade trails configure that mode through a future opt-in (deferred to a later ADR; not the G0/G1/G2 default).

Why fail-open here: the observer's job is to keep recording even when the disk hiccups. A persist-fail-fatal mode would translate every transient I/O blip into an observer-blackout, which is strictly worse for the witness role than missing a few persisted entries — the ring still has them, and operators can drain it on restart.

Why this isn't the Rust anti-pattern: the Go observer doesn't emit verdicts. A persist failure here means "we recorded fewer rows on disk than in memory," not "we accepted something we shouldn't have."

Decision 5.2 — Mode failure in workflow.Runner: `Success = (Error == "")`, no panic-swallow path

Already implemented (internal/workflow/runner.go). Locked:

Mode errors are caught by the runner and surfaced via the node's Error field; Success is the boolean derived from Error == "".
observerd records an ObservedOp per node with Success: false and the error string when a mode fails.
Cycles, missing-deps, and unknown modes are aborting errors → 4xx from /observer/workflow/run with the failure encoded in the JSON response.

Why this is the explicit anti-Rust: allowing a mode to silently swallow its panic and report Success: true is exactly how the Rust "verdict:accept on crash" pattern manifests. Forcing the runner to record Success: false on error makes the failure observable to downstream consumers (observerd queries, scrum review, distillation selection) instead of laundering it into a fake success.

Decision 5.3 — Provenance is one-row-per-node, recorded post-run

Already implemented (cmd/observerd/main.go:140-154). Locked:

runner.Run returns the full RunResult with per-node Success/Error; handleWorkflowRun then iterates res.Nodes and store.Records an ObservedOp per node.
One row per node, NOT a single per-workflow catch-all. A workflow with N nodes produces N audit rows.
Crash semantics:
- Crash during runner.Run → no provenance recorded; queries see absence, not a false acceptance.
- Crash during the recording loop → some nodes recorded, some absent; queries see partial provenance, again not a false acceptance.
Recovery: re-run the whole workflow. No incremental resume in G0/G1/G2.

Why one row per node: debugging a partial workflow is a one-grep operation when each node has its own row. A single catch-all row would be exactly the Rust anti-pattern surface — "we accepted this workflow" records that survive partial crashes look identical to genuine acceptances. Per-node-row makes that structurally impossible.

Known gap, not yet a follow-up ADR: recording happens after runner.Run returns, not as each node completes. A long workflow with late-stage failure currently records nodes that already finished only once the runner returns. For G0/G1/G2 substrate this is fine — workflows are short. When workflows get long enough that mid-run visibility matters, a streaming-record callback is the right shape.

Decision 5.4 — `/observer/event` accepts even when the ring is full

Already implemented via Store.Record's shift-left eviction. Locked:

Ring overflow is normal operation: oldest evicted, newest accepted.
200 OK from /observer/event means "we accepted into the ring"; it does NOT promise "we persisted." Persistence remains best-effort per Decision 5.1.
4xx is reserved for malformed ObservedOp payloads (validation failures).

Why accept-on-full: treating a full ring as a 503 would translate every brief activity burst into client errors, which is exactly the wrong direction for an audit witness — the witness's job is to never refuse to write, only to lose oldest data when capacity binds.

Alternatives considered

Persist-required mode — caller-configurable fail-closed for audit-grade workloads. The right approach when this lands is an opt-in on Store construction, leaving the default fail-open. Deferred to a future ADR.
Distributed ring with WAL — persist before accept-into-ring, sync semantics. Too heavy for G0/G1 and breaks the ring's "in-flight source of truth" property.
Mode-result schema with explicit verdict field — would force every mode to declare accept/reject. Overengineered for the witness role and reintroduces the gate-vs-witness confusion this ADR is trying to avoid.

What this ADR does NOT do

No retention policy. "How long do we keep observer entries on disk?" is a separate operations decision.
No mode-level retry. If a mode fails, the runner records that and moves on. Whether to retry is a workflow-definition concern (Archon-style retry policies in the YAML), not the runner's.
No cross-process recovery. A crashed observerd loses the ring; the persistor preserves what it managed to write. Operators read the JSONL after restart, not query a dead daemon.
No persist-required opt-in. Mentioned in alternatives; lands in a separate ADR when an audit-grade consumer requires it.

How this closes the OPEN list

STATE_OF_PLAY listed ADR-005 as a doc-only gate before observer wired into production paths. The 2026-04-30 lift run wired observerd into the prod-realistic harness boot, which means observer is now on the data path for every reality test workflow. This ADR locks the fail-safe invariants before the next consumer (scrum runner, distillation rebuild, or a real production workflow) takes a hard behavioral dependency.

23 KiB Raw Blame History Unescape Escape

Architecture Decision Records — Lakehouse-Go

ADR-001: Foundational decisions for the Go rewrite

Decision 1.1 — DuckDB via cgo for the query engine

Decision 1.2 — HTMX for the UI

Decision 1.3 — Gitea hosts the new repo

Decision 1.4 — Distillation rebuilt in Go, not ported verbatim

Decision 1.5 — Pathway memory starts clean; old traces preserved as reference

Decision 1.6 — Auditor longitudinal signal restarts

ADR-002: storaged per-prefix PUT cap (vectord _vectors/ → 4 GiB)

ADR-003: Inter-service auth posture — Bearer token + IP allowlist

What gets implemented when auth lands

Alternatives considered

Constant-time comparison + token hygiene

What this ADR does NOT do

How this closes audit findings

ADR-004: Pathway memory data model — Mem0-style versioned traces

Operations

Why Mem0-style + Why these specific ops

Trace data shape

Persistence

Cycle safety

Storage location

What this ADR does NOT do

ADR-005: Observer fail-safe semantics

Decision 5.1 — Persist failure is logged-not-fatal; ring is the in-flight source of truth

Decision 5.2 — Mode failure in workflow.Runner: Success = (Error == ""), no panic-swallow path

Decision 5.3 — Provenance is one-row-per-node, recorded post-run

Decision 5.4 — /observer/event accepts even when the ring is full

Alternatives considered

What this ADR does NOT do

How this closes the OPEN list

23 KiB

Raw Blame History

Decision 5.2 — Mode failure in workflow.Runner: `Success = (Error == "")`, no panic-swallow path

Decision 5.4 — `/observer/event` accepts even when the ring is full