golangLAKEHOUSE/docs/DECISIONS.md
root 2c71d1c637 ADR-005: observer fail-safe semantics
Closes the OPEN item from STATE_OF_PLAY. Required because observerd is
now on the prod-realistic data path via the lift harness boot (b2e45f7),
so the next consumer (scrum runner / distillation rebuild / production
workflow) needs the fail-safe rationale locked, not implicit.

The Rust "verdict:accept on crash" anti-pattern doesn't translate
one-to-one to the Go observer (witness, not gate). But four adjacent
fail-safe decisions are real and live:

5.1 Persist failure is logged-not-fatal; ring is in-flight source of
    truth. Persist-required mode deferred to a future opt-in ADR.

5.2 Mode failure → Success=false, no panic-swallow path. The runner
    catches mode errors and surfaces them via node.Error; downstream
    consumers see failures explicitly rather than as fake successes
    (the Rust anti-pattern surface).

5.3 One row per node, recorded post-run. A workflow with N nodes
    produces N audit rows, never a per-workflow catch-all that
    survives partial crashes. Known gap: recording happens after
    runner.Run returns (acceptable for short workflows; streaming
    callback is the right shape when workflows get longer).

5.4 /observer/event accepts on full ring (oldest evicted). Refusing
    to write would translate every burst into client errors — wrong
    direction for an audit witness.

Mostly ratifies existing behavior; cross-checked claims against
actual code (caught one error in Decision 5.3 draft — recording is
post-run-batched, not per-node-as-it-completes — and the ADR now
states reality).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 06:32:12 -05:00

503 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Architecture Decision Records — Lakehouse-Go
ADRs from the Go era. Numbered fresh from 001 to start clean lineage.
Where a Rust ADR (numbered 001021 in the Rust repo's `DECISIONS.md`)
remains in force, this file references it explicitly. Where a Rust
ADR is superseded, the new ADR records why.
---
## ADR-001: Foundational decisions for the Go rewrite
**Date:** 2026-04-28
**Decided by:** J
**Status:** Ratified — Phase G0 unblocked
The six questions that gated Phase G0 (per PRD.md / SPEC.md §8) are
all answered.
### Decision 1.1 — DuckDB via cgo for the query engine
**Decision:** `queryd` uses `marcboeker/go-duckdb` (cgo bindings to
DuckDB). Pure-Go alternative was rejected.
**Rationale:** DuckDB reads Parquet natively, supports the SQL surface
DataFusion exposed in the Rust era (CTEs, window functions, hybrid
joins), and runs in-process with cgo. The alternatives were:
- Hand-rolling a query planner over arrow-go RecordBatches —
multi-engineer-month research project; high risk of correctness
bugs.
- Running DuckDB as an external process — adds an operational surface
and a network hop to every query.
Cgo build complexity is the accepted cost. Single-binary deploy
preserved (the cgo dependency embeds at link time).
**Supersedes Rust ADR-001** (object storage as source of truth) — no.
That ADR remains in force; the change is the *engine* over the
storage, not the storage model.
### Decision 1.2 — HTMX for the UI
**Decision:** Frontend is `html/template` + HTMX + Alpine.js,
server-rendered by `cmd/gateway`. React/Vite in a separate repo is the
fallback if UX requirements demand SPA-tier interactivity post-G5.
**Rationale:** The existing Lakehouse UIs (`/lakehouse/` demo + staffer
console) are mostly server-rendered HTML with vanilla JS that already
fits the HTMX style. Single-binary deploy is preserved (gateway serves
templates + static assets). No build chain beyond `go build`.
The React fallback is named explicitly so it's not relitigated unless
an actual UX requirement triggers it.
### Decision 1.3 — Gitea hosts the new repo
**Decision:** Repo lives at `git.agentview.dev/profit/golangLAKEHOUSE`
(same Gitea server that hosts the Rust lakehouse).
**Rationale:** Single source of truth for repo hosting; existing
auditor tooling (`lakehouse-auditor` systemd service) already speaks
Gitea API; existing credentials work; no new ops surface.
### Decision 1.4 — Distillation rebuilt in Go, not ported verbatim
**Decision:** The distillation v1.0.0 substrate (`tag
distillation-v1.0.0` at `e7636f2` in the Rust repo) is **not**
bit-identical-ported. The Go reimplementation:
- Ports the LOGIC: SFT export pipeline, contamination firewall (the
`quality_score` enum + `SFT_NEVER` constant), category mapping
rules, audit-baselines append-only pattern.
- Does NOT port the FIXTURES: `tests/fixtures/distillation/acceptance/`
is rebuilt from scratch in Go with new ground-truth golden files.
- Does NOT port the bit-identical reproducibility PROPERTY: that was
measured against the Rust implementation. The Go implementation
establishes its own reproducibility baseline.
**Rationale:** Bit-identical reproducibility was a measured property
of a specific implementation, not a portable invariant. Re-establishing
it in Go means new fixtures, new gates, new audit-baselines. This is
honest about what's transferring (logic) versus what's a Rust-era
artifact (the specific bit-identical hashes).
**Risk:** the contamination firewall is the most consequential
distillation safety net. The port must be reviewed line-by-line, and
the new Go fixtures must include adversarial cases that prove the
firewall works in the new implementation. See SPEC §7 acceptance gates.
### Decision 1.5 — Pathway memory starts clean; old traces preserved as reference
**Decision:** Go pathway memory begins with zero traces. The existing
88 Rust traces at
`/home/profit/lakehouse/data/_pathway_memory/state.json` are NOT loaded
into the Go implementation. They are preserved as a historical record
in the Rust repo and documented at `docs/RUST_PATHWAY_MEMORY_NOTE.md`.
**Rationale:** The Rust pathway memory's value compounded over months
of scrum cycles. Loading those traces into a Go implementation that
hasn't proven its byte-matching contract risks corrupting the new
substrate's signal with semantically-mismatched data. Starting clean
keeps the Go pathway memory's lineage clean and lets the byte-match
correctness be proven on a known input (per SPEC §3.4 G3.4.B).
The historical note records the 88 traces' value (11/11 successful
replays at the time of freeze) so the Go implementation has a
reference baseline to outperform.
### Decision 1.6 — Auditor longitudinal signal restarts
**Decision:** The Rust auditor's `audit_baselines.jsonl`
(longitudinal drift signal accumulated across PRs #6#13) is **not**
ported to Go. The Go auditor begins a fresh `audit_baselines.jsonl`
lineage on its first PR.
**Rationale:** The drift signal is anchored to specific Rust commits,
verdict shapes, and Kimi/Haiku/Opus rotation traces. Carrying it into
the Go era would be like grafting Rust-PR audit history onto the first
Go PR's prologue — confusing more than informative. Restarting gives
the Go auditor a clean baseline to measure drift against.
The existing Rust `audit_baselines.jsonl` stays in the Rust repo as a
historical record.
---
## ADR-002: storaged per-prefix PUT cap (vectord _vectors/ → 4 GiB)
**Date:** 2026-04-29
**Decided by:** J
**Status:** Implemented (commit `423a381`)
`storaged` enforces a 256 MiB per-PUT body cap as DoS protection
(`MaxBytesReader` + Content-Length check). Keys under `_vectors/`
(vectord LHV1 persistence) get a raised cap of 4 GiB; everything
else stays at 256 MiB.
**Rationale:** the 500K staffing test surfaced that single-file LHV1
above ~150K vectors at d=768 hits the 256 MiB cap. `manager.Uploader`
already streams on the outbound side, so the cap is a safety gate
not a memory bottleneck — raising it for the vector path doesn't
introduce new memory pressure. Per-prefix preserves the safety
gate for routine traffic while opening the documented production
path. Splitting LHV1 across multiple keys was rejected because G1P
specifically shipped the single-Put framed format to eliminate
torn-write — multi-key would re-introduce that failure mode.
**Follow-up:** if production workloads exceed 4 GiB single-file
LHV1, refactor to operator-driven config (env/TOML) rather than
bumping the constant. The function-level `maxPutBytesFor(key)` in
`cmd/storaged/main.go` keeps that drop-in clean.
---
## ADR-003: Inter-service auth posture — Bearer token + IP allowlist
**Date:** 2026-04-29
**Decided by:** J + Claude
**Status:** Decided — wiring deferred to Sprint 1
**Decision:** When inter-service auth is needed (the moment any
binary binds non-loopback or the deployment crosses a trust
boundary), the auth model is **a Bearer token loaded from
`secrets-go.toml` plus a configurable IP allowlist**. Both layers
required: the token authenticates the caller; the allowlist
narrows the network surface.
**Status today (G0):** zero auth middleware. Every binary binds
`127.0.0.1` by default; commit `6af0520` (R-001 partial fix) refuses
non-loopback bind unless the per-service `LH_<SVC>_ALLOW_NONLOOPBACK=1`
env override is set. The override-and-no-auth combination is the
worst case — this ADR locks in what we'll require before any
production override fires.
### What gets implemented when auth lands
1. **`secrets-go.toml` adds a `[auth]` section:**
```toml
[auth]
token = "..." # 32+ random bytes, hex-encoded
allowed_ips = ["10.0.0.0/8", "127.0.0.1/32"] # CIDR list
```
2. **`internal/shared/auth.go`** ships a single chi middleware:
```go
func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler
```
- Empty `cfg.Token` → middleware is a no-op (G0 dev mode).
- Non-empty token → reject 401 unless request has
`Authorization: Bearer <token>` matching constant-time.
- Non-empty `allowed_ips` → reject 403 unless `r.RemoteAddr` (or
`X-Forwarded-For` first hop, configurable) is in CIDR set.
- `/health` exempt — load balancers + monitors need it open.
3. **Every `cmd/<svc>/main.go` adds one line:**
```go
r.Use(shared.RequireAuth(cfg.Auth))
```
Mounted before `register(r)` so it covers every route the binary
exposes after `/health`.
4. **`shared.Run` startup gate:** if bind is non-loopback AND
`cfg.Auth.Token == ""`, refuse to start. The implicit
"localhost is the auth layer" guarantee becomes explicit when
crossing the loopback boundary.
### Alternatives considered
| Option | Why rejected |
|---|---|
| **mTLS** | Strongest but heaviest — every binary needs cert provisioning, rotation tooling, and cert-aware client wiring. Overkill for inter-service traffic that already passes through a single gateway. Reconsider when Lakehouse-Go runs across machines. |
| **JWT with short TTL** | Buys nothing over Bearer here — there's no third-party identity provider, no claim hierarchy worth modelling. Pure token has the same security properties at half the wire complexity. |
| **No auth, IP-allowlist only** | One stolen IP allowlist entry → full access. Token + IP is defense in depth; either alone is too weak. |
| **OAuth2 via external IdP** | Rejected for G0G3 timeline. No external IdP commitment. Revisit if Lakehouse-Go ever serves end-user requests directly (today everything fronts through the staffing co-pilot which has its own session model). |
### Constant-time comparison + token hygiene
Token comparison must use `crypto/subtle.ConstantTimeCompare` —
naive `==` is vulnerable to timing attacks against an attacker who
can issue many requests and measure round-trip. Token rotation is
operator-driven via `secrets-go.toml` edit + restart; G0 doesn't
need rotate-without-restart.
### What this ADR does NOT do
- **Does not implement the middleware.** Code lands in Sprint 1.
- **Does not require token in G0 dev.** Empty token → no-op. Smokes
+ proof harness keep working without setting tokens.
- **Does not address gateway → end-user auth.** Gateway terminates
inter-service auth at its inbound; if end-users hit gateway from
a browser, that's a different ADR (likely cookie/session, fronted
by a reverse proxy that handles user auth).
### How this closes audit findings
- **R-001 (queryd /sql RCE-equivalent off-loopback):** the bind
gate prevents accidental exposure today; this ADR specifies the
guardrail when intentional exposure is needed.
- **R-007 (zero auth middleware):** answered by the design above;
R-007 stays open until the middleware is implemented but is no
longer "design TBD."
- **R-010 (no CORS posture):** orthogonal to inter-service auth,
but the `RequireAuth` middleware sits at the right layer to add
CORS handling later (browsers don't reach inter-service routes
in the current design, so CORS is also Sprint 1+ when end-user
requests start landing).
---
## ADR-004: Pathway memory data model — Mem0-style versioned traces
**Date:** 2026-04-29
**Decided by:** J + Claude
**Status:** Decided — substrate landing in `internal/pathway/`
**Decision:** Pathway memory is an append-only event log of opaque
traces with Mem0-style semantics: Add / Update / Revise / Retire /
History / Search. Each trace has a UID; revisions chain backward
via `predecessor_uid` so the full history is reconstructible.
Persistence is JSONL append-only with full-replay on load;
corruption recovery skips bad lines without halting startup.
### Operations
| Op | Effect |
|---|---|
| `Add(content, tags...)` | New UID, stored fresh, replay_count=1. |
| `AddIdempotent(uid, content, tags...)` | If UID exists → replay_count++. Else → Add with that UID. |
| `Update(uid, content)` | In-place content replacement (same UID). Bumps `updated_at_ns`. NOT a revision — same trace, new content. |
| `Revise(predecessorUID, content, tags...)` | New UID with `predecessor_uid` set. Old trace stays accessible via History. Failure modes: predecessor missing → error; predecessor retired → still allowed (revisions of retired traces are valid). |
| `Retire(uid)` | Sets `retired=true`. Excluded from `Search` by default; still accessible via `Get` and `History`. |
| `Get(uid)` | Returns the trace (including if retired); error on missing. |
| `History(uid)` | Walks `predecessor_uid` chain backward, returns slice [self, parent, grandparent, ...]. Cycle-detected via visited-set; returns error on cycle (which only happens if persistence file was hand-edited). |
| `Search(filter)` | Returns matching traces. Default excludes retired; opt in via `IncludeRetired: true`. Filters: tag-match, content-substring, time range. |
### Why Mem0-style + Why these specific ops
- **Mem0** (memory pattern from the OpenAI Memories paper / Mem0 lib)
is the canonical "agent memory" interface for the same reason
Markdown is the canonical text format: it's the lowest-common-
denominator that the entire ecosystem assumes. Adopting it lets
agent loops written against any Mem0-aware substrate work here.
- Update vs Revise are deliberately separate. Update is "I noticed
a typo in my note." Revise is "I now believe something different
than I did when I wrote this; preserve the old belief for audit."
Conflating them loses the audit trail.
- Retire vs Delete is deliberate. Retire stops a trace from
surfacing in search but preserves it for history reconstruction.
Delete (which we don't expose) would break references.
### Trace data shape
```go
type Trace struct {
UID string // UUID v4 unless caller provides one
Content json.RawMessage // opaque, schema is caller's contract
PredecessorUID string // empty if root revision
CreatedAtNs int64
UpdatedAtNs int64
Retired bool
ReplayCount int // ≥1 for any stored trace
Tags []string // for Search
}
```
`Content` is opaque JSON (not a struct) so callers can store any
shape — the data model doesn't constrain semantics. Callers add
their own validators on top.
### Persistence
JSONL append-only log under `_pathway/<store_name>.jsonl`. Each
mutation appends one JSON line:
```
{"op":"add", "trace":{...}}
{"op":"update", "uid":"…", "content":"…"}
{"op":"revise", "trace":{…}} # trace.PredecessorUID is set
{"op":"retire", "uid":"…"}
{"op":"replay", "uid":"…"} # idempotent re-add hit
```
On startup, replay every line in order, building in-memory state.
A malformed line logs a warn and is skipped; load continues.
Corruption tolerance is non-optional — partial state is better
than no state for an agent substrate.
Compaction is a future concern. A 100K-trace log replays in
seconds; below that scale, JSONL append is the simplest correct
choice. When compaction lands, the format will be: snapshot file
(full state JSON) + tail JSONL since snapshot. Detect snapshot,
load it, then replay tail.
### Cycle safety
UIDs are generated server-side via `uuid.New()` (existing dep —
catalogd uses it). New UID for every Add and Revise. The data
model itself can't form cycles — every Revise points at an
EXISTING uid, and the new uid didn't exist a moment ago.
History walks defensively anyway: visited-set tracks UIDs seen
this walk; if we encounter a duplicate, return error. Protects
against corruption (manual edit, bug in a future op) without
constraining the happy path.
### Storage location
JSONL file path is configurable per store. Default:
`/var/lib/lakehouse/pathway/<name>.jsonl` for prod; tests use
`t.TempDir()`. Persistence is OPTIONAL — empty path means
in-memory only (matches vectord G1's pattern).
### What this ADR does NOT do
- **No HTTP surface decision.** Whether `cmd/pathwayd` is its own
binary or routes get added to `cmd/vectord` is the next ADR's
concern. The substrate is a pure library either way.
- **No vector index integration.** Pathway traces can carry a
vector embedding in `Content` (caller decides), but this ADR
doesn't define how the substrate integrates with `vectord`'s
HNSW indexes. That's the staffing co-pilot's design problem
when those layers compose.
- **No agent-loop semantics.** "When does an agent ADD vs
REVISE?" is a workflow decision, not a substrate decision.
---
## ADR-005: Observer fail-safe semantics
**Date:** 2026-04-30
**Status:** RATIFIED
**Scope:** `internal/observer` (Store, Persistor) + `internal/workflow` (Runner) + `cmd/observerd`
The Rust legacy had a documented "verdict:accept on crash" anti-pattern:
when the observer crashed mid-evaluation, the upstream interpreted the
missing verdict as implicit acceptance. Several silent regressions traced
to it. The Go observer's role is structurally different — it is a
**witness** (records what happened) rather than a **gate** (decides
accept/reject) — but adjacent fail-safe decisions still need locking
now that observerd is on the prod-realistic stack via the lift harness
(commit `b2e45f7`, 2026-04-30). This ADR ratifies the current behavior
and locks the rationale so future consumers don't break the invariant
by flipping the defaults.
### Decision 5.1 — Persist failure is logged-not-fatal; ring is the in-flight source of truth
Already implemented (`internal/observer/store.go:60-67`). Locked:
- If `persistor.Append` fails, log a warning and continue. Do NOT
return an error to the caller of `Store.Record`.
- The in-memory ring buffer is the source of truth in flight; the
JSONL is a best-effort durability shadow.
- Operators who need fail-closed audit-grade trails configure that
mode through a future opt-in (deferred to a later ADR; not the
G0/G1/G2 default).
**Why fail-open here:** the observer's job is to keep recording even
when the disk hiccups. A `persist-fail-fatal` mode would translate
every transient I/O blip into an observer-blackout, which is strictly
worse for the witness role than missing a few persisted entries — the
ring still has them, and operators can drain it on restart.
**Why this isn't the Rust anti-pattern:** the Go observer doesn't
emit verdicts. A persist failure here means "we recorded fewer rows
on disk than in memory," not "we accepted something we shouldn't have."
### Decision 5.2 — Mode failure in workflow.Runner: `Success = (Error == "")`, no panic-swallow path
Already implemented (`internal/workflow/runner.go`). Locked:
- Mode errors are caught by the runner and surfaced via the node's
`Error` field; `Success` is the boolean derived from `Error == ""`.
- `observerd` records an `ObservedOp` per node with `Success: false`
and the error string when a mode fails.
- Cycles, missing-deps, and unknown modes are aborting errors → 4xx
from `/observer/workflow/run` with the failure encoded in the JSON
response.
**Why this is the explicit anti-Rust:** allowing a mode to silently
swallow its panic and report `Success: true` is exactly how the Rust
"verdict:accept on crash" pattern manifests. Forcing the runner to
record `Success: false` on error makes the failure observable to
downstream consumers (observerd queries, scrum review, distillation
selection) instead of laundering it into a fake success.
### Decision 5.3 — Provenance is one-row-per-node, recorded post-run
Already implemented (`cmd/observerd/main.go:140-154`). Locked:
- `runner.Run` returns the full `RunResult` with per-node Success/Error;
`handleWorkflowRun` then iterates `res.Nodes` and `store.Record`s an
`ObservedOp` per node.
- One row per node, NOT a single per-workflow catch-all. A workflow with
N nodes produces N audit rows.
- Crash semantics:
- Crash *during* `runner.Run` → no provenance recorded; queries see
absence, not a false acceptance.
- Crash *during* the recording loop → some nodes recorded, some
absent; queries see partial provenance, again not a false
acceptance.
- Recovery: re-run the whole workflow. No incremental resume in G0/G1/G2.
**Why one row per node:** debugging a partial workflow is a one-grep
operation when each node has its own row. A single catch-all row would
be exactly the Rust anti-pattern surface — "we accepted this workflow"
records that survive partial crashes look identical to genuine
acceptances. Per-node-row makes that structurally impossible.
**Known gap, not yet a follow-up ADR:** recording happens after
`runner.Run` returns, not as each node completes. A long workflow with
late-stage failure currently records nodes that already finished only
once the runner returns. For G0/G1/G2 substrate this is fine —
workflows are short. When workflows get long enough that mid-run
visibility matters, a streaming-record callback is the right shape.
### Decision 5.4 — `/observer/event` accepts even when the ring is full
Already implemented via `Store.Record`'s shift-left eviction. Locked:
- Ring overflow is normal operation: oldest evicted, newest accepted.
- 200 OK from `/observer/event` means "we accepted into the ring"; it
does NOT promise "we persisted." Persistence remains best-effort
per Decision 5.1.
- 4xx is reserved for malformed `ObservedOp` payloads (validation
failures).
**Why accept-on-full:** treating a full ring as a 503 would translate
every brief activity burst into client errors, which is exactly the
wrong direction for an audit witness — the witness's job is to never
refuse to write, only to lose oldest data when capacity binds.
### Alternatives considered
- **Persist-required mode** — caller-configurable fail-closed for
audit-grade workloads. The right approach when this lands is an
opt-in on `Store` construction, leaving the default fail-open.
Deferred to a future ADR.
- **Distributed ring with WAL** — persist before accept-into-ring,
sync semantics. Too heavy for G0/G1 and breaks the ring's "in-flight
source of truth" property.
- **Mode-result schema with explicit verdict field** — would force
every mode to declare accept/reject. Overengineered for the witness
role and reintroduces the gate-vs-witness confusion this ADR is
trying to avoid.
### What this ADR does NOT do
- **No retention policy.** "How long do we keep observer entries on
disk?" is a separate operations decision.
- **No mode-level retry.** If a mode fails, the runner records that
and moves on. Whether to retry is a workflow-definition concern
(Archon-style retry policies in the YAML), not the runner's.
- **No cross-process recovery.** A crashed observerd loses the ring;
the persistor preserves what it managed to write. Operators read the
JSONL after restart, not query a dead daemon.
- **No persist-required opt-in.** Mentioned in alternatives; lands in
a separate ADR when an audit-grade consumer requires it.
### How this closes the OPEN list
STATE_OF_PLAY listed ADR-005 as a doc-only gate before observer wired
into production paths. The 2026-04-30 lift run wired observerd into the
prod-realistic harness boot, which means observer is now on the data
path for every reality test workflow. This ADR locks the fail-safe
invariants before the next consumer (scrum runner, distillation rebuild,
or a real production workflow) takes a hard behavioral dependency.
---