ADR-003 locked the auth substrate; ADR-006 ratifies the operator
playbook + adds two implementation pieces needed for Sprint 4
deployment: env-resolved tokens and dual-token rotation.
Six decisions locked in docs/DECISIONS.md:
- 6.1: Non-loopback bind requires auth.token (mechanical gate at
shared.Run, already implemented; this ratifies it).
- 6.2: Token from env, not TOML. /etc/lakehouse/auth.env (mode 0600)
loaded by systemd EnvironmentFile=. New TokenEnv field on
AuthConfig defaults to "AUTH_TOKEN".
- 6.3: AllowedIPs for inter-service same-trust-domain; Token for
cross-trust-boundary (gateway ↔ external).
- 6.4: /health stays unauthenticated; everything else under
shared.Run is gated. Already implemented; ratified here.
- 6.5: Token rotation is dual-token. New SecondaryTokens []string
on AuthConfig — both primary and any secondary pass auth
during the rotation window. Implemented in this commit.
- 6.6: TLS terminates at the network edge (nginx/Caddy), not
in-process. Daemons stay HTTP-only; internal traffic stays
on private subnets per Decision 6.3.
Implementation:
- internal/shared/config.go: AuthConfig gains TokenEnv +
SecondaryTokens fields. New resolveAuthFromEnv() called by
LoadConfig fills Token from os.Getenv(TokenEnv) when Token is
empty. TokenEnv defaults to "AUTH_TOKEN" so the happy path needs
no TOML config.
- internal/shared/auth.go: RequireAuth pre-encodes Bearer headers
for primary + every secondary token; per-request constant-time
compare walks the slice. Fast path is 1 compare (primary).
Tests:
- TestLoadConfig_AuthTokenFromEnv (3 sub-tests): default env name,
custom token_env, explicit Token wins over env.
- TestRequireAuth_SecondaryTokenAccepted: both primary + secondary
tokens pass during rotation window.
- TestRequireAuth_SecondaryTokensOnly: only-secondary path works
for the case where primary was just promoted-to-empty mid-rotation.
go test ./internal/shared all green; existing auth_test.go
unchanged (constant-time compare path preserved).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
659 lines
30 KiB
Markdown
659 lines
30 KiB
Markdown
# Architecture Decision Records — Lakehouse-Go
|
||
|
||
ADRs from the Go era. Numbered fresh from 001 to start clean lineage.
|
||
Where a Rust ADR (numbered 001–021 in the Rust repo's `DECISIONS.md`)
|
||
remains in force, this file references it explicitly. Where a Rust
|
||
ADR is superseded, the new ADR records why.
|
||
|
||
---
|
||
|
||
## ADR-001: Foundational decisions for the Go rewrite
|
||
**Date:** 2026-04-28
|
||
**Decided by:** J
|
||
**Status:** Ratified — Phase G0 unblocked
|
||
|
||
The six questions that gated Phase G0 (per PRD.md / SPEC.md §8) are
|
||
all answered.
|
||
|
||
### Decision 1.1 — DuckDB via cgo for the query engine
|
||
|
||
**Decision:** `queryd` uses `marcboeker/go-duckdb` (cgo bindings to
|
||
DuckDB). Pure-Go alternative was rejected.
|
||
|
||
**Rationale:** DuckDB reads Parquet natively, supports the SQL surface
|
||
DataFusion exposed in the Rust era (CTEs, window functions, hybrid
|
||
joins), and runs in-process with cgo. The alternatives were:
|
||
- Hand-rolling a query planner over arrow-go RecordBatches —
|
||
multi-engineer-month research project; high risk of correctness
|
||
bugs.
|
||
- Running DuckDB as an external process — adds an operational surface
|
||
and a network hop to every query.
|
||
|
||
Cgo build complexity is the accepted cost. Single-binary deploy
|
||
preserved (the cgo dependency embeds at link time).
|
||
|
||
**Supersedes Rust ADR-001** (object storage as source of truth) — no.
|
||
That ADR remains in force; the change is the *engine* over the
|
||
storage, not the storage model.
|
||
|
||
### Decision 1.2 — HTMX for the UI
|
||
|
||
**Decision:** Frontend is `html/template` + HTMX + Alpine.js,
|
||
server-rendered by `cmd/gateway`. React/Vite in a separate repo is the
|
||
fallback if UX requirements demand SPA-tier interactivity post-G5.
|
||
|
||
**Rationale:** The existing Lakehouse UIs (`/lakehouse/` demo + staffer
|
||
console) are mostly server-rendered HTML with vanilla JS that already
|
||
fits the HTMX style. Single-binary deploy is preserved (gateway serves
|
||
templates + static assets). No build chain beyond `go build`.
|
||
|
||
The React fallback is named explicitly so it's not relitigated unless
|
||
an actual UX requirement triggers it.
|
||
|
||
### Decision 1.3 — Gitea hosts the new repo
|
||
|
||
**Decision:** Repo lives at `git.agentview.dev/profit/golangLAKEHOUSE`
|
||
(same Gitea server that hosts the Rust lakehouse).
|
||
|
||
**Rationale:** Single source of truth for repo hosting; existing
|
||
auditor tooling (`lakehouse-auditor` systemd service) already speaks
|
||
Gitea API; existing credentials work; no new ops surface.
|
||
|
||
### Decision 1.4 — Distillation rebuilt in Go, not ported verbatim
|
||
|
||
**Decision:** The distillation v1.0.0 substrate (`tag
|
||
distillation-v1.0.0` at `e7636f2` in the Rust repo) is **not**
|
||
bit-identical-ported. The Go reimplementation:
|
||
- Ports the LOGIC: SFT export pipeline, contamination firewall (the
|
||
`quality_score` enum + `SFT_NEVER` constant), category mapping
|
||
rules, audit-baselines append-only pattern.
|
||
- Does NOT port the FIXTURES: `tests/fixtures/distillation/acceptance/`
|
||
is rebuilt from scratch in Go with new ground-truth golden files.
|
||
- Does NOT port the bit-identical reproducibility PROPERTY: that was
|
||
measured against the Rust implementation. The Go implementation
|
||
establishes its own reproducibility baseline.
|
||
|
||
**Rationale:** Bit-identical reproducibility was a measured property
|
||
of a specific implementation, not a portable invariant. Re-establishing
|
||
it in Go means new fixtures, new gates, new audit-baselines. This is
|
||
honest about what's transferring (logic) versus what's a Rust-era
|
||
artifact (the specific bit-identical hashes).
|
||
|
||
**Risk:** the contamination firewall is the most consequential
|
||
distillation safety net. The port must be reviewed line-by-line, and
|
||
the new Go fixtures must include adversarial cases that prove the
|
||
firewall works in the new implementation. See SPEC §7 acceptance gates.
|
||
|
||
### Decision 1.5 — Pathway memory starts clean; old traces preserved as reference
|
||
|
||
**Decision:** Go pathway memory begins with zero traces. The existing
|
||
88 Rust traces at
|
||
`/home/profit/lakehouse/data/_pathway_memory/state.json` are NOT loaded
|
||
into the Go implementation. They are preserved as a historical record
|
||
in the Rust repo and documented at `docs/RUST_PATHWAY_MEMORY_NOTE.md`.
|
||
|
||
**Rationale:** The Rust pathway memory's value compounded over months
|
||
of scrum cycles. Loading those traces into a Go implementation that
|
||
hasn't proven its byte-matching contract risks corrupting the new
|
||
substrate's signal with semantically-mismatched data. Starting clean
|
||
keeps the Go pathway memory's lineage clean and lets the byte-match
|
||
correctness be proven on a known input (per SPEC §3.4 G3.4.B).
|
||
|
||
The historical note records the 88 traces' value (11/11 successful
|
||
replays at the time of freeze) so the Go implementation has a
|
||
reference baseline to outperform.
|
||
|
||
### Decision 1.6 — Auditor longitudinal signal restarts
|
||
|
||
**Decision:** The Rust auditor's `audit_baselines.jsonl`
|
||
(longitudinal drift signal accumulated across PRs #6–#13) is **not**
|
||
ported to Go. The Go auditor begins a fresh `audit_baselines.jsonl`
|
||
lineage on its first PR.
|
||
|
||
**Rationale:** The drift signal is anchored to specific Rust commits,
|
||
verdict shapes, and Kimi/Haiku/Opus rotation traces. Carrying it into
|
||
the Go era would be like grafting Rust-PR audit history onto the first
|
||
Go PR's prologue — confusing more than informative. Restarting gives
|
||
the Go auditor a clean baseline to measure drift against.
|
||
|
||
The existing Rust `audit_baselines.jsonl` stays in the Rust repo as a
|
||
historical record.
|
||
|
||
---
|
||
|
||
## ADR-002: storaged per-prefix PUT cap (vectord _vectors/ → 4 GiB)
|
||
**Date:** 2026-04-29
|
||
**Decided by:** J
|
||
**Status:** Implemented (commit `423a381`)
|
||
|
||
`storaged` enforces a 256 MiB per-PUT body cap as DoS protection
|
||
(`MaxBytesReader` + Content-Length check). Keys under `_vectors/`
|
||
(vectord LHV1 persistence) get a raised cap of 4 GiB; everything
|
||
else stays at 256 MiB.
|
||
|
||
**Rationale:** the 500K staffing test surfaced that single-file LHV1
|
||
above ~150K vectors at d=768 hits the 256 MiB cap. `manager.Uploader`
|
||
already streams on the outbound side, so the cap is a safety gate
|
||
not a memory bottleneck — raising it for the vector path doesn't
|
||
introduce new memory pressure. Per-prefix preserves the safety
|
||
gate for routine traffic while opening the documented production
|
||
path. Splitting LHV1 across multiple keys was rejected because G1P
|
||
specifically shipped the single-Put framed format to eliminate
|
||
torn-write — multi-key would re-introduce that failure mode.
|
||
|
||
**Follow-up:** if production workloads exceed 4 GiB single-file
|
||
LHV1, refactor to operator-driven config (env/TOML) rather than
|
||
bumping the constant. The function-level `maxPutBytesFor(key)` in
|
||
`cmd/storaged/main.go` keeps that drop-in clean.
|
||
|
||
---
|
||
|
||
## ADR-003: Inter-service auth posture — Bearer token + IP allowlist
|
||
**Date:** 2026-04-29
|
||
**Decided by:** J + Claude
|
||
**Status:** Decided — wiring deferred to Sprint 1
|
||
|
||
**Decision:** When inter-service auth is needed (the moment any
|
||
binary binds non-loopback or the deployment crosses a trust
|
||
boundary), the auth model is **a Bearer token loaded from
|
||
`secrets-go.toml` plus a configurable IP allowlist**. Both layers
|
||
required: the token authenticates the caller; the allowlist
|
||
narrows the network surface.
|
||
|
||
**Status today (G0):** zero auth middleware. Every binary binds
|
||
`127.0.0.1` by default; commit `6af0520` (R-001 partial fix) refuses
|
||
non-loopback bind unless the per-service `LH_<SVC>_ALLOW_NONLOOPBACK=1`
|
||
env override is set. The override-and-no-auth combination is the
|
||
worst case — this ADR locks in what we'll require before any
|
||
production override fires.
|
||
|
||
### What gets implemented when auth lands
|
||
|
||
1. **`secrets-go.toml` adds a `[auth]` section:**
|
||
```toml
|
||
[auth]
|
||
token = "..." # 32+ random bytes, hex-encoded
|
||
allowed_ips = ["10.0.0.0/8", "127.0.0.1/32"] # CIDR list
|
||
```
|
||
|
||
2. **`internal/shared/auth.go`** ships a single chi middleware:
|
||
```go
|
||
func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler
|
||
```
|
||
- Empty `cfg.Token` → middleware is a no-op (G0 dev mode).
|
||
- Non-empty token → reject 401 unless request has
|
||
`Authorization: Bearer <token>` matching constant-time.
|
||
- Non-empty `allowed_ips` → reject 403 unless `r.RemoteAddr` (or
|
||
`X-Forwarded-For` first hop, configurable) is in CIDR set.
|
||
- `/health` exempt — load balancers + monitors need it open.
|
||
|
||
3. **Every `cmd/<svc>/main.go` adds one line:**
|
||
```go
|
||
r.Use(shared.RequireAuth(cfg.Auth))
|
||
```
|
||
Mounted before `register(r)` so it covers every route the binary
|
||
exposes after `/health`.
|
||
|
||
4. **`shared.Run` startup gate:** if bind is non-loopback AND
|
||
`cfg.Auth.Token == ""`, refuse to start. The implicit
|
||
"localhost is the auth layer" guarantee becomes explicit when
|
||
crossing the loopback boundary.
|
||
|
||
### Alternatives considered
|
||
|
||
| Option | Why rejected |
|
||
|---|---|
|
||
| **mTLS** | Strongest but heaviest — every binary needs cert provisioning, rotation tooling, and cert-aware client wiring. Overkill for inter-service traffic that already passes through a single gateway. Reconsider when Lakehouse-Go runs across machines. |
|
||
| **JWT with short TTL** | Buys nothing over Bearer here — there's no third-party identity provider, no claim hierarchy worth modelling. Pure token has the same security properties at half the wire complexity. |
|
||
| **No auth, IP-allowlist only** | One stolen IP allowlist entry → full access. Token + IP is defense in depth; either alone is too weak. |
|
||
| **OAuth2 via external IdP** | Rejected for G0–G3 timeline. No external IdP commitment. Revisit if Lakehouse-Go ever serves end-user requests directly (today everything fronts through the staffing co-pilot which has its own session model). |
|
||
|
||
### Constant-time comparison + token hygiene
|
||
|
||
Token comparison must use `crypto/subtle.ConstantTimeCompare` —
|
||
naive `==` is vulnerable to timing attacks against an attacker who
|
||
can issue many requests and measure round-trip. Token rotation is
|
||
operator-driven via `secrets-go.toml` edit + restart; G0 doesn't
|
||
need rotate-without-restart.
|
||
|
||
### What this ADR does NOT do
|
||
|
||
- **Does not implement the middleware.** Code lands in Sprint 1.
|
||
- **Does not require token in G0 dev.** Empty token → no-op. Smokes
|
||
+ proof harness keep working without setting tokens.
|
||
- **Does not address gateway → end-user auth.** Gateway terminates
|
||
inter-service auth at its inbound; if end-users hit gateway from
|
||
a browser, that's a different ADR (likely cookie/session, fronted
|
||
by a reverse proxy that handles user auth).
|
||
|
||
### How this closes audit findings
|
||
|
||
- **R-001 (queryd /sql RCE-equivalent off-loopback):** the bind
|
||
gate prevents accidental exposure today; this ADR specifies the
|
||
guardrail when intentional exposure is needed.
|
||
- **R-007 (zero auth middleware):** answered by the design above;
|
||
R-007 stays open until the middleware is implemented but is no
|
||
longer "design TBD."
|
||
- **R-010 (no CORS posture):** orthogonal to inter-service auth,
|
||
but the `RequireAuth` middleware sits at the right layer to add
|
||
CORS handling later (browsers don't reach inter-service routes
|
||
in the current design, so CORS is also Sprint 1+ when end-user
|
||
requests start landing).
|
||
|
||
---
|
||
|
||
## ADR-004: Pathway memory data model — Mem0-style versioned traces
|
||
**Date:** 2026-04-29
|
||
**Decided by:** J + Claude
|
||
**Status:** Decided — substrate landing in `internal/pathway/`
|
||
|
||
**Decision:** Pathway memory is an append-only event log of opaque
|
||
traces with Mem0-style semantics: Add / Update / Revise / Retire /
|
||
History / Search. Each trace has a UID; revisions chain backward
|
||
via `predecessor_uid` so the full history is reconstructible.
|
||
Persistence is JSONL append-only with full-replay on load;
|
||
corruption recovery skips bad lines without halting startup.
|
||
|
||
### Operations
|
||
|
||
| Op | Effect |
|
||
|---|---|
|
||
| `Add(content, tags...)` | New UID, stored fresh, replay_count=1. |
|
||
| `AddIdempotent(uid, content, tags...)` | If UID exists → replay_count++. Else → Add with that UID. |
|
||
| `Update(uid, content)` | In-place content replacement (same UID). Bumps `updated_at_ns`. NOT a revision — same trace, new content. |
|
||
| `Revise(predecessorUID, content, tags...)` | New UID with `predecessor_uid` set. Old trace stays accessible via History. Failure modes: predecessor missing → error; predecessor retired → still allowed (revisions of retired traces are valid). |
|
||
| `Retire(uid)` | Sets `retired=true`. Excluded from `Search` by default; still accessible via `Get` and `History`. |
|
||
| `Get(uid)` | Returns the trace (including if retired); error on missing. |
|
||
| `History(uid)` | Walks `predecessor_uid` chain backward, returns slice [self, parent, grandparent, ...]. Cycle-detected via visited-set; returns error on cycle (which only happens if persistence file was hand-edited). |
|
||
| `Search(filter)` | Returns matching traces. Default excludes retired; opt in via `IncludeRetired: true`. Filters: tag-match, content-substring, time range. |
|
||
|
||
### Why Mem0-style + Why these specific ops
|
||
|
||
- **Mem0** (memory pattern from the OpenAI Memories paper / Mem0 lib)
|
||
is the canonical "agent memory" interface for the same reason
|
||
Markdown is the canonical text format: it's the lowest-common-
|
||
denominator that the entire ecosystem assumes. Adopting it lets
|
||
agent loops written against any Mem0-aware substrate work here.
|
||
- Update vs Revise are deliberately separate. Update is "I noticed
|
||
a typo in my note." Revise is "I now believe something different
|
||
than I did when I wrote this; preserve the old belief for audit."
|
||
Conflating them loses the audit trail.
|
||
- Retire vs Delete is deliberate. Retire stops a trace from
|
||
surfacing in search but preserves it for history reconstruction.
|
||
Delete (which we don't expose) would break references.
|
||
|
||
### Trace data shape
|
||
|
||
```go
|
||
type Trace struct {
|
||
UID string // UUID v4 unless caller provides one
|
||
Content json.RawMessage // opaque, schema is caller's contract
|
||
PredecessorUID string // empty if root revision
|
||
CreatedAtNs int64
|
||
UpdatedAtNs int64
|
||
Retired bool
|
||
ReplayCount int // ≥1 for any stored trace
|
||
Tags []string // for Search
|
||
}
|
||
```
|
||
|
||
`Content` is opaque JSON (not a struct) so callers can store any
|
||
shape — the data model doesn't constrain semantics. Callers add
|
||
their own validators on top.
|
||
|
||
### Persistence
|
||
|
||
JSONL append-only log under `_pathway/<store_name>.jsonl`. Each
|
||
mutation appends one JSON line:
|
||
|
||
```
|
||
{"op":"add", "trace":{...}}
|
||
{"op":"update", "uid":"…", "content":"…"}
|
||
{"op":"revise", "trace":{…}} # trace.PredecessorUID is set
|
||
{"op":"retire", "uid":"…"}
|
||
{"op":"replay", "uid":"…"} # idempotent re-add hit
|
||
```
|
||
|
||
On startup, replay every line in order, building in-memory state.
|
||
A malformed line logs a warn and is skipped; load continues.
|
||
Corruption tolerance is non-optional — partial state is better
|
||
than no state for an agent substrate.
|
||
|
||
Compaction is a future concern. A 100K-trace log replays in
|
||
seconds; below that scale, JSONL append is the simplest correct
|
||
choice. When compaction lands, the format will be: snapshot file
|
||
(full state JSON) + tail JSONL since snapshot. Detect snapshot,
|
||
load it, then replay tail.
|
||
|
||
### Cycle safety
|
||
|
||
UIDs are generated server-side via `uuid.New()` (existing dep —
|
||
catalogd uses it). New UID for every Add and Revise. The data
|
||
model itself can't form cycles — every Revise points at an
|
||
EXISTING uid, and the new uid didn't exist a moment ago.
|
||
|
||
History walks defensively anyway: visited-set tracks UIDs seen
|
||
this walk; if we encounter a duplicate, return error. Protects
|
||
against corruption (manual edit, bug in a future op) without
|
||
constraining the happy path.
|
||
|
||
### Storage location
|
||
|
||
JSONL file path is configurable per store. Default:
|
||
`/var/lib/lakehouse/pathway/<name>.jsonl` for prod; tests use
|
||
`t.TempDir()`. Persistence is OPTIONAL — empty path means
|
||
in-memory only (matches vectord G1's pattern).
|
||
|
||
### What this ADR does NOT do
|
||
|
||
- **No HTTP surface decision.** Whether `cmd/pathwayd` is its own
|
||
binary or routes get added to `cmd/vectord` is the next ADR's
|
||
concern. The substrate is a pure library either way.
|
||
- **No vector index integration.** Pathway traces can carry a
|
||
vector embedding in `Content` (caller decides), but this ADR
|
||
doesn't define how the substrate integrates with `vectord`'s
|
||
HNSW indexes. That's the staffing co-pilot's design problem
|
||
when those layers compose.
|
||
- **No agent-loop semantics.** "When does an agent ADD vs
|
||
REVISE?" is a workflow decision, not a substrate decision.
|
||
|
||
---
|
||
|
||
## ADR-005: Observer fail-safe semantics
|
||
|
||
**Date:** 2026-04-30
|
||
**Status:** RATIFIED
|
||
**Scope:** `internal/observer` (Store, Persistor) + `internal/workflow` (Runner) + `cmd/observerd`
|
||
|
||
The Rust legacy had a documented "verdict:accept on crash" anti-pattern:
|
||
when the observer crashed mid-evaluation, the upstream interpreted the
|
||
missing verdict as implicit acceptance. Several silent regressions traced
|
||
to it. The Go observer's role is structurally different — it is a
|
||
**witness** (records what happened) rather than a **gate** (decides
|
||
accept/reject) — but adjacent fail-safe decisions still need locking
|
||
now that observerd is on the prod-realistic stack via the lift harness
|
||
(commit `b2e45f7`, 2026-04-30). This ADR ratifies the current behavior
|
||
and locks the rationale so future consumers don't break the invariant
|
||
by flipping the defaults.
|
||
|
||
### Decision 5.1 — Persist failure is logged-not-fatal; ring is the in-flight source of truth
|
||
|
||
Already implemented (`internal/observer/store.go:60-67`). Locked:
|
||
|
||
- If `persistor.Append` fails, log a warning and continue. Do NOT
|
||
return an error to the caller of `Store.Record`.
|
||
- The in-memory ring buffer is the source of truth in flight; the
|
||
JSONL is a best-effort durability shadow.
|
||
- Operators who need fail-closed audit-grade trails configure that
|
||
mode through a future opt-in (deferred to a later ADR; not the
|
||
G0/G1/G2 default).
|
||
|
||
**Why fail-open here:** the observer's job is to keep recording even
|
||
when the disk hiccups. A `persist-fail-fatal` mode would translate
|
||
every transient I/O blip into an observer-blackout, which is strictly
|
||
worse for the witness role than missing a few persisted entries — the
|
||
ring still has them, and operators can drain it on restart.
|
||
|
||
**Why this isn't the Rust anti-pattern:** the Go observer doesn't
|
||
emit verdicts. A persist failure here means "we recorded fewer rows
|
||
on disk than in memory," not "we accepted something we shouldn't have."
|
||
|
||
### Decision 5.2 — Mode failure in workflow.Runner: `Success = (Error == "")`, no panic-swallow path
|
||
|
||
Already implemented (`internal/workflow/runner.go`). Locked:
|
||
|
||
- Mode errors are caught by the runner and surfaced via the node's
|
||
`Error` field; `Success` is the boolean derived from `Error == ""`.
|
||
- `observerd` records an `ObservedOp` per node with `Success: false`
|
||
and the error string when a mode fails.
|
||
- Cycles, missing-deps, and unknown modes are aborting errors → 4xx
|
||
from `/observer/workflow/run` with the failure encoded in the JSON
|
||
response.
|
||
|
||
**Why this is the explicit anti-Rust:** allowing a mode to silently
|
||
swallow its panic and report `Success: true` is exactly how the Rust
|
||
"verdict:accept on crash" pattern manifests. Forcing the runner to
|
||
record `Success: false` on error makes the failure observable to
|
||
downstream consumers (observerd queries, scrum review, distillation
|
||
selection) instead of laundering it into a fake success.
|
||
|
||
### Decision 5.3 — Provenance is one-row-per-node, recorded post-run
|
||
|
||
Already implemented (`cmd/observerd/main.go:140-154`). Locked:
|
||
|
||
- `runner.Run` returns the full `RunResult` with per-node Success/Error;
|
||
`handleWorkflowRun` then iterates `res.Nodes` and `store.Record`s an
|
||
`ObservedOp` per node.
|
||
- One row per node, NOT a single per-workflow catch-all. A workflow with
|
||
N nodes produces N audit rows.
|
||
- Crash semantics:
|
||
- Crash *during* `runner.Run` → no provenance recorded; queries see
|
||
absence, not a false acceptance.
|
||
- Crash *during* the recording loop → some nodes recorded, some
|
||
absent; queries see partial provenance, again not a false
|
||
acceptance.
|
||
- Recovery: re-run the whole workflow. No incremental resume in G0/G1/G2.
|
||
|
||
**Why one row per node:** debugging a partial workflow is a one-grep
|
||
operation when each node has its own row. A single catch-all row would
|
||
be exactly the Rust anti-pattern surface — "we accepted this workflow"
|
||
records that survive partial crashes look identical to genuine
|
||
acceptances. Per-node-row makes that structurally impossible.
|
||
|
||
**Known gap, not yet a follow-up ADR:** recording happens after
|
||
`runner.Run` returns, not as each node completes. A long workflow with
|
||
late-stage failure currently records nodes that already finished only
|
||
once the runner returns. For G0/G1/G2 substrate this is fine —
|
||
workflows are short. When workflows get long enough that mid-run
|
||
visibility matters, a streaming-record callback is the right shape.
|
||
|
||
### Decision 5.4 — `/observer/event` accepts even when the ring is full
|
||
|
||
Already implemented via `Store.Record`'s shift-left eviction. Locked:
|
||
|
||
- Ring overflow is normal operation: oldest evicted, newest accepted.
|
||
- 200 OK from `/observer/event` means "we accepted into the ring"; it
|
||
does NOT promise "we persisted." Persistence remains best-effort
|
||
per Decision 5.1.
|
||
- 4xx is reserved for malformed `ObservedOp` payloads (validation
|
||
failures).
|
||
|
||
**Why accept-on-full:** treating a full ring as a 503 would translate
|
||
every brief activity burst into client errors, which is exactly the
|
||
wrong direction for an audit witness — the witness's job is to never
|
||
refuse to write, only to lose oldest data when capacity binds.
|
||
|
||
### Alternatives considered
|
||
|
||
- **Persist-required mode** — caller-configurable fail-closed for
|
||
audit-grade workloads. The right approach when this lands is an
|
||
opt-in on `Store` construction, leaving the default fail-open.
|
||
Deferred to a future ADR.
|
||
- **Distributed ring with WAL** — persist before accept-into-ring,
|
||
sync semantics. Too heavy for G0/G1 and breaks the ring's "in-flight
|
||
source of truth" property.
|
||
- **Mode-result schema with explicit verdict field** — would force
|
||
every mode to declare accept/reject. Overengineered for the witness
|
||
role and reintroduces the gate-vs-witness confusion this ADR is
|
||
trying to avoid.
|
||
|
||
### What this ADR does NOT do
|
||
|
||
- **No retention policy.** "How long do we keep observer entries on
|
||
disk?" is a separate operations decision.
|
||
- **No mode-level retry.** If a mode fails, the runner records that
|
||
and moves on. Whether to retry is a workflow-definition concern
|
||
(Archon-style retry policies in the YAML), not the runner's.
|
||
- **No cross-process recovery.** A crashed observerd loses the ring;
|
||
the persistor preserves what it managed to write. Operators read the
|
||
JSONL after restart, not query a dead daemon.
|
||
- **No persist-required opt-in.** Mentioned in alternatives; lands in
|
||
a separate ADR when an audit-grade consumer requires it.
|
||
|
||
### How this closes the OPEN list
|
||
|
||
STATE_OF_PLAY listed ADR-005 as a doc-only gate before observer wired
|
||
into production paths. The 2026-04-30 lift run wired observerd into the
|
||
prod-realistic harness boot, which means observer is now on the data
|
||
path for every reality test workflow. This ADR locks the fail-safe
|
||
invariants before the next consumer (scrum runner, distillation rebuild,
|
||
or a real production workflow) takes a hard behavioral dependency.
|
||
|
||
---
|
||
|
||
## ADR-006: Auth posture for non-loopback deploy
|
||
|
||
**Date:** 2026-04-30
|
||
**Status:** RATIFIED
|
||
**Scope:** `internal/shared/auth.go` + `internal/shared/bind.go` + every `cmd/<bin>/main.go`'s `shared.Run` call site
|
||
|
||
ADR-003 locked the substrate (Bearer token + IP allowlist, opt-in via
|
||
`cfg.Auth.Token`/`cfg.Auth.AllowedIPs`, `/health` exempt). ADR-006
|
||
ratifies the **operator playbook + deploy-time invariants** — what
|
||
gets enforced when, what operators set where, what happens when keys
|
||
rotate. Required because Sprint 4 deployment work (REPLICATION.md,
|
||
systemd units, Dockerfile) needs a locked auth posture before it
|
||
touches production-shaped configs.
|
||
|
||
### Decision 6.1 — Non-loopback bind requires `auth.token`; the gate is mechanical
|
||
|
||
Already implemented in `requireAuthOnNonLoopback` (`internal/shared/bind.go:58-67`).
|
||
Locked:
|
||
|
||
- Any binary that binds anything other than `127.0.0.0/8` / `::1` /
|
||
`localhost` MUST have `cfg.Auth.Token != ""`. Empty-token +
|
||
non-loopback-bind = startup error, not silent insecure mode.
|
||
- The check fires in `shared.Run` BEFORE `http.Server.Serve`, so a
|
||
misconfigured binary fails fast at startup rather than serving
|
||
one request.
|
||
- Pairs with `requireLoopbackOrOverride`: that gate refuses any
|
||
non-loopback bind without `LH_<NAME>_ALLOW_NONLOOPBACK=1`. Together
|
||
they make the audit's R-001+R-007 worst case (queryd `/sql` =
|
||
RCE-equivalent off-loopback with no auth) mechanically impossible.
|
||
|
||
**Why mechanical, not policy:** policy gates rely on operator
|
||
discipline. The substrate gates work even when an operator copies a
|
||
dev `lakehouse.toml` into prod and forgets to set the token —
|
||
binary refuses to start, error message names the env override.
|
||
|
||
### Decision 6.2 — Token comes from `cfg.Auth.Token` populated by env or secrets file
|
||
|
||
Locked:
|
||
|
||
- Operators do NOT put the production token in `lakehouse.toml`
|
||
directly. The TOML field is empty in the committed file; the
|
||
daemon's systemd unit sets `AUTH_TOKEN` (or whatever
|
||
`cfg.Auth.TokenEnv` names) via `EnvironmentFile=` pointing at
|
||
`/etc/lakehouse/auth.env` (mode 0600, root-owned).
|
||
- Same pattern as `chatd`'s provider keys (`OPENROUTER_API_KEY` etc.):
|
||
TOML names the env var, systemd loads the env file.
|
||
- Justification: keeps secrets out of git + out of the running
|
||
process's command line + audit-able via filesystem ACLs.
|
||
|
||
### Decision 6.3 — `AllowedIPs` is the inter-service gate; `Token` is the cross-trust-boundary gate
|
||
|
||
Locked:
|
||
|
||
- Same-box deploys (10 daemons all on one host, all on `127.0.0.0/8`
|
||
or a private subnet) use **`AllowedIPs` only**. Each daemon's
|
||
`cfg.Auth.AllowedIPs` lists the gateway's address (and any other
|
||
daemon that legitimately calls it). No token shared between
|
||
internal services.
|
||
- Gateway-to-external traffic (a coordinator UI in another VPC,
|
||
a user's browser, an external integrator) goes through
|
||
**Bearer token**. The token is per-tenant; rotation is per-tenant.
|
||
- Mixed: a service can require BOTH (allowlist AND token) — the
|
||
middleware logic is `allowed = ip_allowed && token_valid` when
|
||
both are set. Use this for the gateway when binding non-loopback.
|
||
|
||
**Why split:** token rotation is operationally expensive (every
|
||
caller updates a secret). IP allowlist rotation is free if the
|
||
network topology is stable. Splitting them by trust boundary lets
|
||
internal services treat allowlist drift as a network change while
|
||
external callers handle token rotation as a credential change.
|
||
|
||
### Decision 6.4 — `/health` is unauthenticated; everything else under `shared.Run` is gated
|
||
|
||
Already implemented (`internal/shared/server.go:84-92`). Locked:
|
||
|
||
- Load balancers + monitor probes hit `/health` without a token.
|
||
The route returns `{"status":"ok","service":"<name>"}` and nothing
|
||
about service state — no version, no commit, no internal counts.
|
||
- Every other route registered via `shared.Run`'s `register`
|
||
callback lives inside the auth-gated chi.Group. New routes
|
||
inherit auth automatically; new daemons inherit it via `shared.Run`.
|
||
- A daemon that needs a public route MUST add it to the outer router
|
||
before the `register` group, with a code comment explaining the
|
||
exemption. There are no others today.
|
||
|
||
### Decision 6.5 — Token rotation is operator-staged; old + new accepted during the window
|
||
|
||
Not yet implemented; locked as a Sprint 4 follow-up:
|
||
|
||
- Operators stage a rotation by adding a second token to
|
||
`cfg.Auth.SecondaryTokens []string`. Both primary and secondary
|
||
pass auth during the window.
|
||
- After every caller is updated to the new token, operators
|
||
promote secondary → primary and clear secondary. A second
|
||
rotation can begin.
|
||
- Rolling restart not required; daemons reload `cfg.Auth` on
|
||
SIGHUP (also a Sprint 4 follow-up — currently they re-read on
|
||
restart only).
|
||
|
||
**Why dual-token instead of just single-rotation:** caller pool can be
|
||
large (gateway + observerd + scrum runner + UI + external integrators).
|
||
A single-token rotation forces a flag-day. Dual-token windows let
|
||
operators rotate gradually and abort on failure.
|
||
|
||
### Decision 6.6 — TLS is the network operator's job, not ours
|
||
|
||
Locked:
|
||
|
||
- Daemons speak HTTP, not HTTPS. TLS termination happens at the
|
||
network edge (nginx / Caddy / cloud LB), not in the Go process.
|
||
- Internal daemon-to-daemon traffic stays on plaintext HTTP because
|
||
it's all on `127.0.0.0/8` or a private subnet (per Decision 6.3).
|
||
- Justification: TLS in-process means cert management, rotation,
|
||
reload — all undifferentiated lift that nginx already solves
|
||
better. The Bearer token + allowlist gates are sufficient when
|
||
combined with a TLS-terminating reverse proxy.
|
||
|
||
### Alternatives considered
|
||
|
||
- **mTLS for inter-service auth** — every daemon issues + verifies
|
||
certs. Solves token-rotation pain but adds cert lifecycle as a
|
||
problem. Allowlist + plaintext on the private network is cheaper
|
||
and gets the same threat-model coverage.
|
||
- **JWT-only** — JWTs let callers carry richer claims (tenant id,
|
||
expiry, scopes). Overkill for the current threat model; the
|
||
Bearer token + allowlist split is honest about what each layer
|
||
actually defends against. Revisit when multi-tenant gateway
|
||
features land.
|
||
- **No auth, network is the boundary** — works for G0 dev and the
|
||
current single-box deployment. ADR-006 explicitly does NOT
|
||
recommend this for non-loopback prod (the mechanical gate
|
||
refuses it).
|
||
|
||
### What this ADR does NOT do
|
||
|
||
- **Does not specify how the gateway authenticates external callers.**
|
||
Token-vs-mTLS-vs-OAuth at the public edge is a separate decision
|
||
driven by who-calls-us. ADR-006 is about the inter-service +
|
||
same-trust-domain posture.
|
||
- **Does not implement token rotation hot-reload.** Decision 6.5
|
||
documents the design; the implementation is Sprint 4 work.
|
||
- **Does not lock TLS termination details.** Where + how nginx/Caddy
|
||
goes is ops infrastructure, not ADR territory.
|
||
|
||
### How this closes the OPEN list
|
||
|
||
STATE_OF_PLAY listed ADR-006 as the gate before any Go binary binds
|
||
non-loopback in prod. The substrate gates were already present (R-001
|
||
+ R-007 enforced via `requireLoopbackOrOverride` +
|
||
`requireAuthOnNonLoopback`); this ADR locks the operator playbook
|
||
that turns those gates into a deployable posture. Sprint 4 can now
|
||
write systemd units that set `AUTH_TOKEN` from `EnvironmentFile=`
|
||
without re-litigating the design.
|
||
|
||
---
|