golangLAKEHOUSE/docs/DECISIONS.md

# Architecture Decision Records — Lakehouse-Go

ADRs from the Go era. Numbered fresh from 001 to start clean lineage.
Where a Rust ADR (numbered 001–021 in the Rust repo's `DECISIONS.md`)
remains in force, this file references it explicitly. Where a Rust
ADR is superseded, the new ADR records why.

---

## ADR-001: Foundational decisions for the Go rewrite
**Date:** 2026-04-28
**Decided by:** J
**Status:** Ratified — Phase G0 unblocked

The six questions that gated Phase G0 (per PRD.md / SPEC.md §8) are
all answered.

### Decision 1.1 — DuckDB via cgo for the query engine

**Decision:** `queryd` uses `marcboeker/go-duckdb` (cgo bindings to
DuckDB). Pure-Go alternative was rejected.

**Rationale:** DuckDB reads Parquet natively, supports the SQL surface
DataFusion exposed in the Rust era (CTEs, window functions, hybrid
joins), and runs in-process with cgo. The alternatives were:
- Hand-rolling a query planner over arrow-go RecordBatches —
  multi-engineer-month research project; high risk of correctness
  bugs.
- Running DuckDB as an external process — adds an operational surface
  and a network hop to every query.

Cgo build complexity is the accepted cost. Single-binary deploy
preserved (the cgo dependency embeds at link time).

**Supersedes Rust ADR-001** (object storage as source of truth) — no.
That ADR remains in force; the change is the *engine* over the
storage, not the storage model.

### Decision 1.2 — HTMX for the UI

**Decision:** Frontend is `html/template` + HTMX + Alpine.js,
server-rendered by `cmd/gateway`. React/Vite in a separate repo is the
fallback if UX requirements demand SPA-tier interactivity post-G5.

**Rationale:** The existing Lakehouse UIs (`/lakehouse/` demo + staffer
console) are mostly server-rendered HTML with vanilla JS that already
fits the HTMX style. Single-binary deploy is preserved (gateway serves
templates + static assets). No build chain beyond `go build`.

The React fallback is named explicitly so it's not relitigated unless
an actual UX requirement triggers it.

### Decision 1.3 — Gitea hosts the new repo

**Decision:** Repo lives at `git.agentview.dev/profit/golangLAKEHOUSE`
(same Gitea server that hosts the Rust lakehouse).

**Rationale:** Single source of truth for repo hosting; existing
auditor tooling (`lakehouse-auditor` systemd service) already speaks
Gitea API; existing credentials work; no new ops surface.

### Decision 1.4 — Distillation rebuilt in Go, not ported verbatim

**Decision:** The distillation v1.0.0 substrate (`tag
distillation-v1.0.0` at `e7636f2` in the Rust repo) is **not**
bit-identical-ported. The Go reimplementation:
- Ports the LOGIC: SFT export pipeline, contamination firewall (the
  `quality_score` enum + `SFT_NEVER` constant), category mapping
  rules, audit-baselines append-only pattern.
- Does NOT port the FIXTURES: `tests/fixtures/distillation/acceptance/`
  is rebuilt from scratch in Go with new ground-truth golden files.
- Does NOT port the bit-identical reproducibility PROPERTY: that was
  measured against the Rust implementation. The Go implementation
  establishes its own reproducibility baseline.

**Rationale:** Bit-identical reproducibility was a measured property
of a specific implementation, not a portable invariant. Re-establishing
it in Go means new fixtures, new gates, new audit-baselines. This is
honest about what's transferring (logic) versus what's a Rust-era
artifact (the specific bit-identical hashes).

**Risk:** the contamination firewall is the most consequential
distillation safety net. The port must be reviewed line-by-line, and
the new Go fixtures must include adversarial cases that prove the
firewall works in the new implementation. See SPEC §7 acceptance gates.

### Decision 1.5 — Pathway memory starts clean; old traces preserved as reference

**Decision:** Go pathway memory begins with zero traces. The existing
88 Rust traces at
`/home/profit/lakehouse/data/_pathway_memory/state.json` are NOT loaded
into the Go implementation. They are preserved as a historical record
in the Rust repo and documented at `docs/RUST_PATHWAY_MEMORY_NOTE.md`.

**Rationale:** The Rust pathway memory's value compounded over months
of scrum cycles. Loading those traces into a Go implementation that
hasn't proven its byte-matching contract risks corrupting the new
substrate's signal with semantically-mismatched data. Starting clean
keeps the Go pathway memory's lineage clean and lets the byte-match
correctness be proven on a known input (per SPEC §3.4 G3.4.B).

The historical note records the 88 traces' value (11/11 successful
replays at the time of freeze) so the Go implementation has a
reference baseline to outperform.

### Decision 1.6 — Auditor longitudinal signal restarts

**Decision:** The Rust auditor's `audit_baselines.jsonl`
(longitudinal drift signal accumulated across PRs #6–#13) is **not**
ported to Go. The Go auditor begins a fresh `audit_baselines.jsonl`
lineage on its first PR.

**Rationale:** The drift signal is anchored to specific Rust commits,
verdict shapes, and Kimi/Haiku/Opus rotation traces. Carrying it into
the Go era would be like grafting Rust-PR audit history onto the first
Go PR's prologue — confusing more than informative. Restarting gives
the Go auditor a clean baseline to measure drift against.

The existing Rust `audit_baselines.jsonl` stays in the Rust repo as a
historical record.

---

## ADR-002: storaged per-prefix PUT cap (vectord _vectors/ → 4 GiB)
**Date:** 2026-04-29
**Decided by:** J
**Status:** Implemented (commit `423a381`)

`storaged` enforces a 256 MiB per-PUT body cap as DoS protection
(`MaxBytesReader` + Content-Length check). Keys under `_vectors/`
(vectord LHV1 persistence) get a raised cap of 4 GiB; everything
else stays at 256 MiB.

**Rationale:** the 500K staffing test surfaced that single-file LHV1
above ~150K vectors at d=768 hits the 256 MiB cap. `manager.Uploader`
already streams on the outbound side, so the cap is a safety gate
not a memory bottleneck — raising it for the vector path doesn't
introduce new memory pressure. Per-prefix preserves the safety
gate for routine traffic while opening the documented production
path. Splitting LHV1 across multiple keys was rejected because G1P
specifically shipped the single-Put framed format to eliminate
torn-write — multi-key would re-introduce that failure mode.

**Follow-up:** if production workloads exceed 4 GiB single-file
LHV1, refactor to operator-driven config (env/TOML) rather than
bumping the constant. The function-level `maxPutBytesFor(key)` in
`cmd/storaged/main.go` keeps that drop-in clean.

---

## ADR-003: Inter-service auth posture — Bearer token + IP allowlist
**Date:** 2026-04-29
**Decided by:** J + Claude
**Status:** Decided — wiring deferred to Sprint 1

**Decision:** When inter-service auth is needed (the moment any
binary binds non-loopback or the deployment crosses a trust
boundary), the auth model is **a Bearer token loaded from
`secrets-go.toml` plus a configurable IP allowlist**. Both layers
required: the token authenticates the caller; the allowlist
narrows the network surface.

**Status today (G0):** zero auth middleware. Every binary binds
`127.0.0.1` by default; commit `6af0520` (R-001 partial fix) refuses
non-loopback bind unless the per-service `LH_<SVC>_ALLOW_NONLOOPBACK=1`
env override is set. The override-and-no-auth combination is the
worst case — this ADR locks in what we'll require before any
production override fires.

### What gets implemented when auth lands

1. **`secrets-go.toml` adds a `[auth]` section:**
   ```toml
   [auth]
   token = "..."          # 32+ random bytes, hex-encoded
   allowed_ips = ["10.0.0.0/8", "127.0.0.1/32"]   # CIDR list
   ```

2. **`internal/shared/auth.go`** ships a single chi middleware:
   ```go
   func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler
   ```
   - Empty `cfg.Token` → middleware is a no-op (G0 dev mode).
   - Non-empty token → reject 401 unless request has
     `Authorization: Bearer <token>` matching constant-time.
   - Non-empty `allowed_ips` → reject 403 unless `r.RemoteAddr` (or
     `X-Forwarded-For` first hop, configurable) is in CIDR set.
   - `/health` exempt — load balancers + monitors need it open.

3. **Every `cmd/<svc>/main.go` adds one line:**
   ```go
   r.Use(shared.RequireAuth(cfg.Auth))
   ```
   Mounted before `register(r)` so it covers every route the binary
   exposes after `/health`.

4. **`shared.Run` startup gate:** if bind is non-loopback AND
   `cfg.Auth.Token == ""`, refuse to start. The implicit
   "localhost is the auth layer" guarantee becomes explicit when
   crossing the loopback boundary.

### Alternatives considered

| Option | Why rejected |
|---|---|
| **mTLS** | Strongest but heaviest — every binary needs cert provisioning, rotation tooling, and cert-aware client wiring. Overkill for inter-service traffic that already passes through a single gateway. Reconsider when Lakehouse-Go runs across machines. |
| **JWT with short TTL** | Buys nothing over Bearer here — there's no third-party identity provider, no claim hierarchy worth modelling. Pure token has the same security properties at half the wire complexity. |
| **No auth, IP-allowlist only** | One stolen IP allowlist entry → full access. Token + IP is defense in depth; either alone is too weak. |
| **OAuth2 via external IdP** | Rejected for G0–G3 timeline. No external IdP commitment. Revisit if Lakehouse-Go ever serves end-user requests directly (today everything fronts through the staffing co-pilot which has its own session model). |

### Constant-time comparison + token hygiene

Token comparison must use `crypto/subtle.ConstantTimeCompare` —
naive `==` is vulnerable to timing attacks against an attacker who
can issue many requests and measure round-trip. Token rotation is
operator-driven via `secrets-go.toml` edit + restart; G0 doesn't
need rotate-without-restart.

### What this ADR does NOT do

- **Does not implement the middleware.** Code lands in Sprint 1.
- **Does not require token in G0 dev.** Empty token → no-op. Smokes
  + proof harness keep working without setting tokens.
- **Does not address gateway → end-user auth.** Gateway terminates
  inter-service auth at its inbound; if end-users hit gateway from
  a browser, that's a different ADR (likely cookie/session, fronted
  by a reverse proxy that handles user auth).

### How this closes audit findings

- **R-001 (queryd /sql RCE-equivalent off-loopback):** the bind
  gate prevents accidental exposure today; this ADR specifies the
  guardrail when intentional exposure is needed.
- **R-007 (zero auth middleware):** answered by the design above;
  R-007 stays open until the middleware is implemented but is no
  longer "design TBD."
- **R-010 (no CORS posture):** orthogonal to inter-service auth,
  but the `RequireAuth` middleware sits at the right layer to add
  CORS handling later (browsers don't reach inter-service routes
  in the current design, so CORS is also Sprint 1+ when end-user
  requests start landing).

---

## ADR-004: Pathway memory data model — Mem0-style versioned traces
**Date:** 2026-04-29
**Decided by:** J + Claude
**Status:** Decided — substrate landing in `internal/pathway/`

**Decision:** Pathway memory is an append-only event log of opaque
traces with Mem0-style semantics: Add / Update / Revise / Retire /
History / Search. Each trace has a UID; revisions chain backward
via `predecessor_uid` so the full history is reconstructible.
Persistence is JSONL append-only with full-replay on load;
corruption recovery skips bad lines without halting startup.

### Operations

| Op | Effect |
|---|---|
| `Add(content, tags...)` | New UID, stored fresh, replay_count=1. |
| `AddIdempotent(uid, content, tags...)` | If UID exists → replay_count++. Else → Add with that UID. |
| `Update(uid, content)` | In-place content replacement (same UID). Bumps `updated_at_ns`. NOT a revision — same trace, new content. |
| `Revise(predecessorUID, content, tags...)` | New UID with `predecessor_uid` set. Old trace stays accessible via History. Failure modes: predecessor missing → error; predecessor retired → still allowed (revisions of retired traces are valid). |
| `Retire(uid)` | Sets `retired=true`. Excluded from `Search` by default; still accessible via `Get` and `History`. |
| `Get(uid)` | Returns the trace (including if retired); error on missing. |
| `History(uid)` | Walks `predecessor_uid` chain backward, returns slice [self, parent, grandparent, ...]. Cycle-detected via visited-set; returns error on cycle (which only happens if persistence file was hand-edited). |
| `Search(filter)` | Returns matching traces. Default excludes retired; opt in via `IncludeRetired: true`. Filters: tag-match, content-substring, time range. |

### Why Mem0-style + Why these specific ops

- **Mem0** (memory pattern from the OpenAI Memories paper / Mem0 lib)
  is the canonical "agent memory" interface for the same reason
  Markdown is the canonical text format: it's the lowest-common-
  denominator that the entire ecosystem assumes. Adopting it lets
  agent loops written against any Mem0-aware substrate work here.
- Update vs Revise are deliberately separate. Update is "I noticed
  a typo in my note." Revise is "I now believe something different
  than I did when I wrote this; preserve the old belief for audit."
  Conflating them loses the audit trail.
- Retire vs Delete is deliberate. Retire stops a trace from
  surfacing in search but preserves it for history reconstruction.
  Delete (which we don't expose) would break references.

### Trace data shape

```go
type Trace struct {
    UID            string          // UUID v4 unless caller provides one
    Content        json.RawMessage // opaque, schema is caller's contract
    PredecessorUID string          // empty if root revision
    CreatedAtNs    int64
    UpdatedAtNs    int64
    Retired        bool
    ReplayCount    int             // ≥1 for any stored trace
    Tags           []string        // for Search
}
```

`Content` is opaque JSON (not a struct) so callers can store any
shape — the data model doesn't constrain semantics. Callers add
their own validators on top.

### Persistence

JSONL append-only log under `_pathway/<store_name>.jsonl`. Each
mutation appends one JSON line:

```
{"op":"add",     "trace":{...}}
{"op":"update",  "uid":"…",   "content":"…"}
{"op":"revise",  "trace":{…}}    # trace.PredecessorUID is set
{"op":"retire",  "uid":"…"}
{"op":"replay",  "uid":"…"}     # idempotent re-add hit
```

On startup, replay every line in order, building in-memory state.
A malformed line logs a warn and is skipped; load continues.
Corruption tolerance is non-optional — partial state is better
than no state for an agent substrate.

Compaction is a future concern. A 100K-trace log replays in
seconds; below that scale, JSONL append is the simplest correct
choice. When compaction lands, the format will be: snapshot file
(full state JSON) + tail JSONL since snapshot. Detect snapshot,
load it, then replay tail.

### Cycle safety

UIDs are generated server-side via `uuid.New()` (existing dep —
catalogd uses it). New UID for every Add and Revise. The data
model itself can't form cycles — every Revise points at an
EXISTING uid, and the new uid didn't exist a moment ago.

History walks defensively anyway: visited-set tracks UIDs seen
this walk; if we encounter a duplicate, return error. Protects
against corruption (manual edit, bug in a future op) without
constraining the happy path.

### Storage location

JSONL file path is configurable per store. Default:
`/var/lib/lakehouse/pathway/<name>.jsonl` for prod; tests use
`t.TempDir()`. Persistence is OPTIONAL — empty path means
in-memory only (matches vectord G1's pattern).

### What this ADR does NOT do

- **No HTTP surface decision.** Whether `cmd/pathwayd` is its own
  binary or routes get added to `cmd/vectord` is the next ADR's
  concern. The substrate is a pure library either way.
- **No vector index integration.** Pathway traces can carry a
  vector embedding in `Content` (caller decides), but this ADR
  doesn't define how the substrate integrates with `vectord`'s
  HNSW indexes. That's the staffing co-pilot's design problem
  when those layers compose.
- **No agent-loop semantics.** "When does an agent ADD vs
  REVISE?" is a workflow decision, not a substrate decision.

---

(Future ADRs from ADR-005 onward will be added as the Go
implementation accrues design decisions — e.g. observer fail-safe
semantics, distillation rebuild, gRPC adapter wire format, etc.)