golangLAKEHOUSE/docs/DECISIONS.md
root 2a6234ff82 ADR-004 + internal/pathway: Mem0 versioned trace substrate
Closes Sprint 2 design-bar work (audit reports/scrum/sprint-backlog.md):
  S2.1 — ADR-004 documents the pathway-memory data model
  S2.2 — pathway port lands with deterministic fixture corpus
         and full test coverage on day one
  S2.3 — retired traces are excluded from retrieval (test
         passes; would fail without the filter)

Mem0-style operations: Add / AddIdempotent / Update / Revise /
Retire / Get / History / Search. Each operation is a method on
Store; persistence is JSONL append-only with corruption recovery
on Replay.

internal/pathway/types.go     Trace + event + SearchFilter + sentinel errors
internal/pathway/store.go     in-memory state + RWMutex + ops
internal/pathway/persistor.go JSONL append-only log with replay
internal/pathway/store_test.go  20 test funcs covering all 7
                                Sprint 2 claim rows + concurrency
internal/pathway/persistor_test.go  6 test funcs covering missing-
                                file, corruption recovery, long-line
                                handling, parent-dir auto-create,
                                apply-error skip behavior

Sprint 2 claim coverage row-by-row:
  ADD          TestAdd_AssignsUIDAndTimestamps + TestAdd_RejectsInvalidJSON
  UPDATE       TestUpdate_ReplacesContentSameUID + Update_MissingUID_Errors
  REVISE       TestRevise_LinksToPredecessorViaHistory +
               TestRevise_PredecessorMissing_Errors +
               TestRevise_ChainOfThree_BackwardWalk
  RETIRE       TestRetire_ExcludedFromSearch +
               TestRetire_StillAccessibleViaGet +
               TestRetire_StillAccessibleViaHistory
  HISTORY/cycle TestHistory_CycleDetected (injected via internal map),
                TestHistory_PredecessorMissing_TruncatesChain,
                TestHistory_UnknownUID_ErrorsClean
  REPLAY/dup   TestAddIdempotent_IncrementsReplayCount (locks the
               "replay preserves original content" rule per ADR-004)
  CORRUPTION   TestPersistor_CorruptedLines_Skipped +
               TestPersistor_ApplyError_Skipped
  ROUND-TRIP   TestPersistor_RoundTrip locks the full Save → fresh
               Store → Load → Stats-match contract

Two real bugs caught during testing:
  - Add returned the same *Trace stored in the map, so callers
    holding a reference saw later mutations. Fixed: clone before
    return (matches Get's contract). Same fix in AddIdempotent
    + Revise.
  - Test typo: {"v":different} isn't valid JSON; AddIdempotent's
    json.Valid rejected it as ErrInvalidContent. Test fixed to
    use {"v":"different"}; the validation behavior is correct.

Skipped this commit (next):
  - cmd/pathwayd HTTP binary
  - gateway routing for /v1/pathway/*
  - end-to-end smoke
  These add the wire surface; the substrate ships first so the
  wire layer can be a pure proxy in the next commit.

Verified:
  go test -count=1 ./internal/pathway/ — 26 tests green
  just verify                          — vet + test + 9 smokes 34s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 07:23:30 -05:00

365 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Architecture Decision Records — Lakehouse-Go
ADRs from the Go era. Numbered fresh from 001 to start clean lineage.
Where a Rust ADR (numbered 001021 in the Rust repo's `DECISIONS.md`)
remains in force, this file references it explicitly. Where a Rust
ADR is superseded, the new ADR records why.
---
## ADR-001: Foundational decisions for the Go rewrite
**Date:** 2026-04-28
**Decided by:** J
**Status:** Ratified — Phase G0 unblocked
The six questions that gated Phase G0 (per PRD.md / SPEC.md §8) are
all answered.
### Decision 1.1 — DuckDB via cgo for the query engine
**Decision:** `queryd` uses `marcboeker/go-duckdb` (cgo bindings to
DuckDB). Pure-Go alternative was rejected.
**Rationale:** DuckDB reads Parquet natively, supports the SQL surface
DataFusion exposed in the Rust era (CTEs, window functions, hybrid
joins), and runs in-process with cgo. The alternatives were:
- Hand-rolling a query planner over arrow-go RecordBatches —
multi-engineer-month research project; high risk of correctness
bugs.
- Running DuckDB as an external process — adds an operational surface
and a network hop to every query.
Cgo build complexity is the accepted cost. Single-binary deploy
preserved (the cgo dependency embeds at link time).
**Supersedes Rust ADR-001** (object storage as source of truth) — no.
That ADR remains in force; the change is the *engine* over the
storage, not the storage model.
### Decision 1.2 — HTMX for the UI
**Decision:** Frontend is `html/template` + HTMX + Alpine.js,
server-rendered by `cmd/gateway`. React/Vite in a separate repo is the
fallback if UX requirements demand SPA-tier interactivity post-G5.
**Rationale:** The existing Lakehouse UIs (`/lakehouse/` demo + staffer
console) are mostly server-rendered HTML with vanilla JS that already
fits the HTMX style. Single-binary deploy is preserved (gateway serves
templates + static assets). No build chain beyond `go build`.
The React fallback is named explicitly so it's not relitigated unless
an actual UX requirement triggers it.
### Decision 1.3 — Gitea hosts the new repo
**Decision:** Repo lives at `git.agentview.dev/profit/golangLAKEHOUSE`
(same Gitea server that hosts the Rust lakehouse).
**Rationale:** Single source of truth for repo hosting; existing
auditor tooling (`lakehouse-auditor` systemd service) already speaks
Gitea API; existing credentials work; no new ops surface.
### Decision 1.4 — Distillation rebuilt in Go, not ported verbatim
**Decision:** The distillation v1.0.0 substrate (`tag
distillation-v1.0.0` at `e7636f2` in the Rust repo) is **not**
bit-identical-ported. The Go reimplementation:
- Ports the LOGIC: SFT export pipeline, contamination firewall (the
`quality_score` enum + `SFT_NEVER` constant), category mapping
rules, audit-baselines append-only pattern.
- Does NOT port the FIXTURES: `tests/fixtures/distillation/acceptance/`
is rebuilt from scratch in Go with new ground-truth golden files.
- Does NOT port the bit-identical reproducibility PROPERTY: that was
measured against the Rust implementation. The Go implementation
establishes its own reproducibility baseline.
**Rationale:** Bit-identical reproducibility was a measured property
of a specific implementation, not a portable invariant. Re-establishing
it in Go means new fixtures, new gates, new audit-baselines. This is
honest about what's transferring (logic) versus what's a Rust-era
artifact (the specific bit-identical hashes).
**Risk:** the contamination firewall is the most consequential
distillation safety net. The port must be reviewed line-by-line, and
the new Go fixtures must include adversarial cases that prove the
firewall works in the new implementation. See SPEC §7 acceptance gates.
### Decision 1.5 — Pathway memory starts clean; old traces preserved as reference
**Decision:** Go pathway memory begins with zero traces. The existing
88 Rust traces at
`/home/profit/lakehouse/data/_pathway_memory/state.json` are NOT loaded
into the Go implementation. They are preserved as a historical record
in the Rust repo and documented at `docs/RUST_PATHWAY_MEMORY_NOTE.md`.
**Rationale:** The Rust pathway memory's value compounded over months
of scrum cycles. Loading those traces into a Go implementation that
hasn't proven its byte-matching contract risks corrupting the new
substrate's signal with semantically-mismatched data. Starting clean
keeps the Go pathway memory's lineage clean and lets the byte-match
correctness be proven on a known input (per SPEC §3.4 G3.4.B).
The historical note records the 88 traces' value (11/11 successful
replays at the time of freeze) so the Go implementation has a
reference baseline to outperform.
### Decision 1.6 — Auditor longitudinal signal restarts
**Decision:** The Rust auditor's `audit_baselines.jsonl`
(longitudinal drift signal accumulated across PRs #6#13) is **not**
ported to Go. The Go auditor begins a fresh `audit_baselines.jsonl`
lineage on its first PR.
**Rationale:** The drift signal is anchored to specific Rust commits,
verdict shapes, and Kimi/Haiku/Opus rotation traces. Carrying it into
the Go era would be like grafting Rust-PR audit history onto the first
Go PR's prologue — confusing more than informative. Restarting gives
the Go auditor a clean baseline to measure drift against.
The existing Rust `audit_baselines.jsonl` stays in the Rust repo as a
historical record.
---
## ADR-002: storaged per-prefix PUT cap (vectord _vectors/ → 4 GiB)
**Date:** 2026-04-29
**Decided by:** J
**Status:** Implemented (commit `423a381`)
`storaged` enforces a 256 MiB per-PUT body cap as DoS protection
(`MaxBytesReader` + Content-Length check). Keys under `_vectors/`
(vectord LHV1 persistence) get a raised cap of 4 GiB; everything
else stays at 256 MiB.
**Rationale:** the 500K staffing test surfaced that single-file LHV1
above ~150K vectors at d=768 hits the 256 MiB cap. `manager.Uploader`
already streams on the outbound side, so the cap is a safety gate
not a memory bottleneck — raising it for the vector path doesn't
introduce new memory pressure. Per-prefix preserves the safety
gate for routine traffic while opening the documented production
path. Splitting LHV1 across multiple keys was rejected because G1P
specifically shipped the single-Put framed format to eliminate
torn-write — multi-key would re-introduce that failure mode.
**Follow-up:** if production workloads exceed 4 GiB single-file
LHV1, refactor to operator-driven config (env/TOML) rather than
bumping the constant. The function-level `maxPutBytesFor(key)` in
`cmd/storaged/main.go` keeps that drop-in clean.
---
## ADR-003: Inter-service auth posture — Bearer token + IP allowlist
**Date:** 2026-04-29
**Decided by:** J + Claude
**Status:** Decided — wiring deferred to Sprint 1
**Decision:** When inter-service auth is needed (the moment any
binary binds non-loopback or the deployment crosses a trust
boundary), the auth model is **a Bearer token loaded from
`secrets-go.toml` plus a configurable IP allowlist**. Both layers
required: the token authenticates the caller; the allowlist
narrows the network surface.
**Status today (G0):** zero auth middleware. Every binary binds
`127.0.0.1` by default; commit `6af0520` (R-001 partial fix) refuses
non-loopback bind unless the per-service `LH_<SVC>_ALLOW_NONLOOPBACK=1`
env override is set. The override-and-no-auth combination is the
worst case — this ADR locks in what we'll require before any
production override fires.
### What gets implemented when auth lands
1. **`secrets-go.toml` adds a `[auth]` section:**
```toml
[auth]
token = "..." # 32+ random bytes, hex-encoded
allowed_ips = ["10.0.0.0/8", "127.0.0.1/32"] # CIDR list
```
2. **`internal/shared/auth.go`** ships a single chi middleware:
```go
func RequireAuth(cfg AuthConfig) func(http.Handler) http.Handler
```
- Empty `cfg.Token` → middleware is a no-op (G0 dev mode).
- Non-empty token → reject 401 unless request has
`Authorization: Bearer <token>` matching constant-time.
- Non-empty `allowed_ips` → reject 403 unless `r.RemoteAddr` (or
`X-Forwarded-For` first hop, configurable) is in CIDR set.
- `/health` exempt — load balancers + monitors need it open.
3. **Every `cmd/<svc>/main.go` adds one line:**
```go
r.Use(shared.RequireAuth(cfg.Auth))
```
Mounted before `register(r)` so it covers every route the binary
exposes after `/health`.
4. **`shared.Run` startup gate:** if bind is non-loopback AND
`cfg.Auth.Token == ""`, refuse to start. The implicit
"localhost is the auth layer" guarantee becomes explicit when
crossing the loopback boundary.
### Alternatives considered
| Option | Why rejected |
|---|---|
| **mTLS** | Strongest but heaviest — every binary needs cert provisioning, rotation tooling, and cert-aware client wiring. Overkill for inter-service traffic that already passes through a single gateway. Reconsider when Lakehouse-Go runs across machines. |
| **JWT with short TTL** | Buys nothing over Bearer here — there's no third-party identity provider, no claim hierarchy worth modelling. Pure token has the same security properties at half the wire complexity. |
| **No auth, IP-allowlist only** | One stolen IP allowlist entry → full access. Token + IP is defense in depth; either alone is too weak. |
| **OAuth2 via external IdP** | Rejected for G0G3 timeline. No external IdP commitment. Revisit if Lakehouse-Go ever serves end-user requests directly (today everything fronts through the staffing co-pilot which has its own session model). |
### Constant-time comparison + token hygiene
Token comparison must use `crypto/subtle.ConstantTimeCompare` —
naive `==` is vulnerable to timing attacks against an attacker who
can issue many requests and measure round-trip. Token rotation is
operator-driven via `secrets-go.toml` edit + restart; G0 doesn't
need rotate-without-restart.
### What this ADR does NOT do
- **Does not implement the middleware.** Code lands in Sprint 1.
- **Does not require token in G0 dev.** Empty token → no-op. Smokes
+ proof harness keep working without setting tokens.
- **Does not address gateway → end-user auth.** Gateway terminates
inter-service auth at its inbound; if end-users hit gateway from
a browser, that's a different ADR (likely cookie/session, fronted
by a reverse proxy that handles user auth).
### How this closes audit findings
- **R-001 (queryd /sql RCE-equivalent off-loopback):** the bind
gate prevents accidental exposure today; this ADR specifies the
guardrail when intentional exposure is needed.
- **R-007 (zero auth middleware):** answered by the design above;
R-007 stays open until the middleware is implemented but is no
longer "design TBD."
- **R-010 (no CORS posture):** orthogonal to inter-service auth,
but the `RequireAuth` middleware sits at the right layer to add
CORS handling later (browsers don't reach inter-service routes
in the current design, so CORS is also Sprint 1+ when end-user
requests start landing).
---
## ADR-004: Pathway memory data model — Mem0-style versioned traces
**Date:** 2026-04-29
**Decided by:** J + Claude
**Status:** Decided — substrate landing in `internal/pathway/`
**Decision:** Pathway memory is an append-only event log of opaque
traces with Mem0-style semantics: Add / Update / Revise / Retire /
History / Search. Each trace has a UID; revisions chain backward
via `predecessor_uid` so the full history is reconstructible.
Persistence is JSONL append-only with full-replay on load;
corruption recovery skips bad lines without halting startup.
### Operations
| Op | Effect |
|---|---|
| `Add(content, tags...)` | New UID, stored fresh, replay_count=1. |
| `AddIdempotent(uid, content, tags...)` | If UID exists → replay_count++. Else → Add with that UID. |
| `Update(uid, content)` | In-place content replacement (same UID). Bumps `updated_at_ns`. NOT a revision — same trace, new content. |
| `Revise(predecessorUID, content, tags...)` | New UID with `predecessor_uid` set. Old trace stays accessible via History. Failure modes: predecessor missing → error; predecessor retired → still allowed (revisions of retired traces are valid). |
| `Retire(uid)` | Sets `retired=true`. Excluded from `Search` by default; still accessible via `Get` and `History`. |
| `Get(uid)` | Returns the trace (including if retired); error on missing. |
| `History(uid)` | Walks `predecessor_uid` chain backward, returns slice [self, parent, grandparent, ...]. Cycle-detected via visited-set; returns error on cycle (which only happens if persistence file was hand-edited). |
| `Search(filter)` | Returns matching traces. Default excludes retired; opt in via `IncludeRetired: true`. Filters: tag-match, content-substring, time range. |
### Why Mem0-style + Why these specific ops
- **Mem0** (memory pattern from the OpenAI Memories paper / Mem0 lib)
is the canonical "agent memory" interface for the same reason
Markdown is the canonical text format: it's the lowest-common-
denominator that the entire ecosystem assumes. Adopting it lets
agent loops written against any Mem0-aware substrate work here.
- Update vs Revise are deliberately separate. Update is "I noticed
a typo in my note." Revise is "I now believe something different
than I did when I wrote this; preserve the old belief for audit."
Conflating them loses the audit trail.
- Retire vs Delete is deliberate. Retire stops a trace from
surfacing in search but preserves it for history reconstruction.
Delete (which we don't expose) would break references.
### Trace data shape
```go
type Trace struct {
UID string // UUID v4 unless caller provides one
Content json.RawMessage // opaque, schema is caller's contract
PredecessorUID string // empty if root revision
CreatedAtNs int64
UpdatedAtNs int64
Retired bool
ReplayCount int // ≥1 for any stored trace
Tags []string // for Search
}
```
`Content` is opaque JSON (not a struct) so callers can store any
shape — the data model doesn't constrain semantics. Callers add
their own validators on top.
### Persistence
JSONL append-only log under `_pathway/<store_name>.jsonl`. Each
mutation appends one JSON line:
```
{"op":"add", "trace":{...}}
{"op":"update", "uid":"…", "content":"…"}
{"op":"revise", "trace":{…}} # trace.PredecessorUID is set
{"op":"retire", "uid":"…"}
{"op":"replay", "uid":"…"} # idempotent re-add hit
```
On startup, replay every line in order, building in-memory state.
A malformed line logs a warn and is skipped; load continues.
Corruption tolerance is non-optional — partial state is better
than no state for an agent substrate.
Compaction is a future concern. A 100K-trace log replays in
seconds; below that scale, JSONL append is the simplest correct
choice. When compaction lands, the format will be: snapshot file
(full state JSON) + tail JSONL since snapshot. Detect snapshot,
load it, then replay tail.
### Cycle safety
UIDs are generated server-side via `uuid.New()` (existing dep —
catalogd uses it). New UID for every Add and Revise. The data
model itself can't form cycles — every Revise points at an
EXISTING uid, and the new uid didn't exist a moment ago.
History walks defensively anyway: visited-set tracks UIDs seen
this walk; if we encounter a duplicate, return error. Protects
against corruption (manual edit, bug in a future op) without
constraining the happy path.
### Storage location
JSONL file path is configurable per store. Default:
`/var/lib/lakehouse/pathway/<name>.jsonl` for prod; tests use
`t.TempDir()`. Persistence is OPTIONAL — empty path means
in-memory only (matches vectord G1's pattern).
### What this ADR does NOT do
- **No HTTP surface decision.** Whether `cmd/pathwayd` is its own
binary or routes get added to `cmd/vectord` is the next ADR's
concern. The substrate is a pure library either way.
- **No vector index integration.** Pathway traces can carry a
vector embedding in `Content` (caller decides), but this ADR
doesn't define how the substrate integrates with `vectord`'s
HNSW indexes. That's the staffing co-pilot's design problem
when those layers compose.
- **No agent-loop semantics.** "When does an agent ADD vs
REVISE?" is a workflow decision, not a substrate decision.
---
(Future ADRs from ADR-005 onward will be added as the Go
implementation accrues design decisions — e.g. observer fail-safe
semantics, distillation rebuild, gRPC adapter wire format, etc.)