root dbe00d018f Federation foundation + HNSW trial system + Postgres streaming + PRD reframe

Four shipped features and a PRD realignment, all measured end-to-end:

HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
  TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
  /hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
  at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
  80/30 = 1.00 recall in 230s

Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
  and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows

Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
  /etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
  rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
  anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
  /storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
  X-Lakehouse-Rescue-Used observability headers on fallback

Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
  into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
  (586 rows × 13 cols, 6 batches, 196ms end-to-end)

PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
  trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
  bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
  decision rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 01:50:05 -05:00

17 KiB

Raw Blame History

ADR-017: Federated Multi-Bucket Storage

Status: Accepted — 2026-04-16 Owner: J Implements: Phase 15 horizon item "Federated multi-bucket query" Depends on: ADR-001 (object storage as source of truth), ADR-009 (delta files)

Problem

Today every dataset lives in exactly one object storage backend (/home/profit/lakehouse/data on local disk). Three scenarios break that assumption:

Multi-tenant hosting — Client A's staffing data must stay in Client A's S3 bucket. Client B's in Client B's. The lakehouse is the compute plane, their bucket is the data plane.
Data residency — European clients require their data never leaves an EU-region bucket. Same lakehouse instance, different physical storage.
Cross-dataset analytics — A user asks "average placement margin across all clients we manage." The query must scan Parquet across N buckets as a single logical table.

None of this is possible with a single object_store instance. DataFusion can register multiple object stores under different URL schemes, but nothing in the current stack surfaces that capability.

Non-Goals (for the MVP)

Not building cross-bucket JOIN optimization. DataFusion pushes predicates down per-partition already; that's sufficient.
Not handling bucket-level auth policies (who-can-read-which-bucket). That's Phase 13's job.
Not supporting heterogeneous backends in a single logical dataset (e.g. half in S3, half local). One dataset = one bucket.
Not automatic replication across buckets. Each bucket stands alone.
Not scheduled sync/migration. Manual today.

Design

Core invariant

Every ObjectRef belongs to exactly one named bucket. The catalog is the single authority over which bucket holds which dataset. Queries transparently span buckets by virtue of DataFusion's multi-store capability — the catalog tells the query engine where each file lives.

Naming

Three classes of bucket:

primary — system-wide default. Shared catalog, reference datasets, and vectors that aren't tenant-specific. Always present, always the fallback. Corresponds to today's ./data directory.
profile:{user_id} — per-user workspace bucket. Each user's personal data, their vector indexes, their workspace state. Provisioned on first use. Pre-loaded ("hot loaded") when the user activates their profile.
{named} — tenant/client buckets. Explicit, configured up front. E.g. client_a, client_eu.

Conventions:

Bucket URL = {bucket_id}://{key} internally (e.g. primary://datasets/candidates.parquet, profile:daisy://vectors/notes.parquet)
DataFusion sees each bucket as a separate registered object store; ListingTable paths include the bucket scheme
Profile buckets use the profile: prefix as both a namespace marker and a DataFusion URL scheme

Config shape — `lakehouse.toml`

[storage]
# Backward compat — the existing [storage] block still creates a "primary"
# bucket if no [[storage.buckets]] entries are defined.
root = "./data"
# Where profile buckets are rooted when not otherwise configured.
# A profile:{user} bucket resolves to {profile_root}/{user}/ unless the user
# config overrides.
profile_root = "./data/_profiles"

[[storage.buckets]]
name = "primary"
backend = "local"
root = "./data"

rescue_bucket = "rescue"           # single shared fallback for failed reads (see §Failure mode)

[[storage.buckets]]
name = "rescue"
backend = "local"
root = "./data/_rescue"

[[storage.buckets]]
name = "client_a"
backend = "s3"
bucket = "client-a-lakehouse"
region = "us-east-1"
endpoint = "https://s3.amazonaws.com"
secret_ref = "client_a_aws"        # NOT the literal secret — a handle

[[storage.buckets]]
name = "client_eu"
backend = "s3"
bucket = "client-eu-lakehouse"
region = "eu-west-1"
secret_ref = "client_eu_aws"

Rules:

Credentials never appear in this file. secret_ref is a handle resolved through the secrets layer (see §Secrets).
If [[storage.buckets]] is absent, the existing [storage] block creates a single "primary" bucket. Zero-config upgrade.
Adding a bucket requires a restart (MVP); runtime addition via API is a polish item.

Secrets

Credentials are never in lakehouse.toml. A pluggable SecretsProvider trait resolves secret_ref handles to credential structs at startup:

pub trait SecretsProvider: Send + Sync {
    async fn resolve(&self, handle: &str) -> Result<BucketCredentials, String>;
}

MVP ships one implementation — FileSecretsProvider reading /etc/lakehouse/secrets.toml (or a path given by LAKEHOUSE_SECRETS env var):

# /etc/lakehouse/secrets.toml — root:root, mode 0600, NEVER in git
[client_a_aws]
access_key = "AKIA..."
secret_key = "wJalrXUt..."

[client_eu_aws]
access_key = "AKIA..."
secret_key = "..."

Future providers plug into the same trait without touching core code:

VaultSecretsProvider — HashiCorp Vault
SopsSecretsProvider — age/gpg-encrypted files in git
KeyringSecretsProvider — OS-level keyring

Startup fails fast if a secret_ref can't resolve — no silent fallback to anonymous S3.

Schema changes

ObjectRef already has a bucket: String field — currently it carries the S3 bucket name or "local" inconsistently. Repurpose it as the catalog bucket name:

pub struct ObjectRef {
    pub bucket: String,        // NOW: catalog bucket name, e.g. "primary" or "client_a"
    pub key: String,           // object key within that bucket
    pub size_bytes: u64,
    pub created_at: DateTime<Utc>,
}

Migration: a resync-missing-style one-shot sets bucket = "primary" on every existing ObjectRef whose value is empty or ambiguous.

DatasetManifest — no new top-level field. objects: Vec<ObjectRef> is already where bucket info lives. Per the design invariant (one dataset, one bucket), all ObjectRefs inside a manifest share a bucket, so we can add a convenience accessor manifest.bucket_id() → objects[0].bucket.clone().

Storage layer changes

storaged::BucketRegistry — new struct, replaces the single Arc<dyn ObjectStore> currently threaded through gateway.

pub struct BucketRegistry {
    buckets: HashMap<String, Arc<dyn ObjectStore>>,
    default: String, // "primary"
}

impl BucketRegistry {
    pub fn from_config(cfg: &StorageConfig) -> Result<Self, String>;
    pub fn get(&self, bucket_id: &str) -> Result<&Arc<dyn ObjectStore>, String>;
    pub fn default_store(&self) -> &Arc<dyn ObjectStore>;
    pub fn list_buckets(&self) -> Vec<BucketInfo>;
}

storaged::ops gets a bucket parameter on every call:

pub async fn put(registry: &BucketRegistry, bucket: &str, key: &str, data: Bytes) -> Result<(), String>;
pub async fn get(registry: &BucketRegistry, bucket: &str, key: &str) -> Result<Bytes, String>;
// etc.

Every existing call site passes "primary" initially; specific call sites (ingestd, vectord) learn to route.

Query layer changes

queryd — on startup, register every bucket as a separate object_store URL with DataFusion:

for (name, store) in bucket_registry.iter() {
    let url = format!("lakehouse-{}://", name).parse::<Url>()?;
    session_ctx.register_object_store(&url, store.clone());
}

ListingTable URLs reference the bucket:

let url = format!("lakehouse-{}://{}", obj.bucket, obj.key);

DataFusion handles cross-bucket scans natively — each partition gets routed to the correct store.

Gateway API additions

Endpoint	Purpose
`GET /storage/buckets`	List all configured buckets + their backend + reachability status
`GET /storage/errors`	Recent bucket operation failures (filtered by bucket/time)
`GET /storage/health`	Summary: which buckets have errored in the last 5 minutes
`POST /ingest/file` + `X-Lakehouse-Bucket: client_a`	Header selects target bucket; absent → `primary`
`POST /profile/{user}/activate`	Pre-load user's profile bucket: embedding caches warm, HNSW indexes rebuilt, workspace state hydrated
`POST /profile/{user}/deactivate`	Evict profile bucket's cached state
`POST /catalog/datasets/by-name/{name}/relocate`	Move a dataset to another bucket (polish)

Routing rule for all existing endpoints:

X-Lakehouse-Bucket: {name} header → use that bucket
No header → primary
Unknown bucket name → 404

Catalog and query endpoints span all buckets by default — a query sees all buckets the user has access to unless explicitly scoped.

Profile hot load

When POST /profile/{user}/activate fires:

Resolve profile:{user} bucket — create if missing (first-time activation)
Scan the catalog for all datasets whose owner == user or which live in that bucket
For each vector index belonging to the profile: pre-load embeddings into EmbeddingCache, pre-build HNSW with the locked-in default config (ec=80 es=30)
Hydrate any saved workspace state (see Phase 8.5) into memory
Return a manifest summary — what's hot, what wasn't found, total memory used

Design contract: /profile/{user}/activate should be idempotent. Calling it twice is cheap; the second call is a no-op because the cache is already warm.

Failure mode — rescue bucket + loud errors

The design goal is failures stay findable. When something breaks, you should know within seconds which bucket, which key, which operation, and what happened — without grepping logs.

Writes: always go to the target bucket. If the target bucket is unreachable, the request fails hard with 503 and the failure is recorded in the error journal. No queueing, no silent fallback — writes that vanish silently are the worst possible failure mode.

Reads:

First attempt: target bucket.
On failure: fall through to the single configured rescue_bucket.
If the key exists in rescue, serve it with X-Lakehouse-Rescue-Used: true and X-Lakehouse-Original-Bucket: {target} response headers so the caller knows fallback happened.
If the key isn't in rescue either: 404 with both buckets listed in the error body.
Every rescue fallback is appended to the error journal (see below).

Rescue bucket is a single, well-known bucket configured at the storage level. Typical setup: a periodic sync from primary → rescue so that when primary has an outage, recent data is still reachable. Sync is outside this ADR's scope — operator's job.

Rest of the system keeps working: a failed bucket never takes down the gateway. Other buckets remain queryable. Cross-bucket queries scan what they can and surface per-partition failures (which bucket, which key) in the error body.

Error journal — "find errors ez"

A simple append-only log of every bucket operation that failed or fell through. Stored at primary://_errors/bucket_errors.jsonl (yes, primary — because we need it findable even if other buckets are down).

{"ts":"2026-04-16T10:30:15Z","op":"read","target":"client_a","key":"datasets/candidates.parquet","error":"connection refused","rescued":true,"rescue_key_found":true}
{"ts":"2026-04-16T10:30:18Z","op":"write","target":"client_a","key":"datasets/new.parquet","error":"connection refused","rescued":false}

Exposed via:

GET /storage/errors?limit=50 — recent errors
GET /storage/errors?bucket=client_a — filter by bucket
GET /storage/errors?since=2026-04-16T10:00 — filter by time
GET /storage/health — summary: which buckets have errored in the last 5 minutes

The journal doesn't replace logs (tracing still gets everything), but it gives you a single authoritative place to answer the question "has anything been failing?" without tailing systemd journals. If the journal is empty, nothing is broken.

Vectors are bucket-scoped

Vector indexes live in the same bucket as their source data:

Data in primary → vectors in primary://vectors/{index_name}.parquet
Data in profile:daisy → vectors in profile:daisy://vectors/{index_name}.parquet
Data in client_a → vectors in client_a://vectors/{index_name}.parquet

The existing _hnsw_trials/ and _hnsw_evals/ prefixes also live per-bucket. The trial journal for a profile's index stays inside that profile's bucket — so tenants don't see each other's tuning history.

Profile hot load consequence: activating a profile pre-loads ALL vector indexes in that profile's bucket into EmbeddingCache + builds HNSW with the default config. First query after activation is warm.

Implementation phases

MVP (this is what I'd build next session)

shared::config — extend StorageConfig with buckets: Vec<BucketConfig> and profile_root, with backward-compat fallback from single [storage] block
shared::secrets — SecretsProvider trait + FileSecretsProvider impl reading /etc/lakehouse/secrets.toml
storaged::BucketRegistry — multi-backend registry; mirrors-of tracking; lazy profile bucket creation
storaged::ops — add bucket: &str param to every call; read path falls back to mirror on unreachable; write path hard-fails
gateway — X-Lakehouse-Bucket header middleware → routing decision per request
gateway — GET /storage/buckets + POST /profile/{user}/activate|deactivate
catalogd — one-shot migration: every ObjectRef without an explicit bucket gets "primary"
queryd::session — register every bucket as a DataFusion ObjectStore under its own URL scheme
vectord — every vector/trial/eval storage path becomes {bucket}://vectors/...; profile activate pre-loads HNSW
storaged::error_journal — append-only JSONL writer at primary://_errors/bucket_errors.jsonl, hooked into every ops call-site
Test harness — configure primary + profile:testuser + a mock client_a pointing at a 2nd local dir + rescue as a 3rd local dir; verify cross-bucket join, profile activate, rescue fallback, error journal visibility

Success gate #1 — cross-bucket query: SELECT SUM(hours_regular) FROM primary.timesheets t1 JOIN "profile:testuser".timesheets t2 USING (placement_id) returns a sensible number.

Success gate #2 — profile hot load: POST /profile/daisy/activate pre-builds HNSW over her personal vector index; next /search call against that index is <1ms cold.

Success gate #3 — rescue fallback + error visibility: rename client_a's directory to simulate outage. GET for a key returns the copy from rescue with X-Lakehouse-Rescue-Used: true + X-Lakehouse-Original-Bucket: client_a headers. PUT returns 503. GET /storage/errors shows both events with full context. GET /storage/health flags client_a as unhealthy.

Polish (follow-up)

Bucket health check — GET /storage/buckets shows reachable/unreachable
Bucket relocation — POST /catalog/datasets/{name}/relocate (copy to new bucket, update manifest, delete from old)
Runtime bucket addition — POST /storage/buckets adds a bucket without restart (stores in a separate buckets.toml)
S3 streaming optimization — for very large Parquet files, stream the footer instead of loading the whole blob

Decisions (2026-04-16)

Credentials — Pluggable SecretsProvider trait. MVP ships FileSecretsProvider at /etc/lakehouse/secrets.toml (root:root, 0600). Future providers (Vault, SOPS, OS keyring) plug in without touching core.
Primary default + profile bucket model — primary always exists as system-wide fallback. Per-user profile:{user} buckets are provisioned on first activate. Named tenant buckets configured explicitly.
Ingest routing — X-Lakehouse-Bucket header. Absent → primary. Unknown → 404.
Failure — Writes: hard fail on target, logged to error journal. Reads: fall through to a single shared rescue_bucket; response headers X-Lakehouse-Rescue-Used + X-Lakehouse-Original-Bucket make the fallback visible. Every failure appended to an error journal at primary://_errors/bucket_errors.jsonl, queryable via GET /storage/errors. Failed buckets never take down unrelated paths.
Vectors in their bucket — Vectors, trial journals, and eval sets all live in the same bucket as their source data. Profile activate pre-loads the full vector stack for that profile using the locked-in HNSW default (ec=80 es=30).

Risks

Risk	Severity	Mitigation
DataFusion multi-store predicate pushdown lands on cross-bucket JOINs	Medium	Benchmark early. If bad, mark cross-bucket JOINs as a known sharp edge.
S3 credentials accidentally logged	High	Never stringify credential fields. Add a test that redacts `secret_key` in every code path.
Existing ObjectRefs have inconsistent `bucket` values (some `"local"`, some empty)	Low	One-shot migration normalizes to `"primary"`.
Bucket config drift between restarts	Medium	Config-driven means restart = canonical. A runtime API would need careful state reconciliation; defer.

Decision log

1 dataset = 1 bucket (rather than datasets spanning buckets) — vastly simpler, matches the "tenant isolation" real use case. Revisit only if someone has an actual multi-bucket dataset requirement.
Config-first, runtime-later — YAML/TOML for the MVP; runtime bucket add/remove is polish. Deployment restart is cheap in this stack.
bucket: String repurposed, no new field — avoids a schema migration on every manifest.
DataFusion's multi-store primitive handles cross-bucket queries — don't reinvent. We're wiring it up, not building it.

17 KiB Raw Blame History