Federation foundation + HNSW trial system + Postgres streaming + PRD reframe

Four shipped features and a PRD realignment, all measured end-to-end:

HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
  TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
  /hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
  at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
  80/30 = 1.00 recall in 230s

Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
  and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows

Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
  /etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
  rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
  anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
  /storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
  X-Lakehouse-Rescue-Used observability headers on fallback

Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
  into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
  (586 rows × 13 cols, 6 batches, 196ms end-to-end)

PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
  trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
  bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
  decision rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-16 01:50:05 -05:00
parent 84407eeb51
commit dbe00d018f
33 changed files with 3305 additions and 52 deletions

7
Cargo.lock generated
View File

@ -599,6 +599,7 @@ dependencies = [
name = "catalogd"
version = "0.1.0"
dependencies = [
"arrow",
"axum",
"bytes",
"chrono",
@ -2931,6 +2932,7 @@ dependencies = [
"tokio",
"tokio-postgres",
"tracing",
"uuid",
]
[[package]]
@ -3955,6 +3957,7 @@ dependencies = [
"postgres-protocol",
"serde_core",
"serde_json",
"uuid",
]
[[package]]
@ -4718,6 +4721,7 @@ name = "shared"
version = "0.1.0"
dependencies = [
"arrow",
"async-trait",
"bytes",
"chrono",
"parquet",
@ -4725,6 +4729,7 @@ dependencies = [
"serde_json",
"sha2",
"thiserror 2.0.18",
"tokio",
"toml",
"tracing",
"uuid",
@ -4894,6 +4899,7 @@ version = "0.1.0"
dependencies = [
"axum",
"bytes",
"chrono",
"futures",
"object_store",
"serde",
@ -5638,6 +5644,7 @@ dependencies = [
"storaged",
"tokio",
"tracing",
"uuid",
]
[[package]]

View File

@ -45,4 +45,4 @@ csv = "1"
lopdf = "0.35"
encoding_rs = "0.8"
instant-distance = "0.6"
tokio-postgres = { version = "0.7", features = ["with-serde_json-1", "with-chrono-0_4"] }
tokio-postgres = { version = "0.7", features = ["with-serde_json-1", "with-chrono-0_4", "with-uuid-1"] }

View File

@ -15,5 +15,6 @@ bytes = { workspace = true }
chrono = { workspace = true }
uuid = { workspace = true }
object_store = { workspace = true }
arrow = { workspace = true }
proto = { path = "../proto" }
tonic = { workspace = true }

View File

@ -140,6 +140,108 @@ impl Registry {
datasets.values().cloned().collect()
}
/// Re-read the parquet footer(s) for a dataset and repopulate `row_count`
/// and `columns` from reality. Use this to repair manifests whose
/// metadata was lost (e.g. migrated from a pre-Phase 10 catalog).
///
/// Does NOT touch owner/description/sensitivity/lineage/tags — only
/// the structural facts that parquet can tell us authoritatively.
/// The existing `schema_fingerprint` is updated if the recomputed one
/// differs; a warning is logged so drift is visible.
pub async fn resync_from_parquet(&self, name: &str) -> Result<DatasetManifest, String> {
use shared::arrow_helpers::{fingerprint_schema, parquet_to_record_batches};
// Snapshot the target manifest so we don't hold the write lock during IO.
let (id, objects, old_fp) = {
let datasets = self.datasets.read().await;
let m = datasets
.values()
.find(|d| d.name == name)
.ok_or_else(|| format!("dataset not found: {name}"))?;
(m.id.clone(), m.objects.clone(), m.schema_fingerprint.clone())
};
if objects.is_empty() {
return Err(format!("dataset '{name}' has no object references to resync from"));
}
let mut total_rows: u64 = 0;
let mut first_schema: Option<arrow::datatypes::SchemaRef> = None;
for obj in &objects {
let data = ops::get(&self.store, &obj.key).await
.map_err(|e| format!("read {}: {e}", obj.key))?;
let (schema, batches) = parquet_to_record_batches(&data)
.map_err(|e| format!("parse {}: {e}", obj.key))?;
let rows: u64 = batches.iter().map(|b| b.num_rows() as u64).sum();
total_rows += rows;
if first_schema.is_none() {
first_schema = Some(schema);
}
}
let schema = first_schema.ok_or("no schema recovered")?;
let new_fp = fingerprint_schema(&schema);
if new_fp != old_fp {
tracing::warn!(
"dataset '{}' schema fingerprint drift: {} -> {} (updating to match parquet reality)",
name, old_fp.0, new_fp.0,
);
}
let columns: Vec<ColumnMeta> = schema
.fields()
.iter()
.map(|f| ColumnMeta {
name: f.name().clone(),
data_type: f.data_type().to_string(),
sensitivity: None,
description: String::new(),
is_pii: false,
})
.collect();
// Apply updates.
let mut datasets = self.datasets.write().await;
let manifest = datasets
.get_mut(&id)
.ok_or_else(|| format!("dataset disappeared during resync: {name}"))?;
manifest.row_count = Some(total_rows);
manifest.columns = columns;
manifest.schema_fingerprint = new_fp;
manifest.updated_at = chrono::Utc::now();
// Persist.
let manifest_key = format!("{MANIFEST_PREFIX}/{}.json", manifest.id);
let json = serde_json::to_vec_pretty(manifest).map_err(|e| e.to_string())?;
ops::put(&self.store, &manifest_key, json.into()).await?;
tracing::info!("resynced '{name}': row_count={total_rows}, {} columns", manifest.columns.len());
Ok(manifest.clone())
}
/// Resync every dataset that currently has a null row_count.
/// Returns (successes, failures) where each entry is (name, detail).
pub async fn resync_missing(&self) -> (Vec<(String, u64)>, Vec<(String, String)>) {
let names: Vec<String> = {
let datasets = self.datasets.read().await;
datasets
.values()
.filter(|d| d.row_count.is_none() || d.columns.is_empty())
.map(|d| d.name.clone())
.collect()
};
let mut ok = Vec::new();
let mut err = Vec::new();
for name in names {
match self.resync_from_parquet(&name).await {
Ok(m) => ok.push((name, m.row_count.unwrap_or(0))),
Err(e) => err.push((name, e)),
}
}
(ok, err)
}
/// Add objects to an existing dataset.
pub async fn add_objects(
&self,

View File

@ -19,6 +19,8 @@ pub fn router(registry: Registry) -> Router {
.route("/datasets/{id}", get(get_dataset))
.route("/datasets/by-name/{name}", get(get_dataset_by_name))
.route("/datasets/by-name/{name}/metadata", post(update_metadata))
.route("/datasets/by-name/{name}/resync", post(resync_dataset))
.route("/resync-missing", post(resync_all_missing))
.with_state(registry)
}
@ -152,3 +154,43 @@ async fn update_metadata(
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
}
}
/// Re-read parquet footers for a single dataset and repopulate row_count
/// and columns from reality. Useful for repairing manifests whose metadata
/// was lost or never backfilled.
async fn resync_dataset(
State(registry): State<Registry>,
Path(name): Path<String>,
) -> impl IntoResponse {
match registry.resync_from_parquet(&name).await {
Ok(manifest) => Ok(Json(DatasetResponse::from(&manifest))),
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
}
}
#[derive(Serialize)]
struct ResyncAllResponse {
succeeded: Vec<ResyncOk>,
failed: Vec<ResyncErr>,
}
#[derive(Serialize)]
struct ResyncOk {
name: String,
row_count: u64,
}
#[derive(Serialize)]
struct ResyncErr {
name: String,
error: String,
}
/// Resync every dataset that currently has null row_count or empty columns.
async fn resync_all_missing(State(registry): State<Registry>) -> impl IntoResponse {
let (ok, err) = registry.resync_missing().await;
Json(ResyncAllResponse {
succeeded: ok.into_iter().map(|(name, row_count)| ResyncOk { name, row_count }).collect(),
failed: err.into_iter().map(|(name, error)| ResyncErr { name, error }).collect(),
})
}

View File

@ -24,8 +24,30 @@ async fn main() {
tracing::info!("config loaded: gateway={}:{}, storage={}",
config.gateway.host, config.gateway.port, config.storage.root);
// Storage backend
let store = storaged::backend::init_local(&config.storage.root);
// Secrets provider
let secrets = std::sync::Arc::new(shared::secrets::FileSecretsProvider::from_env());
if let Err(e) = secrets.load().await {
tracing::warn!("secrets load: {e} — bucket credentials may be missing");
}
// Federation: bucket registry (primary + any [[storage.buckets]] entries + rescue)
let bucket_registry = std::sync::Arc::new(
storaged::registry::BucketRegistry::from_config(&config.storage, secrets.clone())
.await
.expect("bucket registry init failed")
);
let buckets = bucket_registry.list().await;
tracing::info!(
"bucket registry: {} buckets [{}], rescue={:?}",
buckets.len(),
buckets.iter().map(|b| format!("{}:{}", b.name, b.backend)).collect::<Vec<_>>().join(", "),
bucket_registry.rescue_name(),
);
// Back-compat: existing components that expect a single Arc<ObjectStore>
// get the primary bucket's store. As we migrate call sites to be bucket-
// aware, they'll call bucket_registry.get(name) directly.
let store = bucket_registry.default_store();
// Catalog
let registry = catalogd::registry::Registry::new(store.clone());
@ -56,7 +78,8 @@ async fn main() {
// HTTP router
let mut app = Router::new()
.route("/health", get(health))
.nest("/storage", storaged::service::router(store.clone()))
.nest("/storage", storaged::service::router(store.clone())
.merge(storaged::federation_service::router(bucket_registry.clone())))
.nest("/catalog", catalogd::service::router(registry.clone()))
.nest("/query", queryd::service::router(engine.clone()))
.nest("/ai", aibridge::service::router(ai_client.clone()))
@ -73,6 +96,8 @@ async fn main() {
job_tracker: vectord::jobs::JobTracker::new(),
index_registry: index_reg,
hnsw_store: vectord::hnsw::HnswStore::new(),
embedding_cache: vectord::embedding_cache::EmbeddingCache::new(store.clone()),
trial_journal: vectord::trial::TrialJournal::new(store.clone()),
}
}))
.nest("/workspaces", queryd::workspace_service::router(workspace_mgr))

View File

@ -21,3 +21,4 @@ csv = { workspace = true }
chrono = { workspace = true }
object_store = { workspace = true }
tokio-postgres = { workspace = true }
uuid = { workspace = true }

View File

@ -1,4 +1,5 @@
pub mod db_ingest;
pub mod pg_stream;
pub mod detect;
pub mod csv_ingest;
pub mod json_ingest;

View File

@ -0,0 +1,308 @@
/// Streaming PostgreSQL ingest.
///
/// The original `db_ingest::import_postgres_table` loads every row into
/// memory before emitting Parquet — fine for small tables, blows up on
/// 1M+ rows. This module paginates via `ORDER BY <pk> LIMIT N OFFSET M`
/// and streams batches into an `ArrowWriter`, closing once exhausted.
///
/// Pagination is OFFSET-based (not keyset) — simpler, works on any table
/// without needing to know the PK type. OFFSET N scans N rows per fetch,
/// which becomes quadratic for very large tables. Upgrade to keyset when
/// a user actually ingests something that hurts.
use arrow::array::{ArrayRef, BooleanArray, Float64Array, Int32Array, Int64Array, RecordBatch, StringArray};
use arrow::datatypes::{DataType, Field, Schema};
use parquet::arrow::ArrowWriter;
use std::sync::Arc;
use tokio_postgres::{Client, NoTls, types::Type as PgType};
use crate::db_ingest::DbConfig;
/// Request shape for streaming ingest — takes a DSN string and optional
/// tuning knobs.
#[derive(Debug, Clone, serde::Deserialize)]
pub struct PgStreamRequest {
/// postgresql://user:pass@host:port/db — alternative to DbConfig.
pub dsn: String,
pub table: String,
#[serde(default)]
pub dataset_name: Option<String>,
/// Rows per fetch. Default 10_000.
#[serde(default)]
pub batch_size: Option<usize>,
/// Column to ORDER BY for stable pagination. If omitted, the first
/// column returned by the schema probe is used.
#[serde(default)]
pub order_by: Option<String>,
/// Hard cap on total rows (for sampling / previews).
#[serde(default)]
pub limit: Option<usize>,
}
#[derive(Debug, Clone, serde::Serialize)]
pub struct PgStreamResult {
pub table: String,
pub rows: usize,
pub batches: usize,
pub columns: usize,
pub schema: Vec<String>,
pub parquet_bytes: u64,
pub duration_secs: f32,
}
/// Parse a postgresql:// DSN into a DbConfig.
/// Supports: postgresql://[user[:password]@]host[:port][/db]
/// Does NOT support: query parameters (sslmode etc) — add if needed.
pub fn parse_dsn(dsn: &str) -> Result<DbConfig, String> {
let rest = dsn
.strip_prefix("postgresql://")
.or_else(|| dsn.strip_prefix("postgres://"))
.ok_or_else(|| "DSN must start with postgresql:// or postgres://".to_string())?;
// Split off path (database) first.
let (auth_host, database) = match rest.split_once('/') {
Some((ah, db)) => (ah, db.split('?').next().unwrap_or(db).to_string()),
None => (rest, "postgres".to_string()),
};
// Split user[:password] from host[:port].
let (userpass, hostport) = match auth_host.rsplit_once('@') {
Some((up, hp)) => (Some(up), hp),
None => (None, auth_host),
};
let (user, password) = match userpass {
Some(up) => match up.split_once(':') {
Some((u, p)) => (u.to_string(), p.to_string()),
None => (up.to_string(), String::new()),
},
None => ("postgres".to_string(), String::new()),
};
let (host, port) = match hostport.rsplit_once(':') {
Some((h, p)) => (
h.to_string(),
p.parse::<u16>().map_err(|_| format!("invalid port in DSN: {p}"))?,
),
None => (hostport.to_string(), 5432),
};
if host.is_empty() {
return Err("DSN has no host".into());
}
Ok(DbConfig { host, port, database, user, password })
}
/// Stream a Postgres table as Parquet bytes.
///
/// Returns the full Parquet payload + summary stats. The payload is
/// written to an in-memory buffer via `ArrowWriter` so each batch gets
/// columnar compression as it arrives — memory footprint stays roughly
/// at one batch plus Parquet footer state.
pub async fn stream_table_to_parquet(
req: &PgStreamRequest,
) -> Result<(bytes::Bytes, PgStreamResult), String> {
let t0 = std::time::Instant::now();
let config = parse_dsn(&req.dsn)?;
let batch_size = req.batch_size.unwrap_or(10_000).max(1);
let (client, connection) = tokio_postgres::connect(&config.connection_string(), NoTls)
.await
.map_err(|e| format!("postgres connect: {e}"))?;
tokio::spawn(async move {
if let Err(e) = connection.await {
tracing::error!("pg connection: {e}");
}
});
// Probe schema.
let columns = probe_columns(&client, &req.table).await?;
if columns.is_empty() {
return Err(format!("table '{}' not found or has no columns", req.table));
}
let arrow_fields: Vec<Field> = columns
.iter()
.map(|(name, pg)| Field::new(name, pg_type_to_arrow(pg), true))
.collect();
let schema = Arc::new(Schema::new(arrow_fields));
let schema_report: Vec<String> = columns
.iter()
.map(|(n, t)| format!("{}:{}", n, t))
.collect();
// Pagination key: user-specified or first column.
let order_col = req.order_by.clone().unwrap_or_else(|| columns[0].0.clone());
// Stream batches into ArrowWriter.
let mut buf: Vec<u8> = Vec::with_capacity(1024 * 1024);
let mut writer = ArrowWriter::try_new(&mut buf, schema.clone(), None)
.map_err(|e| format!("ArrowWriter init: {e}"))?;
let mut total_rows: usize = 0;
let mut batch_count: usize = 0;
let row_cap = req.limit.unwrap_or(usize::MAX);
loop {
let remaining = row_cap.saturating_sub(total_rows);
if remaining == 0 { break; }
let fetch = remaining.min(batch_size);
let sql = format!(
"SELECT * FROM \"{}\" ORDER BY \"{}\" LIMIT {} OFFSET {}",
req.table, order_col, fetch, total_rows,
);
let rows = client
.query(&sql, &[])
.await
.map_err(|e| format!("fetch batch at offset {total_rows}: {e}"))?;
if rows.is_empty() { break; }
let n = rows.len();
let arrays: Vec<ArrayRef> = columns
.iter()
.enumerate()
.map(|(idx, (_, pg))| rows_to_column(&rows, idx, pg))
.collect::<Result<_, _>>()?;
let batch = RecordBatch::try_new(schema.clone(), arrays)
.map_err(|e| format!("RecordBatch: {e}"))?;
writer.write(&batch).map_err(|e| format!("ArrowWriter::write: {e}"))?;
total_rows += n;
batch_count += 1;
tracing::info!(
"pg stream '{}': fetched batch {} ({} rows, total {})",
req.table, batch_count, n, total_rows,
);
if n < fetch { break; } // short read = end of table
}
writer.close().map_err(|e| format!("ArrowWriter::close: {e}"))?;
let parquet_bytes = buf.len() as u64;
let duration = t0.elapsed().as_secs_f32();
let result = PgStreamResult {
table: req.table.clone(),
rows: total_rows,
batches: batch_count,
columns: columns.len(),
schema: schema_report,
parquet_bytes,
duration_secs: duration,
};
Ok((bytes::Bytes::from(buf), result))
}
async fn probe_columns(client: &Client, table: &str) -> Result<Vec<(String, PgType)>, String> {
let stmt = client
.prepare(&format!("SELECT * FROM \"{}\" LIMIT 0", table))
.await
.map_err(|e| format!("prepare schema probe: {e}"))?;
Ok(stmt
.columns()
.iter()
.map(|c| (c.name().to_string(), c.type_().clone()))
.collect())
}
fn pg_type_to_arrow(pg: &PgType) -> DataType {
match *pg {
PgType::BOOL => DataType::Boolean,
PgType::INT2 | PgType::INT4 => DataType::Int32,
PgType::INT8 | PgType::OID => DataType::Int64,
PgType::FLOAT4 | PgType::FLOAT8 | PgType::NUMERIC => DataType::Float64,
_ => DataType::Utf8,
}
}
fn rows_to_column(
rows: &[tokio_postgres::Row],
idx: usize,
pg: &PgType,
) -> Result<ArrayRef, String> {
match *pg {
PgType::BOOL => {
let v: Vec<Option<bool>> = rows.iter().map(|r| r.try_get(idx).ok()).collect();
Ok(Arc::new(BooleanArray::from(v)))
}
PgType::INT2 => {
let v: Vec<Option<i32>> = rows.iter()
.map(|r| r.try_get::<_, i16>(idx).ok().map(|x| x as i32)).collect();
Ok(Arc::new(Int32Array::from(v)))
}
PgType::INT4 => {
let v: Vec<Option<i32>> = rows.iter().map(|r| r.try_get(idx).ok()).collect();
Ok(Arc::new(Int32Array::from(v)))
}
PgType::INT8 | PgType::OID => {
let v: Vec<Option<i64>> = rows.iter().map(|r| r.try_get(idx).ok()).collect();
Ok(Arc::new(Int64Array::from(v)))
}
PgType::FLOAT4 => {
let v: Vec<Option<f64>> = rows.iter()
.map(|r| r.try_get::<_, f32>(idx).ok().map(|x| x as f64)).collect();
Ok(Arc::new(Float64Array::from(v)))
}
PgType::FLOAT8 => {
let v: Vec<Option<f64>> = rows.iter().map(|r| r.try_get(idx).ok()).collect();
Ok(Arc::new(Float64Array::from(v)))
}
_ => {
// Safe-default per ADR-010: everything else becomes string.
let v: Vec<Option<String>> = rows.iter().map(|r| {
r.try_get::<_, String>(idx).ok()
.or_else(|| r.try_get::<_, serde_json::Value>(idx).ok().map(|j| j.to_string()))
.or_else(|| {
// Timestamps + uuid, rendered via Display.
r.try_get::<_, chrono::DateTime<chrono::Utc>>(idx).ok()
.map(|t| t.to_rfc3339())
})
.or_else(|| r.try_get::<_, uuid::Uuid>(idx).ok().map(|u| u.to_string()))
.or(Some(String::new()))
}).collect();
Ok(Arc::new(StringArray::from(v)))
}
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn parse_dsn_full() {
let c = parse_dsn("postgresql://daisy:secret@db.example.com:6543/my_db").unwrap();
assert_eq!(c.host, "db.example.com");
assert_eq!(c.port, 6543);
assert_eq!(c.user, "daisy");
assert_eq!(c.password, "secret");
assert_eq!(c.database, "my_db");
}
#[test]
fn parse_dsn_minimal() {
let c = parse_dsn("postgresql://localhost/knowledge_base").unwrap();
assert_eq!(c.host, "localhost");
assert_eq!(c.port, 5432);
assert_eq!(c.user, "postgres");
assert_eq!(c.password, "");
assert_eq!(c.database, "knowledge_base");
}
#[test]
fn parse_dsn_no_password() {
let c = parse_dsn("postgres://postgres@127.0.0.1:5432/mydb").unwrap();
assert_eq!(c.user, "postgres");
assert_eq!(c.password, "");
}
#[test]
fn parse_dsn_rejects_non_pg() {
assert!(parse_dsn("http://host/db").is_err());
}
}

View File

@ -11,7 +11,7 @@ use serde::Deserialize;
use std::sync::Arc;
use catalogd::registry::Registry;
use crate::{db_ingest, pipeline};
use crate::{db_ingest, pg_stream, pipeline};
use shared::arrow_helpers::record_batch_to_parquet;
use shared::types::{ObjectRef, SchemaFingerprint};
use storaged::ops;
@ -28,6 +28,7 @@ pub fn router(state: IngestState) -> Router {
.route("/file", post(ingest_file))
.route("/postgres/tables", post(list_pg_tables))
.route("/postgres/import", post(import_pg_table))
.route("/db", post(ingest_db_stream))
.with_state(state)
}
@ -188,3 +189,112 @@ async fn import_pg_table(
"size_bytes": parquet_size,
}))))
}
/// Streaming DSN-based ingest. Paginates via ORDER BY + LIMIT/OFFSET so
/// large tables don't blow up memory. This is the path the task spec calls
/// `POST /ingest/db` with `{dsn, table, dataset_name}`.
async fn ingest_db_stream(
State(state): State<IngestState>,
Json(req): Json<pg_stream::PgStreamRequest>,
) -> Result<(StatusCode, Json<serde_json::Value>), (StatusCode, String)> {
tracing::info!(
"pg stream ingest: table='{}' dataset='{:?}' batch_size={:?}",
req.table, req.dataset_name, req.batch_size,
);
// Stream from Postgres into Parquet bytes.
let (parquet, stream_result) = pg_stream::stream_table_to_parquet(&req)
.await
.map_err(|e| (StatusCode::BAD_GATEWAY, e))?;
if stream_result.rows == 0 {
return Ok((StatusCode::OK, Json(serde_json::json!({
"table": req.table,
"rows": 0,
"message": "table is empty",
}))));
}
// Recover schema from the parquet footer — keeps a single source of truth
// for the schema fingerprint + column metadata.
let (schema, _) = shared::arrow_helpers::parquet_to_record_batches(&parquet)
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("reparse parquet: {e}")))?;
let dataset_name = req.dataset_name.clone().unwrap_or_else(|| req.table.clone());
let storage_key = format!("datasets/{}.parquet", dataset_name);
let size_bytes = parquet.len() as u64;
ops::put(&state.store, &storage_key, parquet)
.await
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, e))?;
let schema_fp = shared::arrow_helpers::fingerprint_schema(&schema);
let now = chrono::Utc::now();
state.registry.register(
dataset_name.clone(),
SchemaFingerprint(schema_fp.0),
vec![ObjectRef {
bucket: "primary".to_string(),
key: storage_key.clone(),
size_bytes,
created_at: now,
}],
).await.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, e))?;
// Rich metadata: PII auto-detect, lineage, row_count, columns.
let col_names: Vec<&str> = schema.fields().iter().map(|f| f.name().as_str()).collect();
let sensitivity = shared::pii::detect_dataset_sensitivity(&col_names);
let columns: Vec<shared::types::ColumnMeta> = schema.fields().iter().map(|f| {
let sens = shared::pii::detect_sensitivity(f.name());
shared::types::ColumnMeta {
name: f.name().clone(),
data_type: f.data_type().to_string(),
sensitivity: sens.clone(),
description: String::new(),
is_pii: matches!(sens, Some(shared::types::Sensitivity::Pii)),
}
}).collect();
let lineage = shared::types::Lineage {
source_system: "postgresql".to_string(),
source_file: format!("dsn: {}", redact_dsn(&req.dsn)),
ingest_job: format!("pg-stream-{}", now.timestamp_millis()),
ingest_timestamp: now,
parent_datasets: vec![],
};
let _ = state.registry.update_metadata(&dataset_name, catalogd::registry::MetadataUpdate {
sensitivity,
columns: Some(columns),
lineage: Some(lineage),
row_count: Some(stream_result.rows as u64),
..Default::default()
}).await;
Ok((StatusCode::CREATED, Json(serde_json::json!({
"dataset_name": dataset_name,
"table": stream_result.table,
"rows": stream_result.rows,
"batches": stream_result.batches,
"columns": stream_result.columns,
"schema": stream_result.schema,
"storage_key": storage_key,
"size_bytes": size_bytes,
"duration_secs": stream_result.duration_secs,
}))))
}
/// Redact the password in a postgresql:// DSN for logging / lineage.
fn redact_dsn(dsn: &str) -> String {
// Simple approach: find @...: and replace password between : and @.
if let Some(at_idx) = dsn.rfind('@') {
let prefix = &dsn[..at_idx];
if let Some(colon_idx) = prefix.rfind(':') {
// But only if the colon is after the scheme colons (postgresql://)
if colon_idx > "postgresql://".len() {
return format!("{}:***{}", &dsn[..colon_idx], &dsn[at_idx..]);
}
}
}
dsn.to_string()
}

View File

@ -15,3 +15,5 @@ bytes = { workspace = true }
sha2 = { workspace = true }
toml = { workspace = true }
tracing = { workspace = true }
tokio = { workspace = true }
async-trait = "0.1"

View File

@ -28,8 +28,39 @@ pub struct GatewayConfig {
#[derive(Debug, Clone, Deserialize)]
pub struct StorageConfig {
/// Legacy single-backend root. If `buckets` is empty, this is used to
/// create an implicit `primary` bucket at this path — preserves the
/// pre-federation config shape.
#[serde(default = "default_storage_root")]
pub root: String,
/// Where profile buckets are rooted when auto-provisioned.
#[serde(default = "default_profile_root")]
pub profile_root: String,
/// Name of the bucket used for read fallback when a target bucket is
/// unreachable. If `None`, no fallback — reads fail hard.
#[serde(default)]
pub rescue_bucket: Option<String>,
/// Explicitly configured buckets. Empty = backward-compat single-bucket
/// mode driven by `root`.
#[serde(default)]
pub buckets: Vec<BucketConfig>,
}
#[derive(Debug, Clone, Deserialize)]
pub struct BucketConfig {
pub name: String,
pub backend: String, // "local" | "s3"
/// Local filesystem root (for backend = "local")
pub root: Option<String>,
/// S3 bucket name (for backend = "s3")
pub bucket: Option<String>,
pub region: Option<String>,
pub endpoint: Option<String>,
/// Handle for the secrets provider — never the literal credential.
pub secret_ref: Option<String>,
}
#[derive(Debug, Clone, Deserialize, Default)]
@ -78,6 +109,7 @@ pub struct ObservabilityConfig {
fn default_host() -> String { "0.0.0.0".to_string() }
fn default_gateway_port() -> u16 { 3100 }
fn default_storage_root() -> String { "./data".to_string() }
fn default_profile_root() -> String { "./data/_profiles".to_string() }
fn default_manifest_prefix() -> String { "_catalog/manifests".to_string() }
fn default_sidecar_url() -> String { "http://localhost:3200".to_string() }
fn default_embed_model() -> String { "nomic-embed-text".to_string() }
@ -115,7 +147,12 @@ impl Default for Config {
fn default() -> Self {
Self {
gateway: GatewayConfig { host: default_host(), port: default_gateway_port() },
storage: StorageConfig { root: default_storage_root() },
storage: StorageConfig {
root: default_storage_root(),
profile_root: default_profile_root(),
rescue_bucket: None,
buckets: Vec::new(),
},
catalog: CatalogConfig::default(),
query: QueryConfig::default(),
sidecar: SidecarConfig { url: default_sidecar_url() },

View File

@ -3,3 +3,4 @@ pub mod errors;
pub mod arrow_helpers;
pub mod config;
pub mod pii;
pub mod secrets;

View File

@ -0,0 +1,144 @@
/// Secrets layer for bucket credentials.
///
/// Credentials never appear in `lakehouse.toml`. Each bucket entry has an
/// opaque `secret_ref` handle, resolved at startup by a `SecretsProvider`.
///
/// MVP ships `FileSecretsProvider` — reads a TOML file at a path given by
/// `LAKEHOUSE_SECRETS` env var, defaulting to `/etc/lakehouse/secrets.toml`.
/// The file should be root-owned, mode 0600, never in git.
///
/// Future providers (Vault, SOPS, OS keyring) implement the same trait.
use async_trait::async_trait;
use serde::Deserialize;
use std::collections::HashMap;
use std::path::{Path, PathBuf};
use std::sync::Arc;
#[derive(Debug, Clone)]
pub struct BucketCredentials {
pub access_key: String,
pub secret_key: String,
}
// Redact credentials when debug-printing — NEVER let these end up in logs.
impl BucketCredentials {
pub fn redacted(&self) -> String {
let prefix: String = self.access_key.chars().take(4).collect();
format!("access_key={}... (redacted)", prefix)
}
}
#[async_trait]
pub trait SecretsProvider: Send + Sync {
/// Resolve a handle to credentials. Returns Err if the handle is unknown.
async fn resolve(&self, handle: &str) -> Result<BucketCredentials, String>;
/// List known handles (for diagnostics only, never values).
async fn list_handles(&self) -> Vec<String>;
}
#[derive(Debug, Deserialize)]
struct RawSecretEntry {
access_key: String,
secret_key: String,
}
/// File-backed secrets provider.
pub struct FileSecretsProvider {
path: PathBuf,
cache: tokio::sync::RwLock<HashMap<String, BucketCredentials>>,
}
impl FileSecretsProvider {
pub fn new(path: impl Into<PathBuf>) -> Self {
Self {
path: path.into(),
cache: tokio::sync::RwLock::new(HashMap::new()),
}
}
/// Default path: env var `LAKEHOUSE_SECRETS` or `/etc/lakehouse/secrets.toml`.
pub fn from_env() -> Self {
let path = std::env::var("LAKEHOUSE_SECRETS")
.unwrap_or_else(|_| "/etc/lakehouse/secrets.toml".to_string());
Self::new(path)
}
/// Arc wrapper for convenient injection.
pub fn shared(path: impl Into<PathBuf>) -> Arc<dyn SecretsProvider> {
Arc::new(Self::new(path))
}
/// Load and populate the cache. Call once at startup.
pub async fn load(&self) -> Result<usize, String> {
if !self.path.exists() {
tracing::info!(
"secrets file {} not present — running with zero credentials configured",
self.path.display()
);
return Ok(0);
}
check_permissions(&self.path)?;
let content = std::fs::read_to_string(&self.path)
.map_err(|e| format!("read secrets file: {e}"))?;
let raw: HashMap<String, RawSecretEntry> = toml::from_str(&content)
.map_err(|e| format!("parse secrets file: {e}"))?;
let mut cache = self.cache.write().await;
for (handle, entry) in raw {
cache.insert(handle.clone(), BucketCredentials {
access_key: entry.access_key,
secret_key: entry.secret_key,
});
}
let count = cache.len();
tracing::info!("secrets: loaded {count} handles from {}", self.path.display());
Ok(count)
}
}
#[async_trait]
impl SecretsProvider for FileSecretsProvider {
async fn resolve(&self, handle: &str) -> Result<BucketCredentials, String> {
let cache = self.cache.read().await;
cache
.get(handle)
.cloned()
.ok_or_else(|| format!("secret handle not found: {handle}"))
}
async fn list_handles(&self) -> Vec<String> {
self.cache.read().await.keys().cloned().collect()
}
}
/// Reject world-readable secrets files. Fail closed.
fn check_permissions(path: &Path) -> Result<(), String> {
#[cfg(unix)]
{
use std::os::unix::fs::PermissionsExt;
let meta = std::fs::metadata(path)
.map_err(|e| format!("stat secrets file: {e}"))?;
let mode = meta.permissions().mode() & 0o777;
if mode & 0o044 != 0 {
return Err(format!(
"secrets file {} is group/world-readable (mode {:o}); chmod 600 to proceed",
path.display(), mode
));
}
}
let _ = path;
Ok(())
}
/// A zero-handles provider for tests / purely-local setups.
pub struct EmptySecretsProvider;
#[async_trait]
impl SecretsProvider for EmptySecretsProvider {
async fn resolve(&self, handle: &str) -> Result<BucketCredentials, String> {
Err(format!("no secrets provider configured (wanted handle '{handle}')"))
}
async fn list_handles(&self) -> Vec<String> { Vec::new() }
}

View File

@ -13,3 +13,4 @@ tracing = { workspace = true }
object_store = { workspace = true }
bytes = { workspace = true }
futures = { workspace = true }
chrono = { workspace = true }

View File

@ -0,0 +1,239 @@
/// Write-once batched append log.
///
/// Problem we're fixing: the error journal and HNSW trial journal both
/// previously did read-modify-write of their whole JSONL file on every
/// event. That's O(N²) cumulative work and generates huge churn at scale.
/// It's exactly the pattern llms3.com flags as the "small-file /
/// rewrite-amplification" anti-pattern.
///
/// This helper implements the pattern object storage actually wants:
///
/// - Events accumulate in an **in-memory buffer** (reads see them immediately).
/// - When the buffer hits a threshold, or `flush()` is called, the buffer is
/// written **as one new object** with a timestamp-sorted key.
/// - Existing objects are never rewritten.
/// - Reads enumerate all batch files, sort by key, and concat in order.
/// - An explicit `compact()` reads every batch file, writes one consolidated
/// file, and deletes the originals — the LSM-tree compaction idea applied
/// to small JSONL events.
///
/// Storage layout:
/// ```
/// {prefix}/
/// batch_0001776319628000123.jsonl
/// batch_0001776319745987654.jsonl
/// batch_compacted_00001776319800000000.jsonl (after compact())
/// ```
/// Key format: the zero-padded epoch microsecond of the write, so lexical
/// sort == chronological sort.
use bytes::Bytes;
use chrono::Utc;
use object_store::ObjectStore;
use std::sync::Arc;
use tokio::sync::Mutex;
use crate::ops;
const DEFAULT_FLUSH_THRESHOLD: usize = 32;
pub struct AppendLog {
store: Arc<dyn ObjectStore>,
prefix: String,
buffer: Mutex<Vec<Vec<u8>>>,
flush_threshold: usize,
}
impl AppendLog {
/// Create a new append log rooted at `prefix` in the given object store.
/// Events auto-flush when buffer reaches `flush_threshold` (default 32).
pub fn new(store: Arc<dyn ObjectStore>, prefix: impl Into<String>) -> Self {
Self {
store,
prefix: prefix.into(),
buffer: Mutex::new(Vec::new()),
flush_threshold: DEFAULT_FLUSH_THRESHOLD,
}
}
pub fn with_flush_threshold(mut self, threshold: usize) -> Self {
self.flush_threshold = threshold.max(1);
self
}
/// Add an event. The returned future completes either immediately
/// (buffered) or after a flush, depending on whether the buffer hit the
/// threshold. Callers don't need to care either way.
pub async fn append(&self, line: Vec<u8>) -> Result<(), String> {
let should_flush = {
let mut buf = self.buffer.lock().await;
buf.push(line);
buf.len() >= self.flush_threshold
};
if should_flush {
self.flush().await?;
}
Ok(())
}
/// Force-flush the in-memory buffer to object storage as a single new
/// batch file. Safe to call anytime (idempotent no-op when buffer empty).
pub async fn flush(&self) -> Result<(), String> {
let batch = {
let mut buf = self.buffer.lock().await;
if buf.is_empty() { return Ok(()); }
std::mem::take(&mut *buf)
};
let ts_us = Utc::now().timestamp_micros().max(0) as u128;
let key = format!("{}/batch_{:019}.jsonl", self.prefix, ts_us);
let mut body = Vec::with_capacity(batch.iter().map(|b| b.len() + 1).sum());
for line in batch {
body.extend_from_slice(&line);
if !body.ends_with(b"\n") { body.push(b'\n'); }
}
ops::put(&self.store, &key, Bytes::from(body)).await
}
/// Read every event across all batch files + unflushed in-memory buffer.
/// Events are returned in chronological order.
pub async fn read_all(&self) -> Result<Vec<Vec<u8>>, String> {
let mut keys = self.list_batch_keys().await?;
keys.sort();
let mut out = Vec::new();
for key in keys {
let bytes = ops::get(&self.store, &key).await?;
for line in bytes.split(|b| *b == b'\n') {
if !line.is_empty() {
out.push(line.to_vec());
}
}
}
// Include unflushed events so callers see the latest state
// whether or not someone ran flush() recently.
let buf = self.buffer.lock().await;
for line in buf.iter() {
out.push(line.clone());
}
Ok(out)
}
/// Consolidate all current batch files into one compacted file, then
/// delete the originals. Safe to call while appends are in flight:
/// new batches written during compaction get a higher timestamp and
/// survive. Fails closed — if anything goes wrong mid-delete, the
/// compacted file coexists with originals and next read sees duplicates
/// (which the dedup caller must handle) rather than data loss.
pub async fn compact(&self) -> Result<CompactStats, String> {
// Snapshot which files to compact BEFORE we write the new one.
let mut originals = self.list_batch_keys().await?;
originals.sort();
if originals.len() < 2 {
return Ok(CompactStats { merged_files: originals.len(), events: 0, new_key: None });
}
// Gather all existing events.
let mut events = Vec::new();
for key in &originals {
let bytes = ops::get(&self.store, key).await?;
for line in bytes.split(|b| *b == b'\n') {
if !line.is_empty() {
events.push(line.to_vec());
}
}
}
let total_events = events.len();
if total_events == 0 {
// Clean up empty files without writing a new one.
for key in &originals {
let _ = ops::delete(&self.store, key).await;
}
return Ok(CompactStats { merged_files: originals.len(), events: 0, new_key: None });
}
// Name the compacted file with the SAME `batch_{ts}` format so it
// sorts chronologically with future batches. Using a distinct prefix
// ("batch_compacted_") would break lex ordering: later `batch_N`
// files would sort BEFORE the compacted file because 'c' > digits.
// Timestamp = now, so any appends arriving during compaction (which
// get the current wall-clock time) sort AFTER this file.
let ts_us = Utc::now().timestamp_micros().max(0) as u128;
let new_key = format!("{}/batch_{:019}.jsonl", self.prefix, ts_us);
let mut body = Vec::new();
for line in &events {
body.extend_from_slice(line);
body.push(b'\n');
}
ops::put(&self.store, &new_key, Bytes::from(body)).await?;
// Only delete originals once the consolidated file is persisted.
let mut failures = 0;
for key in &originals {
if ops::delete(&self.store, key).await.is_err() {
failures += 1;
}
}
if failures > 0 {
tracing::warn!(
"compact '{}': {} original files failed to delete — consolidated file {} has the data",
self.prefix, failures, new_key,
);
}
Ok(CompactStats {
merged_files: originals.len(),
events: total_events,
new_key: Some(new_key),
})
}
/// How many batch files exist on disk right now.
pub async fn file_count(&self) -> Result<usize, String> {
Ok(self.list_batch_keys().await?.len())
}
async fn list_batch_keys(&self) -> Result<Vec<String>, String> {
let prefix_with_slash = format!("{}/", self.prefix);
// list the prefix then filter for keys that match our naming scheme;
// unrelated files at the same prefix won't be touched.
let raw = ops::list(&self.store, Some(&prefix_with_slash)).await?;
Ok(raw
.into_iter()
.filter(|k| {
let basename = k.rsplit('/').next().unwrap_or(k);
basename.starts_with("batch_") && basename.ends_with(".jsonl")
})
.collect())
}
}
#[derive(Debug, Clone, serde::Serialize)]
pub struct CompactStats {
pub merged_files: usize,
pub events: usize,
pub new_key: Option<String>,
}
// Proactively flush on drop, best-effort.
// We can't `.await` in Drop; we spawn the flush on the tokio runtime if one
// is available. If the runtime is already shutting down the flush is lost —
// acceptable for these observability journals, which are hints not records.
impl Drop for AppendLog {
fn drop(&mut self) {
let buf_len = self.buffer.try_lock().map(|b| b.len()).unwrap_or(0);
if buf_len == 0 { return; }
// Can't spawn from sync Drop on every runtime shape; log + move on.
tracing::debug!(
"append_log '{}' dropping with {} unflushed events",
self.prefix, buf_len,
);
}
}

View File

@ -0,0 +1,200 @@
/// Bucket operation error journal.
///
/// Every bucket op failure and every rescue fallback lands here. Goal:
/// answering "is anything broken?" with one HTTP call.
///
/// Storage: batched write-once files under `_errors/bucket_errors/` in the
/// primary bucket. Uses the shared `AppendLog` helper so we never rewrite
/// existing files — see `append_log.rs` for the full pattern rationale.
///
/// In-memory ring buffer holds the last N events for fast response; on
/// startup `load_recent` hydrates it from all batch files.
use chrono::{DateTime, Utc};
use object_store::ObjectStore;
use serde::{Deserialize, Serialize};
use std::collections::VecDeque;
use std::sync::Arc;
use tokio::sync::RwLock;
use crate::append_log::{AppendLog, CompactStats};
const JOURNAL_PREFIX: &str = "_errors/bucket_errors";
const RING_CAPACITY: usize = 2000;
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "lowercase")]
pub enum BucketOp {
Read,
Write,
Delete,
List,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct BucketErrorEvent {
pub ts: DateTime<Utc>,
pub op: BucketOp,
pub target: String,
pub key: String,
pub error: String,
#[serde(default)]
pub rescued: bool,
}
impl BucketErrorEvent {
pub fn new_read(target: &str, key: &str, error: &str) -> Self {
Self { ts: Utc::now(), op: BucketOp::Read, target: target.into(), key: key.into(), error: error.into(), rescued: false }
}
pub fn new_write(target: &str, key: &str, error: &str) -> Self {
Self { ts: Utc::now(), op: BucketOp::Write, target: target.into(), key: key.into(), error: error.into(), rescued: false }
}
pub fn new_delete(target: &str, key: &str, error: &str) -> Self {
Self { ts: Utc::now(), op: BucketOp::Delete, target: target.into(), key: key.into(), error: error.into(), rescued: false }
}
pub fn new_list(target: &str, prefix: &str, error: &str) -> Self {
Self { ts: Utc::now(), op: BucketOp::List, target: target.into(), key: prefix.into(), error: error.into(), rescued: false }
}
}
#[derive(Clone)]
pub struct ErrorJournal {
log: Arc<AppendLog>,
ring: Arc<RwLock<VecDeque<BucketErrorEvent>>>,
}
#[derive(Debug, Clone, Serialize)]
pub struct HealthReport {
pub period_minutes: i64,
pub total_errors: usize,
pub per_bucket: std::collections::HashMap<String, usize>,
pub unhealthy_buckets: Vec<String>,
}
impl ErrorJournal {
pub fn new(store: Arc<dyn ObjectStore>) -> Self {
// Keep flush threshold lowish — operators checking /storage/errors
// after a recent incident want to see fresh rows even without an
// explicit flush.
let log = Arc::new(
AppendLog::new(store, JOURNAL_PREFIX).with_flush_threshold(8),
);
Self {
log,
ring: Arc::new(RwLock::new(VecDeque::with_capacity(RING_CAPACITY))),
}
}
/// Hydrate the ring buffer from existing batch files. Call once at
/// startup. Tolerates malformed lines (skipped with a warning) and
/// missing files (returns 0).
pub async fn load_recent(&self) -> Result<usize, String> {
let lines = self.log.read_all().await.unwrap_or_default();
let mut ring = self.ring.write().await;
for line in lines {
match serde_json::from_slice::<BucketErrorEvent>(&line) {
Ok(ev) => {
if ring.len() >= RING_CAPACITY { ring.pop_front(); }
ring.push_back(ev);
}
Err(e) => tracing::warn!("error journal: skip malformed line ({e})"),
}
}
Ok(ring.len())
}
/// Append an event. In-memory ring updated immediately; persistence
/// happens in batches via the underlying AppendLog.
pub async fn append(&self, event: BucketErrorEvent) {
{
let mut ring = self.ring.write().await;
if ring.len() >= RING_CAPACITY { ring.pop_front(); }
ring.push_back(event.clone());
}
match serde_json::to_vec(&event) {
Ok(line) => {
if let Err(e) = self.log.append(line).await {
tracing::error!("error journal persist failed: {e}");
}
}
Err(e) => tracing::error!("error journal serialize failed: {e}"),
}
}
/// Mark the most recent matching in-memory event as rescued.
/// Only updates the ring buffer — the JSONL line already persisted
/// records the failure fact; rescue status travels in response headers.
pub async fn mark_rescued_last(&self, target: &str, key: &str) {
let mut ring = self.ring.write().await;
for ev in ring.iter_mut().rev() {
if matches!(ev.op, BucketOp::Read) && ev.target == target && ev.key == key && !ev.rescued {
ev.rescued = true;
break;
}
}
}
pub async fn recent(&self, limit: usize) -> Vec<BucketErrorEvent> {
let ring = self.ring.read().await;
let start = ring.len().saturating_sub(limit);
ring.iter().skip(start).cloned().collect()
}
pub async fn filter(
&self,
bucket: Option<&str>,
since: Option<DateTime<Utc>>,
limit: usize,
) -> Vec<BucketErrorEvent> {
let ring = self.ring.read().await;
ring.iter()
.rev()
.filter(|ev| bucket.map_or(true, |b| ev.target == b))
.filter(|ev| since.map_or(true, |s| ev.ts >= s))
.take(limit)
.cloned()
.collect::<Vec<_>>()
.into_iter()
.rev()
.collect()
}
/// Summarize errors in the last N minutes.
pub async fn health(&self, period_minutes: i64) -> HealthReport {
use std::collections::HashMap;
let cutoff = Utc::now() - chrono::Duration::minutes(period_minutes);
let ring = self.ring.read().await;
let recent: Vec<_> = ring.iter().filter(|ev| ev.ts >= cutoff).collect();
let mut per_bucket: HashMap<String, usize> = HashMap::new();
for ev in &recent {
*per_bucket.entry(ev.target.clone()).or_insert(0) += 1;
}
let unhealthy_buckets: Vec<String> = per_bucket
.iter()
.filter(|(_, c)| **c >= 3)
.map(|(k, _)| k.clone())
.collect();
HealthReport {
period_minutes,
total_errors: recent.len(),
per_bucket,
unhealthy_buckets,
}
}
/// Force an immediate flush of buffered events to object storage.
pub async fn flush(&self) -> Result<(), String> {
self.log.flush().await
}
/// Consolidate all batch files into one. Operator cleanup.
pub async fn compact(&self) -> Result<CompactStats, String> {
self.log.compact().await
}
/// How many JSONL batch files currently exist.
pub async fn file_count(&self) -> Result<usize, String> {
self.log.file_count().await
}
}

View File

@ -0,0 +1,126 @@
/// HTTP surface for federation operator endpoints.
///
/// - `GET /buckets` — configured buckets with reachability status
/// - `GET /errors` — recent bucket op failures (filterable)
/// - `GET /bucket-health` — aggregated errors-per-bucket in the last 5 minutes
///
/// Mounted under `/storage` by the gateway. `/storage/health` is already
/// claimed by the existing storage router for service liveness, so the
/// aggregated bucket health lives at `/storage/bucket-health`.
use axum::{
Json, Router,
extract::{Path, Query, State},
http::StatusCode,
response::IntoResponse,
routing::{get, put},
};
use chrono::{DateTime, Utc};
use serde::Deserialize;
use std::sync::Arc;
use crate::error_journal::BucketErrorEvent;
use crate::registry::{BucketInfo, BucketRegistry};
pub fn router(registry: Arc<BucketRegistry>) -> Router {
Router::new()
.route("/buckets", get(list_buckets))
.route("/errors", get(list_errors))
.route("/errors/flush", axum::routing::post(flush_errors))
.route("/errors/compact", axum::routing::post(compact_errors))
.route("/bucket-health", get(get_health))
.route("/buckets/{bucket}/objects/{*key}", put(put_bucket_object))
.route("/buckets/{bucket}/objects/{*key}", get(get_bucket_object))
.with_state(registry)
}
async fn list_buckets(State(reg): State<Arc<BucketRegistry>>) -> Json<Vec<BucketInfo>> {
Json(reg.list().await)
}
#[derive(Deserialize)]
struct ErrorQuery {
bucket: Option<String>,
since: Option<DateTime<Utc>>,
#[serde(default = "default_limit")]
limit: usize,
}
fn default_limit() -> usize { 50 }
async fn list_errors(
State(reg): State<Arc<BucketRegistry>>,
Query(q): Query<ErrorQuery>,
) -> Json<Vec<BucketErrorEvent>> {
Json(reg.journal().filter(q.bucket.as_deref(), q.since, q.limit).await)
}
#[derive(Deserialize)]
struct HealthQuery {
#[serde(default = "default_period")]
minutes: i64,
}
fn default_period() -> i64 { 5 }
async fn get_health(
State(reg): State<Arc<BucketRegistry>>,
Query(q): Query<HealthQuery>,
) -> impl IntoResponse {
Json(reg.journal().health(q.minutes).await)
}
/// Bucket-aware write. Hard-fails if the target bucket is unreachable;
/// never falls back to rescue for writes.
async fn put_bucket_object(
State(reg): State<Arc<BucketRegistry>>,
Path((bucket, key)): Path<(String, String)>,
body: bytes::Bytes,
) -> impl IntoResponse {
match reg.write_smart(&bucket, &key, body).await {
Ok(()) => (StatusCode::CREATED, format!("stored: {bucket}/{key}")).into_response(),
Err(e) => (StatusCode::SERVICE_UNAVAILABLE, e).into_response(),
}
}
/// Bucket-aware read. Falls through to the rescue bucket if the target
/// is unreachable or the key is missing; emits observability headers so
/// callers can detect the fallback.
async fn flush_errors(State(reg): State<Arc<BucketRegistry>>) -> impl IntoResponse {
match reg.journal().flush().await {
Ok(()) => (StatusCode::OK, "flushed").into_response(),
Err(e) => (StatusCode::INTERNAL_SERVER_ERROR, e).into_response(),
}
}
async fn compact_errors(State(reg): State<Arc<BucketRegistry>>) -> impl IntoResponse {
match reg.journal().compact().await {
Ok(stats) => Json(stats).into_response(),
Err(e) => (StatusCode::INTERNAL_SERVER_ERROR, e).into_response(),
}
}
async fn get_bucket_object(
State(reg): State<Arc<BucketRegistry>>,
Path((bucket, key)): Path<(String, String)>,
) -> impl IntoResponse {
match reg.read_smart(&bucket, &key).await {
Ok(outcome) => {
let mut resp = outcome.data.into_response();
if outcome.rescued {
let headers = resp.headers_mut();
headers.insert("x-lakehouse-rescue-used", "true".parse().unwrap());
headers.insert(
"x-lakehouse-original-bucket",
outcome.original_bucket.parse().unwrap(),
);
headers.insert(
"x-lakehouse-served-by",
outcome.served_by.parse().unwrap(),
);
}
resp
}
Err(e) => (StatusCode::NOT_FOUND, e).into_response(),
}
}

View File

@ -1,3 +1,7 @@
pub mod append_log;
pub mod backend;
pub mod error_journal;
pub mod federation_service;
pub mod ops;
pub mod registry;
pub mod service;

View File

@ -0,0 +1,283 @@
/// Multi-backend bucket registry — the federation foundation.
///
/// Federation rule: every `ObjectRef` belongs to exactly one named bucket.
/// The registry resolves bucket names to `object_store` backends, handles
/// rescue-bucket fallback on read failure, writes every failure to the
/// error journal, and exposes a health summary for operators.
///
/// Existing call sites can keep using `ops::*` with `registry.get(name)`.
/// New bucket-aware call sites use `registry.read_smart` / `write_smart`
/// which handle fallback + journaling automatically.
use object_store::ObjectStore;
use object_store::local::LocalFileSystem;
use serde::Serialize;
use shared::config::{BucketConfig, StorageConfig};
use shared::secrets::{BucketCredentials, SecretsProvider};
use std::collections::HashMap;
use std::sync::Arc;
use crate::error_journal::{BucketErrorEvent, ErrorJournal};
/// A registered bucket — the store handle + its configuration.
pub struct BucketEntry {
pub name: String,
pub backend: String,
pub store: Arc<dyn ObjectStore>,
pub config: BucketConfig,
}
/// Read outcome — may have been rescued.
#[derive(Debug, Clone)]
pub struct ReadOutcome {
pub data: bytes::Bytes,
pub rescued: bool,
pub original_bucket: String,
pub served_by: String,
}
/// Summary entry for GET /storage/buckets.
#[derive(Debug, Clone, Serialize)]
pub struct BucketInfo {
pub name: String,
pub backend: String,
pub reachable: bool,
pub role: BucketRole,
}
#[derive(Debug, Clone, Serialize, PartialEq)]
#[serde(rename_all = "lowercase")]
pub enum BucketRole {
Primary,
Rescue,
Profile,
Tenant,
}
pub struct BucketRegistry {
buckets: HashMap<String, Arc<BucketEntry>>,
default: String,
rescue: Option<String>,
profile_root: String,
journal: ErrorJournal,
}
impl BucketRegistry {
/// Build the registry from storage config + secrets provider.
/// Back-compat: if `buckets` is empty, synthesize a `primary` bucket from
/// the legacy `root` field so pre-federation configs keep working.
pub async fn from_config(
cfg: &StorageConfig,
secrets: Arc<dyn SecretsProvider>,
) -> Result<Self, String> {
let mut buckets: HashMap<String, Arc<BucketEntry>> = HashMap::new();
let bucket_configs: Vec<BucketConfig> = if cfg.buckets.is_empty() {
vec![BucketConfig {
name: "primary".to_string(),
backend: "local".to_string(),
root: Some(cfg.root.clone()),
bucket: None,
region: None,
endpoint: None,
secret_ref: None,
}]
} else {
cfg.buckets.clone()
};
for bc in bucket_configs {
let store = build_store(&bc, secrets.as_ref()).await?;
let entry = Arc::new(BucketEntry {
name: bc.name.clone(),
backend: bc.backend.clone(),
store,
config: bc.clone(),
});
buckets.insert(bc.name.clone(), entry);
}
// Ensure `primary` always exists — it's where error journals live.
if !buckets.contains_key("primary") {
return Err("no bucket named 'primary' configured — required as error-journal home".into());
}
// Rescue bucket is optional but, if named, must exist.
if let Some(r) = &cfg.rescue_bucket {
if !buckets.contains_key(r) {
return Err(format!("rescue_bucket '{r}' not found among configured buckets"));
}
}
let journal = ErrorJournal::new(buckets.get("primary").unwrap().store.clone());
let _ = journal.load_recent().await;
Ok(Self {
buckets,
default: "primary".to_string(),
rescue: cfg.rescue_bucket.clone(),
profile_root: cfg.profile_root.clone(),
journal,
})
}
pub fn default_name(&self) -> &str { &self.default }
pub fn rescue_name(&self) -> Option<&str> { self.rescue.as_deref() }
pub fn journal(&self) -> &ErrorJournal { &self.journal }
/// Resolve a bucket name to its object store. Existing call sites use
/// this as a drop-in replacement for the old single-store pattern.
pub fn get(&self, bucket: &str) -> Result<Arc<dyn ObjectStore>, String> {
self.buckets
.get(bucket)
.map(|e| e.store.clone())
.ok_or_else(|| format!("unknown bucket: {bucket}"))
}
/// The default bucket's store — use for code paths that don't yet know
/// about buckets.
pub fn default_store(&self) -> Arc<dyn ObjectStore> {
self.buckets.get(&self.default).unwrap().store.clone()
}
/// List all registered buckets. Checks reachability by doing a trivial
/// `list` with limit 1 on each.
pub async fn list(&self) -> Vec<BucketInfo> {
let mut out = Vec::with_capacity(self.buckets.len());
for (name, entry) in &self.buckets {
let reachable = probe(&entry.store).await;
let role = self.classify(name);
out.push(BucketInfo {
name: name.clone(),
backend: entry.backend.clone(),
reachable,
role,
});
}
out.sort_by(|a, b| a.name.cmp(&b.name));
out
}
fn classify(&self, name: &str) -> BucketRole {
if name == self.default { BucketRole::Primary }
else if Some(name) == self.rescue.as_deref() { BucketRole::Rescue }
else if name.starts_with("profile:") { BucketRole::Profile }
else { BucketRole::Tenant }
}
/// Read with rescue-bucket fallback. If the target bucket fails and a
/// rescue is configured, retries against rescue. Records every failure
/// in the error journal.
pub async fn read_smart(&self, bucket: &str, key: &str) -> Result<ReadOutcome, String> {
let target = self.buckets.get(bucket)
.ok_or_else(|| format!("unknown bucket: {bucket}"))?;
match crate::ops::get(&target.store, key).await {
Ok(data) => Ok(ReadOutcome {
data, rescued: false,
original_bucket: bucket.to_string(),
served_by: bucket.to_string(),
}),
Err(err) => {
// Record failure regardless of what happens next.
self.journal.append(BucketErrorEvent::new_read(bucket, key, &err)).await;
// Try rescue, if any.
if let Some(rescue_name) = &self.rescue {
if rescue_name != bucket {
if let Some(rescue) = self.buckets.get(rescue_name) {
match crate::ops::get(&rescue.store, key).await {
Ok(data) => {
self.journal.mark_rescued_last(bucket, key).await;
return Ok(ReadOutcome {
data, rescued: true,
original_bucket: bucket.to_string(),
served_by: rescue_name.clone(),
});
}
Err(rescue_err) => {
return Err(format!(
"read '{key}' failed in '{bucket}' ({err}); rescue '{rescue_name}' also failed ({rescue_err})"
));
}
}
}
}
}
Err(format!("read '{key}' failed in '{bucket}': {err}"))
}
}
}
/// Write always goes to target. No rescue fallback for writes — writes
/// that silently vanish are the worst possible failure.
pub async fn write_smart(
&self,
bucket: &str,
key: &str,
data: bytes::Bytes,
) -> Result<(), String> {
let target = self.buckets.get(bucket)
.ok_or_else(|| format!("unknown bucket: {bucket}"))?;
match crate::ops::put(&target.store, key, data).await {
Ok(()) => Ok(()),
Err(err) => {
self.journal.append(BucketErrorEvent::new_write(bucket, key, &err)).await;
Err(format!("write '{key}' failed in '{bucket}': {err}"))
}
}
}
}
/// Trivial reachability check — try to list with limit 0.
async fn probe(store: &Arc<dyn ObjectStore>) -> bool {
use futures::StreamExt;
let mut stream = store.list(None);
// Pulling the first item confirms the store responds. Empty bucket = ok.
match stream.next().await {
Some(Ok(_)) => true,
None => true, // empty but reachable
Some(Err(_)) => false,
}
}
/// Build a concrete ObjectStore from a BucketConfig.
async fn build_store(
bc: &BucketConfig,
secrets: &dyn SecretsProvider,
) -> Result<Arc<dyn ObjectStore>, String> {
match bc.backend.as_str() {
"local" => {
let root = bc.root.as_deref()
.ok_or_else(|| format!("bucket '{}' is backend=local but has no root", bc.name))?;
std::fs::create_dir_all(root)
.map_err(|e| format!("create bucket dir '{root}': {e}"))?;
let fs = LocalFileSystem::new_with_prefix(root)
.map_err(|e| format!("init local bucket '{}': {e}", bc.name))?;
Ok(Arc::new(fs))
}
"s3" => {
let handle = bc.secret_ref.as_deref()
.ok_or_else(|| format!("s3 bucket '{}' has no secret_ref", bc.name))?;
let creds: BucketCredentials = secrets.resolve(handle).await?;
let s3_bucket = bc.bucket.as_deref()
.ok_or_else(|| format!("s3 bucket '{}' has no `bucket` name", bc.name))?;
let region = bc.region.as_deref().unwrap_or("us-east-1");
let mut builder = object_store::aws::AmazonS3Builder::new()
.with_bucket_name(s3_bucket)
.with_region(region)
.with_access_key_id(&creds.access_key)
.with_secret_access_key(&creds.secret_key);
if let Some(endpoint) = &bc.endpoint {
builder = builder.with_endpoint(endpoint);
}
let s3 = builder.build()
.map_err(|e| format!("init s3 bucket '{}': {e}", bc.name))?;
Ok(Arc::new(s3))
}
other => Err(format!("unknown backend '{other}' for bucket '{}'", bc.name)),
}
}

View File

@ -18,3 +18,4 @@ parquet = { workspace = true }
arrow = { workspace = true }
chrono = { workspace = true }
instant-distance = { workspace = true }
uuid = { workspace = true }

View File

@ -0,0 +1,98 @@
/// In-memory cache for StoredEmbedding vectors.
///
/// Rationale: loading 100K embeddings from Parquet takes ~2-5s. When an AI agent
/// iterates on HNSW parameters, each trial would repeat that cost. The cache
/// pins embeddings in memory so trials reuse them.
///
/// This is a pure performance layer — the Parquet file is still the source of
/// truth (ADR-008). Eviction is safe; worst case is one slow reload.
use object_store::ObjectStore;
use serde::Serialize;
use std::collections::HashMap;
use std::sync::Arc;
use tokio::sync::RwLock;
use crate::store::{self, StoredEmbedding};
#[derive(Clone)]
pub struct EmbeddingCache {
store: Arc<dyn ObjectStore>,
cache: Arc<RwLock<HashMap<String, Arc<Vec<StoredEmbedding>>>>>,
}
#[derive(Debug, Clone, Serialize)]
pub struct CacheEntry {
pub index_name: String,
pub vectors: usize,
pub dimensions: usize,
pub memory_bytes: u64,
}
#[derive(Debug, Clone, Serialize)]
pub struct CacheStats {
pub entries: Vec<CacheEntry>,
pub total_memory_bytes: u64,
}
impl EmbeddingCache {
pub fn new(store: Arc<dyn ObjectStore>) -> Self {
Self {
store,
cache: Arc::new(RwLock::new(HashMap::new())),
}
}
/// Return cached embeddings, loading from object storage on first request.
pub async fn get_or_load(
&self,
index_name: &str,
) -> Result<Arc<Vec<StoredEmbedding>>, String> {
if let Some(cached) = self.cache.read().await.get(index_name) {
return Ok(cached.clone());
}
// Load under a write lock so concurrent callers only hit disk once.
let mut guard = self.cache.write().await;
if let Some(cached) = guard.get(index_name) {
return Ok(cached.clone());
}
tracing::info!("embedding_cache: loading '{index_name}' from object storage");
let t0 = std::time::Instant::now();
let loaded = store::load_embeddings(&self.store, index_name).await?;
let n = loaded.len();
let arc = Arc::new(loaded);
guard.insert(index_name.to_string(), arc.clone());
tracing::info!(
"embedding_cache: loaded '{index_name}' — {n} vectors in {:.2}s",
t0.elapsed().as_secs_f32()
);
Ok(arc)
}
pub async fn evict(&self, index_name: &str) -> bool {
self.cache.write().await.remove(index_name).is_some()
}
pub async fn stats(&self) -> CacheStats {
let cache = self.cache.read().await;
let mut entries = Vec::with_capacity(cache.len());
let mut total: u64 = 0;
for (name, embs) in cache.iter() {
let dims = embs.first().map(|e| e.vector.len()).unwrap_or(0);
// Rough estimate: vector data + chunk_text + metadata overhead.
let vec_bytes = (embs.len() * dims * std::mem::size_of::<f32>()) as u64;
let text_bytes: u64 = embs.iter().map(|e| e.chunk_text.len() as u64).sum();
let overhead = (embs.len() * 128) as u64; // strings + struct overhead
let mem = vec_bytes + text_bytes + overhead;
total += mem;
entries.push(CacheEntry {
index_name: name.clone(),
vectors: embs.len(),
dimensions: dims,
memory_bytes: mem,
});
}
CacheStats { entries, total_memory_bytes: total }
}
}

View File

@ -0,0 +1,216 @@
/// Eval harness for HNSW tuning.
///
/// An EvalSet is a named collection of queries against a specific vector index.
/// Each query has a query_text and (optionally pre-computed) ground_truth
/// doc_ids — the "correct" top-k results that brute-force cosine returns.
/// HNSW trials are scored by how well they recreate that top-k (recall@k).
///
/// Storage: `_hnsw_evals/{name}.json` as a single JSON document. Small.
use chrono::{DateTime, Utc};
use object_store::ObjectStore;
use serde::{Deserialize, Serialize};
use std::sync::Arc;
use aibridge::client::{AiClient, EmbedRequest};
use storaged::ops;
use crate::store::StoredEmbedding;
/// A single eval query with optional pre-computed ground truth.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvalQuery {
pub id: String,
pub query_text: String,
/// Ordered list of doc_ids that brute-force returns as top-k.
/// `None` means the ground truth hasn't been computed yet; `compute_ground_truth`
/// fills it in against the current embedding set.
#[serde(default)]
pub ground_truth: Option<Vec<String>>,
/// Optional pre-computed query embedding. Filling this avoids re-embedding
/// the same query on every trial.
#[serde(default)]
pub query_embedding: Option<Vec<f32>>,
}
/// A named eval set tied to a specific vector index.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvalSet {
pub name: String,
pub index_name: String,
pub k: usize, // top-k used for recall calculation
pub queries: Vec<EvalQuery>,
pub created_at: DateTime<Utc>,
pub ground_truth_built: bool,
}
impl EvalSet {
pub fn new(name: &str, index_name: &str, k: usize) -> Self {
Self {
name: name.to_string(),
index_name: index_name.to_string(),
k,
queries: Vec::new(),
created_at: Utc::now(),
ground_truth_built: false,
}
}
pub fn storage_key(&self) -> String {
format!("_hnsw_evals/{}.json", self.name)
}
pub async fn save(&self, store: &Arc<dyn ObjectStore>) -> Result<(), String> {
let json = serde_json::to_vec_pretty(self).map_err(|e| e.to_string())?;
ops::put(store, &self.storage_key(), json.into()).await
}
pub async fn load(store: &Arc<dyn ObjectStore>, name: &str) -> Result<Self, String> {
let key = format!("_hnsw_evals/{}.json", name);
let data = ops::get(store, &key).await?;
serde_json::from_slice(&data).map_err(|e| format!("parse eval set: {e}"))
}
pub async fn list(store: &Arc<dyn ObjectStore>) -> Result<Vec<String>, String> {
let keys = ops::list(store, Some("_hnsw_evals/")).await?;
Ok(keys
.into_iter()
.filter(|k| k.ends_with(".json"))
.map(|k| {
k.trim_start_matches("_hnsw_evals/")
.trim_end_matches(".json")
.to_string()
})
.collect())
}
}
/// Cosine similarity for two same-length f32 slices.
fn cosine(a: &[f32], b: &[f32]) -> f32 {
let mut dot = 0.0f32;
let mut na = 0.0f32;
let mut nb = 0.0f32;
for i in 0..a.len() {
dot += a[i] * b[i];
na += a[i] * a[i];
nb += b[i] * b[i];
}
if na == 0.0 || nb == 0.0 {
return 0.0;
}
dot / (na.sqrt() * nb.sqrt())
}
/// Brute-force top-k search for a single query against all embeddings.
/// This is the ground-truth oracle that HNSW trials must approximate.
pub fn brute_force_top_k(
query: &[f32],
embeddings: &[StoredEmbedding],
k: usize,
) -> Vec<String> {
let mut scored: Vec<(f32, usize)> = embeddings
.iter()
.enumerate()
.map(|(i, e)| (cosine(query, &e.vector), i))
.collect();
// Partial sort — we only need top-k.
scored.sort_unstable_by(|a, b| b.0.partial_cmp(&a.0).unwrap_or(std::cmp::Ordering::Equal));
scored
.into_iter()
.take(k)
.map(|(_, i)| embeddings[i].doc_id.clone())
.collect()
}
/// Recall@k — fraction of ground-truth doc_ids that appear in predicted.
pub fn recall_at_k(predicted: &[String], ground_truth: &[String], k: usize) -> f32 {
if ground_truth.is_empty() || k == 0 {
return 0.0;
}
let gt_set: std::collections::HashSet<&String> =
ground_truth.iter().take(k).collect();
let hits = predicted
.iter()
.take(k)
.filter(|d| gt_set.contains(d))
.count();
hits as f32 / gt_set.len() as f32
}
/// Populate query_embedding and ground_truth for every query that lacks them.
pub async fn compute_ground_truth(
eval: &mut EvalSet,
embeddings: &[StoredEmbedding],
ai_client: &AiClient,
) -> Result<(), String> {
let need_embed: Vec<(usize, String)> = eval
.queries
.iter()
.enumerate()
.filter(|(_, q)| q.query_embedding.is_none())
.map(|(i, q)| (i, q.query_text.clone()))
.collect();
if !need_embed.is_empty() {
// Embed in one batch to keep things simple; for very large eval sets
// we'd batch this in chunks of 32.
let texts: Vec<String> = need_embed.iter().map(|(_, t)| t.clone()).collect();
let resp = ai_client
.embed(EmbedRequest { texts, model: None })
.await
.map_err(|e| format!("embed queries: {e}"))?;
for ((idx, _), vec) in need_embed.iter().zip(resp.embeddings.iter()) {
let v: Vec<f32> = vec.iter().map(|&x| x as f32).collect();
eval.queries[*idx].query_embedding = Some(v);
}
}
for q in eval.queries.iter_mut() {
if q.ground_truth.is_some() {
continue;
}
let emb = q.query_embedding.as_ref().ok_or("missing embedding")?;
q.ground_truth = Some(brute_force_top_k(emb, embeddings, eval.k));
}
eval.ground_truth_built = true;
Ok(())
}
/// Auto-generate a synthetic eval set by sampling every Nth chunk's text as
/// its own query. Useful for a quick-start eval when the user doesn't have
/// real natural-language queries yet.
pub fn synthetic_from_chunks(
eval_name: &str,
index_name: &str,
embeddings: &[StoredEmbedding],
sample_count: usize,
k: usize,
) -> EvalSet {
let n = embeddings.len();
let sample_count = sample_count.min(n);
let stride = (n / sample_count.max(1)).max(1);
let mut queries = Vec::with_capacity(sample_count);
for i in 0..sample_count {
let idx = (i * stride).min(n - 1);
let chunk = &embeddings[idx];
// Use the first ~200 chars of the chunk as the "query" — it should find
// itself and nearby chunks as top results.
let query_text: String = chunk.chunk_text.chars().take(200).collect();
queries.push(EvalQuery {
id: format!("syn-{}", i),
query_text,
ground_truth: None,
query_embedding: None,
});
}
EvalSet {
name: eval_name.to_string(),
index_name: index_name.to_string(),
k,
queries,
created_at: Utc::now(),
ground_truth_built: false,
}
}

View File

@ -7,6 +7,7 @@ use std::sync::Arc;
use tokio::sync::RwLock;
use crate::store::StoredEmbedding;
use crate::trial::HnswConfig;
/// A vector point for HNSW — wraps f32 slice with cosine distance.
#[derive(Clone)]
@ -62,18 +63,32 @@ impl HnswStore {
}
}
/// Build an HNSW index from stored embeddings.
/// Build an HNSW index from stored embeddings with default config.
pub async fn build_index(
&self,
index_name: &str,
embeddings: Vec<StoredEmbedding>,
) -> Result<BuildStats, String> {
self.build_index_with_config(index_name, embeddings, &HnswConfig::default()).await
}
/// Build an HNSW index from stored embeddings with explicit config.
/// Used by the trial system — each trial calls this with different params.
pub async fn build_index_with_config(
&self,
index_name: &str,
embeddings: Vec<StoredEmbedding>,
config: &HnswConfig,
) -> Result<BuildStats, String> {
let n = embeddings.len();
if n == 0 {
return Err("no embeddings to index".into());
}
tracing::info!("building HNSW index '{}' from {} vectors", index_name, n);
tracing::info!(
"building HNSW '{}': {} vectors, ef_construction={} ef_search={} seed={:?}",
index_name, n, config.ef_construction, config.ef_search, config.seed,
);
let start = std::time::Instant::now();
// Separate points and metadata
@ -92,14 +107,17 @@ impl HnswStore {
values.push(i);
}
// Build HNSW — this is the expensive part
let map = Builder::default()
.ef_construction(40) // balanced for 100K scale
.ef_search(50) // fast search with good recall
.build(points, values);
// Build HNSW — the expensive part
let mut builder = Builder::default()
.ef_construction(config.ef_construction)
.ef_search(config.ef_search);
if let Some(seed) = config.seed {
builder = builder.seed(seed);
}
let map = builder.build(points, values);
let build_time = start.elapsed().as_secs_f32();
tracing::info!("HNSW index '{}' built: {} vectors in {:.1}s", index_name, n, build_time);
tracing::info!("HNSW '{}' built: {} vectors in {:.1}s", index_name, n, build_time);
let index = Arc::new(HnswIndex { map, metadata });
self.indexes.write().await.insert(index_name.to_string(), index);
@ -111,6 +129,43 @@ impl HnswStore {
})
}
/// Run a batch of search queries and return raw per-query latencies in microseconds.
/// Also returns the retrieved doc_ids per query (for recall calculation).
pub async fn bench_search(
&self,
index_name: &str,
query_vectors: &[Vec<f32>],
top_k: usize,
) -> Result<BenchResult, String> {
let indexes = self.indexes.read().await;
let index = indexes
.get(index_name)
.ok_or_else(|| format!("HNSW index not found: {index_name}"))?
.clone();
drop(indexes);
let mut latencies_us = Vec::with_capacity(query_vectors.len());
let mut retrieved: Vec<Vec<String>> = Vec::with_capacity(query_vectors.len());
for qv in query_vectors {
let query_point = VecPoint(qv.clone());
let t0 = std::time::Instant::now();
let mut search = Search::default();
let results = index.map.search(&query_point, &mut search);
let ids: Vec<String> = results
.take(top_k)
.map(|item| {
let meta_idx = *item.value;
index.metadata[meta_idx].doc_id.clone()
})
.collect();
latencies_us.push(t0.elapsed().as_micros() as f32);
retrieved.push(ids);
}
Ok(BenchResult { latencies_us, retrieved })
}
/// Search an HNSW index. Returns approximate nearest neighbors.
pub async fn search(
&self,
@ -166,3 +221,9 @@ pub struct BuildStats {
pub vectors: usize,
pub build_time_secs: f32,
}
#[derive(Debug, Clone)]
pub struct BenchResult {
pub latencies_us: Vec<f32>,
pub retrieved: Vec<Vec<String>>,
}

View File

@ -1,4 +1,6 @@
pub mod chunker;
pub mod embedding_cache;
pub mod harness;
pub mod hnsw;
pub mod index_registry;
pub mod jobs;
@ -7,3 +9,4 @@ pub mod search;
pub mod rag;
pub mod supervisor;
pub mod service;
pub mod trial;

View File

@ -10,7 +10,7 @@ use serde::{Deserialize, Serialize};
use std::sync::Arc;
use aibridge::client::{AiClient, EmbedRequest};
use crate::{chunker, hnsw, index_registry, jobs, rag, search, store, supervisor};
use crate::{chunker, embedding_cache, harness, hnsw, index_registry, jobs, rag, search, store, supervisor, trial};
#[derive(Clone)]
pub struct VectorState {
@ -19,6 +19,8 @@ pub struct VectorState {
pub job_tracker: jobs::JobTracker,
pub index_registry: index_registry::IndexRegistry,
pub hnsw_store: hnsw::HnswStore,
pub embedding_cache: embedding_cache::EmbeddingCache,
pub trial_journal: trial::TrialJournal,
}
pub fn router(state: VectorState) -> Router {
@ -34,6 +36,17 @@ pub fn router(state: VectorState) -> Router {
.route("/hnsw/build", post(build_hnsw))
.route("/hnsw/search", post(search_hnsw))
.route("/hnsw/list", get(list_hnsw))
// Trial system — parameterized tuning loop
.route("/hnsw/trial", post(run_trial))
.route("/hnsw/trials/{index_name}", get(list_trials))
.route("/hnsw/trials/{index_name}/best", get(best_trial))
// Eval sets
.route("/hnsw/evals", get(list_evals))
.route("/hnsw/evals/{name}", get(get_eval).put(put_eval))
.route("/hnsw/evals/{name}/autogen", post(autogen_eval))
// Cache management
.route("/hnsw/cache/stats", get(cache_stats))
.route("/hnsw/cache/{index_name}", axum::routing::delete(cache_evict))
.with_state(state)
}
@ -307,26 +320,35 @@ async fn rag_query(
struct BuildHnswRequest {
/// Name of the stored vector index to build HNSW from
index_name: String,
/// Optional config override. Omit to use the production default
/// (ec=80 es=30 — see HnswConfig::default docs for rationale).
#[serde(default)]
config: Option<trial::HnswConfig>,
}
/// Build an HNSW index from an existing stored vector index.
/// Loads embeddings from Parquet, builds HNSW in memory.
/// Uses the embedding cache so repeated builds don't reload from Parquet.
async fn build_hnsw(
State(state): State<VectorState>,
Json(req): Json<BuildHnswRequest>,
) -> impl IntoResponse {
tracing::info!("building HNSW for '{}'", req.index_name);
let config = req.config.unwrap_or_default();
tracing::info!(
"building HNSW for '{}' ef_construction={} ef_search={}",
req.index_name, config.ef_construction, config.ef_search,
);
// Load embeddings from Parquet
let embeddings = store::load_embeddings(&state.store, &req.index_name)
let embeddings = state
.embedding_cache
.get_or_load(&req.index_name)
.await
.map_err(|e| (StatusCode::NOT_FOUND, format!("index not found: {e}")))?;
let n = embeddings.len();
tracing::info!("loaded {} embeddings, building HNSW...", n);
// Build HNSW
match state.hnsw_store.build_index(&req.index_name, embeddings).await {
match state
.hnsw_store
.build_index_with_config(&req.index_name, (*embeddings).clone(), &config)
.await
{
Ok(stats) => Ok(Json(stats)),
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
}
@ -372,3 +394,275 @@ async fn search_hnsw(
async fn list_hnsw(State(state): State<VectorState>) -> impl IntoResponse {
Json(state.hnsw_store.list().await)
}
// --- Trial System: parameterized HNSW tuning loop ---
//
// Flow:
// 1. Agent picks an HnswConfig
// 2. POST /hnsw/trial builds HNSW with that config against cached embeddings,
// runs every query in the harness, measures latency + recall vs the
// harness's ground truth, appends a Trial record to _hnsw_trials/{idx}.jsonl
// 3. Agent reads GET /hnsw/trials/{index}, sees history, decides next config
// 4. Repeat until converged.
//
// The first trial triggers embedding load (slow). Every subsequent trial reuses
// the cache — so the agent iterates in seconds, not minutes.
#[derive(Deserialize)]
struct TrialRequest {
index_name: String,
harness: String,
#[serde(default)]
config: trial::HnswConfig,
#[serde(default)]
note: Option<String>,
}
async fn run_trial(
State(state): State<VectorState>,
Json(req): Json<TrialRequest>,
) -> Result<Json<trial::Trial>, (StatusCode, String)> {
let mut harness_set = harness::EvalSet::load(&state.store, &req.harness)
.await
.map_err(|e| (StatusCode::NOT_FOUND, format!("harness not found: {e}")))?;
if harness_set.index_name != req.index_name {
return Err((
StatusCode::BAD_REQUEST,
format!(
"harness '{}' is for index '{}', not '{}'",
req.harness, harness_set.index_name, req.index_name
),
));
}
if harness_set.queries.is_empty() {
return Err((StatusCode::BAD_REQUEST, "harness has no queries".into()));
}
let embeddings = state
.embedding_cache
.get_or_load(&req.index_name)
.await
.map_err(|e| (StatusCode::NOT_FOUND, format!("load embeddings: {e}")))?;
if !harness_set.ground_truth_built {
tracing::info!("trial: computing ground truth for harness '{}'", harness_set.name);
let t0 = std::time::Instant::now();
harness::compute_ground_truth(&mut harness_set, &embeddings, &state.ai_client)
.await
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("ground truth: {e}")))?;
tracing::info!("trial: ground truth built in {:.1}s", t0.elapsed().as_secs_f32());
harness_set
.save(&state.store)
.await
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("save harness: {e}")))?;
}
let trial_id = trial::Trial::new_id();
let hnsw_slot = format!("{}__{}", req.index_name, trial_id);
let build_stats = state
.hnsw_store
.build_index_with_config(&hnsw_slot, (*embeddings).clone(), &req.config)
.await
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("build: {e}")))?;
let query_vectors: Vec<Vec<f32>> = harness_set
.queries
.iter()
.filter_map(|q| q.query_embedding.clone())
.collect();
let bench = state
.hnsw_store
.bench_search(&hnsw_slot, &query_vectors, harness_set.k)
.await
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("search: {e}")))?;
let mut recalls = Vec::with_capacity(harness_set.queries.len());
for (q, hits) in harness_set.queries.iter().zip(bench.retrieved.iter()) {
if let Some(gt) = &q.ground_truth {
recalls.push(harness::recall_at_k(hits, gt, harness_set.k));
}
}
let mean_recall = if recalls.is_empty() {
0.0
} else {
recalls.iter().sum::<f32>() / recalls.len() as f32
};
let mut lats = bench.latencies_us.clone();
lats.sort_unstable_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal));
let p = |pct: f32| -> f32 {
if lats.is_empty() { return 0.0; }
let idx = ((lats.len() as f32 - 1.0) * pct).round() as usize;
lats[idx.min(lats.len() - 1)]
};
// One brute-force reference latency — keeps the cost proportional to
// whatever the agent is willing to pay per trial.
let brute_latency_us = if let Some(qv) = query_vectors.first() {
let t0 = std::time::Instant::now();
let _ = harness::brute_force_top_k(qv, &embeddings, harness_set.k);
t0.elapsed().as_micros() as f32
} else {
0.0
};
let dims = embeddings.first().map(|e| e.vector.len()).unwrap_or(0);
let memory_bytes =
(embeddings.len() * dims * std::mem::size_of::<f32>() + embeddings.len() * 128) as u64;
let trial_record = trial::Trial {
id: trial_id.clone(),
index_name: req.index_name.clone(),
eval_set: req.harness.clone(),
config: req.config.clone(),
metrics: trial::TrialMetrics {
build_time_secs: build_stats.build_time_secs,
search_latency_p50_us: p(0.50),
search_latency_p95_us: p(0.95),
search_latency_p99_us: p(0.99),
recall_at_k: mean_recall,
memory_bytes,
vectors: build_stats.vectors,
eval_queries: harness_set.queries.len(),
brute_force_latency_us: brute_latency_us,
},
created_at: chrono::Utc::now(),
note: req.note,
};
state
.trial_journal
.append(&trial_record)
.await
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("journal: {e}")))?;
state.hnsw_store.drop(&hnsw_slot).await;
Ok(Json(trial_record))
}
async fn list_trials(
State(state): State<VectorState>,
Path(index_name): Path<String>,
) -> impl IntoResponse {
match state.trial_journal.list(&index_name).await {
Ok(trials) => Ok(Json(trials)),
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
}
}
#[derive(Deserialize)]
struct BestTrialQuery {
#[serde(default = "default_metric")]
metric: String,
}
fn default_metric() -> String {
"pareto".to_string()
}
async fn best_trial(
State(state): State<VectorState>,
Path(index_name): Path<String>,
Query(q): Query<BestTrialQuery>,
) -> impl IntoResponse {
match state.trial_journal.best(&index_name, &q.metric).await {
Ok(Some(t)) => Ok(Json(t)),
Ok(None) => Err((StatusCode::NOT_FOUND, "no trials yet".to_string())),
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
}
}
// --- Harness management ---
async fn list_evals(State(state): State<VectorState>) -> impl IntoResponse {
match harness::EvalSet::list(&state.store).await {
Ok(names) => Ok(Json(names)),
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
}
}
async fn get_eval(
State(state): State<VectorState>,
Path(name): Path<String>,
) -> impl IntoResponse {
match harness::EvalSet::load(&state.store, &name).await {
Ok(e) => Ok(Json(e)),
Err(err) => Err((StatusCode::NOT_FOUND, err)),
}
}
async fn put_eval(
State(state): State<VectorState>,
Path(name): Path<String>,
Json(mut harness_set): Json<harness::EvalSet>,
) -> impl IntoResponse {
harness_set.name = name;
harness_set.ground_truth_built = harness_set
.queries
.iter()
.all(|q| q.ground_truth.is_some());
match harness_set.save(&state.store).await {
Ok(()) => Ok(Json(harness_set)),
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
}
}
#[derive(Deserialize)]
struct AutogenRequest {
index_name: String,
#[serde(default = "default_sample_count")]
sample_count: usize,
#[serde(default = "default_k")]
k: usize,
}
fn default_sample_count() -> usize { 100 }
fn default_k() -> usize { 10 }
async fn autogen_eval(
State(state): State<VectorState>,
Path(name): Path<String>,
Json(req): Json<AutogenRequest>,
) -> Result<Json<harness::EvalSet>, (StatusCode, String)> {
let embeddings = state
.embedding_cache
.get_or_load(&req.index_name)
.await
.map_err(|e| (StatusCode::NOT_FOUND, format!("load embeddings: {e}")))?;
let mut harness_set = harness::synthetic_from_chunks(
&name,
&req.index_name,
&embeddings,
req.sample_count,
req.k,
);
harness::compute_ground_truth(&mut harness_set, &embeddings, &state.ai_client)
.await
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("ground truth: {e}")))?;
harness_set
.save(&state.store)
.await
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("save: {e}")))?;
Ok(Json(harness_set))
}
// --- Embedding cache management ---
async fn cache_stats(State(state): State<VectorState>) -> impl IntoResponse {
Json(state.embedding_cache.stats().await)
}
async fn cache_evict(
State(state): State<VectorState>,
Path(index_name): Path<String>,
) -> impl IntoResponse {
let ok = state.embedding_cache.evict(&index_name).await;
Json(serde_json::json!({ "evicted": ok, "index_name": index_name }))
}

213
crates/vectord/src/trial.rs Normal file
View File

@ -0,0 +1,213 @@
/// Trial journal for HNSW parameter tuning.
///
/// Every HNSW build+eval is recorded as a Trial. The journal is append-only
/// and stored under `_hnsw_trials/{index_name}/` as batched JSONL files —
/// an AI agent iterating on configs reads prior trials to decide what to
/// try next, and writes a new trial on each attempt.
///
/// Storage uses the shared `storaged::append_log::AppendLog` so appends are
/// write-once (new file per batch) rather than rewriting a single growing
/// JSONL on every event. See `append_log.rs` for the full rationale.
use chrono::{DateTime, Utc};
use object_store::ObjectStore;
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::sync::Arc;
use tokio::sync::RwLock;
use storaged::append_log::AppendLog;
/// HNSW build/search parameters the agent can tune.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct HnswConfig {
pub ef_construction: usize,
pub ef_search: usize,
#[serde(default)]
pub seed: Option<u64>,
}
impl Default for HnswConfig {
/// Production default, locked in 2026-04-16 based on trial grid against
/// resumes_100k_v2 (100K vectors, 20 queries, recall@10):
/// ec=20 es=30 → recall 0.960, p50 509us, build 8s
/// ec=80 es=30 → recall 1.000, p50 873us, build 230s ← sweet spot
/// ec=200 es=30 → recall 1.000, p50 874us, build 106s (no recall gain)
///
/// `ec=80` is the smallest value that reaches 100% recall. Higher values
/// waste build time. `es=30` gives faster search than `es=100` with no
/// recall penalty at this scale.
fn default() -> Self {
Self {
ef_construction: 80,
ef_search: 30,
seed: None,
}
}
}
/// Metrics collected on every trial. All latencies in microseconds.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TrialMetrics {
pub build_time_secs: f32,
pub search_latency_p50_us: f32,
pub search_latency_p95_us: f32,
pub search_latency_p99_us: f32,
pub recall_at_k: f32,
pub memory_bytes: u64,
pub vectors: usize,
pub eval_queries: usize,
/// Brute-force latency for comparison — how much speedup did HNSW buy us?
pub brute_force_latency_us: f32,
}
/// A single tuning attempt.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Trial {
pub id: String,
pub index_name: String,
pub eval_set: String,
pub config: HnswConfig,
pub metrics: TrialMetrics,
pub created_at: DateTime<Utc>,
/// Free-form note — the agent can record why it tried this config.
#[serde(default)]
pub note: Option<String>,
}
impl Trial {
pub fn new_id() -> String {
format!(
"trial-{}-{}",
Utc::now().timestamp_millis(),
&uuid::Uuid::new_v4().to_string()[..8]
)
}
}
/// Per-index append log, lazy-created on first write.
#[derive(Clone)]
pub struct TrialJournal {
store: Arc<dyn ObjectStore>,
/// Cache per-index AppendLog instances so the in-memory buffer persists
/// across calls.
logs: Arc<RwLock<HashMap<String, Arc<AppendLog>>>>,
}
impl TrialJournal {
pub fn new(store: Arc<dyn ObjectStore>) -> Self {
Self {
store,
logs: Arc::new(RwLock::new(HashMap::new())),
}
}
fn prefix(index_name: &str) -> String {
format!("_hnsw_trials/{}", index_name)
}
async fn log_for(&self, index_name: &str) -> Arc<AppendLog> {
if let Some(log) = self.logs.read().await.get(index_name) {
return log.clone();
}
let mut guard = self.logs.write().await;
if let Some(log) = guard.get(index_name) {
return log.clone();
}
// Trials arrive one at a time during human/agent iteration — a low
// threshold gives "hit /trials and see my latest attempt" immediacy
// without creating one file per event.
let log = Arc::new(
AppendLog::new(self.store.clone(), Self::prefix(index_name))
.with_flush_threshold(4),
);
guard.insert(index_name.to_string(), log.clone());
log
}
/// Append a trial record. In-memory buffered; persisted in batches.
pub async fn append(&self, trial: &Trial) -> Result<(), String> {
let line = serde_json::to_vec(trial).map_err(|e| e.to_string())?;
let log = self.log_for(&trial.index_name).await;
log.append(line).await
}
/// Read all trials for an index (flushed batches + unflushed buffer).
pub async fn list(&self, index_name: &str) -> Result<Vec<Trial>, String> {
let log = self.log_for(index_name).await;
let lines = log.read_all().await?;
let mut trials = Vec::with_capacity(lines.len());
for line in lines {
match serde_json::from_slice::<Trial>(&line) {
Ok(t) => trials.push(t),
Err(e) => tracing::warn!("trial journal: skip malformed line: {e}"),
}
}
Ok(trials)
}
/// Explicit flush for callers that want write-through semantics
/// (e.g. an agent that wants to commit a trial before querying stats).
pub async fn flush(&self, index_name: &str) -> Result<(), String> {
let log = self.log_for(index_name).await;
log.flush().await
}
/// Compact all batch files for an index into one.
pub async fn compact(&self, index_name: &str) -> Result<storaged::append_log::CompactStats, String> {
let log = self.log_for(index_name).await;
log.compact().await
}
/// Current champion for an index by the named metric.
/// Valid metrics: `recall`, `latency`, `pareto`.
///
/// The `pareto` strategy is a placeholder — J should tune the scoring
/// function to match what matters in production. Right now it's a simple
/// weighted sum.
pub async fn best(
&self,
index_name: &str,
metric: &str,
) -> Result<Option<Trial>, String> {
let trials = self.list(index_name).await?;
if trials.is_empty() {
return Ok(None);
}
let best = match metric {
"recall" => trials
.into_iter()
.max_by(|a, b| {
a.metrics
.recall_at_k
.partial_cmp(&b.metrics.recall_at_k)
.unwrap_or(std::cmp::Ordering::Equal)
})
.unwrap(),
"latency" => trials
.into_iter()
.min_by(|a, b| {
a.metrics
.search_latency_p95_us
.partial_cmp(&b.metrics.search_latency_p95_us)
.unwrap_or(std::cmp::Ordering::Equal)
})
.unwrap(),
"pareto" | _ => trials
.into_iter()
.max_by(|a, b| pareto_score(a).partial_cmp(&pareto_score(b)).unwrap())
.unwrap(),
};
Ok(Some(best))
}
}
/// Simple Pareto-style score: reward recall, penalize p95 latency.
/// Tunable — J should swap this in production to match what matters.
fn pareto_score(t: &Trial) -> f32 {
// Recall is [0, 1]. Latency is us — assume 100us baseline.
let recall = t.metrics.recall_at_k;
let latency_penalty = (t.metrics.search_latency_p95_us / 1000.0).min(1.0); // cap at 1ms
recall - 0.2 * latency_penalty
}

322
docs/ADR-017-federation.md Normal file
View File

@ -0,0 +1,322 @@
# ADR-017: Federated Multi-Bucket Storage
**Status:** Accepted — 2026-04-16
**Owner:** J
**Implements:** Phase 15 horizon item "Federated multi-bucket query"
**Depends on:** ADR-001 (object storage as source of truth), ADR-009 (delta files)
---
## Problem
Today every dataset lives in exactly one object storage backend (`/home/profit/lakehouse/data` on local disk). Three scenarios break that assumption:
1. **Multi-tenant hosting** — Client A's staffing data must stay in Client A's S3 bucket. Client B's in Client B's. The lakehouse is the compute plane, their bucket is the data plane.
2. **Data residency** — European clients require their data never leaves an EU-region bucket. Same lakehouse instance, different physical storage.
3. **Cross-dataset analytics** — A user asks "average placement margin across all clients we manage." The query must scan Parquet across N buckets as a single logical table.
None of this is possible with a single `object_store` instance. `DataFusion` *can* register multiple object stores under different URL schemes, but nothing in the current stack surfaces that capability.
## Non-Goals (for the MVP)
- **Not** building cross-bucket JOIN optimization. DataFusion pushes predicates down per-partition already; that's sufficient.
- **Not** handling bucket-level auth policies (who-can-read-which-bucket). That's Phase 13's job.
- **Not** supporting heterogeneous backends in a single logical dataset (e.g. half in S3, half local). One dataset = one bucket.
- **Not** automatic replication across buckets. Each bucket stands alone.
- **Not** scheduled sync/migration. Manual today.
---
## Design
### Core invariant
Every `ObjectRef` belongs to exactly one named bucket. The catalog is the single authority over which bucket holds which dataset. Queries transparently span buckets by virtue of DataFusion's multi-store capability — the catalog tells the query engine where each file lives.
### Naming
Three classes of bucket:
- **`primary`** — system-wide default. Shared catalog, reference datasets, and vectors that aren't tenant-specific. Always present, always the fallback. Corresponds to today's `./data` directory.
- **`profile:{user_id}`** — per-user workspace bucket. Each user's personal data, their vector indexes, their workspace state. Provisioned on first use. Pre-loaded ("hot loaded") when the user activates their profile.
- **`{named}`** — tenant/client buckets. Explicit, configured up front. E.g. `client_a`, `client_eu`.
Conventions:
- **Bucket URL** = `{bucket_id}://{key}` internally (e.g. `primary://datasets/candidates.parquet`, `profile:daisy://vectors/notes.parquet`)
- DataFusion sees each bucket as a separate registered object store; `ListingTable` paths include the bucket scheme
- Profile buckets use the `profile:` prefix as both a namespace marker and a DataFusion URL scheme
### Config shape — `lakehouse.toml`
```toml
[storage]
# Backward compat — the existing [storage] block still creates a "primary"
# bucket if no [[storage.buckets]] entries are defined.
root = "./data"
# Where profile buckets are rooted when not otherwise configured.
# A profile:{user} bucket resolves to {profile_root}/{user}/ unless the user
# config overrides.
profile_root = "./data/_profiles"
[[storage.buckets]]
name = "primary"
backend = "local"
root = "./data"
rescue_bucket = "rescue" # single shared fallback for failed reads (see §Failure mode)
[[storage.buckets]]
name = "rescue"
backend = "local"
root = "./data/_rescue"
[[storage.buckets]]
name = "client_a"
backend = "s3"
bucket = "client-a-lakehouse"
region = "us-east-1"
endpoint = "https://s3.amazonaws.com"
secret_ref = "client_a_aws" # NOT the literal secret — a handle
[[storage.buckets]]
name = "client_eu"
backend = "s3"
bucket = "client-eu-lakehouse"
region = "eu-west-1"
secret_ref = "client_eu_aws"
```
Rules:
- Credentials **never appear in this file**. `secret_ref` is a handle resolved through the secrets layer (see §Secrets).
- If `[[storage.buckets]]` is absent, the existing `[storage]` block creates a single `"primary"` bucket. Zero-config upgrade.
- Adding a bucket requires a restart (MVP); runtime addition via API is a polish item.
### Secrets
Credentials are never in `lakehouse.toml`. A pluggable `SecretsProvider` trait resolves `secret_ref` handles to credential structs at startup:
```rust
pub trait SecretsProvider: Send + Sync {
async fn resolve(&self, handle: &str) -> Result<BucketCredentials, String>;
}
```
MVP ships **one** implementation — `FileSecretsProvider` reading `/etc/lakehouse/secrets.toml` (or a path given by `LAKEHOUSE_SECRETS` env var):
```toml
# /etc/lakehouse/secrets.toml — root:root, mode 0600, NEVER in git
[client_a_aws]
access_key = "AKIA..."
secret_key = "wJalrXUt..."
[client_eu_aws]
access_key = "AKIA..."
secret_key = "..."
```
Future providers plug into the same trait without touching core code:
- `VaultSecretsProvider` — HashiCorp Vault
- `SopsSecretsProvider` — age/gpg-encrypted files in git
- `KeyringSecretsProvider` — OS-level keyring
Startup fails fast if a `secret_ref` can't resolve — no silent fallback to anonymous S3.
### Schema changes
**`ObjectRef`** already has a `bucket: String` field — currently it carries the S3 bucket name or `"local"` inconsistently. Repurpose it as the **catalog bucket name**:
```rust
pub struct ObjectRef {
pub bucket: String, // NOW: catalog bucket name, e.g. "primary" or "client_a"
pub key: String, // object key within that bucket
pub size_bytes: u64,
pub created_at: DateTime<Utc>,
}
```
Migration: a `resync-missing`-style one-shot sets `bucket = "primary"` on every existing ObjectRef whose value is empty or ambiguous.
**`DatasetManifest`** — no new top-level field. `objects: Vec<ObjectRef>` is already where bucket info lives. Per the design invariant (one dataset, one bucket), all ObjectRefs inside a manifest share a bucket, so we can add a convenience accessor `manifest.bucket_id()``objects[0].bucket.clone()`.
### Storage layer changes
**`storaged::BucketRegistry`** — new struct, replaces the single `Arc<dyn ObjectStore>` currently threaded through gateway.
```rust
pub struct BucketRegistry {
buckets: HashMap<String, Arc<dyn ObjectStore>>,
default: String, // "primary"
}
impl BucketRegistry {
pub fn from_config(cfg: &StorageConfig) -> Result<Self, String>;
pub fn get(&self, bucket_id: &str) -> Result<&Arc<dyn ObjectStore>, String>;
pub fn default_store(&self) -> &Arc<dyn ObjectStore>;
pub fn list_buckets(&self) -> Vec<BucketInfo>;
}
```
**`storaged::ops`** gets a bucket parameter on every call:
```rust
pub async fn put(registry: &BucketRegistry, bucket: &str, key: &str, data: Bytes) -> Result<(), String>;
pub async fn get(registry: &BucketRegistry, bucket: &str, key: &str) -> Result<Bytes, String>;
// etc.
```
Every existing call site passes `"primary"` initially; specific call sites (ingestd, vectord) learn to route.
### Query layer changes
**`queryd`** — on startup, register every bucket as a separate `object_store` URL with DataFusion:
```rust
for (name, store) in bucket_registry.iter() {
let url = format!("lakehouse-{}://", name).parse::<Url>()?;
session_ctx.register_object_store(&url, store.clone());
}
```
**`ListingTable`** URLs reference the bucket:
```rust
let url = format!("lakehouse-{}://{}", obj.bucket, obj.key);
```
DataFusion handles cross-bucket scans natively — each partition gets routed to the correct store.
### Gateway API additions
| Endpoint | Purpose |
|---|---|
| `GET /storage/buckets` | List all configured buckets + their backend + reachability status |
| `GET /storage/errors` | Recent bucket operation failures (filtered by bucket/time) |
| `GET /storage/health` | Summary: which buckets have errored in the last 5 minutes |
| `POST /ingest/file` + `X-Lakehouse-Bucket: client_a` | Header selects target bucket; absent → `primary` |
| `POST /profile/{user}/activate` | Pre-load user's profile bucket: embedding caches warm, HNSW indexes rebuilt, workspace state hydrated |
| `POST /profile/{user}/deactivate` | Evict profile bucket's cached state |
| `POST /catalog/datasets/by-name/{name}/relocate` | Move a dataset to another bucket (polish) |
**Routing rule for all existing endpoints:**
1. `X-Lakehouse-Bucket: {name}` header → use that bucket
2. No header → `primary`
3. Unknown bucket name → 404
Catalog and query endpoints span all buckets by default — a query sees all buckets the user has access to unless explicitly scoped.
### Profile hot load
When `POST /profile/{user}/activate` fires:
1. Resolve `profile:{user}` bucket — create if missing (first-time activation)
2. Scan the catalog for all datasets whose `owner == user` or which live in that bucket
3. For each vector index belonging to the profile: pre-load embeddings into `EmbeddingCache`, pre-build HNSW with the locked-in default config (ec=80 es=30)
4. Hydrate any saved workspace state (see Phase 8.5) into memory
5. Return a manifest summary — what's hot, what wasn't found, total memory used
Design contract: `/profile/{user}/activate` should be idempotent. Calling it twice is cheap; the second call is a no-op because the cache is already warm.
### Failure mode — rescue bucket + loud errors
The design goal is **failures stay findable**. When something breaks, you should know within seconds which bucket, which key, which operation, and what happened — without grepping logs.
**Writes**: always go to the target bucket. If the target bucket is unreachable, the request fails hard with 503 and the failure is recorded in the error journal. No queueing, no silent fallback — writes that vanish silently are the worst possible failure mode.
**Reads**:
- First attempt: target bucket.
- On failure: fall through to the single configured `rescue_bucket`.
- If the key exists in rescue, serve it with `X-Lakehouse-Rescue-Used: true` and `X-Lakehouse-Original-Bucket: {target}` response headers so the caller knows fallback happened.
- If the key isn't in rescue either: 404 with both buckets listed in the error body.
- Every rescue fallback is appended to the **error journal** (see below).
**Rescue bucket** is a single, well-known bucket configured at the storage level. Typical setup: a periodic sync from `primary``rescue` so that when `primary` has an outage, recent data is still reachable. Sync is outside this ADR's scope — operator's job.
**Rest of the system keeps working**: a failed bucket never takes down the gateway. Other buckets remain queryable. Cross-bucket queries scan what they can and surface per-partition failures (which bucket, which key) in the error body.
### Error journal — "find errors ez"
A simple append-only log of every bucket operation that failed or fell through. Stored at `primary://_errors/bucket_errors.jsonl` (yes, primary — because we need it findable even if other buckets are down).
```json
{"ts":"2026-04-16T10:30:15Z","op":"read","target":"client_a","key":"datasets/candidates.parquet","error":"connection refused","rescued":true,"rescue_key_found":true}
{"ts":"2026-04-16T10:30:18Z","op":"write","target":"client_a","key":"datasets/new.parquet","error":"connection refused","rescued":false}
```
Exposed via:
- `GET /storage/errors?limit=50` — recent errors
- `GET /storage/errors?bucket=client_a` — filter by bucket
- `GET /storage/errors?since=2026-04-16T10:00` — filter by time
- `GET /storage/health` — summary: which buckets have errored in the last 5 minutes
The journal doesn't replace logs (tracing still gets everything), but it gives you a **single authoritative place** to answer the question "has anything been failing?" without tailing systemd journals. If the journal is empty, nothing is broken.
### Vectors are bucket-scoped
Vector indexes live in the same bucket as their source data:
- Data in `primary` → vectors in `primary://vectors/{index_name}.parquet`
- Data in `profile:daisy` → vectors in `profile:daisy://vectors/{index_name}.parquet`
- Data in `client_a` → vectors in `client_a://vectors/{index_name}.parquet`
The existing `_hnsw_trials/` and `_hnsw_evals/` prefixes also live per-bucket. The trial journal for a profile's index stays inside that profile's bucket — so tenants don't see each other's tuning history.
**Profile hot load consequence:** activating a profile pre-loads ALL vector indexes in that profile's bucket into `EmbeddingCache` + builds HNSW with the default config. First query after activation is warm.
---
## Implementation phases
### MVP (this is what I'd build next session)
1. **`shared::config`** — extend `StorageConfig` with `buckets: Vec<BucketConfig>` and `profile_root`, with backward-compat fallback from single `[storage]` block
2. **`shared::secrets`** — `SecretsProvider` trait + `FileSecretsProvider` impl reading `/etc/lakehouse/secrets.toml`
3. **`storaged::BucketRegistry`** — multi-backend registry; mirrors-of tracking; lazy profile bucket creation
4. **`storaged::ops`** — add `bucket: &str` param to every call; read path falls back to mirror on unreachable; write path hard-fails
5. **`gateway`** — `X-Lakehouse-Bucket` header middleware → routing decision per request
6. **`gateway`** — `GET /storage/buckets` + `POST /profile/{user}/activate|deactivate`
7. **`catalogd`** — one-shot migration: every ObjectRef without an explicit bucket gets `"primary"`
8. **`queryd::session`** — register every bucket as a DataFusion ObjectStore under its own URL scheme
9. **`vectord`** — every vector/trial/eval storage path becomes `{bucket}://vectors/...`; profile activate pre-loads HNSW
10. **`storaged::error_journal`** — append-only JSONL writer at `primary://_errors/bucket_errors.jsonl`, hooked into every ops call-site
11. **Test harness** — configure `primary` + `profile:testuser` + a mock `client_a` pointing at a 2nd local dir + `rescue` as a 3rd local dir; verify cross-bucket join, profile activate, rescue fallback, error journal visibility
**Success gate #1 — cross-bucket query:** `SELECT SUM(hours_regular) FROM primary.timesheets t1 JOIN "profile:testuser".timesheets t2 USING (placement_id)` returns a sensible number.
**Success gate #2 — profile hot load:** `POST /profile/daisy/activate` pre-builds HNSW over her personal vector index; next `/search` call against that index is <1ms cold.
**Success gate #3 — rescue fallback + error visibility:** rename `client_a`'s directory to simulate outage. GET for a key returns the copy from `rescue` with `X-Lakehouse-Rescue-Used: true` + `X-Lakehouse-Original-Bucket: client_a` headers. PUT returns 503. `GET /storage/errors` shows both events with full context. `GET /storage/health` flags `client_a` as unhealthy.
### Polish (follow-up)
8. **Bucket health check**`GET /storage/buckets` shows reachable/unreachable
9. **Bucket relocation**`POST /catalog/datasets/{name}/relocate` (copy to new bucket, update manifest, delete from old)
10. **Runtime bucket addition**`POST /storage/buckets` adds a bucket without restart (stores in a separate `buckets.toml`)
11. **S3 streaming optimization** — for very large Parquet files, stream the footer instead of loading the whole blob
---
## Decisions (2026-04-16)
1. **Credentials** — Pluggable `SecretsProvider` trait. MVP ships `FileSecretsProvider` at `/etc/lakehouse/secrets.toml` (root:root, 0600). Future providers (Vault, SOPS, OS keyring) plug in without touching core.
2. **Primary default + profile bucket model**`primary` always exists as system-wide fallback. Per-user `profile:{user}` buckets are provisioned on first activate. Named tenant buckets configured explicitly.
3. **Ingest routing**`X-Lakehouse-Bucket` header. Absent → `primary`. Unknown → 404.
4. **Failure** — Writes: hard fail on target, logged to error journal. Reads: fall through to a single shared `rescue_bucket`; response headers `X-Lakehouse-Rescue-Used` + `X-Lakehouse-Original-Bucket` make the fallback visible. Every failure appended to an error journal at `primary://_errors/bucket_errors.jsonl`, queryable via `GET /storage/errors`. Failed buckets never take down unrelated paths.
5. **Vectors in their bucket** — Vectors, trial journals, and eval sets all live in the same bucket as their source data. Profile activate pre-loads the full vector stack for that profile using the locked-in HNSW default (ec=80 es=30).
---
## Risks
| Risk | Severity | Mitigation |
|---|---|---|
| DataFusion multi-store predicate pushdown lands on cross-bucket JOINs | **Medium** | Benchmark early. If bad, mark cross-bucket JOINs as a known sharp edge. |
| S3 credentials accidentally logged | **High** | Never stringify credential fields. Add a test that redacts `secret_key` in every code path. |
| Existing ObjectRefs have inconsistent `bucket` values (some `"local"`, some empty) | **Low** | One-shot migration normalizes to `"primary"`. |
| Bucket config drift between restarts | **Medium** | Config-driven means restart = canonical. A runtime API would need careful state reconciliation; defer. |
---
## Decision log
- **1 dataset = 1 bucket** (rather than datasets spanning buckets) — vastly simpler, matches the "tenant isolation" real use case. Revisit only if someone has an actual multi-bucket dataset requirement.
- **Config-first, runtime-later** — YAML/TOML for the MVP; runtime bucket add/remove is polish. Deployment restart is cheap in this stack.
- **`bucket: String` repurposed, no new field** — avoids a schema migration on every manifest.
- **DataFusion's multi-store primitive handles cross-bucket queries** — don't reinvent. We're wiring it up, not building it.

View File

@ -79,3 +79,13 @@
**Date:** 2026-03-27
**Decision:** Each contract/search gets a named workspace with saved queries, shortlists, activity logs, and delta layers. Workspaces have daily/weekly/monthly tiers and support instant zero-copy handoff between agents.
**Rationale:** Staffing workflows are inherently agent-centric — a recruiter works a contract, builds context, then may need to hand it off. The workspace captures that context in a structured, queryable, transferable format. Without it, handoff means "read the email thread and figure it out."
## ADR-017: Federated multi-bucket storage
**Date:** 2026-04-16
**Decision:** Every `ObjectRef` belongs to exactly one named bucket. Three bucket classes: `primary` (system default, always present), `profile:{user}` (per-user/per-model workspace bucket), and named tenant buckets (`client_a`, `client_eu`). A single shared `rescue_bucket` handles read fallback on target failure. Writes hard-fail on unreachable target; no silent fallback. Every bucket op failure lands in an error journal at `primary://_errors/bucket_errors/`, queryable via `/storage/errors`. Credentials never live in `lakehouse.toml` — a pluggable `SecretsProvider` trait resolves opaque `secret_ref` handles.
**Rationale:** The single-backend assumption breaks when we want: tenant data isolation, data residency (EU bucket), per-profile workspaces for local models. DataFusion already supports multiple registered object stores; catalogd is the cross-bucket metadata authority. Rescue bucket + visible error journal makes operational failures diagnosable in one HTTP call. See `docs/ADR-017-federation.md` for full design + success gates.
## ADR-018: Write-once batched append pattern for journals
**Date:** 2026-04-16
**Decision:** All append-only journals (error journal, HNSW trial journal, future audit logs) use the `storaged::append_log::AppendLog` helper. Events accumulate in an in-memory buffer; on threshold or explicit `flush()`, the buffer is written as one new timestamped file (`batch_{epoch_us}.jsonl`). Existing files are never rewritten. `compact()` merges all batches into one with a fresh timestamp, preserving chronological sort order.
**Rationale:** Object stores have no append primitive. Naive "read-modify-write the whole JSONL file on every event" is O(N²) cumulative work and creates the classic small-file / rewrite-amplification anti-pattern that llms3.com flags as the top lakehouse pitfall. Write-once batching is the LSM-tree idea applied to small JSONL events — bounded write amplification, append-only semantics, optional compaction for read efficiency. The in-memory ring buffer preserves O(1) recent-event reads for the `/storage/errors` and `/hnsw/trials` query endpoints.

233
docs/EXECUTION_PLAN.md Normal file
View File

@ -0,0 +1,233 @@
# Execution Plan — Phases B through E
**Created:** 2026-04-16
**Status:** Active planning document — update as phases complete or scope shifts
**Owner:** J
This plan sequences the work J and Claw agreed on during the 2026-04-16 reframe session, after stress-testing the "dual-use substrate" vision and aligning it with llms3.com's architectural patterns.
---
## The four phases, at a glance
| Phase | Work | Prereq | Estimated cost | Risk |
|---|---|---|---|---|
| B | Lance pilot on one vector index | None | 1 focused session | Medium — new dep, unfamiliar surface |
| C | Decoupled embedding refresh pipeline | Benefits from B's outcome | 1 focused session | Low — additive, doesn't break existing |
| D | AI-safe views | Phase 13 (done) | 1 focused session | Low — builds on existing catalog + tool registry |
| E | Soft deletes / tombstones | None | 1-2 focused sessions | Medium — touches query path, compaction |
Each "focused session" ≈ 3-4 hours of coding + verification + doc update.
---
## Phase B — Lance pilot (the storage format question)
### Why now
J's LLMS3 knowledge base explicitly positions Lance as `alternative_to` Parquet for vector workloads. We admitted in the 2026-04-16 stress test that our Parquet-portability argument for vectors is weaker than advertised — our vector Parquet blobs aren't readable by DuckDB/Polars anyway. Lance could unlock: random-row access, disk-resident indexes, time-travel, better compression. Or it could disappoint and we lock in Parquet with a written reason why.
**Commit to one answer backed by measurements.** No ambiguity after this phase.
### Scope
1. Add `lance` crate as a dep in `vectord`, behind a `feature = "lance"` flag initially so we can build without forcing users to install it
2. New module `vectord::lance_store` — mirror of `vectord::store` but against Lance format
3. New endpoint `POST /vectors/lance/index` — build a Lance index for a named source (parallel to the existing Parquet path)
4. Benchmark script that runs against `resumes_100k_v2` (the existing 100K reference)
### Measured dimensions
All benchmarks against `resumes_100k_v2` (100K × 768d), cold start:
| Metric | Parquet baseline | Lance | Threshold to migrate |
|---|---|---|---|
| Cold load from disk | ~2.8s (measured) | ? | ≥2× faster |
| Search latency p50 | 873us (ec=80 es=30) | ? | Within 50% |
| Disk size | 330MB | ? | Comparable or better |
| Single-row random access | N/A (full scan) | ? | <10ms |
| Append 10K new rows | Full rewrite (~3s) | ? | Incremental <500ms |
### Decision rules
- **Lance wins cold-load by ≥2× AND matches search latency:** migrate vector storage to Lance. Dataset tables stay Parquet. Update ADR-008 → ADR-019.
- **Lance within 50% across board:** stay Parquet. Document ceiling honestly in PRD (already done). Lance revisit when we have a problem Parquet can't solve.
- **Lance loses:** close the door, don't revisit without new evidence. Write ADR-019 as "Lance evaluated, rejected, here's why with numbers."
### Success gate
Benchmark output table posted to `docs/ADR-019-vector-storage.md` with measured numbers in each cell of the table above. Decision rule applied mechanically. No "let's defer the call."
### Rollback
The `feature = "lance"` flag means if the pilot goes badly, `cargo build` without the flag is unchanged. No production migration happens until ADR-019 commits to the change. Safe experiment.
---
## Phase C — Decoupled embedding refresh
### Why now
llms3.com's lakehouse architecture explicitly separates "transactional data mutations" from "asynchronous vector refresh cycles." Today we couple them — an ingest writes rows AND embeddings in one flow. That means:
- Adding 1K rows to a 100K-row dataset forces re-embedding of ALL rows (or nothing)
- No notion of "embeddings are stale, schedule a refresh tonight"
- The embedding cost (Ollama-bound, the bottleneck) is synchronous with ingest
### Scope
1. Add fields to `DatasetManifest`:
- `last_embedded_at: Option<DateTime>`
- `embedding_stale_since: Option<DateTime>` (set when data written but embeddings not refreshed)
- `embedding_refresh_policy: RefreshPolicy``Manual` | `OnAppend` | `Scheduled(cron)`
2. Decouple ingest from embed: ingest writes data + marks embeddings stale; embedding runs separately
3. New endpoint: `POST /vectors/refresh/{dataset}` — diffs existing vectors vs current rows, only embeds new/changed (keyed by `doc_id`)
4. Background scheduler (or cron trigger) — for `Scheduled` policy, re-runs refresh per schedule
5. `GET /vectors/stale` — lists datasets with stale embeddings and how stale
### Measured success
- Ingest a 1K-row append to `kb_team_runs` (currently 586 rows, Postgres-sourced).
- Catalog shows `embedding_stale_since = now`.
- `POST /vectors/refresh/kb_team_runs` embeds only the 1K new rows, not all 1586.
- Result: new rows searchable, old embeddings unchanged, total Ollama time ~5s instead of ~30s.
### Dependencies on Phase B
If Lance wins Phase B, this is dramatically easier — Lance supports native append. If we stay Parquet, we need a "vectors delta" Parquet file that merges at read time (same pattern as Phase 8's data delta files). ~100 extra LOC if we stay Parquet.
### Rollback
The `refresh_policy` field defaults to `Manual` for all existing datasets, so no behavior change for anything already in the system. Opt-in per dataset.
---
## Phase D — AI-safe views
### Why now
llms3.com's framing: "AI-safe views enforcing row/column security + PII tokenization before model exposure." Phase 13 gave us role-based column masking at query time. That's per-query enforcement. "Views" means pre-materialized: create `candidates_safe` once, bind model X to that view, model X can never accidentally see raw `candidates`.
This is also the precondition for Phase 17 (model profiles) to be meaningfully safe. "Bind model to dataset" isn't enough — needs to be "bind model to a safe view of the dataset."
### Scope
1. New catalog entity: `AiView` with fields `name`, `base_dataset`, `columns[]` (whitelist), `row_filter` (optional SQL WHERE clause), `column_redactions[]` (PII tokenization rules)
2. Persistence: `_catalog/views/{name}.json` alongside manifests
3. Query-rewrite layer: when a query references `candidates_safe`, DataFusion sees an equivalent `SELECT (whitelisted cols) FROM candidates WHERE (row_filter)` — with redactions applied as expressions
4. Endpoint: `POST /catalog/views` to create, `GET /catalog/views` to list, `GET /catalog/views/{name}/preview` to see what it looks like
5. Tool registry integration: tools can bind to an AiView instead of a raw table; agent invocations go through the view
### Measured success
- Create view `candidates_safe` = `SELECT candidate_id, skills, city FROM candidates WHERE status != 'blocked'`.
- Agent (tool registry) calls `search_candidates` bound to `candidates_safe`.
- Agent cannot see `email`, `phone`, `ssn`, or `status='blocked'` rows, even if it writes raw SQL.
- Audit log records agent accessed `candidates_safe`, not `candidates`.
### Dependencies
- Phase 13 already provides the sensitivity classification layer
- Phase 12 tool registry already exists
- This phase is the bridge between them for agent access
### Rollback
Views are additive. Dropping the feature = delete view definitions, tool registry falls back to direct table access. No data migration needed.
---
## Phase E — Soft deletes / tombstones
### Why now
GDPR/CCPA compliance for staffing data. Today, `ops::delete` physically deletes a parquet object — fine for whole datasets, useless for "delete one candidate's record." To delete one row we'd have to rewrite the whole `candidates.parquet` which at 100K rows is 10MB of churn per deletion.
llms3.com lists "deletion vectors" as a core lakehouse pattern (Iceberg/Delta/Hudi all implement it). This is the single biggest compliance gap in the current system.
### Scope
1. New sidecar per dataset: `{dataset}_tombstones.parquet` with columns `{row_key, deleted_at, actor, reason}`
2. Delete API: `POST /catalog/datasets/{name}/tombstone` with `{row_keys[], reason, actor}`
3. Query-time filter: `queryd` automatically LEFT JOINs tombstones and filters out deleted rows
4. Compaction integration: Phase 8 compaction reads base + delta + tombstones, writes a clean base without tombstoned rows, clears the tombstone sidecar
5. Event journal integration (Phase 9): every tombstone emits a journal event with full context
### Measured success
- `POST /catalog/datasets/candidates/tombstone` with `{row_keys: ["CAND-123"], reason: "GDPR request", actor: "legal@company"}`
- `SELECT COUNT(*) FROM candidates` drops by 1 immediately
- `SELECT * FROM candidates WHERE candidate_id = 'CAND-123'` returns empty
- `GET /journal/history/CAND-123` shows the tombstone event
- After scheduled compaction, the tombstone is materialized — `candidates.parquet` no longer contains CAND-123, tombstone sidecar is emptied for that row key
### Dependencies
- Phase 8 delta/merge-on-read pattern (done) — tombstones are a third layer at read time
- Phase 9 event journal (done) — tombstones emit journal events
### Rollback
If query rewrite becomes too complex, fallback: tombstones stored but applied only during compaction (not at query time). Queries return deleted rows until compaction runs. Less useful but safer.
---
## Cross-phase concerns
### Phases that need federation layer 2 (task #5)
Every phase above assumes the federation foundation (shipped 2026-04-16) but NOT federation layer 2 (cross-bucket SQL, profile activation, `X-Lakehouse-Bucket` header).
**Implication:** Phases B-E can proceed on the `primary` bucket without blocking on federation layer 2. Federation layer 2 becomes valuable when we want multi-profile model scoping (Phase 17). Sequence:
```
A (done) → B → C (+D in parallel) → federation layer 2 → Phase 16 → Phase 17 → E
```
### Phases that need federation layer 2 FIRST
None of B/C/D/E strictly need it. Phase 16 (hot-swap) benefits from it. Phase 17 (model profiles) depends on it heavily.
### What NOT to build in B-E
- Distributed query — wait for a real scale problem
- Replacement of DataFusion — working fine, stay put
- Iceberg/Delta Lake migration — explicitly out of scope per ADR-009
- Live streaming / CDC — explicit non-goal
### Definition of done for each phase
Each phase completes when:
1. Code shipped and building clean
2. Success gate measurably passed
3. Relevant ADR added to `docs/DECISIONS.md` (or updated)
4. `docs/PHASES.md` checkbox flipped with measurement data
5. PRD invariants checked — if a new invariant emerged, add it
6. One regression test in the crate or HTTP integration test
---
## Session plan (what to do in what order)
### Next session
1. **Phase B — Lance pilot.** Single session. Answers the biggest open architectural question.
2. Based on outcome, **write ADR-019** with the decision + data.
3. **Update PHASES.md** with Phase 18 status (Lance evaluated).
### Session after
4. **Phase C — Decoupled embedding refresh.** Implementation shaped by B's outcome (append is easy on Lance, requires delta logic on Parquet).
### Session after that
5. **Federation layer 2** OR **Phase D (AI-safe views)** — J decides based on priority. Federation layer 2 unlocks model profiles (Phase 17); AI-safe views is standalone value.
### Final session for this track
6. **Phase E — Soft deletes.** The compliance-driven phase. Fits cleanly after everything else because it touches the query path and wants to be built after query optimizations stabilize.
### Milestone checkpoint
After Phase E, stop and reassess. We'll have:
- Lance decision made and committed
- Decoupled embedding pipeline
- AI-safe view enforcement
- Soft delete semantics
That's a substantial capability increase. Plausible "pause, write a retrospective, decide on Phase 16/17" moment.

View File

@ -121,9 +121,35 @@
- [x] AI migration prompt builder for complex cases
- [x] 5 unit tests
## Phase 15+: Horizon ⬜
- [ ] HNSW vector index (100K search: 4.5s → <50ms)
- [ ] Federated multi-bucket query
## Phase 15+: Horizon
- [x] HNSW vector index with iteration-friendly trial system (2026-04-16)
- `HnswStore.build_index_with_config` — parameterized ef_construction, ef_search, seed
- `EmbeddingCache` — pins 100K vectors in memory, shared across trials
- `harness::EvalSet` — named query sets with brute-force ground truth
- `TrialJournal` — append-only JSONL at `_hnsw_trials/{index}.jsonl`
- Endpoints: `/vectors/hnsw/trial`, `/hnsw/trials/{idx}`, `/hnsw/trials/{idx}/best?metric={recall|latency|pareto}`, `/hnsw/evals`, `/hnsw/evals/{name}/autogen`, `/hnsw/cache/stats`
- Measured on 100K resumes: **brute-force 44-54ms → HNSW 509us-1830us**, recall 0.92-1.00 depending on `ef_construction`. Sweet spot: ec=80 es=30 → p50=873us recall=1.00 — locked in as `HnswConfig::default()`
- [x] Catalog manifest repair — `POST /catalog/resync-missing` restores row_count and columns from parquet footers (2026-04-16). All 7 staffing tables recovered to PRD-matching 2.47M rows.
- [~] Federated multi-bucket query — **foundation complete 2026-04-16**, see ADR-017
- [x] `StorageConfig.buckets` + `rescue_bucket` + `profile_root` config shape
- [x] `SecretsProvider` trait + `FileSecretsProvider` (reads /etc/lakehouse/secrets.toml, checks 0600 perms)
- [x] `storaged::BucketRegistry` — multi-backend, rescue-aware, reachability probes
- [x] `storaged::error_journal::ErrorJournal` — append-only JSONL at `primary://_errors/bucket_errors.jsonl`
- [x] Endpoints: `GET /storage/buckets`, `GET /storage/errors`, `GET /storage/bucket-health`
- [x] Bucket-aware I/O: `PUT/GET /storage/buckets/{bucket}/objects/{*key}` with rescue fallback + `X-Lakehouse-Rescue-Used` observability headers
- [x] Backward compat: empty `[[storage.buckets]]` synthesizes a `primary` from legacy `root`
- [x] Three-bucket test (primary + rescue + testing) verified: normal reads, rescue fallback with headers, hard-fail missing, write to unknown bucket 503, error journal + health summary
- [ ] `X-Lakehouse-Bucket` header middleware on ingest/query/catalog endpoints
- [ ] Catalog migration: set `bucket = "primary"` on every legacy ObjectRef
- [ ] `queryd` registers every bucket with DataFusion for cross-bucket SQL
- [ ] Profile hot-load endpoints: `POST /profile/{user}/activate|deactivate`
- [ ] `vectord` bucket-scoped paths (trial journals, eval sets per-bucket)
- [x] Database connector ingest (Postgres first) — 2026-04-16
- `pg_stream::stream_table_to_parquet` — ORDER BY + LIMIT/OFFSET pagination, configurable batch_size
- `parse_dsn` — postgresql:// and postgres:// URL scheme, user/password/host/port/db
- `POST /ingest/db` endpoint: `{dsn, table, dataset_name?, batch_size?, order_by?, limit?}` → streams to Parquet, registers in catalog with PII detection + redacted-password lineage
- Existing `POST /ingest/postgres/import` (structured config) preserved alongside
- 4 DSN-parser unit tests + live end-to-end test against `knowledge_base.team_runs` (586 rows, 13 cols, 6 batches, 196ms)
- [ ] Database connector ingest (Postgres/MySQL)
- [ ] PDF OCR (Tesseract)
- [ ] Scheduled ingest (cron)
@ -133,3 +159,4 @@
---
**30 unit tests | 11 crates | 16 ADRs | 2.47M rows | 100K vectors | All built 2026-03-27**
**HNSW trial system: 2026-04-16**

View File

@ -1,22 +1,41 @@
# PRD: Lakehouse — Rust-First Object Storage System
# PRD: Lakehouse — Rust-First Substrate for Versioned Knowledge Stores
**Status:** Active — Phase 0-5 complete, entering production path
**Status:** Active — Phase 0-14 complete; federation foundation + HNSW trial system shipped 2026-04-16; entering Phase 16 (hot-swap + model profiles)
**Created:** 2026-03-27
**Last reframed:** 2026-04-16 — from "staffing analytics platform" to "dual-use knowledge substrate" (see §Problem below)
**Owner:** J
---
## Problem
### Use case 1 — Staffing analytics (reference implementation)
Legacy data systems silo information across CRMs, databases, spreadsheets, and file shares. Querying across them requires manual ETL, pre-defined schemas, and expensive database licenses. When AI enters the picture, these systems can't handle the dual requirement of fast analytical queries AND semantic retrieval over unstructured text.
A staffing company (our reference case) has candidate records in an ATS, client data in a CRM, timesheets in billing software, call logs from a phone system, and email records from Exchange. Answering "find every Java developer in Chicago who was called 5+ times but never placed" requires querying across all of them — and no single system can do it.
We need a system where:
- Any data source (CSV, DB export, PDF, JSON) can be ingested without pre-defined schemas
### Use case 2 — Local AI knowledge substrate (the second half)
Local LLM workloads need a substrate for ingesting, indexing, and retrieving large knowledge corpora. Each running model (or agent) has its own context — documents it cares about, a vector index tuned to its domain, a scoped view of the catalog. That infrastructure is architecturally identical to the staffing problem: ingest messy data, index it, query it, hand it to an AI. Building one substrate that serves both prevents fragmentation.
Concretely this means a running Ollama model like `qwen2.5:7b` or `claude-code-local` should be able to:
- Bind to a named set of datasets
- Get a scoped vector index pre-warmed for its domain
- Issue searches that only see its bound data
- Have its trial/tuning history isolated from other models
- Swap between knowledge generations (today's, yesterday's) without rebuild
The same infrastructure that lets a recruiter query 2.47M rows of staffing data also lets a local 7B model answer questions grounded in a 500K-chunk documentation corpus. Same substrate, different tenant.
### Shared requirements
- Any data source (CSV, DB export, PDF, JSON, Postgres table) can be ingested without pre-defined schemas
- Structured data is queryable via SQL at scale (millions of rows, sub-second)
- Unstructured data is searchable via AI embeddings (semantic retrieval)
- An LLM can answer natural language questions against all of it
- Unstructured data is searchable via AI embeddings with per-profile indexes
- An LLM can answer natural language questions against scoped data
- Indexes can be hot-swapped between generations without rebuild downtime
- Trials are first-class data — the system remembers how it was tuned
- Everything runs locally — no cloud APIs, total data privacy
- The system is rebuildable from repository + object storage alone
@ -51,15 +70,22 @@ No new frameworks without documented ADR.
| Service | Responsibility |
|---|---|
| **gateway** | HTTP/gRPC ingress, routing, auth, CORS, body limits |
| **catalogd** | Metadata control plane — dataset registry, schema versions, manifests |
| **storaged** | Object I/O — read/write/list/delete via `object_store` |
| **queryd** | SQL execution — DataFusion over Parquet, MemTable hot cache |
| **ingestd** | *NEW* — Ingest pipeline: CSV/JSON/DB → normalize → Parquet → catalog |
| **vectord** | *NEW* — Embedding store + vector index: chunk → embed → index → search |
| **gateway** | HTTP/gRPC ingress, routing, auth, CORS, body limits, `X-Lakehouse-Bucket` header routing |
| **catalogd** | Metadata control plane — dataset registry, schema versions, manifests, per-dataset resync from parquet footers |
| **storaged** | Object I/O — `BucketRegistry` (multi-backend), rescue fallback, error journal, append-log batching pattern |
| **queryd** | SQL execution — DataFusion over Parquet, MemTable hot cache, delta merge-on-read |
| **ingestd** | Ingest pipeline: CSV / JSON / PDF / Postgres-stream → normalize → Parquet → catalog |
| **vectord** | Embedding store + vector indexes + HNSW trial system (EmbeddingCache, trial journal, eval harness) |
| **journald** | Append-only mutation event log (ADR-012) — distinct from storaged error journal |
| **aibridge** | Rust↔Python boundary — HTTP client to FastAPI sidecar |
| **ui** | Dioxus frontend — Ask, Explore, SQL, System tabs |
| **shared** | Types, errors, Arrow helpers, config, protobuf definitions |
| **shared** | Types, errors, Arrow helpers, config, protobuf definitions, **secrets provider trait**, **PII detection** |
**Federation building blocks** (shipped 2026-04-16):
- `shared::secrets::SecretsProvider` trait + `FileSecretsProvider` reading `/etc/lakehouse/secrets.toml` (0600 enforced)
- `storaged::registry::BucketRegistry` — multi-bucket resolution with `rescue_bucket` read fallback
- `storaged::append_log::AppendLog` — write-once batched append pattern (no RMW, no small-file problem)
- `storaged::error_journal::ErrorJournal` — bucket operation failure log at `primary://_errors/bucket_errors/batch_*.jsonl`
### Data Flow
@ -92,10 +118,14 @@ User question → gateway
1. Object storage = source of truth for all data
2. catalogd = sole metadata authority
3. No raw data in catalog — only pointers
4. vectord stores embeddings AS Parquet (portable, not a proprietary format)
4. vectord stores embeddings AS Parquet (portable, not a proprietary format) — see ADR-018 for the Parquet-vs-Lance trade review
5. ingestd is idempotent — re-ingesting the same file is a no-op
6. Hot cache is a performance layer, not a source of truth — eviction is safe
7. All services modular and independently replaceable
8. **Indexes are hot-swappable.** A new index generation can be built in the background while the current one serves queries. Promotion is atomic (pointer swap). Rollback to a prior generation is always possible. (Phase 16)
9. **Every reader gets its own profile.** Human operators, AI agents, and local models are all clients of the same substrate. Each has a named profile with its own bucket, vector indexes, trial history, and dataset bindings. Profiles are a first-class architectural concept, not a tenancy afterthought. (Phase 17)
10. **Trials are data, not logs.** Every index build is a trial with measurable metrics. The trial journal IS the agent's memory for how to tune itself. Stored as write-once batched JSONL per the ADR-018 append-log pattern.
11. **Operational failures are findable in one HTTP call.** The bucket error journal, trial journal, and audit log all expose `/storage/errors`, `/hnsw/trials`, `/access/audit` with structured filter + aggregation. No `grep` archaeology to answer "what broke?"
---
@ -262,19 +292,90 @@ Per-contract overlays with daily/weekly/monthly tiers and instant handoff.
| 14.3 | Migration preview: show how old data maps to new schema before applying | Human approves before data transforms |
| 14.4 | Versioned schemas in catalog: v1, v2, v3 coexist | Queries specify version or use latest |
### Phase 15+: Horizon
### Phase 15: Infrastructure horizon items
- [x] HNSW vector index with trial system (shipped 2026-04-16)
- [x] Federation foundation — ADR-017 (shipped 2026-04-16)
- [x] Database connector ingest — Postgres batch with streaming (shipped 2026-04-16)
- [ ] Federation layer 2 — X-Lakehouse-Bucket middleware, catalog migration, cross-bucket SQL in queryd
- [ ] PDF OCR for scanned documents (Tesseract integration)
- [ ] Scheduled ingest (cron-based file watching, S3 event triggers)
- [ ] Multi-node query distribution (DataFusion supports this architecturally)
### Phase 16: Hot-Swap Index Generations
Make indexes upgrade-in-place without dropping queries.
| Step | Deliverable | Gate |
|---|---|---|
| 16.1 | "Active generation" pointer per logical index name | `/vectors/search` routes to current champion automatically |
| 16.2 | Background trial runner: watches trial journal, proposes configs (random search / Bayesian), fires `/hnsw/trial` | Agent autonomously tunes without human POSTing each config |
| 16.3 | Promotion endpoint: `POST /hnsw/promote/{index}/{trial_id}` atomically swaps active pointer | Next search hits new config, zero downtime |
| 16.4 | Rollback: `POST /hnsw/rollback/{index}` reverts to previous generation | Bad promotion recoverable in milliseconds |
| 16.5 | Dataset-append triggers: when `POST /ingest/file` writes to a dataset with attached vector indexes, schedule automatic re-trial (not full rebuild) | New docs get embedded + indexed without manual intervention |
**Gate:** Run the trial agent for 10 minutes against `resumes_100k_v2` with a fresh eval set. It explores the `ef_construction × ef_search` space, promotes the Pareto winner, continues running. Zero human clicks. All trials and promotions appear in `/hnsw/trials/resumes_100k_v2`.
**Risk:** Agent loops into a bad region (e.g. always proposes ef_construction=1). Mitigation: a hardcoded config space constraint + minimum-quality gate (don't promote anything with recall <0.9).
### Phase 17: Model Profiles + Dataset Bindings
Make "different models see different data" real instead of a config string.
| Step | Deliverable | Gate |
|---|---|---|
| 17.1 | `ModelProfile` manifest: id, ollama_name, bucket, bound_datasets[], hnsw_config, embed_model | `GET /models` lists profiles; `POST /models` creates one |
| 17.2 | Profile activation endpoint: `POST /profile/{id}/activate` — warms EmbeddingCache for bound indexes, builds HNSW with profile's config | Next search against bound indexes is <1ms cold |
| 17.3 | Model-scoped search: `POST /search?model=X` filters to bound datasets only | Model A can't see Model B's datasets unless explicitly shared |
| 17.4 | VRAM-aware activation: only one (or small N) model loaded at a time on 16GB A4000 | Activating model B unloads model A via Ollama's keep_alive=0 |
| 17.5 | Audit: every tool invocation by a model is logged with model identity | `GET /models/{id}/audit` shows exactly what each model touched |
**Gate:** Two model profiles defined: `staffing-recruiter` (bound to candidates/placements/timesheets) and `docs-assistant` (bound to a documentation corpus). Activate staffing-recruiter, search for candidates — works. Switch to docs-assistant, same search — returns zero from staffing (not bound) but finds docs. VRAM shows only one embedding model loaded at a time.
**VRAM reality:** 16GB A4000 realistically holds 1-2 loaded models concurrently. "Multi-model" in practice means sequential swap between profiles, not parallel serving. The profile abstraction makes this swap clean.
### Phase 18: Storage format decision (Lance evaluation)
The question raised 2026-04-16 after J's LLMS3 knowledge base identified Lance as `alternative_to` Parquet for vector workloads. Current stack: Parquet with binary-blob vector columns + in-RAM HNSW sidecar. Evaluated against: Lance native vector format with disk-resident indexes.
| Step | Deliverable | Decision criteria |
|---|---|---|
| 18.1 | Parallel Lance-backed vector index for `resumes_100k_v2` behind feature flag | Both implementations coexist, benchmarkable |
| 18.2 | Head-to-head benchmark: cold-load, search latency, disk size, append cost | See criteria below |
| 18.3 | ADR-019 documenting the decision with measured data | Commit or reject with evidence |
**Decision rules:**
- Lance wins on cold-load by ≥2× AND matches search latency → migrate vector layer to Lance. Dataset Parquet stays.
- Lance is within 50% of current → stay on current stack, document ceiling explicitly.
- Lance loses → close the door, move on.
### Phase 19+: Further horizon
- Federated multi-bucket query (client A's S3 + client B's S3 + yours)
- Database connector ingest (PostgreSQL, MySQL, MSSQL → Parquet via CDC)
- PDF OCR for scanned documents (Tesseract integration)
- Scheduled ingest (cron-based file watching, S3 event triggers)
- Specialized fine-tuned models per domain (staffing matcher, resume parser)
- Multi-node query distribution (DataFusion supports this architecturally)
- Video/audio transcript ingest + multimodal embeddings
- True distributed query (DataFusion multi-node) — only if single-machine ceilings bite
---
## Reference Dataset: Staffing Company
## Known ceilings (honest)
The current stack has measurable limits. Documenting them so future decisions aren't based on wishful thinking.
| Dimension | Current ceiling | Breaks at | Escape hatch |
|---|---|---|---|
| Vector count per index | ~5M vectors on 128GB RAM | 10M+ (serious web crawl) | Phase 18 Lance migration OR mmap'd embeddings |
| Concurrent active indexes | ~50-100 at 100K vectors each | 10M×50 configurations | Lance disk-resident + per-profile activation |
| Rows per dataset | 2.47M proven, probably 100M+ fine | Approaches DataFusion memory limits | DataFusion predicate pushdown + partition pruning (existing) |
| Concurrent loaded models | 1-2 on 16GB VRAM (A4000) | 3+ models simultaneous | Not our problem — architectural, driven by Ollama |
| Trial journal growth per index | Thousands of trials, batched JSONL | High-frequency auto-tuning agent | Compaction via `/hnsw/trials/{idx}/compact` |
| Error journal growth | Bounded by ring buffer (2000 events in-memory) + batched JSONL on disk | Continuous failure scenarios | Compaction + retention policy (TODO) |
---
## Reference Workloads
### Workload 1: Staffing Company
Scale-tested on 128GB RAM server:
@ -315,6 +416,27 @@ Scale-tested on 128GB RAM server:
- Instant handoff between agents — zero data copy
- Full activity timeline preserved across handoffs
### Workload 2: Local LLM Knowledge Base
The second use case this substrate is built for. Reference corpus: the running `knowledge_base` Postgres database (586 team runs, response cache history, pipeline runs, threat intel) + LLMS3.com published corpus (~243 enriched documents).
Target scale on same 128GB server:
- Documents: 10K-100K per model profile
- Chunks after chunking: 500K-5M per profile
- Embedding dimensions: 768 (nomic-embed-text)
- Query latency: <100ms semantic search, <3s end-to-end RAG including LLM generation
- Concurrent model profiles: 2-5 configured, 1-2 active at a time (VRAM-bound)
Measured to date (Phase 7 + Phase 16 prep):
- 100K candidate-resume chunks embedded in 10 min via Ollama nomic-embed-text
- HNSW search at 100% recall, ~1ms p50 on 100K vectors (ec=80 es=30 locked as default)
- Trial journal instrumented and working for parameter tuning
Gaps still to close for this workload:
- Model profiles (Phase 17) — today, "model" is a string, not a first-class entity
- Hot-swap generations (Phase 16) — today, rebuild = downtime
- Scale past 5M vectors — needs Phase 18 Lance evaluation to decide path
---
## Available Local Models
@ -331,12 +453,15 @@ Scale-tested on 128GB RAM server:
## Non-Goals
- Multi-tenancy (single-owner system)
- Cloud deployment (local-first, always)
- Full ACID transactions (single-writer model is sufficient)
- Real-time streaming / CDC (batch ingest is the model)
- Replacing the CRM (this is the analytical layer BEHIND the CRM)
- Custom file formats (Parquet is the format, period)
- Real-time streaming / CDC (batch ingest is the model; scheduled refresh, not transactional replication)
- Replacing the CRM (this is the analytical + AI layer BEHIND the CRM)
- Custom file formats — Parquet for datasets + sidecar indexes for vectors (see ADR-018 for why we stayed Parquet instead of migrating to Lance, and the ceilings that choice implies)
- Hard multi-tenant isolation (profiles and federation provide soft isolation; this is not a SaaS platform with adversarial tenants — operator is single-trust)
Removed from prior non-goals (2026-04-16):
- ~~Multi-tenancy (single-owner system)~~ — federation + profile buckets are now first-class; soft multi-tenancy is a design goal. Hard adversarial multi-tenancy (adversarial tenants on shared infrastructure) remains out of scope.
---

View File

@ -6,7 +6,23 @@ port = 3100
[storage]
root = "./data"
# backend = "local" # local | s3 | rustfs
profile_root = "./data/_profiles"
rescue_bucket = "rescue"
[[storage.buckets]]
name = "primary"
backend = "local"
root = "./data"
[[storage.buckets]]
name = "rescue"
backend = "local"
root = "./data/_rescue"
[[storage.buckets]]
name = "testing"
backend = "local"
root = "./data/_testing"
[catalog]
# Manifests persisted to object storage under this prefix