root dbe00d018f Federation foundation + HNSW trial system + Postgres streaming + PRD reframe
Four shipped features and a PRD realignment, all measured end-to-end:

HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
  TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
  /hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
  at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
  80/30 = 1.00 recall in 230s

Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
  and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows

Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
  /etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
  rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
  anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
  /storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
  X-Lakehouse-Rescue-Used observability headers on fallback

Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
  into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
  (586 rows × 13 cols, 6 batches, 196ms end-to-end)

PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
  trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
  bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
  decision rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 01:50:05 -05:00

265 lines
9.9 KiB
Rust

use shared::types::{
DatasetId, DatasetManifest, ObjectRef, SchemaFingerprint,
ColumnMeta, Lineage, FreshnessContract, Sensitivity,
};
use std::collections::HashMap;
use std::sync::Arc;
use tokio::sync::RwLock;
use storaged::ops;
use object_store::ObjectStore;
/// Partial metadata update — only set fields are applied.
#[derive(Debug, Clone, Default, serde::Deserialize)]
pub struct MetadataUpdate {
pub description: Option<String>,
pub owner: Option<String>,
pub sensitivity: Option<Sensitivity>,
pub tags: Option<Vec<String>>,
pub columns: Option<Vec<ColumnMeta>>,
pub lineage: Option<Lineage>,
pub freshness: Option<FreshnessContract>,
pub row_count: Option<u64>,
}
const MANIFEST_PREFIX: &str = "_catalog/manifests";
/// In-memory dataset registry backed by manifest persistence in object storage.
#[derive(Clone)]
pub struct Registry {
datasets: Arc<RwLock<HashMap<DatasetId, DatasetManifest>>>,
store: Arc<dyn ObjectStore>,
}
impl Registry {
pub fn new(store: Arc<dyn ObjectStore>) -> Self {
Self {
datasets: Arc::new(RwLock::new(HashMap::new())),
store,
}
}
/// Rebuild in-memory index from persisted manifests on startup.
pub async fn rebuild(&self) -> Result<usize, String> {
let keys = ops::list(&self.store, Some(MANIFEST_PREFIX)).await?;
let mut datasets = self.datasets.write().await;
datasets.clear();
for key in &keys {
let data = ops::get(&self.store, key).await?;
let manifest: DatasetManifest =
serde_json::from_slice(&data).map_err(|e| e.to_string())?;
datasets.insert(manifest.id.clone(), manifest);
}
let count = datasets.len();
tracing::info!("catalog rebuilt: {count} datasets loaded");
Ok(count)
}
/// Register a new dataset. Persists manifest to storage before updating memory.
pub async fn register(
&self,
name: String,
schema_fingerprint: SchemaFingerprint,
objects: Vec<ObjectRef>,
) -> Result<DatasetManifest, String> {
let now = chrono::Utc::now();
let manifest = DatasetManifest {
id: DatasetId::new(),
name,
schema_fingerprint,
objects,
created_at: now,
updated_at: now,
description: String::new(),
owner: String::new(),
sensitivity: None,
columns: vec![],
lineage: None,
freshness: None,
tags: vec![],
row_count: None,
};
// Write-ahead: persist before in-memory update
let manifest_key = format!("{MANIFEST_PREFIX}/{}.json", manifest.id);
let json = serde_json::to_vec_pretty(&manifest).map_err(|e| e.to_string())?;
ops::put(&self.store, &manifest_key, json.into()).await?;
let mut datasets = self.datasets.write().await;
datasets.insert(manifest.id.clone(), manifest.clone());
tracing::info!("registered dataset: {} ({})", manifest.name, manifest.id);
Ok(manifest)
}
/// Update metadata on an existing dataset (owner, description, tags, sensitivity, etc.)
pub async fn update_metadata(
&self,
name: &str,
updates: MetadataUpdate,
) -> Result<DatasetManifest, String> {
let mut datasets = self.datasets.write().await;
let manifest = datasets.values_mut()
.find(|d| d.name == name)
.ok_or_else(|| format!("dataset not found: {name}"))?;
if let Some(desc) = updates.description { manifest.description = desc; }
if let Some(owner) = updates.owner { manifest.owner = owner; }
if let Some(sens) = updates.sensitivity { manifest.sensitivity = Some(sens); }
if let Some(tags) = updates.tags { manifest.tags = tags; }
if let Some(cols) = updates.columns { manifest.columns = cols; }
if let Some(lineage) = updates.lineage { manifest.lineage = Some(lineage); }
if let Some(freshness) = updates.freshness { manifest.freshness = Some(freshness); }
if let Some(count) = updates.row_count { manifest.row_count = Some(count); }
manifest.updated_at = chrono::Utc::now();
// Persist
let manifest_key = format!("{MANIFEST_PREFIX}/{}.json", manifest.id);
let json = serde_json::to_vec_pretty(manifest).map_err(|e| e.to_string())?;
ops::put(&self.store, &manifest_key, json.into()).await?;
let result = manifest.clone();
Ok(result)
}
/// Get a dataset by ID.
pub async fn get(&self, id: &DatasetId) -> Option<DatasetManifest> {
let datasets = self.datasets.read().await;
datasets.get(id).cloned()
}
/// Get a dataset by name.
pub async fn get_by_name(&self, name: &str) -> Option<DatasetManifest> {
let datasets = self.datasets.read().await;
datasets.values().find(|d| d.name == name).cloned()
}
/// List all datasets.
pub async fn list(&self) -> Vec<DatasetManifest> {
let datasets = self.datasets.read().await;
datasets.values().cloned().collect()
}
/// Re-read the parquet footer(s) for a dataset and repopulate `row_count`
/// and `columns` from reality. Use this to repair manifests whose
/// metadata was lost (e.g. migrated from a pre-Phase 10 catalog).
///
/// Does NOT touch owner/description/sensitivity/lineage/tags — only
/// the structural facts that parquet can tell us authoritatively.
/// The existing `schema_fingerprint` is updated if the recomputed one
/// differs; a warning is logged so drift is visible.
pub async fn resync_from_parquet(&self, name: &str) -> Result<DatasetManifest, String> {
use shared::arrow_helpers::{fingerprint_schema, parquet_to_record_batches};
// Snapshot the target manifest so we don't hold the write lock during IO.
let (id, objects, old_fp) = {
let datasets = self.datasets.read().await;
let m = datasets
.values()
.find(|d| d.name == name)
.ok_or_else(|| format!("dataset not found: {name}"))?;
(m.id.clone(), m.objects.clone(), m.schema_fingerprint.clone())
};
if objects.is_empty() {
return Err(format!("dataset '{name}' has no object references to resync from"));
}
let mut total_rows: u64 = 0;
let mut first_schema: Option<arrow::datatypes::SchemaRef> = None;
for obj in &objects {
let data = ops::get(&self.store, &obj.key).await
.map_err(|e| format!("read {}: {e}", obj.key))?;
let (schema, batches) = parquet_to_record_batches(&data)
.map_err(|e| format!("parse {}: {e}", obj.key))?;
let rows: u64 = batches.iter().map(|b| b.num_rows() as u64).sum();
total_rows += rows;
if first_schema.is_none() {
first_schema = Some(schema);
}
}
let schema = first_schema.ok_or("no schema recovered")?;
let new_fp = fingerprint_schema(&schema);
if new_fp != old_fp {
tracing::warn!(
"dataset '{}' schema fingerprint drift: {} -> {} (updating to match parquet reality)",
name, old_fp.0, new_fp.0,
);
}
let columns: Vec<ColumnMeta> = schema
.fields()
.iter()
.map(|f| ColumnMeta {
name: f.name().clone(),
data_type: f.data_type().to_string(),
sensitivity: None,
description: String::new(),
is_pii: false,
})
.collect();
// Apply updates.
let mut datasets = self.datasets.write().await;
let manifest = datasets
.get_mut(&id)
.ok_or_else(|| format!("dataset disappeared during resync: {name}"))?;
manifest.row_count = Some(total_rows);
manifest.columns = columns;
manifest.schema_fingerprint = new_fp;
manifest.updated_at = chrono::Utc::now();
// Persist.
let manifest_key = format!("{MANIFEST_PREFIX}/{}.json", manifest.id);
let json = serde_json::to_vec_pretty(manifest).map_err(|e| e.to_string())?;
ops::put(&self.store, &manifest_key, json.into()).await?;
tracing::info!("resynced '{name}': row_count={total_rows}, {} columns", manifest.columns.len());
Ok(manifest.clone())
}
/// Resync every dataset that currently has a null row_count.
/// Returns (successes, failures) where each entry is (name, detail).
pub async fn resync_missing(&self) -> (Vec<(String, u64)>, Vec<(String, String)>) {
let names: Vec<String> = {
let datasets = self.datasets.read().await;
datasets
.values()
.filter(|d| d.row_count.is_none() || d.columns.is_empty())
.map(|d| d.name.clone())
.collect()
};
let mut ok = Vec::new();
let mut err = Vec::new();
for name in names {
match self.resync_from_parquet(&name).await {
Ok(m) => ok.push((name, m.row_count.unwrap_or(0))),
Err(e) => err.push((name, e)),
}
}
(ok, err)
}
/// Add objects to an existing dataset.
pub async fn add_objects(
&self,
id: &DatasetId,
new_objects: Vec<ObjectRef>,
) -> Result<DatasetManifest, String> {
let mut datasets = self.datasets.write().await;
let manifest = datasets.get_mut(id).ok_or_else(|| format!("dataset not found: {id}"))?;
manifest.objects.extend(new_objects);
manifest.updated_at = chrono::Utc::now();
// Persist updated manifest
let manifest_key = format!("{MANIFEST_PREFIX}/{}.json", manifest.id);
let json = serde_json::to_vec_pretty(manifest).map_err(|e| e.to_string())?;
ops::put(&self.store, &manifest_key, json.into()).await?;
Ok(manifest.clone())
}
}