Phase C: Decoupled embedding refresh
Implements the llms3.com-inspired pattern: embeddings refresh
asynchronously, decoupled from transactional row writes. New rows arrive,
ingest marks the vector index stale, a later refresh embeds only the
delta (doc_ids not already in the index).
Schema additions (DatasetManifest):
- last_embedded_at: Option<DateTime> - when the index was last refreshed
- embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh
- embedding_refresh_policy: Option<RefreshPolicy> - Manual | OnAppend | Scheduled
Ingest paths (pipeline::ingest_file + pg_stream) call
registry.mark_embeddings_stale after writing. No-op if the dataset has
never been embedded — stale semantics only kick in once last_embedded_at
is set.
Refresh pipeline (vectord::refresh::refresh_index):
- Reads the dataset Parquet, extracts (doc_id, text) pairs
- Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas)
- Loads existing embeddings via EmbeddingCache (empty on first-time build)
- Filters to rows whose doc_id is NOT in the existing set
- Chunks (chunker::chunk_column), embeds via Ollama (batches of 32),
writes combined index, clears stale flag
Endpoints:
- POST /vectors/refresh/{dataset_name} - body {index_name, id_column,
text_column, chunk_size?, overlap?}
- GET /vectors/stale - lists datasets whose embedding_stale_since is set
End-to-end verified on threat_intel (knowledge_base.threat_intel):
- Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s,
last_embedded_at set
- Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check)
- Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set
- /vectors/stale surfaces threat_intel with timestamps + policy
- Delta refresh: 34 new docs embedded in 970ms (6x faster than full
re-embed); stale_cleared = true
Not in MVP scope:
- UPDATE semantics (same doc_id, different content) - would need
per-row content hashing
- OnAppend policy auto-trigger - just declares intent; actual scheduler
deferred
- Scheduler runtime - the Scheduled(cron) variant declares the intent so
operators can see which datasets expect what, but the cron itself is
separate
Per ADR-019: when a profile switches to vector_backend=Lance, this
refresh path benefits — Lance's native append replaces our "read all +
rewrite" Parquet rebuild pattern. Current MVP works well enough at
~500-5K rows to validate the architecture; Lance unblocks the 5M+ case.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
76f6fba5de
commit
97a376482c
@ -1,6 +1,6 @@
|
|||||||
use shared::types::{
|
use shared::types::{
|
||||||
DatasetId, DatasetManifest, ObjectRef, SchemaFingerprint,
|
DatasetId, DatasetManifest, ObjectRef, SchemaFingerprint,
|
||||||
ColumnMeta, Lineage, FreshnessContract, Sensitivity,
|
ColumnMeta, Lineage, FreshnessContract, RefreshPolicy, Sensitivity,
|
||||||
};
|
};
|
||||||
use std::collections::HashMap;
|
use std::collections::HashMap;
|
||||||
use std::sync::Arc;
|
use std::sync::Arc;
|
||||||
@ -20,6 +20,8 @@ pub struct MetadataUpdate {
|
|||||||
pub lineage: Option<Lineage>,
|
pub lineage: Option<Lineage>,
|
||||||
pub freshness: Option<FreshnessContract>,
|
pub freshness: Option<FreshnessContract>,
|
||||||
pub row_count: Option<u64>,
|
pub row_count: Option<u64>,
|
||||||
|
// Phase C embedding freshness
|
||||||
|
pub embedding_refresh_policy: Option<RefreshPolicy>,
|
||||||
}
|
}
|
||||||
|
|
||||||
const MANIFEST_PREFIX: &str = "_catalog/manifests";
|
const MANIFEST_PREFIX: &str = "_catalog/manifests";
|
||||||
@ -78,6 +80,9 @@ impl Registry {
|
|||||||
freshness: None,
|
freshness: None,
|
||||||
tags: vec![],
|
tags: vec![],
|
||||||
row_count: None,
|
row_count: None,
|
||||||
|
last_embedded_at: None,
|
||||||
|
embedding_stale_since: None,
|
||||||
|
embedding_refresh_policy: None,
|
||||||
};
|
};
|
||||||
|
|
||||||
// Write-ahead: persist before in-memory update
|
// Write-ahead: persist before in-memory update
|
||||||
@ -111,6 +116,7 @@ impl Registry {
|
|||||||
if let Some(lineage) = updates.lineage { manifest.lineage = Some(lineage); }
|
if let Some(lineage) = updates.lineage { manifest.lineage = Some(lineage); }
|
||||||
if let Some(freshness) = updates.freshness { manifest.freshness = Some(freshness); }
|
if let Some(freshness) = updates.freshness { manifest.freshness = Some(freshness); }
|
||||||
if let Some(count) = updates.row_count { manifest.row_count = Some(count); }
|
if let Some(count) = updates.row_count { manifest.row_count = Some(count); }
|
||||||
|
if let Some(policy) = updates.embedding_refresh_policy { manifest.embedding_refresh_policy = Some(policy); }
|
||||||
manifest.updated_at = chrono::Utc::now();
|
manifest.updated_at = chrono::Utc::now();
|
||||||
|
|
||||||
// Persist
|
// Persist
|
||||||
@ -242,6 +248,63 @@ impl Registry {
|
|||||||
(ok, err)
|
(ok, err)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Mark a dataset's embeddings as stale (row-level data has been written
|
||||||
|
/// since the last embedding refresh). Idempotent — setting stale when
|
||||||
|
/// already stale is a no-op. Only marks stale if the dataset has been
|
||||||
|
/// embedded before — a never-embedded dataset doesn't need a stale flag
|
||||||
|
/// (it just needs an initial index build). Called from the ingest path.
|
||||||
|
pub async fn mark_embeddings_stale(&self, name: &str) -> Result<(), String> {
|
||||||
|
let mut datasets = self.datasets.write().await;
|
||||||
|
let manifest = datasets
|
||||||
|
.values_mut()
|
||||||
|
.find(|d| d.name == name)
|
||||||
|
.ok_or_else(|| format!("dataset not found: {name}"))?;
|
||||||
|
|
||||||
|
if manifest.last_embedded_at.is_none() {
|
||||||
|
return Ok(()); // never embedded -> no stale semantics yet
|
||||||
|
}
|
||||||
|
if manifest.embedding_stale_since.is_none() {
|
||||||
|
manifest.embedding_stale_since = Some(chrono::Utc::now());
|
||||||
|
manifest.updated_at = chrono::Utc::now();
|
||||||
|
|
||||||
|
let key = format!("{MANIFEST_PREFIX}/{}.json", manifest.id);
|
||||||
|
let json = serde_json::to_vec_pretty(manifest).map_err(|e| e.to_string())?;
|
||||||
|
ops::put(&self.store, &key, json.into()).await?;
|
||||||
|
tracing::info!("marked embeddings stale for dataset '{name}'");
|
||||||
|
}
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Clear the stale marker and set `last_embedded_at = now`.
|
||||||
|
/// Called by the embedding refresh pipeline once it finishes.
|
||||||
|
pub async fn clear_embeddings_stale(&self, name: &str) -> Result<(), String> {
|
||||||
|
let mut datasets = self.datasets.write().await;
|
||||||
|
let manifest = datasets
|
||||||
|
.values_mut()
|
||||||
|
.find(|d| d.name == name)
|
||||||
|
.ok_or_else(|| format!("dataset not found: {name}"))?;
|
||||||
|
|
||||||
|
let now = chrono::Utc::now();
|
||||||
|
manifest.embedding_stale_since = None;
|
||||||
|
manifest.last_embedded_at = Some(now);
|
||||||
|
manifest.updated_at = now;
|
||||||
|
|
||||||
|
let key = format!("{MANIFEST_PREFIX}/{}.json", manifest.id);
|
||||||
|
let json = serde_json::to_vec_pretty(manifest).map_err(|e| e.to_string())?;
|
||||||
|
ops::put(&self.store, &key, json.into()).await?;
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
|
||||||
|
/// List datasets whose `embedding_stale_since` is set — they need a refresh.
|
||||||
|
pub async fn stale_datasets(&self) -> Vec<DatasetManifest> {
|
||||||
|
let datasets = self.datasets.read().await;
|
||||||
|
datasets
|
||||||
|
.values()
|
||||||
|
.filter(|d| d.embedding_stale_since.is_some())
|
||||||
|
.cloned()
|
||||||
|
.collect()
|
||||||
|
}
|
||||||
|
|
||||||
/// Add objects to an existing dataset.
|
/// Add objects to an existing dataset.
|
||||||
pub async fn add_objects(
|
pub async fn add_objects(
|
||||||
&self,
|
&self,
|
||||||
|
|||||||
@ -98,6 +98,7 @@ async fn main() {
|
|||||||
hnsw_store: vectord::hnsw::HnswStore::new(),
|
hnsw_store: vectord::hnsw::HnswStore::new(),
|
||||||
embedding_cache: vectord::embedding_cache::EmbeddingCache::new(store.clone()),
|
embedding_cache: vectord::embedding_cache::EmbeddingCache::new(store.clone()),
|
||||||
trial_journal: vectord::trial::TrialJournal::new(store.clone()),
|
trial_journal: vectord::trial::TrialJournal::new(store.clone()),
|
||||||
|
catalog: registry.clone(),
|
||||||
}
|
}
|
||||||
}))
|
}))
|
||||||
.nest("/workspaces", queryd::workspace_service::router(workspace_mgr))
|
.nest("/workspaces", queryd::workspace_service::router(workspace_mgr))
|
||||||
|
|||||||
@ -147,6 +147,10 @@ pub async fn ingest_file(
|
|||||||
..Default::default()
|
..Default::default()
|
||||||
}).await;
|
}).await;
|
||||||
|
|
||||||
|
// Phase C: if this dataset already had embeddings, they're now stale.
|
||||||
|
// mark_embeddings_stale is a no-op for never-embedded datasets.
|
||||||
|
let _ = registry.mark_embeddings_stale(&safe_name).await;
|
||||||
|
|
||||||
Ok(IngestResult {
|
Ok(IngestResult {
|
||||||
dataset_name: safe_name,
|
dataset_name: safe_name,
|
||||||
file_type: format!("{:?}", file_type),
|
file_type: format!("{:?}", file_type),
|
||||||
|
|||||||
@ -271,6 +271,10 @@ async fn ingest_db_stream(
|
|||||||
..Default::default()
|
..Default::default()
|
||||||
}).await;
|
}).await;
|
||||||
|
|
||||||
|
// Phase C: mark embeddings stale if the dataset already had a vector
|
||||||
|
// index. No-op for newly-created datasets.
|
||||||
|
let _ = state.registry.mark_embeddings_stale(&dataset_name).await;
|
||||||
|
|
||||||
Ok((StatusCode::CREATED, Json(serde_json::json!({
|
Ok((StatusCode::CREATED, Json(serde_json::json!({
|
||||||
"dataset_name": dataset_name,
|
"dataset_name": dataset_name,
|
||||||
"table": stream_result.table,
|
"table": stream_result.table,
|
||||||
|
|||||||
@ -116,4 +116,44 @@ pub struct DatasetManifest {
|
|||||||
/// Row count (updated on ingest/compact)
|
/// Row count (updated on ingest/compact)
|
||||||
#[serde(default)]
|
#[serde(default)]
|
||||||
pub row_count: Option<u64>,
|
pub row_count: Option<u64>,
|
||||||
|
|
||||||
|
// --- Embedding freshness (Phase C) ---
|
||||||
|
|
||||||
|
/// When the attached vector index was last refreshed. `None` means this
|
||||||
|
/// dataset has no vector index yet, or has never been embedded.
|
||||||
|
#[serde(default)]
|
||||||
|
pub last_embedded_at: Option<chrono::DateTime<chrono::Utc>>,
|
||||||
|
|
||||||
|
/// When data was written that hasn't yet been embedded. `Some(t)` means
|
||||||
|
/// the vector index is out-of-date as of timestamp `t`. Cleared on refresh.
|
||||||
|
#[serde(default)]
|
||||||
|
pub embedding_stale_since: Option<chrono::DateTime<chrono::Utc>>,
|
||||||
|
|
||||||
|
/// How this dataset wants stale embeddings handled.
|
||||||
|
#[serde(default)]
|
||||||
|
pub embedding_refresh_policy: Option<RefreshPolicy>,
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Controls what happens when new data is written to a dataset with an
|
||||||
|
/// attached vector index.
|
||||||
|
///
|
||||||
|
/// - `Manual` (default): data writes set `embedding_stale_since`; nothing
|
||||||
|
/// embeds until an operator or agent calls `/vectors/refresh/{dataset}`.
|
||||||
|
/// - `OnAppend`: ingest fires a background refresh immediately after writing.
|
||||||
|
/// Suitable for datasets where vector freshness matters more than ingest
|
||||||
|
/// latency.
|
||||||
|
/// - `Scheduled(cron)`: a timer or external scheduler triggers refresh at
|
||||||
|
/// the named cadence. The scheduler itself is not in this ADR scope —
|
||||||
|
/// this just declares the intent so operators can see which policy a
|
||||||
|
/// dataset expects.
|
||||||
|
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
|
||||||
|
#[serde(tag = "kind", rename_all = "snake_case")]
|
||||||
|
pub enum RefreshPolicy {
|
||||||
|
Manual,
|
||||||
|
OnAppend,
|
||||||
|
Scheduled { cron: String },
|
||||||
|
}
|
||||||
|
|
||||||
|
impl Default for RefreshPolicy {
|
||||||
|
fn default() -> Self { Self::Manual }
|
||||||
}
|
}
|
||||||
|
|||||||
@ -7,6 +7,7 @@ edition = "2024"
|
|||||||
shared = { path = "../shared" }
|
shared = { path = "../shared" }
|
||||||
storaged = { path = "../storaged" }
|
storaged = { path = "../storaged" }
|
||||||
aibridge = { path = "../aibridge" }
|
aibridge = { path = "../aibridge" }
|
||||||
|
catalogd = { path = "../catalogd" }
|
||||||
tokio = { workspace = true }
|
tokio = { workspace = true }
|
||||||
axum = { workspace = true }
|
axum = { workspace = true }
|
||||||
serde = { workspace = true }
|
serde = { workspace = true }
|
||||||
|
|||||||
@ -4,6 +4,7 @@ pub mod harness;
|
|||||||
pub mod hnsw;
|
pub mod hnsw;
|
||||||
pub mod index_registry;
|
pub mod index_registry;
|
||||||
pub mod jobs;
|
pub mod jobs;
|
||||||
|
pub mod refresh;
|
||||||
pub mod store;
|
pub mod store;
|
||||||
pub mod search;
|
pub mod search;
|
||||||
pub mod rag;
|
pub mod rag;
|
||||||
|
|||||||
279
crates/vectord/src/refresh.rs
Normal file
279
crates/vectord/src/refresh.rs
Normal file
@ -0,0 +1,279 @@
|
|||||||
|
//! Phase C: Decoupled embedding refresh.
|
||||||
|
//!
|
||||||
|
//! When a dataset's row-level data changes, the vector index it feeds is
|
||||||
|
//! stale. Historically we coupled ingest and embedding — writing new rows
|
||||||
|
//! also re-embedded every row, which is fine at 500 rows but blows up at
|
||||||
|
//! 100K+. The llms3.com architecture calls out "asynchronous vector
|
||||||
|
//! refresh cycles independent from transactional mutations" as the right
|
||||||
|
//! pattern.
|
||||||
|
//!
|
||||||
|
//! This module implements the refresh side. The ingest path marks
|
||||||
|
//! embeddings stale (see catalogd::registry::mark_embeddings_stale); this
|
||||||
|
//! code clears that staleness by embedding only rows whose `doc_id` isn't
|
||||||
|
//! already in the existing vector index.
|
||||||
|
//!
|
||||||
|
//! Scope — MVP:
|
||||||
|
//! - Reads the dataset's Parquet, extracts (doc_id, text) pairs from named
|
||||||
|
//! columns
|
||||||
|
//! - Loads existing embeddings via EmbeddingCache
|
||||||
|
//! - Filters to rows whose doc_id is NOT in the existing set
|
||||||
|
//! - Chunks, embeds via Ollama, appends to the index parquet
|
||||||
|
//! - Clears stale flag on success
|
||||||
|
//!
|
||||||
|
//! Not in MVP:
|
||||||
|
//! - UPDATE semantics (same doc_id, new content) — would need content-hash
|
||||||
|
//! comparison per row
|
||||||
|
//! - Large-scale resilience (batching with checkpoints like the
|
||||||
|
//! supervisor) — MVP does it inline
|
||||||
|
//! - Lance backend — ADR-019 makes this straightforward later; MVP stays
|
||||||
|
//! on Parquet sidecar indexes
|
||||||
|
|
||||||
|
use std::collections::HashSet;
|
||||||
|
use std::sync::Arc;
|
||||||
|
|
||||||
|
use aibridge::client::{AiClient, EmbedRequest};
|
||||||
|
use arrow::array::{Array, Int32Array, Int64Array, StringArray};
|
||||||
|
use catalogd::registry::Registry;
|
||||||
|
use object_store::ObjectStore;
|
||||||
|
|
||||||
|
use crate::chunker::{self, TextChunk};
|
||||||
|
use crate::embedding_cache::EmbeddingCache;
|
||||||
|
use crate::index_registry::IndexRegistry;
|
||||||
|
use crate::store::{self, StoredEmbedding};
|
||||||
|
|
||||||
|
#[derive(Debug, Clone, serde::Deserialize)]
|
||||||
|
pub struct RefreshRequest {
|
||||||
|
pub index_name: String,
|
||||||
|
/// Column with the document id (row identity).
|
||||||
|
pub id_column: String,
|
||||||
|
/// Column with the text to embed.
|
||||||
|
pub text_column: String,
|
||||||
|
/// Chunk size (chars). Defaults to 500.
|
||||||
|
#[serde(default)]
|
||||||
|
pub chunk_size: Option<usize>,
|
||||||
|
/// Overlap (chars). Defaults to 50.
|
||||||
|
#[serde(default)]
|
||||||
|
pub overlap: Option<usize>,
|
||||||
|
}
|
||||||
|
|
||||||
|
#[derive(Debug, Clone, serde::Serialize)]
|
||||||
|
pub struct RefreshResult {
|
||||||
|
pub index_name: String,
|
||||||
|
pub dataset_name: String,
|
||||||
|
pub pre_existing_docs: usize,
|
||||||
|
pub dataset_docs: usize,
|
||||||
|
pub new_docs_embedded: usize,
|
||||||
|
pub new_chunks_embedded: usize,
|
||||||
|
pub total_embeddings_after: usize,
|
||||||
|
pub duration_secs: f32,
|
||||||
|
pub stale_cleared: bool,
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Full refresh pipeline. Takes dataset_name as URL param, body has the
|
||||||
|
/// column selectors.
|
||||||
|
pub async fn refresh_index(
|
||||||
|
dataset_name: &str,
|
||||||
|
req: &RefreshRequest,
|
||||||
|
store_: &Arc<dyn ObjectStore>,
|
||||||
|
registry: &Registry,
|
||||||
|
ai_client: &AiClient,
|
||||||
|
embedding_cache: &EmbeddingCache,
|
||||||
|
index_registry: &IndexRegistry,
|
||||||
|
) -> Result<RefreshResult, String> {
|
||||||
|
let t0 = std::time::Instant::now();
|
||||||
|
|
||||||
|
// 1. Find the dataset manifest, pull its object storage key
|
||||||
|
let manifest = registry
|
||||||
|
.get_by_name(dataset_name)
|
||||||
|
.await
|
||||||
|
.ok_or_else(|| format!("dataset not found: {dataset_name}"))?;
|
||||||
|
|
||||||
|
if manifest.objects.is_empty() {
|
||||||
|
return Err(format!("dataset '{dataset_name}' has no object references"));
|
||||||
|
}
|
||||||
|
|
||||||
|
// 2. Read dataset rows — extract (doc_id, text) pairs from the
|
||||||
|
// specified columns. Multi-object datasets: concat across all.
|
||||||
|
let mut doc_id_to_text: Vec<(String, String)> = Vec::new();
|
||||||
|
for obj in &manifest.objects {
|
||||||
|
let data = storaged::ops::get(store_, &obj.key).await
|
||||||
|
.map_err(|e| format!("read {}: {e}", obj.key))?;
|
||||||
|
let (schema, batches) = shared::arrow_helpers::parquet_to_record_batches(&data)
|
||||||
|
.map_err(|e| format!("parse {}: {e}", obj.key))?;
|
||||||
|
|
||||||
|
let id_idx = schema.index_of(&req.id_column)
|
||||||
|
.map_err(|_| format!("id column '{}' not in dataset schema", req.id_column))?;
|
||||||
|
let text_idx = schema.index_of(&req.text_column)
|
||||||
|
.map_err(|_| format!("text column '{}' not in dataset schema", req.text_column))?;
|
||||||
|
|
||||||
|
for batch in &batches {
|
||||||
|
let text_col = batch
|
||||||
|
.column(text_idx)
|
||||||
|
.as_any()
|
||||||
|
.downcast_ref::<StringArray>()
|
||||||
|
.ok_or_else(|| format!("text column '{}' is not Utf8", req.text_column))?;
|
||||||
|
|
||||||
|
// Accept Utf8, Int32, or Int64 as id — that covers CSV-derived
|
||||||
|
// and Postgres-imported schemas without forcing upstream casts.
|
||||||
|
let id_reader: Box<dyn Fn(usize) -> Option<String>> = {
|
||||||
|
let col = batch.column(id_idx);
|
||||||
|
if let Some(s) = col.as_any().downcast_ref::<StringArray>() {
|
||||||
|
let s = s.clone();
|
||||||
|
Box::new(move |row| if s.is_null(row) { None } else { Some(s.value(row).to_string()) })
|
||||||
|
} else if let Some(a) = col.as_any().downcast_ref::<Int32Array>() {
|
||||||
|
let a = a.clone();
|
||||||
|
Box::new(move |row| if a.is_null(row) { None } else { Some(a.value(row).to_string()) })
|
||||||
|
} else if let Some(a) = col.as_any().downcast_ref::<Int64Array>() {
|
||||||
|
let a = a.clone();
|
||||||
|
Box::new(move |row| if a.is_null(row) { None } else { Some(a.value(row).to_string()) })
|
||||||
|
} else {
|
||||||
|
return Err(format!(
|
||||||
|
"id column '{}' must be Utf8, Int32, or Int64 — got {}",
|
||||||
|
req.id_column,
|
||||||
|
col.data_type(),
|
||||||
|
));
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
for row in 0..batch.num_rows() {
|
||||||
|
if text_col.is_null(row) { continue; }
|
||||||
|
let Some(id) = id_reader(row) else { continue; };
|
||||||
|
let text = text_col.value(row).to_string();
|
||||||
|
if text.trim().is_empty() { continue; }
|
||||||
|
doc_id_to_text.push((id, text));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
let dataset_docs = doc_id_to_text.len();
|
||||||
|
tracing::info!("refresh '{}': dataset has {dataset_docs} rows", dataset_name);
|
||||||
|
|
||||||
|
// 3. Load existing embeddings (empty if no index yet)
|
||||||
|
let existing: Vec<StoredEmbedding> = match embedding_cache.get_or_load(&req.index_name).await {
|
||||||
|
Ok(arc) => arc.as_ref().clone(),
|
||||||
|
Err(_) => Vec::new(), // first-time index build
|
||||||
|
};
|
||||||
|
let pre_existing_docs: HashSet<String> = existing
|
||||||
|
.iter()
|
||||||
|
.map(|e| e.doc_id.clone())
|
||||||
|
.collect();
|
||||||
|
let pre_existing_count = pre_existing_docs.len();
|
||||||
|
|
||||||
|
// 4. Delta — rows whose doc_id isn't already indexed
|
||||||
|
let new_rows: Vec<(String, String)> = doc_id_to_text
|
||||||
|
.into_iter()
|
||||||
|
.filter(|(id, _)| !pre_existing_docs.contains(id))
|
||||||
|
.collect();
|
||||||
|
let new_docs = new_rows.len();
|
||||||
|
|
||||||
|
if new_docs == 0 {
|
||||||
|
tracing::info!("refresh '{}': no new docs to embed", dataset_name);
|
||||||
|
registry.clear_embeddings_stale(dataset_name).await?;
|
||||||
|
return Ok(RefreshResult {
|
||||||
|
index_name: req.index_name.clone(),
|
||||||
|
dataset_name: dataset_name.to_string(),
|
||||||
|
pre_existing_docs: pre_existing_count,
|
||||||
|
dataset_docs,
|
||||||
|
new_docs_embedded: 0,
|
||||||
|
new_chunks_embedded: 0,
|
||||||
|
total_embeddings_after: existing.len(),
|
||||||
|
duration_secs: t0.elapsed().as_secs_f32(),
|
||||||
|
stale_cleared: true,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// 5. Chunk the new rows
|
||||||
|
let chunk_size = req.chunk_size.unwrap_or(500);
|
||||||
|
let overlap = req.overlap.unwrap_or(50);
|
||||||
|
let doc_ids: Vec<String> = new_rows.iter().map(|(id, _)| id.clone()).collect();
|
||||||
|
let texts: Vec<String> = new_rows.iter().map(|(_, t)| t.clone()).collect();
|
||||||
|
let chunks: Vec<TextChunk> = chunker::chunk_column(
|
||||||
|
dataset_name, &doc_ids, &texts, chunk_size, overlap,
|
||||||
|
);
|
||||||
|
let new_chunks = chunks.len();
|
||||||
|
tracing::info!("refresh '{}': {} new docs -> {} chunks", dataset_name, new_docs, new_chunks);
|
||||||
|
|
||||||
|
// 6. Embed via Ollama (batched)
|
||||||
|
let batch_size = 32;
|
||||||
|
let mut all_vectors: Vec<Vec<f64>> = Vec::with_capacity(new_chunks);
|
||||||
|
for batch in chunks.chunks(batch_size) {
|
||||||
|
let batch_texts: Vec<String> = batch.iter().map(|c| c.text.clone()).collect();
|
||||||
|
let resp = ai_client
|
||||||
|
.embed(EmbedRequest { texts: batch_texts, model: None })
|
||||||
|
.await
|
||||||
|
.map_err(|e| format!("embed: {e}"))?;
|
||||||
|
all_vectors.extend(resp.embeddings);
|
||||||
|
}
|
||||||
|
|
||||||
|
// 7. Combine existing + new and write back as a single parquet
|
||||||
|
// (MVP — append-as-rewrite. ADR-019 points to Lance for true native
|
||||||
|
// append; for the Parquet sidecar we rewrite at refresh time.)
|
||||||
|
let mut new_stored: Vec<StoredEmbedding> = existing.clone();
|
||||||
|
for (chunk, vector) in chunks.into_iter().zip(all_vectors.iter()) {
|
||||||
|
new_stored.push(StoredEmbedding {
|
||||||
|
source: chunk.source,
|
||||||
|
doc_id: chunk.doc_id,
|
||||||
|
chunk_idx: chunk.chunk_idx,
|
||||||
|
chunk_text: chunk.text,
|
||||||
|
vector: vector.iter().map(|&x| x as f32).collect(),
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// We have to reconstruct chunks + vectors for store_embeddings — but
|
||||||
|
// store_embeddings takes &[TextChunk] and vectors separately. Convert
|
||||||
|
// the combined StoredEmbedding back to that shape.
|
||||||
|
let combined_chunks: Vec<TextChunk> = new_stored
|
||||||
|
.iter()
|
||||||
|
.map(|e| TextChunk {
|
||||||
|
source: e.source.clone(),
|
||||||
|
doc_id: e.doc_id.clone(),
|
||||||
|
chunk_idx: e.chunk_idx,
|
||||||
|
text: e.chunk_text.clone(),
|
||||||
|
})
|
||||||
|
.collect();
|
||||||
|
let combined_vectors: Vec<Vec<f64>> = new_stored
|
||||||
|
.iter()
|
||||||
|
.map(|e| e.vector.iter().map(|&f| f as f64).collect())
|
||||||
|
.collect();
|
||||||
|
|
||||||
|
let _key = store::store_embeddings(
|
||||||
|
store_, &req.index_name, &combined_chunks, &combined_vectors,
|
||||||
|
).await?;
|
||||||
|
|
||||||
|
// 8. Evict embedding cache — next read will pick up the new file
|
||||||
|
let _ = embedding_cache.evict(&req.index_name).await;
|
||||||
|
|
||||||
|
// 9. Update index registry metadata (row/chunk counts; others unchanged
|
||||||
|
// so we don't disturb the existing entry if present)
|
||||||
|
let _ = try_update_index_meta(index_registry, &req.index_name, new_stored.len()).await;
|
||||||
|
|
||||||
|
// 10. Clear stale flag
|
||||||
|
registry.clear_embeddings_stale(dataset_name).await?;
|
||||||
|
|
||||||
|
let total = new_stored.len();
|
||||||
|
Ok(RefreshResult {
|
||||||
|
index_name: req.index_name.clone(),
|
||||||
|
dataset_name: dataset_name.to_string(),
|
||||||
|
pre_existing_docs: pre_existing_count,
|
||||||
|
dataset_docs,
|
||||||
|
new_docs_embedded: new_docs,
|
||||||
|
new_chunks_embedded: new_chunks,
|
||||||
|
total_embeddings_after: total,
|
||||||
|
duration_secs: t0.elapsed().as_secs_f32(),
|
||||||
|
stale_cleared: true,
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Best-effort refresh of index registry metadata. If the index exists,
|
||||||
|
/// bump the chunk_count; if not, this is a no-op.
|
||||||
|
async fn try_update_index_meta(
|
||||||
|
index_registry: &IndexRegistry,
|
||||||
|
index_name: &str,
|
||||||
|
chunk_count: usize,
|
||||||
|
) -> Result<(), String> {
|
||||||
|
if let Some(mut meta) = index_registry.get(index_name).await {
|
||||||
|
meta.chunk_count = chunk_count;
|
||||||
|
index_registry.register(meta).await
|
||||||
|
} else {
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
}
|
||||||
@ -10,7 +10,8 @@ use serde::{Deserialize, Serialize};
|
|||||||
use std::sync::Arc;
|
use std::sync::Arc;
|
||||||
|
|
||||||
use aibridge::client::{AiClient, EmbedRequest};
|
use aibridge::client::{AiClient, EmbedRequest};
|
||||||
use crate::{chunker, embedding_cache, harness, hnsw, index_registry, jobs, rag, search, store, supervisor, trial};
|
use catalogd::registry::Registry as CatalogRegistry;
|
||||||
|
use crate::{chunker, embedding_cache, harness, hnsw, index_registry, jobs, rag, refresh, search, store, supervisor, trial};
|
||||||
|
|
||||||
#[derive(Clone)]
|
#[derive(Clone)]
|
||||||
pub struct VectorState {
|
pub struct VectorState {
|
||||||
@ -21,6 +22,9 @@ pub struct VectorState {
|
|||||||
pub hnsw_store: hnsw::HnswStore,
|
pub hnsw_store: hnsw::HnswStore,
|
||||||
pub embedding_cache: embedding_cache::EmbeddingCache,
|
pub embedding_cache: embedding_cache::EmbeddingCache,
|
||||||
pub trial_journal: trial::TrialJournal,
|
pub trial_journal: trial::TrialJournal,
|
||||||
|
/// Catalog registry — needed by the Phase C refresh path to mark/clear
|
||||||
|
/// staleness and look up dataset manifests.
|
||||||
|
pub catalog: CatalogRegistry,
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn router(state: VectorState) -> Router {
|
pub fn router(state: VectorState) -> Router {
|
||||||
@ -47,6 +51,9 @@ pub fn router(state: VectorState) -> Router {
|
|||||||
// Cache management
|
// Cache management
|
||||||
.route("/hnsw/cache/stats", get(cache_stats))
|
.route("/hnsw/cache/stats", get(cache_stats))
|
||||||
.route("/hnsw/cache/{index_name}", axum::routing::delete(cache_evict))
|
.route("/hnsw/cache/{index_name}", axum::routing::delete(cache_evict))
|
||||||
|
// Phase C: embedding refresh
|
||||||
|
.route("/refresh/{dataset_name}", post(refresh_dataset))
|
||||||
|
.route("/stale", get(list_stale))
|
||||||
.with_state(state)
|
.with_state(state)
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -666,3 +673,61 @@ async fn cache_evict(
|
|||||||
let ok = state.embedding_cache.evict(&index_name).await;
|
let ok = state.embedding_cache.evict(&index_name).await;
|
||||||
Json(serde_json::json!({ "evicted": ok, "index_name": index_name }))
|
Json(serde_json::json!({ "evicted": ok, "index_name": index_name }))
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// --- Phase C: embedding refresh ---
|
||||||
|
//
|
||||||
|
// Decouples "new row data arrived" from "re-embed everything." Ingest marks
|
||||||
|
// a dataset's embeddings stale (see catalogd::registry::mark_embeddings_stale);
|
||||||
|
// `/vectors/refresh/{dataset}` diffs existing embeddings against current
|
||||||
|
// rows, embeds only the new ones, appends to the index, and clears the
|
||||||
|
// stale flag.
|
||||||
|
|
||||||
|
async fn refresh_dataset(
|
||||||
|
State(state): State<VectorState>,
|
||||||
|
Path(dataset_name): Path<String>,
|
||||||
|
Json(req): Json<refresh::RefreshRequest>,
|
||||||
|
) -> Result<Json<refresh::RefreshResult>, (StatusCode, String)> {
|
||||||
|
tracing::info!(
|
||||||
|
"refresh requested for dataset '{}' -> index '{}'",
|
||||||
|
dataset_name, req.index_name,
|
||||||
|
);
|
||||||
|
match refresh::refresh_index(
|
||||||
|
&dataset_name,
|
||||||
|
&req,
|
||||||
|
&state.store,
|
||||||
|
&state.catalog,
|
||||||
|
&state.ai_client,
|
||||||
|
&state.embedding_cache,
|
||||||
|
&state.index_registry,
|
||||||
|
)
|
||||||
|
.await
|
||||||
|
{
|
||||||
|
Ok(result) => Ok(Json(result)),
|
||||||
|
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
#[derive(Serialize)]
|
||||||
|
struct StaleEntry {
|
||||||
|
dataset_name: String,
|
||||||
|
last_embedded_at: Option<String>,
|
||||||
|
stale_since: String,
|
||||||
|
refresh_policy: Option<shared::types::RefreshPolicy>,
|
||||||
|
}
|
||||||
|
|
||||||
|
async fn list_stale(State(state): State<VectorState>) -> impl IntoResponse {
|
||||||
|
let datasets = state.catalog.stale_datasets().await;
|
||||||
|
let entries: Vec<StaleEntry> = datasets
|
||||||
|
.into_iter()
|
||||||
|
.map(|d| StaleEntry {
|
||||||
|
dataset_name: d.name,
|
||||||
|
last_embedded_at: d.last_embedded_at.map(|t| t.to_rfc3339()),
|
||||||
|
stale_since: d
|
||||||
|
.embedding_stale_since
|
||||||
|
.map(|t| t.to_rfc3339())
|
||||||
|
.unwrap_or_default(),
|
||||||
|
refresh_policy: d.embedding_refresh_policy,
|
||||||
|
})
|
||||||
|
.collect();
|
||||||
|
Json(entries)
|
||||||
|
}
|
||||||
|
|||||||
@ -150,6 +150,18 @@
|
|||||||
- `POST /ingest/db` endpoint: `{dsn, table, dataset_name?, batch_size?, order_by?, limit?}` → streams to Parquet, registers in catalog with PII detection + redacted-password lineage
|
- `POST /ingest/db` endpoint: `{dsn, table, dataset_name?, batch_size?, order_by?, limit?}` → streams to Parquet, registers in catalog with PII detection + redacted-password lineage
|
||||||
- Existing `POST /ingest/postgres/import` (structured config) preserved alongside
|
- Existing `POST /ingest/postgres/import` (structured config) preserved alongside
|
||||||
- 4 DSN-parser unit tests + live end-to-end test against `knowledge_base.team_runs` (586 rows, 13 cols, 6 batches, 196ms)
|
- 4 DSN-parser unit tests + live end-to-end test against `knowledge_base.team_runs` (586 rows, 13 cols, 6 batches, 196ms)
|
||||||
|
- [x] Phase B: Lance storage evaluation — 2026-04-16
|
||||||
|
- `crates/lance-bench` standalone pilot (Lance 4.0) avoids DataFusion/Arrow version conflict with main stack
|
||||||
|
- 8-dimension benchmark on resumes_100k_v2 — see docs/ADR-019-vector-storage.md for scorecard
|
||||||
|
- Decision: hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance added as per-profile second backend for random access (112× faster), append (0.08s vs full rewrite), hot-swap (14× faster index builds), and scale past 5M RAM ceiling.
|
||||||
|
- [x] Phase C: Decoupled embedding refresh — 2026-04-16
|
||||||
|
- `DatasetManifest`: `last_embedded_at`, `embedding_stale_since`, `embedding_refresh_policy` (Manual | OnAppend | Scheduled)
|
||||||
|
- `Registry::mark_embeddings_stale` / `clear_embeddings_stale` / `stale_datasets`
|
||||||
|
- Ingest paths (CSV pipeline + Postgres streaming) auto-mark-stale when writing to an already-embedded dataset
|
||||||
|
- `vectord::refresh::refresh_index` — reads dataset, diffs doc_ids vs existing embeddings, embeds only new rows, writes combined index, clears stale
|
||||||
|
- `POST /vectors/refresh/{dataset}` + `GET /vectors/stale`
|
||||||
|
- Id columns accept `Utf8`, `Int32`, `Int64`
|
||||||
|
- End-to-end on threat_intel: initial 20-row embed 2.1s; re-ingest to 54 rows auto-marks stale; delta refresh embeds only 34 new in 970ms (6× faster than full re-embed); stale cleared
|
||||||
- [ ] Database connector ingest (Postgres/MySQL)
|
- [ ] Database connector ingest (Postgres/MySQL)
|
||||||
- [ ] PDF OCR (Tesseract)
|
- [ ] PDF OCR (Tesseract)
|
||||||
- [ ] Scheduled ingest (cron)
|
- [ ] Scheduled ingest (cron)
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user