Phase C: Decoupled embedding refresh
Implements the llms3.com-inspired pattern: embeddings refresh
asynchronously, decoupled from transactional row writes. New rows arrive,
ingest marks the vector index stale, a later refresh embeds only the
delta (doc_ids not already in the index).
Schema additions (DatasetManifest):
- last_embedded_at: Option<DateTime> - when the index was last refreshed
- embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh
- embedding_refresh_policy: Option<RefreshPolicy> - Manual | OnAppend | Scheduled
Ingest paths (pipeline::ingest_file + pg_stream) call
registry.mark_embeddings_stale after writing. No-op if the dataset has
never been embedded — stale semantics only kick in once last_embedded_at
is set.
Refresh pipeline (vectord::refresh::refresh_index):
- Reads the dataset Parquet, extracts (doc_id, text) pairs
- Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas)
- Loads existing embeddings via EmbeddingCache (empty on first-time build)
- Filters to rows whose doc_id is NOT in the existing set
- Chunks (chunker::chunk_column), embeds via Ollama (batches of 32),
writes combined index, clears stale flag
Endpoints:
- POST /vectors/refresh/{dataset_name} - body {index_name, id_column,
text_column, chunk_size?, overlap?}
- GET /vectors/stale - lists datasets whose embedding_stale_since is set
End-to-end verified on threat_intel (knowledge_base.threat_intel):
- Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s,
last_embedded_at set
- Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check)
- Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set
- /vectors/stale surfaces threat_intel with timestamps + policy
- Delta refresh: 34 new docs embedded in 970ms (6x faster than full
re-embed); stale_cleared = true
Not in MVP scope:
- UPDATE semantics (same doc_id, different content) - would need
per-row content hashing
- OnAppend policy auto-trigger - just declares intent; actual scheduler
deferred
- Scheduler runtime - the Scheduled(cron) variant declares the intent so
operators can see which datasets expect what, but the cron itself is
separate
Per ADR-019: when a profile switches to vector_backend=Lance, this
refresh path benefits — Lance's native append replaces our "read all +
rewrite" Parquet rebuild pattern. Current MVP works well enough at
~500-5K rows to validate the architecture; Lance unblocks the 5M+ case.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
76f6fba5de
commit
97a376482c
@ -1,6 +1,6 @@
|
||||
use shared::types::{
|
||||
DatasetId, DatasetManifest, ObjectRef, SchemaFingerprint,
|
||||
ColumnMeta, Lineage, FreshnessContract, Sensitivity,
|
||||
ColumnMeta, Lineage, FreshnessContract, RefreshPolicy, Sensitivity,
|
||||
};
|
||||
use std::collections::HashMap;
|
||||
use std::sync::Arc;
|
||||
@ -20,6 +20,8 @@ pub struct MetadataUpdate {
|
||||
pub lineage: Option<Lineage>,
|
||||
pub freshness: Option<FreshnessContract>,
|
||||
pub row_count: Option<u64>,
|
||||
// Phase C embedding freshness
|
||||
pub embedding_refresh_policy: Option<RefreshPolicy>,
|
||||
}
|
||||
|
||||
const MANIFEST_PREFIX: &str = "_catalog/manifests";
|
||||
@ -78,6 +80,9 @@ impl Registry {
|
||||
freshness: None,
|
||||
tags: vec![],
|
||||
row_count: None,
|
||||
last_embedded_at: None,
|
||||
embedding_stale_since: None,
|
||||
embedding_refresh_policy: None,
|
||||
};
|
||||
|
||||
// Write-ahead: persist before in-memory update
|
||||
@ -111,6 +116,7 @@ impl Registry {
|
||||
if let Some(lineage) = updates.lineage { manifest.lineage = Some(lineage); }
|
||||
if let Some(freshness) = updates.freshness { manifest.freshness = Some(freshness); }
|
||||
if let Some(count) = updates.row_count { manifest.row_count = Some(count); }
|
||||
if let Some(policy) = updates.embedding_refresh_policy { manifest.embedding_refresh_policy = Some(policy); }
|
||||
manifest.updated_at = chrono::Utc::now();
|
||||
|
||||
// Persist
|
||||
@ -242,6 +248,63 @@ impl Registry {
|
||||
(ok, err)
|
||||
}
|
||||
|
||||
/// Mark a dataset's embeddings as stale (row-level data has been written
|
||||
/// since the last embedding refresh). Idempotent — setting stale when
|
||||
/// already stale is a no-op. Only marks stale if the dataset has been
|
||||
/// embedded before — a never-embedded dataset doesn't need a stale flag
|
||||
/// (it just needs an initial index build). Called from the ingest path.
|
||||
pub async fn mark_embeddings_stale(&self, name: &str) -> Result<(), String> {
|
||||
let mut datasets = self.datasets.write().await;
|
||||
let manifest = datasets
|
||||
.values_mut()
|
||||
.find(|d| d.name == name)
|
||||
.ok_or_else(|| format!("dataset not found: {name}"))?;
|
||||
|
||||
if manifest.last_embedded_at.is_none() {
|
||||
return Ok(()); // never embedded -> no stale semantics yet
|
||||
}
|
||||
if manifest.embedding_stale_since.is_none() {
|
||||
manifest.embedding_stale_since = Some(chrono::Utc::now());
|
||||
manifest.updated_at = chrono::Utc::now();
|
||||
|
||||
let key = format!("{MANIFEST_PREFIX}/{}.json", manifest.id);
|
||||
let json = serde_json::to_vec_pretty(manifest).map_err(|e| e.to_string())?;
|
||||
ops::put(&self.store, &key, json.into()).await?;
|
||||
tracing::info!("marked embeddings stale for dataset '{name}'");
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Clear the stale marker and set `last_embedded_at = now`.
|
||||
/// Called by the embedding refresh pipeline once it finishes.
|
||||
pub async fn clear_embeddings_stale(&self, name: &str) -> Result<(), String> {
|
||||
let mut datasets = self.datasets.write().await;
|
||||
let manifest = datasets
|
||||
.values_mut()
|
||||
.find(|d| d.name == name)
|
||||
.ok_or_else(|| format!("dataset not found: {name}"))?;
|
||||
|
||||
let now = chrono::Utc::now();
|
||||
manifest.embedding_stale_since = None;
|
||||
manifest.last_embedded_at = Some(now);
|
||||
manifest.updated_at = now;
|
||||
|
||||
let key = format!("{MANIFEST_PREFIX}/{}.json", manifest.id);
|
||||
let json = serde_json::to_vec_pretty(manifest).map_err(|e| e.to_string())?;
|
||||
ops::put(&self.store, &key, json.into()).await?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// List datasets whose `embedding_stale_since` is set — they need a refresh.
|
||||
pub async fn stale_datasets(&self) -> Vec<DatasetManifest> {
|
||||
let datasets = self.datasets.read().await;
|
||||
datasets
|
||||
.values()
|
||||
.filter(|d| d.embedding_stale_since.is_some())
|
||||
.cloned()
|
||||
.collect()
|
||||
}
|
||||
|
||||
/// Add objects to an existing dataset.
|
||||
pub async fn add_objects(
|
||||
&self,
|
||||
|
||||
@ -98,6 +98,7 @@ async fn main() {
|
||||
hnsw_store: vectord::hnsw::HnswStore::new(),
|
||||
embedding_cache: vectord::embedding_cache::EmbeddingCache::new(store.clone()),
|
||||
trial_journal: vectord::trial::TrialJournal::new(store.clone()),
|
||||
catalog: registry.clone(),
|
||||
}
|
||||
}))
|
||||
.nest("/workspaces", queryd::workspace_service::router(workspace_mgr))
|
||||
|
||||
@ -147,6 +147,10 @@ pub async fn ingest_file(
|
||||
..Default::default()
|
||||
}).await;
|
||||
|
||||
// Phase C: if this dataset already had embeddings, they're now stale.
|
||||
// mark_embeddings_stale is a no-op for never-embedded datasets.
|
||||
let _ = registry.mark_embeddings_stale(&safe_name).await;
|
||||
|
||||
Ok(IngestResult {
|
||||
dataset_name: safe_name,
|
||||
file_type: format!("{:?}", file_type),
|
||||
|
||||
@ -271,6 +271,10 @@ async fn ingest_db_stream(
|
||||
..Default::default()
|
||||
}).await;
|
||||
|
||||
// Phase C: mark embeddings stale if the dataset already had a vector
|
||||
// index. No-op for newly-created datasets.
|
||||
let _ = state.registry.mark_embeddings_stale(&dataset_name).await;
|
||||
|
||||
Ok((StatusCode::CREATED, Json(serde_json::json!({
|
||||
"dataset_name": dataset_name,
|
||||
"table": stream_result.table,
|
||||
|
||||
@ -116,4 +116,44 @@ pub struct DatasetManifest {
|
||||
/// Row count (updated on ingest/compact)
|
||||
#[serde(default)]
|
||||
pub row_count: Option<u64>,
|
||||
|
||||
// --- Embedding freshness (Phase C) ---
|
||||
|
||||
/// When the attached vector index was last refreshed. `None` means this
|
||||
/// dataset has no vector index yet, or has never been embedded.
|
||||
#[serde(default)]
|
||||
pub last_embedded_at: Option<chrono::DateTime<chrono::Utc>>,
|
||||
|
||||
/// When data was written that hasn't yet been embedded. `Some(t)` means
|
||||
/// the vector index is out-of-date as of timestamp `t`. Cleared on refresh.
|
||||
#[serde(default)]
|
||||
pub embedding_stale_since: Option<chrono::DateTime<chrono::Utc>>,
|
||||
|
||||
/// How this dataset wants stale embeddings handled.
|
||||
#[serde(default)]
|
||||
pub embedding_refresh_policy: Option<RefreshPolicy>,
|
||||
}
|
||||
|
||||
/// Controls what happens when new data is written to a dataset with an
|
||||
/// attached vector index.
|
||||
///
|
||||
/// - `Manual` (default): data writes set `embedding_stale_since`; nothing
|
||||
/// embeds until an operator or agent calls `/vectors/refresh/{dataset}`.
|
||||
/// - `OnAppend`: ingest fires a background refresh immediately after writing.
|
||||
/// Suitable for datasets where vector freshness matters more than ingest
|
||||
/// latency.
|
||||
/// - `Scheduled(cron)`: a timer or external scheduler triggers refresh at
|
||||
/// the named cadence. The scheduler itself is not in this ADR scope —
|
||||
/// this just declares the intent so operators can see which policy a
|
||||
/// dataset expects.
|
||||
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
|
||||
#[serde(tag = "kind", rename_all = "snake_case")]
|
||||
pub enum RefreshPolicy {
|
||||
Manual,
|
||||
OnAppend,
|
||||
Scheduled { cron: String },
|
||||
}
|
||||
|
||||
impl Default for RefreshPolicy {
|
||||
fn default() -> Self { Self::Manual }
|
||||
}
|
||||
|
||||
@ -7,6 +7,7 @@ edition = "2024"
|
||||
shared = { path = "../shared" }
|
||||
storaged = { path = "../storaged" }
|
||||
aibridge = { path = "../aibridge" }
|
||||
catalogd = { path = "../catalogd" }
|
||||
tokio = { workspace = true }
|
||||
axum = { workspace = true }
|
||||
serde = { workspace = true }
|
||||
|
||||
@ -4,6 +4,7 @@ pub mod harness;
|
||||
pub mod hnsw;
|
||||
pub mod index_registry;
|
||||
pub mod jobs;
|
||||
pub mod refresh;
|
||||
pub mod store;
|
||||
pub mod search;
|
||||
pub mod rag;
|
||||
|
||||
279
crates/vectord/src/refresh.rs
Normal file
279
crates/vectord/src/refresh.rs
Normal file
@ -0,0 +1,279 @@
|
||||
//! Phase C: Decoupled embedding refresh.
|
||||
//!
|
||||
//! When a dataset's row-level data changes, the vector index it feeds is
|
||||
//! stale. Historically we coupled ingest and embedding — writing new rows
|
||||
//! also re-embedded every row, which is fine at 500 rows but blows up at
|
||||
//! 100K+. The llms3.com architecture calls out "asynchronous vector
|
||||
//! refresh cycles independent from transactional mutations" as the right
|
||||
//! pattern.
|
||||
//!
|
||||
//! This module implements the refresh side. The ingest path marks
|
||||
//! embeddings stale (see catalogd::registry::mark_embeddings_stale); this
|
||||
//! code clears that staleness by embedding only rows whose `doc_id` isn't
|
||||
//! already in the existing vector index.
|
||||
//!
|
||||
//! Scope — MVP:
|
||||
//! - Reads the dataset's Parquet, extracts (doc_id, text) pairs from named
|
||||
//! columns
|
||||
//! - Loads existing embeddings via EmbeddingCache
|
||||
//! - Filters to rows whose doc_id is NOT in the existing set
|
||||
//! - Chunks, embeds via Ollama, appends to the index parquet
|
||||
//! - Clears stale flag on success
|
||||
//!
|
||||
//! Not in MVP:
|
||||
//! - UPDATE semantics (same doc_id, new content) — would need content-hash
|
||||
//! comparison per row
|
||||
//! - Large-scale resilience (batching with checkpoints like the
|
||||
//! supervisor) — MVP does it inline
|
||||
//! - Lance backend — ADR-019 makes this straightforward later; MVP stays
|
||||
//! on Parquet sidecar indexes
|
||||
|
||||
use std::collections::HashSet;
|
||||
use std::sync::Arc;
|
||||
|
||||
use aibridge::client::{AiClient, EmbedRequest};
|
||||
use arrow::array::{Array, Int32Array, Int64Array, StringArray};
|
||||
use catalogd::registry::Registry;
|
||||
use object_store::ObjectStore;
|
||||
|
||||
use crate::chunker::{self, TextChunk};
|
||||
use crate::embedding_cache::EmbeddingCache;
|
||||
use crate::index_registry::IndexRegistry;
|
||||
use crate::store::{self, StoredEmbedding};
|
||||
|
||||
#[derive(Debug, Clone, serde::Deserialize)]
|
||||
pub struct RefreshRequest {
|
||||
pub index_name: String,
|
||||
/// Column with the document id (row identity).
|
||||
pub id_column: String,
|
||||
/// Column with the text to embed.
|
||||
pub text_column: String,
|
||||
/// Chunk size (chars). Defaults to 500.
|
||||
#[serde(default)]
|
||||
pub chunk_size: Option<usize>,
|
||||
/// Overlap (chars). Defaults to 50.
|
||||
#[serde(default)]
|
||||
pub overlap: Option<usize>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, serde::Serialize)]
|
||||
pub struct RefreshResult {
|
||||
pub index_name: String,
|
||||
pub dataset_name: String,
|
||||
pub pre_existing_docs: usize,
|
||||
pub dataset_docs: usize,
|
||||
pub new_docs_embedded: usize,
|
||||
pub new_chunks_embedded: usize,
|
||||
pub total_embeddings_after: usize,
|
||||
pub duration_secs: f32,
|
||||
pub stale_cleared: bool,
|
||||
}
|
||||
|
||||
/// Full refresh pipeline. Takes dataset_name as URL param, body has the
|
||||
/// column selectors.
|
||||
pub async fn refresh_index(
|
||||
dataset_name: &str,
|
||||
req: &RefreshRequest,
|
||||
store_: &Arc<dyn ObjectStore>,
|
||||
registry: &Registry,
|
||||
ai_client: &AiClient,
|
||||
embedding_cache: &EmbeddingCache,
|
||||
index_registry: &IndexRegistry,
|
||||
) -> Result<RefreshResult, String> {
|
||||
let t0 = std::time::Instant::now();
|
||||
|
||||
// 1. Find the dataset manifest, pull its object storage key
|
||||
let manifest = registry
|
||||
.get_by_name(dataset_name)
|
||||
.await
|
||||
.ok_or_else(|| format!("dataset not found: {dataset_name}"))?;
|
||||
|
||||
if manifest.objects.is_empty() {
|
||||
return Err(format!("dataset '{dataset_name}' has no object references"));
|
||||
}
|
||||
|
||||
// 2. Read dataset rows — extract (doc_id, text) pairs from the
|
||||
// specified columns. Multi-object datasets: concat across all.
|
||||
let mut doc_id_to_text: Vec<(String, String)> = Vec::new();
|
||||
for obj in &manifest.objects {
|
||||
let data = storaged::ops::get(store_, &obj.key).await
|
||||
.map_err(|e| format!("read {}: {e}", obj.key))?;
|
||||
let (schema, batches) = shared::arrow_helpers::parquet_to_record_batches(&data)
|
||||
.map_err(|e| format!("parse {}: {e}", obj.key))?;
|
||||
|
||||
let id_idx = schema.index_of(&req.id_column)
|
||||
.map_err(|_| format!("id column '{}' not in dataset schema", req.id_column))?;
|
||||
let text_idx = schema.index_of(&req.text_column)
|
||||
.map_err(|_| format!("text column '{}' not in dataset schema", req.text_column))?;
|
||||
|
||||
for batch in &batches {
|
||||
let text_col = batch
|
||||
.column(text_idx)
|
||||
.as_any()
|
||||
.downcast_ref::<StringArray>()
|
||||
.ok_or_else(|| format!("text column '{}' is not Utf8", req.text_column))?;
|
||||
|
||||
// Accept Utf8, Int32, or Int64 as id — that covers CSV-derived
|
||||
// and Postgres-imported schemas without forcing upstream casts.
|
||||
let id_reader: Box<dyn Fn(usize) -> Option<String>> = {
|
||||
let col = batch.column(id_idx);
|
||||
if let Some(s) = col.as_any().downcast_ref::<StringArray>() {
|
||||
let s = s.clone();
|
||||
Box::new(move |row| if s.is_null(row) { None } else { Some(s.value(row).to_string()) })
|
||||
} else if let Some(a) = col.as_any().downcast_ref::<Int32Array>() {
|
||||
let a = a.clone();
|
||||
Box::new(move |row| if a.is_null(row) { None } else { Some(a.value(row).to_string()) })
|
||||
} else if let Some(a) = col.as_any().downcast_ref::<Int64Array>() {
|
||||
let a = a.clone();
|
||||
Box::new(move |row| if a.is_null(row) { None } else { Some(a.value(row).to_string()) })
|
||||
} else {
|
||||
return Err(format!(
|
||||
"id column '{}' must be Utf8, Int32, or Int64 — got {}",
|
||||
req.id_column,
|
||||
col.data_type(),
|
||||
));
|
||||
}
|
||||
};
|
||||
|
||||
for row in 0..batch.num_rows() {
|
||||
if text_col.is_null(row) { continue; }
|
||||
let Some(id) = id_reader(row) else { continue; };
|
||||
let text = text_col.value(row).to_string();
|
||||
if text.trim().is_empty() { continue; }
|
||||
doc_id_to_text.push((id, text));
|
||||
}
|
||||
}
|
||||
}
|
||||
let dataset_docs = doc_id_to_text.len();
|
||||
tracing::info!("refresh '{}': dataset has {dataset_docs} rows", dataset_name);
|
||||
|
||||
// 3. Load existing embeddings (empty if no index yet)
|
||||
let existing: Vec<StoredEmbedding> = match embedding_cache.get_or_load(&req.index_name).await {
|
||||
Ok(arc) => arc.as_ref().clone(),
|
||||
Err(_) => Vec::new(), // first-time index build
|
||||
};
|
||||
let pre_existing_docs: HashSet<String> = existing
|
||||
.iter()
|
||||
.map(|e| e.doc_id.clone())
|
||||
.collect();
|
||||
let pre_existing_count = pre_existing_docs.len();
|
||||
|
||||
// 4. Delta — rows whose doc_id isn't already indexed
|
||||
let new_rows: Vec<(String, String)> = doc_id_to_text
|
||||
.into_iter()
|
||||
.filter(|(id, _)| !pre_existing_docs.contains(id))
|
||||
.collect();
|
||||
let new_docs = new_rows.len();
|
||||
|
||||
if new_docs == 0 {
|
||||
tracing::info!("refresh '{}': no new docs to embed", dataset_name);
|
||||
registry.clear_embeddings_stale(dataset_name).await?;
|
||||
return Ok(RefreshResult {
|
||||
index_name: req.index_name.clone(),
|
||||
dataset_name: dataset_name.to_string(),
|
||||
pre_existing_docs: pre_existing_count,
|
||||
dataset_docs,
|
||||
new_docs_embedded: 0,
|
||||
new_chunks_embedded: 0,
|
||||
total_embeddings_after: existing.len(),
|
||||
duration_secs: t0.elapsed().as_secs_f32(),
|
||||
stale_cleared: true,
|
||||
});
|
||||
}
|
||||
|
||||
// 5. Chunk the new rows
|
||||
let chunk_size = req.chunk_size.unwrap_or(500);
|
||||
let overlap = req.overlap.unwrap_or(50);
|
||||
let doc_ids: Vec<String> = new_rows.iter().map(|(id, _)| id.clone()).collect();
|
||||
let texts: Vec<String> = new_rows.iter().map(|(_, t)| t.clone()).collect();
|
||||
let chunks: Vec<TextChunk> = chunker::chunk_column(
|
||||
dataset_name, &doc_ids, &texts, chunk_size, overlap,
|
||||
);
|
||||
let new_chunks = chunks.len();
|
||||
tracing::info!("refresh '{}': {} new docs -> {} chunks", dataset_name, new_docs, new_chunks);
|
||||
|
||||
// 6. Embed via Ollama (batched)
|
||||
let batch_size = 32;
|
||||
let mut all_vectors: Vec<Vec<f64>> = Vec::with_capacity(new_chunks);
|
||||
for batch in chunks.chunks(batch_size) {
|
||||
let batch_texts: Vec<String> = batch.iter().map(|c| c.text.clone()).collect();
|
||||
let resp = ai_client
|
||||
.embed(EmbedRequest { texts: batch_texts, model: None })
|
||||
.await
|
||||
.map_err(|e| format!("embed: {e}"))?;
|
||||
all_vectors.extend(resp.embeddings);
|
||||
}
|
||||
|
||||
// 7. Combine existing + new and write back as a single parquet
|
||||
// (MVP — append-as-rewrite. ADR-019 points to Lance for true native
|
||||
// append; for the Parquet sidecar we rewrite at refresh time.)
|
||||
let mut new_stored: Vec<StoredEmbedding> = existing.clone();
|
||||
for (chunk, vector) in chunks.into_iter().zip(all_vectors.iter()) {
|
||||
new_stored.push(StoredEmbedding {
|
||||
source: chunk.source,
|
||||
doc_id: chunk.doc_id,
|
||||
chunk_idx: chunk.chunk_idx,
|
||||
chunk_text: chunk.text,
|
||||
vector: vector.iter().map(|&x| x as f32).collect(),
|
||||
});
|
||||
}
|
||||
|
||||
// We have to reconstruct chunks + vectors for store_embeddings — but
|
||||
// store_embeddings takes &[TextChunk] and vectors separately. Convert
|
||||
// the combined StoredEmbedding back to that shape.
|
||||
let combined_chunks: Vec<TextChunk> = new_stored
|
||||
.iter()
|
||||
.map(|e| TextChunk {
|
||||
source: e.source.clone(),
|
||||
doc_id: e.doc_id.clone(),
|
||||
chunk_idx: e.chunk_idx,
|
||||
text: e.chunk_text.clone(),
|
||||
})
|
||||
.collect();
|
||||
let combined_vectors: Vec<Vec<f64>> = new_stored
|
||||
.iter()
|
||||
.map(|e| e.vector.iter().map(|&f| f as f64).collect())
|
||||
.collect();
|
||||
|
||||
let _key = store::store_embeddings(
|
||||
store_, &req.index_name, &combined_chunks, &combined_vectors,
|
||||
).await?;
|
||||
|
||||
// 8. Evict embedding cache — next read will pick up the new file
|
||||
let _ = embedding_cache.evict(&req.index_name).await;
|
||||
|
||||
// 9. Update index registry metadata (row/chunk counts; others unchanged
|
||||
// so we don't disturb the existing entry if present)
|
||||
let _ = try_update_index_meta(index_registry, &req.index_name, new_stored.len()).await;
|
||||
|
||||
// 10. Clear stale flag
|
||||
registry.clear_embeddings_stale(dataset_name).await?;
|
||||
|
||||
let total = new_stored.len();
|
||||
Ok(RefreshResult {
|
||||
index_name: req.index_name.clone(),
|
||||
dataset_name: dataset_name.to_string(),
|
||||
pre_existing_docs: pre_existing_count,
|
||||
dataset_docs,
|
||||
new_docs_embedded: new_docs,
|
||||
new_chunks_embedded: new_chunks,
|
||||
total_embeddings_after: total,
|
||||
duration_secs: t0.elapsed().as_secs_f32(),
|
||||
stale_cleared: true,
|
||||
})
|
||||
}
|
||||
|
||||
/// Best-effort refresh of index registry metadata. If the index exists,
|
||||
/// bump the chunk_count; if not, this is a no-op.
|
||||
async fn try_update_index_meta(
|
||||
index_registry: &IndexRegistry,
|
||||
index_name: &str,
|
||||
chunk_count: usize,
|
||||
) -> Result<(), String> {
|
||||
if let Some(mut meta) = index_registry.get(index_name).await {
|
||||
meta.chunk_count = chunk_count;
|
||||
index_registry.register(meta).await
|
||||
} else {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
@ -10,7 +10,8 @@ use serde::{Deserialize, Serialize};
|
||||
use std::sync::Arc;
|
||||
|
||||
use aibridge::client::{AiClient, EmbedRequest};
|
||||
use crate::{chunker, embedding_cache, harness, hnsw, index_registry, jobs, rag, search, store, supervisor, trial};
|
||||
use catalogd::registry::Registry as CatalogRegistry;
|
||||
use crate::{chunker, embedding_cache, harness, hnsw, index_registry, jobs, rag, refresh, search, store, supervisor, trial};
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct VectorState {
|
||||
@ -21,6 +22,9 @@ pub struct VectorState {
|
||||
pub hnsw_store: hnsw::HnswStore,
|
||||
pub embedding_cache: embedding_cache::EmbeddingCache,
|
||||
pub trial_journal: trial::TrialJournal,
|
||||
/// Catalog registry — needed by the Phase C refresh path to mark/clear
|
||||
/// staleness and look up dataset manifests.
|
||||
pub catalog: CatalogRegistry,
|
||||
}
|
||||
|
||||
pub fn router(state: VectorState) -> Router {
|
||||
@ -47,6 +51,9 @@ pub fn router(state: VectorState) -> Router {
|
||||
// Cache management
|
||||
.route("/hnsw/cache/stats", get(cache_stats))
|
||||
.route("/hnsw/cache/{index_name}", axum::routing::delete(cache_evict))
|
||||
// Phase C: embedding refresh
|
||||
.route("/refresh/{dataset_name}", post(refresh_dataset))
|
||||
.route("/stale", get(list_stale))
|
||||
.with_state(state)
|
||||
}
|
||||
|
||||
@ -666,3 +673,61 @@ async fn cache_evict(
|
||||
let ok = state.embedding_cache.evict(&index_name).await;
|
||||
Json(serde_json::json!({ "evicted": ok, "index_name": index_name }))
|
||||
}
|
||||
|
||||
// --- Phase C: embedding refresh ---
|
||||
//
|
||||
// Decouples "new row data arrived" from "re-embed everything." Ingest marks
|
||||
// a dataset's embeddings stale (see catalogd::registry::mark_embeddings_stale);
|
||||
// `/vectors/refresh/{dataset}` diffs existing embeddings against current
|
||||
// rows, embeds only the new ones, appends to the index, and clears the
|
||||
// stale flag.
|
||||
|
||||
async fn refresh_dataset(
|
||||
State(state): State<VectorState>,
|
||||
Path(dataset_name): Path<String>,
|
||||
Json(req): Json<refresh::RefreshRequest>,
|
||||
) -> Result<Json<refresh::RefreshResult>, (StatusCode, String)> {
|
||||
tracing::info!(
|
||||
"refresh requested for dataset '{}' -> index '{}'",
|
||||
dataset_name, req.index_name,
|
||||
);
|
||||
match refresh::refresh_index(
|
||||
&dataset_name,
|
||||
&req,
|
||||
&state.store,
|
||||
&state.catalog,
|
||||
&state.ai_client,
|
||||
&state.embedding_cache,
|
||||
&state.index_registry,
|
||||
)
|
||||
.await
|
||||
{
|
||||
Ok(result) => Ok(Json(result)),
|
||||
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Serialize)]
|
||||
struct StaleEntry {
|
||||
dataset_name: String,
|
||||
last_embedded_at: Option<String>,
|
||||
stale_since: String,
|
||||
refresh_policy: Option<shared::types::RefreshPolicy>,
|
||||
}
|
||||
|
||||
async fn list_stale(State(state): State<VectorState>) -> impl IntoResponse {
|
||||
let datasets = state.catalog.stale_datasets().await;
|
||||
let entries: Vec<StaleEntry> = datasets
|
||||
.into_iter()
|
||||
.map(|d| StaleEntry {
|
||||
dataset_name: d.name,
|
||||
last_embedded_at: d.last_embedded_at.map(|t| t.to_rfc3339()),
|
||||
stale_since: d
|
||||
.embedding_stale_since
|
||||
.map(|t| t.to_rfc3339())
|
||||
.unwrap_or_default(),
|
||||
refresh_policy: d.embedding_refresh_policy,
|
||||
})
|
||||
.collect();
|
||||
Json(entries)
|
||||
}
|
||||
|
||||
@ -150,6 +150,18 @@
|
||||
- `POST /ingest/db` endpoint: `{dsn, table, dataset_name?, batch_size?, order_by?, limit?}` → streams to Parquet, registers in catalog with PII detection + redacted-password lineage
|
||||
- Existing `POST /ingest/postgres/import` (structured config) preserved alongside
|
||||
- 4 DSN-parser unit tests + live end-to-end test against `knowledge_base.team_runs` (586 rows, 13 cols, 6 batches, 196ms)
|
||||
- [x] Phase B: Lance storage evaluation — 2026-04-16
|
||||
- `crates/lance-bench` standalone pilot (Lance 4.0) avoids DataFusion/Arrow version conflict with main stack
|
||||
- 8-dimension benchmark on resumes_100k_v2 — see docs/ADR-019-vector-storage.md for scorecard
|
||||
- Decision: hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance added as per-profile second backend for random access (112× faster), append (0.08s vs full rewrite), hot-swap (14× faster index builds), and scale past 5M RAM ceiling.
|
||||
- [x] Phase C: Decoupled embedding refresh — 2026-04-16
|
||||
- `DatasetManifest`: `last_embedded_at`, `embedding_stale_since`, `embedding_refresh_policy` (Manual | OnAppend | Scheduled)
|
||||
- `Registry::mark_embeddings_stale` / `clear_embeddings_stale` / `stale_datasets`
|
||||
- Ingest paths (CSV pipeline + Postgres streaming) auto-mark-stale when writing to an already-embedded dataset
|
||||
- `vectord::refresh::refresh_index` — reads dataset, diffs doc_ids vs existing embeddings, embeds only new rows, writes combined index, clears stale
|
||||
- `POST /vectors/refresh/{dataset}` + `GET /vectors/stale`
|
||||
- Id columns accept `Utf8`, `Int32`, `Int64`
|
||||
- End-to-end on threat_intel: initial 20-row embed 2.1s; re-ingest to 54 rows auto-marks stale; delta refresh embeds only 34 new in 970ms (6× faster than full re-embed); stale cleared
|
||||
- [ ] Database connector ingest (Postgres/MySQL)
|
||||
- [ ] PDF OCR (Tesseract)
|
||||
- [ ] Scheduled ingest (cron)
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user