Phase E: Soft deletes (tombstones) for compliance-grade row deletion
Implements GDPR/CCPA-compatible row-level deletion without rewriting
the underlying Parquet. Tombstone markers live beside each dataset and
are applied at query time via a DataFusion view that excludes the
deleted row_key_values.
Schema (shared::types):
- Tombstone { dataset, row_key_column, row_key_value, deleted_at,
actor, reason }
- All tombstones for a dataset must share one row_key_column —
enforced at write so the query-time filter remains a single
WHERE NOT IN (...) clause
Storage (catalogd::tombstones):
- Per-dataset AppendLog at _catalog/tombstones/{dataset}/
- flush_threshold=1 + explicit flush after every append — tombstones
are high-value, low-frequency; durability on return is the contract
- Reuses storaged::append_log infra so compaction is already wired
(POST .../tombstones/compact will work once we expose it)
Catalog (catalogd::registry):
- add_tombstone validates dataset exists + key column compatibility
- list_tombstones for the GET endpoint
- TombstoneStore exposed via Registry::tombstones() for queryd
HTTP (catalogd::service):
- POST /catalog/datasets/by-name/{name}/tombstone
{ row_key_column, row_key_values[], actor, reason }
Returns rows_tombstoned count + per-value failure list (207 on
partial success).
- GET same path lists active tombstones with full audit info.
Query layer (queryd::context):
- Snapshot tombstones-by-dataset before registering tables
- Tombstoned tables: raw goes to "__raw__{name}", public "{name}"
becomes DataFusion view with
SELECT * FROM "__raw__{name}" WHERE CAST(col AS VARCHAR) NOT IN (...)
- CAST AS VARCHAR handles both string and integer key columns
- Untombstoned tables register as before — zero overhead
End-to-end on candidates (100K rows):
- Pick CAND-000001/2/3 (Linda/Charles/Kimberly)
- POST tombstone -> rows_tombstoned: 3
- COUNT(*) drops 100000 -> 99997
- WHERE candidate_id IN (those 3) -> 0 rows
- candidates_safe view transitively excludes them
(Linda+Denver: __raw__candidates=159, candidates_safe=158)
- Restart: COUNT still 99997, 3 tombstones reload from disk
Reversibility: tombstones are reversible deletes, not destruction.
Power users can still query "__raw__{name}" to see deleted rows.
Phase 13 access control is what stops a non-admin from accessing
__raw__* tables.
Limits / follow-up:
- Physical compaction not yet integrated — Phase 8's compact_files
doesn't read tombstones during merge. Tombstoned rows are still
on disk until that integration ships.
- Phase 9 journald event emission for tombstones not wired —
tombstone records carry their own actor+reason+timestamp so the
audit trail is intact, but cross-referencing with the mutation
event log would help compliance reporting.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
09fd446c8d
commit
d87f2ccac6
@ -1,3 +1,4 @@
|
||||
pub mod registry;
|
||||
pub mod service;
|
||||
pub mod grpc;
|
||||
pub mod tombstones;
|
||||
|
||||
@ -1,7 +1,9 @@
|
||||
use shared::types::{
|
||||
AiView, ColumnMeta, DatasetId, DatasetManifest, FreshnessContract, Lineage, ObjectRef,
|
||||
RefreshPolicy, SchemaFingerprint, Sensitivity,
|
||||
RefreshPolicy, SchemaFingerprint, Sensitivity, Tombstone,
|
||||
};
|
||||
|
||||
use crate::tombstones::TombstoneStore;
|
||||
use std::collections::HashMap;
|
||||
use std::sync::Arc;
|
||||
use tokio::sync::RwLock;
|
||||
@ -46,10 +48,12 @@ const VIEW_PREFIX: &str = "_catalog/views";
|
||||
|
||||
/// In-memory dataset registry backed by manifest persistence in object storage.
|
||||
/// Also tracks AiViews (Phase D) — safe projections over base datasets.
|
||||
/// And tombstones (Phase E) — soft-delete markers applied at query time.
|
||||
#[derive(Clone)]
|
||||
pub struct Registry {
|
||||
datasets: Arc<RwLock<HashMap<DatasetId, DatasetManifest>>>,
|
||||
views: Arc<RwLock<HashMap<String, AiView>>>,
|
||||
tombstones: TombstoneStore,
|
||||
store: Arc<dyn ObjectStore>,
|
||||
}
|
||||
|
||||
@ -58,10 +62,44 @@ impl Registry {
|
||||
Self {
|
||||
datasets: Arc::new(RwLock::new(HashMap::new())),
|
||||
views: Arc::new(RwLock::new(HashMap::new())),
|
||||
tombstones: TombstoneStore::new(store.clone()),
|
||||
store,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn tombstones(&self) -> &TombstoneStore {
|
||||
&self.tombstones
|
||||
}
|
||||
|
||||
/// Add a tombstone for a dataset row. Validates that the dataset
|
||||
/// exists. Idempotent at the endpoint layer — callers dedup if needed.
|
||||
pub async fn add_tombstone(
|
||||
&self,
|
||||
dataset: &str,
|
||||
row_key_column: &str,
|
||||
row_key_value: &str,
|
||||
actor: &str,
|
||||
reason: &str,
|
||||
) -> Result<Tombstone, String> {
|
||||
if self.get_by_name(dataset).await.is_none() {
|
||||
return Err(format!("dataset not found: {dataset}"));
|
||||
}
|
||||
let ts = Tombstone {
|
||||
dataset: dataset.to_string(),
|
||||
row_key_column: row_key_column.to_string(),
|
||||
row_key_value: row_key_value.to_string(),
|
||||
deleted_at: chrono::Utc::now(),
|
||||
actor: actor.to_string(),
|
||||
reason: reason.to_string(),
|
||||
};
|
||||
self.tombstones.append(&ts).await?;
|
||||
Ok(ts)
|
||||
}
|
||||
|
||||
pub async fn list_tombstones(&self, dataset: &str) -> Result<Vec<Tombstone>, String> {
|
||||
self.tombstones.list(dataset).await
|
||||
}
|
||||
|
||||
/// Rebuild in-memory index from persisted manifests + views on startup.
|
||||
pub async fn rebuild(&self) -> Result<usize, String> {
|
||||
let keys = ops::list(&self.store, Some(MANIFEST_PREFIX)).await?;
|
||||
|
||||
@ -25,6 +25,8 @@ pub fn router(registry: Registry) -> Router {
|
||||
// Phase D: AI-safe views
|
||||
.route("/views", post(create_view).get(list_views))
|
||||
.route("/views/{name}", get(get_view).delete(delete_view))
|
||||
// Phase E: soft-delete tombstones
|
||||
.route("/datasets/by-name/{name}/tombstone", post(tombstone_rows).get(list_tombstones))
|
||||
.with_state(registry)
|
||||
}
|
||||
|
||||
@ -269,3 +271,70 @@ async fn delete_view(
|
||||
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
|
||||
}
|
||||
}
|
||||
|
||||
// --- Phase E: soft-delete tombstones ---
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct TombstoneRequest {
|
||||
row_key_column: String,
|
||||
row_key_values: Vec<String>,
|
||||
#[serde(default)]
|
||||
actor: String,
|
||||
#[serde(default)]
|
||||
reason: String,
|
||||
}
|
||||
|
||||
#[derive(Serialize)]
|
||||
struct TombstoneResponse {
|
||||
dataset: String,
|
||||
row_key_column: String,
|
||||
rows_tombstoned: usize,
|
||||
failures: Vec<String>,
|
||||
}
|
||||
|
||||
async fn tombstone_rows(
|
||||
State(registry): State<Registry>,
|
||||
Path(name): Path<String>,
|
||||
Json(req): Json<TombstoneRequest>,
|
||||
) -> impl IntoResponse {
|
||||
if req.row_key_values.is_empty() {
|
||||
return Err((StatusCode::BAD_REQUEST, "row_key_values is empty".to_string()));
|
||||
}
|
||||
|
||||
let mut ok = 0;
|
||||
let mut failures = Vec::new();
|
||||
for value in &req.row_key_values {
|
||||
match registry
|
||||
.add_tombstone(&name, &req.row_key_column, value, &req.actor, &req.reason)
|
||||
.await
|
||||
{
|
||||
Ok(_) => ok += 1,
|
||||
Err(e) => failures.push(format!("{value}: {e}")),
|
||||
}
|
||||
}
|
||||
|
||||
let status = if ok > 0 && failures.is_empty() {
|
||||
StatusCode::CREATED
|
||||
} else if ok > 0 {
|
||||
StatusCode::MULTI_STATUS
|
||||
} else {
|
||||
StatusCode::BAD_REQUEST
|
||||
};
|
||||
|
||||
Ok((status, Json(TombstoneResponse {
|
||||
dataset: name,
|
||||
row_key_column: req.row_key_column,
|
||||
rows_tombstoned: ok,
|
||||
failures,
|
||||
})))
|
||||
}
|
||||
|
||||
async fn list_tombstones(
|
||||
State(registry): State<Registry>,
|
||||
Path(name): Path<String>,
|
||||
) -> impl IntoResponse {
|
||||
match registry.list_tombstones(&name).await {
|
||||
Ok(ts) => Ok(Json(ts)),
|
||||
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
|
||||
}
|
||||
}
|
||||
|
||||
124
crates/catalogd/src/tombstones.rs
Normal file
124
crates/catalogd/src/tombstones.rs
Normal file
@ -0,0 +1,124 @@
|
||||
//! Soft-delete tombstone storage (Phase E).
|
||||
//!
|
||||
//! One append-log per dataset at `_catalog/tombstones/{dataset}/batch_*.jsonl`.
|
||||
//! Uses the shared `storaged::append_log::AppendLog` pattern so appends
|
||||
//! are write-once (never rewrites existing files) and can be compacted.
|
||||
//!
|
||||
//! The store exposes a per-dataset cache of active tombstones for the
|
||||
//! hot path (queryd filter construction) — that's why events are pulled
|
||||
//! into an in-memory map on every call rather than scanning object
|
||||
//! storage repeatedly.
|
||||
|
||||
use object_store::ObjectStore;
|
||||
use shared::types::Tombstone;
|
||||
use std::collections::HashMap;
|
||||
use std::sync::Arc;
|
||||
use storaged::append_log::{AppendLog, CompactStats};
|
||||
use tokio::sync::RwLock;
|
||||
|
||||
const TOMBSTONE_PREFIX: &str = "_catalog/tombstones";
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct TombstoneStore {
|
||||
store: Arc<dyn ObjectStore>,
|
||||
logs: Arc<RwLock<HashMap<String, Arc<AppendLog>>>>,
|
||||
}
|
||||
|
||||
impl TombstoneStore {
|
||||
pub fn new(store: Arc<dyn ObjectStore>) -> Self {
|
||||
Self {
|
||||
store,
|
||||
logs: Arc::new(RwLock::new(HashMap::new())),
|
||||
}
|
||||
}
|
||||
|
||||
fn prefix_for(dataset: &str) -> String {
|
||||
// Sanitize dataset name for filesystem safety.
|
||||
let safe: String = dataset
|
||||
.chars()
|
||||
.map(|c| if c.is_ascii_alphanumeric() || c == '_' || c == '-' { c } else { '_' })
|
||||
.collect();
|
||||
format!("{TOMBSTONE_PREFIX}/{}", safe)
|
||||
}
|
||||
|
||||
async fn log_for(&self, dataset: &str) -> Arc<AppendLog> {
|
||||
if let Some(log) = self.logs.read().await.get(dataset) {
|
||||
return log.clone();
|
||||
}
|
||||
let mut guard = self.logs.write().await;
|
||||
if let Some(log) = guard.get(dataset) {
|
||||
return log.clone();
|
||||
}
|
||||
// Threshold of 1 — every tombstone is high-value. Compliance/audit
|
||||
// doesn't tolerate "lost on restart"; we trade a small file count
|
||||
// for guaranteed durability. Compaction merges later if volume grows.
|
||||
let log = Arc::new(
|
||||
AppendLog::new(self.store.clone(), Self::prefix_for(dataset))
|
||||
.with_flush_threshold(1),
|
||||
);
|
||||
guard.insert(dataset.to_string(), log.clone());
|
||||
log
|
||||
}
|
||||
|
||||
/// Append one tombstone. Validates that the `row_key_column` matches
|
||||
/// the column already used for this dataset (all tombstones for a
|
||||
/// dataset share one key column so the query filter is well-defined).
|
||||
/// Forces a flush so the tombstone is durable before this call returns.
|
||||
pub async fn append(&self, ts: &Tombstone) -> Result<(), String> {
|
||||
let existing = self.list(&ts.dataset).await?;
|
||||
if let Some(prior) = existing.first() {
|
||||
if prior.row_key_column != ts.row_key_column {
|
||||
return Err(format!(
|
||||
"dataset '{}' already uses '{}' as tombstone key; cannot mix with '{}'",
|
||||
ts.dataset, prior.row_key_column, ts.row_key_column,
|
||||
));
|
||||
}
|
||||
}
|
||||
let line = serde_json::to_vec(ts).map_err(|e| e.to_string())?;
|
||||
let log = self.log_for(&ts.dataset).await;
|
||||
log.append(line).await?;
|
||||
// Belt-and-suspenders: explicit flush in case the threshold is
|
||||
// ever raised. Tombstones must be durable on return.
|
||||
log.flush().await
|
||||
}
|
||||
|
||||
/// All tombstones for a dataset (chronological).
|
||||
pub async fn list(&self, dataset: &str) -> Result<Vec<Tombstone>, String> {
|
||||
let log = self.log_for(dataset).await;
|
||||
let lines = log.read_all().await?;
|
||||
let mut out = Vec::with_capacity(lines.len());
|
||||
for line in lines {
|
||||
match serde_json::from_slice::<Tombstone>(&line) {
|
||||
Ok(t) => out.push(t),
|
||||
Err(e) => tracing::warn!("tombstones/{}: skip malformed entry: {e}", dataset),
|
||||
}
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
|
||||
/// Per-dataset grouped view used by queryd — returns a map of
|
||||
/// `{dataset -> (row_key_column, set_of_values)}` for every dataset
|
||||
/// that has any tombstones.
|
||||
pub async fn all_grouped(
|
||||
&self,
|
||||
datasets: &[String],
|
||||
) -> Result<HashMap<String, (String, Vec<String>)>, String> {
|
||||
let mut grouped = HashMap::new();
|
||||
for dataset in datasets {
|
||||
let ts = match self.list(dataset).await {
|
||||
Ok(ts) => ts,
|
||||
Err(_) => continue,
|
||||
};
|
||||
if ts.is_empty() { continue; }
|
||||
let col = ts[0].row_key_column.clone();
|
||||
let values: Vec<String> = ts.iter().map(|t| t.row_key_value.clone()).collect();
|
||||
grouped.insert(dataset.clone(), (col, values));
|
||||
}
|
||||
Ok(grouped)
|
||||
}
|
||||
|
||||
pub async fn compact(&self, dataset: &str) -> Result<CompactStats, String> {
|
||||
let log = self.log_for(dataset).await;
|
||||
log.compact().await
|
||||
}
|
||||
}
|
||||
@ -117,6 +117,19 @@ impl QueryEngine {
|
||||
async fn build_context(&self) -> Result<SessionContext, String> {
|
||||
let ctx = SessionContext::new();
|
||||
|
||||
// Phase E: snapshot tombstones by dataset before registering tables
|
||||
// so we can wrap tombstoned tables in a filter view. The underlying
|
||||
// base table is registered under an internal name `__raw__{dataset}`
|
||||
// and the public `{dataset}` name becomes the filtered view.
|
||||
let all_dataset_names: Vec<String> = self.registry.list().await
|
||||
.iter().map(|d| d.name.clone()).collect();
|
||||
let tombstones_by_dataset = self
|
||||
.registry
|
||||
.tombstones()
|
||||
.all_grouped(&all_dataset_names)
|
||||
.await
|
||||
.unwrap_or_default();
|
||||
|
||||
// Federation layer 2: register every configured bucket as its own
|
||||
// DataFusion ObjectStore under a distinct URL scheme. Each
|
||||
// dataset's ObjectRef.bucket determines which store DataFusion
|
||||
@ -186,17 +199,57 @@ impl QueryEngine {
|
||||
let table = ListingTable::try_new(config)
|
||||
.map_err(|e| format!("table creation failed for {}: {e}", dataset.name))?;
|
||||
|
||||
// Tolerate duplicate manifest entries for the same name —
|
||||
// pre-existing pipeline::ingest_file behavior creates a fresh
|
||||
// dataset id on every ingest. First registration wins; later
|
||||
// ones are skipped with a warning rather than failing the
|
||||
// whole context build.
|
||||
if let Err(e) = ctx.register_table(&dataset.name, Arc::new(table)) {
|
||||
// Decide the registration name: if this dataset has any
|
||||
// tombstones, the raw table gets an internal name and the
|
||||
// public name becomes a filtered view.
|
||||
let tombstone_entry = tombstones_by_dataset.get(&dataset.name);
|
||||
let register_name = if tombstone_entry.is_some() {
|
||||
format!("__raw__{}", dataset.name)
|
||||
} else {
|
||||
dataset.name.clone()
|
||||
};
|
||||
|
||||
if let Err(e) = ctx.register_table(register_name.as_str(), Arc::new(table)) {
|
||||
let msg = e.to_string();
|
||||
if msg.contains("already exists") {
|
||||
tracing::debug!("skip duplicate manifest registration: {}", dataset.name);
|
||||
tracing::debug!("skip duplicate manifest registration: {}", register_name);
|
||||
continue;
|
||||
} else {
|
||||
return Err(format!("table registration failed for {}: {}", dataset.name, msg));
|
||||
return Err(format!("table registration failed for {}: {}", register_name, msg));
|
||||
}
|
||||
}
|
||||
|
||||
// If there are tombstones, register the public name as a
|
||||
// filtered view that excludes tombstoned row_key_values.
|
||||
if let Some((key_col, values)) = tombstone_entry {
|
||||
// Build WHERE NOT IN (...) — quote values to be SQL-safe.
|
||||
// For string keys this is a literal list; for integer keys
|
||||
// CAST(col AS VARCHAR) makes the comparison unambiguous.
|
||||
let quoted: Vec<String> = values.iter()
|
||||
.map(|v| format!("'{}'", v.replace('\'', "''")))
|
||||
.collect();
|
||||
let sql = format!(
|
||||
"SELECT * FROM \"{}\" WHERE CAST(\"{}\" AS VARCHAR) NOT IN ({})",
|
||||
register_name, key_col, quoted.join(", "),
|
||||
);
|
||||
tracing::debug!(
|
||||
"tombstone filter for '{}': {} row_keys excluded",
|
||||
dataset.name, values.len(),
|
||||
);
|
||||
match ctx.sql(&sql).await {
|
||||
Ok(df) => {
|
||||
if let Err(e) = ctx.register_table(dataset.name.as_str(), df.into_view()) {
|
||||
let msg = e.to_string();
|
||||
if msg.contains("already exists") {
|
||||
tracing::debug!("skip duplicate tombstone view: {}", dataset.name);
|
||||
} else {
|
||||
tracing::warn!("tombstone view registration failed for {}: {msg}", dataset.name);
|
||||
}
|
||||
}
|
||||
}
|
||||
Err(e) => {
|
||||
tracing::warn!("tombstone view SQL failed for '{}': {e}", dataset.name);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@ -196,6 +196,35 @@ pub struct AiView {
|
||||
pub description: String,
|
||||
}
|
||||
|
||||
/// Soft-delete marker (Phase E).
|
||||
///
|
||||
/// Tombstones live beside the dataset in `_catalog/tombstones/{dataset}/`
|
||||
/// as append-log JSONL. Query-time filter in queryd reads all tombstones
|
||||
/// for each dataset and wraps the base table in a DataFusion view that
|
||||
/// excludes tombstoned rows. Physical deletion happens later (compaction),
|
||||
/// so the row count immediately reflects the delete but data is still on
|
||||
/// disk until compact runs. That's deliberate — it gives a reversal
|
||||
/// window and keeps the event journal audit trail intact.
|
||||
///
|
||||
/// All tombstones for a given dataset must use the same `row_key_column`
|
||||
/// (enforced on write); otherwise the query-time filter can't be built
|
||||
/// as a single WHERE clause.
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct Tombstone {
|
||||
pub dataset: String,
|
||||
/// Column name that identifies the row (e.g. "candidate_id").
|
||||
pub row_key_column: String,
|
||||
/// Value of that column for the tombstoned row.
|
||||
pub row_key_value: String,
|
||||
pub deleted_at: chrono::DateTime<chrono::Utc>,
|
||||
/// Human / system actor responsible for the delete (audit).
|
||||
#[serde(default)]
|
||||
pub actor: String,
|
||||
/// Why (e.g. "GDPR request #1234", "user-requested erasure").
|
||||
#[serde(default)]
|
||||
pub reason: String,
|
||||
}
|
||||
|
||||
/// How a column's values should be transformed before being returned.
|
||||
/// `Mask` is the most common — keeps a few visible chars, replaces the
|
||||
/// rest with `*`. `Hash` returns SHA-256 of the value for join keys you
|
||||
|
||||
@ -154,6 +154,15 @@
|
||||
- `crates/lance-bench` standalone pilot (Lance 4.0) avoids DataFusion/Arrow version conflict with main stack
|
||||
- 8-dimension benchmark on resumes_100k_v2 — see docs/ADR-019-vector-storage.md for scorecard
|
||||
- Decision: hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance added as per-profile second backend for random access (112× faster), append (0.08s vs full rewrite), hot-swap (14× faster index builds), and scale past 5M RAM ceiling.
|
||||
- [x] Phase E: Soft deletes (tombstones) — 2026-04-16
|
||||
- `shared::types::Tombstone` — { dataset, row_key_column, row_key_value, deleted_at, actor, reason }
|
||||
- `catalogd::tombstones::TombstoneStore` per-dataset append-log at `_catalog/tombstones/{dataset}/`, flush_threshold=1 + explicit flush so every tombstone is durable on return (compliance requirement)
|
||||
- All tombstones for a dataset must share the same `row_key_column` (validated at write — query filter is built as a single WHERE NOT IN clause)
|
||||
- `Registry::add_tombstone / list_tombstones`
|
||||
- Endpoint: `POST /catalog/datasets/by-name/{name}/tombstone` accepting `{row_key_column, row_key_values[], actor, reason}`; companion `GET` lists active tombstones
|
||||
- `queryd::context::build_context` wraps tombstoned tables: raw goes to `__raw__{name}`, public name becomes a DataFusion view with `WHERE CAST(col AS VARCHAR) NOT IN (...)` filter
|
||||
- End-to-end on candidates: tombstone 3 IDs, COUNT drops 100,000 → 99,997, specific WHERE returns empty, AiView candidates_safe transitively excludes them too, restart preserves all tombstones
|
||||
- Limits / not in MVP: physical compaction (Phase 8 doesn't yet read tombstones during merge); journal integration (tombstones don't yet emit Phase 9 mutation events — covered by audit fields on the tombstone itself)
|
||||
- [x] Phase D: AI-safe views — 2026-04-16
|
||||
- `shared::types::AiView` — name, base_dataset, columns whitelist, optional row_filter, column_redactions
|
||||
- `shared::types::Redaction` — Null | Hash | Mask { keep_prefix, keep_suffix }
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user