Phase E: Soft deletes (tombstones) for compliance-grade row deletion

Implements GDPR/CCPA-compatible row-level deletion without rewriting
the underlying Parquet. Tombstone markers live beside each dataset and
are applied at query time via a DataFusion view that excludes the
deleted row_key_values.

Schema (shared::types):
- Tombstone { dataset, row_key_column, row_key_value, deleted_at,
              actor, reason }
- All tombstones for a dataset must share one row_key_column —
  enforced at write so the query-time filter remains a single
  WHERE NOT IN (...) clause

Storage (catalogd::tombstones):
- Per-dataset AppendLog at _catalog/tombstones/{dataset}/
- flush_threshold=1 + explicit flush after every append — tombstones
  are high-value, low-frequency; durability on return is the contract
- Reuses storaged::append_log infra so compaction is already wired
  (POST .../tombstones/compact will work once we expose it)

Catalog (catalogd::registry):
- add_tombstone validates dataset exists + key column compatibility
- list_tombstones for the GET endpoint
- TombstoneStore exposed via Registry::tombstones() for queryd

HTTP (catalogd::service):
- POST /catalog/datasets/by-name/{name}/tombstone
    { row_key_column, row_key_values[], actor, reason }
  Returns rows_tombstoned count + per-value failure list (207 on
  partial success).
- GET same path lists active tombstones with full audit info.

Query layer (queryd::context):
- Snapshot tombstones-by-dataset before registering tables
- Tombstoned tables: raw goes to "__raw__{name}", public "{name}"
  becomes DataFusion view with
  SELECT * FROM "__raw__{name}" WHERE CAST(col AS VARCHAR) NOT IN (...)
- CAST AS VARCHAR handles both string and integer key columns
- Untombstoned tables register as before — zero overhead

End-to-end on candidates (100K rows):
- Pick CAND-000001/2/3 (Linda/Charles/Kimberly)
- POST tombstone -> rows_tombstoned: 3
- COUNT(*) drops 100000 -> 99997
- WHERE candidate_id IN (those 3) -> 0 rows
- candidates_safe view transitively excludes them
  (Linda+Denver: __raw__candidates=159, candidates_safe=158)
- Restart: COUNT still 99997, 3 tombstones reload from disk

Reversibility: tombstones are reversible deletes, not destruction.
Power users can still query "__raw__{name}" to see deleted rows.
Phase 13 access control is what stops a non-admin from accessing
__raw__* tables.

Limits / follow-up:
- Physical compaction not yet integrated — Phase 8's compact_files
  doesn't read tombstones during merge. Tombstoned rows are still
  on disk until that integration ships.
- Phase 9 journald event emission for tombstones not wired —
  tombstone records carry their own actor+reason+timestamp so the
  audit trail is intact, but cross-referencing with the mutation
  event log would help compliance reporting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
root 2026-04-16 09:40:48 -05:00
parent 09fd446c8d
commit d87f2ccac6
7 changed files with 332 additions and 9 deletions

View File

@ -1,3 +1,4 @@
pub mod registry;
pub mod service;
pub mod grpc;
pub mod tombstones;

View File

@ -1,7 +1,9 @@
use shared::types::{
AiView, ColumnMeta, DatasetId, DatasetManifest, FreshnessContract, Lineage, ObjectRef,
RefreshPolicy, SchemaFingerprint, Sensitivity,
RefreshPolicy, SchemaFingerprint, Sensitivity, Tombstone,
};
use crate::tombstones::TombstoneStore;
use std::collections::HashMap;
use std::sync::Arc;
use tokio::sync::RwLock;
@ -46,10 +48,12 @@ const VIEW_PREFIX: &str = "_catalog/views";
/// In-memory dataset registry backed by manifest persistence in object storage.
/// Also tracks AiViews (Phase D) — safe projections over base datasets.
/// And tombstones (Phase E) — soft-delete markers applied at query time.
#[derive(Clone)]
pub struct Registry {
datasets: Arc<RwLock<HashMap<DatasetId, DatasetManifest>>>,
views: Arc<RwLock<HashMap<String, AiView>>>,
tombstones: TombstoneStore,
store: Arc<dyn ObjectStore>,
}
@ -58,10 +62,44 @@ impl Registry {
Self {
datasets: Arc::new(RwLock::new(HashMap::new())),
views: Arc::new(RwLock::new(HashMap::new())),
tombstones: TombstoneStore::new(store.clone()),
store,
}
}
pub fn tombstones(&self) -> &TombstoneStore {
&self.tombstones
}
/// Add a tombstone for a dataset row. Validates that the dataset
/// exists. Idempotent at the endpoint layer — callers dedup if needed.
pub async fn add_tombstone(
&self,
dataset: &str,
row_key_column: &str,
row_key_value: &str,
actor: &str,
reason: &str,
) -> Result<Tombstone, String> {
if self.get_by_name(dataset).await.is_none() {
return Err(format!("dataset not found: {dataset}"));
}
let ts = Tombstone {
dataset: dataset.to_string(),
row_key_column: row_key_column.to_string(),
row_key_value: row_key_value.to_string(),
deleted_at: chrono::Utc::now(),
actor: actor.to_string(),
reason: reason.to_string(),
};
self.tombstones.append(&ts).await?;
Ok(ts)
}
pub async fn list_tombstones(&self, dataset: &str) -> Result<Vec<Tombstone>, String> {
self.tombstones.list(dataset).await
}
/// Rebuild in-memory index from persisted manifests + views on startup.
pub async fn rebuild(&self) -> Result<usize, String> {
let keys = ops::list(&self.store, Some(MANIFEST_PREFIX)).await?;

View File

@ -25,6 +25,8 @@ pub fn router(registry: Registry) -> Router {
// Phase D: AI-safe views
.route("/views", post(create_view).get(list_views))
.route("/views/{name}", get(get_view).delete(delete_view))
// Phase E: soft-delete tombstones
.route("/datasets/by-name/{name}/tombstone", post(tombstone_rows).get(list_tombstones))
.with_state(registry)
}
@ -269,3 +271,70 @@ async fn delete_view(
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
}
}
// --- Phase E: soft-delete tombstones ---
#[derive(Deserialize)]
struct TombstoneRequest {
row_key_column: String,
row_key_values: Vec<String>,
#[serde(default)]
actor: String,
#[serde(default)]
reason: String,
}
#[derive(Serialize)]
struct TombstoneResponse {
dataset: String,
row_key_column: String,
rows_tombstoned: usize,
failures: Vec<String>,
}
async fn tombstone_rows(
State(registry): State<Registry>,
Path(name): Path<String>,
Json(req): Json<TombstoneRequest>,
) -> impl IntoResponse {
if req.row_key_values.is_empty() {
return Err((StatusCode::BAD_REQUEST, "row_key_values is empty".to_string()));
}
let mut ok = 0;
let mut failures = Vec::new();
for value in &req.row_key_values {
match registry
.add_tombstone(&name, &req.row_key_column, value, &req.actor, &req.reason)
.await
{
Ok(_) => ok += 1,
Err(e) => failures.push(format!("{value}: {e}")),
}
}
let status = if ok > 0 && failures.is_empty() {
StatusCode::CREATED
} else if ok > 0 {
StatusCode::MULTI_STATUS
} else {
StatusCode::BAD_REQUEST
};
Ok((status, Json(TombstoneResponse {
dataset: name,
row_key_column: req.row_key_column,
rows_tombstoned: ok,
failures,
})))
}
async fn list_tombstones(
State(registry): State<Registry>,
Path(name): Path<String>,
) -> impl IntoResponse {
match registry.list_tombstones(&name).await {
Ok(ts) => Ok(Json(ts)),
Err(e) => Err((StatusCode::INTERNAL_SERVER_ERROR, e)),
}
}

View File

@ -0,0 +1,124 @@
//! Soft-delete tombstone storage (Phase E).
//!
//! One append-log per dataset at `_catalog/tombstones/{dataset}/batch_*.jsonl`.
//! Uses the shared `storaged::append_log::AppendLog` pattern so appends
//! are write-once (never rewrites existing files) and can be compacted.
//!
//! The store exposes a per-dataset cache of active tombstones for the
//! hot path (queryd filter construction) — that's why events are pulled
//! into an in-memory map on every call rather than scanning object
//! storage repeatedly.
use object_store::ObjectStore;
use shared::types::Tombstone;
use std::collections::HashMap;
use std::sync::Arc;
use storaged::append_log::{AppendLog, CompactStats};
use tokio::sync::RwLock;
const TOMBSTONE_PREFIX: &str = "_catalog/tombstones";
#[derive(Clone)]
pub struct TombstoneStore {
store: Arc<dyn ObjectStore>,
logs: Arc<RwLock<HashMap<String, Arc<AppendLog>>>>,
}
impl TombstoneStore {
pub fn new(store: Arc<dyn ObjectStore>) -> Self {
Self {
store,
logs: Arc::new(RwLock::new(HashMap::new())),
}
}
fn prefix_for(dataset: &str) -> String {
// Sanitize dataset name for filesystem safety.
let safe: String = dataset
.chars()
.map(|c| if c.is_ascii_alphanumeric() || c == '_' || c == '-' { c } else { '_' })
.collect();
format!("{TOMBSTONE_PREFIX}/{}", safe)
}
async fn log_for(&self, dataset: &str) -> Arc<AppendLog> {
if let Some(log) = self.logs.read().await.get(dataset) {
return log.clone();
}
let mut guard = self.logs.write().await;
if let Some(log) = guard.get(dataset) {
return log.clone();
}
// Threshold of 1 — every tombstone is high-value. Compliance/audit
// doesn't tolerate "lost on restart"; we trade a small file count
// for guaranteed durability. Compaction merges later if volume grows.
let log = Arc::new(
AppendLog::new(self.store.clone(), Self::prefix_for(dataset))
.with_flush_threshold(1),
);
guard.insert(dataset.to_string(), log.clone());
log
}
/// Append one tombstone. Validates that the `row_key_column` matches
/// the column already used for this dataset (all tombstones for a
/// dataset share one key column so the query filter is well-defined).
/// Forces a flush so the tombstone is durable before this call returns.
pub async fn append(&self, ts: &Tombstone) -> Result<(), String> {
let existing = self.list(&ts.dataset).await?;
if let Some(prior) = existing.first() {
if prior.row_key_column != ts.row_key_column {
return Err(format!(
"dataset '{}' already uses '{}' as tombstone key; cannot mix with '{}'",
ts.dataset, prior.row_key_column, ts.row_key_column,
));
}
}
let line = serde_json::to_vec(ts).map_err(|e| e.to_string())?;
let log = self.log_for(&ts.dataset).await;
log.append(line).await?;
// Belt-and-suspenders: explicit flush in case the threshold is
// ever raised. Tombstones must be durable on return.
log.flush().await
}
/// All tombstones for a dataset (chronological).
pub async fn list(&self, dataset: &str) -> Result<Vec<Tombstone>, String> {
let log = self.log_for(dataset).await;
let lines = log.read_all().await?;
let mut out = Vec::with_capacity(lines.len());
for line in lines {
match serde_json::from_slice::<Tombstone>(&line) {
Ok(t) => out.push(t),
Err(e) => tracing::warn!("tombstones/{}: skip malformed entry: {e}", dataset),
}
}
Ok(out)
}
/// Per-dataset grouped view used by queryd — returns a map of
/// `{dataset -> (row_key_column, set_of_values)}` for every dataset
/// that has any tombstones.
pub async fn all_grouped(
&self,
datasets: &[String],
) -> Result<HashMap<String, (String, Vec<String>)>, String> {
let mut grouped = HashMap::new();
for dataset in datasets {
let ts = match self.list(dataset).await {
Ok(ts) => ts,
Err(_) => continue,
};
if ts.is_empty() { continue; }
let col = ts[0].row_key_column.clone();
let values: Vec<String> = ts.iter().map(|t| t.row_key_value.clone()).collect();
grouped.insert(dataset.clone(), (col, values));
}
Ok(grouped)
}
pub async fn compact(&self, dataset: &str) -> Result<CompactStats, String> {
let log = self.log_for(dataset).await;
log.compact().await
}
}

View File

@ -117,6 +117,19 @@ impl QueryEngine {
async fn build_context(&self) -> Result<SessionContext, String> {
let ctx = SessionContext::new();
// Phase E: snapshot tombstones by dataset before registering tables
// so we can wrap tombstoned tables in a filter view. The underlying
// base table is registered under an internal name `__raw__{dataset}`
// and the public `{dataset}` name becomes the filtered view.
let all_dataset_names: Vec<String> = self.registry.list().await
.iter().map(|d| d.name.clone()).collect();
let tombstones_by_dataset = self
.registry
.tombstones()
.all_grouped(&all_dataset_names)
.await
.unwrap_or_default();
// Federation layer 2: register every configured bucket as its own
// DataFusion ObjectStore under a distinct URL scheme. Each
// dataset's ObjectRef.bucket determines which store DataFusion
@ -186,17 +199,57 @@ impl QueryEngine {
let table = ListingTable::try_new(config)
.map_err(|e| format!("table creation failed for {}: {e}", dataset.name))?;
// Tolerate duplicate manifest entries for the same name —
// pre-existing pipeline::ingest_file behavior creates a fresh
// dataset id on every ingest. First registration wins; later
// ones are skipped with a warning rather than failing the
// whole context build.
if let Err(e) = ctx.register_table(&dataset.name, Arc::new(table)) {
// Decide the registration name: if this dataset has any
// tombstones, the raw table gets an internal name and the
// public name becomes a filtered view.
let tombstone_entry = tombstones_by_dataset.get(&dataset.name);
let register_name = if tombstone_entry.is_some() {
format!("__raw__{}", dataset.name)
} else {
dataset.name.clone()
};
if let Err(e) = ctx.register_table(register_name.as_str(), Arc::new(table)) {
let msg = e.to_string();
if msg.contains("already exists") {
tracing::debug!("skip duplicate manifest registration: {}", dataset.name);
tracing::debug!("skip duplicate manifest registration: {}", register_name);
continue;
} else {
return Err(format!("table registration failed for {}: {}", dataset.name, msg));
return Err(format!("table registration failed for {}: {}", register_name, msg));
}
}
// If there are tombstones, register the public name as a
// filtered view that excludes tombstoned row_key_values.
if let Some((key_col, values)) = tombstone_entry {
// Build WHERE NOT IN (...) — quote values to be SQL-safe.
// For string keys this is a literal list; for integer keys
// CAST(col AS VARCHAR) makes the comparison unambiguous.
let quoted: Vec<String> = values.iter()
.map(|v| format!("'{}'", v.replace('\'', "''")))
.collect();
let sql = format!(
"SELECT * FROM \"{}\" WHERE CAST(\"{}\" AS VARCHAR) NOT IN ({})",
register_name, key_col, quoted.join(", "),
);
tracing::debug!(
"tombstone filter for '{}': {} row_keys excluded",
dataset.name, values.len(),
);
match ctx.sql(&sql).await {
Ok(df) => {
if let Err(e) = ctx.register_table(dataset.name.as_str(), df.into_view()) {
let msg = e.to_string();
if msg.contains("already exists") {
tracing::debug!("skip duplicate tombstone view: {}", dataset.name);
} else {
tracing::warn!("tombstone view registration failed for {}: {msg}", dataset.name);
}
}
}
Err(e) => {
tracing::warn!("tombstone view SQL failed for '{}': {e}", dataset.name);
}
}
}
}

View File

@ -196,6 +196,35 @@ pub struct AiView {
pub description: String,
}
/// Soft-delete marker (Phase E).
///
/// Tombstones live beside the dataset in `_catalog/tombstones/{dataset}/`
/// as append-log JSONL. Query-time filter in queryd reads all tombstones
/// for each dataset and wraps the base table in a DataFusion view that
/// excludes tombstoned rows. Physical deletion happens later (compaction),
/// so the row count immediately reflects the delete but data is still on
/// disk until compact runs. That's deliberate — it gives a reversal
/// window and keeps the event journal audit trail intact.
///
/// All tombstones for a given dataset must use the same `row_key_column`
/// (enforced on write); otherwise the query-time filter can't be built
/// as a single WHERE clause.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Tombstone {
pub dataset: String,
/// Column name that identifies the row (e.g. "candidate_id").
pub row_key_column: String,
/// Value of that column for the tombstoned row.
pub row_key_value: String,
pub deleted_at: chrono::DateTime<chrono::Utc>,
/// Human / system actor responsible for the delete (audit).
#[serde(default)]
pub actor: String,
/// Why (e.g. "GDPR request #1234", "user-requested erasure").
#[serde(default)]
pub reason: String,
}
/// How a column's values should be transformed before being returned.
/// `Mask` is the most common — keeps a few visible chars, replaces the
/// rest with `*`. `Hash` returns SHA-256 of the value for join keys you

View File

@ -154,6 +154,15 @@
- `crates/lance-bench` standalone pilot (Lance 4.0) avoids DataFusion/Arrow version conflict with main stack
- 8-dimension benchmark on resumes_100k_v2 — see docs/ADR-019-vector-storage.md for scorecard
- Decision: hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance added as per-profile second backend for random access (112× faster), append (0.08s vs full rewrite), hot-swap (14× faster index builds), and scale past 5M RAM ceiling.
- [x] Phase E: Soft deletes (tombstones) — 2026-04-16
- `shared::types::Tombstone` — { dataset, row_key_column, row_key_value, deleted_at, actor, reason }
- `catalogd::tombstones::TombstoneStore` per-dataset append-log at `_catalog/tombstones/{dataset}/`, flush_threshold=1 + explicit flush so every tombstone is durable on return (compliance requirement)
- All tombstones for a dataset must share the same `row_key_column` (validated at write — query filter is built as a single WHERE NOT IN clause)
- `Registry::add_tombstone / list_tombstones`
- Endpoint: `POST /catalog/datasets/by-name/{name}/tombstone` accepting `{row_key_column, row_key_values[], actor, reason}`; companion `GET` lists active tombstones
- `queryd::context::build_context` wraps tombstoned tables: raw goes to `__raw__{name}`, public name becomes a DataFusion view with `WHERE CAST(col AS VARCHAR) NOT IN (...)` filter
- End-to-end on candidates: tombstone 3 IDs, COUNT drops 100,000 → 99,997, specific WHERE returns empty, AiView candidates_safe transitively excludes them too, restart preserves all tombstones
- Limits / not in MVP: physical compaction (Phase 8 doesn't yet read tombstones during merge); journal integration (tombstones don't yet emit Phase 9 mutation events — covered by audit fields on the tombstone itself)
- [x] Phase D: AI-safe views — 2026-04-16
- `shared::types::AiView` — name, base_dataset, columns whitelist, optional row_filter, column_redactions
- `shared::types::Redaction` — Null | Hash | Mask { keep_prefix, keep_suffix }