root 4e1c400f5d Phase E.2: Compaction integrates tombstones — physical deletion closes GDPR loop
Phase E gave us soft-delete at query time (tombstones hide rows via a
DataFusion filter view). This completes the invariant: after compact,
tombstoned rows are PHYSICALLY absent from the parquet on disk.

delta::compact changes:
- Signature adds tombstones: &[Tombstone]
- After merging base + deltas, apply_tombstone_filter builds a
  BooleanArray keep-mask per batch (True where row_key_value is NOT
  in the tombstone set) and applies arrow::compute::filter_record_batch
- Supports Utf8, Int32, Int64 key columns (matches refresh.rs coverage
  for pg- and csv-derived schemas)
- CompactResult gains tombstones_applied + rows_dropped_by_tombstones
- Caller clears tombstone store on success

Critical correctness fix surfaced during E2E testing:
The original Phase 8 compact concatenated N independent Parquet byte
streams from record_batch_to_parquet() — each with its own footer.
Parquet readers only see the FIRST footer's data; the rest is invisible.
Latent since Phase 8 shipped; triggered by tombstone-filtering produc-
ing multiple batches. Corrupted candidates.parquet on first test run
(restored from UI fixture copy — good argument for test data in repo).

Fix:
- Single ArrowWriter per compaction, writes every batch into one
  properly-footered Parquet
- Snappy compression to match ingest defaults (otherwise rewrite
  inflated file 3× — 10.5MB → 34MB — because no compression was set)
- Verify-before-swap: parse written buf back to confirm row count
  matches expected; refuses to overwrite base_key if verification fails
- Write to {base_key}.compact-{ts}.tmp first, then to base_key; delete
  temp; only then delete delta files. Any error along the way leaves
  the original base intact.

TombstoneStore::clear(dataset) drops all tombstone batch files and
evicts the per-dataset AppendLog from cache. Called after successful
compact.

QueryEngine::catalog() accessor exposes the Registry so queryd
handlers can reach the tombstone store without routing through gateway
state.

E2E on candidates (100K rows, 15 cols):
- Baseline: 10.59 MB, 100000 rows
- Tombstone CAND-000001/2/3 (soft-delete): 99997 visible, 100000 raw
- Compact: tombstones_applied=3, rows_dropped=3, final_rows=99997
- Post: 10.72 MB (Snappy), valid parquet (1 row_group), 99997 rows
- Restart: persists, tombstones list empty, __raw__candidates also
  99997 (the 3 IDs are physically gone from disk)

PRD invariant close: deletion is now actually deletion, not just
masking. GDPR erasure request → tombstone + schedule compact → data
gone.

Deferred:
- Compact-all-datasets cron (currently manual per-dataset via
  POST /query/compact)
- Compaction of tombstone batch files themselves (they grow at
  flush_threshold=1 per tombstone; TombstoneStore::compact exists
  but not auto-called)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:38:30 -05:00

270 lines
10 KiB
Rust
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

/// Delta store for incremental updates.
/// Instead of rewriting an entire Parquet file to change one row,
/// we write small delta files. At query time, deltas are merged with the base.
/// Periodic compaction merges deltas into the base file.
use arrow::array::{Array, BooleanArray, Int32Array, Int64Array, RecordBatch, StringArray};
use arrow::compute::filter_record_batch;
use bytes::Bytes;
use object_store::ObjectStore;
use parquet::arrow::ArrowWriter;
use parquet::basic::Compression;
use parquet::file::properties::WriterProperties;
use shared::types::Tombstone;
use std::collections::HashSet;
use std::sync::Arc;
use shared::arrow_helpers::{parquet_to_record_batches, record_batch_to_parquet};
use storaged::ops;
/// Write a delta file for a dataset (new/updated rows).
pub async fn write_delta(
store: &Arc<dyn ObjectStore>,
dataset_name: &str,
batch: &RecordBatch,
) -> Result<String, String> {
let ts = chrono::Utc::now().timestamp_millis();
let key = format!("deltas/{dataset_name}/{ts}.parquet");
let parquet = record_batch_to_parquet(batch)?;
ops::put(store, &key, parquet).await?;
tracing::info!("wrote delta for '{}': {} rows at {}", dataset_name, batch.num_rows(), key);
Ok(key)
}
/// List all delta files for a dataset.
pub async fn list_deltas(
store: &Arc<dyn ObjectStore>,
dataset_name: &str,
) -> Result<Vec<String>, String> {
let prefix = format!("deltas/{dataset_name}/");
ops::list(store, Some(&prefix)).await
}
/// Load all delta batches for a dataset.
pub async fn load_deltas(
store: &Arc<dyn ObjectStore>,
dataset_name: &str,
) -> Result<Vec<RecordBatch>, String> {
let keys = list_deltas(store, dataset_name).await?;
let mut all_batches = Vec::new();
for key in &keys {
let data = ops::get(store, key).await?;
let (_, batches) = parquet_to_record_batches(&data)?;
all_batches.extend(batches);
}
if !all_batches.is_empty() {
let total_rows: usize = all_batches.iter().map(|b| b.num_rows()).sum();
tracing::debug!("loaded {} delta files ({} rows) for '{}'", keys.len(), total_rows, dataset_name);
}
Ok(all_batches)
}
/// Compact: merge base Parquet + all deltas into a single new base file.
/// Optionally filters out tombstoned rows — this is the physical-deletion
/// half of Phase E (the query-time filter handles the "immediate hide",
/// compaction handles the "actually remove from disk" requirement).
///
/// Tombstones must share one `row_key_column` (enforced at tombstone write).
/// If non-empty, every row in the merged output whose row_key_column value
/// matches a tombstone is dropped before the new base is written.
pub async fn compact(
store: &Arc<dyn ObjectStore>,
dataset_name: &str,
base_key: &str,
primary_key_col: Option<&str>,
tombstones: &[Tombstone],
) -> Result<CompactResult, String> {
// Load base
let base_data = ops::get(store, base_key).await?;
let (_schema, mut base_batches) = parquet_to_record_batches(&base_data)?;
// Load deltas
let delta_batches = load_deltas(store, dataset_name).await?;
let delta_count = delta_batches.len();
let has_tombstones = !tombstones.is_empty();
let nothing_to_do = delta_batches.is_empty() && !has_tombstones;
if nothing_to_do {
return Ok(CompactResult {
base_rows: base_batches.iter().map(|b| b.num_rows()).sum(),
delta_rows: 0,
final_rows: base_batches.iter().map(|b| b.num_rows()).sum(),
deltas_merged: 0,
tombstones_applied: 0,
rows_dropped_by_tombstones: 0,
});
}
base_batches.extend(delta_batches);
let pre_filter_rows: usize = base_batches.iter().map(|b| b.num_rows()).sum();
// If primary key specified, deduplicate (keep last occurrence)
let merged_batches = if let Some(_pk) = primary_key_col {
// Current simplification — full dedup requires sort by PK; leave
// that work for a follow-up. Tombstone filter still runs below.
base_batches
} else {
base_batches
};
// Tombstone filter: drop rows whose row_key matches.
let (final_batches, dropped) = if has_tombstones {
apply_tombstone_filter(merged_batches, tombstones)?
} else {
(merged_batches, 0)
};
let final_rows: usize = final_batches.iter().map(|b| b.num_rows()).sum();
// Write merged base atomically: build ONE Parquet file (single
// footer!) via ArrowWriter, stage it under a temporary key, read
// back to verify row count matches what we intended, THEN swap
// into base_key. If anything goes wrong between build and swap,
// the original base_key is untouched.
//
// Why this matters: concatenating N record_batch_to_parquet()
// outputs produces N independent Parquet files glued together —
// readers see only the first footer. That silently corrupted the
// candidates table before this code was fixed.
let schema_ref = final_batches
.first()
.map(|b| b.schema())
.ok_or_else(|| "no batches to write".to_string())?;
// Use Snappy to match the compression of files written by ingest —
// otherwise compaction would inflate file size 3× on each rewrite.
let props = WriterProperties::builder()
.set_compression(Compression::SNAPPY)
.build();
let mut buf: Vec<u8> = Vec::with_capacity(16 * 1024 * 1024);
{
let mut writer = ArrowWriter::try_new(&mut buf, schema_ref.clone(), Some(props))
.map_err(|e| format!("ArrowWriter init: {e}"))?;
for batch in &final_batches {
writer.write(batch).map_err(|e| format!("write batch: {e}"))?;
}
writer.close().map_err(|e| format!("close writer: {e}"))?;
}
// Verify before committing — if the written bytes don't parse as
// the row count we expect, DON'T overwrite the base file.
let (_, verify_batches) = parquet_to_record_batches(&buf)
.map_err(|e| format!("verify parse: {e}"))?;
let verify_rows: usize = verify_batches.iter().map(|b| b.num_rows()).sum();
if verify_rows != final_rows {
return Err(format!(
"compact verification failed: wrote {} bytes that parse as {} rows (expected {}); base_key '{}' untouched",
buf.len(), verify_rows, final_rows, base_key,
));
}
// Stage under a temp key first, then promote via a second put
// over base_key. Object stores don't have atomic rename across
// arbitrary backends; this is "write twice + delete temp" which
// is the widely-adopted pattern.
let temp_key = format!("{base_key}.compact-{}.tmp", chrono::Utc::now().timestamp_millis());
ops::put(store, &temp_key, Bytes::from(buf.clone())).await?;
ops::put(store, base_key, Bytes::from(buf)).await?;
let _ = ops::delete(store, &temp_key).await;
// Only now that base is durably updated, delete delta files.
let delta_keys = list_deltas(store, dataset_name).await?;
for key in &delta_keys {
let _ = ops::delete(store, key).await;
}
tracing::info!(
"compacted '{}': {} deltas merged, {} tombstones applied ({} rows dropped), {} → {} rows",
dataset_name, delta_count, tombstones.len(), dropped, pre_filter_rows, final_rows,
);
Ok(CompactResult {
base_rows: pre_filter_rows - delta_count, // rough base-before-deltas
delta_rows: delta_count,
final_rows,
deltas_merged: delta_count,
tombstones_applied: tombstones.len(),
rows_dropped_by_tombstones: dropped,
})
}
/// Filter every batch in `batches` to exclude rows whose `row_key_column`
/// value appears in the tombstone set. All tombstones must share one
/// row_key_column — enforced at tombstone write time.
fn apply_tombstone_filter(
batches: Vec<RecordBatch>,
tombstones: &[Tombstone],
) -> Result<(Vec<RecordBatch>, usize), String> {
let col_name = &tombstones[0].row_key_column;
let bad_values: HashSet<String> = tombstones
.iter()
.map(|t| t.row_key_value.clone())
.collect();
let mut out = Vec::with_capacity(batches.len());
let mut total_dropped = 0usize;
for batch in batches {
let col_idx = match batch.schema().index_of(col_name) {
Ok(i) => i,
Err(_) => {
// Column not in this batch — nothing to drop, pass through.
out.push(batch);
continue;
}
};
let col = batch.column(col_idx);
// Build a "keep" mask: true where row is NOT tombstoned.
let mask: Vec<bool> = if let Some(s) = col.as_any().downcast_ref::<StringArray>() {
(0..s.len()).map(|i| {
s.is_null(i) || !bad_values.contains(s.value(i))
}).collect()
} else if let Some(a) = col.as_any().downcast_ref::<Int32Array>() {
(0..a.len()).map(|i| {
a.is_null(i) || !bad_values.contains(&a.value(i).to_string())
}).collect()
} else if let Some(a) = col.as_any().downcast_ref::<Int64Array>() {
(0..a.len()).map(|i| {
a.is_null(i) || !bad_values.contains(&a.value(i).to_string())
}).collect()
} else {
// Key column of an unsupported type — log and pass through
// rather than panic. Operator has to handle this manually.
tracing::warn!(
"tombstone filter: column '{}' has unsupported type {:?}; passing rows through",
col_name, col.data_type(),
);
out.push(batch);
continue;
};
let kept = mask.iter().filter(|b| **b).count();
let dropped = mask.len() - kept;
total_dropped += dropped;
if dropped == 0 {
out.push(batch);
continue;
}
let bool_arr = BooleanArray::from(mask);
let filtered = filter_record_batch(&batch, &bool_arr)
.map_err(|e| format!("filter batch: {e}"))?;
out.push(filtered);
}
Ok((out, total_dropped))
}
#[derive(Debug, Clone, serde::Serialize)]
pub struct CompactResult {
pub base_rows: usize,
pub delta_rows: usize,
pub final_rows: usize,
pub deltas_merged: usize,
pub tombstones_applied: usize,
pub rows_dropped_by_tombstones: usize,
}