lakehouse

profit/lakehouse

Fork 0

Commit Graph

Author	SHA1	Message	Date
root	4e1c400f5d	Phase E.2: Compaction integrates tombstones — physical deletion closes GDPR loop Phase E gave us soft-delete at query time (tombstones hide rows via a DataFusion filter view). This completes the invariant: after compact, tombstoned rows are PHYSICALLY absent from the parquet on disk. delta::compact changes: - Signature adds tombstones: &[Tombstone] - After merging base + deltas, apply_tombstone_filter builds a BooleanArray keep-mask per batch (True where row_key_value is NOT in the tombstone set) and applies arrow::compute::filter_record_batch - Supports Utf8, Int32, Int64 key columns (matches refresh.rs coverage for pg- and csv-derived schemas) - CompactResult gains tombstones_applied + rows_dropped_by_tombstones - Caller clears tombstone store on success Critical correctness fix surfaced during E2E testing: The original Phase 8 compact concatenated N independent Parquet byte streams from record_batch_to_parquet() — each with its own footer. Parquet readers only see the FIRST footer's data; the rest is invisible. Latent since Phase 8 shipped; triggered by tombstone-filtering produc- ing multiple batches. Corrupted candidates.parquet on first test run (restored from UI fixture copy — good argument for test data in repo). Fix: - Single ArrowWriter per compaction, writes every batch into one properly-footered Parquet - Snappy compression to match ingest defaults (otherwise rewrite inflated file 3× — 10.5MB → 34MB — because no compression was set) - Verify-before-swap: parse written buf back to confirm row count matches expected; refuses to overwrite base_key if verification fails - Write to {base_key}.compact-{ts}.tmp first, then to base_key; delete temp; only then delete delta files. Any error along the way leaves the original base intact. TombstoneStore::clear(dataset) drops all tombstone batch files and evicts the per-dataset AppendLog from cache. Called after successful compact. QueryEngine::catalog() accessor exposes the Registry so queryd handlers can reach the tombstone store without routing through gateway state. E2E on candidates (100K rows, 15 cols): - Baseline: 10.59 MB, 100000 rows - Tombstone CAND-000001/2/3 (soft-delete): 99997 visible, 100000 raw - Compact: tombstones_applied=3, rows_dropped=3, final_rows=99997 - Post: 10.72 MB (Snappy), valid parquet (1 row_group), 99997 rows - Restart: persists, tombstones list empty, __raw__candidates also 99997 (the 3 IDs are physically gone from disk) PRD invariant close: deletion is now actually deletion, not just masking. GDPR erasure request → tombstone + schedule compact → data gone. Deferred: - Compact-all-datasets cron (currently manual per-dataset via POST /query/compact) - Compaction of tombstone batch files themselves (they grow at flush_threshold=1 per tombstone; TombstoneStore::compact exists but not auto-called) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 10:38:30 -05:00
root	d87f2ccac6	Phase E: Soft deletes (tombstones) for compliance-grade row deletion Implements GDPR/CCPA-compatible row-level deletion without rewriting the underlying Parquet. Tombstone markers live beside each dataset and are applied at query time via a DataFusion view that excludes the deleted row_key_values. Schema (shared::types): - Tombstone { dataset, row_key_column, row_key_value, deleted_at, actor, reason } - All tombstones for a dataset must share one row_key_column — enforced at write so the query-time filter remains a single WHERE NOT IN (...) clause Storage (catalogd::tombstones): - Per-dataset AppendLog at _catalog/tombstones/{dataset}/ - flush_threshold=1 + explicit flush after every append — tombstones are high-value, low-frequency; durability on return is the contract - Reuses storaged::append_log infra so compaction is already wired (POST .../tombstones/compact will work once we expose it) Catalog (catalogd::registry): - add_tombstone validates dataset exists + key column compatibility - list_tombstones for the GET endpoint - TombstoneStore exposed via Registry::tombstones() for queryd HTTP (catalogd::service): - POST /catalog/datasets/by-name/{name}/tombstone { row_key_column, row_key_values[], actor, reason } Returns rows_tombstoned count + per-value failure list (207 on partial success). - GET same path lists active tombstones with full audit info. Query layer (queryd::context): - Snapshot tombstones-by-dataset before registering tables - Tombstoned tables: raw goes to "__raw__{name}", public "{name}" becomes DataFusion view with SELECT * FROM "__raw__{name}" WHERE CAST(col AS VARCHAR) NOT IN (...) - CAST AS VARCHAR handles both string and integer key columns - Untombstoned tables register as before — zero overhead End-to-end on candidates (100K rows): - Pick CAND-000001/2/3 (Linda/Charles/Kimberly) - POST tombstone -> rows_tombstoned: 3 - COUNT() drops 100000 -> 99997 - WHERE candidate_id IN (those 3) -> 0 rows - candidates_safe view transitively excludes them (Linda+Denver: __raw__candidates=159, candidates_safe=158) - Restart: COUNT still 99997, 3 tombstones reload from disk Reversibility: tombstones are reversible deletes, not destruction. Power users can still query "__raw__{name}" to see deleted rows. Phase 13 access control is what stops a non-admin from accessing __raw__ tables. Limits / follow-up: - Physical compaction not yet integrated — Phase 8's compact_files doesn't read tombstones during merge. Tombstoned rows are still on disk until that integration ships. - Phase 9 journald event emission for tombstones not wired — tombstone records carry their own actor+reason+timestamp so the audit trail is intact, but cross-referencing with the mutation event log would help compliance reporting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 09:40:48 -05:00

Author

SHA1

Message

Date

root

4e1c400f5d

Phase E.2: Compaction integrates tombstones — physical deletion closes GDPR loop

Phase E gave us soft-delete at query time (tombstones hide rows via a
DataFusion filter view). This completes the invariant: after compact,
tombstoned rows are PHYSICALLY absent from the parquet on disk.

delta::compact changes:
- Signature adds tombstones: &[Tombstone]
- After merging base + deltas, apply_tombstone_filter builds a
  BooleanArray keep-mask per batch (True where row_key_value is NOT
  in the tombstone set) and applies arrow::compute::filter_record_batch
- Supports Utf8, Int32, Int64 key columns (matches refresh.rs coverage
  for pg- and csv-derived schemas)
- CompactResult gains tombstones_applied + rows_dropped_by_tombstones
- Caller clears tombstone store on success

Critical correctness fix surfaced during E2E testing:
The original Phase 8 compact concatenated N independent Parquet byte
streams from record_batch_to_parquet() — each with its own footer.
Parquet readers only see the FIRST footer's data; the rest is invisible.
Latent since Phase 8 shipped; triggered by tombstone-filtering produc-
ing multiple batches. Corrupted candidates.parquet on first test run
(restored from UI fixture copy — good argument for test data in repo).

Fix:
- Single ArrowWriter per compaction, writes every batch into one
  properly-footered Parquet
- Snappy compression to match ingest defaults (otherwise rewrite
  inflated file 3× — 10.5MB → 34MB — because no compression was set)
- Verify-before-swap: parse written buf back to confirm row count
  matches expected; refuses to overwrite base_key if verification fails
- Write to {base_key}.compact-{ts}.tmp first, then to base_key; delete
  temp; only then delete delta files. Any error along the way leaves
  the original base intact.

TombstoneStore::clear(dataset) drops all tombstone batch files and
evicts the per-dataset AppendLog from cache. Called after successful
compact.

QueryEngine::catalog() accessor exposes the Registry so queryd
handlers can reach the tombstone store without routing through gateway
state.

E2E on candidates (100K rows, 15 cols):
- Baseline: 10.59 MB, 100000 rows
- Tombstone CAND-000001/2/3 (soft-delete): 99997 visible, 100000 raw
- Compact: tombstones_applied=3, rows_dropped=3, final_rows=99997
- Post: 10.72 MB (Snappy), valid parquet (1 row_group), 99997 rows
- Restart: persists, tombstones list empty, __raw__candidates also
  99997 (the 3 IDs are physically gone from disk)

PRD invariant close: deletion is now actually deletion, not just
masking. GDPR erasure request → tombstone + schedule compact → data
gone.

Deferred:
- Compact-all-datasets cron (currently manual per-dataset via
  POST /query/compact)
- Compaction of tombstone batch files themselves (they grow at
  flush_threshold=1 per tombstone; TombstoneStore::compact exists
  but not auto-called)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 10:38:30 -05:00

root

d87f2ccac6

Phase E: Soft deletes (tombstones) for compliance-grade row deletion

Implements GDPR/CCPA-compatible row-level deletion without rewriting
the underlying Parquet. Tombstone markers live beside each dataset and
are applied at query time via a DataFusion view that excludes the
deleted row_key_values.

Schema (shared::types):
- Tombstone { dataset, row_key_column, row_key_value, deleted_at,
              actor, reason }
- All tombstones for a dataset must share one row_key_column —
  enforced at write so the query-time filter remains a single
  WHERE NOT IN (...) clause

Storage (catalogd::tombstones):
- Per-dataset AppendLog at _catalog/tombstones/{dataset}/
- flush_threshold=1 + explicit flush after every append — tombstones
  are high-value, low-frequency; durability on return is the contract
- Reuses storaged::append_log infra so compaction is already wired
  (POST .../tombstones/compact will work once we expose it)

Catalog (catalogd::registry):
- add_tombstone validates dataset exists + key column compatibility
- list_tombstones for the GET endpoint
- TombstoneStore exposed via Registry::tombstones() for queryd

HTTP (catalogd::service):
- POST /catalog/datasets/by-name/{name}/tombstone
    { row_key_column, row_key_values[], actor, reason }
  Returns rows_tombstoned count + per-value failure list (207 on
  partial success).
- GET same path lists active tombstones with full audit info.

Query layer (queryd::context):
- Snapshot tombstones-by-dataset before registering tables
- Tombstoned tables: raw goes to "__raw__{name}", public "{name}"
  becomes DataFusion view with
  SELECT * FROM "__raw__{name}" WHERE CAST(col AS VARCHAR) NOT IN (...)
- CAST AS VARCHAR handles both string and integer key columns
- Untombstoned tables register as before — zero overhead

End-to-end on candidates (100K rows):
- Pick CAND-000001/2/3 (Linda/Charles/Kimberly)
- POST tombstone -> rows_tombstoned: 3
- COUNT(*) drops 100000 -> 99997
- WHERE candidate_id IN (those 3) -> 0 rows
- candidates_safe view transitively excludes them
  (Linda+Denver: __raw__candidates=159, candidates_safe=158)
- Restart: COUNT still 99997, 3 tombstones reload from disk

Reversibility: tombstones are reversible deletes, not destruction.
Power users can still query "__raw__{name}" to see deleted rows.
Phase 13 access control is what stops a non-admin from accessing
__raw__* tables.

Limits / follow-up:
- Physical compaction not yet integrated — Phase 8's compact_files
  doesn't read tombstones during merge. Tombstoned rows are still
  on disk until that integration ships.
- Phase 9 journald event emission for tombstones not wired —
  tombstone records carry their own actor+reason+timestamp so the
  audit trail is intact, but cross-referencing with the mutation
  event log would help compliance reporting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 09:40:48 -05:00

2 Commits