lakehouse/PHASES.md at d87f2ccac62b968594826df9e404912455df8a43

root d87f2ccac6 Phase E: Soft deletes (tombstones) for compliance-grade row deletion

Implements GDPR/CCPA-compatible row-level deletion without rewriting
the underlying Parquet. Tombstone markers live beside each dataset and
are applied at query time via a DataFusion view that excludes the
deleted row_key_values.

Schema (shared::types):
- Tombstone { dataset, row_key_column, row_key_value, deleted_at,
              actor, reason }
- All tombstones for a dataset must share one row_key_column —
  enforced at write so the query-time filter remains a single
  WHERE NOT IN (...) clause

Storage (catalogd::tombstones):
- Per-dataset AppendLog at _catalog/tombstones/{dataset}/
- flush_threshold=1 + explicit flush after every append — tombstones
  are high-value, low-frequency; durability on return is the contract
- Reuses storaged::append_log infra so compaction is already wired
  (POST .../tombstones/compact will work once we expose it)

Catalog (catalogd::registry):
- add_tombstone validates dataset exists + key column compatibility
- list_tombstones for the GET endpoint
- TombstoneStore exposed via Registry::tombstones() for queryd

HTTP (catalogd::service):
- POST /catalog/datasets/by-name/{name}/tombstone
    { row_key_column, row_key_values[], actor, reason }
  Returns rows_tombstoned count + per-value failure list (207 on
  partial success).
- GET same path lists active tombstones with full audit info.

Query layer (queryd::context):
- Snapshot tombstones-by-dataset before registering tables
- Tombstoned tables: raw goes to "__raw__{name}", public "{name}"
  becomes DataFusion view with
  SELECT * FROM "__raw__{name}" WHERE CAST(col AS VARCHAR) NOT IN (...)
- CAST AS VARCHAR handles both string and integer key columns
- Untombstoned tables register as before — zero overhead

End-to-end on candidates (100K rows):
- Pick CAND-000001/2/3 (Linda/Charles/Kimberly)
- POST tombstone -> rows_tombstoned: 3
- COUNT(*) drops 100000 -> 99997
- WHERE candidate_id IN (those 3) -> 0 rows
- candidates_safe view transitively excludes them
  (Linda+Denver: __raw__candidates=159, candidates_safe=158)
- Restart: COUNT still 99997, 3 tombstones reload from disk

Reversibility: tombstones are reversible deletes, not destruction.
Power users can still query "__raw__{name}" to see deleted rows.
Phase 13 access control is what stops a non-admin from accessing
__raw__* tables.

Limits / follow-up:
- Physical compaction not yet integrated — Phase 8's compact_files
  doesn't read tombstones during merge. Tombstoned rows are still
  on disk until that integration ships.
- Phase 9 journald event emission for tombstones not wired —
  tombstone records carry their own actor+reason+timestamp so the
  audit trail is intact, but cross-referencing with the mutation
  event log would help compliance reporting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 09:40:48 -05:00

12 KiB

Raw Blame History

Phase Tracker

Phase 0: Bootstrap ✅

Phase 1: Storage + Catalog ✅

Phase 2: Query Engine ✅

Phase 3: AI Integration ✅

Phase 4: Frontend ✅

Phase 5: Hardening ✅

Phase 6: Ingest Pipeline ✅

Phase 7: Vector Index + RAG ✅

Phase 8: Hot Cache + Incremental Updates ✅

Phase 8.5: Agent Workspaces ✅

Phase 9: Event Journal ✅

Phase 10: Rich Catalog v2 ✅

Phase 11: Embedding Versioning ✅

Phase 12: Tool Registry ✅

Phase 13: Security & Access Control ✅

Phase 14: Schema Evolution ✅

Phase 15+: Horizon

12 KiB Raw Blame History Unescape Escape

Phase Tracker

Phase 0: Bootstrap ✅

Phase 1: Storage + Catalog ✅

Phase 2: Query Engine ✅

Phase 3: AI Integration ✅

Phase 4: Frontend ✅

Phase 5: Hardening ✅

Phase 6: Ingest Pipeline ✅

Phase 7: Vector Index + RAG ✅

Phase 8: Hot Cache + Incremental Updates ✅

Phase 8.5: Agent Workspaces ✅

Phase 9: Event Journal ✅

Phase 10: Rich Catalog v2 ✅

Phase 11: Embedding Versioning ✅

Phase 12: Tool Registry ✅

Phase 13: Security & Access Control ✅

Phase 14: Schema Evolution ✅

Phase 15+: Horizon

12 KiB

Raw Blame History