lakehouse

Author	SHA1	Message	Date
root	9cc0ceb894	P42-002: wire truth gate into queryd /sql + /paged SQL paths Some checks failed lakehouse/auditor 1 blocking issue: cloud: claim not backed — "journal event verified live (total_events_created 0→1 after probe)." The scrum master flagged crates/queryd/src/service.rs across iters 3-5 with the same finding: "raw SQL forwarded to DataFusion without schema or policy gate; violates PRD §42-002 truth enforcement." Confidence 79-95%, gradient tier auto/dry_run. Applier couldn't touch it — the fix is larger than 6 lines and crosses crate boundaries. Hand-fix lands the missing enforcement point: - truth: new RuleCondition::FieldContainsAny { field, needles } with case-insensitive substring matching. 4 new unit tests cover the positive, negative, missing-field, and empty-needles paths. - truth: sql_query_guard_store() helper returns a baseline store that rejects destructive verbs (DROP/TRUNCATE/DELETE FROM) and empty SQL. - queryd: QueryState grows an Arc<TruthStore>; default router() loads sql_query_guard_store; new router_with_truth(engine, store) lets tests inject a custom store. - queryd: sql_policy_check() runs truth.evaluate("sql_query", ctx) before hitting DataFusion. Reject/Block actions on matched conditions short-circuit to HTTP 403 with the rule's message. Both /sql and /paged gated. - queryd: 7 new tests cover block/allow/case-insensitive/false- positive scenarios. "SELECT deleted_at FROM t" must NOT be rejected (substring match is narrow: "delete from", not "delete"). Total: 28 truth tests green (was 24), 7 new queryd policy tests green. Workspace baseline warnings unchanged at 11. This is a signal-driven fix the mechanical pipeline couldn't produce but the scrum master kept asking for. Closes one of four LOOPING files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 04:38:52 -05:00
root	4e1c400f5d	Phase E.2: Compaction integrates tombstones — physical deletion closes GDPR loop Phase E gave us soft-delete at query time (tombstones hide rows via a DataFusion filter view). This completes the invariant: after compact, tombstoned rows are PHYSICALLY absent from the parquet on disk. delta::compact changes: - Signature adds tombstones: &[Tombstone] - After merging base + deltas, apply_tombstone_filter builds a BooleanArray keep-mask per batch (True where row_key_value is NOT in the tombstone set) and applies arrow::compute::filter_record_batch - Supports Utf8, Int32, Int64 key columns (matches refresh.rs coverage for pg- and csv-derived schemas) - CompactResult gains tombstones_applied + rows_dropped_by_tombstones - Caller clears tombstone store on success Critical correctness fix surfaced during E2E testing: The original Phase 8 compact concatenated N independent Parquet byte streams from record_batch_to_parquet() — each with its own footer. Parquet readers only see the FIRST footer's data; the rest is invisible. Latent since Phase 8 shipped; triggered by tombstone-filtering produc- ing multiple batches. Corrupted candidates.parquet on first test run (restored from UI fixture copy — good argument for test data in repo). Fix: - Single ArrowWriter per compaction, writes every batch into one properly-footered Parquet - Snappy compression to match ingest defaults (otherwise rewrite inflated file 3× — 10.5MB → 34MB — because no compression was set) - Verify-before-swap: parse written buf back to confirm row count matches expected; refuses to overwrite base_key if verification fails - Write to {base_key}.compact-{ts}.tmp first, then to base_key; delete temp; only then delete delta files. Any error along the way leaves the original base intact. TombstoneStore::clear(dataset) drops all tombstone batch files and evicts the per-dataset AppendLog from cache. Called after successful compact. QueryEngine::catalog() accessor exposes the Registry so queryd handlers can reach the tombstone store without routing through gateway state. E2E on candidates (100K rows, 15 cols): - Baseline: 10.59 MB, 100000 rows - Tombstone CAND-000001/2/3 (soft-delete): 99997 visible, 100000 raw - Compact: tombstones_applied=3, rows_dropped=3, final_rows=99997 - Post: 10.72 MB (Snappy), valid parquet (1 row_group), 99997 rows - Restart: persists, tombstones list empty, __raw__candidates also 99997 (the 3 IDs are physically gone from disk) PRD invariant close: deletion is now actually deletion, not just masking. GDPR erasure request → tombstone + schedule compact → data gone. Deferred: - Compact-all-datasets cron (currently manual per-dataset via POST /query/compact) - Compaction of tombstone batch files themselves (they grow at flush_threshold=1 per tombstone; TombstoneStore::compact exists but not auto-called) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 10:38:30 -05:00
root	6df904a03c	Phase 8: Hot cache + incremental delta updates - MemCache: LRU in-memory cache for hot datasets (configurable max, default 16GB) Pin/evict/stats endpoints: POST /query/cache/pin, /cache/evict, GET /cache/stats - Delta store: append-only delta Parquet files for row-level updates Write deltas without rewriting base files, merge at query time - Compaction: POST /query/compact merges deltas into base Parquet - Query engine: checks cache first, falls back to Parquet, merges deltas - Benchmarked on 2.47M rows: 1M row JOIN: 854ms cold → 96ms hot (8.9x speedup) 100K filter: 62ms cold → 21ms hot (3x speedup) 1.1M rows cached in 408MB RAM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 08:37:28 -05:00
root	19bdfab227	Phase 2: DataFusion query engine over Parquet - queryd: SessionContext with custom URL scheme to avoid path doubling with LocalFileSystem - queryd: ListingTable registration from catalog ObjectRefs with schema inference - queryd: POST /query/sql returns JSON {columns, rows, row_count} - queryd→catalogd wiring: reads all datasets, registers as named tables - gateway: wires QueryEngine with shared store + registry - e2e verified: SELECT *, WHERE/ORDER BY, COUNT/AVG all correct Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 05:48:20 -05:00
root	a52ca841c6	Phase 0: bootstrap Rust workspace - Cargo workspace with 6 crates: shared, storaged, catalogd, queryd, aibridge, gateway - shared: types (DatasetId, ObjectRef, SchemaFingerprint, DatasetManifest) + error enum - gateway: Axum HTTP entrypoint with nested service routers + tracing - All services expose /health stubs - justfile with build/test/run recipes - PRD, phase tracker, and ADR docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 04:59:05 -05:00

5 Commits