root 4e1c400f5d Phase E.2: Compaction integrates tombstones — physical deletion closes GDPR loop

Phase E gave us soft-delete at query time (tombstones hide rows via a
DataFusion filter view). This completes the invariant: after compact,
tombstoned rows are PHYSICALLY absent from the parquet on disk.

delta::compact changes:
- Signature adds tombstones: &[Tombstone]
- After merging base + deltas, apply_tombstone_filter builds a
  BooleanArray keep-mask per batch (True where row_key_value is NOT
  in the tombstone set) and applies arrow::compute::filter_record_batch
- Supports Utf8, Int32, Int64 key columns (matches refresh.rs coverage
  for pg- and csv-derived schemas)
- CompactResult gains tombstones_applied + rows_dropped_by_tombstones
- Caller clears tombstone store on success

Critical correctness fix surfaced during E2E testing:
The original Phase 8 compact concatenated N independent Parquet byte
streams from record_batch_to_parquet() — each with its own footer.
Parquet readers only see the FIRST footer's data; the rest is invisible.
Latent since Phase 8 shipped; triggered by tombstone-filtering produc-
ing multiple batches. Corrupted candidates.parquet on first test run
(restored from UI fixture copy — good argument for test data in repo).

Fix:
- Single ArrowWriter per compaction, writes every batch into one
  properly-footered Parquet
- Snappy compression to match ingest defaults (otherwise rewrite
  inflated file 3× — 10.5MB → 34MB — because no compression was set)
- Verify-before-swap: parse written buf back to confirm row count
  matches expected; refuses to overwrite base_key if verification fails
- Write to {base_key}.compact-{ts}.tmp first, then to base_key; delete
  temp; only then delete delta files. Any error along the way leaves
  the original base intact.

TombstoneStore::clear(dataset) drops all tombstone batch files and
evicts the per-dataset AppendLog from cache. Called after successful
compact.

QueryEngine::catalog() accessor exposes the Registry so queryd
handlers can reach the tombstone store without routing through gateway
state.

E2E on candidates (100K rows, 15 cols):
- Baseline: 10.59 MB, 100000 rows
- Tombstone CAND-000001/2/3 (soft-delete): 99997 visible, 100000 raw
- Compact: tombstones_applied=3, rows_dropped=3, final_rows=99997
- Post: 10.72 MB (Snappy), valid parquet (1 row_group), 99997 rows
- Restart: persists, tombstones list empty, __raw__candidates also
  99997 (the 3 IDs are physically gone from disk)

PRD invariant close: deletion is now actually deletion, not just
masking. GDPR erasure request → tombstone + schedule compact → data
gone.

Deferred:
- Compact-all-datasets cron (currently manual per-dataset via
  POST /query/compact)
- Compaction of tombstone batch files themselves (they grow at
  flush_threshold=1 per tombstone; TombstoneStore::compact exists
  but not auto-called)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 10:38:30 -05:00

16 KiB

Raw Blame History

Phase Tracker

Phase 0: Bootstrap ✅

Cargo workspace with all crate stubs compiling
shared crate: error types, ObjectRef, DatasetId
gateway with Axum: GET /health → 200
tracing + tracing-subscriber wired in gateway
justfile with build, test, run recipes
docs committed to git

Phase 1: Storage + Catalog ✅

storaged: object_store backend init (LocalFileSystem)
storaged: Axum endpoints (PUT/GET/DELETE/LIST)
shared/arrow_helpers.rs: RecordBatch ↔ Parquet + schema fingerprinting
catalogd/registry.rs: in-memory index + manifest persistence
catalogd service: POST/GET /datasets + by-name
gateway routes wired

Phase 2: Query Engine ✅

queryd: SessionContext + object_store config
queryd: ListingTable from catalog ObjectRefs
queryd service: POST /query/sql → JSON
queryd → catalogd wiring
gateway routes /query

Phase 3: AI Integration ✅

Python sidecar: FastAPI + Ollama (embed/generate/rerank)
Dockerfile for sidecar
aibridge/client.rs: HTTP client
aibridge service: Axum proxy endpoints
Model config via env vars

Phase 4: Frontend ✅

Dioxus scaffold, WASM build
Ask tab: natural language → AI SQL → results
Explore tab: dataset browser + AI summary
SQL tab: raw DataFusion editor
System tab: health checks for all services

Phase 5: Hardening ✅

Proto definitions (lakehouse.proto)
Internal gRPC: CatalogService on :3101
OpenTelemetry tracing: stdout exporter
Auth middleware: X-API-Key (toggleable)
Config-driven startup: lakehouse.toml

Phase 6: Ingest Pipeline ✅

CSV ingest with auto schema detection
JSON ingest (array + newline-delimited, nested flattening)
PDF text extraction (lopdf)
Text/SMS file ingest
Content hash dedup (SHA-256)
POST /ingest/file multipart upload
12 unit tests

Phase 7: Vector Index + RAG ✅

chunker: configurable size + overlap, sentence-boundary aware
store: embeddings as Parquet (binary f32 vectors)
search: brute-force cosine similarity
rag: embed → search → retrieve → LLM answer with citations
POST /vectors/index, /search, /rag
Background job system with progress tracking
Dual-pipeline supervisor with checkpointing + retry
100K embeddings: 177/sec on A4000, zero failures
6 unit tests

Phase 8: Hot Cache + Incremental Updates ✅

MemTable hot cache: LRU, configurable max (16GB)
POST /query/cache/pin, /cache/evict, GET /cache/stats
Delta store: append-only delta Parquet files
Merge-on-read: queries combine base + deltas
Compaction: POST /query/compact
Benchmarked: 9.8x speedup (1M rows: 942ms → 96ms)

Phase 8.5: Agent Workspaces ✅

WorkspaceManager with daily/weekly/monthly/pinned tiers
Saved searches, shortlists, activity logs per workspace
Instant zero-copy handoff between agents
Persistence to object storage, rebuild on startup

Phase 9: Event Journal ✅

journald crate: append-only mutation log
Event schema: entity, field, old/new value, actor, source, workspace
In-memory buffer with auto-flush to Parquet
GET /journal/history/{entity_id}, /recent, /stats
POST /journal/event, /update, /flush

Phase 10: Rich Catalog v2 ✅

DatasetManifest: description, owner, sensitivity, columns, lineage, freshness, tags
PII auto-detection: email, phone, SSN, salary, address, medical
Column-level metadata with sensitivity flags
Lineage tracking: source_system → ingest_job → dataset
PATCH /catalog/datasets/by-name/{name}/metadata
Backward compatible (serde default)

Phase 11: Embedding Versioning ✅

IndexRegistry: model_name, model_version, dimensions per index
Index metadata persisted as JSON, rebuilt on startup
GET /vectors/indexes — list all (filter by source/model)
GET /vectors/indexes/{name} — metadata
Background jobs auto-register metadata on completion

Phase 12: Tool Registry ✅

6 built-in staffing tools (search_candidates, get_candidate, revenue_by_client, recruiter_performance, cold_leads, open_jobs)
Parameter validation + SQL template substitution
Permission levels: read / write / admin
Full audit trail per invocation
GET /tools, GET /tools/{name}, POST /tools/{name}/call, GET /tools/audit

Phase 13: Security & Access Control ✅

Role-based access: admin, recruiter, analyst, agent
Field-level sensitivity enforcement
Column masking determination per agent
Query audit logging
GET/POST /access/roles, GET /access/audit, POST /access/check

Phase 14: Schema Evolution ✅

Schema diff detection (added, removed, type changed, renamed)
Fuzzy rename detection (shared word parts)
Auto-generated migration rules with confidence scores
AI migration prompt builder for complex cases
5 unit tests

Phase 15+: Horizon

HNSW vector index with iteration-friendly trial system (2026-04-16)
- HnswStore.build_index_with_config — parameterized ef_construction, ef_search, seed
- EmbeddingCache — pins 100K vectors in memory, shared across trials
- harness::EvalSet — named query sets with brute-force ground truth
- TrialJournal — append-only JSONL at _hnsw_trials/{index}.jsonl
- Endpoints: /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best?metric={recall|latency|pareto}, /hnsw/evals, /hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on 100K resumes: brute-force 44-54ms → HNSW 509us-1830us, recall 0.92-1.00 depending on ef_construction. Sweet spot: ec=80 es=30 → p50=873us recall=1.00 — locked in as HnswConfig::default()
Catalog manifest repair — POST /catalog/resync-missing restores row_count and columns from parquet footers (2026-04-16). All 7 staffing tables recovered to PRD-matching 2.47M rows.
[~] Federated multi-bucket query — foundation complete 2026-04-16, see ADR-017
- StorageConfig.buckets + rescue_bucket + profile_root config shape
- SecretsProvider trait + FileSecretsProvider (reads /etc/lakehouse/secrets.toml, checks 0600 perms)
- storaged::BucketRegistry — multi-backend, rescue-aware, reachability probes
- storaged::error_journal::ErrorJournal — append-only JSONL at primary://_errors/bucket_errors.jsonl
- Endpoints: GET /storage/buckets, GET /storage/errors, GET /storage/bucket-health
- Bucket-aware I/O: PUT/GET /storage/buckets/{bucket}/objects/{*key} with rescue fallback + X-Lakehouse-Rescue-Used observability headers
- Backward compat: empty [[storage.buckets]] synthesizes a primary from legacy root
- Three-bucket test (primary + rescue + testing) verified: normal reads, rescue fallback with headers, hard-fail missing, write to unknown bucket 503, error journal + health summary
- X-Lakehouse-Bucket header middleware on ingest endpoints (2026-04-16)
- Catalog migration: POST /catalog/migrate-buckets stamps bucket = "primary" on legacy refs (12 renamed, 14 total now canonical)
- queryd registers every bucket with DataFusion for cross-bucket SQL — verified with people_test (testing) × animals (primary) CROSS JOIN
- Profile hot-load endpoints: POST /profile/{user}/activate|deactivate (deferred to Phase 17)
- vectord bucket-scoped paths (trial journals, eval sets per-bucket) (deferred to Phase 17)
Database connector ingest (Postgres first) — 2026-04-16
- pg_stream::stream_table_to_parquet — ORDER BY + LIMIT/OFFSET pagination, configurable batch_size
- parse_dsn — postgresql:// and postgres:// URL scheme, user/password/host/port/db
- POST /ingest/db endpoint: {dsn, table, dataset_name?, batch_size?, order_by?, limit?} → streams to Parquet, registers in catalog with PII detection + redacted-password lineage
- Existing POST /ingest/postgres/import (structured config) preserved alongside
- 4 DSN-parser unit tests + live end-to-end test against knowledge_base.team_runs (586 rows, 13 cols, 6 batches, 196ms)
Phase B: Lance storage evaluation — 2026-04-16
- crates/lance-bench standalone pilot (Lance 4.0) avoids DataFusion/Arrow version conflict with main stack
- 8-dimension benchmark on resumes_100k_v2 — see docs/ADR-019-vector-storage.md for scorecard
- Decision: hybrid architecture. Parquet+HNSW stays primary (2.55× faster search at 100K in-RAM). Lance added as per-profile second backend for random access (112× faster), append (0.08s vs full rewrite), hot-swap (14× faster index builds), and scale past 5M RAM ceiling.
Phase E.2 — Compaction integrates tombstones (physical deletion) — 2026-04-16
- delta::compact accepts tombstones: &[Tombstone] param, filters rows at merge time via arrow filter_record_batch
- CompactResult gains tombstones_applied + rows_dropped_by_tombstones
- Atomic write: ArrowWriter → single Parquet file (fixes latent bug where concatenated Parquet byte streams produced garbage — footer-only-first-segment visible), verify-parse before overwrite, temp_key staging, delete delta files AFTER base write succeeds
- Snappy compression on output matches ingest defaults (avoids 3× size inflation on every compact)
- TombstoneStore::clear drops all batch files for a dataset; called by queryd after successful compact
- Query engine exposes catalog() accessor so service handler can reach the tombstone store
- E2E verified on candidates (100K rows): tombstone 3 IDs → compact → 99,997 rows physically in parquet, tombstones empty, IDs gone from __raw__candidates too; file size 10.59 MB → 10.72 MB (proportional to data, not inflated)
Phase 16: Hot-swap generations + autotune agent — 2026-04-16
- vectord::promotion::PromotionRegistry — per-index current config + history at _hnsw_promotions/{index}.json, cap 50 history entries
- Endpoints: POST /vectors/hnsw/promote/{index}/{trial_id}, POST /vectors/hnsw/rollback/{index}, GET /vectors/hnsw/promoted/{index}
- vectord::autotune::run_autotune — grid of trials (configurable or default 5 configs), Pareto winner selection (max recall, then min p50), min_recall safety gate (default 0.9), config bounds (ec ∈ [10,400], es ∈ [10,200])
- POST /vectors/hnsw/autotune — runs the full loop synchronously, journals every trial, auto-promotes winner
- activate_profile uses promotion_registry.config_or(..., profile_default) so newly-promoted configs flow automatically into next activation
- End-to-end: autogen harness for threat_intel_v1 (10 queries), autotune ran 5 trials (all recall=1.00, p50 64-68us), promoted ec=20 es=30 at recall=1.0 p50=64us as winner. Manual promote of ec=80 es=30 pushed autotune pick onto history. Rollback restored autotune winner. Second rollback cleared to None. Re-promote + restart verified persistence. Activation after promotion logged "building HNSW ef_construction=80 ef_search=30 seed=42" — config flowed through correctly.
Phase 17: Model profiles + scoped search — 2026-04-16
- shared::types::ModelProfile — { id, ollama_name, description, bound_datasets, hnsw_config, embed_model, created_at, created_by }
- shared::types::ProfileHnswConfig — mirror of vectord's HnswConfig to avoid cross-crate dep cycle (defaults ec=80 es=30 matching Phase 15 winner)
- Registry::{put_profile, get_profile, list_profiles, delete_profile} persisted at _catalog/profiles/{id}.json, validates bindings exist (raw dataset OR AiView)
- Endpoints: POST/GET /catalog/profiles, GET/DELETE /catalog/profiles/{id}
- POST /vectors/profile/{id}/activate — warms EmbeddingCache + builds HNSW with profile's config for every bound dataset's vector index; reports warmed indexes + failures + duration
- POST /vectors/profile/{id}/search — rejects 403 if requested index's source isn't in profile.bound_datasets; falls through to HNSW if warm, brute-force otherwise
- Fixed refresh to register new index metadata (was silently no-op for first-time indexes)
- End-to-end: security-analyst profile bound to threat_intel → activate warms 54 vectors in 156ms → within-scope HNSW search works (0.625 score); out-of-scope search for candidates returns 403 with allowed bindings listed
Phase E: Soft deletes (tombstones) — 2026-04-16
- shared::types::Tombstone — { dataset, row_key_column, row_key_value, deleted_at, actor, reason }
- catalogd::tombstones::TombstoneStore per-dataset append-log at _catalog/tombstones/{dataset}/, flush_threshold=1 + explicit flush so every tombstone is durable on return (compliance requirement)
- All tombstones for a dataset must share the same row_key_column (validated at write — query filter is built as a single WHERE NOT IN clause)
- Registry::add_tombstone / list_tombstones
- Endpoint: POST /catalog/datasets/by-name/{name}/tombstone accepting {row_key_column, row_key_values[], actor, reason}; companion GET lists active tombstones
- queryd::context::build_context wraps tombstoned tables: raw goes to __raw__{name}, public name becomes a DataFusion view with WHERE CAST(col AS VARCHAR) NOT IN (...) filter
- End-to-end on candidates: tombstone 3 IDs, COUNT drops 100,000 → 99,997, specific WHERE returns empty, AiView candidates_safe transitively excludes them too, restart preserves all tombstones
- Limits / not in MVP: physical compaction (Phase 8 doesn't yet read tombstones during merge); journal integration (tombstones don't yet emit Phase 9 mutation events — covered by audit fields on the tombstone itself)
Phase D: AI-safe views — 2026-04-16
- shared::types::AiView — name, base_dataset, columns whitelist, optional row_filter, column_redactions
- shared::types::Redaction — Null | Hash | Mask { keep_prefix, keep_suffix }
- Registry::put_view / get_view / list_views / delete_view persisted to _catalog/views/{name}.json
- queryd::context registers each view as a DataFusion view with the safe projection + filter + redactions baked into the SELECT
- Endpoints: POST/GET /catalog/views, GET/DELETE /catalog/views/{name}
- End-to-end on candidates: candidates_safe view exposes 8 of 15 columns, masks candidate_id (CAN******01), filters out status='blocked'. SELECT * FROM candidates_safe returns whitelist only; SELECT email FROM candidates_safe fails. View survives restart.
- Capability surface — raw candidates still accessible by name; Phase 13 access control is the layer that enforces who can query what
Phase C: Decoupled embedding refresh — 2026-04-16
- DatasetManifest: last_embedded_at, embedding_stale_since, embedding_refresh_policy (Manual | OnAppend | Scheduled)
- Registry::mark_embeddings_stale / clear_embeddings_stale / stale_datasets
- Ingest paths (CSV pipeline + Postgres streaming) auto-mark-stale when writing to an already-embedded dataset
- vectord::refresh::refresh_index — reads dataset, diffs doc_ids vs existing embeddings, embeds only new rows, writes combined index, clears stale
- POST /vectors/refresh/{dataset} + GET /vectors/stale
- Id columns accept Utf8, Int32, Int64
- End-to-end on threat_intel: initial 20-row embed 2.1s; re-ingest to 54 rows auto-marks stale; delta refresh embeds only 34 new in 970ms (6× faster than full re-embed); stale cleared
Database connector ingest (Postgres/MySQL)
PDF OCR (Tesseract)
Scheduled ingest (cron)
Fine-tuned domain models
Multi-node query distribution

16 KiB Raw Blame History Unescape Escape

Phase Tracker

Phase 0: Bootstrap ✅

Phase 1: Storage + Catalog ✅

Phase 2: Query Engine ✅

Phase 3: AI Integration ✅

Phase 4: Frontend ✅

Phase 5: Hardening ✅

Phase 6: Ingest Pipeline ✅

Phase 7: Vector Index + RAG ✅

Phase 8: Hot Cache + Incremental Updates ✅

Phase 8.5: Agent Workspaces ✅

Phase 9: Event Journal ✅

Phase 10: Rich Catalog v2 ✅

Phase 11: Embedding Versioning ✅

Phase 12: Tool Registry ✅

Phase 13: Security & Access Control ✅

Phase 14: Schema Evolution ✅

Phase 15+: Horizon

16 KiB

Raw Blame History