lakehouse/PHASES.md at 09fd446c8dbb90f0aef4ae4376f1b189e5503cca

root 09fd446c8d Phase D: AI-safe views — capability-surface projections over base data

Implements the llms3.com "AI-safe views" pattern: a named projection
that exposes only whitelisted columns, with optional row filter and
per-column redactions. AI agents (or Phase 13 roles) bind to the view;
they can never accidentally see PII even if they write raw SQL.

Schema (shared::types):
- AiView { name, base_dataset, columns: Vec<String>, row_filter,
           column_redactions: HashMap<String, Redaction>, ... }
- Redaction enum: Null | Hash | Mask { keep_prefix, keep_suffix }

Catalog (catalogd::registry):
- put_view validates base dataset exists + columns non-empty
- Persists JSON at _catalog/views/{name}.json (sanitized name)
- rebuild() loads views alongside dataset manifests on startup

Query layer (queryd::context):
- build_context registers every AiView as a DataFusion view object
- Constructed SELECT applies whitelist projection, WHERE filter, and
  redaction expressions per column
  - Mask: substr(prefix) + repeat('*', mid_len) + substr(suffix)
  - Hash: digest(value, 'sha256')
  - Null: CAST(NULL AS VARCHAR) AS col
- DataFusion handles JOINs/aggregates over the view natively — it's a
  real view, not a query rewrite

HTTP (catalogd::service):
- POST /catalog/views (create)
- GET  /catalog/views (list)
- GET  /catalog/views/{name} (full def)
- DELETE /catalog/views/{name}

End-to-end test on candidates (100K rows, 15 columns):

  candidates_safe view:
    columns: candidate_id, first_name, city, state, vertical,
             skills, years_experience, status
    row_filter: status != 'blocked'
    redaction: candidate_id mask(prefix=3, suffix=2)

  SELECT * FROM candidates_safe LIMIT 5
    -> 8 columns only, candidate_id shown as "CAN******01"
       (PII fields email/phone/last_name absent from result)

  SELECT email FROM candidates_safe
    -> fails (column not in projection)

  SELECT email FROM candidates
    -> succeeds (raw table still accessible by name —
       Phase 13 access control is the gate, not the view itself)

Survives restart — view definitions reload from object storage.

Limits / not in MVP:
- View CANNOT shadow base table by name (DataFusion treats them as
  separate identifiers; access control must restrict raw-table access)
- row_filter is treated as trusted SQL — operators must validate
  before persisting; only authenticated admin path should call put_view
- Redaction expressions assume column is castable to VARCHAR; numeric
  redactions could be misleading (a Hash on Int64 returns a hex string
  that won't equi-join with another hash on the same value type)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 09:16:44 -05:00

11 KiB

Raw Blame History

Phase Tracker

Phase 0: Bootstrap ✅

Phase 1: Storage + Catalog ✅

Phase 2: Query Engine ✅

Phase 3: AI Integration ✅

Phase 4: Frontend ✅

Phase 5: Hardening ✅

Phase 6: Ingest Pipeline ✅

Phase 7: Vector Index + RAG ✅

Phase 8: Hot Cache + Incremental Updates ✅

Phase 8.5: Agent Workspaces ✅

Phase 9: Event Journal ✅

Phase 10: Rich Catalog v2 ✅

Phase 11: Embedding Versioning ✅

Phase 12: Tool Registry ✅

Phase 13: Security & Access Control ✅

Phase 14: Schema Evolution ✅

Phase 15+: Horizon

11 KiB Raw Blame History Unescape Escape

Phase Tracker

Phase 0: Bootstrap ✅

Phase 1: Storage + Catalog ✅

Phase 2: Query Engine ✅

Phase 3: AI Integration ✅

Phase 4: Frontend ✅

Phase 5: Hardening ✅

Phase 6: Ingest Pipeline ✅

Phase 7: Vector Index + RAG ✅

Phase 8: Hot Cache + Incremental Updates ✅

Phase 8.5: Agent Workspaces ✅

Phase 9: Event Journal ✅

Phase 10: Rich Catalog v2 ✅

Phase 11: Embedding Versioning ✅

Phase 12: Tool Registry ✅

Phase 13: Security & Access Control ✅

Phase 14: Schema Evolution ✅

Phase 15+: Horizon

11 KiB

Raw Blame History