lakehouse/docs/EXECUTION_PLAN.md

# Execution Plan — Phases B through E

**Created:** 2026-04-16
**Status:** Active planning document — update as phases complete or scope shifts
**Owner:** J

This plan sequences the work J and Claw agreed on during the 2026-04-16 reframe session, after stress-testing the "dual-use substrate" vision and aligning it with llms3.com's architectural patterns.

---

## The four phases, at a glance

| Phase | Work | Prereq | Estimated cost | Risk |
|---|---|---|---|---|
| B | Lance pilot on one vector index | None | 1 focused session | Medium — new dep, unfamiliar surface |
| C | Decoupled embedding refresh pipeline | Benefits from B's outcome | 1 focused session | Low — additive, doesn't break existing |
| D | AI-safe views | Phase 13 (done) | 1 focused session | Low — builds on existing catalog + tool registry |
| E | Soft deletes / tombstones | None | 1-2 focused sessions | Medium — touches query path, compaction |

Each "focused session" ≈ 3-4 hours of coding + verification + doc update.

---

## Phase B — Lance pilot (the storage format question)

### Why now

J's LLMS3 knowledge base explicitly positions Lance as `alternative_to` Parquet for vector workloads. We admitted in the 2026-04-16 stress test that our Parquet-portability argument for vectors is weaker than advertised — our vector Parquet blobs aren't readable by DuckDB/Polars anyway. Lance could unlock: random-row access, disk-resident indexes, time-travel, better compression. Or it could disappoint and we lock in Parquet with a written reason why.

**Commit to one answer backed by measurements.** No ambiguity after this phase.

### Scope

1. Add `lance` crate as a dep in `vectord`, behind a `feature = "lance"` flag initially so we can build without forcing users to install it
2. New module `vectord::lance_store` — mirror of `vectord::store` but against Lance format
3. New endpoint `POST /vectors/lance/index` — build a Lance index for a named source (parallel to the existing Parquet path)
4. Benchmark script that runs against `resumes_100k_v2` (the existing 100K reference)

### Measured dimensions

All benchmarks against `resumes_100k_v2` (100K × 768d), cold start:

| Metric | Parquet baseline | Lance | Threshold to migrate |
|---|---|---|---|
| Cold load from disk | ~2.8s (measured) | ? | ≥2× faster |
| Search latency p50 | 873us (ec=80 es=30) | ? | Within 50% |
| Disk size | 330MB | ? | Comparable or better |
| Single-row random access | N/A (full scan) | ? | <10ms |
| Append 10K new rows | Full rewrite (~3s) | ? | Incremental <500ms |

### Decision rules

- **Lance wins cold-load by ≥2× AND matches search latency:** migrate vector storage to Lance. Dataset tables stay Parquet. Update ADR-008 → ADR-019.
- **Lance within 50% across board:** stay Parquet. Document ceiling honestly in PRD (already done). Lance revisit when we have a problem Parquet can't solve.
- **Lance loses:** close the door, don't revisit without new evidence. Write ADR-019 as "Lance evaluated, rejected, here's why with numbers."

### Success gate

Benchmark output table posted to `docs/ADR-019-vector-storage.md` with measured numbers in each cell of the table above. Decision rule applied mechanically. No "let's defer the call."

### Rollback

The `feature = "lance"` flag means if the pilot goes badly, `cargo build` without the flag is unchanged. No production migration happens until ADR-019 commits to the change. Safe experiment.

---

## Phase C — Decoupled embedding refresh

### Why now

llms3.com's lakehouse architecture explicitly separates "transactional data mutations" from "asynchronous vector refresh cycles." Today we couple them — an ingest writes rows AND embeddings in one flow. That means:

- Adding 1K rows to a 100K-row dataset forces re-embedding of ALL rows (or nothing)
- No notion of "embeddings are stale, schedule a refresh tonight"
- The embedding cost (Ollama-bound, the bottleneck) is synchronous with ingest

### Scope

1. Add fields to `DatasetManifest`:
   - `last_embedded_at: Option<DateTime>`
   - `embedding_stale_since: Option<DateTime>` (set when data written but embeddings not refreshed)
   - `embedding_refresh_policy: RefreshPolicy` — `Manual` | `OnAppend` | `Scheduled(cron)`
2. Decouple ingest from embed: ingest writes data + marks embeddings stale; embedding runs separately
3. New endpoint: `POST /vectors/refresh/{dataset}` — diffs existing vectors vs current rows, only embeds new/changed (keyed by `doc_id`)
4. Background scheduler (or cron trigger) — for `Scheduled` policy, re-runs refresh per schedule
5. `GET /vectors/stale` — lists datasets with stale embeddings and how stale

### Measured success

- Ingest a 1K-row append to `kb_team_runs` (currently 586 rows, Postgres-sourced).
- Catalog shows `embedding_stale_since = now`.
- `POST /vectors/refresh/kb_team_runs` embeds only the 1K new rows, not all 1586.
- Result: new rows searchable, old embeddings unchanged, total Ollama time ~5s instead of ~30s.

### Dependencies on Phase B

If Lance wins Phase B, this is dramatically easier — Lance supports native append. If we stay Parquet, we need a "vectors delta" Parquet file that merges at read time (same pattern as Phase 8's data delta files). ~100 extra LOC if we stay Parquet.

### Rollback

The `refresh_policy` field defaults to `Manual` for all existing datasets, so no behavior change for anything already in the system. Opt-in per dataset.

---

## Phase D — AI-safe views

### Why now

llms3.com's framing: "AI-safe views enforcing row/column security + PII tokenization before model exposure." Phase 13 gave us role-based column masking at query time. That's per-query enforcement. "Views" means pre-materialized: create `candidates_safe` once, bind model X to that view, model X can never accidentally see raw `candidates`.

This is also the precondition for Phase 17 (model profiles) to be meaningfully safe. "Bind model to dataset" isn't enough — needs to be "bind model to a safe view of the dataset."

### Scope

1. New catalog entity: `AiView` with fields `name`, `base_dataset`, `columns[]` (whitelist), `row_filter` (optional SQL WHERE clause), `column_redactions[]` (PII tokenization rules)
2. Persistence: `_catalog/views/{name}.json` alongside manifests
3. Query-rewrite layer: when a query references `candidates_safe`, DataFusion sees an equivalent `SELECT (whitelisted cols) FROM candidates WHERE (row_filter)` — with redactions applied as expressions
4. Endpoint: `POST /catalog/views` to create, `GET /catalog/views` to list, `GET /catalog/views/{name}/preview` to see what it looks like
5. Tool registry integration: tools can bind to an AiView instead of a raw table; agent invocations go through the view

### Measured success

- Create view `candidates_safe` = `SELECT candidate_id, skills, city FROM candidates WHERE status != 'blocked'`.
- Agent (tool registry) calls `search_candidates` bound to `candidates_safe`.
- Agent cannot see `email`, `phone`, `ssn`, or `status='blocked'` rows, even if it writes raw SQL.
- Audit log records agent accessed `candidates_safe`, not `candidates`.

### Dependencies

- Phase 13 already provides the sensitivity classification layer
- Phase 12 tool registry already exists
- This phase is the bridge between them for agent access

### Rollback

Views are additive. Dropping the feature = delete view definitions, tool registry falls back to direct table access. No data migration needed.

---

## Phase E — Soft deletes / tombstones

### Why now

GDPR/CCPA compliance for staffing data. Today, `ops::delete` physically deletes a parquet object — fine for whole datasets, useless for "delete one candidate's record." To delete one row we'd have to rewrite the whole `candidates.parquet` which at 100K rows is 10MB of churn per deletion.

llms3.com lists "deletion vectors" as a core lakehouse pattern (Iceberg/Delta/Hudi all implement it). This is the single biggest compliance gap in the current system.

### Scope

1. New sidecar per dataset: `{dataset}_tombstones.parquet` with columns `{row_key, deleted_at, actor, reason}`
2. Delete API: `POST /catalog/datasets/{name}/tombstone` with `{row_keys[], reason, actor}`
3. Query-time filter: `queryd` automatically LEFT JOINs tombstones and filters out deleted rows
4. Compaction integration: Phase 8 compaction reads base + delta + tombstones, writes a clean base without tombstoned rows, clears the tombstone sidecar
5. Event journal integration (Phase 9): every tombstone emits a journal event with full context

### Measured success

- `POST /catalog/datasets/candidates/tombstone` with `{row_keys: ["CAND-123"], reason: "GDPR request", actor: "legal@company"}`
- `SELECT COUNT(*) FROM candidates` drops by 1 immediately
- `SELECT * FROM candidates WHERE candidate_id = 'CAND-123'` returns empty
- `GET /journal/history/CAND-123` shows the tombstone event
- After scheduled compaction, the tombstone is materialized — `candidates.parquet` no longer contains CAND-123, tombstone sidecar is emptied for that row key

### Dependencies

- Phase 8 delta/merge-on-read pattern (done) — tombstones are a third layer at read time
- Phase 9 event journal (done) — tombstones emit journal events

### Rollback

If query rewrite becomes too complex, fallback: tombstones stored but applied only during compaction (not at query time). Queries return deleted rows until compaction runs. Less useful but safer.

---

## Cross-phase concerns

### Phases that need federation layer 2 (task #5)

Every phase above assumes the federation foundation (shipped 2026-04-16) but NOT federation layer 2 (cross-bucket SQL, profile activation, `X-Lakehouse-Bucket` header).

**Implication:** Phases B-E can proceed on the `primary` bucket without blocking on federation layer 2. Federation layer 2 becomes valuable when we want multi-profile model scoping (Phase 17). Sequence:

```
A (done) → B → C (+D in parallel) → federation layer 2 → Phase 16 → Phase 17 → E
```

### Phases that need federation layer 2 FIRST

None of B/C/D/E strictly need it. Phase 16 (hot-swap) benefits from it. Phase 17 (model profiles) depends on it heavily.

### What NOT to build in B-E

- Distributed query — wait for a real scale problem
- Replacement of DataFusion — working fine, stay put
- Iceberg/Delta Lake migration — explicitly out of scope per ADR-009
- Live streaming / CDC — explicit non-goal

### Definition of done for each phase

Each phase completes when:
1. Code shipped and building clean
2. Success gate measurably passed
3. Relevant ADR added to `docs/DECISIONS.md` (or updated)
4. `docs/PHASES.md` checkbox flipped with measurement data
5. PRD invariants checked — if a new invariant emerged, add it
6. One regression test in the crate or HTTP integration test

---

## Session plan (what to do in what order)

### Next session
1. **Phase B — Lance pilot.** Single session. Answers the biggest open architectural question.
2. Based on outcome, **write ADR-019** with the decision + data.
3. **Update PHASES.md** with Phase 18 status (Lance evaluated).

### Session after
4. **Phase C — Decoupled embedding refresh.** Implementation shaped by B's outcome (append is easy on Lance, requires delta logic on Parquet).

### Session after that
5. **Federation layer 2** OR **Phase D (AI-safe views)** — J decides based on priority. Federation layer 2 unlocks model profiles (Phase 17); AI-safe views is standalone value.

### Final session for this track
6. **Phase E — Soft deletes.** The compliance-driven phase. Fits cleanly after everything else because it touches the query path and wants to be built after query optimizations stabilize.

### Milestone checkpoint
After Phase E, stop and reassess. We'll have:
- Lance decision made and committed
- Decoupled embedding pipeline
- AI-safe view enforcement
- Soft delete semantics

That's a substantial capability increase. Plausible "pause, write a retrospective, decide on Phase 16/17" moment.