# Execution Plan — Phases B through E **Created:** 2026-04-16 **Status:** Active planning document — update as phases complete or scope shifts **Owner:** J This plan sequences the work J and Claw agreed on during the 2026-04-16 reframe session, after stress-testing the "dual-use substrate" vision and aligning it with llms3.com's architectural patterns. --- ## The four phases, at a glance | Phase | Work | Prereq | Estimated cost | Risk | |---|---|---|---|---| | B | Lance pilot on one vector index | None | 1 focused session | Medium — new dep, unfamiliar surface | | C | Decoupled embedding refresh pipeline | Benefits from B's outcome | 1 focused session | Low — additive, doesn't break existing | | D | AI-safe views | Phase 13 (done) | 1 focused session | Low — builds on existing catalog + tool registry | | E | Soft deletes / tombstones | None | 1-2 focused sessions | Medium — touches query path, compaction | Each "focused session" ≈ 3-4 hours of coding + verification + doc update. --- ## Phase B — Lance pilot (the storage format question) ### Why now J's LLMS3 knowledge base explicitly positions Lance as `alternative_to` Parquet for vector workloads. We admitted in the 2026-04-16 stress test that our Parquet-portability argument for vectors is weaker than advertised — our vector Parquet blobs aren't readable by DuckDB/Polars anyway. Lance could unlock: random-row access, disk-resident indexes, time-travel, better compression. Or it could disappoint and we lock in Parquet with a written reason why. **Commit to one answer backed by measurements.** No ambiguity after this phase. ### Scope 1. Add `lance` crate as a dep in `vectord`, behind a `feature = "lance"` flag initially so we can build without forcing users to install it 2. New module `vectord::lance_store` — mirror of `vectord::store` but against Lance format 3. New endpoint `POST /vectors/lance/index` — build a Lance index for a named source (parallel to the existing Parquet path) 4. Benchmark script that runs against `resumes_100k_v2` (the existing 100K reference) ### Measured dimensions All benchmarks against `resumes_100k_v2` (100K × 768d), cold start: | Metric | Parquet baseline | Lance | Threshold to migrate | |---|---|---|---| | Cold load from disk | ~2.8s (measured) | ? | ≥2× faster | | Search latency p50 | 873us (ec=80 es=30) | ? | Within 50% | | Disk size | 330MB | ? | Comparable or better | | Single-row random access | N/A (full scan) | ? | <10ms | | Append 10K new rows | Full rewrite (~3s) | ? | Incremental <500ms | ### Decision rules - **Lance wins cold-load by ≥2× AND matches search latency:** migrate vector storage to Lance. Dataset tables stay Parquet. Update ADR-008 → ADR-019. - **Lance within 50% across board:** stay Parquet. Document ceiling honestly in PRD (already done). Lance revisit when we have a problem Parquet can't solve. - **Lance loses:** close the door, don't revisit without new evidence. Write ADR-019 as "Lance evaluated, rejected, here's why with numbers." ### Success gate Benchmark output table posted to `docs/ADR-019-vector-storage.md` with measured numbers in each cell of the table above. Decision rule applied mechanically. No "let's defer the call." ### Rollback The `feature = "lance"` flag means if the pilot goes badly, `cargo build` without the flag is unchanged. No production migration happens until ADR-019 commits to the change. Safe experiment. --- ## Phase C — Decoupled embedding refresh ### Why now llms3.com's lakehouse architecture explicitly separates "transactional data mutations" from "asynchronous vector refresh cycles." Today we couple them — an ingest writes rows AND embeddings in one flow. That means: - Adding 1K rows to a 100K-row dataset forces re-embedding of ALL rows (or nothing) - No notion of "embeddings are stale, schedule a refresh tonight" - The embedding cost (Ollama-bound, the bottleneck) is synchronous with ingest ### Scope 1. Add fields to `DatasetManifest`: - `last_embedded_at: Option` - `embedding_stale_since: Option` (set when data written but embeddings not refreshed) - `embedding_refresh_policy: RefreshPolicy` — `Manual` | `OnAppend` | `Scheduled(cron)` 2. Decouple ingest from embed: ingest writes data + marks embeddings stale; embedding runs separately 3. New endpoint: `POST /vectors/refresh/{dataset}` — diffs existing vectors vs current rows, only embeds new/changed (keyed by `doc_id`) 4. Background scheduler (or cron trigger) — for `Scheduled` policy, re-runs refresh per schedule 5. `GET /vectors/stale` — lists datasets with stale embeddings and how stale ### Measured success - Ingest a 1K-row append to `kb_team_runs` (currently 586 rows, Postgres-sourced). - Catalog shows `embedding_stale_since = now`. - `POST /vectors/refresh/kb_team_runs` embeds only the 1K new rows, not all 1586. - Result: new rows searchable, old embeddings unchanged, total Ollama time ~5s instead of ~30s. ### Dependencies on Phase B If Lance wins Phase B, this is dramatically easier — Lance supports native append. If we stay Parquet, we need a "vectors delta" Parquet file that merges at read time (same pattern as Phase 8's data delta files). ~100 extra LOC if we stay Parquet. ### Rollback The `refresh_policy` field defaults to `Manual` for all existing datasets, so no behavior change for anything already in the system. Opt-in per dataset. --- ## Phase D — AI-safe views ### Why now llms3.com's framing: "AI-safe views enforcing row/column security + PII tokenization before model exposure." Phase 13 gave us role-based column masking at query time. That's per-query enforcement. "Views" means pre-materialized: create `candidates_safe` once, bind model X to that view, model X can never accidentally see raw `candidates`. This is also the precondition for Phase 17 (model profiles) to be meaningfully safe. "Bind model to dataset" isn't enough — needs to be "bind model to a safe view of the dataset." ### Scope 1. New catalog entity: `AiView` with fields `name`, `base_dataset`, `columns[]` (whitelist), `row_filter` (optional SQL WHERE clause), `column_redactions[]` (PII tokenization rules) 2. Persistence: `_catalog/views/{name}.json` alongside manifests 3. Query-rewrite layer: when a query references `candidates_safe`, DataFusion sees an equivalent `SELECT (whitelisted cols) FROM candidates WHERE (row_filter)` — with redactions applied as expressions 4. Endpoint: `POST /catalog/views` to create, `GET /catalog/views` to list, `GET /catalog/views/{name}/preview` to see what it looks like 5. Tool registry integration: tools can bind to an AiView instead of a raw table; agent invocations go through the view ### Measured success - Create view `candidates_safe` = `SELECT candidate_id, skills, city FROM candidates WHERE status != 'blocked'`. - Agent (tool registry) calls `search_candidates` bound to `candidates_safe`. - Agent cannot see `email`, `phone`, `ssn`, or `status='blocked'` rows, even if it writes raw SQL. - Audit log records agent accessed `candidates_safe`, not `candidates`. ### Dependencies - Phase 13 already provides the sensitivity classification layer - Phase 12 tool registry already exists - This phase is the bridge between them for agent access ### Rollback Views are additive. Dropping the feature = delete view definitions, tool registry falls back to direct table access. No data migration needed. --- ## Phase E — Soft deletes / tombstones ### Why now GDPR/CCPA compliance for staffing data. Today, `ops::delete` physically deletes a parquet object — fine for whole datasets, useless for "delete one candidate's record." To delete one row we'd have to rewrite the whole `candidates.parquet` which at 100K rows is 10MB of churn per deletion. llms3.com lists "deletion vectors" as a core lakehouse pattern (Iceberg/Delta/Hudi all implement it). This is the single biggest compliance gap in the current system. ### Scope 1. New sidecar per dataset: `{dataset}_tombstones.parquet` with columns `{row_key, deleted_at, actor, reason}` 2. Delete API: `POST /catalog/datasets/{name}/tombstone` with `{row_keys[], reason, actor}` 3. Query-time filter: `queryd` automatically LEFT JOINs tombstones and filters out deleted rows 4. Compaction integration: Phase 8 compaction reads base + delta + tombstones, writes a clean base without tombstoned rows, clears the tombstone sidecar 5. Event journal integration (Phase 9): every tombstone emits a journal event with full context ### Measured success - `POST /catalog/datasets/candidates/tombstone` with `{row_keys: ["CAND-123"], reason: "GDPR request", actor: "legal@company"}` - `SELECT COUNT(*) FROM candidates` drops by 1 immediately - `SELECT * FROM candidates WHERE candidate_id = 'CAND-123'` returns empty - `GET /journal/history/CAND-123` shows the tombstone event - After scheduled compaction, the tombstone is materialized — `candidates.parquet` no longer contains CAND-123, tombstone sidecar is emptied for that row key ### Dependencies - Phase 8 delta/merge-on-read pattern (done) — tombstones are a third layer at read time - Phase 9 event journal (done) — tombstones emit journal events ### Rollback If query rewrite becomes too complex, fallback: tombstones stored but applied only during compaction (not at query time). Queries return deleted rows until compaction runs. Less useful but safer. --- ## Cross-phase concerns ### Phases that need federation layer 2 (task #5) Every phase above assumes the federation foundation (shipped 2026-04-16) but NOT federation layer 2 (cross-bucket SQL, profile activation, `X-Lakehouse-Bucket` header). **Implication:** Phases B-E can proceed on the `primary` bucket without blocking on federation layer 2. Federation layer 2 becomes valuable when we want multi-profile model scoping (Phase 17). Sequence: ``` A (done) → B → C (+D in parallel) → federation layer 2 → Phase 16 → Phase 17 → E ``` ### Phases that need federation layer 2 FIRST None of B/C/D/E strictly need it. Phase 16 (hot-swap) benefits from it. Phase 17 (model profiles) depends on it heavily. ### What NOT to build in B-E - Distributed query — wait for a real scale problem - Replacement of DataFusion — working fine, stay put - Iceberg/Delta Lake migration — explicitly out of scope per ADR-009 - Live streaming / CDC — explicit non-goal ### Definition of done for each phase Each phase completes when: 1. Code shipped and building clean 2. Success gate measurably passed 3. Relevant ADR added to `docs/DECISIONS.md` (or updated) 4. `docs/PHASES.md` checkbox flipped with measurement data 5. PRD invariants checked — if a new invariant emerged, add it 6. One regression test in the crate or HTTP integration test --- ## Session plan (what to do in what order) ### Next session 1. **Phase B — Lance pilot.** Single session. Answers the biggest open architectural question. 2. Based on outcome, **write ADR-019** with the decision + data. 3. **Update PHASES.md** with Phase 18 status (Lance evaluated). ### Session after 4. **Phase C — Decoupled embedding refresh.** Implementation shaped by B's outcome (append is easy on Lance, requires delta logic on Parquet). ### Session after that 5. **Federation layer 2** OR **Phase D (AI-safe views)** — J decides based on priority. Federation layer 2 unlocks model profiles (Phase 17); AI-safe views is standalone value. ### Final session for this track 6. **Phase E — Soft deletes.** The compliance-driven phase. Fits cleanly after everything else because it touches the query path and wants to be built after query optimizations stabilize. ### Milestone checkpoint After Phase E, stop and reassess. We'll have: - Lance decision made and committed - Decoupled embedding pipeline - AI-safe view enforcement - Soft delete semantics That's a substantial capability increase. Plausible "pause, write a retrospective, decide on Phase 16/17" moment.