Four shipped features and a PRD realignment, all measured end-to-end:
HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
/hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
80/30 = 1.00 recall in 230s
Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows
Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
/etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
/storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
X-Lakehouse-Rescue-Used observability headers on fallback
Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
(586 rows × 13 cols, 6 batches, 196ms end-to-end)
PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
decision rules
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
234 lines
12 KiB
Markdown
234 lines
12 KiB
Markdown
# Execution Plan — Phases B through E
|
||
|
||
**Created:** 2026-04-16
|
||
**Status:** Active planning document — update as phases complete or scope shifts
|
||
**Owner:** J
|
||
|
||
This plan sequences the work J and Claw agreed on during the 2026-04-16 reframe session, after stress-testing the "dual-use substrate" vision and aligning it with llms3.com's architectural patterns.
|
||
|
||
---
|
||
|
||
## The four phases, at a glance
|
||
|
||
| Phase | Work | Prereq | Estimated cost | Risk |
|
||
|---|---|---|---|---|
|
||
| B | Lance pilot on one vector index | None | 1 focused session | Medium — new dep, unfamiliar surface |
|
||
| C | Decoupled embedding refresh pipeline | Benefits from B's outcome | 1 focused session | Low — additive, doesn't break existing |
|
||
| D | AI-safe views | Phase 13 (done) | 1 focused session | Low — builds on existing catalog + tool registry |
|
||
| E | Soft deletes / tombstones | None | 1-2 focused sessions | Medium — touches query path, compaction |
|
||
|
||
Each "focused session" ≈ 3-4 hours of coding + verification + doc update.
|
||
|
||
---
|
||
|
||
## Phase B — Lance pilot (the storage format question)
|
||
|
||
### Why now
|
||
|
||
J's LLMS3 knowledge base explicitly positions Lance as `alternative_to` Parquet for vector workloads. We admitted in the 2026-04-16 stress test that our Parquet-portability argument for vectors is weaker than advertised — our vector Parquet blobs aren't readable by DuckDB/Polars anyway. Lance could unlock: random-row access, disk-resident indexes, time-travel, better compression. Or it could disappoint and we lock in Parquet with a written reason why.
|
||
|
||
**Commit to one answer backed by measurements.** No ambiguity after this phase.
|
||
|
||
### Scope
|
||
|
||
1. Add `lance` crate as a dep in `vectord`, behind a `feature = "lance"` flag initially so we can build without forcing users to install it
|
||
2. New module `vectord::lance_store` — mirror of `vectord::store` but against Lance format
|
||
3. New endpoint `POST /vectors/lance/index` — build a Lance index for a named source (parallel to the existing Parquet path)
|
||
4. Benchmark script that runs against `resumes_100k_v2` (the existing 100K reference)
|
||
|
||
### Measured dimensions
|
||
|
||
All benchmarks against `resumes_100k_v2` (100K × 768d), cold start:
|
||
|
||
| Metric | Parquet baseline | Lance | Threshold to migrate |
|
||
|---|---|---|---|
|
||
| Cold load from disk | ~2.8s (measured) | ? | ≥2× faster |
|
||
| Search latency p50 | 873us (ec=80 es=30) | ? | Within 50% |
|
||
| Disk size | 330MB | ? | Comparable or better |
|
||
| Single-row random access | N/A (full scan) | ? | <10ms |
|
||
| Append 10K new rows | Full rewrite (~3s) | ? | Incremental <500ms |
|
||
|
||
### Decision rules
|
||
|
||
- **Lance wins cold-load by ≥2× AND matches search latency:** migrate vector storage to Lance. Dataset tables stay Parquet. Update ADR-008 → ADR-019.
|
||
- **Lance within 50% across board:** stay Parquet. Document ceiling honestly in PRD (already done). Lance revisit when we have a problem Parquet can't solve.
|
||
- **Lance loses:** close the door, don't revisit without new evidence. Write ADR-019 as "Lance evaluated, rejected, here's why with numbers."
|
||
|
||
### Success gate
|
||
|
||
Benchmark output table posted to `docs/ADR-019-vector-storage.md` with measured numbers in each cell of the table above. Decision rule applied mechanically. No "let's defer the call."
|
||
|
||
### Rollback
|
||
|
||
The `feature = "lance"` flag means if the pilot goes badly, `cargo build` without the flag is unchanged. No production migration happens until ADR-019 commits to the change. Safe experiment.
|
||
|
||
---
|
||
|
||
## Phase C — Decoupled embedding refresh
|
||
|
||
### Why now
|
||
|
||
llms3.com's lakehouse architecture explicitly separates "transactional data mutations" from "asynchronous vector refresh cycles." Today we couple them — an ingest writes rows AND embeddings in one flow. That means:
|
||
|
||
- Adding 1K rows to a 100K-row dataset forces re-embedding of ALL rows (or nothing)
|
||
- No notion of "embeddings are stale, schedule a refresh tonight"
|
||
- The embedding cost (Ollama-bound, the bottleneck) is synchronous with ingest
|
||
|
||
### Scope
|
||
|
||
1. Add fields to `DatasetManifest`:
|
||
- `last_embedded_at: Option<DateTime>`
|
||
- `embedding_stale_since: Option<DateTime>` (set when data written but embeddings not refreshed)
|
||
- `embedding_refresh_policy: RefreshPolicy` — `Manual` | `OnAppend` | `Scheduled(cron)`
|
||
2. Decouple ingest from embed: ingest writes data + marks embeddings stale; embedding runs separately
|
||
3. New endpoint: `POST /vectors/refresh/{dataset}` — diffs existing vectors vs current rows, only embeds new/changed (keyed by `doc_id`)
|
||
4. Background scheduler (or cron trigger) — for `Scheduled` policy, re-runs refresh per schedule
|
||
5. `GET /vectors/stale` — lists datasets with stale embeddings and how stale
|
||
|
||
### Measured success
|
||
|
||
- Ingest a 1K-row append to `kb_team_runs` (currently 586 rows, Postgres-sourced).
|
||
- Catalog shows `embedding_stale_since = now`.
|
||
- `POST /vectors/refresh/kb_team_runs` embeds only the 1K new rows, not all 1586.
|
||
- Result: new rows searchable, old embeddings unchanged, total Ollama time ~5s instead of ~30s.
|
||
|
||
### Dependencies on Phase B
|
||
|
||
If Lance wins Phase B, this is dramatically easier — Lance supports native append. If we stay Parquet, we need a "vectors delta" Parquet file that merges at read time (same pattern as Phase 8's data delta files). ~100 extra LOC if we stay Parquet.
|
||
|
||
### Rollback
|
||
|
||
The `refresh_policy` field defaults to `Manual` for all existing datasets, so no behavior change for anything already in the system. Opt-in per dataset.
|
||
|
||
---
|
||
|
||
## Phase D — AI-safe views
|
||
|
||
### Why now
|
||
|
||
llms3.com's framing: "AI-safe views enforcing row/column security + PII tokenization before model exposure." Phase 13 gave us role-based column masking at query time. That's per-query enforcement. "Views" means pre-materialized: create `candidates_safe` once, bind model X to that view, model X can never accidentally see raw `candidates`.
|
||
|
||
This is also the precondition for Phase 17 (model profiles) to be meaningfully safe. "Bind model to dataset" isn't enough — needs to be "bind model to a safe view of the dataset."
|
||
|
||
### Scope
|
||
|
||
1. New catalog entity: `AiView` with fields `name`, `base_dataset`, `columns[]` (whitelist), `row_filter` (optional SQL WHERE clause), `column_redactions[]` (PII tokenization rules)
|
||
2. Persistence: `_catalog/views/{name}.json` alongside manifests
|
||
3. Query-rewrite layer: when a query references `candidates_safe`, DataFusion sees an equivalent `SELECT (whitelisted cols) FROM candidates WHERE (row_filter)` — with redactions applied as expressions
|
||
4. Endpoint: `POST /catalog/views` to create, `GET /catalog/views` to list, `GET /catalog/views/{name}/preview` to see what it looks like
|
||
5. Tool registry integration: tools can bind to an AiView instead of a raw table; agent invocations go through the view
|
||
|
||
### Measured success
|
||
|
||
- Create view `candidates_safe` = `SELECT candidate_id, skills, city FROM candidates WHERE status != 'blocked'`.
|
||
- Agent (tool registry) calls `search_candidates` bound to `candidates_safe`.
|
||
- Agent cannot see `email`, `phone`, `ssn`, or `status='blocked'` rows, even if it writes raw SQL.
|
||
- Audit log records agent accessed `candidates_safe`, not `candidates`.
|
||
|
||
### Dependencies
|
||
|
||
- Phase 13 already provides the sensitivity classification layer
|
||
- Phase 12 tool registry already exists
|
||
- This phase is the bridge between them for agent access
|
||
|
||
### Rollback
|
||
|
||
Views are additive. Dropping the feature = delete view definitions, tool registry falls back to direct table access. No data migration needed.
|
||
|
||
---
|
||
|
||
## Phase E — Soft deletes / tombstones
|
||
|
||
### Why now
|
||
|
||
GDPR/CCPA compliance for staffing data. Today, `ops::delete` physically deletes a parquet object — fine for whole datasets, useless for "delete one candidate's record." To delete one row we'd have to rewrite the whole `candidates.parquet` which at 100K rows is 10MB of churn per deletion.
|
||
|
||
llms3.com lists "deletion vectors" as a core lakehouse pattern (Iceberg/Delta/Hudi all implement it). This is the single biggest compliance gap in the current system.
|
||
|
||
### Scope
|
||
|
||
1. New sidecar per dataset: `{dataset}_tombstones.parquet` with columns `{row_key, deleted_at, actor, reason}`
|
||
2. Delete API: `POST /catalog/datasets/{name}/tombstone` with `{row_keys[], reason, actor}`
|
||
3. Query-time filter: `queryd` automatically LEFT JOINs tombstones and filters out deleted rows
|
||
4. Compaction integration: Phase 8 compaction reads base + delta + tombstones, writes a clean base without tombstoned rows, clears the tombstone sidecar
|
||
5. Event journal integration (Phase 9): every tombstone emits a journal event with full context
|
||
|
||
### Measured success
|
||
|
||
- `POST /catalog/datasets/candidates/tombstone` with `{row_keys: ["CAND-123"], reason: "GDPR request", actor: "legal@company"}`
|
||
- `SELECT COUNT(*) FROM candidates` drops by 1 immediately
|
||
- `SELECT * FROM candidates WHERE candidate_id = 'CAND-123'` returns empty
|
||
- `GET /journal/history/CAND-123` shows the tombstone event
|
||
- After scheduled compaction, the tombstone is materialized — `candidates.parquet` no longer contains CAND-123, tombstone sidecar is emptied for that row key
|
||
|
||
### Dependencies
|
||
|
||
- Phase 8 delta/merge-on-read pattern (done) — tombstones are a third layer at read time
|
||
- Phase 9 event journal (done) — tombstones emit journal events
|
||
|
||
### Rollback
|
||
|
||
If query rewrite becomes too complex, fallback: tombstones stored but applied only during compaction (not at query time). Queries return deleted rows until compaction runs. Less useful but safer.
|
||
|
||
---
|
||
|
||
## Cross-phase concerns
|
||
|
||
### Phases that need federation layer 2 (task #5)
|
||
|
||
Every phase above assumes the federation foundation (shipped 2026-04-16) but NOT federation layer 2 (cross-bucket SQL, profile activation, `X-Lakehouse-Bucket` header).
|
||
|
||
**Implication:** Phases B-E can proceed on the `primary` bucket without blocking on federation layer 2. Federation layer 2 becomes valuable when we want multi-profile model scoping (Phase 17). Sequence:
|
||
|
||
```
|
||
A (done) → B → C (+D in parallel) → federation layer 2 → Phase 16 → Phase 17 → E
|
||
```
|
||
|
||
### Phases that need federation layer 2 FIRST
|
||
|
||
None of B/C/D/E strictly need it. Phase 16 (hot-swap) benefits from it. Phase 17 (model profiles) depends on it heavily.
|
||
|
||
### What NOT to build in B-E
|
||
|
||
- Distributed query — wait for a real scale problem
|
||
- Replacement of DataFusion — working fine, stay put
|
||
- Iceberg/Delta Lake migration — explicitly out of scope per ADR-009
|
||
- Live streaming / CDC — explicit non-goal
|
||
|
||
### Definition of done for each phase
|
||
|
||
Each phase completes when:
|
||
1. Code shipped and building clean
|
||
2. Success gate measurably passed
|
||
3. Relevant ADR added to `docs/DECISIONS.md` (or updated)
|
||
4. `docs/PHASES.md` checkbox flipped with measurement data
|
||
5. PRD invariants checked — if a new invariant emerged, add it
|
||
6. One regression test in the crate or HTTP integration test
|
||
|
||
---
|
||
|
||
## Session plan (what to do in what order)
|
||
|
||
### Next session
|
||
1. **Phase B — Lance pilot.** Single session. Answers the biggest open architectural question.
|
||
2. Based on outcome, **write ADR-019** with the decision + data.
|
||
3. **Update PHASES.md** with Phase 18 status (Lance evaluated).
|
||
|
||
### Session after
|
||
4. **Phase C — Decoupled embedding refresh.** Implementation shaped by B's outcome (append is easy on Lance, requires delta logic on Parquet).
|
||
|
||
### Session after that
|
||
5. **Federation layer 2** OR **Phase D (AI-safe views)** — J decides based on priority. Federation layer 2 unlocks model profiles (Phase 17); AI-safe views is standalone value.
|
||
|
||
### Final session for this track
|
||
6. **Phase E — Soft deletes.** The compliance-driven phase. Fits cleanly after everything else because it touches the query path and wants to be built after query optimizations stabilize.
|
||
|
||
### Milestone checkpoint
|
||
After Phase E, stop and reassess. We'll have:
|
||
- Lance decision made and committed
|
||
- Decoupled embedding pipeline
|
||
- AI-safe view enforcement
|
||
- Soft delete semantics
|
||
|
||
That's a substantial capability increase. Plausible "pause, write a retrospective, decide on Phase 16/17" moment.
|