Four shipped features and a PRD realignment, all measured end-to-end:
HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
/hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
80/30 = 1.00 recall in 230s
Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows
Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
/etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
/storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
X-Lakehouse-Rescue-Used observability headers on fallback
Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
(586 rows × 13 cols, 6 batches, 196ms end-to-end)
PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
decision rules
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12 KiB
Execution Plan — Phases B through E
Created: 2026-04-16 Status: Active planning document — update as phases complete or scope shifts Owner: J
This plan sequences the work J and Claw agreed on during the 2026-04-16 reframe session, after stress-testing the "dual-use substrate" vision and aligning it with llms3.com's architectural patterns.
The four phases, at a glance
| Phase | Work | Prereq | Estimated cost | Risk |
|---|---|---|---|---|
| B | Lance pilot on one vector index | None | 1 focused session | Medium — new dep, unfamiliar surface |
| C | Decoupled embedding refresh pipeline | Benefits from B's outcome | 1 focused session | Low — additive, doesn't break existing |
| D | AI-safe views | Phase 13 (done) | 1 focused session | Low — builds on existing catalog + tool registry |
| E | Soft deletes / tombstones | None | 1-2 focused sessions | Medium — touches query path, compaction |
Each "focused session" ≈ 3-4 hours of coding + verification + doc update.
Phase B — Lance pilot (the storage format question)
Why now
J's LLMS3 knowledge base explicitly positions Lance as alternative_to Parquet for vector workloads. We admitted in the 2026-04-16 stress test that our Parquet-portability argument for vectors is weaker than advertised — our vector Parquet blobs aren't readable by DuckDB/Polars anyway. Lance could unlock: random-row access, disk-resident indexes, time-travel, better compression. Or it could disappoint and we lock in Parquet with a written reason why.
Commit to one answer backed by measurements. No ambiguity after this phase.
Scope
- Add
lancecrate as a dep invectord, behind afeature = "lance"flag initially so we can build without forcing users to install it - New module
vectord::lance_store— mirror ofvectord::storebut against Lance format - New endpoint
POST /vectors/lance/index— build a Lance index for a named source (parallel to the existing Parquet path) - Benchmark script that runs against
resumes_100k_v2(the existing 100K reference)
Measured dimensions
All benchmarks against resumes_100k_v2 (100K × 768d), cold start:
| Metric | Parquet baseline | Lance | Threshold to migrate |
|---|---|---|---|
| Cold load from disk | ~2.8s (measured) | ? | ≥2× faster |
| Search latency p50 | 873us (ec=80 es=30) | ? | Within 50% |
| Disk size | 330MB | ? | Comparable or better |
| Single-row random access | N/A (full scan) | ? | <10ms |
| Append 10K new rows | Full rewrite (~3s) | ? | Incremental <500ms |
Decision rules
- Lance wins cold-load by ≥2× AND matches search latency: migrate vector storage to Lance. Dataset tables stay Parquet. Update ADR-008 → ADR-019.
- Lance within 50% across board: stay Parquet. Document ceiling honestly in PRD (already done). Lance revisit when we have a problem Parquet can't solve.
- Lance loses: close the door, don't revisit without new evidence. Write ADR-019 as "Lance evaluated, rejected, here's why with numbers."
Success gate
Benchmark output table posted to docs/ADR-019-vector-storage.md with measured numbers in each cell of the table above. Decision rule applied mechanically. No "let's defer the call."
Rollback
The feature = "lance" flag means if the pilot goes badly, cargo build without the flag is unchanged. No production migration happens until ADR-019 commits to the change. Safe experiment.
Phase C — Decoupled embedding refresh
Why now
llms3.com's lakehouse architecture explicitly separates "transactional data mutations" from "asynchronous vector refresh cycles." Today we couple them — an ingest writes rows AND embeddings in one flow. That means:
- Adding 1K rows to a 100K-row dataset forces re-embedding of ALL rows (or nothing)
- No notion of "embeddings are stale, schedule a refresh tonight"
- The embedding cost (Ollama-bound, the bottleneck) is synchronous with ingest
Scope
- Add fields to
DatasetManifest:last_embedded_at: Option<DateTime>embedding_stale_since: Option<DateTime>(set when data written but embeddings not refreshed)embedding_refresh_policy: RefreshPolicy—Manual|OnAppend|Scheduled(cron)
- Decouple ingest from embed: ingest writes data + marks embeddings stale; embedding runs separately
- New endpoint:
POST /vectors/refresh/{dataset}— diffs existing vectors vs current rows, only embeds new/changed (keyed bydoc_id) - Background scheduler (or cron trigger) — for
Scheduledpolicy, re-runs refresh per schedule GET /vectors/stale— lists datasets with stale embeddings and how stale
Measured success
- Ingest a 1K-row append to
kb_team_runs(currently 586 rows, Postgres-sourced). - Catalog shows
embedding_stale_since = now. POST /vectors/refresh/kb_team_runsembeds only the 1K new rows, not all 1586.- Result: new rows searchable, old embeddings unchanged, total Ollama time ~5s instead of ~30s.
Dependencies on Phase B
If Lance wins Phase B, this is dramatically easier — Lance supports native append. If we stay Parquet, we need a "vectors delta" Parquet file that merges at read time (same pattern as Phase 8's data delta files). ~100 extra LOC if we stay Parquet.
Rollback
The refresh_policy field defaults to Manual for all existing datasets, so no behavior change for anything already in the system. Opt-in per dataset.
Phase D — AI-safe views
Why now
llms3.com's framing: "AI-safe views enforcing row/column security + PII tokenization before model exposure." Phase 13 gave us role-based column masking at query time. That's per-query enforcement. "Views" means pre-materialized: create candidates_safe once, bind model X to that view, model X can never accidentally see raw candidates.
This is also the precondition for Phase 17 (model profiles) to be meaningfully safe. "Bind model to dataset" isn't enough — needs to be "bind model to a safe view of the dataset."
Scope
- New catalog entity:
AiViewwith fieldsname,base_dataset,columns[](whitelist),row_filter(optional SQL WHERE clause),column_redactions[](PII tokenization rules) - Persistence:
_catalog/views/{name}.jsonalongside manifests - Query-rewrite layer: when a query references
candidates_safe, DataFusion sees an equivalentSELECT (whitelisted cols) FROM candidates WHERE (row_filter)— with redactions applied as expressions - Endpoint:
POST /catalog/viewsto create,GET /catalog/viewsto list,GET /catalog/views/{name}/previewto see what it looks like - Tool registry integration: tools can bind to an AiView instead of a raw table; agent invocations go through the view
Measured success
- Create view
candidates_safe=SELECT candidate_id, skills, city FROM candidates WHERE status != 'blocked'. - Agent (tool registry) calls
search_candidatesbound tocandidates_safe. - Agent cannot see
email,phone,ssn, orstatus='blocked'rows, even if it writes raw SQL. - Audit log records agent accessed
candidates_safe, notcandidates.
Dependencies
- Phase 13 already provides the sensitivity classification layer
- Phase 12 tool registry already exists
- This phase is the bridge between them for agent access
Rollback
Views are additive. Dropping the feature = delete view definitions, tool registry falls back to direct table access. No data migration needed.
Phase E — Soft deletes / tombstones
Why now
GDPR/CCPA compliance for staffing data. Today, ops::delete physically deletes a parquet object — fine for whole datasets, useless for "delete one candidate's record." To delete one row we'd have to rewrite the whole candidates.parquet which at 100K rows is 10MB of churn per deletion.
llms3.com lists "deletion vectors" as a core lakehouse pattern (Iceberg/Delta/Hudi all implement it). This is the single biggest compliance gap in the current system.
Scope
- New sidecar per dataset:
{dataset}_tombstones.parquetwith columns{row_key, deleted_at, actor, reason} - Delete API:
POST /catalog/datasets/{name}/tombstonewith{row_keys[], reason, actor} - Query-time filter:
querydautomatically LEFT JOINs tombstones and filters out deleted rows - Compaction integration: Phase 8 compaction reads base + delta + tombstones, writes a clean base without tombstoned rows, clears the tombstone sidecar
- Event journal integration (Phase 9): every tombstone emits a journal event with full context
Measured success
POST /catalog/datasets/candidates/tombstonewith{row_keys: ["CAND-123"], reason: "GDPR request", actor: "legal@company"}SELECT COUNT(*) FROM candidatesdrops by 1 immediatelySELECT * FROM candidates WHERE candidate_id = 'CAND-123'returns emptyGET /journal/history/CAND-123shows the tombstone event- After scheduled compaction, the tombstone is materialized —
candidates.parquetno longer contains CAND-123, tombstone sidecar is emptied for that row key
Dependencies
- Phase 8 delta/merge-on-read pattern (done) — tombstones are a third layer at read time
- Phase 9 event journal (done) — tombstones emit journal events
Rollback
If query rewrite becomes too complex, fallback: tombstones stored but applied only during compaction (not at query time). Queries return deleted rows until compaction runs. Less useful but safer.
Cross-phase concerns
Phases that need federation layer 2 (task #5)
Every phase above assumes the federation foundation (shipped 2026-04-16) but NOT federation layer 2 (cross-bucket SQL, profile activation, X-Lakehouse-Bucket header).
Implication: Phases B-E can proceed on the primary bucket without blocking on federation layer 2. Federation layer 2 becomes valuable when we want multi-profile model scoping (Phase 17). Sequence:
A (done) → B → C (+D in parallel) → federation layer 2 → Phase 16 → Phase 17 → E
Phases that need federation layer 2 FIRST
None of B/C/D/E strictly need it. Phase 16 (hot-swap) benefits from it. Phase 17 (model profiles) depends on it heavily.
What NOT to build in B-E
- Distributed query — wait for a real scale problem
- Replacement of DataFusion — working fine, stay put
- Iceberg/Delta Lake migration — explicitly out of scope per ADR-009
- Live streaming / CDC — explicit non-goal
Definition of done for each phase
Each phase completes when:
- Code shipped and building clean
- Success gate measurably passed
- Relevant ADR added to
docs/DECISIONS.md(or updated) docs/PHASES.mdcheckbox flipped with measurement data- PRD invariants checked — if a new invariant emerged, add it
- One regression test in the crate or HTTP integration test
Session plan (what to do in what order)
Next session
- Phase B — Lance pilot. Single session. Answers the biggest open architectural question.
- Based on outcome, write ADR-019 with the decision + data.
- Update PHASES.md with Phase 18 status (Lance evaluated).
Session after
- Phase C — Decoupled embedding refresh. Implementation shaped by B's outcome (append is easy on Lance, requires delta logic on Parquet).
Session after that
- Federation layer 2 OR Phase D (AI-safe views) — J decides based on priority. Federation layer 2 unlocks model profiles (Phase 17); AI-safe views is standalone value.
Final session for this track
- Phase E — Soft deletes. The compliance-driven phase. Fits cleanly after everything else because it touches the query path and wants to be built after query optimizations stabilize.
Milestone checkpoint
After Phase E, stop and reassess. We'll have:
- Lance decision made and committed
- Decoupled embedding pipeline
- AI-safe view enforcement
- Soft delete semantics
That's a substantial capability increase. Plausible "pause, write a retrospective, decide on Phase 16/17" moment.