POST /vectors/lance/recall/{index} runs an existing harness through
Lance IVF_PQ search and measures recall@k against brute-force ground
truth. Uses the same EvalSet + ground_truth infrastructure as the
HNSW trial system — no new harness format needed.
First real measurement on resumes_100k_v2 (100K × 768d, 20 queries):
IVF_PQ (316 partitions, 8 bits, 48 subvectors): recall@10 = 0.805
For comparison — HNSW ec=80 es=30: recall@10 = 1.000
ADR-019 predicted "likely 0.85-0.95" — actual is 0.805. Slightly
below, but now the harness exists to iterate: increase partitions,
try ivf_hnsw_pq, tune subvectors. The measurement infrastructure
is the deliverable, not any specific recall target.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LanceVectorStore gains build_scalar_index(column) and
has_scalar_index(column). Exposed as POST /vectors/lance/scalar-index/
{index}/{column}. activate_profile auto-builds the doc_id btree
alongside the IVF_PQ vector index when activating a Lance-backed
profile — operators get both indexes without extra API calls.
stats() now reports has_doc_id_index alongside has_vector_index.
Measured on resumes_100k_v2 (100K × 768d): random doc_id fetch
improved from ~5.4ms to ~3.5ms (35% faster). Btree build: 19ms,
+2.7 MB on disk. The remaining ~3ms is vector column materialization,
not index lookup — to close further would need a projection-only
fetch that skips the 768-float vector for text-only RAG retrieval.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-tier PDF extraction: lopdf text layer first (fast, digital PDFs),
Tesseract OCR fallback when text extraction yields zero pages (scanned
documents, image-only PDFs). Falls back gracefully if Tesseract isn't
installed — returns an actionable error directing the operator to
`apt install tesseract-ocr tesseract-ocr-eng`.
OCR path: extract embedded XObject /Image streams from each page via
lopdf, detect format from magic bytes (JPEG/PNG/TIFF), write to temp
file, shell out to tesseract with --oem 3 --psm 6 (LSTM + uniform
text block), read output, clean up. Temp files cleaned even on error.
Schema unchanged — both paths produce (source_file, page_number,
text_content) so downstream consumers (chunker, vectord, queryd) work
identically regardless of how text was produced.
Verified: created a synthetic scanned PDF (PIL → image → PDF with no
text layer), ingested via POST /ingest/file. Tesseract recovered the
text with expected OCR artifacts. Queryable via DataFusion SQL.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
activate_profile: when profile.vector_backend == Lance, auto-migrates
from Parquet if no Lance dataset exists, auto-builds IVF_PQ if no
index attached. Reuses existing Lance dataset on subsequent activations.
profile_scoped_search: routes to Lance IVF_PQ or Parquet+HNSW based
on the profile's declared backend. Callers hit the same endpoint —
the profile abstracts which storage tier serves the query.
Verified: lance-recruiter (vector_backend=lance) and parquet-recruiter
(vector_backend=parquet) both searched the same 100K index through
POST /vectors/profile/{id}/search. Lance returned lance_ivf_pq at
25ms; Parquet returned hnsw at <1ms. Same API surface, different
backends, transparent routing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Background Scheduler task fires due ingests on interval, records
outcomes, reschedules. Single-flight per schedule_id so a slow run
can't pile up. 10s tick cadence, schedules' own intervals independent.
ScheduleDef persisted as JSON at primary://_schedules/{id}.json,
rebuilt on startup. ScheduleKind supports Mysql and Postgres (both
through existing streaming paths). ScheduleTrigger::Interval is
live; Cron variant defined in the enum but parsing stubbed with a
safe 1h fallback.
next_run_at set to "now" on creation so operators see success or
failure within one tick — no waiting for the first full interval.
run-now endpoint fires even when schedule is disabled (manual
override for testing). Full catalog integration: PII detection,
lineage with redacted DSN, mark-stale + autotune agent trigger.
Verified live: 20s MySQL schedule against MariaDB lh_demo.customers.
Source mutated between runs (added row + updated value). Second
auto-fire picked up both changes (10→11 rows). DataFusion SQL
confirmed mutations in the lakehouse. 6 unit tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Five threads of work landing as one milestone — all individually
verified end-to-end against real data, full release build clean,
46 unit tests pass.
## Phase 16.2 / 16.5 — autotune agent + ingest triggers
`vectord::agent` is a long-running tokio task that watches the trial
journal and autonomously proposes + runs new HNSW configs. Distinct
from `autotune::run_autotune` (synchronous one-shot grid). Triggered
on POST /vectors/agent/enqueue/{idx} or by the periodic wake; ingest
paths now push DatasetAppended events when an index's source dataset
gets re-ingested. Rate-limited (max_trials_per_hour) and cooldown-
gated so it can't saturate Ollama under live load.
The proposer is ε-greedy around the current champion: with prob 0.25
sample random from full bounds, otherwise perturb champion ± small
delta on both axes. Dedup against history. Deterministic — RNG seeded
from history.len() so the same journal state proposes the same next
config (helps offline replay debugging).
`[agent]` config section in lakehouse.toml; opt-in via enabled=true.
## Federation Layer 2 — runtime bucket lifecycle + per-index scoping
`BucketRegistry.buckets` moved to `std::sync::RwLock<HashMap>` so
buckets can be added/removed after startup. POST /storage/buckets
provisions at runtime; DELETE /storage/buckets/{name} unregisters
(refuses primary/rescue with 403). Local-backend buckets get their
root directory auto-created.
`IndexMeta.bucket` (default "primary" via serde) records each index's
home bucket. `TrialJournal` and `PromotionRegistry` now hold
Arc<BucketRegistry> + IndexRegistry; they resolve target store per-
index via IndexMeta.bucket. PromotionRegistry::list_all scans every
bucket and dedups by index_name. Pre-federation indexes keep working
unchanged — they just default to primary.
`ModelProfile.bucket: Option<String>` declares per-profile artifact
home. POST /vectors/profile/{id}/activate auto-provisions the
profile's bucket under storage.profile_root if not yet registered.
EvalSets stay primary-only for now — noted gap, low-risk to extend
later with the same resolver pattern.
## Phase 17 — VRAM-aware two-profile gate
Sidecar gains POST /admin/unload (Ollama keep_alive=0 trick — forces
immediate VRAM release), POST /admin/preload (keep_alive=5m with
empty prompt, takes the slot warm), and GET /admin/vram (combines
nvidia-smi snapshot with Ollama /api/ps). Exposed via aibridge as
unload_model / preload_model / vram_snapshot.
`VectorState.active_profile` is the GPU-slot singleton —
Arc<RwLock<Option<ActiveProfileSlot>>>. activate_profile checks for
a previous profile with a different ollama_name and unloads it
before preloading the new one; same-model reactivations skip the
unload (Ollama no-ops). New routes: POST /vectors/profile/{id}/
deactivate (unload + clear slot), GET /vectors/profile/active.
Verified live: staffing-recruiter (qwen2.5) → docs-assistant
(mistral) swap freed qwen2.5 from VRAM and loaded mistral. nomic-
embed-text persists across swaps because both profiles use it —
free optimization that fell out of the design. Scoped search
correctly 403s cross-profile in both directions.
## MySQL streaming connector
`crates/ingestd/src/my_stream.rs` mirrors pg_stream.rs for MySQL.
Pure-rust `mysql_async` driver (default-features=false to avoid C
deps). Same OFFSET pagination, same Parquet-streaming write shape.
Type mapping per ADR-010: int/bigint → Int32/Int64, decimal/float
→ Float64, tinyint(1)/bool → Boolean, everything else → Utf8 with
fallback parsers for date/time/json/uuid via Display.
POST /ingest/mysql parallel to /ingest/db. Same PII auto-detection,
same lineage capture (source_system="mysql"), same agent-trigger
hook. `redact_dsn` generalized — was hardcoded to "postgresql://"
length, now works for any scheme://user:pass@host/path URL (latent
PII leak fix for MySQL DSNs).
Verified live against MariaDB on localhost: 10 rows × 9 columns of
test data round-tripped through datatypes int/varchar/decimal/
tinyint/datetime/text. PII detection auto-flagged name + email.
Aggregation queries through DataFusion match the source values
exactly.
## Phase 18 — Hybrid Parquet+HNSW ⊕ Lance backend (ADR-019)
`vectord-lance` is a new firewall crate. Lance pulls Arrow 57 and
DataFusion 52 — incompatible with the rest of the workspace's
Arrow 55 / DataFusion 47. The firewall isolates that dep tree:
public API uses only std types (Vec<f32>, Vec<String>, Hit, Row,
*Stats), so no Arrow types cross the crate boundary and nothing
propagates to vectord. The ADR-019 path that didn't ship until now.
`vectord::lance_backend::LanceRegistry` lazy-creates a
LanceVectorStore per index, resolving bucket → URI via the
conventional local-bucket layout. `IndexMeta.vector_backend` and
`ModelProfile.vector_backend` carry the choice (default Parquet so
existing indexes unchanged).
Six routes under /vectors/lance/*:
- migrate/{idx}: convert binary-blob Parquet → Lance FixedSizeList
- index/{idx}: build IVF_PQ
- search/{idx}: vector search (embed via sidecar)
- doc/{idx}/{doc_id}: random row fetch
- append/{idx}: native fragment append
- stats/{idx}: row count + index presence
Verified live on the real resumes_100k_v2 corpus (100K × 768d):
- Migrate: 0.57s
- Build IVF_PQ index: 16.2s (matches ADR-019 bench; 14× faster than
HNSW's 230s for the same data)
- Search end-to-end (Ollama embed + Lance scan): 23-53ms
- Random doc_id fetch: 5-7ms (filter scan; faster than Parquet's
~35ms full-file scan, slower than the bench's 311us positional
take — would close that gap with a scalar btree on doc_id)
- Append 100 rows: 3.3ms / +320KB on disk vs Parquet's required
full ~330MB rewrite — the structural win
- Index survives append; both backends coexist cleanly
## Known follow-ups not in this milestone
- ModelProfile.vector_backend doesn't yet auto-route /vectors/profile/
{id}/search to Lance; callers go through /vectors/lance/* directly
- Scalar btree on doc_id (closes the 5-7ms → ~300us gap)
- vectord-lance built default-features=false → no S3 yet
- IVF_PQ recall not measured (ADR-019 caveat) — needs a Lance-aware
variant of the eval harness
- Watcher-path ingest doesn't push agent triggers (HTTP paths do)
- EvalSets still primary-only (federation gap)
- No PATCH endpoint to move an existing index between buckets
- The pre-existing storaged::append_log doctest fails to compile
(malformed `{prefix}/` parses as code fence) — pre-existing bug,
left for a focused fix
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase E gave us soft-delete at query time (tombstones hide rows via a
DataFusion filter view). This completes the invariant: after compact,
tombstoned rows are PHYSICALLY absent from the parquet on disk.
delta::compact changes:
- Signature adds tombstones: &[Tombstone]
- After merging base + deltas, apply_tombstone_filter builds a
BooleanArray keep-mask per batch (True where row_key_value is NOT
in the tombstone set) and applies arrow::compute::filter_record_batch
- Supports Utf8, Int32, Int64 key columns (matches refresh.rs coverage
for pg- and csv-derived schemas)
- CompactResult gains tombstones_applied + rows_dropped_by_tombstones
- Caller clears tombstone store on success
Critical correctness fix surfaced during E2E testing:
The original Phase 8 compact concatenated N independent Parquet byte
streams from record_batch_to_parquet() — each with its own footer.
Parquet readers only see the FIRST footer's data; the rest is invisible.
Latent since Phase 8 shipped; triggered by tombstone-filtering produc-
ing multiple batches. Corrupted candidates.parquet on first test run
(restored from UI fixture copy — good argument for test data in repo).
Fix:
- Single ArrowWriter per compaction, writes every batch into one
properly-footered Parquet
- Snappy compression to match ingest defaults (otherwise rewrite
inflated file 3× — 10.5MB → 34MB — because no compression was set)
- Verify-before-swap: parse written buf back to confirm row count
matches expected; refuses to overwrite base_key if verification fails
- Write to {base_key}.compact-{ts}.tmp first, then to base_key; delete
temp; only then delete delta files. Any error along the way leaves
the original base intact.
TombstoneStore::clear(dataset) drops all tombstone batch files and
evicts the per-dataset AppendLog from cache. Called after successful
compact.
QueryEngine::catalog() accessor exposes the Registry so queryd
handlers can reach the tombstone store without routing through gateway
state.
E2E on candidates (100K rows, 15 cols):
- Baseline: 10.59 MB, 100000 rows
- Tombstone CAND-000001/2/3 (soft-delete): 99997 visible, 100000 raw
- Compact: tombstones_applied=3, rows_dropped=3, final_rows=99997
- Post: 10.72 MB (Snappy), valid parquet (1 row_group), 99997 rows
- Restart: persists, tombstones list empty, __raw__candidates also
99997 (the 3 IDs are physically gone from disk)
PRD invariant close: deletion is now actually deletion, not just
masking. GDPR erasure request → tombstone + schedule compact → data
gone.
Deferred:
- Compact-all-datasets cron (currently manual per-dataset via
POST /query/compact)
- Compaction of tombstone batch files themselves (they grow at
flush_threshold=1 per tombstone; TombstoneStore::compact exists
but not auto-called)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes the self-iteration loop from the PRD reframe: an agent can
tune HNSW configs autonomously and the winner flows through to the
next profile activation without human intervention.
Three primitives:
1. PromotionRegistry (vectord::promotion)
- Per-index current + history at _hnsw_promotions/{index}.json
- promote(index, entry) atomically swaps current, pushes prior
onto history (capped at 50)
- rollback() pops history back onto current; clears current if
history exhausted
- config_or(index, default) — the read side used at build time,
returns promoted config if set else caller's default
- Full cache + persistence; writes are durable on return
2. Autotune (vectord::autotune)
- run_autotune(request, ...) — synchronous agent loop
- Default grid: 5 configs covering the practical range
(ec=20/40/80/80/160, es=30/30/30/60/30) with seed=42 for
reproducibility
- Every trial goes through the existing trial-journal pipeline
so autotune runs land alongside manual trials in the
"trials are data" log
- Winner: max recall first, then min p50 latency; must clear
min_recall gate (default 0.9) or no promotion happens
- Config bounds (ec ∈ [10,400], es ∈ [10,200]) reject absurd
values from the request's optional custom grid
- On winner: promote with note "autotune winner: recall=X p50=Y"
3. Wiring
- VectorState gains promotion_registry
- activate_profile now calls promotion_registry.config_or(...)
so newly-promoted configs are picked up on next activation —
the "hot-swap" is: autotune promotes -> profile activates ->
HNSW rebuilt with new config
- New endpoints:
POST /vectors/hnsw/promote/{index}/{trial_id}
?promoted_by=...¬e=...
POST /vectors/hnsw/rollback/{index}
GET /vectors/hnsw/promoted/{index}
POST /vectors/hnsw/autotune { index_name, harness,
min_recall?, grid? }
End-to-end verified on threat_intel_v1 (54 vectors):
- autogen harness 'threat_intel_smoke' (10 queries)
- POST /autotune -> 5 trials in 620ms, winner ec=20 es=30
recall=1.00 p50=64us auto-promoted
- Manual promote of ec=80 es=30 -> history depth 1
- Rollback -> back to ec=20 es=30 autotune winner
- Second rollback -> current cleared
- Re-promote + restart -> persistence verified
- Profile activation after promotion logged:
"building HNSW ef_construction=80 ef_search=30 seed=Some(42)"
proving the hot-swap loop is closed.
Deferred:
- Bayesian optimization (random-grid is fine at this config-space size)
- Append-triggered autotune (Phase 17.5 — refresh OnAppend policy
can schedule autotune after appending sufficient new rows)
- Concurrent autotune per index guard (JobTracker integration)
PRD invariants satisfied: invariant 8 (hot-swappable indexes) is now
real code — promote is atomic, rollback is always available, the
active generation is a persistent pointer not a runtime convention.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements PRD invariant 9 ("every reader gets its own profile") and
completes the multi-model substrate vision. Local models (or agents)
bind to a named set of datasets; activation pre-loads their vector
indexes into memory; search enforces scope.
Schema (shared::types):
- ModelProfile { id, ollama_name, description, bound_datasets,
hnsw_config, embed_model, created_at, created_by }
- ProfileHnswConfig mirrors vectord::trial::HnswConfig to avoid a
cross-crate dep cycle. Default (ec=80, es=30) matches the Phase 15
trial winner.
- bound_datasets can reference raw dataset names OR AiView names
(both register as DataFusion tables with the same name, so mixing
raw tables and PII-redacted views composes naturally)
Catalog (catalogd::registry):
- put_profile validates id is a slug (alphanumeric + -_ only) and
every binding resolves to an existing dataset or view
- Persistence at _catalog/profiles/{id}.json, loaded on rebuild
- get_profile / list_profiles / delete_profile
HTTP endpoints:
- POST /catalog/profiles (create/update)
- GET /catalog/profiles (list)
- GET/DELETE /catalog/profiles/{id}
- POST /vectors/profile/{id}/activate (HNSW hot-load)
- POST /vectors/profile/{id}/search (scope-enforced)
Activation (vectord::service::activate_profile):
- For each bound dataset, find vector indexes with matching source
- Pre-load embeddings into EmbeddingCache
- Build HNSW with profile's config
- Report warmed indexes + per-binding failures + duration
- Failures on individual bindings don't abort — "substrate keeps
working" per ADR-017
Scoped search (vectord::service::profile_scoped_search):
- Look up profile, verify index.source ∈ profile.bound_datasets
- Returns 403 with allowed bindings list if out-of-scope
- Uses HNSW if index is warm, brute-force cosine otherwise (graceful
degradation — no "must activate first" friction)
Bug fix surfaced during testing: vectord::refresh::try_update_index_meta
was a no-op for first-time indexes, so threat_intel_v1 and
kb_team_runs_v1 (both built via refresh after Phase C shipped) didn't
show up in the index registry. Now it auto-infers the source from the
index name convention (`{source}_vN`) and registers new metadata with
reasonable defaults.
End-to-end verified:
- Created security-analyst profile bound to [threat_intel]
- POST /vectors/profile/security-analyst/activate → warmed
threat_intel_v1 (54 vectors) in 156ms, HNSW built
- Within-scope search: method=hnsw, returned relevant IP indicators
- Out-of-scope: tried to search resumes_100k_v2 (source=candidates)
→ 403 "profile 'security-analyst' is not bound to 'candidates' —
allowed bindings: [\"threat_intel\"]"
- staffing-recruiter profile created bound to candidates + placements;
search without activation fell through to brute_force (graceful)
Deferred (Phase 17 followups):
- VRAM-aware activation (unload-then-load via Ollama keep_alive=0)
— Ollama already handles this; we don't need to reinvent
- Model-identity in audit trail — Phase 13 has role-based audit;
adding model_id is ~20 LOC when we want it
- Profile bucket pre-load (profile:user bucket mount) — Phase 17.5
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements GDPR/CCPA-compatible row-level deletion without rewriting
the underlying Parquet. Tombstone markers live beside each dataset and
are applied at query time via a DataFusion view that excludes the
deleted row_key_values.
Schema (shared::types):
- Tombstone { dataset, row_key_column, row_key_value, deleted_at,
actor, reason }
- All tombstones for a dataset must share one row_key_column —
enforced at write so the query-time filter remains a single
WHERE NOT IN (...) clause
Storage (catalogd::tombstones):
- Per-dataset AppendLog at _catalog/tombstones/{dataset}/
- flush_threshold=1 + explicit flush after every append — tombstones
are high-value, low-frequency; durability on return is the contract
- Reuses storaged::append_log infra so compaction is already wired
(POST .../tombstones/compact will work once we expose it)
Catalog (catalogd::registry):
- add_tombstone validates dataset exists + key column compatibility
- list_tombstones for the GET endpoint
- TombstoneStore exposed via Registry::tombstones() for queryd
HTTP (catalogd::service):
- POST /catalog/datasets/by-name/{name}/tombstone
{ row_key_column, row_key_values[], actor, reason }
Returns rows_tombstoned count + per-value failure list (207 on
partial success).
- GET same path lists active tombstones with full audit info.
Query layer (queryd::context):
- Snapshot tombstones-by-dataset before registering tables
- Tombstoned tables: raw goes to "__raw__{name}", public "{name}"
becomes DataFusion view with
SELECT * FROM "__raw__{name}" WHERE CAST(col AS VARCHAR) NOT IN (...)
- CAST AS VARCHAR handles both string and integer key columns
- Untombstoned tables register as before — zero overhead
End-to-end on candidates (100K rows):
- Pick CAND-000001/2/3 (Linda/Charles/Kimberly)
- POST tombstone -> rows_tombstoned: 3
- COUNT(*) drops 100000 -> 99997
- WHERE candidate_id IN (those 3) -> 0 rows
- candidates_safe view transitively excludes them
(Linda+Denver: __raw__candidates=159, candidates_safe=158)
- Restart: COUNT still 99997, 3 tombstones reload from disk
Reversibility: tombstones are reversible deletes, not destruction.
Power users can still query "__raw__{name}" to see deleted rows.
Phase 13 access control is what stops a non-admin from accessing
__raw__* tables.
Limits / follow-up:
- Physical compaction not yet integrated — Phase 8's compact_files
doesn't read tombstones during merge. Tombstoned rows are still
on disk until that integration ships.
- Phase 9 journald event emission for tombstones not wired —
tombstone records carry their own actor+reason+timestamp so the
audit trail is intact, but cross-referencing with the mutation
event log would help compliance reporting.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the llms3.com "AI-safe views" pattern: a named projection
that exposes only whitelisted columns, with optional row filter and
per-column redactions. AI agents (or Phase 13 roles) bind to the view;
they can never accidentally see PII even if they write raw SQL.
Schema (shared::types):
- AiView { name, base_dataset, columns: Vec<String>, row_filter,
column_redactions: HashMap<String, Redaction>, ... }
- Redaction enum: Null | Hash | Mask { keep_prefix, keep_suffix }
Catalog (catalogd::registry):
- put_view validates base dataset exists + columns non-empty
- Persists JSON at _catalog/views/{name}.json (sanitized name)
- rebuild() loads views alongside dataset manifests on startup
Query layer (queryd::context):
- build_context registers every AiView as a DataFusion view object
- Constructed SELECT applies whitelist projection, WHERE filter, and
redaction expressions per column
- Mask: substr(prefix) + repeat('*', mid_len) + substr(suffix)
- Hash: digest(value, 'sha256')
- Null: CAST(NULL AS VARCHAR) AS col
- DataFusion handles JOINs/aggregates over the view natively — it's a
real view, not a query rewrite
HTTP (catalogd::service):
- POST /catalog/views (create)
- GET /catalog/views (list)
- GET /catalog/views/{name} (full def)
- DELETE /catalog/views/{name}
End-to-end test on candidates (100K rows, 15 columns):
candidates_safe view:
columns: candidate_id, first_name, city, state, vertical,
skills, years_experience, status
row_filter: status != 'blocked'
redaction: candidate_id mask(prefix=3, suffix=2)
SELECT * FROM candidates_safe LIMIT 5
-> 8 columns only, candidate_id shown as "CAN******01"
(PII fields email/phone/last_name absent from result)
SELECT email FROM candidates_safe
-> fails (column not in projection)
SELECT email FROM candidates
-> succeeds (raw table still accessible by name —
Phase 13 access control is the gate, not the view itself)
Survives restart — view definitions reload from object storage.
Limits / not in MVP:
- View CANNOT shadow base table by name (DataFusion treats them as
separate identifiers; access control must restrict raw-table access)
- row_filter is treated as trusted SQL — operators must validate
before persisting; only authenticated admin path should call put_view
- Redaction expressions assume column is castable to VARCHAR; numeric
redactions could be misleading (a Hash on Int64 returns a hex string
that won't equi-join with another hash on the same value type)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three pieces of the multi-bucket federation made real:
1. Catalog migration (POST /catalog/migrate-buckets)
- One-shot normalizer for ObjectRef.bucket field
- Empty -> "primary"; legacy "data"/"local" -> "primary"
- Idempotent; re-running on canonical state is no-op
- Ran on existing catalog: 12 refs renamed from "data", 2 already
"primary", all 14 now canonical
2. X-Lakehouse-Bucket header middleware on ingest
- resolve_bucket() helper extracts header, returns
(bucket_name, store) or 404 with valid bucket list
- ingest_file and ingest_db_stream now route writes per-request
- Defaults to "primary" when header absent
- pipeline::ingest_file_to_bucket records the actual bucket on the
ObjectRef so catalog stays the source of truth for "where does this
data live"
- Verified: ingest with X-Lakehouse-Bucket: testing lands in
data/_testing/, ingest without header lands in data/, bad header
returns 404 with hint
3. queryd registers every bucket with DataFusion
- QueryEngine now holds Arc<BucketRegistry> instead of single store
- build_context iterates all buckets, registers each as a separate
ObjectStore under URL scheme "lakehouse-{bucket}://"
- ListingTable URLs include the per-object bucket scheme so
DataFusion routes scans automatically based on ObjectRef.bucket
- Profile bucket names like "profile:user" sanitized to
"lakehouse-profile-user" since URL host segments can't contain ":"
- Tolerant of duplicate manifest entries (pre-existing
pipeline::ingest_file behavior creates a fresh dataset id per
ingest); duplicates skipped with debug log
- Backward compat: legacy "lakehouse://data/" URL still registered
pointing at primary
Success gate: cross-bucket CROSS JOIN
SELECT p.name, p.role, a.species
FROM people_test p (bucket: testing)
CROSS JOIN animals a (bucket: primary)
LIMIT 5
returns rows correctly. DataFusion routed each scan to its bucket's
ObjectStore based on the URL scheme.
No regressions: SELECT COUNT(*) FROM candidates still returns 100000
from the primary bucket.
Deferred to Phase 17:
- POST /profile/{user}/activate (HNSW hot-load on profile switch)
- vectord storage paths becoming bucket-scoped (trial journals,
eval sets per-profile)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The chunker's &text[start..end] slice could land inside a multi-byte
UTF-8 character (e.g. narrow no-break space \u{202f}, em-dashes, smart
quotes — universal in pg-imported editorial data). Rust panics on
non-boundary string slicing. In the refresh path that panic is caught
by tokio's task machinery but somehow causes linear memory growth at
~540MB/sec until OOM at 120GB+.
Root cause: chunk boundaries computed by byte arithmetic without
checking is_char_boundary(). The existing "look for last sentence / \n
/ space" logic finds ASCII-safe positions, but the *primary* `end`
calculation `(start + chunk_size).min(text.len())` lands wherever.
Fix:
- ceil_char_boundary(s, idx) — forward-scan to the nearest valid
UTF-8 char boundary. Used at end, actual_end, and next_start.
- Iteration cap — break if iterations exceed text.len(). Any
non-progressing loop dies safely instead of burning memory.
- Forced forward advance — if overlap + boundary math produce a
next_start <= start, force +1 char to guarantee termination.
Reproduced on kb_team_runs (585 pg-imported prompts with editorial
unicode): previous run grew memory linearly to 124GB over 240s then
OOM-killed. Same request after fix: peaks at <100MB, completes in
~4m42s to produce 12,693 embeddings. /vectors/search returns
relevant results.
Regression tests added:
- handles_multibyte_utf8_at_chunk_boundary — exact \u{202f} repro
- no_infinite_loop_on_no_spaces — 5KB text, no whitespace
- no_infinite_loop_on_degenerate_params — chunk_size == overlap
Surfaced by Phase C, but pre-existed as a latent bug since Phase 7.
Any Ollama-targeted RAG corpus with non-ASCII content would have hit
this once it grew past ~13KB per document.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the llms3.com-inspired pattern: embeddings refresh
asynchronously, decoupled from transactional row writes. New rows arrive,
ingest marks the vector index stale, a later refresh embeds only the
delta (doc_ids not already in the index).
Schema additions (DatasetManifest):
- last_embedded_at: Option<DateTime> - when the index was last refreshed
- embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh
- embedding_refresh_policy: Option<RefreshPolicy> - Manual | OnAppend | Scheduled
Ingest paths (pipeline::ingest_file + pg_stream) call
registry.mark_embeddings_stale after writing. No-op if the dataset has
never been embedded — stale semantics only kick in once last_embedded_at
is set.
Refresh pipeline (vectord::refresh::refresh_index):
- Reads the dataset Parquet, extracts (doc_id, text) pairs
- Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas)
- Loads existing embeddings via EmbeddingCache (empty on first-time build)
- Filters to rows whose doc_id is NOT in the existing set
- Chunks (chunker::chunk_column), embeds via Ollama (batches of 32),
writes combined index, clears stale flag
Endpoints:
- POST /vectors/refresh/{dataset_name} - body {index_name, id_column,
text_column, chunk_size?, overlap?}
- GET /vectors/stale - lists datasets whose embedding_stale_since is set
End-to-end verified on threat_intel (knowledge_base.threat_intel):
- Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s,
last_embedded_at set
- Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check)
- Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set
- /vectors/stale surfaces threat_intel with timestamps + policy
- Delta refresh: 34 new docs embedded in 970ms (6x faster than full
re-embed); stale_cleared = true
Not in MVP scope:
- UPDATE semantics (same doc_id, different content) - would need
per-row content hashing
- OnAppend policy auto-trigger - just declares intent; actual scheduler
deferred
- Scheduler runtime - the Scheduled(cron) variant declares the intent so
operators can see which datasets expect what, but the cron itself is
separate
Per ADR-019: when a profile switches to vector_backend=Lance, this
refresh path benefits — Lance's native append replaces our "read all +
rewrite" Parquet rebuild pattern. Current MVP works well enough at
~500-5K rows to validate the architecture; Lance unblocks the 5M+ case.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Standalone benchmark crate `crates/lance-bench` running Lance 4.0 against
our Parquet+HNSW at 100K × 768d (resumes_100k_v2) measured 8 dimensions.
Results (see docs/ADR-019-vector-storage.md for full scorecard):
Cold load: Parquet 0.17s vs Lance 0.13s (tie — not ≥2× threshold)
Disk size: 330.3 MB vs 330.4 MB (tie)
Search p50: 873us vs 2229us (Parquet 2.55× faster)
Search p95: 1413us vs 4998us (Parquet 3.54× faster)
Index build: 230s (ec=80) vs 16s (IVF_PQ) (Lance 14× faster)
Random access: 35ms (scan) vs 311us (Lance 112× faster)
Append 10K rows: full rewrite vs 0.08s/+31MB (Lance structural win)
Decision (ADR-019): hybrid, not migrate-or-reject.
- Parquet+HNSW stays primary — our HNSW at ec=80 es=30 recall=1.00 is
2.55× faster than Lance IVF_PQ at 100K in-RAM scale
- Lance joins as second backend per-profile for workloads where it wins
architecturally: random row access (RAG text fetch), append-heavy
pipelines (Phase C), hot-swap generations (Phase 16, 14× faster
builds), and indexes past the ~5M RAM ceiling
- Phase 17 ModelProfile gets vector_backend: Parquet | Lance field
- Ceiling table in PRD updated — 5M ceiling now says "switch to Lance"
instead of "migrate" since Lance runs alongside, not instead of
Isolation: lance-bench is a standalone workspace crate with its own dep
tree (Lance pulls DataFusion 52 + Arrow 57 incompatible with main stack
DataFusion 47 + Arrow 55). Kept off the critical path until API is
stable enough to promote into vectord::lance_store.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lakehouse.service: release gateway on :3100, auto-restart
- lakehouse-sidecar.service: Python FastAPI on :3200, auto-restart
- lakehouse-ui.service: WASM file server on :3300, auto-restart
- All enabled at boot (multi-user.target)
- scripts/serve_ui.py for systemd-compatible file serving
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: Dashboard auto-fired 6+ API calls on load, then Ask tab fired
7 DESCRIBE queries per question — 15+ concurrent requests from WASM.
Fixes:
- Schema context cached after first build (7 DESCRIBE → 0 on subsequent questions)
- Dashboard lazy-loads only when tab clicked (not on app mount)
- Default tab changed back to Ask (no background API storm)
- std::sync::Mutex for WASM compat (no tokio in browser)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ResultStore: execute query, store batches server-side, serve pages on demand
- POST /query/paged → returns query_id + total_rows + page count (no rows)
- GET /query/page/{id}/{page}?size=100 → returns one page of rows
- RecordBatch slicing for efficient page extraction from Arrow batches
- LRU eviction: keeps 50 most recent query results in memory
- Tested: 100K rows → 1,000 pages of 100, any page fetchable by number
- Supervisor pattern: chunk results, serve on demand, retry-safe (idempotent GET)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Schema context limited to 7 core staffing tables (was all 12+)
- Results table capped at 200 rows to prevent DOM explosion
- Shows "first 200 of N rows" when truncated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
clean_sql now uses 3 strategies in priority order:
1. Extract from ```sql...``` markdown blocks
2. Find first SELECT/WITH/INSERT statement in text
3. Strip leading "sql" keyword fallback
Tested against 5 real model output patterns:
- Clean SQL ✓
- "sql" prefixed ✓
- Markdown fenced ✓
- Explanation before ```sql block ✓
- Explanation with SELECT buried in text ✓
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous: only retried on "Schema error" or "No field named"
Now: retries on any error (type mismatches, execution errors, etc.)
Model gets full error message + schema to write corrected SQL.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- clean_sql() strips markdown fences, leading "sql" keyword, trailing explanations
- Schema context now includes table relationships (JOIN paths)
- Explicit note: "vertical only in candidates/clients/job_orders, JOIN for others"
- Full column paths (table.column) in schema to reduce ambiguity
- Auto-retry on schema errors feeds error + schema back to model
- TESTED: 4 questions all return correct results:
"highest avg salary" → IT $2,213 ✓
"top 5 earning over $50/hr" → correct candidates ✓
"most placements by vertical" → Industrial 10,096 ✓
"revenue by client" → 1,996 clients ✓
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Prompt now says "CRITICAL: ONLY use columns from schema, do NOT invent"
- Strips markdown backticks from model output
- Auto-retry: if SQL fails with "Schema error" or "No field named",
feeds the error + schema back to the model for a corrected query
- Both button click and Enter key paths have retry logic
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- POST /ingest/postgres/tables — list all tables in a database
- POST /ingest/postgres/import — import table → Parquet → catalog → queryable
- Auto type mapping: int2/4/8 → Int, float4/8 → Float64, bool → Boolean,
text/varchar/jsonb/timestamp → Utf8 (safe default per ADR-010)
- Auto PII detection + lineage on import
- Empty password support for trust auth
- Tested: imported lab_trials (40 rows, 10 cols) and threat_intel (20 rows, 30 cols)
from local knowledge_base Postgres database — immediately queryable
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Drop CSV/JSON/PDF/text into ./inbox → auto-detected → Parquet → queryable
- Polls every 10 seconds (configurable)
- Processed files moved to ./inbox/processed/
- Failed files moved to ./inbox/failed/
- Dedup: same file dropped twice = no-op
- Watcher starts automatically on gateway boot
- Tested: CSV dropped → queryable in <15s
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PHASES.md and project memory updated to reflect actual build state.
Phases 11-14 were built but trackers weren't updated.
Final stats: 11 crates, 30 tests, 16 ADRs, 2.47M rows, 100K vectors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Schema diff detection: compare old vs new schema, identify changes
(added, removed, type changed, renamed columns)
- Fuzzy rename detection: "first_name" → "full_name" detected by shared word parts
- Auto-generated migration rules: direct map, cast, concat, split, drop
Each rule has confidence score (0.0-1.0)
- AI migration prompt builder: generates LLM prompt for complex schema changes
LLM suggests JSON migration rules when heuristics aren't enough
- 5 new unit tests (detect added, removed, type change, rename, rule generation)
- 30 total unit tests passing
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- AccessControl: agent roles with allowed sensitivity levels
- 4 default roles: admin (all), recruiter (PII ok), analyst (financial ok), agent (internal only)
- Field-level masking: determines which columns to mask per agent based on sensitivity
- Query audit log: tracks every query with agent, datasets, PII fields accessed
- Endpoints: GET/POST /access/roles, GET /access/audit, POST /access/check
- Toggleable via config (auth.enabled)
- 100K embedding: supervisor now sustained 125/sec (2.9x vs single pipeline)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 4 parallel pipelines on i9 + A4000 via Ollama
- Previous single-pipeline: 43/sec, 39min for 100K
- Supervisor: 67.6/sec, 22min for 100K
- Previous 100K attempt failed at 97K (no retry) — supervisor handles this
- Checkpointing every 1000 chunks for crash recovery
- Round-robin retry on batch failure (3 attempts)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ToolRegistry: named tools with parameter validation and audit logging
- 6 built-in staffing tools:
search_candidates (skills, city, state, experience, availability)
get_candidate (by ID)
revenue_by_client (top N by billed revenue)
recruiter_performance (placements, revenue per recruiter)
cold_leads (called N+ times, never placed)
open_jobs (by vertical, city)
- Each tool: name, description, params, permission level (read/write/admin)
- SQL template with validated parameter substitution
- Full audit trail: every invocation logged with agent, params, result
- Endpoints: GET /tools (list), GET /tools/{name} (schema),
POST /tools/{name}/call (execute), GET /tools/audit (log)
- Per ADR-015: governed interface before raw SQL for agents
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- IndexRegistry: tracks all vector indexes with model metadata
(model_name, model_version, dimensions, build stats)
- Index metadata persisted as JSON in vectors/meta/
- Rebuilt on startup for crash recovery
- GET /vectors/indexes — list all indexes (filter by source/model)
- GET /vectors/indexes/{name} — get index metadata
- Background jobs auto-register metadata on completion
- Multi-version support: same data, different models, coexist
- Per ADR-014: enables incremental re-embed on model upgrade
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- /read-mem skill: reads PRD, phases, decisions, checks live services
- Updated PHASES.md with all 15 phases tracked
- Updated project_lakehouse.md memory with full context
- Updated CLAUDE.md with project reference
- Skill at ~/.claude/skills/read-mem/ and project level
- Triggers on: "read mem", "project status", "where were we", "catch me up"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- DatasetManifest expanded: description, owner, sensitivity, columns,
lineage, freshness contract, tags, row_count
- All new fields use #[serde(default)] for backward compatibility
- PII auto-detection: scans column names for email, phone, SSN, salary,
address, DOB, medical terms — flags as PII/PHI/Financial
- Column-level metadata: name, type, sensitivity, is_pii flag
- Lineage tracking: source_system, source_file, ingest_job, timestamp
- Ingest pipeline auto-populates: PII scan, column meta, lineage, row count
- PATCH /catalog/datasets/by-name/{name}/metadata — update metadata
- Catalog responses now include all rich fields
- 25 unit tests passing (5 new PII detection tests)
Per ADR-013: datasets without metadata become mystery files.
This makes every ingested file self-describing from day one.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- journald crate: immutable event log for every data mutation
- Events: entity_type, entity_id, field, action, old_value, new_value,
actor, source, workspace_id, timestamp
- In-memory buffer with configurable flush threshold (default 100 events)
- Flush writes events as Parquet to journal/ directory
- Query: GET /journal/history/{entity_id} — full history of any record
- Query: GET /journal/recent?limit=50 — latest events across all entities
- Convenience methods: record_insert, record_update, record_ingest
- Stats: GET /journal/stats — buffer size, persisted file count
- Manual flush: POST /journal/flush
- Per ADR-012: events are never modified or deleted
This is the single most important future-proofing decision.
Once history is lost, it's gone forever.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- JobTracker: create/update/complete/fail jobs with progress tracking
- POST /vectors/index now returns immediately with job_id (HTTP 202)
- Embedding runs in tokio::spawn background task
- GET /vectors/jobs/{id} returns live progress (chunks embedded, rate, ETA)
- GET /vectors/jobs lists all jobs
- Progress logged every 100 batches with chunks/sec and ETA
- 100K embedding job running successfully at 44 chunks/sec
- System stays responsive during embedding (queries in 23ms)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- WorkspaceManager: create/get/list workspaces with daily/weekly/monthly/pinned tiers
- Saved searches: agent stores SQL queries in workspace context
- Shortlist: tag candidates/records to a workspace with notes
- Activity log: track calls, emails, updates per workspace per agent
- Instant handoff: transfer workspace ownership with full history
Zero data copy — just a pointer swap, receiving agent sees everything
- Persistence: workspaces stored as JSON in object storage, rebuilt on startup
- Endpoints: /workspaces/create, /{id}, /{id}/handoff, /{id}/search,
/{id}/shortlist, /{id}/activity
- Tested: Sarah creates workspace, saves searches, shortlists 3 candidates,
logs activity, hands off to Mike who continues seamlessly
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- MemCache: LRU in-memory cache for hot datasets (configurable max, default 16GB)
Pin/evict/stats endpoints: POST /query/cache/pin, /cache/evict, GET /cache/stats
- Delta store: append-only delta Parquet files for row-level updates
Write deltas without rewriting base files, merge at query time
- Compaction: POST /query/compact merges deltas into base Parquet
- Query engine: checks cache first, falls back to Parquet, merges deltas
- Benchmarked on 2.47M rows:
1M row JOIN: 854ms cold → 96ms hot (8.9x speedup)
100K filter: 62ms cold → 21ms hot (3x speedup)
1.1M rows cached in 408MB RAM
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous regex routes for /catalog, /storage, /health intercepted main site.
Now all lakehouse API calls go through /lakehouse/api/ prefix, stripped by nginx rewrite.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>