24 Commits

Author SHA1 Message Date
root
2cac64636c docs: PHASES tracker — mark Phases 42/43/44/45 complete
Today's work shipped four Phase closures (Truth Layer, Validation
Pipeline, Caller Migration, Doc-Drift Detection); the canonical
tracker now reflects them. Foundation for production switchover
(real Chicago data replaces synthetic test data soon).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 08:03:40 -05:00
root
4251e94531 Update PHASES.md: Phase 41 + Guard fixes
- Phase 41: ProfileType enum, per-type endpoints
- Guard: scrumaudit.py, fixed watcher.sh + pr-reviewer.md
2026-04-23 03:09:05 -05:00
root
55f8e0fe6e Phase 40: Routing Engine + Policy
- RoutingEngine with RouteDecision (model_pattern → provider)
- config/routing.toml: rules, fallback chain, cost gating
- Per-provider Usage tracking in /v1/usage response
- 12 gateway tests green
2026-04-23 02:36:45 -05:00
root
e27a17e950 Phase 39: Provider Adapter Refactor
- ProviderAdapter trait with chat(), embed(), unload(), health()
- OllamaAdapter wrapping existing AiClient
- OpenRouterAdapter for openrouter.ai API integration
- provider_key() routing by model prefix (openrouter/*, etc)
2026-04-23 02:24:15 -05:00
root
21e8015b60 Phase 37: Hot-swap async + Phase 38: Universal API skeleton
- JobTracker extended with JobType::ProfileActivation + Embed
- activate_profile returns job_id immediately, work spawns in background
- /v1/chat, /v1/usage, /v1/sessions endpoints (OpenAI-compatible)
- Langfuse trace integration (Phase 40 early deliverable)
- 12 gateway unit tests green, curl gates pass
2026-04-23 01:56:17 -05:00
profit
a6f12e2609 Phase 21 Rust port + Phase 27 playbook versioning + doc-sync
Phase 21 — Rust port of scratchpad + tree-split primitives (companion to
the 2026-04-21 TS shipment). New crates/aibridge modules:

  context.rs       — estimate_tokens (chars/4 ceil), context_window_for,
                     assert_context_budget returning a BudgetCheck with
                     numeric diagnostics on both success and overflow.
                     Windows table mirrors config/models.json.
  continuation.rs  — generate_continuable<G: TextGenerator>. Handles the
                     two failure modes: empty-response from thinking
                     models (geometric 2x budget backoff up to budget_cap)
                     and truncated-non-empty (continuation with partial
                     as scratchpad). is_structurally_complete balances
                     braces then JSON.parse-checks. Guards the degen case
                     "all retries empty, don't loop on empty partial".
  tree_split.rs    — generate_tree_split map->reduce with running
                     scratchpad. Per-shard + reduce-prompt go through
                     assert_context_budget first; loud-fails rather than
                     silently truncating. Oldest-digest-first scratchpad
                     truncation at scratchpad_budget (default 6000 t).

TextGenerator trait (native async-fn-in-trait, edition 2024). AiClient
implements it; ScriptedGenerator test double lets tests inject canned
sequences without a live Ollama.

GenerateRequest gained think: Option<bool> — forwards to sidecar for
per-call hidden-reasoning opt-out on hot-path JSON emitters. Three
existing callsites updated (rag.rs x2, service.rs hybrid answer).

Phase 27 — Playbook versioning. PlaybookEntry gained four optional
fields (all #[serde(default)] so pre-Phase-27 state loads as roots):

  version           u32, default 1
  parent_id         Option<String>, previous version's playbook_id
  superseded_at     Option<String>, set when newer version replaces
  superseded_by     Option<String>, the playbook_id that replaced

New methods:

  revise_entry(parent_id, new_entry) — appends new version, stamps
    superseded_at+superseded_by on parent, inherits parent_id and sets
    version = parent + 1 on the new entry. Rejects revising a retired
    or already-superseded parent (tip-of-chain is the only valid
    revise target).
  history(playbook_id) — returns full chain root->tip from any node.
    Walks parent_id back to root, then superseded_by forward to tip.
    Cycle-safe.

Superseded entries excluded from boost (same rule as retired): filter
in compute_boost_for_filtered_with_role (both active-entries prefilter
and geo-filtered path), rebuild_geo_index, and upsert_entry's existing-
idx search. status_counts returns (total, retired, superseded, failures);
/status JSON reports active = total - retired - superseded.

Endpoints:
  POST /vectors/playbook_memory/revise
  GET  /vectors/playbook_memory/history/{id}

Doc-sync — PHASES.md + PRD.md drifted from git after Phases 24-26
shipped. Fixes applied:

  - Phase 24 marked shipped (commit b95dd86) with detail of observer
    HTTP ingest + scenario outcome streaming. PRD "NOT YET WIRED"
    rewritten to reflect shipped state.
  - Phase 25 (validity windows, commit e0a843d) added to PHASES +
    PRD.
  - Phase 26 (Mem0 upsert + Letta hot cache, commit 640db8c) added.
  - Phase 27 entry added to both docs.
  - Phase 19.6 time decay corrected: was documented as "deferred",
    actually wired via BOOST_HALF_LIFE_DAYS = 30.0 in playbook_memory.rs.
  - Phase E/Phase 8 tombstone-at-compaction limit note updated —
    Phase E.2 closed it.

Tests: 8 new version_tests in vectord (chain-metadata stamping,
retired/superseded parent rejection, boost exclusion, history from
root/tip/middle, legacy default round-trip, status counts). 25 new
aibridge tests (context/continuation/tree_split). Workspace total
145 green (was 120).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 17:40:49 -05:00
root
137aed64fb Coherence pass — PRD/PHASES updates, config snapshot wired, unit tests
J flagged the audit: "make sure everything flows coherently, no
pseudocode or unnecessary patches or ignoring any particular part of
what we built." This is that pass.

PRD.md updates:
- Phase 19 refinement block — geo-filter + role-prefilter WIRED with
  citation density numbers (0.32 → 1.38, and 2 → 28 on same scenario).
- Phase 20 rewrite — mistral dropped, qwen3.5 + qwen3 local hot path,
  think:false as the key mechanical finding, kimi-k2.6 upgrade path.
- Phase 21 status block — think plumbing + cloud executor routing
  added after original commit.
- Phase 22 item B (cloud rescue) — pivot sanitizer, rescue verified
  1/3 on stress_01.
- Phase 23 NEW — staffer identity + tool_level + competence-weighted
  retrieval + kb_staffer_report. Auto-discovered worker labels called
  out with real numbers (Rachel Lewis 12× across 4 staffers).
- Phase 24 NEW — Observer/Autotune integration gap DOCUMENTED, not
  fixed. Observer has been idle at 0 ops for 3600+ cycles because
  scenarios hit gateway:3100 directly, bypassing MCP:3700 which the
  observer wraps. This is the honest "we're not using it in these
  tests" signal J surfaced. Fix deferred; gap visible now.

PHASES.md:
- Appended Phases 20-23 as checked, Phase 24 as unchecked gap.
- Updated footer count: 102 unit tests across all layers.
- Latest line updated with 14× citation lift + 46.4pt tool-asymmetry
  finding.

scenario.ts:
- snapshotConfig() was defined but never called. Now fires at every
  scenario start with a stable sha256 hash over the active model set +
  tool_level + cloud flags. config_snapshots.jsonl finally populates,
  which the error_corrections diff path needs to work correctly.

kb.test.ts (new): 4 signature invariant tests — stability across
unrelated fields (date, contract, staffer), sensitivity to role/city/
count changes, digest shape. All pass under `bun test`.

service.rs: 6 Rust extractor tests for extract_target_geo +
extract_target_role — basic, missing-state-returns-none, word
boundary (civilian != city), multi-word role, absent role, quoted
value parse. All pass under `cargo test -p vectord --lib extractor_tests`.

Dangling items now honestly documented rather than silently pending:
- Chunking cache (config/models.json SPEC, not wired) — flagged
- Playbook versioning (SPEC, not wired) — flagged
- Observer integration (WIRED but disconnected) — new Phase 24
2026-04-20 23:29:13 -05:00
root
3bc82833ac Update PRD + PHASES.md — reflect 8-commit 2026-04-17 push
PRD status line: "Phases 0-18 shipped; hybrid operational; scheduled
ingest live; PDF OCR live; entering horizon items."

PHASES.md: federation L2 items marked complete, Phase 16.2 (autotune
agent), Phase 17 VRAM gate, MySQL connector, Phase 18 (hybrid Lance),
scheduled ingest, PDF OCR all documented with dates and measurements.

Stats updated: 52+ unit tests, 13 crates, 19 ADRs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 20:54:05 -05:00
root
4e1c400f5d Phase E.2: Compaction integrates tombstones — physical deletion closes GDPR loop
Phase E gave us soft-delete at query time (tombstones hide rows via a
DataFusion filter view). This completes the invariant: after compact,
tombstoned rows are PHYSICALLY absent from the parquet on disk.

delta::compact changes:
- Signature adds tombstones: &[Tombstone]
- After merging base + deltas, apply_tombstone_filter builds a
  BooleanArray keep-mask per batch (True where row_key_value is NOT
  in the tombstone set) and applies arrow::compute::filter_record_batch
- Supports Utf8, Int32, Int64 key columns (matches refresh.rs coverage
  for pg- and csv-derived schemas)
- CompactResult gains tombstones_applied + rows_dropped_by_tombstones
- Caller clears tombstone store on success

Critical correctness fix surfaced during E2E testing:
The original Phase 8 compact concatenated N independent Parquet byte
streams from record_batch_to_parquet() — each with its own footer.
Parquet readers only see the FIRST footer's data; the rest is invisible.
Latent since Phase 8 shipped; triggered by tombstone-filtering produc-
ing multiple batches. Corrupted candidates.parquet on first test run
(restored from UI fixture copy — good argument for test data in repo).

Fix:
- Single ArrowWriter per compaction, writes every batch into one
  properly-footered Parquet
- Snappy compression to match ingest defaults (otherwise rewrite
  inflated file 3× — 10.5MB → 34MB — because no compression was set)
- Verify-before-swap: parse written buf back to confirm row count
  matches expected; refuses to overwrite base_key if verification fails
- Write to {base_key}.compact-{ts}.tmp first, then to base_key; delete
  temp; only then delete delta files. Any error along the way leaves
  the original base intact.

TombstoneStore::clear(dataset) drops all tombstone batch files and
evicts the per-dataset AppendLog from cache. Called after successful
compact.

QueryEngine::catalog() accessor exposes the Registry so queryd
handlers can reach the tombstone store without routing through gateway
state.

E2E on candidates (100K rows, 15 cols):
- Baseline: 10.59 MB, 100000 rows
- Tombstone CAND-000001/2/3 (soft-delete): 99997 visible, 100000 raw
- Compact: tombstones_applied=3, rows_dropped=3, final_rows=99997
- Post: 10.72 MB (Snappy), valid parquet (1 row_group), 99997 rows
- Restart: persists, tombstones list empty, __raw__candidates also
  99997 (the 3 IDs are physically gone from disk)

PRD invariant close: deletion is now actually deletion, not just
masking. GDPR erasure request → tombstone + schedule compact → data
gone.

Deferred:
- Compact-all-datasets cron (currently manual per-dataset via
  POST /query/compact)
- Compaction of tombstone batch files themselves (they grow at
  flush_threshold=1 per tombstone; TombstoneStore::compact exists
  but not auto-called)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:38:30 -05:00
root
4d5c49090c Phase 16: Hot-swap generations + autotune agent loop
Closes the self-iteration loop from the PRD reframe: an agent can
tune HNSW configs autonomously and the winner flows through to the
next profile activation without human intervention.

Three primitives:

1. PromotionRegistry (vectord::promotion)
   - Per-index current + history at _hnsw_promotions/{index}.json
   - promote(index, entry) atomically swaps current, pushes prior
     onto history (capped at 50)
   - rollback() pops history back onto current; clears current if
     history exhausted
   - config_or(index, default) — the read side used at build time,
     returns promoted config if set else caller's default
   - Full cache + persistence; writes are durable on return

2. Autotune (vectord::autotune)
   - run_autotune(request, ...) — synchronous agent loop
   - Default grid: 5 configs covering the practical range
     (ec=20/40/80/80/160, es=30/30/30/60/30) with seed=42 for
     reproducibility
   - Every trial goes through the existing trial-journal pipeline
     so autotune runs land alongside manual trials in the
     "trials are data" log
   - Winner: max recall first, then min p50 latency; must clear
     min_recall gate (default 0.9) or no promotion happens
   - Config bounds (ec ∈ [10,400], es ∈ [10,200]) reject absurd
     values from the request's optional custom grid
   - On winner: promote with note "autotune winner: recall=X p50=Y"

3. Wiring
   - VectorState gains promotion_registry
   - activate_profile now calls promotion_registry.config_or(...)
     so newly-promoted configs are picked up on next activation —
     the "hot-swap" is: autotune promotes -> profile activates ->
     HNSW rebuilt with new config
   - New endpoints:
       POST /vectors/hnsw/promote/{index}/{trial_id}
             ?promoted_by=...&note=...
       POST /vectors/hnsw/rollback/{index}
       GET  /vectors/hnsw/promoted/{index}
       POST /vectors/hnsw/autotune  { index_name, harness,
                                      min_recall?, grid? }

End-to-end verified on threat_intel_v1 (54 vectors):
- autogen harness 'threat_intel_smoke' (10 queries)
- POST /autotune -> 5 trials in 620ms, winner ec=20 es=30
  recall=1.00 p50=64us auto-promoted
- Manual promote of ec=80 es=30 -> history depth 1
- Rollback -> back to ec=20 es=30 autotune winner
- Second rollback -> current cleared
- Re-promote + restart -> persistence verified
- Profile activation after promotion logged:
  "building HNSW ef_construction=80 ef_search=30 seed=Some(42)"
  proving the hot-swap loop is closed.

Deferred:
- Bayesian optimization (random-grid is fine at this config-space size)
- Append-triggered autotune (Phase 17.5 — refresh OnAppend policy
  can schedule autotune after appending sufficient new rows)
- Concurrent autotune per index guard (JobTracker integration)

PRD invariants satisfied: invariant 8 (hot-swappable indexes) is now
real code — promote is atomic, rollback is always available, the
active generation is a persistent pointer not a runtime convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:26:21 -05:00
root
a293502265 Phase 17: Model profiles + scoped search — the LLM-brain keystone
Implements PRD invariant 9 ("every reader gets its own profile") and
completes the multi-model substrate vision. Local models (or agents)
bind to a named set of datasets; activation pre-loads their vector
indexes into memory; search enforces scope.

Schema (shared::types):
- ModelProfile { id, ollama_name, description, bound_datasets,
                 hnsw_config, embed_model, created_at, created_by }
- ProfileHnswConfig mirrors vectord::trial::HnswConfig to avoid a
  cross-crate dep cycle. Default (ec=80, es=30) matches the Phase 15
  trial winner.
- bound_datasets can reference raw dataset names OR AiView names
  (both register as DataFusion tables with the same name, so mixing
  raw tables and PII-redacted views composes naturally)

Catalog (catalogd::registry):
- put_profile validates id is a slug (alphanumeric + -_ only) and
  every binding resolves to an existing dataset or view
- Persistence at _catalog/profiles/{id}.json, loaded on rebuild
- get_profile / list_profiles / delete_profile

HTTP endpoints:
- POST /catalog/profiles  (create/update)
- GET  /catalog/profiles  (list)
- GET/DELETE /catalog/profiles/{id}
- POST /vectors/profile/{id}/activate  (HNSW hot-load)
- POST /vectors/profile/{id}/search    (scope-enforced)

Activation (vectord::service::activate_profile):
- For each bound dataset, find vector indexes with matching source
- Pre-load embeddings into EmbeddingCache
- Build HNSW with profile's config
- Report warmed indexes + per-binding failures + duration
- Failures on individual bindings don't abort — "substrate keeps
  working" per ADR-017

Scoped search (vectord::service::profile_scoped_search):
- Look up profile, verify index.source ∈ profile.bound_datasets
- Returns 403 with allowed bindings list if out-of-scope
- Uses HNSW if index is warm, brute-force cosine otherwise (graceful
  degradation — no "must activate first" friction)

Bug fix surfaced during testing: vectord::refresh::try_update_index_meta
was a no-op for first-time indexes, so threat_intel_v1 and
kb_team_runs_v1 (both built via refresh after Phase C shipped) didn't
show up in the index registry. Now it auto-infers the source from the
index name convention (`{source}_vN`) and registers new metadata with
reasonable defaults.

End-to-end verified:
- Created security-analyst profile bound to [threat_intel]
- POST /vectors/profile/security-analyst/activate → warmed
  threat_intel_v1 (54 vectors) in 156ms, HNSW built
- Within-scope search: method=hnsw, returned relevant IP indicators
- Out-of-scope: tried to search resumes_100k_v2 (source=candidates)
  → 403 "profile 'security-analyst' is not bound to 'candidates' —
    allowed bindings: [\"threat_intel\"]"
- staffing-recruiter profile created bound to candidates + placements;
  search without activation fell through to brute_force (graceful)

Deferred (Phase 17 followups):
- VRAM-aware activation (unload-then-load via Ollama keep_alive=0)
  — Ollama already handles this; we don't need to reinvent
- Model-identity in audit trail — Phase 13 has role-based audit;
  adding model_id is ~20 LOC when we want it
- Profile bucket pre-load (profile:user bucket mount) — Phase 17.5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:09:43 -05:00
root
d87f2ccac6 Phase E: Soft deletes (tombstones) for compliance-grade row deletion
Implements GDPR/CCPA-compatible row-level deletion without rewriting
the underlying Parquet. Tombstone markers live beside each dataset and
are applied at query time via a DataFusion view that excludes the
deleted row_key_values.

Schema (shared::types):
- Tombstone { dataset, row_key_column, row_key_value, deleted_at,
              actor, reason }
- All tombstones for a dataset must share one row_key_column —
  enforced at write so the query-time filter remains a single
  WHERE NOT IN (...) clause

Storage (catalogd::tombstones):
- Per-dataset AppendLog at _catalog/tombstones/{dataset}/
- flush_threshold=1 + explicit flush after every append — tombstones
  are high-value, low-frequency; durability on return is the contract
- Reuses storaged::append_log infra so compaction is already wired
  (POST .../tombstones/compact will work once we expose it)

Catalog (catalogd::registry):
- add_tombstone validates dataset exists + key column compatibility
- list_tombstones for the GET endpoint
- TombstoneStore exposed via Registry::tombstones() for queryd

HTTP (catalogd::service):
- POST /catalog/datasets/by-name/{name}/tombstone
    { row_key_column, row_key_values[], actor, reason }
  Returns rows_tombstoned count + per-value failure list (207 on
  partial success).
- GET same path lists active tombstones with full audit info.

Query layer (queryd::context):
- Snapshot tombstones-by-dataset before registering tables
- Tombstoned tables: raw goes to "__raw__{name}", public "{name}"
  becomes DataFusion view with
  SELECT * FROM "__raw__{name}" WHERE CAST(col AS VARCHAR) NOT IN (...)
- CAST AS VARCHAR handles both string and integer key columns
- Untombstoned tables register as before — zero overhead

End-to-end on candidates (100K rows):
- Pick CAND-000001/2/3 (Linda/Charles/Kimberly)
- POST tombstone -> rows_tombstoned: 3
- COUNT(*) drops 100000 -> 99997
- WHERE candidate_id IN (those 3) -> 0 rows
- candidates_safe view transitively excludes them
  (Linda+Denver: __raw__candidates=159, candidates_safe=158)
- Restart: COUNT still 99997, 3 tombstones reload from disk

Reversibility: tombstones are reversible deletes, not destruction.
Power users can still query "__raw__{name}" to see deleted rows.
Phase 13 access control is what stops a non-admin from accessing
__raw__* tables.

Limits / follow-up:
- Physical compaction not yet integrated — Phase 8's compact_files
  doesn't read tombstones during merge. Tombstoned rows are still
  on disk until that integration ships.
- Phase 9 journald event emission for tombstones not wired —
  tombstone records carry their own actor+reason+timestamp so the
  audit trail is intact, but cross-referencing with the mutation
  event log would help compliance reporting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 09:40:48 -05:00
root
09fd446c8d Phase D: AI-safe views — capability-surface projections over base data
Implements the llms3.com "AI-safe views" pattern: a named projection
that exposes only whitelisted columns, with optional row filter and
per-column redactions. AI agents (or Phase 13 roles) bind to the view;
they can never accidentally see PII even if they write raw SQL.

Schema (shared::types):
- AiView { name, base_dataset, columns: Vec<String>, row_filter,
           column_redactions: HashMap<String, Redaction>, ... }
- Redaction enum: Null | Hash | Mask { keep_prefix, keep_suffix }

Catalog (catalogd::registry):
- put_view validates base dataset exists + columns non-empty
- Persists JSON at _catalog/views/{name}.json (sanitized name)
- rebuild() loads views alongside dataset manifests on startup

Query layer (queryd::context):
- build_context registers every AiView as a DataFusion view object
- Constructed SELECT applies whitelist projection, WHERE filter, and
  redaction expressions per column
  - Mask: substr(prefix) + repeat('*', mid_len) + substr(suffix)
  - Hash: digest(value, 'sha256')
  - Null: CAST(NULL AS VARCHAR) AS col
- DataFusion handles JOINs/aggregates over the view natively — it's a
  real view, not a query rewrite

HTTP (catalogd::service):
- POST /catalog/views (create)
- GET  /catalog/views (list)
- GET  /catalog/views/{name} (full def)
- DELETE /catalog/views/{name}

End-to-end test on candidates (100K rows, 15 columns):

  candidates_safe view:
    columns: candidate_id, first_name, city, state, vertical,
             skills, years_experience, status
    row_filter: status != 'blocked'
    redaction: candidate_id mask(prefix=3, suffix=2)

  SELECT * FROM candidates_safe LIMIT 5
    -> 8 columns only, candidate_id shown as "CAN******01"
       (PII fields email/phone/last_name absent from result)

  SELECT email FROM candidates_safe
    -> fails (column not in projection)

  SELECT email FROM candidates
    -> succeeds (raw table still accessible by name —
       Phase 13 access control is the gate, not the view itself)

Survives restart — view definitions reload from object storage.

Limits / not in MVP:
- View CANNOT shadow base table by name (DataFusion treats them as
  separate identifiers; access control must restrict raw-table access)
- row_filter is treated as trusted SQL — operators must validate
  before persisting; only authenticated admin path should call put_view
- Redaction expressions assume column is castable to VARCHAR; numeric
  redactions could be misleading (a Hash on Int64 returns a hex string
  that won't equi-join with another hash on the same value type)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 09:16:44 -05:00
root
24f1249a62 Federation layer 2: header routing + cross-bucket SQL
Three pieces of the multi-bucket federation made real:

1. Catalog migration (POST /catalog/migrate-buckets)
   - One-shot normalizer for ObjectRef.bucket field
   - Empty -> "primary"; legacy "data"/"local" -> "primary"
   - Idempotent; re-running on canonical state is no-op
   - Ran on existing catalog: 12 refs renamed from "data", 2 already
     "primary", all 14 now canonical

2. X-Lakehouse-Bucket header middleware on ingest
   - resolve_bucket() helper extracts header, returns
     (bucket_name, store) or 404 with valid bucket list
   - ingest_file and ingest_db_stream now route writes per-request
   - Defaults to "primary" when header absent
   - pipeline::ingest_file_to_bucket records the actual bucket on the
     ObjectRef so catalog stays the source of truth for "where does this
     data live"
   - Verified: ingest with X-Lakehouse-Bucket: testing lands in
     data/_testing/, ingest without header lands in data/, bad header
     returns 404 with hint

3. queryd registers every bucket with DataFusion
   - QueryEngine now holds Arc<BucketRegistry> instead of single store
   - build_context iterates all buckets, registers each as a separate
     ObjectStore under URL scheme "lakehouse-{bucket}://"
   - ListingTable URLs include the per-object bucket scheme so
     DataFusion routes scans automatically based on ObjectRef.bucket
   - Profile bucket names like "profile:user" sanitized to
     "lakehouse-profile-user" since URL host segments can't contain ":"
   - Tolerant of duplicate manifest entries (pre-existing
     pipeline::ingest_file behavior creates a fresh dataset id per
     ingest); duplicates skipped with debug log
   - Backward compat: legacy "lakehouse://data/" URL still registered
     pointing at primary

Success gate: cross-bucket CROSS JOIN
  SELECT p.name, p.role, a.species
  FROM people_test p          (bucket: testing)
  CROSS JOIN animals a        (bucket: primary)
  LIMIT 5
returns rows correctly. DataFusion routed each scan to its bucket's
ObjectStore based on the URL scheme.

No regressions: SELECT COUNT(*) FROM candidates still returns 100000
from the primary bucket.

Deferred to Phase 17:
- POST /profile/{user}/activate (HNSW hot-load on profile switch)
- vectord storage paths becoming bucket-scoped (trial journals,
  eval sets per-profile)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 08:52:32 -05:00
root
97a376482c Phase C: Decoupled embedding refresh
Implements the llms3.com-inspired pattern: embeddings refresh
asynchronously, decoupled from transactional row writes. New rows arrive,
ingest marks the vector index stale, a later refresh embeds only the
delta (doc_ids not already in the index).

Schema additions (DatasetManifest):
- last_embedded_at: Option<DateTime> - when the index was last refreshed
- embedding_stale_since: Option<DateTime> - set when data written, cleared on refresh
- embedding_refresh_policy: Option<RefreshPolicy> - Manual | OnAppend | Scheduled

Ingest paths (pipeline::ingest_file + pg_stream) call
registry.mark_embeddings_stale after writing. No-op if the dataset has
never been embedded — stale semantics only kick in once last_embedded_at
is set.

Refresh pipeline (vectord::refresh::refresh_index):
- Reads the dataset Parquet, extracts (doc_id, text) pairs
- Accepts Utf8 / Int32 / Int64 id columns (covers both CSV and pg schemas)
- Loads existing embeddings via EmbeddingCache (empty on first-time build)
- Filters to rows whose doc_id is NOT in the existing set
- Chunks (chunker::chunk_column), embeds via Ollama (batches of 32),
  writes combined index, clears stale flag

Endpoints:
- POST /vectors/refresh/{dataset_name} - body {index_name, id_column,
  text_column, chunk_size?, overlap?}
- GET /vectors/stale - lists datasets whose embedding_stale_since is set

End-to-end verified on threat_intel (knowledge_base.threat_intel):
- Initial refresh: 20 rows -> 20 chunks -> embedded in 2.1s,
  last_embedded_at set
- Idempotent second refresh: 0 new docs -> 1.8ms (pure delta check)
- Re-ingest to 54 rows: mark_embeddings_stale fires -> stale_since set
- /vectors/stale surfaces threat_intel with timestamps + policy
- Delta refresh: 34 new docs embedded in 970ms (6x faster than full
  re-embed); stale_cleared = true

Not in MVP scope:
- UPDATE semantics (same doc_id, different content) - would need
  per-row content hashing
- OnAppend policy auto-trigger - just declares intent; actual scheduler
  deferred
- Scheduler runtime - the Scheduled(cron) variant declares the intent so
  operators can see which datasets expect what, but the cron itself is
  separate

Per ADR-019: when a profile switches to vector_backend=Lance, this
refresh path benefits — Lance's native append replaces our "read all +
rewrite" Parquet rebuild pattern. Current MVP works well enough at
~500-5K rows to validate the architecture; Lance unblocks the 5M+ case.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 03:00:43 -05:00
root
dbe00d018f Federation foundation + HNSW trial system + Postgres streaming + PRD reframe
Four shipped features and a PRD realignment, all measured end-to-end:

HNSW trial system (Phase 15 horizon item → complete)
- vectord: EmbeddingCache, harness (eval sets + brute-force ground truth),
  TrialJournal, parameterized HnswConfig on build_index_with_config
- /vectors/hnsw/trial, /hnsw/trials/{idx}, /hnsw/trials/{idx}/best,
  /hnsw/evals/{name}/autogen, /hnsw/cache/stats
- Measured on resumes_100k_v2 (100K × 768d): brute-force 44ms -> HNSW 873us
  at 100% recall@10. ec=80 es=30 locked as HnswConfig::default()
- Lower ec values trade recall for build time: 20/30 = 0.96 recall in 8s,
  80/30 = 1.00 recall in 230s

Catalog manifest repair
- catalogd: resync_from_parquet reads parquet footers to restore row_count
  and columns on drifted manifests
- POST /catalog/datasets/{name}/resync + POST /catalog/resync-missing
- All 7 staffing tables recovered to PRD-matching 2,469,278 rows

Federation foundation (ADR-017)
- shared::secrets: SecretsProvider trait + FileSecretsProvider (reads
  /etc/lakehouse/secrets.toml, enforces 0600 perms)
- storaged::registry::BucketRegistry — multi-bucket resolution with
  rescue_bucket read fallback and reachability probing
- storaged::error_journal — bucket op failures visible in one HTTP call
- storaged::append_log — write-once batched append pattern (fixes the RMW
  anti-pattern llms3.com calls out; errors and trial journals both use it)
- /storage/buckets, /storage/errors, /storage/bucket-health,
  /storage/errors/{flush,compact}
- Bucket-aware I/O at /storage/buckets/{bucket}/objects/{*key} with
  X-Lakehouse-Rescue-Used observability headers on fallback

Postgres streaming ingest
- ingestd::pg_stream: DSN parser, batched ORDER BY + LIMIT/OFFSET pagination
  into ArrowWriter, lineage redacts password
- POST /ingest/db — verified against live knowledge_base.team_runs
  (586 rows × 13 cols, 6 batches, 196ms end-to-end)

PRD realignment (2026-04-16)
- Dual use case: staffing analytics + local LLM knowledge substrate
- Removed "multi-tenancy (single-owner system)" from non-goals
- Added invariants 8-11: indexes hot-swappable, per-reader profiles,
  trials-as-data, operational failures findable in one HTTP call
- New phases 16 (hot-swap generations), 17 (model profiles + dataset
  bindings), 18 (Lance vs Parquet+sidecar evaluation)
- Known ceilings table documents the 5M vector wall and escape hatches
- ADR-017 (federation), ADR-018 (append-log pattern) added
- EXECUTION_PLAN.md sequences phases B-E with success gates and
  decision rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 01:50:05 -05:00
root
8282842eaf Sync memory + phases: all 15 phases marked complete
PHASES.md and project memory updated to reflect actual build state.
Phases 11-14 were built but trackers weren't updated.

Final stats: 11 crates, 30 tests, 16 ADRs, 2.47M rows, 100K vectors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 19:34:15 -05:00
root
6d49f81ebf Add read-mem skill + comprehensive project memory
- /read-mem skill: reads PRD, phases, decisions, checks live services
- Updated PHASES.md with all 15 phases tracked
- Updated project_lakehouse.md memory with full context
- Updated CLAUDE.md with project reference
- Skill at ~/.claude/skills/read-mem/ and project level
- Triggers on: "read mem", "project status", "where were we", "catch me up"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 09:23:01 -05:00
root
01373c0e45 Phase 5: hardening — gRPC, observability, auth, config
- proto: lakehouse.proto with CatalogService, QueryService, StorageService, AiService
- proto crate: tonic-build codegen from proto definitions
- catalogd: gRPC CatalogService implementation
- gateway: dual HTTP (:3100) + gRPC (:3101) servers
- gateway: OpenTelemetry tracing with stdout exporter
- gateway: API key auth middleware (toggleable)
- shared: TOML config system with typed structs and defaults
- lakehouse.toml config file
- ADR-006 and ADR-007 documented

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 06:37:07 -05:00
root
50a8c8013f Phase 4: Dioxus frontend with dataset browser and SQL query editor
- ui: Dioxus WASM app with dataset sidebar, SQL editor (Ctrl+Enter), results table
- ui: dynamic API base URL (same-origin for nginx, port-based for local dev)
- gateway: CORS enabled for cross-origin requests
- nginx: lakehouse.devop.live proxies UI (:3300) + API (:3100) on same origin
- justfile: ui-build, ui-serve, sidecar, up commands added

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 06:24:15 -05:00
root
239e471223 Phase 3: AI integration with Ollama via Python sidecar
- sidecar: FastAPI app with /embed, /generate, /rerank hitting Ollama
- sidecar: Dockerfile, env var config (EMBED_MODEL, GEN_MODEL, RERANK_MODEL)
- aibridge: reqwest HTTP client with typed request/response structs
- aibridge: Axum proxy endpoints (POST /ai/embed, /ai/generate, /ai/rerank)
- gateway: wires AiClient with SIDECAR_URL env var
- e2e verified: nomic-embed-text returns 768d vectors, qwen2.5 generates text

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 05:53:56 -05:00
root
19bdfab227 Phase 2: DataFusion query engine over Parquet
- queryd: SessionContext with custom URL scheme to avoid path doubling with LocalFileSystem
- queryd: ListingTable registration from catalog ObjectRefs with schema inference
- queryd: POST /query/sql returns JSON {columns, rows, row_count}
- queryd→catalogd wiring: reads all datasets, registers as named tables
- gateway: wires QueryEngine with shared store + registry
- e2e verified: SELECT *, WHERE/ORDER BY, COUNT/AVG all correct

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 05:48:20 -05:00
root
655b6c0b37 Phase 1: storage + catalog layer
- storaged: object_store backend (LocalFileSystem), PUT/GET/DELETE/LIST endpoints
- shared: arrow_helpers with Parquet roundtrip + schema fingerprinting (2 tests)
- catalogd: in-memory registry with write-ahead manifest persistence to object storage
- catalogd: POST/GET /datasets, GET /datasets/by-name/{name}
- gateway: wires storaged + catalogd with shared object_store state
- Phase tracker updated: Phase 0 + Phase 1 gates passed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 05:15:27 -05:00
root
a52ca841c6 Phase 0: bootstrap Rust workspace
- Cargo workspace with 6 crates: shared, storaged, catalogd, queryd, aibridge, gateway
- shared: types (DatasetId, ObjectRef, SchemaFingerprint, DatasetManifest) + error enum
- gateway: Axum HTTP entrypoint with nested service routers + tracing
- All services expose /health stubs
- justfile with build/test/run recipes
- PRD, phase tracker, and ADR docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 04:59:05 -05:00