lakehouse

Go to file

root 937569d188 ADR-020: Universal ID mapping — fix the flat embedding identity problem

THE REAL PROBLEM: Every new data source produces different doc_id
prefixes in vector indexes (W-, W500K-, W5K-, CAND-). Hybrid search
had to hardcode strip_prefix for each one. New datasets broke hybrid
until someone added another prefix. This violates "any data source
without pre-defined schemas."

THE FIX: IndexMeta.id_prefix — the catalog records what prefix each
index uses. Hybrid search reads it and strips automatically. Legacy
indexes fall back to heuristic stripping. New indexes can set
id_prefix=None to use raw IDs (no prefix, no stripping needed).

This means: ingest a new dataset, embed it, hybrid search works
immediately without code changes. The system is truly source-agnostic.

Also: full ADR document at docs/ADR-020-universal-id-mapping.md
with the three options considered and rationale for the chosen approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-17 11:58:18 -05:00

crates

ADR-020: Universal ID mapping — fix the flat embedding identity problem

2026-04-17 11:58:18 -05:00

data

Ingest Ethereal 10K worker profiles — domain data in the substrate

2026-04-16 22:26:19 -05:00

docs

ADR-020: Universal ID mapping — fix the flat embedding identity problem

2026-04-17 11:58:18 -05:00

inbox/processed

Scheduled ingest: file watcher auto-ingests from ./inbox