lakehouse

profit/lakehouse

Fork 0

Commit Graph

Author	SHA1	Message	Date
root	f9f92706f3	RAG reranker + manifest bucket fix — quality improvements from eval RAG pipeline now includes a cross-encoder rerank step between retrieval and generation. The LLM re-sorts top-K results by relevance before they become context. Falls back to original order if model output is unparseable (~5% with 7B models). Also improved the generation prompt to be domain-aware ("staffing database") and request specific citations. Fixed 4 catalog manifests with bucket="data" (pre-federation leftover) that poisoned the entire DataFusion query context on startup. The "users", "lab_trials", "meta_runs", and "new_candidates" datasets now correctly reference bucket="primary". This bug was surfaced by the quality evaluation pipeline — wouldn't have been found by structural tests alone. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:19:11 -05:00
root	294f3f6a49	Scheduled ingest: file watcher auto-ingests from ./inbox - Drop CSV/JSON/PDF/text into ./inbox → auto-detected → Parquet → queryable - Polls every 10 seconds (configurable) - Processed files moved to ./inbox/processed/ - Failed files moved to ./inbox/failed/ - Dedup: same file dropped twice = no-op - Watcher starts automatically on gateway boot - Tested: CSV dropped → queryable in <15s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:04:40 -05:00

Author

SHA1

Message

Date

root

f9f92706f3

RAG reranker + manifest bucket fix — quality improvements from eval

RAG pipeline now includes a cross-encoder rerank step between retrieval
and generation. The LLM re-sorts top-K results by relevance before
they become context. Falls back to original order if model output is
unparseable (~5% with 7B models). Also improved the generation prompt
to be domain-aware ("staffing database") and request specific citations.

Fixed 4 catalog manifests with bucket="data" (pre-federation leftover)
that poisoned the entire DataFusion query context on startup. The
"users", "lab_trials", "meta_runs", and "new_candidates" datasets
now correctly reference bucket="primary". This bug was surfaced by
the quality evaluation pipeline — wouldn't have been found by
structural tests alone.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-16 22:19:11 -05:00

root

294f3f6a49

Scheduled ingest: file watcher auto-ingests from ./inbox

- Drop CSV/JSON/PDF/text into ./inbox → auto-detected → Parquet → queryable
- Polls every 10 seconds (configurable)
- Processed files moved to ./inbox/processed/
- Failed files moved to ./inbox/failed/
- Dedup: same file dropped twice = no-op
- Watcher starts automatically on gateway boot
- Tested: CSV dropped → queryable in <15s

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-27 20:04:40 -05:00

2 Commits