golangLAKEHOUSE

profit/golangLAKEHOUSE

Fork 0

Commit Graph

Author	SHA1	Message	Date
root	56844c3f31	embed cache — LRU at /v1/embed for repeat-query elimination Adds CachedProvider wrapping the embedding Provider with a thread-safe LRU keyed on (effective_model, sha256(text)) → []float32. Repeat queries return the stored vector without round-tripping to Ollama. Why this matters: the staffing 500K test (memory project_golang_lakehouse) documented that the staffing co-pilot replays many of the same query texts ("forklift driver IL", "welder Chicago", "warehouse safety", etc). Each repeat paid the ~50ms Ollama round-trip. Cached repeats now serve in <1µs (LRU lookup + sha256 of input). Memory budget: ~3 KiB per entry at d=768. Default 10K entries ≈ 30 MiB. Configurable via [embedd].cache_size; 0 disables (pass-through mode). Per-text caching, not per-batch — a batch with mixed hits/misses only fetches the misses upstream, then merges the result preserving caller input order. Three-text batch with one miss = one upstream call for that one text instead of three. Implementation: internal/embed/cached.go (NEW, 150 LoC) CachedProvider implements Provider; uses hashicorp/golang-lru/v2. Key shape: "<model>:<sha256-hex>". Empty model resolves to defaultModel (request-derived) for the key — NOT res.Model (upstream-derived), so future requests with same input shape hit the same key. Caught by TestCachedProvider_EmptyModelResolvesToDefault. Atomic hit/miss counters + Stats() + HitRate() + Len(). internal/embed/cached_test.go (NEW, 12 test funcs) Pass-through-when-zero, hit-on-repeat, mixed-batch only fetches misses, model-key isolation, empty-model resolves to default, LRU eviction at cap, error propagation, all-hits synthesized without upstream call, hit-rate accumulation, empty-texts rejected, concurrent-safe (50 goroutines × 100 calls), key stability + distinctness. internal/shared/config.go EmbeddConfig.CacheSize (toml: cache_size). Default 10000. cmd/embedd/main.go Wraps Ollama Provider with CachedProvider on startup. Adds /embed/stats endpoint exposing hits / misses / hit_rate / size. Operators check the rate to confirm the cache is working (high rate = good) or sized wrong (low rate + many misses on a workload that should have repeats). cmd/embedd/main_test.go Stats endpoint tests — disabled mode shape, enabled mode tracks hits + misses across repeat calls. One real bug caught by my own test: Initial implementation cached under res.Model (upstream-resolved) rather than effectiveModel (request-resolved). A request with model="" caching under "test-model" (Ollama's default), then a request with model="the-default" (our config default) missing the cache. Fix: always use the request-derived effectiveModel for keys; that's the predictable side. Locked by TestCachedProvider_EmptyModelResolvesToDefault. Verified: go test -count=1 ./internal/embed/ — all 12 cached tests + 6 ollama tests green go test -count=1 ./cmd/embedd/ — stats endpoint tests green just verify — vet + test + 9 smokes 33s Production benefit: ~50ms Ollama round-trip → <1µs cache lookup for cached entries. At 10K-entry default + ~30% repeat rate (typical staffing co-pilot workload), saves several seconds per staffer-query session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 06:54:30 -05:00

Author

SHA1

Message

Date

root

56844c3f31

embed cache — LRU at /v1/embed for repeat-query elimination

Adds CachedProvider wrapping the embedding Provider with a thread-safe
LRU keyed on (effective_model, sha256(text)) → []float32. Repeat
queries return the stored vector without round-tripping to Ollama.

Why this matters: the staffing 500K test (memory project_golang_lakehouse)
documented that the staffing co-pilot replays many of the same query
texts ("forklift driver IL", "welder Chicago", "warehouse safety", etc).
Each repeat paid the ~50ms Ollama round-trip. Cached repeats now serve
in <1µs (LRU lookup + sha256 of input).

Memory budget: ~3 KiB per entry at d=768. Default 10K entries ≈ 30 MiB.
Configurable via [embedd].cache_size; 0 disables (pass-through mode).

Per-text caching, not per-batch — a batch with mixed hits/misses only
fetches the misses upstream, then merges the result preserving caller
input order. Three-text batch with one miss = one upstream call for
that one text instead of three.

Implementation:
  internal/embed/cached.go (NEW, 150 LoC)
    CachedProvider implements Provider; uses hashicorp/golang-lru/v2.
    Key shape: "<model>:<sha256-hex>". Empty model resolves to
    defaultModel (request-derived) for the key — NOT res.Model
    (upstream-derived), so future requests with same input shape
    hit the same key. Caught by TestCachedProvider_EmptyModelResolvesToDefault.
    Atomic hit/miss counters + Stats() + HitRate() + Len().

  internal/embed/cached_test.go (NEW, 12 test funcs)
    Pass-through-when-zero, hit-on-repeat, mixed-batch only fetches
    misses, model-key isolation, empty-model resolves to default,
    LRU eviction at cap, error propagation, all-hits synthesized
    without upstream call, hit-rate accumulation, empty-texts
    rejected, concurrent-safe (50 goroutines × 100 calls), key
    stability + distinctness.

  internal/shared/config.go
    EmbeddConfig.CacheSize (toml: cache_size). Default 10000.

  cmd/embedd/main.go
    Wraps Ollama Provider with CachedProvider on startup. Adds
    /embed/stats endpoint exposing hits / misses / hit_rate / size.
    Operators check the rate to confirm the cache is working
    (high rate = good) or sized wrong (low rate + many misses on a
    workload that should have repeats).

  cmd/embedd/main_test.go
    Stats endpoint tests — disabled mode shape, enabled mode tracks
    hits + misses across repeat calls.

One real bug caught by my own test:
  Initial implementation cached under res.Model (upstream-resolved)
  rather than effectiveModel (request-resolved). A request with
  model="" caching under "test-model" (Ollama's default), then a
  request with model="the-default" (our config default) missing
  the cache. Fix: always use the request-derived effectiveModel
  for keys; that's the predictable side. Locked by
  TestCachedProvider_EmptyModelResolvesToDefault.

Verified:
  go test -count=1 ./internal/embed/  — all 12 cached tests + 6 ollama tests green
  go test -count=1 ./cmd/embedd/      — stats endpoint tests green
  just verify                          — vet + test + 9 smokes 33s

Production benefit:
  ~50ms Ollama round-trip → <1µs cache lookup for cached entries.
  At 10K-entry default + ~30% repeat rate (typical staffing co-pilot
  workload), saves several seconds per staffer-query session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 06:54:30 -05:00

1 Commits