2 Commits

Author SHA1 Message Date
root
56844c3f31 embed cache — LRU at /v1/embed for repeat-query elimination
Adds CachedProvider wrapping the embedding Provider with a thread-safe
LRU keyed on (effective_model, sha256(text)) → []float32. Repeat
queries return the stored vector without round-tripping to Ollama.

Why this matters: the staffing 500K test (memory project_golang_lakehouse)
documented that the staffing co-pilot replays many of the same query
texts ("forklift driver IL", "welder Chicago", "warehouse safety", etc).
Each repeat paid the ~50ms Ollama round-trip. Cached repeats now serve
in <1µs (LRU lookup + sha256 of input).

Memory budget: ~3 KiB per entry at d=768. Default 10K entries ≈ 30 MiB.
Configurable via [embedd].cache_size; 0 disables (pass-through mode).

Per-text caching, not per-batch — a batch with mixed hits/misses only
fetches the misses upstream, then merges the result preserving caller
input order. Three-text batch with one miss = one upstream call for
that one text instead of three.

Implementation:
  internal/embed/cached.go (NEW, 150 LoC)
    CachedProvider implements Provider; uses hashicorp/golang-lru/v2.
    Key shape: "<model>:<sha256-hex>". Empty model resolves to
    defaultModel (request-derived) for the key — NOT res.Model
    (upstream-derived), so future requests with same input shape
    hit the same key. Caught by TestCachedProvider_EmptyModelResolvesToDefault.
    Atomic hit/miss counters + Stats() + HitRate() + Len().

  internal/embed/cached_test.go (NEW, 12 test funcs)
    Pass-through-when-zero, hit-on-repeat, mixed-batch only fetches
    misses, model-key isolation, empty-model resolves to default,
    LRU eviction at cap, error propagation, all-hits synthesized
    without upstream call, hit-rate accumulation, empty-texts
    rejected, concurrent-safe (50 goroutines × 100 calls), key
    stability + distinctness.

  internal/shared/config.go
    EmbeddConfig.CacheSize (toml: cache_size). Default 10000.

  cmd/embedd/main.go
    Wraps Ollama Provider with CachedProvider on startup. Adds
    /embed/stats endpoint exposing hits / misses / hit_rate / size.
    Operators check the rate to confirm the cache is working
    (high rate = good) or sized wrong (low rate + many misses on a
    workload that should have repeats).

  cmd/embedd/main_test.go
    Stats endpoint tests — disabled mode shape, enabled mode tracks
    hits + misses across repeat calls.

One real bug caught by my own test:
  Initial implementation cached under res.Model (upstream-resolved)
  rather than effectiveModel (request-resolved). A request with
  model="" caching under "test-model" (Ollama's default), then a
  request with model="the-default" (our config default) missing
  the cache. Fix: always use the request-derived effectiveModel
  for keys; that's the predictable side. Locked by
  TestCachedProvider_EmptyModelResolvesToDefault.

Verified:
  go test -count=1 ./internal/embed/  — all 12 cached tests + 6 ollama tests green
  go test -count=1 ./cmd/embedd/      — stats endpoint tests green
  just verify                          — vet + test + 9 smokes 33s

Production benefit:
  ~50ms Ollama round-trip → <1µs cache lookup for cached entries.
  At 10K-entry default + ~30% repeat rate (typical staffing co-pilot
  workload), saves several seconds per staffer-query session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 06:54:30 -05:00
root
9ee7fc5550 G2: embedd — text → vector via Ollama · 2 scrum fixes
Bridges the missing piece for the staffing co-pilot: text inputs to
vectord-shaped vectors. Standalone cmd/embedd on :3216 fronted by
gateway at /v1/embed. Pluggable embed.Provider interface (G2 ships
Ollama; OpenAI/Voyage swap in via the same interface in G3+).

Wire format:
  POST /v1/embed {"texts":[...], "model":"..."}  // model optional
  → 200 {"model","dimension","vectors":[[...]]}

Default model: nomic-embed-text (768-d). Ollama returns float64;
provider converts to float32 at the boundary so vectors flow through
vectord/HNSW without re-conversion.

Acceptance smoke 5/5 PASS — including the architectural payoff:
end-to-end embed → vectord add → search by re-embedded text returns
recall=1 at distance 5.96e-8 (float32 precision noise on identical
unit vectors). The staffing co-pilot pipeline (text → vector →
similarity search) is now functional end-to-end.

All 9 smokes (D1-D6 + G1 + G1P + G2) PASS deterministically.

Cross-lineage scrum on shipped code:
  - Opus 4.7 (opencode):                    0 BLOCK + 4 WARN + 3 INFO
  - Kimi K2-0905 (openrouter):              0 BLOCK + 2 WARN + 1 INFO
  - Qwen3-coder (openrouter):               "No BLOCKs" (3 tokens)

Fixed (2 — 1 convergent + 1 single-reviewer):
  C1 (Opus + Kimi convergent WARN): per-text 60s timeout × N-text
    batch was up to N×60s with no batch-level cap. One stuck Ollama
    call would stall the whole handler indefinitely. Fix:
    context.WithTimeout(r.Context(), 60s) wraps the entire batch.
  O-W3 (Opus WARN): empty strings in texts went to Ollama unchecked,
    producing version-dependent garbage. Fix: reject "" with 400 at
    the handler boundary so callers get a deterministic answer
    instead of an upstream-conditional 502.

Deferred (4): drainAndClose 64KiB cap (matches G0 pattern), no
concurrency limit on /embed (single-tenant G2), missing Accept
header (exotic-proxy concern), MaxBytesError string-match
redundancy (paranoia layer kept consistent across codebase).

Zero false positives this round — Qwen returned 3 tokens "No BLOCKs"
and the other two reviewers' findings were all real.

Setup confirmed: Ollama 0.21.0 on :11434 with nomic-embed-text loaded.
Per-text /api/embeddings used (forward-compat with 0.21+); newer
0.4+ /api/embed batch endpoint can swap in via the Provider interface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 01:42:27 -05:00