3-lineage scrum (Opus 4.7 / Kimi K2.6 / Qwen3-coder) on today's wave landed 4 real findings (2 BLOCK + 2 WARN) and 2 INFO touch-ups. Verbatim verdicts + disposition table at: reports/scrum/_evidence/2026-04-30/ B-1 (BLOCK Opus + INFO Kimi convergent) — ResolveKey API: collapse from 3-arg (envVar, envFileName, envFilePath) to 2-arg (envVar, envFilePath). Pre-fix every chatd caller passed the env var name twice; if operator renamed *_key_env in lakehouse.toml while keeping the canonical KEY= line in the .env file, fallback silently missed. B-2 (WARN Opus + WARN Kimi convergent) — handleProviders probe: drop the synthesize-then-Resolve probe; look up by name directly via Registry.Available(name). Prior probe synthesized "<name>/probe" model strings and routed through Resolve, fragile to any future routing rule (e.g. cloud-suffix special case). B-3 (BLOCK Opus single — verified by trace + end-to-end probe) — OllamaCloud.Chat StripPrefix used "cloud" but registry routes "ollama_cloud/<m>". Result: upstream got the prefixed model name and 400'd. Smoke missed it because chatd_smoke runs without ollama_cloud registered. Now strips the right prefix; new TestOllamaCloud_StripsCorrectPrefix locks both prefix + suffix cases. Verified live: ollama_cloud/deepseek-v3.2 round-trips cleanly through the real ollama.com endpoint. B-4 (WARN Opus single) — Ollama finishReason: read done_reason field instead of inferring from done bool alone. Newer Ollama reports done=true with done_reason="length" on truncation; the prior code mapped that to "stop" and lost the truncation signal the playbook_lift judge needs to retry. New TestFinishReasonFromOllama_PrefersDoneReason covers the fallback ladder. INFOs: - B-5: replace hand-rolled insertion sort in Registry.Names with sort.Strings (Opus called the "avoid sort import" comment a false economy — correct). - A-1: clarify the playbook_lift.sh comment around -judge "" arg passing (Opus noted the comment said "env priority" but didn't reflect that the empty arg also passes through the Go driver's resolution chain). False positives dismissed (3, documented in disposition.md): - Kimi: TestMaybeDowngrade_WithConfigList wrong assertion (test IS correct per design — model excluded from weak list = strong = downgrade) - Qwen: nil-deref claim (defensive code already handles nil) - Opus: qwen3.5:latest doesn't exist on Ollama hub (true on the public hub but local install has it) just verify: PASS. chatd_smoke 6/6 PASS. New regression tests: 3 (B-2, B-3, B-4 each get a focused test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
158 lines
4.7 KiB
Go
158 lines
4.7 KiB
Go
package chat
|
|
|
|
import (
|
|
"bytes"
|
|
"context"
|
|
"encoding/json"
|
|
"errors"
|
|
"fmt"
|
|
"io"
|
|
"net/http"
|
|
"strings"
|
|
"time"
|
|
)
|
|
|
|
// Ollama (local) provider — calls /api/chat on the local Ollama
|
|
// server. No auth needed; default URL http://localhost:11434.
|
|
//
|
|
// Bare model names route here by default (registry.defaultName=ollama),
|
|
// so "qwen3.5:latest" → ollama. Explicit "ollama/qwen3.5:latest" also
|
|
// works (prefix stripped).
|
|
type Ollama struct {
|
|
baseURL string
|
|
httpClient *http.Client
|
|
}
|
|
|
|
// NewOllama returns a local Ollama provider. baseURL defaults to
|
|
// http://localhost:11434 when empty. timeout 0 → 180s.
|
|
func NewOllama(baseURL string, timeout time.Duration) *Ollama {
|
|
if baseURL == "" {
|
|
baseURL = "http://localhost:11434"
|
|
}
|
|
if timeout == 0 {
|
|
timeout = 180 * time.Second
|
|
}
|
|
return &Ollama{
|
|
baseURL: strings.TrimRight(baseURL, "/"),
|
|
httpClient: &http.Client{Timeout: timeout},
|
|
}
|
|
}
|
|
|
|
func (o *Ollama) Name() string { return "ollama" }
|
|
|
|
// Available pings /api/tags. Cached negative result would be a
|
|
// premature optimization for G0 — Ollama is typically up. If down,
|
|
// next call gets ErrUpstream which is the right signal anyway.
|
|
func (o *Ollama) Available() bool {
|
|
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
|
|
defer cancel()
|
|
req, _ := http.NewRequestWithContext(ctx, "GET", o.baseURL+"/api/tags", nil)
|
|
resp, err := o.httpClient.Do(req)
|
|
if err != nil {
|
|
return false
|
|
}
|
|
defer resp.Body.Close()
|
|
return resp.StatusCode/100 == 2
|
|
}
|
|
|
|
// Chat translates Request to Ollama's /api/chat shape and back.
|
|
// Strips the optional "ollama/" prefix from req.Model.
|
|
func (o *Ollama) Chat(ctx context.Context, req Request) (*Response, error) {
|
|
model := StripPrefix(req.Model, "ollama")
|
|
|
|
options := map[string]any{}
|
|
// Pointer-valued temperature so "not set" (nil) doesn't overwrite
|
|
// Ollama's default. Only forward when caller set it explicitly.
|
|
if req.Temperature != nil {
|
|
options["temperature"] = *req.Temperature
|
|
}
|
|
if req.MaxTokens > 0 {
|
|
options["num_predict"] = req.MaxTokens
|
|
}
|
|
body := map[string]any{
|
|
"model": model,
|
|
"messages": req.Messages,
|
|
"stream": false,
|
|
// Local hot path: skip reasoning by default. qwen3 / qwen3.5 are
|
|
// thinking-capable but the inner-loop use case wants direct
|
|
// answers, not reasoning traces. Without this, low max_tokens
|
|
// budgets get consumed by thinking before any content is
|
|
// produced. Cloud tier (Ollama Cloud) inherits the same default
|
|
// — see ollama_cloud.go.
|
|
"think": false,
|
|
"options": options,
|
|
}
|
|
if req.Format == "json" {
|
|
body["format"] = "json"
|
|
}
|
|
|
|
bs, _ := json.Marshal(body)
|
|
httpReq, err := http.NewRequestWithContext(ctx, "POST", o.baseURL+"/api/chat", bytes.NewReader(bs))
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
httpReq.Header.Set("Content-Type", "application/json")
|
|
|
|
resp, err := o.httpClient.Do(httpReq)
|
|
if err != nil {
|
|
if errors.Is(ctx.Err(), context.DeadlineExceeded) {
|
|
return nil, fmt.Errorf("%w: %s", ErrTimeout, "ollama")
|
|
}
|
|
return nil, fmt.Errorf("ollama: %w", err)
|
|
}
|
|
defer resp.Body.Close()
|
|
|
|
rb, _ := io.ReadAll(resp.Body)
|
|
if resp.StatusCode/100 != 2 {
|
|
return nil, fmt.Errorf("%w: ollama %d: %s", ErrUpstream, resp.StatusCode, abbrev(string(rb), 200))
|
|
}
|
|
|
|
var ollamaResp struct {
|
|
Model string `json:"model"`
|
|
Message struct {
|
|
Content string `json:"content"`
|
|
} `json:"message"`
|
|
Done bool `json:"done"`
|
|
DoneReason string `json:"done_reason"` // "stop", "length", ...
|
|
PromptEvalCount int `json:"prompt_eval_count"`
|
|
EvalCount int `json:"eval_count"`
|
|
}
|
|
if err := json.Unmarshal(rb, &ollamaResp); err != nil {
|
|
return nil, fmt.Errorf("ollama decode: %w (body=%s)", err, abbrev(string(rb), 200))
|
|
}
|
|
|
|
return &Response{
|
|
Model: model,
|
|
Content: ollamaResp.Message.Content,
|
|
InputTokens: ollamaResp.PromptEvalCount,
|
|
OutputTokens: ollamaResp.EvalCount,
|
|
FinishReason: finishReasonFromOllama(ollamaResp.Done, ollamaResp.DoneReason),
|
|
}, nil
|
|
}
|
|
|
|
// finishReasonFromOllama prefers Ollama's done_reason when present
|
|
// (newer Ollama 0.4+ exposes this on /api/chat). Falls back to the
|
|
// done bool for older versions. Phase 4 scrum fix B-4 (Opus WARN):
|
|
// previous logic mapped `done==true` → "stop" unconditionally, which
|
|
// hid truncations that Ollama reports as `done=true, done_reason="length"`.
|
|
// The playbook_lift judge needs this signal to detect when max_tokens
|
|
// budget ran out before the answer completed.
|
|
func finishReasonFromOllama(done bool, doneReason string) string {
|
|
if doneReason != "" {
|
|
return doneReason
|
|
}
|
|
if done {
|
|
return "stop"
|
|
}
|
|
return "length"
|
|
}
|
|
|
|
// abbrev shortens long error bodies for log/error messages without
|
|
// pulling fmt's truncation flags everywhere.
|
|
func abbrev(s string, n int) string {
|
|
if len(s) <= n {
|
|
return s
|
|
}
|
|
return s[:n] + "…"
|
|
}
|