diff --git a/docs/PRD.md b/docs/PRD.md index 571c329..6e16dcb 100644 --- a/docs/PRD.md +++ b/docs/PRD.md @@ -39,7 +39,7 @@ is an explicit re-platforming, not a refactor. | HTTP gateway | Axum + Tokio | `net/http` + `chi` (or `gin`) | High — Go's bread and butter | | gRPC | tonic | `google.golang.org/grpc` | High — Go is the reference impl | | Object store | Apache Arrow `object_store` | `aws-sdk-go-v2/service/s3` + thin wrapper | High | -| Parquet I/O | parquet-rs (arrow-rs) | `apache/arrow-go/v15/parquet` | Medium — arrow-go lags arrow-rs but covers our needs | +| Parquet I/O | parquet-rs (arrow-rs) | `apache/arrow-go/v18/parquet` | Medium — arrow-go lags arrow-rs but v18 covers our needs | | Query engine | DataFusion | **Hard problem** (see §Hard problems) | Low — no like-for-like Go equivalent | | Vector index (HNSW) | `hora` / hand-rolled | `coder/hnsw` or `Bithack/go-hnsw` (in-process) | High — HNSW is a self-contained algorithm | | Vector backend (Lance) | `lance` (Rust) | **Hard problem** — likely dropped, Parquet-only | Medium | @@ -76,7 +76,7 @@ binaries built from one workspace. | gRPC | `google.golang.org/grpc` | Reference implementation | | Protobuf | `protoc-gen-go` + `buf` | Standard tooling | | Object store | `aws-sdk-go-v2` | Mature, covers S3 + MinIO + RustFS | -| Parquet | `apache/arrow-go/v15` | Columnar I/O + Arrow interop | +| Parquet | `apache/arrow-go/v18` | Columnar I/O + Arrow interop (v18.5.2 — March 2026) | | SQL engine | **Open** — see §Hard problems §1 | Biggest open decision | | Vector index | `coder/hnsw` | Pure-Go HNSW, in-process, no external service | | TOML config | `pelletier/go-toml/v2` | Standard | diff --git a/docs/RESEARCH_LOG_2026-04-28.md b/docs/RESEARCH_LOG_2026-04-28.md new file mode 100644 index 0000000..0fe1c42 --- /dev/null +++ b/docs/RESEARCH_LOG_2026-04-28.md @@ -0,0 +1,133 @@ +# Research Log — 2026-04-28 + +Survey of upstream Go ecosystem state before Phase G0 begins. Goal: +verify the SPEC's library choices reflect the *currently maintained* +upstream and surface anything that has shifted since the SPEC was +drafted earlier today. + +## Sources used + +- Context7 — `resolve-library-id` for `go-duckdb` and `arrow-go` +- WebSearch — six targeted queries on Go HNSW, MCP SDK, DuckDB + bindings, Anthropic SDK, HTTP frameworks, parquet maturity +- WebFetch — direct GitHub READMEs for `coder/hnsw`, + `modelcontextprotocol/go-sdk`, `duckdb/duckdb-go` + +## Findings — three SPEC-changing pivots + +### 1. DuckDB Go binding ownership transferred + +The original SPEC named `marcboeker/go-duckdb` as the query-engine +library. The package has been formally **deprecated** in favor of the +official maintainer transfer. + +| | Before (SPEC v1) | After | +|---|---|---| +| Repo | `github.com/marcboeker/go-duckdb` | `github.com/duckdb/duckdb-go/v2` | +| Maintainer | Marc Boeker (community) | DuckDB team + Marc Boeker (joint, official) | +| Latest | (varies) | v2.10502.0 (April 2026) | +| DuckDB engine | varies | v1.5.2 | +| Migration | n/a | `gofmt -w -r '"github.com/marcboeker/go-duckdb/v2" -> "github.com/duckdb/duckdb-go/v2"' .` | + +**Why this matters:** the upstream migration happened at v2.5.0; using +the old import means we'd be on a deprecated branch from day one. +SPEC §3.1 now names the official path. + +### 2. Official Go MCP SDK exists (Google collaboration) + +The original SPEC named `mark3labs/mcp-go` (community implementation). +An official SDK now exists. + +| | Before (SPEC v1) | After | +|---|---|---| +| Repo | `mark3labs/mcp-go` | `github.com/modelcontextprotocol/go-sdk` | +| Maintainer | mark3labs (community) | MCP org + Google (official) | +| Stars | (smaller) | 4.4k | +| Version | n/a | v1.5.0 stable | +| Spec compat | various | targets MCP 2025-11-25, backward-compat to 2024-11-05 | +| OSSF Scorecard | n/a | yes | + +**Why this matters:** the MCP server (`cmd/mcp`) is one of the most +visible Go binaries we'll ship — it's what AI agents talk to. Using +the official SDK aligns with the MCP spec's evolution and gets us the +Google-tested code path. + +### 3. arrow-go is on v18, not v15 + +The original SPEC referenced `arrow-go/v15`. Apache Arrow Go has +shipped through **v18.5.2** (March 2026). + +| | Before (SPEC v1) | After | +|---|---|---| +| Module path | `apache/arrow-go/v15` | `apache/arrow-go/v18` | +| Latest | n/a | v18.5.2 | +| Recent fixes | n/a | parquet decryption, large string handling, complex type read | + +**Why this matters:** v18 has parquet encryption fixes that are +relevant for our PII-masked safe views (per Rust ADR-017's federation +and the production cutover work). Skipping three major versions is +unnecessary risk. + +## Findings — validations (SPEC unchanged) + +### `coder/hnsw` — keep + +220 stars, 45 commits, active CI, recent PRs/issues. Documented +import speed of 796.85 MB/s. No deprecation signals. In-memory +alternative to Pinecone/Weaviate, fits the no-external-vector-DB +constraint. + +### `chi` for HTTP routing — keep + +Confirmed as the "clean architecture, stdlib `net/http`, zero deps" +pick. Fiber is faster (36k req/s vs ~34k for chi/gin/echo) but uses +fasthttp, which is off the standard library path — wrong fit for our +"boring is good" Go ethos. Fiber stays as a documented alternative if +hot-path performance ever proves chi insufficient. + +### `marcboeker/go-duckdb` → `duckdb/duckdb-go/v2` + +Already covered above as pivot #1. + +## New things worth noting (not SPEC-changing yet) + +### Anthropic Go SDK is official + +`github.com/anthropics/anthropic-sdk-go` is the official Anthropic Go +client library. We currently route Claude calls through OpenCode/Zen, +so this isn't on the Phase G0 critical path. **Worth knowing for**: +direct Claude API calls in `aibridge` if we ever want to bypass +opencode (e.g., for the overseer correction loop in dev mode without +a Zen subscription). + +Added as a noted option in the SPEC `aibridge` row. + +### DuckDB-Wasm exists + +`github.com/duckdb/duckdb-wasm` brings DuckDB to the browser via +WebAssembly with Arrow / Parquet / CSV / JSON support. **Not in scope** +for Phase G0 but a future option if the UI ever needs client-side +analytical queries against fetched parquet (offline analytics over a +permit cache, etc.). + +### IVF as an HNSW alternative + +If during Phase G3 the HNSW recall validation gate (G3.2.C) shows +problems, IVF (Inverted File Index) is the next-best alternative: +faster index builds, lower memory, better filtered-search performance +than HNSW. No first-class Go IVF library was found — would need to +wrap FAISS via cgo or hand-roll. Documented as a fallback only. + +## What I checked but found nothing actionable + +- HTMX-specific Go libraries (`htmx-go`, `gotempl`, etc.) — none has + emerged as a clear standard. Sticking with `html/template` + raw + HTMX as the SPEC plans. +- Langfuse Go client — OSS support varies. SPEC's "hand-roll if + needed" stays. + +## Outcome + +SPEC §1 + §2 + §3.1 updated to reflect the three pivots. PRD's locked +stack table updated to match. No phase changes, no acceptance gate +changes, no new hard problems. diff --git a/docs/SPEC.md b/docs/SPEC.md index ac69c3d..1d56020 100644 --- a/docs/SPEC.md +++ b/docs/SPEC.md @@ -23,15 +23,15 @@ Effort scale (one engineer-week = ~40h focused work): | Crate | Rust deps that mattered | Go target | Library | Effort | Risk | |---|---|---|---|---|---| | `gateway` | axum, tokio, tonic, tower | `cmd/gateway` | `chi` + stdlib `net/http` + `google.golang.org/grpc` | **L** | low — Go's strongest domain | -| `catalogd` | parquet-rs, arrow, sqlite | `cmd/catalogd` | `arrow-go/v15`, `mattn/go-sqlite3` | **L** | low | +| `catalogd` | parquet-rs, arrow, sqlite | `cmd/catalogd` | `apache/arrow-go/v18`, `mattn/go-sqlite3` | **L** | low | | `storaged` | object_store, aws-sdk | `cmd/storaged` | `aws-sdk-go-v2`, `minio-go` for MinIO-specific paths | **M** | low | -| `queryd` | datafusion, arrow | `cmd/queryd` | `marcboeker/go-duckdb` (cgo) | **HARD** | high — see §3 | +| `queryd` | datafusion, arrow | `cmd/queryd` | **`duckdb/duckdb-go/v2`** (cgo, official) | **HARD** | high — see §3 | | `ingestd` | csv, json, lopdf, postgres | `cmd/ingestd` | stdlib `encoding/csv`, `encoding/json`, `pdfcpu/pdfcpu`, `jackc/pgx/v5` | **L** | low | -| `vectord` | hora, arrow, hnsw | `cmd/vectord` | `coder/hnsw`, `arrow-go/v15` | **L** | medium — re-validate HNSW recall | +| `vectord` | hora, arrow, hnsw | `cmd/vectord` | `coder/hnsw`, `apache/arrow-go/v18` | **L** | medium — re-validate HNSW recall | | `vectord-lance` | lance | **DROPPED** | n/a | n/a | n/a — Parquet+HNSW only | -| `journald` | parquet, arrow | `cmd/journald` | `arrow-go/v15` | **M** | low | -| `aibridge` | reqwest | library | `net/http` + connection pool | **S** | low | -| `validator` | parquet, custom | library | `arrow-go/v15` parquet reader | **M** | low — port the 24 unit tests as gates | +| `journald` | parquet, arrow | `cmd/journald` | `apache/arrow-go/v18` | **M** | low | +| `aibridge` | reqwest | library | `net/http` + connection pool · `anthropics/anthropic-sdk-go` available for direct Claude calls (currently routed via opencode) | **S** | low | +| `validator` | parquet, custom | library | `apache/arrow-go/v18` parquet reader | **M** | low — port the 24 unit tests as gates | | `truth` | tomli, custom DSL | library | `pelletier/go-toml/v2` | **M** | low | | `proto` | tonic-build | `proto/` + `protoc-gen-go` | `buf` + `protoc-gen-go-grpc` | **S** | low | | `shared` | serde, anyhow | library | stdlib `encoding/json`, `errors` | **S** | low | @@ -47,7 +47,7 @@ one engineer; 6–8 weeks for two). | TS surface | Current location | Go target | Library | Effort | Risk | |---|---|---|---|---|---| -| `mcp-server/index.ts` | Bun, :3700 | `cmd/mcp` | `mark3labs/mcp-go` (Go MCP SDK) | **L** | medium — MCP semantics | +| `mcp-server/index.ts` | Bun, :3700 | `cmd/mcp` | **`modelcontextprotocol/go-sdk`** (official Go SDK, v1.5.0, Google-collab) | **L** | medium — MCP semantics | | `mcp-server/observer.ts` | Bun, :3800 | `cmd/observer` | stdlib `net/http`, `slog` | **M** | low | | `mcp-server/tracing.ts` | Bun, Langfuse client | library | `go.opentelemetry.io/otel` + Langfuse Go client (or hand-roll) | **M** | low — Langfuse Go OSS support varies | | `auditor/*.ts` | TS, runs as systemd | `cmd/auditor` | stdlib + `gitea API client` | **L** | medium — auditor cross-lineage logic is intricate | @@ -64,7 +64,13 @@ one engineer; 6–8 weeks for two). ### §3.1 — Query engine (DuckDB via cgo) -**Library:** `marcboeker/go-duckdb` — Go bindings via cgo. +**Library:** `github.com/duckdb/duckdb-go/v2` — official Go bindings via +cgo. (Replaces the legacy `marcboeker/go-duckdb`, which was deprecated +when the DuckDB team and Marc Boeker jointly relocated maintenance to +the DuckDB org at v2.5.0. Migration is a one-line `gofmt -r` rewrite of +import paths.) Current version v2.10502.0 (April 2026), DuckDB v1.5.2 +compat. Statically links default extensions: ICU, JSON, Parquet, +Autocomplete. **API shape** (replaces the DataFusion `SessionContext` pattern): ```go