The Eval Gap

Standardization without measurement is agreeing on ritual, not efficacy

⚠️ What we're doing

✅ What we're missing

We can have perfect naming conventions and still ship tools that agents can't use.

Academic evidence

MCPToolBench++ (2025) — LLM context windows limit available tools per run; tool descriptions consume significant tokens

ScaleMCP (2025) — Static tool loading doesn't scale; proposes dynamic lazy loading

CaveAgent (2026) — Advocates metadata-driven skill activation over static tool definitions