The Eval Gap

Standardization without measurement is agreeing on ritual, not efficacy

⚠️ What we're doing
✅ What we're missing

We can have perfect naming conventions and still ship tools that agents can't use.

Academic evidence
MCPToolBench++ (2025) — LLM context windows limit available tools per run; tool descriptions consume significant tokens
ScaleMCP (2025) — Static tool loading doesn't scale; proposes dynamic lazy loading
CaveAgent (2026) — Advocates metadata-driven skill activation over static tool definitions