Engineering

Server-side tool search for a 62-tool MCP platform

What happens when the AI vendor's fix doesn't reach you

April 24, 2026

In April 2026, Anthropic shipped tool search in the Claude Agent SDK. It defers tool schema loading so agents can
operate over catalogs of hundreds of tools without losing selection accuracy. NovoMCP crossed that threshold this year.
The platform now exposes 62 tools to the AI assistants customers use, ranging across target discovery, quantum chemistry, molecular dynamics, clinical outcome prediction, and materials science.

The SDK feature was designed for exactly this problem. It did not reach NovoMCP.

NovoMCP serves two surfaces beyond Claude.ai. One is NovoWorkbench, a desktop
application written in Rust with a custom HTTP router that speaks to Claude, GPT-5.2, Gemini, and Ollama on equal
footing. No SDK in the hot path. No Anthropic tool search. The other surface is Claude.ai itself. But there, remote MCP servers load all tool schemas upfront. The deferred-loading pattern only activates for SDK-based applications, not for servers reached over the protocol.

The problem the SDK solves exists everywhere. The SDK's solution reaches one slice of the problem.

So I built the same pattern on the server. This is what that looked like, what I learned, and why I think this is the correct posture for any enterprise MCP server over roughly 30 tools.

The

architecture

The core insight of tool search is simple. If the LLM only needs a few tools per turn, do not send it all 62 every turn. Send the summary, let the agent request specific tool schemas on demand.

Implementing that in a protocol-compliant way takes about 100 lines of Python.

One embedding call at server startup. On container boot I concatenate each tool's name, description, parameter names, and enum values into a short text blob. All 62 blobs go to an embedding model in a single batched HTTP request. The returned vectors are truncated to 1536 dimensions, L2-normalized so cosine similarity becomes a plain dot product, and held in a numpy array at module scope. Total cost: one second, one network round-trip, no persistent storage.

One embedding call per query. A new endpoint, POST /mcp/tool-search, takes a user query string. I embed the query, compute the dot product against all 62 tool vectors, return the top-K with similarity scores. Typical round trip: 25 milliseconds end to end.

A core whitelist of eight tools that always surface. Platform info, credit usage, funnel logging, the autonomous-mode trigger, job polling. Ensures a caller can orient itself even when retrieval misses.

Template manifests for known workflows. When a caller names a prompt template (the discovery funnel, an OLED screening pipeline, an electrolyte stability screen), the endpoint skips retrieval and returns that template's full tool set. Templates encode their flow; encoding their tool set alongside is a small extension that prevents retrieval from missing a tool the workflow depends on.

A keyword-match fallback. If the embedding provider is unreachable at startup or during a query, the endpoint falls back to substring matching on tool names and descriptions. Not as good as embeddings, but functional. The endpoint stays up; a diagnostic field surfaces the embedding failure to callers.

That is the entire retrieval layer. For 62 tools, it is about 380 kilobytes of RAM.

Why in-memory, not a vector database

NovoMCP already runs managed vector infrastructure for two workloads. Literature search across millions of peer-reviewed papers: correct substrate, millions of vectors, cross-user persistence required. And the funnel memory index, which persists terminal summaries of past discovery runs across sessions and grows per-user over time: correct substrate, continuous growth, persistence required.

Tool search is neither of those. The catalog is small, static, and the same for every container replica. The codebase is the source of truth for tool descriptions. Nothing needs to persist. Nothing needs to survive a restart. Rebuilding 62 embeddings in one second is faster and simpler than any disk-persistence scheme.

The pattern is this. Vector-database infrastructure earns its keep at tens of thousands of items and up, where the cost of network round-trips to the index is amortized across selectivity wins. At 62 items, a numpy dot product runs in half a millisecond. A managed vector query, however fast the service is, adds 50 to 100 milliseconds of network round-trip to every LLM turn. For retrieval that runs per-message, that latency is visible.

The first lesson of the build: the right substrate depends on the corpus size, not the architecture's sophistication.

The silent-failure lesson

I shipped the endpoint. The first production probe of /mcp/tool-search/status returned:

{
  "ready": false,
  "size": 0,
  "built_at":
  null
}

The index had not built. The container had started, the route was registered, queries returned empty. No errors in logs. No exceptions raised. Nothing to investigate except the absence of success.

The cause was a credential-lookup failure during the background index build. The fix was straightforward. The lesson was harder: the diagnostic surface of a new component matters more than its happy-path code.

I now start new MCP components with observability, not with the feature. Build the status endpoint first. Surface the last error. Expose the configuration the component thinks it is using.

The latent-bug lesson

I built an evaluation set alongside the endpoint. Fifty prompts across six categories. The first real run against production returned 95.8 percent recall. The failures clustered on queries that should have surfaced a specific tool whose description was sound and whose index vector had been built correctly.

The cause was upstream of retrieval entirely. A comparison path in the visibility layer had an edge case that silently excluded certain tools. Name-based tool listings had never surfaced the gap. Retrieval asks a different question and fails visibly when the relevant thing is absent.

I fixed the edge case. Recall moved from 95.8 percent to 100 percent.

Latent gaps in discovery surface immediately under retrieval workloads. Any MCP server with tier-gated access should assume similar gaps exist and that a retrieval workload will find them.

The

numbers

62 tools indexed in 1.4 seconds at container startup. 380 kilobytes of memory. 25 milliseconds per query end to end. 100 percent recall at ten on a 50-prompt evaluation set: 48 expected tools, 48 found. Keyword fallback on embedding failure, diagnostic status endpoint, manual rebuild for operators. Zero new infrastructure.

For

other MCP platforms

If you are building an MCP server and approaching 30 tools, consider:

Do not wait for client-side tool search to arrive. Build server-side retrieval in an afternoon.
Use in-memory retrieval until your catalog exceeds roughly ten thousand tools.
Build your status endpoint and your eval set before your retrieval logic.
Assume you have latent discovery gaps. Retrieval workloads will find them.
Treat tool descriptions as versioned code. Write the eval set that catches their regressions.

Capability and capability-that-the-agent-can-find are different properties. The first is the work. The second is the infrastructure that makes the first visible. Both have to ship.

NovoMCP exposes 62 tools across drug discovery, quantum chemistry, molecular dynamics, and materials science. The tool-search endpoint is live at /mcp/tool-search on both ai.novomcp.com and compute.novomcp.com.