Evaluating MCP Servers: A Buyer's Checklist

You’ve decided to use a community MCP server instead of building custom. Smart move for commodity integrations. But the decision doesn’t end there. Not all MCP servers are equal, and a bad one creates problems that a bad npm package never would. A buggy npm dependency breaks your build. A buggy MCP server gives an AI agent malformed instructions for interacting with your production CRM.

The stakes are higher because the blast radius is larger. MCP servers sit between AI agents and your business systems. They translate natural language intent into API calls that create records, modify data, and trigger workflows. An insecure or poorly built server doesn’t just fail ;it fails in ways that can corrupt data, leak credentials, or give agents access to systems they shouldn’t touch.

Here’s a structured evaluation framework. Five dimensions. Specific checks for each. Red flags that should kill a candidate immediately.

Dimension 1: Provenance

Who built this server, and can you verify their identity?

What to check:

Source verification. Is the source code available? Is the repository owned by an identifiable individual or organization? Servers published under anonymous accounts with no contribution history are unknowns. Servers from established organizations or maintainers with public track records are accountable.
Development history. How long has the repository existed? A server created last week with a single commit is different from one with a year of development history, multiple contributors, and documented design decisions. History indicates stability and real-world usage.
Supply chain integrity. Are releases tagged and reproducible? Can you build the server from source and verify it matches the published artifact? Supply chain attacks on MCP servers are an emerging risk, the server you install should be the server the maintainer published, not a tampered version.

Red flags: No source code available. Anonymous publisher with no history. Binary-only distribution. No tagged releases.

Scoring: For critical systems, require verifiable provenance, known maintainer, public source, reproducible builds. For low-risk integrations, a verified publisher on mpak.dev with a trust score is sufficient.

Dimension 2: Functionality

Does the server do what you actually need?

What to check:

Capability coverage. List the specific operations your agents need. Compare against the server’s tool manifest. A GitHub server that supports file reads but not pull request operations is useless if your agents review PRs. Partial coverage means supplementing with custom code, and managing two servers instead of one.
API completeness. For servers wrapping well-documented APIs (Slack, Salesforce, Stripe), what percentage of the API surface does the server expose? Full coverage is rare and often unnecessary. But if the server only wraps 3 of the 15 endpoints you need, the evaluation is over.
Data fidelity. Does the server transform data accurately? Some servers simplify API responses, stripping fields, flattening nested structures, converting types. If your agents need the full response, simplified output is a problem. If they need a clean summary, simplification is a feature. Know which you need.
Edge case handling. Pagination, rate limiting, timeout behavior, empty result sets, malformed input. These aren’t exotic scenarios, they happen in every production deployment. How the server handles them determines whether your agents work reliably or fail intermittently.

Red flags: No tool manifest or capability documentation. Claims “full API coverage” without specifics. No error codes or error message documentation. Undocumented data transformations.

Scoring: Weight functionality based on your specific use case. A server covering 80% of your needs is often sufficient, you can extend the remaining 20% with custom logic. Below 60% coverage, the integration overhead exceeds the value.

Dimension 3: Security

Can you trust this server with access to your systems?

This is the dimension that separates MCP server evaluation from general dependency evaluation. A compromised npm package might mine cryptocurrency. A compromised MCP server has direct access to your CRM, your databases, your internal APIs, whatever systems the server connects to.

What to check:

Permission scope. What does the server request access to? The declared permissions should match the stated capabilities, nothing more. A Slack server needs network access to the Slack API. It does not need filesystem access, access to other APIs, or the ability to spawn child processes. Overly broad permissions are either lazy engineering or something worse.
MTF trust score. The MCP Trust Framework provides automated security assessment for servers in the mpak.dev registry. An MTF score evaluates permission scope, dependency vulnerabilities, secret handling, and manifest quality. High trust levels compress the security evaluation, you still need to understand what the server does, but the automated scan catches the mechanical failures.
Credential handling. How does the server manage authentication credentials? Hardcoded API keys in environment variables work for personal use. Production deployments need OAuth with token rotation, managed identity integration, or a secrets manager. If the server only supports static API keys, it’s not production-ready for multi-tenant or enterprise deployments.
Input validation. Does the server validate inputs before passing them to the target API? Unvalidated inputs create injection risks, an agent could be prompted to include malicious payloads that the server passes through to the target system.
Data handling. Does the server log request/response data? If so, where? Does it strip sensitive fields from logs? A server that logs full API responses containing customer PII is a compliance liability.

Red flags: Requests filesystem or network access beyond the target API. No input validation on tool parameters. Logs full request bodies including credentials. Hardcoded secrets in source code. No documented security model.

Scoring: For servers touching sensitive systems (customer data, financial records, internal databases), require high MTF scores and manual code review. For servers connecting to public APIs with read-only access, the MTF scan is sufficient. Never deploy a server with zero security assessment to production.

Dimension 4: Maintenance

Will this server still work in six months?

What to check:

Last update. When was the most recent commit? The most recent release? MCP is evolving rapidly: protocol changes, SDK updates, new security requirements. A server last updated eight months ago may not support current protocol versions.
Issue responsiveness. How fast does the maintainer respond to bug reports? Are issues triaged and closed, or do they accumulate? A growing backlog of unanswered issues signals abandonment. Consistent triage signals active maintenance.
Release cadence. Regular releases (even minor ones) indicate ongoing attention. Long gaps between releases suggest the maintainer has moved on. Look for a pattern: monthly patches, quarterly feature updates, or at minimum timely security fixes.
Contributor diversity. A server maintained by a single person is a bus factor of one. If they stop maintaining it, you’re forking or rebuilding. A server with multiple contributors distributes the maintenance burden and is more likely to survive individual departures.
CI/CD pipeline. Automated tests, linting, type checking, and release automation indicate engineering maturity. A server with no CI pipeline ships untested code directly to users.

Red flags: No commits in 6+ months. More than 20 open issues with no maintainer response. Single contributor who hasn’t been active recently. No automated tests.

Scoring: For critical integrations, require active maintenance, commits within the last 30 days, responsive maintainers, automated CI. For stable, low-change integrations (connecting to APIs that rarely update), a well-built server with less frequent updates is acceptable.

Dimension 5: Documentation

Can your team understand, deploy, and troubleshoot this server without reading every line of code?

What to check:

Installation guide. Clear, tested steps from zero to running. Not “clone the repo and figure it out.” Specific prerequisites, environment setup, configuration steps, and a verification command that confirms the server is working.
Configuration reference. Every environment variable, every config option, documented with descriptions, types, defaults, and examples. Configuration is where most deployment failures happen, incomplete docs here mean wasted hours.
Tool descriptions. Each tool the server exposes should have a clear description of what it does, what parameters it accepts, what it returns, and what errors it throws. These descriptions aren’t just for humans .AI agents read them to determine how to use the tools.
Known limitations. What doesn’t the server do? What edge cases are unhandled? What API features are not supported? Honest documentation of limitations saves more time than exhaustive documentation of features.
Troubleshooting guidance. Common errors, their causes, and their solutions. Connection failures, auth errors, rate limiting, timeout issues. If you can troubleshoot the top 5 failure modes from the docs, the documentation is sufficient.

Red flags: README that says “see code for usage.” No configuration reference. No tool descriptions. No documented limitations or known issues.

Scoring: Weight documentation quality higher for servers that will be deployed by teams, not just the engineer who evaluated it. A server that one person can deploy but nobody else can troubleshoot is a single point of failure.

The Weighted Evaluation

Not every dimension carries equal weight. The sensitivity of the connected system determines the weighting.

Dimension	High-Risk Systems	Standard Systems	Low-Risk / Read-Only
Provenance	Critical	Important	Nice to have
Functionality	Must meet requirements	Must meet requirements	Must meet requirements
Security	Critical (manual review required	Important) MTF score sufficient	Check basic permissions
Maintenance	Active maintenance required	Recent activity required	Stable is acceptable
Documentation	Full coverage required	Installation + config required	README sufficient

High-risk systems: Customer databases, financial records, PII, internal APIs, regulated systems. Standard systems: CRMs, project management, productivity tools, standard SaaS platforms. Low-risk / read-only: Public APIs, weather, reference data, read-only integrations.

The mpak Shortcut

Running the full five-dimension evaluation on every server candidate is thorough but expensive. mpak.dev compresses the process.

Every server in the mpak registry gets an automated MTF assessment covering security posture, manifest quality, and operational characteristics. Trust levels provide a composite score across the security, maintenance, and documentation dimensions. The automated scan handles the mechanical checks (permissions analysis, dependency scanning, secret detection, manifest validation) so your team can focus manual evaluation time on functionality fit and provenance verification.

The workflow: search mpak.dev for servers matching your requirements. Filter by trust level to eliminate low-quality candidates. Evaluate the top candidates against your specific functionality requirements. For critical systems, add manual code review on top of the automated score. For standard systems, the trust level and a functionality check are sufficient.

This is Business-as-Code applied to tooling decisions. You don’t evaluate from scratch every time. You build a structured evaluation process (the five dimensions, the weighting matrix, the trust score shortcut) and apply it consistently. The process becomes a team asset. New engineers evaluate servers the same way senior engineers do. The quality bar doesn’t depend on individual judgment.

For the definitive technical checklist to run on any server before production deployment, see the MCP Server Quality Checklist: 15 Things to Verify. For deciding whether to evaluate at all vs. building custom, see When to Build Custom MCP Servers.

Frequently Asked Questions

Where do I find MCP servers to evaluate?

mpak.dev is the primary registry with security-scanned servers and trust scores. GitHub is the wild west, servers exist but without consistent quality signals. Vendor marketplaces (Anthropic, OpenAI) have curated selections. Start with mpak.dev for the trust scoring, then expand to GitHub for niche use cases.

What's the single most important thing to check?

Permission scope. What does the server request access to, and does it match what it claims to do? A Slack server that requests filesystem access is a red flag. A CRM server that requests network access beyond the CRM API is suspicious. The declared permissions should match the stated capabilities, nothing more.

Should I review the source code of every MCP server?

For servers accessing sensitive systems (CRM, financial data, customer records) (yes. For lower-risk servers (weather, public APIs, development tools)) trust the MTF score and automated scanning. The risk level of the connected system determines the review depth.

Mat Goldsborough·Founder & CEO, NimbleBrain

Dimension 1: Provenance

Dimension 2: Functionality

Dimension 3: Security

Dimension 4: Maintenance

Dimension 5: Documentation

The Weighted Evaluation

The mpak Shortcut

Frequently Asked Questions

Ready to put AI agentsto work?

Ready to put AI agents
to work?