Methodology

How MCPowered scans + scores MCP servers

Every claim on every scan report deep-links to a rule documented here. The page is long because the methodology has to be inspectable. If a finding doesn't trace back to a documented rule, we're doing it wrong.

Our posture: we describe, we do not certify

MCPowered surfaces observations about MCP servers. We do not certify, vouch for, or warrant any server's behavior. This is the foundational distinction; every other rule on this page derives from it.

A scan finding zero issues is the absence of evidence of malice. It is not evidence of integrity. The two are different and we keep them different in every report. A scanned-clean server may still be compromised by future code changes that haven't been scanned, server-controlled dynamic responses that differ at runtime from what static analysis observed, supply-chain compromise of dependencies of dependencies that we couldn't follow transitively, the publisher acting in bad faith in a way no static signal would catch, or an attack vector that didn't exist at the time of our scan.

The reader is the evaluator. Our reports inform; they do not replace judgment.

Vocabulary discipline

These are the words MCPowered uses (and does not use) when talking about any server. The discipline is consistent across every surface: reports, methodology page, marketing copy, social, footer text, schema markup.

TermWhat it meansWhere it's allowed
Scanned clean Absence of flagged patterns in our static checks Per-server reports where 0 rules fired. Not used in marketing copy.
No findings Neutral statement of fact at scan time Report body. Not used anywhere implying safety in general.
Verified publisher Cryptographic signature on specific publisher-attested claims Reports on publishers in the verified-publisher tier, with a link to the underlying signed statement. Standalone "verified" is never used as a server descriptor.
Audited Examined and certified by a named third party (not MCPowered) Citing a specific third-party audit with link. Never used to refer to MCPowered's own work.

The words trusted, safe, secure, approved, and high-quality are not used as descriptions of any server, anywhere on MCPowered. Recommended is not used for individual servers (it is fine as a label on suggested mitigation actions inside a scan finding). The post-build voice gate enforces the lock; any drift fails the deploy.

What the scanner checks (v0)

Five static checks plus three publisher signal groups. Each runs against the source code or metadata of a published MCP server.

Check 1 · Tool description scan (hardcoded)

What it looks for. Regex + AST patterns against tool description strings in source code:

  • Imperative language in descriptions ("execute the following," "before X, read Y," "include the contents of")
  • File-path tokens in descriptions (~/.ssh/, ~/.aws/credentials, /etc/passwd)
  • Network-call-shaped strings (curl , wget , https://attacker-domain)
  • Token-extraction patterns (process.env., import os; os.environ)

What it catches. Tool Poisoning. The attack pattern where a malicious server hides natural-language instructions inside a tool description that the LLM treats as authoritative context at session start. See the Tool Poisoning explainer.

What it misses. Server-generated descriptions: if the tools/list handler computes descriptions at runtime, the source-code scan sees no hardcoded payload, but the runtime payload can still be malicious. We capture descriptions at scan time via the live tools/list response when available; we cannot catch session-by-session swaps. Adversarial obfuscation also defeats this check: a publisher who reads this methodology can write descriptions that evade the regex patterns. We document the limit; we do not pretend to have closed it.

Severity mapping.

  • Matched patterns in low-confidence positions (description body without context): informational
  • Matched patterns in high-confidence positions (filepath token + imperative language in the same description): warning
  • Multiple high-confidence patterns + the description maps to a high-side-effect tool name: critical

Check 2 · Tool output literal scan (hardcoded)

What it looks for. The same pattern library as Check 1, applied to return-value literals and templates in tool implementations.

What it catches. Return Value Injection. The attack pattern where payloads hidden in tool outputs feed back into the LLM context as authoritative input. A Docker label read by docker_inspect is the canonical worked example.

What it misses. Computed outputs. A tool that fetches data from an external API and returns it cannot be checked statically. The data could contain instructions next time.

Check 3 · Dependency CVE check

What it looks for. Declared dependencies (from package.json, pyproject.toml, Cargo.toml, etc.) cross-referenced against:

  • GitHub Advisory Database
  • npm audit
  • PyPI advisory data (pypa/advisory-database)
  • Crates.io advisory data

What it catches. Known CVEs in direct dependencies with public CVSS scores.

What it misses. Transitive dependencies past one level deep. A vulnerability deep in the dependency graph that the package manager doesn't surface in its standard advisory channels. Zero-days that don't have CVE numbers yet.

Severity mapping.

  • CVSS 0-4: informational (low-severity, often non-exploitable in the MCP context)
  • CVSS 4-7: warning
  • CVSS 7-10: critical

Check 4 · Permission audit (subprocess, filesystem, network)

What it looks for. Static analysis of subprocess calls, filesystem access, and network requests in the server source code. Categorized into: read-only filesystem access; write filesystem access; subprocess execution (any spawn call); network outbound (any fetch/http/socket call); shell execution (exec, Runtime.exec, etc.).

What it catches. A server that requests permissions disproportionate to its stated purpose. A "calendar reader" that has subprocess execution + network outbound is suspicious.

What it doesn't conclude. This check produces a fact surface. It does not adjudicate. The methodology reports what permissions are requested; the reader judges whether they fit the server's stated purpose.

Severity mapping.

  • Permissions in alignment with stated purpose: informational
  • Permissions exceed plausible purpose: warning
  • Shell exec + network outbound + filesystem write: critical (the trifecta that enables exfiltration via shell)

Check 5 · Tool description diff (cross-session, runs on re-scan only)

What it looks for. Diff between the current tools/list response and the prior scan's snapshot.

What it catches. Rug Pulls. The attack pattern where a server returns a benign description at install review and silently swaps in a malicious one later.

When it runs. Only on scheduled re-scans (cadence at Freshness commitments). On first scan there is no prior to diff against; the diff signal becomes available from the second scan onward.

Severity mapping.

  • Description changed; new description doesn't match any Check 1 patterns: informational (could be a legitimate update)
  • Description changed AND new description matches Check 1 patterns: warning
  • Description changed AND new description matches Check 1 critical patterns: critical

Publisher signals (read from registry metadata; not source-code analysis)

Three signal groups that surface publisher facts. None contribute to a safety evaluation; they inform the maintenance and popularity axes only.

Account + commit signals (GitHub API):

  • Account age (years)
  • Public repository count
  • Commits in the past 12 months
  • Contributor count
  • Signed-commit ratio
  • 2FA enabled on account (where the API exposes it)

Package signals (npm / PyPI / etc.):

  • Package age (months since first publish)
  • Version count
  • Publish cadence (commits per month, releases per month)
  • Most recent publish recency

Cross-server publisher signals (computed across the publisher's MCP catalog):

  • Number of MCP servers published by the same account
  • Aggregate scan scores across their catalog
  • Cross-catalog publish pattern (regular cadence vs sporadic burst)

How findings combine into axis scores

Four axes, each 0-100, conceptually independent.

Code axis

Combines Check 1 + Check 2 + Check 3 + Check 4 + Check 5.

  • Each critical-severity finding: -25 points (capped at -50 from any single check)
  • Each warning-severity finding: -8 points
  • Each informational finding: -1 point
  • Floor: 0. Ceiling: 100 (no findings of any severity).

A code axis score reads as: how many points of penalty did this server's source code accumulate from our static checks. It does not read as: how trustworthy is this server.

v0 partial-coverage rule. Until all four required static checks (1-4) are running against every server, the code axis is left empty rather than published as a number. A score of 100/100 computed from one check would misrepresent what we've actually inspected. While the scanner is incomplete, the per-server page shows the per-check status (which checks ran, what they found) instead of a composite axis. The axis lights up server-by-server as the missing checks ship.

Publisher axis

Combines the account + commit signals.

  • Account age > 2 years: +20 points (capped at baseline +20)
  • Signed-commit ratio > 50%: +15
  • 2FA enabled: +10
  • Account has > 5 public repos: +5
  • Baseline: 50 points. Positive signals add; no negative subtractions.

A publisher axis score reads as: how much positive evidence of established account presence does this publisher show. It does not read as: is this publisher trustworthy.

Maintenance axis

Combines package age + commit recency + version cadence + CVE response time (if observed).

  • Commit in past 90 days: +30
  • Average commits per month > 1: +20
  • Most recent publish < 6 months ago: +15
  • Observed CVE patch within 14 days of CVE publication (if applicable): +20
  • Baseline: 0. Positive signals additive. If no CVEs observed historically, the patch-response-time signal is N/A and the remaining weights redistribute.

A maintenance axis score reads as: how actively is this server being maintained. Maintenance correlates weakly with security (an active maintainer is more likely to patch fast); it does not correlate with publisher integrity.

Popularity axis

Combines install volume + GitHub stars + contributor count.

  • Install volume (where available): logarithmically scaled. 10 installs = 10 points, 1,000 = 50, 100K = 90.
  • GitHub stars: logarithmically scaled. 100 stars = 30, 10K = 80.
  • Contributor count > 5: +10

A popularity axis score reads as: how widely adopted is this server. Popularity correlates weakly with maintenance; it does not correlate at all with the safety of installing the server. We restate this on every per-server report.

What every per-server report contains

Header. Server name + version + primary transport (stdio / HTTPS / both). Publisher name + link. The 4-axis profile chart with relative-percentile color coding within the catalog. A "scanned N days ago" timestamp with visual decay (full color when under 30 days, dimmed 30-90, marked "stale" beyond 90). A descriptive "N findings worth attention" count (not an adjudication).

Specific observations section. For each rule that fired: rule name (for example tool_description_instruction_injection), source location (file + line, where applicable), severity (informational / warning / critical), the literal text or pattern matched, a suggested mitigation action ("review the description text for legitimate context"; "update the dependency to version N.M.O"), and a link to the methodology section that describes the rule.

What we couldn't check section. For each irreducible-unknown relevant to this server: a description of the limit (for example: "The server generates tool descriptions at runtime; static analysis cannot observe runtime-only descriptions") and the evidence-grounded rationale.

Disclaimer + methodology link. The standard per-report disclaimer (see below), a link to this methodology page, and a link to the raw scan output for readers who want to dig deeper.

Freshness commitments

  • Top 100 servers by install volume: scheduled re-scan every 14 days
  • Next 400 in catalog: every 30 days
  • Tail of catalog: every 90 days
  • User-requested re-scan: always free, rate-limited to 1 per server per IP per 24 hours
  • Every per-server page shows the last-scanned timestamp + the visual decay state

Environment assumption

Default-case scoring assumes the server runs over stdio with full host process privileges. This is the worst case for most servers in typical local installation. Servers that support HTTPS-only deployment have a lower default exposure but the score is still computed against the stdio worst case for catalog comparability. A v1 environment-modifier toggle will let readers surface only the transport-relevant subset of findings.

What we explicitly do NOT cover at v0

(Documented as out-of-scope so reader expectations are calibrated.)

  • Cross-Server Shadowing detection. Requires multi-server runtime simulation. v2+ scope.
  • Dynamic Tool Poisoning detection. Server-generated descriptions can only be observed at runtime. Partial coverage via Check 5 diff on re-scans, but session-level swaps are unobservable. v2+ adds runtime sandboxing.
  • Author intent classification. Unknowable from outside.
  • Future code changes. Unknowable until they happen.
  • Deep transitive dependency vulnerabilities. v0 stops at direct dependencies. v1 extends to one level deeper.
  • Runtime sandbox behavior analysis. No execution of server code on our infrastructure at v0. v2+ scope.
  • Cryptographic verification of publisher claims. Verified-publisher tier is post-v0.

Appeals + corrections

Publishers can dispute factual errors in scan output (a rule that misfired), misattribution in incident pages, or stale data the publisher has remediated. Any party named in an incident page also has standing to appeal.

  • How to submit: email methodology@mcpowered.com with the URL of the disputed claim, the specific dispute, and any supporting evidence. (Contact-form mechanism lands with the appeals MVP.)
  • Acknowledgment: within 5 business days.
  • Resolution: within 30 calendar days. Outcome is one of: correction with an audit-log entry; explanation of why the methodology stands; for incident pages, addition of the publisher's response inline.
  • What we don't do: retract factually-accurate findings because the publisher disliked the result. If a rule fired, the rule fired, and the methodology page explains the rule. Disagreement with the methodology is a separate (legitimate) conversation but doesn't trigger retroactive deletion of the scan record.

Per-report disclaimer (verbatim)

Scan results describe what our static checks found and didn't find at the time of scan. They are not a recommendation, certification, or guarantee. A scanned-clean result is the absence of evidence of malice, not the presence of evidence of integrity. Servers can be compromised after scan. You are responsible for evaluating whether to install any MCP server. See our methodology for what we check, what we can't check, and the limits of static analysis.

MCPowered

This block appears below every per-server report, is linked from every "Scan a server" CTA, and is quoted verbatim in the Terms of Service. The text is identical everywhere; any drift across surfaces is catchable via grep.

Version history

  • 2026-05-27 (v0): initial methodology. Checks 1-4 run on first scan; Check 5 enables on the first scheduled re-scan cycle. Verified-publisher tier and runtime sandbox observations are out of scope at v0.

Subsequent revisions ship with both a version bump and a changelog entry citing which methodology decisions the change touches.