Labs/Findings/Methodology v1 Launch

AgentCrush Labs · Findings · 2026-05-16

What the agent economy looks like across 5+ signals

On May 16, 2026, AgentCrush shipped four category-specific scoring methodologies. Different agent categories leave different evidence trails, so we measure them differently. This page collects the most striking findings from that launch — the cases where multi-signal scoring produced a different answer than any single source would have given.

By the numbers

Total tracked

1,338

agents indexed

Evidence-ranked

25

across 4 categories

Categories live

4

model_family · tokenized · service · developer

MCP tools

7

machine-readable methodology

Finding #1 — headline

Single-source rankings invert under multi-signal scoring

The HuggingFace leader isn't the LMArena leader. The LMArena leader isn't the citation leader. The citation leader isn't the deployment leader. Each signal answers a different question — and when we combine them with documented weights, the resulting ranking is different from any of them taken alone.

HuggingFace #1

Qwen

score 100

LMArena #1

Gemini

BT 1484

Derivatives #1

Qwen

1,046

Citations #1

Llama

51,449

Deployments #1

Gemini

144

Composite #1 (the agent that maximizes across all 5 weighted signals): Qwen at score 85.

Finding #2 — the admission test

Hermes admitted at #5, ranks last — by design

NousResearch Hermes is a beloved community model. It would top a vibe-based ranking. Our methodology admitted it — and ranked it last among model families.

The rule is: 3 of 5 signals must be present, AND at least one must be a capability signal (derivatives, LMArena, citations, or cross-protocol deployment). For weeks Hermes had only 2 signals (HuggingFace + a thin derivatives footprint) — not evidence-ready. When we added paper citations and deployment scanning to the methodology, Hermes earned its third signal:

HuggingFace

69

LMArena

Derivatives

33

Citations

24

Deployment

27

Composite: 34. Hermes earned admission to the ranking via citations + deployment, but its raw footprint (HF downloads, no LMArena coverage, modest derivatives) keeps it at the back. This is the methodology working as designed: admit on evidence, rank on weight. No manual override.

Finding #3 — the honeypot test

Market cap alone is not a ranking

$TIBBIR has the largest USD market cap in the tokenized index ($108.2M). It does not rank #1.

AIXBT does, at composite 83. Why: the methodology weights on-chain liquidity (anti-honeypot), capital locked in token contracts (TVL = real commitment), and holder distribution. A high market cap with thin liquidity gets penalized, not rewarded. AgentCrush surfaced one Virtuals token at $380M market cap with $5K liquidity — exactly the pattern we built the methodology to demote.

AgentMCLiqTVLScore
AIXBT $AIXBT$33.1M$1190K$588K83
Ribbita $TIBBIR$108.2M$2274K$1118K71
G.A.M.E $GAME$8.1M$2976K$1463K67
Luna $LUNA$6.2M$1891K$947K65
Ethy AI $ETHY$2.2M$329K$162K65

Finding #4 — forks beat stars

Active engagement beats passive interest

For service agents (callable endpoints — A2A protocol, Agentverse, x402, ERC-8004), we use forks as a stronger adoption signal than stars. Anyone can star a repo. Forking means you're going to use or modify it.

A2A leads at composite 77 on 23,798 stars and 2,400 forks — a fork ratio that signals real-world deployment, not just bookmarking.

Why this matters

Read the full methodology

Every weight, every formula, every limitation is published.

All Labs →Methodology →Blog →