AI Benchmark Leaderboard

Every score, every source.

Source-cited benchmark data for frontier AI models. Every number links to its primary or credible-secondary source. No estimates, no unverified figures.

Last updated2026-04-18
Models tracked12
Benchmarks tracked8
ModelProviderInput $/MOutput $/M
SWE-bench Verified
SWE-bench Pro
MCP-Atlas
OSWorld-Verified
BrowseComp
GPQA Diamond
Finance Agent v1.1
MMMLU
Claude Opus 4.7Anthropic$5.00$25.00
Claude Opus 4.6Anthropic$5.00$25.00
GPT-5.4OpenAI$2.50$15.00
Gemini 3.1 ProGoogle$2.00$12.00
Grok 4xAI$3.00$15.00
Grok 4 FastxAI$0.20$0.50
Grok 4.20xAI$2.00$6.00
DeepSeek V3.2DeepSeek$0.26$0.42
Qwen 3.5 397BAlibaba$0.39$2.34
Llama 4 MaverickMeta$0.15$0.60
Kimi K2 ThinkingMoonshot AI$0.60$2.50
GLM 5Z.ai$0.72$2.30
Verified (sourced to primary or credible secondary)
Self-reported (provider marketing, not independently replicated)
Empty cells mean the score is undisclosed or not yet verified. We do not estimate.

Editorial policy

Every score must cite a primary-source or credible-secondary URL. Self-reported benchmarks are flagged verified=false. When a score is disputed or the source is missing, the cell is omitted rather than filled with an estimate.

Related deep dives

Full comparison posts built on this data.