AI Benchmark Leaderboard
Every score, every source.
Source-cited benchmark data for frontier AI models. Every number links to its primary or credible-secondary source. No estimates, no unverified figures.
Last updated2026-04-18
Models tracked12
Benchmarks tracked8
| Model | Provider | Input $/M | Output $/M | SWE-bench Verified | SWE-bench Pro | MCP-Atlas | OSWorld-Verified | BrowseComp | GPQA Diamond | Finance Agent v1.1 | MMMLU |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.7 | Anthropic | $5.00 | $25.00 | ||||||||
| Claude Opus 4.6 | Anthropic | $5.00 | $25.00 | ||||||||
| GPT-5.4 | OpenAI | $2.50 | $15.00 | — | — | ||||||
| Gemini 3.1 Pro | $2.00 | $12.00 | — | ||||||||
| Grok 4 | xAI | $3.00 | $15.00 | — | — | — | — | — | — | — | — |
| Grok 4 Fast | xAI | $0.20 | $0.50 | — | — | — | — | — | — | — | — |
| Grok 4.20 | xAI | $2.00 | $6.00 | — | — | — | — | — | — | — | — |
| DeepSeek V3.2 | DeepSeek | $0.26 | $0.42 | — | — | — | — | — | — | — | — |
| Qwen 3.5 397B | Alibaba | $0.39 | $2.34 | — | — | — | — | — | — | — | — |
| Llama 4 Maverick | Meta | $0.15 | $0.60 | — | — | — | — | — | — | — | — |
| Kimi K2 Thinking | Moonshot AI | $0.60 | $2.50 | — | — | — | — | — | — | — | — |
| GLM 5 | Z.ai | $0.72 | $2.30 | — | — | — | — | — | — | — | — |
Verified (sourced to primary or credible secondary)
Self-reported (provider marketing, not independently replicated)
Empty cells mean the score is undisclosed or not yet verified. We do not estimate.
Editorial policy
Every score must cite a primary-source or credible-secondary URL. Self-reported benchmarks are flagged verified=false. When a score is disputed or the source is missing, the cell is omitted rather than filled with an estimate.
Related deep dives
Full comparison posts built on this data.