Three models. Three different bets. No overlapping winner. Claude Opus 4.7 leads every coding and agentic benchmark (64.3% SWE-bench Pro, 77.3% MCP-Atlas, 78.0% OSWorld). GPT-5.4 dominates web research at 89.3% BrowseComp, ten points ahead of Opus 4.7. Gemini 3.1 Pro costs 60% less than Opus 4.7 at $2 input versus $5. On graduate-level reasoning (GPQA Diamond) they're identical to within 0.2 points. Pick the model that matches the task. Don't pick the brand.
- Claude Opus 4.7 leads SWE-bench Pro at 64.3% vs 57.7% for GPT-5.4 and 54.2% for Gemini 3.1 Pro (per Vellum's Opus 4.7 benchmark breakdown).
- GPT-5.4 leads web research with 89.3% on BrowseComp vs 85.9% for Gemini 3.1 Pro and 79.3% for Opus 4.7.
- All three are statistically tied on GPQA Diamond: Opus 4.7 at 94.2%, GPT-5.4 at 94.4%, Gemini 3.1 Pro at 94.3%.
- Gemini 3.1 Pro is the cheapest: $2 input / $12 output per 1M tokens (under 200k context). GPT-5.4 is $2.50/$15. Opus 4.7 is $5/$25.
- Opus 4.7 leads on MCP-Atlas (tool orchestration) at 77.3% vs 73.9% for Gemini 3.1 Pro and 68.1% for GPT-5.4.
- Opus 4.7 keeps the same $5/$25 pricing as Opus 4.6 but introduces a new 'xhigh' effort level and task budgets in public beta.
- GPT-5.4 context window is roughly 1.05M tokens with 128k max output; past 272k tokens, input pricing doubles.
- Opus 4.7 tripled image-input resolution to 2,576 pixels on the long edge (~3.75MP), the first Claude with genuine high-res vision.
Three frontier labs, three different bets. Anthropic bet on coding and agents and charges a premium for it. Google bet on pricing and pushed Gemini 3.1 Pro to 60% less than Opus 4.7. OpenAI bet on web research and actually landed there.
I pulled the verified benchmark numbers from Anthropic's Opus 4.7 announcement, Vellum's Opus 4.7 benchmark breakdown, and the official pricing pages for each provider. Exact numbers only, no marketing language. For the earlier generation see our GPT-5.4 vs Opus 4.6 vs Gemini 3.1 Pro comparison.
The Benchmark Showdown
Verified numbers only, all from primary or credible secondary sources.
The benchmark picture is cleaner than it's been in a while. Each model actually wins its chosen battleground. None of the three is pretending to be first at everything.
Every number below is also in our live benchmark leaderboard, where you can click any cell to see the primary source.
Verified benchmark comparison
| Benchmark | Opus 4.7 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| SWE-bench Verified | 87.6% | not disclosed | 80.6% |
| SWE-bench Pro | 64.3% | 57.7% | 54.2% |
| MCP-Atlas (tool use) | 77.3% | 68.1% | 73.9% |
| OSWorld-Verified | 78.0% | 75.0% | not disclosed |
| BrowseComp (research) | 79.3% | 89.3% | 85.9% |
| GPQA Diamond | 94.2% | 94.4% | 94.3% |
| Finance Agent v1.1 | 64.4% | 61.5% | 59.7% |
| MMMLU | 91.5% | not disclosed | 92.6% |
The coding verdict is now decisive
Opus 4.7 doesn't just win SWE-bench Pro. It wins by 6.6 points over GPT-5.4 and 10.1 points over Gemini 3.1 Pro. That's a wider gap than Opus 4.6 ever had. If you're shipping code, the Opus premium is now actually earned.
The picture changes completely on BrowseComp. Opus 4.7 scored 79.3%, which is four points worse than Opus 4.6 (83.7%). GPT-5.4 is at 89.3%. If your workflow involves research across the web, Opus 4.7 is now the wrong tool. That's the honest reading.
GPQA Diamond is effectively saturated. Opus 4.7 at 94.2%, GPT-5.4 at 94.4%, Gemini 3.1 Pro at 94.3%. The 0.2-point spread is within run-to-run variance. Don't pick a model based on GPQA anymore.
Pricing: Where Gemini Quietly Wins
All three providers list per-million-token rates, and Google's pricing is structured to punish Anthropic directly. Here's the breakdown for the flagship tier of each.
Per-million-token pricing (standard tier)
| Model | Input (short) | Output (short) | Input (long) | Output (long) |
|---|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $25.00 | $5.00 | $25.00 |
| GPT-5.4 | $2.50 | $15.00 | $5.00 | $22.50 |
| Gemini 3.1 Pro | $2.00 | $12.00 | $4.00 | $18.00 |
Short-context thresholds differ by provider. GPT-5.4 doubles its input price past 272k tokens. Gemini 3.1 Pro does the same past 200k. Opus 4.7 has no step-up: $5/$25 flat. If you're running enormous prompts regularly, Opus 4.7's flat pricing can actually win on long-context workflows, despite looking expensive up front.
The hidden wildcard: tokenizer changes
Opus 4.7 shipped a new tokenizer that uses 1.0x to 1.35x more tokens than Opus 4.6 depending on content type. That's a stealth 0-35% price increase on a model whose sticker price "didn't change." If you're budgeting, factor in a 10-15% real-world cost bump versus Opus 4.6, not zero.
Real Cost Math: What 1M Tokens Actually Costs
Not sure which AI model to use?
12 models · Personalized picks · 60 seconds
Benchmarks are abstract. Money is not. Here's what a realistic agentic workload costs on each model, assuming a 50/50 input-to-output split at a round 1M tokens processed per day.
Daily cost at 1M input + 1M output tokens
| Model | Input cost | Output cost | Total per day | Per month |
|---|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $25.00 | $30.00 | $900 |
| GPT-5.4 (short) | $2.50 | $15.00 | $17.50 | $525 |
| Gemini 3.1 Pro (short) | $2.00 | $12.00 | $14.00 | $420 |
Over a 30-day month, Opus 4.7 costs $480 more than Gemini 3.1 Pro for the same volume. If your workload is SWE-bench-Pro-shaped (resolving real GitHub issues, running tool-heavy agents), Opus 4.7's 10-point lead is likely worth that $480. If your workload is writing, summarization, or research, Gemini 3.1 Pro at the same quality level is the obvious pick.
Cached input reduces the math further. GPT-5.4 cached input is $1.25 per million tokens, a 50% discount applied automatically to repeating context. If you're iterating on the same long system prompt across many requests, GPT-5.4's cache aggressively undercuts both competitors on effective price.
Who Actually Wins Where
Winner by workload
| Workload | Winner | Why |
|---|---|---|
| Shipping production code | Claude Opus 4.7 | 64.3% SWE-bench Pro, leads on every coding benchmark |
| Tool-heavy agents / MCP | Claude Opus 4.7 | 77.3% MCP-Atlas, 9.2 points over GPT-5.4 |
| Computer-use / desktop automation | Claude Opus 4.7 | 78.0% OSWorld-Verified, 3 points over GPT-5.4 |
| Web research / deep research | GPT-5.4 | 89.3% BrowseComp, 10 points over Opus 4.7 |
| Financial analysis | Claude Opus 4.7 | 64.4% Finance Agent v1.1 vs 61.5% GPT-5.4 |
| Cheap, high-volume throughput | Gemini 3.1 Pro | $2/$12 pricing plus strong 80.6% SWE-bench Verified |
| Multilingual knowledge | Gemini 3.1 Pro | 92.6% MMMLU vs 91.5% Opus 4.7 |
| Graduate-level reasoning | Tie (within 0.2 points) | 94.2% / 94.4% / 94.3% GPQA Diamond |
The pattern nobody mentions
Opus 4.7 wins most categories. GPT-5.4 owns research. Gemini owns price. If you can afford Opus 4.7, it's the default. But only if your bottleneck is coding or agents. For research, GPT-5.4 is strictly better. For throughput at scale, Gemini wins on price without a meaningful quality gap for most workloads.
Honest Limitations for All Three
Opus 4.7: The BrowseComp regression (83.7% to 79.3%) is real. If your workflow depends on web search, Opus 4.7 is a downgrade from Opus 4.6. The new tokenizer also quietly raises effective cost 10-35% depending on content.
GPT-5.4: Context pricing doubles past 272k tokens. The Terminal-Bench 2.0 "win" uses a self-reported harness that isn't directly comparable to the Opus 4.7 and Gemini 3.1 Pro runs. Treat that one as unverified.
Gemini 3.1 Pro: Trails on SWE-bench Pro by 10 points. No published OSWorld number, which suggests Google isn't confident about its computer-use story versus Opus 4.7 and GPT-5.4. MCP support is catching up but still behind Anthropic's native integration.
Choose One - Based on the Task
Decision framework
- 1Shipping code full-time? Opus 4.7. The 6-10 point SWE-bench lead is worth the premium and agentic coding is its decisive advantage.
- 2Running tool-heavy agents? Opus 4.7. The MCP-Atlas lead and OSWorld score make it the agentic default.
- 3Deep web research or competitive intelligence? GPT-5.4. The 10-point BrowseComp gap is the single largest spread between any two models in this comparison.
- 4High-volume generation (summaries, drafts, translations)? Gemini 3.1 Pro. 60% cheaper than Opus 4.7 with no quality gap on most common workloads.
- 5Financial or analytical work? Opus 4.7 narrowly, but GPT-5.4 is close enough that the $5 vs $2.50 price difference usually wins.
- 6Iterating on the same long system prompt across many requests? GPT-5.4 with aggressive prompt caching at $1.25 per 1M cached tokens.
- 7Running enormous single prompts (300k+ tokens) regularly? Opus 4.7. Its flat $5/$25 pricing beats both competitors once long-context surcharges kick in.
The broader truth: nobody uses just one of these anymore. The cost-conscious pattern is Gemini 3.1 Pro for bulk, Opus 4.7 for code, GPT-5.4 for research, all routed from the same orchestration layer. If you're not thinking about model routing yet, you're overpaying.
FAQ
Which AI model is genuinely the best in 2026?
It depends entirely on the task. Claude Opus 4.7 leads every coding and agentic benchmark. GPT-5.4 leads web research by a wide margin. Gemini 3.1 Pro is the cheapest by 60% with competitive quality on most general workloads. Graduate-level reasoning (GPQA Diamond) is saturated: all three sit at roughly 94% and are statistically tied.
How much does Claude Opus 4.7 cost compared to GPT-5.4?
Opus 4.7 is $5 input / $25 output per million tokens, flat. GPT-5.4 is $2.50/$15 up to 272k tokens, then $5/$22.50 for longer context. For short-to-medium prompts, GPT-5.4 is roughly half the price of Opus 4.7. For very long prompts (300k+), the difference compresses because GPT-5.4's long-context pricing matches Opus 4.7's flat rate.
What's the biggest upgrade from Opus 4.6 to Opus 4.7?
SWE-bench Pro jumped from 53.4% to 64.3%, an 11-point improvement. That's the largest single-generation coding improvement Anthropic has shipped. OSWorld also moved from 72.7% to 78.0%. Price stayed flat at $5/$25, though the new tokenizer quietly increases effective cost 10-35% depending on content.
Should I switch from Opus 4.6 to Opus 4.7?
For coding and agentic work, yes. The SWE-bench Pro and MCP-Atlas gains are significant. For anything research-heavy, no. BrowseComp regressed from 83.7% to 79.3%, so Opus 4.6 is actually better for web research. Opus 4.7 is a targeted coding upgrade, not a universal one.
Is Gemini 3.1 Pro really 60% cheaper than Claude Opus 4.7?
On input tokens under 200k, yes: $2 vs $5. On output tokens, Gemini is $12 vs $25, or 52% cheaper. Past 200k context Gemini steps up to $4/$18, which is still 20-28% cheaper than Opus 4.7's flat $5/$25. The cost gap is real and consistent.
Can I use all three through the same API?
Not directly, but orchestration layers (LiteLLM, OpenRouter, or your own router) normalize the three APIs so you can route per request. That's the pattern serious users adopt: Gemini for bulk, Opus 4.7 for code, GPT-5.4 for research. Pick-one-model thinking is leaving money on the table.
Keep Reading
Stay ahead of the AI curve
We test new AI tools every week and share honest results. Join our newsletter.

