Which model wins overall: GPT-5.4, Claude Opus 4.7, or Gemini 3.1 Pro?

No single winner. Claude Opus 4.7 leads on coding benchmarks (64.3% SWE-bench Pro vs 57.7% for GPT-5.4 and 54.2% for Gemini) and on tool orchestration (77.3% MCP-Atlas). GPT-5.4 leads on web research (89.3% BrowseComp). Gemini 3.1 Pro is the cheapest by a wide margin ($2 input vs $5 for Opus 4.7). All three are tied on graduate-level reasoning (within 0.2 points on GPQA Diamond).

How much do GPT-5.4, Claude Opus 4.7, and Gemini 3.1 Pro cost per million tokens?

Gemini 3.1 Pro is $2 input / $12 output (under 200k tokens). GPT-5.4 is $2.50 input / $15 output (under 272k tokens). Claude Opus 4.7 is $5 input / $25 output, flat. For long context, Gemini moves to $4/$18, GPT-5.4 moves to $5/$22.50, and Opus 4.7 stays at $5/$25.

Is Claude Opus 4.7 worth the price premium over Gemini 3.1 Pro?

Only for coding and agentic workloads. Opus 4.7 is 2.5x the input price of Gemini 3.1 Pro but leads by 10.1 points on SWE-bench Pro and 3.4 points on MCP-Atlas. For research, writing, or web-based tasks, Gemini is the better economic choice. Opus 4.7 earns its price when you're shipping code or running tool-heavy agents.

What's new in Claude Opus 4.7 vs Opus 4.6?

Three things matter. New xhigh effort level between high and max, giving finer control over reasoning depth. Task budgets in public beta, which let developers set a token target for an entire agentic loop. Vision resolution tripled to 2,576 pixels on the long edge (roughly 3.75 megapixels). The tokenizer also changed, increasing token consumption by roughly 1.0-1.35x depending on content.

GPT-5.4 vs Claude 4.7 vs Gemini 3.1 Pro [2026]

Three models. Three different bets. No overlapping winner. Claude Opus 4.7 leads every coding and agentic benchmark (64.3% SWE-bench Pro, 77.3% MCP-Atlas, 78.0% OSWorld). GPT-5.4 dominates web research at 89.3% BrowseComp, ten points ahead of Opus 4.7. Gemini 3.1 Pro costs 60% less than Opus 4.7 at $2 input versus $5. On graduate-level reasoning (GPQA Diamond) they're identical to within 0.2 points. Pick the model that matches the task. Don't pick the brand.

GPT-5.4 vs Opus 4.7 vs Gemini 3.1 Pro - Verified Numbers

Updated April 2026

Claude Opus 4.7 leads SWE-bench Pro at 64.3% vs 57.7% for GPT-5.4 and 54.2% for Gemini 3.1 Pro (per Vellum's Opus 4.7 benchmark breakdown).
GPT-5.4 leads web research with 89.3% on BrowseComp vs 85.9% for Gemini 3.1 Pro and 79.3% for Opus 4.7.
All three are statistically tied on GPQA Diamond: Opus 4.7 at 94.2%, GPT-5.4 at 94.4%, Gemini 3.1 Pro at 94.3%.
Gemini 3.1 Pro is the cheapest: $2 input / $12 output per 1M tokens (under 200k context). GPT-5.4 is $2.50/$15. Opus 4.7 is $5/$25.
Opus 4.7 leads on MCP-Atlas (tool orchestration) at 77.3% vs 73.9% for Gemini 3.1 Pro and 68.1% for GPT-5.4.
Opus 4.7 keeps the same $5/$25 pricing as Opus 4.6 but introduces a new 'xhigh' effort level and task budgets in public beta.
GPT-5.4 context window is roughly 1.05M tokens with 128k max output; past 272k tokens, input pricing doubles.
Opus 4.7 tripled image-input resolution to 2,576 pixels on the long edge (~3.75MP), the first Claude with genuine high-res vision.

Three frontier labs, three different bets. Anthropic bet on coding and agents and charges a premium for it. Google bet on pricing and pushed Gemini 3.1 Pro to 60% less than Opus 4.7. OpenAI bet on web research and actually landed there.

I pulled the verified benchmark numbers from Anthropic's Opus 4.7 announcement, Vellum's Opus 4.7 benchmark breakdown, and the official pricing pages for each provider. Exact numbers only, no marketing language. For the earlier generation see our GPT-5.4 vs Opus 4.6 vs Gemini 3.1 Pro comparison.

Opus 4.7 SWE-bench Pro

64.3%

+6.6 over GPT-5.4

GPT-5.4 BrowseComp

89.3%

web research leader

Gemini 3.1 Pro input

$2/M

60% cheaper than Opus

GPQA Diamond

94.3%

all three within 0.2pts

The Benchmark Showdown

Verified numbers only, all from primary or credible secondary sources.

The benchmark picture is cleaner than it's been in a while. Each model actually wins its chosen battleground. None of the three is pretending to be first at everything.

Every number below is also in our live benchmark leaderboard, where you can click any cell to see the primary source.

Verified benchmark comparison

Benchmark	Opus 4.7	GPT-5.4	Gemini 3.1 Pro
SWE-bench Verified	87.6%	not disclosed	80.6%
SWE-bench Pro	64.3%	57.7%	54.2%
MCP-Atlas (tool use)	77.3%	68.1%	73.9%
OSWorld-Verified	78.0%	75.0%	not disclosed
BrowseComp (research)	79.3%	89.3%	85.9%
GPQA Diamond	94.2%	94.4%	94.3%
Finance Agent v1.1	64.4%	61.5%	59.7%
MMMLU	91.5%	not disclosed	92.6%

The coding verdict is now decisive

Opus 4.7 doesn't just win SWE-bench Pro. It wins by 6.6 points over GPT-5.4 and 10.1 points over Gemini 3.1 Pro. That's a wider gap than Opus 4.6 ever had. If you're shipping code, the Opus premium is now actually earned.

The picture changes completely on BrowseComp. Opus 4.7 scored 79.3%, which is four points worse than Opus 4.6 (83.7%). GPT-5.4 is at 89.3%. If your workflow involves research across the web, Opus 4.7 is now the wrong tool. That's the honest reading.

GPQA Diamond is effectively saturated. Opus 4.7 at 94.2%, GPT-5.4 at 94.4%, Gemini 3.1 Pro at 94.3%. The 0.2-point spread is within run-to-run variance. Don't pick a model based on GPQA anymore.

Pricing: Where Gemini Quietly Wins

All three providers list per-million-token rates, and Google's pricing is structured to punish Anthropic directly. Here's the breakdown for the flagship tier of each.

Per-million-token pricing (standard tier)

Model	Input (short)	Output (short)	Input (long)	Output (long)
Claude Opus 4.7	$5.00	$25.00	$5.00	$25.00
GPT-5.4	$2.50	$15.00	$5.00	$22.50
Gemini 3.1 Pro	$2.00	$12.00	$4.00	$18.00

Short-context thresholds differ by provider. GPT-5.4 doubles its input price past 272k tokens. Gemini 3.1 Pro does the same past 200k. Opus 4.7 has no step-up: $5/$25 flat. If you're running enormous prompts regularly, Opus 4.7's flat pricing can actually win on long-context workflows, despite looking expensive up front.

The hidden wildcard: tokenizer changes

Opus 4.7 shipped a new tokenizer that uses 1.0x to 1.35x more tokens than Opus 4.6 depending on content type. That's a stealth 0-35% price increase on a model whose sticker price "didn't change." If you're budgeting, factor in a 10-15% real-world cost bump versus Opus 4.6, not zero.

Real Cost Math: What 1M Tokens Actually Costs

Not sure which AI model to use?

12 models · Personalized picks · 60 seconds

Take the Quiz

Benchmarks are abstract. Money is not. Here's what a realistic agentic workload costs on each model, assuming a 50/50 input-to-output split at a round 1M tokens processed per day.

Daily cost at 1M input + 1M output tokens

Model	Input cost	Output cost	Total per day	Per month
Claude Opus 4.7	$5.00	$25.00	$30.00	$900
GPT-5.4 (short)	$2.50	$15.00	$17.50	$525
Gemini 3.1 Pro (short)	$2.00	$12.00	$14.00	$420

Over a 30-day month, Opus 4.7 costs $480 more than Gemini 3.1 Pro for the same volume. If your workload is SWE-bench-Pro-shaped (resolving real GitHub issues, running tool-heavy agents), Opus 4.7's 10-point lead is likely worth that $480. If your workload is writing, summarization, or research, Gemini 3.1 Pro at the same quality level is the obvious pick.

Cached input reduces the math further. GPT-5.4 cached input is $1.25 per million tokens, a 50% discount applied automatically to repeating context. If you're iterating on the same long system prompt across many requests, GPT-5.4's cache aggressively undercuts both competitors on effective price.

Who Actually Wins Where

Winner by workload

Workload	Winner	Why
Shipping production code	Claude Opus 4.7	64.3% SWE-bench Pro, leads on every coding benchmark
Tool-heavy agents / MCP	Claude Opus 4.7	77.3% MCP-Atlas, 9.2 points over GPT-5.4
Computer-use / desktop automation	Claude Opus 4.7	78.0% OSWorld-Verified, 3 points over GPT-5.4
Web research / deep research	GPT-5.4	89.3% BrowseComp, 10 points over Opus 4.7
Financial analysis	Claude Opus 4.7	64.4% Finance Agent v1.1 vs 61.5% GPT-5.4
Cheap, high-volume throughput	Gemini 3.1 Pro	$2/$12 pricing plus strong 80.6% SWE-bench Verified
Multilingual knowledge	Gemini 3.1 Pro	92.6% MMMLU vs 91.5% Opus 4.7
Graduate-level reasoning	Tie (within 0.2 points)	94.2% / 94.4% / 94.3% GPQA Diamond

The pattern nobody mentions

Opus 4.7 wins most categories. GPT-5.4 owns research. Gemini owns price. If you can afford Opus 4.7, it's the default. But only if your bottleneck is coding or agents. For research, GPT-5.4 is strictly better. For throughput at scale, Gemini wins on price without a meaningful quality gap for most workloads.

Honest Limitations for All Three

Opus 4.7: The BrowseComp regression (83.7% to 79.3%) is real. If your workflow depends on web search, Opus 4.7 is a downgrade from Opus 4.6. The new tokenizer also quietly raises effective cost 10-35% depending on content.

GPT-5.4: Context pricing doubles past 272k tokens. The Terminal-Bench 2.0 "win" uses a self-reported harness that isn't directly comparable to the Opus 4.7 and Gemini 3.1 Pro runs. Treat that one as unverified.

Gemini 3.1 Pro: Trails on SWE-bench Pro by 10 points. No published OSWorld number, which suggests Google isn't confident about its computer-use story versus Opus 4.7 and GPT-5.4. MCP support is catching up but still behind Anthropic's native integration.

Choose One - Based on the Task

Decision framework

1Shipping code full-time? Opus 4.7. The 6-10 point SWE-bench lead is worth the premium and agentic coding is its decisive advantage.
2Running tool-heavy agents? Opus 4.7. The MCP-Atlas lead and OSWorld score make it the agentic default.
3Deep web research or competitive intelligence? GPT-5.4. The 10-point BrowseComp gap is the single largest spread between any two models in this comparison.
4High-volume generation (summaries, drafts, translations)? Gemini 3.1 Pro. 60% cheaper than Opus 4.7 with no quality gap on most common workloads.
5Financial or analytical work? Opus 4.7 narrowly, but GPT-5.4 is close enough that the $5 vs $2.50 price difference usually wins.
6Iterating on the same long system prompt across many requests? GPT-5.4 with aggressive prompt caching at $1.25 per 1M cached tokens.
7Running enormous single prompts (300k+ tokens) regularly? Opus 4.7. Its flat $5/$25 pricing beats both competitors once long-context surcharges kick in.

The broader truth: nobody uses just one of these anymore. The cost-conscious pattern is Gemini 3.1 Pro for bulk, Opus 4.7 for code, GPT-5.4 for research, all routed from the same orchestration layer. If you're not thinking about model routing yet, you're overpaying.

FAQ

Which AI model is genuinely the best in 2026?

It depends entirely on the task. Claude Opus 4.7 leads every coding and agentic benchmark. GPT-5.4 leads web research by a wide margin. Gemini 3.1 Pro is the cheapest by 60% with competitive quality on most general workloads. Graduate-level reasoning (GPQA Diamond) is saturated: all three sit at roughly 94% and are statistically tied.

How much does Claude Opus 4.7 cost compared to GPT-5.4?

Opus 4.7 is $5 input / $25 output per million tokens, flat. GPT-5.4 is $2.50/$15 up to 272k tokens, then $5/$22.50 for longer context. For short-to-medium prompts, GPT-5.4 is roughly half the price of Opus 4.7. For very long prompts (300k+), the difference compresses because GPT-5.4's long-context pricing matches Opus 4.7's flat rate.

What's the biggest upgrade from Opus 4.6 to Opus 4.7?

SWE-bench Pro jumped from 53.4% to 64.3%, an 11-point improvement. That's the largest single-generation coding improvement Anthropic has shipped. OSWorld also moved from 72.7% to 78.0%. Price stayed flat at $5/$25, though the new tokenizer quietly increases effective cost 10-35% depending on content.

Should I switch from Opus 4.6 to Opus 4.7?

For coding and agentic work, yes. The SWE-bench Pro and MCP-Atlas gains are significant. For anything research-heavy, no. BrowseComp regressed from 83.7% to 79.3%, so Opus 4.6 is actually better for web research. Opus 4.7 is a targeted coding upgrade, not a universal one.

Is Gemini 3.1 Pro really 60% cheaper than Claude Opus 4.7?

On input tokens under 200k, yes: $2 vs $5. On output tokens, Gemini is $12 vs $25, or 52% cheaper. Past 200k context Gemini steps up to $4/$18, which is still 20-28% cheaper than Opus 4.7's flat $5/$25. The cost gap is real and consistent.

Can I use all three through the same API?

Not directly, but orchestration layers (LiteLLM, OpenRouter, or your own router) normalize the three APIs so you can route per request. That's the pattern serious users adopt: Gemini for bulk, Opus 4.7 for code, GPT-5.4 for research. Pick-one-model thinking is leaving money on the table.

Which AI Model Should You Use? Task-by-Task Guide

Stay ahead of the AI curve

We test new AI tools every week and share honest results. Join our newsletter.