The three-way race between OpenAI, Anthropic, and Google has produced the most capable generation of language models the world has ever seen. But “most capable” is doing a lot of work in that sentence. These models are not interchangeable. Each has measurable strengths and weaknesses, and the gap between them on specific tasks can be enormous — the difference between a model that solves 72% of graduate-level physics problems and one that solves 49% is not a rounding error.
We tested the current flagship models — OpenAI’s GPT-4.1 and o3, Anthropic’s Claude Opus 4, and Google’s Gemini 2.5 Pro — across the benchmarks that matter, then cross-referenced with real-world usage. Here is where they actually stand.
Let’s name names. Each provider now ships multiple model tiers, but the ones that matter for serious work are:
OpenAI: GPT-4.1 (their workhorse model, 1M token context), o3 (their reasoning-specialized model, slower but stronger on math and logic), and o4-mini (reasoning on a budget).
Anthropic: Claude Opus 4 (the new flagship, 200K context), Claude Sonnet 4 (the balanced middle tier, also 200K), and Claude Haiku 3.5 (fast and cheap).
Google: Gemini 2.5 Pro (their best model, 1M token context) and Gemini 2.5 Flash (fast and cost-effective, also 1M context).
| Dimension | OpenAI (GPT-4.1 / o3) | Anthropic (Opus 4) | Google (Gemini 2.5 Pro) |
|---|---|---|---|
| Context Window | 1M tokens (GPT-4.1), 200K (o3) | 200K tokens | 1M tokens |
| Coding (SWE-Bench) | GPT-4.1: 54.6%, o3: 69.1% | Opus 4: 72.5% | Gemini 2.5 Pro: 63.8% |
| Reasoning (GPQA) | o3: 83.3%, GPT-4.1: 62.5% | Opus 4: 74.8% | Gemini 2.5 Pro: 72.0% |
| Math (AIME '25) | o3: 88.9% | Opus 4: 77.8% | Gemini 2.5 Pro: 86.7% |
| Multimodal | Strong image, no native video | Strong image + PDF | Best-in-class: image, video, audio |
| Price (input/output) | $2 / $8 (GPT-4.1), $10 / $40 (o3) | $15 / $75 (Opus 4) | $1.25 / $10 (Gemini 2.5 Pro) |
A few things jump out of that table. First, pricing: Gemini 2.5 Pro is dramatically cheaper than the competition at the frontier tier — $1.25 per million input tokens versus $15 for Claude Opus 4. Second, no single model wins everything. Third, the gap between “reasoning-specialized” models (o3) and general-purpose flagships is real and large on math-heavy benchmarks.
Numbers without context are just decoration. Here is what the key benchmarks actually tell us.
SWE-Bench tests whether a model can solve real GitHub issues from popular open-source projects. It is the closest thing we have to a measure of practical software engineering ability. The model reads an issue description and the repository, then must produce a working patch.
Claude Opus 4 leads here, and the margin is not trivial. In practice, this translates to Opus 4 being noticeably better at multi-file changes, understanding existing codebases, and producing patches that actually pass the test suite on the first try. For agentic coding workflows — where the model operates semi-autonomously — that reliability gap compounds.
OpenAI’s o3 is strong but slower, and its token costs are 4x higher than GPT-4.1. The practical choice for many teams is GPT-4.1 for routine coding and o3 only when they need the extra reasoning muscle.
GPQA (Graduate-Level Google-Proof Q&A) tests questions that require genuine expert-level reasoning — biology, physics, chemistry problems that even PhD students find difficult. “Google-proof” means the answer cannot be found by simple web search.
o3 leads this category at 83.3%, which is legitimately impressive. It represents a qualitative shift: the model is not just pattern-matching from training data, it is doing something closer to step-by-step scientific reasoning. Claude Opus 4 at 74.8% and Gemini 2.5 Pro at 72.0% are strong but meaningfully behind.
The American Invitational Mathematics Examination is a competition-level math test. o3 scores 88.9%, Gemini 2.5 Pro hits 86.7%, and Claude Opus 4 comes in at 77.8%. This is one area where o3’s specialized reasoning architecture genuinely shines — competition math is exactly the kind of multi-step, formally verifiable problem it was built for.
Google wins this category and it isn’t close. Gemini 2.5 Pro natively processes images, video, and audio in a single context. It can watch a 30-minute video and answer questions about specific moments. Claude and GPT handle images well, but neither has native video understanding at the same level.
For many teams, benchmark scores are secondary to cost. If you are processing millions of tokens per day, the difference between $1.25 and $15 per million input tokens is the difference between a viable product and a money pit.
| Tier | OpenAI | Anthropic | |
|---|---|---|---|
| Flagship | GPT-4.1: $2 / $8 | Opus 4: $15 / $75 | Gemini 2.5 Pro: $1.25 / $10 |
| Reasoning | o3: $10 / $40 | Opus 4 (extended): $15 / $75 | Gemini 2.5 Pro (thinking): $1.25 / $10 |
| Mid-tier | GPT-4.1 mini: $0.40 / $1.60 | Sonnet 4: $3 / $15 | Gemini 2.5 Flash: $0.15 / $0.60 |
| Budget | GPT-4.1 nano: $0.10 / $0.40 | Haiku 3.5: $0.80 / $4 | Gemini 2.0 Flash-Lite: $0.075 / $0.30 |
| Free tier | ChatGPT Free (limited) | claude.ai Free (limited) | Gemini Free (limited) |
| Pro subscription | ChatGPT Plus: $20/mo | Claude Pro: $20/mo | Gemini Advanced: $20/mo |
Google’s pricing is aggressive across the board. Gemini 2.5 Flash at $0.15 per million input tokens is an order of magnitude cheaper than Claude Sonnet 4. For high-volume applications — RAG pipelines, document processing, classification at scale — this cost advantage is decisive.
Anthropic is the most expensive, and unapologetically so. Their argument: Opus 4’s reliability on complex tasks means you spend less on retries and human review. For agentic coding and high-stakes analysis, the higher per-token cost can yield lower total cost of ownership. Whether that math works out depends entirely on your use case.
We have been using all three extensively. Here is what we actually reach for:
Best for coding: Claude Opus 4. The SWE-Bench lead translates directly to real-world coding. It produces cleaner diffs, makes fewer mistakes on multi-file refactors, and follows instructions more precisely. Claude Code (Anthropic’s terminal agent) is the best agentic coding tool we have used. If you write software for a living, this is the model to beat.
Best for math and formal reasoning: o3. When you need a model to grind through a multi-step proof, optimize an algorithm, or solve a quantitative problem, o3’s extended thinking is genuinely superior. It is expensive and slow, but for problems where correctness matters more than speed, it earns its premium.
Best value for general use: Gemini 2.5 Pro. The combination of a 1M token context, strong-but-not-leading benchmark scores, native multimodal support, and the lowest pricing in the industry makes it the default choice for teams watching their budget. For 80% of tasks, it is good enough — and at one-tenth the cost of Opus 4.
Best for long documents and research: A toss-up between GPT-4.1 (1M context) and Gemini 2.5 Pro (1M context). Claude’s 200K limit is a real constraint here. If you need to ingest an entire codebase or a 500-page legal filing, you need the million-token models.
Best for multimodal: Gemini 2.5 Pro, no contest. If your workflow involves video, audio, or complex visual understanding, Google is the only serious option.
Best for writing: This is subjective, but Claude Opus 4 produces the most natural, least AI-sounding prose. GPT-4.1 is versatile but tends toward filler. Gemini is competent but generic.
Most people interact with these models through consumer subscriptions, not APIs. All three providers charge $20/month for their Pro tier, which gives access to the flagship model with generous but not unlimited usage. At that price point, the choice is simpler:
If you primarily want a research and writing assistant, Claude Pro is the strongest choice. If you want an all-in-one tool with web browsing, image generation (DALL-E), and the broadest feature set, ChatGPT Plus has the most mature consumer experience. If you are a Google Workspace user and want AI integrated into Gmail, Docs, and Search, Gemini Advanced is the path of least resistance.
For developers, the calculus shifts. ChatGPT Plus does not include API access — you need a separate API account. Claude Pro includes substantial Claude Code usage (Anthropic’s terminal coding agent), which makes it uniquely valuable for engineers. Gemini Advanced includes access via Google AI Studio, which has the most generous free-tier API access of any provider.
Beyond the models themselves, the developer experience of each API matters. OpenAI’s API is the most mature, with the best documentation, the largest ecosystem of libraries and frameworks (LangChain, LlamaIndex, and virtually every AI framework defaults to OpenAI), and the most predictable behavior. If you are building a production application, OpenAI’s API is the path of least resistance.
Anthropic’s API is clean and well-designed, with excellent streaming support and the Messages API format that many developers prefer over OpenAI’s chat completions format. The developer documentation is thorough. The ecosystem is smaller but growing rapidly — most frameworks now support Anthropic as a first-class provider.
Google’s Vertex AI platform is powerful but complex. The authentication model, quota management, and regional endpoint configuration add overhead that OpenAI and Anthropic do not require. For teams already on Google Cloud, this is not an issue. For everyone else, it is friction.
The most important thing about this comparison is that it will be outdated within months. OpenAI is expected to release GPT-5 in the summer of 2026. Anthropic has hinted at improvements to context length. Google is iterating on Gemini at a blistering pace.
The winning strategy is not to pick one provider and lock in. It is to build abstractions that let you swap models, use routers that send different tasks to different models based on their strengths, and re-evaluate quarterly. Most serious AI applications in 2026 already use multiple providers — routing simple tasks to cheap, fast models and reserving the expensive flagships for work that demands their capabilities.
Practically, this means using an abstraction layer like LiteLLM, Portkey, or the OpenAI-compatible API format that Anthropic and Google now also support. It means writing your prompts to be model-agnostic where possible. And it means tracking your costs and quality metrics so you can make data-driven routing decisions.
The era of “one model to rule them all” is over. The era of model orchestration has begun.
One email at dawn. The five stories that mattered, with the bits removed and the meaning kept. Free, for now.