Aller au contenu principal
FinOps for AILLM pricingTokens

Claude & GPT API pricing in 2026: real cost per million tokens

Current Claude, GPT and Gemini API prices per million tokens, what caching and batch discounts actually save, a worked example — and why hard budget caps beat surprise invoices.

8 min readCloudios team
On this page

Per-token prices keep falling, and AI bills keep rising. That is not a paradox — it is the defining pattern of 2026: cheaper tokens unlock agents, longer contexts and more features, and volume outruns the discounts (the "Jevons paradox of inference" that practitioners keep rediscovering). Per the State of FinOps 2026, 98% of organizations now manage AI spend, and the two top pains are not knowing the full perimeter of that spend (53.4%) and not being able to quantify its ROI (40.1%).

So here is the part everyone needs first: what the major APIs actually cost per million tokens (MTok), what the three universal discounts are worth, and what that means for a real workload.

Anthropic (Claude) pricing

ModelInput $/MTokOutput $/MTokContext
Claude Fable 5$10.00$50.001M
Claude Opus 4.8$5.00$25.001M
Claude Sonnet 4.6$3.00$15.001M
Claude Haiku 4.5$1.00$5.00200K
Anthropic API, per million tokens, June 2026.

Two structural observations. First, the output multiple: Claude output tokens cost 5× input. Any workload that generates long answers (reports, code, agent reasoning) is output-dominated, and "average price per token" hides that. Second, the tier spread: Haiku to Fable is a 10× price range on input for the same API shape — which is exactly why model routing (sending easy requests to cheap models) is the single highest-leverage optimization, before any infrastructure work.

OpenAI and Google

OpenAI: GPT-5.5 lists at roughly $5.00/MTok input, with cached input at $1.25/MTok (a 75% cache discount). Published output pricing has been less stable across sources during this verification window — check openai.com/api/pricing for the current output rate rather than trusting a third-party table.

Google: the Gemini line spans from $0.075/$0.30 per MTok (Gemini 2.0 Flash-Lite, input/output) up to $2.00/$12.00 for Gemini 3 Pro. In batch mode, Gemini 2.5 Flash-Lite has been documented as low as $0.05/$0.20 per MTok — the cheapest serious-model tokens on the market.

The cross-provider picture: frontier-tier models cluster at $2–10 input / $12–50 output per MTok, while each provider's small model sits one order of magnitude below. The spread between providers at the same tier is smaller than the spread between tiers at the same provider — route by tier first, by provider second.

The three discounts that matter more than the sticker price

Sticker prices are the ceiling. Three mechanisms, available on every major provider, set the real floor:

  • Prompt caching (−75% to −90% on repeated input). Anthropic charges ~0.1× the input price for cache reads, with a write premium of 1.25× (5-minute TTL) or 2× (1-hour TTL) — break-even after 2–3 requests on the same prefix. OpenAI prices cached input at −75% ($1.25 vs $5 on GPT-5.5). Google context caching reaches −90% on large repeated prompts. If your system prompt plus tool definitions exceed ~1K tokens and you serve more than a few requests a minute, not caching is simply donating money.
  • Batch APIs (−50%, everywhere). OpenAI Batch, Anthropic Message Batches and Google Batch all take 50% off for asynchronous processing with a ≤24h window — and most Anthropic batches complete in under an hour (limits: 100,000 requests or 256 MB per batch). Everything that is not user-facing — evals, enrichment, classification backfills, nightly summaries — belongs here.
  • Provisioned / committed capacity (−50% to −70% per token at high utilization). Azure OpenAI PTU starts around $2,448/month with monthly or annual reservations; AWS Bedrock sells provisioned throughput in model units, and since the OpenAI-AWS deal, AWS EDP discounts (typically 5–25%) reportedly apply to OpenAI consumption on Bedrock — worth verifying with your AWS account team. This is the RI/Savings Plan layer of AI spend, and almost nobody manages it like one yet.

A worked example: 1M requests/month

A support assistant on Claude Sonnet 4.6: 1M requests/month, each with ~1,500 input tokens (1,000 of which are a stable system prompt + tool definitions) and ~400 output tokens.

LineVolumeRateMonthly cost
Input, no caching1,500 MTok$3.00/MTok$4,500
Output400 MTok$15.00/MTok$6,000
Naive total$10,500
Cached prefix (1,000 MTok at ~0.1×)1,000 MTok~$0.30/MTok~$300
Uncached input remainder500 MTok$3.00/MTok$1,500
Total with caching~$7,800 (−26%)

Output dominates (it usually does), caching claws back a quarter of the bill for one engineering afternoon, and if a third of those requests could run as overnight batches, another ~$1,300 comes off. The point of the exercise: the same workload spans roughly 2× in cost depending purely on how it is called — before changing models at all.

Agents multiply everything

The per-request arithmetic above breaks down once agents enter. An agentic task replays system prompts, tool definitions and accumulated context on every step: industry analyses put agent token consumption at 5–30× a simple chatbot exchange for the same user intent, with a fully orchestrated agentic interaction around $1.20 — roughly 30× the 2023 cost of a chat turn (EY, 2026). And agent costs move violently: a prompt tweak, a model upgrade or a new tool can shift a team's spend by an order of magnitude week over week.

This is why "we will review the bill monthly" fails for AI spend specifically. By the time the invoice lands, the experiment that caused the spike has been running for three weeks.

Why hard caps beat surprise invoices

The FinOps Foundation's FinOps for AI guidance is explicit: usage limits and quotas per team and per project, paired with anomaly detection, are the baseline controls against runaway costs. The catch is *where* those controls live. FinOps platforms see the bill after the fact. AI gateways (LiteLLM, Kong, Portkey) can enforce budgets per key in real time — but they have no financial context: no view of the cloud bill, the GPU spend, the margin or the monthly budget those tokens draw from.

Cloudios closes that gap with a metering proxy that sits on the request path: hard budget caps per project and per agent, enforced before the request is sent — the call is denied, not reported, when the cap is hit — with token quotas, cost-per-inference and carbon-per-inference reconciled against the same platform that watches the rest of your cloud bill (on flat, self-serve pricing, not a percentage of the spend it protects). A denied request costs $0; a dashboard alert about last week's overrun costs whatever the overrun was.

If you only do three things after reading this: route requests by difficulty, cache your prompts, and put a hard cap on anything that can call an LLM in a loop. The prices in the tables will change; those three will not stop paying.