The cost of running large language models has dropped by roughly 90% since early 2024, but pricing structures have also become more complex. Input tokens, output tokens, cached tokens, batch discounts, and context window surcharges all affect your actual bill. A model that looks cheap on paper can turn expensive depending on how you use it.
This comparison breaks down real-world pricing for the most widely used LLMs in 2026, so you can estimate costs before committing to an API provider. Use the [LLM Pricing Calculator](/tools/llm-pricing-calculator) to run your own numbers based on your expected usage.
LLM Pricing Table: Cost Per Million Tokens (April 2026)
The table below compares input and output pricing for the major LLM providers. All prices are in USD per million tokens.
| Model | Provider | Input (per 1M tokens) | Output (per 1M tokens) | Context Window | Notes | |---|---|---|---|---|---| | GPT-4o | OpenAI | $2.50 | $10.00 | 128K | Most popular commercial model | | GPT-4o mini | OpenAI | $0.15 | $0.60 | 128K | Best budget option from OpenAI | | Claude Opus 4 | Anthropic | $15.00 | $75.00 | 200K | Highest reasoning capability | | Claude Sonnet 4 | Anthropic | $3.00 | $15.00 | 200K | Best price/performance ratio | | Claude Haiku 3.5 | Anthropic | $0.80 | $4.00 | 200K | Fast, affordable | | Gemini 2.5 Pro | Google | $1.25 | $10.00 | 1M | Largest context window | | Gemini 2.5 Flash | Google | $0.15 | $0.60 | 1M | Budget model with huge context | | Llama 3.3 70B | Meta (self-hosted) | ~$0.40 | ~$0.40 | 128K | Open-weight, hosting costs vary | | Mistral Large | Mistral | $2.00 | $6.00 | 128K | Strong European alternative | | DeepSeek V3 | DeepSeek | $0.27 | $1.10 | 128K | Aggressive pricing from China |
Prices reflect standard API rates as of April 2026. Batch processing, prompt caching, and volume commitments can reduce costs by 25-50%. Self-hosted model costs depend on your GPU infrastructure.
A few patterns stand out. OpenAI and Anthropic charge significantly more for output tokens than input tokens, typically at a 3-5x ratio. Google follows a similar pattern. Self-hosted open models like Llama charge roughly the same for both directions because the cost is your GPU time, not per-token billing.
Understanding the Real Cost: Input vs Output Tokens
The headline per-million-token price is only part of the story. What actually determines your monthly bill is the _ratio_ of input to output tokens in your application.
A chatbot that sends long system prompts with every request but gets short answers is input-heavy. A code generation tool that sends a brief instruction and gets back hundreds of lines of code is output-heavy. Since output tokens cost 3-5x more than input tokens on most providers, the code generation tool will be dramatically more expensive per request, even if the total token count is similar.
Here is a concrete example. Suppose you process 10 million tokens per month, split 70/30 between input and output:
- GPT-4o: (7M x $2.50 + 3M x $10.00) / 1M = $17.50 + $30.00 = $47.50/month
- Claude Sonnet 4: (7M x $3.00 + 3M x $15.00) / 1M = $21.00 + $45.00 = $66.00/month
- Gemini 2.5 Pro: (7M x $1.25 + 3M x $10.00) / 1M = $8.75 + $30.00 = $38.75/month
- DeepSeek V3: (7M x $0.27 + 3M x $1.10) / 1M = $1.89 + $3.30 = $5.19/month
The price difference between the cheapest and most expensive option is over 12x. That gap widens as your volume increases.
Use the [AI Token Counter](/tools/ai-token-counter) to measure exactly how many tokens your prompts and responses consume before estimating costs.
Key Takeaway
The headline per-million-token price is only part of the story.
Prompt Caching and Batch Discounts
Most providers now offer prompt caching, which can cut input costs by 50-90% for repetitive system prompts. If your application sends the same system prompt with every request (as most chatbots and agents do), cached tokens are billed at a fraction of the standard rate.
Anthropic caches system prompts automatically. Cached input tokens cost 90% less than fresh tokens. For applications with long, stable system prompts, this makes Claude significantly cheaper than the headline price suggests.
OpenAI offers a similar caching mechanism that applies automatically to repeated prefixes in your prompts. Cached tokens are billed at 50% of the standard input price.
Google provides context caching for Gemini models, where you can explicitly cache a large context (like a codebase or document set) and reference it across multiple requests.
Batch processing is the other major discount lever. Both OpenAI and Anthropic offer batch APIs that process requests asynchronously (typically within 24 hours) at 50% of the standard price. If your workload is not time-sensitive, batch processing cuts your bill in half with zero code changes beyond switching to the batch endpoint.
Open Models: Llama, Mistral, and the Self-Hosting Equation
Open-weight models like Llama 3.3 and Mistral Large are "free" to use but not free to run. The cost shifts from per-token API fees to GPU infrastructure.
Self-hosting makes financial sense when you process enough volume that API costs exceed infrastructure costs. The crossover point depends on your hardware, but a rough guideline:
- Under 50M tokens/month: API providers are cheaper. The overhead of managing GPU infrastructure (provisioning, scaling, monitoring, updates) is not worth it.
- 50-500M tokens/month: Run the numbers carefully. A dedicated A100 GPU costs roughly $1.50-2.00/hour on major cloud providers. At high utilization, self-hosting can be 3-5x cheaper than API pricing.
- Over 500M tokens/month: Self-hosting almost certainly saves money. At this scale, many organizations buy or lease dedicated GPU hardware rather than using cloud instances.
There are non-financial reasons to self-host as well. Data privacy (your data never leaves your infrastructure), latency control (no network round-trip to an API), and independence from provider rate limits or policy changes. For regulated industries like healthcare and finance, self-hosting may be a compliance requirement regardless of cost.
The performance gap between open and closed models has narrowed substantially. Llama 3.3 70B matches or exceeds GPT-4o on many benchmarks, and Mistral Large competes directly with Claude Sonnet on coding and reasoning tasks.
Key Takeaway
Open-weight models like Llama 3.3 and Mistral Large are "free" to use but not free to run.
Which Model Should You Choose? Decision Framework
Cost is one factor, but not the only one. Here is a practical decision framework:
Choose GPT-4o if you need the broadest ecosystem support. OpenAI has the most third-party integrations, the largest community, and the most battle-tested API. The pricing is middle-of-the-road.
Choose Claude Sonnet 4 if your application requires long-context reasoning (200K tokens), strong instruction following, or nuanced writing. Anthropic models tend to produce fewer hallucinations on factual queries. The 200K context window is genuinely useful for document analysis and code review.
Choose Gemini 2.5 Pro if you need the largest context window (1M tokens) or if you are already in the Google Cloud ecosystem. The pricing is competitive, and the 1M context window enables use cases that other models cannot handle without chunking.
Choose DeepSeek V3 if cost is your primary constraint. At roughly 1/10th the price of GPT-4o for output tokens, DeepSeek makes high-volume applications economically viable. The trade-off is less extensive documentation and a smaller ecosystem.
Choose Llama 3.3 or Mistral if you need full control over your infrastructure, have strict data privacy requirements, or process enough volume to make self-hosting cost-effective.
For most teams starting out, GPT-4o mini or Gemini Flash offer the best starting point: capable enough for production use cases, cheap enough that cost is not a concern during development.
Practical Tips for Reducing LLM Costs
Regardless of which provider you choose, these strategies reduce your token spend without sacrificing quality:
- Measure before optimizing. Use the [AI Token Counter](/tools/ai-token-counter) to understand where your tokens go. Many teams discover that 60-80% of their token usage comes from system prompts, not user messages.
- Compress your system prompts. Remove verbose instructions and examples that do not measurably improve output quality. A/B test shorter prompts against longer ones.
- Use the smallest model that works. Route simple tasks (classification, extraction, formatting) to cheap models like GPT-4o mini or Haiku, and reserve expensive models for complex reasoning. This "model routing" pattern can reduce costs by 70%+.
- Cache aggressively. If your system prompt is stable across requests, make sure you are benefiting from prompt caching. Check your provider's documentation for how caching is triggered.
- Batch when possible. Non-urgent workloads (analytics, content generation, data processing) should always use batch APIs at 50% discount.
- Set max_tokens limits. Prevent runaway output by setting explicit token limits on responses. A model generating 2,000 tokens when you needed 200 is a 10x cost multiplier on the output side.
- Monitor and alert. Set up billing alerts with your provider. A bug in your application that creates an infinite loop of API calls can generate a four-figure bill overnight.
Key Takeaway
Regardless of which provider you choose, these strategies reduce your token spend without sacrificing quality: 1.
FAQ
How much does it cost to run a chatbot on GPT-4o?
A typical chatbot interaction uses 500-2,000 input tokens (system prompt + conversation history + user message) and 200-500 output tokens (response). At GPT-4o pricing, that is roughly $0.003-0.010 per interaction. For 10,000 interactions per month, expect $30-100/month. Prompt caching can reduce this by 50% if your system prompt is consistent across conversations.
Is DeepSeek safe to use for production applications?
DeepSeek's models are technically capable, but there are legitimate considerations. The company is based in China, which matters for data sovereignty in some jurisdictions. The API has experienced more downtime than established providers. And the ecosystem (SDKs, integrations, community support) is smaller. For cost-sensitive applications without strict data residency requirements, DeepSeek is a viable option.
Why are output tokens so much more expensive than input tokens?
Generating output tokens requires sequential computation: each token depends on all previous tokens. Input tokens can be processed in parallel. This fundamental architectural difference means output generation uses more GPU time per token, which providers pass through as higher pricing.
Can I switch between providers without rewriting my application?
Most LLM providers support the OpenAI-compatible API format, which means switching often requires only changing the base URL and API key. However, prompt behavior varies between models: a prompt optimized for GPT-4o may produce different results on Claude or Gemini. Budget time for prompt testing when switching providers.
What is the cheapest way to use AI for a startup?
Start with GPT-4o mini or Gemini Flash for development and testing. These models cost under $1 per million input tokens and handle most use cases adequately. Only upgrade to more expensive models when you have specific quality requirements that cheaper models cannot meet. Use batch APIs for any non-real-time processing, and implement model routing to send simple tasks to cheap models.
Try these tools
Related articles
How to Fine-Tune LLMs: Data Format Guide for 2026
Complete guide to fine-tuning data formats for OpenAI, Anthropic, and Google. JSONL examples, format validation, and best practices for training data preparation.
Understanding AI Token Limits: A Complete Guide to Context Windows
Learn what context windows are, why they matter, and how to manage token limits across GPT-4o, Claude, and Gemini. Practical tips for working within AI token constraints.