LLM leaderboard

Compare large language models for performance, price and more, to find the best match for your needs.

Leaderboard
Largest context
  1. Gemini 2.5 Pro
  2. Gemini 2.0 Flash
  3. Gemini 2.0 Flash-Lite
Highest output tokens
  1. o1-pro
  2. o1
  3. o3-mini
Least expensive
  1. R1 Distill LLama 8B
  2. Ministral 3B
  3. Gemini 1.5 Flash-8B
Model comparison
Model
Input price / 1M tokens
Output price / 1M tokens
Context window
Output token limit
Reasoning model
Open source
Gemini 1.5 Flash-8B
$0.04
$0.15
10000008192
Ministral 3B
$0.04
$0.04
1280004096
R1 Distill LLama 8B
$0.04
$0.04
1280008000
Qwen Turbo
$0.05
$0.20
10000008192
GPT-5 Nano
$0.05
$0.40
12800016384
Coder V2 Lite
$0.06
$0.18
1280008000
Gemini 2.0 Flash-Lite
$0.07
$0.30
10000008192
Gemini 1.5 Flash
$0.07
$0.30
10000008192
Gemini 2.0 Flash
$0.10
$0.40
10000008192
Llama 3.1 8B
$0.10
$0.10
1280002048
Ministral 8B
$0.10
$0.10
1280004096
GPT-4.1 Nano
$0.10
$0.40
12800016384
Gemma 2 9B
$0.12
$0.15
80008192
Coder V2
$0.14
$0.28
1280008000
GPT-4o mini
$0.15
$0.60
12800016384
GPT-4o mini Audio
$0.15
$0.60
12800016384
Gemma 2 27B
$0.17
$0.51
80008192
Mistral Saba
$0.20
$0.60
320004096
Grok 4 Fast
$0.20
$0.50
1280008192
Claude 3 Haiku
$0.25
$1.25
2000004096
GPT-5 Mini
$0.25
$2.00
12800016384
V3
$0.27
$1.10
1280008000
Codestral
$0.30
$0.90
1280004096
R1 Distill Qwen 32B
$0.30
$0.30
1280008000
Gemini 2.5 Flash
$0.30
$2.50
100000064000
Grok 3 Mini
$0.30
$0.50
1280008192
GPT-4.1 Mini
$0.40
$1.60
12800016384
GPT-3.5 Turbo
$0.50
$1.50
163854096
Llama 2 Chat
$0.50
$0.25
40962048
QwQ 32B
$0.55
$0.75
1310008192
DeepSeek Reasoner
$0.55
$2.19
640008000
Llama 3.3 70B
$0.59
$0.70
1280002048
GPT-4o mini Realtime
$0.60
$2.40
1280004096
Llama 3.2
$0.60
$0.60
1280002048
R1 Distill Llama 70B
$0.72
$0.99
1280008000
Claude 3.5 Haiku
$0.80
$4.00
2000008192
Qwen 2.5 Coder 32B
$0.80
$0.80
1310008192
R1 Distill Qwen 14B
$0.88
$0.88
1280008000
Sonar Reasoning
$1.00
$5.00
127000N/A
Sonar
$1.00
$1.00
127000N/A
Claude 4.5 Haiku
$1.00
$5.00
2000008192
o3-mini
$1.10
$4.40
200000100000
o1-mini
$1.10
$4.40
12800065536
o4-mini
$1.10
$4.40
20000065536
Gemini 2.5 Pro
$1.25
$10.00
200000064000
GPT-5
$1.25
$10.00
12800016384
GPT-5.1
$1.25
$10.00
12800016384
GPT-5.1 Codex
$1.25
$10.00
12800016384
Qwen 2.5 Max
$1.60
$6.40
320008192
Mistral Large
$2.00
$6.00
1280004096
Pixtral Large
$2.00
$6.00
1280004096
Sonar Reasoning Pro
$2.00
$8.00
128000N/A
Sonar Deep Research
$2.00
$8.00
200000N/A
Gemini 3 Pro
$2.00
$12.00
100000064000
GPT-4.1
$2.00
$8.00
12800016384
GPT-4o
$2.50
$10.00
12800016384
GPT-4o Audio
$2.50
$10.00
12800016384
Claude 3.7 Sonnet
$3.00
$15.00
2000008192
Claude 3.5 Sonnet
$3.00
$15.00
2000008192
Sonar Pro
$3.00
$15.00
200000N/A
Claude Sonnet 4.5
$3.00
$15.00
2000008192
Grok 3
$3.00
$15.00
1280008192
Grok 4
$3.00
$15.00
1280008192
Llama 3.1 405B
$3.50
$3.50
1280002048
GPT-4o Realtime
$5.00
$20.00
1280004096
GPT-4 Turbo
$10.00
$30.00
1280004096
o3
$10.00
$40.00
200000100000
o3 Deep Research
$10.00
$40.00
200000100000
o1
$15.00
$60.00
200000100000
Claude 3 Opus
$15.00
$75.00
2000004096
Claude Opus 4
$15.00
$75.00
2000004096
GPT-5 Pro
$15.00
$120.00
12800016384
o3 Pro
$20.00
$80.00
200000100000
GPT-4
$30.00
$60.00
81928192
GPT-4.5
$75.00
$150.00
12800016384
o1-pro
$150.00
$600.00
200000100000
Gemma 3 1B
N/A
N/A
320008192
Gemma 3 27B
N/A
N/A
1280008192
Qwen 2.5 72B
N/A
N/A
1310008192
Key definitions
Price:Price per token refers to the cost of processing each token in the prompt sent to an LLM, while output price per token is the cost of each token generated by the model in response. The price shown in the leaderboard section is a blended price, using a typical ratio of 3:1 of input to output usage. Some models have a price of 0, which can be in the case of a limited free trial.
Context window:The maximum amount of text (tokens) the model can process at once, including both input and generated output. It determines how much prior conversation or document history the model can "remember" within a single interaction.
Output token limit:Maximum output tokens define the upper limit of tokens an LLM can generate in a single response. This limit is influenced by the model's context window and provider policies, dictating the length of its output.
Reasoning model:A reasoning LLM signifies a model capable of going beyond pattern recognition to perform logical inference and problem-solving. This involves tasks like complex mathematics, planning, and generating "chain of thought" explanations, mimicking human-like cognitive processes. Essentially, it aims to understand and solve problems, not just reproduce text.
Open source:Some LLMs are published under an open-source license, allowing developers to access and modify the code, and this means you are also able to host these models yourself on premises or in the cloud. Others, such as Mistral, are available for self-hosting under licence.

LLM Leaderboard FAQ

Large Language Models (LLMs) are a type of AI system trained on massive amounts of text data. They learn patterns, relationships, and structures in language, allowing them to generate human-like text, translate languages, answer questions, and perform various other language-based tasks. They typically use neural networks (often transformer models) as their underlying architecture. Modern LLMs can perform a wide range of tasks including writing, translation, summarization, question answering, code generation, and reasoning.

LLMs are large neural networks, specifically transformer architectures, that process and generate text. During training, the neural networks learn to identify patterns and relationships within massive datasets of text, by accomplishing tasks like predicting the next word in a sequence. When prompted, the neural networks use their learnt parameters to generate coherent and relevant text given the provided context.

LLM training is a two-stage process: pre-training and fine-tuning. During pre-training, models learn general language patterns. During fine-tuning, models learn to accurately accomplish a variety of tasks like answering questions, summarizing text, generating code, identifying entities and importantly, following human instructions.

LLMs are capable of solving a diversity of natural language understanding and generation tasks. They can of course generate content, from business emails to children stories. They are capable of extracting information from the provided context, summarizing and analyzing text, and translating between languages.

We can access LLMs in multiple ways. Either directly through the provider’s console or API, but also by self-hosting and on cloud providers’ platforms.

Using the provider console : ChatGPT, Gemini AI studio, Claude allow us to chat with their state-of-the-art LLM models either as a free or paid method.

APIs : Almost all providers (OpenAI, Gemini, Claude, etc) allow users to use their LLMs via paid service using APIs. This is the easiest way for one to integrate with their applications and leverage AI capabilities.

Self-hosting : Research teams have released open-source LLMs (Deepseek-R1, Gemma 3, Llama 3.3, Mistral, Phi-4) allowing us to download and run them on our local machine or in cloud instances. This gives more control over the model, but may require significant computational resources and technical expertise. That said, almost all open-source versions have smaller model flavours (with fewer parameters) that are distilled from the larger models, such as Deepseek R1 1.5B and Llama 3 7B. These can be more easily hosted on local machines with decent specifications (sometimes without GPUs as well). One can interact with them either on the command line using tools like llama.cpp and ollama, or with a web interface like open-webui or text-generation-webui.

Cloud-based platforms : Cloud providers (e.g., AWS, Azure, GCP) offer managed services that allow you to deploy and run LLMs on their infrastructure. This provides a balance between control and ease of use.

It depends on the deployment method and expected usage.

Utilizing LLMs through provider APIs incurs costs based on usage, billed by token consumption.

Self-hosting open-source LLMs may require substantial computational resources, particularly GPUs. This approach offers greater control and privacy, but GPUs need to be highly utilized for it to be cheaper than using provider APIs.

Cloud providers offer managed LLM services that balance cost and control. Pricing models vary, but often involve a combination of compute and usage-based charges.

LLMs represent a significant advancement over traditional Natural Language Processing (NLP) techniques. They generalize better, which allows them to solve very diverse tasks effectively. Being trained on large text corpuses, they embed strong priors about natural language, producing human-like text. Finally, they are computationally efficient and can use GPUs (which are highly parallel processors) very well.

GPT (Generative Pre-trained Transformer) is a specific family of Large Language Models developed by OpenAI. The term LLM is a more generic term referring to any large language model. GPT designates a specific implementation and architecture, and therefore GPT is an LLM.

These are the number of parameters in the LLM, in billions. LLMs are neural networks that are built from matrices: the no. parameters is the total no. elements in these matrices. Parameters are learnt during the training process and then used to make predictions. The higher the number of parameters, the more complex patterns the model can learn. In most cases, they are also directly proportional to the model’s performance. However, more parameters require higher computational resources to run the models (essentially GPU memory and bandwidth).


While previous iterations of LLMs were primarily trained on vast amounts of text, newer models are also trained on non-textual data like images, audio and video thanks to specialized encoders that act as translators. These encoders convert images or audio into numerical representations (vector embeddings) that the LLM can understand. On the output side, corresponding decoders generate text, but may also generate images and audio. Therefore, modern iterations can support various modalities, combining them to solve complex tasks.