Comparing LLM Models

This comprehensive analysis examines various Large Language Models (LLMs) across multiple performance dimensions. Each model is evaluated based on technical specifications, performance benchmarks, cost efficiency, and practical usability metrics. The data provides insights into the trade-offs between speed, quality, and cost across different AI providers.

Click on any model name to see detailed information!
Model Provider Context Window Speed (tokens/sec) Latency (sec) Benchmark (MMLU) Benchmark (Chatbot Arena) Open-Source Price / Million Tokens Training Dataset Size Compute Power Energy Efficiency Quality Rating Speed Rating Price Rating
Claude-9 Anthropic 128000 153 2.73 88 1425 0 16.45 516393975 24 0.15 3 2 3
Claude-9 Anthropic 300000 33 10.59 69 1325 1 16.68 196323711 68 4.24 1 1 3
Claude-9 Anthropic 2000000 278 8.08 82 1069 1 29.25 63094533 2 1.69 2 3 3
Claude-9 Anthropic 200000 70 7.97 76 953 0 15.62 714488813 22 3.26 2 2 3
Claude-9 Anthropic 1000000 53 9.36 61 919 1 7.5 149877971 33 4.01 1 2 3
Command-9 Cohere 200000 200 1.96 73 924 0 8.11 79640080 49 2.56 1 3 3
Command-9 Cohere 200000 152 10.17 80 1007 1 2.22 814111181 50 0.23 2 2 3
Command-9 Cohere 2000000 213 9.36 93 1384 0 0.34 292636894 33 0.39 3 3 2
Command-9 Cohere 300000 164 17.93 66 988 0 1.04 10527148 37 1.27 1 2 2
Command-9 Cohere 256000 39 3.63 62 1060 1 23.7 208404732 97 0.64 1 1 3
Command-9 Cohere 300000 162 8.33 78 1007 0 21.59 529462416 13 0.58 2 2 3
Command-9 Cohere 1000000 243 8.35 77 1339 1 29.14 925626379 80 3.71 2 3 3
Command-9 Cohere 200000 271 5.12 65 939 0 28.02 777995262 82 1.5 1 3 3
DeepSeek-9 Deepseek 200000 117 18.78 62 1052 0 15.47 32582294 16 0.95 1 2 3
Gemini-9 Google 300000 234 0.98 82 1334 0 14.42 758219200 19 4.82 2 3 3
Gemini-9 Google 300000 38 6.58 76 1149 0 19.11 330954893 77 3.11 2 1 3
Gemini-9 Google 128000 248 5.07 63 1396 0 6.62 272568110 36 4.88 1 3 3
Gemini-9 Google 2000000 259 3.8 92 1467 1 2.86 113913164 80 2.58 3 3 3
GPT-9 OpenAI 300000 252 1.79 69 1392 1 14.97 414501952 41 2.33 1 3 3
GPT-9 OpenAI 1000000 82 19.31 79 910 1 7.45 311027380 16 3.9 2 2 3
GPT-9 OpenAI 2000000 152 16.87 89 1209 0 29.08 879108142 16 0.2 3 2 3
Llama-9 Meta AI 300000 76 3.54 90 1116 0 14.77 510792999 72 3.07 3 2 3
Llama-9 Meta AI 128000 242 4.29 82 1397 0 13.68 777268711 37 1.9 2 3 3
Nova-9 AWS 200000 155 14.32 84 1365 1 10.55 518732111 48 4.92 2 2 3
Nova-9 AWS 200000 20 17.57 91 1416 0 5.13 435699137 70 3.9 3 1 3
Table 1: Comprehensive LLM performance metric of most recent model sorted by Model Name

Glossary

Context Window
The maximum number of tokens (words/characters) that the model can process in a single request. A larger context window allows the model to understand and respond to longer conversations or documents.
Speed (tokens/sec)
The rate at which the model generates output, measured in tokens per second. Higher speeds mean faster response times, which is crucial for real-time applications.
Latency
The time delay between sending a request and receiving the first token of the response. Lower latency indicates quicker initial response times.
Benchmark (MMLU)
Massive Multitask Language Understanding - a comprehensive test measuring the model's knowledge across 57 subjects including mathematics, history, computer science, and more. Higher scores indicate better general knowledge.
Benchmark (Chatbot Arena)
A crowdsourced evaluation platform where real users compare model responses. Higher scores indicate better performance in real-world conversation scenarios.
Open-Source
Indicates whether the model's code and weights are publicly available (1) or proprietary (0). Open-source models can be modified and self-hosted.
Price / Million Tokens
The cost in dollars to process one million tokens. This metric helps evaluate the economic feasibility of using the model at scale.
Training Dataset Size
The number of tokens used to train the model. Larger datasets generally lead to better performance but require more computational resources.
Compute Power
A relative measure of the computational resources required to run the model. Higher values indicate more intensive processing requirements.
Energy Efficiency
A measure of how much energy the model consumes relative to its output. Lower values indicate better energy efficiency, which is important for environmental sustainability.
Quality Rating
An overall assessment of the model's output quality on a scale of 1-3, where 3 represents the highest quality. This is the core sorting metric for this table.
Speed Rating
A categorical rating (1-3) of the model's response speed, where 3 is fastest. This helps users quickly identify models suitable for time-sensitive applications.
Price Rating
A categorical rating (1-3) of the model's cost-effectiveness, where lower numbers indicate better value for money.

Walkthrough: How to use this site

Walkthrough #1: Finding the Best High-Quality Model

Looking for a top-tier model? Start by examining the Quality Rating column (highlighted in blue). Models with a rating of 3 represent the highest quality outputs. Notice how the table is sorted by this metric, making it easy to identify premium options. Check the MMLU benchmark scores to verify academic performance - scores above 85 indicate exceptional knowledge breadth. Click this walkthrough to see the high-quality models highlighted!

Walkthrough #2: Balancing Speed and Cost

Look at the Speed Rating column first where a rating of 3 means rapid token generation. Then cross-reference with the Price / Million Tokens column. The most optimal models have high speed ratings but prices under $10. For example, some Command-9 variants offer excellent speed at competitive prices. Click this walkthrough to highlight budget-friendly speed demons!