This comprehensive analysis examines various Large Language Models (LLMs) across multiple performance dimensions. Each model is evaluated based on technical specifications, performance benchmarks, cost efficiency, and practical usability metrics. The data provides insights into the trade-offs between speed, quality, and cost across different AI providers.
Click on any model name to see detailed information!
Model
Provider
Context Window
Speed (tokens/sec)
Latency (sec)
Benchmark (MMLU)
Benchmark (Chatbot Arena)
Open-Source
Price / Million Tokens
Training Dataset Size
Compute Power
Energy Efficiency
Quality Rating
Speed Rating
Price Rating
Claude-9
Anthropic
128000
153
2.73
88
1425
0
16.45
516393975
24
0.15
3
2
3
Claude-9
Anthropic
300000
33
10.59
69
1325
1
16.68
196323711
68
4.24
1
1
3
Claude-9
Anthropic
2000000
278
8.08
82
1069
1
29.25
63094533
2
1.69
2
3
3
Claude-9
Anthropic
200000
70
7.97
76
953
0
15.62
714488813
22
3.26
2
2
3
Claude-9
Anthropic
1000000
53
9.36
61
919
1
7.5
149877971
33
4.01
1
2
3
Command-9
Cohere
200000
200
1.96
73
924
0
8.11
79640080
49
2.56
1
3
3
Command-9
Cohere
200000
152
10.17
80
1007
1
2.22
814111181
50
0.23
2
2
3
Command-9
Cohere
2000000
213
9.36
93
1384
0
0.34
292636894
33
0.39
3
3
2
Command-9
Cohere
300000
164
17.93
66
988
0
1.04
10527148
37
1.27
1
2
2
Command-9
Cohere
256000
39
3.63
62
1060
1
23.7
208404732
97
0.64
1
1
3
Command-9
Cohere
300000
162
8.33
78
1007
0
21.59
529462416
13
0.58
2
2
3
Command-9
Cohere
1000000
243
8.35
77
1339
1
29.14
925626379
80
3.71
2
3
3
Command-9
Cohere
200000
271
5.12
65
939
0
28.02
777995262
82
1.5
1
3
3
DeepSeek-9
Deepseek
200000
117
18.78
62
1052
0
15.47
32582294
16
0.95
1
2
3
Gemini-9
Google
300000
234
0.98
82
1334
0
14.42
758219200
19
4.82
2
3
3
Gemini-9
Google
300000
38
6.58
76
1149
0
19.11
330954893
77
3.11
2
1
3
Gemini-9
Google
128000
248
5.07
63
1396
0
6.62
272568110
36
4.88
1
3
3
Gemini-9
Google
2000000
259
3.8
92
1467
1
2.86
113913164
80
2.58
3
3
3
GPT-9
OpenAI
300000
252
1.79
69
1392
1
14.97
414501952
41
2.33
1
3
3
GPT-9
OpenAI
1000000
82
19.31
79
910
1
7.45
311027380
16
3.9
2
2
3
GPT-9
OpenAI
2000000
152
16.87
89
1209
0
29.08
879108142
16
0.2
3
2
3
Llama-9
Meta AI
300000
76
3.54
90
1116
0
14.77
510792999
72
3.07
3
2
3
Llama-9
Meta AI
128000
242
4.29
82
1397
0
13.68
777268711
37
1.9
2
3
3
Nova-9
AWS
200000
155
14.32
84
1365
1
10.55
518732111
48
4.92
2
2
3
Nova-9
AWS
200000
20
17.57
91
1416
0
5.13
435699137
70
3.9
3
1
3
Table 1: Comprehensive LLM performance metric of most recent model sorted by Model Name
Glossary
Context Window
The maximum number of tokens (words/characters) that the model can process in a single request. A larger context window allows the model to understand and respond to longer conversations or documents.
Speed (tokens/sec)
The rate at which the model generates output, measured in tokens per second. Higher speeds mean faster response times, which is crucial for real-time applications.
Latency
The time delay between sending a request and receiving the first token of the response. Lower latency indicates quicker initial response times.
Benchmark (MMLU)
Massive Multitask Language Understanding - a comprehensive test measuring the model's knowledge across 57 subjects including mathematics, history, computer science, and more. Higher scores indicate better general knowledge.
Benchmark (Chatbot Arena)
A crowdsourced evaluation platform where real users compare model responses. Higher scores indicate better performance in real-world conversation scenarios.
Open-Source
Indicates whether the model's code and weights are publicly available (1) or proprietary (0). Open-source models can be modified and self-hosted.
Price / Million Tokens
The cost in dollars to process one million tokens. This metric helps evaluate the economic feasibility of using the model at scale.
Training Dataset Size
The number of tokens used to train the model. Larger datasets generally lead to better performance but require more computational resources.
Compute Power
A relative measure of the computational resources required to run the model. Higher values indicate more intensive processing requirements.
Energy Efficiency
A measure of how much energy the model consumes relative to its output. Lower values indicate better energy efficiency, which is important for environmental sustainability.
Quality Rating
An overall assessment of the model's output quality on a scale of 1-3, where 3 represents the highest quality. This is the core sorting metric for this table.
Speed Rating
A categorical rating (1-3) of the model's response speed, where 3 is fastest. This helps users quickly identify models suitable for time-sensitive applications.
Price Rating
A categorical rating (1-3) of the model's cost-effectiveness, where lower numbers indicate better value for money.
Walkthrough: How to use this site
Walkthrough #1: Finding the Best High-Quality Model
Looking for a top-tier model? Start by examining the Quality Rating column (highlighted in blue). Models with a rating of 3 represent the highest quality outputs. Notice how the table is sorted by this metric, making it easy to identify premium options. Check the MMLU benchmark scores to verify academic performance - scores above 85 indicate exceptional knowledge breadth. Click this walkthrough to see the high-quality models highlighted!
Walkthrough #2: Balancing Speed and Cost
Look at the Speed Rating column first where a rating of 3 means rapid token generation. Then cross-reference with the Price / Million Tokens column. The most optimal models have high speed ratings but prices under $10. For example, some Command-9 variants offer excellent speed at competitive prices. Click this walkthrough to highlight budget-friendly speed demons!