LLM Comparison Table

Model	Provider	Context Window	Speed (tokens/sec)	Latency (sec)	Benchmark (MMLU)	Benchmark (Chatbot Arena)	Open-Source	Price / Million Tokens	Training Dataset Size	Compute Power	Energy Efficiency	Quality Rating	Speed Rating	Price Rating
Claude-9	Anthropic	128000	153	2.73	88	1425	0	16.45	516393975	24	0.15	3	2	3
Claude-9	Anthropic	300000	33	10.59	69	1325	1	16.68	196323711	68	4.24	1	1	3
Claude-9	Anthropic	2000000	278	8.08	82	1069	1	29.25	63094533	2	1.69	2	3	3
Claude-9	Anthropic	200000	70	7.97	76	953	0	15.62	714488813	22	3.26	2	2	3
Claude-9	Anthropic	1000000	53	9.36	61	919	1	7.5	149877971	33	4.01	1	2	3
Command-9	Cohere	200000	200	1.96	73	924	0	8.11	79640080	49	2.56	1	3	3
Command-9	Cohere	200000	152	10.17	80	1007	1	2.22	814111181	50	0.23	2	2	3
Command-9	Cohere	2000000	213	9.36	93	1384	0	0.34	292636894	33	0.39	3	3	2
Command-9	Cohere	300000	164	17.93	66	988	0	1.04	10527148	37	1.27	1	2	2
Command-9	Cohere	256000	39	3.63	62	1060	1	23.7	208404732	97	0.64	1	1	3
Command-9	Cohere	300000	162	8.33	78	1007	0	21.59	529462416	13	0.58	2	2	3
Command-9	Cohere	1000000	243	8.35	77	1339	1	29.14	925626379	80	3.71	2	3	3
Command-9	Cohere	200000	271	5.12	65	939	0	28.02	777995262	82	1.5	1	3	3
DeepSeek-9	Deepseek	200000	117	18.78	62	1052	0	15.47	32582294	16	0.95	1	2	3
Gemini-9	Google	300000	234	0.98	82	1334	0	14.42	758219200	19	4.82	2	3	3
Gemini-9	Google	300000	38	6.58	76	1149	0	19.11	330954893	77	3.11	2	1	3
Gemini-9	Google	128000	248	5.07	63	1396	0	6.62	272568110	36	4.88	1	3	3
Gemini-9	Google	2000000	259	3.8	92	1467	1	2.86	113913164	80	2.58	3	3	3
GPT-9	OpenAI	300000	252	1.79	69	1392	1	14.97	414501952	41	2.33	1	3	3
GPT-9	OpenAI	1000000	82	19.31	79	910	1	7.45	311027380	16	3.9	2	2	3
GPT-9	OpenAI	2000000	152	16.87	89	1209	0	29.08	879108142	16	0.2	3	2	3
Llama-9	Meta AI	300000	76	3.54	90	1116	0	14.77	510792999	72	3.07	3	2	3
Llama-9	Meta AI	128000	242	4.29	82	1397	0	13.68	777268711	37	1.9	2	3	3
Nova-9	AWS	200000	155	14.32	84	1365	1	10.55	518732111	48	4.92	2	2	3
Nova-9	AWS	200000	20	17.57	91	1416	0	5.13	435699137	70	3.9	3	1	3

Model

Provider

Latency (sec)

Benchmark (Chatbot Arena)

Open-Source

Price / Million Tokens

Training Dataset Size

Compute Power

Energy Efficiency

Quality Rating

Speed Rating

Price Rating

Claude-9

Anthropic

128000

153

2.73

1425

16.45

516393975

0.15

Claude-9

Anthropic

300000

10.59

1325

16.68

196323711

4.24

Claude-9

Anthropic

2000000

278

8.08

1069

29.25

63094533

1.69

Claude-9

Anthropic

200000

7.97

953

15.62

714488813

3.26

Claude-9

Anthropic

1000000

9.36

919

7.5

149877971

4.01

Command-9

Cohere

200000

200

1.96

924

8.11

79640080

2.56

Command-9

Cohere

200000

152

10.17

1007

2.22

814111181

0.23

Command-9

Cohere

2000000

213

9.36

1384

0.34

292636894

0.39

Command-9

Cohere

300000

164

17.93

988

1.04

10527148

1.27

Command-9

Cohere

256000

3.63

1060

23.7

208404732

0.64

Command-9

Cohere

300000

162

8.33

1007

21.59

529462416

0.58

Command-9

Cohere

1000000

243

8.35

1339

29.14

925626379

3.71

Command-9

Cohere

200000

271

5.12

939

28.02

777995262

1.5

DeepSeek-9

Deepseek

200000

117

18.78

1052

15.47

32582294

0.95

Gemini-9

Google

300000

234

0.98

1334

14.42

758219200

4.82

Gemini-9

Google

300000

6.58

1149

19.11

330954893

3.11

Gemini-9

Google

128000

248

5.07

1396

6.62

272568110

4.88

Gemini-9

Google

2000000

259

3.8

1467

2.86

113913164

2.58

GPT-9

OpenAI

300000

252

1.79

1392

14.97

414501952

2.33

GPT-9

OpenAI

1000000

19.31

910

7.45

311027380

3.9

GPT-9

OpenAI

2000000

152

16.87

1209

29.08

879108142

0.2

Llama-9

Meta AI

300000

3.54

1116

14.77

510792999

3.07

Llama-9

Meta AI

128000

242

4.29

1397

13.68

777268711

1.9

Nova-9

AWS

200000

155

14.32

1365

10.55

518732111

4.92

Nova-9

AWS

200000

17.57

1416

5.13

435699137

3.9

Glossary

Context Window

The maximum number of tokens (words/characters) that the model can process in a single request. A larger context window allows the model to understand and respond to longer conversations or documents.

Speed (tokens/sec)

The rate at which the model generates output, measured in tokens per second. Higher speeds mean faster response times, which is crucial for real-time applications.

Latency

The time delay between sending a request and receiving the first token of the response. Lower latency indicates quicker initial response times.

Benchmark (MMLU)

Massive Multitask Language Understanding - a comprehensive test measuring the model's knowledge across 57 subjects including mathematics, history, computer science, and more. Higher scores indicate better general knowledge.

Benchmark (Chatbot Arena)

A crowdsourced evaluation platform where real users compare model responses. Higher scores indicate better performance in real-world conversation scenarios.

Open-Source

Indicates whether the model's code and weights are publicly available (1) or proprietary (0). Open-source models can be modified and self-hosted.

Price / Million Tokens

The cost in dollars to process one million tokens. This metric helps evaluate the economic feasibility of using the model at scale.

Training Dataset Size

The number of tokens used to train the model. Larger datasets generally lead to better performance but require more computational resources.

Compute Power

A relative measure of the computational resources required to run the model. Higher values indicate more intensive processing requirements.

Energy Efficiency

A measure of how much energy the model consumes relative to its output. Lower values indicate better energy efficiency, which is important for environmental sustainability.

Quality Rating

An overall assessment of the model's output quality on a scale of 1-3, where 3 represents the highest quality. This is the core sorting metric for this table.

Speed Rating

A categorical rating (1-3) of the model's response speed, where 3 is fastest. This helps users quickly identify models suitable for time-sensitive applications.

Price Rating

A categorical rating (1-3) of the model's cost-effectiveness, where lower numbers indicate better value for money.

Walkthrough: How to use this site

Walkthrough #1: Finding the Best High-Quality Model

Looking for a top-tier model? Start by examining the Quality Rating column (highlighted in blue). Models with a rating of 3 represent the highest quality outputs. Notice how the table is sorted by this metric, making it easy to identify premium options. Check the MMLU benchmark scores to verify academic performance - scores above 85 indicate exceptional knowledge breadth. Click this walkthrough to see the high-quality models highlighted!

Walkthrough #2: Balancing Speed and Cost

Look at the Speed Rating column first where a rating of 3 means rapid token generation. Then cross-reference with the Price / Million Tokens column. The most optimal models have high speed ratings but prices under $10. For example, some Command-9 variants offer excellent speed at competitive prices. Click this walkthrough to highlight budget-friendly speed demons!