GPT-4o: The flagship model across audio, vision, and text by OpenAI	Gemini: Gemini by Google	Claude 3.5: Claude by Anthropic
Grok-2: Grok-2 by xAI	Nova: Nova by Amazon	Qwen Max: The Frontier Qwen Model by Alibaba
Llama 3.1: Open foundation and chat models by Meta	Mistral: Mistral Large 2	Yi-Large: State-of-the-art model by 01 AI
GLM-4: Next-Gen Foundation Model by Zhipu AI	Jamba 1.5: Jamba by AI21 Labs	Gemma 2: Gemma 2 by Google
Claude: Claude by Anthropic	Nemotron-4 340B: Cutting-edge Open model by Nvidia	Llama 3: Open foundation and chat models by Meta
GPT-3.5: GPT-3.5-Turbo by OpenAI	Reka Core: Frontier Multimodal Language Model by Reka	Reka Flash: Multimodal model by Reka
Command-R-Plus: Command R+ by Cohere	Command R: Command R by Cohere	Mixtral of experts: A Mixture-of-Experts model by Mistral AI

Github link/issue/PR	Type your query
https://github.com/d3/d3	Show me some fancy examples in d3
https://github.com/f/awesome-chatgpt-prompts	give me a system prompt for software engineer mock interview
https://github.com/huggingface/transformers	Show me a text2audio example


GPT-4o: The flagship model across audio, vision, and text by OpenAI	Gemini: Gemini by Google	Claude 3.5: Claude by Anthropic
Grok-2: Grok-2 by xAI	Nova: Nova by Amazon	Qwen Max: The Frontier Qwen Model by Alibaba
Llama 3.1: Open foundation and chat models by Meta	Mistral: Mistral Large 2	Yi-Large: State-of-the-art model by 01 AI
GLM-4: Next-Gen Foundation Model by Zhipu AI	Jamba 1.5: Jamba by AI21 Labs	Gemma 2: Gemma 2 by Google
Claude: Claude by Anthropic	Nemotron-4 340B: Cutting-edge Open model by Nvidia	Llama 3: Open foundation and chat models by Meta
GPT-3.5: GPT-3.5-Turbo by OpenAI	Reka Core: Frontier Multimodal Language Model by Reka	Reka Flash: Multimodal model by Reka
Command-R-Plus: Command R+ by Cohere	Command R: Command R by Cohere	Mixtral of experts: A Mixture-of-Experts model by Mistral AI


GPT-4o: The flagship model across audio, vision, and text by OpenAI	Gemini: Gemini by Google	Claude 3.5: Claude by Anthropic
Grok-2: Grok-2 by xAI	Nova: Nova by Amazon	Qwen Max: The Frontier Qwen Model by Alibaba
Llama 3.1: Open foundation and chat models by Meta	Mistral: Mistral Large 2	Yi-Large: State-of-the-art model by 01 AI
GLM-4: Next-Gen Foundation Model by Zhipu AI	Jamba 1.5: Jamba by AI21 Labs	Gemma 2: Gemma 2 by Google
Claude: Claude by Anthropic	Nemotron-4 340B: Cutting-edge Open model by Nvidia	Llama 3: Open foundation and chat models by Meta
GPT-3.5: GPT-3.5-Turbo by OpenAI	Reka Core: Frontier Multimodal Language Model by Reka	Reka Flash: Multimodal model by Reka
Command-R-Plus: Command R+ by Cohere	Command R: Command R by Cohere	Mixtral of experts: A Mixture-of-Experts model by Mistral AI

Total #models: 212. Total #votes: 2,768,389. Last updated: 2025-03-10.

Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!
用于重新创建排行榜表格和图表的代码在此。您可以在!处投下您的票。

Category 类别

Apply filter 应用过滤器

Style Control 样式控制 Show Deprecated 显示已过时

Overall Questions 整体问题

#models: 212 (100%) #votes: 2,768,389 (100%)

Rank* (UB)	Rank (StyleCtrl)	Model	Arena Score	95% CI	Votes	Organization	License
103	117	Gemini-2.0-Flash-Thinking-Exp-01-21	1407	+10/-10	117785	Cognitive Computations	Falcon-180B TII License

Rank* (UB)	Rank (StyleCtrl)	Model	Arena Score	95% CI	Votes	Organization	License
1	1	GPT-4.5-Preview	1404	+7/-9	6024	OpenAI	Proprietary
1	2	Grok-3-Preview-02-24	1407	+7/-7	7580	xAI	Proprietary
3	2	ChatGPT-4o-latest (2025-01-29)	1375	+4/-5	19587	OpenAI	Proprietary
3	3	Gemini-2.0-Pro-Exp-02-05	1380	+4/-4	17695	Google	Proprietary
6	3	o1-2024-12-17	1353	+4/-4	22010	OpenAI	Proprietary
6	4	DeepSeek-R1	1361	+5/-6	10474	DeepSeek	MIT
3	6	Gemini-2.0-Flash-Thinking-Exp-01-21	1384	+5/-5	19837	Google	Proprietary
14	6	Claude 3.7 Sonnet	1304	+8/-8	6952	Anthropic	Proprietary
9	7	o1-preview	1335	+4/-4	33195	OpenAI	Proprietary
6	10	Gemini-2.0-Flash-001	1355	+4/-5	15416	Google	Proprietary
9	10	Gemma-3-27B-it	1339	+9/-11	3870	Google	Gemma
9	10	Qwen2.5-Max	1338	+5/-5	14258	Alibaba	Proprietary
9	10	o3-mini-high	1328	+6/-5	11409	OpenAI	Proprietary
22	10	Claude 3.5 Sonnet (20241022)	1283	+3/-3	61187	Anthropic	Proprietary
13	12	DeepSeek-V3	1319	+4/-4	23079	DeepSeek	DeepSeek
14	14	o3-mini	1305	+3/-5	17849	OpenAI	Proprietary
13	15	Qwen-Plus-0125	1310	+7/-8	6058	Alibaba	Proprietary
14	15	Gemini-2.0-Flash-Lite	1308	+6/-6	15126	Google	Proprietary
14	15	Gemini-1.5-Pro-002	1302	+3/-3	58887	Google	Proprietary
22	16	GPT-4o-2024-05-13	1285	+3/-3	117785	OpenAI	Proprietary
13	18	GLM-4-Plus-0111	1310	+7/-8	6037	Zhipu	Proprietary

*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.

Rank (StyleCtrl): model's ranking with style control, which accounts for factors like response length and markdown usage to decouple model performance from these potential confounding variables. See blog post for further details.

Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.

More Statistics for Chatbot Arena - Overall

Figure 1: Confidence Intervals on Model Strength (via Bootstrapping)

Figure 2: Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)

Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles

Figure 4: Battle Count for Each Combination of Models (without Ties)

Total #models: 212. Total #votes: 2,768,389. Last updated: 2025-03-10.

Chatbot Arena Overview (Task)

Model	Overall	Overall w/ Style Control	Hard Prompts	Hard Prompts w/ Style Control	Coding	Math	Creative Writing	Instruction Following	Longer Query	Multi-Turn
gemini-2.0-flash-thinking-exp-01-21	103	117	103	102	102	113	112	102	101	100

Model	Overall	Overall w/ Style Control	Hard Prompts	Hard Prompts w/ Style Control	Coding	Math	Creative Writing	Instruction Following	Longer Query	Multi-Turn
grok-3-preview-02-24	1	2	1	1	1	1	1	1	1	2
gpt-4.5-preview-2025-02-27	1	1	1	1	1	1	1	1	1	1
gemini-2.0-flash-thinking-exp-01-21	3	6	2	3	3	1	1	2	1	2
gemini-2.0-pro-exp-02-05	3	3	2	2	3	1	1	3	1	2
chatgpt-4o-latest-20250129	3	2	6	3	3	11	1	3	1	2
deepseek-r1	6	4	4	1	3	1	6	5	6	2
gemini-2.0-flash-001	6	10	5	11	4	5	6	7	6	5
o1-2024-12-17	6	3	3	1	3	1	6	3	3	7
gemma-3-27b-it	9	10	14	15	15	13	6	12	10	7
qwen2.5-max	9	10	6	7	7	6	7	8	6	6
o1-preview	9	7	6	3	4	1	9	7	7	6
o3-mini-high	9	10	4	2	3	1	17	7	6	11
deepseek-v3	13	12	14	15	15	14	7	13	7	7
glm-4-plus-0111	13	18	15	19	18	17	10	15	12	11
qwen-plus-0125	13	15	13	14	13	13	15	15	7	9
gemini-2.0-flash-lite-preview-02-05	14	15	13	15	15	14	9	14	13	17
o3-mini	14	14	5	3	3	1	23	10	7	13
claude-3-7-sonnet-20250219	14	6	14	4	4	11	6	8	3	6
step-2-16k-exp-202412	14	20	15	16	16	14	6	16	14	17
o1-mini	14	20	13	13	4	7	31	15	14	13
gemini-1.5-pro-002	14	15	16	16	19	14	7	16	15	18
grok-2-2024-08-13	22	22	27	22	21	21	17	24	25	20
yi-lightning	22	24	15	17	16	14	17	21	17	13
gpt-4o-2024-05-13	22	16	23	17	19	21	17	21	17	18
claude-3-5-sonnet-20241022	22	10	14	6	15	14	17	15	15	13
qwen2.5-plus-1127	22	33	14	19	16	14	21	21	17	17
deepseek-v2.5-1210	22	27	16	18	16	17	17	21	15	17
athene-v2-chat	26	34	18	19	19	15	36	22	17	21
glm-4-plus	26	32	24	27	20	25	23	25	22	22
hunyuan-large-2025-02-10	26	23	16	17	16	15	17	17	5	18
gpt-4o-mini-2024-07-18	27	33	29	37	21	31	21	30	20	22
gemini-1.5-flash-002	27	34	35	43	40	26	17	31	22	38
llama-3.1-nemotron-70b-instruct	27	47	29	27	24	25	17	31	40	22
claude-3-5-sonnet-20240620	28	18	23	15	19	14	36	22	23	17

Chatbot Arena Overview (Language)

Model	English	Chinese	German	French	Spanish	Russian	Japanese	Korean
gemini-2.0-flash-thinking-exp-01-21	101	112	114	100	103	102	104	115

Model	English	Chinese	German	French	Spanish	Russian	Japanese	Korean
grok-3-preview-02-24	1	1	1	1	1	1	1	1
gpt-4.5-preview-2025-02-27	1	1	1	1	1	1	1	1
gemini-2.0-flash-thinking-exp-01-21	2	1	2	1	1	1	2	1
gemini-2.0-pro-exp-02-05	3	4	2	1	1	1	1	1
chatgpt-4o-latest-20250129	3	1	2	1	1	3	1	1
deepseek-r1	5	2	2	1	1	5	2	1
gemini-2.0-flash-001	6	4	2	1	1	5	6	1
o1-2024-12-17	6	3	3	2	1	5	2	1
gemma-3-27b-it	6	4	2	1	1	5	2	5
o1-preview	6	13	8	2	4	9	2	6
qwen2.5-max	9	4	2	2	1	6	6	2
o3-mini-high	10	2	7	5	1	9	6	1
deepseek-v3	12	11	5	4	1	7	7	8
qwen-plus-0125	12	10	8	2	2	6	7	6
step-2-16k-exp-202412	12	11	6	6	4	6	7	7
glm-4-plus-0111	13	3	2	5	4	9	9	6
o1-mini	13	15	10	6	4	18	10	15
o3-mini	14	13	8	4	1	11	7	6
claude-3-7-sonnet-20250219	14	14	8	6	7	7	6	3
gemini-2.0-flash-lite-preview-02-05	15	11	8	4	4	7	5	3
yi-lightning	15	13	12	6	4	31	11	16
qwen2.5-plus-1127	16	13	15	6	2	21	13	28
gemini-1.5-pro-002	17	13	10	6	4	9	6	6
grok-2-2024-08-13	20	22	10	6	10	18	10	13
llama-3.1-nemotron-70b-instruct	22	29	5	2	5	44	11	44
gpt-4o-2024-05-13	23	26	11	6	5	18	9	13
deepseek-v2.5-1210	23	13	12	6	4	14	13	16
hunyuan-large-2025-02-10	23	7	10	6	4	21	12	16
llama-3.1-405b-instruct-bf16	23	40	18	17	11	34	24	25
claude-3-5-sonnet-20241022	25	29	9	6	4	10	10	16
athene-v2-chat	25	17	12	6	10	18	15	15
llama-3.1-405b-instruct-fp8	25	41	20	13	14	29	31	23
gpt-4o-mini-2024-07-18	27	31	12	6	10	21	15	16
llama-3.3-70b-instruct	27	49	20	6	4	36	46	47

*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.

Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.

Arena-Price Plot presents the cost vs. performance trade-offs for LLMs. We only list models with publicly available pricing. You may find pricing information here.

Total #models: 62. Total #votes: 177,258. Last updated: 2025-03-11.

Overall Questions

#models: 62 (100%) #votes: 177,258 (100%)

Rank* (UB)	Rank (StyleCtrl)	Model	Arena Score	95% CI	Votes	Organization	License
10	11	Qwen-VL-Max-1119	1279	+11/-12	23806	Anthropic	Proprietary

Rank* (UB)	Rank (StyleCtrl)	Model	Arena Score	95% CI	Votes	Organization	License
1	2	Gemini-2.0-Flash-Thinking-Exp-01-21	1279	+11/-12	2714	Google	Proprietary
1	1	ChatGPT-4o-latest (2025-01-29)	1277	+9/-9	2705	OpenAI	Proprietary
3	2	Gemini-2.0-Pro-Exp-02-05	1238	+9/-8	2911	Google	Proprietary
3	2	o1-2024-12-17	1227	+14/-15	1693	OpenAI	Proprietary
3	2	Gemini-2.0-Flash-001	1227	+9/-12	2338	Google	Proprietary
3	2	Claude 3.7 Sonnet	1225	+25/-22	554	Anthropic	Proprietary
4	2	Gemini-1.5-Pro-002	1220	+7/-7	8549	Google	Proprietary
7	5	GPT-4o-2024-05-13	1206	+4/-5	23806	OpenAI	Proprietary
7	11	Gemini-1.5-Flash-002	1205	+6/-9	7349	Google	Proprietary
10	10	Claude 3.5 Sonnet (20240620)	1189	+5/-5	22246	Anthropic	Proprietary
10	11	Step-1o-Vision-32k (highres)	1186	+10/-8	2891	StepFun	Proprietary
10	5	Claude 3.5 Sonnet (20241022)	1183	+6/-7	7534	Anthropic	Proprietary
11	11	Qwen2.5-VL-72B-Instruct	1168	+12/-10	2770	Alibaba	Qwen
13	18	Pixtral-Large-2411	1153	+8/-9	4239	Mistral	MRL
13	11	Gemini-2.0-Flash-Lite	1152	+12/-10	2776	Google	Proprietary
14	12	Gemini-1.5-Pro-001	1151	+6/-6	17169	Google	Proprietary
14	14	GPT-4-Turbo-2024-04-09	1151	+6/-6	13725	OpenAI	Proprietary
16	17	Qwen-VL-Max-1119	1127	+18/-13	1449	Alibaba	Proprietary
18	11	GPT-4o-2024-08-06	1125	+8/-10	3407	OpenAI	Proprietary
18	18	GPT-4o-mini-2024-07-18	1124	+5/-5	16562	OpenAI	Proprietary
18	22	Step-1V-32K	1110	+11/-14	1553	StepFun	Proprietary
19	18	Qwen2-VL-72b-Instruct	1110	+8/-7	6028	Alibaba	Qwen
21	22	Gemini-1.5-Flash-8B-001	1106	+7/-9	6355	Google	Proprietary
24	24	Claude 3 Opus	1076	+5/-6	15943	Anthropic	Proprietary
24	24	Molmo-72B-0924	1075	+9/-10	3092	AI2	Apache 2.0
24	22	Gemini-1.5-Flash-001	1072	+6/-6	13585	Google	Proprietary
24	30	Pixtral-12B-2409	1071	+6/-6	7623	Mistral	Apache 2.0
24	25	Llama-3.2-90B-Vision-Instruct	1069	+6/-6	8829	Meta	Llama 3.2
24	29	InternVL2-26B	1067	+7/-9	5265	OpenGVLab	MIT
24	22	Hunyuan-Standard-Vision-2024-12-31	1064	+21/-20	811	Tencent	Proprietary
24	26	Amazon Nova Lite 1.0	1060	+12/-14	1877	Amazon	Proprietary
29	26	Qwen2-VL-7B-Instruct	1053	+7/-8	5854	Aliaba	Apache 2.0
29	30	Yi-Vision	1045	+15/-11	1237	01 AI	Proprietary
30	29	Claude 3 Sonnet	1048	+6/-5	12679	Anthropic	Proprietary

*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.

Rank (StyleCtrl): model's ranking with style control, which accounts for factors like response length and markdown usage to decouple model performance from these potential confounding variables. See blog post for further details.

Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.

More Statistics for Chatbot Arena (Overall)

Figure 1: Confidence Intervals on Model Strength (via Bootstrapping)

Figure 2: Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)

Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles

Figure 4: Battle Count for Each Combination of Models (without Ties)

Total #models: 8. Total #votes: 158,199. Last updated: 2025-03-11.

Overall Questions

#models: 8 (100%) #votes: 158,199 (100%)

Rank* (UB)	Model	Arena Score	95% CI	Votes	Organization	License
1	Recraft V3	1089	+2/-2	56720	Black Forest Labs	Proprietary

Rank* (UB)	Model	Arena Score	95% CI	Votes	Organization	License
1	Imagen-3.0-generate-002	1089	+2/-2	56720	Google	Proprietary
2	Recraft V3	1027	+2/-2	43929	Recraft	Proprietary
3	Luma Photon	1022	+3/-3	21351	Luma AI	Proprietary
3	FLUX1.1 [pro]	1021	+2/-3	41705	Black Forest Labs	Proprietary
3	Ideogram 2.0	1021	+2/-2	43312	Ideogram	Proprietary
6	DALL·E 3	978	+3/-2	41616	OpenAI	Proprietary
6	FLUX.1 [dev] (fp8)	975	+2/-2	43715	Black Forest Labs	Open
8	Stable Diffusion 3.5 Large	945	+4/-3	24050	Stability AI	Open

*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.

Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.

More Statistics for Chatbot Arena (Overall)

Figure 1: Confidence Intervals on Model Strength (via Bootstrapping)

Figure 2: Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)

Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles

Figure 4: Battle Count for Each Combination of Models (without Ties)

Copilot Arena is a free AI coding assistant that provides paired responses from different state-of-the-art LLMs.

Interaction Mode

Code Completion Code Edit

Code Completion

#models: 13 #votes: 22,093

Rank* (UB)	Model	Arena Score	Confidence Interval	Votes	Organization
12	Meta-Llama-3.1-405B-Instruct	1029	+13 / -15	2292	Inception AI

Rank* (UB)	Model	Arena Score	Confidence Interval	Votes	Organization
1	Deepseek V2.5 (FIM)	1029	+13 / -15	2292	Deepseek AI
1	Claude 3.5 Sonnet (06/20)	1018	+11 / -12	3085	Anthropic
1	Claude 3.5 Sonnet (10/22)	1006	+13 / -12	3162	Anthropic
1	Codestral (25.01)	999	+18 / -16	1423	Mistral
2	Codestral (05/24)	1000	+8 / -8	5130	Mistral
2	Mercury Coder Mini	994	+17 / -19	1429	Inception AI
3	Meta-Llama-3.1-405B-Instruct	984	+11 / -11	3432	Meta
3	GPT-4o (08/06)	983	+11 / -11	3953	OpenAI
4	Gemini-1.5-Pro-002	983	+9 / -13	3004	Google
5	Gemini-1.5-Flash-002	979	+12 / -11	4528	Google
6	Meta-Llama-3.1-70B-Instruct	969	+11 / -8	4085	Meta
12	Qwen2.5-Coder-32B-Instruct	950	+10 / -10	4400	Alibaba
12	GPT-4o-mini (07/18)	942	+13 / -12	4264	OpenAI

*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval).

Confidence Interval: represents the range of uncertainty around the Arena Score. It's displayed as +X / -Y, where X is the difference between the upper bound and the score, and Y is the difference between the score and the lower bound.

Last Updated: 2024-07-31

Arena-Hard-Auto v0.1 - an automatic evaluation tool for instruction-tuned LLMs with 500 challenging user queries curated from Chatbot Arena.

We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314). If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto. Check out our paper for more details about how Arena-Hard-Auto works as an fully automated data pipeline converting crowdsourced data into high-quality benchmarks -> [Paper | Repo]

Rank* (UB)	Model	Win-rate	95% CI	Average Tokens	Organization
10	Phi-3-Mini-128k-Instruct	82.63	+2.0/-1.9	662	DeepSeek AI

Rank* (UB)	Model	Win-rate	95% CI	Average Tokens	Organization
1	GPT-4-Turbo-2024-04-09	82.63	+2.0/-1.9	662	OpenAI
2	Claude 3.5 Sonnet (20240620)	79.35	+1.3/-2.1	567	Anthropic
2	GPT-4o-2024-05-13	79.21	+1.5/-1.8	696	OpenAI
2	GPT-4-0125-preview	77.96	+1.9/-2.0	619	OpenAI
2	Athene-70B	76.83	+1.9/-2.0	683	NexusFlow
4	GPT-4o-mini-2024-07-18	74.94	+2.1/-2.3	668	OpenAI
6	Gemini-1.5-Pro-001	71.96	+2.7/-2.3	676	Google
6	Yi-Large-preview	71.48	+1.9/-2.5	720	01 AI
7	Mistral-Large-2407	70.42	+2.0/-2.3	623	Mistral
10	Meta-Llama-3.1-405B-Instruct-fp8	64.09	+2.5/-2.7	633	Meta
10	GLM-4-0520	63.84	+2.3/-2.6	636	Zhipu AI
10	Yi-Large	63.7	+2.2/-1.9	626	01 AI
10	DeepSeek-Coder-V2-Instruct	62.3	+2.4/-2.5	578	DeepSeek AI
10	Claude 3 Opus	60.36	+2.0/-2.8	541	Anthropic
13	Gemma-2-27B-it	57.51	+2.6/-2.4	577	Google
14	Meta-Llama-3.1-70B-Instruct	55.73	+2.5/-2.9	628	Meta
14	GLM-4-0116	55.72	+2.4/-1.9	622	Zhipu AI
15	Gemini-1.5-Pro-Preview-0409	53.37	+3.3/-2.2	478	Google
17	GLM-4-AIR	50.88	+2.3/-2.3	619	Zhipu AI
19	GPT-4-0314	50	+0.0/-0.0	423	OpenAI
18	Gemini-1.5-Flash-001	49.61	+2.6/-2.1	642	Google
20	Qwen2-72B-Instruct	46.86	+2.4/-2.3	515	Alibaba
20	Claude 3 Sonnet	46.8	+2.2/-2.7	552	Anthropic
20	Llama-3-70B-Instruct	46.57	+2.6/-2.7	591	Meta
24	Claude 3 Haiku	41.47	+2.6/-1.9	505	Anthropic
25	GPT-4-0613	37.9	+2.5/-2.3	354	OpenAI
25	Mistral-Large-2402	37.71	+2.1/-2.9	400	Mistral
26	Mixtral-8x22b-Instruct-v0.1	36.36	+2.2/-2.1	430	Mistral
26	Qwen1.5-72B-Chat	36.12	+2.0/-2.2	474	Alibaba
27	Phi-3-Medium-4k-Instruct	33.37	+1.8/-2.1	517	Microsoft
27	Command R+ (04-2024)	33.07	+2.0/-2.2	541	Cohere
28	Mistral Medium	31.9	+2.4/-2.2	485	Mistral
30	Phi-3-Small-8k-Instruct	29.77	+2.2/-1.8	568	Microsoft
33	Mistral-Next	27.37	+1.7/-2.0	297	Mistral

⚔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots

📜 How It Works 📜 如何使用

🏆 Chatbot Arena LLM Leaderboard🏆 Chatbot Arena LLM 排行榜

👇 Chat now! 👇 现在聊天！

Terms of Service 服务条款

Please report any bug or issue to our Discord/arena-feedback.

Acknowledgment 致谢

⚔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots️ 聊天机器人竞技场（原 LMSYS）：免费 AI 聊天，比较和测试最佳 AI 聊天机器人。

📜 How It Works 📜 如何使用

🤖 Choose two models to compare🤖 选择两个模型进行比较

Terms of Service 服务条款

Please report any bug or issue to our Discord/arena-feedback.

Acknowledgment 致谢

🏔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots🏔️ Chatbot Arena（原 LMSYS）：免费 AI 聊天，比较和测试最佳 AI 聊天机器人

Terms of Service 服务条款

Please report any bug or issue to our Discord/arena-feedback.

Acknowledgment 致谢

🏆 Chatbot Arena LLM Leaderboard: Community-driven Evaluation for Best LLM and AI chatbots🏆 Chatbot Arena LLM 排行榜：社区驱动的最佳聊天机器人和 AI 聊天机器人评估 LLM

Overall Questions 整体问题

#models: 212 (100%) #votes: 2,768,389 (100%)

More Statistics for Chatbot Arena - Overall

Figure 1: Confidence Intervals on Model Strength (via Bootstrapping)

Figure 2: Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)

Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles

Figure 4: Battle Count for Each Combination of Models (without Ties)

Overall Questions

#models: 62 (100%) #votes: 177,258 (100%)

More Statistics for Chatbot Arena (Overall)

Figure 1: Confidence Intervals on Model Strength (via Bootstrapping)

Figure 2: Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)

Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles

Figure 4: Battle Count for Each Combination of Models (without Ties)

Overall Questions

#models: 8 (100%) #votes: 158,199 (100%)

More Statistics for Chatbot Arena (Overall)

Figure 1: Confidence Intervals on Model Strength (via Bootstrapping)

Figure 2: Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)

Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles

Figure 4: Battle Count for Each Combination of Models (without Ties)

Code Completion

#models: 13 #votes: 22,093

Citation

Terms of Service

Please report any bug or issue to our Discord/arena-feedback.

Acknowledgment

🎯 Prompt-to-Leaderboard (P2L)

Citation

Terms of Service

Please report any bug or issue to our Discord/arena-feedback.

Acknowledgment

🔍 Arena Explorer 🔍 arena explorer

About Us 关于我们

Open-source contributors

Learn more

Contact Us

Acknowledgment

🏆 Chatbot Arena LLM Leaderboard
🏆 Chatbot Arena LLM 排行榜

⚔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots
️ 聊天机器人竞技场（原 LMSYS）：免费 AI 聊天，比较和测试最佳 AI 聊天机器人。

🤖 Choose two models to compare
🤖 选择两个模型进行比较

🏔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots
🏔️ Chatbot Arena（原 LMSYS）：免费 AI 聊天，比较和测试最佳 AI 聊天机器人

🏆 Chatbot Arena LLM Leaderboard: Community-driven Evaluation for Best LLM and AI chatbots
🏆 Chatbot Arena LLM 排行榜：社区驱动的最佳聊天机器人和 AI 聊天机器人评估 LLM