Textbox

⚔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots

New Launch! WebDev Arena: web.lmarena.ai - AI Battle to build the best website!

📜 How It Works

Blind Test: Ask any question to two anonymous AI chatbots (ChatGPT, Gemini, Claude, Llama, and more).
Vote for the Best: Choose the best response. You can keep chatting until you find a winner.
Play Fair: If AI identity reveals, your vote won't count.
NEW: Click the 🎨 Text-to-Image tab below to generate images with DALL-E 3, Flux, and more! Use 🐙 RepoChat tab to chat with Github repos.

🏆 Chatbot Arena LLM Leaderboard

Backed by over 1,000,000+ community votes, our platform ranks the best LLM and AI chatbots. Explore the top AI models on our LLM leaderboard!

👇 Chat now!


GPT-4o: The flagship model across audio, vision, and text by OpenAI	Gemini: Gemini by Google	Claude 3.5: Claude by Anthropic
Nova: Nova by Amazon	Grok-2: Grok-2 by xAI	Llama 3.1: Open foundation and chat models by Meta
Mistral: Mistral Large 2	Yi-Large: State-of-the-art model by 01 AI	GLM-4: Next-Gen Foundation Model by Zhipu AI
Molmo: Molmo by AI2	GPT-4-Turbo: GPT-4-Turbo by OpenAI	Jamba 1.5: Jamba by AI21 Labs
Gemma 2: Gemma 2 by Google	Claude: Claude by Anthropic	Nemotron-4 340B: Cutting-edge Open model by Nvidia
Llama 3: Open foundation and chat models by Meta	Athene-70B: A large language model by NexusFlow	Qwen Max: The Frontier Qwen Model by Alibaba
GPT-3.5: GPT-3.5-Turbo by OpenAI	Phi-3: A capable and cost-effective small language models (SLMs) by Microsoft	Reka Core: Frontier Multimodal Language Model by Reka
Reka Flash: Multimodal model by Reka	Command-R-Plus: Command R+ by Cohere	Command R: Command R by Cohere
Mixtral of experts: A Mixture-of-Experts model by Mistral AI	InternVL 2: Multimodal Model developed by OpenGVLab

Model A

Model B

0.1s

Textbox

MultimodalTextbox

Github link/issue/PR

Type your query

Examples

Github link/issue/PR	Type your query
https://github.com/meta-llama/llama-recipes	fun use cases in llama recipes
https://github.com/scikit-learn/scikit-learn	show me how run k-mean clustering
https://github.com/sindresorhus/awesome	Make me a project structure for my food blog React webapp based entirely on awesome libraries.

Textbox

Temperature

Top P

Max output tokens

Terms of Service

Users are required to agree to the following terms before using the service:

The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.

Please report any bug or issue to our Discord/arena-feedback.

Acknowledgment

We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Fireworks AI, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.

⚔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots

New Launch! WebDev Arena: web.lmarena.ai - AI Battle to build the best website!

📜 How It Works

Ask any question to two chosen models (e.g., ChatGPT, Gemini, Claude, Llama) and vote for the better one!
You can chat for multiple turns until you identify a winner.

Note: You can only chat with one image per conversation. You can upload images less than 15MB. Click the "Random Example" button to chat with a random image.

❗️ For research purposes, we log user prompts and images, and may release this data to the public in the future. Please do not upload any confidential or personal information.

🤖 Choose two models to compare


GPT-4o: The flagship model across audio, vision, and text by OpenAI	Gemini: Gemini by Google	Claude 3.5: Claude by Anthropic
Nova: Nova by Amazon	Grok-2: Grok-2 by xAI	Llama 3.1: Open foundation and chat models by Meta
Mistral: Mistral Large 2	Yi-Large: State-of-the-art model by 01 AI	GLM-4: Next-Gen Foundation Model by Zhipu AI
Molmo: Molmo by AI2	GPT-4-Turbo: GPT-4-Turbo by OpenAI	Jamba 1.5: Jamba by AI21 Labs
Gemma 2: Gemma 2 by Google	Claude: Claude by Anthropic	Nemotron-4 340B: Cutting-edge Open model by Nvidia
Llama 3: Open foundation and chat models by Meta	Athene-70B: A large language model by NexusFlow	Qwen Max: The Frontier Qwen Model by Alibaba
GPT-3.5: GPT-3.5-Turbo by OpenAI	Phi-3: A capable and cost-effective small language models (SLMs) by Microsoft	Reka Core: Frontier Multimodal Language Model by Reka
Reka Flash: Multimodal model by Reka	Command-R-Plus: Command R+ by Cohere	Command R: Command R by Cohere
Mixtral of experts: A Mixture-of-Experts model by Mistral AI	InternVL 2: Multimodal Model developed by OpenGVLab

0.1s

Dropdown

0.2s

Dropdown

Model A

Model B

Textbox

MultimodalTextbox

Temperature

Top P

Max output tokens

Terms of Service

Users are required to agree to the following terms before using the service:

Please report any bug or issue to our Discord/arena-feedback.

Acknowledgment

We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Fireworks AI, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.

🏔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots

New Launch! WebDev Arena: web.lmarena.ai - AI Battle to build the best website!

❗️ For research purposes, we log user prompts and images, and may release this data to the public in the future. Please do not upload any confidential or personal information.

Note: You can only chat with one image per conversation. You can upload images less than 15MB. Click the "Random Example" button to chat with a random image.

0.2s

Dropdown


GPT-4o: The flagship model across audio, vision, and text by OpenAI	Gemini: Gemini by Google	Claude 3.5: Claude by Anthropic
Nova: Nova by Amazon	Grok-2: Grok-2 by xAI	Llama 3.1: Open foundation and chat models by Meta
Mistral: Mistral Large 2	Yi-Large: State-of-the-art model by 01 AI	GLM-4: Next-Gen Foundation Model by Zhipu AI
Molmo: Molmo by AI2	GPT-4-Turbo: GPT-4-Turbo by OpenAI	Jamba 1.5: Jamba by AI21 Labs
Gemma 2: Gemma 2 by Google	Claude: Claude by Anthropic	Nemotron-4 340B: Cutting-edge Open model by Nvidia
Llama 3: Open foundation and chat models by Meta	Athene-70B: A large language model by NexusFlow	Qwen Max: The Frontier Qwen Model by Alibaba
GPT-3.5: GPT-3.5-Turbo by OpenAI	Phi-3: A capable and cost-effective small language models (SLMs) by Microsoft	Reka Core: Frontier Multimodal Language Model by Reka
Reka Flash: Multimodal model by Reka	Command-R-Plus: Command R+ by Cohere	Command R: Command R by Cohere
Mixtral of experts: A Mixture-of-Experts model by Mistral AI	InternVL 2: Multimodal Model developed by OpenGVLab

Scroll down and start chatting

Textbox

MultimodalTextbox

Temperature

Top P

Max output tokens

Terms of Service

Users are required to agree to the following terms before using the service:

Please report any bug or issue to our Discord/arena-feedback.

Acknowledgment

We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Fireworks AI, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.

🏆 Chatbot Arena LLM Leaderboard: Community-driven Evaluation for Best LLM and AI chatbots

Chatbot Arena is an open platform for crowdsourced AI benchmarking, developed by researchers at UC Berkeley SkyLab and LMArena. With over 1,000,000 user votes, the platform ranks best LLM and AI chatbots using the Bradley-Terry model to generate live leaderboards. For technical details, check out our paper.

Chatbot Arena thrives on community engagement — cast your vote to help improve AI evaluation!

New Launch! WebDev Arena: web.lmarena.ai - AI Battle to build the best website!

Total #models: 184. Total #votes: 2,465,686. Last updated: 2024-12-22.

Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!

Overall Questions

#models: 184 (100%) #votes: 2,465,686 (100%)

Rank* (UB)	Rank (StyleCtrl)	Model	Arena Score	95% CI	Votes	Organization	License
100	115	Gemini-2.0-Flash-Thinking-Exp-1219	1372	+11/-11	117916	Cognitive Computations	Falcon-180B TII License

Rank* (UB)	Rank (StyleCtrl)	Model	Arena Score	95% CI	Votes	Organization	License
1	1	Gemini-Exp-1206	1372	+4/-5	14652	Google	Proprietary
1	1	Gemini-2.0-Flash-Thinking-Exp-1219	1368	+6/-5	8003	Google	Proprietary
1	1	ChatGPT-4o-latest (2024-11-20)	1364	+3/-4	27544	OpenAI	Proprietary
4	4	Gemini-2.0-Flash-Exp	1354	+5/-4	13665	Google	Proprietary
5	3	o1-preview	1335	+4/-4	33259	OpenAI	Proprietary
6	8	o1-mini	1306	+3/-3	42423	OpenAI	Proprietary
6	7	Gemini-1.5-Pro-002	1302	+4/-4	38758	Google	Proprietary
8	11	Grok-2-08-13	1288	+2/-3	60980	xAI	Proprietary
8	12	Yi-Lightning	1287	+4/-5	29185	01 AI	Proprietary
8	7	GPT-4o-2024-05-13	1285	+2/-3	117916	OpenAI	Proprietary
8	6	Claude 3.5 Sonnet (20241022)	1283	+4/-4	40938	Anthropic	Proprietary
8	16	Deepseek-v2.5-1210	1278	+7/-8	6417	DeepSeek	DeepSeek
8	20	Qwen2.5-plus-1127	1278	+9/-9	3385	Alibaba	Proprietary
10	19	Athene-v2-Chat-72B	1277	+5/-5	14536	NexusFlow	NexusFlow
12	17	GLM-4-Plus	1274	+3/-4	27988	Zhipu AI	Proprietary
12	19	GPT-4o-mini-2024-07-18	1273	+3/-3	57421	OpenAI	Proprietary
12	20	Gemini-1.5-Flash-002	1271	+3/-3	31271	Google	Proprietary
12	32	Llama-3.1-Nemotron-70B-Instruct	1269	+6/-7	7667	Nvidia	Llama 3.1
15	8	Claude 3.5 Sonnet (20240620)	1268	+2/-3	86411	Anthropic	Proprietary
15	10	Meta-Llama-3.1-405B-Instruct-fp8	1267	+3/-3	63247	Meta	Llama 3.1 Community
15	9	Meta-Llama-3.1-405B-Instruct-bf16	1266	+4/-4	14927	Meta	Llama 3.1 Community
16	9	Gemini Advanced App (2024-05-14)	1267	+3/-3	52193	Google	Proprietary
17	29	Grok-2-Mini-08-13	1266	+3/-3	52002	xAI	Proprietary
17	16	Yi-Lightning-lite	1264	+4/-4	17190	01 AI	Proprietary
17	19	Qwen-Max-0919	1263	+5/-4	17577	Alibaba	Qwen
18	10	GPT-4o-2024-08-06	1265	+3/-4	48055	OpenAI	Proprietary
22	16	Gemini-1.5-Pro-001	1260	+2/-3	82551	Google	Proprietary
24	17	Llama-3.3-70B-Instruct	1256	+5/-5	8907	Meta	Llama-3.3
25	25	Deepseek-v2.5	1258	+3/-4	26498	DeepSeek	DeepSeek
25	31	Qwen2.5-72B-Instruct	1258	+4/-4	36753	Alibaba	Qwen
27	16	GPT-4-Turbo-2024-04-09	1256	+3/-3	102228	OpenAI	Proprietary
28	20	Mistral-Large-2407	1251	+3/-4	48365	Mistral	Mistral Research
28	29	Athene-70B	1250	+4/-4	20648	NexusFlow	CC-BY-NC-4.0
28	25	Mistral-Large-2411	1248	+7/-9	4832	Mistral	MRL

*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.

Rank (StyleCtrl): model's ranking with style control, which accounts for factors like response length and markdown usage to decouple model performance from these potential confounding variables. See blog post for further details.

Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.

More Statistics for Chatbot Arena (Overall)

Figure 1: Confidence Intervals on Model Strength (via Bootstrapping)

图表

Figure 2: Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)

图表

Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles

图表

Figure 4: Battle Count for Each Combination of Models (without Ties)

图表

For a more holistic comparison, we've updated the leaderboard to show model rank (UB) across tasks and languages. Check out the 'Arena' tab for more categories, statistics, and model info.

Total #models: 184. Total #votes: 2,465,686. Last updated: 2024-12-22.

Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!

Task Leaderboard

Chatbot Arena Overview

Model	Overall	Overall w/ Style Control	Hard Prompts	Hard Prompts w/ Style Control	Coding	Math	Creative Writing	Instruction Following	Longer Query	Multi-Turn
gemini-2.0-flash-thinking-exp-1219	100	115	102	100	106	110	100	106	101	104

Model	Overall	Overall w/ Style Control	Hard Prompts	Hard Prompts w/ Style Control	Coding	Math	Creative Writing	Instruction Following	Longer Query	Multi-Turn
gemini-exp-1206	1	1	1	1	1	1	1	1	1	1
gemini-2.0-flash-thinking-exp-1219	1	1	1	1	1	1	1	1	1	1
chatgpt-4o-latest-20241120	1	1	4	5	1	6	1	2	1	1
gemini-2.0-flash-exp	4	4	1	3	2	2	2	3	1	2
o1-preview	5	3	1	1	1	1	5	3	5	3
o1-mini	6	8	3	5	1	1	19	6	6	6
gemini-1.5-pro-002	6	7	7	8	8	6	5	7	6	9
grok-2-2024-08-13	8	11	12	13	10	10	7	10	9	9
yi-lightning	8	12	7	10	8	6	7	9	7	6
gpt-4o-2024-05-13	8	7	10	10	8	10	7	9	8	8
claude-3-5-sonnet-20241022	8	6	7	1	6	6	7	6	6	6
deepseek-v2.5-1210	8	16	7	10	8	8	7	9	6	8
qwen2.5-plus-1127	8	20	7	9	7	5	11	7	7	6
athene-v2-chat	10	19	7	10	8	6	23	9	7	9
glm-4-plus	12	17	11	15	10	13	11	11	9	10
gpt-4o-mini-2024-07-18	12	19	14	23	10	19	9	14	9	10
gemini-1.5-flash-002	12	20	21	27	26	13	7	14	9	26
llama-3.1-nemotron-70b-instruct	12	32	12	17	10	11	7	14	27	11
claude-3-5-sonnet-20240620	15	8	11	8	8	6	23	9	9	8
llama-3.1-405b-instruct-fp8	15	10	15	10	13	12	12	13	20	9
llama-3.1-405b-instruct-bf16	15	9	12	9	9	9	14	13	11	9
gemini-advanced-0514	16	9	25	14	25	16	7	19	24	18
grok-2-mini-2024-08-13	17	29	17	27	23	18	20	22	14	21
yi-lightning-lite	17	16	12	13	18	12	11	11	9	9
qwen-max-0919	17	19	11	14	10	12	16	11	8	11
gpt-4o-2024-08-06	18	10	15	13	14	10	8	11	9	14
gemini-1.5-pro-001	22	16	21	13	24	16	8	19	9	21
llama-3.3-70b-instruct	24	17	15	15	22	19	8	19	24	9
deepseek-v2.5	25	25	13	15	8	12	23	19	11	21
qwen2.5-72b-instruct	25	31	13	16	10	10	28	15	9	14
gpt-4-turbo-2024-04-09	27	16	25	19	23	13	11	19	24	21
mistral-large-2407	28	20	17	15	18	19	20	20	19	23
athene-70b-0725	28	29	25	21	24	38	23	37	34	26
mistral-large-2411	28	25	20	17	13	30	12	19	11	13

Language Leaderboard

Chatbot Arena Overview

Model	English	Chinese	German	French	Spanish	Russian	Japanese	Korean
gemini-2.0-flash-thinking-exp-1219	100	106	101	101	104	100	103	113

Model	English	Chinese	German	French	Spanish	Russian	Japanese	Korean
gemini-exp-1206	1	1	1	1	1	1	1	1
gemini-2.0-flash-thinking-exp-1219	1	1	1	1	1	1	1	1
chatgpt-4o-latest-20241120	1	1	1	1	1	1	1	1
gemini-2.0-flash-exp	2	1	1	1	1	1	1	1
o1-preview	2	5	4	1	3	5	1	5
o1-mini	6	5	5	4	3	7	7	5
gemini-1.5-pro-002	7	5	5	4	3	5	3	5
yi-lightning	7	5	6	4	2	15	7	7
qwen2.5-plus-1127	7	5	7	1	2	8	10	15
grok-2-2024-08-13	8	9	5	4	4	7	6	5
deepseek-v2.5-1210	8	5	5	4	2	6	7	7
llama-3.1-nemotron-70b-instruct	8	12	2	1	3	27	7	28
athene-v2-chat	9	6	5	4	4	7	10	6
llama-3.1-405b-instruct-bf16	9	25	5	5	4	15	7	11
gpt-4o-2024-05-13	10	9	5	4	4	8	6	5
llama-3.1-405b-instruct-fp8	10	28	11	6	5	14	18	8
claude-3-5-sonnet-20241022	11	9	4	4	3	5	6	7
llama-3.3-70b-instruct	11	29	6	2	3	21	23	33
gpt-4o-mini-2024-07-18	12	14	5	4	4	8	7	7
grok-2-mini-2024-08-13	14	14	5	5	4	16	7	7
yi-lightning-lite	16	14	5	4	3	20	7	8
glm-4-plus	17	8	5	4	2	8	7	5
gpt-4o-2024-08-06	19	18	7	6	4	8	7	6
qwen-max-0919	19	15	6	1	3	8	7	14
llama-3.1-70b-instruct	19	35	18	10	8	28	32	16
gemini-1.5-flash-002	20	7	5	5	5	7	6	5
claude-3-5-sonnet-20240620	20	14	6	4	3	8	7	7
deepseek-v2.5	21	9	12	4	4	18	11	7
gpt-4-turbo-2024-04-09	21	23	7	6	8	20	14	11
mistral-large-2407	21	25	7	6	13	18	11	11
athene-70b-0725	21	26	11	4	7	20	32	14
mistral-large-2411	21	14	7	1	3	16	7	7
qwen2.5-72b-instruct	23	9	15	4	5	15	17	7
gemini-advanced-0514	25	9	5	4	11	8	7	5

Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.

Total #models: 49. Total #votes: 154,532. Last updated: 2024-12-22.

Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!

Overall Questions

#models: 49 (100%) #votes: 154,532 (100%)

Rank* (UB)	Rank (StyleCtrl)	Model	Arena Score	95% CI	Votes	Organization	License
10	12	Qwen-VL-Max-1119	1281	+22/-22	22862	Anthropic	Proprietary

Rank* (UB)	Rank (StyleCtrl)	Model	Arena Score	95% CI	Votes	Organization	License
1	1	Gemini-2.0-Flash-Thinking-Exp-1219	1281	+22/-22	803	Google	Proprietary
1	1	Gemini-2.0-Flash-Exp	1255	+11/-13	1974	Google	Proprietary
2	1	Gemini-Exp-1206	1236	+14/-13	2266	Google	Proprietary
3	2	ChatGPT-4o-latest (2024-11-20)	1227	+9/-10	3950	OpenAI	Proprietary
3	3	Gemini-1.5-Pro-002	1221	+7/-7	6561	Google	Proprietary
6	5	GPT-4o-2024-05-13	1206	+3/-4	22862	OpenAI	Proprietary
6	8	Gemini-1.5-Flash-002	1205	+5/-8	6057	Google	Proprietary
8	8	Claude 3.5 Sonnet (20240620)	1189	+4/-6	22261	Anthropic	Proprietary
8	5	Claude 3.5 Sonnet (20241022)	1183	+7/-7	6417	Anthropic	Proprietary
10	12	Pixtral-Large-2411	1157	+10/-13	2090	Mistral	MRL
10	10	Gemini-1.5-Pro-001	1151	+5/-6	17183	Google	Proprietary
10	10	GPT-4-Turbo-2024-04-09	1151	+5/-5	13735	OpenAI	Proprietary
10	10	Qwen-VL-Max-1119	1135	+18/-22	640	Alibaba	Proprietary
13	9	GPT-4o-2024-08-06	1126	+9/-11	3392	OpenAI	Proprietary
13	13	GPT-4o-mini-2024-07-18	1124	+4/-5	14962	OpenAI	Proprietary
13	16	Gemini-1.5-Flash-8B-Exp-0827	1112	+9/-9	3427	Google	Proprietary
13	17	Step-1V-32K	1111	+13/-13	1555	StepFun	Proprietary
14	13	Qwen2-VL-72b-Instruct	1111	+7/-6	5341	Alibaba	Qwen
14	16	Gemini-1.5-Flash-8B-001	1105	+10/-9	5463	Google	Proprietary
20	19	Claude 3 Opus	1076	+5/-4	15954	Anthropic	Proprietary
20	19	Molmo-72B-0924	1076	+10/-10	3095	AI2	Apache 2.0
20	24	Pixtral-12B-2409	1072	+7/-6	6126	Mistral	Apache 2.0
20	17	Gemini-1.5-Flash-001	1072	+5/-6	13595	Google	Proprietary
20	20	Llama-3.2-90B-Vision-Instruct	1072	+10/-7	6863	Meta	Llama 3.2
20	25	InternVL2-26B	1067	+8/-8	5266	OpenGVLab	MIT
21	22	Amazon Nova Lite 1.0	1057	+14/-17	1477	Amazon	Proprietary
25	22	Qwen2-VL-7B-Instruct	1054	+9/-7	5510	Aliaba	Apache 2.0
25	25	Yi-Vision	1045	+17/-13	1237	01 AI	Proprietary
26	24	Claude 3 Sonnet	1048	+3/-6	12689	Anthropic	Proprietary
26	24	Amazon Nova Pro 1.0	1043	+12/-13	1538	Amazon	Proprietary
29	30	Llama-3.2-11B-Vision-Instruct	1032	+8/-7	4887	Meta	Llama 3.2
29	30	Molmo-7B-D-0924	1024	+13/-13	2860	AI2	Apache 2.0
29	24	NVILA-15B	1018	+22/-21	713	NVIDIA	-
30	34	Reka-Flash-Preview-20240611	1024	+6/-8	9072	Reka AI	Proprietary

Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.

More Statistics for Chatbot Arena (Overall)

Figure 1: Confidence Intervals on Model Strength (via Bootstrapping)

图表

Figure 2: Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)

图表

Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles

图表

Figure 4: Battle Count for Each Combination of Models (without Ties)

图表

Copilot Arena is a free AI coding assistant that provides paired responses from different state-of-the-art LLMs. This leaderboard contains the relative performance and ranking of 11 models over 15635 battles. Download Copilot Arena or learn more details in our blog post!

Rank* (UB)	Model	Arena Score	Confidence Interval	Votes	Organization
10	Meta-Llama-3.1-405B-Instruct	1028	+17 / -16	2294	Deepseek AI

Rank* (UB)	Model	Arena Score	Confidence Interval	Votes	Organization
1	Deepseek V2.5 (FIM)	1028	+17 / -16	2294	Deepseek AI
1	Claude 3.5 Sonnet (06/20)	1020	+14 / -14	2391	Anthropic
1	Claude 3.5 Sonnet (10/22)	1005	+16 / -12	2426	Anthropic
2	Codestral (05/24)	999	+11 / -6	4039	Mistral
3	Meta-Llama-3.1-405B-Instruct	986	+11 / -12	2810	Meta
3	Gemini-1.5-Flash-002	983	+11 / -12	3108	Google
3	Gemini-1.5-Pro-002	983	+13 / -13	2323	Google
5	GPT-4o (08/06)	978	+13 / -11	2908	OpenAI
5	Meta-Llama-3.1-70B-Instruct	975	+12 / -13	3219	Meta
10	Qwen2.5-Coder-32B-Instruct	945	+13 / -14	2926	Alibaba
10	GPT-4o-mini (07/18)	939	+10 / -13	2827	OpenAI

Confidence Interval: represents the range of uncertainty around the Arena Score. It's displayed as +X / -Y, where X is the difference between the upper bound and the score, and Y is the difference between the score and the lower bound.

Last Updated: 2024-07-31

Arena-Hard-Auto v0.1 - an automatic evaluation tool for instruction-tuned LLMs with 500 challenging user queries curated from Chatbot Arena.

We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314). If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto. Check out our paper for more details about how Arena-Hard-Auto works as an fully automated data pipeline converting crowdsourced data into high-quality benchmarks -> [Paper | Repo]

Rank* (UB)	Model	Win-rate	95% CI	Average Tokens	Organization
10	Phi-3-Mini-128k-Instruct	82.63	+2.0/-1.9	662	DeepSeek AI

Rank* (UB)	Model	Win-rate	95% CI	Average Tokens	Organization
1	GPT-4-Turbo-2024-04-09	82.63	+2.0/-1.9	662	OpenAI
2	Claude 3.5 Sonnet (20240620)	79.35	+1.3/-2.1	567	Anthropic
2	GPT-4o-2024-05-13	79.21	+1.5/-1.8	696	OpenAI
2	GPT-4-0125-preview	77.96	+1.9/-2.0	619	OpenAI
2	Athene-70B	76.83	+1.9/-2.0	683	NexusFlow
4	GPT-4o-mini-2024-07-18	74.94	+2.1/-2.3	668	OpenAI
6	Gemini-1.5-Pro-001	71.96	+2.7/-2.3	676	Google
6	Yi-Large-preview	71.48	+1.9/-2.5	720	01 AI
7	Mistral-Large-2407	70.42	+2.0/-2.3	623	Mistral
10	Meta-Llama-3.1-405B-Instruct-fp8	64.09	+2.5/-2.7	633	Meta
10	GLM-4-0520	63.84	+2.3/-2.6	636	Zhipu AI
10	Yi-Large	63.7	+2.2/-1.9	626	01 AI
10	DeepSeek-Coder-V2-Instruct	62.3	+2.4/-2.5	578	DeepSeek AI
10	Claude 3 Opus	60.36	+2.0/-2.8	541	Anthropic
13	Gemma-2-27B-it	57.51	+2.6/-2.4	577	Google
14	Meta-Llama-3.1-70B-Instruct	55.73	+2.5/-2.9	628	Meta
14	GLM-4-0116	55.72	+2.4/-1.9	622	Zhipu AI
15	Gemini-1.5-Pro-Preview-0409	53.37	+3.3/-2.2	478	Google
17	GLM-4-AIR	50.88	+2.3/-2.3	619	Zhipu AI
19	GPT-4-0314	50	+0.0/-0.0	423	OpenAI
18	Gemini-1.5-Flash-001	49.61	+2.6/-2.1	642	Google
20	Qwen2-72B-Instruct	46.86	+2.4/-2.3	515	Alibaba
20	Claude 3 Sonnet	46.8	+2.2/-2.7	552	Anthropic
20	Llama-3-70B-Instruct	46.57	+2.6/-2.7	591	Meta
24	Claude 3 Haiku	41.47	+2.6/-1.9	505	Anthropic
25	GPT-4-0613	37.9	+2.5/-2.3	354	OpenAI
25	Mistral-Large-2402	37.71	+2.1/-2.9	400	Mistral
26	Mixtral-8x22b-Instruct-v0.1	36.36	+2.2/-2.1	430	Mistral
26	Qwen1.5-72B-Chat	36.12	+2.0/-2.2	474	Alibaba
27	Phi-3-Medium-4k-Instruct	33.37	+1.8/-2.1	517	Microsoft
27	Command R+ (04-2024)	33.07	+2.0/-2.2	541	Cohere
28	Mistral Medium	31.9	+2.4/-2.2	485	Mistral
30	Phi-3-Small-8k-Instruct	29.77	+2.2/-1.8	568	Microsoft
33	Mistral-Next	27.37	+1.7/-2.0	297	Mistral

Citation

Please cite the following paper if you find our leaderboard or dataset helpful.

@misc{chiang2024chatbot,
    title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
    author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
    year={2024},
    eprint={2403.04132},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}

Terms of Service

Users are required to agree to the following terms before using the service:

Please report any bug or issue to our Discord/arena-feedback.

Acknowledgment

We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Fireworks AI, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.

About Us

Chatbot Arena is an open platform for crowdsourced AI benchmarking, hosted by researchers at UC Berkeley SkyLab and LMArena. We open-source the FastChat project at GitHub and release open datasets. We always welcome contributions from the community. If you're interested in collaboration, we'd love to hear from you!

Open-source contributors

Leads: Wei-Lin Chiang, Anastasios Angelopoulos
Contributors: Lianmin Zheng, Ying Sheng, Lisa Dunlap, Christopher Chou, Tianle Li, Evan Frick, Aryan Vichare, Naman Jain, Dacheng Li, Siyuan Zhuang
Advisors: Ion Stoica, Joseph E. Gonzalez, Hao Zhang, Trevor Darrell

Learn more

Chatbot Arena paper, launch blog, dataset, policy
LMSYS-Chat-1M dataset paper, LLM Judge paper

Contact Us

Follow our X, Discord or email us at lmarena.ai@gmail.com
File issues on GitHub
Download our datasets and models on HuggingFace

Acknowledgment

We thank SkyPilot and Gradio team for their system support. We also thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Fireworks AI, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship. Learn more about partnership here.

使用 Gradio 构建