⚔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots
Twitter | Discord | Blog | GitHub | Paper | Dataset | Kaggle Competition
📜 How It Works
- Blind Test: Ask any question to two anonymous AI chatbots (ChatGPT, Gemini, Claude, Llama, and more).
- Vote for the Best: Choose the best response. You can keep chatting until you find a winner.
- Play Fair: If AI identity reveals, your vote won't count.
- NEW: Click the 🎨 Text-to-Image tab below to generate images with DALL-E 3, Flux, and more! Use 🐙 RepoChat tab to chat with Github repos.
🏆 Chatbot Arena LLM Leaderboard
- Backed by over 1,000,000+ community votes, our platform ranks the best LLM and AI chatbots. Explore the top AI models on our LLM leaderboard!
👇 Chat now!
GPT-4o: The flagship model across audio, vision, and text by OpenAI | Gemini: Gemini by Google | Claude 3.5: Claude by Anthropic |
Nova: Nova by Amazon | Grok-2: Grok-2 by xAI | Llama 3.1: Open foundation and chat models by Meta |
Mistral: Mistral Large 2 | Yi-Large: State-of-the-art model by 01 AI | GLM-4: Next-Gen Foundation Model by Zhipu AI |
Molmo: Molmo by AI2 | GPT-4-Turbo: GPT-4-Turbo by OpenAI | Jamba 1.5: Jamba by AI21 Labs |
Gemma 2: Gemma 2 by Google | Claude: Claude by Anthropic | Nemotron-4 340B: Cutting-edge Open model by Nvidia |
Llama 3: Open foundation and chat models by Meta | Athene-70B: A large language model by NexusFlow | Qwen Max: The Frontier Qwen Model by Alibaba |
GPT-3.5: GPT-3.5-Turbo by OpenAI | Phi-3: A capable and cost-effective small language models (SLMs) by Microsoft | Reka Core: Frontier Multimodal Language Model by Reka |
Reka Flash: Multimodal model by Reka | Command-R-Plus: Command R+ by Cohere | Command R: Command R by Cohere |
Mixtral of experts: A Mixture-of-Experts model by Mistral AI | InternVL 2: Multimodal Model developed by OpenGVLab |
Github link/issue/PR | Type your query |
---|---|
https://github.com/meta-llama/llama-recipes | fun use cases in llama recipes |
https://github.com/scikit-learn/scikit-learn | show me how run k-mean clustering |
https://github.com/sindresorhus/awesome | Make me a project structure for my food blog React webapp based entirely on awesome libraries. |
Terms of Service
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment
We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Fireworks AI, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.
⚔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots
Twitter | Discord | Blog | GitHub | Paper | Dataset | Kaggle Competition
📜 How It Works
- Ask any question to two chosen models (e.g., ChatGPT, Gemini, Claude, Llama) and vote for the better one!
- You can chat for multiple turns until you identify a winner.
Note: You can only chat with one image per conversation. You can upload images less than 15MB. Click the "Random Example" button to chat with a random image.
❗️ For research purposes, we log user prompts and images, and may release this data to the public in the future. Please do not upload any confidential or personal information.
🤖 Choose two models to compare
GPT-4o: The flagship model across audio, vision, and text by OpenAI | Gemini: Gemini by Google | Claude 3.5: Claude by Anthropic |
Nova: Nova by Amazon | Grok-2: Grok-2 by xAI | Llama 3.1: Open foundation and chat models by Meta |
Mistral: Mistral Large 2 | Yi-Large: State-of-the-art model by 01 AI | GLM-4: Next-Gen Foundation Model by Zhipu AI |
Molmo: Molmo by AI2 | GPT-4-Turbo: GPT-4-Turbo by OpenAI | Jamba 1.5: Jamba by AI21 Labs |
Gemma 2: Gemma 2 by Google | Claude: Claude by Anthropic | Nemotron-4 340B: Cutting-edge Open model by Nvidia |
Llama 3: Open foundation and chat models by Meta | Athene-70B: A large language model by NexusFlow | Qwen Max: The Frontier Qwen Model by Alibaba |
GPT-3.5: GPT-3.5-Turbo by OpenAI | Phi-3: A capable and cost-effective small language models (SLMs) by Microsoft | Reka Core: Frontier Multimodal Language Model by Reka |
Reka Flash: Multimodal model by Reka | Command-R-Plus: Command R+ by Cohere | Command R: Command R by Cohere |
Mixtral of experts: A Mixture-of-Experts model by Mistral AI | InternVL 2: Multimodal Model developed by OpenGVLab |
Terms of Service
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment
We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Fireworks AI, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.
🏔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots
Twitter | Discord | Blog | GitHub | Paper | Dataset | Kaggle Competition
❗️ For research purposes, we log user prompts and images, and may release this data to the public in the future. Please do not upload any confidential or personal information.
Note: You can only chat with one image per conversation. You can upload images less than 15MB. Click the "Random Example" button to chat with a random image.
GPT-4o: The flagship model across audio, vision, and text by OpenAI | Gemini: Gemini by Google | Claude 3.5: Claude by Anthropic |
Nova: Nova by Amazon | Grok-2: Grok-2 by xAI | Llama 3.1: Open foundation and chat models by Meta |
Mistral: Mistral Large 2 | Yi-Large: State-of-the-art model by 01 AI | GLM-4: Next-Gen Foundation Model by Zhipu AI |
Molmo: Molmo by AI2 | GPT-4-Turbo: GPT-4-Turbo by OpenAI | Jamba 1.5: Jamba by AI21 Labs |
Gemma 2: Gemma 2 by Google | Claude: Claude by Anthropic | Nemotron-4 340B: Cutting-edge Open model by Nvidia |
Llama 3: Open foundation and chat models by Meta | Athene-70B: A large language model by NexusFlow | Qwen Max: The Frontier Qwen Model by Alibaba |
GPT-3.5: GPT-3.5-Turbo by OpenAI | Phi-3: A capable and cost-effective small language models (SLMs) by Microsoft | Reka Core: Frontier Multimodal Language Model by Reka |
Reka Flash: Multimodal model by Reka | Command-R-Plus: Command R+ by Cohere | Command R: Command R by Cohere |
Mixtral of experts: A Mixture-of-Experts model by Mistral AI | InternVL 2: Multimodal Model developed by OpenGVLab |
Terms of Service
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment
We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Fireworks AI, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.
Chatbot Arena is an open platform for crowdsourced AI benchmarking, developed by researchers at UC Berkeley SkyLab and LMArena. With over 1,000,000 user votes, the platform ranks best LLM and AI chatbots using the Bradley-Terry model to generate live leaderboards. For technical details, check out our paper.
Chatbot Arena thrives on community engagement — cast your vote to help improve AI evaluation!
Total #models: 184. Total #votes: 2,465,686. Last updated: 2024-12-22.
Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!
Overall Questions
#models: 184 (100%) #votes: 2,465,686 (100%)
Rank* (UB) | Rank (StyleCtrl) | Model | Arena Score | 95% CI | Votes | Organization | License |
---|---|---|---|---|---|---|---|
100 | 115 | 1372 | +11/-11 | 117916 | Cognitive Computations | Falcon-180B TII License |
*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.
Rank (StyleCtrl): model's ranking with style control, which accounts for factors like response length and markdown usage to decouple model performance from these potential confounding variables. See blog post for further details.
Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.
More Statistics for Chatbot Arena (Overall)
Total #models: 184. Total #votes: 2,465,686. Last updated: 2024-12-22.
Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!
Task Leaderboard
Chatbot Arena Overview
Model | Overall | Overall w/ Style Control | Hard Prompts | Hard Prompts w/ Style Control | Coding | Math | Creative Writing | Instruction Following | Longer Query | Multi-Turn |
---|---|---|---|---|---|---|---|---|---|---|
gemini-2.0-flash-thinking-exp-1219 | 100 | 115 | 102 | 100 | 106 | 110 | 100 | 106 | 101 | 104 |
Language Leaderboard
Chatbot Arena Overview
Model | English | Chinese | German | French | Spanish | Russian | Japanese | Korean |
---|---|---|---|---|---|---|---|---|
gemini-2.0-flash-thinking-exp-1219 | 100 | 106 | 101 | 101 | 104 | 100 | 103 | 113 |
*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.
Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.
Total #models: 49. Total #votes: 154,532. Last updated: 2024-12-22.
Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!
Overall Questions
#models: 49 (100%) #votes: 154,532 (100%)
Rank* (UB) | Rank (StyleCtrl) | Model | Arena Score | 95% CI | Votes | Organization | License |
---|---|---|---|---|---|---|---|
10 | 12 | 1281 | +22/-22 | 22862 | Anthropic | Proprietary |
*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.
Rank (StyleCtrl): model's ranking with style control, which accounts for factors like response length and markdown usage to decouple model performance from these potential confounding variables. See blog post for further details.
Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.
More Statistics for Chatbot Arena (Overall)
Copilot Arena is a free AI coding assistant that provides paired responses from different state-of-the-art LLMs. This leaderboard contains the relative performance and ranking of 11 models over 15635 battles. Download Copilot Arena or learn more details in our blog post!
Rank* (UB) | Model | Arena Score | Confidence Interval | Votes | Organization |
---|---|---|---|---|---|
10 | Meta-Llama-3.1-405B-Instruct | 1028 | +17 / -16 | 2294 | Deepseek AI |
*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval).
Confidence Interval: represents the range of uncertainty around the Arena Score. It's displayed as +X / -Y, where X is the difference between the upper bound and the score, and Y is the difference between the score and the lower bound.
Last Updated: 2024-07-31
Arena-Hard-Auto v0.1 - an automatic evaluation tool for instruction-tuned LLMs with 500 challenging user queries curated from Chatbot Arena.
We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314). If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto. Check out our paper for more details about how Arena-Hard-Auto works as an fully automated data pipeline converting crowdsourced data into high-quality benchmarks -> [Paper | Repo]
Rank* (UB) | Model | Win-rate | 95% CI | Average Tokens | Organization |
---|---|---|---|---|---|
10 | 82.63 | +2.0/-1.9 | 662 | DeepSeek AI |
Citation
Please cite the following paper if you find our leaderboard or dataset helpful.
@misc{chiang2024chatbot,
title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
year={2024},
eprint={2403.04132},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
Terms of Service
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment
We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Fireworks AI, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship.
About Us
Chatbot Arena is an open platform for crowdsourced AI benchmarking, hosted by researchers at UC Berkeley SkyLab and LMArena. We open-source the FastChat project at GitHub and release open datasets. We always welcome contributions from the community. If you're interested in collaboration, we'd love to hear from you!
Open-source contributors
- Leads: Wei-Lin Chiang, Anastasios Angelopoulos
- Contributors: Lianmin Zheng, Ying Sheng, Lisa Dunlap, Christopher Chou, Tianle Li, Evan Frick, Aryan Vichare, Naman Jain, Dacheng Li, Siyuan Zhuang
- Advisors: Ion Stoica, Joseph E. Gonzalez, Hao Zhang, Trevor Darrell
Learn more
Contact Us
- Follow our X, Discord or email us at
lmarena.ai@gmail.com
- File issues on GitHub
- Download our datasets and models on HuggingFace
Acknowledgment
We thank SkyPilot and Gradio team for their system support. We also thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Fireworks AI, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship. Learn more about partnership here.