โ๏ธ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots
Discord | Twitter | ๅฐ็บขไนฆ | Blog | GitHub | Paper | Dataset | Kaggle Competition
ๆฐๅๅธ๏ผWebDev Arena๏ผweb.lmarena.ai - AI ๅฏนๅณ๏ผๆ้ ๆไฝณ็ฝ็ซ๏ผ
๐ How It Works ๐ ๅฆไฝไฝฟ็จ
- Blind Test: Ask any question to two anonymous AI chatbots (ChatGPT, Gemini, Claude, Llama, and more).
็ฒๆต๏ผๅไธคไฝๅฟๅ AI ่ๅคฉๆบๅจไบบ๏ผChatGPTใGeminiใClaudeใLlama ็ญ๏ผๆ้ฎไปปไฝ้ฎ้ขใ - Vote for the Best: Choose the best response. You can keep chatting until you find a winner.
้ๅบๆไฝณ็ญๆก๏ผ้ๆฉไฝ ่ฎคไธบๆๅฅฝ็ๅๅบใไฝ ๅฏไปฅ็ปง็ปญ่ๅคฉ๏ผ็ดๅฐๆพๅฐ่่ ใ - Play Fair: If AI identity reveals, your vote won't count.
ๅ ฌๅนณๆธธๆ๏ผๅฆๆ AI ่บซไปฝๆณ้ฒ๏ผไฝ ็ๆ็ฅจๅฐๆ ๆใ - NEW features: Upload an image ๐ผ๏ธ and chat, or use ๐จ Text-to-Image models like DALL-E 3, Flux, Ideogram to generate images!
Use ๐ RepoChat tab to chat with Github repos.
ๆฐๅ่ฝ๏ผไธไผ ๅพ็๐ผ๏ธๅนถ่ๅคฉ๏ผๆไฝฟ็จ๐จๆๆฌ่ฝฌๅพๅๆจกๅๅฆ DALL-E 3ใFluxใIdeogram ็ๆๅพๅ๏ผไฝฟ็จ๐ฆRepoChat ๆ ็ญพไธ GitHub ไปๅบ่ๅคฉใ
๐ Chatbot Arena LLM Leaderboard
๐ Chatbot Arena LLM ๆ่กๆฆ
- Backed by over 1,000,000+ community votes, our platform ranks the best LLM and AI chatbots. Explore the top AI models on our LLM leaderboard!
็ฑ่ถ ่ฟ 1,000,000+็คพๅบๆ็ฅจๆฏๆ๏ผๆไปฌ็ๅนณๅฐๆๅๅ 1001 ๅ AI ่ๅคฉๆบๅจไบบใๅจๆไปฌ็ๆ่กๆฆไธๆข็ดข้กถ็บง AI ๆจกๅ๏ผ
๐ Chat now! ๐ ็ฐๅจ่ๅคฉ๏ผ
GPT-4o: The flagship model across audio, vision, and text by OpenAI | Gemini: Gemini by Google | Claude 3.5: Claude by Anthropic |
Grok-2: Grok-2 by xAI | Nova: Nova by Amazon | Qwen Max: The Frontier Qwen Model by Alibaba |
Llama 3.1: Open foundation and chat models by Meta | Mistral: Mistral Large 2 | Yi-Large: State-of-the-art model by 01 AI |
GLM-4: Next-Gen Foundation Model by Zhipu AI | Jamba 1.5: Jamba by AI21 Labs | Gemma 2: Gemma 2 by Google |
Claude: Claude by Anthropic | Nemotron-4 340B: Cutting-edge Open model by Nvidia | Llama 3: Open foundation and chat models by Meta |
GPT-3.5: GPT-3.5-Turbo by OpenAI | Reka Core: Frontier Multimodal Language Model by Reka | Reka Flash: Multimodal model by Reka |
Command-R-Plus: Command R+ by Cohere | Command R: Command R by Cohere | Mixtral of experts: A Mixture-of-Experts model by Mistral AI |
Github link/issue/PR | Type your query |
---|---|
https://github.com/d3/d3 | Show me some fancy examples in d3 |
https://github.com/f/awesome-chatgpt-prompts | give me a system prompt for software engineer mock interview |
https://github.com/huggingface/transformers | Show me a text2audio example |
Terms of Service ๆๅกๆกๆฌพ
Users are required to agree to the following terms before using the service:
็จๆทๅจไฝฟ็จๆๅกไนๅๅฟ
้กปๅๆไปฅไธๆกๆฌพ๏ผ
The service is a research preview. It only provides limited safety measures and may generate offensive content.
It must not be used for any illegal, harmful, violent, racist, or sexual purposes.
Please do not upload any private information.
The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
่ฏฅๆๅกไธบ็ ็ฉถ้ข่ง็ใๅฎไป
ๆไพๆ้็ๅฎๅ
จๆชๆฝ๏ผๅนถๅฏ่ฝ็ๆๅ็ฏๆงๅ
ๅฎนใ
ไธฅ็ฆ็จไบไปปไฝ้ๆณใๆๅฎณใๆดๅใ็งๆไธปไนๆ่ฒๆ
็ฎ็ใ
่ฏทๅฟไธไผ ไปปไฝ็งไบบไฟกๆฏใ
ๆๅกไผๆถ้็จๆทๅฏน่ฏๆฐๆฎ๏ผๅ
ๆฌๆๆฌๅๅพ็๏ผๅนถไฟ็ๅฐๅ
ถๅจๅ็จ CC Attribution๏ผCC-BY๏ผๆๅ
ถไป็ฑปไผผ่ฎธๅฏไธๅๅธ็ๆๅฉใ
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment ่ด่ฐข
We thank UC Berkeley SkyLab, a16z, Sequoia, Fireworks AI, Together AI, RunPod, Anyscale, Replicate, Fal AI, Hyperbolic, Kaggle, MBZUAI, HuggingFace for their generous sponsorship.
ๆไปฌๆ่ฐขๅ ๅทๅคงๅญฆไผฏๅ
ๅฉๅๆ ก SkyLabใa16zใSequoiaใFireworks AIใTogether AIใRunPodใAnyscaleใReplicateใFal AIใHyperbolicใKaggleใMBZUAIใHuggingFace ็ๆ
ทๆ
จ่ตๅฉใ
โ๏ธ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots
๏ธ ่ๅคฉๆบๅจไบบ็ซๆๅบ๏ผๅ LMSYS๏ผ๏ผๅ
่ดน AI ่ๅคฉ๏ผๆฏ่พๅๆต่ฏๆไฝณ AI ่ๅคฉๆบๅจไบบใ
Discord | Twitter | ๅฐ็บขไนฆ | Blog | GitHub | Paper | Dataset | Kaggle Competition
ๆฐๅๅธ๏ผWebDev Arena๏ผweb.lmarena.ai - AI ๅฏนๅณ๏ผๆ้ ๆไฝณ็ฝ็ซ๏ผ
๐ How It Works ๐ ๅฆไฝไฝฟ็จ
- Ask any question to two chosen models (e.g., ChatGPT, Gemini, Claude, Llama) and vote for the better one!
ๅไธคไธช้ๅฎ็ๆจกๅ๏ผไพๅฆ๏ผChatGPTใGeminiใClaudeใLlama๏ผๆ้ฎไปปไฝ้ฎ้ข๏ผๅนถๆ็ฅจ้ๅบๆดๅฅฝ็ไธไธช๏ผ - You can chat for multiple turns until you identify a winner.
ไฝ ๅฏไปฅๅค่ฝฎๅฏน่ฏ๏ผ็ดๅฐๆพๅบ่่ ใ
Note: You can only chat with one image per conversation. You can upload images less than 15MB. Click the "Random Example" button to chat with a random image.
ๆณจๆ๏ผๆฏๆฌกๅฏน่ฏๅช่ฝไธไผ ไธๅผ ๅพ็ใไฝ ๅฏไปฅไธไผ ๅฐไบ 15MB ็ๅพ็ใ็นๅปโ้ๆบ็คบไพโๆ้ฎ๏ผไธไธๅผ ้ๆบๅพ็่ฟ่กๅฏน่ฏใ
โ๏ธ For research purposes, we log user prompts and images, and may release this data to the public in the future. Please do not upload any confidential or personal information.
โ๏ธ ไธบไบ็ ็ฉถ็ฎ็๏ผๆไปฌไผ่ฎฐๅฝ็จๆท็้ฎ้ขๅๅพ็๏ผๅนถๅฏ่ฝๅจๆชๆฅๅ
ฌๅผ่ฟไบๆฐๆฎใ่ฏทๅฟไธไผ ไปปไฝๆบๅฏๆไธชไบบไฟกๆฏใ
๐ค Choose two models to compare
๐ค ้ๆฉไธคไธชๆจกๅ่ฟ่กๆฏ่พ
GPT-4o: The flagship model across audio, vision, and text by OpenAI | Gemini: Gemini by Google | Claude 3.5: Claude by Anthropic |
Grok-2: Grok-2 by xAI | Nova: Nova by Amazon | Qwen Max: The Frontier Qwen Model by Alibaba |
Llama 3.1: Open foundation and chat models by Meta | Mistral: Mistral Large 2 | Yi-Large: State-of-the-art model by 01 AI |
GLM-4: Next-Gen Foundation Model by Zhipu AI | Jamba 1.5: Jamba by AI21 Labs | Gemma 2: Gemma 2 by Google |
Claude: Claude by Anthropic | Nemotron-4 340B: Cutting-edge Open model by Nvidia | Llama 3: Open foundation and chat models by Meta |
GPT-3.5: GPT-3.5-Turbo by OpenAI | Reka Core: Frontier Multimodal Language Model by Reka | Reka Flash: Multimodal model by Reka |
Command-R-Plus: Command R+ by Cohere | Command R: Command R by Cohere | Mixtral of experts: A Mixture-of-Experts model by Mistral AI |
Terms of Service ๆๅกๆกๆฌพ
Users are required to agree to the following terms before using the service:
็จๆทๅจไฝฟ็จๆๅกไนๅๅฟ
้กปๅๆไปฅไธๆกๆฌพ๏ผ
The service is a research preview. It only provides limited safety measures and may generate offensive content.
It must not be used for any illegal, harmful, violent, racist, or sexual purposes.
Please do not upload any private information.
The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
่ฏฅๆๅกไธบ็ ็ฉถ้ข่ง็ใๅฎไป
ๆไพๆ้็ๅฎๅ
จๆชๆฝ๏ผๅนถๅฏ่ฝ็ๆๅ็ฏๆงๅ
ๅฎนใ
ไธฅ็ฆ็จไบไปปไฝ้ๆณใๆๅฎณใๆดๅใ็งๆไธปไนๆ่ฒๆ
็ฎ็ใ
่ฏทๅฟไธไผ ไปปไฝ็งไบบไฟกๆฏใ
ๆๅกไผๆถ้็จๆทๅฏน่ฏๆฐๆฎ๏ผๅ
ๆฌๆๆฌๅๅพ็๏ผๅนถไฟ็ๅฐๅ
ถๅจๅ็จ CC Attribution๏ผCC-BY๏ผๆๅ
ถไป็ฑปไผผ่ฎธๅฏไธๅๅธ็ๆๅฉใ
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment ่ด่ฐข
We thank UC Berkeley SkyLab, a16z, Sequoia, Fireworks AI, Together AI, RunPod, Anyscale, Replicate, Fal AI, Hyperbolic, Kaggle, MBZUAI, HuggingFace for their generous sponsorship.
ๆไปฌๆ่ฐขๅ ๅทๅคงๅญฆไผฏๅ
ๅฉๅๆ ก SkyLabใa16zใSequoiaใFireworks AIใTogether AIใRunPodใAnyscaleใReplicateใFal AIใHyperbolicใKaggleใMBZUAIใHuggingFace ็ๆ
ทๆ
จ่ตๅฉใ
๐๏ธ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots
๐๏ธ Chatbot Arena๏ผๅ LMSYS๏ผ๏ผๅ
่ดน AI ่ๅคฉ๏ผๆฏ่พๅๆต่ฏๆไฝณ AI ่ๅคฉๆบๅจไบบ
Discord | Twitter | ๅฐ็บขไนฆ | Blog | GitHub | Paper | Dataset | Kaggle Competition
ๆฐๅๅธ๏ผWebDev Arena๏ผweb.lmarena.ai - AI ๅฏนๅณ๏ผๆ้ ๆไฝณ็ฝ็ซ๏ผ
โ๏ธ For research purposes, we log user prompts and images, and may release this data to the public in the future. Please do not upload any confidential or personal information.
โ๏ธ ไธบไบ็ ็ฉถ็ฎ็๏ผๆไปฌไผ่ฎฐๅฝ็จๆท็้ฎ้ขๅๅพ็๏ผๅนถๅฏ่ฝๅจๆชๆฅๅ
ฌๅผ่ฟไบๆฐๆฎใ่ฏทๅฟไธไผ ไปปไฝๆบๅฏๆไธชไบบไฟกๆฏใ
Note: You can only chat with one image per conversation. You can upload images less than 15MB. Click the "Random Example" button to chat with a random image.
ๆณจๆ๏ผๆฏๆฌกๅฏน่ฏๅช่ฝไธไผ ไธๅผ ๅพ็ใไฝ ๅฏไปฅไธไผ ๅฐไบ 15MB ็ๅพ็ใ็นๅปโ้ๆบ็คบไพโๆ้ฎ๏ผไธไธๅผ ้ๆบๅพ็่ฟ่กๅฏน่ฏใ
GPT-4o: The flagship model across audio, vision, and text by OpenAI | Gemini: Gemini by Google | Claude 3.5: Claude by Anthropic |
Grok-2: Grok-2 by xAI | Nova: Nova by Amazon | Qwen Max: The Frontier Qwen Model by Alibaba |
Llama 3.1: Open foundation and chat models by Meta | Mistral: Mistral Large 2 | Yi-Large: State-of-the-art model by 01 AI |
GLM-4: Next-Gen Foundation Model by Zhipu AI | Jamba 1.5: Jamba by AI21 Labs | Gemma 2: Gemma 2 by Google |
Claude: Claude by Anthropic | Nemotron-4 340B: Cutting-edge Open model by Nvidia | Llama 3: Open foundation and chat models by Meta |
GPT-3.5: GPT-3.5-Turbo by OpenAI | Reka Core: Frontier Multimodal Language Model by Reka | Reka Flash: Multimodal model by Reka |
Command-R-Plus: Command R+ by Cohere | Command R: Command R by Cohere | Mixtral of experts: A Mixture-of-Experts model by Mistral AI |
Terms of Service ๆๅกๆกๆฌพ
Users are required to agree to the following terms before using the service:
็จๆทๅจไฝฟ็จๆๅกไนๅๅฟ
้กปๅๆไปฅไธๆกๆฌพ๏ผ
The service is a research preview. It only provides limited safety measures and may generate offensive content.
It must not be used for any illegal, harmful, violent, racist, or sexual purposes.
Please do not upload any private information.
The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
่ฏฅๆๅกไธบ็ ็ฉถ้ข่ง็ใๅฎไป
ๆไพๆ้็ๅฎๅ
จๆชๆฝ๏ผๅนถๅฏ่ฝ็ๆๅ็ฏๆงๅ
ๅฎนใ
ไธฅ็ฆ็จไบไปปไฝ้ๆณใๆๅฎณใๆดๅใ็งๆไธปไนๆ่ฒๆ
็ฎ็ใ
่ฏทๅฟไธไผ ไปปไฝ็งไบบไฟกๆฏใ
ๆๅกไผๆถ้็จๆทๅฏน่ฏๆฐๆฎ๏ผๅ
ๆฌๆๆฌๅๅพ็๏ผๅนถไฟ็ๅฐๅ
ถๅจๅ็จ CC Attribution๏ผCC-BY๏ผๆๅ
ถไป็ฑปไผผ่ฎธๅฏไธๅๅธ็ๆๅฉใ
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment ่ด่ฐข
We thank UC Berkeley SkyLab, a16z, Sequoia, Fireworks AI, Together AI, RunPod, Anyscale, Replicate, Fal AI, Hyperbolic, Kaggle, MBZUAI, HuggingFace for their generous sponsorship.
ๆไปฌๆ่ฐขๅ ๅทๅคงๅญฆไผฏๅ
ๅฉๅๆ ก SkyLabใa16zใSequoiaใFireworks AIใTogether AIใRunPodใAnyscaleใReplicateใFal AIใHyperbolicใKaggleใMBZUAIใHuggingFace ็ๆ
ทๆ
จ่ตๅฉใ
Chatbot Arena is an open platform for crowdsourced AI benchmarking, developed by researchers at UC Berkeley SkyLab and LMArena. With over 1,000,000 user votes, the platform ranks best LLM and AI chatbots using the Bradley-Terry model to generate live leaderboards. For technical details, check out our paper.
Chatbot Arena thrives on community engagement โ cast your vote to help improve AI evaluation!
Chatbot Arena ๅจ็คพๅบๅไธไธญ่ๅฃฎๆ้ฟโโๆ็ฅจๅธฎๅฉๆ้ซ AI ่ฏไผฐ๏ผ
ๆฐๅๅธ๏ผWebDev Arena๏ผweb.lmarena.ai - AI ๅฏนๅณ๏ผๆ้ ๆไฝณ็ฝ็ซ๏ผ
Total #models: 212. Total #votes: 2,768,389. Last updated: 2025-03-10.
Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!
็จไบ้ๆฐๅๅปบๆ่กๆฆ่กจๆ ผๅๅพ่กจ็ไปฃ็ ๅจๆญคใๆจๅฏไปฅๅจ!ๅคๆไธๆจ็็ฅจใ
Overall Questions ๆดไฝ้ฎ้ข
#models: 212 (100%) #votes: 2,768,389 (100%)
Rank* (UB) | Rank (StyleCtrl) | Model | Arena Score | 95% CI | Votes | Organization | License |
---|---|---|---|---|---|---|---|
103 | 117 | 1407 | +10/-10 | 117785 | Cognitive Computations | Falcon-180B TII License |
*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.
Rank (StyleCtrl): model's ranking with style control, which accounts for factors like response length and markdown usage to decouple model performance from these potential confounding variables. See blog post for further details.
Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.
More Statistics for Chatbot Arena - Overall
Total #models: 212. Total #votes: 2,768,389. Last updated: 2025-03-10.
โ Chatbot Arena Overview (Task)
Model | Overall | Overall w/ Style Control | Hard Prompts | Hard Prompts w/ Style Control | Coding | Math | Creative Writing | Instruction Following | Longer Query | Multi-Turn |
---|---|---|---|---|---|---|---|---|---|---|
gemini-2.0-flash-thinking-exp-01-21 | 103 | 117 | 103 | 102 | 102 | 113 | 112 | 102 | 101 | 100 |
โ Chatbot Arena Overview (Language)
Model | English | Chinese | German | French | Spanish | Russian | Japanese | Korean |
---|---|---|---|---|---|---|---|---|
gemini-2.0-flash-thinking-exp-01-21 | 101 | 112 | 114 | 100 | 103 | 102 | 104 | 115 |
*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.
Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.
Arena-Price Plot presents the cost vs. performance trade-offs for LLMs. We only list models with publicly available pricing. You may find pricing information here.
- Hover Over Points: View the model's arena score, cost, organization, and license.
- Click on Points: Click on a point to visit the model's website.
- Use the Legend: Click an organization name on the right to display its models. To compare models, click multiple organization names.
- Select Category: Use the dropdown menu in the upper-right corner to select a category and view the arena scores for that category.
Total #models: 62. Total #votes: 177,258. Last updated: 2025-03-11.
Overall Questions
#models: 62 (100%) #votes: 177,258 (100%)
Rank* (UB) | Rank (StyleCtrl) | Model | Arena Score | 95% CI | Votes | Organization | License |
---|---|---|---|---|---|---|---|
10 | 11 | 1279 | +11/-12 | 23806 | Anthropic | Proprietary |
*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.
Rank (StyleCtrl): model's ranking with style control, which accounts for factors like response length and markdown usage to decouple model performance from these potential confounding variables. See blog post for further details.
Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.
More Statistics for Chatbot Arena (Overall)
Total #models: 8. Total #votes: 158,199. Last updated: 2025-03-11.
Overall Questions
#models: 8 (100%) #votes: 158,199 (100%)
Rank* (UB) | Model | Arena Score | 95% CI | Votes | Organization | License |
---|---|---|---|---|---|---|
1 | 1089 | +2/-2 | 56720 | Black Forest Labs | Proprietary |
*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.
Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.
More Statistics for Chatbot Arena (Overall)
Copilot Arena is a free AI coding assistant that provides paired responses from different state-of-the-art LLMs.
Code Completion
#models: 13 #votes: 22,093
Rank* (UB) | Model | Arena Score | Confidence Interval | Votes | Organization |
---|---|---|---|---|---|
12 | Meta-Llama-3.1-405B-Instruct | 1029 | +13 / -15 | 2292 | Inception AI |
*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval).
Confidence Interval: represents the range of uncertainty around the Arena Score. It's displayed as +X / -Y, where X is the difference between the upper bound and the score, and Y is the difference between the score and the lower bound.
Last Updated: 2024-07-31
Arena-Hard-Auto v0.1 - an automatic evaluation tool for instruction-tuned LLMs with 500 challenging user queries curated from Chatbot Arena.
We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314). If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto. Check out our paper for more details about how Arena-Hard-Auto works as an fully automated data pipeline converting crowdsourced data into high-quality benchmarks -> [Paper | Repo]
Rank* (UB) | Model | Win-rate | 95% CI | Average Tokens | Organization |
---|---|---|---|---|---|
10 | 82.63 | +2.0/-1.9 | 662 | DeepSeek AI |
Citation
Please cite the following paper if you find our leaderboard or dataset helpful.
@misc{chiang2024chatbot,
title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
year={2024},
eprint={2403.04132},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
Terms of Service
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment
We thank UC Berkeley SkyLab, a16z, Sequoia, Fireworks AI, Together AI, RunPod, Anyscale, Replicate, Fal AI, Hyperbolic, Kaggle, MBZUAI, HuggingFace for their generous sponsorship.
Rank | Delta | Model | Arena Elo | Organization | License | Knowledge Cutoff |
---|---|---|---|---|---|---|
This interactive explorer provides leaderboards across various broad and specific categories. Click into each category to view the sub-category leaderboard and compare models across different tasks. You can also select a model to highlight its strengths and weaknessesโdarker sectors indicate stronger model performance.
Chat with P2L Router: Auto-select the best AI for your prompt!
Citation
Please cite the following paper if you find our leaderboard or dataset helpful.
@misc{frick2025prompttoleaderboard,
title={Prompt-to-Leaderboard},
author={Evan Frick and Connor Chen and Joseph Tennyson and Tianle Li and Wei-Lin Chiang and Anastasios N. Angelopoulos and Ion Stoica},
year={2025},
eprint={2502.14855},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.14855},
}
@misc{chiang2024chatbot,
title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
year={2024},
eprint={2403.04132},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
Terms of Service
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Please report any bug or issue to our Discord/arena-feedback.
Acknowledgment
We thank UC Berkeley SkyLab, a16z, Sequoia, Fireworks AI, Together AI, RunPod, Anyscale, Replicate, Fal AI, Hyperbolic, Kaggle, MBZUAI, HuggingFace for their generous sponsorship.
๐ Arena Explorer ๐ arena explorer
This tool provides an interactive way to explore how people are using Chatbot Arena. Using topic clustering, we organized user-submitted prompts from Arena battles into broad and specific categories. Dive in to uncover insights about the distribution and themes of these prompts!
Check out the blogpost and the topic modeling pipeline for LLM evals & analytics using Arena Explorer.
- Hover Over Segments: View the category name, the number of prompts, and their percentage.
- On mobile devices: Tap instead of hover.
- Click to Explore:
- Click on a main category to see its subcategories.
- Click on subcategories to see example prompts in the sidebar.
- Undo and Reset: Click the center of the chart to return to the top level.
Chabot Arena battle data collected from 06/2024 to 08/2024, with prompts and human preferences.
- Access here: lmarena-ai/arena-human-preference-100k.
80k WebDev Arena battle data collected from 12/2024 to 02/2025.
- Checkout the 10k dataset with prompts and human preferences: lmarena-ai/webdev-arena-preference-10k.
100k WebDev Arena battle data collected from 12/2024 to 02/2025.
About Us ๅ ณไบๆไปฌ
Chatbot Arena is an open platform for crowdsourced AI benchmarking, hosted by researchers at UC Berkeley SkyLab and LMArena. We open-source the FastChat project at GitHub and release open datasets. We always welcome contributions from the community. If you're interested in collaboration, we'd love to hear from you!
Open-source contributors
- Leads: Wei-Lin Chiang, Anastasios Angelopoulos
- Contributors: Lianmin Zheng, Ying Sheng, Lisa Dunlap, Christopher Chou, Tianle Li, Evan Frick, Aryan Vichare, Naman Jain, Manish Shetty, Dacheng Li, Kelly Tang, Sophie Xie, Connor Chen, Joseph Tennyson, Siyuan Zhuang
- Advisors: Ion Stoica, Joseph E. Gonzalez, Hao Zhang, Trevor Darrell
Learn more
Contact Us
- Follow our X, Discord, ๅฐ็บขไนฆ or email us at
lmarena.ai@gmail.com
- File issues on GitHub
- Download our datasets and models on HuggingFace
Acknowledgment
We thank SkyPilot and Gradio team for their system support. We also thank UC Berkeley SkyLab, a16z, Sequoia, Fireworks AI, Together AI, RunPod, Anyscale, Replicate, Fal AI, Hyperbolic, Kaggle, MBZUAI, HuggingFace for their generous sponsorship. Contact us to learn more about partnership.