่ฟ™ๆ˜ฏ็”จๆˆทๅœจ 2025-3-13 23:38 ไธบ https://lmarena.ai/?leaderboard ไฟๅญ˜็š„ๅŒ่ฏญๅฟซ็…ง้กต้ข๏ผŒ็”ฑ ๆฒ‰ๆตธๅผ็ฟป่ฏ‘ ๆไพ›ๅŒ่ฏญๆ”ฏๆŒใ€‚ไบ†่งฃๅฆ‚ไฝ•ไฟๅญ˜๏ผŸ

๐Ÿ† Chatbot Arena LLM Leaderboard: Community-driven Evaluation for Best LLM and AI chatbots
๐Ÿ† Chatbot Arena LLM ๆŽ’่กŒๆฆœ๏ผš็คพๅŒบ้ฉฑๅŠจ็š„ๆœ€ไฝณ่Šๅคฉๆœบๅ™จไบบๅ’Œ AI ่Šๅคฉๆœบๅ™จไบบ่ฏ„ไผฐ LLM

Discord | Twitter | ๅฐ็บขไนฆ | Blog | GitHub | Paper | Dataset | Kaggle Competition

Chatbot Arena is an open platform for crowdsourced AI benchmarking, developed by researchers at UC Berkeley SkyLab and LMArena. With over 1,000,000 user votes, the platform ranks best LLM and AI chatbots using the Bradley-Terry model to generate live leaderboards. For technical details, check out our paper.

Chatbot Arena thrives on community engagement โ€” cast your vote to help improve AI evaluation!
Chatbot Arena ๅœจ็คพๅŒบๅ‚ไธŽไธญ่Œๅฃฎๆˆ้•ฟโ€”โ€”ๆŠ•็ฅจๅธฎๅŠฉๆ้ซ˜ AI ่ฏ„ไผฐ๏ผ

New Launch! WebDev Arena: web.lmarena.ai - AI Battle to build the best website!
ๆ–ฐๅ‘ๅธƒ๏ผWebDev Arena๏ผšweb.lmarena.ai - AI ๅฏนๅ†ณ๏ผŒๆ‰“้€ ๆœ€ไฝณ็ฝ‘็ซ™๏ผ

Total #models: 212.    Total #votes: 2,768,389.    Last updated: 2025-03-10.

Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai!
็”จไบŽ้‡ๆ–ฐๅˆ›ๅปบๆŽ’่กŒๆฆœ่กจๆ ผๅ’Œๅ›พ่กจ็š„ไปฃ็ ๅœจๆญคใ€‚ๆ‚จๅฏไปฅๅœจ!ๅค„ๆŠ•ไธ‹ๆ‚จ็š„็ฅจใ€‚

Category  ็ฑปๅˆซ
Apply filter  ๅบ”็”จ่ฟ‡ๆปคๅ™จ

Overall Questions  ๆ•ดไฝ“้—ฎ้ข˜

    #models: 212 (100%)     #votes: 2,768,389 (100%)   

Rank* (UB)
Rank (StyleCtrl)
Model
Arena Score
95% CI
Votes
Organization
License
103
117
1407
+10/-10
117785
Cognitive Computations
Falcon-180B TII License

*Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores.

Rank (StyleCtrl): model's ranking with style control, which accounts for factors like response length and markdown usage to decouple model performance from these potential confounding variables. See blog post for further details.

Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large.

More Statistics for Chatbot Arena - Overall

Figure 1: Confidence Intervals on Model Strength (via Bootstrapping)

grok-3-preview-02-24gpt-4.5-preview-2025-02-27early-grok-3gemini-2.0-flash-thinking-exp-01-21gemini-2.0-pro-exp-02-05chatgpt-4o-latest-20250129gemini-exp-1206chatgpt-4o-latest-20241120gemini-exp-1121gemini-2.0-flash-thinking-exp-1219deepseek-r1gemini-2.0-flash-expgemini-2.0-flash-001o1-2024-12-17gemini-exp-1114chatgpt-4o-latest-20240903gemma-3-27b-itqwen2.5-maxo1-previewo3-mini-highdeepseek-v3chatgpt-4o-latest-20240808glm-4-plus-0111qwen-plus-0125gemini-2.0-flash-lite-preview-02-051300132013401360138014001420
ModelRating

Figure 2: Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)

0.570.560.560.550.550.550.530.530.520.520.520.520.520.510.510.500.500.500.480.470.450.440.430.400.31gemini-1.5-pro-api-0409-previewgrok-3-preview-02-24gemini-exp-1206chatgpt-4o-latest-20240903chatgpt-4o-latest-20240808gemini-2.0-flash-thinking-exp-1219gemini-exp-1121gemini-2.0-flash-thinking-exp-01-21gemini-1.5-pro-exp-0801gemini-1.5-pro-exp-0827bard-jan-24-gemini-prochatgpt-4o-latest-20241120gpt-4.5-preview-2025-02-27o1-previewgemini-advanced-0514gemini-2.0-pro-exp-02-05chatgpt-4o-latest-20250129early-grok-3gemini-2.0-flash-expgemini-exp-1114gpt-4-1106-previewgpt-4-0314claude-1gpt-4o-2024-05-13gpt-3.5-turbo-031400.10.20.30.40.50.6
ModelAverage Win Rate

Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles

0.000.550.610.540.540.560.000.000.000.000.620.000.620.600.000.000.560.720.000.650.710.000.000.000.750.450.000.550.530.540.530.000.000.000.000.600.000.650.660.000.000.770.550.000.670.710.001.000.000.690.390.450.000.550.550.540.000.520.000.000.640.000.610.600.000.000.610.620.000.620.700.000.500.750.660.460.470.450.000.540.550.510.540.000.590.530.620.530.560.000.000.640.600.000.560.630.000.590.650.650.460.460.450.460.000.490.520.530.000.570.560.600.500.530.000.000.540.560.000.590.660.000.650.650.580.440.470.460.450.510.000.500.580.000.560.490.560.570.560.000.000.550.650.000.610.640.000.690.610.580.000.000.000.490.480.500.000.530.480.530.460.580.360.640.660.550.000.620.570.000.630.000.710.000.660.000.000.480.460.470.420.470.000.520.460.530.540.570.610.580.550.000.560.570.000.590.000.640.560.650.000.000.000.000.000.000.520.480.000.280.000.530.000.000.560.590.000.000.560.000.000.000.000.000.000.000.000.000.410.430.440.470.540.720.000.330.530.000.510.000.000.000.000.500.000.660.000.710.000.000.380.400.360.470.440.510.540.470.000.670.000.440.550.530.000.000.530.570.000.620.680.000.510.630.630.000.000.000.380.400.440.420.460.470.470.560.000.470.500.520.440.000.690.550.000.600.000.580.000.530.380.350.390.470.500.430.640.430.000.000.450.530.000.530.000.000.570.570.000.630.570.000.610.620.650.400.340.400.440.470.440.360.390.000.490.470.500.470.000.000.000.520.530.000.510.540.000.620.680.580.000.000.000.000.000.000.340.420.440.000.000.480.000.000.000.480.000.000.530.000.000.000.000.000.000.000.000.000.000.000.000.450.450.410.000.000.560.000.000.520.000.000.000.520.000.000.550.000.000.000.440.230.390.360.460.450.000.000.000.000.470.000.430.480.000.000.000.470.000.570.570.000.000.000.540.280.450.380.400.440.350.380.440.000.000.430.310.430.470.000.000.530.000.000.520.520.000.570.530.580.000.000.000.000.000.000.430.430.440.500.000.450.000.000.470.480.000.000.000.000.000.000.000.000.000.350.330.380.440.410.390.000.000.000.000.380.000.370.490.000.000.430.480.000.000.500.000.360.610.530.290.290.300.370.340.360.370.410.000.340.320.400.430.460.000.000.430.480.000.500.000.000.500.540.520.000.000.000.000.000.000.000.000.000.000.000.000.000.000.000.450.000.000.000.000.000.000.000.000.000.000.000.500.410.350.310.290.360.000.290.490.420.390.380.000.000.000.430.000.640.500.000.000.000.450.000.000.250.350.350.390.000.440.000.000.380.000.380.320.000.000.000.470.000.390.460.000.000.000.560.250.310.340.350.420.420.340.350.000.000.380.470.350.420.000.000.460.420.000.470.480.000.550.440.00grok-3-preview-02-24gpt-4.5-preview-2025-02-27early-grok-3gemini-2.0-flash-thinking-exp-01-21gemini-2.0-pro-exp-02-05chatgpt-4o-latest-20250129gemini-exp-1206chatgpt-4o-latest-20241120gemini-exp-1121gemini-2.0-flash-thinking-exp-1219deepseek-r1gemini-2.0-flash-expgemini-2.0-flash-001o1-2024-12-17gemini-exp-1114chatgpt-4o-latest-20240903gemma-3-27b-itqwen2.5-maxo1-previewo3-mini-highdeepseek-v3chatgpt-4o-latest-20240808glm-4-plus-0111qwen-plus-0125gemini-2.0-flash-lite-preview-02-05gemini-2.0-flash-lite-preview-02-05qwen-plus-0125glm-4-plus-0111chatgpt-4o-latest-20240808deepseek-v3o3-mini-higho1-previewqwen2.5-maxgemma-3-27b-itchatgpt-4o-latest-20240903gemini-exp-1114o1-2024-12-17gemini-2.0-flash-001gemini-2.0-flash-expdeepseek-r1gemini-2.0-flash-thinking-exp-1219gemini-exp-1121chatgpt-4o-latest-20241120gemini-exp-1206chatgpt-4o-latest-20250129gemini-2.0-pro-exp-02-05gemini-2.0-flash-thinking-exp-01-21early-grok-3gpt-4.5-preview-2025-02-27grok-3-preview-02-24
00.20.40.60.81Model BModel A

Figure 4: Battle Count for Each Combination of Models (without Ties)

015710213615314100001140127138006114001441250001481570821591421360000107015111700441430126130010136102820321340461014100148034541100774060852070443751361593210128114154215012614314825824700106258020935501421642501531423401280941362220139611717430000761000246393084155711411364611149401611840171721801102580010994032339301042341060001541361610312190308952314519188380291210230052010700141215222184312026126510736316119049029401551760237010263173000000190261025021600239255001330000000001261391713082652501526602410000111032102400114107148143617295107015011447172008044017218606196480001481711802313632162661140432068236016153020105009912715134525874110451610047430257009792022828609416584138117411247300258191190024117220625700010922302021380180168243000000884902390082000376002170000000000003829425500360037600072200133000614477106761090000800971090009801061190001071401434062581009429155004416922230098001951650911431020000001211761331110153002177220000000001441268520924632300001720228202001061950022602528232125130207355393393230237032118620128613800119165022600176127183000000000000000133000000000014142841045210202461509418000091025176000770041641552340630096016516800014302812700013014813637525071106107173004899842430010710202321830771300grok-3-preview-02-24early-grok-3gemini-2.0-pro-exp-02-05gemini-exp-1206gemini-exp-1121deepseek-r1gemini-2.0-flash-001gemini-exp-1114gemma-3-27b-ito1-previewdeepseek-v3glm-4-plus-0111gemini-2.0-flash-lite-preview-02-05gemini-2.0-flash-lite-preview-02-05glm-4-plus-0111deepseek-v3o1-previewgemma-3-27b-itgemini-exp-1114gemini-2.0-flash-001deepseek-r1gemini-exp-1121gemini-exp-1206gemini-2.0-pro-exp-02-05early-grok-3grok-3-preview-02-24
0100200300400500600700Model BModel A

Citation

Please cite the following paper if you find our leaderboard or dataset helpful.

@misc{chiang2024chatbot,
    title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
    author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
    year={2024},
    eprint={2403.04132},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}

Terms of Service

Users are required to agree to the following terms before using the service:

The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.

Please report any bug or issue to our Discord/arena-feedback.

Acknowledgment

We thank UC Berkeley SkyLab, a16z, Sequoia, Fireworks AI, Together AI, RunPod, Anyscale, Replicate, Fal AI, Hyperbolic, Kaggle, MBZUAI, HuggingFace for their generous sponsorship.