Qwen1.5-110B

coder543 · 2024-04-26T13:22:23

Firstly, I'll say that it's always exciting to see more weight-available models.
首先，我要说的是，看到更多可用重量的模型总是令人兴奋的。

However, I don't particularly like that benchmark table. I saw the HumanEval score for Llama 3 70B and immediately said "nope, that's not right". It claims Llama 3 70B scored only 45.7. Llama 3 70B Instruct[0] scored 81.7, not even in the same ballpark.
然而，我并不特别喜欢那个基准表。我看到了 Llama 3 70B 的 HumanEval 分数，立刻就说“不，那不对”。它声称 Llama 3 70B 只得了 45.7 分。而 Llama 3 70B Instruct[0]得分是 81.7，甚至不在同一个数量级。

It turns out that the Qwen team didn't benchmark the chat/instruct versions of the model on virtually any of the benchmarks. Why did they only do those benchmarks for the base models?
结果显示，Qwen 团队几乎没有在任何基准测试中对聊天/指导版本的模型进行基准测试。他们为什么只对基础模型进行这些基准测试呢？

It makes it very hard to draw any useful conclusions from this release, since most people would be using the chat-tuned model for the things those base model benchmarks are measuring.
从这个版本中得出任何有用的结论非常困难，因为大多数人会使用聊天调优模型来衡量那些基础模型基准所衡量的事物。

My previous experience with Qwen releases is that the models also have a habit of randomly switching to Chinese for a few words. I wonder if this model is better at responding to English questions with an English response? Maybe we need a benchmark for how well an LLM sticks to responding in the same language as the question, across a range of different languages.
我以前使用 Qwen 版本的经验是，这些模型也有随机切换到中文说几句话的习惯。我想知道这个模型是否更擅长用英语回答英语问题？也许我们需要一个基准，来衡量一个LLM在回答问题时能否坚持使用与问题相同的语言，这需要涵盖一系列不同的语言。

[0]: https://scontent-atl3-1.xx.fbcdn.net/v/t39.2365-6/438037375_...

reply 回复

lhl · 2024-04-26T15:14:37

I'd recommend those looking for local coding models to go for code-specific tunes. See the EvalPlus leaderboard (HumanEval+ and MBPP+): https://evalplus.github.io/leaderboard.html
我建议那些寻找本地编码模型的人选择特定于代码的曲调。请参阅 EvalPlus 排行榜（HumanEval+和 MBPP+）：https://evalplus.github.io/leaderboard.html

For those looking for less contamination, the LiveCodeBench leaderboard is also good: https://livecodebench.github.io/leaderboard.html
对于那些寻找污染较少的人，LiveCodeBench 排行榜也很好：https://livecodebench.github.io/leaderboard.html

I did my own testing on the 110B demo and didn't notice any cross-lingual issues (which I've seen with the smaller and past Qwen models), but for my personal testing, while the 110B is significantly better than the 72B, it doesn't punch above its weight (and doesn't perform close to Llama 3 70B Instruct from my testing). https://docs.google.com/spreadsheets/d/e/2PACX-1vRxvmb6227Au...
我在 110B 演示上进行了自己的测试，没有注意到任何跨语言问题（我在较小和过去的 Qwen 模型中看到过），但对于我的个人测试，虽然 110B 明显优于 72B，但它并没有超出其重量级（并且在我的测试中，它的表现并不接近 Llama 3 70B Instruct）。https://docs.google.com/spreadsheets/d/e/2PACX-1vRxvmb6227Au...