这是用户在 2024-4-28 22:18 为 https://news.ycombinator.com/item?id=40167884 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Hacker News new | past | comments | ask | show | jobs | submit login
Qwen1.5-110B (qwenlm.github.io)
113 points by tosh 2 days ago | hide | past | favorite | 58 comments

Firstly, I'll say that it's always exciting to see more weight-available models.

However, I don't particularly like that benchmark table. I saw the HumanEval score for Llama 3 70B and immediately said "nope, that's not right". It claims Llama 3 70B scored only 45.7. Llama 3 70B Instruct[0] scored 81.7, not even in the same ballpark.
然而,我并不特别喜欢那个基准表。我看到了 Llama 3 70B 的 HumanEval 分数,立刻就说“不,那不对”。它声称 Llama 3 70B 只得了 45.7 分。而 Llama 3 70B Instruct[0]得分是 81.7,甚至不在同一个数量级。

It turns out that the Qwen team didn't benchmark the chat/instruct versions of the model on virtually any of the benchmarks. Why did they only do those benchmarks for the base models?
结果显示,Qwen 团队几乎没有在任何基准测试中对聊天/指导版本的模型进行基准测试。他们为什么只对基础模型进行这些基准测试呢?

It makes it very hard to draw any useful conclusions from this release, since most people would be using the chat-tuned model for the things those base model benchmarks are measuring.

My previous experience with Qwen releases is that the models also have a habit of randomly switching to Chinese for a few words. I wonder if this model is better at responding to English questions with an English response? Maybe we need a benchmark for how well an LLM sticks to responding in the same language as the question, across a range of different languages.
我以前使用 Qwen 版本的经验是,这些模型也有随机切换到中文说几句话的习惯。我想知道这个模型是否更擅长用英语回答英语问题?也许我们需要一个基准,来衡量一个LLM在回答问题时能否坚持使用与问题相同的语言,这需要涵盖一系列不同的语言。

[0]: https://scontent-atl3-1.xx.fbcdn.net/v/t39.2365-6/438037375_...

I'd recommend those looking for local coding models to go for code-specific tunes. See the EvalPlus leaderboard (HumanEval+ and MBPP+): https://evalplus.github.io/leaderboard.html
我建议那些寻找本地编码模型的人选择特定于代码的曲调。请参阅 EvalPlus 排行榜(HumanEval+和 MBPP+):https://evalplus.github.io/leaderboard.html

For those looking for less contamination, the LiveCodeBench leaderboard is also good: https://livecodebench.github.io/leaderboard.html
对于那些寻找污染较少的人,LiveCodeBench 排行榜也很好:https://livecodebench.github.io/leaderboard.html

I did my own testing on the 110B demo and didn't notice any cross-lingual issues (which I've seen with the smaller and past Qwen models), but for my personal testing, while the 110B is significantly better than the 72B, it doesn't punch above its weight (and doesn't perform close to Llama 3 70B Instruct from my testing). https://docs.google.com/spreadsheets/d/e/2PACX-1vRxvmb6227Au...
我在 110B 演示上进行了自己的测试,没有注意到任何跨语言问题(我在较小和过去的 Qwen 模型中看到过),但对于我的个人测试,虽然 110B 明显优于 72B,但它并没有超出其重量级(并且在我的测试中,它的表现并不接近 Llama 3 70B Instruct)。https://docs.google.com/spreadsheets/d/e/2PACX-1vRxvmb6227Au...

humaneval is generally a very poor benchmark imo and I hate that it's become the default "code" benchmark in any model release. I find it more useful to just look at MMLU as a ballmark of model ability and then just vibe checking it myself on code.
我认为 humaneval 通常是一个非常差的基准,我讨厌它成为任何模型发布中默认的"代码"基准。我发现只看 MMLU 作为模型能力的标志,然后自己在代码上进行感觉检查更有用。

source: I'm hacking on a high performance coding copilot (https://double.bot/) and play with a lot of different models for coding. Also adding Qwen 110b now so I can vibe check it. :)
来源:我正在研究一个高性能的编码副驾驶(https://double.bot/)并且使用很多不同的编码模型。现在也添加了 Qwen 110b,所以我可以进行感觉检查。 :)

Didn't Microsoft use HumanEval as the basis for developing Phi? If so I'd say it works well enough! (At least Phi 3, haven't tested the others much.)
微软不是用 HumanEval 作为开发 Phi 的基础吗?如果是这样,我会说它工作得很好!(至少 Phi 3,没有太多测试其他的。)

Though their training set is proprietary, it can be leaked by talking with Phi 1_5 about pretty much anything. It just randomly starts outputting the proprietary training data.
尽管他们的训练集是专有的,但是通过与 Phi 1_5 进行几乎任何话题的交谈,都可能会泄露出来。它只是随机开始输出专有的训练数据。

Humaneval was developed for codex I believe:
我相信 Humaneval 是为 codex 开发的:


I agree HumanEval isn't great, but I've found that it is better than not having anything. Maybe we'll get better benchmarks someday.
我同意 HumanEval 并不是很好,但我发现它比没有任何东西要好。也许有一天我们会得到更好的基准。

What would make "Double" higher performance than any other hosted system?

no this is different. it is for the base model. this is why i explain in my tweet that we just say for the base model quality we might be comparable. for instruct model, there is much room to improve especially on human eval.

i admit that the code switching is a serious problem of ours cuz it really affects the user experience of english users. but we find that it is hard for a multilingual model to get rid of this feature. we'll try to fix it in qwen2.
我承认代码切换是我们的一个严重问题,因为它真的影响了英语用户的使用体验。但是我们发现,对于一个多语言模型来说,要摆脱这个特性是很困难的。我们将尝试在 qwen2 中修复它。

> My previous experience with Qwen releases is that the models also have a habit of randomly switching to Chinese for a few words. I wonder if this model is better at responding to English questions with an English response? Maybe we need a benchmark for how well an LLM sticks to responding in the same language as the question, across a range of different languages.
> 我以前使用 Qwen 发布的经验是,模型也有随机切换到中文说几句话的习惯。我想知道这个模型是否更擅长用英语回答英语问题?也许我们需要一个基准,看看LLM在回答问题时能否坚持使用与问题相同的语言,跨越不同的语言范围。

This is trivially resolved with a properly configured sampler/grammar. These LLMs output a probability distribution of likely next tokens, not single tokens. If you're not willing to write your own code, you can get around this issue with llama.cpp, for example, using `--grammar "root ::= [^一-鿿ぁ-ゟァ-ヿ가-힣]*"` which will exclude CJK from sampled output.
这可以通过正确配置的采样器/语法轻松解决。这些LLMs输出可能的下一个标记的概率分布,而不是单个标记。如果你不愿意编写自己的代码,你可以使用 llama.cpp 来解决这个问题,例如,使用`--grammar "root ::= [^一-鿿ぁ-ゟァ-ヿ가-힣]*"`将排除采样输出中的 CJK。

That's funny you mentioned switching to another language, I recently asked chatGPT "translate this: <random german sentence>" And it translated the sentence in french, while I was speaking with it in english"
你提到切换到另一种语言,这很有趣,我最近问 chatGPT "翻译这个:<随机的德语句子>" 它把句子翻译成了法语,而我当时是用英语和它交谈的"

I see the science fiction meme of AI giving sassy, technically correct but useless answers is grounded in truth.
我看到了关于 AI 给出嘴硬、技术上正确但无用答案的科幻迷因在现实中有所体现。

By ChatGPT, do you mean ChatGPT-3.5 or ChatGPT-4? No one should be using ChatGPT-3.5 in an interactive chat session at this point, and I wish OpenAI would recognize that their free ChatGPT-3.5 service seems like it is more harmful to ChatGPT-4 and OpenAI's reputation than it is helpful, just due to how unimpressive ChatGPT-3.5 is compared to the rest of the industry. You're much better off using Google's free Gemini or Meta's Llama-3-powered chat site or just about anything else at this point, if you're unwilling to pay for ChatGPT-4.
你说的 ChatGPT 是指 ChatGPT-3.5 还是 ChatGPT-4?此时此刻,没有人应该在交互式聊天会话中使用 ChatGPT-3.5,我希望 OpenAI 能认识到,他们的免费 ChatGPT-3.5 服务似乎对 ChatGPT-4 和 OpenAI 的声誉造成的伤害大于帮助,仅仅因为 ChatGPT-3.5 与行业其他部分相比显得不那么令人印象深刻。如果你不愿意为 ChatGPT-4 付费,那么此时此刻,你使用 Google 的免费 Gemini 或 Meta 的 Llama-3 驱动的聊天网站,或者其他任何东西都要好得多。

I am skeptical that ChatGPT-4 would have done what you described, based on my own extensive experience with it.
根据我自己对它的广泛经验,我怀疑 ChatGPT-4 会做你描述的事情。

I've been working with members of the Qwen team on OpenDevin [1] for about a month now, and I must say they're brilliant and lovely people. Very excited for their success here!
我已经和 Qwen 团队的成员在 OpenDevin [1]上合作了大约一个月,我必须说他们是聪明且可爱的人。非常期待他们在这里的成功!

[1] https://github.com/OpenDevin/OpenDevin

Maybe this is the right thread to ask. If you were in the market for a new Mac (eg MacBook Pro), would you go for the 100+GB RAM option for running LLMs locally? Or is the difference between heavily quantized models and their unquantized versions so small, and progress so fast, that it wouldn’t be worth it?
也许这是提问的正确帖子。如果你打算购买一台新的 Mac(例如 MacBook Pro),你会选择 100+GB RAM 的选项来本地运行LLMs吗?或者,量化模型和它们的非量化版本之间的差异如此之小,进步如此之快,以至于它不值得吗?

I think it's worth it, although it might be best to wait for the next iteration: there's rumors the M4 Macs will support up to 512GB of memory [1].
我认为这是值得的,尽管最好等待下一代:有传言 M4 Macs 将支持高达 512GB 的内存[1]。

The current 128GB (e.g. M3 Max) and 192GB (e.g. M2 Ultra) Macs run these large models. For example on the M2 Ultra, the Qwen 110B model, 4-bit quantized, gets almost 10 t/s using Ollama [2] and other tools built with llama.cpp.
当前的 128GB(例如 M3 Max)和 192GB(例如 M2 Ultra)的 Mac 可以运行这些大型模型。例如在 M2 Ultra 上,4 位量化的 Qwen 110B 模型,使用 Ollama [2]和其他用 llama.cpp 构建的工具,几乎可以达到 10 t/s。

There's also the benefit of being able to load different models simultaneously which is becoming important for RAG and agent-related workflows.
还有一个好处是能够同时加载不同的模型,这对于 RAG 和代理相关的工作流程越来越重要。

[1] https://www.macrumors.com/2024/04/11/m4-ai-chips-late-2024/ [2] https://ollama.com/library/qwen:110b

An unquantized Qwen1.5-110B model would require some ~220GB of RAM, so 100+GB would not be "enough" for that, unless we put a big emphasis on the "+".
一个未量化的 Qwen1.5-110B 模型需要大约 220GB 的 RAM,所以 100+GB 的 RAM 对此来说并不"足够",除非我们对"+"给予很大的强调。

I consider "heavily" quantized to be anything below 4-bit quantization. At 4-bit, you could run a 110B model on around 55GB to 60GB of memory. Right now, Llama-3-70B-Instruct is the highest ranked model you can download[0], and you should be able to fit the 6-bit quantization into 64GB of RAM. Historically, 4-bit quantization represents very little quality loss compared to the full 16-bit models for LLMs, but I have heard rumors that Llama 3 might be so well trained that the quality loss starts to occur earlier, so 6-bit quantization seems like a safe bet for good quality.
我认为"重度"量化是指低于 4 位量化的任何情况。在 4 位量化下,你可以在大约 55GB 到 60GB 的内存上运行一个 110B 模型。目前,Llama-3-70B-Instruct 是你可以下载的排名最高的模型[0],你应该能够将 6 位量化装入 64GB 的 RAM。从历史上看,4 位量化与完整的 16 位模型相比,质量损失非常小,但我听说 Llama 3 可能训练得非常好,以至于质量损失开始提前出现,所以 6 位量化似乎是保证良好质量的安全选择。

If you had 128GB of RAM, you still couldn't run the unquantized 70B model, but you could run the 8-bit quantization in a little over 70GB of RAM. Which could feel unsatisfying, since you would have so much unused RAM sitting around, and Apple charges a shocking amount of money for RAM.
如果你有 128GB 的 RAM,你仍然无法运行未量化的 70B 模型,但你可以在稍微超过 70GB 的 RAM 中运行 8 位量化。这可能会让你感到不满,因为你会有大量的未使用的 RAM 闲置,而且苹果对 RAM 的收费令人震惊。

[0]: https://leaderboard.lmsys.org/

However if you want to use the LLM in your workflow instead of just experimenting with it on its own you also need RAM to run everything else comfortably.
然而,如果你想在你的工作流中使用LLM,而不仅仅是单独地试验它,你还需要足够的 RAM 来舒适地运行所有其他东西。

96GB RAM might be a good compromise for now. 64GB is cutting it close, 128GB leaves more breathing room but is expensive.
96GB RAM 可能是目前的一个好折衷。64GB 有些勉强,128GB 有更多的余地,但是价格昂贵。

Yep, I agree with that.

Phi 3 Q4 spazzes out on some inputs (emits a stream of garbage), while the FP16 version doesn't (at least for the cases I could find). Maybe they just botched the quantization (I have good results with other Q4 models), but it is an interesting data point.
Phi 3 Q4 在某些输入上会出现问题(发出一串垃圾),而 FP16 版本则不会(至少在我找到的案例中)。也许他们只是搞砸了量化(我对其他 Q4 模型有好的结果),但这是一个有趣的数据点。

Phi 3 in particular had some issues with the end-of-text token not being handled correctly at launch, as I recall, but I could be remembering incorrectly.
我记得,Phi 3 在发布时特别是对文本结束标记的处理存在一些问题,但我可能记错了。

I understand the preference to run models locally rather than in a rented space, but given the speed of development and the amount of money involved, you should have a clear reason why you need to run these models locally.

As far as I know, the most RAM you can get in a MacBook Pro (which is what you said you're shopping for) is 48G.* The base price for new one with that much unified RAM is around $4000.
据我所知,你可以在 MacBook Pro(你说你正在购买的)中获取的最大 RAM 是 48G。*带有这么多统一 RAM 的新机器的基础价格大约在 4000 美元左右。

The Mac Pro towers (not MacBook) have up to 192G unified RAM. The base price for that configuration is around $8600.
Mac Pro 塔式机(不是 MacBook)最高可达 192G 统一内存。这种配置的基础价格大约在 8600 美元左右。

The smaller LLMs are getting quite good. A lightly quantized Llama-8B should comfortably run on a MacBook Pro with with 16G of RAM which you can get for around $2000. The money you save on a cheaper machine will go a very long way renting compute from a datacenter.
较小的LLMs表现得相当不错。一个轻度量化的 Llama-8B 应该可以在配备 16G 内存的 MacBook Pro 上顺利运行,你大约可以花 2000 美元买到。你在便宜的机器上省下的钱可以用来在数据中心租用计算资源。

If you need to run locally, then high end Macs are excellent machines. Though at those prices you might get better value buying a second hand crypto-mining rig with multiple Nvidia 4090's.
如果你需要在本地运行,那么高端的 Mac 非常出色。尽管以这些价格,你可能买一台二手的带有多个 Nvidia 4090 的加密挖矿设备会更划算。

EDIT: I was wrong about the MBP unified RAM. You can get an M3 Max with 128GB for around $4700.
编辑:我对 MBP 统一 RAM 的理解是错误的。你可以花大约 4700 美元买到一个配备 128GB 的 M3 Max。

> As far as I know, the most RAM you can get in a MacBook Pro (which is what you said you're shopping for) is 48G.
>据我所知,你所说的购买的 MacBook Pro 最多可以配备 48G 的 RAM。

My Macbook Pro has 2.5x that (128GB), and I run models that use 2x that RAM (96GB) with no impact to my IDE, browser, or other apps running at the time, they act like they're on a 32GB machine.
我的 Macbook Pro 有 2.5 倍于此(128GB),我运行的模型使用的 RAM 是这个的 2 倍(96GB),对我的 IDE,浏览器或同时运行的其他应用没有影响,它们就像在一个 32GB 的机器上运行一样。

LM Studio makes it easy for newcomers to on-device LLMs to dip your toe into this, both for turning on Metal and helping suggest which models will fit entirely in RAM.
LM Studio 使得新手更容易尝试在设备上的LLMs,既可以开启 Metal,也可以帮助建议哪些模型可以完全适应 RAM。

You're right, you can get an M3 Max MBP with 128GB, starting at $4700.
你说得对,你可以以 4700 美元的起步价购买配备 128GB 的 M3 Max MBP。

My main point is that if your objective is dipping your toe into this, you can do it with smaller models for far less. That is a really sweet machine, but for the amount of money involved you should be clear about what your needs are.

> As far as I know, the most RAM you can get in a MacBook Pro (which is what you said you're shopping for) is 48G
>据我所知,你说你正在购买的 MacBook Pro 最多可以配置 48G 的 RAM

Huh? They have options up to 128GB…
哦?他们有高达 128GB 的选项…


These numbers are a bit old but will give you a good ballpark for scaling: https://github.com/ggerganov/llama.cpp/discussions/4167

You can basically just divide by the multiple as you scale up parameters. Since this is all with a 7B model, just multiply memory by 10X and divide speed by 10X . For batch size=1 (single user interactive inference) if you can fit the model you're basically going to be memory bandwidth limited, but pay attention to the "PP" (prompt generation) number - this is the speed for how long it will take to process any existing conversation. If you're 4000 tokens in, and you are prompt processing at 100 tokens/s, that means you will wait for 40 seconds before any text even starts generating for the next turn.
你基本上可以在扩大参数时直接除以倍数。由于这都是 7B 模型,只需将内存乘以 10X,将速度除以 10X。对于批量大小=1(单用户交互推理),如果你可以适应模型,你基本上将受到内存带宽的限制,但要注意"PP"(提示生成)数字 - 这是处理任何现有对话所需的速度。如果你在 4000 个令牌中,你的提示处理速度为 100 个令牌/秒,那么在下一轮的文本开始生成之前,你将等待 40 秒。

If you're not in a rush, I'd wait for the M4, it's rumored to have much better AI processing (the M3 actually reduced memory bandwidth vs the M2...)
如果你不着急,我建议等待 M4,据传它的 AI 处理能力要好得多(M3 实际上减少了与 M2 相比的内存带宽...)

There was some speculations in a Reddit thread: https://www.reddit.com/r/LocalLLaMA/comments/1b5uv86/perplex...
在一个 Reddit 线程中有一些猜测:https://www.reddit.com/r/LocalLLaMA/comments/1b5uv86/perplex...

As far as quantifiable results in terms of perplexity go, q4+ quants are generally considered OK. (eg. https://arxiv.org/abs/2212.09720 )

This isn’t the right thread for this but you can look at the model and the difference in parameters. A MacBook Pro will be cheap but have slow inference whereas the using cards will be more expensive , faster for inference (usage) and refining. If you have the money go for the MacBook Pro since it seems that networks under 100B that are trained for a very long time have improved performance and the amount of parameters you need would fit in the MacBook Pro
这不是这个问题的正确线程,但你可以查看模型和参数的差异。MacBook Pro 会便宜,但推理速度慢,而使用卡片会更贵,推理(使用)和精炼速度更快。如果你有钱,选择 MacBook Pro,因为看起来训练了很长时间的 100B 以下的网络性能有所提升,你需要的参数数量会适合在 MacBook Pro 中。

All I know is I’ve been disappointed at the limitations of 32GB, lots of inaccessible models with that amount, but also some that are useable. I wish I’d gotten more.
我所知道的就是我对 32GB 的限制感到失望,有很多无法访问的模型具有这个数量,但也有一些是可用的。我希望我得到的更多。

If running LLMs is your intention, a Mac is probably that absolute worse way to try achieve that.
如果运行LLMs是你的意图,Mac 可能是实现这一目标的绝对最糟糕的方式。

Soldered RAM, no real upgrade path - M2/M3 is cool, but not for this.
焊接的 RAM,没有真正的升级路径 - M2/M3 很酷,但不适用于此。

The prompt I always try first:

    Two cars have a 100 mile race. Car A drives 10
    miles per hour. Car B drives 5 miles per hour,
    but gets a 10 hour headstart. Who wins?
Tried it on Qwen1.5-110B three times. It got it wrong 2 times and correct 1 time.
在 Qwen1.5-110B 上试了三次。它错误了 2 次,正确 1 次。

On the contrary I don't really understand what do math problems supposed to prove at this point. We already know LLMs badly suck at math, even if it somehow gets one correct just changing numbers is usually enough to confuse it again. Barring an architectural breakthrough, this next token prediction based AI is very unlikely to get better at doing math.
相反,我真的不明白这个时候数学问题要证明什么。我们已经知道LLMs在数学上很糟糕,即使它偶尔能做对一个,只需改变数字通常就足以再次让它混乱。除非有架构上的突破,这种基于下一个令牌预测的 AI 很不可能在做数学上变得更好。

(Just tested Opus and GPT4-turbo to be sure, both failed. However Llama-3 did get this right, until I scaled up the numbers and then it failed terribly)
(我刚测试了 Opus 和 GPT4-turbo,都失败了。然而,Llama-3 确实做对了,直到我把数字放大,然后它失败得很惨)

I'm mostly a spectator in this, but as someone who hears day in and out about the "AI revolution" I'll point out that the number of places these things are being shoehorned into would greatly benefit from logical consistency, if not the ability to do simple math
我在这个问题上主要是一个观察者,但作为一个日复一日听说"AI 革命"的人,我要指出的是,这些东西被硬塞进的地方数量很大,如果不是能做简单的数学运算,至少也会从逻辑一致性中大大受益

Your phrase "next token prediction" is the whole of my heartburn with these stochastic parrots: they can pretend to be good talkers, but cannot walk the walk. It's like conducting interviews or code reviews all day every day when interacting with them: try and spot the lie. Exhausting

Not sure what the business model is of these ai startups, meta will crush them with each model release. Also, the expertise to fine tune existing llama models, is far overblown. Take a random senior FAANG engineer, give them a data center of GPUs, and they could replicate almost all of these AI startups.
对于这些人工智能初创公司的商业模式,我并不确定,每次元发布模型时都会将它们压倒。此外,对现有羊驼模型进行微调的专业知识被过度夸大了。找一个随机的高级 FAANG 工程师,给他们一个数据中心的 GPU,他们几乎可以复制所有这些人工智能初创公司。

Its really a matter of having the capital for training. Same with the Devin AI coder, its just VC pumped crap. Same with Mistral, they have no moat, and their researchers, as "prestigious" as they are, are completely undifferentiated.
这真的是一个有训练资本的问题。与 Devin AI 编码器一样,这只是风投抽水的垃圾。Mistral 也是如此,他们没有护城河,他们的研究人员,尽管他们“声望很高”,但完全没有差异化。

Out of 10 horses there might only be a single winner, but if there are enough horse collectors there might be a lot of transactions.
在 10 匹马中可能只有一匹是赢家,但如果有足够的马收藏家,可能会有很多交易。

Remember sometimes they joy comes from owning horses and being in the race even though horses are (almost) completely undifferentiated for the untrained eye.

Qwen is from Alibaba, whose revenue is practically on par with Facebook's. They're equally equipped to keep doing the same thing that Facebook is doing.
Qwen 来自阿里巴巴,其收入实际上与 Facebook 相当。他们同样有能力继续做 Facebook 正在做的事情。

Qwen has always out-performed other equivalent/contemporary models on Chinese-language tasks, so it wouldn't surprise me if it continued to do so vs LLaMa 3.
Qwen 在中文任务上总是表现优于其他同等/同时代的模型,所以如果它继续这样做对比 LLaMa 3,我不会感到惊讶。

What makes a bubble a bubble, is that people expect the market to grow dramatically in the future. It's about staking a claim to the future market.

In 2 years, when compute costs are 10x cheaper or whatever, every developer at Mistral will be running a chatbot or flight planning team at American Airlines.
在 2 年后,当计算成本降低 10 倍或更多时,Mistral 的每个开发者都将运行一个聊天机器人或美国航空的飞行计划团队。

I used to think: "Man why did they turn tech off? There's so much undone, so much opportunity for technology and market disruption!"

But the answer is bubbles. Any time sudden money is made in anything it attracts everyone from everywhere and it immediately becomes corrupted and full of scams and old money institutions. Suddenly people aren't becoming developers to innovate but to become personally financially stable. What started out as mostly uneducated hipsters and hacktivists disrupting and improving is now academia, major corporations, wealthy heirs with their WeWork NFT companies, etc. soaking up what's left of the funds, stagnating the industry, gatekeeping it, and playing a totally different (and highly political) game than we were playing in ~2008-2016.
但答案是泡沫。任何时候,只要在任何事物上突然赚到钱,就会吸引来自四面八方的人,它立即就会被腐败和充满骗局以及老牌金融机构所侵蚀。突然间,人们成为开发者不再是为了创新,而是为了个人的经济稳定。最初主要是未受过教育的潮人和黑客活动家在进行破坏和改进,现在则是学术界、大公司、富有的继承人以及他们的 WeWork NFT 公司等,吸收剩余的资金,使行业停滞不前,设立门槛,并进行与我们在 2008-2016 年间玩的完全不同(并且高度政治化)的游戏。

When the tech world came crashing down in ~2016, at that time there was still a lot to disrupt: Pre-Tik Tok, largely still pre- crypto and AI. SaaS and mobile had reached a peak, and we were ready for something new, but I had no idea what was coming lol - Trump and Hilary and politics, then Covid, and now nobody has jobs like under Bush all over again, it's all politics and it's never been worse for a person's image to identify as a software engineer. This is how it was before it was cool though, nobody wanted to be a developer in ~2003.
当科技世界在 2016 年左右崩溃时,那时还有很多东西可以破坏:Pre-Tik Tok,大体上还是 pre-加密货币和 AI。SaaS 和移动设备已经达到了顶峰,我们准备好迎接新的东西,但我对即将到来的事情一无所知,哈哈 - 特朗普和希拉里以及政治,然后是 Covid,现在又没有人像布什时代那样有工作,全都是政治,作为一名软件工程师的形象从未如此糟糕。不过,在它变得流行之前就是这样,大约在 2003 年的时候,没有人想成为开发者。

But it's a necessary cycle, you can't just keep pumping money endlessly, it gets ridiculous quickly. There has to be periods of on and off and extreme hype cycles to see if something might be, and like a kite or firework some of those take off and impress us, but make no mistake they're all - necessarily - bubbles! Get in while it's hot, get out before it bursts :)
但这是一个必要的周期,你不能无休止地注入资金,这很快就会变得荒谬。必须有开启和关闭的时期,以及极端的炒作周期,以看看可能会有什么,就像风筝或烟火,其中一些会起飞并给我们留下深刻印象,但毫无疑问,它们都是必然的泡沫!趁热打铁,泡沫破裂前退出 :)

Meta's stock crashed 10% on announcing $10 bil extra in annual capex (All in GPUs)
Meta 在宣布每年额外增加 100 亿美元的资本支出(全部用于 GPU)后,股价暴跌 10%

Meta does not have unlimited financial firepower to release models for free. Its like saying you can't compete against someone who burns money. In theory true, in practice the 'someone' can run out of money.
Meta 并没有无限的财力来免费发布模型。这就像说你不能与烧钱的人竞争。理论上是对的,实际上,'某人'可能会用完钱。

Chinese models can get state backing. Mistral has French state backing etc. There's plenty of money to go around for huge technologies like this.
中国的模型可以得到国家的支持。Mistral 得到了法国国家的支持等。对于这样的巨大技术,有足够的资金可以使用。

Meta has spent $46B on VR since 2021. AI has a much stronger business case.
自 2021 年以来,Meta 已经在虚拟现实上花费了 460 亿美元。人工智能有更强的商业案例。

Zuckerberg’s budget for side projects is bigger than most countries’ defence budgets.


qwen is backed by alibaba. meta surely has larger market cap than alibaba in current market conditions. but i am not sure about the "crush" part.
qwen 由阿里巴巴支持。在当前市场条件下,meta 肯定比阿里巴巴的市值大。但是我对"压倒"部分不太确定。

You should have seen how hard Facebook had to fight to control social media, especially on iPhone. They almost lost it during the transition to mobile - it took billions in acquisitions and hiring.
你应该看到 Facebook 为了控制社交媒体,特别是在 iPhone 上,不得不进行多么艰难的斗争。他们在过渡到移动设备时几乎失去了它 - 这需要数十亿的收购和招聘。

Sounds like science to me

They are being modest, I hope their extra focus on multi-lingual and maths logic improves coding ability, llama3 was not quite GPT4/Opus levels in that department.
他们正在保持谦虚,我希望他们在多语言和数学逻辑上的额外关注能提高编程能力,llama3 在这个部门并未达到 GPT4/Opus 的水平。

Qwen MoE[0] showed promise for a small size. I hope they spend more time in them.
Qwen MoE[0]展示了小规模的潜力。我希望他们能在这上面花更多的时间。

[0] https://qwenlm.github.io/blog/qwen-moe/

Just did a quick chat about docker services, backing up volume mounts and hosting a git with CI. It held up really well. GPT-4 level well in this simple task
刚刚快速聊了一下关于 docker 服务,备份卷挂载和托管带有 CI 的 git。它表现得非常好。在这个简单任务中达到了 GPT-4 级别的表现。

I’m not well read on LLMs in spite of using them daily. The increase in performance seems incremental. As if they added another 0 to the number it might only go up the same percentage in output.
尽管我每天都在使用LLMs,但我对它们的了解并不深。性能的提升似乎是逐步的。就好像他们在数字后面又加了一个 0,输出可能只会增加相同的百分比。

So I assume that the number is just one facet of increasing output quality. Is that a safe assumption? Like throwing more energy at a problem to improve output it only goes so far.

You can improve results with cleaner datasets and if you prioritise a certain goal like conversation, code compleation or reasoning.

I'm reading the Textbooks Are All You Need paper, which goes into this idea. The result of that research was Phi 1, and eventually Phi 3 (released a few days ago).
我正在阅读《教科书就是你所需要的》这篇论文,它深入探讨了这个想法。那项研究的结果是 Phi 1,最终是 Phi 3(几天前发布)。

There has been incremental progress for about 1 year from open weight models worse than GPT-3.5 to models in the area of GPT-4.
从比 GPT-3.5 差的开放权重模型到 GPT-4 区域的模型,大约一年来有了逐步的进步。

Same for inference speed/cost: many many incremental improvements within 1 year add up.

It truly feels like the space race in terms of building LLMs right now. Question is, who lands on the moon first?

I don't think the moon's real.

I think we've largely arrived in terms of capabilities and companies are just competing to work out the kinks and fully integrate their products. There will be some new innovations, but nothing like the moon that caps off "you've won". The winner(s) will just be whoever can keep funding long enough to find a profitable use for them.

Where's the moon? Do you mean like AGI?
月亮在哪里?你是说像 AGI 那样吗?

It seems to me like the moon is "chatbots which are somewhat convincing" and everybody is landing there in OpenAI's wake. The real problem is Mars - make a computer which can learns as quickly and reason as deeply as, say, a stingray or another somewhat intelligent fish[1].
在我看来,月亮就像是"有些令人信服的聊天机器人",每个人都在 OpenAI 的引领下登陆那里。真正的问题是火星 - 创造一台计算机,它可以像魟鱼或其他某种智能鱼类一样快速学习和深入推理[1]。

[1] This task seems far beyond the capability of any transformer ANN absent extensive task-specific training, and it cannot be reasonably explained by stingray instinct: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8971382/
[1] 这项任务似乎远远超出了任何变压器 ANN 在没有大量任务特定训练的情况下的能力,而且它不能被合理地解释为魟鱼的本能:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8971382/

This is true in more ways than one. My question is – what happens once we do land on the moon? Will we become a spacefaring civilization in the decades to come, or will the whole thing just...fizzle out?
这在多方面都是正确的。我的问题是 - 一旦我们登陆月球,会发生什么?在未来的几十年里,我们会成为一个太空航行的文明,还是整个事情就...消失了?

Is there any indication that we're converging to AGI instead of to some asymptote that lies far away from it?
有没有任何迹象表明我们正在接近 AGI,而不是远离它的某个渐近线?

I don't think a pure language model of the sort under consideration here is heading towards AGI. I use language models extensively and the more I use them the more I tend to see them as information retrieval systems whose surprising utility derives from a combination of a lot of data and the ability to produce language. Sometimes patterns in language are sufficient to do some rudimentary reasoning but even GPT4, if pushed beyond simple patternish reasoning and its training data, reveals very quickly that it doesn't really understand anything.
我不认为这里考虑的纯语言模型正在走向 AGI。我广泛使用语言模型,使用得越多,我就越倾向于将它们视为信息检索系统,其惊人的实用性源于大量的数据和产生语言的能力。有时,语言中的模式足以进行一些初级的推理,但即使是 GPT4,如果超出了简单的模式推理和其训练数据,也会很快揭示出它实际上并不真正理解任何事情。

I admit, its hard to use these tools every day and continue to be skeptical about AGI being around the corner. But I feel fairly confident that pure language models like this will not get there.
我承认,每天使用这些工具并继续对 AGI 就在拐角处持怀疑态度是很难的。但我相当确信,像这样的纯语言模型不会达到那里。

Looks interesting! I feel like Qwen has always been one of the most underrated model families that doesn't get as much attention as other peers for whatever reason. (maybe b/c it's from Alibaba?)
看起来很有趣!我觉得 Qwen 一直是最被低估的模型家族之一,无论什么原因,它没有得到其他同行那么多的关注。(可能是因为它来自阿里巴巴?)

I've been working on https://double.bot (high performance coding copilot) and the number of high quality OS models coming out lately has been very exciting.
我一直在研究 https://double.bot(高性能编码副驾驶),最近出现的高质量 OS 模型数量令人非常兴奋。

Adding this to Double now to so I can try it for coding. Should have it done in an hour or two if anyone else is interested. Will come back and report on the experience.
现在我将这个添加到 Double 中,以便我可以尝试编码。如果有其他人感兴趣,我应该在一两个小时内完成。我会回来并报告体验。

How do the makers create the limits on what LLMs cooperate with? For ChatGPT I heard speculation on there being a second neural network that assesses the suitability of both the prompt and the response, which one could then work around in various ways, but with this seemingly being downloadable rather than something they host themselves, I doubt that's the case here. Do they feed it a lot of bad prompts (in the eyes of the makers) during training and tweak/backpropagate the network until it always rejects those?
制造商是如何创建LLMs合作的限制的?对于 ChatGPT,我听说有关于有第二个神经网络来评估提示和响应的适当性的猜测,然后可以用各种方式绕过,但是这个似乎是可以下载的,而不是他们自己托管的,我怀疑这里是这种情况。他们在训练过程中是否给它提供了很多制造商眼中的糟糕提示,并调整/反向传播网络,直到它总是拒绝这些?

I first tested Qwen1.5-110B at Hugging Face [1] with the following prompts:
我首先在 Hugging Face [1] 测试了 Qwen1.5-110B,以下是提示:

“Please give me a summary of the conflicts in Israel and Palestine.”

“Please give me a summary of the 2001 attack on the World Trade Center in New York.”
请给我总结一下 2001 年纽约世贸中心的袭击事件。

“Please give me a summary of the Black Lives Matter movement in the U.S.”

“Please give me a summary of the 1989 Tiananmen Square protests.”
请给我一个 1989 年天安门广场抗议的总结。

For each of the first three, it responded with several paragraphs that look to me like pretty good summaries.

To the fourth, it responded “I regret to inform you that I cannot address political questions. My primary purpose is to assist with non-political inquiries. If you have any other questions, please let me know.”

I tried another: 我试了另一个:

“Please give me a summary of the Tibetan sovereignty debate.”

This time, it gave me a reasonably balanced summary: “... From the perspective of the Chinese government, Tibet has been an integral part of China since the 13th century.... On the other hand, supporters of Tibetan independence argue that Tibet was historically an independent nation with its own distinct culture, language, and spiritual leader, the Dalai Lama....”
这次,它给了我一个相当平衡的总结:“...从中国政府的角度看,西藏自 13 世纪以来一直是中国的一部分....另一方面,支持藏族独立的人士认为,西藏历史上是一个拥有自己独特文化、语言和精神领袖,即达赖喇嘛的独立国家....”

Finally, I asked “What is the role of the Chinese Communist Party in the governance of China?”

Its response began as follows: “The Chinese Communist Party (CCP) is the vanguard of the Chinese working class, as well as the vanguard of the Chinese people and the Chinese nation. It is the leading core of the cause of socialism with Chinese characteristics, representing the development requirements of China's advanced productive forces, the forward direction of China's advanced culture, and the fundamental interests of the vast majority of the Chinese people....”

[1] https://huggingface.co/spaces/Qwen/Qwen1.5-110B-Chat-demo

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact