How much energy does ChatGPT use?
ChatGPT 使用多少能源?
Published 已出版
Credit to Alex Erben and Ege Erdil for substantial help with research and calculations. In this issue, “we” refers to our collective judgment.
Alex Erben 和 Ege Erdil 在研究和计算方面提供了大量帮助,特此致谢。在本期中,"我们 "指的是我们的集体判断。
A commonly-cited claim is that powering an individual ChatGPT query requires around 3 watt-hours of electricity, or 10 times as much as a Google search.1 This is often brought up to express concern over AI’s impact on the environment, climate change, or the electric grid.
经常引用的一个说法是,为单个ChatGPT查询供电大约需要 3 瓦时的电量,是谷歌搜索的 10 倍。1 这经常被用来表达对人工智能对环境、气候变化或电网影响的担忧。
However, we believe that this figure of 3 watt-hours per query is likely an overestimate. In this issue, we revisit this question using a similar methodology, but using up-to-date facts and clearer assumptions. We find that typical ChatGPT queries using GPT-4o likely consume roughly 0.3 watt-hours, which is ten times less than the older estimate. This difference comes from more efficient models and hardware compared to early 2023, and an overly pessimistic estimate of token counts in the original estimate.
不过,我们认为,每次查询 3 瓦时的数字很可能被高估了。在本期中,我们使用类似的方法重新探讨了这个问题,但使用的是最新的事实和更明确的假设。我们发现,使用 GPT-4o 的典型 ChatGPT 查询可能消耗大约 0.3 瓦时,这比旧版估计值 少了 10 倍。这一差异来自于与 2023 年初相比更高效的模型和硬件,以及最初估计中对令牌数量过于悲观的估计。
For context, 0.3 watt-hours is less than the amount of electricity that an LED lightbulb or a laptop consumes in a few minutes. And even for a heavy chat user, the energy cost of ChatGPT will be a small fraction of the overall electricity consumption of a developed-country resident. The average US household2 uses 10,500 kilowatt-hours of electricity per year, or over 28,000 watt-hours per day.3
就上下文而言,0.3 瓦时的耗电量还不及一个 LED 灯泡或一台笔记本电脑几分钟的耗电量。即使对于大量聊天用户来说,ChatGPT的能源成本也只是发达国家居民总体耗电量的一小部分。美国平均家庭2 每年耗电 10,500 千瓦时,即每天耗电超过 28,000 瓦时。
This estimate of 0.3 watt-hours is actually relatively pessimistic (i.e. erring towards high energy costs), and it is possible that many or most queries are actually much cheaper still. However, this will vary with use case—queries with long input lengths or with longer outputs (e.g. using reasoning models) may be substantially more energy-intensive than 0.3 watt-hours. And training and inference for future models may consume much more energy than using ChatGPT today.
0.3 瓦时的估计值实际上相对悲观(即偏向于高能耗成本),许多或大多数查询实际上可能更便宜。不过,这也会随着使用情况的不同而变化--输入长度较长或输出较长(例如使用推理模型)的查询可能比 0.3 瓦时的能耗高得多。未来模型的训练和推理可能比现在使用 ChatGPT 消耗更多的能量。

See this spreadsheet for sources for everyday electricity uses.
有关日常用电的来源,请参阅本 电子表格。
Estimating the energy cost of a query
估算查询的能源成本
ChatGPT and other chatbots are powered by large language models (LLMs). Running these models (also known as inference) requires compute, and the chips and data centers that process that compute require electricity, roughly in proportion to the amount of compute required.
ChatGPT 和其他聊天机器人由大型语言模型(LLMs)驱动。运行这些模型(也称为推理)需要计算,而处理这些计算的芯片和数据中心需要电力,电力需求与计算量大致成正比。
Below, I’ll walk through a summary of how to estimate the compute and energy cost of a ChatGPT query. You can find a more detailed version with arguments for every assumption in the appendix.
下面,我将简要介绍如何估算 ChatGPT 查询的计算和能耗成本。您可以在附录中找到更详细的版本,其中包含每个假设的参数。
-
ChatGPT actually runs on several different models, but let’s use GPT-4o as a reference model, because this is still OpenAI’s leading general-purpose model. OpenAI’s new reasoning models (o1, o3-mini, and the upcoming o3) likely require more energy, but are probably less popular, at least right now. In particular, OpenAI’s new Deep Research product, which is rate-limited and requires their $200/month subscription tier, is certainly far more compute-intensive than a simple ChatGPT query. Meanwhile, GPT 4o-mini is smaller and cheaper than GPT-4o. We’ll discuss these other models in more detail in a later section.
ChatGPT 实际上可在多个不同的模型上运行,但我们还是使用 GPT-4o 作为参考模型,因为它仍然是 OpenAI 的主要通用模型。OpenAI的新推理模型(o1、o3-mini 和即将推出的 o3)可能需要更多的能量,但至少目前可能不太受欢迎。尤其是 OpenAI 的新 Deep Research 产品,该产品有费率限制,需要 200 美元/月的订阅级别,其计算密集度肯定远远高于简单的 ChatGPT 查询。同时,GPT 4o-mini 比 GPT-4o 更小巧、更便宜。我们将在后面的章节中更详细地讨论这些其他模型。 -
LLMs generate outputs in units called tokens—for OpenAI, a token represents 0.75 words on average. Floating-point operations (FLOP) is a standard unit for measuring compute, and generating a token requires approximately two FLOP for every active parameter in the model. We previously estimated that GPT-4o has roughly 200 billion total parameters (likely between 100 and 400 billion). It is also most likely a mixture-of-experts model, meaning not all of these parameters are activated at once. Pessimistically taking the high estimate of total parameters, and assuming ¼ are activated at a time suggests 100 billion active parameters. This means that 2 * 100 billion = 200 billion FLOP are needed to generate one token.
LLMs 生成的输出单位称为 令牌--对于 OpenAI,一个令牌平均代表 0.75 个字。浮点运算 (FLOP) 是衡量计算量的标准单位,模型中的每个有效参数生成一个标记大约需要两个 FLOP。我们先前估计 GPT-4o 大约有 2000 亿个总参数(可能在 1000 亿到 4000 亿之间)。这也很可能是一个专家混合模型,这意味着并非所有这些参数都会同时被激活。悲观地估计总参数,并假设每次激活 1/4 个参数,这表明有 1,000 亿个活动参数。这意味着生成一个令牌需要 2 * 1000 亿 = 2000 亿 FLOP。 -
I assume that a typical number of output tokens per query is 500 tokens (~400 words, or roughly a full page of typed text). This is somewhat pessimistic—for example, Chiang et al. found an average response length of 261 tokens in a dataset of chatbot conversations. This is also assuming text-based conversations—it’s unclear how many tokens are needed in conversations with GPT-4o’s advanced voice mode, and we don’t consider the cost of generating images here.
我假设每次查询的典型输出标记数为 500 个标记(约 400 个单词,或大约一整页的键入文本)。这有点悲观--例如,Chiang 等人在聊天机器人对话数据集中发现平均回复长度为 261 个字节。这也是假设基于文本的对话--目前还不清楚在使用 GPT-4o 的高级语音模式的对话中需要多少标记,而且我们在这里也没有考虑生成图像的成本。 -
This leads to 500 * 2 * 100 billion = 1e14 FLOP for a GPT-4o query with 500 output tokens. There is also an additional cost for queries with lengthy inputs, which I discuss below.
这将导致 500 * 2 * 1000 亿 = 1e14 FLOP,用于 500 个输出标记的 GPT-4o 查询。对于输入较长的查询,还会产生额外的成本,这将在下文中讨论。
Next, we can find the energy cost of this compute based on the power consumption of the necessary AI chips:
接下来,我们可以根据必要的人工智能芯片的功耗,计算出该计算的能耗成本:
-
I assume that OpenAI uses Nvidia H100 GPUs for ChatGPT inference. These have a power rating of 700 watts, but H100 clusters can consume up to ~1500 W per GPU due to the overhead costs of servers and data centers.4
我假设 OpenAI 使用 Nvidia H100 GPU 进行 ChatGPT 推断。这些 GPU 的额定功率为 700 瓦,但由于服务器和数据中心的间接成本,H100 集群每个 GPU 的功耗可达 ~1500 瓦。 -
H100s can perform up to 989 trillion (9.89e14) FLOP per second. At that rate, it takes 1e14 / 9.89e14 ~= 0.1 seconds of H100-time to process a query (though in reality, many queries are processed in parallel in batches, so end-to-end generation time is much longer than this).
H100 每秒最多可执行 989 万亿 (9.89e14) FLOP。按照这一速度,处理一个查询需要 1e14 / 9.89e14 ~= 0.1 秒的 H100 时间(但实际上,许多查询都是以 batches 的方式并行处理的,因此端到端生成时间要比这长得多)。 -
However, GPUs can’t actually achieve their max FLOP/second output in practice. I use 10% as a rough estimate of typical utilization rates for inference clusters. This increases the number of GPUs needed to process queries by 10x, so roughly one second of H100-time is required to process a ChatGPT query.
然而,GPU 在实际应用中并不能达到最大 FLOP/秒输出。我使用 10% 作为推理集群典型利用率的粗略估计。这使得处理查询所需的 GPU 数量增加了 10 倍,因此处理 ChatGPT 查询大约需要 1 秒钟的 H100 时间。 -
On the flip side, average power consumption per GPU will be less than 1500 W in practice. One estimate from the literature suggests that inference GPUs may consume around 70% of peak power on average.5
另一方面,每个 GPU 的平均功耗在实际应用中将低于 1500 W。文献中的一个估计表明,推理 GPU 的平均功耗可能约为峰值功耗的 70%。
Putting this together: one second of H100-time per query, 1500 watts per H100, and a 70% factor for power utilization gets us 1050 watt-seconds of energy, which is around 0.3 watt-hours per query. This is around 10 times lower than the widely cited 3 watt-hour estimate!
综上所述:每次查询的 H100 时间为 1 秒,每个 H100 的功率为 1500 瓦,功率利用率为 70%,则每次查询的能量为 1050 瓦秒,约为 0.3 瓦时。这比广泛引用的 3 瓦时估计值低约 10 倍!
One last factor to account for is the cost of processing input tokens. This is negligible for simple questions that are manually typed, but when the inputs are very long, processing inputs becomes significant relative to the cost of generating outputs. Some chatbot use cases involve uploading large documents, codebases, or other forms of data, and GPT-4o supports an input window of up to 128,000 tokens.
最后一个需要考虑的因素是处理 输入令牌的成本。这对于手动输入的简单问题来说可以忽略不计,但当输入很长时,处理输入的成本相对于生成输出的成本就会变得很高。有些聊天机器人使用案例涉及上传大型文档、代码库或其他形式的数据,而 GPT-4o 支持多达 128,000 个标记的输入窗口。
In the appendix, we estimate that a long input of 10k tokens (similar to a short paper or long magazine article) would substantially increase the cost per query to around 2.5 watt-hours, and a very long input length of 100k tokens (equal to roughly 200 pages of text) would require almost 40 watt-hours of energy. Note that this input processing is an upfront cost, so if you have an extended chat exchange after uploading a long document, this cost is not incurred with each message. Additionally, it is almost certainly possible to dramatically improve how the cost of processing inputs scales with very long inputs, but we’re not sure if OpenAI has done this yet.6
在附录中,我们估计 10k 字节的长输入(类似于一篇短论文或长杂志文章)将大幅增加每次查询的成本,达到约 2.5瓦时,而 100k 字节的超长输入(大约相当于 200 页文本)将需要近 40 瓦时的能源。需要注意的是,这种输入处理是一种前期成本,因此如果您在上传长文档后进行了长时间的聊天交流,那么这种成本就不会在每条消息中产生。此外,几乎可以肯定的是,在处理超长输入时,处理输入的成本将大幅提高,但我们不确定 OpenAI 是否已经做到这一点。
Why is our estimate different from others?
为什么我们的估计与别人不同?
The original three watt-hour estimate, which has been widely cited by many different researchers and media outlets, comes from Alex de Vries (2023).
许多 不同 研究人员和媒体都广泛引用了最初的 3 瓦时估计值,该估计值来自Alex de Vries (2023)。
The most important reason our estimate differs is that we use a more realistic assumption for the number of output tokens in typical chatbot usage. We also base our estimate on a newer and more efficient chip (NVIDIA H100 vs A100), and a model with somewhat fewer active parameters.
我们的估计之所以不同,最重要的原因是我们对典型聊天机器人使用中的输出令牌数量采用了更现实的假设。我们的估算还基于更新、更高效的芯片(英伟达 H100 与 A100),以及活动参数更少的模型。
In the original estimate, De Vries cites a February 2023 estimate from SemiAnalysis of the compute cost of inference for ChatGPT. This calculation assumed 175B parameters for GPT-3.5 (vs our assumed active parameter count of 100B for GPT-4o), running on A100 HGX servers (less efficient than the more modern H100), and most importantly, assumed 4000 input tokens and 2000 output tokens per query. This is equivalent to 1500 words, which is likely quite unrepresentative of typical queries (for context, it is about half as long as this newsletter issue, besides the appendix). De Vries then converts this compute cost to energy using the A100 server’s max power capacity of 800 W per GPU, while we assume servers consume 70% of peak power.
在最初的估算中,De Vries 引用了SemiAnalysis在 2023 年 2 月对ChatGPT推理计算成本的估算。该计算假定 GPT-3.5 有 175B 个参数(而我们假定 GPT-4o 的活动参数数为 100B),在 A100 HGX 服务器上运行(效率低于更现代的 H100),最重要的是,假定每次查询有 4000 个输入词块和 2000 个输出词块。这相当于 1500 个单词,而这在典型查询中可能很不具代表性(就上下文而言,除了附录之外,它的长度约为本期时事通讯长度的一半)。De Vries 然后使用 A100 服务器每个 GPU 800 W 的最大功率将计算成本转换为能源,而我们假设服务器消耗峰值功率的 70%。
In addition, Luccioni et. al. measured a ~4 Wh energy consumption per query for BLOOM-176B, a model comparable in size to GPT-4o. However, this was for a model deployed for research purposes in 2022: their cluster handled a relatively low volume of requests, likely with low utilization, and didn’t use batching, a basic inference optimization that is standard in any commercial deployment of LLMs (see the appendix for more details on this paper).
此外,Luccioni等人测得BLOOM-176B每次查询的能耗约为4 Wh,该模型的规模与GPT-4o相当。不过,这是在 2022 年为研究目的而部署的模型:他们的集群处理的请求量相对较少,利用率可能较低,并且没有使用批处理,而批处理是基本的推理优化,是 LLMs 商业部署的标准配置(有关本文的更多详细信息,请参阅 附录)。
What about other models besides GPT-4o?
除了 GPT-4o 之外,还有其他型号吗?
While GPT-4o serves as our reference model, there are numerous other models available both within ChatGPT and in chatbot products from other companies. OpenAI offers GPT-4o-mini, which, based on its API pricing at under one-tenth the cost of GPT-4o, likely has a significantly smaller parameter count7 and therefore energy cost. In contrast, OpenAI’s o1 and o3 reasoning models may consume substantially more energy to operate.
虽然 GPT-4o 是我们的参考模型,但 ChatGPT 和其他公司的聊天机器人产品中还有许多其他模型。OpenAI提供了GPT-4o-mini,根据其API定价,GPT-4o的十分之一,其参数数量7 可能要少得多,因此能源成本也要低得多。相比之下,OpenAI的 o1 和 o3 推理模型的运行能耗可能要高得多。
o1’s parameter count (and cost per-token) is unclear, but its compute and energy costs per query are most likely higher due to the lengthy chain-of-thought that it generates. This cost is even higher with the $200/month o1 pro, which uses the same o1 model but scales up inference compute even further.
o1 的参数数量(以及每个标记的成本)尚不清楚,但由于其产生的思维链很长,因此其计算和能源成本每次查询很可能更高。而每月 200 美元的 o1 pro 成本则更高,它使用相同的 o1 模型,但推理计算量更大。
The newly-released o3-mini deserves special attention here, because it is powerful while also being fast enough that it could significantly displace GPT-4o in usage in the coming months, especially if it receives new product features like image/file upload and voice mode. o3-mini also has an undisclosed parameter count, though it is advertised as a “small” model and its per-token cost on the API is only 44% as much as GPT-4o’s. So there is a good chance that o3-mini has a lower per-token energy cost than GPT-4o, but this could easily be outweighed by long reasoning chains.
新发布的o3-mini在此值得特别关注,因为它功能强大,速度也足够快,在未来几个月内,它的使用量可能会大幅取代GPT-4o,尤其是如果它获得图像/文件上传和语音模式等新产品功能的话。尽管 o3-mini 被宣传为 "小型 "机型,但其参数数量也未披露,而且其在 API 上的每个令牌成本仅为 GPT-4o 的 44%。因此,o3-mini 的单位令牌能源成本很有可能低于 GPT-4o,但这很容易被冗长的推理链所抵消。
I informally tested o3-mini and o1 using OpenAI’s API8, and found that they both generate around 2.5x as many tokens as GPT-4o. This was based on a small sample of my own chats and a more rigorous analysis would be useful here.9 Overall, it is currently unclear if o3-mini queries consume more or less energy than GPT-4o queries, but o1 queries most likely require more energy.
我使用 OpenAI 的 API8 对 o3-mini 和 o1 进行了非正式测试,发现它们生成的令牌数量都是 GPT-4o 的 2.5 倍左右。9 总体而言,目前尚不清楚 o3-mini 查询消耗的能量是多于还是少于 GPT-4o 查询,但 o1 查询很可能需要更多能量。
Today, the “o”-series models are almost certainly much less popular than 4o and 4o-mini, which are available for free,10 are faster, and have more features (e.g. o1 and o3-mini do not have PDF upload or voice mode as of writing). Also, o1’s superior reasoning abilities are unnecessary for most of today’s chatbot use cases, like answering simple questions or drafting emails. It is possible that a shift to reasoning models significantly drives up energy costs for the average chatbot user, but if that happens, that probably means that AI usage has shifted towards much more difficult or complex problems compared to how they are used today. OpenAI’s new, o3-powered Deep Research product is an early sign of this.
如今,"o "系列模型几乎肯定不会像 4o 和 4o-mini 那样受欢迎,因为它们可以免费获得,10 速度更快,功能更多(例如,到目前为止,o1 和 o3-mini 还没有 PDF 上传或语音模式)。此外,o1 的超强推理能力对于今天的大多数聊天机器人用例(如回答简单问题或起草电子邮件)来说都是不必要的。对于普通聊天机器人用户来说,向推理模型的转变可能会大大增加能源成本,但如果出现这种情况,那可能意味着人工智能的使用已经转向比现在更困难或更复杂的问题。OpenAI由 o3 驱动的Deep Research 新产品就是这方面的一个早期迹象。
Beyond OpenAI and ChatGPT, other major chatbot products include Meta’s AI assistant (likely powered by Llama 3.2 11B or 90B since Meta’s product is multimodal), Anthropic’s Claude (powered by Claude 3.5 Sonnet, which we estimate has 400B parameters), and Google’s Gemini (powered by Gemini “Flash” and “Pro” models, with undisclosed parameter counts). We don’t have full details on all of these models, but they are probably all roughly comparable to either GPT-4o or 4o-mini in energy costs. Finally, DeepSeek-V3, which we wrote about recently, and its R1 reasoning variant, offers strong performance with just 37B active parameters (out of 671B total parameters), so there is a good chance that it is less energy-intensive per token than GPT-4o.
除了 OpenAI 和 ChatGPT 之外,其他主要聊天机器人产品包括 Meta 的 AI 助手(可能由 Llama 3.2 11B 或 90B,因为 Meta 的产品是多模态的)、Anthropic的Claude(由Claude 3.5 Sonnet提供支持,我们估计该产品有 400B 参数)和 Google 的 Gemini(由 Gemini "Flash "和 "Pro "模型提供支持,参数数量未披露)。我们并不掌握所有这些模型的全部详细信息,但它们的能耗成本可能都与 GPT-4o 或 4o-mini 大致相当。最后,我们最近撰文介绍过的 DeepSeek-V3 及其 R1 推理变体,只需 37B 有效参数(671B总参数中的 37B 有效参数)即可提供强大的性能,因此其单位令牌能耗很可能低于 GPT-4o。
Moving forward, it’s unclear how energy costs for AI chatbots will evolve—it could easily go up or down over time, and this could diverge dramatically by use case. Holding capabilities constant, language models will become more energy-efficient over time due to both hardware and algorithmic improvements. The AI and tech industry have been developing more efficient hardware, continually improving the capabilities of smaller models, and inventing inference optimizations like multi-token prediction, all of which drive energy costs down.11 However, if consumers shift to increasingly powerful chatbots and assistants that do increasingly complex tasks, this may eat these efficiency gains through larger models or models that generate increasingly large numbers of reasoning tokens.12
展望未来,人工智能聊天机器人的能耗成本将如何发展尚不清楚--随着时间的推移,能耗成本可能会上升或下降,而且不同的使用案例可能会出现巨大差异。在能力不变的情况下,随着时间的推移,语言模型将因硬件和算法的改进而变得更加节能。人工智能和技术行业一直在开发更高效的硬件,不断提高更小模型的能力,并发明了多标记预测等推理优化技术,所有这些都推动了能耗成本的降低。11 不过,如果消费者转向功能越来越强大、任务越来越复杂的聊天机器人和助手,这可能会通过更大的模型或生成越来越多推理令牌的模型来吞噬这些效率收益。
Training and other upstream costs
培训和其他上游费用
Another thing to consider is the upstream energy costs in producing ChatGPT and other LLM products, in addition to the cost of using them.
另一个需要考虑的问题是生产 ChatGPT 和其他 LLM 产品的上游能源成本,以及使用这些产品的成本。
The most obvious one is the energy required to train the models. The training runs for current generation models that are comparable to GPT-4o13 consumed around 20-25 megawatts of power each, lasting around three months. This is enough to power around 20,000 American homes. Since ChatGPT has 300 million users, the energy spent on GPT-4o’s training run is not very significant on a per-user basis. The same article also mentions that ChatGPT users send 1 billion messages per day: using our 0.3 Wh estimate, this suggests that overall ChatGPT inference requires ~12.5 MW of power. This is comparable to the power (temporarily) required to train GPT-4o, but language models are typically used for longer than they are trained (GPT-4o has been available for nine months).
最明显的一点是训练模型所需的能量。与GPT-4o13 相媲美的当前发电模型的训练运行,每次耗电量约为20-25 兆瓦,持续时间约为三个月。这足以为大约20,000个美国家庭供电。由于ChatGPT拥有3 亿million 用户,因此就每个用户而言,GPT-4o的训练运行所耗费的能源并不多。同一篇文章还提到,ChatGPT 用户每天发送 10 亿条消息:根据我们 0.3 Wh 的估计,这表明 ChatGPT 的整体推理需要 ~12.5 MW 的电力。这与训练 GPT-4o 所需的电力(暂时)相当,但语言模型的使用时间通常比训练时间更长(GPT-4o 已可用 9 个月)。
One might also wonder about the upstream energy costs of constructing the GPUs and other hardware, also known as “embodied” energy, in addition to the direct energy cost of running them. A full accounting here would be difficult, but these embodied energy costs are likely much smaller than the direct energy costs.
除了运行 GPU 和其他硬件的直接能源成本外,人们可能还会对构建 GPU 和其他硬件的上游能源成本(也称为"体现"能源)产生疑问。在此很难进行全面核算,但这些内含能源成本可能比直接能源成本小得多。
Luccioni et. al. looked at the embodied carbon emissions of the servers and GPUs used to train BLOOM, a 176 billion parameter LLM, estimating that the embodied emissions (amortized over the length of the training run) were ~11.2 tonnes of CO2. This is less than half of the 24.7 tonnes of CO2 emitted by training the model.14 However, the ratio of embodied energy to training energy is even lower, because those training emissions came from clean French electricity, which is around 5 to 10 times less carbon-intensive than the electricity in most countries.15 So the embodied energy cost was likely a small fraction of the direct energy cost.
Luccioni 等人研究了用于训练 BLOOM(1760 亿参数LLM)的服务器和 GPU 的内含碳排放量,估计内含排放量(在训练运行期间摊销)约为 11.2 吨二氧化碳。14 然而,内含 能源与训练能源的比率甚至更低,因为这些训练排放来自法国的清洁电力,其碳密集度比大多数国家的电力低约 5 到 10 倍。15 因此,内含能源成本可能只是直接能源成本的一小部分。
Another data point is that TSMC, which manufactures a large majority of the world’s AI chips, consumed a total of 24 billion kWh in 2023, which is equivalent to an average power consumption of 2.7 gigawatts. This is in the same ballpark as the power consumption of the GPUs that TSMC produced last year (very roughly, 1 kW per GPU over several million GPUs produced per year). However, GPUs last for multiple years, so the total stock of GPUs probably consumes more than TSMC’s fabs put together. Additionally, much of the 2.7 GW consumed by TSMC actually goes to producing non-AI chips. Overall, his means that it probably takes much less energy to manufacture AI chips than it takes to run them.
另一个数据是,制造全球绝大多数人工智能芯片的台积电在 2023 年的总耗电量240 亿千瓦时,相当于平均耗电量2.7 千兆瓦。这与台积电去年生产的 GPU 的耗电量相当(粗略计算, 每年生产数百万 GPU,每个 GPU 的耗电量为 1 千瓦)。然而,GPU 的使用寿命长达数年,因此 GPU 总库存的消耗可能超过台积电晶圆厂的总和。此外,台积电消耗的 2.7 千兆瓦中的大部分实际上用于生产非 AI 芯片。总体而言,这意味着制造人工智能芯片所需的能源可能远远低于运行这些芯片所需的能源。
Rigorously measuring all these upstream costs would be difficult, but given the available information, they are probably comparable to or less than the direct cost of inference compute for ChatGPT. And in the case of training, this cost is an upfront cost that doesn’t affect the marginal energy cost of using ChatGPT, since AI models don’t wear out after they are trained.
对所有这些上游成本进行严格的测量是非常困难的,但根据现有的信息,它们可能与 ChatGPT 的直接推理计算成本相当或更低。就训练而言,由于人工智能模型在训练后不会磨损,因此这种成本属于前期成本,不会影响使用 ChatGPT 的边际能源成本。
Discussion 讨论
With reasonable and somewhat pessimistic assumptions, a GPT-4o query consumes around 0.3 watt-hours for a typical text-based question, though this increases substantially to 2.5 to 40 watt-hours for queries with very long inputs. As shown in the charts below, this is somewhere between a negligible to a small portion of everyday electricity usage.
根据合理且略显悲观的假设,对于一个典型的基于文本的问题,GPT-4o 查询的耗电量约为 0.3 瓦时,但对于输入很长内容的查询,耗电量将大幅增加到 2.5 至 40 瓦时。如下图所示,这在日常用电量中所占比例介于可忽略不计到微不足道之间。

There is a lot of uncertainty here around both parameter count, utilization, and other factors—you can find more estimates with a wider range of assumptions in this spreadsheet, as well as sources for the other everyday uses of electricity. I’ve tried to err on the side of pessimism (higher energy costs) with every assumption, but different assumptions about parameter counts, utilization, and token output can bring the cost into the 1 to 4 watt-hour range, or down to around 0.1 watt-hours. And again, maxing out GPT-4o’s context window could bring the energy cost into the tens of watt-hours, though this input scaling can be improved with better algorithms.
参数数量、利用率和其他因素都存在很大的不确定性--您可以在本电子表格中找到更多假设范围更广的估算值,以及其他日常用电的数据来源。我试图在每个假设中都偏向悲观(能源成本更高),但对参数数量、利用率和令牌输出的不同假设可以将成本降至 1 到 4 瓦时的范围内,或降至 0.1 瓦时左右。同样,最大化 GPT-4o 的上下文窗口也会使能源成本达到数十瓦时,不过这种输入缩放可以通过更好的算法加以改进。
More transparency from OpenAI and other major AI companies would help produce a better estimate. Ideally we could use empirical data from the actual data centers that run ChatGPT and other popular AI products. This may be difficult for AI companies to reveal due to trade secrets, but it seems to me that public confusion on this topic (including many exaggerated impressions of the energy cost of using AI today) is also not in AI developers’ interests.
OpenAI和其他主要人工智能公司提供更多的透明度将有助于做出更好的估计。理想情况下,我们可以使用运行ChatGPT和其他流行人工智能产品的实际数据中心的经验数据。由于涉及商业机密,人工智能公司可能很难透露这些数据,但在我看来,公众对这一话题的混淆(包括许多人对当今使用人工智能的能源成本的夸大印象)也不符合人工智能开发者的利益。
Taking a broader view, by some estimates AI could reach fairly eye-popping levels of energy usage by 203016, on the order of 10% of US electricity.17 For this reason, I don’t dismiss concerns about AI’s overall impact on the environment and energy, especially in the longer run. However, it’s still an important fact that the current, marginal cost of using a typical LLM-powered chatbot is fairly low by the standards of other ordinary uses of electricity.
从更广阔的视角来看,根据 估计,到 2030 年,人工智能的能源使用量将达到相当惊人的水平16 ,约占美国电力使用量的 10%。17 因此,我并不担心人工智能对环境和能源的整体影响,尤其是长期影响。然而,目前使用典型的LLM供电聊天机器人的边际成本与其他普通用电标准相比相当低,这仍然是一个重要的事实。
Appendix 附录
1. Compute cost 1.计算成本
To find the compute cost of inference for an LLM, we have to consider the size of the model and the tokens generated. LLMs generate text in units called tokens (for OpenAI, 1 token represents 0.75 words on average). Generating a token requires two floating-point operations (FLOP) for every parameter in the model (plus some compute required to process inputs, which we’ll discuss below). Note that we are ignoring advanced techniques like speculative decoding or multi-token prediction that could reduce this token generation, because we don’t know whether or not OpenAI uses them.
要找到 LLM 的推理计算成本,我们必须考虑模型的大小和生成的标记。LLMs 生成文本的单位称为 标记(对于 OpenAI,1 个标记平均代表 0.75 个单词)。生成一个标记需要对模型中的每个参数进行两次浮点运算 (FLOP)(加上处理输入所需的一些计算,我们将在下文讨论)。请注意,我们忽略了累积解码或多标记预测等可减少标记生成的高级技术,因为我们不知道OpenAI是否使用了这些技术。
How many parameters do ChatGPT models have? As of writing, ChatGPT users can choose between several different models, including GPT-4o, GPT-4o mini, o1, and o1-mini, and o3-mini.
ChatGPT 型号有多少参数?到目前为止,ChatGPT 用户可以选择几种不同的型号,包括 GPT-4o、GPT-4o mini、o1 和 o1-mini 以及 o3-mini。
GPT-4o is the full-featured model for paid subscribers, and GPT-4o-mini is a cheaper and less powerful variant (free users have limited access to GPT-4o and more access to 4o-mini). Meanwhile, o1 and o3 are OpenAI’s new “reasoning” models, which likely require much more compute and energy, but they are also probably much less popular among typical users.
GPT-4o是面向付费用户的全功能模型,GPT-4o-mini 是一种更便宜、功能更弱的变体(免费用户可有限访问GPT-4o,而更多访问 4o-mini )。与此同时,o1 和 o3 是 OpenAI 的新 "推理 "模型,它们可能需要更多的计算和能量,但它们在普通用户中的受欢迎程度可能也要低得多。
I’ll use GPT-4o as a reference for typical ChatGPT use, though GPT-4o mini might be more popular while requiring much less compute.18 Unfortunately, we don’t know the exact parameter count for GPT-4o or the other models, but we previously estimated that it has 200 billion total parameters, which could be off by roughly a factor of two.
我将使用 GPT-4o 作为典型 ChatGPT 使用的参考,尽管 GPT-4o mini 可能更受欢迎,但所需的计算量要少得多。18 不幸的是,我们不知道GPT-4o或其他模型的确切参数数,但我们以前估计它有2000亿个总参数,这可能会有大约2倍的误差。
It’s also likely that GPT-4o is a mixture-of-experts (MoE) model, meaning not all of these parameters are activated at once. This is likely because MoE seems to be the standard approach for frontier models due to its compute efficiency: for example, GPT-4 is reportedly an MoE model, Gemini 1.5 Pro is an MoE, and nearly all DeepSeek models including V3 and R1 are MoE.
GPT-4o也很可能是一个专家混合物(MoE)模型,这意味着并非所有这些参数都被同时激活。这很可能是因为 MoE 因其计算效率高而成为前沿模型的标准方法:例如,GPT-4 是reportly MoE 模型,Gemini 1.5 Pro 是MoE 模型,而包括 V3 和 R1 在内的几乎所有 DeepSeek 模型都是 MoE 模型。
The ratio of total parameters to active parameters in MoE models can range from roughly 3.6 to 1 for Mistral’s Mixtral 8x7B and 8x22B, to almost 20 to 1 for DeepSeek-V3, which has 37B parameters against 671B total. So we’ll use a 4 to 1 ratio and assume that GPT-4o has 100B active parameters against 400B total parameters. If GPT-4o turned out to be dense, this would also be consistent with the lower end of our estimate for GPT-4o’s total parameters.19 Overall, this is more pessimistic than our best guess for GPT-4o’s active parameters, though there is substantial uncertainty here.
在 MoE 模型中,总参数与有效参数的比例从 Mistral's Mixtral 8x7B 和 8x22B 的大约 3.6 比 1,到 DeepSeek-V3 的将近 20 比 1,DeepSeek-V3 有 37B 个参数,而总参数则为 671B 个。因此,我们将使用 4 比 1 的比例,假设 GPT-4o 拥有 1 亿个活动参数,而总参数数为 4 亿。如果GPT-4o被证明是密集的,这也将与我们对GPT-4o的总参数的估计值的下限相一致。19 总的来说,这比我们对GPT-4o的活动参数的最佳猜测更为悲观,尽管这里存在很大的不确定性。
Next, we need to multiply this by the number of tokens that ChatGPT generates per query. This can vary widely by use case, but Chiang et al. found an average response length of 269 tokens in a large dataset of LLM chats. We’ll pessimistically bump this up to 500 tokens, which is equivalent to ~400 words or about half a page of typed single-space text. If your ChatGPT usage tends to produce much longer or shorter responses than this, you can adjust our final estimate in proportion to understand your own footprint.
接下来,我们需要将其乘以 ChatGPT 每次查询产生的标记数。这可能会因使用案例的不同而有很大差异,但 Chiang 等人发现,在 LLM 聊天的大型数据集中,平均响应长度为 269 个令牌。我们悲观地将其提高到 500 个字节,这相当于约 400 个单词或约半页单空格文本。如果您使用的 ChatGPT 所产生的回复往往比这一数字更长或更短,您可以根据您自己的足迹,按比例调整我们的最终估计值。
Putting this together, we get 500 * 2 * 100 billion = 1e14 FLOP for a query with 500 output tokens.
综合以上结果,我们可以得到 500 * 2 * 1000 亿 = 1e14 FLOP,用于 500 个输出标记的查询。
2. Energy cost of compute
2.计算的能源成本
Next, we need to find the energy cost of producing this much compute.
接下来,我们需要找出产生这么多计算量的能源成本。
The leading AI chip for companies other than Google is NVIDIA’s H100 GPU, so I’ll assume ChatGPT inference uses H100s. OpenAI may still use some older and less efficient A100s20, but they will also transition to more efficient Blackwell chips sooner or later. The H100 has a power rating21 of 700 watts, but H100 clusters can consume up to ~1500 W per GPU due to the overhead costs of servers and data centers. For example, a DGX server with 8 H100s, a common server configuration for H100s, has a max power usage of 10.2 kW (1275 W per GPU), and there is an additional 10-20% overhead at the data center level.
除 Google 之外,其他公司的主流 AI 芯片是 NVIDIA 的 H100 GPU,因此我假设 ChatGPT 的推理使用的是 H100。OpenAI 可能仍在使用一些较旧且效率较低的 A100s20 ,但他们也将过渡到 效率更高的 Blackwell 芯片 更快或更晚。H100 的额定功率21 为 700 瓦,但由于服务器和数据中心的开销成本,H100 集群每个 GPU 的功耗可达 ~1500 瓦。例如,一台配备 8 个 H100 的 DGX 服务器(H100 的常见服务器配置)的最大功耗为 10.2 kW(每个 GPU 1275 W),而数据中心层面的额外 10-20% 间接费用。
H100s can produce up to 989 trillion (9.89e14) FLOP per second. This assumes 16-bit FLOP without sparsity, though many inference providers for open models use 8-bit inference, and H100s can produce twice as many 8-bit FLOP as 16-bit FLOP. Note that NVIDIA’s spec sheet for the H100 reports sparse FLOP/s, so we divide all those number by 2.
H100s每秒可产生高达989 万亿次(9.89e14)FLOP。这是在无稀疏性的情况下假设 16 位 FLOP,但许多开放模型的推理提供商使用 8 位推理,而 H100s 产生的 8 位 FLOP 是 16 位 FLOP 的两倍。请注意,NVIDIA 的 H100 规格表报告了稀疏 FLOP/s,因此我们将所有这些数字除以 2。
If the H100s running ChatGPT achieve this peak output, then the 1e14 FLOP required to answer a query would take 2e14 / 9.89e14 = 0.1 seconds of H100-time (in reality, servers process many parallel requests in batches, so generation takes much longer than that end-to-end).
如果运行 ChatGPT 的 H100 达到这一峰值输出,那么回答一个查询所需的 1e14 FLOP 将需要 2e14 / 9.89e14 = 0.1 秒的 H100 时间(实际上,服务器会以 batches 的方式处理许多并行请求,因此生成所需的端到端时间要比这长得多)。
Multiplying 0.1 seconds by 1500 watts yields an energy cost of 150 watt-seconds, or ~0.041 watt hours, which is much lower than the common-cited estimate of 3 watt hours. However, this is too optimistic, because GPUs don’t produce 100% of max output in practice. We need to adjust for compute utilization, which is actual compute output divided by the theoretical maximum. Lower utilization increases how many GPUs are required to process queries, which increases energy costs.
将 0.1 秒乘以 1500 瓦,得出的能耗成本为 150 瓦秒,即 ~0.041 瓦时,这比常见的 3 瓦时的估计值要低得多。然而,这也太乐观了,因为 GPU 在实际应用中并不能 100% 地输出最大功率。我们需要根据计算利用率进行调整,即实际计算输出除以理论最大值。较低的利用率会增加处理查询所需的 GPU 数量,从而增加能源成本。
How much utilization is achieved during LLM inference? It’s hard to know for sure, but 10% is a reasonable, though rough assumption. We know that utilization is lower for inference than it is when training AI models (often 30-50%), due to memory bandwidth bottlenecks in inference.22
在 LLM 推理过程中实现了多少利用率?这很难确定,但10% 是一个合理的假设,尽管很粗略。我们知道,由于推理中的内存带宽瓶颈,推理的利用率低于训练人工智能模型时的利用率(通常为 30-50%)。
Another line of evidence is the price per token for open-weight models. Open models have known parameter counts, and hosting APIs for open models is a competitive business (since anyone can download an open model and offer an API for it), so API provider margins are likely low, and their prices are evidence of the true cost of serving the models.
另一个证据是开放权重模型的每个令牌的价格。开放模型的参数数量是已知的,而为开放模型托管 API 是一项竞争激烈的业务(因为任何人都可以下载开放模型并为其提供 API),因此 API 提供商的利润率可能很低,他们的价格证明了提供模型服务的真实成本。
This makes it possible to estimate utilization based on token prices, assuming a 0% profit margin:
这样就可以根据代币价格估算利用率,假设利润率为 0%:
- Llama 3.1 405B has 405 billion parameters, so 2 * 405B = 810B (8e11) FLOP are needed per token, and 8e17 FLOP per one million tokens.
Llama 3.1 405B 有 4 050 亿个参数,因此每个令牌需要 2 * 405B = 810B (8e11) FLOP,每百万令牌需要 8e17 FLOP。 - An H100 can perform 2000 teraFLOP/s at peak output (in FP8 FLOP, which is standard for open model providers), or 7.2e18 FLOP per hour. H100s cost around $2 per hour to rent, so at 100% utilization, API providers could achieve 3.6e18 FLOP per dollar.
H100 在峰值输出时可执行 2000 teraFLOP/s(FP8 FLOP,这是开放模型提供商的标准),即每小时 7.2e18 FLOP。H100 的租用费用约为每小时 2 美元,因此在 100% 利用率的情况下,API 提供商可以实现 3.6e18 FLOP/美元。 - This means that if their GPUs ran at 100% utilization, the compute cost of generating a million Llama 3.1 405B tokens would be around 8e17 / 3.6e18 dollars, which is 22 cents.
这意味着,如果 GPU 以 100% 的利用率运行,生成一百万 Llama 3.1 405B 代币的计算成本约为 8e17 / 3.6e18 美元,即 22 美分。 - In fact, according to prices compiled by Artificial Analysis, Llama 3.1 405B costs $3.50 per million output tokens on average, which is more than 10 times more expensive.
事实上,根据人工分析编制的价格,Llama 3.1 405B 平均每百万输出代币的价格为3.50美元,是人工分析的 10 倍多。
The cost of renting GPUs is almost certainly the biggest cost of running these APIs, so this suggests that providers need about 10x as many GPUs compared to if they achieved 100% utilization. This means that utilization is roughly 10%, increasing the GPUs required, and hence our energy estimate, by 10x.
几乎可以肯定的是,租用 GPU 的成本是运行这些应用程序接口的最大成本,因此这表明,与达到 100% 的利用率相比,提供商需要大约 10 倍的 GPU。这意味着,利用率约为 10%,所需的 GPU 将增加 10 倍,因此我们的能耗估计也将增加 10 倍。
One final consideration is that a GPU cluster’s average power consumption will be lower than peak consumption, especially given the low compute utilization during inference. We can’t know the actual power consumption of OpenAI’s servers absent specific, real-world data, but there is some literature on measuring inference power consumption for other models.
最后一个考虑因素是,GPU 集群的 平均功耗将低于峰值功耗,特别是考虑到推理过程中的低计算利用率。在没有具体真实数据的情况下,我们无法知道 OpenAI 服务器的实际功耗,但有一些文献可以测量其他模型的推理功耗。
Patel et. al., a team of Microsoft Azure23 researchers, did experiments on inference power consumption for several LLMs and found that GPU power consumption is typically around 60-80% of thermal design power, spiking up to 100% when processing input tokens during prefill (Figure 6). Luccioni et. al. also measured inference power consumption, finding that mean power consumption was only about 25% of TDP. However, their setup was very different from ChatGPT’s; for example, they only handled 558 requests per hour, which were processed without batching. By contrast, Patel et. al. sent a “steady stream” of inference requests intended to maximize power utilization, so their result of ~70% may be more representative of large-scale deployments, if somewhat pessimistic. Using this figure means we are pessimistically ignoring the possibility of GPUs being idled on occasion due to fluctuations in user demand, which is a plausible consequence of 10% FLOP throughput, and would reduce power consumption further.
Patel 等人是一个由 Microsoft Azure23 研究人员组成的团队,他们对多个 LLMs 的推理功耗进行了实验,发现 GPU 功耗通常约为热设计功耗的 60-80%,在预填充期间处理输入令牌时,功耗会飙升至 100%(图 6)。Luccioni 等人也测量了推理功耗,发现平均功耗仅为 TDP 的 25%。但是,他们的设置与ChatGPT的设置有很大不同;例如,他们每小时只处理 558 个请求,这些请求是在没有批处理的情况下处理的。相比之下,Patel 等人发送了 "源源不断 "的推理请求,目的是最大限度地提高功耗利用率,因此他们的 ~70% 的结果可能更能代表大规模部署的情况,尽管有些悲观。使用这一数据意味着我们悲观地忽略了 GPU 因用户需求波动而偶尔闲置的可能性,而这正是 10% FLOP 吞吐量的合理结果,并将进一步降低功耗。
A compute utilization of 10% and a power utilization of 70% would increase our earlier native energy estimate by 7x (first divide by 10%, i.e. multiply by 10x, and then multiply by 70%). Applying this to the earlier result, we get 0.041 * 7 ~= 0.3 watt-hours for a GPT-4o query with 500 output tokens. This is almost 10 times lower than the widely-cited 3 Wh estimate!
如果计算利用率为 10%,电源利用率为 70%,那么我们之前估算的本机能耗将增加 7 倍(首先除以 10%,即乘以 10 倍,然后再乘以 70%)。将这一结果应用到之前的结果中,我们将得到 0.041 * 7 ~= 0.3 瓦时,用于 500 个输出标记的 GPT-4o 查询。这比广泛引用的 3 Wh 估计值低了近 10 倍!
3. Energy cost of input tokens
3.输入代币的能源成本
Processing long inputs can be expensive for language models. Transformer models have an attention mechanism describing the relationships between all of the tokens in the input. In the prefill phase, the model processes the input and computes the values of two matrices used in attention (known as the key and value matrices, or K and V), which are then stored in a KV cache.
对于语言模型来说,处理长输入可能很昂贵。转换器模型有一个注意力机制,用于描述输入中所有标记之间的关系。在 prefill 阶段,模型会处理输入并计算用于关注的两个矩阵的值(称为键和值矩阵,或 K 和 V),然后将其存储在 KV 缓存中。
This cost of processing attention scales proportionally to the product of the model dimension, the model depth, and the square of the input length. Given an assumed parameter count for GPT-4o, we can find reasonable estimates for the model depth and the dimension of each layer that would produce this parameter count, which allows us to estimate the FLOP cost of attention calculations for any given input length. There is also an added cost of generating tokens given a long input, but this cost is relatively minor for very long inputs.
处理注意力的成本与模型维度、模型深度和输入长度的平方的乘积成比例。给定 GPT-4o 的假定参数数,我们就可以找到模型深度和每层维度的合理估计值,从而产生该参数数,这样我们就可以估计出任何给定输入长度下的注意力计算 FLOP 成本。在给定较长输入的情况下,生成代币的成本也会增加,但对于很长的输入来说,这一成本相对较小。
Because prefill cost scales quadratically with input length for large inputs, this methodology leads to high compute and energy costs for long inputs. For an input of 10k tokens and an output of 500 tokens, the total cost increases to around 2.4 watt-hours, and 100k input tokens and 500 output tokens would cost around 40 watt-hours of energy. Note that because of the KV cache, this is an upfront cost, and a multi-turn chat conversation that begins with a very long input such as an uploaded document will not incur this cost multiple times.
由于预填充成本与大量输入的输入长度成二次方关系,因此这种方法会导致大量输入的计算和能源成本居高不下。如果输入 10k 个代币,输出 500 个代币,总成本将增加到约 2.4 瓦时,而 100k 个输入代币和 500 个输出代币将耗费约 40 瓦时的能源。需要注意的是,由于 KV 缓存的存在,这只是前期成本,如果聊天对话是以上传文件等超长输入开始的,则不会多次产生这种成本。
You can find the full calculation in this Colab notebook.
您可以在 Colab notebook 中找到完整的计算过程。
Importantly, quadratic scaling would also imply very high compute and energy costs for inputs of 1 million tokens or more—processing 1M input tokens would be around 100 times more costly in total than processing 100k tokens. However, Google DeepMind has been offering 1 to 2 million token context windows starting with Gemini 1.5, and this was almost certainly enabled by innovations that improve on quadratic scaling. Additionally, more efficient attention mechanisms have been proposed in the literature. So we know it is possible to improve how input costs scale with input length, but we don’t know which innovations, if any, OpenAI in particular has adopted.
重要的是,二次扩展还意味着输入 100 万个或更多代币时的计算和能源成本非常高--处理 100 万个输入代币的总成本将是处理 10 万个代币的 100 倍。不过,从 Gemini 1.5 开始,谷歌 DeepMind 已经提供了 100 万到 200 万个代币的上下文窗口,这几乎可以肯定是通过改进二次扩展的创新实现的。此外,文献中还提出了更有效的关注机制。因此,我们知道有可能改进输入成本随输入长度缩放的方式,但我们不知道OpenAI采用了哪些创新(如果有的话)。
Notes 说明
-
The comparison with Google searches comes from a very old estimate from Google in 2009 that each search consumes 0.3 Wh. Estimating Google search’s energy cost today is outside the scope of this post. It could easily be lower today due to increased chip and data center efficiency, or higher if Google searches have become more complex over time (and/or because Google search now has AI features like AI Overviews). ↩
与谷歌搜索的比较来自于谷歌 2009 年的一项非常古老的估计,即每次搜索消耗 0.3 Wh。估算如今 Google 搜索的能耗成本不在本篇文章的讨论范围之内。由于芯片和数据中心效率的提高,今天的能耗成本很可能会更低,或者如果 Google 搜索随着时间的推移变得更加复杂(和/或因为 Google 搜索现在具有人工智能功能,如 AI 概述),能耗成本可能会更高。 -
Household usage is lower than total usage, since the energy used to e.g. manufacture the goods you own are not incurred inside your house. ↩
家庭用电量低于总用电量,因为用于生产您所拥有的商品等的能源并不在您的室内产生。 -
You can read this essay from Andy Masley for more useful context on how much energy this is, as well as discussion on AI’s water usage. ↩
您可以阅读安迪-马斯利(Andy Masley)的这篇文章,了解更多有关能源消耗的有用信息,以及有关人工智能用水量的讨论。 -
A DGX server with 8 H100s, a common server configuration for H100s, has a max power usage of 10.2 kW (1275 W per GPU). And there is an additional 10-20% overhead at the data center level. ↩
一台配备 8 个 H100 的 DGX 服务器(H100 的常见服务器配置)的最大功耗为 10.2 千瓦(每个 GPU 1275 瓦)。此外,在数据中心层面还需要额外增加 10-20% 的开销。 -
This pessimistically assumes a constant stream of inference requests, ignoring the possibility of GPUs idling due to demand fluctuations. ↩
这只是悲观地假设推理请求流持续不断,忽略了 GPU 因需求波动而闲置的可能性。 -
Google DeepMind has almost certainly improved on input scaling, in order to enable 2 million token input windows for its Gemini models, though we’re not sure whether OpenAI has adopted these sorts of innovations yet. ↩
几乎可以肯定,Google DeepMind 已经改进了输入缩放,以便为其Gemini模型启用 200 万个令牌输入窗口,不过我们还不确定OpenAI是否已经采用了此类创新。 -
Parameter count is not necessarily proportional to API prices, since OpenAI could charge different profit margins for different models, and it is possible to serve models more cheaply by reducing token generation speed (GPT-4o-mini is relatively slow for a “mini” model). ↩
参数数量不一定与 API 价格成正比,因为 OpenAI 可以对不同的模型收取不同的利润率,而且可以通过降低令牌生成速度,以更低廉的价格为模型提供服务(GPT-4o-mini 对 "迷你 "模型来说相对慢)。 -
This was necessary because the full reasoning chain is hidden from users, but the API tells you how many tokens were generated. ↩
这是必要的,因为完整的推理链对用户是隐藏的,但 API 会告诉您生成了多少令牌。 -
I asked GPT-4o, o1, and o3-mini five questions from my chat history and found an average response length of 1374 tokens for o3 and 1392 tokens for o1, versus 540 tokens for GPT-4o. (I used reasoning effort=medium, which is the default in ChatGPT, though I didn’t use ChatGPT’s system prompt). So o1 and o3-mini both generated just over 2.5x as many tokens. Meanwhile DeepSeek-R1 returned an average of 1794 tokens for these questions. This is just a rough sanity check that reasoning models can generate much more tokens, and should not be taken as a representative average. ↩
我从聊天记录中向 GPT-4o、o1 和 o3-mini 提出了五个问题,发现 o3 和 o1 的平均回答长度分别为 1374 和 1392,而 GPT-4o 为 540。(我使用了推理 effort=medium,这是 ChatGPT 的默认设置,但我没有使用 ChatGPT 的系统提示)。因此,o1 和 o3-mini 产生的令牌数量都是它们的 2.5 倍多一点。而 DeepSeek-R1 为这些问题平均返回了 1794 个令牌。这只是一个粗略的理智检查,推理模型可以生成更多的标记,不应被视为具有代表性的平均值。 -
o3-mini is also available for free, likely with serious rate limits. ↩
o3-mini 也可免费使用,但可能有严重的速率限制。 -
See our earlier issue on LLM model sizes for more discussion on this point. ↩
有关这一点的更多讨论,请参阅我们上一期关于LLM模型尺寸的文章。 -
Some possible signs of the growing popularity of reasoning models are DeepSeek R1, which is DeepSeek’s new reasoning model, reaching the top of Apple’s app store, and the fact that o3-mini is available to free users. ↩
DeepSeek R1(DeepSeek 的新推理模型)登上苹果应用商店的榜首,以及 o3-mini 可供免费用户使用,都可能是推理模型日益流行的一些迹象。 -
We don’t know how many GPUs were used to train GPT-4o but we do know this for models we believe to be of comparable scale, such as Llama 3.1 and the original GPT-4. ↩
我们不知道训练 GPT-4o 时使用了多少 GPU,但我们知道我们认为规模相当的模型使用了多少 GPU,例如 Llama 3.1 和原始 GPT-4。 -
This is training, not inference, but because they divide the total embodied emissions over the computer equipment’s lifetime by the length of the training run, the conclusion isn’t dramatically different compared if you compare embodied emissions amortized over the very short time period where a cluster serves an inference query, versus the energy cost of that query. ↩
这是训练,而不是推理,但由于他们将计算机设备寿命期内的总体现排放量除以训练运行的时间长度,因此如果将体现排放量摊销到集群为推理查询提供服务的极短时间内与该查询的能源成本进行比较,结论并没有显著不同。 -
Most of France’s electricity comes from nuclear power. Luccioni et. al. assume a carbon intensity of 57 grams CO2 per kWh for BLOOM training, which is around 5-10x cleaner than the electricity in most other countries, and the embodied emissions presumably mostly came from outside France. ↩
法国的大部分电力来自核能。Luccioni 等人假定 BLOOM 培训的碳强度为每千瓦时 57 克二氧化碳,这比大多数其他国家的电力清洁约5-10 倍,而所体现的排放可能主要来自法国以外。 -
This is from a starting point of around 0.3% today: Goldman Sachs estimated that AI data centers would consume about 11 terawatt-hours of electricity in 2024, which is about 0.3% of the US’s total consumption of 4 trillion kWh (calculation here). ↩
而现在的起点约为 0.3%:高盛估计,到 2024 年,人工智能数据中心将消耗约 11 太瓦时的电力,约占美国总耗电量4 万亿千瓦时的 0.3%(计算此处)。 -
These are both roughly consistent with our projection of electricity supply constraints for AI. ↩
这与我们对人工智能电力供应限制的预测基本一致。 -
GPT-4o mini is substantially smaller, given that it is over 10x cheaper on OpenAI’s API. But I don’t know whether 4o-mini is actually more typical than 4o because the message limits for free users are opaque. ↩
GPT-4o迷你版的规模要小得多,因为它在OpenAI的API上要便宜 10 倍以上。但我不知道 4o-mini 是否真的比 4o 更典型,因为免费用户的信息限制是不透明的。 -
These two claims (whether or not GPT-4o is an MoE and its total number of parameters) are not completely independent, since we know GPT-4o was a highly capable model by mid-2024 standards, and this is harder to accomplish as the active parameter count goes down. ↩
这两种说法(GPT-4o是否是一个 MoE 及其参数总数)并非完全独立,因为我们知道 GPT-4o 在 2024 年中期的标准中是一个能力很强的模型,而随着有效参数数量的减少,这一点就更难实现了。 -
NVIDIA sales have grown dramatically since the H100 was introduced in late 2022, so the H100 is likely much more common. ↩
自 2022 年底推出 H100 以来,英伟达的销售额大幅增长,因此 H100 很可能更为常见。 -
Officially called “thermal design power”, which is roughly equal to the GPU’s peak power draw. ↩
官方称之为 "热设计功率",大致等于 GPU 的峰值功耗。 -
See this blog post from Databricks for more discussion. ↩
有关更多讨论,请参阅来自 Databricks 的博文。 -
Microsoft is OpenAI’s main compute provider, though this paper did not test OpenAI models, much less a commercial deployment for ChatGPT. ↩
Microsoft 是 OpenAI 的主要计算提供商,但本文并未测试 OpenAI 模型,更未测试 ChatGPT 的商业部署。