「我為整個 GPU 付了錢，我要使用整個 GPU」：GPU 利用的高級指南

Engineering 工程

February 24, 2025 2025 年 2 月 24 日•20 minute read 20 分鐘閱讀

'I paid for the whole GPU, I am going to use the whole GPU': A high-level guide to GPU utilization
'我為整個 GPU 付了錢，我要使用整個 GPU': 一個高級的 GPU 利用指南

Charles Frye@charles_irl 查爾斯·弗賴@charles_irl

GPU Enjoyer GPU 愛好者

A t-shirt that says 'I paid for the whole speedometer, I am going to use the whole speedometer'

Typical attire of a GPU utilization maximizer.
GPU 使用最大化者的典型服裝。

Graphics Processing Units, or GPUs, are the hottest mathematical co-processor since the FM synthesis chips that shaped the sounds of the 1990s.
圖形處理單元（GPU）是自 1990 年代以來最熱門的數學協處理器，這些協處理器的 FM 合成晶片塑造了當時的聲音。

Like all co-processors, they are chosen when the performance of more flexible commodity hardware, like an x86 Central Processing Unit (CPU), is insufficient. GPUs are in particular designed for problems where CPUs cannot achieve the desired throughput of mathematical operations (in particular, matrix multiplications).
像所有協處理器一樣，當更靈活的商品硬體（如 x86 中央處理單元（CPU））的性能不足時，就會選擇使用它們。GPU 特別設計用於 CPU 無法達到所需數學運算吞吐量的問題（特別是矩陣乘法）。

But GPUs are not cheap: high performance can command a high price.
但 GPU 並不便宜：高性能可能需要高價格。

Combined together, the high price, performance sensitivity, and throughput-orientation of GPU applications mean that a large number of engineers and technical leaders find themselves concerned with GPU utilization of some form or another — “we’re paying a lot, so we’d better be using what we’re paying for”.
綜合來看，高價格、性能敏感性和以吞吐量為導向的 GPU 應用意味著大量工程師和技術領導者對某種形式的 GPU 利用率感到擔憂——“我們花了很多錢，所以我們最好充分利用我們所支付的費用”。

At Modal, we have our own GPU utilization challenges to solve and we help our users solve theirs. We’ve noticed that the term “GPU utilization” gets used to mean very different things by people solving problems at different parts of the stack. So we put together this article to share our framework for thinking about GPU utilization across the stack and the tips and tricks we’ve learned along the way.
在 Modal，我們有自己的 GPU 利用率挑戰需要解決，並且我們幫助用戶解決他們的問題。我們注意到，“GPU 利用率”這個術語在解決不同層級問題的人們中有著非常不同的含義。因此，我們編寫了這篇文章，以分享我們對整個堆棧中 GPU 利用率的思考框架以及我們在此過程中學到的技巧和竅門。

In particular, we’ll talk about three very different things that all get called “GPU utilization”:
特別是，我們將討論三個被稱為“GPU 利用率”的非常不同的事物：

GPU Allocation Utilization, the fraction of your GPUs that are running application code,
GPU 分配利用率，即運行應用程式代碼的 GPU 比例，
GPU Kernel Utilization, the fraction of time your application is running code on GPUs, and
GPU 核心利用率，即您的應用程序在 GPU 上運行代碼的時間比例，以及
Model FLOP/s Utilization, the fraction of the GPUs’ theoretical arithmetic bandwidth your application is using to run models.
模型 FLOP/s 利用率，即您的應用程序使用的 GPU 理論算術帶寬的比例來運行模型。

We’ll specifically focus on neural network inference workloads — neural networks because they are workload generating the most demand right now and inference because, unlike training, inference is a revenue center not a cost center. We’re betting on the revenue center.
我們將特別關注神經網絡推理工作負載——神經網絡因為它們是當前需求最大的工作負載，而推理則因為與訓練不同，推理是一個收入中心而不是成本中心。我們押注於收入中心。

What is utilization? 什麼是利用率？

Utilization = Output achieved ÷ Capacity paid for
利用率 = 實現的輸出 ÷ 支付的容量

Utilization relates the available capacity of a system to that system’s output.
利用率將系統的可用容量與該系統的輸出相關聯。

In throughput-oriented systems like GPU applications, the capacity paid for is often a bandwidth (e.g. the arithmetic bandwidth) and the output achieved is then a throughput (e.g. floating point operations per second, FLOP/s).
在以吞吐量為導向的系統中，如 GPU 應用，支付的容量通常是帶寬（例如算術帶寬），而實現的輸出則是吞吐量（例如每秒浮點運算次數，FLOP/s）。

Because it is a ratio, utilization is unitless. That means there are actually many GPU-related quantities you might call “GPU utilization”, leaving off the implicit units of the capacity and output. These different quantities range across orders of magnitude of time and across different organizational capacities (e.g. procurement, DevOps, and low-level performance engineering).
由於它是一個比率，利用率是無單位的。這意味著實際上有許多與 GPU 相關的量可以稱為“GPU 利用率”，而省略了容量和輸出的隱含單位。這些不同的量在時間的數量級和不同的組織容量（例如採購、DevOps 和低層性能工程）之間範圍廣泛。

What is GPU Allocation Utilization?
什麼是 GPU 配置利用率？

GPU Allocation Utilization = GPU-seconds running application code ÷ GPU-seconds paid for
GPU 配置利用率 = 運行應用程式代碼的 GPU 秒數 ÷ 支付的 GPU 秒數

First, consider the number of GPUs that you have allocated — whether that is fixed GPU capacity on-premise in your basement (or data center) or it is rented capacity in a cloud data center (or many people’s basements) — across a period of time.
首先，考慮您所配置的 GPU 數量——無論是固定的本地 GPU 容量（在您的地下室或數據中心）還是雲數據中心中租用的容量（或許多人的地下室）——在一段時間內。

We use the term GPU Allocation Utilization for the fraction of those GPU-seconds during which you were running application code. This is the highest-level notion of “GPU utilization”.
我們使用“GPU 配置利用率”這個術語來表示在運行應用程式代碼期間的 GPU 秒數的比例。這是“GPU 利用率”的最高層次概念。

There are two key limits on GPU Allocation Utilization: economic and developer-operational.
在 GPU 配置利用率上有兩個主要限制：經濟限制和開發者操作限制。

The economic limits on GPU Allocation Utilization rise from combined technical and market limitations. Purchasing, commissioning, decomissioning, and selling GPUs cannot be done as quickly as the output demanded by the application changes (on the scale of seconds or minutes).
GPU 配置利用率的經濟限制源於技術和市場的綜合限制。購買、啟用、停用和出售 GPU 的速度無法與應用需求變化的速度（以秒或分鐘為單位）相匹配。

Of course, as for other hardware we are blessed with highly-virtualized data center platforms (“clouds”) where we can virtually allocate and de-allocate GPU capacity. Even there, however, existing pricing models and demand that exceeds supply leave providers dictating terms, like multi-month or multi-year commitments, which limit achievable utilization for a given quality-of-service.
當然，與其他硬體一樣，我們擁有高度虛擬化的數據中心平台（“雲”），可以虛擬地分配和釋放 GPU 容量。然而，即使在那裡，現有的定價模型和超過供應的需求仍然使供應商主導條件，例如多個月或多年的承諾，這限制了在給定服務質量下可實現的利用率。

With a fixed, over-provisioned GPU allocation, utilization is low
在固定的、過度配置的 GPU 分配下，利用率很低

Modal helps organizations solve this problem. We aggregate GPU demand across consumers and GPU supply across providers to improve GPU allocation efficiency.
Modal 幫助組織解決這個問題。我們聚合消費者的 GPU 需求和供應商的 GPU 供應，以提高 GPU 配置效率。

But GPU Allocation Utilization isn’t just about the GPU-seconds paid for, it’s about the GPU-seconds spent running application code.
但 GPU 配置利用率不僅僅是支付的 GPU 秒數，而是運行應用程式代碼所花費的 GPU 秒數。

That’s where the DevOps limits on GPU Allocation Utilization come in. Even in a fully liquid GPU market, there is latency between the time at which a GPU is purchased or rented and the time at which the GPU is running useful work — time to configure operating systems, perform health checks, copy over application code, etc. Absent the ability to precisely predict future demand at timescales greater than that latency, this leads to reduced GPU Allocation Utilization, reduced quality-of-service, or both!
這就是 DevOps 在 GPU 配置利用率上的限制所在。即使在完全流動的 GPU 市場中，從購買或租用 GPU 到 GPU 開始運行有用工作的時間之間也存在延遲——配置操作系統、執行健康檢查、複製應用程式代碼等。若無法精確預測超過該延遲的未來需求，這將導致 GPU 配置利用率降低、服務質量下降，或兩者皆是！

If allocation is slow, utilization and QoS suffer
如果配置速度緩慢，利用率和服務質量將受到影響

To achieve high GPU Allocation Utilization and meet quality-of-service goals, allocation and spin-up to application code needs to be fast enough to respond to increases in demand.
為了實現高 GPU 配置利用率並滿足服務質量目標，分配和啟動到應用程式代碼的速度需要足夠快，以便能夠應對需求的增加。

With fast, automatic allocation, utilization and QoS can both be high
透過快速、自動的分配，利用率和服務質量都可以保持在高水平。

This is one of the core problems solved by Modal. We manage a large multi-cloud GPU fleet, benefitting from economies of scale to unlock better engineering solutions and concentration of measure to improve predictability of demand. We built a custom container stack (in Rust btw) to reduce the latency from non-application code and system configuration. And users’ workloads spin up faster because the serverless runtime for that container execution system frames user workloads in terms of application code, not virtual machine maintenance. That allows us to skip the repetitive, undifferentiated work required to create virtual machines. That unlocks novel engineering optimizations for us, like memory snapshotting and restoration, and it just-so-happens to make application engineering easier for our users.
這是 Modal 解決的核心問題之一。我們管理著一個大型的多雲 GPU 車隊，利用規模經濟來解鎖更好的工程解決方案，並通過集中度來提高需求的可預測性。我們建立了一個自定義的容器堆疊（順便說一下，是用 Rust 寫的），以減少來自非應用程式代碼和系統配置的延遲。而且用戶的工作負載因為該容器執行系統的無伺服器運行時以應用程式代碼而非虛擬機維護的方式框架用戶工作負載而更快啟動。這使我們能夠跳過創建虛擬機所需的重複、無差異的工作。這為我們解鎖了新的工程優化，例如內存快照和恢復，並且恰好使我們的用戶的應用程式工程變得更容易。

What level of GPU Allocation Utilization can I expect to achieve?
我可以期待達到什麼級別的 GPU 配置利用率？

The existing numbers are sobering. According to the State of AI Infrastructure at Scale 2024 report, the majority of organizations achieve less than 70% GPU Allocation Utilization when running at peak demand — to say nothing of aggregate utilization. This is true even of sophisticated players, like the former Banana serverless GPU platform, which operated at an aggregate utilization of around 20%.

With Modal, users can achieve GPU Allocation Utilization in excess of 90% — in aggregate, not just at peak.

If that interests you, check out our docs and our pricing page.

If it doesn’t, read on for more about the software engineering required to get the most out of your GPUs — on Modal or elsewhere.

What is GPU Kernel Utilization?

GPU Kernel Utilization = GPU-seconds running kernels ÷ GPU-seconds paid for

Just because an allocated GPU is running application code doesn’t mean it is running code on the GPU. The term of art for “code that runs on the GPU” in the popular CUDA programming model for GPUs is “kernel”, and so we call the fraction of time we spend running code on the GPU the GPU Kernel Utilization.

This utilization metric is reported by, among others, the beloved nvidia-smi command line tool wrapping NVIDIA’s Management Library for their GPU hardware, and so it is commonly checked and cited. We expose it to our users under the name that library uses, “GPU utilization”. Note that this name can be slightly misleading, since this metric does not care whether the code we’re running on the GPU is exercising the hardware’s actual capacity.

An application that is achieving low GPU Allocation Utilization is necessarily going to achieve low GPU Kernel Utilization, so long as you consider all GPU-seconds being paid for: a unit not running application code can’t run kernels.

Why else might you achieve low GPU Kernel Utilization? In particular, what patterns will show up as low kernel utilization per GPU?

First, there might be lots of work to do that supports your application but doesn’t use the GPU, like moving input or output data via network or disk, downloading the many gigabytes of weights of a foundation model, or writing logs.

These tasks can be sped up by usual means — judicious application of lazy and eager loading, parallelization, increased bandwidth for non-GPU components like networks, and deleting more code YAGN.

Second, the CPU might not be providing work to the GPU quickly enough. A typical GPU-accelerated program is, like a high-performance network application, a dance of concurrency between the CPU executing logic about what work must be done and specialized, but dumb, hardware that can actually do the work. For example, when multiplying two matrices, the popular PyTorch library needs to determine the shapes and types of those two matrices and then lookup the appropriate kernel — somewhat akin to a JIT database query optimizer selecting a physical operator mid-execution. If you are unable to complete this work before the GPU finishes its previous task, the GPU will idle. We’ll call this class of issue “host overhead”.

Often, resolving host overhead is a matter of re-writing the host logic — preventing slow host work (like logging in Python) from blocking the host work that drives the GPU. But at the scale of milliseconds per task step, Python starts to become incapable of keeping up, and at the scale of microseconds per task step, the latency required to schedule kernels onto the GPU via the CUDA C++ APIs and driver begins to bottleneck.

In both cases, there are two basic optimizations. First, multiple kernels can be launched at once using CUDA Graphs, which essentially convert a sequence of kernel launches into a DAG that only needs to be launched once. Second, the application can aggregate more work for the GPU to complete for a given unit of host work — for example by batching requests together — to improve utilization with a possible penalty to latency.

Code regions with low GPU Kernel Utilization can be identified from application traces, like those produced by the PyTorch Profiler. Specifically, any period of time where all CUDA streams are empty is a period of zero GPU Kernel Utilization, and so applications with low GPU Kernel Utilization have largely empty CUDA streams in their traces, like the one below. These periods of quiescence need to be correlated to activity on the host to determine which parts of the application code are leading to the bottleneck. GPU application profilers and trace viewers generally support this, e.g. by showing kernel launch dependencies, like the arrow in the trace below.

A trace of a PyTorch application with low GPU Kernel Utilization

In traces of GPU applications, periods where no kernels are running appear as empty strips in the timelines of CUDA streams (e.g. Stream 7 7 in the trace above). For details, see our documentation.

What level of GPU Kernel Utilization can I hope to achieve?

GPU Kernel Utilization is the closest metric in this article to the better-known CPU utilization. CPU utilization tracks the fraction of CPU cycles during which instructions were being executed on behalf of your program (as opposed to the CPU idling or running other programs).

However, for CPU utilization, hitting 90%+ is often bad, even a trigger for alerts. But we want to and can achieve that level of GPU Kernel Utilization!

Fundamentally, this is downstream of the greater predictability of many GPU applications. Running a transactional database replica at 90% CPU utilization baseline risks degraded quality-of-service if query patterns or quantity change. Typical GPU applications have much less variability — for a database analogue, imagine repeatedly running only one basic sequential scan aggregation query, but with slightly different parameters each time — and so have more controllable quality-of-service.

What is Model FLOP/s Utilization (MFU)?

Model FLOP/s Utilization = Model FLOP/s throughput achieved ÷ FLOP/s bandwidth paid for

At some galaxy-brained, CEO-math level, expenditures on GPUs are really expenditures on floating point operation bandwidth, and so the deepest and most fundamental utilization metric to measure is the ratio of that bandwidth to the throughput achieved.

This metric is known as MFU, which either means “Maximum” or “Model” FLOP/s Utilization, depending on who you ask. We go with “Model”, since it’s more common.

Instances that aren’t running application code or that aren’t running GPU kernels cannot achieve a high MFU, so low GPU Allocation Utilization or low GPU Kernel Utilization imply low Model FLOP/s Utilization.

However, high utilization at these more abstract levels does not imply high MFU.

First, as an implementation detail, communication between GPUs is frequently implemented via GPU kernels. This communication, like most communication in distributed systems, is subject to faults (hardware fault, programmer fault, shark attack fault), which frequently manifest as deadlock. From the perspective of GPU Kernel Utilization, a system that is deadlocked in the middle of running a communication kernel is fully utilized (!), but it is completing no useful work. We like to catch this particular issue by monitoring GPU power draw and heat. More generally, optimizing communication is critical for achieving high MFU, especially for workloads that spread a single task across multiple nodes.

Second, floating point computation is just one of the things a GPU must do to complete a task. The most important other task is moving data. Computation can only occur on data stored inside of the register file of the GPU’s streaming multiprocessors, which each store less than a megabyte, while foundation models are measured in gigabytes. The data to which a computation applies must generally be moved from a slower, larger area of the memory hierarchy. The bandwidth of this memory is generally many times lower than the device’s FLOP/s bandwidth, especially in recent generations. The ratio of an algorithm’s FLOP/s throughput to its byte/s throughput is called the arithmetic intensity.

Bottlenecking on memory is a particular challenge in latency-sensitive foundation model inference workloads, where the arithmetic intensity is low (perhaps a few FLOPs per byte). Besides algorithmic rewrites to increase arithmetic intensity, like the online softmax in Flash Attention, the primary generic strategy is batching more work together, which increases FLOPs executed more than memory bytes moved for most neural network inference workloads, but generally adds per-task latency.

Finally, GPU kernels must be carefully written to achieve high MFU. This public worklog by Si Boehm gives a flavor for the effort required to reach state-of-the-art for a single kernel. Even that worklog stops short of truly maximizing MFU, since it tackles a problem that can’t make use of the fastest elements of contemporary GPUs, the Tensor Cores, and writing kernels that can saturate Tensor Cores is even more challenging — see this worklog from Pranjal Shankhdhar. For this reason, most teams use high-quality open source kernels through libraries like CuBLAS or frameworks like PyTorch and vLLM.

The achieved FLOP/s and memory throughput of a GPU application can be monitored using the NVIDIA Data Center GPU Management tool, dcgm. The metrics prefixed with DCGM_FI_PROF are generally relevant. In particular, the DCGM_FI_PROF_DRAM_ACTIVE metric measures the utilization of the DRAM-to-SRAM memory bandwidth. The DCGM_FI_PROF_PIPE_TENSOR_ACTIVE metric measures the utilization of the Tensor Cores that provide the maximum FLOP/s bandwidth. This isn’t identical to MFU for subtle reasons covered well in Stas Bekman’s guide here.

What level of Model FLOP/s Utilization can I hope to achieve?

First, let’s note that measuring Model FLOP/s Utilization is tricky. The theoretical bandwidth can be read from manufacturer datasheets — but watch for asterisks like “with sparsity”. The achieved model throughput, on the other hand, can be hard to measure, in particular since some FLOPs might be spent on other computations, like activation recomputation in training. For that reason, it is often done based on pen-and-paper analysis of the algorithm and with approximate, “napkin” math.

The state-of-the-art for MFU in training is achieved by the foundation model teams at leading organizations like OpenAI, Google, and Meta. Of these, Meta is the most open and reports an MFU of 38 - 41% when training the LLaMA 3 405B model. The more recent DeepSeek-v3 training run by DeepSeek achieved around 20-30% MFU (there’s no official number) using GPUs with tighter communication bottlenecks.

Much of the shortfall is due to the need for inter-node communication in large training jobs, which creates bandwidth constraints that aren’t present in inference applications. For inference workloads, MFU might reach higher, closer to the 70% - 80% MFU achieved by raw matrix multiplications, but we aren’t aware of any published results from large-scale deployments. Let us know if we missed them!

For context, it’s also helpful to consider the equivalent of MFU for a job running on a CPU. For concreteness, consider the One Billion Row Challenge, which led teams around the world to competitively optimize a large-scale aggregation problem on CPUs. This problem requires three floating point operations per row on one billion rows, and so has a total FLOP count of 3 billion. The leading results finished in about one second, and so achieved a FLOP/s throughput of about 3 billion. If we assume that the hardware used for the challenge, eight cores out of a 32 core AMD EPYC 7502P machine which can run at 3.35 GHz, is capable of issuing one FLOP per cycle, then the FLOP/s bandwidth is ~26 billion, for an MFU of ~10%. However, that CPU has AVX2 SIMD vector instructions with a lane width of 256 and so, assuming it can issue 16 FLOPs/cycle per core, the FLOP/s bandwidth is actually ~420 billion, leading to an MFU of under 1%.

How can I improve my GPU utilization?

If you’re not using Modal, that’s a great place to start! Especially for GPU Allocation Utilization.

Besides that, we recommend that if you want to improve your GPU utilization, you dive deeper into GPU-based computing.

We wrote a GPU Glossary to collect together our definitions of the most important terms in one place, complete with links to some of our favorite resources for learning more. Try starting there!

Among those resources, a few stand out, like this talk by Horace He, of the PyTorch team, and this dense blog post by Abhinav Upadhyay of Coding Confessions. We also highly recommend the ML Engineering Open Book by Stas Bekman for deep dives and useful snippets all across the stack.

We’d like to thank Mark Saroufim of PyTorch & the GPU_MODE Discord (join it!) and Erik Dunteman of Pig for comments on a draft of this post.

What is utilization? 什麼是利用率？

Utilization = Output achieved ÷ Capacity paid for利用率 = 實現的輸出 ÷ 支付的容量

What is GPU Allocation Utilization?什麼是 GPU 配置利用率？

GPU Allocation Utilization = GPU-seconds running application code ÷ GPU-seconds paid forGPU 配置利用率 = 運行應用程式代碼的 GPU 秒數 ÷ 支付的 GPU 秒數

With a fixed, over-provisioned GPU allocation, utilization is low在固定的、過度配置的 GPU 分配下，利用率很低

If allocation is slow, utilization and QoS suffer如果配置速度緩慢，利用率和服務質量將受到影響

With fast, automatic allocation, utilization and QoS can both be high透過快速、自動的分配，利用率和服務質量都可以保持在高水平。

What level of GPU Allocation Utilization can I expect to achieve?我可以期待達到什麼級別的 GPU 配置利用率？