Hello, 你好,

We found that some standard nVidia tests fail with dual nVidia RTX 4090 system.
我们发现一些标准的 Nvidia 测试在双 Nvidia RTX 4090 系统上失败。

It is CUDA 11.8, driver 520.61.05 running on Linux dl 5.15.0-52-generic #58-Ubuntu
这是在运行 Linux dl 5.15.0-52-generic #58-Ubuntu 的 CUDA 11.8 和驱动程序 520.61.05。

SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
SMP 2022 年 10 月 13 日星期四 08:03:55 UTC x86_64 x86_64 x86_64 GNU/Linux

TEST-1: 测试-1:
executables_v2/bin/x86_64/linux/release/simpleP2P
可执行文件_v2/bin/x86_64/linux/release/simpleP2P

[executables_v2/bin/x86_64/linux/release/simpleP2P] - Starting…
[executables_v2/bin/x86_64/linux/release/simpleP2P] - 开始..

Checking for multiple GPUs…
检查多个 GPU..

CUDA-capable device count: 2
CUDA 可用设备数:2

Checking GPU(s) for support of peer to peer memory access…
检查 GPU(们)是否支持 peer to peer 内存访问..

Peer access from NVIDIA GeForce RTX 4090 (GPU0) → NVIDIA GeForce RTX 4090 (GPU1) : Yes
来自 NVIDIA GeForce RTX 4090(GPU0)的对等访问→NVIDIA GeForce RTX 4090(GPU1):是

Peer access from NVIDIA GeForce RTX 4090 (GPU1) → NVIDIA GeForce RTX 4090 (GPU0) : Yes
来自 NVIDIA GeForce RTX 4090(GPU1)的对等访问→NVIDIA GeForce RTX 4090(GPU0):是

Enabling peer access between GPU0 and GPU1…
启用 GPU0 和 GPU1 之间的对等访问..

Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
分配缓冲区(在 GPU0、GPU1 和 CPU Host 上分配 64MB)..

Creating event handles… 创建事件句柄..
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 25.19GB/s
GPU0 和 GPU1 之间的 cudaMemcpyPeer / cudaMemcpy:25.19GB/s

Preparing host buffer and memcpy to GPU0…
准备主机缓冲区并复制到 GPU0..

Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
在 GPU1 上运行内核,从 GPU0 获取源数据并写入 GPU1..

Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
在 GPU0 上运行内核,从 GPU1 获取源数据并写入 GPU0..

Copy data back to host from GPU0 and verify results…
将数据从 GPU0 复制回主机并验证结果…

Verification error @ element 0: val = nan, ref = 0.000000
元素 0 处的验证错误:val = nan,ref = 0.000000

Verification error @ element 1: val = nan, ref = 4.000000
元素 1 验证错误:val = nan,ref = 4.000000

Verification error @ element 2: val = nan, ref = 8.000000
元素 2 验证错误:val = nan,ref = 8.000000

Verification error @ element 3: val = nan, ref = 12.000000
元素 3 验证错误:val = nan,ref = 12.000000

Verification error @ element 4: val = nan, ref = 16.000000
元素 4 验证错误:val = nan,ref = 16.000000

Verification error @ element 5: val = nan, ref = 20.000000
元素 5 验证错误:val = nan,ref = 20.000000

Verification error @ element 6: val = nan, ref = 24.000000
元素 6 验证错误:val = nan,ref = 24.000000

Verification error @ element 7: val = nan, ref = 28.000000
元素 7 验证错误:val = nan,ref = 28.000000

Verification error @ element 8: val = nan, ref = 32.000000
元素 8 的验证错误:val = nan,ref = 32.000000

Verification error @ element 9: val = nan, ref = 36.000000
元素 9 验证错误:val = nan,ref = 36.000000

Verification error @ element 10: val = nan, ref = 40.000000
元素 10 的验证错误:val = nan,ref = 40.000000

Verification error @ element 11: val = nan, ref = 44.000000
元素 11 的验证错误:val = nan,ref = 44.000000

Disabling peer access… 禁用同行访问..
Shutting down… 关闭中…
Test failed! 测试失败!

TEST-2 测试 2
executables_v2/bin/x86_64/linux/release/OrderedAllocationIPC
executables_v2/bin/x86_64/linux/release/有序分配 IPC

Step 0 done 步骤 0 已完成
Step 1 done 第一步已完成
Process 0: verifying… 进程 0:正在验证..
Process 0: Verification mismatch at 0: 0 != 1
进程 0:在 0 处的验证不匹配:0!=1

Process 0: Verification mismatch at 1: 0 != 1
进程 0:验证不匹配 at 1: 0!=1

Process 0: Verification mismatch at 2: 0 != 1
进程 0:验证不匹配 at 2: 0!=1

Process 0: Verification mismatch at 3: 0 != 1
进程 0:在 3 处的验证不匹配:0!=1

Process 0: Verification mismatch at 4: 0 != 1
进程 0:4 处验证不匹配:0!=1

Process 0: Verification mismatch at 5: 0 != 1
进程 0:验证不匹配 at 5: 0!=1

Process 0: Verification mismatch at 6: 0 != 1
进程 0:在 6 处的验证不匹配:0!=1

.
.
.

Here is system info:
这是系统信息:

Architecture: x86_64 架构:x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU 运行模式:32 位,64 位

Address sizes: 48 bits physical, 48 bits virtual
物理地址大小:48 位,虚拟地址大小:48 位

Byte Order: Little Endian
字节序:小端

CPU(s): 32 CPU 个数:32
On-line CPU(s) list: 0-31
在线 CPU 列表:0-31

Vendor ID: AuthenticAMD 供应商 ID:AuthenticAMD
Model name: AMD Ryzen Threadripper PRO 5975WX 32-Cores
模型名称:AMD Ryzen Threadripper PRO 5975WX 32 核心

CPU family: 25 CPU 系列:25
Model: 8 模型:8
Thread(s) per core: 1
核心上的线程数:1

Core(s) per socket: 32
插槽上的核心数:32

  • 创建

    2022 年 11 月
  • 最后回复

    4 月 29 日
  • 54

    回复

  • 19.1k

    浏览量

  • 23

    用户

频繁发帖人

54 个回复,预计阅读时间为 8 分钟

3 个月后

Hi Vasilii, 你好,瓦西里

Apologies for the delay. Can you please capture a Nvidia bug report from your system.
抱歉延迟了。能否请您从您的系统中捕获一个 Nvidia 错误报告?

Please run nvidia-bug-report.sh as root or sudo user and attach the generated nvidia-bug-report.log.gz.
请以 root 或 sudo 用户运行 nvidia-bug-report.sh,并附上生成的 nvidia-bug-report.log.gz。

Can you also please provide the make/model of the motherboard and the system.
主板和系统的型号也能提供一下吗?

Thank you 谢谢

Hi @vasilii.shelkov , 你好@vasilii.shelkov,

Can you please check if you see the same failures with iommu disabled?
你可以禁用 IOMMU 后检查是否看到同样的故障吗?

On Ubuntu, please edit /etc/default/grub, append amd_iommu=off to the options in the line starting with GRUB_CMDLINE_LINUX=.... Save. Run #update-grub2 and reboot.
在 Ubuntu 上,请编辑/etc/default/grub,将 amd_iommu=off 追加到以 GRUB_CMDLINE_LINUX=... 开头的行的选项中。保存。运行 #update-grub2 并重新启动。

Thank you 谢谢

With amd_iommu off we get the same errors with “nan” is being replaced by 0s:
关闭 amd_iommu 后,我们得到相同的错误,其中"nan"被 0 替换:

bizon@dl:~$ ~/kuklin/cuda-samples-12/cuda-samples/bin/x86_64/linux/release/simpleP2P
[/home/bizon/kuklin/cuda-samples-12/cuda-samples/bin/x86_64/linux/release/simpleP2P] - Starting…
【/home/bizon/kuklin/cuda-samples-12/cuda-samples/bin/x86_64/linux/release/simpleP2P】- 开始..

Checking for multiple GPUs…
检查多个 GPU..

CUDA-capable device count: 2
CUDA 可用设备数:2

Checking GPU(s) for support of peer to peer memory access…
检查 GPU(们)是否支持 peer to peer 内存访问..

Peer access from NVIDIA GeForce RTX 4090 (GPU0) → NVIDIA GeForce RTX 4090 (GPU1) : Yes
来自 NVIDIA GeForce RTX 4090(GPU0)的对等访问→NVIDIA GeForce RTX 4090(GPU1):是

Peer access from NVIDIA GeForce RTX 4090 (GPU1) → NVIDIA GeForce RTX 4090 (GPU0) : Yes
来自 NVIDIA GeForce RTX 4090(GPU1)的对等访问→NVIDIA GeForce RTX 4090(GPU0):是

Enabling peer access between GPU0 and GPU1…
启用 GPU0 和 GPU1 之间的对等访问..

Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
分配缓冲区(在 GPU0、GPU1 和 CPU 主机上分配 64MB)..

Creating event handles… 创建事件句柄..
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 25.14GB/s
GPU0 和 GPU1 之间的 cudaMemcpyPeer / cudaMemcpy:25.14GB/s

Preparing host buffer and memcpy to GPU0…
准备主机缓冲区并复制到 GPU0..

Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
在 GPU1 上运行内核,从 GPU0 获取源数据并写入 GPU1..

Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
在 GPU0 上运行内核,从 GPU1 获取源数据并写入 GPU0..

Copy data back to host from GPU0 and verify results…
将数据从 GPU0 复制回主机并验证结果…

Verification error @ element 1: val = 0.000000, ref = 4.000000
元素 1 验证错误:val=0.000000,ref=4.000000

Verification error @ element 2: val = 0.000000, ref = 8.000000
元素 2 验证错误:val=0.000000,ref=8.000000

Verification error @ element 3: val = 0.000000, ref = 12.000000
元素 3 验证错误:val=0.000000,ref=12.000000

Verification error @ element 4: val = 0.000000, ref = 16.000000
元素 4 验证错误:val=0.000000,ref=16.000000

Verification error @ element 5: val = 0.000000, ref = 20.000000
元素 5 验证错误:val=0.000000,ref=20.000000

Verification error @ element 6: val = 0.000000, ref = 24.000000
元素 6 验证错误:val=0.000000,ref=24.000000

Verification error @ element 7: val = 0.000000, ref = 28.000000
元素 7 的验证错误:val = 0.000000,ref = 28.000000

Verification error @ element 8: val = 0.000000, ref = 32.000000
元素 8 的验证错误:val = 0.000000,ref = 32.000000

Verification error @ element 9: val = 0.000000, ref = 36.000000
元素 9 处验证错误:val = 0.000000,ref = 36.000000

Verification error @ element 10: val = 0.000000, ref = 40.000000
元素 10 处验证错误:val = 0.000000,ref = 40.000000

Verification error @ element 11: val = 0.000000, ref = 44.000000
元素 11 处验证错误:val = 0.000000,ref = 44.000000

Verification error @ element 12: val = 0.000000, ref = 48.000000
元素 12 处验证错误:val = 0.000000,ref = 48.000000

Disabling peer access… 禁用同行访问..
Shutting down… 关闭中…
Test failed! 测试失败!

I noticed that your system is from Bizon. Aren’t they able to help since they configured your system? Rather, aren’t they testing system for parallel training/gpu-computing before shipping it?
我注意到你的系统是来自 Bizon 的。既然他们配置了你的系统,难道他们不能帮忙吗?或者说,他们不是在发货前对系统进行并行训练/GPU 计算的测试吗?

It looks like nVidia bug and unlikely that Bizon or their training could help here. This is the ticket number:
看起来像是 nVidia 的 bug,不太可能 Bizon 或他们的训练能在这里提供帮助。这是工单号:

“NVIDIA PSIRT” PSIRT@nvidia.com; Bug report: standard nVidia P2P tests: 3902559
"NVIDIA PSIRT" PSIRT@nvidia.com;漏洞报告:标准 nVidia P2P 测试:3902559

Hi vasilii.shelkov 你好,vasilii.shelkov

Thank you for the additional information. We can reproduce this issue on our systems. This is under investigation.
感谢您提供的额外信息。我们可以在我们的系统上重现这个问题。这个问题正在调查中。

12 天后

Hello @abchauhan – How do I receive an update for this issue? I tried sending email to PSIRT@nvidia.com about issue 3902559, but haven’t hear back yet. Is there a portal I can sign in to see the trace of updates?
你好@abchauhan - 如何获取此问题的更新?我尝试向 PSIRT@nvidia.com 发送电子邮件报告问题 3902559,但尚未收到回复。是否有我可以登录以查看更新追踪的门户?

Thanks 谢谢

@abchauhan This is a very serious issue and has already been reproduced by a number of people: Problems With RTX4090 MultiGPU and AMD vs Intel vs RTX6000Ada or RTX3090 | Puget Systems 75
@abchauhan 这是一个非常严重的问题,已经被多人复现:RTX4090 多 GPU 和 AMD vs Intel vs RTX6000Ada 或 RTX3090 的问题 | Puget Systems75

It seems new A6000 Ada cards are also affected on AMD CPUs. I have ordered 8 such cards, so I hope this will be fixed soon.
看起来新的 A6000 Ada 卡在 AMD 处理器上也受到影响。我订购了 8 张这样的卡,所以希望这个问题能尽快得到解决。

I was able to reproduce with my 4090s too, on AMD EPYC. The only thing that prevents hang is NCCL_P2P_DISABLE=1, but the performance is subpar.
我使用 AMD EPYC 的 4090 也可以复现,唯一防止挂起的方法是设置 NCCL_P2P_DISABLE=1,但性能不佳。

We can not confirm that the RTX 6000 Ada GPUs have this problem on AMD EPYC or WRX80 based CPUs. P2P copy has not to be disabled when using RTX 6000 Ada GPUs.
我们无法确认 RTX 6000 Ada GPU 在 AMD EPYC 或 WRX80 基于的 CPU 上是否存在此问题。使用 RTX 6000 Ada GPU 时,不需要禁用 P2P 复制。

More important: the transfered data is correct. Multi RTX 6000 Ada setups seem to work without problems.
更重要的是:传输的数据是正确的。多块 RTX 6000 Ada 配置似乎可以无问题地工作。

Some findings for the multi RTX 4090 setups:
多块 RTX 4090 的设置的一些发现:

  • When disable P2P copy with NCCL_P2P_DISABLE on AMD EPYC/WRX80 the locking problem can be by-passed, but then the transfered data between the GPUs is not copied correct! (destination data is all 0 or all NaN). This can be tested with for example:
    当在 AMD EPYC/WRX80 上禁用 NCCL_P2P_DISABLE 的 P2P 复制时,可以绕过锁定问题,但此时 GPU 之间的传输数据不会被正确复制!(目标数据全为 0 或全为 NaN)。这可以通过例如以下方式进行测试:
  • The multi GPU RTX 4090 problem is not specific to AMD CPUs on the Intel CPUs we tested (for example XEON Silver 4309Y) the transfer is not blocked (NCCL_P2P_DISABLE has no effect) but the data is also not copied correct (destination all 0 or NaN)! This is independed of if NCCL_P2P_DISABLE is set or not (which of course should have no effect, as above example uses directly CUDA and not the higher level NCCL library).
    多 GPU RTX 4090 问题并非仅限于 AMD CPU,我们在测试的 Intel CPU(例如 XEON Silver 4309Y)上发现传输并未阻塞(NCCL_P2P_DISABLE 无效),但数据也没有正确复制(目标全为 0 或 NaN)!这与是否设置 NCCL_P2P_DISABLE 无关(当然,这应该没有影响,因为上述示例直接使用 CUDA 而不是高级的 NCCL 库)。

The RTX 4090 is currently not useable for multi GPU usage, neither on Intel nor AMD. The reason from our analysis seems to be a broken? CUDA UVA implementation.
RTX 4090 目前在 Intel 或 AMD 平台上都无法用于多 GPU 使用,根据我们的分析,原因似乎是损坏的 CUDA UVA 实现。

I have just run into this issue as well using a 2X 4090 setup with i9-10980XE CPU. I too have been using the simpleP2P test which is broken when p2p is enabled. Doing a bit of hacking, it does appear that cudaMemcpyPeer works as expected. In my case the cudaMemcpyPeer gave 10.54GB/s vs 12.5 Gb/s when the p2paccess is enabled.
我也有同样的问题,使用的是 2X 4090 配置和 i9-10980XE CPU。我也一直在使用 simpleP2P 测试,当启用 p2p 时它会损坏。进行一些破解后,看起来 cudaMemcpyPeer 按预期工作。在我的情况下,cudaMemcpyPeer 提供的是 10.54GB/s,而启用 p2paccess 时是 12.5GB/s。

Hi all, 大家好,

Apologies for the delay. Feedback from Engineering is that Peer to Peer is not supported on 4090. The applications/driver should not report this configuration as peer to peer capable. The reporting is being fixed and future drivers will report the following instead :-
对于延迟表示歉意。工程团队的反馈是,4090 不支持点对点。应用程序/驱动程序不应报告此配置为点对点功能。报告问题正在修复中,未来的驱动程序将报告以下内容:

I. # ./simpleP2P 我。# ./simpleP2P
[./simpleP2P] - Starting…
[./simpleP2P] - 开始...

Checking for multiple GPUs…
检查多个 GPU..

CUDA-capable device count: 2
CUDA 可用设备数:2

Checking GPU(s) for support of peer to peer memory access…
检查 GPU(们)是否支持 peer to peer 内存访问..

/> Peer access from NVIDIA GeForce RTX 4090 (GPU0) → NVIDIA GeForce RTX 4090 (GPU1) : No
/> 从 NVIDIA GeForce RTX 4090(GPU0)到 NVIDIA GeForce RTX 4090(GPU1)的对等访问:否

/> Peer access from NVIDIA GeForce RTX 4090 (GPU1) → NVIDIA GeForce RTX 4090 (GPU0) : No
/> 从 NVIDIA GeForce RTX 4090(GPU1)到 NVIDIA GeForce RTX 4090(GPU0)的对等访问:否

Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
对于./simpleP2P,需要两个或更多具有点对点访问能力的 GPU。

Peer to Peer access is not available amongst GPUs in the system, waiving test.
系统中的 GPU 之间无法实现点对点访问,跳过测试。

II. ./streamOrderedAllocationIPC
Device 1 is not peer capable with some other selected peers, skipping
设备 1 与某些其他选定的对等体不兼容,跳过

Step 0 done 步骤 0 已完成
Process 0: verifying… 进程 0:正在验证..
Process 0 complete! 进程 0 已完成!

Thank you 谢谢

@abchauhan Did the Engineering team provide any reasoning as to why P2P is not supported on the 4090?
@abchauhan 工程团队有没有提供为什么不支持 4090 上的 P2P 的解释?

Are there any plans to add support for P2P in the future?
未来有计划增加对 P2P 的支持吗?

The 3090 also does not support P2P, which is not a problem. It can give the same results about simple P2P as 3090, I think it is no problem
3090 也不支持 P2P,这倒不是问题,我认为它在简单的 P2P 方面可以给出同样的结果,没有问题。

cuda-samples/Samples/0_Introduction/simpleP2P at master · NVIDIA/cuda-samples · GitHub 23
NVIDIA/cuda-samples/cuda-samples/Samples/0_Introduction/simpleP2P master 分支@GitHub23

This is just one of the typical test questions
这只是一个典型的测试问题之一

The key problem is that using pytorch for DataParallel or DistributedDataParallel will cause the program to freeze and even the server to crash
关键问题是,使用 pytorch 进行 DataParallel 或 DistributedDataParallel 会导致程序卡住,甚至服务器崩溃

For example 2 here Multi-GPU Computing with Pytorch (Draft) 21 [ 2. DataParallel: MNIST on multiple GPUs]
例如 2:使用 Pytorch 进行多 GPU 计算(草案)21[2. DataParallel:多 GPU 上的 MNIST]

The above program will hang
上面的程序会挂起

or for example by GitHub - pytorch/benchmark: TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance. 10
或者例如使用 GitHub - pytorch/benchmark:TorchBench 是一个用于评估 PyTorch 性能的开源基准集合。

python test.py -k "test_BERT_pytorch_train_cuda"

The above program will cause the server to crash。But it can run normally in NGC on the Intel Xeon platform without server to crash
上述程序会导致服务器崩溃。但是,它可以在基于 Intel Xeon 平台的 NGC 上正常运行,不会导致服务器崩溃。

Indeed, the more important problem is that UVA on RTX 4090 is not working correctly. Which is actually the part where the simpleP2P test failed.
确实,更严重的问题是 RTX 4090 上的 UVA 无法正常工作,这实际上是 simpleP2P 测试失败的部分。

Th UVA not working is a big problem for an application I’m working on too. After debugging it for several days I posted about it just a few days ago when I tracked it down to UVA failing in a simple test program.
对于我正在开发的应用程序来说,UVA 无法正常工作是一个大问题。经过几天的调试,我前几天在追踪到 UVA 在一个简单的测试程序中失败时,刚刚发布了相关帖子。

I tested it and it works on Windows 11 with the 528 driver, but I’m running into other problems with that. I’m keeping my fingers crossed that the next Linux driver release will resolve the problem. It would be great if we could get some feedback from Nvidia about what to expect on that though.
我测试了,它在 Windows 11 上与 528 驱动程序一起工作,但我遇到了其他问题。我希望下一个 Linux 驱动程序发布能解决这个问题。然而,如果能从 Nvidia 那里得到一些关于此的反馈,那就太好了。

I have requested for additional feedback. (Bug 3931150 for tracking internally)
我已经请求了额外的反馈。(内部跟踪问题编号:Bug 3931150)

I am setting up a system to reproduce this. I’ll file a new bug if this remains an open issue. Thanks.
我正在设置一个系统来重现这个问题。如果这个问题仍然存在,我会提交一个新的 bug。谢谢。

I can confirm that dual 4090s work on Windows 10 too with pytorch (latest driver 528.49, CUDA 12.0), it however freezes on Ubuntu 22.04 (latest driver 525.89.02, CUDA 12.0). So this is not a hardware limitation, but a driver/software issue. Please fix asap. Thank you!
我确认在 Windows 10 上,使用 pytorch(最新驱动 528.49,CUDA 12.0)的两个 4090 也可以工作,但在 Ubuntu 22.04 上会冻结(最新驱动 525.89.02,CUDA 12.0)。所以这不是硬件限制,而是驱动程序/软件问题。请尽快修复,谢谢!

Hello @abchauhan-- Do you have a target dates for fix of the issue? There are many modules and packages out there that leverage DDP and are currently broken for multi-gpu use. Thank you!
你好@abchauhan--这个问题的修复有目标日期吗?有很多模块和包利用了 DDP,目前在多 GPU 使用中都出现了问题。谢谢!

Hi, @abchauhan 你好,@abchauhan
I’ve upgrade to the latest cuda 12.1 with driver 530.30.02.
我已升级到最新的 cuda 12.1 和驱动程序 530.30.02。

simpleP2P still failed with report that checking P2P is YES.
简单 P2P 仍然失败,报告显示 P2P 检查为是。

Hi, 你好,

Peer-to-peer is disabled for all GeForce Ada Desktop cards. There were driver issues related to Peer-to-peer which were recently fixed for these cards.
所有 GeForce Ada 台式机卡已禁用点对点通信。这些卡最近已修复与点对点通信相关的驱动程序问题。

The release candidate is being finalized. I will share the target dates, driver versions which includes the fix as soon as that information is available. Unfortunately, the latest Beta version - 530.30.02 does not include the fix.
候选发布版正在最终确定中。一旦获得相关信息,我将分享目标日期和包含修复程序的驱动程序版本。不幸的是,最新的测试版-530.30.02 不包含修复程序。

I can reproduce the hang with this application as well as with the examples at https://github.com/NVIDIA/nccl-tests/issues/117 45.
我也可以使用这个应用程序以及 https://github.com/NVIDIA/nccl-tests/issues/11745 中的示例重现挂起。

I have verified that there are no hangs on drivers with the fix.
我已经验证了修复后驱动程序没有挂起。

Thanks 谢谢

Hi all, 大家好,

To clarify, peer-to-peer is disabled on all GeForce Ada Desktop cards and the fix will resolve the application hangs, crashes and incorrect answers.
明确一下,所有 GeForce Ada 台式卡都禁用了点对点功能,这个修复程序将解决应用程序挂起、崩溃和错误答案的问题。

The fix doesn’t enable peer-to-peer on these cards. Peer-to-peer on these cards is an unsupported configuration.
该修复程序未启用这些卡的点对点功能。这些卡上的点对点是不支持的配置。

Thank you 谢谢

Hi @suprnerd 你好@suprnerd

The driver release is scheduled for the end of this month (March 2023). I will share the release version information when it’s available.
计划在本月底(2023 年 3 月)发布驱动程序。当信息可用时,我将分享发布版本的信息。

Thank you 谢谢

Thanks for the update @abchauhan
感谢更新@abchauhan

I understand that P2P is not and won’t be supported on Ada GeForce cards. Do you know if the problems with Unified memory will be fixed with this release? Or I guess if that is even supported? It works for me with two 4080’s on Windows and I don’t see anything in the documentation to suggest it shouldn’t work. I’m assuming it should be supported, but someone correct me if I’m wrong.
我明白 P2P 在 Ada GeForce 卡上不支持,也不会支持。你知道这次发布是否会修复统一内存的问题吗?或者我想知道这是否 even 支持?在 Windows 上,我使用两块 4080 时它可以工作,我在文档中没有看到任何表明它不应该工作的地方。我假设它应该被支持,但如果我错了,请有人纠正我。

Indeed, it would be realy appreciated if NVIDIA communicates if multi-GPU setups with Geforce GPUs are officially supported or not.
确实,如果 NVIDIA 能沟通一下 GeForce 显卡的多 GPU 配置是否官方支持,那将非常感谢。

This we have introduced a mysteries bug which make GPUs stall or calculate wrong values and we don’t know if it is solvable situation is unprofessional.
我们引入了一个神秘的 bug,它会导致 GPU 挂起或计算出错误的值,我们不知道这种情况是否可以解决,这很不专业。

Do you know if the problems with Unified memory will be fixed with this release?
你知道这次发布是否会修复统一内存的问题吗?

Can you please share the forum post’s link for this? Is there a NVIDIA bug number?
能分享这个论坛帖子的链接吗?有 NVIDIA 的错误编号吗?

Indeed, it would be realy appreciated if NVIDIA communicates if multi-GPU setups with Geforce GPUs are officially supported or not.
确实,如果 NVIDIA 能沟通一下 GeForce 显卡的多 GPU 配置是否官方支持,那将非常感谢。

I’ll share this feedback with the teams.
我会将这个反馈分享给团队。

Thank you 谢谢

From the tone of the discussion it seems 4090s will have P2P locked. In my case, it renders the 4090 worthless.
从讨论的语气来看,4090 似乎会锁定 P2P。在我这种情况下,4090 就变得毫无价值了。

I ended up having an internal debate between A100 40G / A5500 24G x2 + opted for 2x A5500. It is in the ballpark of a pair of 4090s + it packs the performance of the A100 40G that, I have a feeling we both may have been aiming for.
我最终在 A100 40G/A5500 24G x2 之间进行了内部辩论,选择了 2x A5500。它的性能大约相当于一对 4090,并且包含了 A100 40G 的性能,我感觉这可能是我们共同的目标。

A consideration if you are able to make the hardware switch.
如果你能进行硬件切换的话,这是一个需要考虑的因素。

For HPC we’re stuck with using appropriate hardware. Not necessarily a bad thing. Tools tend to be more functional than toys.
对于 HPC,我们 stuck with 使用合适的硬件。不一定是坏事。工具往往比玩具更实用。

An A5500 is nearly as fast as a RTX 3090 … for 4090 performance one has to go for a RTX 6000 Ada.
A5500 的性能几乎和 RTX 3090 一样快……要达到 4090 的性能,就需要选择 RTX 6000 Ada。

That’s nice and all, but it’s useless. 24G of ram is nothing. 5500 has P2P/SLI 3090 has neither.
挺好的,但是没用,24GB 的内存根本不够。5500 有 P2P/SLI,3090 都没有。

With the poor capabilities of the 4090 it’s worthless as anything but a toy. I returned 2 of them. At least I can use 100% of an A5500.
4090 的能力很差,除了当玩具外毫无价值,我已经退回了两台。至少我可以使用 A5500 的全部性能。

Sure, may be faster but if it’s all blocked from use what good does the speed at playing games do for training and inference?
当然,可能会更快,但如果所有东西都被阻止使用,那么游戏中的速度对训练和推理有什么好处呢?

It did nothing for me. Gave me gibberish & unreliable results. Seems a few of us in this thread have shared experiences.
它对我来说毫无作用。给我乱码和不可靠的结果。似乎我们这个帖子中有几个人分享了类似的经历。

Hi all, 大家好,

Driver version 525.105.17 62 has the changes that resolve the application hangs and crashes. CUDA sample tests will report that P2P is not supported.
驱动程序版本 525.105.1762 包含解决应用程序挂起和崩溃的更改。CUDA 样本测试将报告不支持 P2P。

Please let us know if you continue to see any failures.
如果您继续看到任何失败,请告诉我们。

The current 530.xx driver - 530.41.03 43 does not include the changes. The next releases from the 530 branch should pick up the changes.
当前的 530.xx 驱动程序-530.41.0343 不包括这些更改。530 分支的下一个版本应会包含这些更改。

Thank you 谢谢

We can confirm that with driver 525.105.17 the locking problem is fixed and UVA (Unified Virtual Adressing) also seems to work with 2x RTX 4090 without error on AMD EPYC and Threadripper. With working UVA also NCCL works → multi GPU training with Pytorch and Tensorflow work.
我们确认,使用驱动程序 525.105.17,锁定问题已得到修复,同时在 AMD EPYC 和 Threadripper 上,UVA(统一虚拟地址)似乎也能在 2 张 RTX 4090 上正常工作而不会出错。随着 UVA 的正常工作,NCCL 也能够运行,从而实现 Pytorch 和 Tensorflow 的多 GPU 训练。

Finally 2x RTX 4090 can be used for deep learning training, the missing P2P performance probably hinders scaling with more than two RTX 4090 because of bad all gather and all reduce performance… but at least now they are useable.
终于可以使用 2 张 RTX 4090 进行深度学习训练了,缺失的 P2P 性能可能由于较差的全收集和全减少性能,在使用超过两张 RTX 4090 时会阻碍扩展...但至少现在它们可以使用了。

2 个月后

I said that it is useable (it does not crash or hang anymore) and it calculates correct results (which was also not the case in older versions). I did not say that it is good or fast and already doubted that it makes sense to use more than two RTX 4090s. What are your experiences with the performance?
我说过它是可用的(不再崩溃或挂起),并且它可以计算出正确的结果(这在旧版本中也不是这样)。我并没有说它好或快,而且我早就怀疑使用超过两个 RTX 4090 是否有意义。你对性能有什么经验?

The missing P2P performance probably hinders scaling with more than two RTX 4090 because of bad all gather and all reduce performance.
丢失的 P2P 性能可能由于糟糕的全收集和全减少性能,阻碍了使用两个以上 RTX 4090 进行扩展。

The missing functionality makes it as slow as CPU.
缺失的功能使其速度慢如 CPU。

If you have the $ for 2x 4090, you should pick up 2x A5500 instead. The memory > the processing for inference.
如果你有 2 个 4090 的钱,你应该选择 2 个 A5500 instead。内存比推理处理更重要。

< Before 翻译结果:< 之前


< After 无法翻译,源文本中"< After"不完整,缺少翻译内容。请提供完整的信息

The After pic destroys the performance of 2x4090. Not even close.
2x4090 的性能被 After 图摧毁了,完全不在一个档次。

You need a completely un-hobbled GPU, including p2p to do anything. Unless you are using multiple systems and building your own p2p, 4090 is a total waste of time.
你需要一个完全未被限制的 GPU,包括点对点通信,才能做任何事情。除非你正在使用多个系统并构建自己的点对点通信,否则 4090 完全是浪费时间。

  • note on the bracket: I changed cases later + bracket is no longer necessary. Old case needed it. LianLi O11XL fits everything without issue.
    括号内的注释:我后来改了案例+括号不再需要。旧案例需要它。LianLi O11XL 可以无问题地容纳所有东西。
28 天后

Hi, 你好,

This is really interesting, I can confirm that running different training jobs we are having the same issues described here.
这真的很有趣,我可以确认运行不同的训练作业时,我们遇到了这里描述的相同问题。

One thing I noticed is that when I train for example a LLM models or fine-tuning them, it scale fine between GPUs using multi GPUs 4090, we have tested with 4x GPUs, but when running a vision model like resnet50 the system crash or lose performance, for example: It start the resnet50 training and after 2 minutes performance goes down like 80%, the GPUs are being used at 100% at the beginning using all the power of the GPU and then after 2 minutes wattage consumption goes to 120w or less, and this is not related to hardware since we test multiple system and scenarios, it seems that is all related to the P2P being locked on Gforce cards.
我注意到的一点是,当我在例如训练一个LLM模型或者进行微调时,使用多 GPU(4090)进行扩展可以很好地在 GPU 之间进行,我们已经用 4 块 GPU 进行过测试。但是,当运行像 resnet50 这样的视觉模型时,系统会崩溃或者性能下降,例如:它开始训练 resnet50,然后在 2 分钟后性能下降了大约 80%。GPU 在开始时的使用率是 100%,充分利用了 GPU 的性能,然后在 2 分钟后,功耗下降到 120w 或更低。这与硬件无关,因为我们已经在多个系统和场景中进行了测试,似乎所有问题都与 GeForce 显卡上的 P2P 被锁定有关。

Can you please @abchauhan let us know if this is permanent and what it means for workstations then with multi GPUs, so we cannot train DL models using multi 4090s and we must buy Quadro GPUs?
请问@abchauhan,这是否永久的?对于多 GPU 的工作站意味着什么?那么我们不能使用多张 4090 训练深度学习模型,必须购买 Quadro 显卡吗?

Also in Windows 11 then for 3D rendering this P2P will be enable and it will not be affected or how it works in that case.
同时,在 Windows 11 中,对于 3D 渲染,这个 P2P 将启用,不会受到影响,或者在这种情况下是如何工作的。

Please @abchauhan confirming that is important for us to make the right business decision.
请@abchauhan 确认,对我们来说做出正确的商业决策非常重要。

Thanks 谢谢

By running following tests:
通过运行以下测试:

They are not comprehensive, but did not work at all when this thread was started.
它们并不全面,但在这个线程开始时根本不起作用。

7 个月后

This is still affecting me on driver 545.29.06, kernel 6.7.0. I’m getting the same output on as OP with simpleP2P.
这仍然影响我在驱动程序 545.29.06,内核 6.7.0 上的使用。使用 simpleP2P 时我得到与 OP 相同的输出。

3 个月后

Has there been any updates to this?
这个有更新吗?

On my multi-4090 AMD system, I can train with multi-gpu. But when deploying an LLM across multiple gpus, it raises device-side index assertion error. If I turn off all virtualizations, the system hangs when attempting to infer from multiple gpus.
在我的多 4090 AMD 系统上,我可以使用多 GPU 进行训练。但是,当在多个 GPU 上部署LLM时,它会引发设备端索引断言错误。如果关闭所有虚拟化,那么在尝试从多个 GPU 进行推理时,系统会挂起。

I believe this is related to the issue in this thread as another system of mine with multi-3090 has no issues at all running the exact same code. I’m on ubuntu22.04, python310, cuda12.1. P2P test seems to run without errors.
我相信这与本帖中的问题有关,因为我另一个配备多张 3090 的系统运行完全相同的代码时完全没有问题。我使用的是 ubuntu22.04、python310 和 cuda12.1。P2P 测试似乎没有错误。

12 天后

@kj140717 FWIW it seems the latest drivers do correctly report p2p as disabled for 4090, though I’ve since switched to tinygrad’s fork since it implements p2p.
@kj140717 顺便说一下,最新的驱动程序似乎确实正确报告了 4090 的 p2p 被禁用,尽管我后来已经切换到了 tinygrad 的分支,因为它实现了 p2p。