Standard nVidia CUDA tests fail with dual RTX 4090 Linux box

piotr21 · 2023-02-19T20:24:04.409Z

@abchauhan This is a very serious issue and has already been reproduced by a number of people: Problems With RTX4090 MultiGPU and AMD vs Intel vs RTX6000Ada or RTX3090 | Puget Systems

It seems new A6000 Ada cards are also affected on AMD CPUs. I have ordered 8 such cards, so I hope this will be fixed soon.

I was able to reproduce with my 4090s too, on AMD EPYC. The only thing that prevents hang is NCCL_P2P_DISABLE=1, but the performance is subpar.

kinred · 2023-02-20T09:02:23.260Z

We can not confirm that the RTX 6000 Ada GPUs have this problem on AMD EPYC or WRX80 based CPUs. P2P copy has not to be disabled when using RTX 6000 Ada GPUs.

More important: the transfered data is correct. Multi RTX 6000 Ada setups seem to work without problems.

Some findings for the multi RTX 4090 setups:

When disable P2P copy with NCCL_P2P_DISABLE on AMD EPYC/WRX80 the locking problem can be by-passed, but then the transfered data between the GPUs is not copied correct! (destination data is all 0 or all NaN). This can be tested with for example:

github.com

ndd314/cuda_examples/blob/master/0_Simple/simpleP2P/simpleP2P.cu

/*
 * Copyright 1993-2013 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

/*
 * This sample demonstrates a combination of Peer-to-Peer (P2P) and
 * Unified Virtual Address Space (UVA) features new to SDK 4.0
 */

// includes, system
#include <stdlib.h>
#include <stdio.h>

This file has been truncated. show original

The multi GPU RTX 4090 problem is not specific to AMD CPUs on the Intel CPUs we tested (for example XEON Silver 4309Y) the transfer is not blocked (NCCL_P2P_DISABLE has no effect) but the data is also not copied correct (destination all 0 or NaN)! This is independed of if NCCL_P2P_DISABLE is set or not (which of course should have no effect, as above example uses directly CUDA and not the higher level NCCL library).

The RTX 4090 is currently not useable for multi GPU usage, neither on Intel nor AMD. The reason from our analysis seems to be a broken? CUDA UVA implementation.

suprnerd · 2023-02-21T00:53:39.923Z

Is this a hardware or software/firmware issue?

Gaetan · 2023-02-21T09:14:33.733Z

I have just run into this issue as well using a 2X 4090 setup with i9-10980XE CPU. I too have been using the simpleP2P test which is broken when p2p is enabled. Doing a bit of hacking, it does appear that cudaMemcpyPeer works as expected. In my case the cudaMemcpyPeer gave 10.54GB/s vs 12.5 Gb/s when the p2paccess is enabled.

stefshox · 2023-02-21T11:33:13.086Z

@abchauhan Any idea on when we should expect a fix? Can also confirm the same issue with p2p on a multi RTX 4090 setup with AMD WRX80.

abchauhan · 2023-02-21T21:05:39.895Z

Hi all,

Apologies for the delay. Feedback from Engineering is that Peer to Peer is not supported on 4090. The applications/driver should not report this configuration as peer to peer capable. The reporting is being fixed and future drivers will report the following instead :-

I. # ./simpleP2P
[./simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access…
/> Peer access from NVIDIA GeForce RTX 4090 (GPU0) → NVIDIA GeForce RTX 4090 (GPU1) : No
/> Peer access from NVIDIA GeForce RTX 4090 (GPU1) → NVIDIA GeForce RTX 4090 (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

II. ./streamOrderedAllocationIPC
Device 1 is not peer capable with some other selected peers, skipping
Step 0 done
Process 0: verifying…
Process 0 complete!

Thank you

vasilii.shelkov · 2023-02-21T22:10:15.473Z

What about other cards? 4080,70,60? new gen A6000 and all others that are coming up?

federico1 · 2023-02-21T23:27:46.717Z

@abchauhan Did the Engineering team provide any reasoning as to why P2P is not supported on the 4090?

Are there any plans to add support for P2P in the future?

346283191 · 2023-02-22T13:45:28.628Z

The 3090 also does not support P2P, which is not a problem. It can give the same results about simple P2P as 3090, I think it is no problem

346283191 · 2023-02-22T13:48:33.473Z

cuda-samples/Samples/0_Introduction/simpleP2P at master · NVIDIA/cuda-samples · GitHub
This is just one of the typical test questions

The key problem is that using pytorch for DataParallel or DistributedDataParallel will cause the program to freeze and even the server to crash

For example 2 here Multi-GPU Computing with Pytorch (Draft) [ 2. DataParallel: MNIST on multiple GPUs]
The above program will hang
or for example by GitHub - pytorch/benchmark: TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.

python test.py -k "test_BERT_pytorch_train_cuda"

The above program will cause the server to crash。But it can run normally in NGC on the Intel Xeon platform without server to crash

kinred · 2023-02-22T13:49:52.522Z

Indeed, the more important problem is that UVA on RTX 4090 is not working correctly. Which is actually the part where the simpleP2P test failed.

brad.suchoski · 2023-02-22T14:02:11.050Z

Th UVA not working is a big problem for an application I’m working on too. After debugging it for several days I posted about it just a few days ago when I tracked it down to UVA failing in a simple test program.

Unified memory not working in multi GPU system CUDA Programming and Performance

I’ve been debugging a unified memory issue in my application that I’ve been able to reproduce with the following simple test code #include <iostream> void checkError(cudaError_t err) { if (cudaSuccess != err) { printf("CUDA error: '%s'\n", cudaGetErrorString(err)); exit(0); } } __global__ void printVals(int *ptr, int device, int size) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < size) { printf("device %d, ptr[%d] = %d\n", device, i, ptr[i]);…

I tested it and it works on Windows 11 with the 528 driver, but I’m running into other problems with that. I’m keeping my fingers crossed that the next Linux driver release will resolve the problem. It would be great if we could get some feedback from Nvidia about what to expect on that though.

abchauhan · 2023-02-23T04:43:42.001Z

vasilii.shelkov:

What about other cards? 4080,70,60? new gen A6000 and all others that are coming up?

federico1:

Did the Engineering team provide any reasoning as to why P2P is not supported on the 4090?

346283191:

The 3090 also does not support P2P, which is not a problem. It can give the same results about simple P2P as 3090, I think it is no problem

I have requested for additional feedback. (Bug 3931150 for tracking internally)

346283191:

The key problem is that using pytorch for DataParallel or DistributedDataParallel will cause the program to freeze and even the server to crash

I am setting up a system to reproduce this. I’ll file a new bug if this remains an open issue. Thanks.

yu_tianli · 2023-02-25T18:48:28.460Z

I can confirm that dual 4090s work on Windows 10 too with pytorch (latest driver 528.49, CUDA 12.0), it however freezes on Ubuntu 22.04 (latest driver 525.89.02, CUDA 12.0). So this is not a hardware limitation, but a driver/software issue. Please fix asap. Thank you!

akshu2023 · 2023-03-01T08:09:18.337Z

Hello @abchauhan-- Do you have a target dates for fix of the issue? There are many modules and packages out there that leverage DDP and are currently broken for multi-gpu use. Thank you!

sdfsdfrr · 2023-03-02T06:57:22.660Z

Hi, @abchauhan
I’ve upgrade to the latest cuda 12.1 with driver 530.30.02.
simpleP2P still failed with report that checking P2P is YES.

abchauhan · 2023-03-03T02:33:09.598Z

Hi,

Peer-to-peer is disabled for all GeForce Ada Desktop cards. There were driver issues related to Peer-to-peer which were recently fixed for these cards.

The release candidate is being finalized. I will share the target dates, driver versions which includes the fix as soon as that information is available. Unfortunately, the latest Beta version - 530.30.02 does not include the fix.

346283191:

python test.py -k "test_BERT_pytorch_train_cuda"

I can reproduce the hang with this application as well as with the examples at https://github.com/NVIDIA/nccl-tests/issues/117.
I have verified that there are no hangs on drivers with the fix.

Thanks

yu_tianli · 2023-03-07T06:38:32.441Z

It turns out that on Windows 10 the P2P is disabled by default so pytorch did not freeze.

abchauhan · 2023-03-15T00:00:37.461Z

Hi all,

To clarify, peer-to-peer is disabled on all GeForce Ada Desktop cards and the fix will resolve the application hangs, crashes and incorrect answers.
The fix doesn’t enable peer-to-peer on these cards. Peer-to-peer on these cards is an unsupported configuration.

Thank you

suprnerd · 2023-03-15T00:46:38.863Z

is there an estimate of when the fix will be available?

108	GitHub - NVIDIA/nccl-tests: NCCL Tests github.com GitHub - NVIDIA/nccl-tests: NCCL 测试 github.com
75	Problems With RTX4090 MultiGPU and AMD vs Intel vs RTX6000Ada or RTX3090 \| Puget Syst... pugetsystems.com RTX4090 多 GPU 问题及 AMD vs Intel vs RTX6000Ada 或 RTX3090 \| Puget Systems...pugetsystems.com
62	Linux x64 (AMD64/EM64T) Display Driver \| 525.105.17 \| Linux 64-bit \| NVIDIA nvidia.com Linux x64（AMD64/EM64T）显示驱动程序 \| 525.105.17 \| Linux 64 位 \| NVIDIA \| nvidia.com
45	cuda_examples/simpleP2P.cu at master · ndd314/cuda_examples · GitHub github.com cuda_examples/simpleP2P.cu at master · ndd314/cuda_examples · GitHubGitHub
45	NCCL all_reduce_perf test hangs with multiple RTX 4090 GPUs, works fine when I swap i... github.com NCCL all_reduce_perf 测试在多个 RTX 4090 GPU 上挂起，当我交换 i...github.com 时正常工作

话题	浏览量	活动
Hardware cursor is not working on Wayland/DRM sessions Wayland/DRM 会话中硬件光标无法工作 Linux linux-driver linux 驱动程序 wayland drm	2.2k	8 月 6 日
Nvidia RTX 3070Ti laptop GPU ambiguous TGP Nvidia RTX 3070Ti 笔记本显卡模糊的 TGP Linux cuda ubuntu power 力量	889	8 月 14 日
Unable to Detect Nvidia Tesla V100S GPU on Ubuntu 20.04 无法在 Ubuntu 20.04 上检测到 Nvidia Tesla V100S GPU Linux	617	8 月 19 日
Nvidia-smi device not found rtx 4070 ubuntu 20.04 Nvidia-smi 设备未找到，RTX 4070，Ubuntu 20.04 Linux ubuntu driver nvidia-smi linux-driver	363	11 月 15 日
How to deal with nvidia-modprobe when switching between nvidia/nouveau Linux	3.2k	12 月 30 日
Gnome Stability issues/GDM Disconnecting behaviour on my Asus ROG Strix Linux	731	2 月 27 日
Debian 12 driver load fail with RmInitAdapter failed! (0x23:0xffff:1413) Linux	236	3 月 1 日
How to downgrade nvidia driver Linux driver linux-driver	805	3 月 2 日
Ubuntu 22.04 with HWE, 550 driver loaded, but GPU not seen by nvidia-smi Linux cuda kernel hw	496	3 月 24 日
4070ti supers only kind of working Linux cuda ubuntu	597	5 月 1 日

Standard nVidia CUDA tests fail with dual RTX 4090 Linux box

Standard nVidia CUDA tests fail with dual RTX 4090 Linux box
双 RTX 4090 Linux 盒子上的标准 Nvidia CUDA 测试失败

创建

最后回复

回复

浏览量

用户

赞

链接

频繁发帖人

热门链接

新话题和未读话题

想阅读更多？请浏览Linux中的其他话题或查看最新话题。

Standard nVidia CUDA tests fail with dual RTX 4090 Linux box双 RTX 4090 Linux 盒子上的标准 Nvidia CUDA 测试失败

创建

最后回复

回复

浏览量

用户

赞

链接

频繁发帖人

热门链接

您好！看起来您很喜欢讨论，但您还没有注册帐户。

新话题和未读话题

想阅读更多？请浏览Linux中的其他话题或查看最新话题。

Standard nVidia CUDA tests fail with dual RTX 4090 Linux box
双 RTX 4090 Linux 盒子上的标准 Nvidia CUDA 测试失败