Hey team! Could you fix NVLink issue for H100 SXM Community pods? I encounter this error frequently. Corrupted pod ID: 4a5acwxj2kene6
P2P is disabled between NVLINK connected GPUs 1 and 0. This should not be the case given their connectivity, and is probably due to a hardware issue. If you still want to proceed, you can set NCCL_IGNORE_DISABLED_P2P=1.
I can proceed with NCCL_IGNORE_DISABLED_P2P flag but this will drop performance ~ 10%