Low Bandwidth Issue with NCCL Communication Across Pods
I rented two A100 pods on the Instance Clusters page and used the NCCL backend of PyTorch to test the bandwidth. However, the results were below expectations—only 3.4 GB/s was measured. I noticed that the output of ibstat shows a rate of up to 200 Gbps for a single InfiniBand device. There is a significant gap here. Is this normal, or have I misconfigured something?
Here is the script I used: https://gist.github.com/york-droid/cd88b0ab1ffabfa7e6f3ea9355eaceae
When the NCCL_DEBUG=INFO environment variable is added, the NCCL log is also attached.
Here is the script I used: https://gist.github.com/york-droid/cd88b0ab1ffabfa7e6f3ea9355eaceae
When the NCCL_DEBUG=INFO environment variable is added, the NCCL log is also attached.
