Quality Control Issues?
I've noticed this a lot and it's becoming very problematic in my workflows. When I have production GPUs, some of them are just objectively worse (and by a lot, like 3x worse in performance). Same GPU type, same number of CPUs, same template, same storage, same codebase, but it performs x3 slower per iteration in the same region and I can't even swap to a new one because some of the new ones have the same problem. I don't want to play casino when I boot up a GPU and hope that it matches the quality I'm expecting on other instances. Is there a way around this problem? This is in the US-NC-1 region but I've noticed this issue in US-IL-1 too.
1 Reply
Logs from a healthy pod:
1. PCIe Configuration Check
echo "1. PCIe CONFIGURATION"
PCIE_GEN=$(nvidia-smi -q | grep "Current" | grep "PCIe Generation" -A 1 | tail -1 | awk '{print $NF}')
PCIE_WIDTH=$(nvidia-smi -q | grep "Current" | grep "Link Width" -A 1 | tail -1 | awk '{print $NF}')
echo "PCIe Generation: Gen $PCIE_GEN (should be 3 or 4)"
echo "PCIe Width: $PCIE_WIDTH (should be 16x)"
echo ""
2. Storage Speed Test
echo "2. STORAGE SPEED TEST"
echo "Testing write speed..."
WRITE_SPEED=$(dd if=/dev/zero of=/workspace/testfile bs=1M count=1000 oflag=direct 2>&1 | grep copied | awk '{print $(NF-1), $NF}')
echo "Write Speed: $WRITE_SPEED"
echo "Testing read speed..."
READ_SPEED=$(dd if=/workspace/testfile of=/dev/null bs=1M 2>&1 | grep copied | awk '{print $(NF-1), $NF}')
rm -f /workspace/testfile
echo "Read Speed: $READ_SPEED"
echo ""
3. CPU-to-GPU Transfer Speed (PCIe bandwidth test)
echo "3. CPU-GPU TRANSFER TEST (PCIe Bandwidth)"
python3 << 'PYTHON'
import torch
/workspace/benchmarkpod.sh | tee /workspace/benchmark$(hostname).log limiting performance"
1. PCIe CONFIGURATION
PCIe Generation: Gen (should be 3 or 4)
PCIe Width: (should be 16x)
2. STORAGE SPEED TEST
Testing write speed...
Write Speed: 882 MB/s
Testing read speed...
Read Speed: 582 MB/s
3. CPU-GPU TRANSFER TEST (PCIe Bandwidth)
100MB transfer: 0.006s = 17421 MB/s
500MB transfer: 0.028s = 18150 MB/s
1000MB transfer: 0.054s = 18358 MB/s
Average PCIe Bandwidth: 17976 MB/s
Expected PCIe Bandwidths (theoretical):
PCIe Gen1 x16: ~4,000 MB/s
PCIe Gen3 x16: ~15,000 MB/s
PCIe Gen4 x16: ~30,000 MB/s
4. GPU COMPUTE TEST
Testing 8192x8192 matrix multiplication...
Time: 0.20s
Performance: 54.12 TFLOPS
Expected 4090 Performance: ~80-90 TFLOPS
5. GPU MEMORY BANDWIDTH TEST
GPU Memory Bandwidth: 471 GB/s
Expected 4090 Bandwidth: ~1000 GB/s
6. SIMULATED WORKLOAD
Data load to GPU: 0.218s
GPU processing (20 steps): 0.148s
Data download from GPU: 0.011s
Total pipeline time: 0.405s
Logs from an unhealthy GPU:
1. PCIe Configuration Check
echo "1. PCIe CONFIGURATION"
PCIE_GEN=$(nvidia-smi -q | grep "Current" | grep "PCIe Generation" -A 1 | tail -1 | awk '{print $NF}')
PCIE_WIDTH=$(nvidia-smi -q | grep "Current" | grep "Link Width" -A 1 | tail -1 | awk '{print $NF}')
echo "PCIe Generation: Gen $PCIE_GEN (should be 3 or 4)"
echo "PCIe Width: $PCIE_WIDTH (should be 16x)"
echo ""
2. Storage Speed Test
echo "2. STORAGE SPEED TEST"
echo "Testing write speed..."
WRITE_SPEED=$(dd if=/dev/zero of=/workspace/testfile bs=1M count=1000 oflag=direct 2>&1 | grep copied | awk '{print $(NF-1), $NF}')
echo "Write Speed: $WRITE_SPEED"
echo "Testing read speed..."
READ_SPEED=$(dd if=/workspace/testfile of=/dev/null bs=1M 2>&1 | grep copied | awk '{print $(NF-1), $NF}')
rm -f /workspace/testfile
echo "Read Speed: $READ_SPEED"
echo ""
3. CPU-to-GPU Transfer Speed (PCIe bandwidth test)
echo "3. CPU-GPU TRANSFER TEST (PCIe Bandwidth)"
/workspace/benchmarkpod.sh | tee /workspace/benchmark$(hostname).log limiting performance"
1. PCIe CONFIGURATION
PCIe Generation: Gen (should be 3 or 4)
PCIe Width: (should be 16x)
2. STORAGE SPEED TEST
Testing write speed...
Write Speed: 226 MB/s
Testing read speed...
Read Speed: 176 MB/s
3. CPU-GPU TRANSFER TEST (PCIe Bandwidth)
100MB transfer: 0.023s = 4328 MB/s
500MB transfer: 0.123s = 4063 MB/s
1000MB transfer: 0.229s = 4371 MB/s
Average PCIe Bandwidth: 4254 MB/s
Expected PCIe Bandwidths (theoretical):
PCIe Gen1 x16: ~4,000 MB/s
PCIe Gen3 x16: ~15,000 MB/s
PCIe Gen4 x16: ~30,000 MB/s
4. GPU COMPUTE TEST
Testing 8192x8192 matrix multiplication...
Time: 0.22s
Performance: 50.90 TFLOPS
Expected 4090 Performance: ~80-90 TFLOPS
5. GPU MEMORY BANDWIDTH TEST
GPU Memory Bandwidth: 469 GB/s
Expected 4090 Bandwidth: ~1000 GB/s
6. SIMULATED WORKLOAD
Simulating image generation pipeline...
Data load to GPU: 0.898s
GPU processing (20 steps): 1.157s
Data download from GPU: 0.078s
Total pipeline time: 2.368s
These are logs from empty GPU setups without anything running on them, but ones that have already been identified as "bad GPUs"
This is a huge bottleneck for using Runpod in production and if this doesn't get fixed, we're just going to have to move to another service because this is a huge reliability problem
It looks like the main differences are in the storage speed tests and the PCIe Bandwidth and it's enough to show a x3 slowdown in performance
If this is a hardware problem, this needs to be something we can configure or sort by or at least be aware of. This is effectively selling defective setups for the same price as others
Can do a quick check to see using nvidia-smi -q | grep -A 2 "PCIe Generation" to see if it there's bottleneck hardware attached to the runpod instance