R
Runpod9h ago
Artur

Quality Control Issues?

I've noticed this a lot and it's becoming very problematic in my workflows. When I have production GPUs, some of them are just objectively worse (and by a lot, like 3x worse in performance). Same GPU type, same number of CPUs, same template, same storage, same codebase, but it performs x3 slower per iteration in the same region and I can't even swap to a new one because some of the new ones have the same problem. I don't want to play casino when I boot up a GPU and hope that it matches the quality I'm expecting on other instances. Is there a way around this problem? This is in the US-NC-1 region but I've noticed this issue in US-IL-1 too.
1 Reply
Artur
ArturOP7h ago
Logs from a healthy pod: 1. PCIe Configuration Check echo "1. PCIe CONFIGURATION" PCIE_GEN=$(nvidia-smi -q | grep "Current" | grep "PCIe Generation" -A 1 | tail -1 | awk '{print $NF}') PCIE_WIDTH=$(nvidia-smi -q | grep "Current" | grep "Link Width" -A 1 | tail -1 | awk '{print $NF}') echo "PCIe Generation: Gen $PCIE_GEN (should be 3 or 4)" echo "PCIe Width: $PCIE_WIDTH (should be 16x)" echo "" 2. Storage Speed Test echo "2. STORAGE SPEED TEST" echo "Testing write speed..." WRITE_SPEED=$(dd if=/dev/zero of=/workspace/testfile bs=1M count=1000 oflag=direct 2>&1 | grep copied | awk '{print $(NF-1), $NF}') echo "Write Speed: $WRITE_SPEED" echo "Testing read speed..." READ_SPEED=$(dd if=/workspace/testfile of=/dev/null bs=1M 2>&1 | grep copied | awk '{print $(NF-1), $NF}') rm -f /workspace/testfile echo "Read Speed: $READ_SPEED" echo "" 3. CPU-to-GPU Transfer Speed (PCIe bandwidth test) echo "3. CPU-GPU TRANSFER TEST (PCIe Bandwidth)" python3 << 'PYTHON' import torch /workspace/benchmarkpod.sh | tee /workspace/benchmark$(hostname).log limiting performance" 1. PCIe CONFIGURATION PCIe Generation: Gen (should be 3 or 4) PCIe Width: (should be 16x) 2. STORAGE SPEED TEST Testing write speed... Write Speed: 882 MB/s Testing read speed... Read Speed: 582 MB/s 3. CPU-GPU TRANSFER TEST (PCIe Bandwidth) 100MB transfer: 0.006s = 17421 MB/s 500MB transfer: 0.028s = 18150 MB/s 1000MB transfer: 0.054s = 18358 MB/s Average PCIe Bandwidth: 17976 MB/s Expected PCIe Bandwidths (theoretical): PCIe Gen1 x16: ~4,000 MB/s PCIe Gen3 x16: ~15,000 MB/s PCIe Gen4 x16: ~30,000 MB/s 4. GPU COMPUTE TEST Testing 8192x8192 matrix multiplication... Time: 0.20s Performance: 54.12 TFLOPS Expected 4090 Performance: ~80-90 TFLOPS 5. GPU MEMORY BANDWIDTH TEST GPU Memory Bandwidth: 471 GB/s Expected 4090 Bandwidth: ~1000 GB/s 6. SIMULATED WORKLOAD Data load to GPU: 0.218s GPU processing (20 steps): 0.148s Data download from GPU: 0.011s Total pipeline time: 0.405s Logs from an unhealthy GPU: 1. PCIe Configuration Check echo "1. PCIe CONFIGURATION" PCIE_GEN=$(nvidia-smi -q | grep "Current" | grep "PCIe Generation" -A 1 | tail -1 | awk '{print $NF}') PCIE_WIDTH=$(nvidia-smi -q | grep "Current" | grep "Link Width" -A 1 | tail -1 | awk '{print $NF}') echo "PCIe Generation: Gen $PCIE_GEN (should be 3 or 4)" echo "PCIe Width: $PCIE_WIDTH (should be 16x)" echo "" 2. Storage Speed Test echo "2. STORAGE SPEED TEST" echo "Testing write speed..." WRITE_SPEED=$(dd if=/dev/zero of=/workspace/testfile bs=1M count=1000 oflag=direct 2>&1 | grep copied | awk '{print $(NF-1), $NF}') echo "Write Speed: $WRITE_SPEED" echo "Testing read speed..." READ_SPEED=$(dd if=/workspace/testfile of=/dev/null bs=1M 2>&1 | grep copied | awk '{print $(NF-1), $NF}') rm -f /workspace/testfile echo "Read Speed: $READ_SPEED" echo "" 3. CPU-to-GPU Transfer Speed (PCIe bandwidth test) echo "3. CPU-GPU TRANSFER TEST (PCIe Bandwidth)" /workspace/benchmarkpod.sh | tee /workspace/benchmark$(hostname).log limiting performance" 1. PCIe CONFIGURATION PCIe Generation: Gen (should be 3 or 4) PCIe Width: (should be 16x) 2. STORAGE SPEED TEST Testing write speed... Write Speed: 226 MB/s Testing read speed... Read Speed: 176 MB/s 3. CPU-GPU TRANSFER TEST (PCIe Bandwidth) 100MB transfer: 0.023s = 4328 MB/s 500MB transfer: 0.123s = 4063 MB/s 1000MB transfer: 0.229s = 4371 MB/s Average PCIe Bandwidth: 4254 MB/s Expected PCIe Bandwidths (theoretical): PCIe Gen1 x16: ~4,000 MB/s PCIe Gen3 x16: ~15,000 MB/s PCIe Gen4 x16: ~30,000 MB/s 4. GPU COMPUTE TEST Testing 8192x8192 matrix multiplication... Time: 0.22s Performance: 50.90 TFLOPS Expected 4090 Performance: ~80-90 TFLOPS 5. GPU MEMORY BANDWIDTH TEST GPU Memory Bandwidth: 469 GB/s Expected 4090 Bandwidth: ~1000 GB/s 6. SIMULATED WORKLOAD Simulating image generation pipeline... Data load to GPU: 0.898s GPU processing (20 steps): 1.157s Data download from GPU: 0.078s Total pipeline time: 2.368s These are logs from empty GPU setups without anything running on them, but ones that have already been identified as "bad GPUs" This is a huge bottleneck for using Runpod in production and if this doesn't get fixed, we're just going to have to move to another service because this is a huge reliability problem It looks like the main differences are in the storage speed tests and the PCIe Bandwidth and it's enough to show a x3 slowdown in performance If this is a hardware problem, this needs to be something we can configure or sort by or at least be aware of. This is effectively selling defective setups for the same price as others Can do a quick check to see using nvidia-smi -q | grep -A 2 "PCIe Generation" to see if it there's bottleneck hardware attached to the runpod instance

Did you find this page helpful?