L40 Thermal throttling
We noticed we are having an occasional big slow down when running our models. from a 10-15 second calculation to 90-120 seconds.
Test run on pod: 8hh03rby46hd8s
Test run on pod: 8hh03rby46hd8s
- when power draw goes to ~300W and SM usage to ~100%, GPU clock drops from 2490Mhz to 1650Mhz
- as soon as as power draw drops to base of ~80-90W, GPU clock goes back to full speed
We're getting 65% of the performance of desired GPU


