RunpodR
Runpod10mo ago
jphipps

L40 Thermal throttling

We noticed we are having an occasional big slow down when running our models. from a 10-15 second calculation to 90-120 seconds.
Test run on pod: 8hh03rby46hd8s
  • when power draw goes to ~300W and SM usage to ~100%, GPU clock drops from 2490Mhz to 1650Mhz
  • as soon as as power draw drops to base of ~80-90W, GPU clock goes back to full speed
    We're getting 65% of the performance of desired GPU
Example:
# nvidia-smi dmon
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk
    0     80     44      -      0      0      0      0      0      0   9001   2490
    0     92     44      -     86      4      0      0      0      0   9001   2490
    0    272     53      -    100     42      0      0      0      0   9001   2145
    0    297     55      -    100     44      0      0      0      0   9001   1770
    0    295     55      -    100     44      0      0      0      0   9001   1680
    0    300     56      -    100     41      0      0      0      0   9001   1635
    0    299     56      -    100     40      0      0      0      0   9001   1755
    0    299     57      -    100     40      0      0      0      0   9001   1725
    0    301     57      -    100     43      0      0      0      0   9001   1740
    0    304     58      -    100     42      0      0      0      0   9001   1680
    0     98     49      -     86      4      0      0      0      0   9001   2490
Was this page helpful?