R
Runpod4d ago
Hugo

Failing requests

Hey, all of my serverless endpoint requests are failing. I’ve tried every available GPU config (24GB, 24GB PRO, 32GB PRO, 80GB PRO, 141GB) and they all show “High Supply”, but nothing is processing, the entire service is effectively down. This has been going on for a while now and there’s zero communication. If there’s an outage or scaling issue, please just say it, so we can stop waiting and plan accordingly. Can someone from the team confirm what’s happening and whether there’s an ETA on a fix? edit: endpoint ID = uv7fieonipxw1q
No description
19 Replies
Hugo
HugoOP4d ago
logs
No description
riverfog7
riverfog73d ago
it doesn't look like a runpod problem try with 32gb pro unchecked
Hugo
HugoOP3d ago
I'll try removing 32GB Pro! but I’m almost certain this isn’t an issue on my end because its the same container that ive been running for a month or two with no major issues until this week. The logs show a CUDA kernel mismatch (sm_120), which points to (according to gpt5) RunPod rotating their serverless pool to newer GPUs that current PyTorch builds don’t support yet. looks like an infrastructure side change and not a container config problem.
riverfog7
riverfog73d ago
uhh its complicated to explain as far as I know only GPU with a 32Gb VRAM is RTX 5090 which uses NVIDIA's Blackwell architecture
riverfog7
riverfog73d ago
and it has cuda compute capability of 12.0
No description
riverfog7
riverfog73d ago
No description
riverfog7
riverfog73d ago
this log is saying cuda compute capability sm_120 (so blackwell) is not supported in this installation so I told you to disable 32GB PRO option to avoid blackwell GPUs since 141GB=Hopper, 80GB PRO is probably Ampere, 24GB and 24GB PRO is probably Ada Lovelace and under so nothing has blackwell there
Hugo
HugoOP3d ago
yeah rolling this out now. but as you said, the RTX 5090s use the new blackwell architecture with new CUDA compute capability (sm_120) which older pytorch builds dont fully support yet, which explains why the same container that was previously working is now running into issues. isnt it their responsiblity to maintain backward compatibility or at the very least communicate GPU rotations that break existing builds?
riverfog7
riverfog73d ago
nono they maintain the hardware you are supposed to maintain the software that runs top of that did runpod automatically enable 32GB pro on a existing serverless endpoint? if that's the case, this is their fault
Hugo
HugoOP3d ago
yes true, im responsible for my container environment and im fine updating the image if needed but if the hardware rotation breaks existing compatibility there should at the very least be an announcement. That said, I don’t think this is the only issue because I only added 32GB two days ago, but even before that starting monday this week, 60% of my requests were taking over 3 minutes instead of the usual ~20 seconds. So something deeper seems off on the serverless side. without mentioning the initializing being broken? eg i have 2 workers that have been initializing since yesterday
riverfog7
riverfog73d ago
it can be nice to have a warning message besides 32GB and 180GB option since it doesn't list the specific GPU types there but this is clearly user error in my opinion. You should have checked what GPUs they were using for that option.
Hugo
HugoOP3d ago
ok but i added it 2 days ago my problem started 5 days ago
riverfog7
riverfog73d ago
serverless being weird recently is another problem tho it did that 5 days ago? isn't it a different error are you sure its the same error?
Hugo
HugoOP3d ago
let me check exact logs but it was timing out/taking 3+ minutes
riverfog7
riverfog73d ago
oh also uncheck the 180GB option if you have it on it uses the B200 GPU so blackwell
Hugo
HugoOP3d ago
ur right the 32 was def the ones causing actual errors. thanks for this. the other ones were succeeding, but just that they were taking 3 minutes (normal time 7s-30s) + which is beyond my timeout. so thats when i added the 32 (for context) thoughts on this? this was my initial issue. container just ran for 8 minutes like this (normal response 7s-30s). image same as when I had no issues. just pulled locally and tested no issues.
riverfog7
riverfog73d ago
Something is wrong with the pod
Hugo
HugoOP3d ago
also workers constantly switch between throttled/initializing for 12+ hours
No description
riverfog7
riverfog73d ago
Thats strange

Did you find this page helpful?