Severe performance disparity on RunPod serverless (5090 GPUs)

I’ve deployed workflows on RunPod serverless with 5090 GPUs, and the performance differences I’m seeing are concerning. Same endpoint, same model, same operation — yet the results vary a lot: Sometimes the workflow finishes in around 44 seconds Other times it takes over 3 minutes That’s more than 3x slower for the exact same task. The main bottleneck seems to be model loading. On some cards it loads in just a few seconds, while on others it takes much longer. This kind of inconsistency makes it difficult to rely on serverless for predictable performance. Running on the same hardware should not feel like a lottery...
The task failed because I did set a timeout that should NEVER be hit.
No description
17 Replies
WeamonZ
WeamonZOP2d ago
We can add to this the following problem https://discord.com/channels/912829806415085598/1422881307758694451 This came from the GPU I guess, I deleted the worker. Not at all, others have the same problem. I still have this error. Depending on the worker, it can take from 1 to 3??????minutes for the same operation, same gpu
Milad
Milad2d ago
The problem is that as far as I can tell you can't pick the CPU and the available memeory, do you might get a decent CPU or an outdated one which will heavily affect model loading
WeamonZ
WeamonZOP2d ago
@Milad that's kind of insane to be honest... for the memory I rarely need too much, but for the CPU, it's a huge bottleneck...
Ethan
Ethan2d ago
How are your delay times @WeamonZ ? I'm also noticing insane delay times... execution time seems consistent for me though
WeamonZ
WeamonZOP2d ago
Still shitty Sometimes I have 30s of delay even when I have workers available
Ethan
Ethan2d ago
Yea its insane, this isnt usual right? I only noticed this recently
WeamonZ
WeamonZOP18h ago
@Ethan i never had such issue, same as for the performance disparity between workers @Milad Funnier, the 16 vCPU is about 30% faster Yep, I ran more tests. depending on the vCPU I have about 30% performance hit So I guess I had even worse CPU causing this 3 mins. Those CPU should be put in trash. Or set for lower end GPUs. We can't work with those CPUs with 5090 Update: I ran a pod with the same configuration as serverless, AND using network volumes. I have up to 60% lower speed in serverless than with the pod. This is the first time it happens on all of my 12 worfklows I usually have a 10-20% raw performance boost using serverless (and up to 400% with the model loading speed), and it's totally the opposite here.
Milad
Milad16h ago
Yeah, it is not the number of the vCPUs but the model, some servers for example are running Ryzen 9 7950X and some EPYC 7B13, there is a big performance difference between the 2. I have found that certain data centers run better CPUs but then you run into capacity issues at scale if you focus on just one data center Its either that or some of these machines are being overloaded and the worker is starving for CPUs time.
WeamonZ
WeamonZOP15h ago
@Milad This answer the 30% difference between workers But i'll still investigate on the GPU not caching the models between two runs I should go from 1min the first run to 20s the second, and it doesnt not and my GPU is only loaded at 70% (28GB)
Milad
Milad15h ago
I have noticed the same thing, I see the model being reloaded even on the subsequent runs
WeamonZ
WeamonZOP14h ago
@Milad I don't have this issue on lighter workflows using the same models on the same worker. It seems to happen only when too many models are loaded even when the GPU VRAM is not full (it can't be full since it shows 70% AND in a pod environment everything is fine)
Ethan
Ethan13h ago
interesting I have also noticed worse performance in serverless, not sure what it is
WeamonZ
WeamonZOP13h ago
Text encoders and clip etc are loaded on cpu not gpu That might be the issue
Ethan
Ethan13h ago
interesting what workflows are you running? comfy ui? I noticed these flags helped a lot OMP_NUM_THREADS=32 \ TOKENIZERS_PARALLELISM=true \ python main.py \ --listen 0.0.0.0 \ --port 8188 \ --highvram for loading the models much quicker like a 5x boost ~300s vs now, I get consistent ~40s for a first time run
WeamonZ
WeamonZOP12h ago
wow really @Ethan ????? on what GPU? models are stored in network storage ? @Ethan Im indeed using comfyui, this problem only affects a big Qwen workflow (Qwen Image + Qwen Edit). It doesn't seem to affect Wan, Flux, etc And text encoders might be using a custom architecture that is well supported by Ryzen and not EPYC.... I don't know.....
Ethan
Ethan9h ago
Yea I think the models loading slow is a comfy change they made No idea what happened Or maybe it was runpod, because I noticed it too But I used these flags and it’s back to normal That wasn’t my most recent issue though. The delays times are how long it takes the GPUs to cold start Execution time for me was Normal 5090 and h100
WeamonZ
WeamonZOP8h ago
Mh interesting, I'm trying your settings right now and I see no much impact Ok here are my reports: Im working with a 5090 too, I'm loading 2 models (that are about 26GB), with your highvram settings it loads the clip on the GPU instead of the CPU as far as I understand. The problem is that I have a "Not enough VRAM". Because 8 more GB are trying to be loaded onto the 32GB VRAM but only 6GB are available now with a lighter workflow: everything loads fine, but no faster results I'm also using network storage for those tests. So far I've seen no performance boost Yep, I confirm that I see no performance boost using --highvram on a 5090 with a Qwen workflow (text encoder & clip are loaded on GPU thanks to --highvram) And the model loading time seems to be around the same

Did you find this page helpful?