Severe performance disparity on RunPod serverless (5090 GPUs)
I’ve deployed workflows on RunPod serverless with 5090 GPUs, and the performance differences I’m seeing are concerning.
Same endpoint, same model, same operation — yet the results vary a lot:
Sometimes the workflow finishes in around 44 seconds
Other times it takes over 3 minutes
That’s more than 3x slower for the exact same task.
The main bottleneck seems to be model loading. On some cards it loads in just a few seconds, while on others it takes much longer.
This kind of inconsistency makes it difficult to rely on serverless for predictable performance. Running on the same hardware should not feel like a lottery...
The task failed because I did set a timeout that should NEVER be hit.

17 Replies
We can add to this the following problem https://discord.com/channels/912829806415085598/1422881307758694451
This came from the GPU I guess, I deleted the worker.
Not at all, others have the same problem.
I still have this error. Depending on the worker, it can take from 1 to 3??????minutes for the same operation, same gpu
The problem is that as far as I can tell you can't pick the CPU and the available memeory, do you might get a decent CPU or an outdated one which will heavily affect model loading
@Milad that's kind of insane to be honest...
for the memory I rarely need too much, but for the CPU, it's a huge bottleneck...
How are your delay times @WeamonZ ?
I'm also noticing insane delay times... execution time seems consistent for me though
Still shitty
Sometimes I have 30s of delay even when I have workers available
Yea its insane, this isnt usual right? I only noticed this recently
@Ethan i never had such issue, same as for the performance disparity between workers
@Milad Funnier, the 16 vCPU is about 30% faster
Yep, I ran more tests. depending on the vCPU I have about 30% performance hit
So I guess I had even worse CPU causing this 3 mins. Those CPU should be put in trash. Or set for lower end GPUs. We can't work with those CPUs with 5090
Update: I ran a pod with the same configuration as serverless, AND using network volumes. I have up to 60% lower speed in serverless than with the pod. This is the first time it happens on all of my 12 worfklows
I usually have a 10-20% raw performance boost using serverless (and up to 400% with the model loading speed), and it's totally the opposite here.
Yeah, it is not the number of the vCPUs but the model, some servers for example are running Ryzen 9 7950X and some EPYC 7B13, there is a big performance difference between the 2. I have found that certain data centers run better CPUs but then you run into capacity issues at scale if you focus on just one data center
Its either that or some of these machines are being overloaded and the worker is starving for CPUs time.
@Milad This answer the 30% difference between workers
But i'll still investigate on the GPU not caching the models between two runs
I should go from 1min the first run to 20s the second, and it doesnt not
and my GPU is only loaded at 70% (28GB)
I have noticed the same thing, I see the model being reloaded even on the subsequent runs
@Milad I don't have this issue on lighter workflows using the same models on the same worker. It seems to happen only when too many models are loaded even when the GPU VRAM is not full (it can't be full since it shows 70% AND in a pod environment everything is fine)
interesting
I have also noticed worse performance in serverless, not sure what it is
Text encoders and clip etc are loaded on cpu not gpu
That might be the issue
interesting
what workflows are you running? comfy ui?
I noticed these flags helped a lot
OMP_NUM_THREADS=32 \
TOKENIZERS_PARALLELISM=true \
python main.py \
--listen 0.0.0.0 \
--port 8188 \
--highvram
for loading the models much quicker
like a 5x boost
~300s vs now, I get consistent ~40s for a first time run
wow really @Ethan ????? on what GPU?
models are stored in network storage ?
@Ethan Im indeed using comfyui, this problem only affects a big Qwen workflow (Qwen Image + Qwen Edit). It doesn't seem to affect Wan, Flux, etc
And text encoders might be using a custom architecture that is well supported by Ryzen and not EPYC.... I don't know.....
Yea
I think the models loading slow is a comfy change they made
No idea what happened
Or maybe it was runpod, because I noticed it too
But I used these flags and it’s back to normal
That wasn’t my most recent issue though. The delays times are how long it takes the GPUs to cold start
Execution time for me was Normal
5090 and h100
Mh interesting, I'm trying your settings right now and I see no much impact
Ok here are my reports:
Im working with a 5090 too, I'm loading 2 models (that are about 26GB), with your highvram settings it loads the clip on the GPU instead of the CPU as far as I understand. The problem is that I have a "Not enough VRAM". Because 8 more GB are trying to be loaded onto the 32GB VRAM but only 6GB are available
now with a lighter workflow:
everything loads fine, but no faster results
I'm also using network storage for those tests. So far I've seen no performance boost
Yep, I confirm that I see no performance boost using --highvram on a 5090 with a Qwen workflow (text encoder & clip are loaded on GPU thanks to --highvram)
And the model loading time seems to be around the same