Need more RAM but not more VRAM in serverless endpoints
What to do If I need endpoints of more RAM than the ones present in serverless endpoints, this is for GPU endpoints.
11 Replies
I have similar question.
can you share more? which gpu type and how much more ram?
He says he doesn't need more vram. He needs more RAM allocated to worker
Here RAM refers to system RAM and not the GPU RAM
yep updated, still same question
Let's say GPU Type is RTX A5000, and I need fixed system RAM of 80 GB. But currently the the allocation happens randomly. Is it possible to fix the RAM?
currently its not possible, we have been contemplating for a while of supporting a feature to allow defining specific ram, that's why i was asking to understand if the ram requirements are within the realm we can offer
typicall 1.5x-2x ram compared to vram is possible, but currently its random, something we can optimize as we handle workloads, anything more than 2x likely to run into capacity issues, we have to explore more
Nope, they aren't. Yesterday I created one endpoint and because of one worker having low RAM allocation, I got OOM error. Then I terminated it and luckily got higher allocation in another worker.
yes that's what i meant, currently serverless doesnt have the feature to give you workers with certain ram, its something we need to enable with some additional cost attached to it
@CodingNinja - Curious can you give us a bit more insight into your workload that's consuming so much RAM? What types of tasks are you running?
Take a simple ComfyUI Workflow for WAN 2.2 Animate, you can see the entire system RAM has been exhausted and the Pod is unresponsive. Video resolution was only 720x720 and GPU RAM wasn't an issue in above case. System RAM plays a very significant role when it comes to ComfyUI. ComfyUI keeps a lot of stuff on CPU: model parts get loaded/serialized there before moving to GPU, VAE decode and image IO happen on CPU, plus ComfyUI caches node outputs in RAM. PyTorch also uses pinned host buffers for GPU transfers. All that stacks up and spikes host memory even when GPU looks good. That’s why a worker with more RAM ran fine, but a lower-RAM one died.
Then since the Serverless is costly when compared to Pods, the clients would want to minimize the cost and they can't always go with Pro GPUs. So more offloading will happen in such cases. Since the system RAM allocation is random per worker, the workloads feel like luck — sometimes it fits, sometimes it OOMs.

Yes this was my usecase and issues as well