RunPod•16mo ago

Using Same GPU for multiple requests?

Hello @here, I am using ComfyUI + my own custom scripts to generate images. I have set it up in RunPod Serverless (A100) GPUs in the following way: The request contains an image URL. The image is downloaded and processed, and the output image is put to S3. The task takes around 30 seconds However, only 10% of the GPU memory is used by 1 request AT MAX. I want multiple requests to use the same GPU so that it is faster. Is there a way to do this? Is there some existing template to handle this scenario?

3 Replies

AdhishTOP•16mo ago

@flash-singh @Justin

flash-singh•16mo ago

yes look at our vllm worker and how that uses SDK to handle multiple requests in parallel using concurrency, @Justin can provide further details, FYI its holiday weekend support will be limited

J.•16mo ago

https://github.com/runpod-workers/worker-vllm/blob/main/src/handler.py

GitHub

worker-vllm/src/handler.py at main · runpod-workers/worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - runpod-workers/worker-vllm

Gaming

Programming

Using Same GPU for multiple requests?

Did you find this page helpful?