R
RunPod5mo ago
AdhishT

Using Same GPU for multiple requests?

Hello @here, I am using ComfyUI + my own custom scripts to generate images. I have set it up in RunPod Serverless (A100) GPUs in the following way: The request contains an image URL. The image is downloaded and processed, and the output image is put to S3. The task takes around 30 seconds However, only 10% of the GPU memory is used by 1 request AT MAX. I want multiple requests to use the same GPU so that it is faster. Is there a way to do this? Is there some existing template to handle this scenario?
3 Replies
AdhishT
AdhishT5mo ago
@flash-singh @Justin
flash-singh
flash-singh5mo ago
yes look at our vllm worker and how that uses SDK to handle multiple requests in parallel using concurrency, @Justin can provide further details, FYI its holiday weekend support will be limited
justin
justin5mo ago
GitHub
worker-vllm/src/handler.py at main · runpod-workers/worker-vllm
The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - runpod-workers/worker-vllm