Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

using compression encoding for serverless requests

Just wondering if the serverless endpoint is capable of receiving and processing compressed requests? (eg. zstd, gzip)

Throttled ECR Download?

We have a serverless endpoint that uses an ECR registery to back the image. When initializing a new worker the download of a changed layer, (which is a 3 GB) can sometimes take >20 minutes to download. Is this download speed typical? Is there another pattern we should be using? It's surprising that a pull from ECR is such a large bottleneck on our cold-start time....

Need some help to troubleshoot a configuration of a Serverless

I have created my account and subscribed to create a Serverless, I did set it up using the web interface. But it doesn't seem to work. I need some help ASAP.

Do Webhook Request Responses have a retry mechanism?

If a Response webhook fails is there a retry mechanism in place for resending the webhook again? If yes, what does it look like, i.e how many retries and for how long?...

Incorrect billing

the billing for last 4 weeks seems to be wrong, can someone help me understand. I am using only two serverless endpoints and no other services. Endpoint ids: ed0rivbjvv0x0u and pzfz3xhwa86raj
No description

Request getting stuck

Hey i am using runpod endpoint and all my request are stuck . its mission critical . I have raised a ticket , using network volume EU-SE-1

Serverles endpoint status and runsync not returning data anymore in request body (request not found)

Hey Team, I have a custom serverless endpoint worker. It always works. The logs always show that everything went as planned and the requests are always marked as completed after the time I expect. However, on my API the requests error out and on the UI they show completed but have no output. When I inspect the status on Thunderclient, runpod says that the request does not exist. I would like to understand what is going on and how I can make my api more resilient to these issues. Attached are screenshots of the behavior:...
No description

I want to increase/decrease workers by code or script, can you help? (GraphQL)

I have a serverless setup already. Generally we keep 1 active worker in the actual time when we expect the traffic throughout the day, and at night when no one is using the application we make active workers 0 to avoid any charges. And then the next day, we make active workers 1 manually from runpod dashboard. We are willing to do that automatically. I know there is a GraphQL but I am not able to find relevant code to do that. Can anyone please help?...

Support for https://huggingface.co/deepseek-ai/DeepSeek-V3?

Would it be possible to get support for https://huggingface.co/deepseek-ai/DeepSeek-V3? as this is currently the best model for coding that is opensource

Serverless Idle Timeout is not working

One of my serverless endpoints is not respecting the idle timeout setting. Instead of staying active for 300 seconds, it turn to idle after 5. I have redeployed the endpoint, it work for a while, today again without any changes the endpoint turns idle after 5 seconds even though its set to 300....

Flashboot meaning?

Is there any documentation on what it does under the hood? i am asking because of this: "FlashBoot reduces majority cold-starts down to 2s, even for LLMs. Make sure to test output quality before enabling." ...

Distributed inference with Llama 3.2 3B on 8 GPUs with tensor parallelism + Disaggregated serving

Hi. I need help with setting up a vllm serverless pod with disaggregated serving and distributed inference for a llama 3.2 3B model. The setup would be a disaggregated setup, something like 1 worker with 8 total GPUs, where 4 GPUs for 1 prefill task and 4 GPUs for 1 decode task. Can experts help me set this up using vllm on runpod serverless ? I am going for this approach as I want super low latency, and I think sharding the model for prefill and decode separately with tensor parallelism will help me achieve this....

job timed out after 1 retries

Hello! Getting this on every job now on 31py4h4d9ytybu endpoint on serverless. My logs have zero messages or indication about where this is happening, from the outside it looks as if the are totally paused or non-responsive. This silently hung work for over an hour. I'm on runpod 1.7.4. This is currently having significant impacts on production work, without any clear remediation (see screenshots for no logs for many many minutes despite work happening constantly, and errors on every job). Wou...
No description

Can't see Billing beyond July

Hi I'm trying to get my billing invoices but I dont see anything beyond six months, can someone help?...

Linking runpod-volume subfolder doesn't work

Hey, I've been trying to create a serverless runpod that has some network volume attached to it. I want to link specific folders from the network volume to the runpod. To do so, im running the following bash file. ```bash...

ComfyUI Image quantity / batch size issue when sending request to serverless endpoint

I'm not able to generate multiple images from a prompt / request to the endpoint using a ComfyUI workflow. We have added a variable for the ""batch_size": " value in our workflow, but it only seems to generate one image regardless of the batch_size we give it. This is our Github repo for the runpod worker: https://github.com/sozanski1988/runpod-worker-comfyui/ ...

Some basic confusion about the `handlers`

Hi everyone! 👋 I'm currently using RunPod's serverless option to deploy an LLM. Here's my setup: - I've deployed the vLLM with a serverless endpoint (runpod.io/v2/<endpoint>/run)....

Next js app deploy on Runpod

Dear Runpod community, I need to deploy our Next.js app on Runpod, similar to how it works on Vercel. In our Next.js app, I handle the frontend and also create backend APIs for MongoDB interactions. Additionally, I need to run scheduled jobs. Which hosting provider would you recommend for this setup? also can we do that with Runpod?

Optimizing VLLM for serverless

Hello. I am trying to optimize the VLLM for the serverless endpoint. The default VLLM settings are blazing fast for cached workers (~1s) but unusable with cold start initialization (40-60 or more seconds). Forcing eager mode removes the CUDA graph capture and helps push the initialization cold starts down to ~20s with a price of a slower generation time. But other than that, I feel stuck about what could be improved since currently the longest tasks are creating the LLM engine and VLLM's Memory profiling stage. Each takes up to 6 seconds. I am attaching the complete log file with time comments from such a job. I am wondering if anyone found the settings sweet spot for the fastest cold starts and acceptable generation speed, or if there's a way to remove the initialization part for newly spawned workers. Although I already researched many things, from automatic caching on a network volume (which didn't work at all and when using bitsandbytes models there is no cache being saved) to snapshotting and trying to share the initiated state between the workers (which is probably not possible)....