RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods-clusters

ComfyUI Job Failed

Based on the logs, it seems like it tries to connect before it's actually launched. Let me know if I should do anything on my end to fix. It worked well before....

why the hell are my delay times so high and im bearing all the costs??

yesterday everything was working fine, delay times were a couple seconds. but now the delay times are getting ridiculous and IM being charged for the delay on top of the execution??

Serverless endpoints dissapeared

All my prod requests are failing too 😭

Intermittent error No space left on device

We have been getting this error on some workers and it seems to be totaly random. One request gets this error and fails and the next request to the same worker goes through

Workers throttled while processing request

Hi In the past couple of days, I'm getting some cases where my jobs stop while processing without any logs. I can see the job started in my DB, but they never finished, and I don't see anything regarding those specific jobs in the Runpod logs, not even if they started. Could it be a case of workers being throttled while processing? If so, how can I prevent or catch these events? ...

How to return bytes from a serverless endpoint?

Hey all - I'm trying to return bytes from a serverless endpoint, however, it appears that runpod is automatically attempting to serialize it as JSON. Is thre any way to return raw bytes from an endpoint, without this occurring?

Response code 502\520\400

When using the synchronous (runsync) version of the Runpod API, the system works as expected. However, when switching to the asynchronous version (run), the service consistently returns HTTP error statuses such as 400, 502, and 520. The logs of my backend that runs not on the runpod show the following sequence of events: I do 10 retries: Raranker URL: https://api.runpod.ai/v2/serverless_id/run...

Serverless endpoint fails with CUDA error

Hi! I have a WhisperX image, here is the repo that I used for the image: https://github.com/kodxana/whisperx-worker. One of the runpod workers for that image started throwing CUDA failed with the error CUDA capable device is a busy or unavailable error in response to every request. Once this worker was restarted, everything was fixed. The problem is that it failed 18 requests before we found the error and fixed it. It is the first time this error has happened. Is there a way to properly set up notifications if the worker fails a lot of requests, or maybe restart the worker if a few requests fail?...
Solution:
then set minimal cuda version to 12.4

The serverless endpoint times out after 600 seconds, even though the timeout is set to 3600 seconds

Hello! I've set execution timeout to 3600 seconds on my endpoint but the request failed at exactly 10 minutes (which is the default timeout) I have also tried entirely disabling execution timeout (unchecked the box) but requests still fail at exactly 600 seconds Does it event work?...
Solution:
i mean the code (probably the handler has a timeout
No description

Can I look at all workers in serverless endpoint, review latest completed request & delete them

I have an urgent issue that requires a patch where I need to look at all currently online workers for a serverless endpoint, review each of the latest completed tasks, then optionally be able to kill / delete that worker immediately based on the length of time the task took. Please help. Urgent.

How to configure a one-to-one mapping of client connection to worker/GPU instance

I am building an application where a client connects to a worker and the worker streams some content to the client over websocket. I want to configure this setup to force a one-to-one mapping of client to worker. In other words, I would like precise control over how individual client requests are allocated to workers. I tried setting the request count to 1 to force the endpoint to spin up one worker per client connection, but that didn't work because while the endpoint does spin up one worker pe...

Serverless SGLang spent credits on phantom requests

I deployed a serverless endpoint (id ua6ui6kfksdocn). I tried sending a sample request from the web dash, that one still seems to be in the queue, 20 hours later. However, looking a the logs, there are lots of requests like this: ```...

Serverless Requests Queuing Forever

Title says it all - I send a request to my serverless endpoint (just a test through the runpod website UI), and even though all of my workers are healthy, the request has just been sitting in the queue for over a minute. Am I being charged for time spent in queue as well as time spent on actual inference? If that's the case, then I'm burning a lot of money very fast lol. Am I doing something wrong?...

Serverless endpoint fails with Out Of Memory despite no changes

For several months I am using the same endpoint code to generate Stable Diffusion 1.5 images in 512x512 with Auto1111 (in other words, quite low specs). I have a serverless endpoint with 16GB (the logs show more memory available, but the setup was 16 GB). There are very few requests to the endpoint. That's why I know that the worker was just booting up with a fresh start in my two test cases that failed Practically right after booting and when I try to begin inference, I get the following error:...

Getting 401 during image push for serverless, when built from gitrepo

I am getting the following error when pushing image for a serverless endpoint that is built from github. the image build works correctly, but crashloops with the following error during push. I do not have a Container Registry Auth setup in runpod, as I am using public images only, and the build itself worked correctly. Do I need any other kind of auth on runpod to be able to push the image that is built from the github repo?
2025-04-06 13:23:27 [INFO] Pushing image to registry. This takes some time - please be patient. 2025-04-06 13:23:57 [ERROR] 272 | }); 2025-04-06 13:23:57 [ERROR] 273 |...
Solution:
I think I figured it out. It seemd to work when I changed the base image to a runpod* image. earlier. I was using a python/slim image assuming that it is a public image on docker. but looks like runpod supports only using a runpod private base image.

I am trying to connect Facefusion

I'm trying to run FaceFusion on RunPod, but for some reason, it's not working. Do you have any tips on how to set it up properly?

How long does it take to build?

Hey guys I'm new to runpod and I just deployed a serverless instance via github and even after almost 40 minutes it just says "pending" and "waiting for build", is there a long wait time for the build to occur? or am I doing something wrong?
No description

GPU not detected on RunPod serverless - HELP!!

Hey everyone, I'm running into an issue on RunPod serverless endpoint. Despite having CUDA 12.4.1 set up in my Docker container, my models are initializing on CPU instead of GPU. My logs show: "Initializing pipeline on cpu"...

worker exited with exit code 0

Hello team, I'm trying to host my remotion video rendering on Runpod serverless built with nodejs via docker. the build completes but when I shoot a request, it never moves out of job queue, worker starts and gives error worker exited with exit code 0 and never shuts down and the video didn't get's rendered every time I've to terminate the worker and purge the queue....