Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Job Dispatching Issue - Jobs Not Sent to Running Workers

CPU HeadPod running Gradio frontend for comfyui serverless backend. Serverless nodes starts with custom image, they run comfyui directly from NetworkDrive with NetworkDrive venv3.12.3 My settings are configured to have one worker per job. When I send two jobs in parallel from differnt PCs, the platform correctly scales to two running workers, but the job queue assigns both jobs to the same worker sequentially. The second worker remains running without receiving a job. Log on a second worker says that comfy and handler is ready. Configuration:...

Stuck at initializing

I changed into a bigger one becuase of the usual vram error but then this happened (i used l40s before it wasnt like this)
No description

So serverless death

Not sure what you guys did tonight. But the endpoint stopped passing jobs to my vLLM workers at about 3pm my time. The backup was fine. I trashed all the workers and still they would sit there ready, jobs in queue and they would just run until timeout. I had to trash the endpoint, redeploy and add the new endpoint into rotation. So I figure you owe my at least $30 in credit ... not to mention my time ... (2hrs to deploy and qual check)...

How to bake models checkpoints in docker images in ComfyUI

I've seen in the earlier discussion that it is faster to bake models in the images thanks to runpod flashboot, in https://discordapp.com/channels/912829806415085598/1364592893867724910. Thanks to @gokuvonlange for explaining it! Does that mean I have to bake comfyUI git repo and all the other custom nodes and requirements too? And how does it make it faster then just using network volume?...
Solution:
yes either those 2 in image or copy it to /workspace then run from there
Message Not Public
Sign In & Join Server To View

Is it possible to set serverless endpoints to run for more than 24 hours?

I’m trying to configure my serverless endpoints so they can run for more than 24 hours. I set a policy for the job as described here: https://docs.runpod.io/serverless/endpoints/send-requests#execution-policies and also set the executionTimeout to a value higher than 24 hours when creating the endpoints. However, the jobs still exit exactly at the 24-hour mark. Is it possible to increase this limit, and if so, how?

New load balancer serverless endpoint type questions

Hey team ! In the past, i've tried to use runpod's queue based serverless for my voice AI project but the added job queue latency was just making this impossible. Voice AI required sub 200ms inference latency and the overhead made it huge and unpredictable. This is ok for long running jobs but not for high frequency / low latency. This new load balancer serverless endpoint type looks amazing and seem to be solving a real feature gap in the GPU provider game. ...

Mounting a network storage on comfyui serverless endpoint

I have a network storage where i have downloaded all the models that i will need to generate the image using comfyui interface. All the models and custom model have been verified by running some workflow in a POD instance and images are generated as i had intended. Just to avoid manual setup, i have used the comfyui image on serverless endpoint and i have used some default model to generate the image using flux1-dev-fp8 model. Images were generated perfectly and then i tried to generate the images using my own workflow and as expected i had got missing custom node issue. So i edited the endpoint and added the network storage from the advance setting but still getting the same error related to missing custom nodes. Can anyone guide me to solve this issue?...

Testing default "hello world" post with no response after 10 minutes

Attached a few pics of what I tried to do, I eventually cancelled it after a little under 10 minutes and never got a reply, it just stayed in queue. I assume I'm doing something wrong. I left all endpoint settings as default and set the hugging face url to openai/gpt-oss-20b
Solution:
oh change your image tag to
runpod/worker-v1-vllm:v2.8.0gptoss-cuda12.8.1
runpod/worker-v1-vllm:v2.8.0gptoss-cuda12.8.1
...
Message Not Public
Sign In & Join Server To View
No description

do the public endpoints support webhooks?

I'm not seeing anything in the documentation about webhooks for the public endpoints.
Solution:
Update: They do support webhooks :)

Serverless timeout issue

Hi Guys i need help with a serverless timeout issue i have a serveless endpoint setup that keeps timing out after 60 seconds i tried setting up the timeout to be 1200 , i tried disabling the time out , tried sending a time out with the reuqest { "input": {...

RunPod worker is being launched, which ignores my container's ENTRYPOINT

Hello, I'm experiencing an issue with my serverless endpoint. Despite my endpoint being configured to use a 'Custom' worker with my own Docker image (ovyrlord/comfyui-runpod:v1.27), the logs show that a generic RunPod worker is being launched, which ignores my container's ENTRYPOINT. I have verified all my settings and pushed multiple new image tags, but the issue persists. Can you please investigate and clear any stuck configurations on your end for my endpoint?

Load balancing Serverless Issues

Hello Everyone, I was trying to switch from queue to load balancing and try with the default template site provided and even tried hitting the http port, it keeps on running indefinitely, unless I force them to stop, it will keep on running and incur charges. Any recommendations on how to properly hit the endpoint and actually get a response. It just hangs, and doesn't really start the worker....

Access to remote storage from vLLM

I want to make an API call with a file that is on a my RunPod remote storage. But vLLM tell me: Cannot load local files without --allowed-local-media-path ...
Solution:
It works with that (if it can help others): "allowed_local_media_path": os.getenv('ALLOWED_LOCAL_MEDIA_PATH', '/runpod-volume') Add this line in: /worker-vllm/src/engine_args.py So you can add an ENV variable with the paths you want (or by default it will be/runpod-volume)....

"In Progress" after completion

I have a serverless endpoint that trains loras but for some reason after it finishes its still "In Progress" the container has been removed and I am not being charged but yet the status does not update to completed
No description

Load Balancer Endpoint - "No Workers Available"

I tried using Load Balancer Endpoint today. I got it to work successfully as well. However, after successful uses. I noticed some annoying behavior. 1. When testing, you're going to have lots of test runs, so you'll have to hit the endpoints multiple times. However, once you reached the point of, let's say, the 12th time of sending the task to the worker, it's gonna return "no workers available" despite having it but in "Idle" state. 2. When doing inference (for context, I use LatentSync. So it takes about a good 2-5 mins), I had to manually hit /ping in order to prevent the worker to become "Idle", which is kind of annoying....

a

2025-08-09 13:32:40 [INFO] #10 0.357 update-alternatives: using /usr/bin/python3.10 to provide /usr/bin/python (python) in auto mode 2025-08-09 13:32:41 [INFO] #10 0.359 update-alternatives: using /usr/bin/pip3 to provide /usr/bin/pip (pip) in auto mode 2025-08-09 13:32:41 [INFO] #10 0.359 update-alternatives: warning: not replacing /usr/bin/pip with a link 2025-08-09 13:32:41 [INFO] #10 DONE 0.4s 2025-08-09 13:32:41 [INFO]...

Serverless Logs Inconsistant

Hello, We are currently testing various Docker files to ensure the stability and reliability of the systems we’ve built. However, we’ve encountered significant challenges with the logging system. At this time, logs only appear to function properly about 10% of the time. Additionally, telemetry data tends to reset whenever we open the details for individual workers, and the log output is blank in approximately 90% of cases....

long build messages don't wrap

Long build messages don't wrap in the Builds section preventing you from accessing the ellipsis menu.
No description

Failed to return job results

My serverless worker logs these errors throughout the process: 2025-08-07T09:05:53.399265616Z {"requestId": "7d8f9b4a-9caf-48cb-a798-e4047fe62a9b-e1", "message": "Failed to return job results. | 404, message='Not Found', url='https://api.runpod.ai/v2/mwbt52if15qdt0/job-done/nvyv3441xhr52v?gpu=NVIDIA+H100+80GB+HBM3&isStream=false'", "level": "ERROR"} no progress updates on /status and no completed status either (the process completes successfully though)...

How to set max concurrency per worker for a load balancing endpoint?

I'm trying to configure the maximum concurrency for each worker on my serverless load balancing endpoint, but I can't seem to find the setting in the new UI.