Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

New build endless "Pending" state

Usual the worker updates after some minutes the new build. This time it keeps for hours and counting … "pending"... Also earlier working builds dont start to build. Has anybody else this issue?...

Skip build on Github Commit

Hi there, is there a way to skip kicking off build on Commit to Github? I have Serverless endpoint setup with direct Github integration with Runpod (not Github actions) and I am wondering if there is a way to skip kicking off the build using the commit message like "[skip ci]" or something similar.

Model initialization failed: CUDA driver initialization failed, you might not have a CUDA gpu.

Getting this error few times while loading a model that runs on gpu/torch, then the model proceeds to get loaded on CPU. Even tho most of the times the model loads and runs fine on GPU....

getting occasional OOM errors in serverless

I'm running a small service using runpod serverless + comfyUI, and once in a while I get this error. `"error": "Traceback (most recent call last):\n File "/handler.py", line 708, 'in handler\n raise RuntimeError(f'{node_type}: {exception_message}')\ nRuntimeError: WanVideoSampler: Allocation on device \nThis error means you ran ...

ComfyUI + custom models & nodes

I've read this here, and tried it: https://github.com/runpod-workers/worker-comfyui But im still not sure if I did it correctly. So I made a docker file based on one of the versions and add the things I need: ```Dockerfile...
Solution:
You'll stop seeing the error you had, where a worker was spawned to try to handle that job but it was throwing: requirement error: unsatisfied condition: cuda>=12.6, please update your driver to a newer version, or use an earlier cuda container: unknown...

bug in creating endpoints

im trying to create endpoint comfyui 5.4.0 from new gmail acc. from serverless page, i goes through new endpoint under serverless, when i press deploy pod , a pod is created instead of serverless...
No description

16 GB GPU availability almost always low

Hence very frequent throttling workers and pulling docker image again and again

Endpoint specific API Key for Runpod serverless endpoints

I am looking for a way to create a Runpod API Key that is specific to a Serverless endpoint. Is this possible?

generation-config vllm

Hey! Need help with vLLM Quick Deploy setup. I'm getting this warning and can't override sampling parameters in API requests: WARNING 08-18 15:40:11 [config.py:1528] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with --generation-config vllm. How do I add --generation-config vllm parameter when using Quick Deploy? Want to be able to set custom top_k, top_p, temperature in my requests instead of being stuck with model defaults. Thanks!...

New UI New Issue again lol

I'm the admin + owner of the github but I get this in the new version of the UI.... a bit frustrating
No description

ComfyUI looks for checkpoint files in /workspace instead of /runpod-volume

I had a comfyUI on-demande GPU pod, and now need to switch a serverless pod. After setting up the endpoint, I can run some requests, but I see my comfyUI workflow says there are missing checkpoints and LoRa. My serverless workers are correctly connected to my 100Go volume. So it seems the path is actually different in both instances. How can I either: - move the files from /workspace/comfyUi/checkpoints to /runpod-volume/comfyUI/checkpoints ? or...

Unhealthy worker state in serverless endpoint: remote error: tls: bad record MAC

I'm using a runpod serverless endpoint with worker limit 6. The endpoint performs well, except for one error: sometimes a worker gets "unhealthy" and HTTP requests fail with: request failed: Post "https://api.runpod.ai/v2/s3bxj20mra4dvp/runsync": remote error: tls: bad record MAC OR "request failed: Post "https://api.runpod.ai/v2/s3bxj20mra4dvp/runsync\": write tcp [2001:1c02:2c09:9100:7bab:2fba:21cc:6df1]:53732->[2606:4700::6812:9dd]:443: use of closed network connection" ...
No description

Job Dispatching Issue - Jobs Not Sent to Running Workers

CPU HeadPod running Gradio frontend for comfyui serverless backend. Serverless nodes starts with custom image, they run comfyui directly from NetworkDrive with NetworkDrive venv3.12.3 My settings are configured to have one worker per job. When I send two jobs in parallel from differnt PCs, the platform correctly scales to two running workers, but the job queue assigns both jobs to the same worker sequentially. The second worker remains running without receiving a job. Log on a second worker says that comfy and handler is ready. Configuration:...

Stuck at initializing

I changed into a bigger one becuase of the usual vram error but then this happened (i used l40s before it wasnt like this)
No description

So serverless death

Not sure what you guys did tonight. But the endpoint stopped passing jobs to my vLLM workers at about 3pm my time. The backup was fine. I trashed all the workers and still they would sit there ready, jobs in queue and they would just run until timeout. I had to trash the endpoint, redeploy and add the new endpoint into rotation. So I figure you owe my at least $30 in credit ... not to mention my time ... (2hrs to deploy and qual check)...

How to bake models checkpoints in docker images in ComfyUI

I've seen in the earlier discussion that it is faster to bake models in the images thanks to runpod flashboot, in https://discordapp.com/channels/912829806415085598/1364592893867724910. Thanks to @gokuvonlange for explaining it! Does that mean I have to bake comfyUI git repo and all the other custom nodes and requirements too? And how does it make it faster then just using network volume?...
Solution:
yes either those 2 in image or copy it to /workspace then run from there
Message Not Public
Sign In & Join Server To View

Is it possible to set serverless endpoints to run for more than 24 hours?

I’m trying to configure my serverless endpoints so they can run for more than 24 hours. I set a policy for the job as described here: https://docs.runpod.io/serverless/endpoints/send-requests#execution-policies and also set the executionTimeout to a value higher than 24 hours when creating the endpoints. However, the jobs still exit exactly at the 24-hour mark. Is it possible to increase this limit, and if so, how?

New load balancer serverless endpoint type questions

Hey team ! In the past, i've tried to use runpod's queue based serverless for my voice AI project but the added job queue latency was just making this impossible. Voice AI required sub 200ms inference latency and the overhead made it huge and unpredictable. This is ok for long running jobs but not for high frequency / low latency. This new load balancer serverless endpoint type looks amazing and seem to be solving a real feature gap in the GPU provider game. ...

Mounting a network storage on comfyui serverless endpoint

I have a network storage where i have downloaded all the models that i will need to generate the image using comfyui interface. All the models and custom model have been verified by running some workflow in a POD instance and images are generated as i had intended. Just to avoid manual setup, i have used the comfyui image on serverless endpoint and i have used some default model to generate the image using flux1-dev-fp8 model. Images were generated perfectly and then i tried to generate the images using my own workflow and as expected i had got missing custom node issue. So i edited the endpoint and added the network storage from the advance setting but still getting the same error related to missing custom nodes. Can anyone guide me to solve this issue?...

Testing default "hello world" post with no response after 10 minutes

Attached a few pics of what I tried to do, I eventually cancelled it after a little under 10 minutes and never got a reply, it just stayed in queue. I assume I'm doing something wrong. I left all endpoint settings as default and set the hugging face url to openai/gpt-oss-20b
Solution:
oh change your image tag to
runpod/worker-v1-vllm:v2.8.0gptoss-cuda12.8.1
runpod/worker-v1-vllm:v2.8.0gptoss-cuda12.8.1
...
Message Not Public
Sign In & Join Server To View
No description