RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Kicked Worker

Is there a webhhok for the event that a worker is kicked? Or is there only the /health call, where we need to track the change of reqeusts since the last /health call (tracking the change of failed request)

Possible to access ComfyUI interface in serverless to fix custom nodes requirements?

Hi RunPod addicts! I have a functional ComfyUI install running in a Pod that I want to replicate serverless. My comfyUI install is made for a specific workflow requiring 18 custom nodes. ...

How to truly see the status of an endpoint worker?

I'm trying out the vllm serverless endpoints and am running in to a lot of trouble. I was able to get responses from a running worker for a little while, then the worker went idle (as expected) and I tried sending a fresh request. That request has been stuck for minutes now and there's no sign that the worker is even starting up. The runpod UI says the worker is "running" but there's nothing in the logs for the past 9 minutes (the last log line was from the previous worker exiting). My latest requests have been stuck for about 7 minutes each. How do I see the status of an endpoint worker if there's nothing in the logs and nothing in the telemetry? What does "running" mean if there's no logs or telemetry?...

How do I calculate the cost of my last execution on a serverless GPU?

For example if I have one GPU with cost $0.00016 and the other ones with $0.00019. How do I know which serverless GPU actually picked this GPU after the request has been completed? Also, is there an easy way to just get the cost of the last runsync request instead of manually calculating it?

Serverless deepseek-ai/DeepSeek-R1 setup?

How can I configure a serverless end point for deepseek-ai/DeepSeek-R1?

what is the best way to access more gpus a100 and h100

flux is of 25 gb , if i download that model in network volume, then i can only access that region gpus only , and I can see everytime , a100 and h100 gpus are in LOW in all the regions . If i download the model flux in the container itself while building the docker image, instead of network volume, then it has to download 25 size of docker image everytime for the new pod could anyone please help me with this...

Guidance on Mitigating Cold Start Delays in Serverless Inference

We are experiencing delays during the cold starts of our serverless server used for inference of a machine learning model (Whisper). The main suspected cause is the download of model weights (custom model trained by us), which are fetched via the Hugging Face package within the Python code. We are exploring possible solutions and need guidance on feasibility and best practices. Additional Context: - The inference server currently fetches model weights dynamically from Hugging Face during initialization, leading to delays. - The serverless platform is being used for inference as part of a production system requiring low latency....
No description

A40 Throttled very regularly!

I have a serverless endpoint with 3GPU that is being fully throttled very regularly. It is fully unusable for long minutes, see screenshot, request are being queued forever. It has been the case yesterday and today, it's far too unreliable......
No description

SSH info via cli

Absence of ssh access info via CLI (only in the case the server does have an exposed TCP port). It doesn’t have the url ssh access in ‘runpodctl get pod’

Can not get a single endpoint to start

New to runpod, but not new to LLM's and running our own inference. So far, every single vLLM Template or vLLM worker that I have set up is failing. I use only the most basic settings, and have tried across a wide range of GPU types, with a variety of models (including the 'Quickstart' templates). Not a single worker has created an endpoint that works or runs the openai API endpoint. I get 'Initializing' and 'Running', but then no response at all to any request. Logs don't seem to have any information that help me diagnose the issue. Might well be that I am missing something silly, or that there is something amiss, I'm just not sure - could do with some assistance (and some better documentation) if there is someone from runpod that can help?...

All 16GB VRAM workers are throttled in EU-RO-1

I have a problem in EU-RO-1: all worked are constantly in throttled state (xz94qta313qvxe, gu1belntnqrflq and so on)...
No description

worker-vllm: Always stops after 60 seconds of streaming

Serverless is giving me this weird issue where the OpenAI stream stops after 60 seconds, but the request keeps running in the vLLM worker deployed. This results in not getting all the outputs, wasting the compute resources. The reason I want it going longer than 60 seconds is that I have a use-case for generating very long outputs. I have needed to resort to directly querying api.runpod.ai/v2. This has benefits of being able to get the job_id and do more things, but I would like to do this with the OpenAI API....

I want to deploy a serverless endpoint with using Unsloth

Unsloth do bnb qunatization and it's better loading their model, I think. I did training using Unsloth on a pod; I want to deploy it on a serverless endpoint and get the OpenIA client API

--trust-remote-code

I tried to install deepseek v3 on serverless vllm showing this "Uncaught exception | <class 'RuntimeError'>; Failed to load the model config. If the model is a custom model not yet available in the HuggingFace transformers library, consider setting trust_remote_code=True in LLM or using the --trust-remote-code flag in the CLI.; <traceback object at 0x7fecd5a12700>;"...

Is there any "reserve for long" and "get it cheaper" payment options?

Hey, Till now, we have been testing the serverless endpoint with vLLM configuration internally for development. Now, we are looking to move it into production. We believe it would be beneficial to have a "reserve for long" option, such as a monthly reservation. Currently, the service charges on a per-second basis with a 30% discount on active workers, but we need to constantly monitor our balance to ensure it doesn't run out....

llvmpipe is being used instead of GPU

I am a bit lost. I am planning on running waifu2x or real-esrgan but the output says it's using llvmpipe and the process is very slow. How can I make my container use GPU?...

1s delay between execution done and Finished message

I get almost one second of delay between a console message at the end of my handler and the "Finished" message. I am wondering why, and how to reduce this....
No description

Serverless is Broken

Something is clearly broken. Delay times are around 2 mins even when the same worker is getting a request in a row, it still takes 2 mins. It's not a cold start issue because even my normal cold starts don't take longer than 15 seconds.
No description