Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

A40 Throttled very regularly!

I have a serverless endpoint with 3GPU that is being fully throttled very regularly. It is fully unusable for long minutes, see screenshot, request are being queued forever. It has been the case yesterday and today, it's far too unreliable......
No description

SSH info via cli

Absence of ssh access info via CLI (only in the case the server does have an exposed TCP port). It doesn’t have the url ssh access in ‘runpodctl get pod’

Can not get a single endpoint to start

New to runpod, but not new to LLM's and running our own inference. So far, every single vLLM Template or vLLM worker that I have set up is failing. I use only the most basic settings, and have tried across a wide range of GPU types, with a variety of models (including the 'Quickstart' templates). Not a single worker has created an endpoint that works or runs the openai API endpoint. I get 'Initializing' and 'Running', but then no response at all to any request. Logs don't seem to have any information that help me diagnose the issue. Might well be that I am missing something silly, or that there is something amiss, I'm just not sure - could do with some assistance (and some better documentation) if there is someone from runpod that can help?...

All 16GB VRAM workers are throttled in EU-RO-1

I have a problem in EU-RO-1: all worked are constantly in throttled state (xz94qta313qvxe, gu1belntnqrflq and so on)...
No description

worker-vllm: Always stops after 60 seconds of streaming

Serverless is giving me this weird issue where the OpenAI stream stops after 60 seconds, but the request keeps running in the vLLM worker deployed. This results in not getting all the outputs, wasting the compute resources. The reason I want it going longer than 60 seconds is that I have a use-case for generating very long outputs. I have needed to resort to directly querying api.runpod.ai/v2. This has benefits of being able to get the job_id and do more things, but I would like to do this with the OpenAI API....

I want to deploy a serverless endpoint with using Unsloth

Unsloth do bnb qunatization and it's better loading their model, I think. I did training using Unsloth on a pod; I want to deploy it on a serverless endpoint and get the OpenIA client API

--trust-remote-code

I tried to install deepseek v3 on serverless vllm showing this "Uncaught exception | <class 'RuntimeError'>; Failed to load the model config. If the model is a custom model not yet available in the HuggingFace transformers library, consider setting trust_remote_code=True in LLM or using the --trust-remote-code flag in the CLI.; <traceback object at 0x7fecd5a12700>;"...

Is there any "reserve for long" and "get it cheaper" payment options?

Hey, Till now, we have been testing the serverless endpoint with vLLM configuration internally for development. Now, we are looking to move it into production. We believe it would be beneficial to have a "reserve for long" option, such as a monthly reservation. Currently, the service charges on a per-second basis with a 30% discount on active workers, but we need to constantly monitor our balance to ensure it doesn't run out....

llvmpipe is being used instead of GPU

I am a bit lost. I am planning on running waifu2x or real-esrgan but the output says it's using llvmpipe and the process is very slow. How can I make my container use GPU?...

1s delay between execution done and Finished message

I get almost one second of delay between a console message at the end of my handler and the "Finished" message. I am wondering why, and how to reduce this....
No description

Serverless is Broken

Something is clearly broken. Delay times are around 2 mins even when the same worker is getting a request in a row, it still takes 2 mins. It's not a cold start issue because even my normal cold starts don't take longer than 15 seconds.
No description

EU-RO-1 region severless H100 gpu not available ....

I used EU-RO-1 region serverless because I save the data in EU-RO-1 region. problem is there is no H100 GPU in EU-RO-1 region. I created the job in EU-RO-1 region serverless api. I waiting 6 hours but job status always in queue. how can i solve this? I cant use other region because my data saved in EU-RO-1 region.......

Workers wrongfully reported as "idle"

When I call my serverless api endpoint, instead of serving my request, it continues building the image while the worker was reported as "idle" and then "running" when called. So I cancel the request, but then the only way to make it stop (so it doesn't keep billing me) is deleting the worker....
No description

"Throttled" and re-"Initializing" workers everywhere today

Is there some incident going on with serverless today? I have 30 workers that are all "Throttled", other workers just disapear and others initialize instead of them all the time. every request that normally takes 10 seconds is taking minutes... This is true in multiple locations too. Most of my workers ended up in CA-MTL-1 but others in EU-* are displaying the same problems...

how to run flux+lora on 24 GB Gpu through code

I there , could anyone help me how can we inference the flux +lora using 24 GB Gpus Thanks...

Queue waiting 5+ minutes with dozens of idle workers

Lately I am often finding that the queue is sitting there with items that have been queued over 5 minutes, meanwhile there are dozens of idle workers. Why are the workers not picking up the queued items immediately? My application is in production and this delay on requests for seemingly no reason is not really acceptable. Thanks...

Serverless H200?

Hi when can we expect H200s to become available on serverless? My application could use the higher gpu memory

using compression encoding for serverless requests

Just wondering if the serverless endpoint is capable of receiving and processing compressed requests? (eg. zstd, gzip)