RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

Support for https://huggingface.co/deepseek-ai/DeepSeek-V3?

Would it be possible to get support for https://huggingface.co/deepseek-ai/DeepSeek-V3? as this is currently the best model for coding that is opensource

Serverless Idle Timeout is not working

One of my serverless endpoints is not respecting the idle timeout setting. Instead of staying active for 300 seconds, it turn to idle after 5. I have redeployed the endpoint, it work for a while, today again without any changes the endpoint turns idle after 5 seconds even though its set to 300....

Flashboot meaning?

Is there any documentation on what it does under the hood? i am asking because of this: "FlashBoot reduces majority cold-starts down to 2s, even for LLMs. Make sure to test output quality before enabling." ...

Distributed inference with Llama 3.2 3B on 8 GPUs with tensor parallelism + Disaggregated serving

Hi. I need help with setting up a vllm serverless pod with disaggregated serving and distributed inference for a llama 3.2 3B model. The setup would be a disaggregated setup, something like 1 worker with 8 total GPUs, where 4 GPUs for 1 prefill task and 4 GPUs for 1 decode task. Can experts help me set this up using vllm on runpod serverless ? I am going for this approach as I want super low latency, and I think sharding the model for prefill and decode separately with tensor parallelism will help me achieve this....

job timed out after 1 retries

Hello! Getting this on every job now on 31py4h4d9ytybu endpoint on serverless. My logs have zero messages or indication about where this is happening, from the outside it looks as if the are totally paused or non-responsive. This silently hung work for over an hour. I'm on runpod 1.7.4. This is currently having significant impacts on production work, without any clear remediation (see screenshots for no logs for many many minutes despite work happening constantly, and errors on every job). Wou...
No description

Can't see Billing beyond July

Hi I'm trying to get my billing invoices but I dont see anything beyond six months, can someone help?...

Linking runpod-volume subfolder doesn't work

Hey, I've been trying to create a serverless runpod that has some network volume attached to it. I want to link specific folders from the network volume to the runpod. To do so, im running the following bash file. ```bash...

ComfyUI Image quantity / batch size issue when sending request to serverless endpoint

I'm not able to generate multiple images from a prompt / request to the endpoint using a ComfyUI workflow. We have added a variable for the ""batch_size": " value in our workflow, but it only seems to generate one image regardless of the batch_size we give it. This is our Github repo for the runpod worker: https://github.com/sozanski1988/runpod-worker-comfyui/ ...

Some basic confusion about the `handlers`

Hi everyone! 👋 I'm currently using RunPod's serverless option to deploy an LLM. Here's my setup: - I've deployed the vLLM with a serverless endpoint (runpod.io/v2/<endpoint>/run)....

Next js app deploy on Runpod

Dear Runpod community, I need to deploy our Next.js app on Runpod, similar to how it works on Vercel. In our Next.js app, I handle the frontend and also create backend APIs for MongoDB interactions. Additionally, I need to run scheduled jobs. Which hosting provider would you recommend for this setup? also can we do that with Runpod?

Optimizing VLLM for serverless

Hello. I am trying to optimize the VLLM for the serverless endpoint. The default VLLM settings are blazing fast for cached workers (~1s) but unusable with cold start initialization (40-60 or more seconds). Forcing eager mode removes the CUDA graph capture and helps push the initialization cold starts down to ~20s with a price of a slower generation time. But other than that, I feel stuck about what could be improved since currently the longest tasks are creating the LLM engine and VLLM's Memory profiling stage. Each takes up to 6 seconds. I am attaching the complete log file with time comments from such a job. I am wondering if anyone found the settings sweet spot for the fastest cold starts and acceptable generation speed, or if there's a way to remove the initialization part for newly spawned workers. Although I already researched many things, from automatic caching on a network volume (which didn't work at all and when using bitsandbytes models there is no cache being saved) to snapshotting and trying to share the initiated state between the workers (which is probably not possible)....

no compatible serverless GPUs found while following tutorial steps

hi, i'm trying to run orca-mini on serverless by following this tutorial [https://docs.runpod.io/tutorials/serverless/cpu/run-ollama-inference]. whenever the download finishes, i get the error message below and then the ckpt download resstarts. ``` 2025-01-07 22:02:53.719[1vt59v6j5ku3yh][info][GIN] 2025/01/07 - 22:02:45 | 200 | 4.060412ms | 127.0.0.1 | HEAD "/"\n 2025-01-07 22:02:53.719[1vt59v6j5ku3yh][info]time=2025-01-07T22:02:45.001Z level=INFO source=types.go:105 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="4.4 GiB" available="4.2 GiB"\n 2025-01-07 22:02:53.719[1vt59v6j5ku3yh][info]time=2025-01-07T22:02:45.001Z level=INFO source=gpu.go:346 msg="no compatible GPUs were discovered"\n...

How to monitor the LLM inference speed (generation token/s) with vLLM serverless endpoint?

I have got started with vLLM deployment and the configuration with my application is straightforward and that woerked as well. My main concern is how to monitor the speed of inference on the dashboard or on the "metrics" tab? Because, currently, I have to look manually in the logs and find the average token generation speed spit by vLLM. Any neat solution to this??...

When a worker is idle, do I pay for it?

I'm trying to understand how I am billed for the Serverless usage. Thanks!

Error starting container on serverless endpoint

Hello, I'm having an issue on my serverless endpoint when it starts up. When our endpoint tries to initialize the container. We get an error saying 'error response from daemon', failed to create task for container and cites an 'out of space' issue. I believe this is coming from Runpod's infra and not something we can resolve. Can you please advise how we can fix this error, it's causing delays for our customers....

How to Deploy VLLM Serverless using Programming Language

Hello how can we achieve deploying Severless VLLM instances using an API rather than going to the UI?...

Recommended DC and Container Size Limits/Costs

Hello, I’m new to deploying web apps and currently using a persistent network drive along with serverless containers to generate images. My app requires at least 24GB of RAM, and I’ve encountered some challenges in my current region (EU-RO-1): there aren’t many A100 or H100 GPUs available, and most of the 4090 GPUs are throttled. Recommended Data Centers: Are there specific geographic data centers you’d recommend for better GPU availability and performance? Performance and Costs: Since my usage isn’t constant, the containers often ‘wake up’ from idle or after being used by someone else. When this happens, the models (ComfyUI) have to load, leading to generation times ranging from 20 seconds to 3-4 minutes. I assume this delay occurs because the models are loading from a network-mounted drive rather than locally....

How is the architecture set up in the serverless (please give me a minute to explain myself)

We have been looking for the LLM hosting services and autoscaling functionality to make sure we meet the demand -- but our main concern is the authentication architecture design. The basic setup Based on my understanding there are the following layers: 1. Application in the user's device (sends request)...

Best way to cache models with serverless ?

Hello, I'm using serverless endpoint to do image generation with flux dev. The model is 22gb which is quite long to download, especially since some workers seem to be faster than some others. I've been using a network volume as a cache which greatly improve start up time. However, doing this lock me in a particular region which I believe make some GPUs like the A100 very rarely available....