Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

no compatible serverless GPUs found while following tutorial steps

hi, i'm trying to run orca-mini on serverless by following this tutorial [https://docs.runpod.io/tutorials/serverless/cpu/run-ollama-inference]. whenever the download finishes, i get the error message below and then the ckpt download resstarts. ``` 2025-01-07 22:02:53.719[1vt59v6j5ku3yh][info][GIN] 2025/01/07 - 22:02:45 | 200 | 4.060412ms | 127.0.0.1 | HEAD "/"\n 2025-01-07 22:02:53.719[1vt59v6j5ku3yh][info]time=2025-01-07T22:02:45.001Z level=INFO source=types.go:105 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="4.4 GiB" available="4.2 GiB"\n 2025-01-07 22:02:53.719[1vt59v6j5ku3yh][info]time=2025-01-07T22:02:45.001Z level=INFO source=gpu.go:346 msg="no compatible GPUs were discovered"\n...

How to monitor the LLM inference speed (generation token/s) with vLLM serverless endpoint?

I have got started with vLLM deployment and the configuration with my application is straightforward and that woerked as well. My main concern is how to monitor the speed of inference on the dashboard or on the "metrics" tab? Because, currently, I have to look manually in the logs and find the average token generation speed spit by vLLM. Any neat solution to this??...

When a worker is idle, do I pay for it?

I'm trying to understand how I am billed for the Serverless usage. Thanks!

Error starting container on serverless endpoint

Hello, I'm having an issue on my serverless endpoint when it starts up. When our endpoint tries to initialize the container. We get an error saying 'error response from daemon', failed to create task for container and cites an 'out of space' issue. I believe this is coming from Runpod's infra and not something we can resolve. Can you please advise how we can fix this error, it's causing delays for our customers....

How to Deploy VLLM Serverless using Programming Language

Hello how can we achieve deploying Severless VLLM instances using an API rather than going to the UI?...

Recommended DC and Container Size Limits/Costs

Hello, I’m new to deploying web apps and currently using a persistent network drive along with serverless containers to generate images. My app requires at least 24GB of RAM, and I’ve encountered some challenges in my current region (EU-RO-1): there aren’t many A100 or H100 GPUs available, and most of the 4090 GPUs are throttled. Recommended Data Centers: Are there specific geographic data centers you’d recommend for better GPU availability and performance? Performance and Costs: Since my usage isn’t constant, the containers often ‘wake up’ from idle or after being used by someone else. When this happens, the models (ComfyUI) have to load, leading to generation times ranging from 20 seconds to 3-4 minutes. I assume this delay occurs because the models are loading from a network-mounted drive rather than locally....

How is the architecture set up in the serverless (please give me a minute to explain myself)

We have been looking for the LLM hosting services and autoscaling functionality to make sure we meet the demand -- but our main concern is the authentication architecture design. The basic setup Based on my understanding there are the following layers: 1. Application in the user's device (sends request)...

Best way to cache models with serverless ?

Hello, I'm using serverless endpoint to do image generation with flux dev. The model is 22gb which is quite long to download, especially since some workers seem to be faster than some others. I've been using a network volume as a cache which greatly improve start up time. However, doing this lock me in a particular region which I believe make some GPUs like the A100 very rarely available....

Job response not loading

Hi guys, seems like its stuck in loading for at least 1 - 2 minutes. Anyone has any idea what's going on?
No description

All of a Sudden , Error Logs

My Serverless endpoint has been worknig fine up until yesterday. I woke up today and i'm getting this error logs. I'm pretty sure i did'nt change anything in my code. Even when i send the request from the runpod interface , i get the same error logs. Please i need this fixed ASAP because i have people depending on the endpoint { "delayTime": 6470, "error": "Error queuing workflow: <urlopen error [Errno 111] Connection refused>",...

Serverless upscale workflow is resulting in black frames.

Hi, I am attempting to make a simple FLUX upscale process request using the serverless service. I built my image with Docker, but it gives me black frames in the output. I am using FLUX dev with FP8 and the standard VAE. Any ideas?

Failed to load docker package.

2025-01-02T05:04:09Z error pulling image: Error response from daemon: Head "https://ghcr.io/v2/ammarft-ai/img-inpaint/manifests/1.31": denied: denied It was working before...

Serverless SGLang - 128 max token limit problem.

I'm trying to use the subject template. I have always the same problem, the number of token of the answer is limited to 128. I don't know how to change the configuration.,,, I've tried with Llama 3.2 3B and Mistral 7B and with both happens the same problem. I've tried to ste the following environment variables with higher numbers than 128 with now luck ......

Too big requests for serverless infinity vector embedding cause errors

I keep running into "400 Bad Request" server errors for this service, and finally discovered that it was because my requests were too large and running into this constraint: https://github.com/runpod-workers/worker-infinity-embedding/blob/acd1a2a81714a14d77eedfe177231e27b18a48bd/src/utils.py#L14 ```python INPUT_STRING = StringConstraints(max_length=8192 * 15, strip_whitespace=True) ITEMS_LIMIT = { "min_length": 1,...

Cannot send request to one endpoint

I have deployed 4 endpoints on runpod, each having different work to do. I can send request to three of my endpoints but for 1, i can't send request, it gives me timeout error and even the job status is not changing on runpod UI. I have tried deleting the endpoint and deploying it again but same problem.

Settings to reduce delay time using sglang for 4bit quantized models?

I'm deploying 4bit AWQ quantized model: casperhansen/llama-3.3-70b-instruct-awq The delay time for parallel requests increases exponentially when using tsglang template. What settings I need to use to make sure the delay time is manageable?...

How to make api calls to the endpoints with a System Prompt?

Hi everyone, I’m new to using Runpod’s serverless endpoints for LLM calls. So far, I’ve only worked with OpenAI APIs, and we’ve built a product around GPT-4 models. Now, we’re planning to transition to open-source alternatives. I’ve successfully created serverless endpoints on Runpod for models like Qwen14B Instruct and Llama 8B Instruct. I can get outputs from these models using both the Runpod SDK and the UI with JSON input like this: ```{...

Serverless GPUs unavailable

The Serverless GPU that i'm using are always unavailable. Is ther any plans to make them more available in the near future or is there any other solution ?
No description

Where to find gateway level URL for serverless app

Hi folks, I have an app running on serverless infra. I want to use its http endpoint. I am not able to find a static host which I can use to access the app. When I go inside workers tab under endpoint, I can see it has an option to open HTTP based session but that seems to be associated with worker and not with endpoint itself. I tried accessing by endpoint id as well but it did not work. Would anyone point out please?...
No description

Attaching network volume with path inside pod

Hey guys, I have an app running inside container and I want a path from my network drive to be mounted as path inside container. For instance, I have path /app/models inside my container. I want to keep some models inside my network drive and want to be used by pod as /app/models. Not finding any solid documentation around this...