Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

What is the recommended System Req for Building Worker Base Image

I was trying to build a custom runpod/worker-vllm:base-0.3.1-cuda${WORKER_CUDA_VERSION} image, but my 16vCPU, 64GB RAM server crashed. What is the recommended system spec for this purpose

Is there documentation on how to architect runpod serverless?

Wondering if theres Do's / Dont's of integrating runpod serverless into a larger architecture. I assume its not as snappy as lambda so I'd need to plan more aggressively around warm / cold starts? Also is RunPod serverless ready for prod deployments, or is it more of a "use at your own risk" service?

Docker image cache

Hi there, I am quite new to RunPod so I could be wrong but my Docker image is quite large and before my serverless endpoint actually runs, the endpoint is in the 'Initializing' state for quite long. Is there a way to cache this image across endpoints or does this already happen? This is the first request I am doing so it might already be cached for this endpoint but not quite sure. I'd appreciate it! I am not using the network volume/storage so maybe that's also why....

What port do requests get sent on?

I want to do something a little custom, I don't want to use the serverless package, I want to use my own code, i.e. a flask app running on gunicorn for my container... I need to have a flexible container that's decoupled from RunPod. Is this possible? (Presumably it is?) I'd imagine I'd need to specify the endpoints for /run, /runsync, in my flask app etc. right? And then for the port mapping between the host and the container, how is that handled? Do I define the env var RUNPOD_REALTIME_PORT in the template and then the host then uses that for the hosts port, which is then the internal port used by the gunicorn server? ...

Serverless calculating capacity & ideal request count vs. queue delay values

How do you calculate whether serverless worker is reaching it's capacity and what values to set for request count? I see in one of my serverless workers in production which is running regular Oobabooga (not vLLM so no concurrency) reaching 110k requests per day yesterday without starting a new worker. According to my observation my context length is usually 1000 input tokens and 10-70 output tokens which usually take between 2-5secs per request. Even if we take 1sec execution time per request it should have been able to handle only 86400 requests per day. How is it able to handle more without increasing the worker count especially when it takes 2-5secs per request?...

Runpod worker automatic1111 just respond COMPLETED and not return anything

I'm using the worker from https://github.com/ashleykleynhans/runpod-worker-a1111/tree/main, latest version so it should fix the "error" dict problem. For some requests, it just returns the status Completed and runpod logs show something like in the image below. I have tried to create a Pod mount on that volume and run the local request with test_input.json, everything work normally. Can you @ashleyk help me with this?
Solution:
Hi @Merrell , i think the problem is regarding the size of the response? If i set batch size to smaller or set the image size to smaller, everything work fine
No description

Serverless GPU low capacity

I'm finding it almost impossible to use the serverless endpoints as there are no GPUs available, I have a network volume in Romania so therefore need GPUs in the same region. It spends ages throttling "throttled: Waiting for GPU to become available.", then when eventually one comes online it goes off again soon after even if 'Idle timeout' is set to an hour. Is this a common state of just unusually busy right now. Does RunPod have plans to increase capacity, considering its in such large demand....
No description

Runpod queue not processing

Hey, using Kandinsky 2.1 deployed serverless application. Then hit the run endpoint and it was queued. checked status by id still in_queue status. anyone can resolve this issue ?...

cudaGetDeviceCount() Error

When importing exllamav2 library I got this error which made the serverless worker stuck and keeps on spitting an error stack trace. The error is:
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
What's about this error? Is this about the library or is there something wrong with the worker hardware that I've chosen? and why doesn't the error stop the worker? It keeps on running for 5mins without I even realizing....

VLLM Error

2024-02-28T21:49:45.485567449Z The above exception was the direct cause of the following exception: 2024-02-28T21:49:45.485572406Z 2024-02-28T21:49:45.485576486Z Traceback (most recent call last): 2024-02-28T21:49:45.485580679Z File "/handler.py", line 8, in <module> 2024-02-28T21:49:45.485636156Z vllm_engine = vLLMEngine()...

Getting docker error

Random error, no changes in image and was working just 1 min ago.
No description

worker-vllm build fails

I am getting the following error when building the new worker-vllm image with my model. ``` => ERROR [vllm-base 6/7] RUN --mount=type=secret,id=HF_TOKEN,required=false if [ -f /run/secrets/HF_TOKEN ]; then export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); fi && if [ -n "Pate 10.5s ------...

Serverless not returning error

The following code: ``` def handler(event): try: logger.info('validating input')...
No description

Getting 404 error when making request to serverless endpoint

I'm using the python SDK, and pasting in the endpoint ID into the provided example code. Here is the full response: ClientResponseError: Status: 404, Message: Not Found, Headers: <CIMultiDictProxy('Date': 'Tue, 27 Feb 2024 17:15:53 GMT', 'Content-Type': 'text/plain', 'Content-Length': '18', 'Connection': 'keep-alive', 'CF-Cache-Status': 'DYNAMIC', 'Set-Cookie': '__cflb=02DiuEDmJ1gNRaog7Bucmr44gWmZj9b8U2YPJr23J6Q9a; SameSite=None; Secure; path=/; expires=Wed, 28-Feb-24 16:15:53 GMT; HttpOnly', 'Server': 'cloudflare', 'CF-RAY': '85c2128cc898429e-EWR')>...
Solution:
Is your API key correct?

out of memory error

out of memory error GPU

Out of memory errors on 48gb gpu which didn't happen before

Some requests fail due to OOM, but the endpoint uses 48gb and is definitely capable of processing these requests

Is it possible to run fully on sync?

All the async functions and webhooks are so much pain, can we just fully run on sync?

How to keep worker memory after completing request?

Hi! I'm running serverless for model GAN. I want preload model in memory at the first request and reuse it on the next req without load model again (in case container/pod still remain). When I sent 2nd req, Idle had "clean up worker" and load model again. How could I prevent "clean up worker" and keep model in memory? (in case container was not removed)...

Failed to get job. | Error Type: ClientConnectorError

Hey all, I'm starting to receive this kind of error: 2024-02-26T21:49:02.442274586Z connectionpool.py :872 2024-02-26 21:49:02,441 Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fd718d52aa0>: Failed to resolve 'api.runpod.ai' ([Errno -3] Temporary failure in name resolution)")': /v2/d7n1ceeuq4swlp/ping/xkqvldjqlccihw?gpu=NVIDIA+A40&runpod_version=1.6.0 2024-02-26T21:49:12.459986454Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientConnectorError | Error Message: Cannot connect to host api.runpod.ai:443 ssl:default [Temporary failure in name resolution]", "level": "ERROR"} It seems like the system is keep retrying to get the job for 40s and this time interval is included for the serverless billing time. what is going on? Thanks!...

Help: Serverless Mixtral OutOfMemory Error

I can't get to run Mixtral-8x7B-Instruct to run on Serverless using vLLM Runpod Worker neither for model from Mistral nor any of the quantized models Settings I'm using: GPU: 48GB (also tried 80GB) Container Image: runpod/worker-vllm:0.3.0-cuda11.8.0...