Runpod

R

Runpod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

🔧|api-opensource

📡|instant-clusters

🗂|hub

How to get "system log" in serverless

For normal GPU instances, I can see both "pod log" and "system log", but for serverless I can only see "pod log" if I am correct. I don't know if the image pull takes too long and can't really debug a failed to start instance

Default Execution Timeout for Faster-Whisper API

I'm using the Faster-Whisper API for real-time transcription. The first API call is slow, so I'm thinking of sending a test file beforehand to speed up processing for my actual files. Is this a good idea? Also, how long does the active endpoint stay available?

runpod serverless start.sh issue

Hello, I'm having an issue with my Dockerfile and start.sh script. At the end of my Dockerfile, I have ENTRYPOINT ["./start.sh"] which runs the start.sh script. This script starts the comfyui server and then runs rp_handler at the end. When the API is called for the first time, the docker image initializes and the start.sh script successfully starts the comfyui server. However, on the second API call, it seems to skip the contents of start.sh and directly runs rp_handler.py, causing issues. Here's the content of my start.sh file: ...

Production emergency

We use serverless to run our chatgpt and llama3 bots but they have all stopped working after our account inadvertently ran out of credits. We have 14 workers deployed but only 4 are currently running. Does anyone know how to restart/reset them all to full steam?...

Unable to register a new account using a Google Groups email

Hello team, I need assistance in checking why I am unable to register using my Gmail account created from Google Groups. I have verified the email's validity by sending an email to it from another account, but the verification link does not appear when registering on Runpod

Delay Time

Hello, I'm wondering if those Delay Time are normal? If not, what should I do?...

Can't setup a1111 on serverless.. Service not ready error

Hi guys, I am wondering if anyone managed to setup a1111 on serverless RunPod without network volumes. I am using the blog post https://blog.runpod.io/custom-models-with-serverless-and-automatic-stable-diffusion/ with the https://github.com/runpod-workers/worker-a1111 but I can't seem to get it to work. I managed to build the image, createthe template and endpoint but I get a Service not ready yet. Retrying... error from a Request Exception. I also mounted the image on a container locally and tr...

Warming up workers

Hi. I've been noticing some substantial delay times and I'd like to know if there's a built-in tool in RunPod that lets me "warm up" my workers before the user will use it. I'm aware of idle timeout which can help in some situations, but if possible I'd like to keep costs to a minimum. If there's not a "built-in" solution to this, then I can just implement a warm-up logic myself. I just wanted to check this first, since I'm running faster-whisper models and sometimes I can have more than 10s of delay time, which is too much....

container create: signal: killed?

Hey all, I have a task stuck at the booting state, and this is the error message I got: 2024-05-16T18:41:04Z create container stevenynie/dreamweaver:20240516111003 2024-05-16T18:42:04Z error creating container: container: create: container create: signal: killed 2024-05-16T18:42:24Z create container stevenynie/dreamweaver:20240516111003...

Serverless GPU Pricing

Hello. I chose a 24 GiB configuration with the following GPUs: L4, RTXA5000, and RTX3090. I ran some benchmarks and noticed that using only RTX3090 is better for my use-case (faster execution times and so on). Is the base pricing for all these 3 GPUs the same? That is, supposing for a moment that the delay times and execution times are the same across all GPUs, will the billing result in the same value regardless of the one I choose?...

Model loadtime affected if PODs are running on the same server

I was trying to debug the latency on my test PODs and now I figured that PODs running on the same physical machine are lagging too much on IO access. After profilling, I've got these results. Example:...

how to expose my own http port and keep the custom HTTP response?

I want to use my own function and image directly. But I cannot find any guides about how to define a function without SDK runpod-python. Like how to let RunPod know which port to access?how to let RunPod directly return the http response?

confusing serverless endpoint issue

After a successful call through run or runsync, i get my handler's success json. after about 5 seconds, the successful response json turns into this: Status Code 404 "error": "request does not exist" ...

"Error saving template. Please contact support or try again later." when using vLLM Quick Deploy

I managed to launch two endpoints successfully. The third endpoint displays the error above when I click "Deploy".
Solution:
Hey Gabriel! I believe to found the issue. We used the model name as a prefix for template name but it caused issues when the model name is too long (for example llama-3-8b-instruct) A fix will be rolling out in the next few days. Sorry for the inconvenience!...

Why is it considered that it is always in the queue state in serveless and cannot be executed?

The task has been stuck in the queue. The serverless Endpoint id is 9ufpu7wjug1mqc and the task id is a73ccb31-4ad7-4b2a-bed6-bfe3c7a16c06-e1.

Serverless vLLM doesn't work and gives no error message

I've spent a few hours trying to deploy a serverless vLLM endpoint according to the instructions at https://docs.runpod.io/serverless/workers/vllm/get-started The endpoint doesn't work and it gives no error message or any other indication of what's wrong. All the requests I send just stay "in queue" and the status of the requests never change. The logs show an initialization message and some warnings, but no errors, and the requests aren't shown in logs. The endpoint id is o13ejihy2p9hi8....

Hey all, why does this worker keep alive after the task is completed?

Here is the request ID: d7aa74ec-5b1f-4af9-8bf4-d8211740019b-u1 According to log, there is no crash at all. but the worker status is green on the RP dashboard. Thanks!...

Failed to return job results

My serverless endpoint is timing out after the client configured timeout of 30 seconds, even though the request is processed in under 10 seconds. I am using the python client (runpod==1.4.2). This is happening only on non-active workers. Below is one sample request from logs. I have submitted more details in the support request 3922 ```- sync-c4927049-99df-480e-89d5-c95d599653bd-u1 - 2024-05-13T04:43:46.246143796Z {"requestId": "sync-c4927049-99df-480e-89d5-c95d599653bd-u1", "message": "Started.", "level": "INFO"} - 2024-05-13T04:43:54.355899018Z {"requestId": "sync-c4927049-99df-480e-89d5-c95d599653bd-u1", "message": "Failed to return job results. | 404, message='Not Found', url=URL('https://api.runpod.ai/v2/[REDACTED]/job-done/w481rezhgny06k/sync-c4927049-99df-480e-89d5-c95d599653bd-u1?gpu=NVIDIA+RTX+6000+Ada+Generation')", "level": "ERROR"}...
Solution:
this is solved. I incorrectly assumed from the docs that TTL means max delayTime to set but looks like it means delayTime + executionTime....

dockerless issue

I am using runpod serverless. To reduce deployments, I store the server executable files in a network volume, as described in the link https://blog.runpod.io/runpod-serverless-no-docker-stress/. I have been using this method for several months, but recently, even when I modify the rp_handler.py file in the network volume, the changes do not seem to be reflected immediately and appear to be cached. As a result, I am currently unable to use it properly. Are there any recent changes regarding this issue?...

Errors while downloading image from s3 using presigned urls

Hey! I am having occasional errors when I try to download image from s3 bucket using presigned urls inside my serverless worker. The worker is processing around 100K requests per hour and only around (rather less) 10K of them fail due to s3 download error. The error message is (I replaced keys and buckets names here): urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='bucketname.s3-accelerate.amazonaws.com', port=443): Max retries exceeded with url: /bucket/f0bb45e2.png?AWSAccessKeyId=AAWSAccessKeyId&Signature=Signature%3D&Expires=1715540566 (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)'))) I am not sure this is the right place to ask this question, but maybe some of you faced anything similar. I also have additional api layer between runpod worker and client application which also uses s3 buckes for upload and dowload and it never had any issues regarding downloading from presigned urls, that's why I start to think some issue may be withitn my runpod workers...