R
RunPodCaptain Barbossa

2 active workers on serverless endpoint keep rebooting

We have 2 active workers on a serverless endpoint, sometimes the workers reboot at the same time for some reason, which causes major problems in our system.
2024-04-03T14:37:16Z create pod network 2024-04-03T14:37:16Z create container endpoint-image:1.2 2024-04-03T14:37:17Z start container 2024-04-03T15:27:23Z stop container 2024-04-03T15:27:24Z remove container 2024-04-03T15:27:24Z remove network 2024-04-03T15:27:30Z create pod network 2024-04-03T15:27:30Z create container endpoint-image:1.2 2024-04-03T15:27:30Z start container 2024-04-03T17:34:51Z stop container 2024-04-03T17:34:51Z remove container 2024-04-03T17:34:51Z remove network
Has anyone ever had this problem? How to fix it?
Runpods version : 1.3.0 Docker Image : Python 3.11-slim Our image version : 1.2
M
Madiator201117d ago
your serverless worker needs to have startup command and you just run plain python docker image
CB
Captain Barbossa17d ago
Our Docker image already has a command to start with, should I add one anyway in our Runpods templates?
J
justin17d ago
Not sure if ur saying u had this api working before, and suddenly just these two workers these things happen, or if ur saying ur trying to deploy serverless. If the latter, ur trying to deploy, and running into this issue, as madiator said make sure ur calling specifically the handler.py which needs to have a runpod.start() call in the file to be triggered
J
justin17d ago
GitHub
runpodWhisperx/Dockerfile at master · justinwlin/runpodWhisperx
Runpod WhisperX Docker Container Repo. Contribute to justinwlin/runpodWhisperx development by creating an account on GitHub.
J
justin17d ago
Are u doing so?
J
justin17d ago
https://blog.runpod.io/serverless-create-a-basic-api/ Ex. of runpod blog walking thro the setup
RunPod Blog
Serverless | Create a Custom Basic API
RunPod's Serverless platform allows for the creation of API endpoints that automatically scale to meet demand. The tutorial guides you through creating a basic worker and turning it into an API endpoint on the RunPod serverless platform. For this tutorial, we will create an API endpoint that helps us accomplish
CB
Captain Barbossa15d ago
Thanks for the answer Yes I have a handler.py file with :
runpod.serverless.start({
"handler" : do_something,
"return_aggregate_stream" : True,
})
runpod.serverless.start({
"handler" : do_something,
"return_aggregate_stream" : True,
})
And in my dockerfile, I got this command:
CMD ["python", "-u", "handler.py"]
CMD ["python", "-u", "handler.py"]
Everyhitng works fines normally but now every X hours, the active worker reboots for no reason at all
F
flash-singh15d ago
active workers can shuffle, thats normal, there is no single active worker that is dedicated to being an active worker, its last man standing algorithm, its meant to optimize for cost
Want results from more Discord servers?
Add your server
More Posts
POD's ERRORS :((((((This server has recently suffered a network outage and may have spotty network connectivity. We aim Billing increases last two days heavily from delay time in RTX 4000 AdaI checked my billing history and I saw that my serverless bills increased a lot and the culprit is tNvidia driver versionWhere can I see what driver versions pods use? Is it the same for all GPU types? I get this error evProfiling CUDA kernels in runpodHi! I'm trying to profile my kernel with nsight-compute and I'm getting error : "==ERROR== ERR_NVGPUInconsistency with volumesWe have an issue where when we startup a container/pod we run a script that should exists inside of Bug prevents changing a Serverless Pod to a GPU Podhttps://i.imgur.com/DNxVc1y.gifNo availability issueWhen renting some instances, the main screen says 'High availability', or etc.. yet it has none whenError: CUDA error: CUDA-capable device(s) is/are busy or unavailableI have 15 production endpoints deployed using Runpod and today they started to raise this error randL40 and shared storageFor my workloads I want to use a L40, but I also need shared storage. Do I get it right, that this iRun container only onceHi everyone, I want to run a container for a single life-cycle only (i.e. my container is designed tAuto-scaling issues with A1111Hey, I'm running an A1111 worker (https://github.com/ashleykleynhans/runpod-worker-a1111) on ServerlClone a Runpod NetworkvolumeHi! Is there some way to clone a Network Volume in the Runpod interface or is this something i have Insufficient Permissions for Nvidia Multi-GPU Instance (MIG)I was planning to test some new Nvidia GPU features using a pod with Nvidia A100 80G. I tried `nvidAutomatic1111 - Thread creation failed: Resource temporarily unavailableHello, we started to get this error more often. Normally we were getting it time to time, and after How can I view logs remotely?Hi! I am ttrying to view the logs of a training build I am doing but it seems to stop here. The contHow to make Supir in Serverless?Please tell me how to create serverles with the supir project? Or perhaps someone can do this for moCan we use serverless faster Whisper for local audio?I deployed faster Whisper using serverless and invoked it using "import requests url = "https://apis there any method to deploy bert architecture models serverlessly?is there any method to deploy bert architecture models serverlessly?change the GPU pod type without recreatingIs there an option available whereby, if the previous GPU becomes unavailable, I can select another l40s "no ressources available"Hello everytime i try to choose a l40s i keep getting a "no ressource avilable" message. There are m