R
RunPod4mo ago
JorgeG

Worker is very frequently killed and replaced

I have an endpoint configured with 1 active worker and 2 max workers (24GB PRO). The requests are being handled by an asynchronous handler. For some unknown reason -- I can't see any errors or other failures in the logs, every 30 min - 2h (some times less, sometimes more), the worker restarts. Same worker (according to the id), but the container is restarted. What could be the reason for this? The system logs look like this: 2024-03-01T17:08:05Z start container 2024-03-01T17:18:07Z stop container 2024-03-01T17:18:08Z remove container 2024-03-01T17:18:08Z remove network 2024-03-01T18:20:22Z create pod network 2024-03-01T18:20:22Z create container XXXXX 2024-03-01T18:20:22Z start container 2024-03-01T18:30:17Z stop container 2024-03-01T18:30:17Z remove container 2024-03-01T18:30:17Z remove network 2024-03-01T18:38:56Z create pod network 2024-03-01T18:38:56Z create container XXXXX 2024-03-01T18:38:56Z start container 2024-03-01T18:57:44Z stop container 2024-03-01T18:57:45Z remove container 2024-03-01T18:57:45Z remove network 2024-03-01T19:04:17Z create pod network 2024-03-01T19:04:17Z create container XXXXXX 2024-03-01T19:04:17Z start container 2024-03-01T19:19:58Z stop container 2024-03-01T19:20:00Z remove container 2024-03-01T19:20:00Z remove network 2024-03-01T19:20:24Z create pod network 2024-03-01T19:20:24Z create container XXXXXXXX 2024-03-01T19:20:26Z start container 2024-03-01T19:21:05Z stop container 2024-03-01T19:21:07Z remove container 2024-03-01T19:21:07Z remove network 2024-03-01T19:21:34Z create pod network 2024-03-01T19:21:34Z create container XXXXXXXX 2024-03-01T19:21:35Z start container
3 Replies
flash-singh
flash-singh4mo ago
whats the endpoint id?
JorgeG
JorgeG4mo ago
1hdfqkkbw41swp Thanks for looking into it
flash-singh
flash-singh4mo ago
those logs are normal, it happens when your workers sale up and down
Want results from more Discord servers?
Add your server
More Posts
2024-03-01T16:08:54.761577365Z [FATAL tini (6)] exec docker failed: No such file or directory ErrorHey folks I'm having trouble running my image on Runpod. My image works properly on a normal root acI want to install docker in a GPU pod.I want to install docker in a GPU pod. Yes, I am aware that the pod itself is a docker container butWhat is the recommended System Req for Building Worker Base ImageI was trying to build a custom runpod/worker-vllm:base-0.3.1-cuda${WORKER_CUDA_VERSION} image, but mIs there documentation on how to architect runpod serverless?Wondering if theres Do's / Dont's of integrating runpod serverless into a larger architecture. I assOpenBLAS errorHi, all. I got this error "OpenBLAS blas_thread_init: pthread_create failed for thread 3 of 64: ResoDocker image cacheHi there, I am quite new to RunPod so I could be wrong but my Docker image is quite large and beforWhat port do requests get sent on?I want to do something a little custom, I don't want to use the serverless package, I want to use myServerless calculating capacity & ideal request count vs. queue delay valuesHow do you calculate whether serverless worker is reaching it's capacity and what values to set for We have detected a critical error on this machine which may affect some pods.Hey all. We're renting a number of H100s as a trial run of Runpod as we are looking for another compIs it possible to restart the pod using manage Pod GraphQL API?Is it possible to restart the pod using manage Pod GraphQL API?