GPU Pod was down all the night

Hi, we just woke up to a production issue where our all apis were down because our pods just shut down and looks like restarted for some reason, and when we looked at we sat maintenance scheduled text for next week. Can someone help what was the issue, and why it went down itself ? Pod ID: clxu7lem3ph9xu
13 Replies
xPaghkman
xPaghkman4mo ago
@Madiator2011 Could you help us on this issue ?
Madiator2011
Madiator20114mo ago
Usually even when pod restarts it should start the last running app automaticly make sure to check pod logs
xPaghkman
xPaghkman4mo ago
We have 3 different service running in pod, and when it restarted, all had to be restarted The thing is also, we cannot see what happened in the pod, or why it restarted, only thing we see is now "Maintenance Scheduled"
Madiator2011
Madiator20114mo ago
Maintenance means the pod is going to be down for upgrades or fixes
xPaghkman
xPaghkman4mo ago
What do you suggest we should do in such cases where pod restarts for some reason or machine has problems, and when it restarts, how could we automate all the services to be run back again. Is there any API available on runpod where we can see if pod is down, or active etc or can we trigger something
Madiator2011
Madiator20114mo ago
make bash script to run all services on pod start I'm not sure what are you running so cant tell
xPaghkman
xPaghkman4mo ago
We are running sd-web-ui for (API), text generation web ui for llm, and our custom fast api service in another port etc
Madiator2011
Madiator20114mo ago
all in single pod?
xPaghkman
xPaghkman4mo ago
2x4090
Madiator2011
Madiator20114mo ago
I mean single pod with 2x4090 or two pods with single 4090 each
xPaghkman
xPaghkman4mo ago
single pod with 2x4090
Madiator2011
Madiator20114mo ago
you probably will need to make own custom startup script like this https://github.com/runpod/containers/blob/main/container-template/start.sh
GitHub
containers/container-template/start.sh at main · runpod/containers
🐳 | Dockerfiles for the RunPod container images used for our official templates. - runpod/containers
xPaghkman
xPaghkman4mo ago
I see, makes sense, I will have a look
Want results from more Discord servers?
Add your server
More Posts
H100 cluster group compilation errorI use RunPod Desktop on Secure H100 (both SXM5 and PCI3). CUDA Driver Version / Runtime Version Worker hangs for really long time, performance is not close to what it should beHi, I'm working with a transcription and diarization endpoint. The docker image works great, tested Stuck in creating containerNo matter how I had set up the pod, it would be stuck in starting container, dozens, 20, times. ThenCustom template bash: /start.sh: No such file or directoryI.m triyng to run custon docer template nvidia/cuda:12.0.1-devel-ubuntu20.04 with container start co$0 balance in my accountHi, I had about $25 USD last night in my account. This morning I received a message to replenish myvllm + Ray issue: Stuck on "Started a local Ray instance."Trying to run `TheBloke/goliath-120b-AWQ` on vllm + runpod with `2x48GB` GPUs: ``` 2024-02-03T12:36:Why are my model files only 135 bytes after a clone repository on Pytorch template?Each safetensor file is only 135 bytes, despite me cloning the respository. All the other smaller fiI cannot connect to server using Web Terminal. It says 'Connection Closed'When i start a new server the webterminal works. HOwever after the first reboot it no longer works aSimilar speed of workers on different GPUsHi, I am trying to launch the codeformer model on a serverless GPU. However, during testing I've notDocker daemon is not started by default?In the template I specify docker run command, but the worker cannot execute the container because da