RunPod•16mo ago

GPU Pod was down all the night

Hi, we just woke up to a production issue where our all apis were down because our pods just shut down and looks like restarted for some reason, and when we looked at we sat maintenance scheduled text for next week. Can someone help what was the issue, and why it went down itself ? Pod ID: clxu7lem3ph9xu

13 Replies

ErcanOP•16mo ago

@Madiator2011 Could you help us on this issue ?

Madiator2011•16mo ago

Usually even when pod restarts it should start the last running app automaticly make sure to check pod logs

ErcanOP•16mo ago

We have 3 different service running in pod, and when it restarted, all had to be restarted The thing is also, we cannot see what happened in the pod, or why it restarted, only thing we see is now "Maintenance Scheduled"

Madiator2011•16mo ago

Maintenance means the pod is going to be down for upgrades or fixes

ErcanOP•16mo ago

What do you suggest we should do in such cases where pod restarts for some reason or machine has problems, and when it restarts, how could we automate all the services to be run back again. Is there any API available on runpod where we can see if pod is down, or active etc or can we trigger something

Madiator2011•16mo ago

make bash script to run all services on pod start I'm not sure what are you running so cant tell

ErcanOP•16mo ago

We are running sd-web-ui for (API), text generation web ui for llm, and our custom fast api service in another port etc

Madiator2011•16mo ago

all in single pod?

ErcanOP•16mo ago

2x4090

Madiator2011•16mo ago

I mean single pod with 2x4090 or two pods with single 4090 each

ErcanOP•16mo ago

single pod with 2x4090

Madiator2011•16mo ago

you probably will need to make own custom startup script like this https://github.com/runpod/containers/blob/main/container-template/start.sh

GitHub

containers/container-template/start.sh at main · runpod/containers

🐳 | Dockerfiles for the RunPod container images used for our official templates. - runpod/containers

ErcanOP•16mo ago

I see, makes sense, I will have a look

Gaming

Programming

GPU Pod was down all the night

Did you find this page helpful?