How to Estimate the Survival Time of Spot Instances?
I need some advice on estimating the survival time of RunPod Spot instances. I've noticed that sometimes my Spot instances run for several hours without interruption, while other times they get terminated within minutes. This variability makes it challenging to choose between SPOT and ON-DEMAND.
9 Replies
Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
Yeah. Agreed that demand from other users is unpredictable. I was hoping that there are some statistic algorithms which predict the survial time.
skypilot will be a great tool if I am going to run some batch jobs. thanks for the suggestion. meanwhile sometimes I use runpod machine as my workstation as well.
Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
hah. yeah. that might be an solution.
I really love network volumn provided by runpod. It makes using RunPod as a daliy workstation possible. Usually I run a pod for about several hours. If I can select a feasible SPOT price which make a RunPod survive for about 2 hours in average, then it will be perfect.
Can I at least receive a signal inside the container when the SPOT instance being killed and allow me one second to log necessary states to the volume?
Spot Pods use spare compute capacity, allowing you to bid for those compute resources. Resources are dedicated to your Pod, but someone else can bid higher or start an On-Demand Pod that will stop your Pod. When this happens, your Pod is given a signal to stop 5 seconds prior with SIGTERM, and eventually, the kill signal SIGKILL after 5 seconds. You can use volumes to save any data to the disk in that 5s period or push data to the cloud periodically.
https://docs.runpod.io/references/faq/#on-demand-vs-spot-pod
FAQ | RunPod Documentation
RunPod offers two cloud computing services: Secure Cloud and Community Cloud. Secure Cloud provides high-reliability, while Community Cloud offers peer-to-peer GPU computing. On-Demand Pods run continuously, while Spot Pods use spare compute capacity.
Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
Time is money 😆
😉 And I am planning to use the 5 seconds to log the status so that I can resume on another machine without pain.
Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View