Unhealthy machines
We recently noticed that occasionally we get machines with bad performance - worker startup time is very long, and then runtime performance is really bad. We've seen it with and without Fastboot. We are going to do 2 things to address it:
1. Crash worker before giving control back to the Runpod library if we detect bad performance.
2. Remove bad workers with the control plane.
Is it expected for the tenant (us) to handle machine health issues? What would be the recommendation from the Runpod team?
10 Replies
@hotsnr
Escalated To Zendesk
The thread has been escalated to Zendesk!
Unknown User•5mo ago
Message Not Public
Sign In & Join Server To View
Support may tell you that you can just trigger a refresh of the worker with its ID which I recommend, if it doesn't fix your issue then you have something else going on.
What does crashing accomplish?
Our system generates audio in real time, and users complained about lags. Logs indicate that Real-time ratio (RTM) was way higher than usual.
This was a freshly started worker, my understanding is that refresh basically restarts container and doesn't recycle/move it to another host.
Unknown User•5mo ago
Message Not Public
Sign In & Join Server To View
Job won't be able to land on this worker, and user wouldn't suffer.
Sorry, it's Real-time factor (RTF). If you generate X seconds of audio and it took your system Y seconds to do it, your RTF is Y/X. Target is to have it <1 so that you can keep up with playback.
We also noticed that models took much longer too load for that worker.
Unknown User•5mo ago
Message Not Public
Sign In & Join Server To View
Yes, by comparing to other workers - all of them were of the same type (4090) in the same DC (IL-1).
I'm going to file a support request once I get more data from our logging system. I wanted to understand whether I can do something better than 1/2 (we can't allow jobs to land on HW which is clearly bad).
Unknown User•5mo ago
Message Not Public
Sign In & Join Server To View