Runpod•5mo ago

Unhealthy machines

We recently noticed that occasionally we get machines with bad performance - worker startup time is very long, and then runtime performance is really bad. We've seen it with and without Fastboot. We are going to do 2 things to address it: 1. Crash worker before giving control back to the Runpod library if we detect bad performance. 2. Remove bad workers with the control plane. Is it expected for the tenant (us) to handle machine health issues? What would be the recommendation from the Runpod team?

10 Replies

Poddy•5mo ago

@hotsnr

Escalated To Zendesk

The thread has been escalated to Zendesk!

Unknown User•5mo ago

Message Not Public

Dj•5mo ago

Support may tell you that you can just trigger a refresh of the worker with its ID which I recommend, if it doesn't fix your issue then you have something else going on.

Henky!!•5mo ago

What does crashing accomplish?

hotsnrOP•5mo ago

Our system generates audio in real time, and users complained about lags. Logs indicate that Real-time ratio (RTM) was way higher than usual. This was a freshly started worker, my understanding is that refresh basically restarts container and doesn't recycle/move it to another host.

Unknown User•5mo ago

Message Not Public

hotsnrOP•5mo ago

Job won't be able to land on this worker, and user wouldn't suffer. Sorry, it's Real-time factor (RTF). If you generate X seconds of audio and it took your system Y seconds to do it, your RTF is Y/X. Target is to have it <1 so that you can keep up with playback. We also noticed that models took much longer too load for that worker.

Unknown User•5mo ago

Message Not Public

hotsnrOP•5mo ago

Yes, by comparing to other workers - all of them were of the same type (4090) in the same DC (IL-1). I'm going to file a support request once I get more data from our logging system. I wanted to understand whether I can do something better than 1/2 (we can't allow jobs to land on HW which is clearly bad).

Unknown User•5mo ago

Message Not Public

Gaming

Programming

Unhealthy machines

Did you find this page helpful?