RunPod•11mo ago

Cancelling job resets flashboot

For some reason whenever we cancel a job the next time the serverless worker cold boots it doesn't use flash boot and instead it reloads the llm model weights into the gpu from scratch. Any idea why cancelling jobs might be causing this problem? Is there maybe a more graceful solution for stopping jobs early than the /cancel/{job_id} endpoint?

3 Replies

Jason•11mo ago

I'm not sure, maybe it refreshes the worker and makes unload from the vram

digigoblin•11mo ago

You can't stop jobs other than the /cancel API endpoint. I am also not sure whether /cancel would cause the worker to be refreshed. My understanding is that the worker is only refreshed if you specifally set refresh_worker to true in the handler response. It is not called for cancelling jobs as far as I am aware but probably need someone from RunPod to confirm.

ArdgonOP•11mo ago

I can even observe this when cancelling a job through the web ui. While the worker is still active it will take jobs from the queue without refreshing, but as soon as it stops the next boot is refreshed.

Gaming

Programming

Cancelling job resets flashboot

Did you find this page helpful?