RunPod2mo ago

Cancelling job resets flashboot

For some reason whenever we cancel a job the next time the serverless worker cold boots it doesn't use flash boot and instead it reloads the llm model weights into the gpu from scratch. Any idea why cancelling jobs might be causing this problem? Is there maybe a more graceful solution for stopping jobs early than the /cancel/{job_id} endpoint?
3 Replies
nerdylive2mo ago
I'm not sure, maybe it refreshes the worker and makes unload from the vram
digigoblin2mo ago
You can't stop jobs other than the /cancel API endpoint. I am also not sure whether /cancel would cause the worker to be refreshed. My understanding is that the worker is only refreshed if you specifally set refresh_worker to true in the handler response. It is not called for cancelling jobs as far as I am aware but probably need someone from RunPod to confirm.
Ardgon2mo ago
I can even observe this when cancelling a job through the web ui. While the worker is still active it will take jobs from the queue without refreshing, but as soon as it stops the next boot is refreshed.