Flashboot not working
I have flashboot enabled on my worker, but it appears all of them are running off a cold boot, every time for some reason.

40 Replies

delay times:

Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
I'm having the same problem. This seems to have started yesterday.

Very inconsistent, and these are all sequential requests to the same worker
@1AndOnlyPika
Escalated To Zendesk
The thread has been escalated to Zendesk!
I just rolled back to RunPod 1.6.2 (from 1.7.1, since I updated it yesterday) in my Docker image and it seems to have fixed. I'll run some more tests to confirm.
It did.
Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
No it did not timeout
no, i rolled back to 1.6.2 as well but the issue did not fix
after some more testing, appears the delay time has been decreased a bit
~3s, compared to the 6-7s cold boots before
downgraded further to 1.6.0 and looks like that made it a little bit better, weirds
Has it always been at a 1-second idle timeout? There’s a bug in 1.7.1 that affects tasks running longer than the idle timeout. That’s getting fixed in 1.7.2 that is releasing soon. See PR https://github.com/runpod/runpod-python/pull/362
GitHub
fix: pings were missing requestIds since the last big refactor by d...
Distinguish JobsQueue(asyncio.Queue) and JobsProgress(set) and reference them appropriately
Cleaned up JobScaler.process_job --> rp_job.handle_job
Graceful cleanup when worker is killed
More tests
yep, my tasks take 40 seconds and come in bursty batches so i have it set to 1s so that as soon as its done it will shut off
Alright. That should be fixed with the 1.7.2 release. I’ll let you know when it’s out.
thanks. is it safe to install from git repo directly?
Yes. You can do that from the main if you’d like to test it out.
Override the Container Start Command with something like
Thankfully I found this thread. I was about to invest a whole day optimizing my container image because I thought a change I made broke Flashboot! Will wait for the new release
@deanQ Out of curiosity, do you have a sense of when 1.7.2 will be released?
Today. I’m just running some final testing.
Awesome! Thanks @deanQ
FYI: v1.7.2 is on pre-release while I do some final tests https://github.com/runpod/runpod-python/releases/tag/1.7.2
GitHub
Release 1.7.2 · runpod/runpod-python
What's Changed
Corrected job_take_url by @deanq in #359
Update cryptography requirement from <43.0.0 to <44.0.0 by @dependabot in #353
fix: pings were missing requestIds since the last b...
does not appear to have fixed fastboot


downgraded all the way to 1.5.3 and flashboot is a bit more consistent now
@1AndOnlyPika What view is that? What are the columns?
Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
Oh I see, mine don't seem to be showing up, I guess because they're older than 30 min?
Anyway, the metric I've been using to evaluate is the Cold start time P70.
You can see prior to deploying a new image (assuming with a newer version of the pip runpod library) our P70 start time was around <150ms. After deploying it was up to >5,000ms and then redeploying with runpod 1.6.2 it's back down but still higher than before <700ms.
I'm not using delay time as that seems to also factor in queue times.
I'm assuming Flashboot impacts the Cold start time and that is the correct metric to evaluate, yeah?



Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
Well, downgrading to 1.6.2 did improve things quite a bit.
But not as good as before I think, but maybe I just need to wait to see if things get faster through more usage.
Is Flashboot performance related at all to image size or container disk size? For example, should the image fit in the Container Disk Size specified?
Not sure how Flashboot works, so hard to know what's happening.
Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
With a setup like this, you will face cold start issues. For example, if you have burst consecutive jobs coming in, workers will stay alive and take those jobs. The moment a second or two have a gap without a job then your workers will go to sleep. Any job that comes in after that will have to wait in queue until a worker is ready. And by ready I mean, flash-booted or fully booted as a new worker. Extra few seconds will not cost you more, and will guarantee quick job takes between the gaps. Incurring cold start and boot times will end up costing you more time in total.
My gaps are 10 minutes long so i only want workers to boot up to take one job and then done
The jobs must complete within one minute, including the delay/cold start time
Which is why the longer delay times is a problem for me
This is exactly where flash-boot should help. I’ll investigate what I can about this.
Thank you, my endpoint id is 8ba6bkaiosbww6
most of the time, half of the max workers work with flashboot and start in 2s but but lots of them take 15+
Sometiems cannot get all of them even in 45s
Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
its for bittensor, just running out of a python file
Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
nope it just directly starts the worker and waits for requests
Unknown User•14mo ago
Message Not Public
Sign In & Join Server To View
What might be the downsides of using an earlier version of the library (eg. 1.5.3)? I'm finding using this version yields much quicker startups.
no downsides that i've noticed so far
you'd probably lose a few features from the new versions, but i dont know if that'd matter so much