Is there way to store the state and continue?
Hello there, well, I am looking for a way to store current state that where is crawler is crawling, and if anything happen and error occured and crash, then we need to fix it and continue from there.
For example, I wrote a program that crawls google's search page. And I want to crawl 1000+ more page, thus that should take a loooong time.
While crawling, there was error occured due to our program's problem, like we missed special button of google's page.
We fixed it now, but we have no way to continue from there, but we want to, because it already took like 5 hours and we have to spend more waste 5 hours.
8 Replies
sensitive-blue•3y ago
I too am looking for a similar solution
eastern-cyanOP•3y ago
This should be occured to anyone who writing for long scraping, so.......
extended-salmon•3y ago
I guess, you can try to use
useState()
:
https://crawlee.dev/api/core/function/useState
https://crawlee.dev/docs/upgrading/upgrading-to-v3#auto-saved-crawler-statewise-white•3y ago
one of the ways is to set env var: both
process.env.CRAWLEE_PURGE_ON_START = 'false';
and process.env.APIFY_PURGE_ON_START = 'false';
would work
Also it's doable through configuration: https://crawlee.dev/docs/guides/configuration - with crwalee.json
or configuration instance - parameters are described here: https://crawlee.dev/api/core/interface/ConfigurationOptions#purgeOnStarteastern-cyanOP•3y ago
Thank you.
As you said, we can use useState to manage progress state
wise-white•3y ago
@rikusen0335 keep in mind, that
useState
is a method where you need to explicitly provide the data that would be saved. And even if you would use state - by default next crawler start will purge the storages (including the request queue)sensitive-blue•3y ago
Can you go to a saved state on the existing crawler, or can you only use that method on a new crawler? We don't want to open a new crawler due to cookies and reauthentication complications.
wise-white•3y ago
useState()
is used to keep certain files in memory - the values should be serializable and are saved to key-value-store periodically. You could go to existing state if you were using the state before and it was saved before. But if you need to continue the scraping where you left off before - you need to (probably the easiest way) set the env variable as mentioned above