Is there way to store the state and continue?

Hello there, well, I am looking for a way to store current state that where is crawler is crawling, and if anything happen and error occured and crash, then we need to fix it and continue from there. For example, I wrote a program that crawls google's search page. And I want to crawl 1000+ more page, thus that should take a loooong time. While crawling, there was error occured due to our program's problem, like we missed special button of google's page. We fixed it now, but we have no way to continue from there, but we want to, because it already took like 5 hours and we have to spend more waste 5 hours.
8 Replies
sensitive-blue
sensitive-blue3y ago
I too am looking for a similar solution
eastern-cyan
eastern-cyanOP3y ago
This should be occured to anyone who writing for long scraping, so.......
wise-white
wise-white3y ago
one of the ways is to set env var: both process.env.CRAWLEE_PURGE_ON_START = 'false'; and process.env.APIFY_PURGE_ON_START = 'false'; would work Also it's doable through configuration: https://crawlee.dev/docs/guides/configuration - with crwalee.json or configuration instance - parameters are described here: https://crawlee.dev/api/core/interface/ConfigurationOptions#purgeOnStart
eastern-cyan
eastern-cyanOP3y ago
Thank you. As you said, we can use useState to manage progress state
wise-white
wise-white3y ago
@rikusen0335 keep in mind, that useState is a method where you need to explicitly provide the data that would be saved. And even if you would use state - by default next crawler start will purge the storages (including the request queue)
sensitive-blue
sensitive-blue3y ago
Can you go to a saved state on the existing crawler, or can you only use that method on a new crawler? We don't want to open a new crawler due to cookies and reauthentication complications.
wise-white
wise-white3y ago
useState() is used to keep certain files in memory - the values should be serializable and are saved to key-value-store periodically. You could go to existing state if you were using the state before and it was saved before. But if you need to continue the scraping where you left off before - you need to (probably the easiest way) set the env variable as mentioned above

Did you find this page helpful?