Continue scraping on the page where the last scrape failed
Let's say that we're going through a page which has a list of ads. At the end of each page there is a pagination. If for some reason our scraper can't open the page and it fails I'd like to have the information on the location of failure and start the next scrape from it immediatelly. What are the best practices tackling this issue?
4 Replies
@FlowGravity just advanced to level 2! Thanks for your contributions! 🎉
Hello @FlowGravity Just want to be sure. You want to scrape all the ads or just the ads on the page that failed to open the next page?
wise-whiteOP•3y ago
I would like to continue scraping until the pagination ends. So, all the ads
Maybe, this formulation could work better - Does Crawlee have an out of the box mechanism which deals with crawls that ended too soon? Especially when there is a pagination included
Let me use this space as a rubber duck until someone gets in
I start a crawl with a crawlUUID. I create a table called crawl_sessions and I insert this crawl with its UUID and status (in_progress, failed, completed)
Whenever I come to a route which has a pagination I update this crawl_session with the page that we are currently getting ads from (page=227 for instance)
When I reach the page which displays "no more ads" message I update the crawl_session status to "complete"
If I never reach the page with the message "no more ads" but the crawl has finished for some reason, I update the crawl_session with the status FAILED and the last known page number.
At this point I start a new crawl from the page where it succesfully loaded ads.
For my purposes it is not needed to have the perfect coverage of ads.
What do you think? Is this a proper way to go or something better comes to your mind?
In my mind the implementation shoul be pretty simple, but it depends on the website if it is SSG or SPA etc:
This is a pseudocode but might get the idea form it