Continue scraping on the page where the last scrape failed

Let's say that we're going through a page which has a list of ads. At the end of each page there is a pagination. If for some reason our scraper can't open the page and it fails I'd like to have the information on the location of failure and start the next scrape from it immediatelly. What are the best practices tackling this issue?
4 Replies
MEE6
MEE6•3y ago
@FlowGravity just advanced to level 2! Thanks for your contributions! 🎉
Pepa J
Pepa J•3y ago
Hello @FlowGravity Just want to be sure. You want to scrape all the ads or just the ads on the page that failed to open the next page?
wise-white
wise-whiteOP•3y ago
I would like to continue scraping until the pagination ends. So, all the ads Maybe, this formulation could work better - Does Crawlee have an out of the box mechanism which deals with crawls that ended too soon? Especially when there is a pagination included Let me use this space as a rubber duck until someone gets in I start a crawl with a crawlUUID. I create a table called crawl_sessions and I insert this crawl with its UUID and status (in_progress, failed, completed) Whenever I come to a route which has a pagination I update this crawl_session with the page that we are currently getting ads from (page=227 for instance) When I reach the page which displays "no more ads" message I update the crawl_session status to "complete" If I never reach the page with the message "no more ads" but the crawl has finished for some reason, I update the crawl_session with the status FAILED and the last known page number. At this point I start a new crawl from the page where it succesfully loaded ads. For my purposes it is not needed to have the perfect coverage of ads. What do you think? Is this a proper way to go or something better comes to your mind?
Pepa J
Pepa J•3y ago
In my mind the implementation shoul be pretty simple, but it depends on the website if it is SSG or SPA etc: This is a pseudocode but might get the idea form it
router.addDefaultHandler(async ({ crawler }) => {
if (/* has "No more Ads text" */) {
return;
}

// Process ads
const ads = await page.$$('.ad');

for (const ad of ads) {
await processAd(ad);
}

// Add next page to the RequestQueue
const moreLink = await page.$('.more')

if (moreLink) {
await crawler.addRequest({
url: 'xxx' // more.link code
})
}
});
router.addDefaultHandler(async ({ crawler }) => {
if (/* has "No more Ads text" */) {
return;
}

// Process ads
const ads = await page.$$('.ad');

for (const ad of ads) {
await processAd(ad);
}

// Add next page to the RequestQueue
const moreLink = await page.$('.more')

if (moreLink) {
await crawler.addRequest({
url: 'xxx' // more.link code
})
}
});

Did you find this page helpful?