How to stop following delayed javascript redirects?
I'm using the AdaptivePlaywrightCrawler with the same-domain strategy in enqueueLinks. The page I'm trying to crawl has delayed JavaScript redirects to other pages, such as Instagram. Sometimes, the crawler mistakenly thinks it's still on the same domain after a redirect and starts adding Instagram URLs to the main domain, like example.com/account/... and example.com/member/..., which don't actually exist, so, how can I stop following these delayed JavaScript redirects?
6 Replies
Someone will reply to you shortly. In the meantime, this might help:
Hi @Nth , Can you send us an example of how you call
enqueueLinks
?passive-yellowOP•3mo ago
Hey @Pepa J, here it's:
router.addDefaultHandler(async (ctx) => {
const { request, enqueueLinks, parseWithCheerio, querySelector, log, pushData, page } = ctx;
log.info(
Running request handler for ${request.url});
await enqueueLinks({
strategy: 'same-domain',
globs: ['http?(s)://example.com/**', 'http?(s)://**.example.com/**'],
transformRequestFunction: (req) => {
// Skip pdf files
if (request.url.endsWith('.pdf')) {
log.warning(
* Skipping (${req.url}) - PDF);
return false;
}
return req;
},
});
});
@Nth just advanced to level 1! Thanks for your contributions! 🎉
passive-yellowOP•3mo ago
No issues with PlaywrightCrawler, but it sometimes happens with AdaptivePlaywrightCrawler
Thank you @Nth, I believe there might be and issue/bug that shows up happens on a specific website, would it be possible to put together minimal reproducible example with "real urls"?