How to stop following delayed javascript redirects?

I'm using the AdaptivePlaywrightCrawler with the same-domain strategy in enqueueLinks. The page I'm trying to crawl has delayed JavaScript redirects to other pages, such as Instagram. Sometimes, the crawler mistakenly thinks it's still on the same domain after a redirect and starts adding Instagram URLs to the main domain, like example.com/account/... and example.com/member/..., which don't actually exist, so, how can I stop following these delayed JavaScript redirects?
6 Replies
Hall
Hall•3mo ago
Someone will reply to you shortly. In the meantime, this might help:
Pepa J
Pepa J•3mo ago
Hi @Nth , Can you send us an example of how you call enqueueLinks?
passive-yellow
passive-yellowOP•3mo ago
Hey @Pepa J, here it's: router.addDefaultHandler(async (ctx) => { const { request, enqueueLinks, parseWithCheerio, querySelector, log, pushData, page } = ctx; log.info(Running request handler for ${request.url}); await enqueueLinks({ strategy: 'same-domain', globs: ['http?(s)://example.com/**', 'http?(s)://**.example.com/**'], transformRequestFunction: (req) => { // Skip pdf files if (request.url.endsWith('.pdf')) { log.warning(* Skipping (${req.url}) - PDF); return false; } return req; }, }); });
MEE6
MEE6•3mo ago
@Nth just advanced to level 1! Thanks for your contributions! 🎉
passive-yellow
passive-yellowOP•3mo ago
No issues with PlaywrightCrawler, but it sometimes happens with AdaptivePlaywrightCrawler
Pepa J
Pepa J•3mo ago
Thank you @Nth, I believe there might be and issue/bug that shows up happens on a specific website, would it be possible to put together minimal reproducible example with "real urls"?

Did you find this page helpful?