Structure Crawlers to scrape multiple sites

Hey everyone, what is the recommended way to structure your code to scrape multiple sites? I looked at a few questions here and it seems that the recommended way is to use a single crawler but multiple routers to handle this case. the issue, I am facing from this is that when you enqueue links, you'll add site-1 and then site-2 initally before starting the crawler and then the crawler will dynamically add in the links as needed but this will mess up the logs since we are using a Queue and its FIFO, so first it'll crawl the first link, add the extracted links to the queue and then crawl the second link and its links to the queue and like this it'll keep switching contexts between the two sites which will make the logs a mess. Also routers, dont seem to have a url parameter, its just a category and then the request, so we will have to basically define handlers for each site in a single router right? which will just bloat up a single file. Is there a better way I can structure this? usecase is to setup crawlers for 10+ sites and crawl them sequentially or in parallel but having sane logging for them.
17 Replies
Pepa J
Pepa J•2y ago
Not sure if I fully understand your question. You may use one requestHandler for everything and decide which function you want to use based on the request url.
requestHandler: async (context) => {
const { request } = context;

if (/mydomain1\.com/.test(request.url)) {
await processSite1(context);
} else (/mydomain2\.com/.test(request.url)) {
await processSite2(context);
}
}
requestHandler: async (context) => {
const { request } = context;

if (/mydomain1\.com/.test(request.url)) {
await processSite1(context);
} else (/mydomain2\.com/.test(request.url)) {
await processSite2(context);
}
}
This will not solve the logs "issue", but you might run the scraper as a two different instances with different input or implement own logger to put logs into different files.
wise-white
wise-whiteOP•2y ago
ah got it, and yeah the logs will still be an issue, would you recommend running them in a serial manner? like I want concurrent scraping for a site but not have all of these run at once due to logging what I was thinking of doing was to run the scraper for a site, close the scraper and then reinitialize it for the second site and so forth or would it be better to put it as a script in package.json and have npm handle it? just one last thing, can we do nested routers? for e,g, in the example above, since we have different funcs based on the url, can I instead swap out processSite* funcs with a router and have it handle it on a case by case basis? or will it have to be a if/else based syntax as mentioned in the basic tutorial? I would really love if nested routers would somehow work, since the router syntax is much more palatable than a huge 400-500 LOC if/else
Pepa J
Pepa J•2y ago
Do you mean by using create***Route() function? You should be able to create several of these. Calling create***Route() will return a function and you need to pass the context to it. Then in the requestHandler you need to decide which of there routers you want to use, I don't know what are your requirements for this.
wise-white
wise-whiteOP•2y ago
requirements are just as you stated, have different route handlers for different sites, and in the RequestHandler call the relevant route handler according to the site, in the snippet you linked await processSite1(context); I wanted to know if I could instead import the route from some other place and then pass in the context to it, because in the doc example
MEE6
MEE6•2y ago
@AltairSama2 just advanced to level 2! Thanks for your contributions! 🎉
wise-white
wise-whiteOP•2y ago
its like this
requestHandler:router
requestHandler:router
and I didnt find any functions in the docs which are doing the same and btw what do you think about me explicitly closing the crawler and then re-initializing it for the next site? will this kind of work?
await crawler.run('site:1')
await crawler.run('site:2')
await crawler.run('site:1')
await crawler.run('site:2')
or do I have to explicitly close the crawler or anything?
Pepa J
Pepa J•2y ago
I think you are currently trying to create your own pattern that doesn't not follow the recommendations for scrapers made with Crawlee - so the only way for you is test it find out. 🙂
wise-white
wise-whiteOP•2y ago
gotcha thanks, can you clarify just one last thing? https://discord.com/channels/801163717915574323/1176976837528797245/1177561697288982608 I think I pretty much got it and just need this one thing, and it should work pretty well for my usecase
Pepa J
Pepa J•2y ago
not sure if I understand, but you should be able to export and import these functions just like anything else
export const myRoutes = createRout();

myRoutes.addHandler("LABEL", (context) => {
c
})
export const myRoutes = createRout();

myRoutes.addHandler("LABEL", (context) => {
c
})
and import it in anther file
import { myRoutes } from './routes/my-routes.js';
import { myRoutes } from './routes/my-routes.js';
wise-white
wise-whiteOP•2y ago
Let's say I have two sites site1 and site2 and they both have separate routes like site1_routes and site2_routes, now in the crawler requestHandler I want to do something like this
requestHandler: async (context) => {
const { request } = context;

if (/mydomain1\.com/.test(request.url)) {
await site1_routes(context);
} else (/mydomain2\.com/.test(request.url)) {
await site2_routes(context);
}
}
requestHandler: async (context) => {
const { request } = context;

if (/mydomain1\.com/.test(request.url)) {
await site1_routes(context);
} else (/mydomain2\.com/.test(request.url)) {
await site2_routes(context);
}
}
basically dont define all of the possible permutations for both the sites in a single route but rather have them in separate routes which I can switch on the fly depending on which url it is In the docs createPlaywrightRouter does take in a context option but I havent seen it used with explicitly passing in context anywhere
Pepa J
Pepa J•2y ago
I just tested it worked fine:
await Actor.init();

const startUrls = ['https://apify.com', 'https://google.com'];

export const apifyRouter = createPuppeteerRouter();

apifyRouter.addDefaultHandler(async ({ log }) => {
log.info(`Hello from Apify!`);
});

export const googleRouter = createPuppeteerRouter();

googleRouter.addDefaultHandler(async ({ log }) => {
log.info(`Hello from Google!`);
});

const crawler = new PuppeteerCrawler({
requestHandler: async (context) => {
if (/apify\.com/.test(context.request.url)) {
await apifyRouter(context);
} else if (/google\.com/.test(context.request.url)) {
await googleRouter(context);
}
},
});

await crawler.run(startUrls);

await Actor.exit();
await Actor.init();

const startUrls = ['https://apify.com', 'https://google.com'];

export const apifyRouter = createPuppeteerRouter();

apifyRouter.addDefaultHandler(async ({ log }) => {
log.info(`Hello from Apify!`);
});

export const googleRouter = createPuppeteerRouter();

googleRouter.addDefaultHandler(async ({ log }) => {
log.info(`Hello from Google!`);
});

const crawler = new PuppeteerCrawler({
requestHandler: async (context) => {
if (/apify\.com/.test(context.request.url)) {
await apifyRouter(context);
} else if (/google\.com/.test(context.request.url)) {
await googleRouter(context);
}
},
});

await crawler.run(startUrls);

await Actor.exit();
Of course you may export googleRouter and apifyRouter out of different files.
wise-white
wise-whiteOP•2y ago
thanks for the code snippet, I was trying out something similar but it errored out so I just dropped. really appreciate the help, thanks!
rare-sapphire
rare-sapphire•3mo ago
Not sure why, but when using this structure It causes Puppeteer to crash "Target closed"
rare-sapphire
rare-sapphire•3mo ago
pastes
a simple pastebin.
extended-salmon
extended-salmon•3mo ago
It's probably not about this particular structure—it could be related to proxies or some unusual behavior of your target site.
Check out more details here:
🔗 Apify Academy - How to Fix Target Closed
🔗 StackOverflow - Puppeteer Uncatchable Target Closed Error
rare-sapphire
rare-sapphire•2mo ago
Interesting, I was more curios, why the old way it was working. I will try and test more and see what's the root cause.
rare-sapphire
rare-sapphire•2mo ago
I forgot, to add await. facepalm

Did you find this page helpful?