Crawlee & Apify•2y ago

Structure Crawlers to scrape multiple sites

Hey everyone, what is the recommended way to structure your code to scrape multiple sites? I looked at a few questions here and it seems that the recommended way is to use a single crawler but multiple routers to handle this case. the issue, I am facing from this is that when you enqueue links, you'll add site-1 and then site-2 initally before starting the crawler and then the crawler will dynamically add in the links as needed but this will mess up the logs since we are using a Queue and its FIFO, so first it'll crawl the first link, add the extracted links to the queue and then crawl the second link and its links to the queue and like this it'll keep switching contexts between the two sites which will make the logs a mess. Also routers, dont seem to have a url parameter, its just a category and then the request, so we will have to basically define handlers for each site in a single router right? which will just bloat up a single file. Is there a better way I can structure this? usecase is to setup crawlers for 10+ sites and crawl them sequentially or in parallel but having sane logging for them.

17 Replies

Pepa J•2y ago

Not sure if I fully understand your question. You may use one requestHandler for everything and decide which function you want to use based on the request url.

        requestHandler: async (context) => {
            const { request } = context;

            if (/mydomain1\.com/.test(request.url)) {
                await processSite1(context);
            } else (/mydomain2\.com/.test(request.url)) {
                await processSite2(context);
            }
        }

        requestHandler: async (context) => {
            const { request } = context;

            if (/mydomain1\.com/.test(request.url)) {
                await processSite1(context);
            } else (/mydomain2\.com/.test(request.url)) {
                await processSite2(context);
            }
        }

This will not solve the logs "issue", but you might run the scraper as a two different instances with different input or implement own logger to put logs into different files.

wise-whiteOP•2y ago

ah got it, and yeah the logs will still be an issue, would you recommend running them in a serial manner? like I want concurrent scraping for a site but not have all of these run at once due to logging what I was thinking of doing was to run the scraper for a site, close the scraper and then reinitialize it for the second site and so forth or would it be better to put it as a script in package.json and have npm handle it? just one last thing, can we do nested routers? for e,g, in the example above, since we have different funcs based on the url, can I instead swap out processSite* funcs with a router and have it handle it on a case by case basis? or will it have to be a if/else based syntax as mentioned in the basic tutorial? I would really love if nested routers would somehow work, since the router syntax is much more palatable than a huge 400-500 LOC if/else

Pepa J•2y ago

Do you mean by using create***Route() function? You should be able to create several of these. Calling create***Route() will return a function and you need to pass the context to it. Then in the requestHandler you need to decide which of there routers you want to use, I don't know what are your requirements for this.

wise-whiteOP•2y ago

requirements are just as you stated, have different route handlers for different sites, and in the RequestHandler call the relevant route handler according to the site, in the snippet you linked await processSite1(context); I wanted to know if I could instead import the route from some other place and then pass in the context to it, because in the doc example

MEE6•2y ago

@AltairSama2 just advanced to level 2! Thanks for your contributions! 🎉

wise-whiteOP•2y ago

its like this

requestHandler:router

requestHandler:router

and I didnt find any functions in the docs which are doing the same and btw what do you think about me explicitly closing the crawler and then re-initializing it for the next site? will this kind of work?

await crawler.run('site:1')
await crawler.run('site:2')

await crawler.run('site:1')
await crawler.run('site:2')

or do I have to explicitly close the crawler or anything?

Pepa J•2y ago

I think you are currently trying to create your own pattern that doesn't not follow the recommendations for scrapers made with Crawlee - so the only way for you is test it find out. 🙂

wise-whiteOP•2y ago

gotcha thanks, can you clarify just one last thing? https://discord.com/channels/801163717915574323/1176976837528797245/1177561697288982608 I think I pretty much got it and just need this one thing, and it should work pretty well for my usecase

Pepa J•2y ago

not sure if I understand, but you should be able to export and import these functions just like anything else

export const myRoutes = createRout();

myRoutes.addHandler("LABEL", (context) => {
  c
})

export const myRoutes = createRout();

myRoutes.addHandler("LABEL", (context) => {
  c
})

and import it in anther file

import { myRoutes } from './routes/my-routes.js';

import { myRoutes } from './routes/my-routes.js';

wise-whiteOP•2y ago

Let's say I have two sites site1 and site2 and they both have separate routes like site1_routes and site2_routes, now in the crawler requestHandler I want to do something like this

requestHandler: async (context) => {
            const { request } = context;

            if (/mydomain1\.com/.test(request.url)) {
                await site1_routes(context);
            } else (/mydomain2\.com/.test(request.url)) {
                await site2_routes(context);
            }
        }

requestHandler: async (context) => {
            const { request } = context;

            if (/mydomain1\.com/.test(request.url)) {
                await site1_routes(context);
            } else (/mydomain2\.com/.test(request.url)) {
                await site2_routes(context);
            }
        }

basically dont define all of the possible permutations for both the sites in a single route but rather have them in separate routes which I can switch on the fly depending on which url it is In the docs createPlaywrightRouter does take in a context option but I havent seen it used with explicitly passing in context anywhere

Pepa J•2y ago

I just tested it worked fine:

await Actor.init();

const startUrls = ['https://apify.com', 'https://google.com'];

export const apifyRouter = createPuppeteerRouter();

apifyRouter.addDefaultHandler(async ({ log }) => {
    log.info(`Hello from Apify!`);
});

export const googleRouter = createPuppeteerRouter();

googleRouter.addDefaultHandler(async ({ log }) => {
    log.info(`Hello from Google!`);
});

const crawler = new PuppeteerCrawler({
    requestHandler: async (context) => {
        if (/apify\.com/.test(context.request.url)) {
            await apifyRouter(context);
        } else if (/google\.com/.test(context.request.url)) {
            await googleRouter(context);
        }
    },
});

await crawler.run(startUrls);

await Actor.exit();

await Actor.init();

const startUrls = ['https://apify.com', 'https://google.com'];

export const apifyRouter = createPuppeteerRouter();

apifyRouter.addDefaultHandler(async ({ log }) => {
    log.info(`Hello from Apify!`);
});

export const googleRouter = createPuppeteerRouter();

googleRouter.addDefaultHandler(async ({ log }) => {
    log.info(`Hello from Google!`);
});

const crawler = new PuppeteerCrawler({
    requestHandler: async (context) => {
        if (/apify\.com/.test(context.request.url)) {
            await apifyRouter(context);
        } else if (/google\.com/.test(context.request.url)) {
            await googleRouter(context);
        }
    },
});

await crawler.run(startUrls);

await Actor.exit();

Of course you may export googleRouter and apifyRouter out of different files.

wise-whiteOP•2y ago

thanks for the code snippet, I was trying out something similar but it errored out so I just dropped. really appreciate the help, thanks!

rare-sapphire•3mo ago

Not sure why, but when using this structure It causes Puppeteer to crash "Target closed"

rare-sapphire•3mo ago

Small example: https://pastes.scylla.ro/aNQja

pastes

a simple pastebin.

extended-salmon•3mo ago

It's probably not about this particular structure—it could be related to proxies or some unusual behavior of your target site.
Check out more details here:
🔗 Apify Academy - How to Fix Target Closed
🔗 StackOverflow - Puppeteer Uncatchable Target Closed Error

rare-sapphire•2mo ago

Interesting, I was more curios, why the old way it was working. I will try and test more and see what's the root cause.

rare-sapphire•2mo ago

I forgot, to add await. facepalm

Gaming

Programming

Structure Crawlers to scrape multiple sites

Did you find this page helpful?