ambitious-aqua

Router for what? How it works?

I created a new crawler using the npx crawlee create project command, that creates some folders and files, it creates me a router.js file, which it has an instance of createPlaywrightRouter

export const router = createPlaywrightRouter();

router.addDefaultHandler(async ({ enqueueLinks, log }) => {
    log.info(`enqueueing new URLs`);
    await enqueueLinks({
        globs: ['https://crawlee.dev/**'],
        label: 'detail',
    });
});

router.addHandler('detail', async ({ request, page, log }) => {
    const title = await page.title();
    log.info(`${title}`, { url: request.loadedUrl });

    await Dataset.pushData({
        url: request.loadedUrl,
        title,
    });
});

export const router = createPlaywrightRouter();

router.addDefaultHandler(async ({ enqueueLinks, log }) => {
    log.info(`enqueueing new URLs`);
    await enqueueLinks({
        globs: ['https://crawlee.dev/**'],
        label: 'detail',
    });
});

router.addHandler('detail', async ({ request, page, log }) => {
    const title = await page.title();
    log.info(`${title}`, { url: request.loadedUrl });

    await Dataset.pushData({
        url: request.loadedUrl,
        title,
    });
});

as I understand, you are creating the default handler which is kinda the "main" listener, so later you are calling/invoking your route "detail", for the enqueLinks function, this could be interesting to split your process in more "routes"/steps, so it can be more clean and decoupled later. My question is, how to call or invoke this without the enqueList function? I was expecting something like:

router.addDefaultHandler(async (ctx) => {
    await ctx.invoke('extract-meta-data')
    await ctx.invoke('extract-detail')
    await ctx.invoke('download-files')
});

router.addDefaultHandler(async (ctx) => {
    await ctx.invoke('extract-meta-data')
    await ctx.invoke('extract-detail')
    await ctx.invoke('download-files')
});

Where can I see the functions this CTX admit or maybe I understood the router totally different. Thanks 🙂

3 Replies

vicious-gold•3y ago

Hi @smasher, the idea behind router is to have different routes for different types of requests depending on the label you enqueue the route with. It is pretty much equivalent of this code:

const crawler = new BasicCrawler({
    requestHandler: ({ request }) => {
      const label = reuqest.userData;

      switch(label) {
        case 'route1':
            doSomething1();
            break;
        case 'route2':
            doSomething2();
            break;
        default:
            doSomethingDefault();
            break;
        }
    },
});

const crawler = new BasicCrawler({
    requestHandler: ({ request }) => {
      const label = reuqest.userData;

      switch(label) {
        case 'route1':
            doSomething1();
            break;
        case 'route2':
            doSomething2();
            break;
        default:
            doSomethingDefault();
            break;
        }
    },
});

If you want to create a flow like in your code snippet, you should just in each route enqueue the next request with the next route label.

ambitious-aquaOP•3y ago

do you have some real example on github using this route feature with crawlee? @vojtechmaslan

Pepa J•3y ago

@smasher To understand the CrawlerRouters you need to understand also the RequestQueue: Let's say that RequestQueue is just queue of Request information. One of there information would be url, but also the name of the RouterHandler that is going to be used to process this request. The name of the RequestHandler is provided on the label attribute. It is not really related to the event oriented approach as you suggest. Processing of Request starts and ends always only in one single RequestHandler. You may of course enqueue the same request again with different label (and uniqueKey). So the Router is only processing Requests from RequestQueue. Once the Request is successfully proceeded its lifetime ends. Once there are no Requests in the RequestQueue the Crawler ends. You may of course build your own state machine implementation inside the RequestHandler similar to what @vojtechmaslan suggested, but I am not sure if it is what the original question was about.

Gaming

Programming

Router for what? How it works?

Did you find this page helpful?