How to launch playwrightcrawler inside basiccrawler?

So i have this code:

const cookieJar = new CookieJar();
export const basicCrawler = new BasicCrawler({
    async requestHandler({ sendRequest, request, log }) {
        try {
            const res = await sendRequest({
                url: request.url,
                method: 'GET',
                cookieJar
            });
            const json = destr(res.body);
             const urls = json.map(v => v.url);
            await playCrawler.run(urls);
        } catch (error) {
            console.log(error);

        }
     
    },
});

//code for playwright crawler here

const cookieJar = new CookieJar();
export const basicCrawler = new BasicCrawler({
    async requestHandler({ sendRequest, request, log }) {
        try {
            const res = await sendRequest({
                url: request.url,
                method: 'GET',
                cookieJar
            });
            const json = destr(res.body);
             const urls = json.map(v => v.url);
            await playCrawler.run(urls);
        } catch (error) {
            console.log(error);

        }
     
    },
});

//code for playwright crawler here

I start the crawler by calling the basicCrawler.run(['url']); The problem is it seems to call the basicCrawler again for the urls i pass to playCrawler. how is that possible?

11 Replies

metropolitan-bronzeOP•3y ago

Also the try catch inside basicCrawler is triggered for errors from playCrawler

harsh-harlequin•3y ago

so you are trying to run playwrightCrawler inside the handaler of basicCrawler? what is the usecase for this? this is quite wild construction maybe you use the same default requestQueue for both crawlers

metropolitan-bronzeOP•3y ago

The usecase would be calling an http api and run playwrightCrawler on its results

MEE6•3y ago

@Nisthar just advanced to level 1! Thanks for your contributions! 🎉

metropolitan-bronzeOP•3y ago

What i don't understand is the urls i pass to playwrightCrawler is queued to basicCrawler as well How is that possible?

Pepa J•3y ago

That is because, there is only one default RequestQueue related to the run. Since you didn't specified any requestQueue in the constructor for the crawlers they both are using the same default one. You may need to create another names Request queue for one of those crawlers.

metropolitan-bronzeOP•3y ago

is there a way to limit the number of tabs in a window? like use different window with one tab?

Pepa J•3y ago

   maxConcurrency: 4,
   useSessionPool: true,
   browserPoolOptions: {
       maxOpenPagesPerBrowser: 2,
   }

   maxConcurrency: 4,
   useSessionPool: true,
   browserPoolOptions: {
       maxOpenPagesPerBrowser: 2,
   }

For a PlaywrightCrawler constructor is probably what you are looking for. Should use two browsers each with two tabs.

metropolitan-bronzeOP•3y ago

Thanks a lot, Is there a way to put a delay in between the two requests? currently crawlee opens the urls almost at the same time.

Pepa J•3y ago

You would probably need to solve your own logic in prenavigation hook: https://docs.apify.com/sdk/js/docs/2.3/typedefs/puppeteer-crawler-options#prenavigationhooks This is very poor implementation but you may get the idea:

function increment() {
    this.number = (this.number || 0) + 1
    return number;
}

function increment() {
    this.number = (this.number || 0) + 1
    return number;
}

and then in PlaywrightCrawler constructor define something like

postNavigationHooks: [
    async ({page}) => {
        await page.waitForTimeout(increment() * 1_000); // 1 000ms = 1s - the number would be increasing with each request.
    },
]

postNavigationHooks: [
    async ({page}) => {
        await page.waitForTimeout(increment() * 1_000); // 1 000ms = 1s - the number would be increasing with each request.
    },
]

PuppeteerCrawlerOptions | Apify Documentation

Properties

metropolitan-bronzeOP•3y ago

can you take a look at this? https://discord.com/channels/801163717915574323/1076083814817869854 whats the use case of storing results in seperate files?

Gaming

Programming

How to launch playwrightcrawler inside basiccrawler?

Did you find this page helpful?