Crawlee & Apify•3y ago

PlaywrightCrawler New Instance unexpected result

Hi guys, I'm new to crawlee. I wrap the sample code into a function. Each time the getAvailableURLs function is called, a new instance of the PlaywrightCrawler class is created and used to crawl the provided URL. Source Code:

const getAvailableURLs = async (url: string, maxRequestsPerCrawl: number, maxRequestRetries: number, strategy: "all" | "same-hostname" | "same-domain" | "same-origin") => {
    let availableUrls: string[] = [];
    const crawler = new PlaywrightCrawler({
        maxRequestsPerCrawl,
        maxRequestRetries,
        // Use the requestHandler to process each of the crawled pages.
        async requestHandler({ request, enqueueLinks }) {

            await availableUrls.push(request.loadedUrl ? request.loadedUrl : "")


            await enqueueLinks(
                {
                    strategy
                }
            );
        },
    });

    await crawler.run([url])

    return availableUrls
}


let availableUrls = await getAvailableURLs('https://crawlee.dev/', 2, 1, "same-hostname")
console.info(`availableUrls length: ${availableUrls.length} result: ${availableUrls}`)


availableUrls = await getAvailableURLs('https://cheerio.js.org/', 2, 1, "same-hostname")
console.info(`availableUrls length: ${availableUrls.length} result: ${availableUrls}`)

const getAvailableURLs = async (url: string, maxRequestsPerCrawl: number, maxRequestRetries: number, strategy: "all" | "same-hostname" | "same-domain" | "same-origin") => {
    let availableUrls: string[] = [];
    const crawler = new PlaywrightCrawler({
        maxRequestsPerCrawl,
        maxRequestRetries,
        // Use the requestHandler to process each of the crawled pages.
        async requestHandler({ request, enqueueLinks }) {

            await availableUrls.push(request.loadedUrl ? request.loadedUrl : "")


            await enqueueLinks(
                {
                    strategy
                }
            );
        },
    });

    await crawler.run([url])

    return availableUrls
}


let availableUrls = await getAvailableURLs('https://crawlee.dev/', 2, 1, "same-hostname")
console.info(`availableUrls length: ${availableUrls.length} result: ${availableUrls}`)


availableUrls = await getAvailableURLs('https://cheerio.js.org/', 2, 1, "same-hostname")
console.info(`availableUrls length: ${availableUrls.length} result: ${availableUrls}`)

Result: 1st Crawl: INFO PlaywrightCrawler: Terminal status message: Finished! Total 3 requests: 3 succeeded, 0 failed. 2nd Crawl: INFO PlaywrightCrawler: Terminal status message: Finished! Total 0 requests: 0 succeeded, 0 failed. Expected Result: 1st Crawl: INFO PlaywrightCrawler: Terminal status message: Finished! Total 3 requests: 3 succeeded, 0 failed. 2nd Crawl: INFO PlaywrightCrawler: Terminal status message: Finished! Total 3 requests: 3 succeeded, 0 failed. Question How do I achieve expected result and able to customize strategy, maxRequestsPerCrawl and maxRequestRetries by passing in parameters?

1 Reply

xenial-black•3y ago

I's say the problem is that you're using the same request queue as both crawler use the same default queue. So first call is processing the requests, and by the time you have the second call - the queue already have the processed number of requests and thus it just shuts down the crawler. you could open the queue before creating crawler instance and drop it explicitly after crawler run - in this case the second call should go through

Gaming

Programming

PlaywrightCrawler New Instance unexpected result

Did you find this page helpful?