other-emerald

AdaptivePlaywrightCrawler starts crawling the whole web at some point.

I want to use the AdaptivePlaywrightCrawler, but it seems like it wants to crawl the entire web. Here is my code.

const crawler = new AdaptivePlaywrightCrawler({
      renderingTypeDetectionRatio: 0.1,
      maxRequestsPerCrawl: 50,
      async requestHandler({ request, enqueueLinks, parseWithCheerio, querySelector, log, urls }) {
        console.log(request.url, request.uniqueKey);
        await enqueueLinks();
      }
});

crawler.run(['https://crawlee.dev']);

4 Replies

Hall•5mo ago

Someone will reply to you shortly. In the meantime, this might help:

other-emeraldOP•5mo ago

By the way, I've tried with strategy: 'same-domain', 'same-origin' but the result was the same.

continuing-cyan•5mo ago

Hi you have to restrict crawling to specific domains or patterns


import { AdaptivePlaywrightCrawler } from 'crawlee';

const crawler = new AdaptivePlaywrightCrawler({
    renderingTypeDetectionRatio: 0.1,
    maxRequestsPerCrawl: 50,
    async requestHandler({ request, enqueueLinks, parseWithCheerio, querySelector, log, urls }) {
        
        await enqueueLinks({
            pseudoUrls: ['https://crawlee.dev[.*]'], });
    },
});

await crawler.run(['https://crawlee.dev']);

other-emeraldOP•5mo ago

Hey, thanks! That kinda works, but I thought the default behavior of enqueueLinks was to stay on the same hostname. It's clearly mentioned in the documentation: https://crawlee.dev/docs/introduction/adding-urls

Adding more URLs | Crawlee · Build reliable crawlers. Fast.

Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.

Gaming

Programming

AdaptivePlaywrightCrawler starts crawling the whole web at some point.

Did you find this page helpful?