CA
Crawlee & Apify5mo ago
other-emerald

AdaptivePlaywrightCrawler starts crawling the whole web at some point.

I want to use the AdaptivePlaywrightCrawler, but it seems like it wants to crawl the entire web. Here is my code. const crawler = new AdaptivePlaywrightCrawler({ renderingTypeDetectionRatio: 0.1, maxRequestsPerCrawl: 50, async requestHandler({ request, enqueueLinks, parseWithCheerio, querySelector, log, urls }) { console.log(request.url, request.uniqueKey); await enqueueLinks(); } }); crawler.run(['https://crawlee.dev']);
4 Replies
Hall
Hall5mo ago
Someone will reply to you shortly. In the meantime, this might help:
other-emerald
other-emeraldOP5mo ago
By the way, I've tried with strategy: 'same-domain', 'same-origin' but the result was the same.
continuing-cyan
continuing-cyan5mo ago
Hi you have to restrict crawling to specific domains or patterns import { AdaptivePlaywrightCrawler } from 'crawlee'; const crawler = new AdaptivePlaywrightCrawler({ renderingTypeDetectionRatio: 0.1, maxRequestsPerCrawl: 50, async requestHandler({ request, enqueueLinks, parseWithCheerio, querySelector, log, urls }) { await enqueueLinks({ pseudoUrls: ['https://crawlee.dev[.*]'], }); }, }); await crawler.run(['https://crawlee.dev']);
other-emerald
other-emeraldOP5mo ago
Hey, thanks! That kinda works, but I thought the default behavior of enqueueLinks was to stay on the same hostname. It's clearly mentioned in the documentation: https://crawlee.dev/docs/introduction/adding-urls
Adding more URLs | Crawlee · Build reliable crawlers. Fast.
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.

Did you find this page helpful?