CA
other-emerald
AdaptivePlaywrightCrawler starts crawling the whole web at some point.
I want to use the AdaptivePlaywrightCrawler, but it seems like it wants to crawl the entire web.
Here is my code.
const crawler = new AdaptivePlaywrightCrawler({
renderingTypeDetectionRatio: 0.1,
maxRequestsPerCrawl: 50,
async requestHandler({ request, enqueueLinks, parseWithCheerio, querySelector, log, urls }) {
console.log(request.url, request.uniqueKey);
await enqueueLinks();
}
});
crawler.run(['https://crawlee.dev']);
4 Replies
Someone will reply to you shortly. In the meantime, this might help:
other-emeraldOP•5mo ago
By the way, I've tried with strategy: 'same-domain', 'same-origin' but the result was the same.
continuing-cyan•5mo ago
Hi you have to restrict crawling to specific domains or patterns
import { AdaptivePlaywrightCrawler } from 'crawlee';
const crawler = new AdaptivePlaywrightCrawler({
renderingTypeDetectionRatio: 0.1,
maxRequestsPerCrawl: 50,
async requestHandler({ request, enqueueLinks, parseWithCheerio, querySelector, log, urls }) {
await enqueueLinks({
pseudoUrls: ['https://crawlee.dev[.*]'], });
},
});
await crawler.run(['https://crawlee.dev']);
other-emeraldOP•5mo ago
Hey, thanks! That kinda works, but I thought the default behavior of enqueueLinks was to stay on the same hostname. It's clearly mentioned in the documentation: https://crawlee.dev/docs/introduction/adding-urls
Adding more URLs | Crawlee · Build reliable crawlers. Fast.
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.