CA
like-gold

How to stop the crawl after a certain time?

I use crawlee to warm up my caches. And I set a stop after 5000 pages. But sometimes it takes more or less time to crawl them. I prefer to tell it to stop the crawl after x minutes. Is it possible via the configuration ?
2 Replies
like-gold
like-goldOP3y ago
For the moment I did:
const timeout = (ms: number) =>
new Promise<void>((resolve) => {
setTimeout(() => {
console.log(`End after ${configuration.CRAWL_TIMEOUT} secondes`);
resolve();
}, ms);
});

const run = async () => {
const promesseCrawler = crawler.run([configuration.START_URL]);
const promesseTimeout = timeout(configuration.CRAWL_TIMEOUT * 1000);
await Promise.race([promesseCrawler, promesseTimeout]);
process.exit(0);
};
const timeout = (ms: number) =>
new Promise<void>((resolve) => {
setTimeout(() => {
console.log(`End after ${configuration.CRAWL_TIMEOUT} secondes`);
resolve();
}, ms);
});

const run = async () => {
const promesseCrawler = crawler.run([configuration.START_URL]);
const promesseTimeout = timeout(configuration.CRAWL_TIMEOUT * 1000);
await Promise.race([promesseCrawler, promesseTimeout]);
process.exit(0);
};
But I think you have a configuration for that?
probable-pink
probable-pink3y ago
You can try: 1. You can use method abort() of AutoscaledPool class:
https://crawlee.dev/api/next/core/class/AutoscaledPool#abort 2. Or you can set maxRequestsPerCrawl https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#maxRequestsPerCrawl

Did you find this page helpful?