CA
xenial-black

Apify not queueing my apify links

import { Actor } from 'apify';
import { CheerioCrawler, downloadListOfUrls, EnqueueStrategy } from 'crawlee';

await Actor.init();

const crawler = new CheerioCrawler({
// Function called for each URL
async requestHandler({ request, enqueueLinks }) {
console.log(request.url);
await enqueueLinks({
globs: ['https://www.something/produto/*']
});
},
});



const listOfUrls = await downloadListOfUrls({ url: 'https://www.something/sitemap_index.xml' });

await crawler.addRequests(listOfUrls);
await crawler.run();

await Actor.exit();
import { Actor } from 'apify';
import { CheerioCrawler, downloadListOfUrls, EnqueueStrategy } from 'crawlee';

await Actor.init();

const crawler = new CheerioCrawler({
// Function called for each URL
async requestHandler({ request, enqueueLinks }) {
console.log(request.url);
await enqueueLinks({
globs: ['https://www.something/produto/*']
});
},
});



const listOfUrls = await downloadListOfUrls({ url: 'https://www.something/sitemap_index.xml' });

await crawler.addRequests(listOfUrls);
await crawler.run();

await Actor.exit();
Why does this not enqueue my links?
3 Replies
conscious-sapphire
conscious-sapphire3y ago
It's hard to say without seeing actual URLs - but looking at this snippet - https://www.something/produto is not a valid URL. Same as https://www.something/sitemap_index.xml Also keep in mind that by default enqueueLinks enqueues the links with the same hostname (as current page/request). You could try changing it to strategy: 'all' - see here: https://crawlee.dev/api/core/interface/EnqueueLinksOptions#strategy
xenial-black
xenial-blackOP3y ago
stategy appears to work fine, how can I put it to my url? same-domain appears to crawl more than my domai? @Andrey Bykov Does enqueueLinks only query the a selector? by default What I want to do is, the main sitemap_index.html points to other sitemaps. I want to basically a recursive crawler automatically
conscious-sapphire
conscious-sapphire3y ago
by default it's using the a selector, yes. You would not be able to use enqueueLinks with sitemap and cheerio, because, well, there are no links - it's only text in loc selector. If you would use the browser though - it should be rendered into a with proper hrefs and thus enqueueLinks will work. If you still want to use cheerio - grab the urls from html manually and then use crawler.addRequests[<your_urls_here>]

Did you find this page helpful?