Cannot EnqueueLinks with Globs

The crawler starts with the sitemap.xml of a website, and I'm trying to enqueue all the links inside the XML with globs.
await enqueueLinks({
globs: ["https://website.com/product/*"],
});
await enqueueLinks({
globs: ["https://website.com/product/*"],
});
I have tested the glob pattern and it seems to work fine inside the XML I'm reading, but the enqueueLinks doesn't add any link. What am I doing wrong?
4 Replies
correct-apricot
correct-apricotOP•3y ago
Actually it doesn't enqueue any links even without the globs. It just can't find links inside an XML file. Although the content of the XML is loaded properly.
exotic-emerald
exotic-emerald•3y ago
are there <a> tags in the xml with the href attributes? I think enqueueLinks function search for that normal sitemap looks like this https://www.root.cz/sitemap/sitemap.xml if inspect the elements you can see there are just divs and spans, so you will need to implement your own logic to parse the xml and get the links that way
correct-apricot
correct-apricotOP•3y ago
Ok. I get it. Crawling the sitemap is such a given easy way to find all URLs that I don't get why it's not easily supported by crawlee.
Pepa J
Pepa J•3y ago
How to scrape from sitemaps | Apify Documentation
The sitemap.xml file is a jackpot for every web scraper developer. Take advantage of this and learn an easier way to extract data from websites using Crawlee.

Did you find this page helpful?