Cannot EnqueueLinks with Globs
The crawler starts with the sitemap.xml of a website, and I'm trying to enqueue all the links inside the XML with globs.
I have tested the glob pattern and it seems to work fine inside the XML I'm reading, but the enqueueLinks doesn't add any link.
What am I doing wrong?
4 Replies
correct-apricotOP•3y ago
Actually it doesn't enqueue any links even without the globs.
It just can't find links inside an XML file. Although the content of the XML is loaded properly.
exotic-emerald•3y ago
are there <a> tags in the xml with the href attributes? I think enqueueLinks function search for that
normal sitemap looks like this https://www.root.cz/sitemap/sitemap.xml
if inspect the elements you can see there are just divs and spans, so you will need to implement your own logic to parse the xml and get the links that way
correct-apricotOP•3y ago
Ok. I get it.
Crawling the sitemap is such a given easy way to find all URLs that I don't get why it's not easily supported by crawlee.
How to scrape from sitemaps | Apify Documentation
The sitemap.xml file is a jackpot for every web scraper developer. Take advantage of this and learn an easier way to extract data from websites using Crawlee.