3 Replies
conscious-sapphire•3y ago
It's hard to say without seeing actual URLs - but looking at this snippet - https://www.something/produto is not a valid URL. Same as https://www.something/sitemap_index.xml Also keep in mind that by default
enqueueLinks
enqueues the links with the same hostname (as current page/request). You could try changing it to strategy: 'all'
- see here: https://crawlee.dev/api/core/interface/EnqueueLinksOptions#strategyxenial-blackOP•3y ago
stategy appears to work fine, how can I put it to my url?
same-domain appears to crawl more than my domai?
@Andrey Bykov Does
enqueueLinks
only query the a selector? by default
What I want to do is, the main sitemap_index.html points to other sitemaps. I want to basically a recursive crawler automaticallyconscious-sapphire•3y ago
by default it's using the
a
selector, yes. You would not be able to use enqueueLinks
with sitemap and cheerio, because, well, there are no links - it's only text in loc
selector. If you would use the browser though - it should be rendered into a
with proper href
s and thus enqueueLinks
will work. If you still want to use cheerio - grab the urls from html manually and then use crawler.addRequests[<your_urls_here>]