Downloadlistofurloptions

how to crawl XML from the XML (nested)from there I have to collect links,any suggestions by using cheerio crawler it will useful for me to move on.#crawlee
3 Replies
complex-teal
complex-teal3y ago
XML is normally parsed by CheerioCrawler you can work with it like with HTML.
rising-crimson
rising-crimsonOP3y ago
Is there any doc available.Thanks
wise-white
wise-white3y ago
I just did a quick work around for this (only two levels at the mo) - get the lists of urls. Loop through the urls and then run downloadListOfUrls on that one is we know it's xml. Then enqueue all of those links into the same requestQueue
const urls = await downloadListOfUrls({
url,
});
for (let url of urls) {
// check to see if it ends in .xml to then download all of those links
// will only work for two levels, might be worth making a recursive function
if (url.search(/\.xml/gi) !== -1) {
const urls = await downloadListOfUrls({
url,
});
await enqueueLinks({
urls,
requestQueue,
baseUrl: parsedUrl.origin,
strategy: "same-domain",
});
}
}
// still need to check/filter these urls from above incase the whole sitemap was all nested sitemaps
await enqueueLinks({
urls,
requestQueue,
baseUrl: parsedUrl.origin,
strategy: "same-domain",
});
const urls = await downloadListOfUrls({
url,
});
for (let url of urls) {
// check to see if it ends in .xml to then download all of those links
// will only work for two levels, might be worth making a recursive function
if (url.search(/\.xml/gi) !== -1) {
const urls = await downloadListOfUrls({
url,
});
await enqueueLinks({
urls,
requestQueue,
baseUrl: parsedUrl.origin,
strategy: "same-domain",
});
}
}
// still need to check/filter these urls from above incase the whole sitemap was all nested sitemaps
await enqueueLinks({
urls,
requestQueue,
baseUrl: parsedUrl.origin,
strategy: "same-domain",
});

Did you find this page helpful?