Crawlee & Apify•3y ago

Downloadlistofurloptions

how to crawl XML from the XML (nested)from there I have to collect links,any suggestions by using cheerio crawler it will useful for me to move on.#crawlee

3 Replies

complex-teal•3y ago

XML is normally parsed by CheerioCrawler you can work with it like with HTML.

rising-crimsonOP•3y ago

Is there any doc available.Thanks

wise-white•3y ago

I just did a quick work around for this (only two levels at the mo) - get the lists of urls. Loop through the urls and then run downloadListOfUrls on that one is we know it's xml. Then enqueue all of those links into the same requestQueue

const urls = await downloadListOfUrls({
  url,
});
for (let url of urls) {
   // check to see if it ends in .xml to then download all of those links
   // will only work for two levels, might be worth making a recursive function
   if (url.search(/\.xml/gi) !== -1) {
     const urls = await downloadListOfUrls({
       url,
     });
     await enqueueLinks({
       urls,
       requestQueue,
       baseUrl: parsedUrl.origin,
       strategy: "same-domain",
     });
   }
 }
 // still need to check/filter these urls from above incase the whole sitemap was all nested sitemaps
 await enqueueLinks({
   urls,
   requestQueue,
   baseUrl: parsedUrl.origin,
   strategy: "same-domain",
 });

const urls = await downloadListOfUrls({
  url,
});
for (let url of urls) {
   // check to see if it ends in .xml to then download all of those links
   // will only work for two levels, might be worth making a recursive function
   if (url.search(/\.xml/gi) !== -1) {
     const urls = await downloadListOfUrls({
       url,
     });
     await enqueueLinks({
       urls,
       requestQueue,
       baseUrl: parsedUrl.origin,
       strategy: "same-domain",
     });
   }
 }
 // still need to check/filter these urls from above incase the whole sitemap was all nested sitemaps
 await enqueueLinks({
   urls,
   requestQueue,
   baseUrl: parsedUrl.origin,
   strategy: "same-domain",
 });

Gaming

Programming

Downloadlistofurloptions

Did you find this page helpful?