Href inside of a data-href attribute

So, as the title says - I have this specific case where a link to the next page is not inside of a href attribute of an anchor tag but rather inside of a (custom?) data-href attribute of a button element. Is there a way to enqueue this url with the selector parameter of enqueue links or is the only way to pass it to the urls array of enqueue links?
8 Replies
NeoNomade
NeoNomade3y ago
Cheerio ? Puppeteer ? Playwright ?
quickest-silver
quickest-silverOP3y ago
Playwright. Sorry to ommit that part of information.
NeoNomade
NeoNomade3y ago
You can use parsewithCheerio, for the ease of usage, and use : $(‘selector’).attr(data-href) And pass those in an array to your crawler
conscious-sapphire
conscious-sapphire3y ago
Hi @FlowGravity, you can't do this using enqueueLinks, as this function only works with the href attribute. You can parse the links as NeoNomade says and pass them to the addRequests method on crawler/request queue.
quickest-silver
quickest-silverOP3y ago
Thanks. The proposed solutions work but they just aren't solving the issue I have. My guess is that it is probably due to bad proxies I'm using.
NeoNomade
NeoNomade3y ago
how they are not solving the issue ? I don't understand how are proxies related to selectors
quickest-silver
quickest-silverOP3y ago
Sorry, it's been a busy day. Let me try to give you the complete context. I'm scraping a site which provides ads (real estate, instruments, vehicles, etc.). I have managed to successfully extract all the data that I need but now I'm trying to scale the amount of ads scraped. I'm afraid that the site might block my ip address (or my servers address) so I'm trying free proxies to see how far can I go. Now, while running the crawler with these free proxies I oftenly get timed out and it results with my crawl ending without reaching the desired number of requests because it times out when searching for the next button (which I explained in my original post)
NeoNomade
NeoNomade3y ago
oh ok. Free proxies are not the best option, most of the time they are blocked from the beginning.

Did you find this page helpful?