Href inside of a data-href attribute
So, as the title says - I have this specific case where a link to the next page is not inside of a href attribute of an anchor tag but rather inside of a (custom?) data-href attribute of a button element. Is there a way to enqueue this url with the selector parameter of enqueue links or is the only way to pass it to the urls array of enqueue links?
8 Replies
Cheerio ? Puppeteer ? Playwright ?
quickest-silverOP•3y ago
Playwright. Sorry to ommit that part of information.
You can use parsewithCheerio, for the ease of usage, and use : $(‘selector’).attr(data-href)
And pass those in an array to your crawler
conscious-sapphire•3y ago
Hi @FlowGravity, you can't do this using enqueueLinks, as this function only works with the href attribute. You can parse the links as NeoNomade says and pass them to the addRequests method on crawler/request queue.
quickest-silverOP•3y ago
Thanks. The proposed solutions work but they just aren't solving the issue I have. My guess is that it is probably due to bad proxies I'm using.
how they are not solving the issue ? I don't understand how are proxies related to selectors
quickest-silverOP•3y ago
Sorry, it's been a busy day. Let me try to give you the complete context. I'm scraping a site which provides ads (real estate, instruments, vehicles, etc.). I have managed to successfully extract all the data that I need but now I'm trying to scale the amount of ads scraped. I'm afraid that the site might block my ip address (or my servers address) so I'm trying free proxies to see how far can I go. Now, while running the crawler with these free proxies I oftenly get timed out and it results with my crawl ending without reaching the desired number of requests because it times out when searching for the next button (which I explained in my original post)
oh ok.
Free proxies are not the best option, most of the time they are blocked from the beginning.