23 Replies
sensitive-blueOP•3y ago
@Helper Even though it was workin before but when I add a new link, it didn't work
sensitive-blueOP•3y ago

sensitive-blueOP•3y ago
what could possibly go wrong?
I also tried the other method which is to pass an array of urls in
crawler. Run
directly but got the same errdeep-jade•3y ago
And which url worked for you?
provincial-silver•3y ago
I think some adresses dont allow to be crawled
try it for different urls, if it works for one then it can work for others too
deep-jade•3y ago
I guess that target url might be a json api endpoint. Try to add
application/json
to https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#additionalMimeTypes. (This won't fix your error but you could possibly read data from response object. But if it is json I would suggest to use https://crawlee.dev/api/http-crawler insteadsensitive-blueOP•3y ago
I tried two urls
crawlee.dev
+ github.com
Plus now I faced a new problem
I want to crawl through search engines
like google, bing
crawl all of the links that appear on the search result
https://google.com/search?q=restaurants
when feeding this url and setting maxRequestsPerCrawl
to any number, it just sends only one requestdeep-jade•3y ago
Seems like docs outdated a bit, you can read json data from context object
({ json })
without passing json mime type using cheerio crawler
This option is not what you think it is. It sends one request because url itself is unique key. maxRequestsPerCrawl
is safe guard that will stop crawler if it finds more urls than is set in this option.sensitive-blueOP•3y ago
I know that
await enqueueLinks()
is what lets it crawl more than one request
right?
setting that I would expect to get 20 links from google search result
but what does it stop? @yellott @0xBitShoT just advanced to level 2! Thanks for your contributions! 🎉
deep-jade•3y ago
Sorry you didn't specify you were using enqueuLinks, have no idea honestly. Never parsed google myself since there is apify scrapper for that.
Most likely it detects cheerio immediately, try using browser based crawlers if you want to try to implement it yourself
sensitive-blueOP•3y ago
I am trying browser-based crawler
deep-jade•3y ago
There's something todo with https://crawlee.dev/docs/examples/crawl-relative-links, but with default
EnqueueStrategy
it should have crawled at least google links. If you want to scrape google search results urls (and not to crawl them) you need to collect them from a page using selectors.@yellott just advanced to level 4! Thanks for your contributions! 🎉
deep-jade•3y ago
I see. You need to start with
'https://www.google.com/search?q=restaurants'
since google redirects to that page from 'https://google.com/search?q=restaurants'
Or use SameDomain
strategy to enqueue all links to google domain. But I don't think this is what you want to achieve.
Naive implementation of crawler that walks through search result pages and also enqueues urls from search result page might look like this:
sensitive-blueOP•3y ago
ah now I recved this err
CheerioCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 429 status code.
{"id":"lbvAGmHKVGPGH6n","url":"https://google.com/search?q=restaurants","retryCount":2}
I thinks google showing captchadeep-jade•3y ago
You need to use SERP proxies
sensitive-blueOP•3y ago
ok let me research about that
deep-jade•3y ago
Btw, it is required to start google url with
www.
when using SERP proxiessensitive-blueOP•3y ago
why is it like that?
deep-jade•3y ago
Google SERP proxy | Apify Documentation
Learn how to collect search results from Google Search-powered tools. Get search results from localized domains in multiple countries, e.g. the US and Germany.
mute-gold•3y ago
to crawl same URL again recommended approach is to add request as
{ url, uniqueKey: [GENERATE_RANDOM_KEY_OR_USE_COUNTER] }
since when you adding anchor #COUNTER
its in-page navigation actually (for browser it means same page should be opened then content scrolled to #anchor)
in regards of google search - save snapshot if you opening page(s) by browser based crawler or save body
under cheerio then check actual content available to scraper at time of running. If you not getting links it means bot is blocked in one of other way.eastern-cyan•3y ago
btw: for debugging, just store the HTML to KV store to see what was loaded, then you can see if it was html, json or text