Need help with Crawlee

I am getting the following error when crawling
No description
23 Replies
sensitive-blue
sensitive-blueOP•3y ago
@Helper Even though it was workin before but when I add a new link, it didn't work
sensitive-blue
sensitive-blueOP•3y ago
No description
sensitive-blue
sensitive-blueOP•3y ago
what could possibly go wrong? I also tried the other method which is to pass an array of urls in crawler. Run directly but got the same err
deep-jade
deep-jade•3y ago
And which url worked for you?
provincial-silver
provincial-silver•3y ago
I think some adresses dont allow to be crawled try it for different urls, if it works for one then it can work for others too
deep-jade
deep-jade•3y ago
I guess that target url might be a json api endpoint. Try to add application/json to https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#additionalMimeTypes. (This won't fix your error but you could possibly read data from response object. But if it is json I would suggest to use https://crawlee.dev/api/http-crawler instead
sensitive-blue
sensitive-blueOP•3y ago
I tried two urls crawlee.dev + github.com Plus now I faced a new problem I want to crawl through search engines like google, bing crawl all of the links that appear on the search result https://google.com/search?q=restaurants when feeding this url and setting maxRequestsPerCrawl to any number, it just sends only one request
deep-jade
deep-jade•3y ago
Seems like docs outdated a bit, you can read json data from context object ({ json }) without passing json mime type using cheerio crawler This option is not what you think it is. It sends one request because url itself is unique key. maxRequestsPerCrawl is safe guard that will stop crawler if it finds more urls than is set in this option.
sensitive-blue
sensitive-blueOP•3y ago
I know that await enqueueLinks() is what lets it crawl more than one request right? setting that I would expect to get 20 links from google search result but what does it stop? @yellott
MEE6
MEE6•3y ago
@0xBitShoT just advanced to level 2! Thanks for your contributions! 🎉
deep-jade
deep-jade•3y ago
Sorry you didn't specify you were using enqueuLinks, have no idea honestly. Never parsed google myself since there is apify scrapper for that. Most likely it detects cheerio immediately, try using browser based crawlers if you want to try to implement it yourself
sensitive-blue
sensitive-blueOP•3y ago
I am trying browser-based crawler
deep-jade
deep-jade•3y ago
There's something todo with https://crawlee.dev/docs/examples/crawl-relative-links, but with default EnqueueStrategy it should have crawled at least google links. If you want to scrape google search results urls (and not to crawl them) you need to collect them from a page using selectors.
MEE6
MEE6•3y ago
@yellott just advanced to level 4! Thanks for your contributions! 🎉
deep-jade
deep-jade•3y ago
I see. You need to start with 'https://www.google.com/search?q=restaurants' since google redirects to that page from 'https://google.com/search?q=restaurants' Or use SameDomain strategy to enqueue all links to google domain. But I don't think this is what you want to achieve. Naive implementation of crawler that walks through search result pages and also enqueues urls from search result page might look like this:
import { CheerioCrawler, createCheerioRouter, EnqueueStrategy } from 'crawlee';

const startUrls = ['https://www.google.com/search?q=restaurants'];
const searchPageNavUrlSelector = 'div[role="navigation"] table a';
const searchResultsUrlSelector = 'div[id="search"] div[data-sokoban-container] a[data-ved]';

export const router = createCheerioRouter();

router.addDefaultHandler(async ({ enqueueLinks, log, request }) => {
log.info(`Search page`, { url: request.loadedUrl });

await enqueueLinks({
strategy: EnqueueStrategy.SameDomain,
selector: searchPageNavUrlSelector,
});

await enqueueLinks({
strategy: EnqueueStrategy.All,
selector: searchResultsUrlSelector,
label: 'SEARCH_RESULT_URL',
});
});

router.addHandler('SEARCH_RESULT_URL', async ({ request, log }) => {
log.info(`Search result url:`, { url: request.loadedUrl });
});

const crawler = new CheerioCrawler({
requestHandler: router,
// This still is a safeguard only in this implementation.
maxRequestsPerCrawl: 30,
});

await crawler.run(startUrls);
import { CheerioCrawler, createCheerioRouter, EnqueueStrategy } from 'crawlee';

const startUrls = ['https://www.google.com/search?q=restaurants'];
const searchPageNavUrlSelector = 'div[role="navigation"] table a';
const searchResultsUrlSelector = 'div[id="search"] div[data-sokoban-container] a[data-ved]';

export const router = createCheerioRouter();

router.addDefaultHandler(async ({ enqueueLinks, log, request }) => {
log.info(`Search page`, { url: request.loadedUrl });

await enqueueLinks({
strategy: EnqueueStrategy.SameDomain,
selector: searchPageNavUrlSelector,
});

await enqueueLinks({
strategy: EnqueueStrategy.All,
selector: searchResultsUrlSelector,
label: 'SEARCH_RESULT_URL',
});
});

router.addHandler('SEARCH_RESULT_URL', async ({ request, log }) => {
log.info(`Search result url:`, { url: request.loadedUrl });
});

const crawler = new CheerioCrawler({
requestHandler: router,
// This still is a safeguard only in this implementation.
maxRequestsPerCrawl: 30,
});

await crawler.run(startUrls);
sensitive-blue
sensitive-blueOP•3y ago
ah now I recved this err CheerioCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 429 status code. {"id":"lbvAGmHKVGPGH6n","url":"https://google.com/search?q=restaurants","retryCount":2} I thinks google showing captcha
deep-jade
deep-jade•3y ago
You need to use SERP proxies
sensitive-blue
sensitive-blueOP•3y ago
ok let me research about that
deep-jade
deep-jade•3y ago
Btw, it is required to start google url with www. when using SERP proxies
sensitive-blue
sensitive-blueOP•3y ago
why is it like that?
deep-jade
deep-jade•3y ago
From the docs https://docs.apify.com/platform/proxy/google-serp-proxy
Requests made through the proxy are automatically routed through a proxy server from the selected country and pure HTML code of the search result page is returned.

Important: Only HTTP requests are allowed, and the Google hostname needs to start with the www. prefix.

For code examples on how to connect to Google SERP proxies, see the examples page.
Requests made through the proxy are automatically routed through a proxy server from the selected country and pure HTML code of the search result page is returned.

Important: Only HTTP requests are allowed, and the Google hostname needs to start with the www. prefix.

For code examples on how to connect to Google SERP proxies, see the examples page.
Google SERP proxy | Apify Documentation
Learn how to collect search results from Google Search-powered tools. Get search results from localized domains in multiple countries, e.g. the US and Germany.
mute-gold
mute-gold•3y ago
to crawl same URL again recommended approach is to add request as { url, uniqueKey: [GENERATE_RANDOM_KEY_OR_USE_COUNTER] } since when you adding anchor #COUNTER its in-page navigation actually (for browser it means same page should be opened then content scrolled to #anchor) in regards of google search - save snapshot if you opening page(s) by browser based crawler or save body under cheerio then check actual content available to scraper at time of running. If you not getting links it means bot is blocked in one of other way.
eastern-cyan
eastern-cyan•3y ago
btw: for debugging, just store the HTML to KV store to see what was loaded, then you can see if it was html, json or text

Did you find this page helpful?