Crawlee & Apify•3y ago

Need help with Crawlee

I am getting the following error when crawling

23 Replies

sensitive-blueOP•3y ago

@Helper Even though it was workin before but when I add a new link, it didn't work

sensitive-blueOP•3y ago

what could possibly go wrong? I also tried the other method which is to pass an array of urls in crawler. Run directly but got the same err

deep-jade•3y ago

And which url worked for you?

provincial-silver•3y ago

I think some adresses dont allow to be crawled try it for different urls, if it works for one then it can work for others too

deep-jade•3y ago

I guess that target url might be a json api endpoint. Try to add application/json to https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#additionalMimeTypes. (This won't fix your error but you could possibly read data from response object. But if it is json I would suggest to use https://crawlee.dev/api/http-crawler instead

@crawlee/http | API | Crawlee

CheerioCrawlerOptions | API | Crawlee

sensitive-blueOP•3y ago

I tried two urls crawlee.dev + github.com Plus now I faced a new problem I want to crawl through search engines like google, bing crawl all of the links that appear on the search result https://google.com/search?q=restaurants when feeding this url and setting maxRequestsPerCrawl to any number, it just sends only one request

deep-jade•3y ago

Seems like docs outdated a bit, you can read json data from context object ({ json }) without passing json mime type using cheerio crawler This option is not what you think it is. It sends one request because url itself is unique key. maxRequestsPerCrawl is safe guard that will stop crawler if it finds more urls than is set in this option.

sensitive-blueOP•3y ago

I know that await enqueueLinks() is what lets it crawl more than one request right? setting that I would expect to get 20 links from google search result but what does it stop? @yellott

MEE6•3y ago

@0xBitShoT just advanced to level 2! Thanks for your contributions! 🎉

deep-jade•3y ago

Sorry you didn't specify you were using enqueuLinks, have no idea honestly. Never parsed google myself since there is apify scrapper for that. Most likely it detects cheerio immediately, try using browser based crawlers if you want to try to implement it yourself

sensitive-blueOP•3y ago

I am trying browser-based crawler

deep-jade•3y ago

There's something todo with https://crawlee.dev/docs/examples/crawl-relative-links, but with default EnqueueStrategy it should have crawled at least google links. If you want to scrape google search results urls (and not to crawl them) you need to collect them from a page using selectors.

MEE6•3y ago

@yellott just advanced to level 4! Thanks for your contributions! 🎉

deep-jade•3y ago

I see. You need to start with 'https://www.google.com/search?q=restaurants' since google redirects to that page from 'https://google.com/search?q=restaurants' Or use SameDomain strategy to enqueue all links to google domain. But I don't think this is what you want to achieve. Naive implementation of crawler that walks through search result pages and also enqueues urls from search result page might look like this:

import { CheerioCrawler, createCheerioRouter, EnqueueStrategy } from 'crawlee';

const startUrls = ['https://www.google.com/search?q=restaurants'];
const searchPageNavUrlSelector = 'div[role="navigation"] table a';
const searchResultsUrlSelector = 'div[id="search"] div[data-sokoban-container] a[data-ved]';

export const router = createCheerioRouter();

router.addDefaultHandler(async ({ enqueueLinks, log, request }) => {
    log.info(`Search page`, { url: request.loadedUrl });

    await enqueueLinks({
        strategy: EnqueueStrategy.SameDomain,
        selector: searchPageNavUrlSelector,
    });

    await enqueueLinks({
        strategy: EnqueueStrategy.All,
        selector: searchResultsUrlSelector,
        label: 'SEARCH_RESULT_URL',
    });
});

router.addHandler('SEARCH_RESULT_URL', async ({ request, log }) => {
    log.info(`Search result url:`, { url: request.loadedUrl });
});

const crawler = new CheerioCrawler({
    requestHandler: router,
    // This still is a safeguard only in this implementation.
    maxRequestsPerCrawl: 30,
});

await crawler.run(startUrls);

import { CheerioCrawler, createCheerioRouter, EnqueueStrategy } from 'crawlee';

const startUrls = ['https://www.google.com/search?q=restaurants'];
const searchPageNavUrlSelector = 'div[role="navigation"] table a';
const searchResultsUrlSelector = 'div[id="search"] div[data-sokoban-container] a[data-ved]';

export const router = createCheerioRouter();

router.addDefaultHandler(async ({ enqueueLinks, log, request }) => {
    log.info(`Search page`, { url: request.loadedUrl });

    await enqueueLinks({
        strategy: EnqueueStrategy.SameDomain,
        selector: searchPageNavUrlSelector,
    });

    await enqueueLinks({
        strategy: EnqueueStrategy.All,
        selector: searchResultsUrlSelector,
        label: 'SEARCH_RESULT_URL',
    });
});

router.addHandler('SEARCH_RESULT_URL', async ({ request, log }) => {
    log.info(`Search result url:`, { url: request.loadedUrl });
});

const crawler = new CheerioCrawler({
    requestHandler: router,
    // This still is a safeguard only in this implementation.
    maxRequestsPerCrawl: 30,
});

await crawler.run(startUrls);

sensitive-blueOP•3y ago

ah now I recved this err

CheerioCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 429 status code.
 {"id":"lbvAGmHKVGPGH6n","url":"https://google.com/search?q=restaurants","retryCount":2}

I thinks google showing captcha

deep-jade•3y ago

You need to use SERP proxies

sensitive-blueOP•3y ago

ok let me research about that

deep-jade•3y ago

Btw, it is required to start google url with www. when using SERP proxies

sensitive-blueOP•3y ago

why is it like that?

deep-jade•3y ago

From the docs https://docs.apify.com/platform/proxy/google-serp-proxy

Requests made through the proxy are automatically routed through a proxy server from the selected country and pure HTML code of the search result page is returned.

Important: Only HTTP requests are allowed, and the Google hostname needs to start with the www. prefix.

For code examples on how to connect to Google SERP proxies, see the examples page.

Requests made through the proxy are automatically routed through a proxy server from the selected country and pure HTML code of the search result page is returned.

Important: Only HTTP requests are allowed, and the Google hostname needs to start with the www. prefix.

For code examples on how to connect to Google SERP proxies, see the examples page.

Google SERP proxy | Apify Documentation

Learn how to collect search results from Google Search-powered tools. Get search results from localized domains in multiple countries, e.g. the US and Germany.

mute-gold•3y ago

to crawl same URL again recommended approach is to add request as { url, uniqueKey: [GENERATE_RANDOM_KEY_OR_USE_COUNTER] } since when you adding anchor #COUNTER its in-page navigation actually (for browser it means same page should be opened then content scrolled to #anchor) in regards of google search - save snapshot if you opening page(s) by browser based crawler or save body under cheerio then check actual content available to scraper at time of running. If you not getting links it means bot is blocked in one of other way.

eastern-cyan•3y ago

btw: for debugging, just store the HTML to KV store to see what was loaded, then you can see if it was html, json or text

Gaming

Programming

Need help with Crawlee

Did you find this page helpful?