Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

πŸ’»hire-freelancers

πŸš€actor-promotion

πŸ’«feature-request

πŸ’»devs-and-apify

πŸ—£general-chat

🎁giveaways

programming-memes

🌐apify-announcements

πŸ•·crawlee-announcements

πŸ‘₯community

Concurrent crawlers or maxRequests per Queue?

I'm crawling many websites everyday. The ideal approach for me would be to have a maxRequestsPerMinute dependent on the website. That way I'd have the crawler going at full speed, but crawling different pages from different websites in order to not surpass the websites request limit. I don't think that's possible though. So how could I achieve this? ...

accessing RequestQueue/RequestList for scraper

i have a cheerio craweler able to crawl an amazon results page for product inks and it does so successfully but then i want to add those to a RequestQueue/RequestList (by enqueueing each request from RequestList into RequestQueue) and then access it in a diff route and crawl that list of product links with the cheerio crawler for the data needed, how can i do so this is what my code looks like...

taking list of scraped urls and conducting multiple new scrapes

i have this code that scrapes product URLs from an Amazon results page i am able to successfully scrape the product URLs, but I'm unable to take each link and scrape the needed info in another crawler do i need another cheerio router also how can i take each link once scraped and instead add it to a requestlist and requestqueue and then take the urls in that request queue and scrape that information...

PlaywrightCrawler New Instance unexpected result

Hi guys, I'm new to crawlee. I wrap the sample code into a function. Each time the getAvailableURLs function is called, a new instance of the PlaywrightCrawler class is created and used to crawl the provided URL. ...

push Dataset but got nothing

Hi, i'm new I try to make like https://crawlee.dev/docs/examples/playwright-crawler but make none data on storage : / ```typescript...

browserType.launchPersistentContext: Browser closed

I'm getting the below error when running Playwright. The problem will lie with the chromium executable, but I'm not sure why. I have my executable path set: executablePath: '/tmp/chromium/chrome-linux/chrome',. This chrome executable was downloaded from Playwright's hosted file, so I didn't think there would be a compatibility issue: https://playwright.azureedge.net/builds/chromium/1060/chromium-linux.zip Extra context: I'm running this in an AWS Lambda (x86_64). ```{...

help on doing a cheeriocrawler scrape and then taking that list of urls and conducting a scrape

i have this code that scrapes product URLs from an Amazon results page i am able to successfully scrape the product URLs, but I'm unable to take each link and scrape the needed info in another crawler do i need another cheerio router also how can i take each link once scraped and instead add it to a requestlist and requestqueue and then take the urls in that request queue and scrape that information ...

change proxies while running

Hello, I have a question regarding Puppeteer, I want to change the proxies at one point during the process. Is this achievable ? for example I have proxy1 and proxy2, I start by using proxy1 and at one point I switch to proxy2....

PlaywrightCrawler in AWS Lambda

Hi guys, trying to run PlaywrightCrawler in a lambda but having some issues. ```browserType.launchPersistentContext: Executable doesn't exist at /home/sbx_user1051/.cache/ms-playwright/chromium-1060/chrome-linux/chrome ╔═════════════════════════════════════════════════════════════════════════╗ β•‘ Looks like Playwright Test or Playwright was just installed or updated. β•‘...

Is the Playwright Firefox Docker image usable with PlaywrightCrawler?

I understand that the template for PlaywrightCrawler uses the Chrome Docker image. Is it possible to modify that Dockerfile to use apify/actor-node-playwright-firefox:16, and if so, are there any other modifications that would need to be made?

What optimizations work for you?

I'm attempting to use crawlee and puppeteer to crawl between 15 and 30 million urls. I'm not rich but I also can't wait forever for the crawl to finish, so I've spent some time over the last few days hunting for different optimizations that might make my crawler faster. This is more challenging that usual when you're crawling a laundry list of unknown sites. First, here's some of the code I'm working with at this point. To get this running you just: ``...

Cherrio's innerText sometimes returns corrupted content

Hi folks, I've encountered an issue when using $('body').prop('innerText'). Namely, the returned content is not always the same. I've opened an github issue for this and created a separate repository for easy reproduction. I wanted to mention this issue here on Discord as well. Maybe we can discuss possible solutions in an informal manner more easily. Link to the issue: https://github.com/apify/crawlee/issues/1898 Link to the repo for reproduction steps: https://github.com/tsopeh/crawlee-innertext-repro...

Failed to parse URL from [object Object]

This is the request that I'm trying to add: ``` let popReportRequest = new Request({ url: 'https://www.beckett.com/grading/pop-report/', method: 'POST',...

getting ERR_CERT_AUTHORITY_INVALID with Playwright

Hi folks, I'm getting when using proxy: ERROR PlaywrightCrawler: Request failed and reached maximum retries. page.goto: net::ERR_CERT_AUTHORITY_INVALID at 'MY_URL' ...

map maximum size exceeded

I get the following error:
WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Map maximum size exceeded
WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Map maximum size exceeded
The script at this point is using 11gb of ram (I've allowed 40gb of max heap size)...

Crawlee doesn't process newly enqueued links via enqueueLinks

Hi folks, I'm trying to build a crawler that retrieves a body (Buffer), and later enqueues the next "page" to be crawled, if it exists (has_next === true ). The problem is that ?page=1 gets processed but the enqueued page (via enqueueLinks) doesn't; Crawlee states that it has processed all links (1 of 1). I have confirmed that has_next is indeed true and that enqueueLinks gets called. Am I missing something obvious?...

Getting the parent URL while executing inside the requestHandler for Crawlee

Hey folks, I'm saving the hierarchy of the crawl tree in my database as part of the crawling process, which means in the requestHandler, I need to save the parent URL that enqueued the link that is currently executing in the requestHandler. Is there an easy way to get that or is it something I need to implement myself? Thanks!...

networkidle2 option

Hello, In puppeteer for page.reload or page.goto you can choose the option {waitUntil: 'networkidle2'} , using Puppeteer in Crawlee, I only found that I can use that if I reload each page . Is there any other way to configure navigation to use the {waitUntil: 'networkidle2'} from the beginning ?...

I am looking for python & data processing expert (long term)

Candidate has to be of experiences in python, image processing, NLP, Machine learning. Thanks....

Got captha and HTTP 403 using PlaywrightCrawler

Got captha and HTTP 403 when accessing wellfound.com I get captcha all the time when I access links like these (basically - accessing any job ad on wellfound): https://wellfound.com/company/kalepa/jobs/2651640-tech-lead-manager-full-stack-europe https://wellfound.com/company/pinatacloud/jobs/2655889-principal-software-engineer...
No description