Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

xenial-black

5/17/2023

Concurrent crawlers or maxRequests per Queue?

I'm crawling many websites everyday. The ideal approach for me would be to have a maxRequestsPerMinute dependent on the website. That way I'd have the crawler going at full speed, but crawling different pages from different websites in order to not surpass the websites request limit. I don't think that's possible though. So how could I achieve this? ...

stormy-gold

5/16/2023

accessing RequestQueue/RequestList for scraper

i have a cheerio craweler able to crawl an amazon results page for product inks and it does so successfully but then i want to add those to a RequestQueue/RequestList (by enqueueing each request from RequestList into RequestQueue) and then access it in a diff route and crawl that list of product links with the cheerio crawler for the data needed, how can i do so this is what my code looks like...

stormy-gold

5/16/2023

taking list of scraped urls and conducting multiple new scrapes

i have this code that scrapes product URLs from an Amazon results page i am able to successfully scrape the product URLs, but I'm unable to take each link and scrape the needed info in another crawler do i need another cheerio router also how can i take each link once scraped and instead add it to a requestlist and requestqueue and then take the urls in that request queue and scrape that information...

harsh-harlequin

5/13/2023

PlaywrightCrawler New Instance unexpected result

Hi guys, I'm new to crawlee. I wrap the sample code into a function. Each time the getAvailableURLs function is called, a new instance of the PlaywrightCrawler class is created and used to crawl the provided URL. ...

genetic-orange

5/13/2023

push Dataset but got nothing

Hi, i'm new I try to make like https://crawlee.dev/docs/examples/playwright-crawler but make none data on storage : / ```typescript...

genetic-orange

5/12/2023

browserType.launchPersistentContext: Browser closed

I'm getting the below error when running Playwright. The problem will lie with the chromium executable, but I'm not sure why. I have my executable path set: executablePath: '/tmp/chromium/chrome-linux/chrome',. This chrome executable was downloaded from Playwright's hosted file, so I didn't think there would be a compatibility issue: https://playwright.azureedge.net/builds/chromium/1060/chromium-linux.zip Extra context: I'm running this in an AWS Lambda (x86_64). ```{...

stormy-gold

5/11/2023

help on doing a cheeriocrawler scrape and then taking that list of urls and conducting a scrape

crawlee_cheerio_scra...

NeoNomade

5/10/2023

change proxies while running

Hello, I have a question regarding Puppeteer, I want to change the proxies at one point during the process. Is this achievable ? for example I have proxy1 and proxy2, I start by using proxy1 and at one point I switch to proxy2....

genetic-orange

5/10/2023

PlaywrightCrawler in AWS Lambda

Hi guys, trying to run PlaywrightCrawler in a lambda but having some issues. ```browserType.launchPersistentContext: Executable doesn't exist at /home/sbx_user1051/.cache/ms-playwright/chromium-1060/chrome-linux/chrome ╔═════════════════════════════════════════════════════════════════════════╗ ║ Looks like Playwright Test or Playwright was just installed or updated. ║...

rare-sapphire

5/9/2023

Is the Playwright Firefox Docker image usable with PlaywrightCrawler?

I understand that the template for PlaywrightCrawler uses the Chrome Docker image. Is it possible to modify that Dockerfile to use apify/actor-node-playwright-firefox:16, and if so, are there any other modifications that would need to be made?

stormy-gold

5/5/2023

What optimizations work for you?

I'm attempting to use crawlee and puppeteer to crawl between 15 and 30 million urls. I'm not rich but I also can't wait forever for the crawl to finish, so I've spent some time over the last few days hunting for different optimizations that might make my crawler faster. This is more challenging that usual when you're crawling a laundry list of unknown sites. First, here's some of the code I'm working with at this point. To get this running you just: ``...

adverse-sapphire

5/5/2023

Cherrio's innerText sometimes returns corrupted content

Hi folks, I've encountered an issue when using $('body').prop('innerText'). Namely, the returned content is not always the same. I've opened an github issue for this and created a separate repository for easy reproduction. I wanted to mention this issue here on Discord as well. Maybe we can discuss possible solutions in an informal manner more easily. Link to the issue: https://github.com/apify/crawlee/issues/1898 Link to the repo for reproduction steps: https://github.com/tsopeh/crawlee-innertext-repro...

NeoNomade

5/4/2023

Failed to parse URL from [object Object]

This is the request that I'm trying to add: ``` let popReportRequest = new Request({ url: 'https://www.beckett.com/grading/pop-report/', method: 'POST',...

continuing-cyan

5/3/2023

getting ERR_CERT_AUTHORITY_INVALID with Playwright

Hi folks, I'm getting when using proxy:

ERROR PlaywrightCrawler: Request failed and reached maximum retries. page.goto: net::ERR_CERT_AUTHORITY_INVALID at 'MY_URL'

...

NeoNomade

5/3/2023

map maximum size exceeded

I get the following error:

 WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Map maximum size exceeded

 WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Map maximum size exceeded

The script at this point is using 11gb of ram (I've allowed 40gb of max heap size)...

adverse-sapphire

5/3/2023

Crawlee doesn't process newly enqueued links via enqueueLinks

Hi folks, I'm trying to build a crawler that retrieves a body (Buffer), and later enqueues the next "page" to be crawled, if it exists (has_next === true ). The problem is that ?page=1 gets processed but the enqueued page (via enqueueLinks) doesn't; Crawlee states that it has processed all links (1 of 1). I have confirmed that has_next is indeed true and that enqueueLinks gets called. Am I missing something obvious?...

rare-sapphire

5/2/2023

Getting the parent URL while executing inside the requestHandler for Crawlee

Hey folks, I'm saving the hierarchy of the crawl tree in my database as part of the crawling process, which means in the requestHandler, I need to save the parent URL that enqueued the link that is currently executing in the requestHandler. Is there an easy way to get that or is it something I need to implement myself? Thanks!...

NeoNomade

5/2/2023

networkidle2 option

Hello, In puppeteer for page.reload or page.goto you can choose the option {waitUntil: 'networkidle2'} , using Puppeteer in Crawlee, I only found that I can use that if I reload each page . Is there any other way to configure navigation to use the {waitUntil: 'networkidle2'} from the beginning ?...

foreign-sapphire

4/28/2023

I am looking for python & data processing expert (long term)

Candidate has to be of experiences in python, image processing, NLP, Machine learning. Thanks....

rare-sapphire

4/28/2023

Got captha and HTTP 403 using PlaywrightCrawler

Got captha and HTTP 403 when accessing wellfound.com I get captcha all the time when I access links like these (basically - accessing any job ad on wellfound): https://wellfound.com/company/kalepa/jobs/2651640-tech-lead-manager-full-stack-europe https://wellfound.com/company/pinatacloud/jobs/2655889-principal-software-engineer...

Previous Next

Gaming

Programming

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Crawlee & Apify

This is the official developer community of Apify and Crawlee.