Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Scrape/crawl transactional rather than batch

Hi, I'm looking to introduce crawling websites into an existing workflow which doesn't suit batch processing. i.e. I want to scrape each website get the result and do some further processing downstream. I do have this working with the code attached however I imagine there's a better way to achieve this given I'll be concurrently processing this with up to 500 websites and my concern is memory allocation ```javascript export async function crawlWebsiteForAddresses(url: string) { const ukPostcodeRegex = /\b([A-Z]{1,2}[0-9][A-Z0-9]?)\s?([0-9][A-Z]{2})\b/;...

How to close Puppeteer browser mid-run while continuing actor execution in crawlee?

Hi everyone, I’m using PuppeteerCrawler for scraping because it unblocks websites effectively and allows JavaScript execution. However, I’m facing an issue: After accessing a website, I extract the required data from network requests (e.g., HTML) and parse it later with cheerio....

What is headless shell

I noticed that npx playwright install chromium install chromium headless shell And now those run in processes instead of chromium app, I think they take less cpu, but i couldnt find any information about them on crawlee...
No description

Downloading JSON and YAML files while crawling with Playwright

Hi there. Is it possible to detect the Content-Type header of responses and download JSON or YAML files? I'm using Playwright to crawl my sites and have some JSON and YAML content I would like to capture, as well.

Digital Ocean

Is there any documentation around using Digital Ocean for a Crawlee scrapper? I see options for GC and AWS but looking more for just setting something up on a droplet.

`maxRequestsPerMinute` But for session

:perfecto: Hey! Firstly I just want to thank you for creating such an amazing product ❤️ ! Question itself:...

Massive Scraper

Hi I have a (noob) question. So I want to crawl many different urls from different pages so they need their own crawler implementation. Some can use the same also. How can I achieve this in crawlee such that they run in parallel and can be lal executed with a single command or also in isolation? Input and example repos etc. would be highly appreciated...

await a promise set in a pre navigation hook

Hi All, I have a pre navigation hook that listens for requests and if they return images saves them to the cloud ```typescript...

Generative Bayesian Network Docs

I'm looking at the generative-bayesian-network package part of the fingerprint suite. https://www.npmjs.com/package/generative-bayesian-network However, I cant find any kind of documentation whatsoever on this package. It looks interesting and I want to figure out how to use it. Are there docs anywhere for this?...

Does crawlee support sock5 proxies with authentication?

Does crawlee support sock5 proxies with authentication? I am building a crawler based in crawlee with playwright. And it's need to use sock5 proxies with authentication. But I don't find the anything about that in the crawlee document ....

ERROR: We've encountered an unexpected system error. If the issue persists, please contact support.

Hi people, I am having this problem with Docker, it runs reursively and fails, it is on Platform. I can't find an error and every single file of the project seems to be ok. Any idea? - Pulling Docker image of build XXXXX frpm repository - Creating Docker container - Starting Docker container...

retryOnBlocked with HttpCrawler

Hi, I'm using the HttpCrawler to scrape a static list of URLs. However, when I do get a 403 response as a result of CloudFlare challenge, the request is not retried with retryOnBlocked: true. However, if I remove retryOnBlocked, I see my errorHandler getting invoked and the request is retried. Do I understand retryOnBlocked wrong?

Goodbye Crawlee (migrated to Hero)

I migrated my scraping code from Crawlee to Hero (see https://github.com/ulixee/hero). It works. Everything that worked with Crawlee - works with Hero. Why I migrated: can not handle the over-engineered Crawlee API more (and bugs related to this). It was just too much APIs (different APIs!) for my simple case. Hero has about 5 times simpler API. ...

PlaywrightCrawler proxy issue

my crawler with PlaywrightCrawler works just fine but I have issue when adding proxy !!! this is the code ```ts import { PlaywrightCrawler, ProxyConfiguration } from "crawlee";...

Stop Crawlee When Condition Met

I am trying to scrape an ecommerce site and would like to scrape only 20 items. How can I stop the process when this many items are scraped.

Crawlee stops after about 30 items pushed to the datastore, repeats the same data on next run.

I'm writing my first Actor using Crawlee and Playwright crawler to scrape website https://sreality.cz. I wrote a crawler using as much as possible from the examples in the documentation. It works like this: 1. Start on the first page of search, for example this one....

autoscale pool trying to scale up without suffecient memory

Hi All, im running a playwright crawler and am running into a bit of an issue with crawler stability. Have a look at these two log messages ...

Max redirects

I am getting this error message, how to best deal with it? Reclaiming failed request back to the list or queue. Redirected 10 times. Aborting. Can I increase the max number of redirects for my CheerioCrawler?...

Anyone have any example scraping multiple different websites?

The structure i am doing idoes not look like the best. I am basically creating several routers and then doing something like: ```ts...

How to override `maxRequestRetries` error log

there is a function ```typescript protected async _handleFailedRequestHandler(crawlingContext: Context, error: Error): Promise<void> { // Always log the last error regardless if the user provided a failedRequestHandler const { id, url, method, uniqueKey } = crawlingContext.request;...