Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

xenial-black

2/10/2025

More meaningful error than ERR_TUNNEL_CONNECTION_FAILED

Hi There. I am using a proxy to crawl some sites and encounter a ERR_TUNNEL_CONNECTION_FAILED error. I am using brightData as my proxy service. If i was to curl my proxy endpoint I get a meaningful errror. For example...

deep-jade

2/5/2025

crawlee not respecting cgroup resource limits

crawlee doesnt seem to respect resource limits imposed by cgroups. This poses problems for containerised enviroments where ethier crawlee gets oom killed or silently slows to a crawl as it thinks it has much more resource available then it actually does. reading and setting the maximum ram is pretty easy ```typescript function getMaxMemoryMB(): number | null { const cgroupPath = '/sys/fs/cgroup/memory.max';...

like-gold

2/4/2025

Looking for feedback/review of my scraper

It's already working but I'm fairly new to scraping and just want to learn the best possible practises. The script is 300-400 lines (Typescript) total and contains a login routine + session retention, network listeners as well as DOM querying and is running on a Fastify backend. Dm me if you are down ♥️...

dependent-tan

2/2/2025

Trying out Crawlee, etsy not working..

Hi Apify,
Thank you for this fine auto-scraping tool Crawlee! I wanted to try out along with the tutorial but with different url e.g. https://www.etsy.com/search?q=wooden%20box but it failed with PlaywrightCrawler. ``` // For more information, see https://crawlee.dev/...

national-gold

1/31/2025

Only the first crawler runs in function

When running the example below, only the first crawler (crawler1) runs, and the second crawler (crawler2) does not work as intended. Running either crawler individually works fine, and changing the URL to something completely different also works fine. Here is an example. ``` import { PlaywrightCrawler } from 'crawlee'; ...

genetic-orange

1/30/2025

How to retry only failed requests after the crawler has finished ?

I finished the crawler with around 1.7M, and got around 100k failed requests. Is there a way to retry just the failed requests ?

dependent-tan

1/30/2025

Crawlee support esm?

I try to integrate with Nuxt3, when i run on production mode it doesnt work


[nuxt] [request error] [unhandled] [500] Cannot find module '/app/server/node_modules/puppeteer/lib/cjs/puppeteer/puppeteer.js'

...

metropolitan-bronze

1/30/2025

Max Depth option

Hello! Just wondering whether it is possible to set max depth for the crawl? Previous posts (2023) seems to make use of 'userData' to track the depth. Thank you....

genetic-orange

1/29/2025

Max session count 1 doesn't work, session got called concurrently upon starting.

genetic-orange

1/29/2025

How can I pass context to createNewSession ?

I want to use existing crawler setting (JSON/ cherioo ) upon creating new session, signing in / signing up user there while associating cookies, token with the session. Currently I put these new session creation conditionally inside preNavigation hook (context is passed as arg there), but not in createNewSession...

genetic-orange

1/27/2025

how do i create organize 1 auth per session, ip, user agent ?

I want to create bunch of authenticated users, each with their consistent browser, proxy, user agent, fingerprints, schedule, browsing pattern, etc.

robust-apricot

1/26/2025

Is there a way to get the number of enqueued links?

I have the following code for AdaptivePlaywrightCrawler and I want to log the number of enqueued links after calling enqueueLinks. ` router.addDefaultHandler(async ({ request, enqueueLinks, parseWithCheerio, querySelector, log, page }) => {
await enqueueLinks({...

extended-salmon

1/21/2025

One or multiple instances of CheerioCrawler?

Hi community! I'm new to Crawlee, and I'm building a script that scrapes a lot of specific, different domains. These domains each have a different number of pages to scrape; some have 2 to 3 thousand pages, while others might have just a few hundred (or even less). The thing I have doubts about is: if I put all starting URLs in the same crawler instance, it might finish scraping a domain way before another one. I thought about separating domains, creating a crawler instance for each domain, just so that I can run each crawler separately and let them run their own course. Is there any downside to this, e.g. will it need significantly more resources? Is there a better strategy? TIA...

correct-apricot

1/14/2025

Handling Dynamic Links with Crawlee PlaywrightCrawler

I’m working on a project using PlaywrightCrawler to scrape links from a dynamic JavaScript-rendered website. The challenge is that the <a> tags don’t have href attributes, so I need to click on them and capture the resulting URLs. - Delayed Link Rendering: Links are dynamically rendered with JavaScript, often taking time due to a loader. How can I ensure all links are loaded before clicking? - Navigation Issues: Some links don’t navigate as expected or fail when trying to open in a new context. - Memory Overload: I get the warning "Memory is critically overloaded" during crawls...

robust-apricot

1/10/2025

AdaptivePlaywrightCrawler starts crawling the whole web at some point.

I want to use the AdaptivePlaywrightCrawler, but it seems like it wants to crawl the entire web. Here is my code. `const crawler = new AdaptivePlaywrightCrawler({ renderingTypeDetectionRatio: 0.1,...

like-gold

1/7/2025

Moving from Playwright to Crawlee/Playwright for Scraping

Are there actually any ressources on building a scraper with crawlee except the one in the docs? Where do I set all the browser context for example? ```javascript const launchPlaywright = async () => {...

foreign-sapphire

1/4/2025

How scrape the emails from linkedin

I am building the linkedin email scraper actor and having some issue requesting if anyone to help me on these: Scraped data: { name: 'Join LinkedIn', title: 'Not found', email: 'Not found', location: 'Not found'...

afraid-scarlet

1/4/2025

How to implement persistent login with crawlee-js/playwright?

I need to scrape content on multiple pages in one social network (x.com) that requires auth. Where to implement the login mechanism in order to it happened before following urls and persisted to use it until it is valid?

eastern-cyan

1/3/2025

Incremental Web scraping using Crawlee

Hey everyone. :perfecto: :crawlee: Currently, I am working on scraping one website where new content (pages) is added frequently (as an example we can say like a blog). So when I run my scraper it scrapes all pages successfully but when I run it for example tomorrow (when new pages are added to websites) it will start scraping everything again. I would be thankful if you could give me some advice, ideas, solutions, or examples out there of efficiently re-scraping without crawling the entire site again. ...

deep-jade

12/30/2024

Managing Queue using redis or something similar and having worker nodes listening on queue

I'm trying to run Crawlee for production use and try to scale where we can have a cluster of worker nodes who will be ready for crawling pages based on the request. How can achieve this. The RequestQueue is basically writing requests to files and not utilizing any queueing system. I couldn't find doc that said how i can utilise Redis queue or something similar....

Previous Next

Gaming

Programming

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Crawlee & Apify

This is the official developer community of Apify and Crawlee.