Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Routing issue

I have a listing website as INPUT and enqueueLinks of it. These links (case studies) at the time has also multiple pages. When the cralwer adds the links with the new label attached, it's not happening anything. When using only case study page, it's scrapping the data and working. Not sure, what to do next and how to test it more. Does the Queue System waits to complete to add all links to start scrapping?

Using BrightData's socks5h proxies

BrightData's datacenter proxies can be used with socks5 but only with remote dns resolution, thus the protocol should be given like socks5h://... Testing it with curl works, but using it in crawlee it doesn't work. Just keeps hanging. ```...

Loadtime

Hello, Is there a way to get the load time of a site from crawlee in headless mode? I'm using PlaywrightCrawler. Thanks!...

How to stop following delayed javascript redirects?

I'm using the AdaptivePlaywrightCrawler with the same-domain strategy in enqueueLinks. The page I'm trying to crawl has delayed JavaScript redirects to other pages, such as Instagram. Sometimes, the crawler mistakenly thinks it's still on the same domain after a redirect and starts adding Instagram URLs to the main domain, like example.com/account/... and example.com/member/..., which don't actually exist, so, how can I stop following these delayed JavaScript redirects?

Replace default logger

Hello, did anybody manage to completely replace the logs from Crawlee with console logs ? If yes, can you please share your implementation ?...

Shared external queue between multiple crawlers

Hello folks! Is there any way i can force cheerio/playwright crawlers to stop using their own internal request queue and instead "enqueue links" to another queue service such as Redis? I would like to achieve this in order to be able to run multiple crawlers on a single website and i would need them to share the same queue so they won't use duplicate links. Thanks in advance!...

Reclaiming failed request back to the list or queue

Hello. I am facing this issue regularly. I am using Crawlee with Cheerio. How can I resolve this? ...

Disable write to disk

By default, data will be write to ./storage, is there a way to turn off this and use memory instead ?

CheerioCrawler headerGenerator help

Hello ! I kept reading the docs but couldn't find a clear information about this. When we use Puppeteer or Playwright we can tweak in browserPool the fingerprintGenerator. For Cheerio we have the headerGenerator from got, how we can adjust it inside the CheerioCrawler ?...

Is it possible to bypass proxies for specific requests?

I have a use case where I want to have a crawler running permanently. This crawler has a tieredProxyList set up that it will iterate over in case some of them don't work. For scraping some pages I don't want to use proxies to reduce the amount of money I am spending on them (When I scrape my own page I don't want to proxy, but I want to use the same logic / handlers. Is it possible to specify either the proxy that should be used for specific requests? Or maybe even the proxy tier? Basic Setup: const proxyConfiguration = new ProxyConfiguration({tieredProxyUrls: [{'proxyTier1'], ['proxyTier2']]});...

Issue: Playwright-chrome Docker with pnpm

Hello! I'm trying to run the actor using pnpm instead of npm. In my local, running pnpm run start:dev , pnpm run start:prod and apify run works as expected. apify push is also successful. ...

More meaningful error than ERR_TUNNEL_CONNECTION_FAILED

Hi There. I am using a proxy to crawl some sites and encounter a ERR_TUNNEL_CONNECTION_FAILED error. I am using brightData as my proxy service. If i was to curl my proxy endpoint I get a meaningful errror. For example...

crawlee not respecting cgroup resource limits

crawlee doesnt seem to respect resource limits imposed by cgroups. This poses problems for containerised enviroments where ethier crawlee gets oom killed or silently slows to a crawl as it thinks it has much more resource available then it actually does. reading and setting the maximum ram is pretty easy ```typescript function getMaxMemoryMB(): number | null { const cgroupPath = '/sys/fs/cgroup/memory.max';...

Looking for feedback/review of my scraper

It's already working but I'm fairly new to scraping and just want to learn the best possible practises. The script is 300-400 lines (Typescript) total and contains a login routine + session retention, network listeners as well as DOM querying and is running on a Fastify backend. Dm me if you are down ♥️...

Trying out Crawlee, etsy not working..

Hi Apify,
Thank you for this fine auto-scraping tool Crawlee! I wanted to try out along with the tutorial but with different url e.g. https://www.etsy.com/search?q=wooden%20box but it failed with PlaywrightCrawler. ``` // For more information, see https://crawlee.dev/...

Only the first crawler runs in function

When running the example below, only the first crawler (crawler1) runs, and the second crawler (crawler2) does not work as intended. Running either crawler individually works fine, and changing the URL to something completely different also works fine. Here is an example. ``` import { PlaywrightCrawler } from 'crawlee'; ...

How to retry only failed requests after the crawler has finished ?

I finished the crawler with around 1.7M, and got around 100k failed requests. Is there a way to retry just the failed requests ?

Crawlee support esm?

I try to integrate with Nuxt3, when i run on production mode it doesnt work [nuxt] [request error] [unhandled] [500] Cannot find module '/app/server/node_modules/puppeteer/lib/cjs/puppeteer/puppeteer.js' ...

Max Depth option

Hello! Just wondering whether it is possible to set max depth for the crawl? Previous posts (2023) seems to make use of 'userData' to track the depth. Thank you....