Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Wiping session between inputs

Hello! I'm crawling / scraping a site which involves doing the following steps for each input. 1. Entering some data 2. doing a bunch of "Load more" 3. Collect output...

preNavigationHooks not followed

Camoufox JS integration used. If I log something before the await page.route it works, inside page.route it doesn't. ```typescript preNavigationHooks: [...

Proxy settings appear to be cached

Hi, I'm trying to use residential proxies on a playwright crawler, but it appears that even when I comment out the proxyConfiguration there is still an attempt to use a proxy. Created a fresh project to create a minimal test to debug and it worked fine, until I had a proxy failure, and then it happened again. The error is: WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Detected a session error, rotating session... ...

Caching requests for development and testing

Hi, I'm wondering what people are doing (if anything) to record and replay requests while building scrapers. A lot of building scrapers is trial and error, making sure you have the right selectors, json paths, etc, so I end up running my code a fair few times. I'd ideally cache the initial request to each endpoint and replay it when it's requested again, just for development, so I'm not continually hitting the website (both for politeness, and also to reduce the chances of triggering any antibot provisions). Thinking back to my ruby days there was a package called VCR which would do this if you instantiated it before HTTP requests, with ways to invalidate the cache. In JS there's netflix's polly which I'm going to try out shortly, but I'm interested to hear what other people are doing/using, if anything....

Customising logging

Is there a recommended way to customise logging? I want to be able to log which specific crawler and which handler a log is coming from. I have tried to override the logger in the crawler using ```import defaultLog, { Log } from '@apify/log'; ... const crawler = new BasicCrawler({ requestHandler: router,...

How to clear cookies?

I need to clear the cookies for a website before requesting it using the CheerioCrawler, how do I do it? TIA

Browerless + Crawlee

Hello, Is there any way to run Crawlee on Browserless?...

How to handle 403 error response using Puppeteer and JS when click on the button which hit an API

We are building a scrapper and that is using client side pagination and when we click on the Next page it calls the API but the api returns 403 as they are detecting it is coming from some bot. So how can we bypass that while opening the browser or while doing the scrapping. Any suggestion will be halpful....

Request works in Postman but doesn't works in crawler even with full browser

Hello I'm trying to handle ajax call via got-scraping. I prepare call in postman, where it works fine. But if I want to try it in Actor a got 403 every time. Even if I try i via Puppeteer or Playwrite and click on the button with request I got response with geo.captcha-delivery.com/captcha url to solve it. Please can anybody give me any advice how to handle this issue?...

about RESIDENTIAL proxies

Hi all, what is your experience with RESIDENTIAL proxies? Let us share: - provider URL - price /GB residential traffic...

served with unsupported charset/encoding: ISO-88509-1

Reclaiming failed request back to the list or queue. Resource http://www.etmoc.com/look/Looklist?Id=47463 served with unsupported charset/encoding: ISO-88509-1

Cannot detect CDP client for Puppeteer

Hi, How to fix this? `Failed to compile...

error in loader module

Hi! Error with Lodash in Crawlee Please help. I ran the actor and got this error. I tried changing to different versions of Crawlee, but the error still persists. node:internal/modules/cjs/loader:1140...

Saving the working configurations & Sessions for each sites

Hi! I'm new to Crawlee, I'm super excited to migrate my scraping architecture to Crawlee but I can't find how to achieve this. My use case: ...

Request queue with id: [id] not does not exist

I create an API with express that runs crawle when called on an endpoint. It is weird that it works completly fine on the first request I make to the API, but fails on the next ones. I get the error: Request queue with id: [id] not does not exist....

Only-once storage

Helllo all, I’m looking to understand how crawlee uses storage a little better and have a question regarding that: Crawlee truncates the storage of all indexed pages every time I run. Is there a way to not have it do that? Almost like using it as an append-only log for new items found....

Camoufox failing

I have a project that is using the PlaywrightCrawler from Crawlee. If I create the template camoufox it's running perfectly, when I take the same commands from the package.json of the template and basically following the same example in my project I get the following error: ``` 2025-03-13T11:58:38.513Z [Crawler] [INFO ℹ️] Finished! Total 0 requests: 0 succeeded, 0 failed. {"terminal":true}...

Redirect Control

Im trying to make a simple crawler, how do proper control the redirects? Some bad proxies sometimes redirect to auth page , in this case i want to mark the request as failed if the redirect URL ( target ) contains something like /auth/login. Whats the best to handle this scenarios and abort the request earlier?

TypeError: Invalid URL

Adding requests with crawler.run(["https://website.com/1234"]); works locally while in the apify cloud it breaks with the following error: Reclaiming failed request back to the list or queue. TypeError: Invalid URL It appears that while running in the cloud, the URL is split by character and each creates a request in the queue, as it can be seen in the screenshot. The bug happens no matter the URL is hardcoded in the code or added dynamically via input....
No description

How to ensure dataset is created before pushing data to it?

I have a public actor and some of my users experience that either default and/or named datasets don't seem to be existing and somehow won't be created when pushing data to them. This is the error message I can see affecting only a handful of user runs: ```bash ...
Next