Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

How to add headers to `addRequests`

Hello, I'm using the addRequests function to crawl an array of urls. How do I add a header to this crawl request? Thanks!

File download causes: waiting until "load" error

If the link on the website is <a href='page.html'>link</a> everything works fine but if it's <a href='image.png'>link</a> I get this error: ``` ERROR PlaywrightCrawler: Request failed and reached maximum retries. page.goto: net::ERR_ABORTED at https://mysite.com/?attachment_id=24365 =========================== logs ===========================...

isTaskReadyFunction failing randomly

I've built a Cheerio Crawler that doesn't do anything super fancy, it takes a start url, than it has 2 enqueue links functions, and another handler that saves to the dataset the url and the body of the page. I've exposed the GC and running it after both of the request handlers, and also where I'm saving the body , I'm assigning the body to null after saving it. But I get this error randomly, sometimes at the beginning of the script, sometimes after 20k items scraped sometimes after 50k items scraped, but I could never pass the 50-55k items. MacOS Ventura 13.1...

adding other libraries

Don’t know if it works or not but would it work to add other libraries from npm like fs or readline in a Crawlee crawler

scraping at scale

How should I structure my crawler when scraping possibly 100s of different sites with different structures, handling multiple requests at once in Crawlee

How to store array of objects in the same json file?

I don't really understand datasets. I want to store an array of objects in the same json file. So i can connect this json file to table api's or convert them to csv....

How to launch playwrightcrawler inside basiccrawler?

So i have this code: ```ts const cookieJar = new CookieJar(); export const basicCrawler = new BasicCrawler({...

chromium crashes on Docker on Mac M1

Hi Anyone tried to run Crawlee in Docker on MacBook Pro M1? When I run container I get warning: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s...

Playwright Crawler fails on undefined page

Hello there! I just build my first actor using the apify cli. I chose to use a typescript playwright crawler. It by default uses the createPlaywrightRouter() function to create a router and pass it to the requestHandler of the PlaywrightCrawler. All seems well, and according to typescript, I should be able to access a page object in the handler. (I'm only using the addDefaultHandler) However, when I run the actor on the Apify platform it fails with the following exception: ```2023-02-09T15:16:45.925Z INFO PlaywrightCrawler: Start of default handler ...

Running default Playwright example on Docker (arm64)

Hi there, Thanks for the awesome work on Crawlee! I'm struggling at... the very beginning. I'm trying to execute the Playwright example on the actor-node-playwright-chrome:16 image, but it fails. I'm on a M1/arm64 machine but have tried to force amd64 with the same result. ...

How to manually pass datasets, sessions, cookies, proxies between Requests?

It might be obvious but have not been able to figure this out, nor in the documentation nor in the forums. I want to manually manage my datasets and session, but I want to make a Request use a session I have created and to pass on the dataset to the handler of the request. I know I could pass on using the userData, or I could create it in a different file and simply import it, but these seem like the wrong approaches....

Handle a 401 in errorHandler by detecting login form and gracefully continuing if present

Hello there! I'm working on a page crawler that can handle logging into sites, and then crawling around as that user. We've had a lot of success so far with Crawlee (PuppeteerCrawler) by detecting the login in requestHandler, logging in, and then continuing with the crawl. Recently we were asked to support "logging in" to a simple password protection screen on a Netlify site....
No description

'undefined' in DataSet it is keeping me from exporting data

when i run Dataset.getData() i keep finding one of the value as 'undefined' this is preventing me from being able to Dataset.exportToCSV() because it complains about this undefined value. is there a ways to clean the Dataset so that undefined value does not appear? or to know why this is happening?...

Persist Puppeteer tab with page.goto

Whenever I use page.goto in a Puppeteer Crawler handler, a new browser is opened / previous one is closed which prevents me from preservering the session. How can I make sure that the same tab is being used when I do page.goto?

How to increase max memory?

I am running a script that needs concurrency. I have 64 GB of RAM available and I want to use it to the max. I am running my script on a server so there is not much else running. The problem is, at around 15GB I always get memory overloaded error. I have tried: ``` config.set('memoryMbytes', 50_000)...

Stop `keepAlive` crawler after all requests are finished

Hello, I have two crawlers, one running with keepAlive: true (call it A) and the second running normally (call it B), which adds request to the crawler A. After the cralwer B finishes I'd like to keep the crawler A running until it finishes all the requests and then stop the script. I tried the teardown method but it stops the crawler without finishing the queue....

Need help bypassing CF 403 Blocked

Hi guys, i'm new to this community and i'm trying to scrape allpeople.com which has Cloudflare protection. After reading the docs I came up with two solutions - puppeteer-stealth and playwright/firefox combinations. Both are getting 403 Blocked by CF (i will share code snippets inside the thread) Am I doing something wrong? Or if not, what else can I try to bypass CF 403?...

Can I use modules inside the `evaluate`?

All the title says, specifically, can I use other modules like enum like object in evaluateAll function? For example: ```ts const Foo = {...

Scrape Monthly Listeners data from Spotify page

Hello community. I wish to scrape the "Monthly Listener" number from this page: https://open.spotify.com/artist/6FBDaR13swtiWwGhX1WQsP I have already spent several hours following along with video tutorials but have not been able to figure it out. Is it possible to scrape that information? Or is it somehow disabled?...
No description

Trying to extend Dockerfile, can't install using apt-get getting permission denied

Hey, I'm trying to install stuff into the dockerfile and it won'r let me. 1) either I get permission denied when running apt-get install 2) using sudo says there is no sudo (so I should already be root?)...