Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

PuppeteerCrawler proxy rotate

I'm using PuppeteerCrawler in Crawlee lib. I want to rotate proxy, how to apply proxy rotate to PuppeteerCrawler? Thanks!...

how to use apify proxy for playwright crawler ?

any working demo, couldn't get it working using documentation provided on crawlee.dev

Scraping auth-protected pages with CheerioCrawler, should I use Session?

I am trying to scrape some pages that only have certain information available when the user is logged in (as a personal project, I understand the risks) At first, I tried to add a request to the queue that executes a POST request to perform a login, and then save those cookies into the route handler session using session.setCookiesFromResponse, and afterwards add the starting point for my scraping. However, for some reason the session is always empty (since the session was destroyed) and the next handler has always a new session, even though I set the following configuration to my crawler: ```ts...

New page launch options

How can i pass options for new page whenever it is created or opened

New page middleware

I need to use adblock from here https://www.npmjs.com/package/@cliqz/adblocker-playwright, but not quite sure what is the best way to integrate it with Crawlee. For now, I create page with dummy url, enable adblocker on the page and then navigate to the target website. I am curious is there an API to add some kind of middleware to the page creation process, to add adblocker to every newly created page before target url navigation? Couldn't find anything suitable in neither Crawlee or Playwright docs 😦 Thank you in advance!...

Split crawler and scraper into 2 container

I want to split crawler & scraper container, scraper container can call crawler container to get html. 2 container call by gRPC or JSON RPC. Purpose for this, I want 2 process is independent (crawler container can cache or store own data) Thank you!...

Json2csv overwrites columns

Hello, I have issue with json2csv, each line overwrites the previews one when I create excel file, I can see that in VS Code -> dataset, while creating the xlsx file, instead of creating multipole columns, it writes a new line instead of the previews line end I end up with a file with 1 line. Please help. Thank you This is my code: import { PlaywrightCrawler, Dataset } from 'crawlee'; import { writeFileSync } from 'fs';...

Has anyone found a solution to run Crawlee inside a Rest API on demand?

I have managed to get some parts of it working such that I have a NodeJs API that starts my crawler. I have yet to manage the request queue to handle additional and concurrent API calls, so I would just like to know if someone has had any luck implementing such a solution? My particular use case for this API requires running in my cloud instead of Apify...

Download PDF file from URL?

Does someone know of a simple npm library to download files from a URL in Javascript/TypeScript?

puppeteer_dev_chrome_profile

running puppeteer crawler on amazon linux 2 fills the tmp directory rapidly with some profile files - using up all the diskspace. can I do something about this?
files look like this: puppeteer_dev_chrome_profile-igd7yw puppeteer_dev_chrome_profile-wwkJvK puppeteer_dev_chrome_profile-IWxYXH puppeteer_dev_chrome_profile-WX4JH2 ......

Puppeteer download CSV file using javascript

Hello, I am trying to use puppeteer with javascript to scrape website and download CSV file using await page.click(CSS selector). Unfortunately, no file is ever downloaded. On local machine it works. Screenshots are downloaded to the Apify storage, but the CSV file is not... does anybody know how to do that? Thank you...

New Puppeteer version issue

I want update new pupputeer version but error
No description

Run Crawlee using pupeteer in Docker

How to use Pupeteer crawler in docker: * Dockerfile: FROM node:16 WORKDIR /zserver/app...

How to execute javascript code with Playwright?

How can I execute "document.execCommand("insertText", false, "25810") from playwright?

PlayWrightCrawler new request results are bleeding into old requests. RequestQueue issue?

Hello, first some code: crawl function ```javascript async function crawl (jobId, websiteURL, cb) {...

Scraping single page with load more button

Hi, I just discovered Crawlee and seems a very great project. I'm scraping a single url (https://jobs.workable.com/search) that contains a list of items with a load more button. Each time an item is clicked a floating modal show the item information. In this scenario all the power of crawlee to remember visited urls, retries, etc is not a help....

The best way to scale browser pool on multiple machines.

As I understand it, there are no problems running Сrawlee in a Docker container where browsers will work. But what if you need to create a cluster of machines. Is there a built-in browser pool management functionality running on different hosts or do you have any ideas how to do this.

Scraping multipule items on a page

Hello, I haven't used Apify SDK for many months and I see some things changed, please help me by providing a snippet based on this: https://sdk.apify.com/docs/examples/basic-crawler that will visit a url, create array from all elements with class .branch, and extract the text under class .branch-name, and will create a list of json files for each branch with his branch name. In the past I made things like that and much more complicated but I totally forgot. And I cant find a few articles with examples such as the one that scraped Alexa sites with their ranking....

Add cookie in a special request

I want to add cookie to special ajax url in pupperties: page.on('request', async (req) => { await req.continue(); // if you don't call this, it will hang indefinitely }); ...

Dataset.pushData()

I am trying to push 2 separate datasets into (2 seperate folders); first is just ids, the second is the whole object with all the data. After running, it creates a _ crawlee_temporary_0_ folder with only the second set inside and I also get an error saying
...