Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Hello! need to scrape multiple links within same page using Puppeteer

The title of my post is not explicit , in fact I am scraping a website with multiple products but the category tree of the menu is a little complex, There are products category on the top level and products families on the sub-level, I need to retrieve a count of the products of each category but the only way to do that is to go through each family of product. I am stuck at this point how do I enqueue Links within the category products page ( there are more than one families within a category ); IN ADDITION I need to push the name of the category in the dataset, AND the name of each family as such: { "category" { "family" : "family_name", "detail": 10}} Here is my code : ...

userAgent in different crawlers

How to set the userAgent in different crawlers?

http response status

ist there any way to handle a response status explicitly? (puppeteer/cheerio) there seems to be a response parameter in the requesthandler but I don't know how to use that. so e.g looking to handle a 400 response code...

page.setRequestInterception(true)

How to I can use page.setRequestInterception(true) in PuppeteerCrawler (not use raw Puppeteer)

Storage of data or returning of results

Hello, this shouldn't take long.
Am I reading correctly (and have tested) that returning results to a promise or a callback isn't an option with this SDK (crawlee w/ new PlaywrightCrawler() for example) ? We can only write to Datasets and retrieve later for use?...

Accessing browser.newPage() inside PuppeteerCrawler

Hi, I'm trying to integrate the puppeteer-extra-plugin-recaptcha into my crawling, and I've gotten everything working except for one bit: in the documentation it says I need to create a new page with
const page = await browser.newPage()
const page = await browser.newPage()
However, I can't figure out where I can hook the page with that call to get the captcha integration working properly. My thoughts were that it would need to be done in the preNavigationHooks - maybe through crawlingContext?...

Parallel crawling

Ho to Parallel crawling in puppeteer crawler.

JSON logging with Crawlee?

Hi! Instead of the (nicely formatted) default log lines that are intended for human consumption, I'm trying to make Crawlee output structured JSON logs for analytics. Is there any convenient way to do that?

how to disable duplicates check

```javascript import { Dataset, HttpCrawler, log, LogLevel } from 'crawlee'; log.setLevel(LogLevel.DEBUG); const crawler = new HttpCrawler({ useSessionPool:false,...

Resume crawler based on request queues from previous run locally and in apify

Is it possible to stop a crawler and resume it from the previous run's request queues? I have a crawler that has run for a couple hours locally and I would like to add proxies to it to speed up processing speed because I am getting throttled by using 1 IP, but without starting from scratch because it will be unnecessary and a waste of time. I want to use my existing request queues. Is this possible? Also is this possible on Apify?...

Saving fingerprints and cookies in database

Hello there! Is it possible, to store fingerprints, cookies, etc in database? save it automatically and load when needed?

How can you find HTML element that can be clicked on to use dropdown?

I have managed to find element for some dropdowns but for a specific dropdown I cant find it. Is there any HTML attribute I can look at to determine which can be clicked? There does not seem to be a onClick() or similar...

Puppeteer crawler loop elements lsist

router.addDefaultHandler(async ({ page, request, enqueueLinks,log }) => { log.info(enqueueing new URLs); await enqueueLinks({ selector:"div[role='article'] > a", label: 'detail',//corresponding to handle for processing...
No description

How to add images to readme from Github repo?

Currently I have !Sample reviews, but this does not work locally and in Apify, but does work when reading the readme inside Github.

Injecting Axe a11y tester

I would like to use Crawlee to crawl a bunch of internal sites and run the Axe accessibility scanner on each page. I figured out how to inject the script they reference in their getting started docs (https://github.com/dequelabs/axe-core#getting-started) using the page.addInitScript. ``` import { PlaywrightCrawler, Dataset } from 'crawlee'; import axe from 'axe-core';...

Ability to change scraping speed and concurrency while the script is running

It would be nice to be able to change the scraping speed while running the crawler if you determine it is running too fast for the site to keep up, without having to stop and start the crawler again. It could be done from the crawlee CLI for example....

How can I set css selector that checks for start of name of a class?

for example "styles_verificationIcon___X7KO". I want to find element by class that starts with "styles_verificationIcon". This does not work: document.querySelector("[class^=styles_verificationIcon]");...

Disable image in playwright

How can I disable downloading images and videos and other media globally for my scraper?

enqueueLinks with pagination

How can I use pagination with route, I have a route that I call and get a list of cards with links I add to requestqueue and then I need to paginate to next page using same route. My guess is to use router.call(), but I am not sure what to pass I tried also doing: // https://dk.trustpilot.com/categories/*?page=*, but this does not work either. page=0 is 404, so I need to start from 1 and go up....

Mark session as bad when request times out or proxy responds with 502

I'm using CheerioCrawler and I'd like to mark sessions as bad when the request either times out or there's a proxy error. Those cases trigger an error before reaching requestHandler and the request is added back to the queue without me having the opportunity to mark the session. Is there a hook somewhere that I can use? Or should I override _requestFunctionErrorHandler?