Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

New Tab

When I start a crawl using PlaywrightCrawler it always opens the URL in a new tab, leaving an empty tab open. This affects performance. Is there a way to turn this off and not open a new tab but use the empty start tab instead?

Efficient css selectors

Hey I’m looking for some help to pick more efficient css selectors. I’ve looked into a few tools but never had much luck with speeding anything up. Currently some of my textContent variables are timing out at 10 seconds. And a request is taking anywhere from 20-30 seconds. The data is being stored and written to the dataset and there’s a total of 17 selectors using textContent() And 6 using count() I am using a proxy and it’s currently placed in the launchContext for the chromium launcher. So that’s going to be some of the latency but I wasn’t expecting 20-30 seconds 😅...

How to transfer data between playwrightcrawler and cheeriocrawler?

I want to scrape javascript generated HTML with playwrightcrawler and then parse it with cheeriocrawler: const selector = 'div[class="space-y-4 max-w-2xl mx-auto"]'; const htmlCode = await page.innerHTML(selector);...

Ignore previously crawled URLs

Is there a simple way to go about ignoring previously crawled URLs? Or should I implement logic for detecting if a URL has been previously crawled and skip it? My current approach is to store the items in a separate database, and then use https://crawlee.dev/docs/introduction/adding-urls#transform-requests transform requests to determine whether or not to crawl the link

How to make Puppeteer crawler ignore errors on page?

I receive some 401 code errors while loading the page, but these requests are not important for me. The problem is crawling is crashing, because of that error and I cannot continue to interact with page.

chromium.launchpersistentcontext with crawlee

Hi everyone, this doc: https://docs.apify.com/academy/puppeteer-playwright/browser-contexts shows how to use persistent context when working with pure playwright. But how can I combine this with crawlee? is there a configuration for this while calling PlayWrightCrawler(...)? or a way to get similar behaviour?...

Page.goto never resolves in headful (using XVFB) using `apify/actor-node-puppeteer-chrome` Docker

We are able to successful launch the chromium browser, but when navigating to certain pages, puppeteer.page.goto never resolves in a page load event (either load or any of the other events). We are not seeing this behavior when we run the same script (using chromium116, puppeteer21, and the latest version of crawlee) outside of a Docker container. We also don't see this behavior on the Apify platform using Actors, but currently can't use the service due to our security requirements. Happy to share more detail but would appreciate any ideas on where to look. Thanks!...

Throw error that respects maxRequestRetries

Hello, With RetryRequestError, the request gets retried an infinite times until it succeeds, what error should I throw to respect the maxRequestRetries?...

Basic Crawlee how do I use my own proxies?

'Did not expect property proxyConfiguration to exist, got [object Object] in object BasicCrawlerOptions'

How to run cheerio crawler with Bun?

Hi, I'm trying to run a Cheerio Crawler with Bun, But I'm getting errored out as soon as I'm trying to import something. Here's the import message. I'm not sure what's happening. Any idea? IT's a fresh crawlee crawler with typescript.
No description

Webscraper.io

Does anyone use the webscraper.io chrome extention to collect web data? I havent had the time to learn how to use Beautiful soup and thought i would try to take a shortcut.

Playwright crawler failing when element is not found

I have written crawler using playwright. I have bunch of page.locator functions to find elements and scrap text from them. Most of the elements are always on the page, but few elements like reviews are not always there since the product may be new and doesn't have any review yet. That would be no problem at all if not the playwright / crawlee failing because of it. What I saw is that when page.locator can't find an given element it throws an error - that's okay. But crawlee is picking this error as like "the whole page error" and marks request to the page as failed. Even though other locators are working and there's a lot of data that has been found with other page.locator I'm getting messages that request to url someshop/product-55 failed. How can I somehow fix this and tell crawlee / playwright to not fail if the page.locator fails? I'm okay with having empty string if there's no reviews found, but I'm not okay with igoring other data because of one page.locator failure. Example code: `const a = await page .locator(a_locator) .textContent(); // element found...

Sorting Quora's questions by number of answers and views

Hi there! Had a question you might be able to crack! Is there a way to sort/filter search result by SERP features? For example, I’m interested in finding questions with high views/votes/followers on quora BUT with no answers (or just a couple). Is there a way to sort/filter results starting from the ones that have the highest votes or lowest number of answers...

Multiple queues

Is there a way to have a single crawler read from multiple queues?

How to open multiple browsers?

I basically want the program to open multiple browsers instead of one, how can I configure it to do so?

TSConfig in Crawlee projects.

Cannot find module 'crawlee'. Did you mean to set the 'moduleResolution' option to 'nodenext', or to add aliases to the 'paths' option?ts(2792)
Cannot find module 'crawlee'. Did you mean to set the 'moduleResolution' option to 'nodenext', or to add aliases to the 'paths' option?ts(2792)
The linter is giving this error even on the template project. This needs attention or can I let it like this ?...

how deep can website-content-crawler go?

hello guys I have a website that is composed of 3 million pages, the website homepage is like google so the crawler has to enter all the search results one by one and scrape the content inside each of them, can website-content-crawler do all that automatically or do I have to give it the links to those 3 million pages? also can I customize what to scrape inside each of those links? like give it div id of the container?...

Crawlee request handler no access to class functions in NestJS

Hey there! I am workin on a NestJS app with Crawlee, when I create a request handler function using NestJS classes, the handler function cannot access other functions within the class using the (this) operater. Does anyone know how to make it work? Could it be a problem resulting from NestJS dependency injection?

Set debug breakpoint in VS Code

Is it possible to set a debug breakpoint in VS Code when writing TypeScript code for Crawlee in to inspect the value of variables and flow of the code execution?