Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Crawlee stops scanning for links with different anchors (#xyz) but the same base URL

I am trying to crawl a domain that is structured in the same index.html base URL and different anchors for subpages. An example is: 'https://myDomain.com/index.html#/welcome', 'https://myDomain.com/index.html#/documents', 'https://myDomain.com/index.html#/test' I am using Crawlee with Playwright. While the first URL is still being crawled correctly, Crawlee just stops afterwards and does not scan the other URLs. Even tough they were actively added to the Queue. ...

How to access Actor input inside route handlers?

Hello, I am having trouble figuring out how to do the following, and cannot find anything in the docs: Suppose I want to use variables provided in my input_schema, ex: ```...

Connecting to a remote browser instance?

Is there a way we can specify a web socket endpoint in the PlaywrightCrawler config (or somewhere else) so we can connect to a remote browser?

Crawlee Router as a folder with different files for each Handler

Hey, everyone! 👋 Is it possible to create a folder named routes in my project and then construct export const router in index.ts file by adding all my handlers from different files? Like this ```typescript...

Added "playwright-extra" with "stealthPlugin" and got error "Cannot read properties of undefined"

I have some code using PlaywrightCrawler. I added "playwright-extra" with "stealthPlugin" to this code. Exactly as in documentation [1] I added to my code only this: ``` import { firefox } from 'playwright-extra';...

Can I deploy to Azure?

Look like the deployment only support Apify, AWS and GCP
No description

Module not found in NextJs projects

I am trying to run a playwright crawler with Crawlee and for some reason I'm getting a error saying the puppeteer module cannot be found. I've installed npm install crawlee playwright but not puppeteer Do i need puppeteer also to run playwright? i dont think so... ```...
No description

Structure Crawlers to scrape multiple sites

Hey everyone, what is the recommended way to structure your code to scrape multiple sites? I looked at a few questions here and it seems that the recommended way is to use a single crawler but multiple routers to handle this case. the issue, I am facing from this is that when you enqueue links, you'll add site-1 and then site-2 initally before starting the crawler and then the crawler will dynamically add in the links as needed but this will mess up the logs since we are using a Queue and its FIFO, so first it'll crawl the first link, add the extracted links to the queue and then crawl the second link and its links to the queue and like this it'll keep switching contexts between the two sites which will make the logs a mess. Also routers, dont seem to have a url parameter, its just a category and then the request, so we will have to basically define handlers for each site in a single router right? which will just bloat up a single file. Is there a better way I can structure this? usecase is to setup crawlers for 10+ sites and crawl them sequentially or in parallel but having sane logging for them....

INSTAGRAM FOLLOWERS SCRAPER

Hello everyone, I have been using the Apiff services for a long time and I remember that several months ago there was an actor that allowed the scrape of the followers of a page. I don't only mean the number of followers but all the nicknames of the followers of a page. Now I don't find it anymore. Was it eliminated? Is it possible to create it without programming? Can anyone help me? Thank you all...

Blocking requests after click

I am using preNavigationHooks to block images, which works for the initial page load but does not block images loaded after a click on the page (i.e. XHR requests). How can these be blocked? ```typescript preNavigationHooks: [ async ({ page, blockRequests }) => {...

relationship between memoryMbytes and availableMemoryRatio?

Can someone tell me what the relationship between those two configuration options is? is one jut in MB and the other in a ratio of the total? I thought availableMemoryRatio limited the total memory used, but "When the memory usage is more than the provided ratio, the memory is considered overloaded." so it's more like an indicator? I'm currently working on a heavily overloaded server, and would rather that crawlee/playwright had a strong upper limit and was slower....

Help me!

I am going to send the key event to the wxpython dialog to input in the focued field.But can't Please help me...

How can I configure PuppeteerCrawler to not save request information to disk?

I see I can set the location of the storage directory with environment variables (CRAWLEE_STORAGE_DIR) , but I am not seeing anything to disable storage. I don't want crawlee to save anything to disk. How can I configure this? Edit: Answering my own post - in the crawlee.json configuration, the storageClientOptions.persistStorage can be set to false to disable crawlee storing on disk...

Hi, I'm new to Crawlee & Appify having discovered it yesterday

Just setting up a couple of actors for LinkedIN, Google Careers and Monster. As noted, I'm new to the platform and none of the scrapers work as expected * The Monster search just fails - fatal exception...

Can i get 403 status

Hi. I guess this question might be a bit dumb, but i wanted to ask how does the crawlee work with requests? If i want to access some particular website using pure request or axios i get 403 error, but with cralwee cheerioCrawler, i get the result i want. I figured out the retry mechanism and the session rotation has to do something with it since it happened few times with my use case. I know it's too much to ask, but just wondering like if its going trough some proxy's, how are user agents handled, tls handshake etc....

not skipping over urls for unfound elements

when i am scraping product data from product urls, if i am trying to either see whether a tag is available and if not to use a different tag or if a tag simply isn't found, i don't want it to give a full error for not finding that certain element i want and not scrape and save the rest of the data how do i avoid this "skipping" over by overriding or changing the natural response of the crawler i even have tried try catch statements and if else statements and nothing works...

Crawlee scrapper invoking the same handler multiple times

Hey all! I've built a Crawlee scrapper, but for some reason it invokes the same handler multiple times, creating a lot of duplicate requests and entries in my dataset. Also: - I've already tried manually setting uniqueKeys for all my requests. - I've also tried setting maxConcurrency: 1 for the crawler. - As you can see from the logs below, the issue is not that I'm adding the same requests multiple times. It's Crawlee who's invoking handlers multiple times with the same request....

@crawlee/browser-pool useFingerprint + constant browser size

How can I use BrowserPool to auto randomize fingerprints while having constant browser width and height? Im using it with Playwright...

Scrape private website?

Hello Friends! I'm new to Apify and pretty excited about what I've learned so far. One use case I'm not sure of: Can Apify be used to scrape a website that's not on the public internet? Specifically, I want to scrape knowledgebases inside corporations (with their permission). Is there for example some sort of proxy that could be put in place inside the private network that connects with Appify and then scrapes at Apify's direction? Or etc?