Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

XVFB fails on server.

I've deployed a playwright with chromium crawler on aws batch, with the default docker image. this is the error that I'm getting, it's mandatory for this crawler to run headful because otherwise there are some buttons that I need to click that are not loading. (Error log attached). I've also tried to create a custom slimmer image, but I bump into the same issue with Xvfb....

Anything special about .php websites?

when I try to make a request to a website URL that ends with .php it appears that request is skipped? Anything peculiar I need to know about .php sites and how to get to them via Crawlee?...

Handle browser failure

I have Puppeteer scraper that is doing lots of actions on a page, at one point the browser fails. It's a page with infinite scroll and I have to click a button and scroll down. After 70-80 interactions the browser crashes, and the request is getting retried as usual. The main idea is that with those actions I'm collecting urls that I wan't to navigate. I want to somehow handle the browser crashing so I can start with those urls when the browser crashes....

Best practices to not crawl links that are already crawled when Actor is run as CRON

Hi, I'm building an actor that goes through a list and then goes to each individual item's page to extract information. The items themselves don't really change. New items can appear in the list and old ones can get removed. But if item details were extracted once, there's no need to repeatedly extract them on next Actor runs. E.g. Actor is run twice a day. I'm planning to use Postgresql and Prisma to store extracted items details. Wondering, if it is a fine decision to access the target database while doing crawls within Actor (e.g. to check if URL was already scraped previously)? Or is there some better solution, possibly with built-in tools of Apify? Thanks...

Stoping Crawler when done in scraping

Good day everyone, How can I make the crawler stop? When its done to scraper/request a certain url? Because I want to setup my crawlee project that continously running when it has no url to request, like waiting to a queue of urls (redis)...

How to run some code in each session

essentially - I want to make sure that I'm logged in, in any session that I run. Even better - That I log in with one user per session. How can I make sure that a new session won't open without running a log-in?

Crawler skipping Jobs after processing 5,000-6,000 Requests

Since a few days, I have been running the crawler with a high number of jobs. As a result, I have run into a problem. I have found, that not all jobs are processed by the CheerioCrawler despite these jobs being added to the queue through addRequest([job]). I can't really reproduce it, it happens approximately after 5000 - 6000 number of jobs....

Crawlee does not work with cron job

I'm running a cron job on node server, but it doesn't execute after the the first run

TikTok scraper following list

Hi there, we are trying to build our own TikTok scraper actor using the Playright wrapper. The mobile view of TikTok provides more details than the desktop view so we scraping the mobile site. One thing we noticed is , in order to open the following list modal using playwright’s click event is not working. ...

Running crawlee multiple times with the same URL

Hi! I am trying to build a crawler using PuppeteerCrawler. The crawler will be started by sending a POST to an API endpoint. The API is implemented using azure durable functions. The first time I call the API it works as expected. The next time I call it I get no output. This is the log output on the second run:...

AI Product Matcher - manual update of JSONs

Hi, I wanted to test AI Product Matcher without uploading my custom crawlers and setting up everything. So is there a way to manually create a dataset, for example dataset1 and dataset2 and update JSONs to this datasets? I would like to upload Amazon JSON with product data to dataset1 and Walmart JSON with product data to dataset2 and then just compare it. Can I manually update those JSONs? I couldn't find any info in the docs 😕

Passing data to a router/ handler

I'm trying to pass a username and password to the async function in the default handler, since I'm using the default handler to log in to the website. I've seen different guides use all kinds of input parameters - request, page, enqueueLinks, log, pushData - but these seem to all be specific prebuilt parameters of the module? I'm not sure. so, how could I pass my own data through?

Trouble installing scrapy through npx

I'm trying to install crawlee through npx crawlee create my-crawler but it seems like npx can't find npm. (no such file or directory, lstat 'C:\users\ethan\AppData\Roaming\npm') npm -v works fine, v18.17.1 I'm running on a windows 11 machine, from visual studio code with powershell....

AI Product Matcher - Support other languages

What is the prediction of AI Product Matcher supporting other languages? I would like to use it for Portuguese language! while this does not occur, what is the recommended palliative solution? would I have to translate my data from Portuguese to English and submit it to AI Product Matcher?...

Re-using the crawler, instead initializing after each url?

My scraper uses BullMQ, which retrieves jobs (URL's) from the job queue and runs them with CheerioCrawler. Is there any way to initialize the crawler once and keep using it. I assume this will also consume less resources and increase performance? If there are any best practices that I have not implemented I would love to hear about it....
No description

Ignore URLs the matches the current url but does have query params

I do not want to crawl a url that is already crawled but have different query params, how can i do this?

Using sqlite as datastore.

Looking for instructions on how to hook up @apify/storage-local. I believe uses sqlite datastore instead of file system....

How do I enqueue same link with BasicCrawler?

I want to scrape a API that has pagination in it using got-scrapping which leading me to use BasicCrawler (written in guide). Now I'm confused how do I enqueue the same API url to the crawler.