Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Am I able to scrape a facebook private group?

Either by modifying Icebergg's Facebook groups scraper or otherwise? I want to capture all posts and comments, i.e. everything....

integrating a normal js scraper into crawlee

If I have a fully functional scraper through Cheerio puppeteer or playwright already, can I implement that into the crawlee framework so I can have all the benefits that crawlee offers and keep most of the code fundamentally the same If so how would I do so?...

Crawlee/Apify usage with chrome extensions

Is it possible to upload my own along with puppeteer/playwright? Thanks...

Help with www subdomain

How I allow www. subdomains in "same-hostname"? Some links are prefixed with www. but point to the same website. I think www subdomains should be considered on "same-hostname" on enqueuelinks www.xyz.com is always be the same as xyz.com and vice-versa....

Is there a way to reset timeout?

I want to reset requestHandlerTimeoutSecs due to don't know request's maximum amount. How can I reset it? should we just set it like infinite?

Sending requests to API causes timeout for 60 seconds

While using sendRequest causes Reclaiming failed request back to the list or queue. requestHandler timed out after 60 seconds. Also, using PlaywrightCrawler with Proxy. I am using following code (not same but similar):...

Image diagram search: how to filter for "best designed" or "highest quality results?

I'm looking to do an image search and return the "highest-quality" ~5-10 results out of the first ~100. Specifically: I'm looking to search for diagrams and find ones that are most clear / well-designed / have the best explanations. Any tips for how to do this? ...

puppeteer/replay

Hi, all, nice to see you. I have a question about how I would use puppeteer/replay in PuppeteerCrawler for executing scenario generated on Chrome Recorder. What I want to do is that, before executing crawling, function is called on each scenario step....

disable request queue storage

Hey team, my request queue is too long and it is critically affecting my local storage, is there anyway I can disable it manually by overwritten some configurations?

How to set cookie on Crawlee?

I want to set cookie on PlaywrightCrawler, but I can't find the tutorial on the documentation.

Solve Sliding Captcha

Hello All, i want to scrap the website seloger.com protected by datadom when using selenium i got directly a captcha slider. i found this library to implement the solving : https://anti-captcha.com/apidoc/task-types/GeeTestTaskProxyless . The point is i try to find the gt & challenge key and i can't find it , i searched on js files and in the elements....

config logs: 1/ add timestamp 2/save to file

Hello Now logs from Crawlee looks like: `INFO PuppeteerCrawler: Starting the crawl DEBUG PuppeteerCrawler:AutoscaledPool:Snapshotter: Setting max memory of this run to ......

Saving bandwith using PlaywrightCrawler: to block googletagmanager, google-analytics etc...

I already block images as described in [1] and this helps to save some bandwith. Next step: looking at statistics in my proxy service I see a significant number of requests like these: ``` https://www.googletagmanager.com/gtag/js?id=......

Need help with puppeteer-extra

How to use puppeteer-extra-plugin-stealth with crawlee PuppeteerCrawler?

Help in crawling instagram

Hi guys, Need guidance in crawling instagram businesses. Need a leadlist of new restaurants opened....

Is there way to store the state and continue?

Hello there, well, I am looking for a way to store current state that where is crawler is crawling, and if anything happen and error occured and crash, then we need to fix it and continue from there. For example, I wrote a program that crawls google's search page. And I want to crawl 1000+ more page, thus that should take a loooong time. While crawling, there was error occured due to our program's problem, like we missed special button of google's page....

Exctract url from html code

Hello all, I would like to extract url in html code with Apify scrapper. Here is the html code and the url to extract :...

Dataset.open(..) doesn't init dataset - when called outside of handler

Hi Due to performance issues - I want to move out from handler all possible awaits. For example here: ...

click on a specific word of a span

Hi everyone, I'm new to the word of scraping and crawling. For the time being I'm using Playwright, and I just learned of crawlee. I'm stuck in playwright, and perhaps crawlee will help. I need to click at the starting and ending words of a span element. Is there anyway select a specific word of a span? Alternatively, is there any way to CREATE a DOM element with playwright? Perhaps I could create a DIV around these two words, then click on those DIV. Or even alternatively, I can get the coordinates of a given word, and click away. Is any of this possible?...

How can I add dynamically JS string function into `preNavigationHooks`?

I would like to dynamically add a string (which describes a JS function) to preNavigationHooks array in CheerioCrawlerOptions [1] ```javascript const crawlerOptions = { ... preNavigationHooks: [],...