Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

How to send an URL with a label to main file?

I am trying to send an URL with a label and user data to main file in order to run this url directly from a specific handler within routes file. Is that possible? I am using Playwright.

structlog support?

Could I see an example of how struct log would be implimented officially?

Memory is critically overloaded

I have an AWS EC2 instance with 64GB memory. My crawler is running in a docker container. The CRAWLEE_MEMORY_MBYTES is set to 61440 My docker config ```...
No description

Routers not working as expected

Hello everyone First of all, thanks for this project — it looks really good and promising! I'm considering using Crawlee as an alternative to Scrapy....

Dynamically change dataset id based on root_domain

Hey folks. I've attached an example of my code as a snippet Is it possible to dynamically change the dataset id so that each link has it's own dataset?...

Handling of 4xx and 5xx in default handler (Python)

I built a crawler for crawling the websites and now trying to add functionality to also handle error pages/links like 4xx and 5xx. I was not able to find any documentation regarding that. So, the question is if it is supported and if yes in what direction to look at?...

Camoufox and adaptive playwright

Hello great friends of Crawlee, I was wondering if there was anyway to use camoufox and the adaptive playwright browser? It seems to throw an error when I try to add the browser pool....

Hey ,why do i get web scrapping of first url , since i have another url .

I am implemented Playwright crawler to parse the url , I made a single request to crawler with first url, since the request has been processing , meanwhile , i passed anotther url in craler and hit the request, While processing, through crawler, it is processing content from first url , instead of second url both times. Can be please help? async def run_crawler(url, domain_name, save_path=None): print("doc url inside crawler file====================================>", url)...

proxy_config.new_url() does not return new proxy

Here is my selenium python script, where i try to rotate proxies using the proxy_config.new_url(): ```python Standard libraries import asyncio import logging...

Proxy example with PlaywrightCrawler

This is probably a simple fix but I cannot find an example of crawlee using a simple proxy link with Playwright. If anyone has a working example or know what is wrong in the code I would really appreciate your help. Here is the code I have been working with: (I wish I could copy and paste of the code here but the post go over the character limit I get the following error from the code: ...
No description

Input schema is not valid (Field schema.properties.files.enum is required)

input_schema.json ''' { "title": "Base64 Image Processor", "type": "object",...

Issues Creating an Intelligent Crawler & Constant Memory Overload

Hey there! I am creating an intelligent crawler using crawlee. Was previously using crawl4ai but switched since crawlee seems much better at anti-blocking. The main issue I am facing is I want to filtering the urls to crawl for a given page using LLMs. Is there a clean way to do this? So far I implemented a transformer for enqueue_links which saves the links to a dict and then process those dicts at a later point of time using another crawler object. Any other suggestions to solve this problem? I don't want to make the llm call in the transform function since that would be an LLM call per URL found which is quite expensive. Also when I run this on my EC2 instance with 8GB of RAM it constantly runs into memory overload and just gets stuck i.e. doesn't even continue scraping pages. Any idea how I can resolve this? This is my code currently...

Selenium + Chrome Instagram Scraper cannot find the Search button when I run it in Apfiy..

Hey everyone, I have built an Instagram Scraper using Selenium and Chrome that works perfectly until I deploy it as an actor here on Apify. It signs in fine but fails every time no matter what I do or try when it gets to the Search button....

Error on cleanup PlaywrightCrawler

I use PlaywrightCrawler with headless=True The package that I use is: crawlee[playwright]==0.6.1 When running the crawler I noticed when waiting for remaining tasks to finish it sometimes receives an error like you can see in the screenshot. Is this something that can be resolved easily? ...

Google Gemini Applet - Google Module Not Found (even though it is there)

Hey all I have a question about whether I can actually use Apify to access Google Gemini for video analyzation: I've built my own python version of the Gemini Video Analyzer Applet that analyzes social media videos for content style, structure, and aesthetic qualities and it works, I have installed all the Google dependencies required but when I try to run it as an actor using "apify run --purge" no matter what I do it says no module named google found. Is this a bug with Apify ? ...

"apify run" no longer able to detect python

Hey all, I successfully deployed one actor yesterday and followed all the same steps to deploy my next actor but now the Apify CLI can not detect python anymore when I run "apify run" which is crazy because it has to detect it in order to build the actor in the first place, This is output in my terminal which shows that it can't detect python but that I can find the version no problem: PS C:\Users\Ken\New PATH py\testing-it> apify run --purge...

Django Google Maps Reviews : Pulling Data into local Django app

Hi hi, Am looking for guidance on how I could interact with the Google Maps Srapper in my django application, I already have a model and a view that I would like to add the individual reviews from a particular listing. NB: I have numerous listings that I will also need to get the reviews from and present them based on their own url/details...

Is recommend to use Crawlee without the Apify CLI?

is recommend to use the Crawlee without de Apify CLI, iam using the lib because of the practive to create Crawler and i want to know the experience of another devs using in the same way that i am using

How i can change/save the logger that the context provides

The context of handler provides a context.log but i want to save/change the logger is used, because i want to save this, i am using the Crawly without Apify CLI
No description

How can I add my own cookies to the crawler

Hi, I'm using crawlee to fetch some data but I don't know how to add my own cookies in my crawler. I'm using Playwright to fetch cookies and after that I want to pass (in a session if it is possible) them to my BeautifulSoupCrawler.
Next