Issues Creating an Intelligent Crawler & Constant Memory Overload
Hey there! I am creating an intelligent crawler using crawlee. Was previously using
crawl4ai
but switched since crawlee
seems much better at anti-blocking.
The main issue I am facing is I want to filtering the urls to crawl for a given page using LLMs. Is there a clean way to do this? So far I implemented a transformer for enqueue_links which saves the links to a dict and then process those dicts at a later point of time using another crawler object. Any other suggestions to solve this problem? I don't want to make the llm call in the transform function since that would be an LLM call per URL found which is quite expensive.
Also when I run this on my EC2 instance with 8GB of RAM it constantly runs into memory overload and just gets stuck i.e. doesn't even continue scraping pages. Any idea how I can resolve this? This is my code currently2 Replies
Someone will reply to you shortly. In the meantime, this might help:
quickest-silverOP•2mo ago