Issues Creating an Intelligent Crawler & Constant Memory Overload

Hey there! I am creating an intelligent crawler using crawlee. Was previously using crawl4ai but switched since crawlee seems much better at anti-blocking. The main issue I am facing is I want to filtering the urls to crawl for a given page using LLMs. Is there a clean way to do this? So far I implemented a transformer for enqueue_links which saves the links to a dict and then process those dicts at a later point of time using another crawler object. Any other suggestions to solve this problem? I don't want to make the llm call in the transform function since that would be an LLM call per URL found which is quite expensive. Also when I run this on my EC2 instance with 8GB of RAM it constantly runs into memory overload and just gets stuck i.e. doesn't even continue scraping pages. Any idea how I can resolve this? This is my code currently
2 Replies
Hall
Hall2mo ago
Someone will reply to you shortly. In the meantime, this might help:
quickest-silver
quickest-silverOP2mo ago

Did you find this page helpful?