Crawlee & Apify•10mo ago

Hi! I'm building something using the

Hi! I'm building something using the apify sdk for crawling. I'm currently trying to figure out how I can tell the actor which URLs to skip during recrawls. Is the excludeUrlGlobs the right input setting for this? Is there a limit on exclusions? The plan is to regularily crawl news sites but i would like to only process something if there is new (not previously visited urls) data found. Can somebody point me in the right direction?

4 Replies

sensitive-blue•10mo ago

just use the same requestQueue, then crawler will not visit urls that were already handled

MEE6•10mo ago

@HonzaS just advanced to level 16! Thanks for your contributions! 🎉

sensitive-blue•10mo ago

how would i use the same request queue? I want the crawler to crawl the startUrl again so I'd call client.actor(actorId).start() on it? The actorId will always stay the same i'm guessing. Does it internally keep the same request queue or is this something i would need to handle? Also do you know if I can control how much RAM the actor should use? currently it's set to 4gb but I would like to play around with this as it doesn't seem I need this much RAM per actor instance given I consume roughly 400-600mb on average at the moment

sensitive-blue•10mo ago

You need to implement it. If you are using some premade actor then you are out of luck unless you can modify the code.

Gaming

Programming

Hi! I'm building something using the

Did you find this page helpful?