Hi! I'm building something using the
Hi! I'm building something using the apify sdk for crawling.
I'm currently trying to figure out how I can tell the actor which URLs to skip during recrawls.
Is the
excludeUrlGlobs
the right input setting for this?
Is there a limit on exclusions? The plan is to regularily crawl news sites but i would like to only process something if there is new (not previously visited urls) data found.
Can somebody point me in the right direction?4 Replies
sensitive-blue•10mo ago
just use the same requestQueue, then crawler will not visit urls that were already handled
@HonzaS just advanced to level 16! Thanks for your contributions! 🎉
sensitive-blue•10mo ago
how would i use the same request queue? I want the crawler to crawl the startUrl again so I'd call client.actor(actorId).start() on it?
The actorId will always stay the same i'm guessing.
Does it internally keep the same request queue or is this something i would need to handle?
Also do you know if I can control how much RAM the actor should use? currently it's set to 4gb but I would like to play around with this as it doesn't seem I need this much RAM per actor instance given I consume roughly 400-600mb on average at the moment
sensitive-blue•10mo ago
You need to implement it. If you are using some premade actor then you are out of luck unless you can modify the code.