Currently i'm running the Google Crawler
Currently i'm running the Google Crawler, i'm at 60.000 requests and noticed that there is a Search Term in the list I want to skip. Currently (as far as i could tell) there is no way to stop the run, edit settings, resurrect the run. In addition, i also can't stop the run, edit settings, and start a new run with a setting like 'Don't crawl pages already crawled in run #x'. Hence leaving me with only 2 options, stop the run and start again (costly) letting it run with the unwanted term (costly as well).
[Add an option, to save all crawled urls of an actor -on a central place- and adding the setting 'don't run those urls again' would really be a huge improvement in cases like this
Also in cases where the Actor can't crawl a whole country at once (e.g. per city), you unavoidably crawl duplicate urls (overlap between cities) in each crawl (costly, in both $ and time); the function above will also be a great improvement for those cases.
Cheers.
1 Reply
equal-aqua•2y ago
This can be technically done by aborting, then using API removing the unwanted requests from the queue and then resurrecting. But it requires knowledge of the API and a bit about the specific Actor.
If you hit me privately, I might be able to help