Currently i'm running the Google Crawler

Currently i'm running the Google Crawler, i'm at 60.000 requests and noticed that there is a Search Term in the list I want to skip. Currently (as far as i could tell) there is no way to stop the run, edit settings, resurrect the run. In addition, i also can't stop the run, edit settings, and start a new run with a setting like 'Don't crawl pages already crawled in run #x'. Hence leaving me with only 2 options, stop the run and start again (costly) letting it run with the unwanted term (costly as well). [Add an option, to save all crawled urls of an actor -on a central place- and adding the setting 'don't run those urls again' would really be a huge improvement in cases like this Also in cases where the Actor can't crawl a whole country at once (e.g. per city), you unavoidably crawl duplicate urls (overlap between cities) in each crawl (costly, in both $ and time); the function above will also be a great improvement for those cases. Cheers.
1 Reply
equal-aqua
equal-aqua2y ago
This can be technically done by aborting, then using API removing the unwanted requests from the queue and then resurrecting. But it requires knowledge of the API and a bit about the specific Actor. If you hit me privately, I might be able to help

Did you find this page helpful?