CA
exotic-emerald
Clear URL queue at end of run?
I'm a data reporter at CBS News using crawlee to archive web pages. Currently, when I finish a crawl, the next crawl continues to crawl pages enqueued in the previous crawl.
Is there an easy fix to this? I've looked at the docs, specifically the
persist_storage
and purge_on_start
parameters, but it's unclear from the documentation what exactly those do.
Happy to provide a code sample if helpful8 Replies
Someone will reply to you shortly. In the meantime, this might help:
exotic-emeraldOP•7mo ago
If anyone comes across this post, I think I understand what's happening now - if crawlee hits the max number of requests defined in max_requests_per_crawl, it stops making requests but doesn't clear the request queue, so if you're running enqeue you'll end up with more pages in the queue.
Hi, are you by any chance using Jupiter Notebook when working with crawlee?
Since the behavior you describe corresponds to
purge_on_start=False
That is, it reaches the max_requests_per_crawl
limit and aborts, but after starting it continues where it left off, since the queue is not cleared.
But if you are working with Jupiter Notebook, the queue and cache stored in memory are not cleared without session termination.exotic-emeraldOP•7mo ago
Nope this is running inside a django app I'm building, not a notebook
I managed to get it to crawl the correct pages by not setting a limit of requests but rather limiting the crawl depth
However, i'm now seeing an issue where the crawler is refusing to crawl a page that it's previously crawled, not clear why
If the crawler is to crawl the same page, you must pass
unique_key
example:
exotic-emeraldOP•7mo ago
Boom that worked for me, thanks so much for the help
@Chris just advanced to level 1! Thanks for your contributions! 🎉
exotic-emeraldOP•7mo ago
For posterity if anyone comes across this thread, I had to provide a
unique_key
(I used a uuid because I want them to be crawled every time) to the Request
object AND in the user_data
argument to enqueue_links