CA
Crawlee & Apify•7mo ago
exotic-emerald

Clear URL queue at end of run?

I'm a data reporter at CBS News using crawlee to archive web pages. Currently, when I finish a crawl, the next crawl continues to crawl pages enqueued in the previous crawl. Is there an easy fix to this? I've looked at the docs, specifically the persist_storage and purge_on_start parameters, but it's unclear from the documentation what exactly those do. Happy to provide a code sample if helpful
8 Replies
Hall
Hall•7mo ago
Someone will reply to you shortly. In the meantime, this might help:
exotic-emerald
exotic-emeraldOP•7mo ago
If anyone comes across this post, I think I understand what's happening now - if crawlee hits the max number of requests defined in max_requests_per_crawl, it stops making requests but doesn't clear the request queue, so if you're running enqeue you'll end up with more pages in the queue.
Mantisus
Mantisus•7mo ago
Hi, are you by any chance using Jupiter Notebook when working with crawlee? Since the behavior you describe corresponds to purge_on_start=False That is, it reaches the max_requests_per_crawl limit and aborts, but after starting it continues where it left off, since the queue is not cleared. But if you are working with Jupiter Notebook, the queue and cache stored in memory are not cleared without session termination.
exotic-emerald
exotic-emeraldOP•7mo ago
Nope this is running inside a django app I'm building, not a notebook I managed to get it to crawl the correct pages by not setting a limit of requests but rather limiting the crawl depth However, i'm now seeing an issue where the crawler is refusing to crawl a page that it's previously crawled, not clear why
Mantisus
Mantisus•7mo ago
If the crawler is to crawl the same page, you must pass unique_key example:
async def main() -> None:
crawler = BeautifulSoupCrawler()

@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

request_1 = Request.from_url("https://httpbin.org/get", unique_key="1")
request_2 = Request.from_url("https://httpbin.org/get", unique_key="2")
request_3 = Request.from_url("https://httpbin.org/get", unique_key="3")

await crawler.run(
[
request_1,
request_2,
request_3
]
)
async def main() -> None:
crawler = BeautifulSoupCrawler()

@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

request_1 = Request.from_url("https://httpbin.org/get", unique_key="1")
request_2 = Request.from_url("https://httpbin.org/get", unique_key="2")
request_3 = Request.from_url("https://httpbin.org/get", unique_key="3")

await crawler.run(
[
request_1,
request_2,
request_3
]
)
exotic-emerald
exotic-emeraldOP•7mo ago
Boom that worked for me, thanks so much for the help
MEE6
MEE6•7mo ago
@Chris just advanced to level 1! Thanks for your contributions! 🎉
exotic-emerald
exotic-emeraldOP•7mo ago
For posterity if anyone comes across this thread, I had to provide a unique_key (I used a uuid because I want them to be crawled every time) to the Request object AND in the user_data argument to enqueue_links

Did you find this page helpful?