exotic-emerald

Clear URL queue at end of run?

I'm a data reporter at CBS News using crawlee to archive web pages. Currently, when I finish a crawl, the next crawl continues to crawl pages enqueued in the previous crawl. Is there an easy fix to this? I've looked at the docs, specifically the persist_storage and purge_on_start parameters, but it's unclear from the documentation what exactly those do. Happy to provide a code sample if helpful

8 Replies

Hall•7mo ago

Someone will reply to you shortly. In the meantime, this might help:

exotic-emeraldOP•7mo ago

If anyone comes across this post, I think I understand what's happening now - if crawlee hits the max number of requests defined in max_requests_per_crawl, it stops making requests but doesn't clear the request queue, so if you're running enqeue you'll end up with more pages in the queue.

Mantisus•7mo ago

Hi, are you by any chance using Jupiter Notebook when working with crawlee? Since the behavior you describe corresponds to purge_on_start=False That is, it reaches the max_requests_per_crawl limit and aborts, but after starting it continues where it left off, since the queue is not cleared. But if you are working with Jupiter Notebook, the queue and cache stored in memory are not cleared without session termination.

exotic-emeraldOP•7mo ago

Nope this is running inside a django app I'm building, not a notebook I managed to get it to crawl the correct pages by not setting a limit of requests but rather limiting the crawl depth However, i'm now seeing an issue where the crawler is refusing to crawl a page that it's previously crawled, not clear why

Mantisus•7mo ago

If the crawler is to crawl the same page, you must pass unique_key example:

async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    request_1 = Request.from_url("https://httpbin.org/get", unique_key="1")
    request_2 = Request.from_url("https://httpbin.org/get", unique_key="2")
    request_3 = Request.from_url("https://httpbin.org/get", unique_key="3")

    await crawler.run(
        [
            request_1,
            request_2,
            request_3
        ]
    )

async def main() -> None:
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    request_1 = Request.from_url("https://httpbin.org/get", unique_key="1")
    request_2 = Request.from_url("https://httpbin.org/get", unique_key="2")
    request_3 = Request.from_url("https://httpbin.org/get", unique_key="3")

    await crawler.run(
        [
            request_1,
            request_2,
            request_3
        ]
    )

exotic-emeraldOP•7mo ago

Boom that worked for me, thanks so much for the help

MEE6•7mo ago

@Chris just advanced to level 1! Thanks for your contributions! 🎉

exotic-emeraldOP•7mo ago

For posterity if anyone comes across this thread, I had to provide a unique_key (I used a uuid because I want them to be crawled every time) to the Request object AND in the user_data argument to enqueue_links

Gaming

Programming

Clear URL queue at end of run?

Did you find this page helpful?