Crawlee & Apify•3mo ago

Error on cleanup PlaywrightCrawler

I use PlaywrightCrawler with headless=True The package that I use is: crawlee[playwright]==0.6.1 When running the crawler I noticed when waiting for remaining tasks to finish it sometimes receives an error like you can see in the screenshot. Is this something that can be resolved easily? Because I think this error is also related to another issue I have. In my code I have my own batching system in place. But I noticed that my memory slowly started to increase on each batch. After some investigation I saw that ps -fC headless_shell gave me a lot headless_shell with <defunct> (zombie processes). So I assume this is related to the cleanup that is failing on each crawl. Below you can see my code for the batching system:

    # Create key values stores for batches
    scheduled_batches = await prepare_requests_from_mongo(crawler_name)
    processed_batches = await KeyValueStore.open(
        name=f'{crawler_name}-processed_batches'
    )

    # Create crawler
    crawler = await create_playwright_crawler(crawler_name)

    # Iterate over the batches
    async for key_info in scheduled_batches.iterate_keys():
        urls: List[str] = await scheduled_batches.get_value(key_info.key)
        requests = [
            Request.from_url(
                url,
                user_data={
                    'page_tags': [PageTag.HOME.value],
                    'chosen_page_tag': PageTag.HOME.value,
                    'label': PageTag.HOME.value,
                },
            )
            for url in urls
        ]
        LOGGER.info(f'Processing batch {key_info.key}')
        await crawler.run(requests)
        await scheduled_batches.set_value(key_info.key, None)
        await processed_batches.set_value(key_info.key, urls)

    # Create key values stores for batches
    scheduled_batches = await prepare_requests_from_mongo(crawler_name)
    processed_batches = await KeyValueStore.open(
        name=f'{crawler_name}-processed_batches'
    )

    # Create crawler
    crawler = await create_playwright_crawler(crawler_name)

    # Iterate over the batches
    async for key_info in scheduled_batches.iterate_keys():
        urls: List[str] = await scheduled_batches.get_value(key_info.key)
        requests = [
            Request.from_url(
                url,
                user_data={
                    'page_tags': [PageTag.HOME.value],
                    'chosen_page_tag': PageTag.HOME.value,
                    'label': PageTag.HOME.value,
                },
            )
            for url in urls
        ]
        LOGGER.info(f'Processing batch {key_info.key}')
        await crawler.run(requests)
        await scheduled_batches.set_value(key_info.key, None)
        await processed_batches.set_value(key_info.key, urls)

7 Replies

MEE6•3mo ago

@ROYOSTI just advanced to level 1! Thanks for your contributions! 🎉

Hall•3mo ago

Someone will reply to you shortly. In the meantime, this might help: -# This post was marked as solved by ROYOSTI. View answer.

generous-apricotOP•3mo ago

🤦‍♂️forgot to upload the screenshot

generous-apricotOP•3mo ago

UPDATE: Noticed this PR: https://github.com/apify/crawlee-python/pull/1046 This will fix my initial issue. Hopefully will this also fix the zombie processes on each batch 🙏

GitHub

fix: Remove tmp folder for PlaywrightCrawler in non-headless mode b...

Description Fix a bug removing a temporary folder for non-headless mode. I saved an attempt to remove a folder on the close event, for when for some reason the browser crashes and the context is c...

correct-apricot•3mo ago

Yes, unfortunately this bug did not show up in tests during development. And I only discovered it while testing the release on one of my projects 😢 I think this should help with zombie processes, as the error during file closing prevents the browser closing to complete correctly. But if after the PR release, it persists, feel free to create an Issue in the repository. @ROYOSTI This should already be available in the beta release crawlee==0.6.3b3. If you decide to try this, please let me know if you observe any problems

generous-apricotOP•3mo ago

@Mantisus, I did a small rerun and used crawlee==0.6.3b4. The issue for removing the tmp folder for PlaywrightCrawler is solved. But on each batch it still keeps a lot of zombie processes. Could I fix something in my code to prevent this? Or is this something that I best report in an Issue in the repository?

correct-apricot•3mo ago

Got it. Yes, please report in the Issue repository. You can try using - use_incognito_pages=True, maybe it will improve the situation with zombie processes (But will reduce the speed of your crawler as there will be no brawser cache sharing between different requests) But I am not sure, because if it is not related to crash due to file closing error, we need to study the situation in detail.

Gaming

Programming

Error on cleanup PlaywrightCrawler

Did you find this page helpful?