Mantisus Comments - Answer Overflow

Mantisus

•Created by Andrew on 5/12/2025 in #crawlee-python

Scrapping tweets are all mock tweets

Hi, @Andrew Questions regarding a specific Actor should be asked on the Actor page. Since its developers may not be in the Discord community https://console.apify.com/actors/CJdippxWmn9uRfooo/issues

4 replies

CACrawlee & Apify

•Created by optimistic-gold on 5/2/2025 in #crawlee-python

How to send an URL with a label to main file?

Thanks for your example, you can use Request with the from_url constructor, for this:

await crawler.run([Request.from_url('https://example.org/', label='PRODUCT', user_data = {"id": })])

await crawler.run([Request.from_url('https://example.org/', label='PRODUCT', user_data = {"id": })])

6 replies

CACrawlee & Apify

•Created by extended-salmon on 5/2/2025 in #crawlee-python

How to send an URL with a label to main file?

Sorry, could you please give an example of the code. I don't quite understand your question

6 replies

CACrawlee & Apify

•Created by sunny-green on 8/14/2024 in #crawlee-python

How to save network requests made by the webpage I am scraping?

Hey @uandsaeed For capture network traffic, you can use Playwright with the record_har_path parameter.

8 replies

CACrawlee & Apify

•Created by exotic-emerald on 4/29/2025 in #crawlee-python

structlog support?

But stderr is also one of the standard outputs of a standard logger. I haven't used structlog, but given that it can wrap around the standard logger, I don't see any problems for that. Or any dirty tricks.

5 replies

CACrawlee & Apify

•Created by frail-apricot on 4/29/2025 in #crawlee-python

structlog support?

Hey @Rykari Crawlee, uses a standard logger, you can plug in structlog by following its official documentation - https://www.structlog.org/en/stable/standard-library.html In the Crawlee documentation, there is an example of connecting loguru which you can use as an example - https://crawlee.dev/python/docs/examples/configure-json-logging

5 replies

CACrawlee & Apify

•Created by exotic-emerald on 4/23/2025 in #crawlee-python

Memory is critically overloaded

Hi @ROYOSTI I recommend you post this situation in the repository, as it could be either another one of the customization features for AWS or a bug in the resource counter.

5 replies

CACrawlee & Apify

•Created by fascinating-indigo on 4/17/2025 in #crawlee-python

Routers not working as expected

default - same-hostname However, the links to the PDF in your case are on a different host

5 replies

CACrawlee & Apify

•Created by conscious-sapphire on 4/17/2025 in #crawlee-python

Routers not working as expected

Hey @Matheus Rossi Thank you for your interest in the framework! Try using

await context.enqueue_links(transform_request_function=transform_request, strategy='all')

await context.enqueue_links(transform_request_function=transform_request, strategy='all')

5 replies

CACrawlee & Apify

•Created by conscious-sapphire on 4/15/2025 in #crawlee-python

Dynamically change dataset id based on root_domain

Hey @Rykari Note the Dataset class - https://crawlee.dev/python/api/class/Dataset You can open different Datasets in handlers and write data to them

5 replies

CACrawlee & Apify

•Created by quickest-silver on 4/9/2025 in #crawlee-python

Handling of 4xx and 5xx in default handler (Python)

Could you give examples of the kind of behavior you want to achieve? Perhaps error_handler is better for your case https://crawlee.dev/python/api/class/BasicCrawler#error_handler

7 replies

CACrawlee & Apify

•Created by foreign-sapphire on 4/9/2025 in #crawlee-python

Handling of 4xx and 5xx in default handler (Python)

You need to include all. Something like.

list(range(400,600))

list(range(400,600))

7 replies

CACrawlee & Apify

•Created by ratty-blush on 4/9/2025 in #crawlee-python

Handling of 4xx and 5xx in default handler (Python)

Hey @rast42 Standard crawlee has its own behavior for status error handling 5xx - cause a repeat 403, 429, 401 - cause session rotation if used 4xx - marked as erroneous without repetition If you want to handle any statuses yourself you can use ignore_http_error_status_codes.

7 replies

CACrawlee & Apify

•Created by generous-apricot on 4/6/2025 in #crawlee-python

Camoufox and adaptive playwright

The brower_pool is set with playwright_crawler_specific_kwargs, but I don't have a way to test running it with Camoufox right now. However, if it is not supported. it is an error

async def main() -> None:
    crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
        max_requests_per_crawl=10,
        playwright_crawler_specific_kwargs={'browser_pool': BrowserPool(plugins=[
            PlaywrightBrowserPlugin(browser_type='chromium')
            ])}
    )

    @crawler.router.default_handler
    async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    await crawler.run(['https://crawlee.dev/'])

async def main() -> None:
    crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
        max_requests_per_crawl=10,
        playwright_crawler_specific_kwargs={'browser_pool': BrowserPool(plugins=[
            PlaywrightBrowserPlugin(browser_type='chromium')
            ])}
    )

    @crawler.router.default_handler
    async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

    await crawler.run(['https://crawlee.dev/'])

7 replies

CACrawlee & Apify

•Created by equal-aqua on 4/6/2025 in #crawlee-python

Camoufox and adaptive playwright

Hey, @Doigus Could you create an Issue, with an example of the error you're getting and more context? https://github.com/apify/crawlee-python/issues

7 replies

CACrawlee & Apify

•Created by sensitive-blue on 3/22/2025 in #crawlee-python

Proxy example with PlaywrightCrawler

Hey The documentation has examples of using PlaywrightCrawler with proxy - https://crawlee.dev/python/docs/guides/proxy-management#crawler-integration. Try changing the proxies you are using. Judging by your error I think your proxies have some kind of certificate conflict with Instagram

3 replies

CACrawlee & Apify

•Created by stormy-gold on 3/13/2025 in #apify-platform

Uncaught exception during the run of the Actor

Hey @Arindam A new release was made today that should fix this. Try set crawlee==0.6.5 https://github.com/apify/crawlee-python/releases/tag/v0.6.5

5 replies

CACrawlee & Apify

•Created by automatic-azure on 3/7/2025 in #crawlee-python

Selenium + Chrome Instagram Scraper cannot find the Search button when I run it in Apfiy..

Hey. In such a context, I would recommend starting with a debug using screenshots when running on the Apify platform and saving them to key-value storage. That way you can get a better understanding of what the problem is. Also test the crawler locally with the same proxy configuration Any crawler may work differently locally and in the cloud, for example because of proxy (if you didn't use proxy locally). For example Youtube doesn't show some popup windows for me because Ukraine doesn't have GDPR. However, when crawler works with European proxies these popups will.

4 replies

CACrawlee & Apify

•Created by vicious-gold on 3/5/2025 in #crawlee-python

Error on cleanup PlaywrightCrawler

You can try using - use_incognito_pages=True, maybe it will improve the situation with zombie processes (But will reduce the speed of your crawler as there will be no brawser cache sharing between different requests) But I am not sure, because if it is not related to crash due to file closing error, we need to study the situation in detail.

Gaming

Programming