Mantisus
Mantisus
CACrawlee & Apify
Created by Andrew on 5/12/2025 in #crawlee-python
Scrapping tweets are all mock tweets
Hi, @Andrew Questions regarding a specific Actor should be asked on the Actor page. Since its developers may not be in the Discord community https://console.apify.com/actors/CJdippxWmn9uRfooo/issues
4 replies
CACrawlee & Apify
Created by optimistic-gold on 5/2/2025 in #crawlee-python
How to send an URL with a label to main file?
Thanks for your example, you can use Request with the from_url constructor, for this:
await crawler.run([Request.from_url('https://example.org/', label='PRODUCT', user_data = {"id": })])
await crawler.run([Request.from_url('https://example.org/', label='PRODUCT', user_data = {"id": })])
6 replies
CACrawlee & Apify
Created by extended-salmon on 5/2/2025 in #crawlee-python
How to send an URL with a label to main file?
Sorry, could you please give an example of the code. I don't quite understand your question
6 replies
CACrawlee & Apify
Created by sunny-green on 8/14/2024 in #crawlee-python
How to save network requests made by the webpage I am scraping?
Hey @uandsaeed For capture network traffic, you can use Playwright with the record_har_path parameter.
8 replies
CACrawlee & Apify
Created by exotic-emerald on 4/29/2025 in #crawlee-python
structlog support?
But stderr is also one of the standard outputs of a standard logger. I haven't used structlog, but given that it can wrap around the standard logger, I don't see any problems for that. Or any dirty tricks.
5 replies
CACrawlee & Apify
Created by frail-apricot on 4/29/2025 in #crawlee-python
structlog support?
Hey @Rykari Crawlee, uses a standard logger, you can plug in structlog by following its official documentation - https://www.structlog.org/en/stable/standard-library.html In the Crawlee documentation, there is an example of connecting loguru which you can use as an example - https://crawlee.dev/python/docs/examples/configure-json-logging
5 replies
CACrawlee & Apify
Created by exotic-emerald on 4/23/2025 in #crawlee-python
Memory is critically overloaded
Hi @ROYOSTI I recommend you post this situation in the repository, as it could be either another one of the customization features for AWS or a bug in the resource counter.
5 replies
CACrawlee & Apify
Created by fascinating-indigo on 4/17/2025 in #crawlee-python
Routers not working as expected
default - same-hostname However, the links to the PDF in your case are on a different host
5 replies
CACrawlee & Apify
Created by conscious-sapphire on 4/17/2025 in #crawlee-python
Routers not working as expected
Hey @Matheus Rossi Thank you for your interest in the framework! Try using
await context.enqueue_links(transform_request_function=transform_request, strategy='all')
await context.enqueue_links(transform_request_function=transform_request, strategy='all')
5 replies
CACrawlee & Apify
Created by conscious-sapphire on 4/15/2025 in #crawlee-python
Dynamically change dataset id based on root_domain
Hey @Rykari Note the Dataset class - https://crawlee.dev/python/api/class/Dataset You can open different Datasets in handlers and write data to them
5 replies
CACrawlee & Apify
Created by quickest-silver on 4/9/2025 in #crawlee-python
Handling of 4xx and 5xx in default handler (Python)
Could you give examples of the kind of behavior you want to achieve? Perhaps error_handler is better for your case https://crawlee.dev/python/api/class/BasicCrawler#error_handler
7 replies
CACrawlee & Apify
Created by foreign-sapphire on 4/9/2025 in #crawlee-python
Handling of 4xx and 5xx in default handler (Python)
You need to include all. Something like.
list(range(400,600))
list(range(400,600))
7 replies
CACrawlee & Apify
Created by ratty-blush on 4/9/2025 in #crawlee-python
Handling of 4xx and 5xx in default handler (Python)
Hey @rast42 Standard crawlee has its own behavior for status error handling 5xx - cause a repeat 403, 429, 401 - cause session rotation if used 4xx - marked as erroneous without repetition If you want to handle any statuses yourself you can use ignore_http_error_status_codes.
7 replies
CACrawlee & Apify
Created by generous-apricot on 4/6/2025 in #crawlee-python
Camoufox and adaptive playwright
The brower_pool is set with playwright_crawler_specific_kwargs, but I don't have a way to test running it with Camoufox right now. However, if it is not supported. it is an error
async def main() -> None:
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
max_requests_per_crawl=10,
playwright_crawler_specific_kwargs={'browser_pool': BrowserPool(plugins=[
PlaywrightBrowserPlugin(browser_type='chromium')
])}
)

@crawler.router.default_handler
async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

await crawler.run(['https://crawlee.dev/'])
async def main() -> None:
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
max_requests_per_crawl=10,
playwright_crawler_specific_kwargs={'browser_pool': BrowserPool(plugins=[
PlaywrightBrowserPlugin(browser_type='chromium')
])}
)

@crawler.router.default_handler
async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

await crawler.run(['https://crawlee.dev/'])
7 replies
CACrawlee & Apify
Created by equal-aqua on 4/6/2025 in #crawlee-python
Camoufox and adaptive playwright
Hey, @Doigus Could you create an Issue, with an example of the error you're getting and more context? https://github.com/apify/crawlee-python/issues
7 replies
CACrawlee & Apify
Created by sensitive-blue on 3/22/2025 in #crawlee-python
Proxy example with PlaywrightCrawler
Hey The documentation has examples of using PlaywrightCrawler with proxy - https://crawlee.dev/python/docs/guides/proxy-management#crawler-integration. Try changing the proxies you are using. Judging by your error I think your proxies have some kind of certificate conflict with Instagram
3 replies
CACrawlee & Apify
Created by stormy-gold on 3/13/2025 in #apify-platform
Uncaught exception during the run of the Actor
Hey @Arindam A new release was made today that should fix this. Try set crawlee==0.6.5 https://github.com/apify/crawlee-python/releases/tag/v0.6.5
5 replies
CACrawlee & Apify
Created by automatic-azure on 3/7/2025 in #crawlee-python
Selenium + Chrome Instagram Scraper cannot find the Search button when I run it in Apfiy..
Hey. In such a context, I would recommend starting with a debug using screenshots when running on the Apify platform and saving them to key-value storage. That way you can get a better understanding of what the problem is. Also test the crawler locally with the same proxy configuration Any crawler may work differently locally and in the cloud, for example because of proxy (if you didn't use proxy locally). For example Youtube doesn't show some popup windows for me because Ukraine doesn't have GDPR. However, when crawler works with European proxies these popups will.
4 replies
CACrawlee & Apify
Created by vicious-gold on 3/5/2025 in #crawlee-python
Error on cleanup PlaywrightCrawler
You can try using - use_incognito_pages=True, maybe it will improve the situation with zombie processes (But will reduce the speed of your crawler as there will be no brawser cache sharing between different requests) But I am not sure, because if it is not related to crash due to file closing error, we need to study the situation in detail.
10 replies
CACrawlee & Apify
Created by conscious-sapphire on 3/5/2025 in #crawlee-python
Error on cleanup PlaywrightCrawler
Got it. Yes, please report in the Issue repository.
10 replies