Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

other-emerald

2/18/2025

Is recommend to use Crawlee without the Apify CLI?

is recommend to use the Crawlee without de Apify CLI, iam using the lib because of the practive to create Crawler and i want to know the experience of another devs using in the same way that i am using

other-emerald

2/18/2025

How i can change/save the logger that the context provides

The context of handler provides a context.log but i want to save/change the logger is used, because i want to save this, i am using the Crawly without Apify CLI

unwilling-turquoise

2/3/2025

How can I add my own cookies to the crawler

Hi, I'm using crawlee to fetch some data but I don't know how to add my own cookies in my crawler. I'm using Playwright to fetch cookies and after that I want to pass (in a session if it is possible) them to my BeautifulSoupCrawler.

like-gold

2/3/2025

SAME_HOSTNAME not working on non www URLs

When using the EnqueueStrategy.SAME_HOSTNAME I noticed it does not work properly on non www urls. In the debugger I noticed it passes origin to the _check_enqueue_strategy but it uses the context.request.loaded_url if available. So every URL that is checked will mismatch because of the difference in hostname ...

stormy-gold

1/27/2025

Testing my first actor

Hi there. I am coming from scraperPAI solutions and I am having issues w/ them. I just want to try Apify. I am trying to build my firt actor without any succeed currently. The test actor sample offers a full example. Sounds great but I get error when I try to use another URL than the one proposed by default (https://www.apify.com) I get an error. For example I try the following https://fr.indeed.com and I get an error. Any idea?...

like-gold

1/27/2025

Chromium sandboxing failed

I run Crawlee in a docker container. That docker container is used in a Jenkins task. When starting the crawler I receive the following error: ``` Browser logs: Chromium sandboxing failed!...

message.txt

absent-sapphire

1/26/2025

Not scheduling new tasks - system is overloaded - gcp cloud run

getting this system overloading message just trying to scrape two urls. this check just keeps looping for almost 10 mins now. i set the cpu to 4 and memeory to 4gb but still getting this message. i know cloud runs dont like threads and background tasks is that the real issue? not sure wondering if anyone has run them on cloud run ``` [90m[crawlee.events._event_manager][0m [34mDEBUG[0m LocalEventManager.on.listener_wrapper(): Awaiting listener task... [90m[crawlee.events._event_manager][0m [34mDEBUG[0m LocalEventManager.on.listener_wrapper(): Awaiting listener task... '[90m[crawlee._autoscaling.autoscaled_pool][0m [34mDEBUG[0m Not scheduling new tasks - system is overloaded...

genetic-orange

1/26/2025

error

hi why do i always get this error: raise ApifyApiError(response, attempt) apify_client._errors.ApifyApiError: You must rent a paid Actor in order to run it. i have apify pro...

like-gold

1/22/2025

Enqueue_links only on match in url path? Cancel request in pre_navigation_hook?

I have set up my handler that it only enqueue links that match on certain keywords Problem here is that I want the code to only check the URL Path and not the full URL. To give an example: Lets say I only want to enqueue links where the keyword "team" or "about" is part of the URL path. When crawling www.example.com and it would find an url with www.example.com/team. I want that URL to queue....

optimistic-gold

1/20/2025

Does Crawlee crawl both root-relative and base-relative urls?

Root relative - prefixed with '/', ie href=/ASDF brings you to example.com/ASDF base-relative - no prefix, ie. href=ASDF from example.com/test/ brings you to example.com/test/ASDF If someone could point me to where in the library this logic occurs, I would be forever grateful...

national-gold

1/19/2025

Double log output

in main.py logging works as expected, however in routes.py logging is printed twice for some reason. I did not setup any custom logging, I just use Actor.log.info("STARTING A NEW CRAWL JOB") example:...

national-gold

1/19/2025

clean way to stop "request queue seems to be stuck for 300.0"

A scraper that I am developing, scrapes a SPA with infinite scrolling. This works fine, but after 300 seconds, I get a WARN , which spawns another playwright instance. This probably happens since I only handle 1 request (I do not add anything to the RequestQueue), from which I just have a while until finished condition is met. ``` [crawlee.storages._request_queue] WARN The request queue seems to be stuck for 300.0s, resetting internal state. ({"queue_head_ids_pending": 0, "in_progress": ["tEyKIytjmqjtRvA"]})...

national-gold

1/18/2025

how to pass data to routes.py

If i use multiple files, what is the best way to pass data (user input, which contains 'max_results' or something) to my routes.py? example snippet main.py ```py max_results = 5 # example...

metropolitan-bronze

1/17/2025

Crawlee with multiple Crawlers?

Does the python crawlee allow for multiple crawlers to be run using one router?

router = Router[BeautifulSoupCrawlingContext]()

router = Router[BeautifulSoupCrawlingContext]()

Just asking as a coleague asked me if it would be possible because curl requests are a lot faster than playwright, so if we can use curl for half the requests and only load the browser for the other portion where it's needed, it could significantly speed up some processes...

rising-crimson

1/13/2025

Extracting a websites URLs, prioritizing URLs in the footer

Hi, I need help with finding an actor or setting up the settings of the website content crawler to extract all the URLs from a site but not the content from the URL, I want to filter the URLs by keywords to find the one Im looking for, but dont need the content from the URL Thanks for your help...

evident-indigo

1/11/2025

ImportError: cannot import name 'service_container' from 'crawlee'

When I build actor and run it, I get the following error: 2025-01-10T18:47:43.475Z Traceback (most recent call last): 2025-01-10T18:47:43.476Z File "<frozen runpy>", line 198, in _run_module_as_main 2025-01-10T18:47:43.477Z File "<frozen runpy>", line 88, in _run_code 2025-01-10T18:47:43.478Z File "/usr/src/app/src/main.py", line 3, in <module>...

harsh-harlequin

1/9/2025

Parsel Crawler way too dank with request speed

Hi everyone! I am creating a crawler using crawlee for python. I noticed the Parsel crawler makes the requests in much higher frequency than the Beautiful soup crawler. Is there a way to make the Parsel crawler slower, so we avoid getting blocked better? Thanks!

optimistic-gold

1/6/2025

Playwright Crawler on Windows?

Hello, I'm seeing (https://playwright.dev/python/docs/library#incompatible-with-selectoreventloop-of-asyncio-on-windows) that there is an incompatibility between Playwright and Windows SelectorEventLoop -- which Crawlee seems to require? Can you confirm whether it is possible to use a PlaywrightCrawlingContext in a Windows environment? I'm running into am asyncio NotImplementedError when trying to run the crawler, which suggests to me that there might be an issue. Thanks for the help.

optimistic-gold

1/2/2025

Python Session Tracking

Is there a way to ensure that successive requests are made using the same session (with the same cookies, etc.) in the Python API? I am scraping a very fussy site that seems to have strict session continuity requirements so I need to ensure that for main page A, all requests to sub pages linked from there, A-1, A-2, A-3, etc. (as well as A-1-1, A-1-2, etc.,) are made within the same session as the original request. Thanks as always....

optimistic-gold

1/2/2025

Issue with Residential Proxies

Hi there. Whenever I try to use residential proxies ('HTTP://groups-RESIDENTIAL:/...') I run into this error: httpx.ConnectError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1129) The 'auto' group seems to work fine. Can anyone tell me what I'm doing wrong here?...

Previous Next

Gaming

Programming

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Crawlee & Apify

This is the official developer community of Apify and Crawlee.