Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Log In instagram using facebook

hello, I try to log into instagram using facebook, using Playwright. I am struggling with a pop up. Miss the right timing, accessing the "Allow all cookies" button. https://www.loom.com/share/a50934922679402cb46ecf59b80d88f7...
No description

enqueue urls / request queue not being unique

I'm seeing a lot of the same exact URL's being ran twice? Any ideas?

Issue with RequestQueue2

I am having an issue with 'queues', so here is the scenario, I am rotating sessions and getting next error : "Error: Detected a session error, rotating session..." and after 10 retries I got eventually: ...

Error: PlaywrightCrawler:SessionPool:Session "Cookie not in this host's domain"

I am using PlaywrightCrawler with Firefox. When accessing wellfound.com and see this error:
DEBUG PlaywrightCrawler:SessionPool:Session: Could not set cookies. {"errorMessages":["Cookie not in this host's domain. Cookie:prod.website-files.com Request:wellfound.com"]}
DEBUG PlaywrightCrawler:SessionPool:Session: Could not set cookies. {"errorMessages":["Cookie not in this host's domain. Cookie:prod.website-files.com Request:wellfound.com"]}
...

SC-CH-UA header includes 'Headless Chrome' when using @sparticuz/chromium

I've been playing around with deploying PlaywrightCrawler to AWS Lambda and it's working well. I've used @sparticuz/chromium for the chrome exe as per this doc: https://crawlee.dev/docs/deployment/aws-browsers However, upon examining the request headers it's generating, I've discovered the sec-ch-ua hint header is always as follows: "HeadlessChrome";v="129", "Not=A?Brand";v="8", "Chromium";v="129" ...

A site that shows cloudflare captcha ALWAYS

I immediately get captcha on every URL. Accessing it in a normal GUI browser typing site homepage URL: captcha. Searching this site in google, clicking on the link in google results: browser shows site address and...: captcha. (by the way, they changed it, few months ago this site was not that restrictive)...
No description

bot detection (captcha) changed, Playwright+Crawlee+Firefox+rotating proxies does not help any more

I have a program: Playwright+Crawlee+Firefox+rotating proxies used to scrape jobs from wellfound.com In may 2024 (and earlier) it worked quite well, many months, despite captcha protection on site. Today I get HTTP 403 and captcha (from ct.captcha-delivery.com). My code is not changed! Proxies: iproyal.com, "residential-proxies", session time 1 min ("sticky session"). What I did: in the same session accessed URL1 and than URL2. URL1 has no captcha, URL2 contains info I need, and is/was protected with captcha. In the past the trick with "URL1 and than URL2 in the same session" worked well. Today I get captcha when accessing URL2....

chromium version error in path

Hey Playwright creators! 👋 I'm running into a frustrating issue with Playwright and Chromium, and I could really use some help. Here's what's going on: The Error:...

Scrape JSON and HTML responses in different handlers

I do not know how to scrape a website, that contains JSON and HTML responses My scraper need to: 1. Send a request and parse a JSON response which contains a list of URL that I will enqueue. 2. Scrape those URLs but in HTML using cheerio or whatever is required to do so....

Playwright with Firefox: New Windows vs Tabs and Chromium-specific Features

Hey Playwright community! I've been using Firefox with Playwright because it uses less CPU, but I've run into a couple of issues I'd love some help with: 1. New Windows Instead of Tabs I'm running Firefox in headless: false mode to check how things look, and I've noticed it opens new windows for each URL instead of opening new tabs. Is there a way to configure this behavior? I'd prefer to have new tabs open instead of separate windows. ...

crawlee.run only scrap the first URL

Hi my problem is crawler.run(['https://keepa.com/#!product/4-B07GS6ZB7T', 'https://keepa.com/#!product/4-B0BZSWWK48']) only scrap the first URL I think this is because crawlee think they are the same URL , if i replace the "#" with a "?" it works , is there any way to make it work with url like this ?

Router Class

I recently read a blog post about Playwright web scraping (https://blog.apify.com/playwright-web-scraping/#bonus-routing) and implemented its routing concept in my project. However, I'm encountering an issue with handling failed requests. Currently, when a request fails, the application stalls instead of proceeding to the next request. Do you have any suggestions for implementing a failedRequestHandler to address this problem?

WebRTC IP leak?

Hi, so for the last couple days I am on a quest to evade detection for a project that proved to be quite challanging. As I researched the issue, I noticed that my real IP leaks through WebRTC with a default Crawlee Playwright CLI project. I see a commit to the fingerprint-suite that I think should prevent that, but based on my tests it doesn't. Does it need special setup or anything?

Crawlee Playwright is detected as bot

Checking on this page, Crawlee Playwright is detected as bot due to CDP. https://www.browserscan.net/bot-detection This is a known issue, also discussed on:...

Puppeteer browser page stuck on redirections

when i use puppeteer & fingerprint injector with generator, some redirects make puppeteer on page firefox/chromium stuck after these redirections the page stops logging my interceptors (they just write the url), the page stops responding to the resize if I create a new page manually in this browser and follow the link with redirections, it's fine without injector and generator everything works fine too...

Saving scraped data from dynamic URLs using Crawlee in an Express Server?

Hello all. I've been trying to build an app that triggers a scraping job when the api is hit. The initial endpoint hits a crawlee router which has 2 handlers. one for the url-list scraping and the other for scraping the detail from each of the detail-page. (the url-list handler enqueues the next url-list page to url-list handler too btw) ...
No description

All requests from the queue have been processed, the crawler will shut down.

I'm working on news web crawler, and setting purgeOnStart=false so that I don't scrape duplicated news, however sometimes in some cases I got the message "All requests from the queue have been processed, the crawler will shut down." and the crawler don't run, any suggestion to fix this issue??

Crawlee not working with cloudflare

It keeps on returning 403 even with rotating proxy pool Source code: ``` import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';...
No description