Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

absent-sapphire

10/23/2024

hello, I try to log into instagram using facebook, using Playwright. I am struggling with a pop up. Miss the right timing, accessing the "Allow all cookies" button. https://www.loom.com/share/a50934922679402cb46ecf59b80d88f7...

like-gold

10/22/2024

enqueue urls / request queue not being unique

I'm seeing a lot of the same exact URL's being ran twice? Any ideas?

memo23

10/22/2024

Issue with RequestQueue2

I am having an issue with 'queues', so here is the scenario, I am rotating sessions and getting next error : "Error: Detected a session error, rotating session..." and after 10 retries I got eventually: ...

genetic-orange

10/18/2024

Error: PlaywrightCrawler:SessionPool:Session "Cookie not in this host's domain"

I am using PlaywrightCrawler with Firefox. When accessing wellfound.com and see this error:

DEBUG PlaywrightCrawler:SessionPool:Session: Could not set cookies. {"errorMessages":["Cookie not in this host's domain. Cookie:prod.website-files.com Request:wellfound.com"]}

DEBUG PlaywrightCrawler:SessionPool:Session: Could not set cookies. {"errorMessages":["Cookie not in this host's domain. Cookie:prod.website-files.com Request:wellfound.com"]}

...

fair-rose

10/17/2024

SC-CH-UA header includes 'Headless Chrome' when using @sparticuz/chromium

I've been playing around with deploying PlaywrightCrawler to AWS Lambda and it's working well. I've used @sparticuz/chromium for the chrome exe as per this doc: https://crawlee.dev/docs/deployment/aws-browsers However, upon examining the request headers it's generating, I've discovered the sec-ch-ua hint header is always as follows: "HeadlessChrome";v="129", "Not=A?Brand";v="8", "Chromium";v="129" ...

genetic-orange

10/17/2024

A site that shows cloudflare captcha ALWAYS

I immediately get captcha on every URL. Accessing it in a normal GUI browser typing site homepage URL: captcha. Searching this site in google, clicking on the link in google results: browser shows site address and...: captcha. (by the way, they changed it, few months ago this site was not that restrictive)...

genetic-orange

10/17/2024

bot detection (captcha) changed, Playwright+Crawlee+Firefox+rotating proxies does not help any more

I have a program: Playwright+Crawlee+Firefox+rotating proxies used to scrape jobs from wellfound.com In may 2024 (and earlier) it worked quite well, many months, despite captcha protection on site. Today I get HTTP 403 and captcha (from ct.captcha-delivery.com). My code is not changed! Proxies: iproyal.com, "residential-proxies", session time 1 min ("sticky session"). What I did: in the same session accessed URL1 and than URL2. URL1 has no captcha, URL2 contains info I need, and is/was protected with captcha. In the past the trick with "URL1 and than URL2 in the same session" worked well. Today I get captcha when accessing URL2....

unwilling-turquoise

10/15/2024

chromium version error in path

Hey Playwright creators! 👋 I'm running into a frustrating issue with Playwright and Chromium, and I could really use some help. Here's what's going on: The Error:...

correct-apricot

10/14/2024

Scrape JSON and HTML responses in different handlers

I do not know how to scrape a website, that contains JSON and HTML responses My scraper need to: 1. Send a request and parse a JSON response which contains a list of URL that I will enqueue. 2. Scrape those URLs but in HTML using cheerio or whatever is required to do so....

unwilling-turquoise

10/11/2024

Playwright with Firefox: New Windows vs Tabs and Chromium-specific Features

Hey Playwright community! I've been using Firefox with Playwright because it uses less CPU, but I've run into a couple of issues I'd love some help with: 1. New Windows Instead of Tabs I'm running Firefox in headless: false mode to check how things look, and I've noticed it opens new windows for each URL instead of opening new tabs. Is there a way to configure this behavior? I'd prefer to have new tabs open instead of separate windows. ...

dependent-tan

10/10/2024

crawlee.run only scrap the first URL

Hi my problem is crawler.run(['https://keepa.com/#!product/4-B07GS6ZB7T', 'https://keepa.com/#!product/4-B0BZSWWK48']) only scrap the first URL I think this is because crawlee think they are the same URL , if i replace the "#" with a "?" it works , is there any way to make it work with url like this ?

unwilling-turquoise

10/10/2024

Router Class

I recently read a blog post about Playwright web scraping (https://blog.apify.com/playwright-web-scraping/#bonus-routing) and implemented its routing concept in my project. However, I'm encountering an issue with handling failed requests. Currently, when a request fails, the application stalls instead of proceeding to the next request. Do you have any suggestions for implementing a failedRequestHandler to address this problem?

provincial-silver

10/10/2024

WebRTC IP leak?

Hi, so for the last couple days I am on a quest to evade detection for a project that proved to be quite challanging. As I researched the issue, I noticed that my real IP leaks through WebRTC with a default Crawlee Playwright CLI project. I see a commit to the fingerprint-suite that I think should prevent that, but based on my tests it doesn't. Does it need special setup or anything?

provincial-silver

10/8/2024

Crawlee Playwright is detected as bot

Checking on this page, Crawlee Playwright is detected as bot due to CDP. https://www.browserscan.net/bot-detection This is a known issue, also discussed on:...

unwilling-turquoise

10/8/2024

How can I wait with processing further logic untill all request from batch are proceeded

Hi I have this code: ```typescript async processBatch(batch){...

sunny-green

10/7/2024

Puppeteer browser page stuck on redirections

when i use puppeteer & fingerprint injector with generator, some redirects make puppeteer on page firefox/chromium stuck after these redirections the page stops logging my interceptors (they just write the url), the page stops responding to the resize if I create a new page manually in this browser and follow the link with redirections, it's fine without injector and generator everything works fine too...

foreign-sapphire

10/7/2024

Saving scraped data from dynamic URLs using Crawlee in an Express Server?

Hello all. I've been trying to build an app that triggers a scraping job when the api is hit. The initial endpoint hits a crawlee router which has 2 handlers. one for the url-list scraping and the other for scraping the detail from each of the detail-page. (the url-list handler enqueues the next url-list page to url-list handler too btw) ...

extended-salmon

10/4/2024

All requests from the queue have been processed, the crawler will shut down.

I'm working on news web crawler, and setting purgeOnStart=false so that I don't scrape duplicated news, however sometimes in some cases I got the message "All requests from the queue have been processed, the crawler will shut down." and the crawler don't run, any suggestion to fix this issue??

foreign-sapphire

10/2/2024

Crawlee not working with cloudflare

It keeps on returning 403 even with rotating proxy pool Source code: ``` import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';...

provincial-silver

9/23/2024

How to use Playwright's bypassCsp option?

I would like to use: https://playwright.dev/docs/api/class-testoptions#test-options-bypass-csp Any idea how I can do that in crawlee?...

Previous Next

Gaming

Programming

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Crawlee & Apify

This is the official developer community of Apify and Crawlee.