Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Got captha and HTTP 403 using PlaywrightCrawler

Got captha and HTTP 403 when accessing wellfound.com I get captcha all the time when I access links like these (basically - accessing any job ad on wellfound): https://wellfound.com/company/kalepa/jobs/2651640-tech-lead-manager-full-stack-europe https://wellfound.com/company/pinatacloud/jobs/2655889-principal-software-engineer...
No description

enqueueLinksByClickingElements help

I have written this code for Puppeteer:
await puppeteerClickElements.enqueueLinksByClickingElements({ forefront: true, selector: 'a.js-color-change' })
await puppeteerClickElements.enqueueLinksByClickingElements({ forefront: true, selector: 'a.js-color-change' })
But it generates this error:...

Continue scraping on the page where the last scrape failed

Let's say that we're going through a page which has a list of ads. At the end of each page there is a pagination. If for some reason our scraper can't open the page and it fails I'd like to have the information on the location of failure and start the next scrape from it immediatelly. What are the best practices tackling this issue?

Blocking certain requests

I'm trying to block some requests in Puppeteer but it doesn't seem to work if I run the script headed : ``` const blockedResourceTypes = ['webp', 'svg', 'mp4', 'jpeg', 'gif', 'avif', 'font'] const crawler = new PuppeteerCrawler({ launchContext: {...

Navigation timed out after 60 seconds.

I'm scraping a website, if I run it in headless mode I get this error, if I run it headed I see the webpage completely loaded (the loading wheel still spins somehow). this is my routes: ``` router.addDefaultHandler(async ({ enqueueLinks, log, page, request }) => { if (request.url.includes('/ayuda/')){...

push data to S3

is there any already built solution to push data straight to an online storage like S3 from crawlee ?

JSDOMCrawler access features of JSDOM

I have set runScripts but I would like to access, resources: "usable", pretendToBeVisual, and maybe canvas. I have not been successful accessing those JSDOM options. Any ideas would be greatly appreciated.

--disable-dev-shm-usage

How can I run puppeteer with this tag ? (obviously inside Crawlee)

Custom headers

I have a suuuper secure website that I'm trying to scrape and now I want to try to use the sitemaps and use google.com as referer . How can I put this header for all requests ?...

Dataset importion problem

when using crawlee in a node.js project (npm i crawlee), I keep getting this error with my code : (cheerio crawler, btw) TypeError: Dataset is not a constructor from this section in my scraper code:...

Override browser permission on PuppeteerCrawler

Hi, a quick question: how do I override a certain permission on a page when using PuppeteerCrawler? Something like so: ``` ..., preNavigationHooks: [...

Keeping track of the parent page with PlaywrightCrawler

Hi! I'm using Crawlee as an e2e test for broken links and generated diagrams in our documentation website. So far it's been successful and the only thing I'm missing is figuring out what page actually contained the broken link. For example, this is the snippet I use to find pages that display the 404 message: ```...

How to utilize memory in Apify actor which holds crawlee program with cheerio crawler?

Here I have attached a screenshot which is related to my last run. Within this run I have seen 700MB memory amount is in idle while the program execute. I have allocated 1024 memory to the actor. Also CPU usage looks very low. Furthermore, I have seen there are only maximum 15 concurrent requests have been running on the thread pool. Is there any way to utilize these factors?
No description

Post Request with json data to get cookies and use these cookies to to scrap further Urls

Hello all, I have a special situation, website response depends on the location of the IP address. But there is a possibility to change the address. The way it works is by calling the endpoint which returns the cookies. I want to scrap the urls once I have the cookies. How can I do that with crawlee ? and how will those cookies be managed with sessions? It's a bit complicated to explain but I hope you guys get the idea of what I want. Thank you for reading that long post.

YouTube Scraper stops working well at 50 videos

Does anyone find the YouTube scarper starts OK but then comes to a halt around 50 videos? This is when scraping a channels videos. Anyway to get around this?

How can I bypass the CSP in PlaywrightCrawler?

Bypass the CSP in PlaywrightCrawler is not working! I'm receiving the following error: "page.waitForFunction: EvalError: Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Security Policy directive: "script-src *.facebook.com *.fbcdn.net *.facebook.net *.google-analytics.com .google.com 127.0.0.1: 'unsafe-inline' blob: data: 'self' connect.facebook.net 'wasm-unsafe-eval'".",...

Keep scraping if element not found

This might be a question more closely related to js and playwright rather than Crawlee but let me give it a try. Depending on the existence of an element on a page I want to decide if I should proceed with scraping or stop the process. As I look for element with : const element = await page.getByText("some text) the crawler times out if the element doesnt show up Any ideas how to implement this logic to proceed with scrape if the element is not found?...

Scraper Time Zone

Am I correct in thinking that Apify scraped data is in time UTC time?

HTTPCrawler proxy not working

I am trying to add a proxy to my HTTPCrawler and it doesn't seem to work. I am following the same code structure that is in the docs but i keep getting an error saying: ERROR HttpCrawler: Request failed and reached maximum retries. RequestError: Client network socket disconnected before secure TLS connection was established It works when i do it with my PuppeteerCrawler and i have them the same way....

Geonode Proxies

Hey, having some troubles trying to use the proxies provided by Geonode `const proxyConfiguration = new ProxyConfiguration({ proxyUrls: [ "http://{username}:{password}@rotating-residential.geonode.com:9010"...