Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Easiest way to scrape a JSON URL?

I would like to scrape the URL below which returns a JSON dataset and then create a dataset from this by extracting selected json objects from this. I tried the Vanilla JS scraper and Cheerio scraper but did not manage to get a valid scrape. How would you approach this? Which (free) scraper would you choose? Does somebody have an example of a code that would acomplish this? This is the URL: https://hypomat.glkb.ch/online-hypothek/api/hypomat/zinssatz...

General way to scrape blogs, articles and content?

Hi all! Is there a general way to scrape blogs of various types? I want to create a program that: ...

How to disable storage directory creation?

Hi! I tried to run multiple playwright crawlers in parallel and each crawler was conflicting with each other. I want to store all data (request queue, datasets, key value stores) in memory. How can I do it? I tried to set persistStorage configuration option to false (it was mentioned in some discussion on GitHub), but it has no effect. Also I tried to set defaultKeyValueStoreId, defaultRequestQueueId and defaultDatasetId for each crawler. I thought that I will get separate directory for each crawler, but Crawlee creates storage/key_value_stores/default, storage/request_queues/default directories. NodeJS version: 18.13.0 Crawlee version: 3.1.4...

File format issue

I am trying to upload a file that has a list of urls, I wanted to ask what delimiter should i use? I have tried a couple of them nothing seems to work.

Captcha detection?

How to detect captcha? I see this in the response HTML: ``` <head>...

where to hook 'Puppeteer request interceptor'

Hi Where in crawlee is the best place to hook Puppeteer 'request interceptor'. I mean mechanizm like here `// pure puppeteer code...

querying data-set on filesystem - like SQL

Hi Let's assume I scrap data to folders: ./storage/datasets/ products categories...

failedRequestHandler, error argument, detailed error message lost

I am using PlaywrightCrawler and the failedRequestHandler to handle errors. Something like this: ``` const crawler = new PlaywrightCrawler({ ......

Using requestsFromUrl is throwing an Error

When i tried fetching the URLs from a text file, Apify throwed an error with the latest crawlee/playwrightcrawler setup. I have attached a screenshot of the same error with crawlee alone running in my local machine....
No description

Firefox, PlaywrightCrawler, SSL_ERROR_BAD_CERT_DOMAIN error

One of the pages I want to scrape with PlaywrightCrawler returns the SSL_ERROR_BAD_CERT_DOMAIN error. I can reproduce this error when I open this URL in Firefox/Chrome - I see the browser shows the prompt with the warning and asks "...do you want to proceed?" So the error is from the browser, not from Crawlee/Playwright......

example of manually adding requests to requestQueue

Hi I have html like: `.. <a href="subpage.php?id=1">Title 1</a> <a href="subpage.php?id=2">Title 2</a>...

scraping different website strutures

can crawlee be used for automating scraping of diff website structures can you make a code to scrape a website structure without having to spend time inspecting each page for specific html elements for retrieving the data you need...

PlaywrightCrawler.requestHandler: Error: mouse.move: Target page, context or browser has been closed

In the PlaywrightCrawler.requestHandler I calling page.mouse.move and sometimes I get this error: mouse.move: Target page, context or browser has been closed Here the sequence of calls: ```...

From page to End page pagination

How can I implement pagination starting from eg. page 10 to page 20 or to the last page? Do I need to implement my own code for this or does Crawlee implement something? I am able to see the last page on the first page of the website I am scraping....

Recover endless pagination items by clicking on showMore button

Hey guys ! Im trying to scrape all the products from a listing with an endless pagination. In order to load the other items, I've got to click on a showMore button. I looked it up on the docs and tried several syntaxes but I couldn't get it to work.. ```js await page.$eval('main', (main) => { main.querySelector('.c-endless-paginator__button').click();...

which ec2 instance type is best suited for crawling?

Just tried out a t3.small instance without much luck (running out of memory). tried a r3.large wich looks better but seems to be week on cpu. any hints?

Cant input in Google

I am following this guide to learn apify and crawlee: https://developers.apify.com/academy/puppeteer-playwright/page/page-methods#screenshotting I am trying the code on this page, it should visit google.com and search, click the first result and get the title, but when I am running it it just open google.com and not adding anything to input, why is that?...

How can I get my data to be scrapper faster?

Good day guys! Thanks again to stop by. I have a ecommerce scrapper design to extra over 200k data. So far, I been testing some (Seller's front page) only with about 7k on data. Well, for those small seller the avg time to get the data is about 20 minutes using proxies or apify. I am wondering if there is a way to get data faster? Using celery? server local host? Any help will be appreciate it! Thanks...

Crawlee not working(?) on a page with shadow dom

Hey, I've encountered a website using shadow dom, where crawlee isn't able to find elements (for a good reason). https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM I was wondering since there is no mentions of shadow dom if anyone knows what to look at to make it work?...