Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

genetic-orange

1/20/2023

Easiest way to scrape a JSON URL?

I would like to scrape the URL below which returns a JSON dataset and then create a dataset from this by extracting selected json objects from this. I tried the Vanilla JS scraper and Cheerio scraper but did not manage to get a valid scrape. How would you approach this? Which (free) scraper would you choose? Does somebody have an example of a code that would acomplish this? This is the URL: https://hypomat.glkb.ch/online-hypothek/api/hypomat/zinssatz...

correct-apricot

1/19/2023

General way to scrape blogs, articles and content?

Hi all! Is there a general way to scrape blogs of various types? I want to create a program that: ...

stormy-gold

1/19/2023

How to disable storage directory creation?

Hi! I tried to run multiple playwright crawlers in parallel and each crawler was conflicting with each other. I want to store all data (request queue, datasets, key value stores) in memory. How can I do it? I tried to set persistStorage configuration option to false (it was mentioned in some discussion on GitHub), but it has no effect. Also I tried to set defaultKeyValueStoreId, defaultRequestQueueId and defaultDatasetId for each crawler. I thought that I will get separate directory for each crawler, but Crawlee creates storage/key_value_stores/default, storage/request_queues/default directories. NodeJS version: 18.13.0 Crawlee version: 3.1.4...

ambitious-aqua

1/18/2023

File format issue

I am trying to upload a file that has a list of urls, I wanted to ask what delimiter should i use? I have tried a couple of them nothing seems to work.

afraid-scarlet

1/17/2023

Captcha detection?

How to detect captcha? I see this in the response HTML: ``` <head>...

robust-apricot

1/17/2023

where to hook 'Puppeteer request interceptor'

Hi Where in crawlee is the best place to hook Puppeteer 'request interceptor'. I mean mechanizm like here `// pure puppeteer code...

robust-apricot

1/16/2023

querying data-set on filesystem - like SQL

Hi Let's assume I scrap data to folders:

./storage/datasets/
  products
  categories

...

afraid-scarlet

1/16/2023

failedRequestHandler, error argument, detailed error message lost

I am using PlaywrightCrawler and the failedRequestHandler to handle errors. Something like this: ``` const crawler = new PlaywrightCrawler({ ......

like-gold

1/16/2023

Using requestsFromUrl is throwing an Error

When i tried fetching the URLs from a text file, Apify throwed an error with the latest crawlee/playwrightcrawler setup. I have attached a screenshot of the same error with crawlee alone running in my local machine....

afraid-scarlet

1/16/2023

Firefox, PlaywrightCrawler, SSL_ERROR_BAD_CERT_DOMAIN error

One of the pages I want to scrape with PlaywrightCrawler returns the SSL_ERROR_BAD_CERT_DOMAIN error. I can reproduce this error when I open this URL in Firefox/Chrome - I see the browser shows the prompt with the warning and asks "...do you want to proceed?" So the error is from the browser, not from Crawlee/Playwright......

robust-apricot

1/16/2023

example of manually adding requests to requestQueue

Hi I have html like: `.. <a href="subpage.php?id=1">Title 1</a> <a href="subpage.php?id=2">Title 2</a>...

eastern-cyan

1/16/2023

scraping different website strutures

can crawlee be used for automating scraping of diff website structures can you make a code to scrape a website structure without having to spend time inspecting each page for specific html elements for retrieving the data you need...

afraid-scarlet

1/15/2023

PlaywrightCrawler.requestHandler: Error: mouse.move: Target page, context or browser has been closed

In the PlaywrightCrawler.requestHandler I calling page.mouse.move and sometimes I get this error: mouse.move: Target page, context or browser has been closed Here the sequence of calls: ```...

afraid-scarlet

1/15/2023

From page to End page pagination

How can I implement pagination starting from eg. page 10 to page 20 or to the last page? Do I need to implement my own code for this or does Crawlee implement something? I am able to see the last page on the first page of the website I am scraping....

generous-apricot

1/11/2023

Recover endless pagination items by clicking on showMore button

Hey guys ! Im trying to scrape all the products from a listing with an endless pagination. In order to load the other items, I've got to click on a showMore button. I looked it up on the docs and tried several syntaxes but I couldn't get it to work.. ```js await page.$eval('main', (main) => { main.querySelector('.c-endless-paginator__button').click();...

harsh-harlequin

1/11/2023

which ec2 instance type is best suited for crawling?

Just tried out a t3.small instance without much luck (running out of memory). tried a r3.large wich looks better but seems to be week on cpu. any hints?

stormy-gold

1/11/2023

Cant input in Google

I am following this guide to learn apify and crawlee: https://developers.apify.com/academy/puppeteer-playwright/page/page-methods#screenshotting I am trying the code on this page, it should visit google.com and search, click the first result and get the title, but when I am running it it just open google.com and not adding anything to input, why is that?...

stormy-gold

1/11/2023

Cannot use import statement outside a module

I am following this guide: https://developers.apify.com/academy/web-scraping-for-beginners/crawling/pro-scraping#crawlee-installation I created new folder and run this: ...

ugly-tan

1/11/2023

How can I get my data to be scrapper faster?

Good day guys! Thanks again to stop by. I have a ecommerce scrapper design to extra over 200k data. So far, I been testing some (Seller's front page) only with about 7k on data. Well, for those small seller the avg time to get the data is about 20 minutes using proxies or apify. I am wondering if there is a way to get data faster? Using celery? server local host? Any help will be appreciate it! Thanks...

quickest-silver

1/10/2023

Crawlee not working(?) on a page with shadow dom

Hey, I've encountered a website using shadow dom, where crawlee isn't able to find elements (for a good reason). https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM I was wondering since there is no mentions of shadow dom if anyone knows what to look at to make it work?...

Previous Next

Gaming

Programming

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Crawlee & Apify

This is the official developer community of Apify and Crawlee.