Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Scrape data from TikTok for research

Hey there! I am doing a research on influencers' success on TikTok, and how the users interact with them in this platform. For such a purpose I want to scrape the comments and other data from ~120 video-posts. I do not know how to proceed. Moreover, I do not know whether it is against TikTok's Terms and Conditions....

I want webhooks to send input url in the post request

I want webhooks to trigger api with post request when the job succeed, I found that I can pass these information to the post request: https://docs.apify.com/platform/integrations/webhooks/actions But I want to send the urls that I scraped, for example I scrape Bidan instgram profile, I want to pass this information to post request....

Is there a way to initiate crawlee crawl + scraping jobs from a server?

Context: - I'm currently using playwright on my nextjs api routes and persist some data in my database (postgres) - since I need IP roration with session management though, I'd love to offload the scraping to crawlee - I'm also considering apify as the platform to deploy this crawlee scraper to (as that seems the recommended setup?) ...

application/octect stream in cheerio

I'm trying to scrape a second page, in a working scraper. Though this page gives the response as "application/octect-stream". Is there something I could do to fix this or should I swap to puppeteer/playwright. Looks kinda same since the page is full static Here the error message:
ERROR CheerioCrawler: Request failed and reached maximum retries. Error: Resource http://127.0.0.1/website.web/part_to_scrape served Content-Type application/octet-stream, but only text/html, text/xml, application/xhtml+xml, application/xml, application/json are allowed. Skipping resource.
ERROR CheerioCrawler: Request failed and reached maximum retries. Error: Resource http://127.0.0.1/website.web/part_to_scrape served Content-Type application/octet-stream, but only text/html, text/xml, application/xhtml+xml, application/xml, application/json are allowed. Skipping resource.
...

Cheerio memory error

Hello, I have deployed a CheerioCrawler on AWS, the machine has 2vCPU and 4gb of ram, but I get the following error:
WARN CheerioCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 1174 MB of 750 MB (157%). Consider increasing available memory.
WARN CheerioCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 1174 MB of 750 MB (157%). Consider increasing available memory.
What can it be ?...

mixed headful and headless in a PlaywrightCrawler

I want to check content of some requests in headful mode, approve it then let crawler scrap it in headless mode. I've tried @crawlee/browser-pool but it doesn't seem to have autoscaledPool....

I found that Apify took about 45$ from my credit for nothing

Two days ago, I just consumed $5 when I used Apify APIs, now I found that the account credit reached $50. I don't know why as I didn't use it. There's 1 Giga bytes video in storage, does storage charge me $45 in two days? Can you help me understand the details of what happened?...

Facebook events by page

Is it possible to pass a URL for a Facebook page and scrape event information for all of the events on the page? The URL would be of the form “https://www.facebook.com/thepiperstavern/events/“. Passing this URL to the Facebook Events scraper logs an error.

Firefox Error in PlaywrightCrawler

We are receiving an intermittent error using Firefox with PlaywrightCrawler (an example is run
w9I8udSOta4b0kEw8
w9I8udSOta4b0kEw8
). The error is: ``` 2023-06-10T19:44:32.733Z /home/myuser/node_modules/playwright-core/lib/utils/index.js:100 2023-06-10T19:44:32.735Z if (!value) throw new Error(message || 'Assertion error'); 2023-06-10T19:44:32.737Z ^...

Is it possible to add request in middle of queue?

Hey, I am trying to use addRequest to add new request in middle but it always add that request in last. I know thats the obvious behaviour of queue but is there a way around?

CheerioCrawler Timeout after 320 Seconds Error/Exception

In some of our CheerioCrawler actors, we continue to get some random timeout errors after 320 seconds that cause them to crash. This is an example of the error:
2023-06-08T07:28:54.464Z ERROR CheerioCrawler: An exception occurred during handling of failed request. This places the crawler and its underlying storages into an unknown state and crawling will be terminated. This may have happened due to an internal error of Apify's API or due to a misconfigured crawler.
2023-06-08T07:28:54.464Z ERROR CheerioCrawler: An exception occurred during handling of failed request. This places the crawler and its underlying storages into an unknown state and crawling will be terminated. This may have happened due to an internal error of Apify's API or due to a misconfigured crawler.
...

Can't mark request as failed

Hello, I am trying to mark a request as failed using crawlee + playwright. I tried multiple things, from throwing exceptions to use request.pushErrorMessage() method. The session.retire() is working for marking the session as bad, but not for mark the request as failed: ...

Keyvaluestore file extensions

Hi, How do I configure Keyvalue store to have the .mhtml file extension? Using the code below seems to always set it to .bin extension ```...

Pause concurrent requests ?

Hello, I have the following issue, I have a website that I'm scraping and I need to login every 100-150 items. The issue is, if I'm going with more than 1 concurrent requests when in needs to login it already has in progress requests, which will go wrong. So I have a marker that I'm extracting to know when I need to login again. I want to go with >1 concurrent requests and stop everything when that marker is found, do the login and then resume....

web scrapper create 1 file instead of multiple output for pagination

I used this article https://docs.apify.com/academy/advanced-web-scraping/scraping-paginated-sites#define-and-enqueue-pivot-ranges to scrap data of multiple pages, when I run apify run I get 20 different json files. How can I combine all the data in 1 json file for all the pages?

Error when running in Docker Container

I'm deploying a Crawlee (Cheerio) project in an amazonlinux:2023 based docker container. I get the following error: ```
node src/main.js
...

conducting faster scrapes with pagination and individual product scraping

hey i was curious that when im scraping amazon, what's a reasonable time frame for the scraping duration considering i scrape each product link from the results page and then scrape each individual product page for the information and also paginate through each results page until there are no more pages left i did previously just scrape product info straight of product cards on the results page but it would some times give dummy links that would lead to an unrelated amazon page and the product info would be more innacurate how can i increase the speed of my scrapes, especially considering i want add on more and more scrapers in the future that i all want to happen concurrently to save time, im aiming for quite a low scrape time of within 10 seconds - 15 seconds or lower and its taking upwards of 1 minute this is a cheerio crawler...

Crawlee + Proxy = Blocked, My laptop + Proxy = unblocked

I have a weird situation. whenever I try to access a website via crawlee and with proxy the request is blocked but with the same proxy I can access the website without any problem on my system and with many other browsers and also in incognito mode. Its really puzzling me. Any help would be highly appreciated. Thankyou....

double import problem

i have my crawler to crawl a couple of sites and scrape them but i get this import problem when importing the router (which is the same for both sites but using a different route for the sites) from both of the sites if i only import it from one site, it only runs one site how do i import it so it runs multiple sites and make it so it can scale up to multiple sites in the near future it can successfully scrape amazon and ebay (ebay tags are kinda innacurate) but if only if i use the router from ebay or amazon and remove the other url from starturls, or else it gives an error for not having the AMAZON label or EBAY label anywhere...