Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Scrape private website?

Hello Friends! I'm new to Apify and pretty excited about what I've learned so far. One use case I'm not sure of: Can Apify be used to scrape a website that's not on the public internet? Specifically, I want to scrape knowledgebases inside corporations (with their permission). Is there for example some sort of proxy that could be put in place inside the private network that connects with Appify and then scrapes at Apify's direction? Or etc?

Difficulties in finding my data for my project

hello, I have difficulties to collect tweets in Arabic. I am a linguist not a software engineer. I would so grateful if anyone can help me.

Cheerio Crawler Output

I want to know is there a way to preview the result of cheerio crawler like for example in puppeteer crawler we can set headless to false so that we can actually whether or not page is loading and our logic is working but what about Cheerio crawler is there any way to check that thanks...

Failed to Launch Browser error on the latest version of apify/actor-node-playwright-chrome image

Hello guys, I have a server where I handle all scraping. Apify pushed a new update today for their docker image. And after building a new image with the latest image and everything. The Browser is failing to launch. What to do to fix this issue?...
No description

Purging storage using npx crawlee run does not work

I am trying to develop a crawlee scraper locally and it that regard I need to easily purge all data from the default and named datasets as well as request queues to test my changes. However, it does not purge storage. I intend to use the crawlee code in an Apify Actor. Do you have any suggestions of what might be the issue?...

Using Apify with unstructured.io

Hi, I've experimented with Unstructured for RAG of a PDF and I am hoping Apify would be a good fit for scraping web data and then processing the result with Unstructured to chunk the data for RAG. But I did not find any mention of Unstructured here and I wonder if these tools play well together or should I take another approach? Thanks for any pointers!

Add label to pages via `crawler.addRequests()`?

I am adding a page as the initial crawl target, but would like to add a label to ensure it routes to the correct processor. Is there a way to do this?
await crawler.addRequests([ "https://www.foo.bar/page",
])
await crawler.addRequests([ "https://www.foo.bar/page",
])
...

Source Code

When I published my actor on apify store the source code tab was also visible It should be hidden as I checked on other actors. Is it because of the fact that I didn't setup billing details before publishing the actor or due to some other reason

How to delay

When scraping a particular news site, I would like to include a delay to prevent the site from becoming too busy. Is such an option already provided? If not, please let me know what part of the code I should use to sleep.

error handling

Can we somehow throw errors that are closing the page ? and not retrying the request?...

Rate limit based on key

Is it possible to rate limit based on a key? So basically it would only process 1 URL at a time per key.

playwright and proxy problem

When I try to access page with proxy with playwright I get captcha. Without proxy it works with no problem. But what is weird that if I use the same proxy in the regular browser via SwitchyOmega extension then the page loads also without problem. So I think page somehow detects that automated browser is using proxies. Did anyone encounter this problem?...

infinite scrolling of pages

i have a crawler that goes through collection pages of stores and scrapes their product links and goes through those product page links to get product data...

How to enqueue links until a certain "depth"

I want to crawl until a certain depth of the website. Is there an option on enqueueLinks or somewhere else for this?

Array output not displaying in table

I am trying to display an array field in table but its showing undefined I have specified that in dataset_schema "data.links": { "label": "Links", "format": "array" }...

Help me!

now I am scraping a website. This website is similar to jobsite, but there are 3066 job posts. However, since it cannot load all of them at once, it imports 20 more at a time using data-page="10" in HTML. I can move the mouse down to continue receiving data. Since I use a program, I can only read the 20 with my script. How can I treat this problem? print(len(searchItems)) must be 3066 but It is set to 20 because it has not been loaded....
No description

BasicCrawler request information

Hi, when using BasicCrawler, how do you get info about Hostname, IP Address, User Agent, AS Number for the request, i need to provide them to our client to be able to give us access to be able to scrape

Unable to crawl

Hi, I'm trying to crawl a few pages, with this script: Now, I don't know if I'm doing something wrong, but I'm getting timeouts and not crawling at all sometimes. The script will run, but not log any pdf, or link, it just says it crawled the two pages, and that's it. Someone, please help....

Crawlee Not scraping when provided with the same link twice

I'm using a REST API to access my Scraper which will be provided with a url. And, it will scrape the URL and return me the Information. Now, When the Server is first started, and the request is made. It works correctly. It scrapes the data and responds with the results. But when another request is made with the same URL as the parameter, It doesn't scrape the URL and instead just says that All requests from the queue have been processed, the crawler will shut down. And shuts down. It needs to rescrape the Data and return it as my App needs it to work. I'm using the PlayWright Crawler. the code is attached below. I think it has something to do with the request crawler. But, I'm not sure. Can anyone help? Thanks!...
No description

How to push to same dataset from 2 different urls?

I have a site I’m scraping but I’m facing a problem. There is information about the same thing but on 2 different pages of the website. But I want to store those information on the same JSON dataset....