Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Href inside of a data-href attribute

So, as the title says - I have this specific case where a link to the next page is not inside of a href attribute of an anchor tag but rather inside of a (custom?) data-href attribute of a button element. Is there a way to enqueue this url with the selector parameter of enqueue links or is the only way to pass it to the urls array of enqueue links?

ENOSPC: no space left on device, mkdtemp '/tmp/puppeteer_dev_chrome_profile-*

I have two puppeteer scripts running on Fedora 37, they've filled my tmp with this kind of files, and the script obvious crashed. They indeed have a lot of concurrency, they both run at a median value of 40-50 instances. How can I avoid this ?

Help for a Instagram data collection

Dear, I’m trying to download Instagram posts with Apify thanks to “Instagram Post Scraper” (https://apify.com/apify/instagram-post-scraper). I wrote a Python code to automatize the pictures collection. Unfortunately, a lot of pictures URL are expired well before the 7 days of data retention. Could it be crashing because I tried to download more posts than allowed with the free trial or that my storage memory is overflowed ? The second question is: if I pay for an upgrade subscription, can I be sure that my storage will increase according to my data size? I don’t want to pay for a subscription and that the downloaded URL are expired way too soon. Have a great day, Tiago...

Concurrency: How to use multiple proxies / session pool IDs?

Hi, I'm using the proxy config with 100 proxies. The goal is to let the scraper run with say 4 sessions concurrently - using 4 different proxies. In each run, I see it picks one Session ID = One proxy and runs through all requests with the same one....

I have 99 urls in the queue. But scraper finishes crawl after a few urls, why?

The scraper finishes crawl after a few ones everytime. I have 99 urls added to the queue. This is my config:...
No description

Downloading an image using puppeteer example

Hi, I found this simple example using Puppeteer, that downloads images as you visit the page and I'm wonderign how I can encorporate it into my crawlee scraper. ``` this.page.on('response', async (response) => { const matches = /.*.(jpg|png|svg|gif)$/.exec(response.url());...

socks5 passwore protected proxies

Dear all, how can I use socks5 proxies with crawlee ? Also in general of the proxy is password protected how to put it in proxyUrl ? I didnt find any exmaple to use password protected proxies and socks5 proxies are not supported by defualt. Anyway to get around it ? ...

How do I log the fingerprint that's generated for the current browser?

I've looked all over the documentation, it doesn't explain how we can log/show the currently generated fingerprint for the current session. The website I'm trying to scrape is setting the cookies, I'm using premium residential proxy of a different country and already set the launchOptions geo location for that country, but somehow that website is still detecting my timezone as my home country's and setting it as a cookie, which is probably how they are detecting the request as a bot. I'd like to check what exactly the fingerprint generator is setting to debug. anyone knows how?...

playwright response is missing status code.

This is the code, but the status is always empty `crawler.router.use (async ({request, response, page, enqueueLinks, log, proxyInfo, session, parseWithCheerio}) => { log.info("middleware fired") ...

Replicate XHR requests to wait for cheerio page to load further

Dear all, after trying out broweser based data extraction. I need to enhance the crawler and make it light weight. But cheerio crawler doesnt work for sites which has cloudflare secuirty. Because it just grabs the very first html (which is from CF) and thats it. Is there anyway to wait for its completion ? Also any suggestion to find not the actual endpoint which is loading the data ? I am using dev tools but it looks like js and data is quite strongly coupled. Any help would be highly appreciated. ...

New to Crawlee and after reading the docs, I'm not sure how to use it to crawl links in a website

So I'm quite new to Crawlee and I'm not sure how it really works 😦 I've reach the docs and checked some examples but couldn't find anything really useful. I have a case where I need to login to a website and then go to a page where I have a list of links that I'd like to crawl, within each page I have more links to crawl and finally, within each page I'd like to perform some actions on the page. One of them is get the URL of a video and download the video to Google drive. I've read about enqueueLinks and RequestQueue but I really don't know how it works. I've checked the example in the home page but that's not really what I want. I'd like to login, then go to a page https://www.my-site.com/categories and then from there grab all links that match the glob...

Passing user data to the crawler ?

Hello, I am trying to find how to best handle the output of my scraper with Datasets. I have a main request handler dispatching to sub-handlers based on labels, and I would like to have a Dataset for each label/sub-handler with data following a specific format (basically a database table). I could probably open and close named Datasets every time I process one request but considering that (as far as I understand) Datasets are stored on-disk this would seem quite wasteful in terms of disk I/O. Is there a way to pass my datasets to the crawler so that any request can access them ?...

Crawl using the same tab and session

hey guys, Im using crawlee to crawl a site, but I need it to use the same browser tab to visit the consecutive pages using the same session. right now it opens a new tab for each request and generates a new session for it, then it closes that tab, and uses a new tab and new session for the next request. Which is horrible for anti-bot detection as the session retires the cookies it gets from the first request as well. any ideas what Im doing wrong?...

crawlee storage path ? And call api

Hello Where to change the storage path for the html and json output from the crawlee. And how I can call crawlee instance via api ??...

How to reset a queue?

While developing a scraper I'm often facing this issue: 1) I add initial page to the queue 2) I run the scraper, which marks the url done 3) I want to re-run the scraper on the same page ...

The scrapy crawler gets a different amount of data every time, okay?

There are 20 pages of data on the website, 14 on each page, 280 in total. I have tried several times to get inconsistent data each time, and there is no error in the log. To get the multi-page data and the detail page data, the code is as follows: class GzDfjrjdSpider(scrapy.Spider): name = 'gz_dfjrjd' allowed_domains = ['jrjgj.gz.gov.cn']...

ERR_INVALID_ARGUMENT Help !

I have this code, I'm using Puppeteer to scrape in API and I have to do the first request to an endpoint. Here is the code : ```var firstRequest = new Request({ url: 'https://www.example.com/api/commerce/v1/categories',...

using Playwright's recordVideo option when using newContext

Hi everyone, thanks for the awesome platform. I am stuck on something and could use a hand. I am looking for a way to record videos of crawls in order to evaluate a playbook that executes during a visit but can't figure out how to pass options to enable video recording to crawlee as specified in Playwright's docs ``` const context = await browser.newContext({ recordVideo: { dir: 'videos/' } });...

Addressing playwright memory limitations in crawlee

Hello, I am currently using crawlee on a medium sized project and I am generally happy with it. I am targeting e-commerce websites and I am interested in the presentation of various products on the website, therefore I opted of a browser automation solution, to be able to "see" the page. I am using playwright as the browser automation tool. Recently I noticed some of my scraping instances fail with the following error: ...

How to tell PlaywrightCrawler to wait

In the playwrightUtils I have the option to tell Playwright to wait until content is loaded (This can be done by calling the "gotoExtended" function and providing "DirectNavigationOptions") or I can tell it to wait a certain amount of seconds for the content to load before exiting (This can be done in the "infiniteScroll" function by providing "InfiniteScrollOptions"). My question is, can this somehow also be done by using the main PlaywrightCrawler class? Just calling the "gotoExtended" or "infiniteScroll" does not seem to give me the option to use all other features that the PlaywrightCrawler provides, such as using a proxy server and so on....