Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Website language filtering (/en)

Hi, I am new with crawlee, I was wondering if there was a method with which we can only crawl English versions of websites when they exist and when they dont, to just scrape the regular version at its home language. The issue with only setting URLs with https://example/en/.... is that some websites dont have such endings, which means that they will return an error. In those cases id still want to scrape it even if in another language, its just that wherever possible Id prefer the english...

What Are The Advantages of Crawlee Over AsyncIO/Scrapy

Is there any advantage to using Crawlee over say AsyncIO, HTTPX, Scrapy etc... when scraping pages that don't need dynamic content? I am beginning to learn about Crawlee(so I am fairly knew to understanding all of the features) and I know there are some basic things that give Crawlee a leg up such as the TLS fingerprints, but what other features do you find that make Crawlee excel over the other options. Are there any speed improvements?

User agent isn't randomly created by crawlee

I'm using the PuppeteerCrawler class and my expectation is that the user agent should be randomly generated whenever I create a new instance (or start a new run) This is not what happens, instead, the same value is consistently used across runs
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9
How do I rotate the user agent? Sometimes, the user agent just looks like my normal laptop user agent as well. Not sure why. Note, that I am also using Pupeteer extra. For reference, this is a simplified look at what my Instantion looks like. ...

make request for cookies inside createSessionFunction

Dear all, I am trying to use createSessionFunction to create a session and set some basic cookies from a response. The problem is how can I make a request to an endpoint to get a cookies inside createSessionFunction ? My basic code is this and am wondering what is the best way to get cookies without breaking the flow of the crawler. ` createSessionFunction: async(sessionPool,options) => {...

Cookies failure Playwright

DEBUG PlaywrightCrawler:SessionPool:Session: Could not set cookies. {"errorMessages":["Cookie not in this host's domain. Cookie:cdn.grupoelcorteingles.es Request:www.elcorteingles.es","Cookie not in this host's domain. Cookie:cuenta.elcorteingles.es Request:www.elcorteingles.es"]}
DEBUG PlaywrightCrawler: Page opened. {"url":"https://www.elcorteingles.es/"}
DEBUG PlaywrightCrawler:SessionPool:Session: Could not set cookies. {"errorMessages":["Cookie not in this host's domain. Cookie:cdn.grupoelcorteingles.es Request:www.elcorteingles.es","Cookie not in this host's domain. Cookie:cuenta.elcorteingles.es Request:www.elcorteingles.es"]}
DEBUG PlaywrightCrawler: Page opened. {"url":"https://www.elcorteingles.es/"}
I'm just opening this url, no hooks no nothing, and it just has to open this and print the title....

Remove single item from Dataset

Hi! Is it possible to delete a single item from a local Dataset?...

having issue to run Crunchbase scraper. I'm on trial. why can't I run the bot

having issue to run Crunchbase scraper. I'm on trial. why can't I run the bot
No description

change session storage in preNavigationHooks

Hello, I'm trying to change session storage, before navigating to a page . The hook is attached, here is the error : ``` WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Evaluation failed: DOMException: Failed to read the 'sessionStorage' property from 'Window': Access is denied for this document....

In a request handler, can I open a new browser tab & run code concurrently?

I'm currently using playwright and my request is on the page /results. My defaultHandler is running and I'd like to: 1. open a new tab and navigate: page.goto(/docs) 2. Concurrently, perform some browser navigation with playwright in both tabs /results & /docs ...

How to increase (playwright) max Data size?

I'm getting an error after a run completed that is over 9MB. How do I increase this limit so I can save larger datasets?
No description

how can I mount the `createPlaywrightRouter()` properly?

I don't understand why crawlee is throwing an error about adding a router for the label undefined. here's my code: ```ts // routes.ts...

I want to define memory allocated for the run when I trigger it using the api

My account has 32 Giga bytes memory, but each run allocate 4 giga memory. I want to change it to maximum 1 giga memory so I could run till 32 parallel job. I can do that from the console, how could I do it from the api This is example of the data I use when I trigger the api:...

How to use network mocking ?

I use playwright. And I'd like to mock some network requests, so as not to attack my CDN too hard. With playwright, we can do something like this (documentation): ``` test.beforeEach(async ({ context }) => {...

How to find usernames in bulk instead of manually for Instagram profile scrapper

is there any way to find category-wise Instagram brand usernames at once such as Nike, Reebok, Paragon etc in bulk instead of manually and then the bulk upload on Instagram profile scrapper? any help would be appreciated...

Crawler only working in headed mode.

I have a Puppeteer Crawler that works almost flawless in headed mode, but if I go headless all the requests are getting 403 errors. I was thinking that xvfb should fix this but unfortunately it doesn't. Any other ideas ?...

Puppeteer actor dockerfile via nixpacks?

Hello, I'm trying to build a typescript puppeteer crawler with nixpacks, I can't seem to get the puppeteer dependencies to work how they should. Build completes just fine.. Here's my nixpacks.toml file:...

Delay exit program after crawl all request.

I'm currently running a PuppeteerCrawler with crawlee. After my code print "INFO PuppeteerCrawler: Crawl finished. Final request statistics". My program do not exit instantly, It's delay about 5 seconds. Can you help me explain and fix it? Thank you! Bellow is simple code `await crawler.run([url]);...

How can I open a new browser tab from within a router handler?

Hi there, I'm currently running a PlaywrightCrawler with crawlee and I would like to open a new tab from within a route handler, so that I can run the following: ```ts router.addHandler('*/ergebnisse.xhtml', async ({ request, page, log, session }) => {...

Scrape the subpages of a website: depth variable possible?

Hi guys, I searched for that one but could not find a answer: I am building a web crawler which I have implemented at the moment using Puppeteer only. There I can use custom JS to control the depth of a query, basically how often the recursive function is called. I tried the crawlee and it is so good! But unfortunately it collects way too many links, even some I don't need. Hence my question: Is it possible to set the depth of the crawler? E.g: it should only scrape the links of the website and not explore the links further....

Scrape data from TikTok for research

Hey there! I am doing a research on influencers' success on TikTok, and how the users interact with them in this platform. For such a purpose I want to scrape the comments and other data from ~120 video-posts. I do not know how to proceed. Moreover, I do not know whether it is against TikTok's Terms and Conditions....