Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

I want to use a created dataset

I was following this video: https://www.youtube.com/watch?v=8uvHH-ocSes to create a dataset. I created it. The problem is that I am using python, and I want to import the dataset created to train it with llamaindex. The documentation is here: https://llamahub.ai/l/apify-actor In this documentation, is only included the way to create a new dataset scrapping with a url, without giving the option to train with the dataset of an id....

Is it possible to increase the data item size when using `pushData` locally?

I see that the current limit of 9MB is defined by the MAX_PAYLOAD_SIZE_BYTES constant from the @apify/consts package. I completely understand the need for this limit when running on the Apify platform. Is it possible to customize this value when running locally?...

JSDOMCrawler, website breaks crawlee

Hey, after im getting a Warning the whole process stops, is it possible to catch it? WARN JSDOMCrawler: Reclaiming failed request back to the list or queue. ReferenceError: request is not defined at JSDOMCrawler.requestHandler (/home/vue/repo/test/fofo.js:14:31) at /home/vue/repo/test/node_modules/@crawlee/http/internals/http-crawler.js:336:81 at wrap (/home/vue/repo/test/node_modules/@apify/timeout/index.js:52:27) at /home/vue/repo/test/node_modules/@apify/timeout/index.js:66:7...

High Volume Scraping

I'm evaluating Crawlee for my startup that will require us to scrape several hundred websites. These sites are non-ecommerce nor social media and require interaction with the page (feeding in a list of search parameters, clicking submit buttons, etc.) The documentation seems to imply that I need to use a headless browser in order to interact with the site, but headless browsers consume tons of memory when compared to the non browser counterparts and are overkill for sites that do not render Ja...

enqueuelinks doesn't work.

``` at ArrayValidator.handle (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@sapphire/shapeshift/src/validators/ArrayValidator.ts:102:17) at ArrayValidator.parse (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@sapphire/shapeshift/src/validators/BaseValidator.ts:103:2) at RequestQueueClient.batchAddRequests (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/resource-clients/request-queue.ts:338:36) at RequestQueue.addRequests (/home/chirag/Desktop/crawlee_test/my-crawler/node_modules/@crawlee/src/storages/request_queue.ts:376:46)...

Proxy authentication bug?

Was trying to connect residential proxies to my crawler after using some other proxies and after changing ProxyConfiguration URLs I got error
net::ERR_TUNNEL_CONNECTION_FAILED
After checking credentials, and all possible whitelists and blacklists I realized that problem is not on my side. So I tried to implement proxy authentication in old way and it worked....

How to pass Apify proxy as argument to puppeteer.launch()?

Is there a simple to get new Apify proxy data to pass as argument to puppeteer.launch()? // Launch Puppeteer with proxy configuration const browser = await puppeteer.launch({ headless: true, args: [--proxy-server=${proxyServer}`]...

got many 429 status code when crawled the target site,even though proxies. How to optimise my code?

Hi,guys. I am a python coder but not good at nodejs. I make a crawler to bulk check the infomation by the crawlee. This is my option : useSessionPool: true, useIncognitoPages: true,...
No description

Download Delay

does crawlee support download delay? Like in Scrapy? Because I want to crawl a website but this website has delay before to load its content, so my current crawlee project didn't get the content of the website.

Mark request as handled inside response interceptor

All the data I need is on the response to an specific request, which occurs before the page is loaded. What I'm trying to achieve is to close the page and go to the next request as soon as I got what I need, so I tried doing it on preNavigationHooks: `preNavigationHooks: [ async (crawlingContext, gotoOptions) => { const { page, request, log } = crawlingContext;...

Expire requests from request queue

Hello, I have a use case where I need to handle request expiration in the RequestQueue after a specified time (e.g., 30 minutes). Is this achievable in the current scenario? One possible approach is to set an epoch time in the userData when enqueuing a request. Then, when it reaches the preNavigationHooks phase, you can check the elapsed time against the specified limit and throw a NonRetryableError to prevent further processing of the request....

Log Proxy IP

Hey, i am using crawlee, playwright crawler with oxylabs residential prox. how can i print the actuallIP address that crawlee used for a specific requets? I know the crawling context gives you access to a proxyInfo object but it only shows the proxy url and not the IP chosen for the request. Which is not very useful to me. I have configured the proxy like this:...

Error: Failed to launch the browser process with Puppeter

Hello There. I'm getting the following error:
file:///home/myuser/node_modules/@puppeteer/browsers/lib/esm/launch.js:268 reject(new Error([ ^...

How to manage sensitive data in database?

I am really impressed about the entire platform and I really enjoy creating crawlers with Crawlee and Apify. I am working on a single page app where users can give (sensitive) input and run my private actors to display the results. I already found that I can make an input invisible by the "isSecret" boolean in the input_schema. The only issue I currently have is that the results shouldn't be visible for me (apify console => database) because it could have sensitive data. Do you have any idea's or tips?...

How to set different requestHandlerTimeoutSecs for specific handlers?

I have different handlers in my PuppeteerRouter ```javascript export const router = createPuppeteerRouter(); ...

Bayesian Network

I don't know if its okay to ask this but, how Bayesian network implement/used to generate browser fingerprints in crawlee? I hope its okay to ask this.

Best practice to stop/crash the actor/crawler on high ratio of errors?

Following snippet works well for me, but it smells... sb have a cleaner approach? ``` // Every 3s, check for the ratio of finished (=success) and failed requests and stop the process if it's too bad setInterval(() => {...

Interception error in Puppeteer

I'm getting this error in Puppeteer but I'm not doing any interception in my script, I just create a request and add it to the crawler using crawler.addRequests, the request is a get where I just provide url and headers.
DEBUG Error while disabling request interception {"error":{"name":"TargetCloseError","message":"Protocol error (Network.setCacheDisabled): Target closed","stack":"TargetCloseError: Protocol error (Network.setCacheDisabled): Target closed\n at CallbackRegistry.clear (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:138:36)\n at CDPSessionImpl._onClosed (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:451:25)\n at Connection.onMessage (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:248:25)\n at WebSocket.<anonymous> (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/NodeWebSocketTransport.js:52:32)\n at callListener (project/node_modules/ws/lib/event-target.js:290:14)\n at WebSocket.onMessage (project/node_modules/ws/lib/event-target.js:209:9)\n at WebSocket.emit (node:events:365:28)\n at Receiver.receiverOnMessage (project/node_modules/ws/lib/websocket.js:1184:20)\n at Receiver.emit (node:events:365:28)\n at Receiver.dataMessage (project/node_modules/ws/lib/receiver.js:541:14)"}}
DEBUG Error while disabling request interception {"error":{"name":"TargetCloseError","message":"Protocol error (Network.setCacheDisabled): Target closed","stack":"TargetCloseError: Protocol error (Network.setCacheDisabled): Target closed\n at CallbackRegistry.clear (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:138:36)\n at CDPSessionImpl._onClosed (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:451:25)\n at Connection.onMessage (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:248:25)\n at WebSocket.<anonymous> (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/NodeWebSocketTransport.js:52:32)\n at callListener (project/node_modules/ws/lib/event-target.js:290:14)\n at WebSocket.onMessage (project/node_modules/ws/lib/event-target.js:209:9)\n at WebSocket.emit (node:events:365:28)\n at Receiver.receiverOnMessage (project/node_modules/ws/lib/websocket.js:1184:20)\n at Receiver.emit (node:events:365:28)\n at Receiver.dataMessage (project/node_modules/ws/lib/receiver.js:541:14)"}}
...

Webscraping list of TikTok account.

Hello, my name is Leo and I have a tiktok acc by the name of @_revenite.se , what I want to do for a video project is to extract the names of all my followers into a list, to be able to print them all and put them on a wall. Can anyone help me with this? 🙂

How does createSessionFunction create session when parallel requests are being made

I have a custom function which open a browser to get cookies. The problem is my machine is very small and what would happen when multiple sessions are being made it will try to open many browsers at the same time. Can I somehow make the creation of sessions sequential ? so even though I need 1000s of sessions but at any point in time only one session is created and no session can be created in parallel. So only one browser instance will be running at any point in time. `createSessionFunction: async(sessionPool,options) => { var new_session = new Session({ sessionPool });...