Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Puppeteer Module not found on Vercel

Do you have a working example of PuppeteerCrawler working on a serverless function deployed on Vercel. I have the following error: "MODULE_NOT_FOUND","path":"/var/task/node_modules/puppeteer/package.json\

To eliminate duplicates of "request retries," may need to set a "timeout" between them?

The issue is that when the "job" fails, it gets restarted as many times as specified in "maxRequestRetries." However, if the restarted "jobs" are successful, I end up with multiple identical results in the output, whereas I only need one. For example: the first job fails, and it gets restarted (which is intended), but since it successfully restarts, for instance, two times, I receive two identical results. But I actually need only one result. ``` import { Dataset, PuppeteerCrawler, log, } from 'crawlee';...

hello anyone can help me in selenium python script. I am getting error

2023-07-20T06:21:57.447Z ACTOR: Pulling Docker image from repository. 2023-07-20T06:21:57.588Z ACTOR: Creating Docker container. 2023-07-20T06:21:57.732Z ACTOR: Starting Docker container. 2023-07-20T06:21:58.502Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp 2023-07-20T06:21:58.505Z Executing main command...

SameDomainDelay And RetryDelay in Puppeteer

Is it possible to replace sameDomainDelay and retryDelay options of puppeteer-cluster with crawlee puppeteer ? https://www.npmjs.com/package/puppeteer-cluster...

infiniteScroll and enqueueLinksByClickingElements

Hello Team, I'm trying to crawl a page that has lazy loaded images (on scroll) and an element on the first page that is a JS event 'button' that expands the pages of "posts" on the page. I'm trying to use the below code, however, it seems like the request never gets filled, the stats show 'requestsTotal:0'....

saving data and accessing it in an Apify Actor

ive tried saving the data to a rawdata.json file from the data i scrape from my actors, however i dont get a json output even thought the scraping works how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database - i have my mongodb schema already setup so how would i save the data to the apify console and access it heres what i have for saving the json file so far:...

accessing saved data with mongodb in an apofy actor

ive tried saving the data to a rawdata.json file from the data i scrape from my actors, however i dont get a json output even thought the scraping works how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database - i have my mongodb schema already setup so how would i save the data to the apify console and access it heres what i have for saving the json file so far:...

Cookies are not saving to KeyValue pairs

Dear all I am trying to test out sessionPool and preparing each session with some default cookies. I create a sessionPool like this `const sessionPool_de = await SessionPool.open({ maxPoolSize: 25, sessionOptions:{...

system design of concurrent crawlers

i have multiple crawlers - primarily playwright, per site that work on their own completely fine when i use only 1 crawler per site i have tried running these crawlers concurrently through a scrape event that is emitted from the server that emits individual scrape events for each site to run each crawler i face a lot of memory overloads, timed out navigations, skipping of many products, and early ending of the crawlers each crawler essentially takes base urls or scrapes these base urls to get product urls that are then indvidually scraped through to get product page info...

setCookies only works for name and value keys

Dear all. I am trying to set cookies to a session. But cookies are only set if there is only name and value keys. If there is any other key present the cookie is not set. Can you please guide me how to debug it further ? Cookies I am getting from playwright crawlers. Its a list of objects. Then I pass that list of objects to Session. Setcookies() but its doesnt work. ...

delete actor

how do you delete actors that you don't want that you accidently made

Location of Apify.utils.social class library in Apify SDK 3.0?

We are upgrading our custom version of the contact-info-scraper to use the Apify SDK 3.0. Unfortunately, we cannot seem to find where the Apify.utils.social class library has gone. Any insight would be greatly appreciated. Thank you!

error handling w/ playwright

i've been experiencing this same error pattern with my scrapers of completely different sites that i thought was an individual site problem but now its a repeating pattern i have my scraper scraping results page urls, then product urls, which it has no problem with, but when it goes through those product urls with a playwright crawler, it always scrapes around 30-40 URLs successfully and suddenly experiences some crash error and then randomly rescrapes a couple of old product urls before crashing...

SOLVED: enqueueLinks not working properly

hey, i'm trying to crawl a website, and i use the enqueueLinks() function which is supposedly context-aware. however, when i call await enqueueLinks() at the bottom of my request handler (i can confirm it is actually IN the request handler), it gives me this error:
enqueueLinks() was called without the required options. You can only do that when you use the `crawlingContext.enqueueLinks()` method in request handlers.
enqueueLinks() was called without the required options. You can only do that when you use the `crawlingContext.enqueueLinks()` method in request handlers.
...

infinite scrolling

trying to get infinite scrolling to render in all products while scraping them as the page is being scrolled down i looked at the documentation but didnt understand how to do this: `` kotnRouter.addHandler('KOTN_DETAIL', async ({ log, page, parseWithCheerio }) => { log.info(Scraping product URLs`);...

Issues with Keboola and Apify Amazon Scraper job

Hi all, I have an Apify job that will not pass any data to my Keboola integration. I’ve been completely tossed around by support on this issue from both Keboola and Apify and nobody can seem to help. Keboola keeps blaming it on Apify and Apify keeps telling me it will “be another week.” It’s been a month now and I haven’t gotten an answer on this. The problem is that my destination table is not being updated by the extractor when my job is ran in Keboola. It will never load new data due to the fact that the Apify component is not working properly. It’s constantly trying to read the “actor” but it’s not actually retrieving the data once it’s ran. I see the job gets triggered via API, but it appears the request to transmit the data back to my component isn’t working. Any help you can provide would be greatly appreciated....

What's the right way to catch

How to catch sigterm signal and persist urls in queue or haven't scraping yet?

error handling unfound elements w/ puppeteer

in my puppeteer crawler, im searching for some elements and they sometimes might not be there, so when they're not there, since the crawler is awaiting that element, if it isnt there, it causes an error that can often crash the crawler ive tried wrapping the await element statements in try catch statements to handle errors and return but ive seen that it still returns errors because when it awaits the element, it needs to see that element to move on i want it to be able to skip over unfound elements, scrape the OTHER elements on the page, and move on a small snippet of the code:...

crawlee misses links #depth #missing-urls

Happy Fourth everyone! Hoping someone can suggest how to address the following. I copied the simple example on the docs in an attempt to scrape all links to pages below https://weaviate.io/developers/weaviate. It runs and reports 32 links found but misses many links, particularly those 3 or more levels down. For instance it misses all the pages below https://weaviate.io/developers/weaviate/api/graphql/ like https://weaviate.io/developers/weaviate/api/graphql/get. My code is ``` const startUrls = ['https://weaviate.io/developers/weaviate']; const storageDir = path.join(__dirname, '../storage/datasets/default'); ...

Broken Links Checker only returns 20 requests

Hi! When running Broken Links Checker actor by Jan Čurn in Apify, it only returns 20 results instead of all links on the website. Is there possibly a setting that I'm missing? The run ID is WpXfO0LrXg8GestHv. Thank you in advance for any help you could offer!...