Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

Custom storage provider for RequestQueue?

It's probably a little out of the ordinary, but I'm building a crawler project that stores a pretty large pile of information in a database, rather than Crawlee's native KVS and DataSet. I'm curious if there are any examples of using alternative backends to store Crawlee's own datasets and request queue? If possible i'd love to consolidate the storage in one place, particularly since it would allow me to query and manage the request pool a bit easier…

How to scroll page

Hi, I am using PupperteerCrawler, how to scroll load more in handler

Exclude query parameter URLs from crawl jobs

Hello, I'm researching currently methods to exclude URLs with, for example: https://domain[.]com/path?query1=test&query2=test2 I've tried hooking into the enqueueLinks options like:...

Custom configuration is not working

i am trying to use custom configuration but no luck so far
No description

Best practice for rendering javascript, then doing a deep or structuredclone of the window object?

Hello, I am looking for general high level advice for the best approach to crawl a site, and save the *.js resources as well as log the window object. Does anyone have an idea? I'm a little unsure if I should be leaning more on the playwright API or if there is a built-in utility or helper function for downloading resources ( and analyzing the window object at a depth of 3 or 4 ) from the site. Thanks in advance for any help.

How to rotate proxy in cheerio crawler?

will it rotate proxy url with each url?
No description

About define route

I'm using Crawlee for Crawler platform. I have a question: I can use routing for preNavigationHooks, failedRequestHandler... as requestHandler. If I can, how to use it for preNavigationHooks. Thanks!...

Extracting text from list elements

I want to extract the text from all <li> elements inside an unordered list <ul>. Trying await page.locator("div.my_class > ul > li").textContent(); causes an error: strict mode violation: locator('div.my_class > ul > li') resolved to x elements. The presence of multiple elements is expected since this is a list. Playwright itself doesn't seem to have an issue with selectors that return multiple elements, and I did find the strictSelectors parameter in the crawlee docs, but didn't manage to set it to false (if that is even the solution). In scrapy item.add_css("list", "div.my_class > ul > li::text") returns a list of the text for each list item, which is what I'm looking for. Does anyone know how to solve this?...

Playwright in Docker image doesn't work

I just started a new and clean project with npx crawlee create my-crawler , built the docker image and deployed it in a server. When the image runs I get the log in the image bellow. I didn't change any line of code. Tried with crawlee in version 3.0.0 and 3.1.2. ...
No description

Disable statistics

Hi, is there is a way to disable statistics, which is saved in storage dir by default?

requestQueue doesn't delete requests after visiting and saving data

Hi, working with crawlee and playwright. I've noticed that requests aren't being popped out of the queue even though the links have already been visited and scraped. Am I missing a configuration or something?

Run Puppeteer docker locally (actor-node-puppeteer-chrome)

I am trying to run&debug my crawler locally, but keep getting following error: ``` Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1280x720x16 -nolisten tcp 2022-11-17T12:30:00.521459905Z Executing main command 2022-11-17T12:30:01.435704710Z INFO System info {"apifyVersion":"3.1.0","apifyClientVersion":"2.6.1","osType":"Linux","nodeVersion":"v18.7.0"}...

How do we assign a session to a request without having to use proxy?

Can we include the session or the sessionId when running addRequest?

How to handle sequential steps (like a login flow or a wizard) in headless browser?

Context We need to log in to establish a session, then visit a 'content page' to scrape the data we want. Goal We're trying to understand the correct way to set up Crawlee for this scenario. Do we do it serially with page.goto as is done in the forms example[1]? Should we set up handlers for each page type (loginHandler and contentPageHandler) and just add the pages to the RequestQueue? Or do we do something else entirely?...

how to set payload in cheerio crawler preNavigationHooks

doing it like this: ``javascript preNavigationHooks:[async (crawlingContext, gotOptions) => { const { request } = crawlingContext; request.payload = .......`;...

Collecting url from the nested Xml

XML > XML(has to collect url by matching url index) Need an assist to Crawl urls from the nested XML using downloadlistofurls using crawlee ,below my sample code const urls = await downloadListOfUrls({ url: sitemapUrl }); for (let url of urls) {...

Get data old link crawler

When I use PuppeteerCrawler, I run the same URL for crawling, I want to get previous data crawled of this link

How to handle a huge Json file?

I have huge json file with many objects from Apify.com, each object is a business, the file is too big and when I request it in the browser in order to save the results to mysql it's waiting for hours to load the file. What should I do? I need to process the file before insert it to mysql.

Canada411 site failing after 4 hours

I am using a CheerioCrawler actor to process input files of 500,000 records against this dynamically populated url: https://www.canada411.ca/search/?stype=re&what= The actor has been mysteriously failing after 4 to 4.5 hours, and we have not observed such behavior before. I have included below the log toward the end of the failed run (#KcMSz5QQp8qIQnbYF). Any insight on this error message would be greatly appreciated. Thank you!...

How do I delay requests with HttpCrawler?

I am working with an API that has rate-limiting in place. The API gives me a timestamp of when the current rate limit will expire in seconds. I need to delay my next request by this many seconds, which is usually 15 ish minutes. I tried adding a delay with setTimeout and Promise like this and awaiting on it ```ts export function delay(seconds: number): Promise<void> {...