Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

PlaywrightCrawler hangs up after some time

We have an issue with the PlaywrightCrawler. It seems that after some time it can't open a page properly and throws no Network or Handler Timeout, being stuck forever. Sadly we aren't able to reproduce the error after short time and it seems to happen "randomly". Crawlee version: 3.1.0 Docker-Image: apify/actor-node-playwright-chrome:18-1.27.1-next (also for other versions)...

How to click on pagination of js links?

The pagination links are not changing the url, they dont have href value. I think it must click on the first link, scrape the page, and then move to the next link, and it should happen in the same tab, there is no need for new tabs. You can see the pagination here: https://www.paz.co.il/he-IL/paz-stations Please help, thank you...

custom CheerioCrawler User-Agent?

hi, could anyone show me an example of CheerioCrawler User-Agent? I had tried preNavigationHooks but got a stranger error ```js preNavigationHooks: [ function customUserAgent(_ctx, opts = {}) {...

Examine headers before loading full page?

I'm working on a project that requires quite a few "blind" requests — hitting URLs that might be full-fledged pages, or might be (say) PDFs to download and archive, but provide no real clues in their URLs alone. Unfortunately, of the examples of intercepting requests and downloading file rather than requesting the URLs as a browser do their work in preNavigationHooks, examining the URL itself. Aside from simply using a stub BasicCrawler to check headers first, canceling the full navigation attempt if it's unnecessary, and accepting that there will be unnecessary double-visits, does Crawlee's architecture offer any way to handle this scenario?...

How to share object between requests with Crawlee on Apify

Hello. While scraping website, I need an access object, which will be shared between all requests. I keep some data in this object, every request can read/write there. When all requests are handled, I do some validation and calculations on the data and write the result to Dataset. It was easy in Apify SDKv2. I created instance of the object and passed it as parameter of handleXY methods. Like this: ```javascript const myData = new MyData();...

How to set depth-breadth first crawl?

I suspect that crawlee will perform a breadth first crawl by default with enqueueLinks(...) Does exist somme option to perform a depth first?...

How to disable crawlee log?

If I do Example Usage (https://crawlee.dev/api/playwright-crawler#example-usage) with this Url: https://httpbin.org/status/404, I got this output : ` 2022-10-11 08:52:58.931 WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: net::ERR_HTTP_RESPONSE_CODE_FAILURE at https://httpbin.org/status/404 =========================== logs =========================== navigating to "https://httpbin.org/status/404", waiting until "load"...

How to set the timeout?

With Playwright, It is possible to set the timeout for every method that accepts the timeout setting using: browserContext.setDefaultTimeout(timeout) If you want a different timeout for navigations than other methods, perhaps when simulating slow connection speeds, you can also set: browserContext.setDefaultNavigationTimeout(timeout) How can I do this with crawlee/playwright? ...

Sending request in XML

Hi with the ApifySDK we could export json dataset to xml format like shown with the following format : ```javascript { "address": [{ "@": { "type": "home",...

How to get only what's after html tag?

How to scrape only what's after the html tag <label>? (e.g. text after label 1, text after label 2) <div> <label>some label 1</label>...

How to get the image source?

How to get the image source of this element? page.locator('img'); This is not working: const image = await page.locator('img').src;...

How to scrape videos?

I'm trying to scrape dancing videos from a website (Steezy) for an academic research project. When I open up devtools, I see that there are requests for .mp4 files, but when I try to copy as curl and check the file locally, they seem malformed and I can't see them. Does anyone have any advice? (I know this doesn't have to do with crawlee/apify but maybe someone here knows about this)...

crawlee do not scrap second time

I am scraping same amazon products in fixed interval of time but when i run program, crawlee scrap for the first time but after that it does not make any request
No description

Approach to store scrapped data in database (postgres)

(Apologises for the crosslink: https://github.com/apify/crawlee/discussions/1577) Hi, I recently discovered Crawlee and I'm trying to figure out how can I store the scraped data in database instead in local directorio storage. Is there any plugin for that? How must I proceed to implement one? Must I code my own class that implements StorageClient interface? If so how must I injected later to be used....

How to solve example.com needs to review the security of your connection before proceeding?

I am trying to scrape a site that has a page that check if the connection is secure and I get this error: WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 403 status code....

How to extend log messages?

Are there any plans to extend Crawlee logger? Ref: https://crawlee.dev/api/core/class/Logger I found this to set the skipTime option ```js...

Do you know a dashboard for Crawleee?

I want to monitor all Crawlee crawlers. To do this I looked for a dashboard to control the crawlers. I only found this distributed web crawler management platform: Crawlab (https://github.com/crawlab-team/crawlab). Do you know of any others?...

How to resize Playwright browser window?

There is the method page.setViewportSize() (https://playwright.dev/docs/api/class-page#page-set-viewport-size) to resize Playwright browser window. With Crawlee/PlaywrightCrawler, How can I set the size of browser window ? ...

Making API request with crawlee?

I need to make an api POST request to retrieve information in the body . I tried using the Basic crawler, since it uses Got-scrapping under the hood. I've had no success and I optedfor installing an axios package instead. The code ```javascript import { BasicCrawler} from 'crawlee'; ...

PuppeteerCrawler proxy rotate

I'm using PuppeteerCrawler in Crawlee lib. I want to rotate proxy, how to apply proxy rotate to PuppeteerCrawler? Thanks!...