Crawlee & Apify

CA

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

how to wait browser to close like playwright await browser.close();

how to wait browser to close like playwright await browser.close();

Cannot EnqueueLinks with Globs

The crawler starts with the sitemap.xml of a website, and I'm trying to enqueue all the links inside the XML with globs. ```await enqueueLinks({ globs: ["https://website.com/product/*"], });...

How to prevent following redirects to other domains?

I see there is a way to prevent this once page loads with something like this: ```js await page.setRequestInterception(true); page.on('request', async (request) => {...

ERROR BrowserPool: Failed to close context.

Hi, I noticed that I get this error, here is my configuration: const crawler = new PuppeteerCrawler({...

Setting cookies is failing

Error: ``` /node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:330 error: new Errors_js_1.ProtocolError(), ^...

How to retry failed requests after the queue as "ended"?

I just depleted my proxy quota and all the remaining requests in the queue failed. Similar thing happens often, how do I retry/re-enqueue the failed requests? I've been googling it for a while now, hardly any up to date info on google, only bits and pieces from older versions, closed Github issues etc.....

requestHandlerTimeout and navigationTimeout not respected

In my main.js, I've setted up in the PuppeteerCrawler options : navigationTimeoutSecs: 10 and requestHandlerTimeoutSecs: 11. In the logs I still see the 30 seconds default timeout... Am I doing something wrong ? I don't get it why it doesn't get overriden...

Need help compiling crawlee in react

Hi everyone! I'm trying to integrate the crawlee library inside my react app for a social scrape project, but as soon as I import the PlaywrightCrawler module I get the following compile error: ``` ERROR in ./node_modules/@crawlee/browser/internals/browser-crawler.js Module build failed (from ./node_modules/react-scripts/node_modules/babel-loader/lib/index.js):...

Crawlee seems to be getting a cached version of a xml file

I'm starting my crawler with the first request being a https://site.com/sitemap.xml. Then I read all the URLs in sitemap and check for the modified date (The website does update the modified that in the sitemap), and only crawl the pages that were modified. The problem is that the crawler in production is doing that once every hour, and it's always getting the same version of the sitemap.xml. If I run it after a while in my PC, it finds modified URLs, crawl the pages and get the updates. I'm enqueuing the XML with await crawler.run([{url: "sitemap.xml", "label": "SITEMAP"}]); Is there a way to add headers and prevent caching here?...

Puppeteer - Intercept request, modify its response body and respond() with the modified body.

Has anyone done it in Puppeteer? With Playwright its quite straightforward but Im not able to get the response with Puppeteer in the request interception.

Overriding request response for images

Hey, I want to override request response image, how could I do that with puppeteer? Playwright has a neat field called from which I can include files from my local machine but with Puppeteer I can't make it work....

Need help with Crawlee

I am getting the following error when crawling
No description

Set 'ignoreHTTPSErrors' on a PlaywrightCrawler

Hi everyone, I need to set the ignoreHTTPSErrors flag for a Playwright crawler. `const crawler = new PlaywrightCrawler({ launchContext: {...

Docusaurus crawler

Hi all! New here, crawlee looks like an awesome tool. I'm currently building a docusaurus site crawler but wanted to ask around and see if anyone knows of an existing implementation before I go and reinvent something. If an existimg implementation doesn't exist I'd be happy to open source my own!

Tracking time

Any convenient way to track the time spent on different functiionalities in one crawl? I'm trying to use winston loggers profile functionality but I'm just not sure if it does it correctly. Apart from crawling and scraping my app also does data transformation after scraping and inserts it into DB. I'm trying to track time spent on each segment of this process

bind launch-context(timezone,locale) with proxy

Dear all, I have bunch of proxies. Cloudflare and many other anti-bot protections check for the ip address and the timezone. I could see that because of this discripency my crawler is being detected as bot. How can I bind this launch context which has correct timezone and locale to a proxy+ browser ? I have seen proxy configuration but how can I tell the browser being launched to use a specific context ? Thanks in advance....

Handling HTML structures of different websites

Hey, I want to scrape multiple e commerce web shops that have different HTML structures. I was thinking about making handlers for each shop. Allowing each shop to scrape the HTML on its own. Eventually all sites should come up with kind of similar data, such as price, title, in stock sizes etc. This is necessary because the data must then be processed, requiring each product to meet the schema....

Dockerize in new container

I'm trying to Dockerize a Crawlee Puppeteer project, but I'm stuck with it not finding Chromium. This is the current Dockerfile that I'm using: ``` Specify the base Docker image. You can read more about the available images at https://crawlee.dev/docs/guides/docker-images...

Python SDK for Crawlee?

I can see a Python SDK for Apify has been released. Is a Python SDK also planned for Crawlee with the same functionality as Javascript/Typescript with Cheerio and Playwright?