Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Join

crawlee-js

apify-platform

crawlee-python

💻hire-freelancers

🚀actor-promotion

💫feature-request

💻devs-and-apify

🗣general-chat

🎁giveaways

programming-memes

🌐apify-announcements

🕷crawlee-announcements

👥community

flat-fuchsia

3/30/2023

how to wait browser to close like playwright await browser.close();

fair-rose

3/30/2023

Cannot EnqueueLinks with Globs

The crawler starts with the sitemap.xml of a website, and I'm trying to enqueue all the links inside the XML with globs. ```await enqueueLinks({ globs: ["https://website.com/product/*"], });...

like-gold

3/29/2023

How to prevent following redirects to other domains?

I see there is a way to prevent this once page loads with something like this: ```js await page.setRequestInterception(true); page.on('request', async (request) => {...

equal-aqua

3/29/2023

ERROR BrowserPool: Failed to close context.

Hi, I noticed that I get this error, here is my configuration: const crawler = new PuppeteerCrawler({...

NeoNomade

3/28/2023

Setting cookies is failing

Error: ``` /node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:330 error: new Errors_js_1.ProtocolError(), ^...

like-gold

3/24/2023

How to retry failed requests after the queue as "ended"?

I just depleted my proxy quota and all the remaining requests in the queue failed. Similar thing happens often, how do I retry/re-enqueue the failed requests? I've been googling it for a while now, hardly any up to date info on google, only bits and pieces from older versions, closed Github issues etc.....

NeoNomade

3/24/2023

requestHandlerTimeout and navigationTimeout not respected

In my main.js, I've setted up in the PuppeteerCrawler options : navigationTimeoutSecs: 10 and requestHandlerTimeoutSecs: 11. In the logs I still see the 30 seconds default timeout... Am I doing something wrong ? I don't get it why it doesn't get overriden...

stormy-gold

3/23/2023

Need help compiling crawlee in react

Hi everyone! I'm trying to integrate the crawlee library inside my react app for a social scrape project, but as soon as I import the PlaywrightCrawler module I get the following compile error: ``` ERROR in ./node_modules/@crawlee/browser/internals/browser-crawler.js Module build failed (from ./node_modules/react-scripts/node_modules/babel-loader/lib/index.js):...

fair-rose

3/23/2023

Crawlee seems to be getting a cached version of a xml file

I'm starting my crawler with the first request being a https://site.com/sitemap.xml. Then I read all the URLs in sitemap and check for the modified date (The website does update the modified that in the sitemap), and only crawl the pages that were modified. The problem is that the crawler in production is doing that once every hour, and it's always getting the same version of the sitemap.xml. If I run it after a while in my PC, it finds modified URLs, crawl the pages and get the updates. I'm enqueuing the XML with await crawler.run([{url: "sitemap.xml", "label": "SITEMAP"}]); Is there a way to add headers and prevent caching here?...

equal-aqua

3/22/2023

Puppeteer - Intercept request, modify its response body and respond() with the modified body.

Has anyone done it in Puppeteer? With Playwright its quite straightforward but Im not able to get the response with Puppeteer in the request interception.

equal-aqua

3/22/2023

Overriding request response for images

Hey, I want to override request response image, how could I do that with puppeteer? Playwright has a neat field called from which I can include files from my local machine but with Puppeteer I can't make it work....

xenial-black

3/21/2023

Need help with Crawlee

I am getting the following error when crawling

afraid-scarlet

3/21/2023

Set 'ignoreHTTPSErrors' on a PlaywrightCrawler

Hi everyone, I need to set the ignoreHTTPSErrors flag for a Playwright crawler. `const crawler = new PlaywrightCrawler({ launchContext: {...

wise-white

3/21/2023

Docusaurus crawler

Hi all! New here, crawlee looks like an awesome tool. I'm currently building a docusaurus site crawler but wanted to ask around and see if anyone knows of an existing implementation before I go and reinvent something. If an existimg implementation doesn't exist I'd be happy to open source my own!

graceful-blue

3/16/2023

Tracking time

Any convenient way to track the time spent on different functiionalities in one crawl? I'm trying to use winston loggers profile functionality but I'm just not sure if it does it correctly. Apart from crawling and scraping my app also does data transformation after scraping and inserts it into DB. I'm trying to track time spent on each segment of this process

like-gold

3/16/2023

bind launch-context(timezone,locale) with proxy

Dear all, I have bunch of proxies. Cloudflare and many other anti-bot protections check for the ip address and the timezone. I could see that because of this discripency my crawler is being detected as bot. How can I bind this launch context which has correct timezone and locale to a proxy+ browser ? I have seen proxy configuration but how can I tell the browser being launched to use a specific context ? Thanks in advance....

deep-jade

3/15/2023

Handling HTML structures of different websites

Hey, I want to scrape multiple e commerce web shops that have different HTML structures. I was thinking about making handlers for each shop. Allowing each shop to scrape the HTML on its own. Eventually all sites should come up with kind of similar data, such as price, title, in stock sizes etc. This is necessary because the data must then be processed, requiring each product to meet the schema....

NeoNomade

3/15/2023

Dockerize in new container

I'm trying to Dockerize a Crawlee Puppeteer project, but I'm stuck with it not finding Chromium. This is the current Dockerfile that I'm using: ``` Specify the base Docker image. You can read more about the available images at https://crawlee.dev/docs/guides/docker-images...

afraid-scarlet

3/15/2023

Python SDK for Crawlee?

I can see a Python SDK for Apify has been released. Is a Python SDK also planned for Crawlee with the same functionality as Javascript/Typescript with Cheerio and Playwright?

ambitious-aqua

3/15/2023

How can i change request timeout to 10 seconds instead of 30 seconds

Previous Next

Gaming

Programming

Crawlee & Apify

This is the official developer community of Apify and Crawlee.

Crawlee & Apify

This is the official developer community of Apify and Crawlee.