R2 errors

Interesting 🤔 not sure if anyone else monitors a lot of data like me when it comes to R2 but if anyone does just curious if you also notice starting 12/13 the error rate as risen a bit from what it used to be
16 Replies
Sid
Sid•3y ago
Interesting, what kind of operations do you do? Are there certain ops that fail more than others? Do you know what these errors are? We have an SLO graph internally that isn’t going crazy so I’m curious what’s going on here
Unsmart
UnsmartOP•3y ago
So I do every operation type get/put/delete with files of varying size (0MB, 1MB, 5MB, 25MB). I have a load balancer healthcheck that is set to all regions that calls an api which will do 2 operations, 1 to a US bucket, and one to an EU bucket. After the operation completes/errors I send latency/error data to AE for tracking purposes. The most common error I get is Client Disconnect (10054) happens pretty equal across all operations. But I assume the client is still connected just fine otherwise the data wouldnt be in AE since its only in AE if the error is handled by the request. And a way less common error is We encountered an internal error. Please try again. (10001) happening mostly to the put operation but not that often.
andrew
andrew•3y ago
This kind of thing fascinates me, since I do worry about things like increased error rates if I ever do an S3->R2 switchover... please do report any more findings back to the channel 😄 Thanks for doing that monitoring and reporting it
Sid
Sid•3y ago
Yeah I'm going to take a look at this (or at least get someone else to take a look 😛 ). Can you give me the account ID where you make these requests from? The client disconnect is what I want to look into a little more. The internal error you're seeing for PUTs might be due to concurrency, but with your account ID, hopefully I'll be able to see what the exact reason is
Unsmart
UnsmartOP•3y ago
Yeah sure the account is dc941e8156f4a1336ca08481cb6d4222. @sdnts just curious if this ever got looked at? I noticed a user complaining about elevated 500 error rates: https://discord.com/channels/595317990191398933/940663374377783388/1069076313517858888 And just wanted to say I also see another spike in error rates starting at around 2023-01-27 17:00:00 UTC
Sid
Sid•3y ago
Hey sorry yeah I just looked and I think I know what the problem is. I have a PR up but we'll likely push it out on Monday since it's the weekend and this seems to be a small fraction of your requests I will let you know when we do though so you can check if it helped
Unsmart
UnsmartOP•3y ago
Sounds good and yeah definitely a small % 🙂
andrew
andrew•3y ago
@sdnts Just curious, did this end up getting pushed?
Sid
Sid•3y ago
It did actually, let me double check real quick if the errors I saw on Sunday are down too Okay yeah so the errors I was seeing earlier are down, I'll let Unsmart confirm if their error rate is down as well
Unsmart
UnsmartOP•3y ago
So my error rate in the last 24 hours has dropped (image 1), but the overall error rate is at an even higher peak now than the jump that happened on 12/13. Jumping up again on 1/27 (image 2). Pre 12/13 the average errors per 12 hours would be about 100. 12/13 -> 1/26 it was about 900 per 12 hours. 1/27 -> 1/30 its from 1500-3000 per 12 hours.
Unsmart
UnsmartOP•3y ago
It looks like the error rate should be going down to like 600-800 every 12 hours from the release that happened today. But still pretty far from going back to the pre 12/13 average which was 100 every 12 hours. I will say each 12 hour point represents about 300,000 operations that happen so the error rate is still extraordinarily low even with the recent jumps I am seeing
Unsmart
UnsmartOP•3y ago
Over the last 6 hours these are the top errors by operation type and error message. Mostly client disconnects, followed by internal errors. (The top one about network connection lost can be ignored thats a DO error that isnt included in the R2 graphs)
Sid
Sid•3y ago
It’s sorta curious to me that client disconnects are so common for you too. We see a lot of client disconnects on our end too but it hard to track them down because they really could just be a dropped connection. In your case I suppose it means that we (R2) closed the connection right?
Unsmart
UnsmartOP•3y ago
Yes that's correct. I only save data in AE if I actually get a response back during the request.
Unknown User
Unknown User•3y ago
Message Not Public
Sign In & Join Server To View
Sid
Sid•3y ago
Yeah sorry about that, it's all actively being investigated

Did you find this page helpful?