Hey I think someone is attacking our DOs

Hey I think someone is attacking our DOs:
D
DanTheGoodman244d ago
anyone have ideas on how to verify this? we haven't deployed any code for months on DOs or things that connect to DOs IIRC, certainly not in the last ~40 min I can't even tell what DO it's from
D
DanTheGoodman244d ago
the log isnt' particularly helpful either...
D
DanTheGoodman244d ago
ther eisn't even an exception in the exception log
D
DanTheGoodman244d ago
oh yeah that's weird... had to be an attack
No description
D
DanTheGoodman244d ago
happened again in a burst, does not seem to be a isngle object so I might guess it's an infra issue on CF's side?
D
DanTheGoodman244d ago
added the ID of the object, not all the same, very high cardinality
No description
D
DanTheGoodman244d ago
we are getting hundreds of thousands of these tens per second
D
DanTheGoodman244d ago
No description
M
milan244d ago
Can you DM me your account ID, I can take a look to see if something is weird
D
DanTheGoodman244d ago
yep dm'd, discord might have suppressed the DM notification @milan_cf
D
DanTheGoodman244d ago
had another burst a bit ago too
No description
M
milan244d ago
I got the dm, looking This started about an hour ago? 18:30 utc or so? Ah no I see, from 17:50 onwards, definitely see an increase in # invocations
D
DanTheGoodman244d ago
Yeah just before then And there’s no reason any individual durable object should be getting even within 20 or 30% of that connection limit Or rather 20% of that limit Largest we know of would be 7k peak
UU
Unknown User244d ago
D
DanTheGoodman244d ago
Alerts Cloudflare log push to a Grafana stack
D
DanTheGoodman244d ago
another one
No description
M
milan244d ago
Yeah, I see a linear increase in the number of websocket connections to some of your DOs
D
DanTheGoodman244d ago
how high is the cardinality? Seemed pretty high from the logs when I added the IDs to the urls in the workers (query param) I also checked twitch and none of the 4 people above that viewer count are using our tools
M
milan244d ago
Monitoring of ws hibernation isn't really good enough to know how many DO instances were hitting that limit, I can see there were about 900 instances, most of which received very few messages (under 10), between 17:50utc and 20:00utc (so the last 2 hours) Bit over 700 of them received 10 or fewer messages
D
DanTheGoodman244d ago
that's expected
M
milan244d ago
How many connections are you generally expecting to get to a single DO instance? How are you connecting to the DO?
D
DanTheGoodman244d ago
@milan_cf typically <100, but some can be in the thousands. We are connecting through browser websockets
D
DanTheGoodman244d ago
ah more
No description
D
DanTheGoodman244d ago
@milan_cf any luck? It's pretty constant right now
M
milan244d ago
We don't think the issue is with our infrastructure, there is a significant increase in invocations to your durable object namespace starting at 17:50 UTC, and its been sustained for a while. It's likely that someone is opening a lot of websocket connections to your DOs and forcing you to hit the connection limit.
D
DanTheGoodman244d ago
Gotcha, so perhaps an attack then because we pushed no code that modifies how we connect to clients. In my checking of the pretty unhelpful exception logs I do see they are quite spread out among US and EU, but a large number coming from eastern EU. Is there any way to check that? Logpush does not give us that info and this exception happens before our code runs it seems Unless maybe we can get that info in the worker before connecting to the DO?
M
milan244d ago
this exception happens before our code runs it seems
You mean logpush or the DO code? This exception should be from acceptWebSocket() throwing in the DO
D
DanTheGoodman244d ago
see this there's no exception in the exceptions array maybe that's not the right exception, but then what is that exception lol
M
milan244d ago
Where did you get the 32k connection limit exception then?
D
DanTheGoodman244d ago
logpush that message from the screenshot is literally all logpush was sending, so I looked into the function call logs for more info
D
DanTheGoodman244d ago
No description
D
DanTheGoodman244d ago
it also comes in waves completely uncharacteristic of any behavior from normal users of our app
D
DanTheGoodman244d ago
and it seems a bit too constant to be our users
No description
M
milan244d ago
not sure why there's no exception there... I can ask around tomorrow (everyone is currently out for the day). I still think this is some sort of attack, but we definitely need to improve out hibernatable ws monitoring. It's probably worth wrapping acceptWebSocket in a try catch, or tracking how many ws you have connected and refusing to allow more to connect if you're near the limit (to avoid errors).
D
DanTheGoodman244d ago
@milan_cf that's what I just pinged my team we are going to do (is try catch that and log our own error) @.hades32 fyi (my team) just added some try catch and additinal logging
D
DanTheGoodman244d ago
ugh perfect timing to stop...
No description
D
DanTheGoodman244d ago
nothing in the logs still... trying to return a valid response to see if the exceptions go away
D
DanTheGoodman244d ago
does not seem to be that @milan_cf , as these errors are still blank
No description
D
DanTheGoodman244d ago
idk what these exceptions even are...
try {
this.state.acceptWebSocket(pair[1], [t]);
} catch (error) {
console.error("error accepting websocket:", JSON.stringify(error), JSON.stringify(request.headers))
throw error
}
try {
this.state.acceptWebSocket(pair[1], [t]);
} catch (error) {
console.error("error accepting websocket:", JSON.stringify(error), JSON.stringify(request.headers))
throw error
}
this doesn't seem to be firing I think those exceptions are unrelated, we aren't getting the conneciton limit log right now, so that seems to be a second issue we've found @milan_cf
M
milan243d ago
probably stopped because of reload of all your DOs?
D
DanTheGoodman243d ago
No it's still happening been seeing it all night
M
milan243d ago
are you throwing an exception in your code somewhere?
D
DanTheGoodman243d ago
I added that code snippet above but it's never being reached, and there are no places that we are throwing an exception ourselves
D
DanTheGoodman243d ago
ok for some reason as of a few hours ago that error started throwing
No description
D
DanTheGoodman243d ago
idk why that didn't show last night though
M
milan243d ago
It looks like almost all your DOs are returning only 201s, a small percentage returning 400s
D
DanTheGoodman243d ago
but it doesn't have any request headers? I'm fixing the log to get the headers, and the query
M
milan243d ago
Yeah so I went back a couple days and that namespace has only been responding with 201s and 400s, mostly 201s though also be back in 10 I'm getting coffee
D
DanTheGoodman243d ago
no worries, it stopped a few min before I pushed out the update I'll sanity check our connection code, but we have nobody above 2k live viewers right now so nobody should be even remotely near that limit I wonder if it's actually a side-effect of an attack on twitch, because it is through our twitch extension which gets loaded whe the twitch page loads I can see these error logs have our JWT from twitch I think we might have figured it out, it seems that when we navigate we open a new socket but do not close new ones... for some reason the browser is keeping them around for 1-2 minutes now the rate of logs could be logpush throttling how fast they are sent, it looked like about 100/s which is the same limit that exists in the CF dashboard for viewing function logs
M
milan243d ago
Mind expanding on this? I'm not familiar with what the DOs are doing or how the client works and I'm curious
D
DanTheGoodman243d ago
@milan_cf Sure, basically we are using them as coordination, we use HTMX so a navigation is replacing the component that connects via websockets. However for some reason that's not disonnecting, we deployed what we think is a temporary fix Basically I think every time our users did something they opened another socket
M
milan243d ago
Did it fix the issue?
D
DanTheGoodman242d ago
@milan_cf The connection limit one yeah, but not entirely, I think that was one issue but I think there still is an attack. We added code to verify that the Twitch token was passed and is valid, and this error happens when no token is passed in
No description
D
DanTheGoodman242d ago
Now we are rejecting it before accepting the socket, so that removes the connection limit issue, but it still seems like there are similar patterns. And our users are passing tokens (this is the last 12 hours) perhaps these are viewers not logged in that are viewing though lol, but yeah connection issue solved. But the pattern is just so constant, it doesn't feel like our users No, twitch gives a token regardless, this doens't seem to be our users
M
milan240d ago
@danthegoodman we found a regression in the hibernation code regarding dispatching the close handler (+ dropping the websocket) upon client disconnects and we're investigating further. Not certain it's affecting you but I suspect it probably is, will keep you updated as I find more
D
DanTheGoodman240d ago
Appreciate the update! I wonder if the browser was not getting a close and thus kept reconnecting, as we never had this issue before hibernation
M
milan240d ago
GitHub
🐛 Bug Report — Runtime APIs: Hibernating WebSockets remain open whe...
Problem We started observing the following strange behaviour some time yesterday. when using Durable Objects Hibernating WebSockets (calling state.acceptWebSocket(socket), when connecting via web b...
M
milan239d ago
I think the release went out, have the issues been resolved?
D
DanTheGoodman239d ago
@milan_cf not sure, we added our own code to prevent this in the meantime that has worked for us so far
M
milan239d ago
Oh I thought you were still seeing problems regardless of that fix, my bad
D
DanTheGoodman239d ago
no worries lol, we had to do 2 fixes were other users hitting this? or are we the largest websocket hibernation user 🤔
M
milan239d ago
Yeah, it seems like it hit a couple other folks as well. Unfortunately our test in CI didn't verify if the disconnect handler ran, it only confirmed that when it ran everything worked as expected. That coupled with lack of hibernatable ws monitoring made this tricky to confirm w/o reports from users We fixed the test case + are working on making this class of bugs discoverable at compile time. Will also need to think about some monitoring and metrics for hibernatable websockets
D
DanTheGoodman239d ago
awesome
M
milan239d ago
Not the largest but definitely up there 🙂
D
DanTheGoodman239d ago
😎 ill take it
M
milan239d ago
Sorry for the trouble, and thanks for the detailed report. We haven't had a larger scale issue w/ hibernation so this will help us with our tooling going forward
D
DanTheGoodman239d ago
glad it's all sorted!
Want results from more Discord servers?
Add your server
More Posts
Fix 1014 CNAME Cross-User Banned with Cloudflare for SaaSHello Cloudflare Community! We need your help and hints on solving our case. The initial conditionscloudflare repositoryI want to check when a ddos attack is sent to my website, and when it is sent, block all traffic carWorker not always finding .svg/.webpI'm running into an issue where my cloudflare worker in front of R2 is failing to return webp/svg buHow would I do require('googleapis') in a worker?My question is pretty straightforward: How would I do ```const { google } = require('googleapis')`DNSHello Guys, I've lost access to a Cloudflare Account and I don't remember de email address. I need aCan Cloudflare for SaaS help me?I would like for my clients to have a custom domain that will host part of my saas offering i.e. my Domain reseller options?Hello, I'm interested in reselling domains through cloudflare. The whmcs module seems to have been dis d1 down right now getting a failed tois d1 down right now? getting a "failed to reach database. please try again later" statusStuck on an infinite human verification loopCan anyone help? Seems to be happening on other websites aswellRendering API releaseDoes anybody know when the rendering API is set to be realised? Has anybody used this service to preThe creation of new pages projects has a 500 error.```json { "result": null, "success": false, "errors": [ { "code": 8000000, "meBug Report: like to Cloudflare Pages app on GitHub is broken for organizationsI'm trying to deploy a new repo on Pages, and I need to give this repo access to the Cloudflare PageConfiguring Exclude routes for NitroServer on NuxtJSI am currently trying to add an "exclude":["/api/*] to my auto generated _routes.json but I cant quiWhy does my CloudflareWARP adapter shows "No internet access" even though everything works?It has been happening ever since I've been using cloudflare (since a year ago), I thought it was norWhere to write Cloudflare workers functions in an Astro project?I'm working on an e-commerce project and am trying to integrate Stripe. Where would you place a servIP seems to be blocked?Hello everyone, i hope i can get some advice here. It appears that cloudflare has banned my ip as i Visibility and Diagnosis of Zero Trust Known Network/Profile selectionIs there a way to view any system logs when the Warp client probes for known networks and selects a Email Routing : Address not found errorSo I'm trying to route an email address using the email routing feature on a Cloudflare-managed domaCloudflare Pages Jekyll Deployment failing on production after latest updateLooks like cloudflare updated it bundler version for jekyll pages deployment, all our build our now Can I use Cloudflare Tunnel to connect to MySQL externally without root?On my system, there is no root. And access to MySQL is restricted to localhost only. I want to allo