Random drops / 3 sec response time

There is this issue going on for weeks now, at random time intervals the response time for pages go from around 0.15 seconds to 3 seconds causing huge slowdowns and drops in traffic because of it. My own nginx requests log show the exact same as cloudflare's panel. Issues appear to be L7 at CF I have spends day debugging, and in short this is what i found out. 1) It happens to all my sites across cloudflare, across all datacenters in the Netherlands. 2) I tried various things such as disabling tiered caching, moving firewall rules with no changes. There is also no changes in WAF logs or any ddos or such going on. 3) Bypassing cloudflare and going directly to the server solves all issues during the downpeaks. 4) This is not an issue with origin. 5) I have made a ticket #3193135 but its been EIGHT DAYS with NO RESPONSE.
No description
68 Replies
CosmosisT
CosmosisTā€¢3mo ago
You're using CloudFlare as a proxy, they have a set amount of IPs that act as "shields" for your domain. It's possible one it being hit and/or not operating as fast due to high traffic or other issues. 3 seconds is not long if the average response is faster than that. Your direct host could serve it at lower millisecond for sure but understand you're through a proxy that's designed to firewall and protect you and some nodes cloudflare is sending out your way could have added latency for many reasons.
slayerduck
slayerduckā€¢3mo ago
An avarage pageload going from 0.08 seconds to 3.22 seconds and it causes traffic to drop by about 50% for the time its going on and this happening up to 3 times a day, is a very big deal
CosmosisT
CosmosisTā€¢3mo ago
Well how is the server looking? High CPU? High Network?
slayerduck
slayerduckā€¢3mo ago
read OP
CosmosisT
CosmosisTā€¢3mo ago
Yes I mean during these slow-downs, is the 3s lasting long? As in does this 3s latency for full loads happening for minutes->hours?
slayerduck
slayerduckā€¢3mo ago
Yes, its like 5 to 15 min at a time where pageloads are 3 seconds
CosmosisT
CosmosisTā€¢3mo ago
During these stretches are you monitoring your traffic, cpu and all sorts to ensure it's no CPU overhead?
slayerduck
slayerduckā€¢3mo ago
Its not origin
CosmosisT
CosmosisTā€¢3mo ago
Have you logged it with htop and other metrics? Origin may be getting smoked if it's all cloudflare entries A single cloudflare entry can be delayed but by how you describe it sounds like ALL cloudflare nodes users come through it's 3s
slayerduck
slayerduckā€¢3mo ago
I have smokeping, uptime monitoring, constant curl requests gonig both direct to server and trough cloudflare
CosmosisT
CosmosisTā€¢3mo ago
Do yo u know the results? Could you post them? Stuff like this:
CosmosisT
CosmosisTā€¢3mo ago
No description
slayerduck
slayerduckā€¢3mo ago
No, because it will expose my ips and infrastructure
CosmosisT
CosmosisTā€¢3mo ago
Well there's mine, it's public facing anyways. You want to determine if the origin server is having issues if say all CF pointing users to your network are having 3s delays. A single CF pointing user could have that 3s delay, but if all users are getting it, your server is busy processing or something... I'd say monitor your server if you got root access, run htop and maybe CBM.
slayerduck
slayerduckā€¢3mo ago
I think you misunderstand the scale here, i got 14 servers with most of them running 64 core epyc's and its happening to every datacenter i run at in my country at the same time, the drop causes all cpu and network traffic to decrease as its being choked by cloudflare. Like i said in OP, bypassing CF has no increase in pageloads
CosmosisT
CosmosisTā€¢3mo ago
MTR for traces Well damn if it's t hat saucy, you may have a cleaner idea on what's happening.
slayerduck
slayerduckā€¢3mo ago
this isen't a home site, if you look at the OP that requests is in billions
CosmosisT
CosmosisTā€¢3mo ago
Oh I can dig it. So is the 3sec drop for every cloudflare IP? Have you logged/analysed that?
slayerduck
slayerduckā€¢3mo ago
but i also have smaller sites with CF, free plan on those and they also have the exact same issue at the same time. Even servers that don't even run in the same datacenter
CosmosisT
CosmosisTā€¢3mo ago
No needing to flex here, I'm helpful regardless. šŸ™‚
slayerduck
slayerduckā€¢3mo ago
Yes, i'm running a curl loop with 5 seconds delay. One to CF > my site and one directly to my site and only the one trough cloudflare is having issues
CosmosisT
CosmosisTā€¢3mo ago
Try logging the results when you're getting these 3s delays, if it's some cloudflare IPs that's fine, if it's every cloudflare connection you may need to worry. That doesn't help, you want a more dynamic approach.
slayerduck
slayerduckā€¢3mo ago
I can't log per cloudflare ip to see what nodes at CF has the issues, i don't have that data or way to route traffic that way
CosmosisT
CosmosisTā€¢3mo ago
No but you know if it's all users having 3s delay. If it's a single or few nodes delayed that's AOK! If all nodes 3s delay you got a server problem and need to aduit. audit*
slayerduck
slayerduckā€¢3mo ago
I could probe from outside my country to see, that might take a differenr route but that's as far as i can take it. But i have already debugged the issue to the point where i can 100% say that its not an origin issue anymore
CosmosisT
CosmosisTā€¢3mo ago
Well of coarse you never want it to be origin issue or everyone goes offline that's using that server... That's why I say if it's 3s for every CF client, then you got server issues if it's the odd connection that's fine. You need to confirm with audits, run htop and track the server when it's at load and see if CPU is cranking 100% and check CBM for if network is cranked.
slayerduck
slayerduckā€¢3mo ago
Well CF could say that for example the server has issues, or the datacenter isps have issues, and thus is why it fails but none of this is at hand. Because the connection doesn't flat out fail, but just takes really long there is also nothing in cf error analytics, or any logs on origin that tell me anything
CosmosisT
CosmosisTā€¢3mo ago
CF is not jesus/god now. They won't turn coke into rum.
slayerduck
slayerduckā€¢3mo ago
If origin has issues, it would show in the logs. Or munin would report high cpu or smokeping (that runs from a different datacenter) would report high latancy
CosmosisT
CosmosisTā€¢3mo ago
You need to do better audits of your origin server if you have root access and compare to these outages, and if o utage affects all users
slayerduck
slayerduckā€¢3mo ago
but in case primary server fails, or blows up it would automatically switchover to failover systems
CosmosisT
CosmosisTā€¢3mo ago
CF will reflect but their proxy is designed to take the load off your serve by caching and serving tons of data for you, How do you know though I haven't seen a result, I have these anytime I'm having outages or delays.
slayerduck
slayerduckā€¢3mo ago
how do i know what?
CosmosisT
CosmosisTā€¢3mo ago
You have no auditting to prove anything, just your word. You puzzle us. I'm still unsure if everyone has 3s delays or if it's specific users(nodes). No audit/analytics it's kind of a who knows situation. With crazy services no monitoring?
slayerduck
slayerduckā€¢3mo ago
I litterly can not test it, because its happening somewhere on cloudflare's servers and CF analytics do not report per-node issues
CosmosisT
CosmosisTā€¢3mo ago
You can test trace-routes. Your server is a sending point, if it's on maintenance mode it can send pings t owards every safe point to tell you what's up.
slayerduck
slayerduckā€¢3mo ago
What are you talking about, everything is monitored. CPU, mem, interrups, sql queries, disks stats etc
CosmosisT
CosmosisTā€¢3mo ago
test any IP to determine, AKA get you host compensation. So when there was outage, did it affect all users, or a range?
slayerduck
slayerduckā€¢3mo ago
. there are timestamps on all CF analytics data, because the drops are visable on all of them
CosmosisT
CosmosisTā€¢3mo ago
Traffic drop 50%
slayerduck
slayerduckā€¢3mo ago
and its happening up to 3 times a day, with random intervals
CosmosisT
CosmosisTā€¢3mo ago
with the add of You're just fluffing me right? Yeah I think you need to do a bit more auditing and analytic work with your servers to know better, this topic won't help you.
slayerduck
slayerduckā€¢3mo ago
please stop trolling
CosmosisT
CosmosisTā€¢3mo ago
Afraid I've been trolled, anyways good luck I hope you sort it out.
rootalien
rootalienā€¢3mo ago
This problem exists in my country too, it is not a problem related to your country. Some cf-ray ids can cause this problem. If you have 100 users on your site, this problem occurs for 10 or 20 people, but the other people do not have this problem because they access it from different ray-id. I've been trying to explain this for days, but everyone was saying that the ISPs of the country I live in were blocking cloudflare servers. No such thing. There is a problem with Cloudflare Ray ID and loading times are increasing. Cloudflare should take this as an issue and work to resolve it. There is no troll here, my friend, this is completely real.
slayerduck
slayerduckā€¢2mo ago
The reason i said you were trolling is because of your condescending tone and the only reason I'm actually accurately are able to debug when and how its happening is because i DO have analytics, provided by my munin and monit stats. It keeps track of every single thing happening on all servers, including making graphs of web requests/bandwidth/sql requests etc etc. You saying that i should use 'htop' to check my servers, when clearly its happening only for 15 min per day is not a viable solution. I talked to some of my IT people and they told me that they don't think you're intentionally trolling, its just your lack of knowledge. Knowing a little bit, but just not enough can be real dangerous. I think in the end this is a good learning curve for both of us, you knowing that not everybody hosts their site from their garage and me for making the mistake of not checking the level of technical knowledge provided for this type of issue. I was just real desperate, because of how long this was going on for. i'm happy to report tho that cloudflare fixed it, i will paste the response @rootalien . if this is also related to your issue
On our end we had deployed some rules to mitigate large amounts of high bandwidth traffic from individual customers (targeting abuse of the platform, e.g. streaming video) which we have been confident in for quite some time now, but we had forgotten a condition to only include certain requests with responses in our analysis of traffic that were above a certain size in a recent change. This caused the system to look at more than the target abuse and some mitigation spilled over to your site. This has now been fixed with some adjustments to the system.
On our end we had deployed some rules to mitigate large amounts of high bandwidth traffic from individual customers (targeting abuse of the platform, e.g. streaming video) which we have been confident in for quite some time now, but we had forgotten a condition to only include certain requests with responses in our analysis of traffic that were above a certain size in a recent change. This caused the system to look at more than the target abuse and some mitigation spilled over to your site. This has now been fixed with some adjustments to the system.
rezabet
rezabetā€¢2mo ago
Well, this was an interesting read šŸ˜…
rootalien
rootalienā€¢2mo ago
Thank you very much for your message, where did you read this?
slayerduck
slayerduckā€¢2mo ago
this was send to me on my ticket
rootalien
rootalienā€¢2mo ago
Can you send me the link of your ticket?
slayerduck
slayerduckā€¢2mo ago
you can't see it, its not public
rootalien
rootalienā€¢2mo ago
hmm ok how about a screenshot? It's not that I don't believe you, this problem was really a headache, I was on vacation for a few days and couldn't deal with the problem. Even though I used cloudflare workers, I was having problems. And unfortunately no one took me seriously here šŸ˜„
slayerduck
slayerduckā€¢2mo ago
i know how you feel..... as you most likely read this topic :facepalm:
rootalien
rootalienā€¢2mo ago
ahahah šŸ˜„
slayerduck
slayerduckā€¢2mo ago
No description
rootalien
rootalienā€¢2mo ago
I hope the problem has been solved, I will observe this when I return from holiday on Friday and let you know here.
slayerduck
slayerduckā€¢2mo ago
yea i hope this was the cause of your issue to šŸ™
rootalien
rootalienā€¢2mo ago
I hope so, thank you very much. šŸ«‚
slayerduck
slayerduckā€¢2mo ago
i can't express in words of how frustrating this issue was, you'd think if i have multiple sites in multiple datacenters all having the issue at the same time i'd be easier to get away from blaming origin but turns out that wasn't true. Then i upgraded my account to business in the hopes i could get faster support as this was an issue that's been going on for over a month. But then the billing issue prevented my account from being upgraded, even though i paid. Actually its still not upgraded to this day as the billing issue is STILL ongoing. So i'm stuck in a loop of not being able to get support because PRO support takes forever and my account not being business because of the billing issue. its truly a catch 22
rootalien
rootalienā€¢2mo ago
I use workers, maybe I pay 100 dollars a month, but I was still limited. As far as I know, Workers is a rival system to cloudfront.net and azureedge.net. However, as we saw, ToS was coming and uploads were increasing significantly. I have said many times that this increases when there are Champions League matches.
slayerduck
slayerduckā€¢2mo ago
yea i don't touch workers, i just use my own hardware for any processing related things
rootalien
rootalienā€¢2mo ago
unfortunately the problem persists. I think cloudflare only solved it for your website.
Akama Aka @ DoKomi
Akama Aka @ DoKomiā€¢2mo ago
Still a issue?
rootalien
rootalienā€¢2mo ago
Yes, when the demand increases, CF loading times reach 20/40 seconds. cf-cache-status: HIT Unfortunately, the files are like this. Cloudflare said they did this to reduce abuse, but it was also happening on non-abusive sites. Even though he said he fixed this, unfortunately the situation is still the same. I even saw a 429 request error recently šŸ˜„ It's ridiculous.
rootalien
rootalienā€¢2mo ago
šŸ˜„
No description
slayerduck
slayerduckā€¢2mo ago
all resolved for me, no more gaps or timeouts
No description
Akama Aka @ DoKomi
Akama Aka @ DoKomiā€¢2mo ago
nice