US traffic is being served by London and France datacenter

around 13:00 UTC most of the US traffic is being pointed to Cloudflare France and London datacenters. This causes our Europe region to sustain "undesigned" traffic and that region goes down or slows down a lot. This repeats daily till about 01:00 UTC and US traffic is again served by US datacenters. As the traffic goes over seas the ping for end user increases. The only solution to controll that was to shut down the EU Load Balancer pool, but that does not resolve increased latency. This can be seen in HTTP Traffic chart, filtering out US and selecting to group by Datacenter. The chart directly corelates with what we see in our internal traffic analysis.
No description
62 Replies
Chaika
Chaika2mo ago
known: https://discord.com/channels/595317990191398933/1408149529202786325/1408151056877097010 Upgrading plans can help but not guaranteed and only incrementally.. I have some testing on it and for this issue specifically Free -> Pro is maybe 20% less rerouting, Biz another 20% or so less, Argo on free or Ent is mostly gone but still slightly there. There's no guarantees though, they just need to expand capacity in the region, not many other details they've shared tho.
asuffield
asuffield2mo ago
if you just want it served from the lowest available latency, it's doing that at present and it's surprising-but-true that if your user is on the US east coast, London is closer than California
asuffield
asuffield2mo ago
if you specifically care about always being served in-region, that's what https://developers.cloudflare.com/data-localization/regional-services/ is for. if you just want low latency, this is "working as intended" but sometimes there isn't enough for everybody and there is a priority order
Cloudflare Docs
Regional Services
Regional Services gives you the ability to accommodate regional restrictions by choosing which subset of data centers decrypt and service HTTPS traffic.
asuffield
asuffield2mo ago
and for east-coast users, you're competing with everybody who runs massive workloads in us-east-1, the world's most overloaded cloud region. capacity in that part of the world is a recurring challenge (it happens to be a bit worse than usual in the past week or so, for reasons that have been most of my couple of weeks, but what you personally are observing is likely just that you used to fit into US east coast capacity and you've been pushed out)
asuffield
asuffield2mo ago
if the actual problem that's hurting you is that this has moved too much of your origin pull traffic to origins in Europe, you might want to insert https://developers.cloudflare.com/load-balancing/load-balancers/ to force some of it back to the US, instead of relying on whatever is closest to the cloudflare location the user is served from
Cloudflare Docs
Load balancers
A load balancer distributes traffic among pools according to pool health and traffic steering policies. Each load balancer is identified by its DNS hostname (lb.example.com, dev.example.com, etc.) or IP address.
Dawnward
DawnwardOP2mo ago
US east coast traffic is routed to the London and France, which is further than our load balancer pool located in Columbus. Forcing Cf Europe edge servers to route US back to US seems ping costly. It does not matter if Europe pool is on, they just route requests to Europe edge servers.
Jeff
Jeff2mo ago
I'm having the same issue. I wouldn't normally mind if traffic went to London or France, but the performance for these data centers is very poor. I'm using workers to get data from R2 (just simple uploads, usually less than 200 KB) and those take over 30 seconds to load, even if cached and it's not going to the origin (R2) I understand that Cloudflare can't handle increased traffic to US data centers, so they are routing people to EU data centers for less important customers. But they also can't handle increased traffic to EU data centers either.
asuffield
asuffield2mo ago
yeah it's a multidimensional problem. I've been very busy this past couple weeks. I can't give any forward looking statements but there's plenty of attention on what's happening I am cautiously optimistic for the situation improving. it's never going to be ideal, but the number of users served from further away should be lower than it is right now
Jeff
Jeff2mo ago
Glad to hear that it's being looked at and fixed. Is this issue why I'd be seeing increased 520x errors? This only started happening on August 21st, no changes to my nginx config (or my code) on my end. I can't seem to figure this out. It seems to be an issue only when Cloudflare starts pushing people to EU data centers, but the issue is intermittent, which leads me to believe it's just about how overloaded the data centers are with Cloudflare. I've tried switching servers and I see the same issue, although with varying severity. But even that depends on how overloaded the Cloudflare data centers are, assuming that's the issue... The time I'm seeing this is around 2:00 - 2:40 PM UTC
No description
No description
Frerduro
Frerduro2mo ago
No description
Frerduro
Frerduro2mo ago
its still happening
Frerduro
Frerduro2mo ago
No description
Frerduro
Frerduro2mo ago
IP is on the CDN77 network in their ashburn DC
Jeff
Jeff2mo ago
Are you seeing any 520x errors on your end @Frerduro?
Frerduro
Frerduro2mo ago
no
Frerduro
Frerduro2mo ago
No description
Frerduro
Frerduro2mo ago
No description
asuffield
asuffield2mo ago
things moving around inside the US is "normal", especially at this time of year don't expect this to be an overnight change, it's slower moving
Frerduro
Frerduro2mo ago
I mean connecting to paris and melbourne aus is outside the US
Chaika
Chaika2mo ago
I was playing around with this and built out https://delay.chaika.me/routing/ if it helps anyone to see the plan differences. This is done by testing from a bunch of locations in NA (SEA, PDX, SJC, LAS, SLC, MCI, DFW, ATL, MIA, ORD, DTW, YYZ, EWR, IAD), using datacenter conns usually over IX or direct peering so should entirely be CF shuffling requests between DCs. It's not worth upgrading to Pro to get away from it, Business is mostly uneffected. Argo & Ent is mostly not effected.
Cloudflare Routing Monitoring
See Cloudflare Routing, using Workers running on each plan returning static content.
Frerduro
Frerduro2mo ago
idk its just weird. I fully accept getting redirected to other DCs in the US. But Australia is about as far as you can go.
Chaika
Chaika2mo ago
The rerouting seems pretty across the board, some PoPs are effected less then others but they all shift, I don't imagine sending a request to another PoP near capacity would help From what I've seen, most requests which are rerouted get sent to entirely different regions for processing. Like a while ago Oceanic region (Australia and New Zealand) had capacity issues and most requests got flung all the way to Europe. I'm not sure if it's because other DCs are also just close enough, or if it's just the logic is safe and wants to ensure the forwarding location has capacity, but it's just observed behavior
Frerduro
Frerduro2mo ago
I have been seeing this behavior with stuff we host ourselves. We got app hosted in an ashburn DC with these upstreams and even if the request comes from that same server rack but different machine to the public domain of the app I have seen Ashburn server #2 > Cloudflare AUS > Ashburn server #1 > Cloudflare AUS > Ashburn server #2 Same behavior has been seen with residential ISPs connecting to the same public url but also same behavior for our static HTML CF pages website ¯\_(ツ)_/¯
Chaika
Chaika2mo ago
It's not your routing if that's what you're saying, it's CF internally flinging the request from one DC to another for capacity reasons https://blog.cloudflare.com/meet-traffic-manager/ probably Plurimog
If a request goes into Philadelphia and Philadelphia is unable to take the request, Plurimog will forward to another data center that can take the request, like Ashburn, where the request is decrypted and processed. Because Plurimog operates at layer 4, it can send individual TCP or UDP requests to other places which allows it to be very fine-grained: it can send percentages of traffic to other data centers very easily, meaning that we only need to send away enough traffic to ensure that everyone can be served as fast as possible
Frerduro
Frerduro2mo ago
so your telling me that 0 other cloudflare pops in the US has capacity for the past week+ hell id prefer EU over AUS ive even seen singapore and new zealand
Chaika
Chaika2mo ago
That goes back to https://discord.com/channels/595317990191398933/1409539854747963523/1410375823994912888 anyway not much you can do other then wait it out or upgrade
Chaika
Chaika2mo ago
I have data going back over a year and it wasn't ever this wide until now besides small bumps, at least from my simple testing against Workers
No description
Frerduro
Frerduro2mo ago
How do you have data for all these plans btw? You must be spending a ton of money just to collect data right?
Chaika
Chaika2mo ago
CF is very nice and gives Community Champs & MVPs Enterprise and all the other plan levels All the monitoring endpoints are just separate small VPS's feeding back. The primary purpose of my monitoring stuff was more like monitoring Worker Script Deployment https://delay.chaika.me/job/worker, but I also log which CF location deals with the requests, so always just kinda had this data
Frerduro
Frerduro2mo ago
full enterprise plan?
Chaika
Chaika2mo ago
There's no such thing as "full enterprise plan", it's all piece meal/requested in bits, but can ask for most features and get them. It's all non-commercial personal/testing usage but the upside is if someone asks if API Shield can do x or y (or a feature which requires it), we can test it and see that it can, or find problems with them and escalate them, etc. There's been decent amount of incidents that Champ monitoring data has helped raise or find, Workers Deployments used to be way more unstable for example.
asuffield
asuffield2mo ago
looks consistent with what I've been glaring at. it's supposed to look more like it did before the start of august - some, but not as much today might have been a bit better than earlier in the week, should be a bit more over the next couple days
Dawnward
DawnwardOP2mo ago
noup, well at least in my case. yesterday was much better. we have 3 regions, us-east, us-west, eu. day before it was balanced in periods like 1/3 per region, no where near perfect, but much better. tonight it was in bursts, 1h eu, then EU almost taken out, then again EU takes most of the US traffic. Attaching screenshot just for the US traffic and balance over regions mentioned
No description
Dawnward
DawnwardOP2mo ago
time in chart GMT+3
xCROv
xCROv2mo ago
I mainly use the services that I've got running through tunnels during the day time which is when I see the most impact. I think that it appearing better during the night is just a symptom of less usage for the service or something. It looks like it's been pretty consistent based of the data that Chaika has so kindly been providing.
No description
xCROv
xCROv2mo ago
The only plan that seems to have zero impact is Ent Spectrum HTTP. Someone needs to let me in on the secret for getting a trial for that. :SAD:
asuffield
asuffield2mo ago
yeah free will be the last thing to stop spilling out of region it's hard to predict when that will be, it always does it at least a little bit
Dawnward
DawnwardOP2mo ago
im not sure if pro for 8-9 years is paid enough, but apparently we are still getting spilled
asuffield
asuffield2mo ago
enterprise goes first, and that's a lot of traffic
Frerduro
Frerduro2mo ago
One thing I don't get is why do I never see US west of US south? I see aus, singapora, paris, etc first Its either US east or across an ocean for me nothing inbetween
Jeff
Jeff2mo ago
That's a good question. I'm also not sure why they don't route things to Canada instead or the Caribbean. Both of those are still faster than the EU, or Australia or Japan datacenters...
asuffield
asuffield2mo ago
you'd expect so, but network paths take surprisingly strange routes. I looked into this because it seemed weird to me, but they are actually closer by latency not by very much, there's only a couple ms difference I'm hoping for another chunk to move back later today. it's going to be an ongoing process though
Chaika
Chaika2mo ago
at least for me it does look way better on free today, there's some amount of in region rerouting (like ORD/EWR to MIA), but staying in region at least
asuffield
asuffield2mo ago
it'll affect different zones and plans at different times, because it still doesn't all fit. it does appear somewhat better today though
Frerduro
Frerduro2mo ago
yeah didn't even think of canada
Chaika
Chaika2mo ago
obv I don't know all the internal numbers but if they're shifting due to capacity limits in the US there's no way the Caribbean is going to have enough capacity, and I don't think CF is very big in Canada either, no DO hosts there or anything. Makes sense to shift to other big regions Different regions have different peaks too, makes sense to kind of opposite of follow the sun with capacity shifting
Frerduro
Frerduro2mo ago
so far seems better but not perfect
No description
Frerduro
Frerduro2mo ago
I am just glad this issue isn't affecting CF Magic Transit at all it seems I got a question does pro plan get any kind of priority or is it treated like free plan?
asuffield
asuffield2mo ago
yes, it's between the two
Jeff
Jeff2mo ago
Seems better today on my end too, almost no timeouts Tomorrow will be a good test since there was downtime on the 23rd (last Saturday)
Chaika
Chaika2mo ago
From what I saw, when it was at its peak during a few hours earlier this week, it was like 50% free conns being rerouted Pro 40% Biz 20% Ent/Argo: few %'s. Noticably less for Pro but still painful
Frerduro
Frerduro2mo ago
Host had to recently switch our ip subnet from DataPacket to magic transit temporarily again because of a 3+tbps ddos issue going on so glad to see magic isn't being re-routed across the world like http is
Chaika
Chaika2mo ago
At least last weekend there was minimal amounts of rerouting during the weekend as well, only weekdays
asuffield
asuffield2mo ago
things should be significantly better now. still keeping an eye on it though
Frerduro
Frerduro2mo ago
mostly yes.
No description
asuffield
asuffield2mo ago
that's pretty close to expectations (although something else is still wrong here, but I don't think it's affecting you based on those numbers, but others might have different experiences)
Frerduro
Frerduro2mo ago
What im curious about is how much extra capacity does paris have to be eating traffic so much? consistant #2 through the weeks
asuffield
asuffield2mo ago
peak time is at different times in different timezones
Jeff
Jeff2mo ago
Seeing timeouts, users being served to EU data centers again today Mostly at CPH Time on the graph is UTC-4
No description
No description
Jeff
Jeff4w ago
Seeing timeouts again today
No description
No description
asuffield
asuffield4w ago
we're still poking at it. things are generally much better now but it still might take a while to nail it all back down keep in mind that for free accounts, we don't try to make this go to zero. it should be lower on higher plan tiers, ending up on zero for enterprise customers in most of the world (South America and Africa will always do some of it, there are limits to what is achievable)
Jeff
Jeff4w ago
Fair enough -- I don't expect it to ever go to 0 since I'm on the Pro plan I am noticing that the timeouts are happening less frequently, which is how it worked before. Even prior to all of these issues I saw re-routing happening occasionally, but there weren't any timeout issues that I saw, or if there was it was so infrequent that it wasn't of concern I don't mind as much if people get re-routed to the EU or wherever as long as it works

Did you find this page helpful?