Cloudflare Developers•2mo ago

US traffic is being served by London and France datacenter

around 13:00 UTC most of the US traffic is being pointed to Cloudflare France and London datacenters. This causes our Europe region to sustain "undesigned" traffic and that region goes down or slows down a lot. This repeats daily till about 01:00 UTC and US traffic is again served by US datacenters. As the traffic goes over seas the ping for end user increases. The only solution to controll that was to shut down the EU Load Balancer pool, but that does not resolve increased latency. This can be seen in HTTP Traffic chart, filtering out US and selecting to group by Datacenter. The chart directly corelates with what we see in our internal traffic analysis.

62 Replies

Chaika•2mo ago

known: https://discord.com/channels/595317990191398933/1408149529202786325/1408151056877097010 Upgrading plans can help but not guaranteed and only incrementally.. I have some testing on it and for this issue specifically Free -> Pro is maybe 20% less rerouting, Biz another 20% or so less, Argo on free or Ent is mostly gone but still slightly there. There's no guarantees though, they just need to expand capacity in the region, not many other details they've shared tho.

asuffield•2mo ago

if you just want it served from the lowest available latency, it's doing that at present and it's surprising-but-true that if your user is on the US east coast, London is closer than California

asuffield•2mo ago

if you specifically care about always being served in-region, that's what https://developers.cloudflare.com/data-localization/regional-services/ is for. if you just want low latency, this is "working as intended" but sometimes there isn't enough for everybody and there is a priority order

Cloudflare Docs

Regional Services

Regional Services gives you the ability to accommodate regional restrictions by choosing which subset of data centers decrypt and service HTTPS traffic.

asuffield•2mo ago

and for east-coast users, you're competing with everybody who runs massive workloads in us-east-1, the world's most overloaded cloud region. capacity in that part of the world is a recurring challenge (it happens to be a bit worse than usual in the past week or so, for reasons that have been most of my couple of weeks, but what you personally are observing is likely just that you used to fit into US east coast capacity and you've been pushed out)

asuffield•2mo ago

if the actual problem that's hurting you is that this has moved too much of your origin pull traffic to origins in Europe, you might want to insert https://developers.cloudflare.com/load-balancing/load-balancers/ to force some of it back to the US, instead of relying on whatever is closest to the cloudflare location the user is served from

Cloudflare Docs

Load balancers

A load balancer distributes traffic among pools according to pool health and traffic steering policies. Each load balancer is identified by its DNS hostname (lb.example.com, dev.example.com, etc.) or IP address.

DawnwardOP•2mo ago

US east coast traffic is routed to the London and France, which is further than our load balancer pool located in Columbus. Forcing Cf Europe edge servers to route US back to US seems ping costly. It does not matter if Europe pool is on, they just route requests to Europe edge servers.

Jeff•2mo ago

I'm having the same issue. I wouldn't normally mind if traffic went to London or France, but the performance for these data centers is very poor. I'm using workers to get data from R2 (just simple uploads, usually less than 200 KB) and those take over 30 seconds to load, even if cached and it's not going to the origin (R2) I understand that Cloudflare can't handle increased traffic to US data centers, so they are routing people to EU data centers for less important customers. But they also can't handle increased traffic to EU data centers either.

asuffield•2mo ago

yeah it's a multidimensional problem. I've been very busy this past couple weeks. I can't give any forward looking statements but there's plenty of attention on what's happening I am cautiously optimistic for the situation improving. it's never going to be ideal, but the number of users served from further away should be lower than it is right now

Jeff•2mo ago

Glad to hear that it's being looked at and fixed. Is this issue why I'd be seeing increased 520x errors? This only started happening on August 21st, no changes to my nginx config (or my code) on my end. I can't seem to figure this out. It seems to be an issue only when Cloudflare starts pushing people to EU data centers, but the issue is intermittent, which leads me to believe it's just about how overloaded the data centers are with Cloudflare. I've tried switching servers and I see the same issue, although with varying severity. But even that depends on how overloaded the Cloudflare data centers are, assuming that's the issue... The time I'm seeing this is around 2:00 - 2:40 PM UTC

Frerduro•2mo ago

its still happening

Frerduro•2mo ago

IP is on the CDN77 network in their ashburn DC

Jeff•2mo ago

Are you seeing any 520x errors on your end @Frerduro?

Frerduro•2mo ago

asuffield•2mo ago

things moving around inside the US is "normal", especially at this time of year don't expect this to be an overnight change, it's slower moving

Frerduro•2mo ago

I mean connecting to paris and melbourne aus is outside the US

Chaika•2mo ago

I was playing around with this and built out https://delay.chaika.me/routing/ if it helps anyone to see the plan differences. This is done by testing from a bunch of locations in NA (SEA, PDX, SJC, LAS, SLC, MCI, DFW, ATL, MIA, ORD, DTW, YYZ, EWR, IAD), using datacenter conns usually over IX or direct peering so should entirely be CF shuffling requests between DCs. It's not worth upgrading to Pro to get away from it, Business is mostly uneffected. Argo & Ent is mostly not effected.

Cloudflare Routing Monitoring

See Cloudflare Routing, using Workers running on each plan returning static content.

Frerduro•2mo ago

idk its just weird. I fully accept getting redirected to other DCs in the US. But Australia is about as far as you can go.

Chaika•2mo ago

The rerouting seems pretty across the board, some PoPs are effected less then others but they all shift, I don't imagine sending a request to another PoP near capacity would help From what I've seen, most requests which are rerouted get sent to entirely different regions for processing. Like a while ago Oceanic region (Australia and New Zealand) had capacity issues and most requests got flung all the way to Europe. I'm not sure if it's because other DCs are also just close enough, or if it's just the logic is safe and wants to ensure the forwarding location has capacity, but it's just observed behavior

Frerduro•2mo ago

I have been seeing this behavior with stuff we host ourselves. We got app hosted in an ashburn DC with these upstreams and even if the request comes from that same server rack but different machine to the public domain of the app I have seen Ashburn server #2 > Cloudflare AUS > Ashburn server #1 > Cloudflare AUS > Ashburn server #2 Same behavior has been seen with residential ISPs connecting to the same public url but also same behavior for our static HTML CF pages website ¯\_(ツ)_/¯

Chaika•2mo ago

It's not your routing if that's what you're saying, it's CF internally flinging the request from one DC to another for capacity reasons https://blog.cloudflare.com/meet-traffic-manager/ probably Plurimog

If a request goes into Philadelphia and Philadelphia is unable to take the request, Plurimog will forward to another data center that can take the request, like Ashburn, where the request is decrypted and processed. Because Plurimog operates at layer 4, it can send individual TCP or UDP requests to other places which allows it to be very fine-grained: it can send percentages of traffic to other data centers very easily, meaning that we only need to send away enough traffic to ensure that everyone can be served as fast as possible

Frerduro•2mo ago

so your telling me that 0 other cloudflare pops in the US has capacity for the past week+ hell id prefer EU over AUS ive even seen singapore and new zealand

Chaika•2mo ago

That goes back to https://discord.com/channels/595317990191398933/1409539854747963523/1410375823994912888 anyway not much you can do other then wait it out or upgrade

Chaika•2mo ago

I have data going back over a year and it wasn't ever this wide until now besides small bumps, at least from my simple testing against Workers

Frerduro•2mo ago

How do you have data for all these plans btw? You must be spending a ton of money just to collect data right?

Chaika•2mo ago

CF is very nice and gives Community Champs & MVPs Enterprise and all the other plan levels All the monitoring endpoints are just separate small VPS's feeding back. The primary purpose of my monitoring stuff was more like monitoring Worker Script Deployment https://delay.chaika.me/job/worker, but I also log which CF location deals with the requests, so always just kinda had this data

Frerduro•2mo ago

full enterprise plan?

Chaika•2mo ago

There's no such thing as "full enterprise plan", it's all piece meal/requested in bits, but can ask for most features and get them. It's all non-commercial personal/testing usage but the upside is if someone asks if API Shield can do x or y (or a feature which requires it), we can test it and see that it can, or find problems with them and escalate them, etc. There's been decent amount of incidents that Champ monitoring data has helped raise or find, Workers Deployments used to be way more unstable for example.

asuffield•2mo ago

looks consistent with what I've been glaring at. it's supposed to look more like it did before the start of august - some, but not as much today might have been a bit better than earlier in the week, should be a bit more over the next couple days

DawnwardOP•2mo ago

noup, well at least in my case. yesterday was much better. we have 3 regions, us-east, us-west, eu. day before it was balanced in periods like 1/3 per region, no where near perfect, but much better. tonight it was in bursts, 1h eu, then EU almost taken out, then again EU takes most of the US traffic. Attaching screenshot just for the US traffic and balance over regions mentioned

DawnwardOP•2mo ago

time in chart GMT+3

xCROv•2mo ago

I mainly use the services that I've got running through tunnels during the day time which is when I see the most impact. I think that it appearing better during the night is just a symptom of less usage for the service or something. It looks like it's been pretty consistent based of the data that Chaika has so kindly been providing.

xCROv•2mo ago

The only plan that seems to have zero impact is Ent Spectrum HTTP. Someone needs to let me in on the secret for getting a trial for that. :SAD:

asuffield•2mo ago

yeah free will be the last thing to stop spilling out of region it's hard to predict when that will be, it always does it at least a little bit

DawnwardOP•2mo ago

im not sure if pro for 8-9 years is paid enough, but apparently we are still getting spilled

asuffield•2mo ago

enterprise goes first, and that's a lot of traffic

Frerduro•2mo ago

One thing I don't get is why do I never see US west of US south? I see aus, singapora, paris, etc first Its either US east or across an ocean for me nothing inbetween

Jeff•2mo ago

That's a good question. I'm also not sure why they don't route things to Canada instead or the Caribbean. Both of those are still faster than the EU, or Australia or Japan datacenters...

asuffield•2mo ago

you'd expect so, but network paths take surprisingly strange routes. I looked into this because it seemed weird to me, but they are actually closer by latency not by very much, there's only a couple ms difference I'm hoping for another chunk to move back later today. it's going to be an ongoing process though

Chaika•2mo ago

at least for me it does look way better on free today, there's some amount of in region rerouting (like ORD/EWR to MIA), but staying in region at least

asuffield•2mo ago

it'll affect different zones and plans at different times, because it still doesn't all fit. it does appear somewhat better today though

Frerduro•2mo ago

yeah didn't even think of canada

Chaika•2mo ago

obv I don't know all the internal numbers but if they're shifting due to capacity limits in the US there's no way the Caribbean is going to have enough capacity, and I don't think CF is very big in Canada either, no DO hosts there or anything. Makes sense to shift to other big regions Different regions have different peaks too, makes sense to kind of opposite of follow the sun with capacity shifting

Frerduro•2mo ago

so far seems better but not perfect

Frerduro•2mo ago

I am just glad this issue isn't affecting CF Magic Transit at all it seems I got a question does pro plan get any kind of priority or is it treated like free plan?

asuffield•2mo ago

yes, it's between the two

Jeff•2mo ago

Seems better today on my end too, almost no timeouts Tomorrow will be a good test since there was downtime on the 23rd (last Saturday)

Chaika•2mo ago

From what I saw, when it was at its peak during a few hours earlier this week, it was like 50% free conns being rerouted Pro 40% Biz 20% Ent/Argo: few %'s. Noticably less for Pro but still painful

Frerduro•2mo ago

Host had to recently switch our ip subnet from DataPacket to magic transit temporarily again because of a 3+tbps ddos issue going on so glad to see magic isn't being re-routed across the world like http is

Chaika•2mo ago

At least last weekend there was minimal amounts of rerouting during the weekend as well, only weekdays

asuffield•2mo ago

things should be significantly better now. still keeping an eye on it though

Frerduro•2mo ago

mostly yes.

asuffield•2mo ago

that's pretty close to expectations (although something else is still wrong here, but I don't think it's affecting you based on those numbers, but others might have different experiences)

Frerduro•2mo ago

What im curious about is how much extra capacity does paris have to be eating traffic so much? consistant #2 through the weeks

asuffield•2mo ago

peak time is at different times in different timezones

Jeff•2mo ago

Seeing timeouts, users being served to EU data centers again today Mostly at CPH Time on the graph is UTC-4

Jeff•4w ago

Seeing timeouts again today

asuffield•4w ago

we're still poking at it. things are generally much better now but it still might take a while to nail it all back down keep in mind that for free accounts, we don't try to make this go to zero. it should be lower on higher plan tiers, ending up on zero for enterprise customers in most of the world (South America and Africa will always do some of it, there are limits to what is achievable)

Jeff•4w ago

Fair enough -- I don't expect it to ever go to 0 since I'm on the Pro plan I am noticing that the timeouts are happening less frequently, which is how it worked before. Even prior to all of these issues I saw re-routing happening occasionally, but there weren't any timeout issues that I saw, or if there was it was so infrequent that it wasn't of concern I don't mind as much if people get re-routed to the EU or wherever as long as it works

Gaming

Programming

US traffic is being served by London and France datacenter

Did you find this page helpful?