DO Placement Debugging
anyone have ideas why my DOs are being created in England (LHR) when I'm 6000 miles away in california?
70 Replies
Actually, let's move to a thread, @Avi
What's the colo your Worker is in?
wouldn't that vary per request?
and i should also note i have a 'router' worker in front of the DO worker
worker (router) -> worker (for DO) -> DO
Sorry, I meant for a specific case this is happening in, where is the DO Worker at?
ok yeah i'm like 80% sure LHR DOs are broken for me
I assume you aren't using a location hint, right?
or rather, for all our users
correct
what i'm observing is that DOs created in LHR essentially cannot call Discord's APIs
but DOs created (so far) anywhere else can
But where is the DO Worker at?
ohh i see what you're asking
i'm not sure, i'll have to redeploy to add another cgi-cdn/trace proxy
Because where the DO spawns depends on where the original Worker is at
So for example, even if you are in California, if the DO Worker was in LHR, it would probably spawn a DO there too
that makes sense
perhaps, also, i SHOULD be using a location hint
And also, how are you getting the DO ID?
because now that i think about it, even if the router worker runs near me, that doesn't mean the router will cause the DO worker to be created in the same place
idFromName
IIRC generally Workers invoked via Service Bindings are on the same metal, but if you are using Smart Placement on the DO Worker, it may not be
What is the name in this case? Could it be that the DO was originally created by a request from LHR, and then you reused that name somewhere else?
to clarify, the router worker talks to the DO worker over plain old HTTP
There's been a ton of rerouting for North America for any plan besides Ent/Argo mid-day US for the past few days. For LAS/Los Angeles, there's like a 60-70% chance for some hours on free plan that your request gets thrown for processing somewhere in Europe, and the Durable Object is created from wherever it's thrown to. Just my 2c
that certainly matches what i'm experiencing
it doesn't explain why my DOs in LHR would be nonfunctional
happens with brand-new names though
Gotcha
AMS
is ok, MAD
is ok
LHR
-> deadWhat is Discord responding with?
Discord's backend is known to be just in us-east, but obviously shouldn't fail for requests from other places. All you said so far is "essentially cannot call", is there anything more?
to be clear, 'dead' doesn't mean i can't talk to the DO
jsut that these external API calls are seemingly going nowhere
what i've been observing is that the requests either never return or return after like, a COMICAL amount of time
check this graph out

importantly: this is NOT average/p50 times
this is 'max' times we measure, essentially like this:
then = now()
fetch(discord.com/api)
fetch_time = now - then
those are like, comical numbers
3,300,000 milliseconds is nearly an hour
how does a fetch even return after that long?
those requests are being made from DOs. i'm not sure from which locations.
i should also note we have a 3000 millisecond timeout to abort our fetch calls to discord anyway
so this has to be something really weird
To confirm, is this only happening against Discord, or against all APIs?
i think only discord, because the fact i can even get logs from these DOs suggests that other http requests can get out - because those logs are going to BetterStack
Might want to ask Discord then? The routing issue is weird, but if it is only Discord erroring, then it makes me think the issue is on their end
their eng team has been swearing up and down its on my end šµāš«
i'm like, dude, we have the same infra provider xD
as as stopgap, i wonder if i can use some location hints here
this is in fact causing an outage for our product
It wouldn't move existing DOs, you could either just hint everything to enam or try to be more specific and just hint away from weur
we don't really use DO storage (just for temporary caching to be resilient to infra-caused DO restarts), so i have no concern about existing DOs
i've just added some queryparam support to:
1) pass the DO worker trace
2) allow me to specify a locationhint
ok yeah looks like my DO workers are in LHR
that in and of itself makes no sense to me
surely cf has much closer colos to serve me with
how are you observing that routing pattern?
btw i'm not on the free plan
Sending requests from a ton of locations for all the plan levels and looking at the colo which runs it. I'm using transform rules colo.id but cf-ray works just as well, or request.cf.colo. The thing that throws requests if there's not enough capacity works on layer 4
do you have this in a dashboard somewhere?
also, i'm not using smart placement, wonder if we should be?
not yet, one day I'll make a proper dash for it
It happens around peak times US, requests get flung. This is free plan for example

Someone from CF did ack it and say they're working on it https://discord.com/channels/595317990191398933/1408149529202786325/1408151174623920128
If it's just making a request to spawn a Durable Object it wouldn't help
i seeee
this is very much what i'm seeing too
ok, fine, so cf is routing a bunch of traffic to europe
that's fine with me, don't see why it should break my app specifically in LHR
MAD and AMS seem to be fine
like the LHR thing is what makes no sense to me
what's so special about that datacenter?
it's huge but so is ams nowadays
@Chaika do you know if i pass an
enam
locationHint, since CF is already routing me to LHR when presumably it knows i'm in the US, that it would do anything?
like i basically just want a locationhint that avoids LHRAs long as the Durable Object does not already exist, your hint will be respected regardless of where you are connecting to
even if there's no capacity?
Durable Object Capacity and normal request Routing capacity are two diff things
Also it's more just free/pro/biz plans are being shifted to Europe. Enterprise & Argo are mostly unaffected, and Spectrum/other stuff is entirely unaffected
(although in theory a hint is just a hint, it could be ignored as per docs, but I don't think I've ever heard of it being ignored in reality)
meaning you've never heard of a situation where a location hint is ignored due to capacity issues?
Ignored for any reason
R2, D1, etc, operate using the same hints, and they're also perfectly consistent
and for good measure I'd throw on top too it's more specific locations within North America which are rerouting more then others
ok as an aside i can now confirm that by passing
enam
i'm creating "non-broken" DOs
i continue to be absolutely perplexed by the mystery of LHR seemingly being so brokenI don't think Cloudflare has directly said anything but I've observed a number of instances of rerouting over the years.
Sao Paulo in Brazil had capacity issues for a while, and it threw to the East Coast.
Australian Locations had issues for a day due to Fortnite (true story), and it threw all the way to Europe
I think their "throw due to capacity" only picks locations in other regions
rerouting in violation of a provided locationHint? or just general rerouting like what we're seeing the past few days
general request rerouting
yea idk that one's weird, it just doesn't make sense that some outbound fetches like for metrics would be fine but not Discord, if it was something with the Durable Object.
Without any more info, I would almost assume it's something like Discord's load balancing stuff (which I guess is prob something GCP or CF) is messed from the UK
ok i've just updated our DO worker to only spawn in
enam
and the issue is fixed
all i can say is: wtf
@Chaika question for you - what if I read the location information of the incoming request in my DO worker, and used that to inform the locationHint?
this is what i expected CF to do atuomatically, but i guess notyea that's what I was suggesting above, to just hint away from weur
for now i'm just hardcoding
enam
which seems to work fine for avoiding LHRrequest.cf has country/continent data, it's from GeoIP/ipinfo.io though
i see, and is there an easy way to map that to a locationHint?
If you wanted to go purely by GeoIP Data (which is by the easiest), you could just look at latitiude/longitude and then map to the closest region excluding some
probably the easiest way to do that most accurately. Otherwise with just country/continent, you wouldn't be able to tell East and West US apart without further parsing of states (region field)
if you didn't care about west/east US, you could just force all of NA to ENAM, Europe to EEUR, and the rest are much broader (ex: OC -> OC, AS -> APAC)
Full list to handle is
"AF": Africa
"AN": Antarctica
"AS": Asia
"EU": Europe
"NA": North America
"OC": Oceania
"SA": South America
"T1": Tor network
from https://developers.cloudflare.com/workers/runtime-apis/request/#incomingrequestcfproperties
oh thats perfect
Yes it's partially crazy because you need to map colo ids or names back to their locations, if you went with the colo name idea lol
again, its surprising i even need to do this
why doesn't cloudflare automatically put the DO in the "logical place"
instead of LHR...
It does tho
LHR is closest to the running Worker
oh, right
well, why are my workers in LHR then? xD
lol that's back to the earlier routing fun
wouldn't that result in explicit ratelimit responses from the API though?
right, what i'm seeing is just that the requests.... never return

also, that ratelimit is just for failed requests right?
and yet, that's what i observe
those are ms
that's the actual code
let me go read the discord.js source code
hmm, it does seem they do some fancy internal queueing which uses
Date.now()
that makes me a bit nervous, maybe something about the way workers fuzzes the precision of things is causing a deadlock
still wouldn't explain what's so special about londonhmm, didn't realize you were using a lib, could it be that it internally handles bad responses/rate limits/etc and retries/suppresses?
could just be that it's more used/rate limited
it does, but i do have those explicitly configured:
timeout after 3000 ms, max 3 retries
apparently they a pretty elaborate queueing system
GitHub
discord.js/packages/rest/src/lib/handlers/SequentialHandler.ts at 5...
A powerful JavaScript library for interacting with the Discord API - discordjs/discord.js
i'm going to experiment with just calling the API with plain old fetch
It looks like there's some internal tools there too like a ratelimit event you can subscribe to and "rejectOnRateLimit"
another interesting clue: LHR seems to be fine now
this comes about an hour after i stopped spawning any DOs there
which could be consistent with the idea of a ratelimit of some kind
to my eyes, i don't see a code path where we would retry forever. but i'll at least turn on logging for when we're getting 'rateLimited'
i was setting up worker for saas 2 days ago and i ran into an error page and to my surprise the error page said lhr london
iām in us east
refreshing the page returned me to my regular colo
seems like there are weird routing issues happening ?