DO Placement Debugging

anyone have ideas why my DOs are being created in England (LHR) when I'm 6000 miles away in california?
70 Replies
Hard@Work
Hard@Work•15h ago
Actually, let's move to a thread, @Avi What's the colo your Worker is in?
Avi
AviOP•15h ago
wouldn't that vary per request? and i should also note i have a 'router' worker in front of the DO worker worker (router) -> worker (for DO) -> DO
Hard@Work
Hard@Work•15h ago
Sorry, I meant for a specific case this is happening in, where is the DO Worker at?
Avi
AviOP•15h ago
ok yeah i'm like 80% sure LHR DOs are broken for me
Hard@Work
Hard@Work•15h ago
I assume you aren't using a location hint, right?
Avi
AviOP•15h ago
or rather, for all our users correct what i'm observing is that DOs created in LHR essentially cannot call Discord's APIs but DOs created (so far) anywhere else can
Hard@Work
Hard@Work•15h ago
But where is the DO Worker at?
Avi
AviOP•15h ago
ohh i see what you're asking i'm not sure, i'll have to redeploy to add another cgi-cdn/trace proxy
Hard@Work
Hard@Work•15h ago
Because where the DO spawns depends on where the original Worker is at So for example, even if you are in California, if the DO Worker was in LHR, it would probably spawn a DO there too
Avi
AviOP•15h ago
that makes sense perhaps, also, i SHOULD be using a location hint
Hard@Work
Hard@Work•15h ago
And also, how are you getting the DO ID?
Avi
AviOP•15h ago
because now that i think about it, even if the router worker runs near me, that doesn't mean the router will cause the DO worker to be created in the same place idFromName
Hard@Work
Hard@Work•15h ago
IIRC generally Workers invoked via Service Bindings are on the same metal, but if you are using Smart Placement on the DO Worker, it may not be What is the name in this case? Could it be that the DO was originally created by a request from LHR, and then you reused that name somewhere else?
Avi
AviOP•15h ago
to clarify, the router worker talks to the DO worker over plain old HTTP
Chaika
Chaika•15h ago
There's been a ton of rerouting for North America for any plan besides Ent/Argo mid-day US for the past few days. For LAS/Los Angeles, there's like a 60-70% chance for some hours on free plan that your request gets thrown for processing somewhere in Europe, and the Durable Object is created from wherever it's thrown to. Just my 2c
Avi
AviOP•15h ago
that certainly matches what i'm experiencing it doesn't explain why my DOs in LHR would be nonfunctional happens with brand-new names though
Hard@Work
Hard@Work•15h ago
Gotcha
Avi
AviOP•15h ago
AMS is ok, MAD is ok LHR -> dead
Hard@Work
Hard@Work•15h ago
What is Discord responding with?
Chaika
Chaika•15h ago
Discord's backend is known to be just in us-east, but obviously shouldn't fail for requests from other places. All you said so far is "essentially cannot call", is there anything more?
Avi
AviOP•15h ago
to be clear, 'dead' doesn't mean i can't talk to the DO jsut that these external API calls are seemingly going nowhere what i've been observing is that the requests either never return or return after like, a COMICAL amount of time check this graph out
Avi
AviOP•15h ago
No description
Avi
AviOP•15h ago
importantly: this is NOT average/p50 times this is 'max' times we measure, essentially like this: then = now() fetch(discord.com/api) fetch_time = now - then those are like, comical numbers 3,300,000 milliseconds is nearly an hour how does a fetch even return after that long? those requests are being made from DOs. i'm not sure from which locations. i should also note we have a 3000 millisecond timeout to abort our fetch calls to discord anyway so this has to be something really weird
Hard@Work
Hard@Work•15h ago
To confirm, is this only happening against Discord, or against all APIs?
Avi
AviOP•15h ago
i think only discord, because the fact i can even get logs from these DOs suggests that other http requests can get out - because those logs are going to BetterStack
Hard@Work
Hard@Work•15h ago
Might want to ask Discord then? The routing issue is weird, but if it is only Discord erroring, then it makes me think the issue is on their end
Avi
AviOP•15h ago
their eng team has been swearing up and down its on my end šŸ˜µā€šŸ’« i'm like, dude, we have the same infra provider xD as as stopgap, i wonder if i can use some location hints here this is in fact causing an outage for our product
Chaika
Chaika•15h ago
It wouldn't move existing DOs, you could either just hint everything to enam or try to be more specific and just hint away from weur
Avi
AviOP•15h ago
we don't really use DO storage (just for temporary caching to be resilient to infra-caused DO restarts), so i have no concern about existing DOs i've just added some queryparam support to: 1) pass the DO worker trace 2) allow me to specify a locationhint ok yeah looks like my DO workers are in LHR that in and of itself makes no sense to me surely cf has much closer colos to serve me with
Avi
AviOP•15h ago
how are you observing that routing pattern? btw i'm not on the free plan
Chaika
Chaika•15h ago
Sending requests from a ton of locations for all the plan levels and looking at the colo which runs it. I'm using transform rules colo.id but cf-ray works just as well, or request.cf.colo. The thing that throws requests if there's not enough capacity works on layer 4
Avi
AviOP•15h ago
do you have this in a dashboard somewhere? also, i'm not using smart placement, wonder if we should be?
Chaika
Chaika•14h ago
not yet, one day I'll make a proper dash for it
Chaika
Chaika•14h ago
It happens around peak times US, requests get flung. This is free plan for example
No description
Chaika
Chaika•14h ago
Someone from CF did ack it and say they're working on it https://discord.com/channels/595317990191398933/1408149529202786325/1408151174623920128 If it's just making a request to spawn a Durable Object it wouldn't help
Avi
AviOP•14h ago
i seeee this is very much what i'm seeing too ok, fine, so cf is routing a bunch of traffic to europe that's fine with me, don't see why it should break my app specifically in LHR MAD and AMS seem to be fine like the LHR thing is what makes no sense to me what's so special about that datacenter?
Chaika
Chaika•14h ago
it's huge but so is ams nowadays
Avi
AviOP•14h ago
@Chaika do you know if i pass an enam locationHint, since CF is already routing me to LHR when presumably it knows i'm in the US, that it would do anything? like i basically just want a locationhint that avoids LHR
Chaika
Chaika•14h ago
As long as the Durable Object does not already exist, your hint will be respected regardless of where you are connecting to
Avi
AviOP•14h ago
even if there's no capacity?
Chaika
Chaika•14h ago
Durable Object Capacity and normal request Routing capacity are two diff things Also it's more just free/pro/biz plans are being shifted to Europe. Enterprise & Argo are mostly unaffected, and Spectrum/other stuff is entirely unaffected (although in theory a hint is just a hint, it could be ignored as per docs, but I don't think I've ever heard of it being ignored in reality)
Avi
AviOP•14h ago
meaning you've never heard of a situation where a location hint is ignored due to capacity issues?
Chaika
Chaika•14h ago
Ignored for any reason R2, D1, etc, operate using the same hints, and they're also perfectly consistent and for good measure I'd throw on top too it's more specific locations within North America which are rerouting more then others
Avi
AviOP•14h ago
ok as an aside i can now confirm that by passing enam i'm creating "non-broken" DOs i continue to be absolutely perplexed by the mystery of LHR seemingly being so broken
Chaika
Chaika•14h ago
I don't think Cloudflare has directly said anything but I've observed a number of instances of rerouting over the years. Sao Paulo in Brazil had capacity issues for a while, and it threw to the East Coast. Australian Locations had issues for a day due to Fortnite (true story), and it threw all the way to Europe I think their "throw due to capacity" only picks locations in other regions
Avi
AviOP•14h ago
rerouting in violation of a provided locationHint? or just general rerouting like what we're seeing the past few days
Chaika
Chaika•14h ago
general request rerouting yea idk that one's weird, it just doesn't make sense that some outbound fetches like for metrics would be fine but not Discord, if it was something with the Durable Object. Without any more info, I would almost assume it's something like Discord's load balancing stuff (which I guess is prob something GCP or CF) is messed from the UK
Avi
AviOP•14h ago
ok i've just updated our DO worker to only spawn in enam and the issue is fixed all i can say is: wtf @Chaika question for you - what if I read the location information of the incoming request in my DO worker, and used that to inform the locationHint? this is what i expected CF to do atuomatically, but i guess not
Chaika
Chaika•14h ago
yea that's what I was suggesting above, to just hint away from weur
Avi
AviOP•14h ago
for now i'm just hardcoding enam which seems to work fine for avoiding LHR
Chaika
Chaika•14h ago
request.cf has country/continent data, it's from GeoIP/ipinfo.io though
Avi
AviOP•14h ago
i see, and is there an easy way to map that to a locationHint?
Chaika
Chaika•14h ago
If you wanted to go purely by GeoIP Data (which is by the easiest), you could just look at latitiude/longitude and then map to the closest region excluding some probably the easiest way to do that most accurately. Otherwise with just country/continent, you wouldn't be able to tell East and West US apart without further parsing of states (region field) if you didn't care about west/east US, you could just force all of NA to ENAM, Europe to EEUR, and the rest are much broader (ex: OC -> OC, AS -> APAC) Full list to handle is "AF": Africa "AN": Antarctica "AS": Asia "EU": Europe "NA": North America "OC": Oceania "SA": South America "T1": Tor network from https://developers.cloudflare.com/workers/runtime-apis/request/#incomingrequestcfproperties
Avi
AviOP•14h ago
oh thats perfect
Chaika
Chaika•14h ago
Yes it's partially crazy because you need to map colo ids or names back to their locations, if you went with the colo name idea lol
Avi
AviOP•14h ago
again, its surprising i even need to do this why doesn't cloudflare automatically put the DO in the "logical place" instead of LHR...
Chaika
Chaika•14h ago
It does tho LHR is closest to the running Worker
Avi
AviOP•14h ago
oh, right well, why are my workers in LHR then? xD
Chaika
Chaika•14h ago
lol that's back to the earlier routing fun
Avi
AviOP•13h ago
wouldn't that result in explicit ratelimit responses from the API though? right, what i'm seeing is just that the requests.... never return
Avi
AviOP•13h ago
No description
Avi
AviOP•13h ago
also, that ratelimit is just for failed requests right? and yet, that's what i observe those are ms
const start = performance.now();
const authInfo = await this.oauth2.getCurrentAuthorizationInformation();
const end = performance.now();
Log.debug(
`[PERF] DiscordAPI.getCurrentAuthorizationInformation took ${end - start}ms`
);
const start = performance.now();
const authInfo = await this.oauth2.getCurrentAuthorizationInformation();
const end = performance.now();
Log.debug(
`[PERF] DiscordAPI.getCurrentAuthorizationInformation took ${end - start}ms`
);
that's the actual code let me go read the discord.js source code hmm, it does seem they do some fancy internal queueing which uses Date.now() that makes me a bit nervous, maybe something about the way workers fuzzes the precision of things is causing a deadlock still wouldn't explain what's so special about london
Chaika
Chaika•13h ago
hmm, didn't realize you were using a lib, could it be that it internally handles bad responses/rate limits/etc and retries/suppresses? could just be that it's more used/rate limited
Avi
AviOP•13h ago
it does, but i do have those explicitly configured: timeout after 3000 ms, max 3 retries
Avi
AviOP•13h ago
GitHub
discord.js/packages/rest/src/lib/handlers/SequentialHandler.ts at 5...
A powerful JavaScript library for interacting with the Discord API - discordjs/discord.js
Avi
AviOP•13h ago
i'm going to experiment with just calling the API with plain old fetch
Chaika
Chaika•13h ago
It looks like there's some internal tools there too like a ratelimit event you can subscribe to and "rejectOnRateLimit"
Avi
AviOP•13h ago
another interesting clue: LHR seems to be fine now this comes about an hour after i stopped spawning any DOs there which could be consistent with the idea of a ratelimit of some kind to my eyes, i don't see a code path where we would retry forever. but i'll at least turn on logging for when we're getting 'rateLimited'
o7
o7•12h ago
i was setting up worker for saas 2 days ago and i ran into an error page and to my surprise the error page said lhr london i’m in us east refreshing the page returned me to my regular colo seems like there are weird routing issues happening ?

Did you find this page helpful?