DO Placement Debugging

anyone have ideas why my DOs are being created in England (LHR) when I'm 6000 miles away in california?
96 Replies
Hard@Work
Hard@Work2mo ago
Actually, let's move to a thread, @Avi What's the colo your Worker is in?
Avi
AviOP2mo ago
wouldn't that vary per request? and i should also note i have a 'router' worker in front of the DO worker worker (router) -> worker (for DO) -> DO
Hard@Work
Hard@Work2mo ago
Sorry, I meant for a specific case this is happening in, where is the DO Worker at?
Avi
AviOP2mo ago
ok yeah i'm like 80% sure LHR DOs are broken for me
Hard@Work
Hard@Work2mo ago
I assume you aren't using a location hint, right?
Avi
AviOP2mo ago
or rather, for all our users correct what i'm observing is that DOs created in LHR essentially cannot call Discord's APIs but DOs created (so far) anywhere else can
Hard@Work
Hard@Work2mo ago
But where is the DO Worker at?
Avi
AviOP2mo ago
ohh i see what you're asking i'm not sure, i'll have to redeploy to add another cgi-cdn/trace proxy
Hard@Work
Hard@Work2mo ago
Because where the DO spawns depends on where the original Worker is at So for example, even if you are in California, if the DO Worker was in LHR, it would probably spawn a DO there too
Avi
AviOP2mo ago
that makes sense perhaps, also, i SHOULD be using a location hint
Hard@Work
Hard@Work2mo ago
And also, how are you getting the DO ID?
Avi
AviOP2mo ago
because now that i think about it, even if the router worker runs near me, that doesn't mean the router will cause the DO worker to be created in the same place idFromName
Hard@Work
Hard@Work2mo ago
IIRC generally Workers invoked via Service Bindings are on the same metal, but if you are using Smart Placement on the DO Worker, it may not be What is the name in this case? Could it be that the DO was originally created by a request from LHR, and then you reused that name somewhere else?
Avi
AviOP2mo ago
to clarify, the router worker talks to the DO worker over plain old HTTP
Chaika
Chaika2mo ago
There's been a ton of rerouting for North America for any plan besides Ent/Argo mid-day US for the past few days. For LAS/Los Angeles, there's like a 60-70% chance for some hours on free plan that your request gets thrown for processing somewhere in Europe, and the Durable Object is created from wherever it's thrown to. Just my 2c
Avi
AviOP2mo ago
that certainly matches what i'm experiencing it doesn't explain why my DOs in LHR would be nonfunctional happens with brand-new names though
Hard@Work
Hard@Work2mo ago
Gotcha
Avi
AviOP2mo ago
AMS is ok, MAD is ok LHR -> dead
Hard@Work
Hard@Work2mo ago
What is Discord responding with?
Chaika
Chaika2mo ago
Discord's backend is known to be just in us-east, but obviously shouldn't fail for requests from other places. All you said so far is "essentially cannot call", is there anything more?
Avi
AviOP2mo ago
to be clear, 'dead' doesn't mean i can't talk to the DO jsut that these external API calls are seemingly going nowhere what i've been observing is that the requests either never return or return after like, a COMICAL amount of time check this graph out
Avi
AviOP2mo ago
No description
Avi
AviOP2mo ago
importantly: this is NOT average/p50 times this is 'max' times we measure, essentially like this: then = now() fetch(discord.com/api) fetch_time = now - then those are like, comical numbers 3,300,000 milliseconds is nearly an hour how does a fetch even return after that long? those requests are being made from DOs. i'm not sure from which locations. i should also note we have a 3000 millisecond timeout to abort our fetch calls to discord anyway so this has to be something really weird
Hard@Work
Hard@Work2mo ago
To confirm, is this only happening against Discord, or against all APIs?
Avi
AviOP2mo ago
i think only discord, because the fact i can even get logs from these DOs suggests that other http requests can get out - because those logs are going to BetterStack
Hard@Work
Hard@Work2mo ago
Might want to ask Discord then? The routing issue is weird, but if it is only Discord erroring, then it makes me think the issue is on their end
Avi
AviOP2mo ago
their eng team has been swearing up and down its on my end 😵‍💫 i'm like, dude, we have the same infra provider xD as as stopgap, i wonder if i can use some location hints here this is in fact causing an outage for our product
Chaika
Chaika2mo ago
It wouldn't move existing DOs, you could either just hint everything to enam or try to be more specific and just hint away from weur
Avi
AviOP2mo ago
we don't really use DO storage (just for temporary caching to be resilient to infra-caused DO restarts), so i have no concern about existing DOs i've just added some queryparam support to: 1) pass the DO worker trace 2) allow me to specify a locationhint ok yeah looks like my DO workers are in LHR that in and of itself makes no sense to me surely cf has much closer colos to serve me with
Avi
AviOP2mo ago
how are you observing that routing pattern? btw i'm not on the free plan
Chaika
Chaika2mo ago
Sending requests from a ton of locations for all the plan levels and looking at the colo which runs it. I'm using transform rules colo.id but cf-ray works just as well, or request.cf.colo. The thing that throws requests if there's not enough capacity works on layer 4
Avi
AviOP2mo ago
do you have this in a dashboard somewhere? also, i'm not using smart placement, wonder if we should be?
Chaika
Chaika2mo ago
not yet, one day I'll make a proper dash for it
Chaika
Chaika2mo ago
It happens around peak times US, requests get flung. This is free plan for example
No description
Chaika
Chaika2mo ago
Someone from CF did ack it and say they're working on it https://discord.com/channels/595317990191398933/1408149529202786325/1408151174623920128 If it's just making a request to spawn a Durable Object it wouldn't help
Avi
AviOP2mo ago
i seeee this is very much what i'm seeing too ok, fine, so cf is routing a bunch of traffic to europe that's fine with me, don't see why it should break my app specifically in LHR MAD and AMS seem to be fine like the LHR thing is what makes no sense to me what's so special about that datacenter?
Chaika
Chaika2mo ago
it's huge but so is ams nowadays
Avi
AviOP2mo ago
@Chaika do you know if i pass an enam locationHint, since CF is already routing me to LHR when presumably it knows i'm in the US, that it would do anything? like i basically just want a locationhint that avoids LHR
Chaika
Chaika2mo ago
As long as the Durable Object does not already exist, your hint will be respected regardless of where you are connecting to
Avi
AviOP2mo ago
even if there's no capacity?
Chaika
Chaika2mo ago
Durable Object Capacity and normal request Routing capacity are two diff things Also it's more just free/pro/biz plans are being shifted to Europe. Enterprise & Argo are mostly unaffected, and Spectrum/other stuff is entirely unaffected (although in theory a hint is just a hint, it could be ignored as per docs, but I don't think I've ever heard of it being ignored in reality)
Avi
AviOP2mo ago
meaning you've never heard of a situation where a location hint is ignored due to capacity issues?
Chaika
Chaika2mo ago
Ignored for any reason R2, D1, etc, operate using the same hints, and they're also perfectly consistent and for good measure I'd throw on top too it's more specific locations within North America which are rerouting more then others
Avi
AviOP2mo ago
ok as an aside i can now confirm that by passing enam i'm creating "non-broken" DOs i continue to be absolutely perplexed by the mystery of LHR seemingly being so broken
Chaika
Chaika2mo ago
I don't think Cloudflare has directly said anything but I've observed a number of instances of rerouting over the years. Sao Paulo in Brazil had capacity issues for a while, and it threw to the East Coast. Australian Locations had issues for a day due to Fortnite (true story), and it threw all the way to Europe I think their "throw due to capacity" only picks locations in other regions
Avi
AviOP2mo ago
rerouting in violation of a provided locationHint? or just general rerouting like what we're seeing the past few days
Chaika
Chaika2mo ago
general request rerouting yea idk that one's weird, it just doesn't make sense that some outbound fetches like for metrics would be fine but not Discord, if it was something with the Durable Object. Without any more info, I would almost assume it's something like Discord's load balancing stuff (which I guess is prob something GCP or CF) is messed from the UK
Avi
AviOP2mo ago
ok i've just updated our DO worker to only spawn in enam and the issue is fixed all i can say is: wtf @Chaika question for you - what if I read the location information of the incoming request in my DO worker, and used that to inform the locationHint? this is what i expected CF to do atuomatically, but i guess not
Chaika
Chaika2mo ago
yea that's what I was suggesting above, to just hint away from weur
Avi
AviOP2mo ago
for now i'm just hardcoding enam which seems to work fine for avoiding LHR
Chaika
Chaika2mo ago
request.cf has country/continent data, it's from GeoIP/ipinfo.io though
Avi
AviOP2mo ago
i see, and is there an easy way to map that to a locationHint?
Chaika
Chaika2mo ago
If you wanted to go purely by GeoIP Data (which is by the easiest), you could just look at latitiude/longitude and then map to the closest region excluding some probably the easiest way to do that most accurately. Otherwise with just country/continent, you wouldn't be able to tell East and West US apart without further parsing of states (region field) if you didn't care about west/east US, you could just force all of NA to ENAM, Europe to EEUR, and the rest are much broader (ex: OC -> OC, AS -> APAC) Full list to handle is "AF": Africa "AN": Antarctica "AS": Asia "EU": Europe "NA": North America "OC": Oceania "SA": South America "T1": Tor network from https://developers.cloudflare.com/workers/runtime-apis/request/#incomingrequestcfproperties
Avi
AviOP2mo ago
oh thats perfect
Chaika
Chaika2mo ago
Yes it's partially crazy because you need to map colo ids or names back to their locations, if you went with the colo name idea lol
Avi
AviOP2mo ago
again, its surprising i even need to do this why doesn't cloudflare automatically put the DO in the "logical place" instead of LHR...
Chaika
Chaika2mo ago
It does tho LHR is closest to the running Worker
Avi
AviOP2mo ago
oh, right well, why are my workers in LHR then? xD
Chaika
Chaika2mo ago
lol that's back to the earlier routing fun
Avi
AviOP2mo ago
wouldn't that result in explicit ratelimit responses from the API though? right, what i'm seeing is just that the requests.... never return
Avi
AviOP2mo ago
No description
Avi
AviOP2mo ago
also, that ratelimit is just for failed requests right? and yet, that's what i observe those are ms
const start = performance.now();
const authInfo = await this.oauth2.getCurrentAuthorizationInformation();
const end = performance.now();
Log.debug(
`[PERF] DiscordAPI.getCurrentAuthorizationInformation took ${end - start}ms`
);
const start = performance.now();
const authInfo = await this.oauth2.getCurrentAuthorizationInformation();
const end = performance.now();
Log.debug(
`[PERF] DiscordAPI.getCurrentAuthorizationInformation took ${end - start}ms`
);
that's the actual code let me go read the discord.js source code hmm, it does seem they do some fancy internal queueing which uses Date.now() that makes me a bit nervous, maybe something about the way workers fuzzes the precision of things is causing a deadlock still wouldn't explain what's so special about london
Chaika
Chaika2mo ago
hmm, didn't realize you were using a lib, could it be that it internally handles bad responses/rate limits/etc and retries/suppresses? could just be that it's more used/rate limited
Avi
AviOP2mo ago
it does, but i do have those explicitly configured: timeout after 3000 ms, max 3 retries
Avi
AviOP2mo ago
GitHub
discord.js/packages/rest/src/lib/handlers/SequentialHandler.ts at 5...
A powerful JavaScript library for interacting with the Discord API - discordjs/discord.js
Avi
AviOP2mo ago
i'm going to experiment with just calling the API with plain old fetch
Chaika
Chaika2mo ago
It looks like there's some internal tools there too like a ratelimit event you can subscribe to and "rejectOnRateLimit"
Avi
AviOP2mo ago
another interesting clue: LHR seems to be fine now this comes about an hour after i stopped spawning any DOs there which could be consistent with the idea of a ratelimit of some kind to my eyes, i don't see a code path where we would retry forever. but i'll at least turn on logging for when we're getting 'rateLimited' my best guess is that discord has IP range based rate limiting and either we or maybe some bad actors are triggering it you mean here? https://discord.com/developers/docs/topics/rate-limits#invalid-request-limit-aka-cloudflare-bans the "guess" is about whether we are actually hitting that limit there is no explicit indication given
Chaika
Chaika2mo ago
threw something together because I figure this'll get worse before it gets better (no insider knowledge, just guessing) https://delay.chaika.me/routing/ Can flip through the plans and see the difference. Basically not worth upgrading to Pro to get away from it
Cloudflare Routing Monitoring
See Cloudflare Routing, using Workers running on each plan returning static content.
Avi
AviOP2mo ago
super cool it seems like for us, the LHR thing was a red herring of sorts - its likely discord doing IP-based ratelimiting to cloudflare unclear to me if that's us or other noisy tenants triggering it im currently setting up a VPS with a dedicated IP to act as a proxy for discord's API so all our worker traffic to discord will come from a static IP we control
Chaika
Chaika2mo ago
Probably wasn't you, but yea Discord has always been iffy about traffic from Workers, a proxy is a good idea
Avi
AviOP2mo ago
it's hard to know. our scale is such that we might be right at that limit. we can easily spike well beyond 50 requests per second to discord, likely hundreds of RPS to discord during peak moments
Avi
AviOP2mo ago
No description
Chaika
Chaika2mo ago
Discord's docs just say there's a 50 rps limit per bot token (falling back to ip) and an error limit of 10k per 10 minutes, no other mentions of global limits
Avi
AviOP2mo ago
yeah, but most of our requests don't come from bot tokens, they're from Bearer tokens the concept of "Bot" has become very murky for discord we're principally a "Discord Activity", but there are bot components to it
Chaika
Chaika2mo ago
specifically here it says
. If no authorization header is provided, then the limit is applied to the IP address
I'm guessing this is just a CF Ratelimit rule using the Auth header but still, those would return errors (and very quickly, since they're CF Level)
Avi
AviOP2mo ago
exactly the fact we're getting essentially "hung requests" is what throws me off i mean if i were trying to ratelimit what i thought was malicious traffic, i too would hang the requests slows things down more because the caller doesn't know how long to wait
Chaika
Chaika2mo ago
afaik you typically do the opposite, end as quickly as possible. Cloudflare scales their protections that way, going up to ip jails open connections are expensive but anyway if not something silly like that, the only thing that comes to mind is if it was retrying internally a ton of requests and then tripping over the concurrent requests limitation due to other requests piling up and being retried
Avi
AviOP2mo ago
hmm its possible
Avi
AviOP2mo ago
@Chaika
No description
Avi
AviOP2mo ago
gross, but i'm doing it (this is unrelated to the ratelimiting of course - for that i'm proxying everything through a VPS i control with a static IP) but it's tragic to serve a request from california with a DO in london
Chaika
Chaika2mo ago
well if they're going to London anyway going back just adds more it's not a you thing but a north american free plan thing in general with peak times rerouting requests https://discord.com/channels/595317990191398933/1409539854747963523/1409695340910874657 it's a cdn level thing that just doesn't care about the origin behind it at all afaik compute is compute but yea it def sucks, I wish there was a bit more transparency with it
Avi
AviOP2mo ago
even for a websocket?
Chaika
Chaika2mo ago
If you pass a WS to a Durable Object from a Worker and just pass it through without messing with it, the worker on the machine dies but the machine keeps proxying it so yea, you'd end up with LAS -> LON (Machine proxying) -> DO
Avi
AviOP2mo ago
which is good, no? meaning the london hop goes away
Chaika
Chaika2mo ago
no The london machine still proxies the connection, can't like upgrade out of band or something
Avi
AviOP2mo ago
hmm but once the connection is made, why would it need to go through the stub worker?
Chaika
Chaika2mo ago
The connection is established using the middle machine in London to proxy it The protocol is just a simple http connection which upgrades to a tcp connection. There's no mechanism for like "ok reconnect on this address"
Avi
AviOP2mo ago
hm i see so really i gain nothing with the location hint
Chaika
Chaika2mo ago
besides maybe avoiding whatever issues you were having in London before
Avi
AviOP2mo ago
only evidence i have suggests that those issues were simply discord or cloudflare blocking the LHR IP range after i routed traffic away from london, other regions started getting blocked and london recovered now i'm proxying all discord traffic via my own proxy running on DigitalOcean, with dedicated static IPs and so far so good @Chaika any idea how to debug a DO stub throwing a RangeError: Max call stack size exceeded? since I don't get a stacktrace across the worker<>DO boundary?
Chaika
Chaika2mo ago
If it's from a Durable Object, the Durable Object should be able to report the error in live tail/obs logs/whatever else you use for logs within it
Avi
AviOP2mo ago
how would I be able to catch an infinite recursion within the DO though? you know, i wonder if the error is actually with my observability code because it isn't showing up there and most of the codebase is wrapped in a try/catch
Chaika
Chaika2mo ago
true for your own obs code, but live tailing/obs logs sees all exceptions thrown
Avi
AviOP2mo ago
@Chaika as a follow up here, making the discord API calls from digital ocean instead of discord completely eliminated the issue i still don’t know why exactly, but presumably it has to be something to do with discord doing rate limiting from cloudflare…

Did you find this page helpful?