DO Placement Debugging
anyone have ideas why my DOs are being created in England (LHR) when I'm 6000 miles away in california?
96 Replies
Actually, let's move to a thread, @Avi
What's the colo your Worker is in?
wouldn't that vary per request?
and i should also note i have a 'router' worker in front of the DO worker
worker (router) -> worker (for DO) -> DO
Sorry, I meant for a specific case this is happening in, where is the DO Worker at?
ok yeah i'm like 80% sure LHR DOs are broken for me
I assume you aren't using a location hint, right?
or rather, for all our users
correct
what i'm observing is that DOs created in LHR essentially cannot call Discord's APIs
but DOs created (so far) anywhere else can
But where is the DO Worker at?
ohh i see what you're asking
i'm not sure, i'll have to redeploy to add another cgi-cdn/trace proxy
Because where the DO spawns depends on where the original Worker is at
So for example, even if you are in California, if the DO Worker was in LHR, it would probably spawn a DO there too
that makes sense
perhaps, also, i SHOULD be using a location hint
And also, how are you getting the DO ID?
because now that i think about it, even if the router worker runs near me, that doesn't mean the router will cause the DO worker to be created in the same place
idFromName
IIRC generally Workers invoked via Service Bindings are on the same metal, but if you are using Smart Placement on the DO Worker, it may not be
What is the name in this case? Could it be that the DO was originally created by a request from LHR, and then you reused that name somewhere else?
to clarify, the router worker talks to the DO worker over plain old HTTP
There's been a ton of rerouting for North America for any plan besides Ent/Argo mid-day US for the past few days. For LAS/Los Angeles, there's like a 60-70% chance for some hours on free plan that your request gets thrown for processing somewhere in Europe, and the Durable Object is created from wherever it's thrown to. Just my 2c
that certainly matches what i'm experiencing
it doesn't explain why my DOs in LHR would be nonfunctional
happens with brand-new names though
Gotcha
AMS
is ok, MAD
is ok
LHR
-> deadWhat is Discord responding with?
Discord's backend is known to be just in us-east, but obviously shouldn't fail for requests from other places. All you said so far is "essentially cannot call", is there anything more?
to be clear, 'dead' doesn't mean i can't talk to the DO
jsut that these external API calls are seemingly going nowhere
what i've been observing is that the requests either never return or return after like, a COMICAL amount of time
check this graph out

importantly: this is NOT average/p50 times
this is 'max' times we measure, essentially like this:
then = now()
fetch(discord.com/api)
fetch_time = now - then
those are like, comical numbers
3,300,000 milliseconds is nearly an hour
how does a fetch even return after that long?
those requests are being made from DOs. i'm not sure from which locations.
i should also note we have a 3000 millisecond timeout to abort our fetch calls to discord anyway
so this has to be something really weird
To confirm, is this only happening against Discord, or against all APIs?
i think only discord, because the fact i can even get logs from these DOs suggests that other http requests can get out - because those logs are going to BetterStack
Might want to ask Discord then? The routing issue is weird, but if it is only Discord erroring, then it makes me think the issue is on their end
their eng team has been swearing up and down its on my end 😵💫
i'm like, dude, we have the same infra provider xD
as as stopgap, i wonder if i can use some location hints here
this is in fact causing an outage for our product
It wouldn't move existing DOs, you could either just hint everything to enam or try to be more specific and just hint away from weur
we don't really use DO storage (just for temporary caching to be resilient to infra-caused DO restarts), so i have no concern about existing DOs
i've just added some queryparam support to:
1) pass the DO worker trace
2) allow me to specify a locationhint
ok yeah looks like my DO workers are in LHR
that in and of itself makes no sense to me
surely cf has much closer colos to serve me with
how are you observing that routing pattern?
btw i'm not on the free plan
Sending requests from a ton of locations for all the plan levels and looking at the colo which runs it. I'm using transform rules colo.id but cf-ray works just as well, or request.cf.colo. The thing that throws requests if there's not enough capacity works on layer 4
do you have this in a dashboard somewhere?
also, i'm not using smart placement, wonder if we should be?
not yet, one day I'll make a proper dash for it
It happens around peak times US, requests get flung. This is free plan for example

Someone from CF did ack it and say they're working on it https://discord.com/channels/595317990191398933/1408149529202786325/1408151174623920128
If it's just making a request to spawn a Durable Object it wouldn't help
i seeee
this is very much what i'm seeing too
ok, fine, so cf is routing a bunch of traffic to europe
that's fine with me, don't see why it should break my app specifically in LHR
MAD and AMS seem to be fine
like the LHR thing is what makes no sense to me
what's so special about that datacenter?
it's huge but so is ams nowadays
@Chaika do you know if i pass an
enam
locationHint, since CF is already routing me to LHR when presumably it knows i'm in the US, that it would do anything?
like i basically just want a locationhint that avoids LHRAs long as the Durable Object does not already exist, your hint will be respected regardless of where you are connecting to
even if there's no capacity?
Durable Object Capacity and normal request Routing capacity are two diff things
Also it's more just free/pro/biz plans are being shifted to Europe. Enterprise & Argo are mostly unaffected, and Spectrum/other stuff is entirely unaffected
(although in theory a hint is just a hint, it could be ignored as per docs, but I don't think I've ever heard of it being ignored in reality)
meaning you've never heard of a situation where a location hint is ignored due to capacity issues?
Ignored for any reason
R2, D1, etc, operate using the same hints, and they're also perfectly consistent
and for good measure I'd throw on top too it's more specific locations within North America which are rerouting more then others
ok as an aside i can now confirm that by passing
enam
i'm creating "non-broken" DOs
i continue to be absolutely perplexed by the mystery of LHR seemingly being so brokenI don't think Cloudflare has directly said anything but I've observed a number of instances of rerouting over the years.
Sao Paulo in Brazil had capacity issues for a while, and it threw to the East Coast.
Australian Locations had issues for a day due to Fortnite (true story), and it threw all the way to Europe
I think their "throw due to capacity" only picks locations in other regions
rerouting in violation of a provided locationHint? or just general rerouting like what we're seeing the past few days
general request rerouting
yea idk that one's weird, it just doesn't make sense that some outbound fetches like for metrics would be fine but not Discord, if it was something with the Durable Object.
Without any more info, I would almost assume it's something like Discord's load balancing stuff (which I guess is prob something GCP or CF) is messed from the UK
ok i've just updated our DO worker to only spawn in
enam
and the issue is fixed
all i can say is: wtf
@Chaika question for you - what if I read the location information of the incoming request in my DO worker, and used that to inform the locationHint?
this is what i expected CF to do atuomatically, but i guess notyea that's what I was suggesting above, to just hint away from weur
for now i'm just hardcoding
enam
which seems to work fine for avoiding LHRrequest.cf has country/continent data, it's from GeoIP/ipinfo.io though
i see, and is there an easy way to map that to a locationHint?
If you wanted to go purely by GeoIP Data (which is by the easiest), you could just look at latitiude/longitude and then map to the closest region excluding some
probably the easiest way to do that most accurately. Otherwise with just country/continent, you wouldn't be able to tell East and West US apart without further parsing of states (region field)
if you didn't care about west/east US, you could just force all of NA to ENAM, Europe to EEUR, and the rest are much broader (ex: OC -> OC, AS -> APAC)
Full list to handle is
"AF": Africa
"AN": Antarctica
"AS": Asia
"EU": Europe
"NA": North America
"OC": Oceania
"SA": South America
"T1": Tor network
from https://developers.cloudflare.com/workers/runtime-apis/request/#incomingrequestcfproperties
oh thats perfect
Yes it's partially crazy because you need to map colo ids or names back to their locations, if you went with the colo name idea lol
again, its surprising i even need to do this
why doesn't cloudflare automatically put the DO in the "logical place"
instead of LHR...
It does tho
LHR is closest to the running Worker
oh, right
well, why are my workers in LHR then? xD
lol that's back to the earlier routing fun
wouldn't that result in explicit ratelimit responses from the API though?
right, what i'm seeing is just that the requests.... never return

also, that ratelimit is just for failed requests right?
and yet, that's what i observe
those are ms
that's the actual code
let me go read the discord.js source code
hmm, it does seem they do some fancy internal queueing which uses
Date.now()
that makes me a bit nervous, maybe something about the way workers fuzzes the precision of things is causing a deadlock
still wouldn't explain what's so special about londonhmm, didn't realize you were using a lib, could it be that it internally handles bad responses/rate limits/etc and retries/suppresses?
could just be that it's more used/rate limited
it does, but i do have those explicitly configured:
timeout after 3000 ms, max 3 retries
apparently they a pretty elaborate queueing system
GitHub
discord.js/packages/rest/src/lib/handlers/SequentialHandler.ts at 5...
A powerful JavaScript library for interacting with the Discord API - discordjs/discord.js
i'm going to experiment with just calling the API with plain old fetch
It looks like there's some internal tools there too like a ratelimit event you can subscribe to and "rejectOnRateLimit"
another interesting clue: LHR seems to be fine now
this comes about an hour after i stopped spawning any DOs there
which could be consistent with the idea of a ratelimit of some kind
to my eyes, i don't see a code path where we would retry forever. but i'll at least turn on logging for when we're getting 'rateLimited'
my best guess is that discord has IP range based rate limiting
and either we or maybe some bad actors are triggering it
you mean here? https://discord.com/developers/docs/topics/rate-limits#invalid-request-limit-aka-cloudflare-bans
the "guess" is about whether we are actually hitting that limit
there is no explicit indication given
threw something together because I figure this'll get worse before it gets better (no insider knowledge, just guessing)
https://delay.chaika.me/routing/
Can flip through the plans and see the difference. Basically not worth upgrading to Pro to get away from it
Cloudflare Routing Monitoring
See Cloudflare Routing, using Workers running on each plan returning static content.
super cool
it seems like for us, the LHR thing was a red herring of sorts - its likely discord doing IP-based ratelimiting to cloudflare
unclear to me if that's us or other noisy tenants triggering it
im currently setting up a VPS with a dedicated IP to act as a proxy for discord's API
so all our worker traffic to discord will come from a static IP we control
Probably wasn't you, but yea Discord has always been iffy about traffic from Workers, a proxy is a good idea
it's hard to know. our scale is such that we might be right at that limit. we can easily spike well beyond 50 requests per second to discord, likely hundreds of RPS to discord during peak moments

Discord's docs just say there's a 50 rps limit per bot token (falling back to ip) and an error limit of 10k per 10 minutes, no other mentions of global limits
yeah, but most of our requests don't come from bot tokens, they're from Bearer tokens
the concept of "Bot" has become very murky for discord
we're principally a "Discord Activity", but there are bot components to it
specifically here it says
. If no authorization header is provided, then the limit is applied to the IP addressI'm guessing this is just a CF Ratelimit rule using the Auth header but still, those would return errors (and very quickly, since they're CF Level)
exactly
the fact we're getting essentially "hung requests"
is what throws me off
i mean if i were trying to ratelimit what i thought was malicious traffic, i too would hang the requests
slows things down more because the caller doesn't know how long to wait
afaik you typically do the opposite, end as quickly as possible. Cloudflare scales their protections that way, going up to ip jails
open connections are expensive
but anyway if not something silly like that, the only thing that comes to mind is if it was retrying internally a ton of requests and then tripping over the concurrent requests limitation due to other requests piling up and being retried
hmm its possible
@Chaika

gross, but i'm doing it
(this is unrelated to the ratelimiting of course - for that i'm proxying everything through a VPS i control with a static IP)
but it's tragic to serve a request from california with a DO in london
well if they're going to London anyway going back just adds more
it's not a you thing but a north american free plan thing in general with peak times rerouting requests https://discord.com/channels/595317990191398933/1409539854747963523/1409695340910874657
it's a cdn level thing that just doesn't care about the origin behind it at all afaik
compute is compute
but yea it def sucks, I wish there was a bit more transparency with it
even for a websocket?
If you pass a WS to a Durable Object from a Worker and just pass it through without messing with it, the worker on the machine dies but the machine keeps proxying it
so yea, you'd end up with LAS -> LON (Machine proxying) -> DO
which is good, no?
meaning the london hop goes away
no
The london machine still proxies the connection, can't like upgrade out of band or something
hmm but once the connection is made, why would it need to go through the stub worker?
The connection is established using the middle machine in London to proxy it
The protocol is just a simple http connection which upgrades to a tcp connection. There's no mechanism for like "ok reconnect on this address"
hm i see
so really i gain nothing with the location hint
besides maybe avoiding whatever issues you were having in London before
only evidence i have suggests that those issues were simply discord or cloudflare blocking the LHR IP range
after i routed traffic away from london, other regions started getting blocked and london recovered
now i'm proxying all discord traffic via my own proxy running on DigitalOcean, with dedicated static IPs
and so far so good
@Chaika any idea how to debug a DO stub throwing a
RangeError: Max call stack size exceeded
? since I don't get a stacktrace across the worker<>DO boundary?If it's from a Durable Object, the Durable Object should be able to report the error in live tail/obs logs/whatever else you use for logs within it
how would I be able to catch an infinite recursion within the DO though?
you know, i wonder if the error is actually with my observability code
because it isn't showing up there
and most of the codebase is wrapped in a try/catch
true for your own obs code, but live tailing/obs logs sees all exceptions thrown
@Chaika as a follow up here, making the discord API calls from digital ocean instead of discord completely eliminated the issue
i still don’t know why exactly, but presumably it has to be something to do with discord doing rate limiting from cloudflare…