DO Placement Debugging

anyone have ideas why my DOs are being created in England (LHR) when I'm 6000 miles away in california?

96 Replies

Actually, let's move to a thread, @Avi What's the colo your Worker is in?

AviOP•2mo ago

wouldn't that vary per request? and i should also note i have a 'router' worker in front of the DO worker worker (router) -> worker (for DO) -> DO

Hard@Work•2mo ago

Sorry, I meant for a specific case this is happening in, where is the DO Worker at?

AviOP•2mo ago

ok yeah i'm like 80% sure LHR DOs are broken for me

Hard@Work•2mo ago

I assume you aren't using a location hint, right?

AviOP•2mo ago

or rather, for all our users correct what i'm observing is that DOs created in LHR essentially cannot call Discord's APIs but DOs created (so far) anywhere else can

Hard@Work•2mo ago

But where is the DO Worker at?

AviOP•2mo ago

ohh i see what you're asking i'm not sure, i'll have to redeploy to add another cgi-cdn/trace proxy

Hard@Work•2mo ago

Because where the DO spawns depends on where the original Worker is at So for example, even if you are in California, if the DO Worker was in LHR, it would probably spawn a DO there too

AviOP•2mo ago

that makes sense perhaps, also, i SHOULD be using a location hint

Hard@Work•2mo ago

And also, how are you getting the DO ID?

AviOP•2mo ago

because now that i think about it, even if the router worker runs near me, that doesn't mean the router will cause the DO worker to be created in the same place idFromName

Hard@Work•2mo ago

IIRC generally Workers invoked via Service Bindings are on the same metal, but if you are using Smart Placement on the DO Worker, it may not be What is the name in this case? Could it be that the DO was originally created by a request from LHR, and then you reused that name somewhere else?

AviOP•2mo ago

to clarify, the router worker talks to the DO worker over plain old HTTP

Chaika•2mo ago

There's been a ton of rerouting for North America for any plan besides Ent/Argo mid-day US for the past few days. For LAS/Los Angeles, there's like a 60-70% chance for some hours on free plan that your request gets thrown for processing somewhere in Europe, and the Durable Object is created from wherever it's thrown to. Just my 2c

AviOP•2mo ago

that certainly matches what i'm experiencing it doesn't explain why my DOs in LHR would be nonfunctional happens with brand-new names though

Hard@Work•2mo ago

Gotcha

AviOP•2mo ago

AMS is ok, MAD is ok LHR -> dead

Hard@Work•2mo ago

What is Discord responding with?

Chaika•2mo ago

Discord's backend is known to be just in us-east, but obviously shouldn't fail for requests from other places. All you said so far is "essentially cannot call", is there anything more?

AviOP•2mo ago

to be clear, 'dead' doesn't mean i can't talk to the DO jsut that these external API calls are seemingly going nowhere what i've been observing is that the requests either never return or return after like, a COMICAL amount of time check this graph out

AviOP•2mo ago

AviOP•2mo ago

importantly: this is NOT average/p50 times this is 'max' times we measure, essentially like this: then = now() fetch(discord.com/api) fetch_time = now - then those are like, comical numbers 3,300,000 milliseconds is nearly an hour how does a fetch even return after that long? those requests are being made from DOs. i'm not sure from which locations. i should also note we have a 3000 millisecond timeout to abort our fetch calls to discord anyway so this has to be something really weird

Hard@Work•2mo ago

To confirm, is this only happening against Discord, or against all APIs?

AviOP•2mo ago

i think only discord, because the fact i can even get logs from these DOs suggests that other http requests can get out - because those logs are going to BetterStack

Hard@Work•2mo ago

Might want to ask Discord then? The routing issue is weird, but if it is only Discord erroring, then it makes me think the issue is on their end

AviOP•2mo ago

their eng team has been swearing up and down its on my end 😵‍💫 i'm like, dude, we have the same infra provider xD as as stopgap, i wonder if i can use some location hints here this is in fact causing an outage for our product

Chaika•2mo ago

It wouldn't move existing DOs, you could either just hint everything to enam or try to be more specific and just hint away from weur

AviOP•2mo ago

we don't really use DO storage (just for temporary caching to be resilient to infra-caused DO restarts), so i have no concern about existing DOs i've just added some queryparam support to: 1) pass the DO worker trace 2) allow me to specify a locationhint ok yeah looks like my DO workers are in LHR that in and of itself makes no sense to me surely cf has much closer colos to serve me with

Chaika•2mo ago

https://discord.com/channels/595317990191398933/1408526922475901059/1408531746617819209

AviOP•2mo ago

how are you observing that routing pattern? btw i'm not on the free plan

Chaika•2mo ago

Sending requests from a ton of locations for all the plan levels and looking at the colo which runs it. I'm using transform rules colo.id but cf-ray works just as well, or request.cf.colo. The thing that throws requests if there's not enough capacity works on layer 4

AviOP•2mo ago

do you have this in a dashboard somewhere? also, i'm not using smart placement, wonder if we should be?

Chaika•2mo ago

not yet, one day I'll make a proper dash for it

Chaika•2mo ago

It happens around peak times US, requests get flung. This is free plan for example

Chaika•2mo ago

Someone from CF did ack it and say they're working on it https://discord.com/channels/595317990191398933/1408149529202786325/1408151174623920128 If it's just making a request to spawn a Durable Object it wouldn't help

AviOP•2mo ago

i seeee this is very much what i'm seeing too ok, fine, so cf is routing a bunch of traffic to europe that's fine with me, don't see why it should break my app specifically in LHR MAD and AMS seem to be fine like the LHR thing is what makes no sense to me what's so special about that datacenter?

Chaika•2mo ago

it's huge but so is ams nowadays

AviOP•2mo ago

@Chaika do you know if i pass an enam locationHint, since CF is already routing me to LHR when presumably it knows i'm in the US, that it would do anything? like i basically just want a locationhint that avoids LHR

Chaika•2mo ago

As long as the Durable Object does not already exist, your hint will be respected regardless of where you are connecting to

AviOP•2mo ago

even if there's no capacity?

Chaika•2mo ago

Durable Object Capacity and normal request Routing capacity are two diff things Also it's more just free/pro/biz plans are being shifted to Europe. Enterprise & Argo are mostly unaffected, and Spectrum/other stuff is entirely unaffected (although in theory a hint is just a hint, it could be ignored as per docs, but I don't think I've ever heard of it being ignored in reality)

AviOP•2mo ago

meaning you've never heard of a situation where a location hint is ignored due to capacity issues?

Chaika•2mo ago

Ignored for any reason R2, D1, etc, operate using the same hints, and they're also perfectly consistent and for good measure I'd throw on top too it's more specific locations within North America which are rerouting more then others

AviOP•2mo ago

ok as an aside i can now confirm that by passing enam i'm creating "non-broken" DOs i continue to be absolutely perplexed by the mystery of LHR seemingly being so broken

Chaika•2mo ago

I don't think Cloudflare has directly said anything but I've observed a number of instances of rerouting over the years. Sao Paulo in Brazil had capacity issues for a while, and it threw to the East Coast. Australian Locations had issues for a day due to Fortnite (true story), and it threw all the way to Europe I think their "throw due to capacity" only picks locations in other regions

AviOP•2mo ago

rerouting in violation of a provided locationHint? or just general rerouting like what we're seeing the past few days

Chaika•2mo ago

general request rerouting yea idk that one's weird, it just doesn't make sense that some outbound fetches like for metrics would be fine but not Discord, if it was something with the Durable Object. Without any more info, I would almost assume it's something like Discord's load balancing stuff (which I guess is prob something GCP or CF) is messed from the UK

AviOP•2mo ago

ok i've just updated our DO worker to only spawn in enam and the issue is fixed all i can say is: wtf @Chaika question for you - what if I read the location information of the incoming request in my DO worker, and used that to inform the locationHint? this is what i expected CF to do atuomatically, but i guess not

Chaika•2mo ago

yea that's what I was suggesting above, to just hint away from weur

AviOP•2mo ago

for now i'm just hardcoding enam which seems to work fine for avoiding LHR

Chaika•2mo ago

request.cf has country/continent data, it's from GeoIP/ipinfo.io though

AviOP•2mo ago

i see, and is there an easy way to map that to a locationHint?

Chaika•2mo ago

If you wanted to go purely by GeoIP Data (which is by the easiest), you could just look at latitiude/longitude and then map to the closest region excluding some probably the easiest way to do that most accurately. Otherwise with just country/continent, you wouldn't be able to tell East and West US apart without further parsing of states (region field) if you didn't care about west/east US, you could just force all of NA to ENAM, Europe to EEUR, and the rest are much broader (ex: OC -> OC, AS -> APAC) Full list to handle is "AF": Africa "AN": Antarctica "AS": Asia "EU": Europe "NA": North America "OC": Oceania "SA": South America "T1": Tor network from https://developers.cloudflare.com/workers/runtime-apis/request/#incomingrequestcfproperties

AviOP•2mo ago

oh thats perfect

Chaika•2mo ago

Yes it's partially crazy because you need to map colo ids or names back to their locations, if you went with the colo name idea lol

AviOP•2mo ago

again, its surprising i even need to do this why doesn't cloudflare automatically put the DO in the "logical place" instead of LHR...

Chaika•2mo ago

It does tho LHR is closest to the running Worker

AviOP•2mo ago

oh, right well, why are my workers in LHR then? xD

Chaika•2mo ago

lol that's back to the earlier routing fun

AviOP•2mo ago

wouldn't that result in explicit ratelimit responses from the API though? right, what i'm seeing is just that the requests.... never return

AviOP•2mo ago

also, that ratelimit is just for failed requests right? and yet, that's what i observe those are ms

    const start = performance.now();
    const authInfo = await this.oauth2.getCurrentAuthorizationInformation();
    const end = performance.now();
    Log.debug(
      `[PERF] DiscordAPI.getCurrentAuthorizationInformation took ${end - start}ms`
    );

    const start = performance.now();
    const authInfo = await this.oauth2.getCurrentAuthorizationInformation();
    const end = performance.now();
    Log.debug(
      `[PERF] DiscordAPI.getCurrentAuthorizationInformation took ${end - start}ms`
    );

that's the actual code let me go read the discord.js source code hmm, it does seem they do some fancy internal queueing which uses Date.now() that makes me a bit nervous, maybe something about the way workers fuzzes the precision of things is causing a deadlock still wouldn't explain what's so special about london

Chaika•2mo ago

hmm, didn't realize you were using a lib, could it be that it internally handles bad responses/rate limits/etc and retries/suppresses? could just be that it's more used/rate limited

AviOP•2mo ago

it does, but i do have those explicitly configured: timeout after 3000 ms, max 3 retries

AviOP•2mo ago

apparently they a pretty elaborate queueing system

GitHub

discord.js/packages/rest/src/lib/handlers/SequentialHandler.ts at 5...

A powerful JavaScript library for interacting with the Discord API - discordjs/discord.js

AviOP•2mo ago

i'm going to experiment with just calling the API with plain old fetch

Chaika•2mo ago

It looks like there's some internal tools there too like a ratelimit event you can subscribe to and "rejectOnRateLimit"

AviOP•2mo ago

another interesting clue: LHR seems to be fine now this comes about an hour after i stopped spawning any DOs there which could be consistent with the idea of a ratelimit of some kind to my eyes, i don't see a code path where we would retry forever. but i'll at least turn on logging for when we're getting 'rateLimited' my best guess is that discord has IP range based rate limiting and either we or maybe some bad actors are triggering it you mean here? https://discord.com/developers/docs/topics/rate-limits#invalid-request-limit-aka-cloudflare-bans the "guess" is about whether we are actually hitting that limit there is no explicit indication given

Chaika•2mo ago

threw something together because I figure this'll get worse before it gets better (no insider knowledge, just guessing) https://delay.chaika.me/routing/ Can flip through the plans and see the difference. Basically not worth upgrading to Pro to get away from it

Cloudflare Routing Monitoring

See Cloudflare Routing, using Workers running on each plan returning static content.

AviOP•2mo ago

super cool it seems like for us, the LHR thing was a red herring of sorts - its likely discord doing IP-based ratelimiting to cloudflare unclear to me if that's us or other noisy tenants triggering it im currently setting up a VPS with a dedicated IP to act as a proxy for discord's API so all our worker traffic to discord will come from a static IP we control

Chaika•2mo ago

Probably wasn't you, but yea Discord has always been iffy about traffic from Workers, a proxy is a good idea

AviOP•2mo ago

it's hard to know. our scale is such that we might be right at that limit. we can easily spike well beyond 50 requests per second to discord, likely hundreds of RPS to discord during peak moments

AviOP•2mo ago

Chaika•2mo ago

Discord's docs just say there's a 50 rps limit per bot token (falling back to ip) and an error limit of 10k per 10 minutes, no other mentions of global limits

AviOP•2mo ago

yeah, but most of our requests don't come from bot tokens, they're from Bearer tokens the concept of "Bot" has become very murky for discord we're principally a "Discord Activity", but there are bot components to it

Chaika•2mo ago

specifically here it says

. If no authorization header is provided, then the limit is applied to the IP address

I'm guessing this is just a CF Ratelimit rule using the Auth header but still, those would return errors (and very quickly, since they're CF Level)

AviOP•2mo ago

exactly the fact we're getting essentially "hung requests" is what throws me off i mean if i were trying to ratelimit what i thought was malicious traffic, i too would hang the requests slows things down more because the caller doesn't know how long to wait

Chaika•2mo ago

afaik you typically do the opposite, end as quickly as possible. Cloudflare scales their protections that way, going up to ip jails open connections are expensive but anyway if not something silly like that, the only thing that comes to mind is if it was retrying internally a ton of requests and then tripping over the concurrent requests limitation due to other requests piling up and being retried

AviOP•2mo ago

hmm its possible

AviOP•2mo ago

@Chaika

AviOP•2mo ago

gross, but i'm doing it (this is unrelated to the ratelimiting of course - for that i'm proxying everything through a VPS i control with a static IP) but it's tragic to serve a request from california with a DO in london

Chaika•2mo ago

well if they're going to London anyway going back just adds more it's not a you thing but a north american free plan thing in general with peak times rerouting requests https://discord.com/channels/595317990191398933/1409539854747963523/1409695340910874657 it's a cdn level thing that just doesn't care about the origin behind it at all afaik compute is compute but yea it def sucks, I wish there was a bit more transparency with it

AviOP•2mo ago

even for a websocket?

Chaika•2mo ago

If you pass a WS to a Durable Object from a Worker and just pass it through without messing with it, the worker on the machine dies but the machine keeps proxying it so yea, you'd end up with LAS -> LON (Machine proxying) -> DO

AviOP•2mo ago

which is good, no? meaning the london hop goes away

Chaika•2mo ago

no The london machine still proxies the connection, can't like upgrade out of band or something

AviOP•2mo ago

hmm but once the connection is made, why would it need to go through the stub worker?

Chaika•2mo ago

The connection is established using the middle machine in London to proxy it The protocol is just a simple http connection which upgrades to a tcp connection. There's no mechanism for like "ok reconnect on this address"

AviOP•2mo ago

hm i see so really i gain nothing with the location hint

Chaika•2mo ago

besides maybe avoiding whatever issues you were having in London before

AviOP•2mo ago

only evidence i have suggests that those issues were simply discord or cloudflare blocking the LHR IP range after i routed traffic away from london, other regions started getting blocked and london recovered now i'm proxying all discord traffic via my own proxy running on DigitalOcean, with dedicated static IPs and so far so good @Chaika any idea how to debug a DO stub throwing a RangeError: Max call stack size exceeded? since I don't get a stacktrace across the worker<>DO boundary?

Chaika•2mo ago

If it's from a Durable Object, the Durable Object should be able to report the error in live tail/obs logs/whatever else you use for logs within it

AviOP•2mo ago

how would I be able to catch an infinite recursion within the DO though? you know, i wonder if the error is actually with my observability code because it isn't showing up there and most of the codebase is wrapped in a try/catch

Chaika•2mo ago

true for your own obs code, but live tailing/obs logs sees all exceptions thrown

AviOP•2mo ago

@Chaika as a follow up here, making the discord API calls from digital ocean instead of discord completely eliminated the issue i still don’t know why exactly, but presumably it has to be something to do with discord doing rate limiting from cloudflare…

Gaming

Programming

DO Placement Debugging

Did you find this page helpful?