Is high p99 latency from railway overhead is expected?

I have a very simple http service (doesn't do any IO and it immediately returns a cached value from memory). It usually takes <15ms (through private network) but I occasionally see 100 or even 200+ ms. Somewhat related to this, I also see 502 time to time.. I'm curious if anyone has a similar experience. I don't see any sign from my logs that it is an application error. I am wondering if these hiccups are from Railway.
No description
39 Replies
Percy
Percy12mo ago
Project ID: e9b4f99b-4d10-4dae-9884-a2be0a1637eb
ChubbyAvocado
ChubbyAvocado12mo ago
e9b4f99b-4d10-4dae-9884-a2be0a1637eb
Brody
Brody12mo ago
the uptime kuma template didnt support testing services on the private network, are you sure youre testing the private network?
ChubbyAvocado
ChubbyAvocado12mo ago
I'm not using the public template. I'm pretty sure I'm using the private network (http://foobar.railway.internal:PORT/...). I'm actually testing both public and private endpoints and I see that latency is noticeably lower on private endpoint. My question is more around these unexpected spikes 🥲
Brody
Brody12mo ago
those are from the internal dns server
Brody
Brody12mo ago
with this test i see spikes of up to 160ms
No description
Brody
Brody12mo ago
with just tcp, the biggest spike was 7ms
No description
ChubbyAvocado
ChubbyAvocado12mo ago
interesting.. so internal DNS lookup could be pretty slow
Brody
Brody12mo ago
yep, but as for the 502, thats not railway, there is no gateway involved in the private network, its a wireguard tunnel
ChubbyAvocado
ChubbyAvocado12mo ago
I see. I will look more into what's causing 502. For DNS lookup latency I guess there is no way around it?
Brody
Brody12mo ago
use a caching dns resolver? though i dont know how well that would work, pretty sure the service would get a new ipv6 ip on every deployment
ChubbyAvocado
ChubbyAvocado12mo ago
right 😦
Brody
Brody12mo ago
ill ping char in next time i see him online, maybe he will have some ideas
ChubbyAvocado
ChubbyAvocado12mo ago
thanks for looking into it! for my info, how did you get these numbers?
Brody
Brody12mo ago
uptime kuma, same as you
ChubbyAvocado
ChubbyAvocado12mo ago
oh didn't know that kuma also supports DNS lol good to know!
Brody
Brody12mo ago
you'd need to specify the internal dns resolver fd12::10
ChubbyAvocado
ChubbyAvocado12mo ago
About 502s, I added some logging to a reverse proxy that I'm using, I think it could be due to DNS lookup failures.
Brody
Brody12mo ago
oh did you say you where using a proxy before?
ChubbyAvocado
ChubbyAvocado12mo ago
I don't think i mentioned it earlier 🤔
Brody
Brody12mo ago
but yes, that would for sure cause a 502 from a proxy, in fact ive gotten dns lookup errors too
No description
ChubbyAvocado
ChubbyAvocado12mo ago
yeah 😦 I have been running uptime kuma for internal DNS (in the way you described in above) and uptime is 99.84% since I started running it (20+ hours) Maybe I will leave a feedback in #🤗|feedback so that Railway folks can take a look if they think it's important haha.
Brody
Brody12mo ago
ill ping the appropriate team member here, dont worry, just havent seen them online
ChubbyAvocado
ChubbyAvocado12mo ago
awesome thanks!
Brody
Brody12mo ago
@char8 - internal dns resolver, random lag spikes and occasional DNS lookup failures
char8
char812mo ago
thanks for flagging that, definitely odd will look to see whats happening
Brody
Brody12mo ago
here's more accurate time stamps, they're in EST though
No description
char8
char812mo ago
the 08:03:59 timestamp matches with a whole bunch of errors in our logs where that node couldn't talk to the control plane between 13:03:13 and 13:03:45 UTC, it's weirdly not reflected on other hosts. Seems to coincide with some of the backends cycling - but that process shouldn't drop traffic. Gonna add three items to my list to address. 1) tune the caching a bit so we flatten those P99 2) serve cached entries during control plane downtime instead of the SERVFAIL 3) add some more monitoring to figure out why the control plane cycling stuff isn't as clean as it should be 2 should mean the worst for a control plane blip on a node with warm caches is slightly delayed dns propagation for new deploys. Shouldn't be huge, will see if I can squeeze in some of that into prod this week and update here.
Brody
Brody12mo ago
awsome, thanks for this!
char8
char812mo ago
np! thanks again for flagging
ChubbyAvocado
ChubbyAvocado12mo ago
Thanks! Any update on this?
Brody
Brody12mo ago
haha bad time to ask for an update, it's the weekend
ChubbyAvocado
ChubbyAvocado12mo ago
haha sorry -- i didn't mean for an immediate response 😆
char8
char812mo ago
sorry - I should've updated here. We're currently in the process of merging a whole bunch of networking updates - and I ended up pushing this patch into that series. I'm landing parts in batches so we don't change too many things at once. Will update once its in. Can't give an exact ETA since when we get to it depends on how the stuff before it lands.
ChubbyAvocado
ChubbyAvocado12mo ago
Thanks for the update!
char8
char812mo ago
we pushed some patches about 24hrs ago. Should see a big decrease in those dropouts, and depending on how often the lookups are happening, a drop in latency. Overall latency can range from a few ms to a few 100ms on average, we'll ship some more improvements to this over the coming months (can't give exact timelines on the latency thing).
Brody
Brody12mo ago
can confirm, no errors the internal dns resolution has been rock sold, zero errors
ChubbyAvocado
ChubbyAvocado12mo ago
Also confirming that internal DNS uptime has been great! (24h is 100%) ❤️ For another status check against a reverse proxy, I am also observing that uptime has improved (past 24h 99.93% vs last ~2 week 99.85%).
Brody
Brody12mo ago
yeah because even though it gave a different error, it was still the dns lookup that was failing not the http connection
Want results from more Discord servers?
Add your server