Topics

Railway•12mo ago

Is high p99 latency from railway overhead is expected?

I have a very simple http service (doesn't do any IO and it immediately returns a cached value from memory). It usually takes <15ms (through private network) but I occasionally see 100 or even 200+ ms. Somewhat related to this, I also see 502 time to time.. I'm curious if anyone has a similar experience. I don't see any sign from my logs that it is an application error. I am wondering if these hiccups are from Railway.

No description

39 Replies

Percy•12mo ago

Project ID: e9b4f99b-4d10-4dae-9884-a2be0a1637eb

ChubbyAvocado•12mo ago

e9b4f99b-4d10-4dae-9884-a2be0a1637eb

Brody•12mo ago

the uptime kuma template didnt support testing services on the private network, are you sure youre testing the private network?

ChubbyAvocado•12mo ago

I'm not using the public template. I'm pretty sure I'm using the private network (http://foobar.railway.internal:PORT/...). I'm actually testing both public and private endpoints and I see that latency is noticeably lower on private endpoint. My question is more around these unexpected spikes 🥲

Brody•12mo ago

those are from the internal dns server

Brody•12mo ago

with this test i see spikes of up to 160ms

No description

Brody•12mo ago

with just tcp, the biggest spike was 7ms

No description

ChubbyAvocado•12mo ago

interesting.. so internal DNS lookup could be pretty slow

Brody•12mo ago

yep, but as for the 502, thats not railway, there is no gateway involved in the private network, its a wireguard tunnel

ChubbyAvocado•12mo ago

I see. I will look more into what's causing 502. For DNS lookup latency I guess there is no way around it?

Brody•12mo ago

use a caching dns resolver? though i dont know how well that would work, pretty sure the service would get a new ipv6 ip on every deployment

ChubbyAvocado•12mo ago

right 😦

Brody•12mo ago

ill ping char in next time i see him online, maybe he will have some ideas

ChubbyAvocado•12mo ago

thanks for looking into it! for my info, how did you get these numbers?

Brody•12mo ago

uptime kuma, same as you

ChubbyAvocado•12mo ago

oh didn't know that kuma also supports DNS lol good to know!

Brody•12mo ago

you'd need to specify the internal dns resolver fd12::10

ChubbyAvocado•12mo ago

About 502s, I added some logging to a reverse proxy that I'm using, I think it could be due to DNS lookup failures.

Brody•12mo ago

oh did you say you where using a proxy before?

ChubbyAvocado•12mo ago

I don't think i mentioned it earlier 🤔

Brody•12mo ago

but yes, that would for sure cause a 502 from a proxy, in fact ive gotten dns lookup errors too

No description

ChubbyAvocado•12mo ago

yeah 😦 I have been running uptime kuma for internal DNS (in the way you described in above) and uptime is 99.84% since I started running it (20+ hours) Maybe I will leave a feedback in #🤗｜feedback so that Railway folks can take a look if they think it's important haha.

Brody•12mo ago

ill ping the appropriate team member here, dont worry, just havent seen them online

ChubbyAvocado•12mo ago

awesome thanks!

Brody•12mo ago

@char8 - internal dns resolver, random lag spikes and occasional DNS lookup failures

char8•12mo ago

thanks for flagging that, definitely odd will look to see whats happening

Brody•12mo ago

here's more accurate time stamps, they're in EST though

No description

char8•12mo ago

the 08:03:59 timestamp matches with a whole bunch of errors in our logs where that node couldn't talk to the control plane between 13:03:13 and 13:03:45 UTC, it's weirdly not reflected on other hosts. Seems to coincide with some of the backends cycling - but that process shouldn't drop traffic. Gonna add three items to my list to address. 1) tune the caching a bit so we flatten those P99 2) serve cached entries during control plane downtime instead of the SERVFAIL 3) add some more monitoring to figure out why the control plane cycling stuff isn't as clean as it should be 2 should mean the worst for a control plane blip on a node with warm caches is slightly delayed dns propagation for new deploys. Shouldn't be huge, will see if I can squeeze in some of that into prod this week and update here.

Brody•12mo ago

awsome, thanks for this!

char8•12mo ago

np! thanks again for flagging

ChubbyAvocado•12mo ago

Thanks! Any update on this?

Brody•12mo ago

haha bad time to ask for an update, it's the weekend

ChubbyAvocado•12mo ago

haha sorry -- i didn't mean for an immediate response 😆

char8•12mo ago

sorry - I should've updated here. We're currently in the process of merging a whole bunch of networking updates - and I ended up pushing this patch into that series. I'm landing parts in batches so we don't change too many things at once. Will update once its in. Can't give an exact ETA since when we get to it depends on how the stuff before it lands.

ChubbyAvocado•12mo ago

Thanks for the update!

char8•12mo ago

we pushed some patches about 24hrs ago. Should see a big decrease in those dropouts, and depending on how often the lookups are happening, a drop in latency. Overall latency can range from a few ms to a few 100ms on average, we'll ship some more improvements to this over the coming months (can't give exact timelines on the latency thing).

Brody•12mo ago

can confirm, no errors the internal dns resolution has been rock sold, zero errors

ChubbyAvocado•12mo ago

Also confirming that internal DNS uptime has been great! (24h is 100%) ❤️ For another status check against a reverse proxy, I am also observing that uptime has improved (past 24h 99.93% vs last ~2 week 99.85%).

Brody•12mo ago

yeah because even though it gave a different error, it was still the dns lookup that was failing not the http connection

Hang out with other likeminded developers & talk about all things https://railway.app on the Railway Community Server.

25KMembers

View on Discord

Want results from more Discord servers?

Add your server