Is high p99 latency from railway overhead is expected?
I have a very simple http service (doesn't do any IO and it immediately returns a cached value from memory). It usually takes <15ms (through private network) but I occasionally see 100 or even 200+ ms.
Somewhat related to this, I also see 502 time to time.. I'm curious if anyone has a similar experience. I don't see any sign from my logs that it is an application error. I am wondering if these hiccups are from Railway.
39 Replies
Project ID:
e9b4f99b-4d10-4dae-9884-a2be0a1637eb
e9b4f99b-4d10-4dae-9884-a2be0a1637eb
the uptime kuma template didnt support testing services on the private network, are you sure youre testing the private network?
I'm not using the public template. I'm pretty sure I'm using the private network (
http://foobar.railway.internal:PORT/...
).
I'm actually testing both public and private endpoints and I see that latency is noticeably lower on private endpoint.
My question is more around these unexpected spikes 🥲those are from the internal dns server
with this test i see spikes of up to 160ms
with just tcp, the biggest spike was 7ms
interesting.. so internal DNS lookup could be pretty slow
yep, but as for the 502, thats not railway, there is no gateway involved in the private network, its a wireguard tunnel
I see. I will look more into what's causing 502.
For DNS lookup latency I guess there is no way around it?
use a caching dns resolver?
though i dont know how well that would work, pretty sure the service would get a new ipv6 ip on every deployment
right 😦
ill ping char in next time i see him online, maybe he will have some ideas
thanks for looking into it!
for my info, how did you get these numbers?
uptime kuma, same as you
oh didn't know that kuma also supports DNS lol
good to know!
you'd need to specify the internal dns resolver
fd12::10
About 502s, I added some logging to a reverse proxy that I'm using, I think it could be due to DNS lookup failures.
oh did you say you where using a proxy before?
I don't think i mentioned it earlier 🤔
but yes, that would for sure cause a 502 from a proxy, in fact ive gotten dns lookup errors too
yeah 😦
I have been running uptime kuma for internal DNS (in the way you described in above) and uptime is 99.84% since I started running it (20+ hours)
Maybe I will leave a feedback in #🤗|feedback so that Railway folks can take a look if they think it's important haha.
ill ping the appropriate team member here, dont worry, just havent seen them online
awesome thanks!
@char8 - internal dns resolver, random lag spikes and occasional DNS lookup failures
thanks for flagging that, definitely odd will look to see whats happening
here's more accurate time stamps, they're in EST though
the 08:03:59 timestamp matches with a whole bunch of errors in our logs where that node couldn't talk to the control plane between 13:03:13 and 13:03:45 UTC, it's weirdly not reflected on other hosts. Seems to coincide with some of the backends cycling - but that process shouldn't drop traffic. Gonna add three items to my list to address.
1) tune the caching a bit so we flatten those P99
2) serve cached entries during control plane downtime instead of the SERVFAIL
3) add some more monitoring to figure out why the control plane cycling stuff isn't as clean as it should be
2 should mean the worst for a control plane blip on a node with warm caches is slightly delayed dns propagation for new deploys. Shouldn't be huge, will see if I can squeeze in some of that into prod this week and update here.
awsome, thanks for this!
np! thanks again for flagging
Thanks!
Any update on this?
haha bad time to ask for an update, it's the weekend
haha sorry -- i didn't mean for an immediate response 😆
sorry - I should've updated here. We're currently in the process of merging a whole bunch of networking updates - and I ended up pushing this patch into that series. I'm landing parts in batches so we don't change too many things at once. Will update once its in.
Can't give an exact ETA since when we get to it depends on how the stuff before it lands.
Thanks for the update!
we pushed some patches about 24hrs ago. Should see a big decrease in those dropouts, and depending on how often the lookups are happening, a drop in latency. Overall latency can range from a few ms to a few 100ms on average, we'll ship some more improvements to this over the coming months (can't give exact timelines on the latency thing).
can confirm, no errors
the internal dns resolution has been rock sold, zero errors
Also confirming that internal DNS uptime has been great! (24h is 100%) ❤️
For another status check against a reverse proxy, I am also observing that uptime has improved (past 24h 99.93% vs last ~2 week 99.85%).
yeah because even though it gave a different error, it was still the dns lookup that was failing not the http connection