Is railway down?

I can't open my railway project page and I get error 401 and database isn't ready...
79 Replies
Percy
Percy7mo ago
Project ID: N/A
Brody
Brody7mo ago
a picture is worth 1000 words
Ryan!
Ryan!7mo ago
(different customer) Our app looks health in the dashboard, but we're getting the error page too
Ryan!
Ryan!7mo ago
Railway
Railway
Railway is an infrastructure platform where you can provision infrastructure, develop with that infrastructure locally, and then deploy to the cloud.
ENT3I <3
ENT3I <37mo ago
No description
ENT3I <3
ENT3I <37mo ago
anything loads but app looks health in my dashbooard too
Ryan!
Ryan!7mo ago
My app appears back
ENT3I <3
ENT3I <37mo ago
No description
ENT3I <3
ENT3I <37mo ago
same here my app is loading again
Ryan!
Ryan!7mo ago
I came here after getting an alert that the URL for my app was down when I checked the logs my app showed no issues.
ENT3I <3
ENT3I <37mo ago
exactly the same thing
Brody
Brody7mo ago
did you get a 503 too?
ENT3I <3
ENT3I <37mo ago
I got url errors coming up in my slack and when I checked my railway project url it didn't load at all got errors 401, 503 and "database isn't ready" message
Brody
Brody7mo ago
might have been a little blip with the proxy maybe related? https://discord.com/channels/713503345364697088/1171292427118706769/1176503689150730342 specifically
We're currently in the process of merging a whole bunch of networking updates
ENT3I <3
ENT3I <37mo ago
looking some error logs I can see this: ERROR: The DNS server returned an error, perhaps the server is offline (item 0) getaddrinfo EAI_AGAIN postgres.railway.internal In one of the errors while my app was "down"
Brody
Brody7mo ago
i saw that too, though this is not the first time ive gotten that error
ENT3I <3
ENT3I <37mo ago
But how that can affect my private network between my services and my postgres* database? Does that migration affect all networks?
Brody
Brody7mo ago
its a dns lookup error for a private domain i dont know what it effects, i know as much as what char said in that message
Ryan!
Ryan!7mo ago
Thanks for the quick response on this. 👋
ENT3I <3
ENT3I <37mo ago
Yeah thanks. And yes Brody I think the issue was on the proxy or smth I'm seeing the same error message in a totally external tool from railway
Brody
Brody7mo ago
No description
Brody
Brody7mo ago
going to tag in @char8 here too, just for visibility
char8
char87mo ago
this wasn't anything to do with the updates, though we're looking into some weird alarms between 17:14 - 17:30 UTC
Brody
Brody7mo ago
had a feeling it wasnt related, but good to know!
char8
char87mo ago
gonna dig deeper into what happened and how it affected (what looks to be DNS resolutions mainly) - will create a retro incident with the times. but yeah - a proper fix for more robust DNS is in the pipe
ENT3I <3
ENT3I <37mo ago
char8 where Railway is based up what timezone, so I can give you exactly what time it started to fail and when it worked again
char8
char87mo ago
UTC always works 🙏 , we had a routing propagation alarm fire 17:14-17:22 UTC [it only alerted us to it at 17:20 shortly before it resolved, so I gotta tweak something there]
ENT3I <3
ENT3I <37mo ago
Ok. For me it started to fail at UTC 17:14:50 to 17:21:39 does it make sense?
char8
char87mo ago
yep that matches perfectly thanks! that confirms what we're seeing. Looks like a network cut of some form.
ENT3I <3
ENT3I <37mo ago
Ok great. Thanks
char8
char87mo ago
looks like this is recurring gonna create an incident
ENT3I <3
ENT3I <37mo ago
yep, it is happening again
Brody
Brody7mo ago
it sure is
No description
char8
char87mo ago
we isolated it to the host that ran the control plane 😞 , just went 100% on I/O and locked up the server. Gonna fast track some of the patches that Brody spoke about that will mitigate these issues. https://railway.instatus.com/clp8myu281083bhohpt28odbp
Brody
Brody7mo ago
yippee
ENT3I <3
ENT3I <37mo ago
it happened again a few minutes ago
Brody
Brody7mo ago
can confirm
char8
char87mo ago
6 mins ago? now recovered right? for about 20 secs
Brody
Brody7mo ago
yep
ENT3I <3
ENT3I <37mo ago
yes, for me at xx (your time): 27 xx:27
Brody
Brody7mo ago
No description
char8
char87mo ago
I've got a patch that just got approved that I want to land in the morning (it's like 1am local for me). Should put a end to these blips and de-risk anything like the 5 min outage we had earlier - will update the threads when that's out. it looks like we're getting a disproportionate number of lookup requests from a small selection of apps, and that's causing these resource spikes as the infra gets stressed. Fixes should mitigate that.
ENT3I <3
ENT3I <37mo ago
so the issue is caused by some small selection of apps? I hope mine is not one of them lol salute
Brody
Brody7mo ago
um
char8
char87mo ago
nah - whatever it is we're talking 1k+ rps of lookups 😅
Brody
Brody7mo ago
that would be me ...im not joking
char8
char87mo ago
1k rps? 😅 since nov 6? I mean props for load testing us
Brody
Brody7mo ago
let me do some maths nah my bad its not 1k rps, but i do make a batch of requests every minute
char8
char87mo ago
yep that's not gonna make a dent. Also the 1krps thing should be fine, just creaky v1 implementation on our side which needed to be more defensive. I'll update here once I test tomorrow. Might elect a subset of hosts to test on first and then rollout throughout the day.
Brody
Brody7mo ago
sounds good!
ENT3I <3
ENT3I <37mo ago
thanks
char8
char87mo ago
currently testing the patched dnsserver on a small set of hosts - wider rollout tomorrow if it looks good overnight. It might not help with the P99s as much, but it'll hopefully eliminate those occasional blips you see
ENT3I <3
ENT3I <37mo ago
hi. Not sure if it has anything to do but I experimented some timeouts in my railway project
ENT3I <3
ENT3I <37mo ago
No description
ENT3I <3
ENT3I <37mo ago
also I tried to re-deploy a running image for a service and it took 10 minutes and I had to abort it and re-deploy it freezes in this window sometimes too
ENT3I <3
ENT3I <37mo ago
No description
ENT3I <3
ENT3I <37mo ago
I have to refresh couple of times
char8
char87mo ago
this would be something different
ENT3I <3
ENT3I <37mo ago
and then it loads
Yeti
Yeti7mo ago
funnily enough im running into this as well
char8
char87mo ago
we'll take a look
ENT3I <3
ENT3I <37mo ago
thanks!
char8
char87mo ago
yep incident created - cooper had just spotted it when I landed on that channel - we're on it!
ENT3I <3
ENT3I <37mo ago
Awesome. Thanks!! I guess you guys still working on it, but now dashboard seems to be working ok but my service url doesn't load
ENT3I <3
ENT3I <37mo ago
No description
Brody
Brody7mo ago
careful with the ping replies please
ENT3I <3
ENT3I <37mo ago
what you mean?
Brody
Brody7mo ago
by default you ping when you reply to a message just something to keep in mind
ENT3I <3
ENT3I <37mo ago
seems up again now
char8
char87mo ago
this is a different issue the dashboard one is fixed did you have a project ID?
ENT3I <3
ENT3I <37mo ago
53d90c0e-0d69-400d-8c78-aaa211f288a1
char8
char87mo ago
we show the application failed to respond page when the app times out on teh request. I see a bunch of CPU spikes on the process, wondering if its hitting some timeout or a large request or smth.
ENT3I <3
ENT3I <37mo ago
Ok, maybe it was a coincidence and hit a large request... or smth... I'll keep an eye on it, thanks for responding and solving the issues!
char8
char87mo ago
yeah let us know if you see it again or see a pattern 🙏
ENT3I <3
ENT3I <37mo ago
Sure. I will. Have a good night! 🌜 Hi. I got two logs today morning. One at UTC 6:45 AM (getaddrinfo EAI_AGAIN postgres.railway.internal) Another one at UTC 7:00 AM (getaddrinfo EAI_AGAIN postgres.railway.internal) I only got these two logs so probably it got fixed after a couple of seconds. (Posting it here just in case you want to look something)
Brody
Brody7mo ago
you likely weren't on a host with the fixes so it would be expected that you'd still see this
char8
char87mo ago
yep I've been running a test on a patched machine and an unpatched one - the patched one had no drops thus far, patched one did around 7am UTC so all good for wider rollout later today [mostly so I'm around to watch it once its live] wider rollout done ✅ , we've been running with it for about 24h. Should hopefully see far far fewer dropouts
Brody
Brody7mo ago
can confirm there has been zero dropouts, rock sold now!