Is railway down?
I can't open my railway project page and I get error 401 and database isn't ready...
79 Replies
Project ID:
N/A
a picture is worth 1000 words
(different customer) Our app looks health in the dashboard, but we're getting the error page too
Railway
Railway
Railway is an infrastructure platform where you can provision infrastructure, develop with that infrastructure locally, and then deploy to the cloud.
anything loads
but app looks health in my dashbooard too
My app appears back
same here
my app is loading again
I came here after getting an alert that the URL for my app was down when I checked the logs my app showed no issues.
exactly the same thing
did you get a 503 too?
I got url errors coming up in my slack
and when I checked my railway project url it didn't load at all
got errors 401, 503 and "database isn't ready" message
might have been a little blip with the proxy
maybe related? https://discord.com/channels/713503345364697088/1171292427118706769/1176503689150730342
specifically
We're currently in the process of merging a whole bunch of networking updates
looking some error logs
I can see this:
ERROR: The DNS server returned an error, perhaps the server is offline (item 0)
getaddrinfo EAI_AGAIN postgres.railway.internal
In one of the errors while my app was "down"
i saw that too, though this is not the first time ive gotten that error
But how that can affect my private network between my services and my postgres* database?
Does that migration affect all networks?
its a dns lookup error for a private domain
i dont know what it effects, i know as much as what char said in that message
Thanks for the quick response on this. 👋
Yeah thanks. And yes Brody I think the issue was on the proxy or smth
I'm seeing the same error message
in a totally external tool
from railway
going to tag in @char8 here too, just for visibility
this wasn't anything to do with the updates, though we're looking into some weird alarms between 17:14 - 17:30 UTC
had a feeling it wasnt related, but good to know!
gonna dig deeper into what happened and how it affected (what looks to be DNS resolutions mainly) - will create a retro incident with the times.
but yeah - a proper fix for more robust DNS is in the pipe
char8 where Railway is based up what timezone, so I can give you exactly what time it started to fail
and when it worked again
UTC always works 🙏 , we had a routing propagation alarm fire 17:14-17:22 UTC [it only alerted us to it at 17:20 shortly before it resolved, so I gotta tweak something there]
Ok. For me it started to fail at UTC 17:14:50 to 17:21:39
does it make sense?
yep that matches perfectly thanks! that confirms what we're seeing. Looks like a network cut of some form.
Ok great. Thanks
looks like this is recurring
gonna create an incident
yep, it is happening again
it sure is
we isolated it to the host that ran the control plane 😞 , just went 100% on I/O and locked up the server. Gonna fast track some of the patches that Brody spoke about that will mitigate these issues.
https://railway.instatus.com/clp8myu281083bhohpt28odbp
yippee
it happened again a few minutes ago
can confirm
6 mins ago? now recovered right? for about 20 secs
yep
yes, for me at xx (your time): 27
xx:27
I've got a patch that just got approved that I want to land in the morning (it's like 1am local for me). Should put a end to these blips and de-risk anything like the 5 min outage we had earlier - will update the threads when that's out.
it looks like we're getting a disproportionate number of lookup requests from a small selection of apps, and that's causing these resource spikes as the infra gets stressed. Fixes should mitigate that.
so the issue is caused by some small selection of apps? I hope mine is not one of them lol
um
nah - whatever it is we're talking 1k+ rps of lookups 😅
1k rps? 😅 since nov 6? I mean props for load testing us
let me do some maths
nah my bad its not 1k rps, but i do make a batch of requests every minute
yep that's not gonna make a dent. Also the 1krps thing should be fine, just creaky v1 implementation on our side which needed to be more defensive. I'll update here once I test tomorrow. Might elect a subset of hosts to test on first and then rollout throughout the day.
sounds good!
thanks
currently testing the patched dnsserver on a small set of hosts - wider rollout tomorrow if it looks good overnight. It might not help with the P99s as much, but it'll hopefully eliminate those occasional blips you see
hi. Not sure if it has anything to do
but I experimented some timeouts in my railway project
also I tried to re-deploy a running image for a service and it took 10 minutes and I had to abort it and re-deploy
it freezes in this window sometimes too
I have to refresh couple of times
this would be something different
and then it loads
funnily enough im running into this as well
we'll take a look
thanks!
yep incident created - cooper had just spotted it when I landed on that channel - we're on it!
Awesome. Thanks!!
I guess you guys still working on it, but now dashboard seems to be working ok but my service url doesn't load
careful with the ping replies please
what you mean?
by default you ping when you reply to a message
just something to keep in mind
seems up again now
this is a different issue
the dashboard one is fixed
did you have a project ID?
53d90c0e-0d69-400d-8c78-aaa211f288a1
we show the application failed to respond page when the app times out on teh request. I see a bunch of CPU spikes on the process, wondering if its hitting some timeout or a large request or smth.
Ok, maybe it was a coincidence and hit a large request... or smth... I'll keep an eye on it, thanks for responding and solving the issues!
yeah let us know if you see it again or see a pattern 🙏
Sure. I will. Have a good night! 🌜
Hi. I got two logs today morning.
One at UTC 6:45 AM (getaddrinfo EAI_AGAIN postgres.railway.internal)
Another one at UTC 7:00 AM (getaddrinfo EAI_AGAIN postgres.railway.internal)
I only got these two logs so probably it got fixed after a couple of seconds.
(Posting it here just in case you want to look something)
you likely weren't on a host with the fixes so it would be expected that you'd still see this
yep I've been running a test on a patched machine and an unpatched one - the patched one had no drops thus far, patched one did around 7am UTC
so all good for wider rollout later today [mostly so I'm around to watch it once its live]
wider rollout done ✅ , we've been running with it for about 24h. Should hopefully see far far fewer dropouts
can confirm there has been zero dropouts, rock sold now!