Restart doesn't actually restart

Ssdan5/23/2023
Seems like a service failed after it couldn't connect to a DB... i tried to restart but it never restarted. This has been an ongoing issue for a few weeks
Ssdan5/23/2023
97046871-517d-4af1-adfa-6b493cccebc3
Ssdan5/23/2023
usually just get around this issue by redeploying but my project takes 10-15min to build so sometimes an annoyance
ADA Dumb5/23/2023
I'm seeing a deployment above your crashed deployment. Looks to me like your restart was successful
Ssdan5/24/2023
the new deployment was successful yes, but i dont believe that failed container was ever restarted. i can try again sometime later and show if necessary
Ssdan5/24/2023
but from the screenshots you can see it says "restart successful" but on the ui it still shows a red box. no new blue box saying its restarting ever popped up -- had to manually deploy since restart didnt work
Ssdan5/27/2023
running into the same issue again
ADA Dumb5/27/2023
Hm very odd. Is your app active? Another user reported a similar issue where their app was in the crashed status visually but was still logging
Ssdan5/28/2023
yes -- i guess this now comes to semantics on what does restart/redeploy mean... i feel like i should be able to restart a running container and not have to redeploy(build and push that image) just to restart that service
Ssdan5/31/2023
hey guys this is a pretty serious issue, our build times are unfortunately very long (20 min tops) and it takes up 20 minutes just to get back "online"
ADA Dumb5/31/2023
Why are you restarting your service that often? On code updates you should have a deployment running with previous code that’s shut down when your new code’s healthcheck is complete
ADA Dumb5/31/2023
this seems like user error
Ssdan5/31/2023
I have 100k+ users a day so it crashes our database almost every 12 -18 hours. this crashes this particular instance so it shows up as "crashed"

it could be user error but i would like to just simply restart the container. meaning: delete it, run the same exact image w/ same config, and have it back up
ADA Dumb5/31/2023
this definitely sounds like user error. There’s got to be better ways to get around that. Also, with 100k+ users you should be on the teams plan
ADA Dumb5/31/2023
this is not a hobby project as the dev plan is meant for
Ssdan5/31/2023
not to mention I have other services on railway that simply hand and show up as "application not responding" would be nice to have healthchecks running hourly if thats possible?
Ssdan5/31/2023
alright sounds good. i use "we" too often, sorry its just me self funding.
ADA Dumb6/1/2023
Unfortunately that all sounds like user/code error. Afaik there’s no way to set up scheduled healthchecks, but if you join the teams plan you can discuss that with the team
Nngeloxyz6/1/2023
Hey @sdan - this is bug on our end.
Nngeloxyz6/1/2023
With that said- is your app crashing or the DB crashing?
Ssdan6/1/2023
db running on google cloud, i found railway cant handle some stuff so moved most of my infra elsewhere
Nngeloxyz6/1/2023
Like vector or?
Nngeloxyz6/1/2023
Just a scale issue
Ssdan6/1/2023
yea
Nngeloxyz6/1/2023
yea to what
Nngeloxyz6/1/2023
😛
Ssdan6/1/2023
yea vector db and yea scale issue 🙂
Nngeloxyz6/1/2023
L
Ssdan6/1/2023
also have google cloud credits
Nngeloxyz6/1/2023
ok- so on your app, how many connections to the DB are you keeping open?
Ssdan6/1/2023
8 at a time probably
Nngeloxyz6/1/2023
What happens when you bump that up?
Ssdan6/1/2023
no clue honestly i just restart stuff whenever it goes down
Nngeloxyz6/1/2023
;-;
Ssdan6/1/2023
there are more issues because the vector db i am using is in beta and runs into race issues all the time
Nngeloxyz6/1/2023
so, you may wanna increase the number of connections
Nngeloxyz6/1/2023
actually wait
Nngeloxyz6/1/2023
can you decrease it?
Nngeloxyz6/1/2023
it will slow your app but might help with race
Nngeloxyz6/1/2023
also do you have a link to that vector DB?
Ssdan6/1/2023
yeah i have tried multiple things but ultimately i dont run most of my heavy workloads on railway. i just purely do reading on railway
Nngeloxyz6/1/2023
I know a guy there, we can chat
Ssdan6/1/2023
and i have probably already chatted with that guy haha. theyre rolling out a refactor next week so hoping that will solve it
Nngeloxyz6/1/2023
curious, why are you still on Railway then (aside from you being an ex-employee)
Nngeloxyz6/1/2023
what are we doing so right even when we seem to get things wrong
Ssdan6/1/2023
no easy way to run flask servers honestly
Ssdan6/1/2023
i do vercel for 99% of stuff but now need to interact with python and vercel is pretty bad at it
Nngeloxyz6/1/2023
you mean that Google Cloud Run's 99 steps isn't easy 😉
Nngeloxyz6/1/2023
anyway, gotcha- can you dump crash logs when the DB connects reset?
Nngeloxyz6/1/2023
I would have a service that uses the Railway API and monitors when DB crashes and just perform a restart ngl
Nngeloxyz6/1/2023
in the long term, I am going to flag the UI bug to the team
Ssdan6/1/2023
google cloud is a mess for sure but its containable mess :). just docker up, docker down, docker remove, docker ps -a. and tailscale for networking and cloudflare for proxying.

i have reliable logs, stuff never hangs, and if it does i know exactly whats up. i can check htop, etc.

railway hangs and logs stop and stuff gets silently shut off. more often than not i wake up to a text from someone saying my stuff is down and railway still shows a green box which is frustrating.
Ssdan6/1/2023
railway api monitoring a db that is not running on railway is def. not railway's fault. its just reliable loggin and make sure that if something crashes that it is fully crashes. i think i turned off notifs for crashes which i will turn back on
Ssdan6/1/2023
also as prev. mentioned, having continuous health checks would be nice
Ssdan6/1/2023
some logs
Ssdan6/1/2023
again this is entirely my error -- the db crashing should be handled on my end.