Railway•15mo ago

Restart doesn't actually restart

Seems like a service failed after it couldn't connect to a DB... i tried to restart but it never restarted. This has been an ongoing issue for a few weeks

37 Replies

Percy•15mo ago

Project ID: 97046871-517d-4af1-adfa-6b493cccebc3

sdan•15mo ago

97046871-517d-4af1-adfa-6b493cccebc3 usually just get around this issue by redeploying but my project takes 10-15min to build so sometimes an annoyance

Adam•15mo ago

I'm seeing a deployment above your crashed deployment. Looks to me like your restart was successful

sdan•15mo ago

the new deployment was successful yes, but i dont believe that failed container was ever restarted. i can try again sometime later and show if necessary but from the screenshots you can see it says "restart successful" but on the ui it still shows a red box. no new blue box saying its restarting ever popped up -- had to manually deploy since restart didnt work

sdan•15mo ago

running into the same issue again

Adam•15mo ago

Hm very odd. Is your app active? Another user reported a similar issue where their app was in the crashed status visually but was still logging

sdan•15mo ago

yes -- i guess this now comes to semantics on what does restart/redeploy mean... i feel like i should be able to restart a running container and not have to redeploy(build and push that image) just to restart that service hey guys this is a pretty serious issue, our build times are unfortunately very long (20 min tops) and it takes up 20 minutes just to get back "online"

Adam•15mo ago

Why are you restarting your service that often? On code updates you should have a deployment running with previous code that’s shut down when your new code’s healthcheck is complete this seems like user error

sdan•15mo ago

I have 100k+ users a day so it crashes our database almost every 12 -18 hours. this crashes this particular instance so it shows up as "crashed" it could be user error but i would like to just simply restart the container. meaning: delete it, run the same exact image w/ same config, and have it back up

Adam•15mo ago

this definitely sounds like user error. There’s got to be better ways to get around that. Also, with 100k+ users you should be on the teams plan this is not a hobby project as the dev plan is meant for

sdan•15mo ago

not to mention I have other services on railway that simply hand and show up as "application not responding" would be nice to have healthchecks running hourly if thats possible? alright sounds good. i use "we" too often, sorry its just me self funding.

Adam•15mo ago

Unfortunately that all sounds like user/code error. Afaik there’s no way to set up scheduled healthchecks, but if you join the teams plan you can discuss that with the team

angelo•15mo ago

Hey @sdan - this is bug on our end. With that said- is your app crashing or the DB crashing?

sdan•15mo ago

db running on google cloud, i found railway cant handle some stuff so moved most of my infra elsewhere

angelo•15mo ago

Like vector or? Just a scale issue

sdan•15mo ago

yea

angelo•15mo ago

yea to what 😛

sdan•15mo ago

yea vector db and yea scale issue 🙂

angelo•15mo ago

sdan•15mo ago

also have google cloud credits

angelo•15mo ago

ok- so on your app, how many connections to the DB are you keeping open?

sdan•15mo ago

8 at a time probably

angelo•15mo ago

What happens when you bump that up?

sdan•15mo ago

no clue honestly i just restart stuff whenever it goes down

angelo•15mo ago

;-;

sdan•15mo ago

there are more issues because the vector db i am using is in beta and runs into race issues all the time

angelo•15mo ago

so, you may wanna increase the number of connections actually wait can you decrease it? it will slow your app but might help with race also do you have a link to that vector DB?

sdan•15mo ago

yeah i have tried multiple things but ultimately i dont run most of my heavy workloads on railway. i just purely do reading on railway

sdan•15mo ago

https://trychroma.com

the AI-native open-source embedding database

angelo•15mo ago

I know a guy there, we can chat

sdan•15mo ago

and i have probably already chatted with that guy haha. theyre rolling out a refactor next week so hoping that will solve it

angelo•15mo ago

curious, why are you still on Railway then (aside from you being an ex-employee) what are we doing so right even when we seem to get things wrong

sdan•15mo ago

no easy way to run flask servers honestly i do vercel for 99% of stuff but now need to interact with python and vercel is pretty bad at it

angelo•15mo ago

you mean that Google Cloud Run's 99 steps isn't easy 😉 anyway, gotcha- can you dump crash logs when the DB connects reset? I would have a service that uses the Railway API and monitors when DB crashes and just perform a restart ngl in the long term, I am going to flag the UI bug to the team

sdan•15mo ago

google cloud is a mess for sure but its containable mess :). just docker up, docker down, docker remove, docker ps -a. and tailscale for networking and cloudflare for proxying. i have reliable logs, stuff never hangs, and if it does i know exactly whats up. i can check htop, etc. railway hangs and logs stop and stuff gets silently shut off. more often than not i wake up to a text from someone saying my stuff is down and railway still shows a green box which is frustrating. railway api monitoring a db that is not running on railway is def. not railway's fault. its just reliable loggin and make sure that if something crashes that it is fully crashes. i think i turned off notifs for crashes which i will turn back on also as prev. mentioned, having continuous health checks would be nice

sdan•15mo ago

some logs

message.txt

sdan•15mo ago

again this is entirely my error -- the db crashing should be handled on my end.

Gaming

Programming

Restart doesn't actually restart