Services weren't reachable when the deployment failed
(id: 4b49be5c-8a2a-44f2-90e8-28a9de6c457f)
Yesterday, we noticed that some of our services weren't reachable when the deployment failed. Isn't railway make sure that the old container will be killed only after the new one is live?
Also, the service was returning gateway timeout for few minutes once the deployment succeeded.
We are using railway on production and this really affects our users. Please look into this issue.
22 Replies
Project ID:
4b49be5c-8a2a-44f2-90e8-28a9de6c457f
for zero downtime deployments you would want to implement a healthcheck in your app that returns a 200 status code once it's determined itself as healthy
https://docs.railway.app/deploy/healthchecks
Thanks for the response Brody. We implemented that long back and healthcheck is there.
do you have a volume on your service?
We use volume in only one service. I understand that volume currently results in a small downtime. This is ok for us for now. But all other services does not have volume.
so how long after the deployment did the deployment fail?
The deployment was failed because one of the script were broken. Our expectation was that the old container will be up incase if the deployment fails. Later after 10-10:30 CET we foud the gateway timeout errors.
I hope you will have access to some logs to see what went wrong.
railway will do the healthcheck, once thats successful it will switch over to the new deployment and remove the old deployment, if your application crashes after that then you will see downtime, crashes happen so the best thing you can do is make sure your code fully exits as fast as possible with an error code so that railway can restart it, there is also the option of running replicas so that in the event your app locks up railway will route traffic to the other replica
Please understand that we have healthchecks and the deployment failed: This means the new service didn't start. The healthchecks failed with message: 1/1 replicas never became healthy!
Healthcheck failed!
did the build fail, or did the build succeed and the healthcheck fail?
what region do you have your app deployed to?
The UI was not able to access this service. My expectation was the old service deployment(healthy) will be available. But that was not the case
The US-West region (i think that is the default one)
^
did the build fail, or did the build succeed and the healthcheck fail? Build was successful, healthcheck failed.
Path: /health
Retry window: 5m0s
Attempt #1 failed with service unavailable. Continuing to retry for 4m59s
Attempt #2 failed with service unavailable. Continuing to retry for 4m58s
Attempt #3 failed with service unavailable. Continuing to retry for 4m56s
Attempt #4 failed with service unavailable. Continuing to retry for 4m52s
Attempt #5 failed with service unavailable. Continuing to retry for 4m44s
Attempt #6 failed with service unavailable. Continuing to retry for 4m28s
Attempt #7 failed with service unavailable. Continuing to retry for 3m58s
Attempt #8 failed with service unavailable. Continuing to retry for 3m28s
Attempt #9 failed with service unavailable. Continuing to retry for 2m58s
Attempt #10 failed with service unavailable. Continuing to retry for 2m28s
Attempt #11 failed with service unavailable. Continuing to retry for 1m58s
Attempt #12 failed with service unavailable. Continuing to retry for 1m28s
Attempt #13 failed with service unavailable. Continuing to retry for 58s
Attempt #14 failed with service unavailable. Continuing to retry for 28s
1/1 replicas never became healthy!
Healthcheck failed!
This is the complete healthcheck log
then railway would have never swapped it in, your old running deployment would have not been affected
if a deployment passes a health check but then later fails there is no fallback deployment in that scenario (talking about the deployment that was running before this deployed failed it's health check)
Sorry. Let's also not forget that there can be issues in the platform. I have posted this message since we faced "503" error. The error was gone after redeployment. I still got 503 errors after successfully deploying(healthcheck passed) for few more mins.
I am definitely taking platform issues into consideration, that's why I asked what region, but there have been no issues with the us-west1 region
Explaining this again: The old service never crashed, We catch all the errors to make sure that a running service won't stop.
I'm sorry but there where no reported issues with the routing later for us-west1 during the time of your apps outage, if this happens again please report back
Thanks for your help
the past issues have been
- routing layer in us-east1 failing
- builds failing all regions
- dashboard 404
throughout all of this already deployed apps in us-west1 went unaffected
Ok. We faced this issue, I cannot share more details about my services here. Is there anything I can do to get more information(will sending emails to support email address help)?
at this time it looks to me like this is an issue with your app itself, as there was no issues reported with the routing layer for us-west1, I'm sorry I can't be of more help here