Render Error on new deployment
On a new deployment, there's always a brief period of time (5-25 seconds) where the website intermittently returns an error with Railway theme and some generic error text (I forget what it is). Is that behavior expected?
129 Replies
Project ID:
87f6d50b-7bab-488e-b802-02f9edc442e3
Render's theme?
87f6d50b-7bab-488e-b802-02f9edc442e3
Sorry!
Fixed the text
Technically it's some sort "render error"
so I hope the title can stay
can you get a screenshot of this?
The issue is on Railway though
sure
Next time it happens
maybe the like not found error?
does it have a railway logo?
yes
application not responding?
this is what I meant by "theme"
I doubt that, Railway promotes the new deployment long after it has succeeded I'd assume
do you have a volume on your service?
There it is
Well on two of them: Postgres and Redis
but do you have a volume on that service though
on my web service?
no
are you using a health check?
no
then theres your problem, without a health check railway doesnt know exactly when your app is ready to handle requests
what kind of web service?
it's a Django app.
Well I thought since the deployment is only turning green after the Gunicorn worker is up, then everything should be properly set up.
Sometimes it happens for over 10 seconds, which is really making me think this isn't a healthcheck/no healthcheck issue.
turns green when the container is ran, not necessarily when your app can handle a request, go ahead and add a health check just to get that out of the way
will do sir
and you are sure the django service doesnt have a volume?
Is this proof?
yes haha
Is the health check path relative?
never heard someone use that term for a url lol
point taken
should I use my domain then, or Railway's? or either?
it only accepts a path, like
/api/v1/healthz
huh, failing for 5 straight minutes for some reason...Even though all it does is return an empty 200 response (not even checking databases or anything)
guess you got the path wrong or something?
I don't understand.
My healthcheck path is
up
, I tried /up/
, up
and up/
None worked
If I access mydomain.com/up/
- it returns 200
On the other hand, I do see gunicorn returning 400 for the healthcheck tests.
So weird..you likely have some middleware interfering, like checking allowed hosts on it
just added
...railway.internal
therethe health checks are done from local ipv4 addresses
Aha
I guess the solution isn't to whitelist every host right?
how do I know what to whitelist?
dont have any middleware run for the health check path
huh, so basically for that path, have
ALLOWED_HOSTS = ["*"]
?
This looks like it should be simpler than thisthats django for you
Hah that's a bit cheeky
Thanks a lot for your help!
you got a health check working?
nope
my website gets like 20 visits a day right now, 20 of are by me
so I'm just gonna put that on the TODO list
have to find a proper way to do this
sounds good
@Brody this is after setting up the healthcheck successfully:
Yeah there's definitely something happening once there are two green deployments, and the older one is removed. I can reproduce it quite consistently.
Btw, that's the 3rd report of the same issue in the last couple days. Same thing I posted in the other thread https://discord.com/channels/713503345364697088/1202585121677643836
Might be some global problem?
Exactly the symptoms I’m experiencing (after adding a proper health check)
you definitely could get into a scenario where your app responds to a health check but not to actual traffic
@Brody can you please elaborate? If I temporarily move my health-check to be the very same path that I receive "actual" traffic on, will that be enough to investigate on your side?
haha I don't have any other side than the community side, I don't work for Railway
I thought you did honestly. But by side I meant “end”. It’s not about sides it’s about whether issues are investigated properly. Where can I raise an issue regarding this? As @sergey said, this is not an isolated incident.
I'll try to reproduce
Total guess but what I think is happening is Railway right after the switch sends some of the requests to the terminated/to be terminated deployment.
As Sergey said, a period of 5-20 seconds where we see this error.
This is after the new deployment has responded successfully to the health check.
my guess is that django is doing some unwanted behaviour, same with the app Sergey is running, I'll try to reproduce with a simple http server with no middleware stacks or anything of the sort
So two tried frameworks are doing unwanted behavior? Maybe it’s the third 😉
but for transparency, if you have a volume (you don't but Sergey might) there will be downtime as two services can't connect to the same volume as the same time
I have a node server, btw, not Jango
just did a few back to back tests for a basic http server with health check, no volume.
during the period of switch over, at a refresh rate of 250ms and cache disabled, i only saw a singular flash of the railway page
Well that is definitely not my experience. Which web server were you using? If this is something common to Node.js and Gunicorn/Django, then it must be an obvious config step
i am running a golang stdlib http server, with the chi router
but heres the update on sergey's issue https://discord.com/channels/713503345364697088/1202585121677643836/1203077349432758289
keep in mind, they are using a volume, so their issue does not apply to you, since you are not using a volume
Very interesting. Believe it or not I’m relieved to know it’s probably a misconfig on my part.
are you using a readiness type health check? aka a health check that confirms your app is talking to your database
of course im not, but my test app doesnt talk to a database
No. I’m just returning a 200 from the Django middleware since Railway is using a random IP each time, but I’ll add those checks in a bit
your health check should be made not to care who or what is making the health check
I don't necessarily agree with that. The
ALLOWED_HOSTS
setting is an important one, and "exempting" the health-check endpoint from encforcement is a workaround to make it work with Railway.then it would be a work around to work with any similar hosting service, railway isnt going to make the request with a masked host header, the health check should be relatively dumb
At least on the good news front, the latest deployment didn't show this kind of behavior when I ensured Redis + DB connections before returning 200:
I'll keep an eye on this for the next few days for sure
sounds good!
Thanks again Mr. @Brody
always happy to help
Yeah, spoke too soon:
is your previous deployment getting shut down before the new deployment is live? check its state during the transition
But now at least I see this:
So it's definitely on me
It's only getting shut down after the new deployment is green and shows active, so I assume something is going wrong on my side.
but whats its state after the failed health check timeout limit? active, complete?
The health check is succeeding.
oh my bad, that log line is not an access log for the health check request
And I actually think we still have an issue here, I just happened to deploy a bug now at the same time 🙂
No, that's guincorn
bad wording, fixed
Yeah, something is definitely up.
I think I'll just record a video to prove it, because at times (this isn't the first time it happened), the response goes like this:
Railway "app failed" screen
Railway "app failed" screen
200
Railway "app failed" screen
Railway "app failed" screen
200
And from there it sorts itself out
i saw that, but only during my tests with a volume
We've already established I'm not using those for the deployed service.
May I ask if you preformed the test more than once? I see this happening for 50-70% of deployments, it's not 100% of the time.
yes I've done it multiple times
Well, maybe the same issue in https://discord.com/channels/713503345364697088/1202585121677643836/1203077349432758289 is also affecting me here.
i just cant reproduce it
How can I raise a support ticket?
When Gunicorn/Django return 500, A “classic” 500 Server Error is displayed. Black text on white background.
This isn’t what’s happening here. Railway is routing traffic to a deployment it shouldn’t route to.
I deployed a bug earlier that caused the app to consistently return 500. When that happens, a regular server error is returned, without Railway’s theme.
as a hobby user you get community support
why is django returning 500 anyway
I intentionally (OK let’s pretend that it was intentional) inserted a bug.
So Django throws an exception, and Gunicron returns 500
It’s different from the error page displayed by Railway
Which tells me it’s a routing issue
if you see the railway page during normal operation that means your app didnt answer railway's proxy request, therefore the error lies with your app, heres proof of that https://utilities.up.railway.app/status-code/500
I don’t see it during normal operation, that’s the point.
I see it during Railway’s deployment
This is what I’m trying to say
during the transition period?
Yes
do you see a build log that the health check succeeded?
Yes
And 200 from Gunicorn logs for the health check path
how many tries until first success?
2-3 times with 503, then succeeds
I don’t think it’s the new instance that’s not “answering” the proxy request.
I think some traffic for a short period of time is directed to the old instance. M
during these health checks of the new deployment, what is the status of the previous deployment
Active
and it stays active until the new deployment is switched in?
can you triple check this for me
I will right now
But what does “switched in” mean here: becomes “green” colored? Turns itself to “Active”?
correct
i shall try my tests again but with an artificial health delay where i return 503 for the first 5 seconds
sound like a more appropriate test?
Let me document what’s happening:
So once the new deployment kicks in (building), the other is green but it doesn’t say Active
Here's this state
Let me capture the state when they're both green
In this picture, the upper one is the new deployment, which just succeeded its health-check.
Build logs for the new deployment:
Deploy logs for the new deployment:
Deploy logs for the old deployment:
The issues definitely start AFTER the old deployment is just removed, while both deployments are green, everything works fine. But for a (not so) short period of time, once the old deployment is moved to "HISTORY", the Railway screen of death appears.
Browser console, don't know what that
:1
means...hope it's not the port:it means the first line of that file lol
yeah, so this is what I have.
I'll upgrade to Pro temporarily to get this looked at if that's needed.
I don't see how a tried and tested server like Gunicorn returns 503 or doesn't respond after it has booted up and returned 200 already.
going to test an artificial health check delay, will get back to you
I'm not optimistic about this, as you can see it took Gunicorn 2 seconds
I'd test two things here:
Either Gunicron itself, with its default graceful shutdown configurations.
Or a server that sleeps for 30 seconds on
SIGTERM
I'm close to convinced the terminating instance is receiving traffic still, but it has already hung up.
Or something along those lines.you may be able to get railway to wait 30 seconds after sending sigterm before force killing the old container, but gunicorn isnt going to answer requests after sigterm anyway
I don't really care, I can configure it to go dead immediately.
My question is: is this behavior tripping up Railway
One more thing to note:
I am pre-tty sure that while both deployments are green (so just as the new deployment becomes healthy), Railway is still directing traffic exclusively to the previous (still green) deployment. I guess this is desired behavior, but thought I'd mention that because I don't know what's right and wrong anymore.
thats correct from my understanding, railway routes traffic to the previous deployment for a default of 20 seconds, then kills the old deployment and switches over after 20 seconds or when the new deployments health check succeeds, whatever comes last
Got it.
Well, I’m still hopeful @sergey’s thread investigation by support will bring results for services without volume mounts as well. He also reported the issue occurs without a volume attached.
i can reproduce your issue when the health check does not succeed right away, doing some more tests
Oh this is interesting
i set
RAILWAY_DEPLOYMENT_OVERLAP_SECONDS
to 35
and did a bunch more test runs, with that set to 35, i didnt even get a single flicker of the railway pageI’ll try that early tomorrow. What does this env var mean exactly? Why do you think it solves the issue?
it's explained a bit in railways docs, but dinner time now so I can't link
Bon appetite
wrote and deployed much less today. But it seems that
31
seconds makes this go away as well.
@Brody Should this be raised for support anyway in your opinion? I mean, the health check returned 200, why do I need to change an ENV VAR to make zero downtime deployments zero downtime?yeah i set to 35 since i like jumping by 5 😆
yes i will bring this to the teams attention this week, i have some theories on why this is happening, but i would need the team to confirm.
theres also the likely hood that even though this is brought to the the teams attention the fix would still be you setting that variable to 31 on account of they are replacing their current proxy with a entirely new built in house proxy that would be very likely fix this issue anyway, no point in patching their current proxy (when there is a work around) if its going to be ripped out anyway
I would just like confirmation that something is amiss, and some way to track it, so I know when to remove the patch.
if I can reproduce it, something is amiss 😆
but if I hear anything I'll be sure to tell you
@Brody weren’t you able to repro without the ENV variable, when the first (few) health check(s) fail?
yeah, why?
^
I shouldn't read anything after 23:00. I totally misinterpreted this message.
haha no worries