R
Railway10mo ago
SHxKM

Render Error on new deployment

On a new deployment, there's always a brief period of time (5-25 seconds) where the website intermittently returns an error with Railway theme and some generic error text (I forget what it is). Is that behavior expected?
129 Replies
Percy
Percy10mo ago
Project ID: 87f6d50b-7bab-488e-b802-02f9edc442e3
Brody
Brody10mo ago
Render's theme?
SHxKM
SHxKM10mo ago
87f6d50b-7bab-488e-b802-02f9edc442e3 Sorry! Fixed the text Technically it's some sort "render error" so I hope the title can stay
Brody
Brody10mo ago
can you get a screenshot of this?
SHxKM
SHxKM10mo ago
The issue is on Railway though sure Next time it happens
Zazh
Zazh10mo ago
maybe the like not found error?
Brody
Brody10mo ago
does it have a railway logo?
SHxKM
SHxKM10mo ago
yes
Brody
Brody10mo ago
application not responding?
SHxKM
SHxKM10mo ago
this is what I meant by "theme" I doubt that, Railway promotes the new deployment long after it has succeeded I'd assume
Brody
Brody10mo ago
do you have a volume on your service?
SHxKM
SHxKM10mo ago
There it is
No description
SHxKM
SHxKM10mo ago
Well on two of them: Postgres and Redis
Brody
Brody10mo ago
but do you have a volume on that service though
SHxKM
SHxKM10mo ago
on my web service? no
Brody
Brody10mo ago
are you using a health check?
SHxKM
SHxKM10mo ago
no
Brody
Brody10mo ago
then theres your problem, without a health check railway doesnt know exactly when your app is ready to handle requests what kind of web service?
SHxKM
SHxKM10mo ago
it's a Django app. Well I thought since the deployment is only turning green after the Gunicorn worker is up, then everything should be properly set up. Sometimes it happens for over 10 seconds, which is really making me think this isn't a healthcheck/no healthcheck issue.
Brody
Brody10mo ago
turns green when the container is ran, not necessarily when your app can handle a request, go ahead and add a health check just to get that out of the way
SHxKM
SHxKM10mo ago
will do sir
Brody
Brody10mo ago
and you are sure the django service doesnt have a volume?
SHxKM
SHxKM10mo ago
Is this proof?
No description
Brody
Brody10mo ago
yes haha
SHxKM
SHxKM10mo ago
Is the health check path relative?
Brody
Brody10mo ago
never heard someone use that term for a url lol
SHxKM
SHxKM10mo ago
point taken should I use my domain then, or Railway's? or either?
Brody
Brody10mo ago
it only accepts a path, like /api/v1/healthz
SHxKM
SHxKM10mo ago
huh, failing for 5 straight minutes for some reason...Even though all it does is return an empty 200 response (not even checking databases or anything)
Brody
Brody10mo ago
guess you got the path wrong or something?
SHxKM
SHxKM10mo ago
I don't understand. My healthcheck path is up, I tried /up/, up and up/ None worked If I access mydomain.com/up/ - it returns 200 On the other hand, I do see gunicorn returning 400 for the healthcheck tests. So weird..
Brody
Brody10mo ago
you likely have some middleware interfering, like checking allowed hosts on it
SHxKM
SHxKM10mo ago
just added ...railway.internal there
Brody
Brody10mo ago
the health checks are done from local ipv4 addresses
SHxKM
SHxKM10mo ago
Aha I guess the solution isn't to whitelist every host right? how do I know what to whitelist?
Brody
Brody10mo ago
dont have any middleware run for the health check path
SHxKM
SHxKM10mo ago
huh, so basically for that path, have ALLOWED_HOSTS = ["*"]? This looks like it should be simpler than this
Brody
Brody10mo ago
thats django for you
SHxKM
SHxKM10mo ago
Hah that's a bit cheeky Thanks a lot for your help!
Brody
Brody10mo ago
you got a health check working?
SHxKM
SHxKM10mo ago
nope my website gets like 20 visits a day right now, 20 of are by me so I'm just gonna put that on the TODO list have to find a proper way to do this
Brody
Brody10mo ago
sounds good
SHxKM
SHxKM10mo ago
@Brody this is after setting up the healthcheck successfully:
No description
SHxKM
SHxKM10mo ago
Yeah there's definitely something happening once there are two green deployments, and the older one is removed. I can reproduce it quite consistently.
sergey
sergey10mo ago
Btw, that's the 3rd report of the same issue in the last couple days. Same thing I posted in the other thread https://discord.com/channels/713503345364697088/1202585121677643836 Might be some global problem?
SHxKM
SHxKM10mo ago
Exactly the symptoms I’m experiencing (after adding a proper health check)
Brody
Brody10mo ago
you definitely could get into a scenario where your app responds to a health check but not to actual traffic
SHxKM
SHxKM10mo ago
@Brody can you please elaborate? If I temporarily move my health-check to be the very same path that I receive "actual" traffic on, will that be enough to investigate on your side?
Brody
Brody10mo ago
haha I don't have any other side than the community side, I don't work for Railway
SHxKM
SHxKM10mo ago
I thought you did honestly. But by side I meant “end”. It’s not about sides it’s about whether issues are investigated properly. Where can I raise an issue regarding this? As @sergey said, this is not an isolated incident.
Brody
Brody10mo ago
I'll try to reproduce
SHxKM
SHxKM10mo ago
Total guess but what I think is happening is Railway right after the switch sends some of the requests to the terminated/to be terminated deployment. As Sergey said, a period of 5-20 seconds where we see this error. This is after the new deployment has responded successfully to the health check.
Brody
Brody10mo ago
my guess is that django is doing some unwanted behaviour, same with the app Sergey is running, I'll try to reproduce with a simple http server with no middleware stacks or anything of the sort
SHxKM
SHxKM10mo ago
So two tried frameworks are doing unwanted behavior? Maybe it’s the third 😉
Brody
Brody10mo ago
but for transparency, if you have a volume (you don't but Sergey might) there will be downtime as two services can't connect to the same volume as the same time
sergey
sergey10mo ago
I have a node server, btw, not Jango
Brody
Brody10mo ago
just did a few back to back tests for a basic http server with health check, no volume. during the period of switch over, at a refresh rate of 250ms and cache disabled, i only saw a singular flash of the railway page
SHxKM
SHxKM10mo ago
Well that is definitely not my experience. Which web server were you using? If this is something common to Node.js and Gunicorn/Django, then it must be an obvious config step
Brody
Brody10mo ago
i am running a golang stdlib http server, with the chi router but heres the update on sergey's issue https://discord.com/channels/713503345364697088/1202585121677643836/1203077349432758289 keep in mind, they are using a volume, so their issue does not apply to you, since you are not using a volume
SHxKM
SHxKM10mo ago
Very interesting. Believe it or not I’m relieved to know it’s probably a misconfig on my part.
Brody
Brody10mo ago
are you using a readiness type health check? aka a health check that confirms your app is talking to your database of course im not, but my test app doesnt talk to a database
SHxKM
SHxKM10mo ago
No. I’m just returning a 200 from the Django middleware since Railway is using a random IP each time, but I’ll add those checks in a bit
Brody
Brody10mo ago
your health check should be made not to care who or what is making the health check
SHxKM
SHxKM10mo ago
I don't necessarily agree with that. The ALLOWED_HOSTS setting is an important one, and "exempting" the health-check endpoint from encforcement is a workaround to make it work with Railway.
Brody
Brody10mo ago
then it would be a work around to work with any similar hosting service, railway isnt going to make the request with a masked host header, the health check should be relatively dumb
SHxKM
SHxKM10mo ago
At least on the good news front, the latest deployment didn't show this kind of behavior when I ensured Redis + DB connections before returning 200:
if request.path in self.EXEMPT_PATHS:
redis.ping()
connection.ensure_connection()
if request.path in self.EXEMPT_PATHS:
redis.ping()
connection.ensure_connection()
I'll keep an eye on this for the next few days for sure
Brody
Brody10mo ago
sounds good!
SHxKM
SHxKM10mo ago
Thanks again Mr. @Brody
Brody
Brody10mo ago
always happy to help
SHxKM
SHxKM10mo ago
Yeah, spoke too soon:
No description
Brody
Brody10mo ago
is your previous deployment getting shut down before the new deployment is live? check its state during the transition
SHxKM
SHxKM10mo ago
But now at least I see this:
192.168.0.2 - - [02/Feb/2024:21:16:07 +0000] 'GET / HTTP/1.1' 500 145 'https://xxxxxxxx.com/'; 'Mozilla/5.0 (XXXX) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36' in 296216µs
192.168.0.2 - - [02/Feb/2024:21:16:07 +0000] 'GET / HTTP/1.1' 500 145 'https://xxxxxxxx.com/'; 'Mozilla/5.0 (XXXX) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36' in 296216µs
So it's definitely on me It's only getting shut down after the new deployment is green and shows active, so I assume something is going wrong on my side.
Brody
Brody10mo ago
but whats its state after the failed health check timeout limit? active, complete?
SHxKM
SHxKM10mo ago
The health check is succeeding.
Brody
Brody10mo ago
oh my bad, that log line is not an access log for the health check request
SHxKM
SHxKM10mo ago
And I actually think we still have an issue here, I just happened to deploy a bug now at the same time 🙂 No, that's guincorn
Brody
Brody10mo ago
bad wording, fixed
SHxKM
SHxKM10mo ago
Yeah, something is definitely up. I think I'll just record a video to prove it, because at times (this isn't the first time it happened), the response goes like this: Railway "app failed" screen Railway "app failed" screen 200 Railway "app failed" screen Railway "app failed" screen 200 And from there it sorts itself out
Brody
Brody10mo ago
i saw that, but only during my tests with a volume
SHxKM
SHxKM10mo ago
We've already established I'm not using those for the deployed service. May I ask if you preformed the test more than once? I see this happening for 50-70% of deployments, it's not 100% of the time.
Brody
Brody10mo ago
yes I've done it multiple times
SHxKM
SHxKM10mo ago
Brody
Brody10mo ago
i just cant reproduce it
SHxKM
SHxKM10mo ago
How can I raise a support ticket? When Gunicorn/Django return 500, A “classic” 500 Server Error is displayed. Black text on white background. This isn’t what’s happening here. Railway is routing traffic to a deployment it shouldn’t route to. I deployed a bug earlier that caused the app to consistently return 500. When that happens, a regular server error is returned, without Railway’s theme.
Brody
Brody10mo ago
as a hobby user you get community support why is django returning 500 anyway
SHxKM
SHxKM10mo ago
I intentionally (OK let’s pretend that it was intentional) inserted a bug. So Django throws an exception, and Gunicron returns 500 It’s different from the error page displayed by Railway Which tells me it’s a routing issue
Brody
Brody10mo ago
if you see the railway page during normal operation that means your app didnt answer railway's proxy request, therefore the error lies with your app, heres proof of that https://utilities.up.railway.app/status-code/500
SHxKM
SHxKM10mo ago
I don’t see it during normal operation, that’s the point. I see it during Railway’s deployment This is what I’m trying to say
Brody
Brody10mo ago
during the transition period?
SHxKM
SHxKM10mo ago
Yes
Brody
Brody10mo ago
do you see a build log that the health check succeeded?
SHxKM
SHxKM10mo ago
Yes And 200 from Gunicorn logs for the health check path
Brody
Brody10mo ago
how many tries until first success?
SHxKM
SHxKM10mo ago
2-3 times with 503, then succeeds I don’t think it’s the new instance that’s not “answering” the proxy request. I think some traffic for a short period of time is directed to the old instance. M
Brody
Brody10mo ago
during these health checks of the new deployment, what is the status of the previous deployment
SHxKM
SHxKM10mo ago
Active
Brody
Brody10mo ago
and it stays active until the new deployment is switched in? can you triple check this for me
SHxKM
SHxKM10mo ago
I will right now But what does “switched in” mean here: becomes “green” colored? Turns itself to “Active”?
Brody
Brody10mo ago
correct i shall try my tests again but with an artificial health delay where i return 503 for the first 5 seconds sound like a more appropriate test?
SHxKM
SHxKM10mo ago
Let me document what’s happening: So once the new deployment kicks in (building), the other is green but it doesn’t say Active
SHxKM
SHxKM10mo ago
Here's this state
No description
SHxKM
SHxKM10mo ago
Let me capture the state when they're both green
SHxKM
SHxKM10mo ago
In this picture, the upper one is the new deployment, which just succeeded its health-check.
No description
SHxKM
SHxKM10mo ago
Build logs for the new deployment:
No description
SHxKM
SHxKM10mo ago
Deploy logs for the new deployment:
No description
SHxKM
SHxKM10mo ago
Deploy logs for the old deployment:
No description
SHxKM
SHxKM10mo ago
The issues definitely start AFTER the old deployment is just removed, while both deployments are green, everything works fine. But for a (not so) short period of time, once the old deployment is moved to "HISTORY", the Railway screen of death appears.
SHxKM
SHxKM10mo ago
Browser console, don't know what that :1 means...hope it's not the port:
No description
Brody
Brody10mo ago
it means the first line of that file lol
SHxKM
SHxKM10mo ago
yeah, so this is what I have. I'll upgrade to Pro temporarily to get this looked at if that's needed. I don't see how a tried and tested server like Gunicorn returns 503 or doesn't respond after it has booted up and returned 200 already.
Brody
Brody10mo ago
going to test an artificial health check delay, will get back to you
SHxKM
SHxKM10mo ago
I'm not optimistic about this, as you can see it took Gunicorn 2 seconds I'd test two things here: Either Gunicron itself, with its default graceful shutdown configurations. Or a server that sleeps for 30 seconds on SIGTERM I'm close to convinced the terminating instance is receiving traffic still, but it has already hung up. Or something along those lines.
Brody
Brody10mo ago
you may be able to get railway to wait 30 seconds after sending sigterm before force killing the old container, but gunicorn isnt going to answer requests after sigterm anyway
SHxKM
SHxKM10mo ago
I don't really care, I can configure it to go dead immediately. My question is: is this behavior tripping up Railway One more thing to note: I am pre-tty sure that while both deployments are green (so just as the new deployment becomes healthy), Railway is still directing traffic exclusively to the previous (still green) deployment. I guess this is desired behavior, but thought I'd mention that because I don't know what's right and wrong anymore.
Brody
Brody10mo ago
thats correct from my understanding, railway routes traffic to the previous deployment for a default of 20 seconds, then kills the old deployment and switches over after 20 seconds or when the new deployments health check succeeds, whatever comes last
SHxKM
SHxKM10mo ago
Got it. Well, I’m still hopeful @sergey’s thread investigation by support will bring results for services without volume mounts as well. He also reported the issue occurs without a volume attached.
Brody
Brody10mo ago
i can reproduce your issue when the health check does not succeed right away, doing some more tests
SHxKM
SHxKM10mo ago
Oh this is interesting
Brody
Brody10mo ago
i set RAILWAY_DEPLOYMENT_OVERLAP_SECONDS to 35 and did a bunch more test runs, with that set to 35, i didnt even get a single flicker of the railway page
SHxKM
SHxKM10mo ago
I’ll try that early tomorrow. What does this env var mean exactly? Why do you think it solves the issue?
Brody
Brody10mo ago
it's explained a bit in railways docs, but dinner time now so I can't link
SHxKM
SHxKM10mo ago
Bon appetite wrote and deployed much less today. But it seems that 31 seconds makes this go away as well. @Brody Should this be raised for support anyway in your opinion? I mean, the health check returned 200, why do I need to change an ENV VAR to make zero downtime deployments zero downtime?
Brody
Brody10mo ago
yeah i set to 35 since i like jumping by 5 😆 yes i will bring this to the teams attention this week, i have some theories on why this is happening, but i would need the team to confirm. theres also the likely hood that even though this is brought to the the teams attention the fix would still be you setting that variable to 31 on account of they are replacing their current proxy with a entirely new built in house proxy that would be very likely fix this issue anyway, no point in patching their current proxy (when there is a work around) if its going to be ripped out anyway
SHxKM
SHxKM10mo ago
I would just like confirmation that something is amiss, and some way to track it, so I know when to remove the patch.
Brody
Brody10mo ago
if I can reproduce it, something is amiss 😆 but if I hear anything I'll be sure to tell you
SHxKM
SHxKM9mo ago
@Brody weren’t you able to repro without the ENV variable, when the first (few) health check(s) fail?
Brody
Brody9mo ago
yeah, why? ^
SHxKM
SHxKM9mo ago
I shouldn't read anything after 23:00. I totally misinterpreted this message.
Brody
Brody9mo ago
haha no worries
Want results from more Discord servers?
Add your server