Losing connection to redis after migration
Hello, i recently migrated to the new redis instance and have been getting errors randomly once a day and need to restart my container for it to work again. Anyone had similar issues after migrating?
55 Replies
c256fc68-1d5c-4c39-8e86-a964d7ff66f5
@matt - connection reset redis
This happens with the new postgres container as well.
@arus can you share more details? And is there a reliable wat to reproduce the issue? ty!
Could you find any reason why it happens in my project @matt ?
Hey there @Eddy - we had an issue with an #🚨|incidents, the Railway team is working on a post-mortem.
Only way to reproduce it is to wait ~3-8 hours without anyone calling one of the endpoints. Then I get an EOF detected error. I have to restart the bot container to reconnect to pg again. Hang on, I'll grab the full error.
Lol and that isn't even fully true, it's been up 8 hours and I can't reproduce it yet.
Command raised an exception: OperationalError: (psycopg2.OperationalError) SSL SYSCALL error: EOF detected
(Background on this error at: https://sqlalche.me/e/14/e3q8)
Already followed the guidance from sqlalchemy, it's still happening. It would sometimes happen after a month or two with the old postgres container, but now it's like 1-5 times a day.
are you making sure to close idle connections?
seen this a lot where postgres would mark the connection as closed but the client doesn't know the connection was closed
Yep
well 8 hours so far is good, if there errors again let us know
Seems to have been stable overnight, hopefully whatever happened yesterday fixed it (I'm in the West region mentioned in incidents)
Yep it was likely the outage
Yeah same here, happened 2 nights in a row and tonight it was stable
Sounds about right. Also lol that av Angelo. I haven't seen that frog in years. All seems fine now yeah.
Down again @Angelo
Hmm- are you closing your connections?
Yes.
It happens after appx 20 hours now rather than 1-6 hours. Only started happening after I migrated to the new container.
And only if it's idle the entire time.
The bot container is not set to sleep but sometimes seems to anyway. Resource allocation in my region maybe?
This is the specific error I get on the client side. https://docs.sqlalchemy.org/en/14/core/pooling.html#pool-disconnects I am using a pessimistic method to recover. Looking at my logs, the main loop seems to be restarting while the container is running sometimes. The other behavior I notice if I try and pull before the EOF message are extremely delayed server responses in the region, when building/restarting especially. I'm going to give null pools a try again though and I'll let you know.
im still thinking that your problem is related to keeping stale connections around, this problem its mentioned in the knexjs docs.
its a javascript package, but the same can apply for any pooled postgres client within a docker environment.
https://knexjs.org/guide/#pool
It can result in problems with stale connections
I'll take a look.
Yeah, I'm following this guidence which won't use a connection without checking it first. https://stackoverflow.com/a/66360789
Stack Overflow
psycopg2.OperationalError: SSL SYSCALL error: EOF detected on Flask...
I have an app that was written with Flask+SQLALchemy+Celery, RabbitMQ as a broker, database is PostgreSQL (PostgreSQL 10.11 (Ubuntu 10.11-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubu...
Given this other article though, it does align with the lag theory. The lag could be exceeding the keep alive.
sorry do you mean youre going to use
pool_pre_ping=True
going forward, or have you already have been using it?I have been already.
are you using the private url?
https://stackoverflow.com/a/66515677
I don't recall. Let me check.
Stack Overflow
Postgres SSL SYSCALL error: EOF detected with python and psycopg
Using psycopg2 package with python 2.7 I keep getting the titled error: psycopg2.DatabaseError: SSL SYSCALL error: EOF detected
It only occurs when I add a WHERE column LIKE ''%X%'' clause to my
roundhouse.proxy.rlwy.net
Same variable as before, but the migration populated a lot of it
I can try the private url
can you try using the
DATABASE_PRIVATE_URL
variableYeah, let me switch off my phone.
Alright, deploying. I'll let you know if it stays connected.
sounds good
Alright, it won't connect to the private url. Says the hostname can't be found.
building with nixpacks?
could not translate host name "postgres.railway.internal" to address: Name or service not known
yes
can you try adding a 3 second sleep to the beginning of your start command?
Yeah one sec.
nope
postgres is in the same project right?
Yes
I'm going to try adding sleep to my cog setup functions.
Didn't work either
my nixbuild runs this. docker run -it us-west1.registry.rlwy.net/
let me check if postgres is in the same region.
Bleh, yes, I'm hobby plan too
So I couldn't change it if I wanted to
does the dns lookup that SQLAlchemy does support ipv6?
Pretty sure it does. Let me check this version real quick.
yes, it does.
does the start command in the build table at the top of the build logs confirm that there is a sleep 3?
No, but that code is wrapped in a script.
can you change your start command to
sleep 3 && <your current start command>
yeah one sec.
okay, looks like it didn't explode this time.
make sure you are using a healthcheck now though https://docs.railway.app/guides/healthchecks-and-restarts
Not entirely sure how I'm going to do that just yet, but i'll look into it.
do you already have a web framework in place or is this a bot app?
Bot
ah then dont worry about the health check
I'm thinking of adding some short lived auth endpoints to connect users to their own content though, so I'll probably throw it in when I do that.
Alright, gonna let this sucker idle for a couple days and see if the errors are done.
yep that would be the time to add a healthcheck
sounds good
Thanks!
Down again.
I checked, the client reported connecting but not receiving a response.
Same here
This is still happening for me, crashes every roughly 4 days. Did anyone figure out a clear resolution. Otherwise I'm going to need to migrate away from Railway entirely as this is not stable for production.
Note I have healthchecks and everything.