R
Railway5mo ago
johns

Did something related to private networking fundamentally change in the past couple of days?

We have a simple Django app and use Railway's Redis. Over the past couple of days, we've been seeing the following happen: Deploy the server -> In a couple hours, we start getting ERROR:django_datadog_logger.middleware.error_log:Error -2 connecting to redis.railway.internal:6379. Name or service not known where all connection attempts to Redis fails. Our code hadn't changed and we just wanted to double-check if there's anything fundamentally changed with the network layer.
Solution:
@unstoppablerizz The fix for this race condintion has been merged and deployed. This should not happen to you again, if you want to remove the 3 second delay (assuming you added one). You can but as brody showed us, that will generally just result in errors. However, if Django already takes long enough you might not need an additional delay....
Jump to solution
176 Replies
Percy
Percy5mo ago
Project ID: 120b5ec5-59d8-4087-84ae-4e0b3d934aa7
johns
johns5mo ago
120b5ec5-59d8-4087-84ae-4e0b3d934aa7
macwilko
macwilko5mo ago
does it connect immediately to the thing on Private Networking? I have found it's needed to add a short 3 second delay, before connecting on Private Networking
johns
johns5mo ago
Yeah, initial couple hours is not the issue and the delay we need in the beginning is not the issue.
macwilko
macwilko5mo ago
I don't believe anything has fundamentally changed with Private networking if it was previously working it could of been good luck maybe
theodor
theodor5mo ago
Didn't you guys have an outage with k8s 503s? @macwilko We've been running this with quite a lot of users for the past 9 months so I think good luck is a stretch We haven't touched the code there for over a week which is quite suspicious. Only started happening after the 503 issue Can we escalate this somehow and get a bit better treatment? We're paying quite a bit to railway after all
Brody
Brody5mo ago
@thomas - would you know or could you check to see if there where any changes with the private network or any of the surrounding infrastructure? my testing indicates zero behavior changes, but who knows
theodor
theodor5mo ago
Thanks @Brody! We're moving over to Azure Redis for now Do you guys use Azure or AWS? We can use either but we'd prefer to be on the same cloud provider cc @Brody
Brody
Brody5mo ago
railway runs on GCP at the moment
macwilko
macwilko5mo ago
do you see similar problems if you connect over the public url?
thomas
thomas5mo ago
I am looking at this now. But the next level of support is direct support.
johns
johns5mo ago
Well we just use the REDIS_URL so it is using the public url
Brody
Brody5mo ago
connecting to redis.railway.internal:6379. Name or service not known
connecting to redis.railway.internal:6379. Name or service not known
you are using the private url
theodor
theodor5mo ago
@thomas is this not the way to get that? or how should we tap into it
johns
johns5mo ago
Right we're using the REDIS_PRIVATE_URL. My fault
thomas
thomas5mo ago
So at the moment, we have 3 tiers of support, community, team, and direct. They correspond with Hobby, Team, and what I would call enterprise. Direct support you get a private channel with the team on slack or discord.
johns
johns5mo ago
I remember - when we used the public url, the latency went up by 5~10x when we migrated to v2 from v1 in december, which is why I specifically changed to use the private url. Using the private url brought down the latency
thomas
thomas5mo ago
@unstoppablerizz Which service is throwing thiese logs. And yes using the private URL is recommended for links between all services.
johns
johns5mo ago
Service ID d89b8d57-7c05-4bab-bf8a-27bc11f78cbc
johns
johns5mo ago
Basically, after those spikes happen, the connection to redis doesn't seem to work at all
No description
johns
johns5mo ago
I was trying to look through the logs but I keep deploying over the old instance so they get truncated
thomas
thomas5mo ago
that screen shot is from the reddit instance?
johns
johns5mo ago
No, it's the screenshot from Service ID d89b8d57-7c05-4bab-bf8a-27bc11f78cbc which is our Django server instance Maybe it’s that we’re exhausting the max simultaneous connection limit of Redis somehow? Do you guys monitor this
Brody
Brody5mo ago
i dont think you are opening 10k conns
No description
thomas
thomas5mo ago
That's not a bad way path to consider. Do all connections drop or just some? You said that all connection attempts to redis fail in the first part of this. Considering the replications you have running and the scale of the CPU you are using it's possible
Brody
Brody5mo ago
side question, are you deplying your app with nixpacks, or a dockerfile?
thomas
thomas5mo ago
I am bringing this up internally. Brody also is one of most helpful community members.
Brody
Brody5mo ago
i'm pretty sure this is char's wheelhouse, he would instantly know if there was any changes to the internal dns resolver
thomas
thomas5mo ago
@unstoppablerizz Is there a reason all your logs for this service are piped to stderr? the red wall is not what I would consider normal logging practices and honestly sent a shiver down my spine. (this is unrelated to your issue, but I am honestly curious)
johns
johns5mo ago
Dockerfile
thomas
thomas5mo ago
I have been made to understand that is standard for some libs
Brody
Brody5mo ago
is it based on an alpine image?
thomas
thomas5mo ago
Are you esablishing a new reddit connection for each API request or using a pool of connections?
johns
johns5mo ago
We have datadog so that’s how we set up monitoring / tracing / alerts, we don’t really read the server logs on Railway because it’s not really usable Let me double check We’re using a pool
thomas
thomas5mo ago
So you shouldn't be running out of connections unless your pool is over 10000? Yeah, at your scale, we understand that, and this is one of the things that we will be improving.
johns
johns5mo ago
When we started seeing the issues, we were using the pool + Django’s default cache backend module which I believe does pool connections Though even if we don’t use a pool, I don’t think our usage is that much to warrant this issue. And besides, our load didn’t drastically change from last week The most recent failure (one at 1:40am) could be because I explicitly defined connection pool of 2000 which may have exhausted the # of connections, but this wasn’t the case before last night
Brody
Brody5mo ago
where you able to check?
johns
johns5mo ago
We're using python:3.11 base image. Checking if that's alpine
Brody
Brody5mo ago
thats debian based there goes that idea lol
johns
johns5mo ago
Oh I think I saw the spike again a couple tens of minutes ago
johns
johns5mo ago
No description
thomas
thomas5mo ago
Do you have a graph of the freaqueny of the error from data dog?
johns
johns5mo ago
Yeah the tricky thing with this is that it doesn't get traced properly We unfortunately only have the wall of red But since it's pretty fresh, you should be able to see the socket errors happen from that point starting 10:06am (first bump) If not, then after 10:34am
thomas
thomas5mo ago
I have paged infrastructure on-call, and they are looking into this as well now. We are attempting to determine if this is an incident or not.
johns
johns5mo ago
We're also not logging everything to stderr - we're only logging errors to stderr (have log level set to ERROR)
thomas
thomas5mo ago
Understood
johns
johns5mo ago
But 400s are also "logged" as errors which can be safely ignored most of the time
thomas
thomas5mo ago
No description
johns
johns5mo ago
No description
thomas
thomas5mo ago
I see what you mean by intermitent
johns
johns5mo ago
trying to look at what happens at 10:08:20 Because I think that's when it started again Just a quick feedback - I don't think our multiline logs are actual individual logs, it's just newlines on a trace. Probably what makes me never want to use the log because it's impossible to trace I didn't bother setting a jsonformatter
johns
johns5mo ago
I do notice something odd - our server seems to restart right before that happens. Not sure why
No description
johns
johns5mo ago
?: (staticfiles.W004) The directory '/code/static' in the STATICFILES_DIRS setting does not exist. <- this only happens on server start Also pagination on search -> load more is wrong. Likely some issue with cursor based pagination
Brody
Brody5mo ago
at what point in your apps lifecycle is the redis pool created? at start?
johns
johns5mo ago
No description
johns
johns5mo ago
Defined in the global scope, yes
thomas
thomas5mo ago
ok, that's a good clue
Brody
Brody5mo ago
at startup?
johns
johns5mo ago
Python Redis pool is threadsafe so it should be fine yeah...perhaps the restarts somehow eat up the connections
thomas
thomas5mo ago
The private network takes time to initialize after a server restart. But the thing is that it should go away then
johns
johns5mo ago
yeah I don't think that's the issue
Brody
Brody5mo ago
my current thinking is that their app is crashing, then re-creating the pool on startup before the network initiation has finished, thus resulting in dns errors
johns
johns5mo ago
The thing is our code hadn't changed before this started happening afaik I'll double check my commits
Brody
Brody5mo ago
the spikes are a good sign that the app crashed
thomas
thomas5mo ago
That's a counterpoint, but is this load level reaching new heights?
johns
johns5mo ago
If you search for "System check identified some issues" in the logs, you can see that it's sparingly found (only happens on deploy). Starting Jan 16th, it happens a lot more even when we're not deploying. No not at all Barely more, and we've had much more before
johns
johns5mo ago
No description
johns
johns5mo ago
Though double-checking
Brody
Brody5mo ago
do you have a 3 second sleep before you are creating the redis pool on startup?
johns
johns5mo ago
We don't but startup already takes ~10 seconds anyways for django since we run migration on server start before we initialize the server, so there's no way there's a connection issue Besides it never was an issue basically all the random restarts here I think could be the source of the issue. We hadn't done any deploys at that time, so my guess is something about Railway's infra changed at that time
Brody
Brody5mo ago
i have historical data and the network intonation times have not changed, still sitting at an average of 1.57 seconds definitely worth looking into why your app is crashing
johns
johns5mo ago
It could be something about a k8 setting that you guys are using internally, or how hardware is allocated for the pods Though I have no idea about the internals
thomas
thomas5mo ago
Do you have time for that time you reference?
johns
johns5mo ago
Exactly around Jan 16 18pm. Not sure this is UTC The only new deployment I see from our side after Jan 13th is on 17th, and I think it was a reaction to the errors we were seeing
thomas
thomas5mo ago
We do ask that initial private network attempts be retried. In the first 100ms: docs
Railway Docs
Private Networking | Railway Docs
Documentation for Railway
Brody
Brody5mo ago
side note, railway doesn’t use k8s for any user workloads, and as far as i know they haven't changed the runtime yet
thomas
thomas5mo ago
I am looking at our commit PR log to check
johns
johns5mo ago
I think if you look back and forth 2~3 hours from there, maybe something can be found? But I'll also continue diagnosing
thomas
thomas5mo ago
We recommand you change your restart policy to restart always for your API service We ran into a strange condition where one of your replicas was in an exited state.
johns
johns5mo ago
ok changed it
thomas
thomas5mo ago
Still looking into other things on our side just something that was noticed where did you get this date, and what would your local time be if this isn't UTC
johns
johns5mo ago
The date is from the Railway log itself so whatever it is Ok it seems like it's PST based given what I see in the most recent log matching my timezone so Jan 16 18PM PST
macwilko
macwilko5mo ago
i believe logs should be in UTC on Railway
johns
johns5mo ago
I meant just from my screenshot
johns
johns5mo ago
I see this in the most recent screenshot, which means the display is based on my current timezone, so given that the screenshot that I was referring to for when the errors were starting has ~Jan 16 18PM, thomas should look at the commits using the PST. I'm sure on the backend they save it as UTC
No description
thomas
thomas5mo ago
Yes, but we do convert it to local in the UI! So that time is in PST One of our infra engineers asks if you can you modify the code to log the IP of redis.railway.internal when you init a connection so we can confirm if it's DNS related or not? If you put this in where the connections are started in the pool that should avoid flooding the logs.
johns
johns5mo ago
😓 I've done this before but it was a lot of trial and error - do you happen to have the python code for it
thomas
thomas5mo ago
You can share your connection code and Brody or I might be able to help
johns
johns5mo ago
Just double checking - am I supposed to look up for redis.railway.internal:6379?
Brody
Brody5mo ago
just the hostname
thomas
thomas5mo ago
no, that's an internal log sorry
johns
johns5mo ago
We're only setting the REDIS_URL, which contains everything redis://<user>:<password>@<host>:<port> so
thomas
thomas5mo ago
If that works sometimes, it should always work
johns
johns5mo ago
@Brody .query is deprecated
def print_aaaa_record(domain: str):
print("print_aaaa_record")
try:
answers = dns.resolver.resolve(domain, "AAAA", search=True)
for rdata in answers:
print(domain, "has AAAA record", rdata.to_text())
except dns.resolver.NoAnswer:
print("No AAAA record found.")
except Exception as e:
print(f"An error occurred: {e}")
def print_aaaa_record(domain: str):
print("print_aaaa_record")
try:
answers = dns.resolver.resolve(domain, "AAAA", search=True)
for rdata in answers:
print(domain, "has AAAA record", rdata.to_text())
except dns.resolver.NoAnswer:
print("No AAAA record found.")
except Exception as e:
print(f"An error occurred: {e}")
perhaps this works? Though I'm pretty skeptical that it's going to work on the entire redis connection string
Brody
Brody5mo ago
please only lookup the host
johns
johns5mo ago
monorail.proxy.rlwy.net just this right?
Brody
Brody5mo ago
thats the public host
johns
johns5mo ago
I know, just the form of it
Brody
Brody5mo ago
ah then yes, thats a hostname
thomas
thomas5mo ago
@unstoppablerizz can you tag me when this happens again we have an idea we would like to try but we need it to be happening
johns
johns5mo ago
sure
thomas
thomas5mo ago
I have a larger updating coming, we did find something. Just confirming best next steps with the team @Conductor This user can override rule 5 today and tag me directly.
Brody
Brody5mo ago
for complete transparency Thomas, in my testing, over the time span of 7 days, and 600k successful internal dns lookups I get about 6 to 7 failed dns lookups. though this doesn't explain their issues, and my test doesn't cover all the deployed resolvers, just thought I'd mention it
thomas
thomas5mo ago
To reproduce this you would have to crash your container then use the internal network immedately. It is likely that your test is so stable that it doesn't reproduce the rare race condition we are looking at
Brody
Brody5mo ago
my thoughts exactly, but 6 or 7 errors is something, though that's only, what? 0.0010% error rate I'm seeing
johns
johns5mo ago
@Brody redis.railway.internal has AAAA record fd12:f325:1eed::32:ffc6:b2dc
Brody
Brody5mo ago
that's the correct value you've made that lookup in code at the exact same time the pool is created right?
thomas
thomas5mo ago
Yes, this confirms the DNS is fine and our update is more relevant, we are still working on it sorry
johns
johns5mo ago
yes @thomas Would love to hear what the postmortem was just out of curiosity once it's settled
thomas
thomas5mo ago
we are hammering it out currently
johns
johns5mo ago
Though I'm still not sure 100% that there's nothing I can improve from our side - looking to see how we can manage things better
Brody
Brody5mo ago
super long shot, in node there's a redis package called ioredis and it has a setting called family where you set it to either use ipv6 or ipv4, maybe your redis client in python has something similar? where you could default it to ipv6
johns
johns5mo ago
I've had the ipv4 ipv6 issue with datadog before but I don't think that's the root cause here
thomas
thomas5mo ago
We have checked the DNS logs for the host that was running the application which exhibit the Name or service not known. errors and found no issues there. We've asked for you to log the DNS resolution in your app to also help confirm if it is DNS related. However, given that the issue occurred only after a restart we suspect this might be related to a very rare race condition in private networking that we have recently discovered and you are one of the first customers to also mention it. A patch was in the works already but we have our engineers prioritizing this to get it out right away. As the patch is being deployed we also have a platform engineer on stand-by throughout this process to troubleshoot more and provide support to myself and you. We have some mitigations you can try for the interim: We recommend you delay connecting to the private network 3 seconds after your service starts as the race condition comes from the private networking interfaces being modified by the kernel when they haven't finished initializing their routing tables but receive traffic. We'd like this change to be removed after since it's not ideal to wait 3 seconds for your app to start but should mitigate the issue entirely. If you have an application that gets into this state please let us know as we have some tooling for us to go in and check if the issue is a result of the race condition we suspect. Just adding that if you don't implement the 3 second delay, that's your call because this fix should be rolling out today. We just wanted to put that there to give you options Probably not this but I honestly couldn't tell you if it affects this race condition or not.
macwilko
macwilko5mo ago
smol question will this fix mean that I won't need the 3 second delay before connecting to private Redis in future? I have a Go api and I'd very much like to remove the 3 second delay I poggers
thomas
thomas5mo ago
Wait I didn't read this. You should never had needed that long of a delay. We try and bring the private network up in 100ms. We recommend that you just retry your connections a little more generally after the server starts. This race condition is specifically for services that restart under a lot of load then flood the private network, which then breaks in a way that doesn't resolve until the service restarts Also Brody said this, which means we might be missing our 100ms mark, but still not 3s It's going to depend on a bunch of things
Brody
Brody5mo ago
3 seconds is a good bet (data from 2 days)
No description
macwilko
macwilko5mo ago
brodys figures more accurate, I wish it was 100ms...
Brody
Brody5mo ago
it will be 0, but thats off topic
johns
johns5mo ago
Just to double check - was it a regression introduced around the range that we found (Jan 16th 18pm pst)? Or did we walk into it somehow @thomas out of curiosity
thomas
thomas5mo ago
We are still unsure, it's hard to eliminate any change to the kernal, our host OS in general. But we didn't find a smoking gun
char8
char85mo ago
Heya - so the issue here seems to be a very narrow race between when there's a high volume of traffic while we are initialising privnets. There was a gap between two syscalls in that init process where a packet of traffic could get the IPv6 neighbour table into a bad state. We only spotted this yesterday cos' we built a new chaos monkey system to stress test with frequent deploys and monitor network connectivity. The race existed since we launched the feature, and no users previously reported it. It's unlikely it was a regression, and more probably that you ran into it by accident. Most probably because the containers crashed and restarted into a high volume of traffic; redeploys etc... get a gradual traffic ramp so it's very unlikey to trigger. We're working on a privnets v2 - rewrite of the BPF programs underneath, more control over injection, and a bunch of other improvements to the overall runtime. It's been a slow process to rollout because it's complex and has a wide impact, it's not available to users atm - we're slowly landing it in parts - I'll ping back here once that's launched (don't want to commit to timelines until I get it out to Beta), but I'm actively on that 100% of my time.
Solution
thomas
thomas5mo ago
@unstoppablerizz The fix for this race condintion has been merged and deployed. This should not happen to you again, if you want to remove the 3 second delay (assuming you added one). You can but as brody showed us, that will generally just result in errors. However, if Django already takes long enough you might not need an additional delay.
thomas
thomas5mo ago
If this happens again, please reopen this thread.
johns
johns5mo ago
@thomas The error is actually ongoing right now since 16:14
Adam
Adam5mo ago
what time zone?
thomas
thomas5mo ago
Good Morning my time.
No description
thomas
thomas5mo ago
I did a quick check of the logs and noticed that it did happen again starting around 1600 and ending around 2000. It hasn't seem to have happened again after that is my filter not showing this properly or do you observe that as well? I also would like to note there was some memory spikes the day after on the 20th around redploys. but over all your memory useage has smoothed out. I can't tell if this is from replicas not restarting or if you just experance a smoother load over weekends
Mig
Mig5mo ago
We are still interested to hear about how this issue might still be impacting your services so please let us know if the issue is persisting. I know you mentioned the error is still on going but Thomas mentioned he couldn't verify that. We know for sure that the race condition we suspected you were hitting is in fact resolved and now it's just confirming if that was the issue you ran into.
johns
johns5mo ago
The issue is still persisting and we ended up just adding a sleep There's a separate service for which we're experiencing the same issue of the internal host not being found
theodor
theodor5mo ago
Yeah we're trying to connect to the private URL of the new redis server for the migration and that's not being found: 2024-01-26 20:16:35 [line-worker] error: Line worker error: Error.getaddrinfo ENOTFOUND redis.railway.internal. Error: getaddrinfo ENOTFOUND redis.railway.internal cc @Brody @Mig @thomas
Mig
Mig5mo ago
do you have a project with that error right now ? if so I can quickly confirm if it's the issue we expect
theodor
theodor5mo ago
yeah i had to roll back though
Mig
Mig5mo ago
understandable
theodor
theodor5mo ago
Deployment ID: b2888f70-317a-404c-8910-8702eb63bc36
Mig
Mig5mo ago
Do you have a delay in the start of your app to avoid the race condition ?
theodor
theodor5mo ago
I thought that was solved? Should I still add it back in the code? Maybe it was a miscommunication We thought that the 3s delay was no longer necessary
Mig
Mig5mo ago
no you understood correctly. We did deploy a patch for that but if you're still experiencing the issue it might be something else or the patch was not right (highly doubt this, we saw the bug, did patch, no longer could reproduce)
theodor
theodor5mo ago
I can try to quickly redeploy one of our dummy services for testing with a 3s delay
Mig
Mig5mo ago
if there isn't too much impact by having the 3 second delay I would add it for now to try and mitigate the error.
Brody
Brody5mo ago
from my data, a 3 second delay is still necessary as the internal dns server will not respond to any dns lookups until after about 2-3 seconds
theodor
theodor5mo ago
ah ack! thanks brody
Mig
Mig5mo ago
Brody - the patch prevents unrecoverable state but I believe you're right that DNS could (seems often) fail for the first 3 seconds.
theodor
theodor5mo ago
but actually let me check this our worker will attempt to reconnect for ever even if it fails at first shouldn't it eventually connect it was failing for like a minute
Brody
Brody5mo ago
over 7 days the average internal dns resolver ready time is 1.56s with some very intermittent spikes to 5+ seconds
theodor
theodor5mo ago
but if it's ready after 2s, shouldn't the reconnection work after 2s or 5s
Mig
Mig5mo ago
yeah, 100% it should
theodor
theodor5mo ago
is this because it's using the original new Redis(REDIS_URL) client?
Mig
Mig5mo ago
if it does not that's a signficant bug. The 3 second startup is a caveat that can be worked around but DNS resolution failing after that is not acceptible.
theodor
theodor5mo ago
i'm wondering if it's a detail of how ioredis on node handles thiw this*
Brody
Brody5mo ago
oh wait you're using ioredis? I thought this was python
Mig
Mig5mo ago
if the retry mechanism re-uses the broken config I could see how future attempts just continues to error
theodor
theodor5mo ago
ya @Brody should i specify ?family=6
Brody
Brody5mo ago
do you have family set to 6?
theodor
theodor5mo ago
? ahhhhh i see
Brody
Brody5mo ago
absolutely otherwise ioredis will only use ipv4
theodor
theodor5mo ago
i see this explains everything i'm sorry then that's not your fault it's ioredis
Brody
Brody5mo ago
I could have swarn this was a python app we where talking about so ioredis didn't cross my mind
theodor
theodor5mo ago
haha that is our other service my bad
Brody
Brody5mo ago
is that service still experiencing problems?
Mig
Mig5mo ago
checkout the caveats on private networking https://docs.railway.app/reference/private-networking#caveats setting ioredis to use ipv6 is not obvious from our caveats page but might be helpful to know all the known issues
Railway Docs
Private Networking | Railway Docs
Documentation for Railway
theodor
theodor5mo ago
@Brody i think we had to add the sleep for the rizz service @Mig thank you!
Brody
Brody5mo ago
Melissa has added a thing about ioredis elsewhere in the docs
Mig
Mig5mo ago
ohhh you're rizzgpt. I saw that project and thought it was an interesting name. :p
Brody
Brody5mo ago
it's a very cool product too!
theodor
theodor5mo ago
haha thank you! the node issue is for heycami.com which is even higher usage
Mig
Mig5mo ago
try the ipv6 setting to ensure the the correct dns resolution is used and report back
theodor
theodor5mo ago
on it!
Mig
Mig5mo ago
I'm really trying to fix this issue for y'all Clarification that the patch we deployed prevents a unrecoverable state for the private networking routes. There is still currently a ~3 second startup delay where DNS resolution may fail. If the 3 second delay is not too impacting I'd recommend leaving it. We are working on removing that startup delay this quarter because it is definitely unreasonable to have you workaround that. Just so I understand the impact on you, has this 3 second delay mitigated the issue ? DNS resolution should work after the first few seconds.
johns
johns5mo ago
For rizzgpt, yes. We'll see if the ipv6 and the delay fixes things for cami as well
Brody
Brody5mo ago
the 3 second delay won't have any impact as long as you are using a readiness healthcheck!
Mig
Mig5mo ago
ok ty @ me if y'all want. I'm going back to working on our new edge proxy 🙂
theodor
theodor5mo ago
using family 6 did the job! (forgot to mention earlier)
Brody
Brody5mo ago
awesome
Mig
Mig5mo ago
thanks Brody 🐐