EMERGENCY: Production database keeps going down
Our production database keeps going down with as an error message. This is a major issue.
The only resolution is to manually restart the compute endpoint each time.
Other context:
- us-east-2 endpoint
- Autoscaling is off
- Happened for the first time yesterday at 10:43 am PST
- Recently enabled logical replication
- Happened for the first time minutes after triggering an airbyte sync, but has happened 3+ times since then without us doing anything.
- Same day it started happening, neon had this status incident (https://neonstatus.com/aws-us-east-ohio/incidents/01HYDN1JG1MKS6HK9DS75HT358#01HYDN1JG16JHX8NHMNZ3D1FSC).
- Connection pooling is turned on and has been turned on for about a week. A google search of this error message brings up
pgBouncer so it can be a connection pooling issue, but this only just started happening yesterday.
We've tried contact neon support but have gotten no response even with priority support and marking the ticket serverity as "Critical: Production down"AWS - US East (Ohio) - us-east-2 Status
10 minute downtime for some projects (269) in us-east-2 - AWS - US ...
The incident has been successfully resolved, and normal operations have been restored to all affected systems and services. Our team conducted a thorough investigation, implemented necessary actions, and analyzed the underlying cause to prevent similar incidents in the future.
8 Replies
robust-apricotOP•2y ago
Updated the post with this context :
- Connection pooling is turned on and has been turned on for about a week. A google search of this error message brings up
pgBouncer so it can be a connection pooling issue, but this only just started happening yesterday.fascinating-indigo•2y ago
Hey! Super sorry to hear that you're running into an issue. Do yo you mind sharing your project ID
robust-apricotOP•2y ago
wandering-sun-51218810
fascinating-indigo•2y ago
I'll get back to you shortly
robust-apricotOP•2y ago
thank you 🙏
rising-crimson•2y ago
I replied to Alex in the Support case that she raised last night
Long story short, this issue is caused by a malfunctioning replication: Airbyte is not consuming the WAL that are preserved
so the disk of your endpoint got saturated twice
Here is an extract of the email I sent to Alex two hours ago:
The issue you experienced is caused by an exhaustion of the local disk space on your endpoint.
Projects with Logical Replication enabled must preserve a copy of the WAL files generated to support replication to an external database or ETL software.
The WAL files are progressively deleted once consumed by the subscriber.
In this specific case, it seems that your Airbyte replication is not consuming the WAL files generated, which leads to an accumulation of WAL files that eventually saturated your endpoint's local disk.
<LOG EXTRACTS REMOVED>
When the disk got saturated, PostgreSQL panicked, which caused a restart of the database and all associated services, including PGBouncer.
To prevent this issue from arising, it's critically important to ensure that your subscriber (Airbytes) is actively consuming the WAL files generated.
Alternatively, we can set the parameter max_slot_wal_keep_size, which will restrict the maximum size of WAL files that replication slots are allowed to retain in the pg_wal directory at checkpoint time.
Once this maximum size is reached, the oldest WAL files will be purged, avoiding filling up the disk with WAL and would thus prevent a crash of your endpoint.
But this also implies that Airbyte will no longer be able to continue replication due to the removal of required WAL files.
When the required WAL files are removed, you will have to drop the existing replication slot and recreate it.
Can you please have a look at your Airbyte instance and confirm if everything works as expected?
Thank you in advance for your kind feedback.
Best regards,
Yanic
Same day it started happening, neon had this status incident (https://neonstatus.com/aws-us-east-ohio/incidents/01HYDN1JG1MKS6HK9DS75HT358#01HYDN1JG16JHX8NHMNZ3D1FSC).The incident mentioned is unrelated to the problem experienced. Between 11:47 UTC and 11:58 UTC, so 4 hours before your endpoint crashed, one of the Pageserver in us-east-2 became unavailable due to a slow deployment. The 269 projects attached to this Pageserver experienced 11 minutes of downtime. But your project is not one of the affected projects. Your project resides on the Pageserver-5, which didn't experience any downtime. @pfcodes ⬆️ If this helps, we can jump on a remote session together to have a look at your Airbyte instance together.
robust-apricotOP•2y ago
In this specific case, it seems that your Airbyte replication is not consuming the WAL files generated,hey @Yanic -- thanks for your help on this. the reason why the Airbye replication wasn't consuming the WAL files is because shortly after we started our first sync, the database went down. we assumed it was the sync that caused it and completely shut it off. it seems like what happened was a coincidence in timing. the database would of probably went down regardless of whether we initiated the sync or not. we just re-enabled airbyte and the sync was successful. so hopefully the issue is now resolved. will know by the end of today if it doesn't go down again. thanks!
rising-crimson•2y ago
Thank you, I just saw your email!
Based on the logs I reviewed, the DB only went down due to a full disk, I didn't notice anything else abnormal with your database in the 2 hours before the crash.
Please, keep an eye on your DB and if you experience anything unexpected, drop me an email and I would be happy to help!
(I don't always get discord notifications on my work laptop...)