N
Neon2y ago
painful-plum

More transparent communications when issues occur and how customers might be impacted.

It's cool that you'll are moving fast. Love to see it. But you can easily erode trust when you don't communicate to your customers that you've disabled a feature. In this specific case you disabled persistence of logical replication slots and related data in all regions. A toast/banner on the dashboard to all customers would suffice. That fact that logical replications was disabled left me trying to figure out why Airbyte connector was failing over weekend, why my replication slots weren't sticking even after recreating and thus impacting freshness of analytics reports. Transparency/communication can easily make bugs/issues like this easier to take as a customer!
5 Replies
fair-rose
fair-rose2y ago
@Drew I have an instance with LR enabled that's streaming changes to Confluent Cloud and it has been working fine since last week. I believe some compute instances with LR enabled were restarted at some point last week, but I'm following up with the team to get more info
painful-plum
painful-plumOP2y ago
This is what I got from the team this evening:
Indeed, as of last Friday we disabled persistence of logical replication slots and related data in all regions.
With the next storage release (currently planned for tomorrow, pending any issues we discover before that) this will be fixed, i.e. persistence enabled back again.
In summary: We had an incident on Thursday causing projects to be stuck and unable to start due to pageservers being overloaded. https://neonstatus.com/incidents/01HQ7NV6XQ70M71FGRNDYQGHZK We had to disable support for logical replication at the pageserver level to mitigate the situation. More details from engineering below. We are going to update our documentation to indicate if any changes will be required on the user end going forward. Since Friday we have disabled persistence of logical replication slots and related data in all regions; it means if compute restarts, logical subscription (replication slot) must be recreated as slots disappear. Otherwise it works as previously. Since next storage release which will happen tomorrow this will be fixed, i.e. persistence enabled back again. However, at the same time we will enable automatic drop of inactive logical replication slots; inactive means compute is being used, but logical subscription doesn't acknowledge progress (flush_lsn) for several hours. The reason is that such inactive slot bloats storage; we might raise the limit in the future and can do that on demand, but right now (since release) it is quite strict.
This should deff be amplified to more than just my inbox. tyty Evan
fair-rose
fair-rose2y ago
@Drew are you on a paid plan BTW?
painful-plum
painful-plumOP2y ago
yes on the pro plan - or what use to be the "pro" plan! i run an agency - will be moving over to launch plan
deep-jade
deep-jade2y ago
@nb, what is the best way to communicate these kinds of changes for you?

Did you find this page helpful?