Outage or technical issue not reflected on status page? Are others having problems?
Our production Supabase instance has been offline for hours. There's nothing on the status page about an incident. Rebooting the project took a very long time ~20 mins and restored service for about an hour after which it is down again. Despite being on a pro plan for a long time we have yet to receive any response from support for our urgent request.
We got a notification that we're consuming a lot of our disk iops budget. Our project has never consumed more than 1-5% of our iops budget for years and we still are nowhere near to our cpu/memory capacity. No errors in the logs that point directly to the issue. Dashboard reports all services unhealthy. API gateway logs responses are all 522 and 544 errors. Sentry reports failures for direct connections to the db.
Checking grafana metrics there is nothing out of the ordinary. Our instance just goes totally offline with no indication of error.
We made no changes to our app other than doing the upgrade on the infrastructure page to the latest postgres 17.4.1.074. The outage started happening ~8hrs after the upgrade.
There are a few discussions about it on github https://github.com/orgs/supabase/discussions/37963 https://github.com/orgs/supabase/discussions/16747
Anybody else encountering outages or these symptoms?
GitHub
Outage not reflected on status page? · supabase · Discussion #37963
Our production Supabase instance has been offline for hours. There's nothing on the status page about an incident. Support has not replied. Rebooting the project took a very long time ~20 mins ...
GitHub
Need help, Database not responding..."Your disk IO budget has run o...
Hi, my database is currently not responding. Upon further investigation, it shows "Your disk IO budget has run out for today" and that the Disk IO consumed for the past 2 days is at 100%....
17 Replies
From your link it looks like one other user has maybe a similar issue. Still not seeing anything reflecting a mass outage as just the two and not clear they are related.
It would be interesting though if the other user also upgraded their Postgres version as you claim this started after that.
This is a user helping user forum as is Github discussions for the most part (at least in any short term situation).
Support is really only option if this is an instance problem.
Did you get the confirmation email with ticket number at least? If not check spam.
Wonder if doing the upgrade somehow used up your disk I/O budget and is doing ongoing cleanup or something. Not really familiar with all that Postgres does on an upgrade. Was this a 15 to 17 upgrade or just a 17 to 17 "minor" upgrade (which may not involve postgres at all as Supabase upgrades extensions and features that way.
It was a 17 to 17 minor upgrade. Last upgrade was only a few weeks ago. Unfortunately support is not being very supportive. One ticket went unanswered. Another one was answered but it appears the person didn't thoroughly read the ticket. Just quoted some documentation I'd already read which per the metrics and data is almost certainly irrelevant, after which I haven't heard back. Our backend is going down at random without coming back until a manual reboot. Customers are mad. Customers are leaving. Negative reviews are happening. My team is angry. Supabase support does not seem to be taking it seriously. It's intensely frustrating.
Especially because we've been paying customers for a long time (since near the beginning) and evangelize supabase to others.
This isn't something we can diagnose and fix on our own. All metrics and logging appears to cease and we cannot connect to the DB. The only logging we get is the api gateway showing 522 521 544 and other random 500 numbers. All the logs for other services just stop. Grafana metrics just stop. There's no data with which we can diagnose anything. It requires supabase support to look at the actual instance I think but it seems nobody is paying attention.
Frustrating. Can you upgrade your instance out of it? A bit of a pain as a bit of downtime to up the CPU which includes disk io. Then same to back it off if it does not help.
Disk io isn't the problem though. That appears to be a spurious error. Our disk io is very low, 1-5% of budget.
I already tried downgrading to a smaller compute then upgrading back to the current size. Unfortunately that didn't help.
Per all of the supabase reports and grafana metrics we're nowhere near any of our resource limits. ~4-10% avg cpu. ~20-40% avg mem. 1-5% disk iops. Plenty of free disk space. Everything looks completely normal.
Oh. You had said disk in warnings and commented on a GitHub I/O issue.
Yeah, we did get warnings in the console, but there's no indication that we were ever anywhere close to exhausing iops budget. Even if we did somehow that AFAIK would just make things slower, not bring down the whole instance and all services.
I don’t know enough on this level of detail to help. But it is specific to your code or a very small number (1 or 2) instances as still no other public forum reported
I checked connections and pool size too. Pool size is 40 but we've never gone above 20-25 per grafana stats.
Unfortunately I doubt it's our code. We haven't deployed anything for weeks. This started happening only after we did that minor upgrade yesterday.
Yeah and same time if not a large outage the infra team they have would not be looking at it until support moves it on.
I've tried to check pg_stat_activity for hanging active connections as well, nothing there 🙁
Somebody from support did reply to me today (on my 2nd request) but didn't offer much help and seems to have disappeared, no longer responding. I asked them to please escalate but never got a reply.
I've made a third request in the hopes that it might get seen by somebody and acted on sooner. Our product is burning, we can't do anything about it, and it seems that Supabase support isn't really interested unless we pony up $600/mo for the team plan. The one or two times we've had urgent issues in the past (maybe a year ago) support was very quick to respond and escalate appropriately to an urgent request on the pro plan. I guess something has changed?
They do work their time zones I believe for first level. I have no insight into how they hand off issues. Support has gotten slower and they are trying to hire with very rapid growth. Also infra team has another odd issue that spiked with restores and I know at least last week overwhelmed.
But that is not your concern right now.
I know they don't technically have to get back to us for ~2 days per the pro plan, but it's very frustrating and disappointing to have such an urgent problem (for us and our customers) seemingly take minimum priority.
Hate to leave you hanging but I have no way to escalate this for individual case and outside my skill set to help debug.
I know. Thank you. I appreciate you trying to help and listening to my frustration.
I'm just going to increase instance size from L to XL temporarily and cross fingers that somehow helps until support actually looks at the problem.
Didn't help 🙁
Instance went down at 03:45 EDT when there were literally zero clients connected. All logs just stop. No errors at all. Just dead. Only logs are 522/544 when later requests hit the api gateway and time out. Nothing can connect to the db. It just went down with zero load all on its own. Still no support 🙁
@Tomás P. reaching out to you from your comment on my reddit thread
@imagio Can you provide your Support ticket number here as @Terry has asked on the reddit post about that, but it might be quicker here.
Ah I just seen you replied on the Reddit post now with it.
Thanks @imagio! this has been escalated by Terry
Thank you!