i'm suddenly seeing what i believe to be
i'm suddenly seeing what i believe to be increased connections from hyperdrive to my postgres server (there are many pg processes each using very little cpu), which is now pegging my server to 100% cpu usage, and that manifests in numerous "PostgresError: Timed out while waiting for an open slot in the pool." as well as CONNECTION_CLOSED errors in my logs. i don't think my traffic has increased proportionally to cause this--i've never noticed sustained CPU usage like this before and my daily requests are about what they've been for the past few weeks. what could be causing this? do i just need to scale up my server?
while writing this i'm noticing a few dips down into the 30-70% range which is more reasonable but the sustained highs are still concerning to me
18 Replies
Do you happen to know the connection limit on your postgres server, and can you share your Hyperdrive ID?
hyperdrive ID is
9568cd870bee47f3801c862de747ca94
postgresql.conf says the connection limit is 100, i've never changed it so that seems rightAlright, thanks. Is it looking ok for now, or still knocking your server over?
things seem to be about the same unfortunately, though there was a period where it wasn't hitting 100% for about 40s
Do you have any middleware between Hyperdrive and your DB? PgBouncer/pooler/proxy/etc?
Also, is there any information you can provide on your origin DB? Version, hosting info, etc?
Feel free to DM if you prefer
no, hyperdrive should be connecting straight to the server. nginx might be involved? but it should only be set up on port 80, not postgres's port of course
i'm using psql16 on a hetzner server in ashburn
and i'm using the
postgres
driver with drizzle orm if it mattersAh. Hetzner with a raw IP address regularly seems to have issues. We've had folks have a lot of success with slapping a CNAME onto that. It's not clear to me why.
The TLDR of what's happening here is that the traffic to your server has gotten out of sync, and Hyperdrive is reading that as the connection has been corrupted, so it drops the connection and tries to open a new one. On your side the DB is not cleaning up those connections after we drop them, so they're piling up.
If possible a restart of your DB should resolve this CPU usage issue, and a CNAME on your IP address might help prevent this sync issue from reocurring.
I've only ever seen this happen on Hetzner, and you're only customer #4 or so that's brought this scenario to me, so it's not incredibly common.
oh wow! do you have an example on how to set that up? same zone, proxy, port specification, etc.? i'm using cloudflare for my DNS as well
Let me see if I can find a good guide real quick
I'm not finding a ton. Most of what I see suggests that it's pretty vanilla, e.g. https://community.cloudflare.com/t/setup-cloudflare-on-hetzner/59595
The customers I've spoken with on this issue in the past had used Hetzner's own DNS, IIUC, so I haven't had cause to go digging into that particular setup before.
Everything should be the same, yes, just a cname to resolve it instead of the direct IP address.
i've set up an A and AAAA record to the raw ipv4 and ipv6 (it won't let me create a cname for the raw IP, they are hostname only, no? does it need to involve a CNAME?)
not sure how to test this, i guess i'll create a temporary worker and hyperdrive configuration so i don't mess up the main workers accidentally. i remember reading that cloudflare only supports certain ports(?), i'm hoping i don't need to set up a proxy and can just use example.com:5432
oh actually i guess cloudflare will just tell me if it's bad when i try to create it
A and AAAA should be plenty I believe
ok, i tried several solutions and so far only one has been able to connect to the database.
- A and AAAA to the raw IPs
- CNAME to hetzer's public network reverse proxy (which i only found after poking around to create a private network)
- Zero trust tunnel to pg's unix socket
finally i tried plugging in hetzner's reverse proxy without running it through cloudflare first, and that worked (for creating the hyperdrive configuration), but i have yet to see whether that will actually fix the issue i'm facing (about to test). future reference: this can be found in the hetzner cloud panel -> project name -> servers -> server name -> networking -> public network

Understood.
Please let me know if it helps.
I think a restart of the database itself is likely necessary also, this change would be more to prevent reoccurrence
well, it's only been a few minutes but unfortunately it's still sustaining 100% utilization for longer than i would like following updating the configuration and restarting with systemctl.
however, things appear to be more responsive from the front-end side--i'm hoping that's not just a temporary effect of the restart
i had restarted earlier today and i noticed that postgres's ram usage had steadily climbed to something like 2gb by the time i restarted just now, so obviously that would indicate something not being cleaned up. can't say i'm noticing that again but it's only been a few minutes about 20m later: it's still climbing, but under 1gb
The not dropping connections when we disconnect is particularly strange. I wonder what's causing that.
We'll be adding some more logging around this. Also more explicit disconnect behavior, I think.
This is going through a DO or anything like that?
Do you open/close a client connection on each request ?
most queries are not going through DOs but some are. i don't explicitly open or close connections because that appeared to be being handled for me; i thought hyperdrive handled the pooling and connections (but maybe it's obvious now that it's not)
It is (should be), but to the origin connection. Opening/closing connections to Hyperdrive is somewhat handled by your ORM, but the way it does so may differ from one stack to the next
I'm just running through what might be holding connections open in this way, trying to collect some additional info as I go
Drizzle with postgres.js is an incredibly popular setup so I doubt it's that, in this case. Just asking to check
ah ok. i have two workers that access this hyperdrive config but they both share drizzle initialization code so hopefully they are theoretically identical. one is a remix app and the other is a more bare server-only worker