Postgres stuck rebuilding (whole website down!)

Hello,
Im in a code red - my whole web service is down, i sent a query that was too big i guess and it froze my whole instance eating up the entire CPU. Now i restarted the service (powered up and powered down) and its just been stuck in “Rebuilding” for like 20 min now. My users are gonna be pissed! Please help me im not sure what to do

Thanks
Mitch

Hey there, Mitch! So sorry to hear you hit an issue! :frowning:

From digging around it looks like you’re on our Free plan for PostgreSQL which caps out at 1GB and may be where you ran into the problem.

Restarting the service in case of trouble is a good call; did it happen to come online eventually? The website attached to your account appears to be up, so I hope so! :crossed_fingers: (I myself as a Community team member do not have access to customer platform services; we keep that under strict lock and key, even for free plan users.)

Please note that free plans don’t have an uptime SLA attached to them, nor do they offer formal support. But I’ll try to help however I can!

1 Like

Hi Mitch,

I can see that your service is now up and running after you upgraded the plan to startup-4.

Let us know if you have further questions.

2 Likes

Hey! Yeah on the free tier it just said “rebuilding” for like an hour, with no progress and no indication of error. Would appreciate any insight on that if you got it. Would that eventually have taken care of itself? Could it take longer than an hour to restart a free database? I figured if i upgraded it i could fork a backup and point my server to that new forked database, but when i upgraded it actually just rebuilt fine in a few min.

Thanks
Mitch

1 Like

Based on some amazing sleuthing by one of our SREs, it looks like the problem was the disk ran out of space. Most likely, it was trying to write a temp file which PostgreSQL will do when it can’t hold temporary query results in memory. This can be the result of particularly large or complex queries, which it sounds like was the “origin story” here. :\

On a paid plan, our operators would’ve been alerted to fix this, but our free plan is offered without any formal support/monitoring, just “best effort” support through this very forum.

However, you raise very decent points about what you’d expect to happen in this situation, so I’ve filed this as a feature request on our Aiven Ideas Platform: Allow for graceful recovery of "queries gone wild" on free tier plans – Aiven Ideas Feel free to give that an upvote and/or add additional context.

Sorry about the trouble, hopefully it’s smooth sailing from here! :sunglasses:

1 Like

I’m late to the party here, but I see the idea has been shelved without any comment. I think that’s a shame.

If the free tier is supposed to be anything other than a trial with undefined length, there must be a way to recover the service when something goes wrong.

It’s perfectly understandable that problems with free tier services doesn’t trigger on-call alerts, but at some point (within a reasonable time frame) there should be someone who takes a look and recovers the service if it is impossible for the user to recover it themselves. This helps Aiven too, because services that break and never recover will be abandoned by the user, who might not even be able to switch it off, leaving Aiven with a bill for no reason.

This is an actual problem I have observed elsewhere. I know a large american database vendor, who have a generous free tier in their cloud infrastructure offering, but with a policy to never touch the free tier services. Since their cloud offering is ripe with bugs, they have had several users complain that their free tier project is stuck in some way, and that it is impossible for the user to recover or turn it off. So there were by my count at least 100 VMs running in said cloud infrastructure last time I checked, with an unknown number of ancillary services, which is not in use. The people who set them up and was using them can’t recover them and can’t shut them off, and have simply abandoned them.