Reliability concerns for cloud databases

Reliability concerns for cloud databases

More and more businesses are moving their infrastructure to the cloud. It makes a lot of sense, since on-premise infrastructure is often underutilized, maintenance is a hassle and there are many ways for things to go wrong. Typically, even if backups are regularly made, companies almost never have a regularly tested restore process. You really want to have a quick and painless restore process when on fire.

Having everything in the cloud is often more expensive - or it looks like it. Yes, we can buy a bunch of servers for the money we are shelling out to Amazon, Google or Microsoft every month. The thing is, we are comparing apples and oranges. On-premise side of the equation is likely hobbled-up solution that works "good enough". Cloud side is without exception engineered for high availability and reliability.

So, everything is fine when our on-premise database is running as expected. We do tend to forget about it as "it works". But, if we want it to continue working, we must monitor both hardware and software, as well as alert when things get fishy. Now, if the SMART starts reporting bad sectors, we just add a new driver to our RAID array and we're good, right? Wrong. I have a RAID5 array with a single drive failure tolerance, with 5 disks. There were 3 old drives and 2 I added a few years later. Sure enough, those two drives started having bad sectors somewhat later. So, I purchased two new drives and planned to replace one, wait for a RAID array sync itself, then replace the other one. Sure enough, as soon as I popped one out and sync started, RAID array crashed due to increased activity when syncing. Being old and wise (and burned in the past), as soon as I noticed the issues, I placed the array in a read-only mode and backed it up. Rebuilding the array took a while and you do not want a downtime in production.

Then, there is patching. Easy to do on a small-scale, a hassle on a larger scale.

Then, there is scaling. Scaling relational databases is a non-trivial endeavor. While we'd like to have a performance unit and "just add servers", relational databases are actually far better scaled vertically due to atomicity requirement. It's easy to add read replicas if your workload is read-intensive, but scaling of writes is hard.

Then, there are upgrades. We'd like to upgrade our database servers without taking the database down. Well, even upgrading the database server can be risky. You need to read the documentation and make sure you anticipated everything that can go wrong. Then, you need to decide the risk you're willing to take vs performance hit you're willing to make. It may be possible to serve the existing data directory to a new binaries - or not. You may need to do a full backup, then full restore, which will take a lot longer.

So much about the on-premise database housekeeping. It's a chore and a lot of things can go wrong and will go wrong. Cloud... actually makes our lives much easier. Yes, cloud provider can screw up and have an outage - and they do, occasionally. The important part is that the chance they lose our data is practically zero. Our hobbled-up on-premise solution is usually just good enough, not good.