Coping with Regional Failure
10/12/2011 07:15 pm
Sure the cloud is great for managing failure, but what if a whole data center goes down?
Failure in the cloud usually means that an instance disappears. As I've described in the past, this can be due to anything from a physical failure to a problem with the cloud infrastructure itself. Dealing with failure means you have to be able to bring back the entire environment on a different infrastructure, and do it easily. You'll see this written all over the place but how can you do it?
First, let's look at what kinds of failure can occur. After a recent cloud outage, cloud architects filled twitter with the the wise observation that failure can occur at the instance, data-center, region, or cloud level. You need to plan for failure at all these levels.
The instance that's running at the moment might fail, and you should be OK with this, but planning for this kind of failure should be a matter of launching the same kind of instance and installing the required OS, apps, and code to get your site running again.
A failure at this level is annoying but isn't a huge problem. All the network connections, data, and API capabilities still exist where they were before the instance failed, so getting the same environment up and running again doesn't require any change in your plan.
The data center level is also dealt with for you by the cloud provider. They should be running multiple data-centers, and failure of anything in a single location shouldn't matter; as for the instance failure, a failure at the data-center level shouldn't often concern you. The environment to re-launch a similar instance is the same.
That said, some cloud providers are smaller, and if a data center fails for them, it's going to have an impact on how you can recover... which leads us to the nastier levels of failure.
A failure at the geographical region is significantly nastier. This has happened twice for Amazon, and I have experienced problems with other providers.
The first failure AWS experienced was caused by a networking issue
(see http://aws.amazon.com/message/65648/)
exacerbated by what was termed a re-mirroring storm, but the effect was simple: an entire geographical region was unavailable to AWS customers wanting to launch new instances.
The second occurred in the EU region and was caused by a power failure, but the effect was the same: an entire geographical region was unavailable for some important operations for a time.
Expert cloud architects must plan for this level of failure, but how do they do it? How do you build on top of things that might fail to the point that the system never fails?
Let's pretend an entire region goes down. This is a failure that many people don't plan for, nor is it one that many are able to plan for when they start out.
Amazon offers Elastic Block Store (EBS) storage volumes that are are separate from the instance and, according to Amazon “persist independently from the life of the instance.” You can use EBS to back up a snapshot of the instance.
Suppose you have an instance running on AWS in the EU. It is EBS-backed, so you can make a snapshot of it to move it.
First, set the availability zone in which the new volume must be created:
ec2-create-volume --snapshot snap-abc123ab --availability-zone us-east-1
But this solution wouldn't have helped earlier this year, when a failure occurred in the US East region. To prevent a regional failure, also always back your data up to a region where it doesn't run normally. Backing up into the same region where your instance are currently running is like keeping your backups on the same disk.
When your region fails, and your sites are instantly down, what can you do?
You need to launch a new instance with all your data on it in a different region. On AWS you simply switch to another region and start a similarly size instance, as long as you have made sure that everything you need to run your app exists somewhere else.
The checklist is:
- Make sure the code and data exist outside of the region you're running in. Also, be sure the Amazon Machine Image (AMI) you're using is available in other regions.
- Setting a low TTL on your DNS can help you make changes in the event of all out failure, but you might also want to get a third-party load balancer to insulate you from the failure of an entire cloud.
- Work through a complete list of failure paths, providing solutions for every failure scenario. What happens when an instance fails? Where do you recover it? What happens when the region fails? How do you get the recent data out of it? What happens when the cloud provider fails -- even if it's just their API? How do you get the data out to start somewhere else?
I've gone from calm, through cautious, to all out paranoid, and I've used the word "failure" too many times, but if you want your site up all the time, design it paranoid.