Coping with Regional Failure

10/12/2011 07:15 pm

Dan Frost

Tags:

Sure the cloud is great for managing failure, but what if a whole data center goes down?

Failure in the cloud usually means that an instance disappears. As I've described in the past, this can be due to anything from a physical failure to a problem with the cloud infrastructure itself. Dealing with failure means you have to be able to bring back the entire environment on a different infrastructure, and do it easily. You'll see this written all over the place but how can you do it?

First, let's look at what kinds of failure can occur. After a recent cloud outage, cloud architects filled twitter with the the wise observation that failure can occur at the instance, data-center, region, or cloud level. You need to plan for failure at all these levels.

The instance that's running at the moment might fail, and you should be OK with this, but planning for this kind of failure should be a matter of launching the same kind of instance and installing the required OS, apps, and code to get your site running again.

A failure at this level is annoying but isn't a huge problem. All the network connections, data, and API capabilities still exist where they were before the instance failed, so getting the same environment up and running again doesn't require any change in your plan.

The data center level is also dealt with for you by the cloud provider. They should be running multiple data-centers, and failure of anything in a single location shouldn't matter; as for the instance failure, a failure at the data-center level shouldn't often concern you. The environment to re-launch a similar instance is the same.

That said, some cloud providers are smaller, and if a data center fails for them, it's going to have an impact on how you can recover... which leads us to the nastier levels of failure.

A failure at the geographical region is significantly nastier. This has happened twice for Amazon, and I have experienced problems with other providers.

The first failure AWS experienced was caused by a networking issue

(see http://aws.amazon.com/message/65648/)

exacerbated by what was termed a re-mirroring storm, but the effect was simple: an entire geographical region was unavailable to AWS customers wanting to launch new instances.

The second occurred in the EU region and was caused by a power failure, but the effect was the same: an entire geographical region was unavailable for some important operations for a time.

Expert cloud architects must plan for this level of failure, but how do they do it? How do you build on top of things that might fail to the point that the system never fails?

Let's pretend an entire region goes down. This is a failure that many people don't plan for, nor is it one that many are able to plan for when they start out.

Amazon offers Elastic Block Store (EBS) storage volumes that are are separate from the instance and, according to Amazon “persist independently from the life of the instance.” You can use EBS to back up a snapshot of the instance.

Suppose you have an instance running on AWS in the EU. It is EBS-backed, so you can make a snapshot of it to move it.

First, set the availability zone in which the new volume must be created:

ec2-create-volume --snapshot snap-abc123ab --availability-zone us-east-1

But this solution wouldn't have helped earlier this year, when a failure occurred in the US East region. To prevent a regional failure, also always back your data up to a region where it doesn't run normally. Backing up into the same region where your instance are currently running is like keeping your backups on the same disk.

When your region fails, and your sites are instantly down, what can you do?

You need to launch a new instance with all your data on it in a different region. On AWS you simply switch to another region and start a similarly size instance, as long as you have made sure that everything you need to run your app exists somewhere else.

The checklist is:

Make sure the code and data exist outside of the region you're running in. Also, be sure the Amazon Machine Image (AMI) you're using is available in other regions.
Setting a low TTL on your DNS can help you make changes in the event of all out failure, but you might also want to get a third-party load balancer to insulate you from the failure of an entire cloud.
Work through a complete list of failure paths, providing solutions for every failure scenario. What happens when an instance fails? Where do you recover it? What happens when the region fails? How do you get the recent data out of it? What happens when the cloud provider fails -- even if it's just their API? How do you get the data out to start somewhere else?

I've gone from calm, through cautious, to all out paranoid, and I've used the word "failure" too many times, but if you want your site up all the time, design it paranoid.

Platform Games on Heroku

10/12/2011 06:57 pm

Dan Frost

Tags:

Heroku is platform as a service (PaaS). Instead of all that infrastructure, servers, and complications, you have a platform that is managed for you.

With the move to cloud computing, particularly cloud hosting, developers often opt for infrastructure as a server rather than PaaS. Why would you go PaaS?

The first reason is that you have less to do. The infrastructure and coordination of resources are managed for you, and on Heroku, they’re managed in a particularly cool way. Everything – from the moment you push to their Git repo – is managed for you, removing the need to configure servers, databases, and all that junk.

Heroku started off hosting just Rails, but now it also hosts Java and Node.js. To get started, you sign up on their site and simply – as is always the way with anything Ruby-ish – install a gem:

gem install heroku

Heroku’s CLI tool is where a lot of the magic happens. Start the magic by creating a new project:

heroku create

Log in at https://api.heroku.com/myapps, and you’ll see some insanely named app: vivid-robots-9999. You can look at the URL for the app, but there isn’t anything there to see yet, so you need to push up a Rails app.

Copy the Git URL from the app’s detail page (e.g., git@heroku.com:vivid-robot-9999.git) and clone it to a local repo:

$ git clone git@heroku.com:vivid-robot-9999.git
Cloning into vivid-robot-9999...
warning: You appear to have cloned an empty repository.

Create a Rails app, add it to Git, and push:

cd name-of-project/
rails new .
git add .
git commit -m “Empty rails app”
git push origin master

As you push, you’ll see a number of things that Git doesn’t normally say:

-----> Heroku receiving push
-----> Ruby/Rails app detected
-----> WARNING: Detected Rails is not declared in either .gems or Gemfile
       Scheduling the install of Rails 2.3.11.
       See http://devcenter.heroku.com/articles/gems for details on specifying gems.
-----> Configure Rails to log to stdout
       Installing rails_log_stdout... done

-----> Installing gem rails 2.3.11 from http://rubygems.org
       Successfully installed activesupport-2.3.11
       Successfully installed activerecord-2.3.11
       Successfully installed actionpack-2.3.11
       Successfully installed actionmailer-2.3.11
       Successfully installed activeresource-2.3.11
       Successfully installed rails-2.3.11
       6 gems installed
-----> Compiled slug size is 4.7MB
-----> Launching... done, v4
       http://vivid-robot-9999.heroku.com deployed to Heroku

Rails is a great platform for PaaS because it is inherently self-contained. Dependencies are declared in the Gemfile, from which everything can be installed. Once you push to Heroku, all required gems are installed, and the app is deployed.

Heroku creates a pipeline for the code from the moment you push, so when finished, your app is deployed. If you have ever had to deal with pulling onto other servers, configuring them and wasting all that time, this is going to look like magic.

Hit the URL of your app,

http://vivid-robot-1992.heroku.com/

and, Bam …, you have deployed your app.

Now, I'll create something simple. Because Heroku integrates so nicely with Rails – or, more accurately, with the Git process – it’s worth noting that Rails projects can integrate directly with Heroku. The Rails CMS Refinery has hooks to push directly to Heroku.

Finally, some nice add-ons provide services, such as MongoDB, Redis, on-demand video encoding, and cloud DNS for Heroku. Just like the underlying web servers, proxies, and all that, you don’t manage this. The working platform is what comes as a service.

If you have a Rails projects, push it to Heroku and see if it’ll work there, because it could save you a load of sys admin work.

But what do you lose? Why wouldn’t you use PaaS?

PaaS is very limiting. The problem lots of Java and Python developers had with Google App Engine (and this is from bar conversations, rather than from trawling the web) is that simple things like the filesystem aren’t there anymore. You can’t reconfigure the web server. All those niceties.

But, this is a good thing. Even if you don’t use PaaS, such as Heroku or App Engine, in your day job, experiencing the efficiency of deployment is worth the investment. All code pipelines should be a matter of push-and-deploy, which they often aren’t, simply because you have access to the config, the filesystem, and all those old reliable bits and pieces.

How does PaaS fit in with your stack? Could it replace those big servers you love so much?

Dan Frost is Technical Director of 3ev.com, cloud hosting consultants and web developers based in London and Brighton, UK

Dan has been building cloud hosting, writing, and talking about the cloud since before it was trendy. Since he spun up his first AWS instance, he's been trying out new services and finding ways of getting more out of hardware without actually owning any of it.

http://www.3ev.com/expertise/dan-frost/

Mon	Tue	Wed	Thu	Fri	Sat	Sun
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Coping with Regional Failure

Platform Games on Heroku

Tags

Archive

October 2011