Killing Instances with Chaos Monkey

06/28/2011 08:53 pm

To kick off this series, I thought I'd look at how you go about breaking your cloud setup once everything seems to be running nicely. Confidence grows with testing, and in cloud hosting -- and any service-based architecture -- confidence in the service grows every time you see servers come back from the dead.

If you're deploying your first cloud setup, how do you know that your cloud architecture actually works, and how do you know when you've finished? How do you know that you're setup is truly resilient rather than just a bit scalable?

If you're considering cloud for dev, test, or production the first thing you need to think: what happens when something fails? It's a common misconception that simply by using cloud resilience and scalability are built in. You don't get a hugely scalable site just by pushing your code to CloudServers or EC2.

Resilience is built in at the hardware level in most cloud offerings. Disks fail, network connections drop out, and through the magic of your cloud provider, you don't notice. But it's your job to assume failure at the levels of device, data-center, region, and internet and to use the cloud APIs to build a system that self-heals.

Challenge every part of your architecture and know what happens if it fails. That's one of the great benefits of anything cloud-based: you can trial new configurations and test scenarios very cheaply.

There are some tools and techniques to help with this.

Chaos Monkey tests your cloud by randomly taking down servers. Run it on your AWS account at your own risk. Read the disclaimer. But do run it so you can confidently say "if everything crashes, it'll be back in 2 minutes with no help from me."

Choas Monkey perfectly simulates quite a lot of normal behaviors. Instances run slowly and become unresponsive, or some unusual event happens that chews up all your inodes. Someone might deploy an insane and badly written script that takes down the database.

Once installed, you run Chaos Monkey like: 

> ChaosMonkey -l=output.txt -e=US-East -a=YOUR-ACCESS-KEY -s=YOUR-SECRET-KEY -t=chaos -v=1

What follows is the random killing of instances and unearthing all the assumptions in your architecture.

ChaosMonkey kills instances if they have the tag chaos=1, which means you can limit the damage it can cause. If you wanted to script this yourself and do something specific -- say, kill each database server every time it comes back -- then it only takes a few lines to throw something together in Ruby. The follow script does some of what ChaosMonkey does:

First install the fog gem with

gem install fog

Then set your credentials in the script and run it... remember, it is designed to break your servers.

require "fog"
  connection = Fog::Compute.new(:provider => 'AWS',
    :aws_access_key_id => 'XXXXXXXXX',
    :aws_secret_access_key => 'XXXXXXXXXXXXXXXXXXXXXXXXXXX',
    :region => 'us-east-1'
  )
 candidates = []
 connection.servers.all().each do |i|
   if i.tags["chaos"].to_i == 1
     candidates.push i
   end
end
puts "We have #{candidates.size} candidates. We're going to KILL these instances. If stuff breaks, it is not my fault. Ok?.. type 'ok'."
  input=gets
if input.strip != 'ok'
  puts "Maybe next time"
  exit
end
# Pick off just one instance and kill it. You can run this as many times as you like to kill more.
kill_instance = candidates.shuffle.first
puts "Killing #{kill_instance.id}"
kill_instance.destroy
puts "Check your app... is it still there?"

Suppose you want to kill only the smaller instances, because you think the new version of your app uses too much memory. Just change the tags line to:

connection.servers.all().each do |i|
   if i.tags["chaos"].to_i == 1 && i.flavor_id == 't1.micro' # Only kill micros
     candidates.push i
   end
end

You know your architecture is good if you can kill instances, wait a moment, and everything is back as if by magic. New instances must start up and configure themselves, and the app must be there as if nothing happened.

If you don't feel comfortable unleashing the monkey on your setup, you should at least kill servers on a regular basis, even if you do this out of hours when no-one is looking.

Let's walk through a fairly basic AWS setup. Suppose it has RDS (relational database service) providing MySQL. You have a bank of EC2 instances running your app which sits behind a elastic load balancer (ELB). That's sitting behind another EC2 instance acting as a caching server.

Can you spot the points of failure? All the EC2 instances and the RDS instance; the load balancer. The EBS virtual disks can fail as well as anyone who experienced the AWS outage will know.

When each EC2 instance goes down, you want the rest to pick up the load and the cloud config to automatically repair itself. By killing instances in your cloud setup, and by shutting down services on the instances, you can simulate lots of things going wrong. You'll find some nasty condition that causes the servers to lock up and so you fix it... with APIs.

The first rule of cloud architecture is to allow for failure. This is allowed for by all parts of the architecture being programmatically managed: there's an API for everything. When something fails, which APIs can you use to fix your cloud? If you can't fix it with APIs, you need to rework your architecture.

Next time we'll look at some of the magic configuration tools that make this kind of self-repair possible.

Related content

Comments