Netflix and their “Chaos Monkey”
Netflix has been consistently leading the pack when it comes to running a massive application on the cloud. Being out in front, they’ve run into issues people hadn’t really considered yet. One thing they got a lot of attention for centered on their efforts to get every developer thinking about the things that change when you start building something big on a cloud platform. In order to keep their engineers on their toes, they created (and blogged about) the “Chaos Monkey”. “It was at its heart a script with admin privileges that would intentionally and randomly break their system (by randomly killing off members of auto-scaling groups)”. With the knowledge that the chaos monkey might kill off a key component of your application, their architects knew they had to build for it by adding smart redundancies. It paid off through numerous significant AWS outages that left other public cloud users counting their lost dollars while manically refreshing the AWS status page.
One of the greatest things about moving your application into the cloud is how much friction you remove by giving developers direct access to the resources. Unfortunately you can end up with a whole lot of those resources spoken for but no longer in active use. Netflix came up with a great solution the “Janitor Monkey“. But the best part about it is that they’re blogging about it, and sharing the code!
It’s nice to see these engineers coming up with really smart ways to keep a massive cloud deployment efficient by cleaning up the detritus that can build up over time. It’s especially great to see how much they’re giving back to the community at large – thanks Netflix!
GET PRIVATE CLOUD NOW!
mCloud Helix: Private Cloud for Enterprise