The Amazon EC2 Failure in perspective

Tuesday, April 26, 2011 » Amazon, Cloud, EC2, IaaS

The recent failure of Amazon's EC2 service has caused wide-spread naysaying, and "told you so's" from many of Cloud Computing most vocal critics.  The failure, which first appeared on the evening of April 21st and continued into part of the next day took down many customers, including a number of high-profile Internet companies such as FourSquare, Hotpads, Reddit, and Quora.  Apparently the outage was caused by a failure of the Elastic Block Storage (EBS) systemin the Virginia datacenter.  Although initially CNET reported the failure was caused by trojan horses and power outages, apparently it was indeed the EBS that failed by performing runaway replications, consuming both storage and network bandwidth.  Apparently this failure spilled over into and affected multiple "Availability Zones" within the Amazon EC2 system.

While many people are screaming "We told you so", "The Cloud isn't ready for Primetime", and "Let's rethink Cloud Strategies" in their blogs and articles, I believe the hysteria is misguided.  One well-respected blogger, even went as far as to say "Amazon's EC2 Outage Proves Cloud Failure Recovery is a Myth."  With all due respect (he's normally brilliant, I just happen to disagree with him this one time), he and many others are flat-out wrong.  The error in their thinking lies with their expectations of Amazon EC2.

<!--break-->

Amazon EC2 provides a fairly high degree of reliability BUT this doesn't include the concept that they'll never suffer any outages or unavailability (read your contract!).  Amazon's SLA (Service Level Agreement) provides for a 10% service credit if uptime falls below 3-nines (99.9%), which, on an annual basis, allows 8.76 hours of downtime.  If uptime falls below 2-nines (99% or 3.65 days of outage annually) they will provide a 25% service credit. Amazon calculates uptime monthly and bases their SLA on monthly reliability and provides service credits for the current applicable billing cycle.   Amazon's goal is to reach or exceed 3-nines, and while they use some fault-tolerant components to get there, they never make the claim that applications or individual virtual machines running on their platform are highly-reliable or fault-tolerant.

It's worth noting that every data center I've ever known has had some degree of failure.  Whether it's routing issues, or power (even with generators - I know a datacenter that was taken down by "transfer switch" failure while trying to switch over to generators), or heating and cooling....once every few years (in the better ones, more often in others...) something goes wrong.  The key to ensuring that your application doesn't go down is...architecture!

Within the Amazon EC2 environment it's possible to create fault-tolerant systems.  In fact, Amazon provides some great tools to support doing so (such as load-balancing, multiple data centers, dynamic DNS, etc).  Creating fault-tolerance requires forethought, architectural saavy, and yes, spending money on the resources.

While many customers were "taken down" by the failure of Amazon EC2 (including, I might add, this blog, which is running on a machine within EC2), there were some who had the forethought to create fault-tolerant systems.  These sites continued to run just fine.  Two of my own customers (and I've heard of several others) survived the failure with no apparent interuption in their operations.

Another point worth noting is that Amazon runs a pretty tight ship when it comes to their datacenters and Cloud infrastructure.  I'm sure there will be a very thorough post-mortem examination of the circumstances that led to the failure.  I would strongly suspect Amazon will make changes, either in architecture or operating procedures, to ensure that this type of failure doesn't occur again.

The real lesson to be learned here is that even in the Cloud, fault-tolerance needs to be architected into the equation.  Sites that failed "because of the failure of Amazon EC2" probably could have kept running, if only they were architected correctly.

comments powered by Disqus