Amazon AWS suffered one of the biggest outages in cloud computing to date, with data housed in a single availability zone knocked out for more than 24 hours in some cases. Many people are pointing to this as proof of a cloud computing failure. Some have suggested that traditional servers-in-a-rack models are safer. These people apparently don’t understand the flexibility of cloud computing.
The way Amazon AWS works, you have the ability to easily replicate your environment across multiple regions and availability zones to prevent the kind of catastrophic downtime we saw this week. If you opt not to deploy across regions and availability zones, you are effectively gambling Amazon won’t have an AWS outage like this one. In the case of this particular outage, merely being in multiple availability zones was not enough because AWS had issues across the entire US East Region, so spanning availability zones in the US East region was not enough. For more complete redundancy, it’s important to look at application architecture that spans both availability zones and regions, so that when US East has a problem, your app can continue to run from California, Ireland, or one of the other regions. Those that experienced the full extent of the downtime take the same risk as IT shops using servers in a single data center. In both models (cloud and servers-in-a-rack), if you don’t have a distributed infrastructure, you’re safe up until the point your data center becomes unavailable for some reason. In the case of AWS, the data center is the availability zones that went unresponsive.
Anyone with redundancy in place in another region’s availability zone merely needed to fail over to the other zone and continue on with minimal downtime. With even more fault-tolerance in place, you could have an intance pool that spans availability zones, meaning an outage might cause an existing session to fail for some users, but as a whole no one would even know you went down at all.
This is the beauty of cloud computing. You can easily have as much infrastructure ready to go as you need, or have plenty on standby, if your budget doesn’t allow for all those instances to run continuously. Cloud computing in general, and the AWS infrastructure in particular, makes SLA and uptime a software problem instead of a hardware problem. Developers are in control of as much or as little SLA as they can handle within the AWS construct. That’s a paradigm shift.
Writing the code to span availability zones might seem like a headache, which is where companies like RightScale make the architecture piece of configuring AWS a breeze. With RightScale, you can pool your resources across regions and zones and make them talk to each other without needing to start from scratch each time.
For those companies who got hit by the AWS outage, I feel your pain. I’ve lived through downtime and would like to avoid ever experiencing it again. AWS is currently the best solution for avoiding that downtime, but you have to make use of all the tools Amazon provides for it to work. The AWS outage wasn’t a failure of the cloud, it was the failure of development teams to build in the redundancy provided to them.