The major Amazon Web Services outage that began this past Thursday morning was unlike anything before it. Countless AWS customers, big and small, went down, many for days. Surprisingly, other biggies like Netflix, SmugMug, and Twilio had little or no disruption. One hungers to know why...
Over the weekend, George Reese, a cloud expert and author (and CTO of cloud-management tools company enStratus), wrote a fascinating post on O'Reilly about what some would call a cloud disaster -- entitling it, ironically enough, "The Cloud's Shining Moment." George has a unique perspective on the cloud, and a large following. His post got huge play, and that continues -- so I decided to message him on Twitter and set up a coffee so I could interview him Monday morning. I was anxious for him to elaborate on his post and share more of his thoughts, now that the outage is (mostly) behind us.
Click on the link below to hear the whole chat. What follows here are some snippets from that 30-minute conversation (it was recorded in a busy coffee shop, so there's background noise, but you can hear us fine):
• Thursday at 3:00 am: "We knew something significant was going down."
• What happened, who was affected, and why.
• What about SLAs? "They're not an insurance policy, they're a refund policy... SLAs are a joke."
• The "Design for Failure" approach vs. traditional application architecture gives you "control over your own destiny."
• Why the AWS outage was a shining moment: it's about learning what you can do in the face of an event like this. "So many survived."
• The "cloud haters" came out after the O'Reilly post. Flame wars erupted in the comments. George pre-empted what they thought was, ahem, their shining moment... :-)
• In large corporations, the "Department of No" is the real problem.
• George guarantees that CIOs who say their companies are not in the cloud actually are, and just don't know it. Many others realize the cloud "genie is out of the bottle," and are now coming to his firm, to be their window into what's really going on in the cloud.
• George's company now makes it possible to do "cross-cloud" backup and disaster recovery. Not only can customers do automated DR, but automated DR testing, too.
• He says his company is at "the most important point" in its life and the evolution of the cloud. In the last six months, "enterprise has gotten it." He noted that he's never spoken to so many Fortune 100 companies as he has in the past week.
Two other excellent blog posts we touched on that came out over the weekend:
• "How SmugMug survived the Amazonpocalypse," by Don MacAskill, Cofounder & Chief Geek
• "Seven lessons to learn from Amazon's outage," by Phil Wainewright, ZDnet
UPDATE: Here's another good one:
• "An unofficial EC2 outage postmortem - the sky is not falling," on the CloudHarmony Blog (caution: you have to really want to take a deep dive into cloud storage)
(Here's more about my interview subject: George Reese has been delivering software as a service since 2003 when he founded Valtira, a suite of web-based marketing tools. Prior to Valtira, George held a variety of technology leadership roles with J. Walter Thompson, Carlson Marketing Group, and startups Ancept and Imaginet. George is the author of several O'Reilly books on Internet and enterprise technologies, including Java Database Best Practices and Managing and Using MySQL and the recently released Cloud Application Architectures. He has an MBA from the Kellogg School of Management at Northwestern University and a B.A. in Philosophy from Bates College in Lewiston, ME. Follow him on Twitter @georgereese.)
Full Disclosure: As mentioned during the recorded interview, the writer had a consulting relationship with enStratus in 2009.