Nipping the AWS Outage Outrage in the Bud

In our Twitter-fueled, Facebook-fed society, it’s very easy to get caught up in hype.

So after AWS’s outage last week, it was amazing to see all of the anti-cloud pundits coming out of the woodwork to hold AWS responsible for taking down a majority of the internet.

Pundits said there were too many monopolies. Some global equities firms immediately predicted a 2 percent negative impact on 1Q 2017 AWS revenues.

What? Let’s put this in context.

Who’s to Blame?

The AWS S3 outage happened in one of its availability zones, US East Region . While this is a large zone, there are 18 availability zones and all others were operating normally during the outage.

AWS S3 is designed to deliver 99.99 percent availability, and scale past trillions of objects worldwide. Last week’s outage illustrated that one-in-ten-thousand chance of non-availability. The world is not built for 100 percent availability nor should any company believe that the public cloud automatically provides this.

There is no denying that many businesses rely on AWS S3. According to market research firms, S3 is used by nearly 150,000 sites, 120,000 unique domains and has almost 4 trillion pieces of data stored in it. It powers big brand sites like Netflix, Adobe, Spotify, Pinterest, Trello, IFTTT and Buzzfeed, as well as tens of thousands of smaller sites.

But who’s to blame here? Did anyone consider why Amazon’s ecommerce site didn’t go down? Was it because they get preferential treatment or was it because they’ve built their site with redundancy in mind?

The reason is relatively simple: its sites are spread out across a number of geographic zones so an outage in one area doesn’t mean the whole site goes down. If your SaaS provider or application architecture does not provide for redundancy, it’s not really Amazon’s fault. This is common sense.

This outage is an indictment, not of AWS, but of business and IT decision makers. As many companies moved their IT applications to the public cloud, cost was a major driver and unfortunately this drove decisions to abandon traditional thinking around architectures to provide redundant data services.

AWS Is Not Alone

Let’s put the AWS 4-hour S3 outage in perspective of other major outages we have seen over the past year.

In September 2016, Microsoft Azure suffered a serious outage due to a spike in network traffic that caused DNS issues resulting in several regions being unavailable. Despite that being the second multi-region outage in a week (Europe was hit the week before), it barely made headlines.

Google’s cloud services were down in April, affecting their Compute Engine instances and VPN services in all of its regions. For damage control, it offered customers a 10 percent discount on their monthly compute charges.

Last March, some Salesforce customers in Europe had to cope with a CRM disruption for up to 10 hours caused by a storage problem across an instance on that continent. And in May, a Salesforce outage wiped out four hours of customer data that took days to fully remediate.

In January 2016, a power outage at a Verizon data center impacted JetBlue Airways operations, delaying flights and sending many passengers scrambling to rebook. The Verizon data center outage impacted customer support systems, including, mobile apps, a toll-free phone number and check-in and airport counter/gate systems.

Our beloved Twitter was down for eight hours in the same month due to uploading some faulty code that took down the Twitter website and mobile apps. Security services expert Symantec experienced a 24-hour outage in April 2016 preventing Symantec clients from administering email and web security services due to a database update error.

And Apple’s outage in June 2016 resulted in some of the tech giants popular iCloud backup services, App Store and iTunes to be offline.

The Cost of Downtime

Downtime is no joking matter. The estimated cost of downtime to US businesses in 2016 reached $700 billion. Fortune 1000 companies reported between $1.25 billion and $2.5 billion in estimated business losses. The cost of downtime has increased 38 percent since 2010, according to a recent study by Ponemon Institute.

But the causes for outages are not due to cloud service providers like AWS being unprepared for enterprise availability. Cyber crimes are the fastest-growing cause of data center outages, rising from 2 percent in 2010 to 22 percent of outages. Uninterruptible power supply (UPS) failure continues to be the number one cause of unplanned data center outages, accounting for one-quarter of all such events.

IT equipment malfunction accounted for only 4 percent. Water, heat or air conditioning failure accounted for 11 percent of outages, followed by weather at 10 percent and generator failure at 6 percent.

Many of the above scenarios are actually due to private, on-premises data centers and not cloud service providers.

But in the end, human error still accounts for a majority of all outages. The Feb. 28 AWS outage has been attributed to one of its employees debugging an issue with the billing system and accidentally taking more servers offline than intended. That error started a domino effect that took down many other server subsystems.

Don’t Put Your Head in the Clouds

The public cloud is a fantastic platform offering very compelling cost, elasticity and availability benefits.

But like any infrastructure, it’s not immune to failure. You need to leverage it the right way.

Building resiliency into application architectures is not something that disappears when you use the public cloud. If you don’t prepare for that, you’re likely to lose control of your business when the next outage happens.


The article was originally published on CMS WiRE and is re-posted here by permission.

Frank Palermo

Executive Vice President - Global Digital Solutions, Virtusa. Frank Palermo brings more than 24 years of experience in technology leadership across a wide variety of technical products and platforms. Frank has a wealth of experience in leading global teams in large scale, transformational application and product development programs. In his current role at Virtusa, Frank heads the Global Technical Solutions Group which contains many of Virtusa’s specialized technical competency areas such as Business Process Management (BPM), Enterprise Content Management (ECM) and Data Warehousing and Business Intelligence (DWBI). The group is responsible for creating an overall go-to-market strategy, developing technical competencies and standards, and delivering IP based Solutions for each of these practice areas. Frank also leads an emerging technology group that is responsible for incubating new solutions in areas such as mobile computing, social solutions and cloud computing. Frank is also responsible for overseeing all of the Partner Channels as well as Analyst Relations for the firm. Prior to joining Virtusa, Frank was Chief Technology Officer (CTO) for Decorwalla, an emerging B2B marketplace in the interior design industry, where he was responsible for the overall technology strategy, creative direction, and site development and deployment. Prior to that, Frank was CTO and VP of Engineering for INSCI Corporation, a supplier of digital document repositories and integrated output management products and services. Prior to INSCI, Frank worked at IBM in the Advanced Workstations Division, and took part in the PowerPC consortium with IBM, Motorola and Apple. He was also involved in the design of the PowerPC family of microprocessors as well as architecting and developing a massive distributed client/server design automation and simulation system involving thousands of high-end clustered servers. Frank received several patents for his work in the area of microprocessor design and distributed client/server computing. Frank holds a BSEE degree from Northeastern University and completed advanced studies at the University of Texas.

More Posts