193 reads

The Amazon S3 Outage: It’s Not About a Typo

by Tony Martin-VegueMarch 4th, 2017

Too Long; Didn't Read

The <a href="http://www.theverge.com/2017/3/2/14792442/amazon-s3-outage-cause-typo-internet-server" target="_blank">big tech news story this week</a> was, of course, the Amazon S3 outage in which portions of Amazon’s cloud was down for the better part of 3 hours. Companies and consumers are increasingly reliant on S3, so the outage was felt far and wide.

featured image - The Amazon S3 Outage: It’s Not About a Typo

It happens.

The big tech news story this week was, of course, the Amazon S3 outage in which portions of Amazon’s cloud was down for the better part of 3 hours. Companies and consumers are increasingly reliant on S3, so the outage was felt far and wide.

The S3 outage itself didn’t bother me — it was a minor inconvenience. However, I realize that, for some companies, this was a revenue impacting event. Nevertheless, we should all file this under the “shit happens” category and move on. I’m actually surprised this doesn’t happen more often given the relative fragility of the Internet.

I was bothered by the way the media picked up on the incident as “one fat-finger that took down S3,” and how the tech and infosec twitterverse piled on and perpetrated this narrative with equal parts schadenfreude and nervous empathy, because we’ve all done something like this before. (Remind me to blog about the time I destroyed three Exchange server databases with no backups.)

If the story is true and a typo brought down Amazon’s cloud backbone, then the story is wrong; a typo didn’t bring down S3. It’s a cascade of fail that goes up the chain of management. It’s a series of failed controls that even allowed this situation to exist in the first place.

There’s a fundamental design issue if a command can take down such a significant portion of a company’s customer-facing infrastructure. Separation of duties, among other security principles, comes to mind: critical, revenue impacting changes are made by two or more people.

A good change control process will have a record of the requested change, including the commands that will be used, risk if the command fails, rollback procedure and a backup plan. The “what could go wrong?” portion of the assessment would also include identification of the possibility of a significant availability event on the S3 infrastructure.

If all else fails, risk assessments should identify availability events and quantify the impact. Even if that particular command or associated function isn’t called out, the critical business process itself should be along with the need for change control, roll-back plans, system achitecture review and appropriate management approvals of the risk.

When we think about Information Security, we like to think about the cool, sexy jobs like car hackers, pen testers, people that make ATM’s spit money — the fun stuff. Behind the scenes, we live. We are an army of Information Security governance, compliance and risk professionals, fighting the good fight.

To the person that hit <enter> on that fateful command: it’s OK. You’ll get though this. There are other factors at play here; least of which is your typo.