759 reads

How Amazon broke the internet? A no non-sense guide for non-engineering background enthusiasts.

by Rohit SharmaMarch 5th, 2017

Too Long; Didn't Read

Well, It all started on the morning of <strong>28th February, 2017</strong>. It was a normal day for engineers at Amazon.

featured image - How Amazon broke the internet? A no non-sense guide for non-engineering background enthusiasts.

Well, It all started on the morning of 28th February, 2017. It was a normal day for engineers at Amazon.

There billing system got very slow and an engineering team started debugging the issue to make the better user experience as a routine maintenance activity.

Amazon always strive for customer oriented experience.

PS: I am really very sad for whatever happened to such a flawless company.

While debugging the issue, they had to take some servers (which were responsible for S3 billing) offline from serving the live traffic.

Amazon S3 is highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites.

Like Instagram use Amazon services to function, So most likely all of the data instagram has is stored with Amazon S3 service.

Taking servers offline means,

Say, If S3 service is served by 100 servers so amazon had to take some servers offline(say 5 servers). They wanted the load to be distributed to the remaining 95 servers.

For all non tech guys, please have a look at the video that explains about how the load is distributed to multiple servers to keep the site up and running.

Like the above video, amazon may have hundreds of servers balanced by the Load Balancer. They had to take some servers offline.

But due to a typo, the command executor mistakenly took multiple servers offline.

Just an information,(You can skip this)

In Unix, if you have to remove files from the current directory the command is like

rm -rf .

( . means current directory)

However if you do something like

rm -rf /

( / means home directory)

This will remove all the files in home directory if you have root permissions.

Something relative with different commands may have happened while removing servers from the online set.

Just a small typo but came up with blasting effects for the internet.

This large unintentional removal took the two S3 subsets down.

Source

Very soon it made a blunder, as the unavailability of storage service led down many amazon services.

Just an info

Primarily amazon web service that is used by Instagram, Vine, and IMDB,Trello, Quora, IFTTT, Medium, Websites build with wix.com and Splitwise etc.

Alexa was struggling to stay online, too.

Amazon S3 is used by around 148,213 websites, and 121,761 unique domains, according to data tracked by SimilarTech.

Nest’s app was unable to connect to thermostats and other devices for a period of time as well. Which eventually broke many people’s home appliances.

The blow was so bad that amazon dashboard was also down, so estimating the issues and current status evaluation was impossible. So amazon couldn’t comment on when will it go up.

Source

Amazon S3 is the backbone for all of its services as shown above.

Story Continues …

So to make the amazon service up and running, they had to restart all the servers that were taken offline.

Amazon services have done really well in past few years and have got a lot of subscribers. There infrastructure was grown significantly big.

Amazon did not restart servers for many years in past and hence unexpectedly restarting servers took longer than what was expected.

Official apology from Amazon

“We want to apologize for the impact this event caused for our customers. We will do everything we can to learn from this event and use it to improve our availability even further.”.

What Amazon did to prevent this in future ?

They have a tool to take servers offline, amazon modified this tool to remove capacity more slowly and added checks to prevent capacity from being removed beyond minimum required capacity level for any subsystem.
Adding the above check to all other placed where it can cause an outburst.
For past few years, the s3 team has refactored the feature into multiple independent parts, Taking up which took so much time to recover in such situations.
The S3 team is planning further partitioning of the index subsystem are prioritizing that work so such restarts get fasten up in future.

What this outage caused to Amazon?

Source

That’s huge loss! I pray for that guy who executed this command ;-)

Source

Indeed, amazon is flawless when it comes to cloud services it offers. It has given a tough run to Google Cloud and Microsoft Azure.

I feel real bad for whatever that happened, but perhaps this will help Amazon build a more robust system. Hence it will emerge as the number one player in cloud for years to come.