Critical Amazon Cloud services went down earlier this week, causing disruptions in the services of a number of web sites, including Quora, Spotify, Netflix, Slack, Pinterest, Buzzfeed, Trello and IFTTT. The web site to check if other web sites are down, isitdownrightnow also went down because of the disruption in Amazon cloud services. Amazon's own e-commerce site was not affected by the outage.
A post mortem conducted by Amazon has revealed the root cause of the issue. A single wrong command executed by an employee caused a cascading effect that took down a number of other services. An employee of the Amazon Simple Storage Service (S3) team was conducting a routine debugging operation, investigating an issue where S3 billing services were progressing at a snail's pace. The employee meant to execute a command that would take down a small number of servers that handled the S3 billing process.
One of the inputs in the command had a typographical error, as a result a larger number of servers were taken down. An index subsystem that contained the metadata and tracked all objects on the S3 went down. The placement subsystem that depended on the index subsystem to work properly also went down. Together, the disruptions meant that Amazon could no longer serve API requests from clients in the Northern Virginia region, designated as US-EAST-1.
After the initial disruption, Amazon took four hours to get all the systems back up and running. The E3 sub-systems had not been restarted in years, and had expanded considerably since the last time they were restarted. The checks to make sure everything was working properly, as well as catching up to the backlog of requests received during the outage meant that the systems took longer to recover than anticipated. Amazon has announced that it is taking steps to make sure such a disruption does not happen again.
There are now automatic checks in place, where capacity is removed slowly, and cannot be removed below a minimum threshold. An incorrect command entered in the future will not be able to disrupt Amazon services in the same way. Amazon is conducting an audit of a system to ensure similar checks in place for all services. Amazon factors in services into tiny cells, to decrease the amount of impact disruptive events can have. Amazon intends to break down these cells into smaller pieces later this year.
After explaining the event, Amazon has posted, "Finally, we want to apologise for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further."
Published Date: Mar 03, 2017 11:29 am | Updated Date: Mar 03, 2017 11:29 am