How a typo took down S3

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
Earlier this week, much of the internet ground to a halt when the servers that power them suddenly vanished. The servers were part of S3, Amazon’s popular web hosting service, and when they went down they took several big services with them. Quora, Trello, and IFTTT were among the sites affected by the disruption. The servers came back online more than four hours later, but not before totally ruining the UK celebration of AWSome Day.

Now we know how it happened. In a note posted to customers today, Amazon revealed the cause of the problem: a typo.

On Tuesday morning, members of the S3 team were debugging the billing system. As part of that, the team needed to take a small number of servers offline. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended,” Amazon said. “The servers that were inadvertently removed supported two other S3 subsystems.”

The subsystems were important. One of them “manages the metadata and location information of all S3 objects in the region,” Amazon said. Without it, services that depend on it couldn’t perform basic data retrieval and storage tasks.

“WE WANT TO APOLOGIZE FOR THE IMPACT THIS EVENT CAUSED FOR OUR CUSTOMERS.”
After accidentally taking the servers offline, the various systems had to do “a full restart,” which apparently takes longer than it does on your laptop. While S3 was down, a variety of other Amazon web services stopped functioning, including Amazon’s Elastic Compute Cloud (EC2), which is also popular with internet companies that need to rapidly expand their storage.

Amazon said S3 was designed to be able to handle losing a few servers. What it had more trouble handling was the massive restart. “S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected,” the company said.

As a result, Amazon said it is making changes to S3 to enable its systems to recover more quickly. It’s also declaring war on typos. In the future, the company said, engineers will no longer be able to remove capacity from S3 if it would take subsystems below a certain threshold of server capacity.

It’s also making a change to the AWS Service Health Dashboard. During the outage, the dashboard embarrassingly showed all services running green, because the dashboard itself was dependent on S3. The next time S3 goes down, the dashboard should function properly, the company said.

“We want to apologize for the impact this event caused for our customers,” the company said. “We will do everything we can to learn from this event and use it to improve our availability even further.”
http://www.theverge.com/2017/3/2/14792442/amazon-s3-outage-cause-typo-internet-server
 

PhireSide

Honorary Master
Joined
Dec 31, 2006
Messages
14,236
This just goes to show that you will never be able to eliminate human error to a 100% margin as long as we are involved.

The other thread where some person spoke about a company losing $200m per hour due to a mistake or something comes to mind :p
 

Solarion

Honorary Master
Joined
Nov 14, 2012
Messages
21,885
The subsystems were important. One of them “manages the metadata and location information of all S3 objects in the region,” Amazon said. Without it, services that depend on it couldn’t perform basic data retrieval and storage tasks.

That's what happens when you put all your eggs in one basket.

"I don't need redundancy I'm keeping it real."
 

Thor

Honorary Master
Joined
Jun 5, 2014
Messages
44,236
I wonder if that guy will face consequences I mean damn we all made mistakes like this.... They just have scale.
 

Solarion

Honorary Master
Joined
Nov 14, 2012
Messages
21,885
I wonder if that guy will face consequences I mean damn we all made mistakes like this.... They just have scale.

You could leave an external hard drive on the train filled with all your countries most top secrets.

That's happened.
 

rward

Senior Member
Joined
Oct 26, 2007
Messages
865
The best thing you can ever do web something goes wrong is have a solution.

1 - "I dropped the users table in the database by mistake.."
2 - "I dropped the users table in the database by mistake but I'm unzipping the backup from last night and will restore from that. Then I'm checking the log file for new accounts and will manually add those back, email them an apology and credit them with 5 credits."

1 is a problem, 2 is a solution.

Give your boss solutions.
 

Necropolis

Executive Member
Joined
Feb 26, 2007
Messages
8,401
I can just imagine the sinking feeling that the person who issued that command felt.

Glad I'm not him.
 

MrGray

Executive Member
Joined
Aug 2, 2004
Messages
9,391
This just goes to show that you will never be able to eliminate human error to a 100% margin as long as we are involved.

The other thread where some person spoke about a company losing $200m per hour due to a mistake or something comes to mind :p

Sure, but to my mind it's a massive system design failure if developers debugging something have direct access to a command environment where critical live production servers can be controlled by a typed command. That's just asking for disaster. The bigger error is that it was even possible for someone just debugging something to do this.
 

bridgeburner

Well-Known Member
Joined
Feb 17, 2017
Messages
316
Wonder if this was done from the Cape Town office.

Slightly OT: but does anyone know what it is like to work at the Amazon office in Cape Town?

Pretty awesome. I have a couple of friends who work there. I have visited their offices myself. Recruitment process is quite hectic though.
 

flippakitten

Expert Member
Joined
Aug 5, 2015
Messages
2,486
As services become more reliant on each other things really start to break in a spectacular fashion.

I just looked at it and thought "well not much I can do about that" and went to bed.
 

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
As services become more reliant on each other things really start to break in a spectacular fashion.

I just looked at it and thought "well not much I can do about that" and went to bed.
As much as the Internet's design was intended to be resilient; arguably today it's far easier to bring down most of it.
 
Top