NIGHTMARE!

The_Unbeliever

Honorary Master
Joined
Apr 19, 2005
Messages
103,193
Reaction score
10,233
Location
Nkaaaaandla
Think you have it hard, with your difficult boss? Here's what happens when you got millions of clients and an upgrade borks the entire system.

Shark Tank said:
This pilot fish works at a telco that provides DSL hardware access to ISPs, who in turn sell access to the network. Total number of users: in the millions.

"The network staff is an interesting team to deal with, and often refuse to believe something can fail in their network without setting off an alarm," says fish.

"From my point of view, when you've got three dozen users on a single piece of hardware as their sole point of commonality and they're all screaming that they are off the air, then the problem has been located -- network alarms or not."

So when clusters of problem reports start coming in from users on the morning after a software upgrade, fish is pretty sure it's a real problem. The network guys haven't gotten any alarms, but they reluctantly send out a tech who confirms it.

Analysis soon shows that, although the upgrade tested fine in a sandbox environment, the tests didn't include a certain older variant of hardware that's at more than 100 sites. Result: The upgrade has trashed the firmware on the hardware blades, putting 100 sites partially offline.

All those blades need replacement -- but usually there are only 80 spares in inventory. And a physical check of the inventory soon shows that, due to human error in keeping track, only 40 blades are actually available.

Oops.

"While the network team managers panic and harass the vendor for spares, the network team members frantically begin doing anything they can to try to restore services," fish says.

"On the recommendation of the vendor, the on-site techs at multiple sites are told to physically reset the main site network controllers, which will necessitate a full restoration of the site from remote backup afterward."

To everyone's growing horror, the site restorations begin to fail one by one. Apparently, the software upgrade has also rendered the backups unusable.

Fish's company is now looking at flying blades in from international sources, after which the support team will start recreating individual services one at a time at sites where the network controllers were hard-reset, thus irretrievably dumping the configuration files.

That's about 5,000 controllers -- and the number is climbing.

"The final kicker is that someone has unearthed an e-mail from the vendor from nearly three years ago," sighs fish. "It warns that the sandbox environment doesn't properly simulate the behavior of the older blades.

"So the vendor is off the hook. And none of the current network team was here when the warning was sent, so no one was officially aware of the problem before it occurred.

"It's going to be a long week."


Discuss - should they have gone with a staged roll-out?
 
Last edited:
Omf, absolutely! But inevitably, there is always SOMETHING that rains on the party. Luckily it's not always this bad, though!
 
Think you have it hard, with your difficult boss? Here's what happens when you got millions of clients and an upgrade borks the entire system.

Discuss - should they have gone with a staged roll-out?

somebody once said... "Whatever can go wrong will go wrong, and at the worst possible time, in the worst possible way" kinda sounds true of this situation! :D

But there's another law that says we need these situations once a decade to keep us on our toes the other 9 years.
 
Top
Sign up to the MyBroadband newsletter
X