Hi OMB
I have a team looking into the area this evening and will get back to you on the forum>
And as Promised, the details of what we found and what we are doing:
The root cause of the problem turned out to be a bit elusive to the NOC team. We found after much investigation that one type of the IP Microwave hardware that our vendor used to connect the Cell Site Base Stations back to the main hub had a configuration setting that was not stable on the platform. When the Node B (the cell site Hardware) was reset, and sometime just on its own, the MW hardware would take the Ethernet port connected to the Node B to half duplex mode. Now what this means is around the way it deals with an IP Packet, the normal mode is full Duplex in which packets travel in both directions at once, and when only one of the ports, or one side is set to half duplex it causes all kinds of problems. It will cause packet delays due to packet loss, limited speeds, dropped connections due to lost packets and the far end system timing out, etc.
The reason it was hard to find was that type of fault did not raise any alarms on our monitoring systems and the metrics we monitored did not capture the error as well. What we are doing now is visiting each site with this type of link, checking the configuration and changing the method of configuration. We will have this task completed by Friday evening.
The longer term fix over the next month is we are replacing all of those links in the network with a new hardware type. We have added threshold alarms for packet loss on very specific parameters to monitor the Node B Data and Voice VLANS so that an alarm is raised for any site showing any packet loss for those services.
In the near term we are doing some more work to develop a reset script to help the notification on the old style links and ease in finding any sites that may revert back to half duplex due to an unplanned reset on the Node B.
Again I apoligize for the issues and we are working to give you the best network possible.
For the Cape Town users we have an MPLS fault that is being worked on the Fiber segment from Cape Town to Joburg and time to resolution is around 2 AM.
In the future we will be using Twitter Status, our Web Site, and this forum to keep our customer updated in Service Effecting outages on a fast basis. I am working with my NOC team to finalize the procedures for these near realtime updates and we hope to have it operational my mid Week next.
Regards
Ron
Ronald Reddick
CTO
Cell C