Status
Not open for further replies.

Tinuva

The Magician
Joined
Feb 10, 2005
Messages
12,478
The whole country wasn't down at 8 - 9pm, only selected areas
Correct, but all the areas that were down came online at the exact same time. Its still a lot more than what would have come online at the same time at 2pm.
 

John Tempus

Executive Member
Joined
Aug 8, 2017
Messages
6,121
Looking at my smokeping graphs https://netmon.heaven.za.net/smokeping/smokeping.cgi?displaymode=n;start=2019-02-13 06:06;end=now;target=CISP.CISP-Vumatel-1st-hop-new

I can see there were actual outages on the network at midnight, 2am and slightly 4am. Chances are that is what you were seeing at the time of your post. I suspect Vumatel doing network maintenance at the time as mentioned earlier in the thread.

Totally plausible and match up with my experience but what grinds my teeth is Vumatels last official notice on twitter that the maintenance on Tuesday night/Wednesday morning resolved all the issues. If their public facing agents don't have a clue about what is scheduled and give misleading information then either they don't really care or even their own people have no clue about their schedule.

How these companies can be so out of touch not to break down for us to see when they will be performing maintenance that would lead in actual downtime durations we are experiencing is really shameful.
 

John Tempus

Executive Member
Joined
Aug 8, 2017
Messages
6,121
Correct, but all the areas that were down came online at the exact same time. Its still a lot more than what would have come online at the same time at 2pm.

This does not explain anything relating to DHCP storm "theory" at this point.

At the time even assigning static ip that requires no DHCP broadcasting I was unable to get online until 12:30am.

If all these issues power loadshedding is in fact DHCP related, I should not have had this issue. linesync light also only cleared up around 4am however I was able to get online even with blinking sync light at 2am using static ip.

One thing I noticed for 3hours was massive ping spikes all across the routers , DHCP broadcasting does not generate that much traffic on single segment each area need to allocated an IP. If that is actually what is happening then Vumatel have far greater design issues than I initially suspected because that looked akin to straight up terabit DDOSing than mere DHCP broadcasting traffic.

DHCP broadcasting could in theory relate to a complete DDOS effect if every single Vumatel user coming online need to broadcast via local POP all the way across the network to one central point. But SURELY Vumatel did not design their network in this way ? Right ?

If they designed the network to operate in that way then we will need to pray to ever get online after any possible power/network outage if Vumatel clientele grow any bigger.

This could be remedied immediately by allow DHCP leasing to extend even to 6hrs it would cover any possible outage and let it extend leased session when client is back on. I dont even think it could be this , I am always back on my original DHCP ip no matter how long I have had fiber off before so its not negotiating new ips due to expired lease in any case. What exactly is causing so much talkback on the network when users come back online is the real mystery. It sure as sht is not dhcp leasing / renewal / assignments.
 
Last edited:

Tinuva

The Magician
Joined
Feb 10, 2005
Messages
12,478
This does not explain anything relating to DHCP storm "theory" at this point.

At the time even assigning static ip that requires no DHCP broadcasting I was unable to get online until 12:30am.

If all these issues power loadshedding is in fact DHCP related, I should not have had this issue. linesync light also only cleared up around 4am however I was able to get online even with blinking sync light at 2am using static ip.

One thing I noticed for 3hours was massive ping spikes all across the routers , DHCP broadcasting does not generate that much traffic on single segment each area need to allocated an IP. If that is actually what is happening then Vumatel have far greater design issues than I initially suspected because that looked akin to straight up terabit DDOSing than mere DHCP broadcasting traffic.
DDoS is not _only_ when a lot of bandwidth is pushed. Its the technical term that say, a service is not available. Many DDoS attacks rely on killing a service with as little bandwidth as possible. So if too many DHCP Requests cause a DHCP server to become unavailable, thats at the least a DoS (Denial of Service) attack. The extra D means its distributed, which in this case it very much is. Another one where little bandwidth was needed, is SYN attacks on HTTP servers. Nowadays thats easily blocked/prevented.

Anyways, I do believe the DHCP broadcast storm story. That said, I also think there is something else on top of that issue. The DHCP broadcast storm is just the most visible one at the moment.

What I find most interesting, is that I have no issues on the same Vumatel PoP as you and @DJZassie (who I know in person). But it is my understanding, that there is multiple line cards and also multiple switches at a single PoP, and I must be lucky to not be on the affected side there.

Now if your static ip didn't resolve your issue, I suspect you are on a switch that needs booting up after loadshedding, and some of these big switches that I worked on in the past, takes 15 minutes to boot up if they already provisioned. Depending on how Vumatel did theirs, it may need automated provisioning at east boot up (I dont know the real facts), and that can also cause issues I think with the multiple loadshedding.

Sadly, the more complex networks are now, and systems, the more complex issues are. Vumatel has multiple interesting issues to sort out. I imagine them running around like mad chickens this last week. I feel really sorry for the guys that need to do the actual troubleshooting/fixing at the moment.
 

John Tempus

Executive Member
Joined
Aug 8, 2017
Messages
6,121
DDoS is not _only_ when a lot of bandwidth is pushed. Its the technical term that say, a service is not available. Many DDoS attacks rely on killing a service with as little bandwidth as possible. So if too many DHCP Requests cause a DHCP server to become unavailable, thats at the least a DoS (Denial of Service) attack. The extra D means its distributed, which in this case it very much is. Another one where little bandwidth was needed, is SYN attacks on HTTP servers. Nowadays thats easily blocked/prevented.

Anyways, I do believe the DHCP broadcast storm story. That said, I also think there is something else on top of that issue. The DHCP broadcast storm is just the most visible one at the moment.

What I find most interesting, is that I have no issues on the same Vumatel PoP as you and @DJZassie (who I know in person). But it is my understanding, that there is multiple line cards and also multiple switches at a single PoP, and I must be lucky to not be on the affected side there.

Now if your static ip didn't resolve your issue, I suspect you are on a switch that needs booting up after loadshedding, and some of these big switches that I worked on in the past, takes 15 minutes to boot up if they already provisioned. Depending on how Vumatel did theirs, it may need automated provisioning at east boot up (I dont know the real facts), and that can also cause issues I think with the multiple loadshedding.

Sadly, the more complex networks are now, and systems, the more complex issues are. Vumatel has multiple interesting issues to sort out. I imagine them running around like mad chickens this last week. I feel really sorry for the guys that need to do the actual troubleshooting/fixing at the moment.

So the interesting part with static ip test while the dynamic ip would not assign at all is the following.

With static ip during the shtstorm I could get the following baseline results since 9pm when power came back on. Sync light was still just flashing. No usable traffic was available aside from measily MTR output.

|------------------------------------------------------------------------------------------|
| WinMTR statistics |
| Host - % | Sent | Recv | Best | Avrg | Wrst | Last |
|------------------------------------------------|------|------|------|------|------|------|
| 192.168.0.1 - 0 | 20 | 20 | 0 | 0 | 0 | 0 |
| mypublicip - 0 | 20 | 20 | 2 | 12 | 58 | 5 |
| c3h-backbone.coolideas.co.za - 25 | 8 | 6 | 554 | 579 | 602 | 588 |
| cd-backbone.coolideas.co.za - 19 | 11 | 9 | 534 | 569 | 601 | 534 |
| cloudflare.ixp.capetown - 50 | 6 | 3 | 0 | 613 | 628 | 609 |
| 154.0.5.146 - 50 | 6 | 3 | 0 | 613 | 628 | 609 |
|________________________________________________|______|______|______|______|______|______|

At 2am, still with flickering sync light the static ip dropped all traffic back to normal no issues.


At the same time the dynamic ip would not assign anything from 9pm - 4am and synclight only synced at 4am.


So from earlier tests with @TheRoDent , the static ip is seemingly being assigned to entirely different "rack" than dynamic ips. Even though it is assigned to different "rack" it was unusable above MTR test until wee morning but recovered before the DHCP ip.


So on the one side I can believe that the entire Vumatel network is effectively killed for hours on end by simply leasing traffic. That would explain the static ip horrible hours on end results as I posted above while still unable to get DHCP ip returned.

What I do not get is that there is nearly zero verification for a router to assign a prestored leased ip back to the same MAC compared to freshly assigning new IP leases.

So my conclusion is that at each load shedding every single leasing server active memory(doesn't seem to use any stored db) is wiped clean and that causes every single returning user to get assigned/leased fresh from each router making DHCP stored leasing completely useless.

If I am on the right track, Vumatel figure out a way to actually keep backup memory of router data if you are unable to keep any of your systems up when load shedding occur because at this time I don't think even 50% of their network contain any backup power to handle this.
 
Last edited:

Tinuva

The Magician
Joined
Feb 10, 2005
Messages
12,478
So the interesting part with static ip test while the dynamic ip would not assign at all is the following.

With static ip during the shtstorm I could get the following baseline results since 9pm when power came back on. Sync light was still just flashing. No usable traffic was available aside from measily MTR output.

|------------------------------------------------------------------------------------------|
| WinMTR statistics |
| Host - % | Sent | Recv | Best | Avrg | Wrst | Last |
|------------------------------------------------|------|------|------|------|------|------|
| 192.168.0.1 - 0 | 20 | 20 | 0 | 0 | 0 | 0 |
| mypublicip - 0 | 20 | 20 | 2 | 12 | 58 | 5 |
| c3h-backbone.coolideas.co.za - 25 | 8 | 6 | 554 | 579 | 602 | 588 |
| cd-backbone.coolideas.co.za - 19 | 11 | 9 | 534 | 569 | 601 | 534 |
| cloudflare.ixp.capetown - 50 | 6 | 3 | 0 | 613 | 628 | 609 |
| 154.0.5.146 - 50 | 6 | 3 | 0 | 613 | 628 | 609 |
|________________________________________________|______|______|______|______|______|______|

At 2am, still with flickering sync light the static ip dropped all traffic back to normal no issues.


At the same time the dynamic ip would not assign anything from 9pm - 4am and synclight only synced at 4am.


So from earlier tests with @TheRoDent , the static ip is seemingly being assigned to entirely different "rack" than dynamic ips. Even though it is assigned to different "rack" it was unusable above MTR test until wee morning but recovered before the DHCP ip.


So on the one side I can believe that the entire Vumatel network is effectively killed for hours on end by simply leasing traffic. That would explain the static ip horrible hours on end results as I posted above while still unable to get DHCP ip returned.

What I do not get is that there is nearly zero verification for a router to assign a prestored leased ip back to the same MAC compared to freshly assigning new IP leases.

So my conclusion is that at each load shedding every single leasing server active memory(doesn't seem to use any stored db) is wiped clean and that causes every single returning user to get assigned/leased fresh from each router making DHCP stored leasing completely useless.

If I am on the right track, Vumatel figure out a way to actually keep backup memory of router data if you are unable to keep any of your systems up when load shedding occur because at this time I don't think even 50% of their network contain any backup power to handle this.
I am not surprised with your MTR that looks the way it does. Once a connection comes online, you need to restart the mtr to get accurate results. What that shows, was that you were able to ping the gateway IP and everything else after that was dead. As if the backhaul from your vumatel pop to the CISP interconnect wasn't working.

What is the ip address you have hidden as "mypublicip"? ps. its not your IP address, that is the gateway you connect through, should end in a .1 number. I want to know if it is the same as mine?

Code:
~ % mtr 154.0.5.146 --report-wide --show-ips
Start: 2019-02-14T07:31:54+0200
HOST: turbot                                     Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- router.lan (192.168.241.1)                  0.0%    10    0.4   0.6   0.4   0.7   0.1
  2.|-- 155.93.246.1                                0.0%    10    2.9  10.9   2.9  53.0  15.1
  3.|-- c3h-backbone.coolideas.co.za (154.0.1.125)  0.0%    10    1.6   1.5   1.1   1.6   0.1
  4.|-- cd-backbone.coolideas.co.za (154.0.1.13)    0.0%    10    1.7   1.7   1.6   1.8   0.1
  5.|-- 154.0.1.61                                  0.0%    10   19.7  19.0  18.8  19.7   0.3
  6.|-- u13m-cust.coolideas.co.za (154.0.5.146)     0.0%    10   19.3  19.2  18.7  19.4   0.2
 

Concentric

Expert Member
Joined
Feb 16, 2017
Messages
1,028
Are all the guys here on vuma trenched?
We are vuma arial jhb no issues what so ever. reconnect after maybe 30s of power coming back here.
 

John Tempus

Executive Member
Joined
Aug 8, 2017
Messages
6,121
I am not surprised with your MTR that looks the way it does. Once a connection comes online, you need to restart the mtr to get accurate results. What that shows, was that you were able to ping the gateway IP and everything else after that was dead. As if the backhaul from your vumatel pop to the CISP interconnect wasn't working.

What is the ip address you have hidden as "mypublicip"? ps. its not your IP address, that is the gateway you connect through, should end in a .1 number. I want to know if it is the same as mine?

Code:
~ % mtr 154.0.5.146 --report-wide --show-ips
Start: 2019-02-14T07:31:54+0200
HOST: turbot                                     Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- router.lan (192.168.241.1)                  0.0%    10    0.4   0.6   0.4   0.7   0.1
  2.|-- 155.93.246.1                                0.0%    10    2.9  10.9   2.9  53.0  15.1
  3.|-- c3h-backbone.coolideas.co.za (154.0.1.125)  0.0%    10    1.6   1.5   1.1   1.6   0.1
  4.|-- cd-backbone.coolideas.co.za (154.0.1.13)    0.0%    10    1.7   1.7   1.6   1.8   0.1
  5.|-- 154.0.1.61                                  0.0%    10   19.7  19.0  18.8  19.7   0.3
  6.|-- u13m-cust.coolideas.co.za (154.0.5.146)     0.0%    10   19.3  19.2  18.7  19.4   0.2


Not the same one and yes its the gateway.

Static ip goes through 155.93.246.1
DHCP ip always end up through 155.93.252.1

The point I was making is that these are FRESH MTR's even after restarting it.

The static ip was able to get traffic yet unusable meanwhile the DHCP was never able to assign any IP. This was the only point I was trying to make so yes DHCP was haywired and kept on failing for hours and the side effect of this could be seen when I assigned my static ip since it showed the network being congested to hell and back.

So to sum it up.

I could assign static ip and get trace info while DHCP was unable to assign any IP and thus unable to get any trace info.

In case you were curious, the static ip and dhcp ip assignment is on different gateways/segments.

I am pretty sure we connect to the same Stellenberg POP unless I am mistaking you with someone else on here.
 

Tinuva

The Magician
Joined
Feb 10, 2005
Messages
12,478
Not the same one and yes its the gateway.

Static ip goes through 155.93.246.1
DHCP ip always end up through 155.93.252.1

The point I was making is that these are FRESH MTR's even after restarting it.

The static ip was able to get traffic yet unusable meanwhile the DHCP was never able to assign any IP. This was the only point I was trying to make so yes DHCP was haywired and kept on failing for hours and the side effect of this could be seen when I assigned my static ip since it showed the network being congested to hell and back.

So to sum it up.

I could assign static ip and get trace info while DHCP was unable to assign any IP and thus unable to get any trace info.

In case you were curious, the static ip and dhcp ip assignment is on different gateways/segments.

I am pretty sure we connect to the same Stellenberg POP unless I am mistaking you with someone else on here.
I actually think both 155.93.246.1 and 155.93.252.1 is on the same router interface. I dynamically get 155.93.246.1 as gateway from dhcp actually.

I am also on the Stellenberg pop, but a pop is more than a single router and single switch. I bet you will find there are more than 1 router and 1 switch, and we are not on the same switches at the very least.

Looks like some faulty hardware at the pop to me on top of the DHCP issue.

Edit: The only other thing that could potentially affect the switches is the mass learning of new mac-addresses coming online...but I can't imagine that this would cause issues, switches are made to learn these as fast as possible when they receive traffic from a new mac address. ARP tables could be slow, but shouldn't be the issue when the routers can;t even get IP addresses from the DHCP servers.
 

John Tempus

Executive Member
Joined
Aug 8, 2017
Messages
6,121
I actually think both 155.93.246.1 and 155.93.252.1 is on the same router interface. I dynamically get 155.93.246.1 as gateway from dhcp actually.

I am also on the Stellenberg pop, but a pop is more than a single router and single switch. I bet you will find there are more than 1 router and 1 switch, and we are not on the same switches at the very least.

Looks like some faulty hardware at the pop to me on top of the DHCP issue.

Thats weird, for the last 2 years I have only ever gotten the same gateway and same public ip via DHCP.

I dont actually think there is any faulty router.

From what I observed during the whole issue around DHCP storm / no one getting ip assigned for hours. The whole vumatel network is multicasting or something stupid trashing the whole network just for attempting to assign all the IP requests. To that extend they have a critical design flaw that is now suddenly rearing its head, I dont know why it only showed up since Monday.

The reason I dont think there is faulty routers because after everything cleared up the DHCP assigned up have clean network and the static ip have a clean network. The noisy congested network was just visible earlier for me on the static ip since I was able to actually force myself onto the network and then able to see the network state of congestion.

It was pretty clear to me that the network is DDOS itself at this point. For over an hour while on the static ip I observed ping stuck at 700ms / 90% packetloss then every minute for 60minutes the last ping dropped slowly and packetloss dropped slowly then wildly fluctuated every ping between 2ms and 500ms for another 30minutes and then suddenly cleared up.

Right at the point the static ip cleared up I was able to get my DHCP ip assigned which to me suggest the network congestion cleared up at this point that DHCP service functioned properly even though this took 7hours.
 

Tinuva

The Magician
Joined
Feb 10, 2005
Messages
12,478
Thats weird, for the last 2 years I have only ever gotten the same gateway and same public ip via DHCP.

I dont actually think there is any faulty router.

From what I observed during the whole issue around DHCP storm / no one getting ip assigned for hours. The whole vumatel network is multicasting or something stupid trashing the whole network just for attempting to assign all the IP requests. To that extend they have a critical design flaw that is now suddenly rearing its head, I dont know why it only showed up since Monday.

The reason I dont think there is faulty routers because after everything cleared up the DHCP assigned up have clean network and the static ip have a clean network. The noisy congested network was just visible earlier for me on the static ip since I was able to actually force myself onto the network and then able to see the network state of congestion.

It was pretty clear to me that the network is DDOS itself at this point. For over an hour while on the static ip I observed ping stuck at 700ms / 90% packetloss then every minute for 60minutes the last ping dropped slowly and packetloss dropped slowly then wildly fluctuated every ping between 2ms and 500ms for another 30minutes and then suddenly cleared up.

Right at the point the static ip cleared up I was able to get my DHCP ip assigned which to me suggest the network congestion cleared up at this point that DHCP service functioned properly even though this took 7hours.
I don't see any congestion at all during or after loadshedding. I am on the same pop as you. I am 100% certain you are on a switch that is causing that packet loss, where I am on one where I don't see it at all.

Its not the whole network like you think it is.
 

irBosOtter

Expert Member
Joined
Feb 14, 2014
Messages
2,872
Correct, but all the areas that were down came online at the exact same time. Its still a lot more than what would have come online at the same time at 2pm.

Not according to the loadshedding schedules, same areas was down earlier the day and then there were no issues when it came back up
 
Status
Not open for further replies.
Top