Switches and Hell

KemoSabi

Well-Known Member
Joined
Aug 6, 2003
Messages
305
Reaction score
3
Location
Pretoria
Ok, so I am in a predicament. I simply cant figure out what is wrong on my network. Maybe someone on this forum can help or has had similar problems.

My setup currently :

1 x 24-Port D-Link Gigabyte Switch (Un-managed in the Server Room)
1 x 24-Port Cnet 10/100 Switch (Un-managed in the Server Room)
1 x 8-Port Netgear Switch (In Workshop 1) - 50 Meters away from the main switches
1 x 8-Port "Noname Brand" (In Workshop 2) - 50 Meters away from the Netgear switch in Workshop 1.

Connected to these switches :


Office Switches:
23 x Computers (Variety of 100mbps and Gig network adapters)
3 x Network printers (100mbps)
3 x CNC Machines (100mbps)
4 x Netgear Routers (with separate ADSL lines) - *Note only one of these act as a DHCP Server. (Has been working in this setup for 7 years - Age maybe?)

Workshop 1 Switch:
4 x Wireless Access Points
1 x Connection from Workshop 2
1 x Connection to Server Room

Workshop 2 Switch:
2 x Wireless Access Points
1 x Computers (100mbps)
3 x CNC Machines (100mbps)
1 x Connection to Workshop 1

The reason for the unmanaged switches are simple. Firstly we were a small'ish company "IT-Wise" and never needed more than that. However we have slowly grown, and the need for an additional switch was there, so we added one. even if it just had two lines into it. Then as time went by, Workshop 1 needed to be on the network. Easiest way was a switch in the workshop. Then Workshop 2. etc. etc. Add onto that new CNC machines and new pc's in the workshop.

This setup has worked for the past 7 years. However, 6 months ago an Appy wired and plugged in a faulty network cable into the Switch in Workshop 2. He had bunched up the wires inside the connector and squashed them all into each other. Normally I would write this off as a lost connector and call it quits. However The whole network went ape-****. I could not get to the bottom of it. 30 mins everything works fine, then suddenly the network is insane, then 30 minutes after running around unplugging, restarting and generally praying everything works fine again. This went on for about 4 hours. Until it dawned on me that I had the Appy make up and install the cable earlier that morning.

So I ran down and unplugged the cable. Lo-and Behold the network worked fine, immediately. I wrote it off as a freak occurrence with the network cable. Why not, everything worked fine now. And so it did for 6 months...until yesterday. Then suddenly the problems started again. I Thought it couldnt possibly be that Switch, but after an hour of struggling I went out to the switch and noticed that the maintenance team had added a new CNC machine to the switch. The likely hood of the network problems starting the same day, that they added a new device on the switch; that was also involved in the previous network problems, were small.

So I immediately plugged out that switch, and once again it worked. But this morning, the network still seems odd. Certain warnings keep flashing on my SQL DB's and I notice connectivity issues. Doing a 100 count ping brings back a 4% Packet loss, where in the past that would have been 0%.

Is it possible that the switch has damaged the main Switches? Is it even possible that the switch is the culprit? Am I just imagining that plugging out that switch worked, and within an hour its all gonna happen again? Will I have another flashback from my youth and experimental drug usage? Who knows....all I know, is that this thing is a pain in the ass.
 
Does that switch have a console port or management interface of sorts?

Do you have any loops in your network?
 
Tried running a packet sniffer to see what exactly is going on?
 
Tried running a packet sniffer to see what exactly is going on?

Bit hard to run a sniffer on a switch if you can't span ports as you will only see the connected ports traffic.
 
Bit hard to run a sniffer on a switch if you can't span ports as you will only see the connected ports traffic.


You'll still pick up any viruses that might be flooding the network

Also, why would he have a management interface on unmanaged switches? :D
 
Also, why would he have a management interface on unmanaged switches? :D

He does not specify the switches in the workshops as unmanaged switches on the ones in the server room are specified as unmanaged.
 
So I ran down and unplugged the cable. Lo-and Behold the network worked fine, immediately. I wrote it off as a freak occurrence with the network cable.

If this cable hasn't been already destroyed, replace it.
 
Last edited:
Can you string temporary cables between the main switch and the slave switches?

Do it after disconnecting the existing feeder cables. Then see if the problem persists.

It might be a faulty/bad cable somewhere.

BUT it can also be a jabbering network card, flooding the network with useless garbage, hence the network going down.

Your best bet would be to get a managed switch in to collect some statistics on which port(s) get used the most, then narrow it down from there.

One last question : How old is the server(s)? Age do matter. If the NIC's are quite old, chances are they'll be flaky.


If it's Windows, run netstat -e

You'll get an output similar to this

Code:
U:\>netstat -e
Interface Statistics

                           Received            Sent

Bytes                    3408110432      2877415177
Unicast packets             4285325         4219457
Non-unicast packets         2278393           37486
Discards                          0               0
[B]Errors                            0               0[/B]
Unknown protocols             55320

If you have a lot of errors, then that specific NIC's giving you grief.
 
Last edited:
Can you string temporary cables between the main switch and the slave switches?
Do it after disconnecting the existing feeder cables. Then see if the problem persists.
It might be a faulty/bad cable somewhere.
BUT it can also be a jabbering network card, flooding the network with useless garbage, hence the network going down.
Your best bet would be to get a managed switch in to collect some statistics on which port(s) get used the most, then narrow it down from there.
One last question : How old is the server(s)? Age do matter. If the NIC's are quite old, chances are they'll be flaky.
If it's Windows, run netstat -e
If you have a lot of errors, then that specific NIC's giving you grief.
+1 : Great post !:D

Follow the great advice above!
If no success, with unmanaged switches you may have to start by unplugging all other switches from your main switch, then and see if the problem remains. If not start adding back the switches, one by one, monitoring results all the time. If you can identify the problematic switch, then start by unplugging all the cables except the uplink, and then adding one cable at a time, to locate the faulty device / cable.

Start planning to get a decent managed switch as your core switch at least - HP Procurve switches are great for cost/performance. It is worth the investment! Then try to add the other switches directly to the core, so in future it will be easier to trace problems....
 
You kinda glossed over all the details. "network went ape-****" isn't a particularly good description of what is happening.

After you removed the switch in the last paragraph, what did you do? Replace it with a different one? Connected the affected users to a different switch? Leave them hanging high & dry?

The time patterns suggest that it is caused by a specific PC/machine being switched on might be the cause. Therefore either that NIC, that cable or that port on the switch is damaged. If you have a specific time it started & a vague idea which area it came from then check the sys logs for that area for PCs that were switched on at that time.

Also, invest in a cable tester. They aren't particularly expensive but very useful in cases like this. Worked on a similarly organically grown network a while back & tested every single cable in the place. The amount of damaged, flakey or just plain wrong (crossover) cables which that place sported was impressive. This is especially true if the cables were hand crimped. One cheap crimping tool in the wrong hands can ruin an entire network.

One switch damaging another is unlikely, unless there was a serious power surge through the cables. Those *can* jump devices.

Also, +1 on Electron1 and *systematically* troubleshooting this.
 
This sounds to me like a broadcast storm.
Since you're using un-managed switches, they probably don't have any loop prevention capabilities, such as STP.

You need to trace every cable and make sure you don't have any cables looping back into the switch. You should also ensure that you don't have more than one 'uplink' between each of the switches.
 
Ok well Daffy hit it on the head.

More details for those that wanted....

This network has been a dream to date. Honestly no problems. But what happened 6 months ago, was explained up top. This made me doubt the switch, but after plugging out the faulty cable, all went well...for 6 months.

Then last week Friday, a new machine was installed on the Workshop floor. The machine has an installer that accompanies it from Italy...yes a person. This guy installed the whole machine along with our support personnel on the workshop floor. I was informed to acquire a 90m network cable, a 2x2 meter cable and a single switch for this new machine. Being busy with Desktop Review meetings, I just ordered and didnt think what I was asked to order.

So sure as hell, the machine is set up. Cables are connected, and they ask me to connect the machine to our network as well as the web through our routers. Done.

Everything worked fine for a day. Then wierd stuff started happening. Computers would flash that a network cable was unplugged and then immediately plugged in again. Mapped drives dissapeared even though general connections were still running and available. SQL DB's dropped left right and centre, and worked immediately after. IP's got scrambled and the works. General Ape-****.

I approached it by starting with the Router which auto-assigns most of the pc ip's (Except the servers). Did a general restart of it. Everything worked after that for 30 mins. Thinking this solved it. But alas.
Then I Went to the switches and gave them a good ol unplugging. Once again bout an hour went by without any problems. Then hell broke lose again.
What got me super confused was that it kept dropping our main SQL DB. Which lead me to believe it was that Servers network cable. So I remade a network cable and plugged it in. Voila 2 hours of bliss. Only to be destroyed again.
Then I lost it a tad. I ran into the workshop and cut the link that supplies half the workshop to our Network. This worked. So now I had the culprit trapped in that half of the workshop.

So it dawned on me, that the Switch that was previously involved with something this crazy, might have finally decided to cave. So I drove to Incredible Connection (was after 5pm and couldnt wait for the next morning), and got a switch and installed. What do you know. Everything works like a charm. Man im so smart....
Next day everything is running, im so smart....till 4:20pm when all hell breaks lose again.
This is when I wept a little and sat silently behind my closed office door contemplating suicide. When it dawned on me...."What was that switch and 2x2 meter cables needed for, that the Italian asked for"
So I ran down to the machine and opened her up. Inside - low and behold was a build-in switch. with 4 cables running into it. 2 from the machine and 2 running to our new switch...thats 2 running into the new switch that would then connect the "Faulty" Switch with the 90m cable. I was shocked...seems Europeans arent as bright as we would like to believe.

So basically the problem was a broadcast storm. But because i never even anticipated a build-in switch inside the machine it didnt occur to me that it was that.

Thanks for all the replies though.
 
Hectic story.

My current company moved to new officers about a year and half ago. Before then, previous building we ran flat network, much like you have over 3 floors. Company grew really fast back then and as such the network was never planned by anyone, just more switches throwed on the network as more ports were needed. At one point we started having the same broadcast storm problems, bringing the whole network down to its knees at random times.

So when we moved building, they gave ups EoL Cisco equipment, slow, but very sweet stuff. I proposed we broke up the big flan vlan into smaller ones at the new office. We were about 150 staff at the time, and as such a department consists of about 4-10 people depending which. Every department now have their own vlan, a vlan for the servers and then a whole separate vlan for the workshop room.

The idea being, we could use it to shape vlans differently, block stuff different based on vlan, making life much easier, and the side affect, that if a broadcast storm happen now its limited to that vlan. Makes life much easier in many regards. So now the new network runs 100mbit ports instead of 1000mbit ports, BUT we never have a single problem.

I think, it is something you should consider, as a long haul project, not a go ape and splash project. I think for you even just splitting the workshop off on to its own vlan will go miles, because it sounds like it is there where your experiments are done and the breaking of the network occur. Try to minimize it to that room in the future. If you can have your servers and execs/top-management on another vlan too, so that they can do their important work unaffected while certain parts of the network break down. Basically, minimize the breakage to certain portions of the network.
 
Top
Sign up to the MyBroadband newsletter
X