I've read everything here about WACS and upstream providers and what not, and to be brutally honest I couldn't care less. Is there no redundancy or measures to protect against this in place without the entire connection going to %$#@?
These past two weeks have been probably the worst fixed line connection I've had in 5+ years. Frequent drops in connection, high latency, packet loss. No one else I know have had any issues on their ISPs, and I've asked around.
You guys came through for me in a hurry so I'm not keen to jump ship, but this is getting super frustrating.
So what's plan here? When can we expect stability and are you working to improve redundancy in providers?
Let me start off by assuring you, we do have redundancy - this is why you're still up despite our primary WACS route being down. I know it's little consolation, and I too am very frustrated at the turn of events - and I've been putting all the pressure I can to get this issue resolved, as well as looking at alternative solutions going forward.
A little technical explanation if you want:
Unfortunately, when a transit service loses a route, it leads to routes needing to re-converge (IE all the dead routes need to be pulled out of routing tables on a global scale and new, optimal routes need to be calculated). When it happens on a smaller table (like a NAP peer), this is negligible - a packet or two drop and you're back up and running. On a global table (about 800k prefixes for IPv4 and 90k for IPv6+), this can take a little longer. But there is definitely redundancy on our network, and there is redundancy on our upstream's networks. Generally, the international outages are simple.. the service is down and there is one failover episode and it's done. What's happened here is what we call flapping. The circuit is up, then down, then up. That means we're constantly needing to adjust our tables, upstreams theirs and so on and so forth - which is a worse case scenario. Think of it as flipping the switch on a fluorescent light fitting, continuously. At some point the starter takes strain and the light doesn't come on as quickly. Routers aren't designed to recalculate routes constantly. Even the best of breed routers take 15-45s to recalculate routing tables and ensure traffic goes to where it needs to.
If you look around on this forum, there are other networks that were affected. The particular upstream has about 21 downstream networks including us - so we're certainly not alone and there are other mentions on this forum. In addition, they have a pretty rock-solid network (until now) and are extremely responsive to assisting us with routing issues and route improvements (which is a rarity in the wholesale space, and allows us to be more flexible) - so we've been patient with these issues until now. We've got other transit providers in play, and we immediately see a spike in traffic on their networks when this happens. Unfortunately, it's not instant - but we're doing everything we can to make it smoother. That said, we do have an active order for yet another provider, but their lead times are long (once again, I've been on their case today).
Of course, we'll be requesting an RFO for this afternoon's outage and stressing that this is simply unacceptable as these outages have been too regular. From our side, you have my assurance we are working on this, and if this provider can't sort this out, we'll move to another.



