1-Grid offline?

mutsu · Feb 1, 2022

Is anyone hosting DNS records with 1-Grid?
They have an outage since yesterday but no notifications from the provider and ETR not known.
We have websites that are not resolving and e-mail not working.

WAslayer · Feb 1, 2022

mutsu said:
Is anyone hosting DNS records with 1-Grid?
They have an outage since yesterday but no notifications from the provider and ETR not known.
We have websites that are not resolving and e-mail not working.

They did send an email to the email address registered on the account..

ld13 · Feb 1, 2022

How are you guys even logging in? I am simply redirected to the status page when I click on https://1-grid.com/client/clientarea.php

snobee · Feb 1, 2022

mutsu said:
Is anyone hosting DNS records with 1-Grid?
They have an outage since yesterday but no notifications from the provider and ETR not known.
We have websites that are not resolving and e-mail not working.

Yes, we had the same problem. The world lost our email and site location data as their nameservers were dead to the world. Our service is up and running now, but it appears quite a few are not.

Thor · Feb 1, 2022

Dear Customers

As many of you are aware, 1-grid experienced large scale outage on Monday, 31 January 2022 evening affecting the vast majority of our customers. We wanted to provide a summary as to what happened, why the delay to resolve the issue and what we are doing to avoid further risks in the future.

Firstly, I would like to personally apologise to all affected customers. This is unacceptable and we let you down. This shouldn’t have happened and whilst I would like to say it was outside our control, our choice of suppliers and the extent to which we audit their setup is something we should have done better on. Saying that it affected hundreds of other companies doesn’t make it better. All we can do at this point is be brutally honest about the causes with our customers.

Current State

We were in the process of migrating from our current datacentre (owned by Old Mutual and run by Africa Datacentres) to a new site at the Africa Datacentres Diep River facility. The incident last night was not related to the migration, however its impact affected both sites, and this requires a bit of explanation

We have now moved all physical servers to our new site in Diep River which are in the process of being brought back onto our network.

What happened last night?

Last night, we saw every single link from Pinelands to our other datacentres go down at the same time, both transit (Internet connectivity) as well as two links to Teraco and the link to Diep River. This is a bit like a plane with four engines having all of them stop mid-flight at the same time. This caused an outage for most customers who were routed in Pinelands, even if their server had been moved to Diep River. The root cause was a Liquid Networks issue with a major failure in Pinelands.

We proceeded to carry out an emergency migration of routing from Pinelands to Diep River (something we were not scheduled to do quite yet). Because we are moving vendors for the equipment doing this routing, the process is a bit more complicated than it would otherwise be. Nonetheless we are doing this as we believe it will help get some customers back before Liquid address their own equipment failure. We have also made some temporary changes to bring customers up quicker in some cases. We’re working hard to get everyone up by the morning.

We have tonight also been physically moving all the remaining servers in Pinelands to Diep River as Liquid have indicated to us that they cannot fix the issue overnight. This is clearly not acceptable, however we would much rather take control of the situation and look after our customers. This is a significant undertaking as this was supposed to happen over the next couple of weeks spread over several nights. We have staff on hand to help with the move from our directors to technical team.

The technically minded among you will wonder ‘why didn’t you just move routing as you went along?’ – that’s a very good question. We have a lot of legacy setup from years of acquisitions and we’ve made iterative improvements to increase capacity, resilience and remove some legacy issues like large broadcast domains. Nevertheless servers on the same VLAN may be in different sites during the migration, so this process would have impossible to do perfectly, and we believed our three separate links between the sites would have been sufficient protection against any incidents.

We also found that out out-of-band access wasn’t working as expected; we have a setup that allows us to get into our routers even if our network was down. We have used this over the past few weeks however the setup had a glitch at the same time. This wasn’t the cause of the problem or a result of it, but it delayed us starting the diagnostic process. We will learn from that too and improve our monitoring.

The Future

We don’t propose to explain all the changes we will make in this post. We need to do a full incident analysis for that. However, in the meantime we would like to add a few comments.

Once again, I would like to personally apologise to all affected customers.

s898 · Feb 1, 2022

are there any new updates from 1-grid? last update in status page is from 13 hours ago! unbelievable.

Thor · Feb 1, 2022

Thor said:
Dear Customers

As many of you are aware, 1-grid experienced large scale outage on Monday, 31 January 2022 evening affecting the vast majority of our customers. We wanted to provide a summary as to what happened, why the delay to resolve the issue and what we are doing to avoid further risks in the future.

Firstly, I would like to personally apologise to all affected customers. This is unacceptable and we let you down. This shouldn’t have happened and whilst I would like to say it was outside our control, our choice of suppliers and the extent to which we audit their setup is something we should have done better on. Saying that it affected hundreds of other companies doesn’t make it better. All we can do at this point is be brutally honest about the causes with our customers.

Current State

We were in the process of migrating from our current datacentre (owned by Old Mutual and run by Africa Datacentres) to a new site at the Africa Datacentres Diep River facility. The incident last night was not related to the migration, however its impact affected both sites, and this requires a bit of explanation

We have now moved all physical servers to our new site in Diep River which are in the process of being brought back onto our network.

What happened last night?

Last night, we saw every single link from Pinelands to our other datacentres go down at the same time, both transit (Internet connectivity) as well as two links to Teraco and the link to Diep River. This is a bit like a plane with four engines having all of them stop mid-flight at the same time. This caused an outage for most customers who were routed in Pinelands, even if their server had been moved to Diep River. The root cause was a Liquid Networks issue with a major failure in Pinelands.

We proceeded to carry out an emergency migration of routing from Pinelands to Diep River (something we were not scheduled to do quite yet). Because we are moving vendors for the equipment doing this routing, the process is a bit more complicated than it would otherwise be. Nonetheless we are doing this as we believe it will help get some customers back before Liquid address their own equipment failure. We have also made some temporary changes to bring customers up quicker in some cases. We’re working hard to get everyone up by the morning.

We have tonight also been physically moving all the remaining servers in Pinelands to Diep River as Liquid have indicated to us that they cannot fix the issue overnight. This is clearly not acceptable, however we would much rather take control of the situation and look after our customers. This is a significant undertaking as this was supposed to happen over the next couple of weeks spread over several nights. We have staff on hand to help with the move from our directors to technical team.

The technically minded among you will wonder ‘why didn’t you just move routing as you went along?’ – that’s a very good question. We have a lot of legacy setup from years of acquisitions and we’ve made iterative improvements to increase capacity, resilience and remove some legacy issues like large broadcast domains. Nevertheless servers on the same VLAN may be in different sites during the migration, so this process would have impossible to do perfectly, and we believed our three separate links between the sites would have been sufficient protection against any incidents.

We also found that out out-of-band access wasn’t working as expected; we have a setup that allows us to get into our routers even if our network was down. We have used this over the past few weeks however the setup had a glitch at the same time. This wasn’t the cause of the problem or a result of it, but it delayed us starting the diagnostic process. We will learn from that too and improve our monitoring.

The Future

We don’t propose to explain all the changes we will make in this post. We need to do a full incident analysis for that. However, in the meantime we would like to add a few comments.

Once again, I would like to personally apologise to all affected customers.

Secondly, as part of the datacentre move we already have plans to add another transit link in Teraco with a third party to work around Internet routing issues that sometimes arise (outside our control but which still affect our customers, and which we can ‘work around’). In addition to this, we will be re-examining the links Liquid provide to us to fully understand their setup and see whether we need another carrier for another link.

Smaller companies always suffer the most in the event of an incident, whilst larger ones (multinationals especially so) are assumed to simply suffer bad luck when things go badly wrong. We want to be open with our customers as to why this was so unexpected and that whilst we fully accept responsibility, it was a result of a failure which was difficult to expect. Nonetheless we should have questioned our suppliers’ assurances more. For that, we apologise unreservedly.

Yours sincerely,

Thomas Vollrath, Morne Patterson and the entire 1-grid team (who have worked tirelessly overnight to resolve this issue).

PurSpyk!! · Feb 1, 2022

Still cannot get any of my emails, this is really poor service, so much for companies guarantees of uptime.

09:23, seems to be up and running again, emails are working. Clientzone opens but cannot logon

lowriderza · Feb 1, 2022

PurSpyk!! said:
Still cannot get any of my emails, this is really poor service, so much for companies guarantees of uptime.

09:23, seems to be up and running again, emails are working. Clientzone opens but cannot logon

Webmail seems to be working for some domains. Luckilly I only had two clients with issues today. Up to now at least...

Greg Christos · Feb 1, 2022

It took 3.4 mins for fetch/xhr to preload a 3.9 Mb video on one of my websites. There something else going on here. I think there was a major change no one has been notified about.

Greg Christos · Feb 1, 2022

Thor said:
Dear Customers

As many of you are aware, 1-grid experienced large scale outage on Monday, 31 January 2022 evening affecting the vast majority of our customers. We wanted to provide a summary as to what happened, why the delay to resolve the issue and what we are doing to avoid further risks in the future.

Firstly, I would like to personally apologise to all affected customers. This is unacceptable and we let you down. This shouldn’t have happened and whilst I would like to say it was outside our control, our choice of suppliers and the extent to which we audit their setup is something we should have done better on. Saying that it affected hundreds of other companies doesn’t make it better. All we can do at this point is be brutally honest about the causes with our customers.

Current State

We were in the process of migrating from our current datacentre (owned by Old Mutual and run by Africa Datacentres) to a new site at the Africa Datacentres Diep River facility. The incident last night was not related to the migration, however its impact affected both sites, and this requires a bit of explanation

We have now moved all physical servers to our new site in Diep River which are in the process of being brought back onto our network.

What happened last night?

Last night, we saw every single link from Pinelands to our other datacentres go down at the same time, both transit (Internet connectivity) as well as two links to Teraco and the link to Diep River. This is a bit like a plane with four engines having all of them stop mid-flight at the same time. This caused an outage for most customers who were routed in Pinelands, even if their server had been moved to Diep River. The root cause was a Liquid Networks issue with a major failure in Pinelands.

We proceeded to carry out an emergency migration of routing from Pinelands to Diep River (something we were not scheduled to do quite yet). Because we are moving vendors for the equipment doing this routing, the process is a bit more complicated than it would otherwise be. Nonetheless we are doing this as we believe it will help get some customers back before Liquid address their own equipment failure. We have also made some temporary changes to bring customers up quicker in some cases. We’re working hard to get everyone up by the morning.

We have tonight also been physically moving all the remaining servers in Pinelands to Diep River as Liquid have indicated to us that they cannot fix the issue overnight. This is clearly not acceptable, however we would much rather take control of the situation and look after our customers. This is a significant undertaking as this was supposed to happen over the next couple of weeks spread over several nights. We have staff on hand to help with the move from our directors to technical team.

The technically minded among you will wonder ‘why didn’t you just move routing as you went along?’ – that’s a very good question. We have a lot of legacy setup from years of acquisitions and we’ve made iterative improvements to increase capacity, resilience and remove some legacy issues like large broadcast domains. Nevertheless servers on the same VLAN may be in different sites during the migration, so this process would have impossible to do perfectly, and we believed our three separate links between the sites would have been sufficient protection against any incidents.

We also found that out out-of-band access wasn’t working as expected; we have a setup that allows us to get into our routers even if our network was down. We have used this over the past few weeks however the setup had a glitch at the same time. This wasn’t the cause of the problem or a result of it, but it delayed us starting the diagnostic process. We will learn from that too and improve our monitoring.

The Future

We don’t propose to explain all the changes we will make in this post. We need to do a full incident analysis for that. However, in the meantime we would like to add a few comments.

Once again, I would like to personally apologise to all affected customers.

The problem is you are gaining a reputation for whenever you're performing these "migrations". When you moved international domains you failed to correctly update all domain zones. I am still having hangovers from that move. Also, many of your personnel at your support center are technically inept and fail to properly read tickets. It makes me concerned that the same technical deficiencies exist in your tech department.

bokka1 · Feb 1, 2022

And mine is gone again.

Greg Christos · Feb 1, 2022

Greg Christos said:
The problem is you are gaining a reputation for whenever you're performing these "migrations". When you moved international domains you failed to correctly update all domain zones. I am still having hangovers from that move. Also, many of your personnel at your support center are technically inept and fail to properly read tickets. It makes me concerned that the same technical deficiencies exist in your tech department.

Where did you find this message?

Rickster · Feb 1, 2022

People still using 1-grid?

Rickster · Feb 1, 2022

dualmeister said:
Please suggest an alternative to 1-Grip to host business email account for a small business?

Literally anyone else but these are some of the best:

domains.co.za
Microsoft 365
absolutehosting.co.za

Thor · Feb 1, 2022

dualmeister said:
Please suggest an alternative to 1-Grip to host business email account for a small business?

Home

For Professional Web Hosting, Domain Registrations, SSL Certificates and EPP Solutions trust an ICANN Accredited Registrar - Domains.co.za!

www.domains.co.za

W377!M · Feb 1, 2022

My server is still offline. More than 24h without service

MrTendai · Feb 1, 2022

It's 8pm and still no emails. #1-grip

Turing · Feb 1, 2022

W377!M said:
My server is still offline. More than 24h without service

Those poor techies and sysadmins must be shitting bricks by now.

Packet-Kollector · Feb 2, 2022

Wait, they seriously leaked customer data to other people via the client portal, and no one lost their minds?

Did I miss something? That seems like a mega huge deal.

Join the MyBroadband community

Get started

1-Grid offline?

New Member

Honorary Master

Honorary Master

Expert Member

Honorary Master

Active Member

Honorary Master

Expert Member

Expert Member

New Member

New Member

Executive Member

New Member

EVGA Fanatic

EVGA Fanatic

Honorary Master

Well-Known Member

New Member

Active Member

Well-Known Member