Hackers breached South Africa's top supercomputer and used it to mine crypto

CHPC was just one of hundreds globally that was compromised.

A scientists SSH key was compromised and then it drifted between institutions which was just one of the cases. Since all of us HPCs run older versions of Ubuntu like Ubuntu 22 LTS, it is still susceptible to a large number of CVE's that nobody is willing to patch. There were also a crazy amount of 0-days and exploitable CVE's the last couple of weeks being pushed into the public.

There was also packages exploited on the scientist jumpbox, with the existing applications being removed and replaced with same software but with actively exploitable versions.

Remember also institutions have like 100 Gbps - 10 Tbps internet breakout, half of the country's bandwidth run through them. What do you think a Firewall of this magnitude will cost? More than the HPC itself, hence pretty much all of them run without firewalls.
One of HUNDREDS compromised? This sounds incredibly sensationalist. An incident of that magnitude would trigger GDPR declarations. Think back to Europe 2020.

Your comment about "us" running older Ubuntu is also not accurate. My group runs numerous systems and they're Alma 9, Rocky 9, EL 9. Not a single Ubuntu and definitely not "old".

A scientist user with a rooted system is still not a risk to a hardened remote system. Root on your machine is not root on my machine.

You're also completely wrong about the bandwidth you're talking about. SANREN is 10gbps at most universities, not 100gbps to 10tbps.

SCInet reached just under 14tbps in St.Louis and that's a record.

I really wish you would stop exaggerating. You're spreading such bad information and I'm sure you think you know what you're talking about, but you plainly do not. Sorry to be so blunt.
 
Great insight, thanks!

Feels like a firewall is not required, just a simple packet filter on the existing router would do, but still, applying that to Tbps of traffic would probably significantly increase the load on the router. And of course, if the ssh key for a legit user is compromised, all bets are off anyway.
No. Please don't believe the OP.

https://www.sanren.ac.za/ supports significantly lower bandwidth than what OP said.

Also, we DO run firewalls if we are serious about security. There is no excuse for bypassing best practice. An HPC will have diverse landing points and data transfer nodes all of which are monitored, managed, and controlled.

A user's compromised ssh key will only give access to a remote system, not root level permissions. Users are significantly locked down on clusters because they are such complex user-intensive systems. And not to mention the NIST standards and health data.

OP is being sensationalist and unfortunately is not accurate.
 
You know the CHPC was running Ubuntu? Thought it was CentOS.
What can be done better re. using SSH keys? The knowledge and technical ability of users vary widely.
CentOS.

Password protected SSH keys and periodic expiry and cycle (painful but necessary)
 
nobody said it was the same thing, there were multiple attack vectors.

this was a global scale attack not just CHPC.
Please provide proof or stop spreading speculation. You've already shown you don't know what you're talking about with HPC configuration, or even the bandwidth interlinking SA sites.
 
No. Please don't believe the OP.

https://www.sanren.ac.za/ supports significantly lower bandwidth than what OP said.

Also, we DO run firewalls if we are serious about security. There is no excuse for bypassing best practice. An HPC will have diverse landing points and data transfer nodes all of which are monitored, managed, and controlled.

A user's compromised ssh key will only give access to a remote system, not root level permissions. Users are significantly locked down on clusters because they are such complex user-intensive systems. And not to mention the NIST standards and health data.

OP is being sensationalist and unfortunately is not accurate.
Understanding that in general, a user-level account doesn't give you root, but also that there have been several Local Privilege Escalation vulns in Linux. Not sure if you were able to patch in time, but it would certainly have been one way to compromise other user-level accounts.
 
Understanding that in general, a user-level account doesn't give you root, but also that there have been several Local Privilege Escalation vulns in Linux. Not sure if you were able to patch in time, but it would certainly have been one way to compromise other user-level accounts.
Agreed, yes.

Leading HPC centers shut down all their systems when there is a vulnerability and rebuild from trusted sources. Mileage varies on sites but generally best practice (and HPC is no different, potentially the opposite and is more secure) is applied to everything. A local root exploit is vulnerable anywhere and it's up to the admins to apply common sense etc.

Totally agree with you.
 
OT, but what grates me is that the latest privilege esc vulns have to be manually mitigated in Azure. CopyFail still hasn’t received a kernel update from them. My older Digital Ocean boxes have had updates.
 
Agreed, yes.

Leading HPC centers shut down all their systems when there is a vulnerability and rebuild from trusted sources. Mileage varies on sites but generally best practice (and HPC is no different, potentially the opposite and is more secure) is applied to everything. A local root exploit is vulnerable anywhere and it's up to the admins to apply common sense etc.

Totally agree with you.
How amenable are HPC servers/algorithms to "suspend and resume" their jobs? Or is it more a "take one box at a time out of service, migrate their workload to a different server, patch it, and bring it back online, repeat 5000 times, then do the same for the management server(s)"?
 
How amenable are HPC servers/algorithms to "suspend and resume" their jobs? Or is it more a "take one box at a time out of service, migrate their workload to a different server, patch it, and bring it back online, repeat 5000 times, then do the same for the management server(s)"?
There are options for pre-empting or interrupting workloads. In some cases, a workload can't be adequately predicted so users can risk running out of their allocation time window before completion. In those cases, users would use snapshots ("checkpoints") where data is periodically saved. If a job with a piece of string time length runs through the user's time window, they would simply resubmit from their most recent successful job checkpoint. It's a bit of a trick to decide how frequently to checkpoint because writing to disk is a significant bottleneck.

However this has to be done at submission and cannot be retroactively applied. Most sites will typically set their systems to DRAIN mode which means when the current jobs complete the node is taken offline.

With epilog scripts you can force a node to reboot on job completion.

Since most HPC systems nowadays are stateless, they just reboot the compute nodes via PXE to an updated image.

Some sites like CHPC use stateful so these redeploy to local disk.

The standard best practice is;

Update image
Drain nodes
Epilog reboot
New image on PXE boot
 
CentOS.

Password protected SSH keys and periodic expiry and cycle (painful but necessary)
I'd take that pain if it means less downtime.

In instances like this (HPC breach, investigate, patch etc) what would be an unacceptable amount of downtime?
Trying to figure out the time bounds here.
 
I'd take that pain if it means less downtime.

In instances like this (HPC breach, investigate, patch etc) what would be an unacceptable amount of downtime?
Trying to figure out the time bounds here.
One may argue any downtime is unacceptable but it depends on the nature of the downtime. If it is an avoidable and trivial exploit then even a millisecond should be considered unacceptable.

National systems will have a mandated SLA and they are usually treated as a sunk cost the moment procurement starts. Every minute contributes to a clock tracking obsolescence, whether the system is used or idling in a store room. Every minute is also calculated as an opportunity cost. If a system is used to mine crypto, there's the value of the stolen commodity, the cost of running the system, and the opportunity cost of the system compute hours for science. In a well governed environment this would mean all-hands-on-deck sleepless nights until it's back up.

In the case of the 2020 incident, ARCHER in the UK was documented as offline for 2-3 weeks for forensics, cleaning, restoration. Other systems were supposedly shorter.

The difference is that ARCHER was first hit and dealing with discovery and no leads. The subsequent systems and any nowadays have a wealth of data to know what to look for and what to patch. If 2-3 weeks is still reasonable then the next question is how much more time should a system have over and above the established baseline of ARCHER.
 
I'd take that pain if it means less downtime.

In instances like this (HPC breach, investigate, patch etc) what would be an unacceptable amount of downtime?
Trying to figure out the time bounds here.
Not sure how familiar you are with deploying HPC environments, but a basic management server with the necessary base images and configuration files can take one or two days. Re-using old automation scripts and digging up IaC makes it even easier.

The hardening is trivial (especially with automation) so the main bottlenecks are likely user management and file transfer for restoring user data.

If it was ONLY restoring the cluster, it's not unreasonable to expect it done within a week since that's how most of us do it anyway. Unless you're living in a bubble, you'll have access to automation scripts (yours or the community) and can get most things done pretty quickly.

It's the forensics and maybe remedial work that will take time, depending on the extent of the issue and the level of skill of the parties.
 
Not sure how familiar you are with deploying HPC environments, but a basic management server with the necessary base images and configuration files can take one or two days. Re-using old automation scripts and digging up IaC makes it even easier.

The hardening is trivial (especially with automation) so the main bottlenecks are likely user management and file transfer for restoring user data.

If it was ONLY restoring the cluster, it's not unreasonable to expect it done within a week since that's how most of us do it anyway. Unless you're living in a bubble, you'll have access to automation scripts (yours or the community) and can get most things done pretty quickly.

It's the forensics and maybe remedial work that will take time, depending on the extent of the issue and the level of skill of the parties.
Thanks that is a wealth of context and the baseline helps.
In this case I am sure there's some sort of non-technical procedure/protocol that needs to be followed and this is adding to downtime. We are now in the fourth week.
 
Thanks that is a wealth of context and the baseline helps.
In this case I am sure there's some sort of non-technical procedure/protocol that needs to be followed and this is adding to downtime. We are now in the fourth week.
Could be. Well, SHOULD be non technical protocols in place! And even the most robust of these would distill timelines that are fairly quick under normal operating conditions (but they heavily assume best practices and standard protocols are being implemented from the start).

To isolate and clone isn't complicated. Call it a week to preserve various systems (but you don't need a 1000 compute node images).

To redeploy a cluster, call it a week.

These are very conservative numbers since imaging isn't a bounded start, image everything, stop.

Imaging a management node takes hours, but for incredibly generous reasons let's call it two days for five copies, and then that can immediately be reprovisioned while other images are snapshot elsewhere (but everything is run concurrently, not sequentially, so taking a week to clone an image per machine is incredibly generous, laughably so). But let's say it all takes a week because reasons.

To conduct investigation should start at day zero, so that can be ongoing separate to operationalizing a cluster.

The longer something takes past a couple of weeks to operationalize probably indicates that things weren't implemented in the standard way in the beginning and this isn't just a case of restarting things, but more likely rethinking things.

Possibly could involve hardening, internal investigations, or heaven forbid, upskilling.
 
Top
Sign up to the MyBroadband newsletter
X