Comments like this are just not on.lol. African excellence
South Africa’s biggest forum. Discuss, discover, and connect with thousands of members.
Comments like this are just not on.lol. African excellence
One of HUNDREDS compromised? This sounds incredibly sensationalist. An incident of that magnitude would trigger GDPR declarations. Think back to Europe 2020.CHPC was just one of hundreds globally that was compromised.
A scientists SSH key was compromised and then it drifted between institutions which was just one of the cases. Since all of us HPCs run older versions of Ubuntu like Ubuntu 22 LTS, it is still susceptible to a large number of CVE's that nobody is willing to patch. There were also a crazy amount of 0-days and exploitable CVE's the last couple of weeks being pushed into the public.
There was also packages exploited on the scientist jumpbox, with the existing applications being removed and replaced with same software but with actively exploitable versions.
Remember also institutions have like 100 Gbps - 10 Tbps internet breakout, half of the country's bandwidth run through them. What do you think a Firewall of this magnitude will cost? More than the HPC itself, hence pretty much all of them run without firewalls.
No. Please don't believe the OP.Great insight, thanks!
Feels like a firewall is not required, just a simple packet filter on the existing router would do, but still, applying that to Tbps of traffic would probably significantly increase the load on the router. And of course, if the ssh key for a legit user is compromised, all bets are off anyway.
CentOS.You know the CHPC was running Ubuntu? Thought it was CentOS.
What can be done better re. using SSH keys? The knowledge and technical ability of users vary widely.
You are definitely exposing yourself here. Show some respect because you're providing comments like a noob earlier. Restrain yourself.`
AnOthEr noob trying to comment
Please provide proof or stop spreading speculation. You've already shown you don't know what you're talking about with HPC configuration, or even the bandwidth interlinking SA sites.nobody said it was the same thing, there were multiple attack vectors.
this was a global scale attack not just CHPC.
Understanding that in general, a user-level account doesn't give you root, but also that there have been several Local Privilege Escalation vulns in Linux. Not sure if you were able to patch in time, but it would certainly have been one way to compromise other user-level accounts.No. Please don't believe the OP.
https://www.sanren.ac.za/ supports significantly lower bandwidth than what OP said.
Also, we DO run firewalls if we are serious about security. There is no excuse for bypassing best practice. An HPC will have diverse landing points and data transfer nodes all of which are monitored, managed, and controlled.
A user's compromised ssh key will only give access to a remote system, not root level permissions. Users are significantly locked down on clusters because they are such complex user-intensive systems. And not to mention the NIST standards and health data.
OP is being sensationalist and unfortunately is not accurate.
Agreed, yes.Understanding that in general, a user-level account doesn't give you root, but also that there have been several Local Privilege Escalation vulns in Linux. Not sure if you were able to patch in time, but it would certainly have been one way to compromise other user-level accounts.
How amenable are HPC servers/algorithms to "suspend and resume" their jobs? Or is it more a "take one box at a time out of service, migrate their workload to a different server, patch it, and bring it back online, repeat 5000 times, then do the same for the management server(s)"?Agreed, yes.
Leading HPC centers shut down all their systems when there is a vulnerability and rebuild from trusted sources. Mileage varies on sites but generally best practice (and HPC is no different, potentially the opposite and is more secure) is applied to everything. A local root exploit is vulnerable anywhere and it's up to the admins to apply common sense etc.
Totally agree with you.
There are options for pre-empting or interrupting workloads. In some cases, a workload can't be adequately predicted so users can risk running out of their allocation time window before completion. In those cases, users would use snapshots ("checkpoints") where data is periodically saved. If a job with a piece of string time length runs through the user's time window, they would simply resubmit from their most recent successful job checkpoint. It's a bit of a trick to decide how frequently to checkpoint because writing to disk is a significant bottleneck.How amenable are HPC servers/algorithms to "suspend and resume" their jobs? Or is it more a "take one box at a time out of service, migrate their workload to a different server, patch it, and bring it back online, repeat 5000 times, then do the same for the management server(s)"?
I'd take that pain if it means less downtime.CentOS.
Password protected SSH keys and periodic expiry and cycle (painful but necessary)
One may argue any downtime is unacceptable but it depends on the nature of the downtime. If it is an avoidable and trivial exploit then even a millisecond should be considered unacceptable.I'd take that pain if it means less downtime.
In instances like this (HPC breach, investigate, patch etc) what would be an unacceptable amount of downtime?
Trying to figure out the time bounds here.
Not sure how familiar you are with deploying HPC environments, but a basic management server with the necessary base images and configuration files can take one or two days. Re-using old automation scripts and digging up IaC makes it even easier.I'd take that pain if it means less downtime.
In instances like this (HPC breach, investigate, patch etc) what would be an unacceptable amount of downtime?
Trying to figure out the time bounds here.
Thanks that is a wealth of context and the baseline helps.Not sure how familiar you are with deploying HPC environments, but a basic management server with the necessary base images and configuration files can take one or two days. Re-using old automation scripts and digging up IaC makes it even easier.
The hardening is trivial (especially with automation) so the main bottlenecks are likely user management and file transfer for restoring user data.
If it was ONLY restoring the cluster, it's not unreasonable to expect it done within a week since that's how most of us do it anyway. Unless you're living in a bubble, you'll have access to automation scripts (yours or the community) and can get most things done pretty quickly.
It's the forensics and maybe remedial work that will take time, depending on the extent of the issue and the level of skill of the parties.
Could be. Well, SHOULD be non technical protocols in place! And even the most robust of these would distill timelines that are fairly quick under normal operating conditions (but they heavily assume best practices and standard protocols are being implemented from the start).Thanks that is a wealth of context and the baseline helps.
In this case I am sure there's some sort of non-technical procedure/protocol that needs to be followed and this is adding to downtime. We are now in the fourth week.
Trust everybody but trust nobodywhat if the compromise was done in the dumbest way possible? somebody leaked the private keys and admin credentials