AWS HPC Blog
October was busy for HPC in the cloud
It’s been a busy month in the world of HPC on AWS: we’ve seen new data sets, refinements to cluster operations, and deeper thinking about how workloads map to infrastructure. For our customers driving R&D with HPC, those changes matter (and yes, the nerd in me is quietly excited).
In today’s post, we’ll tell you about a data-driven method for choosing the right compute instances for your specific workloads. We’ll cover the new Slurm custom settings feature in AWS Parallel Computing Service (AWS PCS). And we’ll even give you some advice for knowing when you should choose PCS or AWS ParallelCluster. We’ll give you the scoop on our new podcast, too.
Don’t forget to come and find us in St Louis at SC’25 in a few weeks (details later in this post).
On the blog channel
A scientific approach to workload‑aware computing presented a data-driven methodology for selecting compute instances based on workload patterns. It walks through a multi-dimensional scoring framework (benchmark-based) and then shows how this plays out in real-world HPC applications.
A new dataset of protein‑ligand complexes now available in the Registry of Open Data on AWS (RoDA). This is the culmination of a lot of work between our customers at IIT Hyderabad and our own teams. They’ve released a comprehensive dataset of over 16,000 protein-ligand complexes including molecular dynamics trajectories. It’s expected to be popular with machine learning and structural-biology communities (like those in drug design). RoDA itself is a trove of datasets used by the world’s R&D community every day and well worth a look if you’ve never seen it before.
We announced expanded support for custom Slurm settings in AWS Parallel Computing Service (PCS). The post details how PCS now supports more than 65 configurable Slurm parameters, including queue-specific settings. This addresses customer requests for more control over scheduling and fair-share policies, among other settings.
Since PCS and ParallelCluster both use Slurm to serve similar customers, we also wrote up some of our thoughts about what makes them different. The obvious conclusion to draw (and it’s a fair one) is that since PCS was created long after ParallelCluster, PCS benefitted from some of our learnings from developing and supporting it’s predecessors: CfnCluster and ParallelCluster. It’s newer and contains features and functionality to address the things we just didn’t know until customers started using it in production. If you don’t want to read the blog post, you can hear about it from me in video form on the HPC Tech Shorts channel.
Re-introducing HPC Tech Shorts on YouTube
In October we started our shift to move the HPC Tech Shorts channel on YouTube to a more podcast-style format. As we ramp up expect stories from HPC leaders around the world who are using the things we build to solve hard problems.
Our first episode focused on the work we’re doing with Seqera, Arm, and the NF-Core community to port bioinformatics applications in BioConda to arm64 architectures, to give the bioinformatics community more choices. We’re well over the 90% mark for applications to build and run on Graviton. Angel Pizarro explained how achieving this could potentially knock a third or more off the cost of computing in the genomics field, putting critical discoveries in the reach of more researchers.
In our second episode, Matt Vaughn walked us through the work the PCS Engineering team has done to turn Slurm Accounting into a feature you can enable with a single check-box.
Matt came back to Tech Shorts the following week to talk about the work we’ve been doing under the hood in Research and Engineering Studio on AWS. This is our portal for R&D workers to abstract away much of the effort involved for attaining safe, secure, and powerful desktops. But because portal products are more recognized for their graphical user interfaces, we often overlook the work that contains under the covers to enable more powerful things – hidden behind a button.
There’s more to come in the next month or two as we expand on the podcast format.
Service changes you might have missed
Delivering services in the cloud is a story of constant improvement, building on what came before. Now that Slurm Accounting is a click-box feature in PCS, no one ever needs to build the machinery to support that ever again. That means we can churn out new things at a frequent pace.
Here some of the minor, but important, things we delivered for you this month.
[PCS] Slurm instance reboot via scontrol – allows customers to reboot compute nodes for resource cleanup and bootstrap actions through Slurm CLI. That means instance restart without replacement while Slurm waits for the instance to recycle.
[PCS] Dynamic cluster updates – lets you modify key Slurm settings like accounting and workload parameters on existing clusters without rebuilds, eliminating downtime for configuration changes and addressing major operational pain points.
[PCS] IPv6– lets you deploy PCS clusters with IPv6 networking infrastructure – really useful for large-scale HPC deployments.
[PCS] Slurm 25.05 – this was quite a packed release for Slurm. It allows you to target multiple PCS clusters from a single login node. You can leverage granular retry strategies to react to capacity failures, and it improves the flow for multi-step jobs requiring less controller coordination.
[PCS] Slurm secret key rotation – helps you meet compliance requirements for FedRAMP and HIPAA eligibility by automatically rotating Slurm cluster authentication keys.
[PCS] Distro expansion – PCS now supports Ubuntu 24, Rocky 8, RHEL8, and Amazon Linux 2023.
ParallelCluster 3.14 – our P release includes support for P6e-GB200 and P6-B200 instance types, prioritized allocation strategies for optimized instance placement, and Amazon DCV support for Amazon Linux 2023. Also has support for chef-client log visibility in instance console inside the instance’s system log and Amazon Linux 2023 with kernel 6.12. More in the ParallelCluster 3.14.0 release notes.
Super Computing 2025 in St Louis
Once again, we’ll be at SC’25 in St Louis with our partner and our amazing technical team. In this era of change, we’re reminding everyone that the cloud is your human progress catalyst (HPC). Our goal remains to put the most powerful and scalable tools in the hands of the world’s rarest people: scientists and engineers who are trying to solve the toughest problems.
The complete catalog of what we’re up to is on our community site. If you come to the Arm HPC gathering on the Sunday evening, I’ll be super happy to see you. But if you can’t make that, don’t forget to stop by and talk to our nerds (and pick up some nerd swag) at our booth (#2207) in the front of the expo.
Conclusion
This is just a small snapshot of a month’s activity for us and our customers. HPC has always been a dynamic place to work, but with the injection of velocity the cloud provides, things move quickly.
If you’re running or planning HPC workflows on AWS, there’s a lot here to act on. Whether it’s refining your instance selection, leveraging rich open data for ML + simulation, or tuning your scheduler policies, you’ve got momentum.
If you want help getting started, reach out to your AWS account team. There’s also a trove of material to help you understand AWS in the context of HPC at our community site (day1hpc.com). If you’re more experienced with AWS, you could have explore the HPC Recipe Library on GitHub to find whole clusters ready to deploy (or modify).