AWS Public Sector Blog
How NIH scientists unlocked cardiovascular disease insights using AWS
Scientists at the National Institutes of Health (NIH) recently uncovered how a structure known as low-density lipoprotein (LDL), which transports “bad” cholesterol through the bloodstream, interacts with its receptor molecule to enter cells—information that has eluded researchers for decades. The findings could lead to more personalized treatments for cardiovascular disease and were enabled by cutting-edge high performance computing (HPC) infrastructure from Amazon Web Services (AWS).
The challenge
Scientists use cryogenic electron microscopy (cryo-EM) to determine the 3D structure of biomolecules at near-atomic resolution. Cryo-EM transmission electron microscopes flash-freeze protein samples in vitreous ice, revealing 3D digital representations of microscopic structures in near-native states. These 3D representations produce terabytes (TBs) of data per sample and the data resolution increases with each generation of new instruments. The data needs to be processed using HPC resources with a complex pipeline accelerated by one or more graphics processing units (GPUs). Cryo-EM requires iterative processing of large datasets, therefore reducing costs and operational overhead while increasing processing speeds is critical to enhancing the quality of structural biology research. Understanding how LDL interacts with its receptor (LDLR) has also been a major challenge for decades in cardiovascular research. Previous attempts to visualize this interaction were limited by LDL’s large size and structural complexity, as well as inadequate computing power and data storage.
The process
The workflow of data collection is shown in the following figure and is discussed in detail in Cryo-electron microscopy: A primer for the non-microscopist by Milne et al.
- During the cryo-EM process, a homogenous, highly pure protein sample is applied to cryo-EM grids. The sample is rapidly frozen in liquid ethane in a thin layer of vitreous ice.
- Images are recorded as movies on a transmission electron microscope.
- Movie frames are then aligned to reduce the effects of drift or inability of a microscope to maintain the selected focal plane over an extended period of time.
- Particles are picked from each micrograph with those representing the same view.
- The particles are then grouped together to create 2D images.
- 2D images are then computationally aligned to generate a 3D map.
- Using sophisticated modeling, 3D classification can identify different conformational states, or changes in the shape of a macromolecule, of the protein.
The solution
NIH researchers needed powerful computational resources to process massive amounts of imaging data for their research. The cryo-EM dataset used to determine the LDL structure contained over 35,000 movies. A typical movie is about 0.5 gigabytes (GB) in size, resulting in approximately 17.5 (TB) per dataset. Data processing of the images increases the size of the data by 5 times. In addition to data storage, cryo-EM data processing is computationally intensive. The structure of a yeast spliceosomal complex requires more than half a million CPU hours of classification and high-resolution refinement as described by elifesciences. The implementation of GPUs to alleviate the computational bottleneck has transformed the cryo-EM field. Many of the common cryo-EM software packages have been redesigned to take advantage of recent advances in GPU technology and can now implement many independent tasks simultaneously.
As an outcome of this research, NIH scientists were able to show for the first time how ApoB100, the main structural protein of LDL, binds to its receptor—a process that starts the clearance of LDL from the blood – and what happens when that process is impaired in a disease called Familial Hypercholesterolemia that often leads to early heart disease. To complete the research, the NIH research team leveraged several AWS services and capabilities. The following figure shows an AWS HPC environment used by the NIH research team.
The impact to cardiovascular research
The AWS infrastructure has revolutionized NIH’s research capabilities, enabling the processing of more than 35,000 molecular movies per dataset while efficiently managing 17.5TB of raw data per experiment, which typically expands by 3-5x during processing. The transformation to AWS has dramatically accelerated project completion rates, with 20 new structures determined successfully in just 12 months—a stark contrast to the traditional 2–3-year timeline required for on-premises implementations. The computing power offered by AWS has proven to be remarkably superior, delivering up to 12 times faster processing speeds compared to traditional on-premises systems. Additionally, the cloud infrastructure has significantly enhanced collaboration among researchers, making it easier for multiple teams to work simultaneously on complex datasets and share findings in real-time, ultimately accelerating the pace of scientific discovery.
The results of this groundbreaking research are published in the journal Nature and might lead to more highly targeted drugs for reducing blood cholesterol. The computational infrastructure established on AWS continues to support research initiatives at NIH, thereby demonstrating the power of cloud computing in advancing biomedical research.
Additional benefits of the AWS HPC solution
The AWS environment allows for both burst computing needs and sustained HPC workloads while maintaining security and performance requirements. This architecture is utilized for the cryo-EM research to accelerate:
- Data collection and processing: 
         - High-speed data ingestion from electron microscopes through AWS Direct Connect
- Raw image data can be initially stored in Amazon FSx for Lustre for immediate processing
 
- GPU queues (G4/G5/G6), which are crucial for: 
         - Image alignment
- 2D classification
- 3D reconstruction
- Particle picking
- Contrast transfer function (CTF) estimation
 
- Storage management: 
         - Active datasets stored in Amazon S3 Standard for frequent access
- Completed projects moved to S3 Intelligent-Tiering for cost optimization
- FSx for high-performance shared storage for processing pipelines
- DRA ensures seamless data movement between storage tiers
 
- Computational workflows: 
         - Multiple processing queues support different computational needs: 
           - CPU Queue for basic preprocessing
- GPU Queue for intensive image processing
- Multi-GPU Queue for complex 3D reconstructions
- Parallel processing capabilities for handling large datasets
 
 
- Multiple processing queues support different computational needs: 
           
- Research collaboration: 
         - Secure VPC environment for data protection
- Shared volumes enable team collaboration
 
- Cost management: 
         - Scalable resources based on processing demands
- Storage tiering optimizes costs for long-term data retention
- Pay-per-use model for compute resources
 
The solution can also be integrated with common cryo-EM software packages and scaled according to research requirements.
The key components of the solution include:
- Network connectivity: 
         - AWS Direct Connect provided dedicated network connectivity from NIH on-premises to AWS
- The environment is in a VPC (virtual private cloud) with private subnets for security
 
- Compute resources: 
         - Different types of compute queues are available: 
           - CPU Queue for standard compute tasks
- GPU Queue utilizing Nvidia GPU instances for GPU-accelerated workloads
- Multi-GPU Queue with variety of GPU (G4/G5/G6) instances for parallel GPU processing
- AWS ParallelCluster to create the HPC cluster
 
- Storage layer: 
           - Amazon FSx for Lustre for high-performance shared file system
- Shared volumes (gp3) for temporary storage
- Tiered storage strategy: 
             - Amazon Simple Storage Service (S3) Standard Storage Class for active data
- Amazon S3 Intelligent-Tiering Storage Class for archived data
 
- Data repository association (DRA) between Amazon FSx for Lustre and S3 for automated data movement for long term retention and smaller high performance file system size
- Data repository task (DRT) to tier data to S3 from FSx for Lustre for objects not actively used with atime (access time) greater than seven days
 
- Management and monitoring: 
           - Amazon CloudWatch for monitoring and metrics
- AWS Identity and Access Management (IAM) for access control and security
- Amazon EventBridge for scheduling and automation
- Slurm scheduler for managing HPC workloads and job queues
 
 
- Different types of compute queues are available: 
           
Curious to learn more?
Learn more about how AWS can help build game-changing GPU-enabled cryo-EM workflows on AWS. Check out how AWS services provide an agile, modular, and scalable architecture to optimize the cryo-EM workflows thus providing a pathway to wider adoption of cryo-EM as a standard tool for structural biology.
Learn more about AWS solutions for healthcare and life sciences.


