Amazon FSx for Lustre customers
Datologyai
DatologyAI builds tools that automatically select the best data on which to train deep learning models.
“We are excited to use Amazon SageMaker HyperPod’s one-click observability solution. Our senior staff members needed insights into how we’re utilizing GPU resources. The pre-built Grafana dashboards will give us exactly what we needed, with immediate visibility into critical metrics—from task-specific GPU utilization to file system (FSx for Lustre) performance—without requiring us to maintain any monitoring infrastructure. As someone who appreciates the power of the Prometheus Query Language, I like the fact that I can write my own queries and analyze custom metrics without worrying about infrastructure problems.”
Josh Wills, Member of Technical Staff at DatologyAI
Apoidea Group
Apoidea develops AI-powered solutions for multinational banks using cutting-edge generative AI and deep learning technologies. Their flagship product, SuperAcc, is a sophisticated document processing service that employs proprietary models to handle diverse financial documents, including bank statements and KYC forms. This technology has dramatically improved efficiency in the banking sector, reducing financial spreading processing time from 4-6 hours to just 10 minutes.
To support this development, Apoidea utilizes Amazon SageMaker HyperPod, which provides a scalable and flexible environment for large-scale model training. SageMaker HyperPod features distributed training management, seamless data synchronization with FSx for Lustre, and customizable environments, all of which enhance ML workflow efficiency.
Adobe
Adobe was founded 40 years ago on the simple idea of creating innovative products that change the world, Adobe offers groundbreaking technology that empowers everyone, everywhere to imagine, create, and bring any digital experience to life. Rather than rely on open-source models, Adobe decided to train its own foundational generative AI models tailored for creative use cases. Adobe created an AI superhighway on AWS to build an AI training platform and data pipelines to rapidly iterate models. Adobe used Amazon FSx for Lustre high-performance file storage, for fast access to data and to make sure GPU resources are never left idle.
"It's easy to think I'll create my own AI cloud, but the partnership with AWS lets us focus on our differentiators"
Alexandru Costin - Vice President, Generative AI and Sensei at Adobe
LG AI Ressearch
LG AI Research, the artificial intelligence (AI) research hub of South Korean conglomerate LG Group, was founded to promote AI as part of its digital transformation strategy to drive future growth. The research institute developed its foundation model EXAONE engine within one year using Amazon SageMaker and Amazon FSx for Lustre. The foundation model mimics humans as it thinks, learns, and takes actions on its own through large-scale data training. The multi-purpose foundation model can be employed in various industries to carry out a range of tasks.
Paige
Paige, a leading digital pathology provider, sought to enhance its AI and ML models for cancer diagnosis but faced limitations with on-premises solutions. To overcome this, Paige adopted Amazon EC2 P4d Instances and Amazon FSx for Lustre, integrating the latter with Amazon S3 buckets for efficient handling of petabytes of ML input data. This AWS infrastructure enabled Paige to process data without manual prestaging on high-performance filesystems. As a result, Paige achieved a tenfold increase in data training capacity and 72% faster internal workflows.
"By connecting Amazon FSx for Lustre to Amazon S3, we can train on 10 times the amount of data that we have ever tried in the on-premises infrastructure without any trouble."
Alexander van Eck, staff AI engineer - Paige
Toyota
Toyota Research Institute (TRI) collects and processes large amounts of sensor data from their autonomous vehicles (AV) test drives. Each training data set is staged in an on-premises NAS device and transferred to Amazon Simple Storage Service (Amazon S3) before processing on a powerful GPU compute cluster. TRI needed a high-performance file system to pair with their compute resources, speed up their ML model training, and accelerate insights for their data scientists. Toyota Research Institute chooses FSx for Lustre to reduce object recognition machine learning training times.
"We needed a parallel file system for our ML training data sets and chose Amazon FSx for Lustre for its higher availability and durability, compared to our legacy file system offering. The integration with AWS services, including S3, also made it the preferred option for our high performance file storage."
David Fluck, Software Engineer - Toyota Research Institute
Shell
Shell offers a dynamic portfolio of energy options – from oil, gas and petrochemicals, to wind, solar and hydrogen – Shell is proud to supply the energy their customers need to power their lives. Shell relies on HPC for model building, testing, and validation. From 2020 to 2022, GPU utilization has averaged less than 90%, resulting in project delays and limitations on new algorithm experimentation. Shell augments their on-premises compute capacity by bursting to the cloud with Amazon EC2 clusters and Amazon FSx for Lustre. This solution gives Shell the capability to quickly scale up and down, and only purchase additional compute capacity when needed. Shell’s GPU’s are now fully utilized reducing the cost of compute, and accelerating machine learning model testing.
Netflix
Netflix uses large scale distributed training for media ML models, for post-production thumbnails, VFX, and trailer generation for thousands of videos and millions of clips. Netflix was experiencing long waits due to cross-node replication and a 40% GPU idle time.
Netflix re-architected their data loading pipeline and improved its efficiency by pre-computing all video/audio clips. Amazon FSx for Lustre performance enables Netflix to saturate GPU’s, and virtually eliminate GPU idle time. Netflix now experiences a 3-4x improvement using pre-compute and FSx for Lustre, reducing model training time from a week to 1-2 days.
Production of the fourth season of Netflix’s episodic drama “The Crown” faced unexpected challenges, as the world went into lockdown for the COVID-19 pandemic just as post-production VFX work was slated to begin. By adopting a cloud-based workflow on AWS, including Amazon FSx Lustre file server for enhanced throughput, Netflix’s in-house VFX team of 10 artists was able to seamlessly complete more than 600 VFX shots for the season’s 10-episode run in just 8 months, all while working remotely.
Storengy
Storengy, a subsidiary of the ENGIE Group, is a leading supplier of natural gas. The company offers gas storage, geothermal solutions, carbon-free energy production, and storage technologies to enterprises worldwide.
To ensure its products are properly stored, Storengy uses high-tech simulators to evaluate underground gas storage, a process that requires extensive use of high-performance computing (HPC) workloads. The company also uses HPC technology to run natural gas discovery and exploration jobs.
"Because of AWS, we have the scalability and high availability to perform hundreds of simulations at a time. Additionally, the solution scales automatically up or down to support our peak workload periods, which means we don’t have any surprises with our HPC environment."
Jean-Frederic Thebault – Engineer, Storengy
Smartronix
Smartronix leverages FSx for Lustre to deliver reliable high performance for their SAS Grid deployments.
Smartronix provides cloud solutions, cyber security, systems integration, worldwide C5ISR and data analytics, and mission-focused engineering for many of the world's leading commercial and federal organizations. Smartronix relied on SAS Grid to analyze and deliver state-wide COVID daily statistics, and found their self-managed, parallel file system difficult to administer and protect.
"Collaborating with AWS and leveraging their managed solutions like FSx for Lustre has allowed us to serve our customers better – with higher availability and 29% lower cost than self-managed file systems."
Rob Mounier – Senior Solutions Architect, Smartronix
Hyundai
Hyundai Motor Company, a global automotive manufacturer exporting to over 200 countries, uses semantic segmentation for autonomous driving to classify image pixels into categories like roads, people, and buildings.
To improve model accuracy and meet deadlines, Hyundai implemented Amazon SageMaker for automated training and data parallelism across multiple GPUs, along with Amazon FSx for Lustre and S3 for efficient data storage and processing. These solutions helped Hyundai achieve 93% scaling efficiency with 64 GPUs while eliminating data wait times.
Rivian
Amazon FSx for Lustre played a crucial role in Rivian's cloud transformation, providing the fast shared storage access needed for their computer-aided engineering and design workloads. Using FSx for Lustre as part of their AWS solution, Rivian dramatically improved their performance metrics, including a 66% increase in product lifecycle management interaction speed and reducing backup synchronization time from one day to less than an hour.
The fully managed storage service was implemented alongside other AWS services like Amazon EC2 and Auto Scaling, helping Rivian overcome their on-premises infrastructure limitations and achieve scalable, high-performance computing capabilities in just three weeks compared to their expected six-month timeline.
Denso
Denso develops image sensors for advanced driver-assistance systems (ADAS), which help drivers with functions such as parking and changing lanes. To develop the necessary ML models for ADAS image recognition, DENSO had built GPU clusters in its on-premises environment. However, multiple ML engineers shared limited GPU resources, which impacted productivity—especially during the busy period before a new product release.
By adopting Amazon SageMaker and Amazon FSx for Lustre, Denso was able to accelerate the creation of ADAS image recognition models by reducing the data acquisition, model development, learning, and evaluation time.
"The practice of shifting to the cloud will keep accelerating in the artificial intelligence and ML field. I’m confident that AWS will continue to give us support as we continue adding functions.”
Kensuke Yokoi, general manager - DENSO
T-Mobile
T-Mobile transformed their SAS Grid infrastructure by implementing Amazon FSx for Lustre to address performance issues and high management overhead with their self-managed system.
The deployment of FSx for Lustre, along with its integration with Amazon S3, enabled T-Mobile to double their SAS Grid workload speeds while achieving $1.5M in annual savings and an 83% reduction in Total Cost of Ownership.
The solution eliminated operational burdens and allowed T-Mobile to focus on their core business of developing innovative customer products while leveraging AWS's advanced storage capabilities.
Maxar
Maxar Technologies, a trusted partner and innovator in Earth intelligence and Space infrastructure, needed to deliver weather forecasts faster compared to its on-premises supercomputer. Maxar worked with AWS to create an HPC solution with key technologies including AMAZON EC2 for secure, highly reliable compute resources, Amazon FSx for Lustre to accelerate the read/write throughput of its application, and AWS ParallelCluster to quickly build HPC compute environments on AWS.
"Maxar used Amazon FSx for Lustre in our AWS HPC solution for running NOAA's numerical weather forecasting model. This allowed us to reduce compute time by 58%, generating the forecast in about 45 minutes for a much more cost-effective price point. Maximizing our AWS compute resources was an incredible performance boost for us."
Stefan Cecelski, PhD, Senior Data Scientist & Engineer - Maxar Technologies
BlackThorn Therapeutics (Neumora)
Processing magnetic resonance imaging (MRI) data using standard DiY cloud file systems was resource- and time-intensive. BlackThorn needed a compute-intensive, shared file storage solution to help simplify their data science and machine learning workflows. Amazon FSx for Lustre is integrated with Amazon S3 and Amazon SageMaker, providing fast processing for their ML training data sets as well as seamless access to compute using Amazon EC2 instances.
"FSx for Lustre has enabled us to create a high-performance MRI data processing pipeline. Data processing time for our ML-based workflows was cut down to minutes compared to days and weeks."
Oscar Rodriguez, Senior Director, Innovation & Technology - BlackThorn Therapeutics
Qubole
Qubole was seeking a high-performance storage solution to process analytical and AI/ML workloads for their customers. They needed to easily store and process the intermediate data held in their EC2 Spot Fleet. Qubole used Amazon FSx for Lustre to store and process intermediate data through its parallel, high-speed file system.
"Our users’ two biggest problems, high costs and intermediate data loss, stemmed from using idle EC2 instances and EC2 Spot instances to process and store intermediate data generated by distributed processing frameworks like Hive and Spark. We were able to solve this problem by using Amazon FSx for Lustre, a highly performant file system, to offload intermediate data. Now our users do not have to pay to maintain idle instances and are not affected by interrupted EC2 Spot nodes. Amazon FSx helped our users reduce total costs by 30%."
Joydeep Sen Sarma, CTO - Qubole