Skip to main content
Tutorial

Deploy a P4d EC2 UltraCluster

Introduction

Overview

Amazon EC2 P4d instances deliver the highest performance for machine learning (ML) training and high performance computing (HPC) applications in the cloud. Amazon EC2 P4d instances are deployed in clusters called EC2 UltraClusters that are comprised of the high performance compute, networking, and storage in the cloud. Each EC2 UltraCluster of P4d instances comprises more than 4,000 of the latest NVIDIA A100 GPUs, Petabit-scale non-blocking networking infrastructure, and high throughput low latency storage with FSx for Lustre.

This step-by-step tutorial will help you launch a high performance HPC cluster in the cloud using EC2 UltraClusters of P4d Instances. You will setup the underlying networking for the cluster, deploy FSx for Lustre and P4d cluster, and delete your AWS resources. 

Before launching an EC2 UltraCluster it is recommended first to launch a single P4d instance and get familiar with the instance type. Also note in which Availability Zone in your account and region you launched the P4d instance. You will need this information later in the tutorial.

You complete the following steps in this tutorial:

  • Login to the AWS Console

  • Create a private subnet with a NAT Gateway

  • Create 3 security groups for access to the EC2 UltraCluster

  • Launch a FSx for Lustre file system

  • Launch a cluster of EC2 P4d instances with 4 EFA ENIs

  • Launch a jumphost

  • Deprovision resources in the EC2 UltraCluster

Implementation

Intermediate

10 minutes

December 8, 2020

1. Login to the AWS Console

When you click here, the AWS Management Console will open in a new browser window, so you can keep this step-by-step guide open.  When the screen loads, enter your user name and password to get started. Then type VPC in the search bar and select VPC to open the console.

Create a private subnet with a NAT Gateway

The EC2 UltraCluster will have multiple elastic network interfaces per instance. We will need to create the instances in a private subnet and route a NAT Gateway through a public subnet with the internet gateway (IGW) attached.

1. Create a private subnet

Create a subnet in your VPC with an available free CIDR range this CIDR range needs to be able to accommodate the number of instances you want to launch * 4.

Screenshot showing an example configuration for subnet settings in AWS EC2, including subnet name, availability zone, IPv4 CIDR block, and tagging options.

2. Create a NAT Gateway

Create a NAT Gateway by going to NAT Gateways in the side menu launching a gateway in public subnet in the VPC. This will take a few minutes to provision.

Screenshot showing the NAT gateway settings page, including fields for the NAT gateway name, subnet selection, and Elastic IP allocation, as demonstrated in an EC2 tutorial.

3. Create a routing table

After provisioning is complete go to route tables and create a new route table selecting the VPC that your Gateway was created in. In Routes for the route table add a route for the destination 0.0.0.0/0 where the target is the NAT Gateway ID you created earlier.

Screenshot showing a sample AWS route tables configuration with destinations, targets, statuses, and propagation columns as part of a getting started tutorial.

4. Associate the route table with the subnet

Associate this route table with the private subnet you created earlier, right click on the route table ID and choose Edit subnet associations.

Create security groups for access to the EC2 UltraCluster

We will create 2 security groups with different policies for access for:

  • external SSH access

  • EFA networking

1. Configure the security groups

In the EC2 Console navigate to the security groups and choose Create security groups.

  • Choose the VPC used earlier to associate this security group with

  • For EFA: For inbound rules add All traffic on all ports in scope of the security group that is being created.

  • For EFA: For outbound rules add All traffic on all ports in scope of the security group being create

Screenshot showing the Inbound rules tab of an AWS EC2 security group, with all traffic, all protocols, all port ranges allowed from a specific source security group.

2. Verify the TCP port settings

For the new SSH security group ensure that TCP port 22 is open inbound with outbound set to 0.0.0.0/0.

Launch a FSx for Lustre file system

As part of the EC2 UltraClusters you will need to launch a FSx for Lustre file system. You can use any process to launch the FSx for Lustre file system but it needs to be launched in the private subnet you created earlier.

1. Create a file system

In the FSx for Lustre console click on Create file system.

Choose Amazon FSx for Lustre and click Next.

2. Configure the file system

Complete the form with the following parameters:

  • For Deployment & storage type choose Scratch,SSD

  • For Throughput per unit of storage choose 200 MBs/TB

  • For Storage Capacity choose 2.4TiB

  • For Virtual Private Cloud: choose VPC of the private subnet created earlier

  • For VPC Security Groups choose Choose the EFA security group you created earlier

  • For Subnet choose. the private subnet you created earlier

Screenshot of the AWS FSx for Lustre 'Create file system' settings page, showing options for file system details, storage type, throughput, capacity, and network and security configuration.

3. Choose an S3 bucket

Choose an S3 bucket for data ingestion. The dataset for this tutorial is the BERT dataset. If you don’t have it, we can use synthetic benchmarks.

Screenshot of a Data Repository Import/Export configuration interface, showing options to import data from and export data to an S3 bucket, including file and directory listing updates, import bucket and prefix entry, and export prefix selection.

4. Verify the file system was created

Wait until the FSx cluster is in the Available state.

Note the dnsname and mountname of the cluster.

Launch a cluster of EC2 P4d instances with 4 EFA ENIs

We can launch the compute layer of the EC2 UltraCluster. You can use the Deep Learning AMI v36 for support for P4d or create your own. You will need to install the FSx client drivers.

In the EC2 management console, select in the EC2 Dashboard to launch an instance.

1. Launch an instance

  1. Select the AMI with A100 support as well as have the FSx client driver installed.

  2. For Instance Type choose p4d.24xlarge

  3. For the instance details choose the number of instances you want in the count

  4. Choose the VPC and private subnet created earlier.

  5. Select a placement group created as a cluster.

  6. For network interfaces add 3 more network interfaces with Elastic Fabric Adapter selected

  7. Set the NetworkCardIndex for each EFA adapter to 0,1,2,3 .

  8. Add any relevant tags in the next screen for the Security Group section choose the security groups created earlier for SSH and EFA access.

  9. Launch the instance and confirm they have 4 private IP addresses per node.

Screenshot showing the AWS EC2 console interface for configuring multiple network interfaces with Elastic Fabric Adapter (EFA) attachment options.

Launch a jumphost

Since the cluster is a private subnet. We need to launch a jumphost in the public subnet to be able to access the P4d instance in the EC2 UltraCluster. 

  • In the EC2 Console launch an EC2 instance, for example t3a.xlarge, in a public subnet of the VPC.

  • Attach the security groups created earlier.

  • Once the instance launched you can ssh into the instance and then ssh into one of the p4d.24xlarge nodes in the cluster.

Delete resources in the EC2 UltraCluster

You can easily delete the EC2 P4d cluster from the EC2 console and the FSx for Lustre file system from the FSx console. In fact, it is a best practice to delete resources you are no longer using so you don’t keep getting charged for them.

Congratulations!

You have just launched a P4d instance in the EC2 UltraCluster. With this cluster you can run large scale distributed deep learning workflows with the best practices for compute and storage.

EC2 UltraClusters is an optimized placement strategy for the EC2 P4d instances and FSx for Lustre file system. EC2 UltraClusters are supported in managed services such as Amazon Elastic Kubernetes Service (EKS). Follow examples on Github to launch an EC2 UltraCluster with containers using Amazon EKS.