Deploy a P4d EC2 UltraCluster
Introduction
Implementation
1. Login to the AWS Console
When you click here, the AWS Management Console will open in a new browser window, so you can keep this step-by-step guide open. When the screen loads, enter your user name and password to get started. Then type VPC in the search bar and select VPC to open the console.
Create a private subnet with a NAT Gateway
The EC2 UltraCluster will have multiple elastic network interfaces per instance. We will need to create the instances in a private subnet and route a NAT Gateway through a public subnet with the internet gateway (IGW) attached.
1. Create a private subnet
Create a subnet in your VPC with an available free CIDR range this CIDR range needs to be able to accommodate the number of instances you want to launch * 4.

2. Create a NAT Gateway
Create a NAT Gateway by going to NAT Gateways in the side menu launching a gateway in public subnet in the VPC. This will take a few minutes to provision.

3. Create a routing table
After provisioning is complete go to route tables and create a new route table selecting the VPC that your Gateway was created in. In Routes for the route table add a route for the destination 0.0.0.0/0 where the target is the NAT Gateway ID you created earlier.

4. Associate the route table with the subnet
Associate this route table with the private subnet you created earlier, right click on the route table ID and choose Edit subnet associations.
Create security groups for access to the EC2 UltraCluster
We will create 2 security groups with different policies for access for:
external SSH access
EFA networking
1. Configure the security groups
In the EC2 Console navigate to the security groups and choose Create security groups.
Choose the VPC used earlier to associate this security group with
For EFA: For inbound rules add All traffic on all ports in scope of the security group that is being created.
For EFA: For outbound rules add All traffic on all ports in scope of the security group being create

2. Verify the TCP port settings
For the new SSH security group ensure that TCP port 22 is open inbound with outbound set to 0.0.0.0/0.
Launch a FSx for Lustre file system
As part of the EC2 UltraClusters you will need to launch a FSx for Lustre file system. You can use any process to launch the FSx for Lustre file system but it needs to be launched in the private subnet you created earlier.
1. Create a file system
In the FSx for Lustre console click on Create file system.
Choose Amazon FSx for Lustre and click Next.
2. Configure the file system
Complete the form with the following parameters:
For Deployment & storage type choose Scratch,SSD
For Throughput per unit of storage choose 200 MBs/TB
For Storage Capacity choose 2.4TiB
For Virtual Private Cloud: choose VPC of the private subnet created earlier
For VPC Security Groups choose Choose the EFA security group you created earlier
For Subnet choose. the private subnet you created earlier

3. Choose an S3 bucket
Choose an S3 bucket for data ingestion. The dataset for this tutorial is the BERT dataset. If you don’t have it, we can use synthetic benchmarks.

4. Verify the file system was created
Wait until the FSx cluster is in the Available state.
Note the dnsname and mountname of the cluster.
Launch a cluster of EC2 P4d instances with 4 EFA ENIs
We can launch the compute layer of the EC2 UltraCluster. You can use the Deep Learning AMI v36 for support for P4d or create your own. You will need to install the FSx client drivers.
In the EC2 management console, select in the EC2 Dashboard to launch an instance.
1. Launch an instance
Select the AMI with A100 support as well as have the FSx client driver installed.
For Instance Type choose p4d.24xlarge
For the instance details choose the number of instances you want in the count
Choose the VPC and private subnet created earlier.
Select a placement group created as a cluster.
For network interfaces add 3 more network interfaces with Elastic Fabric Adapter selected
Set the NetworkCardIndex for each EFA adapter to 0,1,2,3 .
Add any relevant tags in the next screen for the Security Group section choose the security groups created earlier for SSH and EFA access.
Launch the instance and confirm they have 4 private IP addresses per node.

Launch a jumphost
Since the cluster is a private subnet. We need to launch a jumphost in the public subnet to be able to access the P4d instance in the EC2 UltraCluster.
In the EC2 Console launch an EC2 instance, for example t3a.xlarge, in a public subnet of the VPC.
Attach the security groups created earlier.
Once the instance launched you can ssh into the instance and then ssh into one of the p4d.24xlarge nodes in the cluster.
Delete resources in the EC2 UltraCluster
You can easily delete the EC2 P4d cluster from the EC2 console and the FSx for Lustre file system from the FSx console. In fact, it is a best practice to delete resources you are no longer using so you don’t keep getting charged for them.
Congratulations!
You have just launched a P4d instance in the EC2 UltraCluster. With this cluster you can run large scale distributed deep learning workflows with the best practices for compute and storage.
EC2 UltraClusters is an optimized placement strategy for the EC2 P4d instances and FSx for Lustre file system. EC2 UltraClusters are supported in managed services such as Amazon Elastic Kubernetes Service (EKS). Follow examples on Github to launch an EC2 UltraCluster with containers using Amazon EKS.
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages