AWS Cloud Operations Blog

Maximizing Multi-Region Resilience with AWS Resilience Hub

In today’s fast-paced digital world, business continuity isn’t just a goal — it’s an achievable reality. As organizations continue to innovate and grow, their cloud-based applications have become the beating heart of modern business operations, delivering value to customers around the clock.

Companies are taking their cloud strategy to the next level by embracing multi-Region deployments on AWS. This powerful approach isn’t just about keeping the lights on; it’s about building a foundation for unparalleled service reliability and customer satisfaction. Whether driven by regulatory requirements or a commitment to excellence, organizations are discovering the compelling advantages of operating across multiple AWS Regions.

Think of it as giving your business superpowers: the ability to maintain seamless operations even in the face of unexpected challenges as more enterprises move their essential applications to the cloud.

AWS Resilience Hub protects applications through continuous resilience validation. It evaluates Recovery Time Objective (RTO) and Recovery Point Objective (RPO) targets and identifies infrastructure issues pre-emptively. This optimizes business continuity and reduces costs. For multi-Region setups, it recommends resource groupings to ensure accurate RTO/RPO estimates. We recommend that you perform regular testing, documentation, and continuous improvement to be prepared for a disruption.

Resilience Hub concepts

Resilience Hub performs read-only resilience assessments across application components (AppComponents). It automatically includes associated AWS resources when applications are defined. Resilience Hub groups resources into AppComponents based on confidence levels that they either operate together for multi-Region resilience or fail and recover together in single-Region deployments. The service uses these AppComponents to estimate Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) and provides targeted resilience recommendations.

Resources

Resilience Hub reads the configuration of resources like Elastic Load Balancers, Amazon EC2 Auto Scaling Groups, Lambda functions, etc. defined in AWS CloudFormation stacks, AWS Resource Groups, myApplications applications, Terraform state files, and Amazon Elastic Kubernetes Service (EKS) clusters. During import, Resilience Hub scans resources and groups them into AppComponents when there is a high level of confidence that they are related, such as DNS Record Sets in multi-Region workloads.

Application Component

An AppComponent is a group of related AWS resources that work and fail as a single unit. Resilience Hub supports AppComponents across six domains – compute, database, networking, notifications, queues and storage. AWS resources are automatically grouped together into their respective domain and function. For example, a primary database and its replica form one AppComponent since they operate as a single unit – if the primary database fails, the replica takes over as primary. Resilience Hub has rules that determine which AWS resource can belong to which AppComponent type, then uses the AppComponents to estimate workload RTO and workload RPO to generate recommendations.

For example, primary and secondary Region Application Load Balancers (ALBs) should be grouped in one AppComponent. This tells Resilience Hub that failover uses the existing secondary ALB instead of creating a new one, providing more accurate RTO/RPO estimates. Without proper grouping, the service would assume single-Region deployment and calculate longer recovery times.

Application architecture

In this blog post, we’ll assess this example three-tier application architecture (Figure 1) that uses global, regional, and zonal services.

Presentation tier:

  • Amazon Route 53 for DNS resolution
  • Amazon CloudFront for content delivery
  • Amazon S3 for the UI

Application tier:

  • Elastic Load Balancer for load balancing
  • Amazon Elastic Compute Cloud (EC2) instances for application servers

Data tier:

  • Amazon Aurora Global Database for the database
  • Amazon ElastiCache Global Database for caching
  • Amazon S3 data lake for reporting
The diagram depicts a three-tier web application architecture diagram that is deployed in mulitple AWS Regions.

Figure 1: Three-tier multi-Region architecture

Multi-Region application assessment

Adding an application

Start by adding the application to Resilience Hub for assessment. We’re using a warm standby strategy for disaster recovery with an RTO of 90 minutes and an RPO of 60 minutes. We import infrastructure resources from multiple CloudFormation stacks deployed across primary and secondary Regions. After publishing the application, we execute an initial assessment.

The Resilience Hub analysis of our application structure automatically groups together redundant resources (Figure 2). For example, Route 53 DNS RecordSets are automatically grouped (Figure 3). The next sections examine these groupings and their effects on resilience assessment accuracy.

Image showing initial Resilience Hub groupings and recommendations

Figure 2: Resilience Hub automatic groupings

Image showing Record Set resources from both primary and secondary Regions automatically grouped together

Figure 3: DNS Record Set resources from both primary and secondary Regions automatically grouped together

Assessment before grouping recommendations

The Region assessment shows Unrecoverable status, meaning that in the event of a Region disruption we cannot meet our recovery targets defined in the resiliency policy (Figure 4). Focus on these three components that are currently not meeting our RTO/RPO objectives: Elastic Load Balancer, Auto-Scaling group, and our S3 bucket representing a data lake. The following sections explain how to resolve these issues.

Image showing assessment results for regional RTO/RPO marked as Unrecoverable

Figure 4: Assessment results showing regional RTO/RPO marked as Unrecoverable

Grouping resources

As previously mentioned, resources are automatically grouped together on a best effort basis. Since the overall assessment accuracy and operational recommendations are focused on AppComponents, it’s important that you review each AppComponent for accuracy. Some resources are blocked for manual grouping and are grouped automatically when applicable because there are strict dependencies that require specific grouping configurations. This is discussed further in the following sections.

Grouping recommendations

You can deploy resources across multiple AWS Regions, such as Application Load Balancers in primary and secondary Regions. Resilience Hub treats cross-Region resources with a lower level of confidence and will not group them together because it cannot determine if cross-Region resources were deployed for resilience purposes. In this scenario, Resilience Hub will make a best effort resource grouping recommendation that can either be accepted or rejected. Resources that Resilience Hub previously grouped together into an AppComponent may not receive additional grouping recommendations. If the grouping is incorrect, you’ll need to ungroup the resources into its own AppComponent, covered in the following sections, prior to generating grouping recommendations.

Image showing Auto Scaling Group RTO policy met

Figure 5: Auto Scaling Group RTO policy met

The Auto Scaling Group (ASG) that handles the scaling of our EC2 instances will meet our resilience objectives, however notice that the resource shows the RTO as the time it would take to deploy a new ASG in a different Region (Figure 5). This is because the ASGs were not grouped into the same AppComponent, so Resilience Hub recommended to create a new ASG to recover in the secondary Region. To resolve this issue, generate and accept the grouping recommendations to associate the ASG in the primary Region with the ASG in the secondary Region.

Generate grouping recommendations

1. Open the application in Resilience Hub, and select Application structure and then the Resources tab.

2. Choose Actions and select Get grouping recommendations (Figure 6)

Image showing UI to select get group recommendations

Figure 6: Get grouping recommendations

3. Resilience Hub will generate resource grouping recommendations in the background. This may take a few minutes. Upon completion an info box will be displayed within the Application structure tab. Choose Review recommendations (Figure 7).

Image showing UI to review grouping recommendations

Figure 7: Review recommendations dialog box

4. Select an AppComponent in the list of grouping recommendations to view the resources that will be grouped together (Figure 8).

Image showing UI to review resource grouping recommendations

Figure 8: Review resource grouping recommendations

5. With the grouping recommendation(s) selected, choose Accept to group the resources into a new AppComponent.

Note: The AppComponent name can be edited to make it more descriptive.

Image showing UI to accept grouping recommendations

Figure 9: Accept grouping recommendations

Assessment after grouping recommendation

After grouping the ASGs into one AppComponent, Resilience Hub now knows they’re related resilience resources. This results in near-zero RTO/RPO since failover uses existing secondary Region resources.

Image showing the new assessment after grouping recommendations

Figure 10: New assessment after resource grouping recommendations have been applied

Manually grouping components

Sometimes recommendations require adjustment and you will need to manually group resources. In addition, there may be occasions where you want to remove groupings and get an updated set of grouping recommendations from Resilience Hub. Resilience Hub will only provide grouping recommendations for resources that are not already grouped into an AppComponent.

Resilience Hub provided correct grouping recommendations for our sample workload. The steps below demonstrate manual resource grouping to show greater control over cross-Region configurations. Consider a scenario with multiple load balancers deployed across Regions, routing traffic to backend services. In such cases, Resilience Hub may not correctly pair cross-Region load balancers. Manual grouping allows for more precise control in complex architectures.

Currently, our Elastic Load Balancer (ELB) has an estimated RTO of 1 hour and 40 minutes based on the time to deploy a new ELB in a different Region. If you refer back to our architecture (Figure 1) there is already an ELB deployed to our secondary Region. Without grouping, Resilience Hub can’t determine that the secondary Region ELB correspond to primary Region ELB for failover scenarios. Resolve this by manually grouping these resources together.

Image showing Elastic Load Balancer RTO policy breach

Figure 11: Elastic Load Balancer RTO policy breach

Manually group resources

1. Open the application in Resilience Hub, and select Application structure and then the Resources tab.

2. Select the resources that you would like to group together into a single AppComponent. Choose Actions and select Group resources (Figure 12).

Image showing the UI to select resources to manually group

Figure 12: Select resources to manually group

3. In the Group resources dialog, select the AppComponent to group resources. Type Group in the confirmation and click Save.

Image showing UI to assign resources to an AppComponent

Figure 13: Choose which AppComponent the resources should be manually grouped under

Now that the ELBs have been grouped together into a single AppComponent, Resilience Hub knows they are related. Upon republishing and assessing the application we can see that our RTO/RPO is now near zero because there is no longer a need to create a new resource in the secondary Region because one already exists. In a failover scenario, traffic will be re-routed to our secondary Region where the ELB will route traffic to resources in that Region.

Image showing Elastic Load Balancer policy is now met

Figure 14: Elastic Load Balancer policy is now met after manually grouping resources

Blocked services for manual grouping

Resilience Hub does block services for manual grouping to maintain assessment accuracy. The system automatically groups these services based on their dependencies and configurations. Through analysis of service relationships, dependencies, and resilience requirements, Resilience Hub creates optimized groupings with high confidence that ensure accurate resilience assessments for your application. One example is grouping a primary database and its replica to be included in one AppComponent since they operate as a single unit. For Amazon S3, instead of grouping, Resilience Hub evaluates amount of stored data, time to replicate, backup plans, and cross-Region replication for RTO/RPO estimates.

The buckets that support our static web content in both Regions meet our recovery objectives as versioning is enabled and a backup plan is defined (Figure 15).

Image showing S3 policy met

Figure 15: S3 policy met for static web content

The S3 buckets supporting our data lake are currently unrecoverable (Figure 16). There are no cross-Region backup plans and Amazon S3 Cross-Region Replication is not configured. This bucket was not a candidate for Amazon S3 Cross-Region Replication as the contents are Region specific.

To meet our RTO/RPO objectives, we can setup a cross-Region backup plan to support disaster recovery and implement Amazon S3 Cross-Region Replication (CRR) to move required data across Regions. This ensures that data is available and in the event of a regional failover our application will continue to operate normally.

Image showing S3 policy breach

Figure 16: S3 policy breach for data lake

Excluding components

Applications should accurately reflect your AWS resources for meaningful resilience assessments. While importing resources, exclude non-critical components (like build pipeline S3 buckets or EC2 instances) that won’t impact production workloads if they fail. Excluding these resources improves your resilience score and prevents unnecessary recommendations. Excluded resources won’t be re-imported or trigger drift detection.

Unsupported resources

Resilience Hub performs assessments and provides recommendations only for supported AWS services, excluding any unsupported resources from your input sources. For unsupported services, architects should refer to the AWS Well-Architected framework to implement the proper fault isolation boundaries, observability, resilience testing and run books if necessary.

Conclusion

Resilience Hub optimizes multi-Region application resilience through automated grouping and comprehensive assessments. It provides RTO/RPO estimates and tailored recommendations to strengthen resilience posture. While the service offers valuable insights, effective business continuity requires regular testing, documentation, and continuous improvement.

Visit the Resilience Hub console to import applications, run assessments, and enhance your resilience strategy. For hands-on experience and deeper insights, check out Monitoring Resilient Architectures With AWS Resilience Hub and training sessions at AWS Skill Builder. Don’t wait for an impairment to test your resilience – take proactive steps now to ensure your applications can withstand any disruption.

Daniel Cil

Daniel Cil is a Senior Resilience Specialist Solutions Architect based out of Southern California. He helps AWS Industries and Strategic customers design fault-tolerant architectures and implement resilience best practices for their workloads on the AWS Cloud.

Tyler Huehmer

Tyler Huehmer serves as a Senior Solutions Architect at AWS, where he partners with large-scale ecommerce customers to optimize their cloud infrastructure. He specializes in serverless computing, event-driven architecture, and building resilient systems that withstand the demands of modern commerce. Tyler’s passion lies in unifying distributed teams to tackle complex challenges.