AWS Public Sector Blog
Proactive strategies for cyber resilience and business continuity on AWS
Amazon Web Services (AWS) recommends that organizations prepare to recover workloads in case of cybersecurity incidents or business continuity events such as technical or natural disasters. In this post, we offer guidance and strategies for public sector organizations to use AWS infrastructure to operate resilient systems in the cloud. AWS recommends that its customers:
- Use frameworks for cybersecurity and AWS architecture best practices
- Implement a multi-account environment
- Use infrastructure as code (IaC) to deploy AWS environments and workloads
- Prepare a recovery account in a different Availability Zone or Region than the primary workload
- Populate all application code, IaC code, configuration files, and other dependencies in the recovery account
- Define a strategy to back up data to the recovery account
- Implement automated unit and full workload testing in the recovery account
Use an established framework
As cybersecurity incidents like ransomware continue to increase, public sector organizations look to frameworks such as the NIST Cybersecurity Framework (CSF) 2.0 from the National Institute of Standards and Technology (NIST) for guidance to better manage cybersecurity risk. Although the CSF arranges cybersecurity outcomes by functions (govern, identify, protect, detect, respond, and recover), NIST doesn’t prescribe how outcomes should be achieved.
For more specific guidance, refer to resources such as the AWS Blueprint for Ransomware Defense, which provides a mapping of specific AWS services to CSF functions. To really understand best practices about how to architect on AWS, the AWS Well-Architected Framework dives deep across six pillars to help cloud architects build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads, in accordance with the AWS Shared Responsibility Model. These frameworks are complemented by the AWS Security Reference Architecture (AWS SRA), which helps you design, implement, and manage AWS security services so that they align with AWS recommended practices. Bridging CSF and AWS Well-Architected, this post explores essential patterns you can use to prepare to recover from a cybersecurity or disaster event.
Implement a multi-account strategy on AWS
AWS recommends following the best practice of implementing your cloud environment using a multi-account strategy, also known as a landing zone. Using AWS Organizations along with AWS Control Tower or Landing Zone Accelerator on AWS creates an environment suited for ease of management and automation. Using multiple AWS accounts helps isolate and manage your applications and data, and it eliminates unintended lateral movement in your environment.
As a part of the multi-account strategy, AWS recommends centrally managing identities using AWS IAM Identity Center. Identity providers external to AWS can be integrated with IAM Identity Center. But in a recovery situation, you might have to use root credentials. AWS recommends using root access to your AWS accounts only when required. Root credentials should be properly secured to prevent account takeovers and unauthorized access. AWS recently announced new capabilities for centrally managing root credentials. With the centralized management of root credentials, you can strengthen your overall AWS security posture by removing long-term root credentials and preventing unintended or unauthorized credential recovery.
Build your AWS infrastructure with automation
As a key principle of DevSecOps, AWS recommends deploying all AWS infrastructure using IaC. Using IaC, AWS customers can quickly iterate on their infrastructure, improve consistency and reproducibility, reduce configuration errors, and provide version control. Important from a recovery standpoint, IaC also functions as documentation for your infrastructure. IaC speeds up recovery and testing from a business continuity event, which means you can quickly deploy infrastructure in a new AWS account or AWS Region.
AWS services are designed with IaC in mind, and you can choose from a variety of IaC tools that meet the needs of their organization. AWS provides options such as AWS CloudFormation and AWS Cloud Development Kit (AWS CDK). You can use AWS CloudFormation to define your infrastructure using templates, and with AWS CDK, you can define cloud resources using familiar programing languages. You can also use AWS CloudFormation to create a stack from existing infrastructure, thus easing your migration to IaC. At an application level, AWS Serverless Application Model (AWS SAM) is an open source framework for building serverless applications using IaC. AWS also supports HashiCorp Terraform, which is an IaC tool you can use to define both resources across multiple cloud service providers and on-premises resources in human-readable configuration files that you can version, reuse, and share.
In all cases, IaC can be developed using variables such as resource naming, AWS Region, IP ranges, and more, making it straightforward to reuse code across AWS accounts and Regions, speeding up a recovery process. IaC code should be stored in a source code repository, such as GitLab. To prepare for faster recovery, you can set up your own DevSecOps pipelines or standardize using AWS prescriptive guidance. DevOps Pipeline Accelerator (DPA) is a solution composed of templates that help you construct a complete continuous integration and continuous delivery (CI/CD) pipeline for application or infrastructure deployment using the previously described options. Refer to the Infrastructure as code whitepaper for a deeper dive on IaC.
Prepare a recovery location
Using the multi-account strategy as discussed previously, AWS recommends establishing separate accounts for recovery to address cybersecurity incidents such as the compromise of identities. Although identities in an AWS account should be aligned to roles, if a bad actor has gained root access or has can escalate their own privileges, backups in the same account can be compromised as well.
To address technical or natural disasters, you can protect a workload from the unlikely scenario of an AWS Availability Zone being unavailable. If you use Availability Zone recovery within a Region, make sure that the recovery Availability Zone is discrete from the zones where workloads are deployed. To accomplish this goal, use consistent Availability Zone IDs (AZ IDs) across accounts. AWS randomizes the name of AZs across different accounts. In order to effectively use an Availability Zone in a separate account for recovery, use consistent AZ IDs across all your accounts in a Region. Using this method, you can use two or more Availability Zones for resilient deployment of your workloads and a separate Availability Zone for recovery in the same Region. Using consistent AZ IDs, you can ensure that your recovery Availability Zone will be different than deployment Availability Zones for all workloads.
The following diagram shows the architecture for workload and recovery accounts using AZ IDs.
For resiliency against more widespread technical or natural disasters, you can also choose multi-Region recovery. AWS recommends that you consider using a different Region for recovery accounts if possible, such as having your main Region in US East (N. Virginia) us-east-1
and a backup location in US East (Ohio) us-east-2
. At a minimum, AWS recommends backing up data to another Region. If using cross-Region backup locations, organizations should evaluate their encryption key strategy using AWS Key Management Service (AWS KMS) to be able to decrypt backup data in a different account.
Note that there may be higher data transfer charges between Regions than Availability Zones, which should be considered as a part of the overall backup location strategy.
Define backup strategy and implement automated testing
AWS recommends aligning a backup strategy with cloud-based solutions. After preparing a recovery location as noted above, you can use AWS Backup to back up workload data in the recovery account. AWS Backup can also be used to store IaC, infrastructure configuration data, application code, and application configuration settings in the recovery location. In most scenarios, all those data types will be required to restore a workload. If using a cloud-based code repository, sufficient backup capabilities might already be included in the solution.
To validate the viability of backed up data, AWS recommends implementing automated testing in the recovery account. For example, implementing an automated process that runs after backup completion to ensure the backup can be read correctly is an effective mitigation measure against threats such as ransomware. To further mitigate from such attacks, organizations can maintain multiple versions of all backed up components so that an older version of a backup can be restored if required. Customers may also want to consider utilizing immutable backup strategies using features such as AWS Backup Vault Lock as a part of their backup strategy.
As a part of your backup and IaC strategy, you should make sure relevant necessary components of workloads are stored in the recovery location, including IaC code, infrastructure configuration data, Amazon Machine Images (AMIs), application code, and application configuration settings. This process should be incorporated into your overall DevSecOps and IaC strategies. Implementing automated processes that trigger backups of all these things when changes are made in a DevSecOps pipeline makes sure that backups are up to date in the recovery account. For systems with frequent pipeline changes, you can opt for automated backup processes that execute on a periodic basis, such as daily.
After recovery accounts have been established, IaC has been implemented, and a backup strategy has been established, AWS recommends testing the entire workload in recovery accounts. You should periodically build the complete workload in the recovery account using the IaC code and backups in that account. In many cases, existing component, integrated, or user acceptance testing scripts can be used for periodic business continuity exercises. Such testing verifies the validity of the recovery capabilities and means that the workload can be recovered in the case of a business continuity event.
Conclusion
By following the best practices in this post, public sector organizations can use AWS capabilities to prepare for recovery from a business continuity event. Please reach out to your AWS account team and solutions architect for additional guidance.