AWS Database Blog
Fluent Commerce’s approach to near-zero downtime Amazon Aurora PostgreSQL upgrade at 32 TB scale using snapshots and AWS DMS ongoing replication
This is a guest post by Adrian Cook, DevOps Engineer at Fluent Commerce, in partnership with AWS.
Fluent Commerce, an omnichannel commerce platform, offers order management solutions that enable businesses to deliver seamless shopping experiences across various channels. Fluent uses Amazon Aurora PostgreSQL-Compatible Edition as its high-performance online transaction processing (OLTP) database engine to process their customers’ intricate search queries efficiently. The system handles multiple complex database queries while maintaining consistent, reliable, and scalable operations. Aurora PostgreSQL-Compatible helps Fluent deliver the desired database performance, so they can meet the demanding needs of their clientele with ease.
As the business embarked on its journey of global expansion and the continual onboarding of new customers, it became evident that a strategic shift was in order, to support the ever-expanding geographic footprint. The success of this ambitious endeavor hinged on a critical factor: improving cost-efficiency across the organization.
Amazon Relational Database Service (Amazon RDS) and Amazon Aurora played a pivotal role in this transformation. The catalyst for change arrived in the form of the AWS Graviton instance type, promising up to 20% performance improvement. Fluent Commerce made the strategic decision to migrate all their instances to Graviton, recognizing the potential it held for streamlining operations on a global scale with improved right-sizing. This approach allowed for better resource allocation and enhanced performance while maintaining efficient tenant density across instances.
Migration to the Graviton2 instance family was a straightforward process, but it required a major version upgrade from Aurora PostgreSQL-Compatible 10.14 to 12.4 and higher. Upgrading databases can be highly disruptive, especially for Fluent Commerce, which supports some of the world’s largest ecommerce platforms. Managing upgrades easily, with near-zero downtime, was challenging, given the immense size of their databases, which can reach up to 32 TB. Fortunately, AWS offers several technical approaches for database upgrades, including in-place upgrades, native replication, Amazon RDS blue/green deployment, and AWS Database Migration Service (AWS DMS). Fluent Commerce strategically combined AWS-based upgrade approaches—including snapshot restores and AWS DMS ongoing replication—to upgrade their 32 TB Aurora PostgreSQL databases with minimal downtime.
In this post, we explore a pragmatic and cost-effective approach to achieve near-zero downtime during database upgrades. We explore the method of using the snapshot and restore method followed by continuous replication using AWS DMS. By the end of this post, you will be equipped with insights on how to navigate database upgrades, maintaining uninterrupted service and optimized performance as your organization scales and evolves.
Migration overview
Aurora PostgreSQL-Compatible offers several upgrade paths to its customers, including in-place upgrades and blue/green deployments. Given Fluent’s specific requirements and constraints, they implemented a combined approach using native logical replication and AWS DMS.Fluent Commerce used this approach to upgrade over 350 production Aurora PostgreSQL databases, some as large as 32 TB, serving some of the top global ecommerce clients with strict downtime requirements.AWS DMS has been a preferred choice for both database upgrade and Graviton migration in production due to the following benefits:
- It minimizes data transfer costs, making it a cost-effective alternative compared to third-party services, which can be prohibitively expensive.
- AWS DMS integrates with native AWS services, aligning with existing DevOps and data pipelines. This integration encompasses AWS IAM Identity Center, AWS CloudFormation, and continuous integration and continuous deployment (CI/CD) processes.
- AWS DMS provides change data capture (CDC) capabilities, providing ongoing data replication during cutover periods with minimal or near-zero downtime. AWS DMS includes comprehensive monitoring features, using Amazon CloudWatch and AWS CloudTrail logs for real-time metrics and notifications, which allows for proactive management and troubleshooting of the upgrade and migration tasks.
Solution overview
Fluent Commerce took the following approach to achieve a zero-downtime migration:
- Configure the source database parameter to enable replication slots.
- Create replication slots in the source database to create a checkpoint.
- Take a snapshot of the source database. (Fluent Commerce didn’t use the Aurora clone feature because they were migrating the database clusters across accounts. However, the original clusters were using the default AWS Key Management Service (AWS KMS) keys, and using CI/CD with AWS CloudFormation to deploy the newly restored clusters in their own distinct accounts.)
- Copy the snapshot using the new KMS key and restore the snapshot to the target database.
- Drop the replication slots on the target database.
- Perform necessary prerequisites for major version upgrade (for example, upgrade PostgreSQL extensions like
PostGIS
andpg_repack
). - Use AWS DMS to replicate the data from the source to the target.
- Perform data validation between the source and target databases using AWS DMS.
- Cut over the application from the source to the target.
- Perform post-cutover tasks such as decommissioning the old environment, backing up the new environment, and end-to-end validation and testing.
The following diagram illustrates the solution architecture.
The following section outlines the steps taken to upgrade the database and update the instance class with minimal downtime.
Configure the source database
Complete the following steps to configure the source database:
- Configure the parameters for ongoing replication and CDC:
- Use the following code to set up the
pglogical
extension: - Configure the node:
- Create a replication slot:
Note this replication slot name, this will be needed later when configuring the source DMS endpoint
- Get a confirmed flushed logical sequence number (LSN):
- Create two replication sets using the
pglogical.create_replication_set
function:- The first replication set tracks updates and deletes for tables that have primary keys.
- The second replication set tracks only inserts, and has the same name as the first replication set, with the added prefix
i
.
Replication slots are a marker in the PostgreSQL write-ahead log (WAL) so that AWS DMS can identify the correct starting point for ingesting data from the source database. Due to the extremely high DML transactions on the database, Fluent Commerce used two replication sets for each replication slot to improve granular control and overall migration performance. The first replication set tracks updates and deletes for tables that have primary keys. The second replication set tracks only inserts. It is essential for all tables involved in the migration to have primary keys and be included in these two sets to provide data integrity and consistency during the migration process. For more information, refer to Using a PostgreSQL database as an AWS DMS source.
The following diagram illustrates the replication slot architecture.
- Add tables to the replication sets that tracks updates and deletes. The following query shows how to add one table to the replication set:
You can also query the catalog table to generate SQL for adding tables to the replication set in one step:
- Add tables to the replication sets that tracks inserts only:
Similarly, you can add all tables to replication sets in one step using the following SQL query:
Create a snapshot
Use the following code to create a snapshot of the source database. Note the name of the snapshot and make sure you share that to a target AWS account, because you will migrate this to a new AWS account.
This code uses the following placeholders:
- <snapshot-identifier> – The identifier for the DB snapshot that you want to share.
- <db-instance-identifier> – The identifier of the DB instance to create the snapshot of. It must match the identifier of an existing DB instance.
Share snapshots
If you need to migrate into another account, you can use the following code to share your snapshots across accounts:
This code uses the following placeholders:
- <snapshot-identifier> – The identifier for the DB snapshot that you want to share.
- <account-id-to-share-with> – The AWS account ID with which you want to share the snapshot. You can specify multiple account IDs separated by spaces.
- <region> – The AWS Region where the snapshot is located (for example,
us-west-2
). - <profile-name> – The name of the AWS Command Line Interface (AWS CLI) profile to use for the command. This is optional if you’re using the default profile.
If you use a custom KMS key, you must allow cross-account access. After the snapshot is shared, you can copy the snapshot using a new KMS key in the target account:
Restore the target database
Use the following code to restore the target database:
This code uses the following placeholders:
- <new-db-cluster-identifier> – The identifier for the new DB cluster that you want to create from the snapshot.
- <snapshot-identifier> – The identifier for the DB snapshot that you want to restore from.
- <engine> – The name of the database engine to be used for the new DB cluster (for example,
aurora
,aurora-mysql
, oraurora-postgresql
). - <region> – The Region where the snapshot is located (for example,
us-west-2
). - <profile-name> – The name of the AWS CLI profile to use for the command. This is optional if you’re using the default profile.
Upgrade the target database
Because there is no direct upgrade route available from version 10.20 to 14.3, our initial step is to upgrade the target database to a minor version that does offer a direct upgrade path to 14.3. For a comprehensive list of compatible versions, refer to Upgrading Amazon Aurora PostgreSQL DB clusters.
In this case, we upgrade from 10.20 to 10.21, and from 10.21 to 14.3.
- To upgrade from 10.20 to 10.21, update the CloudFormation template with the following:
RDSEngineVersion: 10.21
- To upgrade from 10.21 to 14.3, update the CloudFormation template with the following:
RDSEngineVersion: 14.3
- After you’ve upgraded Aurora PostgreSQL-Compatible to version 14.3, change the instance type (to a supported Graviton instance) and set the custom parameters to
true
: - Update the target database cluster and instance parameters and make sure
MAX_WORKER_PROCESSES
is configured. This must be greater than or equal to the number of databases you intend to migrate to this cluster. For example, if you’re migrating 10 databases from the source to the target, then set this value to a minimum of 10.
Configure AWS DMS
In this section, we focus on the configuration of AWS DMS environment, including the replication instance and replication tasks. For more information, refer to Set up replication for AWS Database Migration Service.
When configuring the source endpoint for the source PostgreSQL cluster, you must specify the replication slot name that you created earlier.
The following are key considerations for configuring your AWS DMS infrastructure for large-scale database migration with minimal downtime:
- Separate the validation and AWS DMS main migration tasks to minimize risk and overhead of migration failure. This also helps improve the throughput of the data migration task.
- If you have LOB data types in the database, consider setting the max LOB size when migration latency is high. This helps increase throughput.
- Place the AWS DMS instance in the same Availability Zone as the source database to reduce latency and cross-Availability Zone data transfer costs.
- The validation task uses memory on the replication instance; the more memory you have, the quicker the validations can occur. Choosing an instance size with adequate memory makes sure the validation process is fast and efficiently transitions from
PENDING
status. - To achieve rapid data transfer from the source to the target, it’s recommended to choose a replication instance with higher CPU and network speeds, preferably using r6i instances for higher memory, and oversize the replication instance for faster migration.
Track the migration progress
In this section, we discuss different methods to track the migration progress.
Data ingest start time
If you want to track what time the ingest starts, use the following script. This script loops over the AWS DMS task (describe-table-statistics
) function and tracks when data is inserted into any of the tables. The last timestamp is the time that the data started ingesting into the target database.
Monitor validation tasks
The following script performs data verification between a source and target database for specified tables. It reads SQL queries from a file (verify-${DBTYPE}-sql.txt
), executes them on both the source and target databases, compares the results, and logs any differences:
Source ingestion rate
Using a custom script to track ingestion rates at the source and target databases offers a more granular view of data flow in near real time and provides specific insights into row counts and time intervals. The following command monitors the data ingestion rate for a table (source_table
) based on records created in the last 5 minutes:
Target ingestion rate
This command is similar to the preceding example for the source database, but for the target database. You can then compare the insert ingest rates between source and target to determine the speed in which the AWS DMS compared to API ingest is occurring. Note that this is not foolproof, it’s merely a means of tracking the ingest rates at a high level.
CDC latency
To troubleshoot latency issues in AWS DMS, you also can monitor the CDCLatencySource
and CDCLatencyTarget
metrics in CloudWatch. For further details, refer to Types of CDC latency.
Perform the cutover
After the target database is caught up with the source and you can confirm that the data has been migrated successfully and validation completed, you can cut over the application. The following are some key considerations before cutting over:
- Configuration changes – The application retrieves part of its database configuration from an internal table, specifically the
dbhost
value, which controls the database connection. By updating thedbhost
in the application’s configuration, you can repoint it to a new database (such asAURORA02
). This avoids confusion about the details of where and how the application table is stored, simplifying the process to just modifying the configuration. - Bumping the sequence – Begin by identifying all sequence numbers in the target database (the one being cut over to). Next, generate a query to increment the sequence IDs. For this scenario, we’ve chosen an arbitrary offset of 500,000, though the exact number can vary. The important factor is to make sure the source database doesn’t catch up to the target during the cutover process. The larger the offset, the safer the operation—opt for a higher value to provide a buffer.
- Vacuum and analyze tables – We run vacuum and analyze on the tables on the target DB instance to validate optimal performance.
Conclusion
In this post, we showed you how Fluent Commerce used AWS DMS with ongoing replication and data validation to minimize disruptions during peak business hours, delivering a seamless experience for their global ecommerce customers. By using snapshot and restore as a precursor to CDC, they enabled a more cost-effective and efficient migration—especially beneficial for large-scale databases.Fluent Commerce’s results and performance metrics highlighted the advantages of this approach, demonstrating significantly reduced migration times compared to traditional full load methods. The combination of snapshot and CDC not only accelerated the process but also enhanced reliability, enabling continuous replication with minimal impact on source database performance.
What are your experiences with database migrations? Share your thoughts in the comments section.