Fluent Commerce’s approach to near-zero downtime Amazon Aurora PostgreSQL upgrade at 32 TB scale using snapshots and AWS DMS ongoing replication

This is a guest post by Adrian Cook, DevOps Engineer at Fluent Commerce, in partnership with AWS.

Fluent Commerce, an omnichannel commerce platform, offers order management solutions that enable businesses to deliver seamless shopping experiences across various channels. Fluent uses Amazon Aurora PostgreSQL-Compatible Edition as its high-performance online transaction processing (OLTP) database engine to process their customers’ intricate search queries efficiently. The system handles multiple complex database queries while maintaining consistent, reliable, and scalable operations. Aurora PostgreSQL-Compatible helps Fluent deliver the desired database performance, so they can meet the demanding needs of their clientele with ease.

As the business embarked on its journey of global expansion and the continual onboarding of new customers, it became evident that a strategic shift was in order, to support the ever-expanding geographic footprint. The success of this ambitious endeavor hinged on a critical factor: improving cost-efficiency across the organization.

Amazon Relational Database Service (Amazon RDS) and Amazon Aurora played a pivotal role in this transformation. The catalyst for change arrived in the form of the AWS Graviton instance type, promising up to 20% performance improvement. Fluent Commerce made the strategic decision to migrate all their instances to Graviton, recognizing the potential it held for streamlining operations on a global scale with improved right-sizing. This approach allowed for better resource allocation and enhanced performance while maintaining efficient tenant density across instances.

Migration to the Graviton2 instance family was a straightforward process, but it required a major version upgrade from Aurora PostgreSQL-Compatible 10.14 to 12.4 and higher. Upgrading databases can be highly disruptive, especially for Fluent Commerce, which supports some of the world’s largest ecommerce platforms. Managing upgrades easily, with near-zero downtime, was challenging, given the immense size of their databases, which can reach up to 32 TB. Fortunately, AWS offers several technical approaches for database upgrades, including in-place upgrades, native replication, Amazon RDS blue/green deployment, and AWS Database Migration Service (AWS DMS). Fluent Commerce strategically combined AWS-based upgrade approaches—including snapshot restores and AWS DMS ongoing replication—to upgrade their 32 TB Aurora PostgreSQL databases with minimal downtime.

In this post, we explore a pragmatic and cost-effective approach to achieve near-zero downtime during database upgrades. We explore the method of using the snapshot and restore method followed by continuous replication using AWS DMS. By the end of this post, you will be equipped with insights on how to navigate database upgrades, maintaining uninterrupted service and optimized performance as your organization scales and evolves.

Migration overview

Aurora PostgreSQL-Compatible offers several upgrade paths to its customers, including in-place upgrades and blue/green deployments. Given Fluent’s specific requirements and constraints, they implemented a combined approach using native logical replication and AWS DMS.Fluent Commerce used this approach to upgrade over 350 production Aurora PostgreSQL databases, some as large as 32 TB, serving some of the top global ecommerce clients with strict downtime requirements.AWS DMS has been a preferred choice for both database upgrade and Graviton migration in production due to the following benefits:

It minimizes data transfer costs, making it a cost-effective alternative compared to third-party services, which can be prohibitively expensive.
AWS DMS integrates with native AWS services, aligning with existing DevOps and data pipelines. This integration encompasses AWS IAM Identity Center, AWS CloudFormation, and continuous integration and continuous deployment (CI/CD) processes.
AWS DMS provides change data capture (CDC) capabilities, providing ongoing data replication during cutover periods with minimal or near-zero downtime. AWS DMS includes comprehensive monitoring features, using Amazon CloudWatch and AWS CloudTrail logs for real-time metrics and notifications, which allows for proactive management and troubleshooting of the upgrade and migration tasks.

Solution overview

Fluent Commerce took the following approach to achieve a zero-downtime migration:

Configure the source database parameter to enable replication slots.
Create replication slots in the source database to create a checkpoint.
Take a snapshot of the source database. (Fluent Commerce didn’t use the Aurora clone feature because they were migrating the database clusters across accounts. However, the original clusters were using the default AWS Key Management Service (AWS KMS) keys, and using CI/CD with AWS CloudFormation to deploy the newly restored clusters in their own distinct accounts.)
Copy the snapshot using the new KMS key and restore the snapshot to the target database.
Drop the replication slots on the target database.
Perform necessary prerequisites for major version upgrade (for example, upgrade PostgreSQL extensions like PostGIS and pg_repack).
Use AWS DMS to replicate the data from the source to the target.
Perform data validation between the source and target databases using AWS DMS.
Cut over the application from the source to the target.
Perform post-cutover tasks such as decommissioning the old environment, backing up the new environment, and end-to-end validation and testing.

The following diagram illustrates the solution architecture.

The following section outlines the steps taken to upgrade the database and update the instance class with minimal downtime.

Configure the source database

Complete the following steps to configure the source database:

Configure the parameters for ongoing replication and CDC:

max_worker_processes = <value> #On source:1 per db being replicated,On target:1 per node
max_logical_replication_workers = <value>
max_replication_slots = <value>    #Equal to as many tasks that will have
max_parallel_workers = <value>
max_wal_senders = <value>. # maximum no. of concurrent connections allowed from all kinds of streaming replicas
wal_sender_timeout = 0 
rds.logical_replication = 1
shared_preload_libraries = pglogical

Use the following code to set up the pglogical extension:

# Install the pglogical extension
database_name => CREATE EXTENSION pglogical;

Configure the node:

# Create pglogical node
database_name => SELECT pglogical.create_node(node_name := 'database_name_node',
	dsn := 'host=dmt-test-core-cluster.cluster-ID.ap-southeast-2.rds.amazonaws.com
	dbname=database_name port=5432 user=dmt_user password=<PASSWORD>');

Create a replication slot:
```
database_name= > SELECT * FROM pg_create_logical_replication_slot('database_name_slot', 'pglogical');
```
Note this replication slot name, this will be needed later when configuring the source DMS endpoint

Get a confirmed flushed logical sequence number (LSN):

database_name=> SELECT slot_name, slot_type, plugin, database, active, confirmed_flush_lsn

Create two replication sets using the pglogical.create_replication_set function:
1. The first replication set tracks updates and deletes for tables that have primary keys.
2. The second replication set tracks only inserts, and has the same name as the first replication set, with the added prefix i.
```
database_name=> SELECT pglogical.create_replication_set('database_name_slot', false, true, true, false);
database_name=> SELECT pglogical.create_replication_set('idatabase_name_slot', true, false, false, true);
```
Replication slots are a marker in the PostgreSQL write-ahead log (WAL) so that AWS DMS can identify the correct starting point for ingesting data from the source database. Due to the extremely high DML transactions on the database, Fluent Commerce used two replication sets for each replication slot to improve granular control and overall migration performance. The first replication set tracks updates and deletes for tables that have primary keys. The second replication set tracks only inserts. It is essential for all tables involved in the migration to have primary keys and be included in these two sets to provide data integrity and consistency during the migration process. For more information, refer to Using a PostgreSQL database as an AWS DMS source.

The following diagram illustrates the replication slot architecture.

Add tables to the replication sets that tracks updates and deletes. The following query shows how to add one table to the replication set:

select pglogical.replication_set_add_table('database_name_slot', 'schema_name.my_table');

You can also query the catalog table to generate SQL for adding tables to the replication set in one step:

database_name=> SELECT E'SELECT pglogical.replication_set_add_table(\'database_name_slot\', \''
||tab.table_schema||'.'||tab.table_name||E'\');'
FROM information_schema.tables tab
LEFT JOIN information_schema.table_constraints tco
ON tco.table_schema = tab.table_schema
AND tco.table_name = tab.table_name
AND tco.constraint_type = 'PRIMARY KEY'
LEFT JOIN information_schema.key_column_usage kcu
ON kcu.constraint_name = tco.constraint_name
AND kcu.constraint_schema = tco.constraint_schema
AND kcu.constraint_name = tco.constraint_name
WHERE tab.table_schema NOT IN ('pg_catalog', 'information_schema')
AND tab.table_type = 'BASE TABLE' AND tco.constraint_name IS NOT NULL
GROUP BY tab.table_schema,
tab.table_name,
tco.constraint_name
ORDER BY tab.table_schema,
tab.table_name;

Add tables to the replication sets that tracks inserts only:

select pglogical.replication_set_add_table('database_name_slot', 'schema_name.my_table');

Similarly, you can add all tables to replication sets in one step using the following SQL query:

database_name=> SELECT E'SELECT pglogical.replication_set_add_table(\'idatabase_name_slot\', \''
  ||tab.table_schema||'.'||tab.table_name||E'\');'
FROM information_schema.tables tab
LEFT JOIN information_schema.table_constraints tco
          ON tco.table_schema = tab.table_schema
          AND tco.table_name = tab.table_name
          AND tco.constraint_type = 'PRIMARY KEY'
LEFT JOIN information_schema.key_column_usage kcu 
          ON kcu.constraint_name = tco.constraint_name
          AND kcu.constraint_schema = tco.constraint_schema
          AND kcu.constraint_name = tco.constraint_name
WHERE tab.table_schema NOT IN ('pg_catalog', 'information_schema')
      AND tab.table_type = 'BASE TABLE' AND tco.constraint_name IS NOT NULL
GROUP BY tab.table_schema,
         tab.table_name,
         tco.constraint_name
ORDER BY tab.table_schema,
         tab.table_name;

Create a snapshot

Use the following code to create a snapshot of the source database. Note the name of the snapshot and make sure you share that to a target AWS account, because you will migrate this to a new AWS account.

aws rds create-db-snapshot --db-snapshot-identifier <snapshot-name> --db-instance-identifier <db-instance-identifier>

This code uses the following placeholders:

<snapshot-identifier> – The identifier for the DB snapshot that you want to share.
<db-instance-identifier> – The identifier of the DB instance to create the snapshot of. It must match the identifier of an existing DB instance.

Share snapshots

If you need to migrate into another account, you can use the following code to share your snapshots across accounts:

aws rds modify-db-snapshot-attribute \
    --db-snapshot-identifier <snapshot-identifier> \
    --attribute-name restore \
    --values-to-add <account-id-to-share-with> \
    --region <region> \
    --profile <profile-name>

This code uses the following placeholders:

<snapshot-identifier> – The identifier for the DB snapshot that you want to share.
<account-id-to-share-with> – The AWS account ID with which you want to share the snapshot. You can specify multiple account IDs separated by spaces.
<region> – The AWS Region where the snapshot is located (for example, us-west-2).
<profile-name> – The name of the AWS Command Line Interface (AWS CLI) profile to use for the command. This is optional if you’re using the default profile.

If you use a custom KMS key, you must allow cross-account access. After the snapshot is shared, you can copy the snapshot using a new KMS key in the target account:

aws rds copy-db-cluster-snapshot \
  --source-db-cluster-snapshot-identifier <original-snapshot-identifier> \
  --target-db-cluster-snapshot-identifier <new-snapshot-identifier>  \
  --kms-key-id my-us-east-1-key

Restore the target database

Use the following code to restore the target database:

aws rds restore-db-cluster-from-snapshot \
--db-cluster-identifier <new-db-cluster-identifier> \
--snapshot-identifier <snapshot-identifier> \
--engine <engine> \
--region <region> \
--profile <profile-name> \
[—optional-parameters]

This code uses the following placeholders:

<new-db-cluster-identifier> – The identifier for the new DB cluster that you want to create from the snapshot.
<snapshot-identifier> – The identifier for the DB snapshot that you want to restore from.
<engine> – The name of the database engine to be used for the new DB cluster (for example, aurora, aurora-mysql, or aurora-postgresql).
<region> – The Region where the snapshot is located (for example, us-west-2).
<profile-name> – The name of the AWS CLI profile to use for the command. This is optional if you’re using the default profile.

Upgrade the target database

Because there is no direct upgrade route available from version 10.20 to 14.3, our initial step is to upgrade the target database to a minor version that does offer a direct upgrade path to 14.3. For a comprehensive list of compatible versions, refer to Upgrading Amazon Aurora PostgreSQL DB clusters.

In this case, we upgrade from 10.20 to 10.21, and from 10.21 to 14.3.

To upgrade from 10.20 to 10.21, update the CloudFormation template with the following:
RDSEngineVersion: 10.21
To upgrade from 10.21 to 14.3, update the CloudFormation template with the following:
RDSEngineVersion: 14.3
After you’ve upgraded Aurora PostgreSQL-Compatible to version 14.3, change the instance type (to a supported Graviton instance) and set the custom parameters to true:
```
RDSInstanceClass: db.r6g.4xlarge
RDSEngineVersion: 14.3
RDSEnableCustomParameters: true
```
Update the target database cluster and instance parameters and make sure MAX_WORKER_PROCESSES is configured. This must be greater than or equal to the number of databases you intend to migrate to this cluster. For example, if you’re migrating 10 databases from the source to the target, then set this value to a minimum of 10.
```
# On the TARGET database
database_name=> show MAX_WORKER_PROCESSES; 

max_worker_processes
----------------------
32
```

Configure AWS DMS

In this section, we focus on the configuration of AWS DMS environment, including the replication instance and replication tasks. For more information, refer to Set up replication for AWS Database Migration Service.

When configuring the source endpoint for the source PostgreSQL cluster, you must specify the replication slot name that you created earlier.

The following are key considerations for configuring your AWS DMS infrastructure for large-scale database migration with minimal downtime:

Separate the validation and AWS DMS main migration tasks to minimize risk and overhead of migration failure. This also helps improve the throughput of the data migration task.
If you have LOB data types in the database, consider setting the max LOB size when migration latency is high. This helps increase throughput.
Place the AWS DMS instance in the same Availability Zone as the source database to reduce latency and cross-Availability Zone data transfer costs.
The validation task uses memory on the replication instance; the more memory you have, the quicker the validations can occur. Choosing an instance size with adequate memory makes sure the validation process is fast and efficiently transitions from PENDING status.
To achieve rapid data transfer from the source to the target, it’s recommended to choose a replication instance with higher CPU and network speeds, preferably using r6i instances for higher memory, and oversize the replication instance for faster migration.

Track the migration progress

In this section, we discuss different methods to track the migration progress.

Data ingest start time

If you want to track what time the ingest starts, use the following script. This script loops over the AWS DMS task (describe-table-statistics) function and tracks when data is inserted into any of the tables. The last timestamp is the time that the data started ingesting into the target database.

#!/bin/bash
# Check if task ARN is provided
if [ -z "$1" ]; then
    echo "Usage: $0 <task-arn>"
    exit 1
fi
TASK_ARN="$1"
REGION="us-east-1"
PREVIOUS_TOTAL=0
START_TIME=""
echo "Monitoring DMS task: $TASK_ARN"
echo "Waiting for ingestion to begin..."
while true; do
    # Get total inserts across all tables
    CURRENT_TOTAL=$(aws dms describe-table-statistics \
        --replication-task-arn "$TASK_ARN" \
        --region "$REGION" \
        --query 'sum(TableStatistics[].Inserts)' \
        --output text)
    # Check if ingestion has started
    if [ "$CURRENT_TOTAL" -gt "$PREVIOUS_TOTAL" ]; then
        if [ -z "$START_TIME" ]; then
            START_TIME=$(date '+%Y-%m-%d %H:%M:%S')
            echo "Ingestion started at: $START_TIME"
        fi
        echo "Total records inserted: $CURRENT_TOTAL"
    fi
    PREVIOUS_TOTAL=$CURRENT_TOTAL
    sleep 30
done

Monitor validation tasks

The following script performs data verification between a source and target database for specified tables. It reads SQL queries from a file (verify-${DBTYPE}-sql.txt), executes them on both the source and target databases, compares the results, and logs any differences:

#!/bin/bash
SOURCE_SQL_RESULT_DUMP_FILE=./temp-source-sql-data-dump.txt
TARGET_SQL_RESULT_DUMP_FILE=./temp-target-sql-data-dump.txt
DATA_VERIFY_RESULTS_FILE=./temp-data-verify-results.txt
while getopts "t:s:m:l:d:h:p:u:c:H:P:U:C:D:?" optKey; do
  case "${optKey}" in
    t) TIME=${OPTARG} ;;
    s) SMALL=${OPTARG} ;;
    m) MEDIUM=${OPTARG} ;;
    l) LARGE=${OPTARG} ;;
    d) DATABASE=${OPTARG} ;;
    h) SOURCE_HOST=${OPTARG} ;;
    p) SOURCE_PORT=${OPTARG} ;;
    u) SOURCE_USERNAME=${OPTARG} ;;
    c) SOURCE_PASSWORD=${OPTARG} ;;
    H) TARGET_HOST=${OPTARG} ;;
    P) TARGET_PORT=${OPTARG} ;;
    U) TARGET_USERNAME=${OPTARG} ;;
    C) TARGET_PASSWORD=${OPTARG} ;;
    D) DBTYPE=${OPTARG} ;;
    ?) usage ;;
  esac
done
rm -f ${SOURCE_SQL_RESULT_DUMP_FILE} ${TARGET_SQL_RESULT_DUMP_FILE} ${DATA_VERIFY_RESULTS_FILE}
created_on_offset="'$TIME'"
sql_replace() { echo "$1" | sed -e "s/\$CREATED_DATE/$created_on_offset/g" -e "s/\$COUNT1/$SMALL/g" -e "s/\$COUNT2/$MEDIUM/g" -e "s/\$COUNT3/$LARGE/g"; }
declare -a TABLE_NAMES=() SOURCE_DIFF_COUNT=() TARGET_DIFF_COUNT=() ZERO_DATA_SOURCE=() ZERO_DATA_TARGET=()
i=0
while IFS='' read -r line; do
  table=$(echo "$line" | cut -d"@" -f1)
  sql=$(sql_replace "$(echo "$line" | cut -d"@" -f2)")
  echo "Validating table: $table"
  TABLE_NAMES[$i]="$table"
  echo "$sql" | PGPASSWORD="${SOURCE_PASSWORD}" psql -h "${SOURCE_HOST}" -p "${SOURCE_PORT}" -U "${SOURCE_USERNAME}" -d "${DATABASE}" > "${SOURCE_SQL_RESULT_DUMP_FILE}"
  echo "$sql" | PGPASSWORD="${TARGET_PASSWORD}" psql -h "${TARGET_HOST}" -p "${TARGET_PORT}" -U "${TARGET_USERNAME}" -d "${DATABASE}" > "${TARGET_SQL_RESULT_DUMP_FILE}"
  SOURCE_DIFF_COUNT[$i]=$(diff -b -w $TARGET_SQL_RESULT_DUMP_FILE $SOURCE_SQL_RESULT_DUMP_FILE --new-line-format='Source missing %L' | wc -l)
  TARGET_DIFF_COUNT[$i]=$(diff -b -w $SOURCE_SQL_RESULT_DUMP_FILE $TARGET_SQL_RESULT_DUMP_FILE --new-line-format='Target missing %L' | wc -l)
  [[ $(tail -n2 "${SOURCE_SQL_RESULT_DUMP_FILE}") == "(0 rows)" ]] && ZERO_DATA_SOURCE[$i]=1 || ZERO_DATA_SOURCE[$i]=0
  [[ $(tail -n2 $TARGET_SQL_RESULT_DUMP_FILE) == "(0 rows)" ]] && ZERO_DATA_TARGET[$i]=1 || ZERO_DATA_TARGET[$i]=0
  if [[ "${SOURCE_DIFF_COUNT[$i]}" -ne 0 || "${TARGET_DIFF_COUNT[$i]}" -ne 0 ]]; then
    echo "Failure in table=$table" >> $DATA_VERIFY_RESULTS_FILE
  fi
  rm $SOURCE_SQL_RESULT_DUMP_FILE $TARGET_SQL_RESULT_DUMP_FILE
  ((i++))
done < "/var/task/bin/verify-${DBTYPE}-sql.txt"
echo "=================================================================="
printf "%-20s %-20s %-15s %-15s\n" "Table" "Result" "Source Miss" "Target Miss"
for ((idx=0; idx<${#TABLE_NAMES[@]}; ++idx)); do
  result="PASS"
  [[ "${ZERO_DATA_SOURCE[idx]}" -eq 1 && "${ZERO_DATA_TARGET[idx]}" -eq 1 ]] && result="PASS (0 rows)"
  [[ "${ZERO_DATA_SOURCE[idx]}" -eq 0 && "${ZERO_DATA_TARGET[idx]}" -eq 1 ]] && result="FAIL (Target 0 rows)"
  [[ "${SOURCE_DIFF_COUNT[idx]}" -ne 0 || "${TARGET_DIFF_COUNT[idx]}" -ne 0 ]] && result="FAIL"
  printf "%-20s %-20s %-15s %-15s\n" "${TABLE_NAMES[idx]}" "$result" "${SOURCE_DIFF_COUNT[idx]}" "${TARGET_DIFF_COUNT[idx]}"
done

Source ingestion rate

Using a custom script to track ingestion rates at the source and target databases offers a more granular view of data flow in near real time and provides specific insights into row counts and time intervals. The following command monitors the data ingestion rate for a table (source_table) based on records created in the last 5 minutes:

SELECT COUNT(id), date_trunc('minute', created_on) AS truncated_time FROM source_table WHERE created_on > NOW() - interval '5 minutes' GROUP BY truncated_time ORDER BY truncated_time;

Target ingestion rate

This command is similar to the preceding example for the source database, but for the target database. You can then compare the insert ingest rates between source and target to determine the speed in which the AWS DMS compared to API ingest is occurring. Note that this is not foolproof, it’s merely a means of tracking the ingest rates at a high level.

CDC latency

To troubleshoot latency issues in AWS DMS, you also can monitor the CDCLatencySource and CDCLatencyTarget metrics in CloudWatch. For further details, refer to Types of CDC latency.

Perform the cutover

After the target database is caught up with the source and you can confirm that the data has been migrated successfully and validation completed, you can cut over the application. The following are some key considerations before cutting over:

Configuration changes – The application retrieves part of its database configuration from an internal table, specifically the dbhost value, which controls the database connection. By updating the dbhost in the application’s configuration, you can repoint it to a new database (such as AURORA02). This avoids confusion about the details of where and how the application table is stored, simplifying the process to just modifying the configuration.
Bumping the sequence – Begin by identifying all sequence numbers in the target database (the one being cut over to). Next, generate a query to increment the sequence IDs. For this scenario, we’ve chosen an arbitrary offset of 500,000, though the exact number can vary. The important factor is to make sure the source database doesn’t catch up to the target during the cutover process. The larger the offset, the safer the operation—opt for a higher value to provide a buffer.
Vacuum and analyze tables – We run vacuum and analyze on the tables on the target DB instance to validate optimal performance.

Conclusion

In this post, we showed you how Fluent Commerce used AWS DMS with ongoing replication and data validation to minimize disruptions during peak business hours, delivering a seamless experience for their global ecommerce customers. By using snapshot and restore as a precursor to CDC, they enabled a more cost-effective and efficient migration—especially beneficial for large-scale databases.Fluent Commerce’s results and performance metrics highlighted the advantages of this approach, demonstrating significantly reduced migration times compared to traditional full load methods. The combination of snapshot and CDC not only accelerated the process but also enhanced reliability, enabling continuous replication with minimal impact on source database performance.

What are your experiences with database migrations? Share your thoughts in the comments section.

AWS Database Blog

Fluent Commerce’s approach to near-zero downtime Amazon Aurora PostgreSQL upgrade at 32 TB scale using snapshots and AWS DMS ongoing replication

Migration overview

Solution overview

Configure the source database

Create a snapshot

Share snapshots

Restore the target database

Upgrade the target database

Configure AWS DMS

Track the migration progress

Data ingest start time

Monitor validation tasks

Source ingestion rate

Target ingestion rate

CDC latency

Perform the cutover

Conclusion

About the authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help