Monitoring multithreaded replication in Amazon RDS for MySQL, Amazon RDS for MariaDB, and Aurora MySQL

In our previous post, we discussed how MySQL replication and multithreaded replication (MTR) works and its key configuration options and best practices. In this post, we discuss methods to effectively monitor parallel replication performance and tune its related parameters for Amazon Aurora MySQL and Amazon Relational Database Service for MySQL and MariaDB.

In the following section, we will dive into several methods that MySQL offers to monitor MTR.

SHOW REPLICA STATUS

SHOW REPLICA STATUS is a good starting place to monitor and troubleshoot MTR, especially when replication fails and stops. You run this command in replica instance. To understand each field’s meaning and interpretation, refer to How do I troubleshoot high replica lag with Amazon RDS for MySQL? in the AWS Knowledge Center and SHOW REPLICA STATUS Statement in the MySQL documentation:

mysql> SHOW REPLICA STATUS\G
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
              Source_Log_File: mysql-bin-changelog.002961
          Read_Source_Log_Pos: 10799580
               Relay_Log_File: relaylog.008193
                Relay_Log_Pos: 26644679
        Relay_Source_Log_File: mysql-bin-changelog.002737
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Source_Log_Pos: 26644443
              Relay_Log_Space: 29257792087
        Seconds_Behind_Source: 5203
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
    Replica_SQL_Running_State: Waiting for replica workers to process their queues
           Source_Retry_Count: 86400

Some of the fields have slightly different meanings in MTR compared to single threaded replication.

Seconds_Behind_Source is still valid, accurate, and useful with multithreaded replication, but you should keep in mind that this value is based on Exec_Source_Log_Pos and might not reflect the position of the most recently committed transaction.
- When performing operations that will require a cutover to the target database, you should not rely on Seconds_Behind_Source or the Amazon CloudWatch metric ReplicaLag in RDS for MySQL and RDS for MariaDB and AuroraBinlogReplicaLag in Aurora MySQL. We recommend using Amazon RDS Blue/Green Deployments, which not only automates the switchover process with minimal downtime but also provides built-in safeguards so that the replica (green) environment is fully synchronized with the source (blue) before cutover occurs.
Last_SQL_Errno and Last_SQL_Error only show the error of the coordinator thread, not worker threads. That means there might be more failures in the worker threads that can be found in the replication_applier_status_by_worker table that shows each worker thread’s status. If that table isn’t available, the replica error log can be used. The log or the replication_applier_status_by_worker table should also be used to learn more about the failure shown by SHOW REPLICA STATUS or the coordinator table.
Common states for the Replica_SQL_Running_State field in MTR include 'Waiting for dependent transaction to commit' or 'Waiting for preceding transaction to be committed'. These states are normal and indicate that a worker thread is waiting for a dependent transaction to be complete before proceeding. However, if these states appear frequently, consider tuning the workload in the source, such as breaking large transactions to smaller ones, to improve replication parallelism and overall replication performance. Setting binlog_transaction_dependency_tracking to WRITESET can also significantly reduce these dependencies. Refer to Overview and best practices of multithreaded replication in Amazon RDS for MySQL, Amazon RDS for MariaDB, and Amazon Aurora MySQL for best practices.

Performance Schema tables

MySQL provides a set of Performance Schema tables to monitor replication at a deeper level compared to SHOW REPLICA STATUS, hence they are recommended. Three tables that are useful in MTR monitoring are:

replication_connection_status
replication_applier_status_by_coordinator
replication_applier_status_by_worker

We recommend having Performance Schema enabled so all data is populated accordingly to these tables.

The replication_connection_status table shows the current status of the I/O thread that handles the replica’s connection to the source, information on the last transaction queued in the relay log, and information on the transaction currently being queued in the relay log.

The replication_applier_status_by_coordinator table shows the status of the coordinator thread, specifically the last transaction that was buffered by the coordinator thread to a worker’s queue, as well as the transaction it’s currently buffering. The start timestamp refers to when the coordinator thread read the first event of the transaction from the relay log to buffer it to a worker’s queue, and the end timestamp refers to when the last event finished buffering to the worker’s queue.

The replication_applier_status_by_worker table is the most important table. It shows the status of the worker threads. When replication fails and stops, LAST_ERROR_NUMBER, LAST_ERROR_MESSAGE columns are important to understand its root cause. An error number of 0 and message of the empty string mean “no error.” If the LAST_ERROR_MESSAGE value isn’t empty, the error values also appear in the replica’s error log. The following is an example:

LAST_ERROR_NUMBER: 1062
LAST_ERROR_MESSAGE: "Error 'Duplicate entry '123' for key 'PRIMARY''
on query 'INSERT INTO customers(id, name) VALUES (123, 'John')'

Custom views

Performance Schema tables contain raw data that can be challenging to translate into actionable insights. Currently, there are no industry standards or established best practices for querying these tables to measure MTR lag or evaluate MTR utilization and effectiveness. To make this data more accessible, we recommend creating custom views. The following is an example view that shows detailed statistics about MySQL replication worker threads, including their current activity, timing information, and error status. It can help to identify long-running replica transactions, underutilized workers, and replication errors. Test thoroughly before implementing it in your production environment:

CREATE OR REPLACE
  ALGORITHM = MERGE
  SQL SECURITY INVOKER 
VIEW binlog_replication_worker_stats AS
SELECT 
  COALESCE(NULLIF(CHANNEL_NAME, ''), 'default') as channel,
  WORKER_ID as worker_num,
  THREAD_ID as thread_id,
  APPLYING_TRANSACTION_START_APPLY_TIMESTAMP != '0000-00-00 00:00:00.000000' as active,
  CASE 
    WHEN APPLYING_TRANSACTION_START_APPLY_TIMESTAMP != '0000-00-00 00:00:00.000000'
    THEN sys.format_time(GREATEST(0, TIMESTAMPDIFF(MICROSECOND, 
         APPLYING_TRANSACTION_START_APPLY_TIMESTAMP, NOW(6))) * 1000000)
    ELSE NULL
  END as time_applying_current_trx,
  CASE 
    WHEN LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP != '0000-00-00 00:00:00.000000'
    THEN sys.format_time(GREATEST(0, TIMESTAMPDIFF(MICROSECOND, 
         LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP,
         LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP)) * 1000000)
    ELSE NULL
  END as time_applying_last_trx,
  CASE 
    WHEN LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP != '0000-00-00 00:00:00.000000'
    THEN LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP
    ELSE NULL
  END as last_active,
  SERVICE_STATE as worker_state,
  LAST_ERROR_NUMBER as last_error_code,
  LAST_ERROR_MESSAGE as last_error_message
FROM 
  performance_schema.replication_applier_status_by_worker
ORDER BY 
  channel,
  worker_num;

The following table explains what each column represents in the view:

Column name	Description
`channel`	Replication channel name (`default` for unnamed channel).
`worker_num`	Worker thread number (`worker_id` from `performance_schema`).
`thread_id`	MySQL thread ID of the worker.
`active`	Whether the worker is currently applying a transaction (1=yes, 0=no).
`time_applying_current_trx`	How long the current transaction has been applying (if active).
`time_applying_last_trx`	How long the last transaction took to apply.
`last_active`	Timestamp of last transaction ended.
`worker_state`	Current state of the worker thread (`ON`/`OFF`)
`last_error_code`	Last error number (0 if no error)
`last_error_message`	Last error message (empty if no error)

The following is the sample output of querying the view:

mysql> select * from binlog_replication_worker_stats;
+---------+------------+-----------+--------+---------------------------+------------------------+----------------------------+--------------+-----------------+--------------------+
| channel | worker_num | thread_id | active | time_applying_current_trx | time_applying_last_trx | last_active                | worker_state | last_error_code | last_error_message |
+---------+------------+-----------+--------+---------------------------+------------------------+----------------------------+--------------+-----------------+--------------------+
| default |          1 |        46 |      1 | 930 us                    | 1.62 ms                | 2025-04-09 18:24:21.130941 | ON           |               0 |                    |
| default |          2 |        47 |      1 | 1.78 ms                   | 3.6 ms                 | 2025-04-09 18:24:21.130124 | ON           |               0 |                    |
| default |          3 |        48 |      1 | 1.7 ms                    | 2.47 ms                | 2025-04-09 18:24:21.130132 | ON           |               0 |                    |
| default |          4 |        49 |      1 | 7 us                      | 2.18 ms                | 2025-04-09 18:24:21.131854 | ON           |               0 |                    |
+---------+------------+-----------+--------+---------------------------+------------------------+----------------------------+--------------+-----------------+--------------------+
4 rows in set (0.01 sec

Regarding the Active column: Ideally you want to confirm all threads are active because this suggests good parallel processing. If you notice significant differences in worker activity, it might indicate these problems:

Events are being processed one after another due to dependency conflicts
Transactions are taking too long because of missing indexes
Transactions are delayed because multiple processes are waiting for the same locks
DDL operations being applied

The time_applying_current_trx (and time_applying_last_trx ) helps identify long running transactions, which might cause serialization replication in the replica, as explained in replica_pending_jobs_size_max in Overview and best practices of multithreaded replication in Amazon RDS for MySQL, Amazon RDS for MariaDB, and Amazon Aurora MySQL. When this happens, you might notice only one active thread with high time_applying_current_trx value.

If, after resolving all issues, you still frequently encounter many worker threads stay inactive for a while, as indicated in last_active column, consider lowering the replica_parallel_workers parameter. In contrast, if all worker threads are busy or active all the time, increasing the replica_parallel_workers might help.

Engine error log

MTR log messages are available when log_error_verbosity is set to 3. Although Aurora MySQL has this setting enabled by default, RDS for MySQL and RDS for MariaDB require you to modify the parameter group value. With this setting enabled, the replica’s coordinator thread periodically writes statistical information to the error log, showing how events are distributed among worker threads. The frequency of these log entries depends on the volume of events being processed, with logs appearing no more frequently than one time every 120 seconds. The following is sample output from error logs.

2025-07-09T12:26:21.017757Z 2892166 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel '': seconds elapsed = 120; events assigned = 2276626433; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 171233788281300 waited (count) when Workers occupied = 6460064 waited when Workers occupied = 151556259200 (rpl_replica.cc:4978), 

2025-07-09T12:28:33.805410Z 2892166 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel '': seconds elapsed = 132; events assigned = 2276672513; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 171234310046600 waited (count) when Workers occupied = 6460064 waited when Workers occupied = 151556259200 (rpl_replica.cc:4978)

You can find the explanation of each field in Monitoring Replication Applier Worker Threads in the MySQL documentation. We recommend that you publish the log to CloudWatch Logs for long-term retention and then use CloudWatch Logs Insights to interpret the log and compare changes overtime. Refer to CloudWatch Logs Insights language query syntax for more details.

Ideally, waited due the total size should be zero. If it’s nonzero and increasing between samples especially during replica lag presence, check if you have large transactions. If there are no large transactions, consider increasing the parameter replica_pending_jobs_size_max. Waited (count) when workers occupied should be as low as possible. Consider increasing replica_parallel_threads if you notice significant changes between samples.

Waited at clock conflicts indicates transaction/event dependence, meaning a transaction/event had to wait on another transaction before being applied. It is normal to see high and increasing value for this counter in most cases. To help reduce transaction dependence and therefore increase replication parallelism, follow the best practices mentioned in part 1 of this two-part blog post series. These best practices include setting binlog_transaction_dependency_tracking to WRITESET and replica_parallel_type to LOGICAL_CLOCK. We don’t recommend turning off replica_preserve_commit_order due to data consistency, especially for applications that depend on commit order.

Conclusion

In this post, you learned about monitoring and tuning MySQL multithreaded replication using tools such as SHOW REPLICA STATUS, Performance Schema tables, error logs and custom views provided by Amazon. Effective monitoring of multithreaded replication is important for optimal replication performance. Remember that tuning is an iterative process – start with conservative parameter adjustments and monitor the effects closely. Regular performance audits and proactive optimization ensure robust replication, improving data consistency and minimizing latency.For more information on Aurora MySQL and RDS for MySQL and RDS for MariaDB replication, refer to the following resources:

If you have any additional thoughts, leave a comment.

AWS Database Blog

Monitoring multithreaded replication in Amazon RDS for MySQL, Amazon RDS for MariaDB, and Aurora MySQL

SHOW REPLICA STATUS

Performance Schema tables

Custom views

Engine error log

Conclusion

About the authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help