AWS Database Blog
Monitoring multithreaded replication in Amazon RDS for MySQL, Amazon RDS for MariaDB, and Aurora MySQL
In our previous post, we discussed how MySQL replication and multithreaded replication (MTR) works and its key configuration options and best practices. In this post, we discuss methods to effectively monitor parallel replication performance and tune its related parameters for Amazon Aurora MySQL and Amazon Relational Database Service for MySQL and MariaDB.
In the following section, we will dive into several methods that MySQL offers to monitor MTR.
SHOW REPLICA STATUS
SHOW REPLICA STATUS is a good starting place to monitor and troubleshoot MTR, especially when replication fails and stops. You run this command in replica instance. To understand each field’s meaning and interpretation, refer to How do I troubleshoot high replica lag with Amazon RDS for MySQL? in the AWS Knowledge Center and SHOW REPLICA STATUS Statement in the MySQL documentation:
Some of the fields have slightly different meanings in MTR compared to single threaded replication.
Seconds_Behind_Source
is still valid, accurate, and useful with multithreaded replication, but you should keep in mind that this value is based onExec_Source_Log_Pos
and might not reflect the position of the most recently committed transaction.- When performing operations that will require a cutover to the target database, you should not rely on
Seconds_Behind_Source
or the Amazon CloudWatch metricReplicaLag
in RDS for MySQL and RDS for MariaDB andAuroraBinlogReplicaLag
in Aurora MySQL. We recommend using Amazon RDS Blue/Green Deployments, which not only automates the switchover process with minimal downtime but also provides built-in safeguards so that the replica (green) environment is fully synchronized with the source (blue) before cutover occurs.
- When performing operations that will require a cutover to the target database, you should not rely on
Last_SQL_Errno
andLast_SQL_Error
only show the error of the coordinator thread, not worker threads. That means there might be more failures in the worker threads that can be found in the replication_applier_status_by_worker table that shows each worker thread’s status. If that table isn’t available, the replica error log can be used. The log or thereplication_applier_status_by_worker
table should also be used to learn more about the failure shown by SHOW REPLICA STATUS or the coordinator table.- Common states for the
Replica_SQL_Running_State
field in MTR include'Waiting for dependent transaction to commit'
or'Waiting for preceding transaction to be committed'
. These states are normal and indicate that a worker thread is waiting for a dependent transaction to be complete before proceeding. However, if these states appear frequently, consider tuning the workload in the source, such as breaking large transactions to smaller ones, to improve replication parallelism and overall replication performance. Setting binlog_transaction_dependency_tracking to WRITESET can also significantly reduce these dependencies. Refer to Overview and best practices of multithreaded replication in Amazon RDS for MySQL, Amazon RDS for MariaDB, and Amazon Aurora MySQL for best practices.
Performance Schema tables
MySQL provides a set of Performance Schema tables to monitor replication at a deeper level compared to SHOW REPLICA STATUS
, hence they are recommended. Three tables that are useful in MTR monitoring are:
replication_connection_status
replication_applier_status_by_coordinator
replication_applier_status_by_worker
We recommend having Performance Schema enabled so all data is populated accordingly to these tables.
The replication_connection_status table shows the current status of the I/O thread that handles the replica’s connection to the source, information on the last transaction queued in the relay log, and information on the transaction currently being queued in the relay log.
The replication_applier_status_by_coordinator table shows the status of the coordinator thread, specifically the last transaction that was buffered by the coordinator thread to a worker’s queue, as well as the transaction it’s currently buffering. The start timestamp refers to when the coordinator thread read the first event of the transaction from the relay log to buffer it to a worker’s queue, and the end timestamp refers to when the last event finished buffering to the worker’s queue.
The replication_applier_status_by_worker table is the most important table. It shows the status of the worker threads. When replication fails and stops, LAST_ERROR_NUMBER
, LAST_ERROR_MESSAGE
columns are important to understand its root cause. An error number of 0 and message of the empty string mean “no error.” If the LAST_ERROR_MESSAGE
value isn’t empty, the error values also appear in the replica’s error log. The following is an example:
Custom views
Performance Schema tables contain raw data that can be challenging to translate into actionable insights. Currently, there are no industry standards or established best practices for querying these tables to measure MTR lag or evaluate MTR utilization and effectiveness. To make this data more accessible, we recommend creating custom views. The following is an example view that shows detailed statistics about MySQL replication worker threads, including their current activity, timing information, and error status. It can help to identify long-running replica transactions, underutilized workers, and replication errors. Test thoroughly before implementing it in your production environment:
The following table explains what each column represents in the view:
Column name | Description |
channel |
Replication channel name (default for unnamed channel). |
worker_num |
Worker thread number (worker_id from performance_schema ). |
thread_id |
MySQL thread ID of the worker. |
active |
Whether the worker is currently applying a transaction (1=yes, 0=no). |
time_applying_current_trx |
How long the current transaction has been applying (if active). |
time_applying_last_trx |
How long the last transaction took to apply. |
last_active |
Timestamp of last transaction ended. |
worker_state |
Current state of the worker thread (ON /OFF ) |
last_error_code |
Last error number (0 if no error) |
last_error_message |
Last error message (empty if no error) |
The following is the sample output of querying the view:
Regarding the Active
column: Ideally you want to confirm all threads are active because this suggests good parallel processing. If you notice significant differences in worker activity, it might indicate these problems:
- Events are being processed one after another due to dependency conflicts
- Transactions are taking too long because of missing indexes
- Transactions are delayed because multiple processes are waiting for the same locks
- DDL operations being applied
The time_applying_current_trx
(and time_applying_last_trx
) helps identify long running transactions, which might cause serialization replication in the replica, as explained in replica_pending_jobs_size_max in Overview and best practices of multithreaded replication in Amazon RDS for MySQL, Amazon RDS for MariaDB, and Amazon Aurora MySQL. When this happens, you might notice only one active thread with high time_applying_current_trx
value.
If, after resolving all issues, you still frequently encounter many worker threads stay inactive for a while, as indicated in last_active
column, consider lowering the replica_parallel_workers parameter. In contrast, if all worker threads are busy or active all the time, increasing the replica_parallel_workers
might help.
Engine error log
MTR log messages are available when log_error_verbosity
is set to 3. Although Aurora MySQL has this setting enabled by default, RDS for MySQL and RDS for MariaDB require you to modify the parameter group value. With this setting enabled, the replica’s coordinator thread periodically writes statistical information to the error log, showing how events are distributed among worker threads. The frequency of these log entries depends on the volume of events being processed, with logs appearing no more frequently than one time every 120 seconds. The following is sample output from error logs.
You can find the explanation of each field in Monitoring Replication Applier Worker Threads in the MySQL documentation. We recommend that you publish the log to CloudWatch Logs for long-term retention and then use CloudWatch Logs Insights to interpret the log and compare changes overtime. Refer to CloudWatch Logs Insights language query syntax for more details.
Ideally, waited due the total size
should be zero. If it’s nonzero and increasing between samples especially during replica lag presence, check if you have large transactions. If there are no large transactions, consider increasing the parameter replica_pending_jobs_size_max
. Waited (count) when workers occupied
should be as low as possible. Consider increasing replica_parallel_threads
if you notice significant changes between samples.
Waited at clock conflicts
indicates transaction/event dependence, meaning a transaction/event had to wait on another transaction before being applied. It is normal to see high and increasing value for this counter in most cases. To help reduce transaction dependence and therefore increase replication parallelism, follow the best practices mentioned in part 1 of this two-part blog post series. These best practices include setting binlog_transaction_dependency_tracking
to WRITESET and replica_parallel_type
to LOGICAL_CLOCK. We don’t recommend turning off replica_preserve_commit_order due to data consistency, especially for applications that depend on commit order.
Conclusion
In this post, you learned about monitoring and tuning MySQL multithreaded replication using tools such as SHOW REPLICA STATUS, Performance Schema tables, error logs and custom views provided by Amazon. Effective monitoring of multithreaded replication is important for optimal replication performance. Remember that tuning is an iterative process – start with conservative parameter adjustments and monitor the effects closely. Regular performance audits and proactive optimization ensure robust replication, improving data consistency and minimizing latency.For more information on Aurora MySQL and RDS for MySQL and RDS for MariaDB replication, refer to the following resources:
- Replication with Amazon Aurora MySQL
- Working with MySQL replication in Amazon RDS
- Working with MariaDB replication in Amazon RDS
If you have any additional thoughts, leave a comment.