Causes and Workarounds for Slave Performance Too Slow with Row-Based Events

Slave Performance Too Slow

Slave Performance Too SlowRecently I worked on one customer issue that I would describe as “slave performance too slow”. During a quick analysis, I’ve found that the replication slave SQL thread cannot keep up while processing row-based events from the master’s binary log.

For example:

mysql> SHOW SLAVE STATUSG
*************************** 1. row ***************************
                          ...
              Master_Log_File: binlog.0000185
          Read_Master_Log_Pos: 86698585
                          ...
        Relay_Master_Log_File: binlog.0000185
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                          ...
          Exec_Master_Log_Pos: 380
              Relay_Log_Space: 85699128
                          ...
                  Master_UUID: 98974e7f-2fbc-18e9-72cd-07003817585c
                          ...
           Retrieved_Gtid_Set: 98974e7f-2fbc-18e9-72cd-07003817585c:1055-1057
            Executed_Gtid_Set: 7f42e2c5-3fbc-16e7-7fb8-05003715789a:1-2,
98974e7f-2fbc-18e9-72cd-07003817585c:1-1056
                          ...

The processlist state for the SQL thread can be one of the following: Reading event from the relay log, or System lock, or potentially some other state. In my case:

mysql> SHOW PROCESSLIST;
+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+
| Id | User            | Host            | db   | Command | Time | State                            | Info             |
+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+
...
|  4 | system user     |                 | NULL | Connect |  268 | Reading event from the relay log | NULL             |
...
+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+

What causes that?

Let’s take a look what could potentially cause such behavior and what we need to pay attention to. When the SQL thread applies the change from a row-based event, it has to locate the exact row that was updated. With a primary key, this is trivial as only one row can possibly have the same value for the primary key.

However, if there is no primary key on the table on the replication slave side, the SQL thread must search the entire table to locate the row to update or delete. It repeats the search for each updated row. This search is both very resource usage intensive (CPU usage can be up to 100%) and slow causing the slave to fall behind.

For InnoDB tables, the “hidden” key used for the clustered index for tables without a primary key cannot be used to avoid searching the entire table for the rows to update or delete. We need to keep in mind that the “hidden” key is unique only to each MySQL instance, so the replication master and replication slave generally don’t have the same values for the “hidden” key for the same row.

What can we do to solve that?

The best solution is to ensure that all tables have a primary key. This not only ensures the SQL thread can easily locate rows to update or delete, but it is also considered as a best practice since it ensures all rows are unique.

If there is no way to logically add a natural primary key for the table, a potential solution is to add an auto-increment unsigned integer column as the primary key.

The query below helps you to locate tables without a primary key:

SELECT tables.table_schema, tables.table_name, tables.table_rows
      FROM information_schema.tables
      LEFT JOIN (
        SELECT table_schema, table_name
        FROM information_schema.statistics
        GROUP BY table_schema, table_name, index_name
        HAVING
          SUM(
            CASE WHEN non_unique = 0 AND nullable != 'YES' THEN 1 ELSE 0 END
          ) = COUNT(*)
      ) puks
      ON tables.table_schema = puks.table_schema AND tables.table_name = puks.table_name
      WHERE puks.table_name IS NULL
        AND tables.table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
        AND tables.table_type = 'BASE TABLE' AND engine='InnoDB';

Please note that for InnoDB, there must always be a unique NOT NULL key for all tables. It is required for the clustered index. So adding an explicit “dummy” column as suggested above will not add to the overall storage requirements as it will merely replace the hidden key.

It’s not always possible to add a primary key to the table immediately if, for example, there are many relations on the application side/legacy system, lack of resources, unknown application behavior after the change which required testing, etc.

In this case, a short-term solution is to change the search algorithm used by the replication slave to locate the rows changed by row-based events.

The search algorithm is set using the slave_rows_search_algorithms option which is available in MySQL 5.6 and later. The default value is to use an index scan if possible, otherwise a table scan.

https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#option_mysqld_slave-rows-search-algorithms

However, for tables without a primary key using a hash scan, which causes the SQL thread to temporarily cache hashes to reduce the overhead of searching the whole table. The value of slave_rows_search_algorithms can be changed dynamically using:

mysql> SET GLOBAL slave_rows_search_algorithms = 'INDEX_SCAN,HASH_SCAN';

Just to note INDEX_SCAN,HASH_SCAN is the default value in MySQL 8.0.

One thing to be aware of when using hash scans that the hashes are only reused within one row-based event. (Each row-based event may have changes to several rows in the same table originating from the same SQL statement).

The binlog_row_event_max_size option on the replication master controls the maximum size of a row-based event. The default max event size is 8kB. This means that switching to hash scans only improves the performance of the SQL thread when:

  1. Several rows fit into one row based event. It may help to increase the value of binlog_row_event_max_size on the replication master, if you perform updates or deletes on large rows (e.g., with blob or text data). You can only set the binlog_row_event_max_size in the MySQL configuration file, and resetting this value requires a restart.
  2. One statement changes several rows.

https://dev.mysql.com/doc/refman/5.7/en/replication-options-binary-log.html#option_mysqld_binlog-row-event-max-size

Conclusion

Even if enabling hash scans improves the performance enough for the replication slave to keep up, the permanent solution is to add an explicit primary key to each table. This should be the general rule of thumb in the schema design in order avoid and/or minimize many issues like slave performance too slow (as described in this post).

Next, I am going to investigate how we can find out the exact thread state using Performance Schema in order to make issue identification less of a guessing game.

The post Causes and Workarounds for Slave Performance Too Slow with Row-Based Events appeared first on Percona Database Performance Blog.

关注dbDao.com的新浪微博

扫码加入微信Oracle小密圈,了解Oracle最新技术下载分享资源

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569