Handling long duration SST(timeout) in PXC with systemd

In this blog post, We will be explaining about the timeouts in SST on systemd implementation which we faced recently in Percona XtraDB Cluster  during our Consulting with a client. State Snapshot Transfers (SST) refers to complete data sync from one of the nodes from the cluster to the joining node.

SST will happen for one or more reasons listed below.

  1. Initial sync to join a node to cluster.
  2. Node is out of cluster and lost its ability to join back due to data corruption or inconsistencies and also when the node went far behind the node, Starting point of recovery from gcache (Where recovery logs are written) is purged or rotated.

It’s very important to understand the timeout related to SST as in a large size cluster implementation, Where it’s going to take hours to complete the SST. If it fails on timeout in mid it can ruin your day.

We will be looking for SST timeouts on two large scale galera cluster implementations, Percona XtraDB Cluster and MariaDB Cluster with the systemd startup process.

Percona XtraDB Cluster (PXC):

PXC Version: 5.6.38

Systemd Service Script: /usr/lib/systemd/system/mysql.service

[Service]
ExecStartPre=/usr/bin/mysql-systemd start-pre
# pre check script to check if an instance of mysql is already running and exit if it found one.

ExecStart=/usr/bin/mysqld_safe --basedir=/usr
# when it passes pre check it goes and starts MySQL

ExecStartPost=/usr/bin/mysql-systemd start-post $MAINPID
# post check script to verify the pid is created and server is running.
# startup will complete when the post script returns success.

When the nodes goes for SST, Startup script will be waiting on ExecStartPost to give OK.

  • We can see, post check script calls /usr/bin/mysql-systemd with argument start-post, It goes through the below switch case call.
"start-post") 
      wait_for_pid created  "$pid_path"; ret=$?
      if [[ $ret -eq 1 ]];then 
          log_failure_msg "MySQL (Percona XtraDB Cluster) server startup failed!"
      elif [[ $ret -eq 2 ]];then
          log_info_msg "MySQL (Percona XtraDB Cluster) server startup failed! State transfer still in progress"
      fi
      exit $ret
    ;;
  • Inside start-post, wait_for_pid function is invoked with argument created and pid path. Script will then be looping through wait_for_pid function until the SST completes.
  • Just pasting the code related to this discussion from the function wait_for_pid.
i=0
while [[ $i -lt $service_startup_timeout ]]; do
    if [[ $verb = 'created' ]];then 
        if ([[ -e $sst_progress_file ]] || grep -q -- '--wsrep-new-cluster' <<< "$env_args" ) \
        && [[ $startup_sleep -ne 10 ]]; then
            echo "State transfer in progress, setting sleep higher"
            startup_sleep=10
        fi
    fi
    i=$(( i+1 ))
    sleep $startup_sleep
done

 This while loop tries for service_startup_timeout number of times, Each time it waits for startup_sleep of 10 sec, The value for service_startup_timeout is hardcoded in the script as 900.

service_startup_timeout=900
  • So, SST will only wait for only 900 * 10 = 9000 Seconds = 2 hrs 30 min to complete on systemd implementation and It timeout after that.
  • For a cluster of huge size, Its’ a bottleneck, For a bigger data set SST can take more time, Failing in middle is very bad thing that can happen. Error it throws when such event happens is misleading and it’s not clear.

Testing:

In our testing with PXC Version: 5.6.38 and OS: Centos 7 of data set 1.5 TB, SST timed out in middle when almost 700G copied in approx. 2 hrs 30 min.

Error Logs:

Joiner:

2018-03-14 19:13:04 16392 [Note] WSREP: Member 2.0 (node4) requested state transfer from 'node3'. Selected 1.0 (node3)(SYNCED) as donor.
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20180314 19:13:05.350)
WSREP_SST: [ERROR] Removing /data/mysql//.sst/xtrabackup_galera_info file due to signal (20180314 21:42:44.532)
WSREP_SST: [ERROR] Cleanup after exit with status:143 (20180314 21:42:44.535)
2018-03-14 21:42:44 16392 [ERROR] WSREP: SST failed: 2 (No such file or directory)

Donor:

WSREP_SST: [INFO] Streaming the backup to joiner at xxx.xx.xx.xx 4444 (20180314 19:13:13.446)
WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/my.cnf --defaults-group=mysqld --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio TCP:xxx.xx.xx.xx:4444; RC=( ${PIPESTATUS[@]} ) (20180314 19:13:13.449)
2018/03/14 21:42:42 socat[13974] E write(6, 0x55d29d806d00, 8192): Broken pipe

 SST Duration: 19:13:04 – 21:42:44 ~ Timeout In 2 hrs 30 min

Solutions:

Method 1:

  • Edit /usr/bin/mysql-systemd file and set service_startup_timeout from 900 to a much higher value. In our case, We have set it to 8 hours (2880).
    (2880*10)/60/60 = 8 hrs

# sed -i ‘s/service_startup_timeout=900/service_startup_timeout=2880/g’ /usr/bin/mysql-systemd

Method 2:

  • On /usr/bin/mysql-systemd, We can see it is also reading this variable from mysqld_safe tag

service_startup_timeout=$(parse_cnf service-startup-timeout $service_startup_timeout mysqld_safe)

  • It’s mentioned on the mysql.service script mysql.service, But it’s not clear.
    # Timeout is handled elsewhere
    # service-startup-timeout in my.cnf 
    # Default is 900 seconds -> This value BTW is not seconds
  • So we can also define service_startup_timeout variable in the /etc/my.cnf under [mysqld_safe] tag
  • Variable under /etc/my.cnf takes higher precedence.

This behaviour is reported to Percona team: https://jira.percona.com/browse/PXC-2080

MariaDB Cluster:

MariaDB Version: 10.1.31

MariaDB has provided clear information on how to increase the timeout for SST in their documentation for upgrading.

https://mariadb.com/kb/en/library/upgrading-from-mariadb-galera-cluster-100-to-mariadb-101/

On Linux distributions that use #systemd# you man need to increase the service startup timeout as the default timeout of 20 minutes may not be sufficient if a SST becomes necessary

  • create a file /etc/systemd/system/mariadb.service.d/timeout.conf with the following data.
    [Service]
    TimeoutSec=infinity
  • If you are using a systemd version older than version 229 you have to replace infinity with 0
  • Execute # systemctl daemon-reload after the change for the new timeout setting to take effect.

It’s also very interesting that, It has provided with very good documentation on the systemd startup script and variable details. you can read at the following link.

https://mariadb.com/kb/en/library/systemd/

mariadb-service-convert script to generate the systemd startup script variables from /etc/my.cnf is just fascinating. I m not going into much details on that as it’s out of the scope for this blog. I really admire the fact, the documentation is very clear.

Key Takeaways:

SST on systemd implementation has timeouts.

Percona XtraDB Cluster it’s 2 hours 30 minutes.
MariaDB Cluster it’s 20 minutes.

If your data copy during SST is going take more than that, Use the solutions provided to avoid surprises in the production.

关注dbDao.com的新浪微博

扫码加入微信Oracle小密圈,了解Oracle最新技术下载分享资源

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569