Author: Yves Trudeau

Compression options in MySQL (part 1)

Over the last year, I have been pursuing a part time hobby project exploring ways to squeeze as much data as possible in MySQL. As you will see, there are quite a few different ways. Of course things like compression ratio matters a lot but, other items like performance of inserts, selects and updates, along […]

How To Best Use Percona Server Column Compression With Dictionary

Database CompressionVery often, database performance is affected by the inability to cache all the required data in memory. Disk IO, even when using the fastest devices, takes much more time than a memory access. With MySQL/InnoDB, the main memory cache is the InnoDB buffer pool. There are many strategies we can try to fit as much […]

Lock Down: Enforcing AppArmor with Percona XtraDB Cluster

Enforcing AppArmor with Percona XtraDB Cluster

Recently, I wrote a blog post showing how to enforce SELinux with Percona XtraDB Cluster (PXC). The Linux distributions derived from RedHat use SELinux. There is another major mandatory discretionary access control (DAC) system, AppArmor. Ubuntu, for example, installs AppArmor by default. If you are concerned by computer security and use PXC on Ubuntu, you […]

The post Lock Down: Enforcing AppArmor with Percona XtraDB Cluster appeared first on Percona Database Performance Blog.

Tuning InnoDB Primary Keys

The choice of good InnoDB primary keys is a critical performance tuning decision. This post will guide you through the steps of choosing the best primary key depending on your workload. As a principal architect at Percona, one of my main duties is to tune customer databases. There are many aspects related to performance tuning […]

The post Tuning InnoDB Primary Keys appeared first on Percona Database Performance Blog.

Lock Down: Enforcing SELinux with Percona XtraDB Cluster

Why do I spend time blogging about security frameworks? Because, although there are some resources available on the Web, none apply to Percona XtraDB Cluster (PXC) directly. Actually, I rarely encounter a MySQL setup where SELinux is enforced and never when Percona XtraDB Cluster (PXC) or another Galera replication implementation is used. As we’ll see, there are good reasons for that. I originally thought this post would be a simple “how to” but it ended up with a push request to modify the SST script and a few other surprises.
Some context
These days, with all the major security breaches of the last few years, the importance of security in IT cannot be highlighted enough. For that reason, security in MySQL has been progressively tightened from version to version and the default parameters are much more restrictive than they used to be. That’s all good but it is only at the MySQL level if there is still a breach allowing access to MySQL, someone could in theory do everything the mysql user is allowed to do. To prevent such a situation, the operations that mysqld can do should be limited to only what it really needs to do. SELinux’ purpose is exactly that. You’ll find SELinux on RedHat/Centos and their derived distributions. Debian, Ubuntu and OpenSuse uses another framework, AppArmor, which is functionally similar to SELinux. I’ll talk about AppArmor in a future post, let’s focus for now on SELinux.
The default behavior of many DBAs and Sysadmins appears to be: “if it doesn’t work, disable SELinux”. Sure enough, it often solves the issue but it also removes an important security layer. I believe disabling SELinux is the wrong cure so let’s walk through the steps of configuring a PXC cluster with SELinux enforced.
Starting point
As a starting point, I’ll assume you have a running PXC cluster operating with SELinux in permissive mode. That likely means the file “/etc/sysconfig/selinux” looks like this:
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing – SELinux security policy is enforced.
# permissive – SELinux prints warnings instead of enforcing.
# disabled – No SELinux policy is loaded.
SELINUX=permissive
# SELINUXTYPE= can take one of three two values:
# targeted – Targeted processes are protected,
# minimum – Modification of targeted policy. Only selected processes are protected.
# mls – Multi Level Security protection.
SELINUXTYPE=targeted
For the purpose of writing this article, I created a 3 nodes PXC cluster with the hosts: BlogSELinux1, BlogSELinux2 and BlogSELinux3. On BlogSELinux1, I set SELinux in permissive mode, I truncated the audit.log. SELinux violations are logged in the audit.log file.
[root@BlogSELinux1 ~]# getenforce
Permissive
[root@BlogSELinux1 ~]# echo ” > /var/log/audit/audit.log
Let’s begin by covering the regular PXC operation items like start, stop, SST Donor, SST Joiner, IST Donor and IST Joiner. As we execute the steps in the list, the audit.log file will record SELinux related elements.
Stop and start
Those are easy:
[root@BlogSELinux1 ~]# systemctl stop mysql
[root@BlogSELinux1 ~]# systemctl start mysql

SST Donor
On BlogSELinux3:
[root@BlogSELinux3 ~]# systemctl stop mysql
then on BlogSELinux2:
[root@BlogSELinux2 ~]# systemctl stop mysql
[root@BlogSELinux2 ~]# rm -f /var/lib/mysql/grastate.dat
[root@BlogSELinux2 ~]# systemctl start mysql

SST Joiner
We have BlogSELinux1 and BlogSELinux2 up and running, we just do:
[root@BlogSELinux1 ~]# systemctl stop mysql
[root@BlogSELinux1 ~]# rm -f /var/lib/mysql/grastate.dat
[root@BlogSELinux1 ~]# systemctl start mysql

IST Donor
We have BlogSELinux1 and BlogSELinux2 up and running, we just do:
[root@BlogSELinux2 ~]# systemctl stop mysql
Then on the first node:
[root@BlogSELinux1 ~]# mysql -e ‘create database test;’;
[root@BlogSELinux1 ~]# mysql -e ‘create table test.testtable (id int not null, primary key (id)) engine=innodb;’
[root@BlogSELinux1 ~]# mysql -e ‘insert into test.testtable (id) values (1);’
Those statements put some data in the gcache, now we just restart the second node:
[root@BlogSELinux2 ~]# systemctl start mysql

IST Joiner
We have BlogSELinux1 and BlogSELinux2 up and running, we just do:
[root@BlogSELinux1 ~]# systemctl stop mysql
Then on the second node:
[root@BlogSELinux2 ~]# mysql -e ‘insert into test.testtable (id) values (2);’
to insert some data in the gcache and we restart the first node:
[root@BlogSELinux1 ~]# systemctl start mysql

First run
Now that we performed the basic operations of a cluster while recording the security violations in permissive mode, we can look at the audit.log file and start building the SELinux policy. Let’s begin by installing the tools needed to manipulate the SELinux audit log and policy files with:
[root@BlogSELinux1 ~]# yum install policycoreutils-python.x86_64
Then, we’ll use the audit2allow tool to analyze the audit.log file:
[root@BlogSELinux1 ~]# grep -i denied /var/log/audit/audit.log | grep mysqld_t | audit2allow -M PXC
******************** IMPORTANT ***********************
To make this policy package active, execute:
semodule -i PXC.pp
We end up with 2 files, PXC.te and PXC.pp. The pp file is a compiled version of the human readable te file. If we examine the content of the PXC.te file, at the beginning, we have the require section listing all the involved SELinux types and classes:
module PXC 1.0;
require {
type unconfined_t;
type init_t;
type auditd_t;
type mysqld_t;
type syslogd_t;
type NetworkManager_t;
type unconfined_service_t;
type system_dbusd_t;
type tuned_t;
type tmp_t;
type dhcpc_t;
type sysctl_net_t;
type kerberos_port_t;
type kernel_t;
type unreserved_port_t;
type firewalld_t;
type systemd_logind_t;
type chronyd_t;
type policykit_t;
type udev_t;
type mysqld_safe_t;
type postfix_pickup_t;
type sshd_t;
type crond_t;
type getty_t;
type lvm_t;
type postfix_qmgr_t;
type postfix_master_t;
class process { getattr setpgid };
class unix_stream_socket connectto;
class system module_request;
class netlink_tcpdiag_socket { bind create getattr nlmsg_read setopt };
class tcp_socket { name_bind name_connect };
class file { getattr open read write };
class dir search;
}
Then, using these types and classes, the policy file adds a series of generic allow rules matching the denied found in the audit.log file. Here’s what I got:
#============= mysqld_t ==============
allow mysqld_t NetworkManager_t:process getattr;
allow mysqld_t auditd_t:process getattr;
allow mysqld_t chronyd_t:process getattr;
allow mysqld_t crond_t:process getattr;
allow mysqld_t dhcpc_t:process getattr;
allow mysqld_t firewalld_t:process getattr;
allow mysqld_t getty_t:process getattr;
allow mysqld_t init_t:process getattr;
#!!!! This avc can be allowed using the boolean ‘nis_enabled’
allow mysqld_t kerberos_port_t:tcp_socket name_bind;
allow mysqld_t kernel_t:process getattr;
#!!!! This avc can be allowed using the boolean ‘domain_kernel_load_modules’
allow mysqld_t kernel_t:system module_request;
allow mysqld_t lvm_t:process getattr;
allow mysqld_t mysqld_safe_t:process getattr;
allow mysqld_t policykit_t:process getattr;
allow mysqld_t postfix_master_t:process getattr;
allow mysqld_t postfix_pickup_t:process getattr;
allow mysqld_t postfix_qmgr_t:process getattr;
allow mysqld_t sysctl_net_t:file { getattr open read };
allow mysqld_t syslogd_t:process getattr;
allow mysqld_t system_dbusd_t:process getattr;
allow mysqld_t systemd_logind_t:process getattr;
allow mysqld_t tuned_t:process getattr;
allow mysqld_t udev_t:process getattr;
allow mysqld_t unconfined_service_t:process getattr;
allow mysqld_t unconfined_t:process getattr;
allow mysqld_t tuned_t:process getattr;
allow mysqld_t udev_t:process getattr;
allow mysqld_t sshd_t:process getattr;
allow mysqld_t self:netlink_tcpdiag_socket { bind create getattr nlmsg_read setopt };
allow mysqld_t self:process { getattr setpgid };
#!!!! The file ‘/var/lib/mysql/mysql.sock’ is mislabeled on your system.
#!!!! Fix with $ restorecon -R -v /var/lib/mysql/mysql.sock
#!!!! This avc can be allowed using the boolean ‘daemons_enable_cluster_mode’
allow mysqld_t self:unix_stream_socket connectto;
allow mysqld_t sshd_t:process getattr;
allow mysqld_t sysctl_net_t:dir search;
allow mysqld_t sysctl_net_t:file { getattr open read };
allow mysqld_t syslogd_t:process getattr;
allow mysqld_t system_dbusd_t:process getattr;
allow mysqld_t systemd_logind_t:process getattr;
#!!!! WARNING ‘mysqld_t’ is not allowed to write or create to tmp_t. Change the label to mysqld_tmp_t.
allow mysqld_t tmp_t:file write;
allow mysqld_t tuned_t:process getattr;
allow mysqld_t udev_t:process getattr;
allow mysqld_t unconfined_service_t:process getattr;
allow mysqld_t unconfined_t:process getattr;
#!!!! This avc can be allowed using one of the these booleans:
# nis_enabled, mysql_connect_any
allow mysqld_t unreserved_port_t:tcp_socket { name_bind name_connect };
I can understand some of these rules. For example, one of the TCP ports used by Kerberos is 4444 and it is also used by PXC for the SST transfer. Similarly, MySQL needs to write to /tmp. But what about all the other rules?
Troubleshooting
We could load the PXC.pp module we got in the previous section and consider our job done. It will likely allow the PXC node to start and operate normally but what exactly is happening? Why did MySQL or one of its subprocesses asked for the process attributes getattr of all the running processes like sshd, syslogd and cron. Looking directly in the audit.log file, I found many entries like these:
type=AVC msg=audit(1527792830.989:136): avc: denied { getattr } for pid=3683 comm=”ss”
scontext=system_u:system_r:mysqld_t:s0 tcontext=system_u:system_r:init_t:s0 tclass=process
type=AVC msg=audit(1527792830.990:137): avc: denied { getattr } for pid=3683 comm=”ss”
scontext=system_u:system_r:mysqld_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclass=process
type=AVC msg=audit(1527792830.991:138): avc: denied { getattr } for pid=3683 comm=”ss”
scontext=system_u:system_r:mysqld_t:s0 tcontext=system_u:system_r:syslogd_t:s0 tclass=process
So, ss, a network utility tool, scans all the processes. That rang a bell… I knew where to look for, the sst script. Here’s the source of the problem in the wsrep_sst_xtrabackup-v2 file:
wait_for_listen()
{
local HOST=$1
local PORT=$2
local MODULE=$3
for i in {1..300}
do
ss -p state listening “( sport = :$PORT )” | grep -qE ‘socat|nc’ && break
sleep 0.2
done
echo “ready ${HOST}:${PORT}/${MODULE}//$sst_ver”
}
This bash function is used when the node is a joiner and it checks using ss if the TCP port used by socat or nc is opened. The check is needed in order to avoid replying too early with the “ready” message. The code is functionally correct but wrong, security wise. Instead of looking if there is a socat or nc command running in the list of processes owned by the mysql user, it checks if any of the processes has opened the SST port and only then does it checks if the name of the command is socat or nc. Since we don’t know which processes will be running on the server, we can’t write a good security profile. For example, in the future, one could add the ntpd daemon, causing PXC to fail to start yet again. To avoid that, the function needs to be modified like this:
wait_for_listen()
{
local HOST=$1
local PORT=$2
local MODULE=$3
for i in {1..300}
do
sleep 0.2
# List only our (mysql user) processes to avoid triggering SELinux
for cmd in $(ps -u $(id -u) -o pid,comm | sed ‘s/^\s*//g’ | tr ‘ ‘ ‘|’ | grep -E ‘socat|nc’)
do
pid=$(echo $cmd | cut -d’|’ -f1)
# List the sockets of the pid
sockets=$(ls -l /proc/$pid/fd | grep socket | cut -d'[‘ -f2 | cut -d ‘]’ -f1 | tr ‘\n’ ‘|’)
if [[ -n $sockets ]]; then
# Is one of these sockets listening on the SST port?
# If so, we need to break from 2 loops
grep -E “${sockets:0:-1}” /proc/$pid/net/tcp | \
grep “00000000:$(printf ‘%X’ $PORT)” > /dev/null \
&& break 2
fi
done
done
echo “ready ${HOST}:${PORT}/${MODULE}//$sst_ver”
}
The modified function removes many of the denied messages in the audit log file and simplifies a lot the content of PXC.te. I tested the above modification and made a pull request to PXC. Among the remaining items, we have:
allow mysqld_t self:process { getattr setpgid };
setpgid is called often used after a fork to set the process group, usually through the setsid call. MySQL uses fork when it starts with the daemonize option but our installation of Percona XtraDB cluster uses mysqld_safe and does not directly run as a daemon. Another fork call is part of the wsrep source files and is used to launch processes like the SST script and is done when mysqld is already running with reduced privileges. This later invocation is certainly our culprit.
TCP ports
What about TPC ports? PXC uses quite a few. Of course there is the 3306/tcp port used to access MySQL. Galera also uses the ports 4567/tcp for replication, 4568/tcp for IST and 4444/tcp for SST. Let’s have a look which ports SELinux allows PXC to use:
[root@BlogSELinux1 audit]# semanage port -l | grep mysql
mysqld_port_t tcp 1186, 3306, 63132-63164
No surprise, port 3306/tcp is authorized but if you are new to MySQL, you may wonder what uses the 1186/tcp. It is the port used by NDB cluster for inter-node communication (NDB API). Now, if we try to add the missing ports:
[root@BlogSELinux1 audit]# semanage port -a -t mysqld_port_t -p tcp 4567
ValueError: Port tcp/4567 already defined
[root@BlogSELinux1 audit]# semanage port -a -t mysqld_port_t -p tcp 4568
[root@BlogSELinux1 audit]# semanage port -a -t mysqld_port_t -p tcp 4444
ValueError: Port tcp/4444 already defined
4568/tcp was successfully added but, 4444/tcp and 4567/tcp failed because they are already assigned to another security context. For example, 4444/tcp belongs to the kerberos security context:
[root@BlogSELinux1 audit]# semanage port -l | grep kerberos_port
kerberos_port_t tcp 88, 750, 4444
kerberos_port_t udp 88, 750, 4444
A TCP port is not allowed by SELinux to belong to more than one security context. We have no other choice than to move the two missing ports to the mysqld_t security context:
[root@BlogSELinux1 audit]# semanage port -m -t mysqld_port_t -p tcp 4444
[root@BlogSELinux1 audit]# semanage port -m -t mysqld_port_t -p tcp 4567
[root@BlogSELinux1 audit]# semanage port -l | grep mysqld
mysqld_port_t tcp 4567, 4444, 4568, 1186, 3306, 63132-63164
If you happen to be planning to deploy a Kerberos server on the same servers you may have to run PXC using a different port for Galera replication. In that case, and in the case where you want to run MySQL on a port other than 3306/tcp, you’ll need to add the port to the mysqld_port_t context like we just did above. Do not worry too much for the port 4567/tcp, it is reserved for tram which, from what I found, is a remote access protocol for routers.
Non-default paths
It is very frequent to run MySQL with non-standard paths/directories. With SELinux, you don’t list the authorized path in the security context, you add the security context labels to the paths. Adding a context label is a two steps process, basically change and apply. For example, if you are using /data as the MySQL datadir, you need to do:
semanage fcontext -a -t mysqld_db_t “/data(/.*)?”
restorecon -R -v /data
On a RedHat/Centos 7 server, the MySQL file contexts and their associated paths are:
[root@BlogSELinux1 ~]# bzcat /etc/selinux/targeted/active/modules/100/mysql/cil | grep filecon
(filecon “HOME_DIR/\.my\.cnf” file (system_u object_r mysqld_home_t ((s0) (s0))))
(filecon “/root/\.my\.cnf” file (system_u object_r mysqld_home_t ((s0) (s0))))
(filecon “/usr/lib/systemd/system/mysqld.*” file (system_u object_r mysqld_unit_file_t ((s0) (s0))))
(filecon “/usr/lib/systemd/system/mariadb.*” file (system_u object_r mysqld_unit_file_t ((s0) (s0))))
(filecon “/etc/my\.cnf” file (system_u object_r mysqld_etc_t ((s0) (s0))))
(filecon “/etc/mysql(/.*)?” any (system_u object_r mysqld_etc_t ((s0) (s0))))
(filecon “/etc/my\.cnf\.d(/.*)?” any (system_u object_r mysqld_etc_t ((s0) (s0))))
(filecon “/etc/rc\.d/init\.d/mysqld” file (system_u object_r mysqld_initrc_exec_t ((s0) (s0))))
(filecon “/etc/rc\.d/init\.d/mysqlmanager” file (system_u object_r mysqlmanagerd_initrc_exec_t ((s0) (s0))))
(filecon “/usr/bin/mysqld_safe” file (system_u object_r mysqld_safe_exec_t ((s0) (s0))))
(filecon “/usr/bin/mysql_upgrade” file (system_u object_r mysqld_exec_t ((s0) (s0))))
(filecon “/usr/libexec/mysqld” file (system_u object_r mysqld_exec_t ((s0) (s0))))
(filecon “/usr/libexec/mysqld_safe-scl-helper” file (system_u object_r mysqld_safe_exec_t ((s0) (s0))))
(filecon “/usr/sbin/mysqld(-max)?” file (system_u object_r mysqld_exec_t ((s0) (s0))))
(filecon “/usr/sbin/mysqlmanager” file (system_u object_r mysqlmanagerd_exec_t ((s0) (s0))))
(filecon “/usr/sbin/ndbd” file (system_u object_r mysqld_exec_t ((s0) (s0))))
(filecon “/var/lib/mysql(-files|-keyring)?(/.*)?” any (system_u object_r mysqld_db_t ((s0) (s0))))
(filecon “/var/lib/mysql/mysql\.sock” socket (system_u object_r mysqld_var_run_t ((s0) (s0))))
(filecon “/var/log/mariadb(/.*)?” any (system_u object_r mysqld_log_t ((s0) (s0))))
(filecon “/var/log/mysql.*” file (system_u object_r mysqld_log_t ((s0) (s0))))
(filecon “/var/run/mariadb(/.*)?” any (system_u object_r mysqld_var_run_t ((s0) (s0))))
(filecon “/var/run/mysqld(/.*)?” any (system_u object_r mysqld_var_run_t ((s0) (s0))))
(filecon “/var/run/mysqld/mysqlmanager.*” file (system_u object_r mysqlmanagerd_var_run_t ((s0) (s0))))
If you want to avoid security issues with SELinux, you should stay within those paths. A good example of an offending path is the PXC configuration file and directory which are now located in their own directory. These are not labeled correctly for SELinux:
[root@BlogSELinux1 ~]# ls -Z /etc/per*
-rw-r–r–. root root system_u:object_r:etc_t:s0 /etc/percona-xtradb-cluster.cnf
/etc/percona-xtradb-cluster.conf.d:
-rw-r–r–. root root system_u:object_r:etc_t:s0 mysqld.cnf
-rw-r–r–. root root system_u:object_r:etc_t:s0 mysqld_safe.cnf
-rw-r–r–. root root system_u:object_r:etc_t:s0 wsrep.cnf
I must admit that even if the security context labels on those files were not set, I got no audit messages and everything worked normally. Nevetheless, adding the labels is straightforward:
[root@BlogSELinux1 ~]# semanage fcontext -a -t mysqld_etc_t “/etc/percona-xtradb-cluster\.cnf”
[root@BlogSELinux1 ~]# semanage fcontext -a -t mysqld_etc_t “/etc/percona-xtradb-cluster\.conf\.d(/.*)?”
[root@BlogSELinux1 ~]# restorecon -v /etc/percona-xtradb-cluster.cnf
restorecon reset /etc/percona-xtradb-cluster.cnf context system_u:object_r:etc_t:s0->system_u:object_r:mysqld_etc_t:s0
[root@BlogSELinux1 ~]# restorecon -R -v /etc/percona-xtradb-cluster.conf.d/
restorecon reset /etc/percona-xtradb-cluster.conf.d context system_u:object_r:etc_t:s0->system_u:object_r:mysqld_etc_t:s0
restorecon reset /etc/percona-xtradb-cluster.conf.d/wsrep.cnf context system_u:object_r:etc_t:s0->system_u:object_r:mysqld_etc_t:s0
restorecon reset /etc/percona-xtradb-cluster.conf.d/mysqld.cnf context system_u:object_r:etc_t:s0->system_u:object_r:mysqld_etc_t:s0
restorecon reset /etc/percona-xtradb-cluster.conf.d/mysqld_safe.cnf context system_u:object_r:etc_t:s0->system_u:object_r:mysqld_etc_t:s0

Variables check list
Here is a list of all the variables you should check for paths used by MySQL

datadir, default is /var/lib/mysql, where MySQL stores its data
basedir, default is /usr, where binaries and librairies can be found
character_sets_dir, default is basedir/share/mysql/charsets, charsets used by MySQL
general_log_file, default is the datadir, where the general log is written
init_file, no default, sql file read and executed when the server starts
innodb_undo_directory, default is datadir, where InnoDB stores the undo files
innodb_tmpdir, default is tmpdir, where InnoDB creates temporary files
innodb_temp_data_file_path, default is in the datadir, where InnoDB creates the temporary tablespace
innodb_parallel_doublewrite_path, default is in the datadir, where InnoDB created the parallel doublewrite buffer
innodb_log_group_home_dir, default is the datadir, where InnoDB writes its transational log files
innodb_data_home_dir, default is the datadir, used a default value for the InnoDB files
innodb_data_file_path, default is in the datadir, path of the system tablespace
innodb_buffer_pool_filename, default is in the datadir, where InnoDB writes the buffer pool dump information
lc_messages_dir, basedir/share/mysql
log_bin_basename, default is the datadir, where the binlogs are stored
log_bin_index, default is the datadir, where the binlog index file is stored
log_error, no default value, where the MySQL error log is stored
pid-file, no default value, where the MySQL pid file is stored
plugin_dir, default is basedir/lib/mysql/plugin, where the MySQL plugins are stored
relay_log_basename, default is the datadir, where the relay logs are stored
relay_log_info_file, default is the datadir, may include a path
slave_load_tmpdir, default is tmpdir, where the slave stores files coming from LOAD DATA INTO statements.
slow_query_log, default is in the datadir, where the slow queries are logged
socket, no defaults, where the Unix socket file is created
ssl_*, SSL/TLS related files
tmpdir, default is /tmp, where temporary files are stored
wsrep_data_home_dir, default is the datadir, where galera stores its files
wsrep_provider->base_dir, default is wsrep_data_home_dir
wsrep_provider->gcache_dir, default is wsrep_data_home_dir, where the gcache file is stored
wsrep_provider->socket.ssl_*, no defaults, where the SSL/TLS related files for the Galera protocol are stored

That’s quite a long list and I may have missed some. If for any of these variables you use a non-standard path, you’ll need to adjust the context labels as we just did above.
All together
I would understand if you feel a bit lost, I am not a SELinux guru and it took me some time to understand decently how it works. Let’s recap how we can enable SELinux for PXC from what we learned in the previous sections.
1. Install the SELinux utilities

yum install policycoreutils-python.x86_64

2. Allow the TCP ports used by PXC

semanage port -a -t mysqld_port_t -p tcp 4568
semanage port -m -t mysqld_port_t -p tcp 4444
semanage port -m -t mysqld_port_t -p tcp 4567

3. Modify the SST script
Replace the wait_for_listen function in the /usr/bin/wsrep_sst_xtrabackup-v2 file by the version above. Hopefully, the next PXC release will include a SELinux friendly wait_for_listen function.
4. Set the security context labels for the configuration files
These steps seems optional but for completeness:
semanage fcontext -a -t mysqld_etc_t “/etc/percona-xtradb-cluster\.cnf”
semanage fcontext -a -t mysqld_etc_t “/etc/percona-xtradb-cluster\.conf\.d(/.*)?”
restorecon -v /etc/percona-xtradb-cluster.cnf
restorecon -R -v /etc/percona-xtradb-cluster.conf.d/

5. Create the policy file PXC.te
Create the file PXC.te with this content:
module PXC 1.0;
require {
type unconfined_t;
type mysqld_t;
type unconfined_service_t;
type tmp_t;
type sysctl_net_t;
type kernel_t;
type mysqld_safe_t;
class process { getattr setpgid };
class unix_stream_socket connectto;
class system module_request;
class file { getattr open read write };
class dir search;
}
#============= mysqld_t ==============
allow mysqld_t kernel_t:system module_request;
allow mysqld_t self:process { getattr setpgid };
allow mysqld_t self:unix_stream_socket connectto;
allow mysqld_t sysctl_net_t:dir search;
allow mysqld_t sysctl_net_t:file { getattr open read };
allow mysqld_t tmp_t:file write;

6. Compile and load the policy module

checkmodule -M -m -o PXC.mod PXC.te
semodule_package -o PXC.pp -m PXC.mod
semodule -i PXC.pp

7. Run for a while in Permissive mode
Set SELinux into permissive mode in /etc/sysconfig/selinux and reboot. Validate everything works fine in Permissive mode, check the audit.log for any denied messages. If there are denied messages, address them.
8. Enforce SELINUX
Last step, enforce SELinux:
setenforce 1
perl -pi -e ‘s/SELINUX=permissive/SELINUX=enforcing/g’ /etc/sysconfig/selinux

Conclusion
As we can see, enabling SELinux with PXC is not straightforward but, once the process is understood, it is not that hard either. In an IT world where security is more than ever a major concern, enabling SELinux with PXC is a nice step forward. In an upcoming post, we’ll look at the other security framework, Apparmor.
The post Lock Down: Enforcing SELinux with Percona XtraDB Cluster appeared first on Percona Database Performance Blog.

About ZFS Performance

If you are a regular reader of this blog, you likely know I like the ZFS filesystem a lot. ZFS has many very interesting features, but I am a bit tired of hearing negative statements on ZFS performance. It feels a bit like people are telling me “Why do you use InnoDB? I have read that MyISAM is faster.” I found the comparison of InnoDB vs. MyISAM quite interesting, and I’ll use it in this post.
To have some data to support my post, I started an AWS i3.large instance with a 1000GB gp2 EBS volume. A gp2 volume of this size is interesting because it is above the burst IOPS level, so it offers a constant 3000 IOPS performance level.
I used sysbench to create a table of 10M rows and then, using export/import tablespace, I copied it 329 times. I ended up with 330 tables for a total size of about 850GB. The dataset generated by sysbench is not very compressible, so I used lz4 compression in ZFS. For the other ZFS settings, I used what can be found in my earlier ZFS posts but with the ARC size limited to 1GB. I then used that plain configuration for the first benchmarks. Here are the results with the sysbench point-select benchmark, a uniform distribution and eight threads. The InnoDB buffer pool was set to 2.5GB.

In both cases, the load is IO bound. The disk is doing exactly the allowed 3000 IOPS. The above graph appears to be a clear demonstration that XFS is much faster than ZFS, right? But is that really the case? The way the dataset has been created is extremely favorable to XFS since there is absolutely no file fragmentation. Once you have all the files opened, a read IOP is just a single fseek call to an offset and ZFS doesn’t need to access any intermediate inode. The above result is about as fair as saying MyISAM is faster than InnoDB based only on table scan performance results of unfragmented tables and default configuration. ZFS is much less affected by the file level fragmentation, especially for point access type.
More on ZFS metadata
ZFS stores the files in B-trees in a very similar fashion as InnoDB stores data. To access a piece of data in a B-tree, you need to access the top level page (often called root node) and then one block per level down to a leaf-node containing the data. With no cache, to read something from a three levels B-tree thus requires 3 IOPS.
Simple three levels B-treeThe extra IOPS performed by ZFS are needed to access those internal blocks in the B-trees of the files. These internal blocks are labeled as metadata. Essentially, in the above benchmark, the ARC is too small to contain all the internal blocks of the table files’ B-trees. If we continue the comparison with InnoDB, it would be like running with a buffer pool too small to contain the non-leaf pages. The test dataset I used has about 600MB of non-leaf pages, about 0.1% of the total size, which was well cached by the 3GB buffer pool. So only one InnoDB page, a leaf page, needed to be read per point-select statement.
To correctly set the ARC size to cache the metadata, you have two choices. First, you can guess values for the ARC size and experiment. Second, you can try to evaluate it by looking at the ZFS internal data. Let’s review these two approaches.
You’ll read/hear often the ratio 1GB of ARC for 1TB of data, which is about the same 0.1% ratio as for InnoDB. I wrote about that ratio a few times, having nothing better to propose. Actually, I found it depends a lot on the recordsize used. The 0.1% ratio implies a ZFS recordsize of 128KB. A ZFS filesystem with a recordsize of 128KB will use much less metadata than another one using a recordsize of 16KB because it has 8x fewer leaf pages. Fewer leaf pages require less B-tree internal nodes, hence less metadata. A filesystem with a recordsize of 128KB is excellent for sequential access as it maximizes compression and reduces the IOPS but it is poor for small random access operations like the ones MySQL/InnoDB does.
To determine the correct ARC size, you can slowly increase the ARC size and monitor the number of metadata cache-misses with the arcstat tool. Here’s an example:
# echo 1073741824 > /sys/module/zfs/parameters/zfs_arc_max
# arcstat -f time,arcsz,mm%,mhit,mread,dread,pread 10
time arcsz mm% mhit mread dread pread
10:22:49 105M 0 0 0 0 0
10:22:59 113M 100 0 22 73 0
10:23:09 120M 100 0 20 68 0
10:23:19 127M 100 0 20 65 0
10:23:29 135M 100 0 22 74 0
You’ll want the ‘mm%’, the metadata missed percent, to reach 0. So when the ‘arcsz’ column is no longer growing and you still have high values for ‘mm%’, that means the ARC is too small. Increase the value of ‘zfs_arc_max’ and continue to monitor.
If the 1GB of ARC for 1TB of data ratio is good for large ZFS recordsize, it is likely too small for a recordsize of 16KB. Does 8x more leaf pages automatically require 8x more ARC space for the non-leaf pages? Although likely, let’s verify.
The second option we have is the zdb utility that comes with ZFS, which allows us to view many internal structures including the B-tree list of pages for a given file. The tool needs the inode of a file and the ZFS filesystem as inputs. Here’s an invocation for one of the tables of my dataset:
# cd /var/lib/mysql/data/sbtest
# ls -li | grep sbtest1.ibd
36493 -rw-r—– 1 mysql mysql 2441084928 avr 15 15:28 sbtest1.ibd
# zdb -ddddd mysqldata/data 36493 > zdb5d.out
# more zdb5d.out
Dataset mysqldata/data [ZPL], ID 90, cr_txg 168747, 4.45G, 26487 objects, rootbp DVA[0]=<0:1a50452800:200> DVA[1]=<0:5b289c1600:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=3004977L/3004977P fill=26487 cksum=13723d4400:5d1f47fb738:fbfb87e6e278:1f30c12b7fa1d1
Object lvl iblk dblk dsize lsize %full type
36493 4 16K 16K 1.75G 2.27G 97.62 ZFS plain file
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 148991
path /var/lib/mysql/data/sbtest/sbtest1.ibd
uid 103
gid 106
atime Sun Apr 15 15:04:13 2018
mtime Sun Apr 15 15:28:45 2018
ctime Sun Apr 15 15:28:45 2018
crtime Sun Apr 15 15:04:13 2018
gen 3004484
mode 100640
size 2441084928
parent 36480
links 1
pflags 40800000004
Indirect blocks:
0 L3 0:1a4ea58800:400 4000L/400P F=145446 B=3004774/3004774
0 L2 0:1c83454c00:1800 4000L/1800P F=16384 B=3004773/3004773
0 L1 0:1eaa626400:1600 4000L/1600P F=128 B=3004773/3004773
0 L0 0:1c6926ec00:c00 4000L/c00P F=1 B=3004773/3004773
4000 L0 EMBEDDED et=0 4000L/6bP B=3004484
8000 L0 0:1c69270c00:400 4000L/400P F=1 B=3004773/3004773
c000 L0 0:1c7fbae400:800 4000L/800P F=1 B=3004736/3004736
10000 L0 0:1ce3f53600:3200 4000L/3200P F=1 B=3004484/3004484
14000 L0 0:1ce3f56800:3200 4000L/3200P F=1 B=3004484/3004484
18000 L0 0:18176fa600:3200 4000L/3200P F=1 B=3004485/3004485
1c000 L0 0:18176fd800:3200 4000L/3200P F=1 B=3004485/3004485

[more than 140k lines truncated]
The last section of the above output is very interesting as it shows the B-tree pages. The ZFSB-tree of the file sbtest1.ibd has four levels. L3 is the root page, L2 is the first level (from the top) pages, L1 are the second level pages, and L0 are the leaf pages. The metadata is essentially L3 + L2 + L1. When you change the recordsize property of a ZFS filesystem, you affect only the size of the leaf pages.
The non-leaf page size is always 16KB (4000L) and they are always compressed on disk with lzop (If I read correctly). In the ARC, these pages are stored uncompressed so they use 16KB of memory each. The fanout of a ZFS B-tree, the largest possible ratio of a number of pages between levels, is 128. With the above output, we can easily calculate the required size of metadata we would need to cache all the non-leaf pages in the ARC.
# grep -c L3 zdb5d.out
1
# grep -c L2 zdb5d.out
9
# grep -c L1 zdb5d.out
1150
# grep -c L0 zdb5d.out
145447
So, each of the 330 tables of the dataset has 1160 non-leaf pages and 145447 leaf pages; a ratio very close to the prediction of 0.8%. For the complete dataset of 749GB, we would need the ARC to be, at a minimum, 6GB to fully cache all the metadata pages. Of course, there is some overhead to add. In my experiments, I found I needed to add about 15% for ARC overhead in order to have no metadata reads at all. The real minimum for the ARC size I should have used is almost 7GB.
Of course, an ARC of 7GB on a server with 15GB of Ram is not small. Is there a way to do otherwise? The first option we have is to use a larger InnoDB page size, as allowed by MySQL 5.7. Instead of the regular Innodb page size of 16KB, if you use a page size of 32KB with a matching ZFS recordsize, you will cut the ARC size requirement by half, to 0.4% of the uncompressed size.
Similarly, an Innodb page size of 64KB with similar ZFS recordsize would further reduce the ARC size requirement to 0.2%. That approach works best when the dataset is highly compressible. I’ll blog more about the use of larger InnoDB pages with ZFS in a near future. If the use of larger InnoDB page sizes is not a viable option for you, you still have the option of using the ZFS L2ARC feature to save on the required memory.
So, let’s proposed a new rule of thumb for the required ARC/L2ARC size for a a given dataset:

Recordsize of 128KB => 0.1% of the uncompressed dataset size
Recordsize of 64KB => 0.2% of the uncompressed dataset size
Recordsize of 32KB => 0.4% of the uncompressed dataset size
Recordsize of 16KB => 0.8% of the uncompressed dataset size

The ZFS revenge
In order to improve ZFS performance, I had 3 options:

Increase the ARC size to 7GB
Use a larger Innodb page size like 64KB
Add a L2ARC

I was reluctant to grow the ARC to 7GB, which was nearly half the overall system memory. At best, the ZFS performance would only match XFS. A larger InnoDB page size would increase the CPU load for decompression on an instance with only two vCPUs; not great either. The last option, the L2ARC, was the most promising.
The choice of an i3.large instance type is not accidental. The instance has a 475GB ephemeral NVMe storage device. Let’s try to use this storage for the ZFS L2ARC. The warming of an L2ARC device is not exactly trivial. In my case, with a 1GB ARC, I used:
echo 1073741824 > /sys/module/zfs/parameters/zfs_arc_max
echo 838860800 > /sys/module/zfs/parameters/zfs_arc_meta_limit
echo 67108864 > /sys/module/zfs/parameters/l2arc_write_max
echo 134217728 > /sys/module/zfs/parameters/l2arc_write_boost
echo 4 > /sys/module/zfs/parameters/l2arc_headroom
echo 16 > /sys/module/zfs/parameters/l2arc_headroom_boost
echo 0 > /sys/module/zfs/parameters/l2arc_norw
echo 1 > /sys/module/zfs/parameters/l2arc_feed_again
echo 5 > /sys/module/zfs/parameters/l2arc_feed_min_ms
echo 0 > /sys/module/zfs/parameters/l2arc_noprefetch
I then ran ‘cat /var/lib/mysql/data/sbtest/* > /dev/null’ to force filesystem reads and caches on all of the tables. A key setting here to allow the L2ARC to cache data is the zfs_arc_meta_limit. It needs to be slightly smaller than the zfs_arc_max in order to allow some data to be cache in the ARC. Remember that the L2ARC is fed by the LRU of the ARC. You need to cache data in the ARC in order to have data cached in the L2ARC. Using lz4 in ZFS on the sysbench dataset results in a compression ration of only 1.28x. A more realistic dataset would compress by more than 2x, if not 3x. Nevertheless, since the content of the L2ARC is compressed, the 475GB device caches nearly 600GB of the dataset. The figure below shows the sysbench results with the L2ARC enabled:

Now, the comparison is very different. ZFS completely outperforms XFS, 5000 qps for ZFS versus 3000 for XFS. The ZFS results could have been even higher but the two vCPUs of the instance were clearly the bottleneck. Properly configured, ZFS can be pretty fast. Of course, I could use flashcache or bcache with XFS and improve the XFS results but these technologies are way more exotic than the ZFS L2ARC. Also, only the L2ARC stores data in a compressed form, maximizing the use of the NVMe device. Compression also lowers the size requirement and cost for the gp2 disk.
ZFS is much more complex than XFS and EXT4 but, that also means it has more tunables/options. I used a simplistic setup and an unfair benchmark which initially led to poor ZFS results. With the same benchmark, very favorable to XFS, I added a ZFS L2ARC and that completely reversed the situation, more than tripling the ZFS results, now 66% above XFS.
Conclusion
We have seen in this post why the general perception is that ZFS under-performs compared to XFS or EXT4. The presence of B-trees for the files has a big impact on the amount of metadata ZFS needs to handle, especially when the recordsize is small. The metadata consists mostly of the non-leaf pages (or internal nodes) of the B-trees. When properly cached, the performance of ZFS is excellent. ZFS allows you to optimize the use of EBS volumes, both in term of IOPS and size when the instance has fast ephemeral storage devices. Using the ephemeral device of an i3.large instance for the ZFS L2ARC, ZFS outperformed XFS by 66%.
The post About ZFS Performance appeared first on Percona Database Performance Blog.

Fsync Performance on Storage Devices

fsyncWhile preparing a post on the design of ZFS based servers for use with MySQL, I stumbled on the topic of fsync call performance. The fsync call is very expensive, but it is essential to databases as it allows for durability (the “D” of the ACID acronym). Let’s first review the type of disk IO operations […]

The hidepid Options for /proc and Percona XtraDB Cluster Don’t Play Well Together

In this blog, we’ll look at how the hidepid options for /proc and Percona XtraDB Cluster can fight with one another.
One of the things I like about consulting at Percona is the opportunity to be exposed to unusual problems. I recently worked with a customer having issues getting SST to work with Percona XtraDB Cluster. A simple problem you would think. After four hours of debugging, my general feeling was that nothing made sense.
I added a bash trace to the SST script and it claimed MySQL died prematurely:
[ -n ” ]]
+ ps -p 11244
+ wsrep_log_error ‘Parent mysqld process (PID:11244) terminated unexpectedly.’
+ wsrep_log ‘[ERROR] Parent mysqld process (PID:11244) terminated unexpectedly.’
++ date ‘+%Y-%m-%d %H:%M:%S’
+ local readonly ‘tst=2017-11-28 22:02:46’
At the same time, from the MySQL error log MySQL was complaining the SST script died:
2017-11-28 22:02:46 11244 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 –role ‘joiner’ –address ‘172.31.4.179’ –datadir ‘/var/lib/my
sql/’ –defaults-file ‘/etc/my.cnf’ –defaults-group-suffix ” –parent ‘11244’ ” : 32 (Broken pipe)
2017-11-28 22:02:46 11244 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2017-11-28 22:02:46 11244 [ERROR] WSREP: SST script aborted with error 32 (Broken pipe)
2017-11-28 22:02:46 11244 [ERROR] WSREP: SST failed: 32 (Broken pipe)
2017-11-28 22:02:46 11244 [ERROR] Aborting

The Solution
Clearly, something odd was at play. But what? At that point, I decided to try a few operations with the mysql user. Finally, I stumbled onto something:
[root@db-01 mysql]# su mysql –
bash-4.2$ ps fax
PID TTY STAT TIME COMMAND
11901 pts/0 S 0:00 bash –
11902 pts/0 R+ 0:00 _ ps fax
bash-4.2$
There are way more than 100 processes on these servers, so, why can’t the mysql user see them? Of course, the SST script monitors the state of its parent process using “ps”. Look at the bash trace above: 11244 is the mysqld pid. After a little Googling exercise, I found this blog post about the /proc hidepid mount option. Of course, the customer was using this option:
[root@db-02 lib]# mount | grep ‘^proc’
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime,hidepid=2)
I removed the hidepid option using remount, and set hidepid=0 on all the nodes:
mount -o remount,rw,nosuid,nodev,noexec,relatime,hidepid=0 /proc
This simple command solved the issue. The SST scripts started to work normally. A good lesson learned: do not overlook security settings!

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569