Month: August 2015

Cached linkbench performance for MySQL 5.7.8, 5.6, WebScale and MyRocks

This extends previous results for Linkbench to compare performance for a cached database with concurrent clients. My conclusions are:InnoDB compression in the Facebook patch for MySQL 5.6 is much faster for insert-heavy workloads than the same feature in upstream 5.6 and 5.7. Too bad those changes might not reach upstream. InnoDB transparent page compression is faster than non-transparent for write-heavy workloads assuming that feature is OK to use on your servers.QPS for MyRocks suffers over time. We have work-in-progress to fix this. Otherwise it is already competitive with InnoDB. Compression with MyRocks is much better than InnoDB for linkbench data. That has also been true for real workloads.Load rates are higher for compressed InnoDB tables when partitioning is used for 5.6 but not for 5.7. I didn’t debug the slowdown in 5.7. It has been a win in the past for IO-bound linkbench because it reduces contention on the per-index mutex in InnoDB. Work has been done in 5.7 to reduce the contention on the per-index mutex.SetupThe database size was between 10G and 30G after the load. The test was run with maxid=20000001, loaders=10 & requesters=20. Otherwise the default settings for linkbench were used. The InnoDB buffer pool was large enough to cache the database. The server has 144G of RAM, fast PCIe flash storage and 40 HW threads with HT enabled. The binlog was enabled but fsync was not done for the binlog or InnoDB redo log on commit. I tested several configurations for compression and partitioning:p0.c0 – no partitioning, no compressionp0.c1 – no partitioning, compression (transparent & non-transparent)p1.c0 – partitioning, no compressionp1.c1 – partitioning, compression (transparent & non-transparent)I tested several binaries:myrocks – the Facebook patch for MySQL, 5.6.X and the RocksDB storage enginefb56 – the Facebook patch for MySQL, 5.6.X and InnoDBorig56.ps – upstream 5.6.26 with the performance schema (PS) enabledorig57.ps – upstream 5.7.8 with non-transparent compression and PS enabledorig57.tc – upstream 5.7.8 with transparent compression and PS enabledThe test was done in two parts. First I measure the load performance, then I run the query test for 12 1-hour intervals. The data below is the insert rate from the load (load ips), the size after load (load gb), the QPS rate during the second and twelfth 1-hour runs (2h qps, 12h qps) and the size after the second and twelfth 1-hour runs (2h gb, 12h gb).p0.c0load    load    2h      2h      12h     12hips     gb      qps     gb      qps     gb      config136041  14      43784   18      24298   20      myrocks109724  22      49881   31      48459   51      fb56103030  24      39979   34      39582   54      orig56.ps116343  24      48506   35      48112   58      orig57.psp0.c1load    load    2h      2h      12h     12hips     gb      qps     gb      qps     gb      config 73115  15      42508   20      35766   32      fb56 45660  16      36474   22      33107   34      orig56.ps 46737  16      40890   22      37305   36      orig57.ps101966  17      33716   23      29695   37      orig57.tcp1.c0load    load    2h      2h      12h     12hips     gb      qps     gb      qps     gb      config101783  26      34342   30      21883   36      myrocks105099  24      48686   33      47369   52      fb56 97931  27      39343   36      39000   55      orig56.ps109230  27      46671   37      46155   59      orig57.psp1.c1load    load    2h      2h      12h     12hips     gb      qps     gb      qps     gb      config 91884  15      46852   21      45223   36      fb56 74080  17      39379   23      38627   38      orig56.ps 77037  17      45156   24      44070   40      orig57.ps 87708  19      37062   25      32424   40      orig57.tcGraphs!And for people who want graphs this has the average insert rate from the load and the average query rate from the twelfth hour for the p0.c0 test (no partitioning, no compression).

High-load clusters and desynchronized nodes on Percona XtraDB Cluster

There can be a lot of confusion and lack of planning in Percona XtraDB Clusters in regards to nodes becoming desynchronized for various reasons.  This can happen a few ways:An IST or SST joining node catching up after a state transfer (Joined/Joining state)Using wsrep_desync for something like a backupExecuting a rolling-schema-upgrade using using wsrep_OSU_method=RSUWhen I say “desynchronized” I mean a node that is permitted to build up a potentially large wsrep_local_recv_queue while some operation is happening.  For example a node taking a backup would set wsrep_desync=ON during the backup and potentially fall behind replication some amount.Some of these operations may completely block Galera from applying transactions, while others may simply increase load on the server enough that it falls behind and applies at a reduced rate.In all the cases above, flow control is NOT used while the node cannot apply transactions, but it MAY be used while the node is recovering from the operation.  For an example of this, see my last blog about IST.If a cluster is fairly busy, then the flow control that CAN happen when the above operations catch up MAY be detrimental to performance.Example setupLet us take my typical 3 node cluster with workload on node1.  We are taking a blocking backup of some kind on node3 so we are executing the following steps:node3> set global wsrep_desync=ON;Node3’s “backup” starts, this starts with FLUSH TABLES WITH READ LOCK;Galera is paused on node3 and the wsrep_local_recv_queue grows some amountNode3’s “backup” finishes, finishing with UNLOCK TABLES;node3> set global wsrep_desync=OFF;During the backupThis includes up through step 3 above.  My node1 is unaffected by the backup on node3, I can see it averaging 5-6k writesets(transactions) per second which it did before we began: node2 is also unaffected:but node3 is not applying and its queue is building up:Unlock tables, still wsrep_desync=ONLet’s examine briefly what happens when node3 is permitted to start applying, but wsrep_desync stays enabled:node1’s performance is pretty much the same, node3 is not using flow control yet. However, there is a problem:It’s hard to notice, but node3 is NOT catching up, instead it is falling further behind!  We have potentially created a situation where node3 may never catch up.The PXC nodes were close enough to the red-line of performance that node3 can only apply just about as fast (and somewhat slower until it heats up a bit) as new transactions are coming into node1.This represents a serious concern in PXC capacity planning:Nodes do not only need to be fast enough to handle normal workload, but also to catch up after maintenance operations or failures cause them to fall behind.Experienced MySQL DBA’s will realize this isn’t all that different than Master/Slave replication.Flow Control as a way to recoverySo here’s the trick:  if we turn off wsrep_desync on node3 now, node3 will use flow control if and only if the incoming replication exceeds node3’s apply rate.  This gives node3 a good chance of catching up, but the tradeoff is reducing write throughput of the cluster.  Let’s see what this looks like in context with all of our steps.  wsrep_desync is turned off at the peak of the replication queue size on node3, around 12:20PM:So at the moment node3 starts utilizing flow control to prevent falling further behind, our write throughput (in this specific environment and workload) is reduced by approximately 1/3rd (YMMV).   The cluster will remain in this state until node3 catches up and returns to the ‘Synced’ state.  This catchup is still happening as I write this post, almost 4 hours after it started and will likely take another hour or two to complete.I can see a more realtime representation of this by using myq_status on node1, summarizing every minute:[root@node1 ~]# myq_status -i 1m wsrep
mycluster / node1 (idx: 1) / Galera 3.11(ra0189ab)
Cluster Node Outbound Inbound FlowC Conflct Gcache Appl
time P cnf # stat laten msgs data que msgs data que pause snt lcf bfa ist idx %ef
19:58:47 P 5 3 Sync 0.9ms 3128 2.0M 0 27 213b 0 25.4s 0 0 0 3003k 16k 62%
19:59:47 P 5 3 Sync 1.1ms 3200 2.1M 0 31 248b 0 18.8s 0 0 0 3003k 16k 62%
20:00:47 P 5 3 Sync 0.9ms 3378 2.2M 32 27 217b 0 26.0s 0 0 0 3003k 16k 62%
20:01:47 P 5 3 Sync 0.9ms 3662 2.4M 32 33 266b 0 18.9s 0 0 0 3003k 16k 62%
20:02:47 P 5 3 Sync 0.9ms 3340 2.2M 32 27 215b 0 27.2s 0 0 0 3003k 16k 62%
20:03:47 P 5 3 Sync 0.9ms 3193 2.1M 0 27 215b 0 25.6s 0 0 0 3003k 16k 62%
20:04:47 P 5 3 Sync 0.9ms 3009 1.9M 12 28 224b 0 22.8s 0 0 0 3003k 16k 62%
20:05:47 P 5 3 Sync 0.9ms 3437 2.2M 0 27 218b 0 23.9s 0 0 0 3003k 16k 62%
20:06:47 P 5 3 Sync 0.9ms 3319 2.1M 7 28 220b 0 24.2s 0 0 0 3003k 16k 62%
20:07:47 P 5 3 Sync 1.0ms 3388 2.2M 16 31 251b 0 22.6s 0 0 0 3003k 16k 62%
20:08:47 P 5 3 Sync 1.1ms 3695 2.4M 19 39 312b 0 13.9s 0 0 0 3003k 16k 62%
20:09:47 P 5 3 Sync 0.9ms 3293 2.1M 0 26 211b 0 26.2s 0 0 0 3003k 16k 62%This reports around 20-25 seconds of flow control every minute, which is consistent with that ~1/3rd of performance reduction we see in the graphs above.Watching node3 the same way proves it is sending the flow control (FlowC snt):mycluster / node3 (idx: 2) / Galera 3.11(ra0189ab)
Cluster Node Outbound Inbound FlowC Conflct Gcache Appl
time P cnf # stat laten msgs data que msgs data que pause snt lcf bfa ist idx %ef
17:38:09 P 5 3 Dono 0.8ms 0 0b 0 4434 2.8M 16m 25.2s 31 0 0 18634 16k 80%
17:39:09 P 5 3 Dono 1.3ms 0 0b 1 5040 3.2M 16m 22.1s 29 0 0 37497 16k 80%
17:40:09 P 5 3 Dono 1.4ms 0 0b 0 4506 2.9M 16m 21.0s 31 0 0 16674 16k 80%
17:41:09 P 5 3 Dono 0.9ms 0 0b 0 5274 3.4M 16m 16.4s 27 0 0 22134 16k 80%
17:42:09 P 5 3 Dono 0.9ms 0 0b 0 4826 3.1M 16m 19.8s 26 0 0 16386 16k 80%
17:43:09 P 5 3 Jned 0.9ms 0 0b 0 4957 3.2M 16m 18.7s 28 0 0 83677 16k 80%
17:44:09 P 5 3 Jned 0.9ms 0 0b 0 3693 2.4M 16m 27.2s 30 0 0 131k 16k 80%
17:45:09 P 5 3 Jned 0.9ms 0 0b 0 4151 2.7M 16m 26.3s 34 0 0 185k 16k 80%
17:46:09 P 5 3 Jned 1.5ms 0 0b 0 4420 2.8M 16m 25.0s 30 0 0 245k 16k 80%
17:47:09 P 5 3 Jned 1.3ms 0 0b 1 4806 3.1M 16m 21.0s 27 0 0 310k 16k 80%There are a lot of flow control messages (around 30) per minute.  This is a lot of ON/OFF toggles of flow control where writes are briefly delayed rather than a steady “you can’t write” for 20 seconds straight.It also interestingly spends a long time in the Donor/Desynced state (even though wsrep_desync was turned OFF hours before) and then moves to the Joined state (this has the same meaning as during an IST).Does it matter?As always, it depends.If these are web requests and suddenly the database can only handle ~66% of the traffic, that’s likely a problem, but maybe it just slows down the website somewhat.  I want to emphasize that WRITES are what is affected here.  Reads on any and all nodes should be normal (though you probably don’t want to read from node3 since it is so far behind).If this were some queue processing that had reduced throughput, I’d expect it to possibly catch up laterThis can only be answered for your application, but the takeaways for me are:Don’t underestimate your capacity requirementsBeing at the redline normally means you are well past the redline for abnormal events.Plan for maintenance and failure recoveriesWhere possible, build queuing into your workflows so diminished throughput in your architecture doesn’t generate failures.Happy clustering!Graphs in this post courtesy of VividCortex.The post High-load clusters and desynchronized nodes on Percona XtraDB Cluster appeared first on Percona Data Performance Blog.

MySQL replication in action – Part 5 – parallel appliers

Previous episodes: MySQL replication monitoring 101MySQL replication in action – Part 1: GTID & CoMySQL replication in action – Part 2 – Fan-in topologyMySQL replication in action – Part 3 – All-masters P2P topologyMySQL replication in action – Part 4 – star and hybrid topologies Parallel replication overviewOne of the main grievance of replication users is that, while a well tuned master server can handle thousands of concurrent operations, an equally tuned slave is constrained to work on a single thread. In Figure 1, we see the schematics of this paradigm. Multiple operations on the master are executed simultaneously and saved to the binary log. The slave IO thread copies the binary log events to a local log, and on such log the SQL thread executes the events on the slave database. When the master is very active, chances are that the slave lags behind, causing hatred and nightmares to the DBAs. Figure 1 – Single applierUntil 2011, there was little remedy to this problem. Then we saw the first implementation of parallel apply, provided by Tungsten Replicator. The parallelism is schema-based, and provides 5x improvement on the slave performance. Shortly after that, the same algorithm of schema-based parallel applier was implemented —with the same performance gain— in MySQL 5.6 (Figure 2). This implementation is valuable for those organizations that have data split by schemas, which can happen for those companies that have adopted sharding of data and split their logical chunks in different schemas, or for those companies that run multi-tenant services, where splitting data by schema is a necessity. This solution does not meet the needs of all users. Many systems are based on a single very active schema and perhaps a few ancillary ones with minimal traffic. For those users, parallel replication by schema is useless. Figure 2 – Parallel applierIn more recent times, we have seen three more implementations:In MariaDB 10, parallel appliers are based on information generated by the master when using group commit;In MySQL 5.7, a new parallelism mode was introduced, based on the events logical clock: the events that have the same timestamp are safe to execute in parallel.In MariaDB 10.1, there is an optimistic mode that breaks the commit order that was guaranteed by the group-commit algorithm, and allows higher performance.We are not going to measure the performance of all these methods. Having benchmarked a few of them, I know by experience that you can get enormous performance gains, but depending on the data traffic you can also get smaller and smaller advantages, and you may even slow down the slave further in some cases. The performance depends on data distribution, on the pattern of the data, the type and amount of indexes, and some unknown factors that sometimes baffle both developers and DBAs. Thus, we will focus our examination to the manageability of parallel appliers. What is important to note here is that all implementations introduce, in addition to a larger number of appliers, much more complexity in the operations, where we need to know, and eventually monitor, much more than what we should do in single threaded replication. We’ll see in a moment how the new features rise to this challenge.Setting up Parallel replicationCompared to what we had to do for multi-source, the setup is not hard. There is a variable that we need to change. The name is different in MySQL 5.6/5.7 and MariaDB 10. As it happens for multi source, we can set-up the functionality using one of the mysql-replication-samples scripts on GitHub.Setup in MySQL 5.7When we run the script install_parallel.sh, what happens is that the slave gets stopped, the parallel workers variable is increased, and then the slave restarts. A look at show processlist in the slave tells us that there are now 10 processes waiting for work.+————————+——-+| Variable_name | Value |+————————+——-+| slave_parallel_workers | 10 |+————————+——-++—-+————-+———–+——+———+——+—————————————+——————+| Id | User | Host | db | Command | Time | State | Info |+—-+————-+———–+——+———+——+—————————————+——————+| 9 | system user | | NULL | Connect | 0 | Waiting for master to send event | NULL || 10 | system user | | NULL | Connect | 0 | Reading event from the relay log | NULL || 11 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL || 12 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL || 13 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL || 14 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL || 15 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL || 16 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL || 17 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL || 18 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL || 19 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL || 20 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL || 22 | msandbox | localhost | NULL | Query | 0 | starting | show processlist |+—-+————-+———–+——+———+——+—————————————+——————+setup in MariaDB 10Same story in MariaDB 10. Stop the slave, change the variable, restart the slave. The process list shows willing workers.+————————+——-+| Variable_name | Value |+————————+——-+| slave_parallel_threads | 10 |+————————+——-++—-+————-+———–+——+———+——+————————————+——————+———-+| Id | User | Host | db | Command | Time | State | Info | Progress |+—-+————-+———–+——+———+——+————————————+——————+———-+| 10 | system user | | NULL | Connect | 0 | Waiting for master to send event | NULL | 0.000 || 11 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 || 12 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 || 13 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 || 14 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 || 15 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 || 16 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 || 17 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 || 18 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 || 19 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 || 20 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 || 21 | system user | | NULL | Connect | 0 | Slave has read all relay log; wait | NULL | 0.000 || 23 | msandbox | localhost | NULL | Query | 0 | init | show processlist | 0.000 |+—-+————-+———–+——+———+——+————————————+——————+———-+The above output has been shortened for easier showing. The message for thread 21 said “waiting for master to send event” Monitoring parallel replicationAfter the setup, the interesting operations can start. We need to know what we can get from the monitoring facilities.To monitor, however, we need some action to look at. In the same mysql-replicator-samples there are a few scripts that can generate load for multiple databases (multi_inserts.sh) or for many tables in the same database (multi_inserts_one_db.sh). We will use the first one in MySQL 5.7 to check the default algorithm (parallelism by schema) and the second script for mariadb, again to test its default algorithm, which can run parallel appliers in the same schema. As I mentioned before, we are not comparing performance here: we just want to see what we can detect when parallel replication is running.The scripts are nothing fancy. They will generate lots of concurrent inserts in many tables. This is enough to see all the parallel workers busy. You can use your favorite stress test instead of these simple scripts.Monitoring in MySQL 5.7Let’s start with MySQL 5.7, where we have two tables dedicated to parallel replication monitoring. The first one, in the mysql schema, can show the progress of work on binary logs and relay logs. There is no mention of GTIDs. And there is no mention of database. For a system where parallelism can happen by schema, this looks like insufficient design. Another thing that is important to notice is the field Checkpoint_group_bitmap, of which I ignore the purpose, but is very prominent when you select the table contents, because this column is filled with non-printable characters. For this reason, we will select just a few columns, to see the basic progress.slave1 [localhost] {msandbox} (mysql) > desc slave_worker_info;+—————————-+———————+——+—–+———+——-+| Field | Type | Null | Key | Default | Extra |+—————————-+———————+——+—–+———+——-+| Id | int(10) unsigned | NO | PRI | NULL | || Relay_log_name | text | NO | | NULL | || Relay_log_pos | bigint(20) unsigned | NO | | NULL | || Master_log_name | text | NO | | NULL | || Master_log_pos | bigint(20) unsigned | NO | | NULL | || Checkpoint_relay_log_name | text | NO | | NULL | || Checkpoint_relay_log_pos | bigint(20) unsigned | NO | | NULL | || Checkpoint_master_log_name | text | NO | | NULL | || Checkpoint_master_log_pos | bigint(20) unsigned | NO | | NULL | || Checkpoint_seqno | int(10) unsigned | NO | | NULL | || Checkpoint_group_size | int(10) unsigned | NO | | NULL | || Checkpoint_group_bitmap | blob | NO | | NULL | || Channel_name | char(64) | NO | PRI | NULL | |+—————————-+———————+——+—–+———+——-+13 rows in set (0.00 sec)slave1 [localhost] {msandbox} (mysql) > select ID,Relay_log_name,Relay_log_pos,Master_log_name,Master_log_pos from slave_worker_info;+—-+———————-+—————+——————+—————-+| ID | Relay_log_name | Relay_log_pos | Master_log_name | Master_log_pos |+—-+———————-+—————+——————+—————-+| 1 | ./mysql-relay.000003 | 33029530 | mysql-bin.000002 | 33029317 || 2 | ./mysql-relay.000003 | 33066066 | mysql-bin.000002 | 33065853 || 3 | ./mysql-relay.000003 | 33111090 | mysql-bin.000002 | 33110877 || 4 | ./mysql-relay.000003 | 33107225 | mysql-bin.000002 | 33107012 || 5 | ./mysql-relay.000003 | 33059630 | mysql-bin.000002 | 33059417 || 6 | ./mysql-relay.000003 | 33056541 | mysql-bin.000002 | 33056328 || 7 | ./mysql-relay.000003 | 33086391 | mysql-bin.000002 | 33086178 || 8 | ./mysql-relay.000003 | 33018457 | mysql-bin.000002 | 33018244 || 9 | ./mysql-relay.000003 | 33052167 | mysql-bin.000002 | 33051954 || 10 | ./mysql-relay.000003 | 33044450 | mysql-bin.000002 | 33044237 |+—-+———————-+—————+——————+—————-+10 rows in set (0.00 sec)We can see that workers are using different positions of the relay logs, which correspond to different positions of the binary logs. However, there are no timestamps, so it is hard to know which workers are working more.Next, we look at the table dedicated to parallel replication in performance_schema.slave1 [localhost] {msandbox} (mysql) > use performance_schemaDatabase changedslave1 [localhost] {msandbox} (performance_schema) > desc replication_applier_status_by_worker;+———————–+———————+——+—–+——————-+—————————–+| Field | Type | Null | Key | Default | Extra |+———————–+———————+——+—–+——————-+—————————–+| CHANNEL_NAME | char(64) | NO | | NULL | || WORKER_ID | bigint(20) unsigned | NO | | NULL | || THREAD_ID | bigint(20) unsigned | YES | | NULL | || SERVICE_STATE | enum(‘ON’,’OFF’) | NO | | NULL | || LAST_SEEN_TRANSACTION | char(57) | NO | | NULL | || LAST_ERROR_NUMBER | int(11) | NO | | NULL | || LAST_ERROR_MESSAGE | varchar(1024) | NO | | NULL | || LAST_ERROR_TIMESTAMP | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |+———————–+———————+——+—–+——————-+—————————–+slave1 [localhost] {msandbox} (performance_schema) > select WORKER_ID, THREAD_ID , SERVICE_STATE , LAST_SEEN_TRANSACTION , LAST_ERROR_MESSAGE from replication_applier_status_by_worker;+———–+———–+—————+———————————————+——————–+| WORKER_ID | THREAD_ID | SERVICE_STATE | LAST_SEEN_TRANSACTION | LAST_ERROR_MESSAGE |+———–+———–+—————+———————————————+——————–+| 1 | 40 | ON | 00013253-1111-1111-1111-111111111111:133621 | || 2 | 41 | ON | 00013253-1111-1111-1111-111111111111:133699 | || 3 | 42 | ON | 00013253-1111-1111-1111-111111111111:133826 | || 4 | 43 | ON | 00013253-1111-1111-1111-111111111111:133919 | || 5 | 44 | ON | 00013253-1111-1111-1111-111111111111:133686 | || 6 | 45 | ON | 00013253-1111-1111-1111-111111111111:133688 | || 7 | 46 | ON | 00013253-1111-1111-1111-111111111111:133770 | || 8 | 47 | ON | 00013253-1111-1111-1111-111111111111:133494 | || 9 | 48 | ON | 00013253-1111-1111-1111-111111111111:133731 | || 10 | 49 | ON | 00013253-1111-1111-1111-111111111111:133763 | |+———–+———–+—————+———————————————+——————–+10 rows in set (0.00 sec)There is a timestamp in this table, but only for errors. The regular operations don’t get one. Also here we miss info about database, time of extraction and apply, and we don’t know what exactly the worker is doing: The column LAST_SEEN_TRANSACTION is about received GTIDs, not executed ones.Finally, we have a look at the output of SHOW SLAVE STATUS. In the past episodes, here is where we had the most comprehensive set of information. But with parallel replication we are going to have a surprise.slave1 [localhost] {msandbox} (performance_schema) > SHOW SLAVE STATUS\G*************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: rsandbox Master_Port: 13253 Connect_Retry: 60 Master_Log_File: mysql-bin.000002 Read_Master_Log_Pos: 38675840 Relay_Log_File: mysql-relay.000003 Relay_Log_Pos: 35124216 Relay_Master_Log_File: mysql-bin.000002 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 35124003 Relay_Log_Space: 38676469 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 30Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_UUID: 00013253-1111-1111-1111-111111111111 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Reading event from the relay log Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: 00013253-1111-1111-1111-111111111111:1-150420 Executed_Gtid_Set: 00013253-1111-1111-1111-111111111111:1-136616:136618-136624:136626-136632:136634-136635:136637-136640:136642-136644:136646-136648:136650-136670:136672-136681:136683-136687:136689-136703:136706-136716:136718-136720:136722-136723:136726-136727:136730:136734-136735:136737:136739-136741:136743-136744:136746-136747:136749:136752-136753:136755:136757:136762-136763:136765-136766:136768-136771:136773-136774:136777:136783:136785:136787-136792:136794-136795:136797:136801:136806:136809:136812-136814:136817:136821:136824:136826:136833:136835-136837:136841-136843:136846:136852:136862-136864:136871:136874:136878:136881:136884:136897:136901:136908:136913:136915:136926:136928:136940:136948:136951:136968:136986:136989:136992:136995 Auto_Position: 1 Replicate_Rewrite_DB: Channel_Name:1 row in set (0.00 sec)The output of Executed_Gtid_set is actually one line. I split it to make it fit in the page. But even after it is split, I have trouble understanding what it is showing, and how this monster piece of information can be useful. Want to see something more horrible than this? Try parallel replication with muulti-source.Monitoring with MariaDB 10In MariaDb there is only one table where we can follow progress. Unlike the one in MySQL 5.6/7, there is no fixed number of rows. This is the same table where operations with GTID are stored with single thread replication. The rows are added or pruned according to the traffic in the various channels. slave1 [localhost] {msandbox} (mysql) > select * from gtid_slave_pos;+———–+——–+———–+——–+| domain_id | sub_id | server_id | seq_no |+———–+——–+———–+——–+| 0 | 28569 | 1 | 28569 || 0 | 28570 | 1 | 28570 |+———–+——–+———–+——–+2 rows in set (0.00 sec)slave1 [localhost] {msandbox} (mysql) > select * from gtid_slave_pos;+———–+——–+———–+——–+| domain_id | sub_id | server_id | seq_no |+———–+——–+———–+——–+| 0 | 42786 | 1 | 42786 || 0 | 42787 | 1 | 42787 || 0 | 42788 | 1 | 42788 || 0 | 42789 | 1 | 42789 |+———–+——–+———–+——–+4 rows in set (0.00 sec)slave1 [localhost] {msandbox} (mysql) > select * from gtid_slave_pos;+———–+——–+———–+——–+| domain_id | sub_id | server_id | seq_no |+———–+——–+———–+——–+| 0 | 46807 | 1 | 46807 || 0 | 46808 | 1 | 46808 |+———–+——–+———–+——–+2 rows in set (0.00 sec)The information here is slim. As we have noticed in the previous articles, we get little or no visibility into the internal operations. Here we have ten appliers that work concurrently, but there is little evidence that this is happening. Much like ‘show processlist’, the gtid table shows only a glimpse of a brief moment. Sometimes we find 10 rows showing activity, sometimes just two. It does not mean that the server is idle. It’s just that we don’t catch its activity. Same as when we try to monitor using process list: with a busy server, catching a view of many concurrent queries is a matter of luck. Let’s have a look at SHOW SLAVE STATUS. The good news is that we don’t have the horrible crowd seen in MySQL 5.7. The bad news is that we have just as much information that we would get with a single thread. And let’s not forget that the only info about GTID in the slave status is related to the IO thread. For the SQL threads (in this case, we have 10) we need to look at the above table or the GTID variables.slave1 [localhost] {msandbox} ((none)) > SHOW SLAVE STATUS\G*************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: rsandbox Master_Port: 25030 Connect_Retry: 60 Master_Log_File: mysql-bin.000001 Read_Master_Log_Pos: 958300 Relay_Log_File: mysql-relay.000002 Relay_Log_Pos: 956223 Relay_Master_Log_File: mysql-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 958300 Relay_Log_Space: 956516 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Current_Pos Gtid_IO_Pos: 0-1-58261 row in set (0.00 sec)Small improvementsAbout two years ago I wrote a post comparing parallel replication in MySQL 5.6 and Tungsten Replicator. Most of what was said there is still valid, but I want to acknowledge a small improvement. When an error happened in parallel replication, the message shown in my article said:Last_SQL_Error: Error ‘Duplicate entry ‘1’ for key ‘PRIMARY” on query. Default database: ‘test’. Query: ‘insert into t1 values (1)’In this context, when working with 10 threads, finding where the problem happens may be tough. But now the same situation results in this more helpful error message:Last_SQL_Error: Worker 8 failed executing transaction ‘00021891-1111-1111-1111-111111111111:90860’ at master log mysql-bin.000002, end_log_pos 23636368; Error ‘Duplicate entry ‘340’ for key ‘PRIMARY” on query. Default database: ‘db1’. Query: ‘insert into t7 values (340, null)’Here we get the thread ID (Worker 8), the GTID (yay!) and next to it the binary log name and position that was processed. This proves that the worker thread knows everything that is useful. Now the next step will be adding the same information to the appropriate monitoring table.Summing upOf all the new features, parallel replication is probably the most needed, and the one that falls short in matter of usability. MySQL 5.7 offers what looks at first sight a large amount of monitoring data, but in the end is not enough to run operations confidently.In MariaDB, the lack of monitoring data is so acute that I would be very reluctant to use or recommend it for production.It’s a start. I hope that both teams will pause a bit in their frenzy to add more features, and will strengthen up the existing ones instead.What’s next: Percona Live in Amsterdam!This is the last episode in the series. There is much more to say about replication features, old and new, but the focus of these articles was on monitoring capabilities, and we’ve seen enough.I am speaking at Percona Live Amsterdam 2015 on September 22, and I will cover all these topics with examples.I would like to remind everyone that there is a public project on GitHub with mysql-replication-samples. I am sure there could be plenty of other scripts that could be added. Participation is welcome!

Unexpected Memory Consumption for Bulk Index Creation in InnoDB (MySQL)

In my last Booking.com Hackathon, I worked on MyISAM vs InnoDB for data loading (LOAD DATA IN FILE) and bulk index creation.  My motivation was the following: knowing that some are still using MyISAM for this particular use-case, I wanted to verify/understand if/why InnoDB is slower than MyISAM.  I do not yet have complete results on this specific subject but I found some interesting things that

Start Your Certification Journey with Oracle

  • 90% say that Oracle Certification gives employees credibility when looking for a job.
  • 70% improve their earning power.
  • 61% hiring managers place value on IT certification.
  • Members with certification on their LinkedIn profile get six times more views – leading to more opportunities.

In
a competitive market, stand out from the rest. Not only meet the job
requirements, prove your knowledge, and position yourself for success.

Access the Oracle Certification Infographic Library under Certification Resources on the Oracle Certification Home Page.

First day with InnoDB transparent page compression

I ran linkbench overnight for a database that started at 100G using MySQL 5.7.8 and InnoDB transparent page compression. After ~24 hours I have 1 mysqld crash with nothing in the error log. I don’t know if that is related to bug 77738. I will attach gdb and hope for another crash. For more about transparent page compression read here, here and here. For concerns about the feature see the post by Domas. I previously wrote about this feature.On the bright side, this is a great opportunity for MyRocks, the RocksDB storage engine for RocksDB.Then I ran ‘dmesg -e’ and get 81 complaints from XFS on the host that uses transparent compression. The warnings are from the time when the benchmark ran. My other test host isn’t using hole-punch and doesn’t get these warnings. I get the same error message below from a host with CentoOS 6.6 host using a 3.10.53 kernel.[Aug27 05:53] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)[  +1.999375] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)[  +1.999387] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)[  +1.983386] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)[  +1.999379] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)The host has Fedora 19 and the kernel is 3.14.27-100.fc19.x86_64. I don’t know if Fedora 19 is officially supported. I know that hole punch is available because this is in the error log:    [Note] InnoDB: PUNCH HOLE support availableAnd this was used in a create table statement:    ENGINE=InnoDB COMPRESSION=”zlib” DEFAULT CHARSET=latin1; From my host without hole-punch where the files for the table are ~44G.    $ xfs_bmap /ssd1/scratch/data.mdcallag/data/linkdb/linktable*.ibd | wc -l    11446And the host with it  where the files for the table are ~104G according to ls but are much smaller because of the holes.    $ xfs_bmap /ssd1/scratch/data.mdcallag/data/linkdb/linktable.ibd  | wc -l    11865620I don’t like the vmstat output either. On the host that uses transparent page compression swap is being used and that started during the linkbench load. It is not being used on the other host. Doesn’t look right. procs ———–memory———- —swap– —–io—- -system– —-cpu—- r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa13  6 4095996 8410200   1164 6298324    0    0    65   534    0    0  8  1 89  2But wait, there is moreThis is from a host with CentOS 6.6 and the 3.10.53 kernel. After running for a few hours with transparent page compression mysqld got stuck. I looked in the database error log and there was the standard output from threads getting long waits on InnoDB synchronization objects. I attached gdb to look at thread stacks, and that took ~15 minutes just to attach. Fortunately I got stacks before needing to reboost this host. All threads appear to be blocked on calls into the kernel. This gist shows two of the threads — one is stuck doing aio and another trying to do fallocate64 called from os_file_punch_hole.Deadlocked mysqld + reboot server == victory? I filed bug 78277 for this. After the reboot dropping a ~72G ibd file that had been using hole-punch took ~3 minutes. Then I created an 80G file and dropping that took 10 seconds. It a good idea to have database files that take minutes to drop given that InnoDB and filesystem state can get out of sync during a crash requiring manual repair.And moreI have two CentOS 6.6 servers with the 3.10.53 kernel and XFS. The host in the previous section doesn’t use partitioning. The host in this section uses partitioning and transparent page compression. They both get InnoDB deadlocks but this host was healthy enough the the InnoDB background thread was able to kill the server. This was triggered by a DROP DATABASE statement. I don’t have thread stacks but will guess that a file unlink operation took too long or got stuck. Because DDL isn’t transactional the data dictionary isn’t in a good state at this point. Some of the per-partition ibd files are missing.Reboot #2 has arrived. Did kill -9 on mysqld and the process is still there in state Z after 40 minutes. Delete the database directory but space is still in use according to df. I assume that space won’t be released until the zombie process goes away. I give up on trying to use transparent compression on this host.

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569