This post describes how to recover Galera Cache (or gcache) on restart.
Recently Codership introduced (with Galera 3.19) a very important and long awaited feature. Now users can recover Galera cache on restart.
If you gracefully shutdown cluster nodes one after another, with some lag time between nodes, then the last node to shutdown holds the latest data. Next time you restart the cluster, the last node shutdown will be the first one to boot. Any followup nodes that join the cluster after the first node will demand an SST.
Why SST, when these nodes already have data and only few write-sets are missing? The DONOR node caches missing write-sets in Galera cache, but on restart this cache is wiped clean and restarted fresh. So the DONOR node doesn’t have a Galera cache to donate missing write-sets.
This painful set up made it necessary for users to think and plan before gracefully taking down the cluster. With the introduction of this new feature, the user can retain the Galera cache.
How does this help ?
On restart, the node will revive the galera-cache. This means the node can act as a DONOR and service missing write-sets (facilitating IST, instead of using SST). This option to retain the galera-cache is controlled by an option named gcache.recover=yes/no. The default is NO (Galera cache is not retained). The user can set this option for all nodes, or selective nodes, based on disk usage.
gcache.recover in action
The example below demonstrates how to use this option:
Let’s say the user has a three node cluster (n1, n2, n3), with all in sync.
The user gracefully shutdown n2 and n3.
n1 is still up and running, and processes some workload, so now n1 has latest data.
n1 is eventually shutdown.
Now the user decides to restart the cluster. Obviously, the user needs to start n1 first, followed by n2/n3.
n1 boots up, forming an new cluster.
n2 boots up, joins the cluster, finds there are missing write-sets and demands IST but given that n1 doesn’t have a gcache, it falls back to SST.
n2 (JOINER node log):2016-11-18 13:11:06 3277 [Note] WSREP: State transfer required:
Group state: 839028c7-ad61-11e6-9055-fe766a1886c3:4680
Local state: 839028c7-ad61-11e6-9055-fe766a1886c3:3893n1 (DONOR node log), gcache.recover=no:2016-11-18 13:11:06 3245 [Note] WSREP: IST request: 839028c7-ad61-11e6-9055-fe766a1886c3:3893-4680|tcp://192.168.1.3:5031
2016-11-18 13:11:06 3245 [Note] WSREP: IST first seqno 3894 not found from cache, falling back to SSTNow let’s re-execute this scenario with gcache.recover=yes.
n2 (JOINER node log):2016-11-18 13:24:38 4603 [Note] WSREP: State transfer required:
Group state: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:1495
Local state: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:769
2016-11-18 13:24:41 4603 [Note] WSREP: Receiving IST: 726 writesets, seqnos 769-1495
2016-11-18 13:24:49 4603 [Note] WSREP: IST received: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:1495n1 (DONOR node log):2016-11-18 13:24:38 4573 [Note] WSREP: IST request: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:769-1495|tcp://192.168.1.3:5031
2016-11-18 13:24:38 4573 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.You can also validate this by checking the lowest write-set available in gcache on the DONOR node.mysql> show status like ‘wsrep_local_cached_downto’;
| Variable_name | Value |
| wsrep_local_cached_downto | 1 |
1 row in set (0.00 sec)So as you can see, gcache.recover could restore the cache on restart and help service IST over SST. This is a major resource saver for most of those graceful shutdowns.
gcache revive doesn’t work if . . .
If gcache pages are involved. Gcache pages are still removed on shutdown, and the gcache write-set until that point also gets cleared.
Again let’s see and example:
Let’s assume the same configuration and workflow as mentioned above. We will just change the workload pattern.
n1, n2, n3 are in sync and an average-size workload is executed, such that the write-set fits in the gcache. (seqno=1-x)
n2 and n3 are shutdown.
n1 continues to operate and executes some average size workload followed by a huge transaction that results in the creation of a gcache page. (1-x-a-b-c-h) [h represent transaction seqno]
Now n1 is shutdown. During shutdown, gcache pages are purged (irrespective of the keep_page_sizes setting).
The purge ensures that all the write-sets that has seqno smaller than gcache-page-residing write-set are purged, too. This effectively means (1-h) everything is removed, including (a,b,c).
On restart, even though n1 can revive the gcache it can’t revive anything, as all the write-sets are purged.
When n2 boots up, it requests IST, but n1 can’t service the missing write-set (a,b,c,h). This causes SST to take place.
Summing it up
Needless to say, gcache.recover is a much needed feature, given it saves SST pain. (Thanks Codership.) It would be good to see if the feature can be optimized to work with gcache pages.
And yes, Percona XtraDB Cluster inherits this feature in its upcoming release.
This post describes how to recover Galera Cache (or gcache) on restart.
In this blog post, I’ll discuss how to use the InnoDB buffer pool pre-load feature in MySQL 5.7
Starting MySQL 5.6, you can configure MySQL to save the contents of your InnoDB buffer pool and load it on startup. Starting in MySQL 5.7, this is the default behavior. Without any special effort, MySQL saves and restores a portion of buffer pool in the default configuration. We made a similar feature available in Percona Server 5.5 – so the concept has been around for quite a while.
Frankly, time has reduced the need for this feature. Five years ago, we would typically store databases on spinning disks. These disks often took quite a long time to warm up with normal database workloads, which could lead to many hours of poor performance after a restart. With the rise of SSDs, warm up happens faster and reduces the penalty from not having data in the buffer pool. Typically, a system reaches 90% of its fully warmed up performance in 10 minutes or less. But since it takes virtually no effort to use, saving the contents of the InnoDB buffer pool is a great feature to enable by default.
This blog post looks into some issues with this feature that might not be totally obvious from its name or documentation.
By default, MySQL only saves 25% of the most actively accessed pages (by the LRU) in the InnoDB buffer pool (not the whole buffer pool).
This is a reasonable choice for many use cases: it saves the most valuable pages, which can then be loaded faster than if you try to load every page in the buffer pool (many of which might not be relevant for continuing workload).
You can change this number by setting the innodb_buffer_pool_dump_pct variable. If you’re using InnoDB essentially as an in-memory database, and want to ensure all data is memory resident and can be accessed without a disk read, set it to 100.
Note that this variable is based on the actual amount of data present in memory, not the buffer pool size, For example, if you have a 100GB buffer pool but it only contains 10GB of data, by default only 25% of 10GB (2.5GB) gets saved. (As the manual explains, it will not take nearly as much on disk as only the page identifiers are stored, not full page contents.)
MySQL starts and becomes accessible through the network before the buffer pool load on startup is complete. Immediately before the start, a lot of resources fetch buffer pool contents from the disk as quickly as possible, possibly affecting performance. If you have multiple MySQL nodes – like using MySQL Replication or running Percona XtraDB Cluster – you might consider bringing them back for production traffic only after the buffer pool load operation completes. You can monitor the buffer pool load progress by watching the GLOBAL STATUS variable:
Buffer pool load is in progress:| Innodb_buffer_pool_load_status | Loaded 403457/419487 pages |Buffer pool load is complete:| Innodb_buffer_pool_load_status | Buffer pool(s) load completed at 161123 9:18:57 |As a side note, it would be great if MySQL would provide a clearer concept of the “State” of the node: being UP versus being READY to serve the traffic in an optimal way are often not the same.
InnoDB’s buffer pool preload is not very efficient, at least with fast storage. In my test environment, with a rather capable NVMe storage, I get a more than 400MB/sec warmup rate if I run read-only sysbench workload. The InnoDB’s buffer pool preload warmup rate is around 100MB/sec or so. I would guess the problem is that it doesn’t drive as many parallel IO requests as SSD storage needs to perform optimally. I did not investigate it further.
Innodb buffer pool save/restore only stores the buffer pool contents on a clear shutdown. If the server crashes MySQL still does a buffer pool preload, but with the content information saved on last clean shutdown (stored in the ib_buffer_pool file). This might end up wasting time loading data that is not relevant for the current workload. Periodically running the following ensures a fresh set of pages is available for a quick warmup, even if MySQL crashed:SET GLOBAL innodb_buffer_pool_dump_now=ON;This preserves the current list of buffer pool pages.
Note that while you (hopefully) do not see your MySQL crash that often, the same issue exists with backups, MySQL slave cloning with Percona XtraBackup, or LVM snapshot. This causes these operations to be less efficient.
I hope the observations in this blog help you put this feature to better use!
Recently I wanted to create a policy managed database in my 18.104.22.168 flex cluster setup, but free servers are not available. Hence I had to unregister an administrator managed RAC database (amdbh12) to make the servers available. Having completed my testing, I dropped the policy managed database and then wanted to drop the database amdbh12 using DBCA. It was then… Continue Reading →
在继『Oracle sharding database的一些概念』和『sharding database的一些概念的补充』之后，我觉得还是有些概念需要谈一下。 1.shard prune(分片裁剪): 这个概念类似分区裁剪（partition prune），是指根据sql语句会到对应的分片上去。 但 […]
1. bitmap索引 [crayon-5871ab4e0281d337995396/] 2. 主外键关系，主键表插入数据不提交,外键表插入数据被阻塞 [crayon-5871ab4e02835797782934/] 3.插入主键同一值 [crayon-5871ab4e0283d373627861/ […]
USA Webinar-Galera Cluster® Best Practices for DBAs and DevOps Part 2: Taking Full Advantage of Multi-Master
Description This webinar will be the second in our series on best practices to follow when using Galera Cluster.In this part, we will discuss important topics related to multi-master setups:Practical considerations when using Galera in a multi-master setupEvaluating the characteristics of your database workloadPreparing your application for multi-masterDetecting and dealing with transaction conflictsThe webinar will conclude with a Q&A session where you can ask any questions you may have about Galera Cluster.Time: 9-10 AM PST (Pacific time zone), 13th of DecemberSpeakers: Philip Stoev, Quality and Release Manager, Codership Sakari Keskitalo, COO, CodershipREGISTER TO USA TIME ZONE WEBINAR
As a developer at another company (not Oracle), I once worked the support phone lines as the philosophy of the company was that developers should be familiar with real customers and their applications. I recall one call on a particular morning: a customer was crying because she had upgraded her database. The upgrade had destroyed her data. And, she had no backup. Her research data had been created for her PhD thesis. Sadly, she could not complete her degree since her work had been destroyed.
This true story has always reminded me of how deadly an upgrade can be.
Fortunately, the Oracle MySQL Cloud Service (MySQLCS) software has made the upgrade process more robust. With the MySQLCS software, you can test drive your upgrade process with your application. By spinning up a MySQLCS cloud instance you can:
Pre-check the new MySQL version to make sure system requirements such as disk space are adequate before the upgrade process begins.
Mark one less thing on your upgrade checklist. Before the patch begins, a quick MySQL Enterprise Backup automatically makes sure you always have a pre-patched version of your database in the cloud and also stored locally on the VM created by the cloud.
Down-grade to a previous MySQL version if you see that something is not to your liking with the new patch.
Here’s another consideration: when there is a new version of MySQL, the MySQLCS UI notifies you on your patch ‘panel’ that a new MySQL version has been automatically uploaded to the cloud for you. And, you don’t have to upgrade your version of MySQL. If you decide to upgrade, you can upgrade when it is convenient for you. Not when it is convenient for us.
When I talked to developers and architects attending the Cloud Expo in Santa Clara a few weeks ago about patching in MySQLCS, they really liked the idea of being able to have so much control over the patching process. Moving your Dev/Test use case to the cloud makes so much sense in these days of agile development where you get not only a refined patching process with MySQLCS but also the ability to test drive your new applications with just-released versions of MySQL.
This is only one of the many new features of MySQLCS. Note that MySQLCS is enterprise-ready with all the proven Oracle MySQL Enterprise features that include tight security mechanisms, automatic, fast backups, and a MySQL monitor that can drill down from query execution times to individual query plans.
“The statements and opinions expressed here are my own and do not necessarily represent those of the Oracle Corporation.”
-Kathy Forte, Oracle MySQL Solutions Architect
For more information about Oracle MySQLCS, visit https://cloud.oracle.com/mysql .
Percona announces the GA release of Percona XtraBackup 2.4.5 on November 29th, 2016. You can download it from our download site and from apt and yum repositories. Percona XtraBackup enables MySQL backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered […]
Percona announces the release of Percona XtraBackup 2.3.6 on November 29, 2016. Downloads are available from our download site or Percona Software Repositories. Percona XtraBackup enables MySQL backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered free as an open […]
本站文章除注明转载外，均为本站原创： 转载自love wife & love life —Roger […]