utf8

Fun with Bugs #46 – On Some Bugs I’ve Reported During the Year of 2016

It’s time to summarize the year of 2016. As a kind of a weird summary, in this post I’d like to share a list of MySQL bug reports I’ve created in 2016 that are still remaining “Verified” today:Bug #79831 – “Unexpected error message on crash-safe slave with max_relay_log_size set”. According to Umesh this is not repeatable with 5.7. The fact that I’ve reported the bug on January 4 probably means I was working at that time. I should not repeat this mistake again next year.Bug #80067 – “Index on BIT column is NOT used when column name only is used in WHERE clause”. People say the same problem happens with INT and, what may be even less expected, BOOLEAN columns.Bug #80424 – “EXPLAIN output depends on binlog_format setting”. Who could expect that?Bug #80619 – “Allow slave to filter replication events based on GTID”. In this feature request I’ve suggested to implement filtering by GTID pattern, so that we can skip all events originating from specific master on some slave in a complex replication chain.Bug #82127 – “Deadlock with 3 concurrent DELETEs by UNIQUE key”. It’s clear that manual is not even close to explaining how the locks are really set “by design” in this weird case. See comments in MDEV-10962 for some explanations. Nobody from Oracle event tried to really explain how things are designed to work.Bug #82212 – “mysqlbinlog can produce events larger than max_allowed_packet for mysql”. This happens for encoded row-based events. There should be some way to take this overhead into account while creating binary log, IMHO.Bug #83024 – “Internals manual does not explain COM_SLEEP in details”. One day you’ll see Sleep for some 17 seconds logged into the slow query log, and may start to wonder why…Bug #83248 – “Partition pruning is not working with LEFT JOIN”. You may find some interesting related ideas in MDEV-10946.Bug #83640 – “Locks set by DELETE statement on already deleted record”. This case shows that design of locking in InnoDB does produce really weird outcomes sometimes. This is not about “missing manual”, this is about extra lock set that is absolutely NOT needed (a gap X lock on a record in the secondary unique index is set when the same transaction transaction already has the next key lock on it). As a side note, I keep finding, explaining and reporting weird or undocumented details in InnoDB locking for years, still my talk about InnoDB locks was not accepted by Oracle once again for OOW in 2016. What do I know about the subject and who even cares about those locks… Bug #83708 – “uint expression is used for the value that is passed as my_off_t for DDL log”. I was really shocked by this finding. I assumed that all uint vs unsigned long long improper casts are already found. It seems I was mistaking.Bug #83912 – “Time spent sleeping before entering InnoDB is not measured/reported separately”. The use case that led me to reporting this bug is way more interesting than the fact that some wait is not instrumented in performance_schema. You may see more related bug reports from me next year.Bug #83950 – “LOAD DATA INFILE fails with an escape character followed by a multi-byte one”. This single bug (and related bugs and stories) were original topic for issue #46 of my “Fun With Bugs” series. I was not able to write everything I want properly over last 3 weeks, but trust me: it’s a great story, of “Let’s Make America Great Again” style. With the goal for LOAD DATA to behave exactly as INSERT when wrong utf8 data are inserted, Oracle changed the way LOAD DATA works back and forth, with the last change (back) happened in 5.7.17: “Incompatible Change: A change made in MySQL 5.7.8 for handling of multibyte character sets by LOAD DATA was reverted due to the replication incompatibility (Bug #24487120, Bug #82641)”I just can not keep up with all the related fun people have in replication environments thanks to these ongoing changes… It’s incredible.Bug #84004 – “Manual misses details on MDL locks set and released for online ALTER TABLE”. Nothing new: locks in MySQL are not properly/completely documented, metadata locks included. yes, they are documented better now, after 11+ years of my continuous efforts (of a kind), but we are “not there yet”. I am still waiting for a job offer to join MySQL Documentation Team, by the way :)Bug #84173 – “mysqld_safe –no-defaults & silently does NOT work any more”. Recent MySQL 5.7.17 release had not only given us new Group Replication plugin and introduced incompatible changes. In a hope to fix security issues it comes with pure regression – for the first time in last 11 years mysqld_safe –no-defaults stopped working for me! By the way, mysqld_safe is still NOT safe in a sense that 5.7.17 tried to enforce, and one day (really soon) you will find out why.Bug #84185 – “Not all “Statements writing to a table with an auto-increment…” are unsafe”. If you do something like DELETE FROM `table` WHERE some_col IN (SELECT some_id FROM `other_table`) where `table` has auto_increment column, why should anyone care about it? We do not generate the value, we delete rows… This bug report was actually created by Hartmut Holzgraefe and test case comes from Elena Stepanova (see MDEV-10170). I want to take this opportunity to thank them and other colleagues from MariaDB for their hard work and cooperation during the year of 2016. Thanks to Umesh (who processed most of my bug reports),  Sinisa Milivojevic and Miguel Solorzano for their verifications of my bug reports this year.In conclusion I should say that, no matter how pointless you may consider this activity, I still suggest you to report each and every problem that you have with MySQL and can not understand after reading the manual, as a public MySQL bug. Now, re-read my 4 years old post on this topic and have a Happy and Fruitful New Year 2017!

Importing the Unicode Character Database in MySQL

In Python it is easily possible to findout the name of a Unicode character and findout some properties about that character. The module which does that is called unicodedata.An example:>>> import unicodedata>>> unicodedata.name(‘☺’)’WHITE SMILING FACE’This module uses the data as released in the UnicodeData.txt file from the unicode.org website.So if UnicodeData.txt is a ‘database’, then we should be able to import it into MySQL and use it!I wrote a small Python script to automate this. The basic steps are:Download UnicodeData.txtCreate a unicodedata.ucd tableUse LOAD DATA LOCAL INFILE to load the dataThis isn’t difficult especially because the file doesn’t have the actual characters in it. It is just an ASCII file with codepoints and the data related to it.But it would be useful to have those characters in the database. So what I did is to create those with CONVERT(UNHEX(value) USING utf32.What that does is this:UNHEX: get the number of the codepointCONVERT that number from utf32. In utf32 all characters are 4-byte/32-bit and are mapped 1-on-1 to their codepoint. This is what ucs2 does with 2-byte/16-bit, but then we can’t use the full range of characters. That is why UTF-16 replaced ucs2. UTF-16 uses a variable length, just like UTF-8.So now I can do this:mysql> USE unicodedataReading table information for completion of table and column namesYou can turn off this feature to get a quicker startup with -ADatabase changedmysql> SELECT name FROM ucd WHERE `char`=’☺’;+——————–+| name |+——————–+| WHITE SMILING FACE |+——————–+1 row in set (0.00 sec)Or this:To get the data (or just a dump of the database): github.com/dveeden/mysqlunicodedataI’m speaking at Percona Live Amsterdam about MySQL and Unicode. And you can also cath me the Booking.com booth (#205) or at the Community Diner.Update 1Set your encoding to utf8mb4 in your connection properties (e.g. –default-character-set=utf8mb4) or use SET NAMES utf8mb4 to switch to the utf8mb4 character set.On Linux you might want to install google-noto-color-emoji-fonts, gdouros-symbola-fonts and/or google-android-emoji-fonts to see the emoji characters. For other groups of characters you might need to install additional fonts specific to that script.Emoji fonts should be present on Windows 10 and might not be present on Windows 7 and earlier. Use the unicode enable mysql option in the start menu which is installed by MySQL Installer. This is to enable unicode for cmd.exe (Powershell or cmd.exe with the correct codepage set might also work)Thanks to Peter Laursen for testing this.

MySQL Character encoding – part 2

In MySQL Character encoding – part 1 we stated that the myriad of ways in which character encoding can be controlled can lead to many situations where your data may not be available as expected.
UTF8 was designed on a placemat in a New Jersey diner one night in September or so 1992.

Setting MySQL Client and Server Character encoding.
Lets restart MySQL with the correct setting for our purpose, UTF8. Here we can see the setting in the MySQL configuration file, in this case /etc/mysql/my.cnf.
character-set-server = utf8

This change is then reflected in the session and global variables once the instance is restarted with the new configuration parameter.
mysql> SELECT @@global.character_set_server, @@session.character_set_client;
+——————————-+——————————–+
| @@global.character_set_server | @@session.character_set_client |
+——————————-+——————————–+
| utf8 | utf8 |
+——————————-+——————————–+
1 row in set (0.00 sec)

Now we have verified the server and client are set to use UTF8, we can go ahead, continue developing our application and create a new table people.
mysql> CREATE TABLE people (first_name VARCHAR(30) NOT NULL,
last_name VARCHAR(30) NOT NULL);
Query OK, 0 rows affected (0.13 sec)

Now let’s enter some data into the new table, which was created with the server and client configured for UTF8.

Something appears to have gone terribly wrong, the accent in Maciek’s surname now appears as a question mark.
mysql> SELECT @@session.character_set_server, @@session.character_set_client;
+——————————–+——————————–+
| @@session.character_set_server | @@session.character_set_client |
+——————————–+——————————–+
| utf8 | utf8 |
+——————————–+——————————–+
1 row in set (0.00 sec)

The database settings are still UTF8, this should have worked.
mysql> USE fosdem;
mysql> SHOW CREATE TABLE people\G
*************************** 1. row ***************************
Table: people
Create Table: CREATE TABLE `people` (
`first_name` varchar(30) NOT NULL,
`last_name` varchar(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec)

Looking at the table, we see that despite being created under a server set to use UTF8, it appears to be set to use latin1.
How can this be?, Let’s look at the session settings.
mysql> SHOW SESSION VARIABLES LIKE ‘character_set_%’;
+————————–+—————————-+
| Variable_name | Value |
+————————–+—————————-+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+————————–+—————————-+
8 rows in set (0.00 sec)

We can see the server and client values are as expected, but database is set to something else.
mysql> SHOW CREATE DATABASE fosdem\G
*************************** 1. row ***************************
Database: fosdem
Create Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */
1 row in set (0.00 sec)

Since the database was created when the server was set to latin1 it inherited that charset setting, which persists even when the server setting changes.
Can we fix this?
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT last_name, HEX(last_name) FROM people;
+————+———————-+
| last_name | HEX(last_name) |
+————+———————-+
| Lemon | 4C656D6F6E |
| Müller | 4DFC6C6C6572 |
| Dobrza?ski | 446F62727A613F736B69 |
+————+———————-+
3 rows in set (0.00 sec)

mysql> SET NAMES latin2;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT last_name, HEX(last_name) FROM people;
+————+———————-+
| last_name | HEX(last_name) |
+————+———————-+
| Lemon | 4C656D6F6E |
| Müller | 4DFC6C6C6572 |
| Dobrza?ski | 446F62727A613F736B69 |
+————+———————-+
3 rows in set (0.00 sec)

Unfortunately, no matter how I try to read the data, 0x3F is ‘?’, so the ‘ń’ has been lost forever. Therefore it may not be enough to reconfigure the server, as a mismatch between client and server can permanently break data, due to the implicit conversion inside the MySQL server.
Implicit conversions happen silently when characters of one character set are inserted into a column with a different character set. This behaviour can be controlled by SQL_MODE, which allows you force MySQL to raise an error instead.
In MySQL Character encoding – part 1 we established there were a number of places you can control the character settings, now we can add a couple of important observations to our view of Character encoding settings.

Session settings

character_set_server
character_set_client
character_set_connection
character_set_database
character_set_result

Schema level Defaults – Affects new tables
Table level Defaults – Affects new columns
Column charsets

We have seen how a table created with no explicit charset declaration inherits the database (schema) charset, but what happens to a column when the table charset is changed?.
mysql> USE fosdem;
mysql> CREATE TABLE test (a VARCHAR(300), INDEX (a));
Query OK, 0 rows affected (0.62 sec)
mysql> SHOW CREATE TABLE test\G
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> ALTER TABLE test DEFAULT CHARSET = utf8;
Query OK, 0 rows affected (0.08 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> SHOW CREATE TABLE test\G
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) CHARACTER SET latin1 DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.00 sec)

The columns in a table inherit their default charset value when the table is created, but do not change when the table is changed, however new columns added after the ALTER TABLE would inherit UTF8.
mysql> ALTER TABLE test ADD b VARCHAR(10);
Query OK, 0 rows affected (0.74 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> SHOW CREATE TABLE test\G
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) CHARACTER SET latin1 DEFAULT NULL,
`b` varchar(10) DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec

What can you do if you detect inconsistencies in your MySQL Character encoding settings
First of all, keep calm and don’t start by changing something. Analyse the situation and make sure you understand what settings you have and what your application understands regarding reading and writing data from the database.
Once you detect a problem, try to assess the extent of the damage. Firstly, what is the scope of the damage and is it consistent. Are all the rows bad or is it just a subset such as the last days worth of inserts. Are all the bad rows broken in the same way or are there actually a mixture of problems affected different sets of rows. Are the rows actually repairable – could be that recovering from backup and rolling forward is necessary as the inserted data has already been destroyed. Has any character mapping occurred during writes (e.g. unicode over latin1/latin1) – all of this is necessary to get a good picture of where you are starting from.
Take care not to do not do any of the following:

Try to fix this table by table unless you really only have a single table. – Any fix will affect the application and database as a whole, therefore fixing a single table may lead to inconsistencies and further problems elsewhere.
ALTER TABLE … DEFAULT CHARSET = as it only changes the default character set for new columns.
ALTER TABLE … CONVERT TO CHARACTER SET … It’s not for fixing broken encoding.
ALTER TABLE … MODIFY col_name … CHARACTER SET …

What needs to be fixed?

Schema default character set

ALTER SCHEMA fosdem DEFAULT CHARSET = utf8;

Tables with text columns: CHAR, VARCHAR, TEXT, TINYTEXT, LONGTEXT
What about ENUM?

The information schema can provide a list of candidate tables.
SELECT CONCAT(c.table_schema, ‘.’, c.table_name) AS candidate_table
FROM information_schema.columns c
WHERE c.table_schema = ‘fosdem’
AND c.column_type REGEXP ‘^(.*CHAR|.*TEXT|ENUM)(\(.+\))?$’ GROUP BY candidate_table;

You must also ensure the database and application configuration is correct also, to avoid having the newly fixed tables broken by new data being introduced incorrectly (for the settings) into the tables.
How do I fix this?
Option 1. Dump and restore (Requires downtime)
Dump the data preserving the bad configuration and drop the old database
# mysqldump -u root -p –skip-set-charset –default-character-set=latin1 fosdem > fosdem.sql

mysql> DROP SCHEMA fosdem;

Correct table definitions in the dump file by editing DEFAULT CHARSET in all CREATE TABLE statements, then create the database again and import the data.
mysql> CREATE SCHEMA fosdem DEFAULT CHARSET utf8;

# mysql -u root -p –default-character-set=utf8 fosdem < fosdem.sql

Option 2. Two step conversion (Requires downtime)
Perform a two step conversion with ALTER TABLE, converting the original encoding to VARBINARY/BLOB and then from there to the target encoding. Conversion from/to BINARY/BLOB removes character set context.

Stop applications
On each table, for each text column perform:

ALTER TABLE tbl MODIFY col_name VARBINARY(255);
ALTER TABLE tbl MODIFY col_name VARCHAR(255) CHARACTER SET utf8;

You may specify multiple columns per ALTER TABLE

Fix the problems (application and/or db configs)
Restart applications

Option 3. – Online character set fix; (Minimal downtime, Approximately 1 min)
Using pt-online-schema-change with the PSCE plugin and a small patch for pt-online-schema-change, you can convert columns online in the live database.

Start pt-online-schema-change on all tables – one by one with table rotation disabled (–no-swap-tables) or drop pt-online-schema-change triggers
Wait until all tables have been converted
Stop applications
Fix the problems (application and/or db configs)
Rotate the tables – should take a minute or so
Restart applications

Currently the patch to pt-online-schema-change and plugin are available on bitbucket Github.
In MySQL Character encoding part 3 we will cover the gotchas in the process of fixing broken encoding, and what best practise to follow to get it right each time you setup a new server or create a new database.

MySQL Character encoding – part 1

Breaking and unbreaking your data
Recently at FOSDEM, Maciej presented “Breaking and unbreaking your data”, a presentation about the potential problems you can incur regarding character encoding whilst working with MySQL. In short, there are a myriad of places where character encoding can be controlled, which gives ample opportunity for the system to break and for text to become unrecoverable.

The slides from the presentation are available on slideshare.

Character Encoding – MySQL DevRoom – FOSDEM 2015 from mushupl
Since slides don’t tell the whole story, we decided to create a series of blog posts to demonstrate how easy it is to go wrong, how to fix some of the issues and how to avoid such issues in the future.
What is character encoding?
The encoding is the binary representation of glyphs, where each character can be represented by 1 or more bytes. Popular schemes include ASCII and Unicode, and can include language specific character sets such as Latin US, Latin1, Latin2 which are commonly used in America and Europe and EUC-KR or GB18030 which support language characters with an Asian origin. Each character can be associated by several different codes, and one code may correspond to several different characters, depending on the encoding scheme used.
Where do you set character sets in MySQL?
Here is the core of the problem, the character encoding can be controlled from the application, database or even on a per table or column basis. Together with a set of rules regarding inheritance, it is easy to have one layer of the system configured for one character set whilst the actual data being introduced is using a different character set.
In MySQL the following area, the following settings can all affect the character encoding used.

Session settings

character_set_server
character_set_client
character_set_connection
character_set_database
character_set_result

Schema level defaults
Table level defaults
Column charsets

Character encoding in MySQL.
As Maciej pointed out in the presentation, where MySQL is concerned we are all born Swedish, as MySQL starts configured for the Latin1 character set and collation set to latin1_swedish_ci. This is even the case in MySQL 5.7, meaning by default your system expects only characters in the latin1 set and will when comparing characters it will assume the Swedish language is being used.
Lets look at how this manifests itself in a new application, where server, client and table are set to the default latin1.

mysql> SELECT @@global.character_set_server, @@session.character_set_client;
+——————————-+——————————–+
| @@global.character_set_server | @@session.character_set_client |
+——————————-+——————————–+
| latin1 | latin1 |
+——————————-+——————————–+
1 row in set (0.00 sec)
mysql> CREATE SCHEMA fosdem;
Query OK, 1 row affected (0.00 sec)
mysql> USE fosdem;
mysql> CREATE TABLE locations (city VARCHAR(30) NOT NULL);
Query OK, 0 rows affected (0.15 sec);

mysql> SHOW CREATE TABLE locations\G
*************************** 1. row ***************************
Table: locations
Create Table: CREATE TABLE `locations` (
`city` varchar(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec)
So what happens when you try to save some data that is not latin1 encoded.

 
The city of Tokyo is displayed.

The application returned and rendered the new city correctly, however inside the database there is some confusion.
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from locations;
+——————–+
| city |
+——————–+
| Berlin |
| KrakÃ3w |
| 東äo¬éƒ1⁄2 |
+——————–+
3 rows in set (0.00 sec)

The data being saved was UTF8 encoded, however if an application attempts to query the database as UTF8 it receives garbage. Instead the application must ask for Latin1 to receive the original data.
mysql> SET NAMES latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select
+———–+
| city |
+———–+
| Berlin |
| Kraków |
| 東京都 |
+———–+
3 rows in set
* from locations;
(0.00 sec)

The new city was saved and from the application the result looked correct, however what is happening here is that the connection to the database has saved the binary data without any manipulation. Hence it returned the same data, and the browser was able to do the right thing and display it correctly, as did the terminal which was set to UTF8. Inside the database though, it is not able to understand the data in the correct context.
In the next blog post we will look at how to correctly configure character sets, as well as demonstrating some of the problems we have encountered in production systems and how we fixed those.

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569