Pattern Matching Queries vs. Full-Text Indexes

Pattern Matching Queries vs. Full-Text Indexes

Pattern Matching Queries vs. Full-Text IndexesIn this blog post, we’ll compare the performance of pattern matching queries vs. full-text indexes.

In my previous blog post, I looked for a solution on how we can search only a part of the email address and how can we make faster queries where the condition is email LIKE '%n.pierre%'. I showed two possible ways that could work. Of course, they had some pros and cons as well but were more efficient and faster than a like '%n.prierre%'.

But you could also ask why I would bother with this? Let’s add a FULLTEXT index, and everybody is happy! Are you sure about that? I’m not. Let’s investigate and test a bit. (We have some nice blog posts that explain how FULLTEXT indexes work: Post 1, Post 2, Post 3.)

Let’s see if it works in our case where we were looking for email addresses. Here is the table:

CREATE TABLE `email` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `email` varchar(120) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `idx_email` (`email`)
) ENGINE=InnoDB AUTO_INCREMENT=318465 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Add the default full-text index:

ALTER TABLE email ADD FULLTEXT KEY (email);

It took only five seconds for 320K email addresses.

Let’s run a search:

SELECT id, email FROM email where MATCH(email) AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE);
+--------+--------------------------------+
| id     | email                          |
+--------+--------------------------------+
|   2940 | pierre.west@example.org        |
|  10775 | pierre.beier@example.org       |
|  24267 | schroeder.pierre@example.org   |
|  26285 | bode.pierre@example.org        |
|  27104 | pierre.franecki@example.org    |
|  31792 | pierre.jaskolski@example.com   |
|  39369 | kuphal.pierre@example.org      |
|  58625 | olson.pierre@example.org       |
|  59526 | larkin.pierre@example.net      |
|  64718 | boyle.pierre@example.com       |
|  72033 | pierre.wolf@example.net        |
|  90587 | anderson.pierre@example.org    |
| 108806 | fadel.pierre@example.org       |
| 113897 | jacobs.pierre@example.com      |
| 118579 | hudson.pierre@example.com      |
| 118798 | pierre.wuckert@example.org     |
| 118937 | green.pierre@example.net       |
| 125451 | hauck.pierre@example.net       |
| 133352 | friesen.pierre@example.net     |
| 134594 | windler.pierre@example.com     |
| 135406 | dietrich.pierre@example.org    |
| 190451 | daugherty.pierre@example.org   |
...

Immediately, we have issues with the results. It returns 43 rows, but there are only 11 rows with string n.pierre. Why? It is because of . The manual says:

The built-in FULLTEXT parser determines where words start and end by looking for certain delimiter characters; for example,   (space), , (comma), and . (period).

The parser believes that a . starts a new word, so it is going to search for pierre instead of n.pierre. That’s not good news as many email addresses contain ..  What can we do? The manual says:

It is possible to write a plugin that replaces the built-in full-text parser. For details, see Section 28.2, “The MySQL Plugin API”. For example parser plugin source code, see the plugin/fulltext directory of a MySQL source distribution.

If you are willing to write your own plugin in C/C++, you can try that route. Until then, it is going to give us back a lot of irrelevant matches.

We can order the results by relevancy:

SELECT id,email,MATCH(email) AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE)
 AS score FROM email where MATCH(email) AGAINST
('n.pierre' IN NATURAL LANGUAGE MODE) ORDER BY 3 desc limit 10;
+-------+------------------------------+-------------------+
| id    | email                        | score             |
+-------+------------------------------+-------------------+
|  2940 | pierre.west@example.org      | 14.96491813659668 |
| 10775 | pierre.beier@example.org     | 14.96491813659668 |
| 24267 | schroeder.pierre@example.org | 14.96491813659668 |
| 26285 | bode.pierre@example.org      | 14.96491813659668 |
| 27104 | pierre.franecki@example.org  | 14.96491813659668 |
| 31792 | pierre.jaskolski@example.com | 14.96491813659668 |
| 39369 | kuphal.pierre@example.org    | 14.96491813659668 |
| 58625 | olson.pierre@example.org     | 14.96491813659668 |
| 59526 | larkin.pierre@example.net    | 14.96491813659668 |
| 64718 | boyle.pierre@example.com     | 14.96491813659668 |
+-------+------------------------------+-------------------+

This does not guarantee we get back the lines that we are looking for, however. I tried to change innodb_ft_min_token_size as well, but it did not affect the results.

Let’s see what happens when I search for williamson pierre. Two separate words. I know there is only one email address with these names.

SELECT id,email,MATCH(email) AGAINST
('williamson.pierre' IN NATURAL LANGUAGE MODE) AS score
FROM email where MATCH(email) AGAINST
('williamson.pierre' IN NATURAL LANGUAGE MODE) ORDER BY 3 desc limit 50;
+--------+---------------------------------+-------------------+
| id     | email                           | score             |
+--------+---------------------------------+-------------------+
| 238396 | williamson.pierre@example.net   | 24.08820343017578 |
|   2940 | pierre.west@example.org         | 14.96491813659668 |
|  10775 | pierre.beier@example.org        | 14.96491813659668 |
|  24267 | schroeder.pierre@example.org    | 14.96491813659668 |
|  26285 | bode.pierre@example.org         | 14.96491813659668 |
|  27104 | pierre.franecki@example.org     | 14.96491813659668 |
|  31792 | pierre.jaskolski@example.com    | 14.96491813659668 |
|  39369 | kuphal.pierre@example.org       | 14.96491813659668 |
|  58625 | olson.pierre@example.org        | 14.96491813659668 |
...

The first result is that we still got another 49 addresses. How can the application decide which email address is relevant and which is not? I am still not happy.

Are there any other options without writing our own plugin?

Can I somehow tell the parser to use n.pierre as one word? The manual says:

A phrase that is enclosed within double quote (") characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in the FULLTEXT index for the words. Nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example, "test phrase" matches "test, phrase".

I can use double quotes, but it will still split at . and the results are the same. I did not find a solution except writing your own plugin. If someone knows a solution, please write a comment.

With Parser Ngram

The built-in MySQL full-text parser uses delimiters between words, but we can create an Ngram-based full-text index.

mysql> alter table  email ADD FULLTEXT KEY (email) WITH PARSER ngram;
Query OK, 0 rows affected (20.10 sec)
Records: 0  Duplicates: 0  Warnings: 0

Before that, I changed the ngram_token_size to 3.

mysql> SELECT id,email,MATCH(email) AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE) AS score FROM email where MATCH(email) AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE) ORDER BY 3 desc;
+--------+----------------------------------+--------------------+
| id     | email                            | score              |
+--------+----------------------------------+--------------------+
|  58625 | olson.pierre@example.org         |  16.56794548034668 |
|  59526 | larkin.pierre@example.net        |  16.56794548034668 |
|  90587 | anderson.pierre@example.org      |  16.56794548034668 |
| 118579 | hudson.pierre@example.com        |  16.56794548034668 |
| 118937 | green.pierre@example.net         |  16.56794548034668 |
| 133352 | friesen.pierre@example.net       |  16.56794548034668 |
| 200608 | wilkinson.pierre@example.org     |  16.56794548034668 |
| 237928 | johnson.pierre@example.org       |  16.56794548034668 |
| 238396 | williamson.pierre@example.net    |  16.56794548034668 |
| 278384 | monahan.pierre@example.net       |  16.56794548034668 |
| 306718 | rohan.pierre@example.com         |  16.56794548034668 |
| 226737 | warren.pfeffer@example.net       | 12.156486511230469 |
|  74278 | stiedemann.perry@example.net     |  11.52701187133789 |
|  75234 | bogan.perry@example.org          |  11.52701187133789 |
...
4697 rows in set (0.03 sec)

Finally, we are getting somewhere. But it gives back 4697 rows. How can the application decide which results are relevant? Should we just use the score?

Subselect?

I dropped the Ngram FULLTEXT index and created a normal one because that gives me back only 43 results instead of 4697. I thought a full-text search might be good to narrow down the results from a million to a few thousand, and then we can run a select based on that. Example:

mysql> Select e2.id,e2.email from
(SELECT id,email FROM email where MATCH(email)
AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE))
as e2 where e2.email like '%n.pierre%';
+--------+-------------------------------+
| id     | email                         |
+--------+-------------------------------+
|  58625 | olson.pierre@example.org      |
|  59526 | larkin.pierre@example.net     |
|  90587 | anderson.pierre@example.org   |
| 118579 | hudson.pierre@example.com     |
| 118937 | green.pierre@example.net      |
| 133352 | friesen.pierre@example.net    |
| 200608 | wilkinson.pierre@example.org  |
| 237928 | johnson.pierre@example.org    |
| 238396 | williamson.pierre@example.net |
| 278384 | monahan.pierre@example.net    |
| 306718 | rohan.pierre@example.com      |
+--------+-------------------------------+
11 rows in set (0.00 sec)

Wow, this can work and it looks quite fast as well. BUT (there is always a but), if I run the following query (searching for ierre):

mysql> Select e2.id,e2.email from
(SELECT id,email FROM email where MATCH(email)
AGAINST ('ierre' IN NATURAL LANGUAGE MODE))
as e2 where e2.email like '%ierre%';
Empty set (0.00 sec)

It gives back nothing because the default full-text parser uses only full words! In our case, that is not very helpful. Let’s switch back to Ngram and re-run the query:

mysql> Select e2.id,e2.email from
(SELECT id,email FROM email where MATCH(email)
AGAINST ('ierre' IN NATURAL LANGUAGE MODE))
as e2 where e2.email like '%ierre%';
+--------+--------------------------------+
| id     | email                          |
+--------+--------------------------------+
|   2940 | pierre.west@example.org        |
|  10775 | pierre.beier@example.org       |
|  16958 | pierre68@example.com           |
|  24267 | schroeder.pierre@example.org   |
...
65 rows in set (0.05 sec)
+-------------------------+----------+
| Status                  | Duration |
+-------------------------+----------+
| starting                | 0.000072 |
| checking permissions    | 0.000006 |
| Opening tables          | 0.000014 |
| init                    | 0.000027 |
| System lock             | 0.000007 |
| optimizing              | 0.000006 |
| statistics              | 0.000013 |
| preparing               | 0.000006 |
| FULLTEXT initialization | 0.006384 |
| executing               | 0.000012 |
| Sending data            | 0.020735 |
| end                     | 0.000014 |
| query end               | 0.000014 |
| closing tables          | 0.000013 |
| freeing items           | 0.001383 |
| cleaning up             | 0.000024 |
+-------------------------+----------+

It gives us back 65 rows, and it takes between 0.02-0.05s because the subquery results in many rows.

With my “shorting method”:

select e.email from email as e right join email_tib as t
on t.email_id=e.id where t.email_parts like "ierre%";
+--------------------------------+
| email                          |
+--------------------------------+
| anderson.pierre@example.org    |
| bode.pierre@example.org        |
| bode.pierre@example.org        |
| boyle.pierre@example.com       |
| bradtke.pierre@example.org     |
| bradtke.pierre@example.org     |
...
65 rows in set (0.00 sec)
mysql> show profile;
+----------------------+----------+
| Status               | Duration |
+----------------------+----------+
| starting             | 0.000069 |
| checking permissions | 0.000011 |
| checking permissions | 0.000003 |
| Opening tables       | 0.000020 |
| init                 | 0.000021 |
| System lock          | 0.000008 |
| optimizing           | 0.000009 |
| statistics           | 0.000070 |
| preparing            | 0.000011 |
| executing            | 0.000001 |
| Sending data         | 0.000330 |
| end                  | 0.000002 |
| query end            | 0.000007 |
| closing tables       | 0.000005 |
| freeing items        | 0.000014 |
| cleaning up          | 0.000010 |
+----------------------+----------+

It reads and gives back exactly 65 rows and it takes 0.000s.

Conclusion

When it comes to pattern matching queries vs. full-text indexes, it looks like full-text index can be helpful, and it is built in. Unfortunately, we do not have many metrics regarding full-text indexes. We cannot see how many rows were read, etc. I don’t want to make any conclusions on which one is faster. I still have to run some tests with our favorite benchmark tool sysbench on a much bigger dataset.

I should mention that full-text indexes and my previous solutions won’t solve all the problems. In this and my other blog I was trying to find an answer to a specific problem, but there are cases where my solutions would not work that well.

The post Pattern Matching Queries vs. Full-Text Indexes appeared first on Percona Database Performance Blog.

关注dbDao.com的新浪微博

扫码加入微信Oracle小密圈,了解Oracle最新技术下载分享资源

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569