Fuzzy Matching in YugabyteDB
Wildcard lookups (using “like” in a query) are common ways to get partial matches on strings. However, they are well-known to not be very performant.
This test plan will show some basic query timing using a variety of alternatives available in YugabyteDB. Please note however, that timing should be used only in rough comparison and is not meant to measure what timing would look like in a large, busy production environment.
In this test, I will highlight various methods of finding the name of the artist “Jeanne Verdoux” in a large data set of names.
Here’s our target.
select name, nationality, gender, birth_year from artists where name='Jeanne Verdoux'; name | nationality | gender | birth_year ----------------+-------------+--------+------------ Jeanne Verdoux | French | Female | 1966
Let’s say I heard this name in a lecture, but I don’t know how to spell it. Therefore, I’ll need to use less straightforward methods than querying the complete string.
Set up the environment
I’m setting up this test environment on my local laptop using `yugabyted`. Again, results would look different in a production cluster topology.
Start cluster
sudo ifconfig lo0 alias 127.0.0.2 sudo ifconfig lo0 alias 127.0.0.3 yugabyted start --base_dir /tmp/data1 --advertise_address 127.0.0.1 yugabyted start --base_dir /tmp/data2 --advertise_address 127.0.0.2 --join 127.0.0.1 yugabyted start --base_dir /tmp/data3 --advertise_address 127.0.0.3 --join 127.0.0.1
Set up database
ysqlsh -U yugabyte -d yugabyte
create database stringmatch; \c stringmatch CREATE TABLE artists ( artist_id int, name text, nationality text, gender text, birth_year int, death_year int );
Set up data
Get data from the Museum of Modern Art.
copy artists from '/tmp/artists.csv' delimiter ',' csv header; select * from artists limit 10;
Wildcard filters
A wildcard filter is a typical SQL method used to find a substring. This method is usually not performant.
\timing select name from artists where name like 'Jean%'; -- Time: 80.312 ms
As expected, the result is a long list of names starting with “Jean” as shown below.
name ----------------------------- Jean-Paul Goude Jeanette Reinhardt Jeanne Bardey Jean Puy Jeanne Moutoussamy-Ashe Jean-Francois Moriceau Jean-Baptiste-Camille Corot Jean Baudrillard Jeannette Ducrocq Tanguy ... (113 rows)
For a wildcard that is not left-hand (see following examples), the timing is about the same.
select name from artists where name like '%ean%'; -- Time: 72.488 ms select name from artists where name like 'J%n'; -- Time: 79.885 ms
Creating a regular index doesn’t help for left-hand wildcards in YugabyteDB in the same way it does in other DBMSs, due to underlying storage differences. One optimization using GIN indexes is highlighted further down in this post.
There is also the assumption that I’m guessing the first four letters of the artist’s name correctly. It could, after all, be “Jon” or “John” or “Gian” or a range of other possibilities. We can use fuzzy and trigram extensions to help find a string we don’t know how to spell exactly.
Phonetic or “fuzzy” methods
The fuzzystrmatch extension is pre-packaged with YugabyteDB. Issue the following statement to enable it.
create extension if not exists fuzzystrmatch;
There are three major options with fuzzy matching: Levenshtein distance, Soundex, and Metaphone (or double Metaphone).
Levenshtein matching
The Levenshtein score between two strings is the number of transformations needed to change the first string to the second string. For example, the Levenshtein score for baz>bat is 1. For baz>batter, it is 4.
WITH q AS ( SELECT 'Jean' AS qsn ) SELECT levenshtein(lower(substring(name from '[^ ]+'::text)),lower(qsn)) AS leven, artists.name FROM artists, q WHERE levenshtein(lower(substring(name from '[^ ]+'::text)),lower(qsn)) < 3 ORDER BY leven; -- Time: 94.325 ms
As you can see below, we get a list of first names having a transformation count (Levenshtein distance) of 0, 1, or 2. First names are being split out using the substring function above. This vastly increases the number of results, but it also helps find the name if I’ve incorrectly guessed the spelling.
leven | firstname -------+----------- 0 | Jean Vylen 0 | Jean Dewasne 0 | Jean Nouvel 0 | Jean Locey 0 | Jean Carlomusto 0 | Jean Le Gac 0 | Jean Baudrillard ... 2 | Ewan Gibbs 2 | Len Waernberg 2 | John Irvin 2 | Jens H. Quistgaard 2 | John Hays Hammond ... (760 rows)
Soundex matching
The Soundex algorithm simplifies a string to a phonetic sound pattern. I’ve chosen “Jon Verdoo” as the best phonetic spelling of the target artist’s name.
select soundex('Jon Verdoo'); -- soundex -- --------- -- J516
select name, soundex(name) from artists where soundex(name)=soundex('Jon Verdoo'); -- Time: 84.997 ms
The result set demonstrates how broadly the matching is with Soundex, and this can really help if you don’t know how to spell a string. Notice that the target name does appear on the list.
name | soundex -----------------------------------------+--------- Jeanne Bardey | J516 John Hubbard/Black Star Publishing Co. | J516 Jean-Francois Moriceau | J516 John Behringer | J516 Jean-Pierre Melville | J516 Jan Preisler | J516 Jean-Bernard Menoud | J516 Jean-Pierre Vielfaure | J516 Jan Forsberg | J516 John Opper | J516 ... Jeanne Verdoux | J516 ... (50 rows)
Adding an index on the soundex reduced the runtime for this query.
create index name_soundex on artists (soundex(name)); select name, soundex(name) from artists where soundex(name)=soundex('Jon Verdoo'); -- Time: 9.644 ms
Double metaphone matching
The metaphone and improved double metaphone algorithms also simplify a string to a phonetic sound pattern. Since this method is focused on consonant sounds, I don’t need to approximate vowel sounds very closely.
select dmetaphone('Jon Verdoo'); -- dmetaphone -- ------------ -- JNFR select name from artists where dmetaphone(name) = dmetaphone('Jon Verdoo'); -- Time: 80.838 ms
The result set is again decreased and also includes the target name. Note the differences in this result set from the previous ones. We are now getting spellings of the “J” sound that start with “G.”
name ------------------------ Jean-Francois Moriceau Gianfranco Ferroni Gianfranco Masi Jan Forsberg Gianfranco Rosi Jane Freilicher John Ford Jennifer Morla John Hoover John Frankenheimer Jeanne Verdoux ... (28 rows)
An index decreases the amount of time spent on this query significantly.
create index name_dmp on artists (dmetaphone(name)); select name from artists where dmetaphone(name) = dmetaphone('Jon Verdoo'); -- Time: 2.467 ms
Trigrams
As an alternate method to fuzzy matching, we can use the trigram extension. The trigram method breaks a string into groups of letters.
Here is a further definition taken from the PostgreSQL documentation site.
pg_trgm
ignores non-word characters (non-alphanumerics) when extracting trigrams from a string. Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string. For example, the set of trigrams in the string “cat” is “ c”, “ ca”, “cat”, and “at”.
The pg_trgm extension is also pre-packaged with YugabyteDB. Issue the following statement to enable it.
create extension if not exists pg_trgm;
You can search for a string (e.g., name) even if you can’t remember how to spell it.
select show_trgm('Jon Verdoo'); -- show_trgm -- ------------------------------------------------------------------- -- {" j"," v"," jo"," ve",doo,erd,jon,"on ","oo ",rdo,ver} select word_similarity('Jeanne Verdoux','Jon Verdoo'); -- word_similarity -- ----------------- -- 0.333333
select name from artists where similarity(name, 'Jon Verdoo') > 0.3; -- Time: 76.742 ms
This query results in exactly one name that contains similar groupings of letters, and it’s our target name. There’s a bit of luck involved with the selected string in this test, but you can expect a trigram search to significantly reduce results compared to the fuzzy methods. In my testing of several different strings, trigram results were no more than 2 rows for this dataset of ~15K names.
name ---------------- Jeanne Verdoux (1 row)
Current limitations in YugabyteDB prevent optimizing such queries with indexes. If you return to the wildcard query pattern instead of a phonetic search, adding a GIN index will decrease time spent on wildcard `like` queries. Here’s a brief example.
select name from artists where name like 'Jean%'; -- Time: 80.047 ms create index name_trgm on artists using gin (name gin_trgm_ops); select name from artists where name like 'Jean%'; -- Time: 28.977 ms
Combining metaphone and trigrams
We can combine the speed of the index on the double metaphone and the precision of the trigrams.
WITH q AS ( SELECT 'Jon Verdoo' AS qsn ) SELECT similarity(lower(name),lower(qsn)) AS similarity_score, artists.name FROM artists, q WHERE dmetaphone(name) = dmetaphone(qsn) ORDER BY similarity_score DESC LIMIT 1;
This gives us a sorted list. Again, there is some luck involved with the test string, but results are similar for various strings on this data. The target string shows up in the first result.
With the above index `name_dmp` on `dmetaphone(name)`, the time to run the trigram query is reduced to approximately 2 milliseconds.
similarity_score | name ------------------+---------------- 0.3 | Jeanne Verdoux (1 row) Time: 2.128 ms
Conclusion
The indexed double metaphone method is the most performant, especially for long strings that reduce the number of matches. On the other hand, the trigram method reduces the result list most accurately. YugabyteDB supports either of these methods. Combining the speed of metaphone with the precision of trigrams gives the best results.
References
The following articles demonstrate similar tests in Postgres.