Fuzzy Matching in YugabyteDB

March 15, 2023

Wildcard lookups (using “like” in a query) are common ways to get partial matches on strings. However, they are well-known to not be very performant.

This test plan will show some basic query timing using a variety of alternatives available in YugabyteDB. Please note however, that timing should be used only in rough comparison and is not meant to measure what timing would look like in a large, busy production environment.

In this test, I will highlight various methods of finding the name of the artist “Jeanne Verdoux” in a large data set of names.

Here’s our target.

select name, nationality, gender, birth_year from artists where name='Jeanne Verdoux';

  	name  	| nationality | gender | birth_year
----------------+-------------+--------+------------
 Jeanne Verdoux | French  	| Female |   	1966

Let’s say I heard this name in a lecture, but I don’t know how to spell it. Therefore, I’ll need to use less straightforward methods than querying the complete string.

Set up the environment

I’m setting up this test environment on my local laptop using `yugabyted`. Again, results would look different in a production cluster topology.

Start cluster

sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3

yugabyted start --base_dir /tmp/data1 --advertise_address 127.0.0.1
yugabyted start --base_dir /tmp/data2 --advertise_address 127.0.0.2 --join 127.0.0.1
yugabyted start --base_dir /tmp/data3 --advertise_address 127.0.0.3 --join 127.0.0.1

Set up database

ysqlsh -U yugabyte -d yugabyte

create database stringmatch;
\c stringmatch

CREATE TABLE artists (
    artist_id int,
   name text,
   nationality text,
   gender text,
   birth_year int,
   death_year int
);

Set up data

Get data from the Museum of Modern Art.

copy artists from '/tmp/artists.csv' delimiter ',' csv header;

select * from artists limit 10;

Wildcard filters

A wildcard filter is a typical SQL method used to find a substring. This method is usually not performant.

\timing

select name from artists where name like 'Jean%';
-- Time: 80.312 ms

As expected, the result is a long list of names starting with “Jean” as shown below.

        	name
-----------------------------
 Jean-Paul Goude
 Jeanette Reinhardt
 Jeanne Bardey
 Jean Puy
 Jeanne Moutoussamy-Ashe
 Jean-Francois Moriceau
 Jean-Baptiste-Camille Corot
 Jean Baudrillard
 Jeannette Ducrocq Tanguy
 ...
 (113 rows)

For a wildcard that is not left-hand (see following examples), the timing is about the same.

select name from artists where name like '%ean%';
-- Time: 72.488 ms

select name from artists where name like 'J%n';
-- Time: 79.885 ms

Creating a regular index doesn’t help for left-hand wildcards in YugabyteDB in the same way it does in other DBMSs, due to underlying storage differences. One optimization using GIN indexes is highlighted further down in this post.

There is also the assumption that I’m guessing the first four letters of the artist’s name correctly. It could, after all, be “Jon” or “John” or “Gian” or a range of other possibilities. We can use fuzzy and trigram extensions to help find a string we don’t know how to spell exactly.

Phonetic or “fuzzy” methods

The fuzzystrmatch extension is pre-packaged with YugabyteDB. Issue the following statement to enable it.

create extension if not exists fuzzystrmatch;

There are three major options with fuzzy matching: Levenshtein distance, Soundex, and Metaphone (or double Metaphone).

Levenshtein matching

The Levenshtein score between two strings is the number of transformations needed to change the first string to the second string. For example, the Levenshtein score for baz>bat is 1. For baz>batter, it is 4.

WITH q AS (
  SELECT 'Jean' AS qsn
)
SELECT
  levenshtein(lower(substring(name from '[^ ]+'::text)),lower(qsn)) AS leven,
  artists.name
FROM artists, q
WHERE levenshtein(lower(substring(name from '[^ ]+'::text)),lower(qsn)) < 3
ORDER BY leven;

-- Time: 94.325 ms

As you can see below, we get a list of first names having a transformation count (Levenshtein distance) of 0, 1, or 2. First names are being split out using the substring function above. This vastly increases the number of results, but it also helps find the name if I’ve incorrectly guessed the spelling.

 leven | firstname
-------+-----------
 	0 | Jean Vylen
 	0 | Jean Dewasne
 	0 | Jean Nouvel
 	0 | Jean Locey
 	0 | Jean Carlomusto
 	0 | Jean Le Gac
 	0 | Jean Baudrillard
 	...
 	2 | Ewan Gibbs
 	2 | Len Waernberg
 	2 | John Irvin
 	2 | Jens H. Quistgaard
 	2 | John Hays Hammond
 	...
 	(760 rows)

Soundex matching

The Soundex algorithm simplifies a string to a phonetic sound pattern. I’ve chosen “Jon Verdoo” as the best phonetic spelling of the target artist’s name.

select soundex('Jon Verdoo');

-- soundex
-- ---------
-- J516

select name, soundex(name) from artists where soundex(name)=soundex('Jon Verdoo');
-- Time: 84.997 ms

The result set demonstrates how broadly the matching is with Soundex, and this can really help if you don’t know how to spell a string. Notice that the target name does appear on the list.

              	name               	| soundex
-----------------------------------------+---------
 Jeanne Bardey                       	| J516
 John Hubbard/Black Star Publishing Co. | J516
 Jean-Francois Moriceau              	| J516
 John Behringer                      	| J516
 Jean-Pierre Melville                	| J516
 Jan Preisler                        	| J516
 Jean-Bernard Menoud                 	| J516
 Jean-Pierre Vielfaure               	| J516
 Jan Forsberg                        	| J516
 John Opper                          	| J516
 ...
 Jeanne Verdoux                      	| J516
 ...
 (50 rows)

Adding an index on the soundex reduced the runtime for this query.

create index name_soundex on artists (soundex(name));

select name, soundex(name) from artists where soundex(name)=soundex('Jon Verdoo');
-- Time: 9.644 ms

Double metaphone matching

The metaphone and improved double metaphone algorithms also simplify a string to a phonetic sound pattern. Since this method is focused on consonant sounds, I don’t need to approximate vowel sounds very closely.

select dmetaphone('Jon Verdoo');

-- dmetaphone
-- ------------
-- JNFR

select name from artists where dmetaphone(name) = dmetaphone('Jon Verdoo');
-- Time: 80.838 ms

The result set is again decreased and also includes the target name. Note the differences in this result set from the previous ones. We are now getting spellings of the “J” sound that start with “G.”

 name
------------------------
 Jean-Francois Moriceau
 Gianfranco Ferroni
 Gianfranco Masi
 Jan Forsberg
 Gianfranco Rosi
 Jane Freilicher
 John Ford
 Jennifer Morla
 John Hoover
 John Frankenheimer
 Jeanne Verdoux
 ...
 (28 rows)

An index decreases the amount of time spent on this query significantly.

create index name_dmp on artists (dmetaphone(name));

select name from artists where dmetaphone(name) = dmetaphone('Jon Verdoo');
-- Time: 2.467 ms

Trigrams

As an alternate method to fuzzy matching, we can use the trigram extension. The trigram method breaks a string into groups of letters.

Here is a further definition taken from the PostgreSQL documentation site.

pg_trgm ignores non-word characters (non-alphanumerics) when extracting trigrams from a string. Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string. For example, the set of trigrams in the string “cat” is “ c”, “ ca”, “cat”, and “at”.

The pg_trgm extension is also pre-packaged with YugabyteDB. Issue the following statement to enable it.

create extension if not exists pg_trgm;

You can search for a string (e.g., name) even if you can’t remember how to spell it.

select show_trgm('Jon Verdoo');

-- show_trgm
-- -------------------------------------------------------------------
--  {"  j","  v"," jo"," ve",doo,erd,jon,"on ","oo ",rdo,ver}

select word_similarity('Jeanne Verdoux','Jon Verdoo');

-- word_similarity
-- -----------------
-- 0.333333

select name from artists where similarity(name, 'Jon Verdoo') > 0.3;
-- Time: 76.742 ms

This query results in exactly one name that contains similar groupings of letters, and it’s our target name. There’s a bit of luck involved with the selected string in this test, but you can expect a trigram search to significantly reduce results compared to the fuzzy methods. In my testing of several different strings, trigram results were no more than 2 rows for this dataset of ~15K names.

  name
----------------
 Jeanne Verdoux
 (1 row)

Current limitations in YugabyteDB prevent optimizing such queries with indexes. If you return to the wildcard query pattern instead of a phonetic search, adding a GIN index will decrease time spent on wildcard `like` queries. Here’s a brief example.

select name from artists where name like 'Jean%';
-- Time: 80.047 ms

create index name_trgm on artists using gin (name gin_trgm_ops);

select name from artists where name like 'Jean%';
-- Time: 28.977 ms

Combining metaphone and trigrams

We can combine the speed of the index on the double metaphone and the precision of the trigrams.

WITH q AS (
  SELECT 'Jon Verdoo' AS qsn
)
SELECT
  similarity(lower(name),lower(qsn)) AS similarity_score,
  artists.name
FROM artists, q
WHERE dmetaphone(name) = dmetaphone(qsn)
ORDER BY similarity_score DESC
LIMIT 1;

This gives us a sorted list. Again, there is some luck involved with the test string, but results are similar for various strings on this data. The target string shows up in the first result.

With the above index `name_dmp` on `dmetaphone(name)`, the time to run the trigram query is reduced to approximately 2 milliseconds.

 similarity_score |  	name
------------------+----------------
          	0.3 | Jeanne Verdoux
(1 row)

Time: 2.128 ms

Conclusion

The indexed double metaphone method is the most performant, especially for long strings that reduce the number of matches. On the other hand, the trigram method reduces the result list most accurately. YugabyteDB supports either of these methods. Combining the speed of metaphone with the precision of trigrams gives the best results.

References

The following articles demonstrate similar tests in Postgres.

March 15, 2023

Query

Fuzzy Matching in YugabyteDB

Related Posts

Explore Distributed SQL and YugabyteDB in Depth