SQL Puzzle: Partial Versus Expression Indexes

Karthik Ranganathan

Founder and co-CEO

Here is an intriguing SQL puzzle we came across in the context of a real-world use case. This post shows the power of advanced RDBMS features such as partial indexes and expression indexes.

Let us assume we have a table in PostgreSQL named users, where each row in the table represents a user. The table is defined as follows.

CREATE TABLE users (
  id    SERIAL PRIMARY KEY,
  email VARCHAR DEFAULT NULL,
  name  VARCHAR
);

Now, let us assume we create the following indexes on the table above.

Index #1Index #2
CREATE UNIQUE INDEX users_idx1 
  ON users (email) 
  WHERE email IS NOT NULL;
CREATE UNIQUE INDEX users_idx2
  ON users 
  (( email IS NOT NULL ));

What is the difference between the two indexes shown above? The first index #1 is a partial index while index #2 is an expression index. And it should be no surprise that because YugabyteDB reuses the PostgreSQL native query layer, it already supports both partial and expression indexes!

Let’s dive right in and take a look at what these indexes do.

Partial Indexes

We have written about partial indexes in YugabyteDB at length before. As the PostgreSQL documentation on partial indexes states, a partial index is built over a subset of rows in a table defined by a conditional expression (called the predicate of the partial index). The index contains entries only for those table rows that satisfy the predicate. A major reason for using a partial index is to avoid indexing commonly occurring values. Since a query searching for a commonly occurring value (one that accounts for more than a few percent of all the table rows) will run through most of the table anyway, the benefit from using an index is marginal. A better strategy is to create a partial index where such rows are excluded altogether. Partial indexing reduces the size of the index and hence speeds up those queries that do use the index. It will also speed up many write operations because the index does not need to be updated in all cases. And in the context of a distributed SQL database like YugabyteDB, every such speed up helps in lowering the effective latency observed by client applications.

Consider the first index in our example.

CREATE UNIQUE INDEX users_idx1 
  ON users (email) 
  WHERE email IS NOT NULL;

This is a partial unique index that:

  • Only indexes users whose email is not NULL.
  • Ensures that no two users have the same, non-null email address.

Hence, with this index, the table can have any number of users without an email address. But if an email address for a user is specified, then another user cannot already have the same email address. This can be seen from the following examples.

/* OK: Insert a user with a name and an email. */
yugabyte=# INSERT INTO users (name, email) 
           VALUES ('James Bond', 'jbond@yugabyte.com');

/* OK: Insert the same user name without an email (email is NULL). */
yugabyte-# INSERT INTO users (name) VALUES ('James Bond');

/* FAIL: Insert a different user with same email */
yugabyte=# INSERT INTO users (name, email) 
           VALUES ('007', 'jbond@yugabyte.com');
ERROR: duplicate key value violates unique constraint "users_idx1"

Further, the index users_idx1 is used when a lookup is performed by a non-empty email, while a full-scan is performed if the query looks up users that have a NULL value for their email.

yugabyte=# EXPLAIN SELECT * FROM users 
           WHERE email='jbond@yugabyte.com';
                               QUERY PLAN
-------------------------------------------------------------------------
 Index Scan using users_idx1 on users  (cost=0.00..4.11 rows=1 width=68)
   Index Cond: ((email)::text = 'jbond@yugabyte.com'::text)
(2 rows)


yugabyte=# EXPLAIN SELECT * FROM users WHERE email IS NULL;
                          QUERY PLAN
---------------------------------------------------------------
 Foreign Scan on users  (cost=0.00..100.00 rows=1000 width=68)
   Filter: (email IS NULL)
(2 rows)

Expression Indexes

The PostgreSQL documentation on expression indexes notes:

An index column need not be just a column of the underlying table, but can be a function or scalar expression computed from one or more columns of the table. This feature is useful to obtain fast access to tables based on the results of computations.

The second example is an expression index, as shown below.

CREATE UNIQUE INDEX users_idx2
  ON users 
  (( email IS NOT NULL ));

The index table in the above case would contain the result of evaluating the expression, ( email IS NOT NULL ). The result would evaluate to either true or false, and would allow exactly one row of each kind since it is a UNIQUE index. This means that the above index allows only two rows in the table, one row with email being NULL and the other row with a non-NULL email.

While this particular index is not very useful in practice, it certainly illustrates how expression indexes would work. Below are some example rows being inserted into the table.

/* OK: Insert a row with non-NULL email */
yugabyte=# INSERT INTO users (name, email)
           VALUES ('James Bond', 'jbond@yugabyte.com');


/* OK: Insert a row with non-NULL email */
yugabyte=# INSERT INTO users (name) VALUES ('James Bond');


/* FAIL: Row with NULL email already exists */
yugabyte=# INSERT INTO users (name) VALUES ('Sherlock Holmes');
ERROR:  duplicate key value violates unique constraint "users_idx2"


/* FAIL: Row with non-NULL email already exists */
yugabyte=# INSERT INTO users (name, email)                                                                   VALUES ('Sherlock Holmes', 'sholmes@yugabyte.com');
ERROR:  duplicate key value violates unique constraint "users_idx2"

A more useful example of an expression index in our scenario is to create a case-insensitive index as follows.

yugabyte=# CREATE UNIQUE INDEX users_idx3
           ON users (lower(email));

This would prevent inserting duplicate email addresses using a different case, as shown in the example below.

/* OK: Insert a row a new email */
yugabyte=# INSERT INTO users (name, email)
           VALUES ('James Bond', 'jbond@yugabyte.com');


/* FAIL: Lowercase version of email address already exists! */
yugabyte=# INSERT INTO users (name, email)                                                                   VALUES ('James Bond', 'JBOND@yugabyte.com');
ERROR:  duplicate key value violates unique constraint "users_idx3"

Summary

YugabyteDB supports advanced PostgreSQL features including partial and express indexes. As a cloud native distributed SQL database, it has high resilience to failures, scales writes horizontally, and can be deployed in multiple geo-distributed configurations (such as multi-zone or multi-region deployments). Read more about how we reused PostgreSQL to create a feature-rich distributed SQL database. If this excites you, come join us – we’re hiring!

Karthik Ranganathan

Founder and co-CEO

Related Posts

Explore Distributed SQL and YugabyteDB in Depth

Discover the future of data management.
Learn at Yugabyte University
Get Started
Browse Yugabyte Docs
Explore docs
PostgreSQL For Cloud Native World
Read for Free