Differences in Primary Key Usage Between Colocated and Non-Colocated Tables

Updated for Release 2.18

May 24, 2023

Although YugabyteDB is a distributed SQL database with automatic data sharding, it supports colocated tables and databases. This is a useful feature to support specific use cases. The YugabyteDB documentation page describes colocated tables and the use cases where they are a good fit. Additionally, this blog post, Boosting Performance for Small Distributed SQL Data Sets with Colocated Tables, gives a concise view of when this feature is useful.

In this blog post I will discuss the primary key differences between colocated and non-colocated databases/tables. I recommend you review the documentation and blog post mentioned above to get familiar with this topic, before diving into the syntax implications of primary key usage on colocated and non-colocated tables/databases.

Non-Colocated (Distributed) and Colocated Databases

Non-colocated/distributed databases

A non-colocated/distributed database is created using a CREATE DATABASE command. No special options are required on the CREATE DATABASE command to define this type of database.

CREATE DATABASE my_noncolocated_db;

A non-colocated database in a YugabyteDB cluster has all tables distributed across the nodes in a cluster. Distributed tables are by default hash sharded and primary key definition includes partition columns and clustering columns. Partition columns are the columns used in the hash function to determine the node on which data would be stored. Clustering columns are used to determine the sort order of data with the same hash key.

Range sharding can also be applied on distributed tables. This involves splitting the rows of a table into contiguous ranges that respect the sort order of the table based on the primary key column values.

Primary key syntaxes in non-colocated database

Set up a non-colocated db as
```
CREATE DATABASE my_noncolocated_db;
```
Connect to the database and check if it is a non-colocated database using the below query. The query will result in a 'f' for a non-colocated database
```
SELECT yb_is_database_colocated();
```
Here are the different primary key syntaxes on a non-colocated table

PRIMARY KEY (a, b, c) will have column ‘a’ as the partition column. Hash function for sharding is on column a only. Columns b and c are clustering columns with default ASC for the table

PRIMARY KEY ((a, b, c)) will have columns ‘a’,’b’ & ‘c’ as the partition columns. Hash function for sharding is a combination of the composite columns mentioned within the inner parentheses: ie columns a, b and c

PRIMARY KEY ((a, b), c)) will have columns ‘a’,’b’ as the partition columns. Hash function for sharding is on columns a & b. Column c is a clustering column with default ascending for the table

PRIMARY KEY (a asc, b) will not have any hash sharding. Data will be organized in the ascending order of columns a, b. This is range sharding.

A colocated table cannot be created in a non-colocated or a distributed database. Also tables within a colocated database will be colocated unless you opt out of colocation when creating the table.

Colocated database

A colocated database can be defined using the WITH COLOCATION = 'true' option in the CREATE DATABASE statement:

CREATE DATABASE my_colocated WITH colocation = 'true';

With the database being defined as colocated you can have both type of tables – distributed tables as well as colocated tables.

If not explicitly declared, all tables are created as colocated in a colocated database (ie) the entire table is on one single tablet in a node. As a result, hash sharding is not applicable for colocated tables, and data is organized by the columns and sort order specified in the primary key. However, for resiliency and availability, the tablets for all the tables are replicated to other nodes with a factor equivalent to the replication factor of the cluster.

Distributed tables can be created in a colocated database. These tables will be distributed across the YugabyteDB cluster nodes and hash sharded by default. A distributed table is created by explicitly specifying 'WITH COLOCATION = 'false' in the CREATE TABLE statement:

CREATE TABLE my_noncolocated_table(k int primary key, v text) WITH (colocation = 'false');

The flexibility of this combination of distributed and colocated tables is typically required when you have few large growing tables amongst several small reference-like fairly static tables. Making such tables distributed allows the possibility of scaling out the cluster to meet the demands of the table growth, as well as requests to read/write those tables.

Primary key syntaxes in a colocated database

This section provides syntax details on how to create a colocated database. It also provides details on different primary key syntaxes on a colocated table, and what they would mean to an application that needs to use them.

Setup a colocated db as:

CREATE DATABASE my_colocated_db WITH colocation = 'true';

Connect to the database and check if it is a colocated database using the below query. The query will result in a 't' for a colocated database:
```
SELECT yb_is_database_colocated();
```
Primary key syntaxes on CREATE TABLE and their result on a colocated DB. In all the syntax scenarios below, I will use a simple table called employee, which will have 2 columns – id of int type and name of text type.

CREATE TABLE Syntax	Index definition output on \d+ tablename	Comments about the created table
create table employee (id int primary key, name text);	“employee_pkey” PRIMARY KEY, lsm (id ASC)	Colocated table No partition column id is clustering column
create table employee (id int, name text)	Nothing under Indexes	Colocated table No partition column ybrowid is clustering column as assigned by yugabyteDB
create table employee (id int, name text, primary key(id))	“employee_pkey” PRIMARY KEY, lsm (id ASC)	Colocated table No partition column id is clustering column
create table employee (id int, name text, primary key(id asc, name))	“employee_pkey” PRIMARY KEY, lsm (id ASC, name ASC)	Colocated table No partition column id & name are clustering columns
create table employee (id int, name text, primary key(id asc , name desc)) ;	“employee_pkey” PRIMARY KEY, lsm (id ASC, name desc)	Colocated table No partition column id & name are clustering columns
create table employee (id int, name text, primary key((id), name))	“employee_pkey” PRIMARY KEY, lsm (id ASC, name ASC)	Colocated table No partition column id & name are clustering columns
create table employee (id int, name text, primary key (id hash));		ERROR: Invalid argument: Invalid table definition: Error creating table my_colocated.employee on the master: Cannot colocate hash partitioned table
create table employee (id int, name text, primary key((id , name))) ;		Invalid table definition: Error creating table my_colocated.employee on the master: Cannot colocate hash partitioned table
create table employee (id int primary key, name text) with (colocation = false);	“employee_pkey” PRIMARY KEY, lsm (id HASH) Options: colocation=false	Non-Colocated/ Distributed table Hash partition table id partition column no clustering columns

Summary

Colocated tables in YugabyteDB are organized by the sort order of the columns specified in the primary key in one tablet only. Therefore, any explicitly mentioned HASH or inner parentheses with multiple columns in it, with the syntax as PRIMARY KEY ((a, b, c)), will result in an error.

Non-colocated or distributed tables are hash sharded by default. They can also be range sharded if the use case demands it.

To learn more about hash sharding and range sharding, check out our documentation site.

May 24, 2023