What Is the Difference Between ‘Table Partitioning’ and ‘Sharding’ In a Distributed SQL Deployment?
Lets separate this into two separate questions:
- What is table partitioning and sharding?
- How does YugabyteDB utilize partitioning and sharding?
First, partitioning (or table partitioning) is the process of splitting large tables into smaller, more manageable ones, providing multiple physically independent tables that are logically seen as one. Partitioning has been used for traditional (single node) databases in order to efficiently query large amounts of data. In essence you are grouping multiple subsets of data on a single node based on the column you choose to partition on and a strategy such as range or list.
Sharding (or database sharding) is the process of breaking up large tables, indexes, or partitions into smaller chunks called shards (or tablets in YugabyteDB) that are then distributed across multiple servers based on a hash or range of the primary key. Sharding is similar to partitioning in that you are breaking up a table into smaller pieces. It is different in the sense that you are distributing these pieces across multiple nodes.
In YugabyteDB you can use both concepts, so we understand why it can be confusing. Within YugabyteDB partitioning is a user-defined, SQL-level concept, thus requiring an explicit definition through SQL. Sharding on the other hand, and the load balancing of shards, is a storage level concept that is performed automatically by YugabyteDB based on your replication factor. As your data grows in size, the database will continue to create more shards and distribute them across your cluster. As you add nodes, the database will replicate them to the new nodes. This allows you to horizontally scale out with no effort to meet your high throughput requirements. Because partitioning is a SQL-level property that is dependent on the tablet definition, it is not automatic.
The main use of partitioning in YugabyteDB, or declarative partitioning as it is referred to in PostgreSQL, is to group your data together. For example, YSQL time series workloads can mimic the time-to-live (TTL) found in Cassandra or YugabyteDB’s YCQL API by partitioning on the time column and dropping the oldest partition. Another use case is to implement row-level geo-partitioning, when your data cannot leave a specific region (i.e. GDPR). Partitioning for either of these scenarios is done at the query layer using YSQL, separating itself again from sharding which is done at the storage layer (DocDB). Keep in mind that indexes created on partitioned tables are automatically partitioned.
For deeper insights check out Distributed SQL Essentials: Sharding and Partitioning in YugabyteDB and the demo-driven explanations on Partitioning in YugabyteDB and Sharding in YugabyteDB.
Learn more about YugabyteDB architecture.
Discover More Tips and Tricks
A library of distributed SQL tips and tricks and general “how to” information can be found by searching the YugabyteDB blog, as well as our DEV Community Blogs.
Events and Training
Check out the upcoming YugabyteDB events, including all training sessions, conferences, in-person and virtual events, and YugabyteDB Friday Tech Talks (designed for engineers by engineers).
In addition, there is some extremely popular “how to” content on the YugabyteDB YouTube channel.
If You Have Questions About Distributed SQL
If you have questions, make sure to ask them on the YugabyteDB Slack channel, Forum, GitHub, or Stack Overflow.
Next Steps
Ready to start exploring YugabyteDB features? You have some great options to get started. Run the database locally on your laptop (Quick Start), deploy it to your favorite cloud provider (Multi-node Cluster Deployment), sign up for a free YugabyteDB Managed cluster, or request a full-featured trial. It’s easy! Get started today!