Achieving Sub-ms Latencies on Large Datasets in Public Clouds

Founder & President

January 26, 2018

One of our users was interested to learn more about YugabyteDB’s behavior for a random read workload where the data set does not fit in RAM and queries need to read data from disk (i.e. an uncached random read workload).

The intent was to verify if YugabyteDB was designed well to handle this case with the optimal number of IOs to the disk subsystem.

This post is a sneak peak into just one of the aspects of YugabyteDB’s innovative storage engine, DocDB, that supports very high data densities per node, something that helps you keep your server footprint and AWS (or relevant cloud provider) bills low! If you’re interested in the internals of DocDB, you can check out our github repo.

Summary

We loaded about 1.4TB across 4 nodes, and configured the block cache on each node to be only 7.5GB. On four 8-cpu machines, we were able to get about 77K ops/second with average read latencies of 0.88 ms.

More importantly, as you’ll see from the details below, YugabyteDB does the optimal number of disk IOs for this work load.

This is possible because YugaByte’s highly optimized storage engine (DocDB) and its ability to chunk/partition and cache the index & bloom filter blocks effectively.

Read latency during random reads from disk

Read ops/sec during random reads from disk

Details

Setup

4-node cluster in Google Compute Platform (GCP)

Each node is a:

n1-standard-8
8 vcpu; Intel® Xeon® CPU @ 2.60GHz
RAM: 30GB
SSD: 2 x 375GB

Replication Factor = 3
Default Block Size (db_block_size) = 32KB

Load Phase & Data Set Size

Number of KVs: 1.6 Billion
KV Size: ~300 bytes
** Value size: 256 bytes (deliberately chosen to be not very compressible)
** Key size: 50 bytes
Logical Data Size: 1.6B * 300 = 480GB
Raw Data Including Replication: 480GB * 3 = 1.4TB
Raw Data per Node: 1.4TB / 4 nodes = 360GB
Block Cache Size: 7.5GB

Loading Data

This was run using a sample load tester bundled with YugabyteDB.

% nohup java -jar /opt/yugabyte/java/yb-sample-apps.jar  
   --workload CassandraKeyValue                          
   --nodes  10.128.0.2:9042, ... ,10.128.0.3:9042        
   --num_threads_write 200                               
   --num_threads_read 0                                  
   --value_size 256                                      
   --num_unique_keys 1629145600                          
   --num_writes 1629145601                               
   --uuid ed8c67d8-7fd6-453f-b109-554131f380c1 >& /tmp/write.txt &

Running Random Read Workload

We used 150 concurrent readers; the reads use a random distribution across the 1.6B keys loaded into the system.

% nohup java -jar /opt/yugabyte/java/yb-sample-apps.jar   
    --workload CassandraKeyValue                          
    --nodes 10.128.0.2:9042, ... ,10.128.0.3:9042         
    --num_threads_write 0                                 
    --num_threads_read 150                                
    --value_size 256                                      
    --read_only                                           
    --max_written_key 1629145600                          
    --uuid ed8c67d8-7fd6-453f-b109-554131f380c1  >& /tmp/read.txt &

Disk Utilization

Sample disk IO on one of the nodes during the “random read” workload is shown below. The disk stats below show that the workload is evenly distributed across the 2 available data SSDs on the system. Each of the four nodes is handling about 16.4K disk read ops/sec (for the 77K user read ops/sec cluster wide).

The average IO size is 230MB/s / 8.2K reads/sec = 29KB. This corresponds to our db_block_size of 32KB (it’s slightly smaller because while keys in this setup are somewhat compressible, the bulk of the data volume is in the value portion, which has deliberately been picked to be not very compressible).

iostat (disk utilization) during random read workload

Bloom Filter Efficiency

The index blocks and bloom filters are cached effectively, and therefore all the misses, about 8.2K per disk (as shown above) are for data blocks. The amount of IO is about 230MB/s on each disk for the 8.2K disk reads ops/sec.

In addition, the bloom filters are highly effective in minimizing the number of IOs (to SSTable files in our Log-Structure-Merge organized storage engine), as shown in the chart below.

Bloom filter’s effectiveness in YugabyteDB

Summary

YugabyteDB performance on large data sets

This post highlights YugabyteDB’s ability to support sub-ms read latencies on large data sets in a public cloud deployment. We will cover several additional aspects of YugabyteDB’s storage engine in subsequent posts — topics ranging from optimizations for fast-data use cases such as “events” / “time organized” data, modeling of complex types such as collections & maps with minimal “read and write-amplification”, and so on.

If you have questions or comments, we would love to hear from you in our community forum.

If you want to try this on your own, you can get started with YugabyteDB in just 5 mins.

January 26, 2018