Achieving Sub-ms Latencies on Large Datasets in Public Clouds
The intent was to verify if YugabyteDB was designed well to handle this case with the optimal number of IOs to the disk subsystem.
This post is a sneak peak into just one of the aspects of YugabyteDB’s innovative storage engine, DocDB, that supports very high data densities per node, something that helps you keep your server footprint and AWS (or relevant cloud provider) bills low! If you’re interested in the internals of DocDB, you can check out our github repo.
Summary
We loaded about 1.4TB across 4 nodes, and configured the block cache on each node to be only 7.5GB. On four 8-cpu machines, we were able to get about 77K ops/second with average read latencies of 0.88 ms.
More importantly, as you’ll see from the details below, YugabyteDB does the optimal number of disk IOs for this work load.
This is possible because YugaByte’s highly optimized storage engine (DocDB) and its ability to chunk/partition and cache the index & bloom filter blocks effectively.
Details
Setup
4-node cluster in Google Compute Platform (GCP)
Each node is a:
- n1-standard-8
- 8 vcpu; Intel® Xeon® CPU @ 2.60GHz
- RAM: 30GB
- SSD: 2 x 375GB
Replication Factor = 3
Default Block Size (db_block_size) = 32KB
Load Phase & Data Set Size
- Number of KVs: 1.6 Billion
- KV Size: ~300 bytes
** Value size: 256 bytes (deliberately chosen to be not very compressible)
** Key size: 50 bytes - Logical Data Size: 1.6B * 300 = 480GB
- Raw Data Including Replication: 480GB * 3 = 1.4TB
- Raw Data per Node: 1.4TB / 4 nodes = 360GB
- Block Cache Size: 7.5GB
Loading Data
This was run using a sample load tester bundled with YugabyteDB.
% nohup java -jar /opt/yugabyte/java/yb-sample-apps.jar
--workload CassandraKeyValue
--nodes 10.128.0.2:9042, ... ,10.128.0.3:9042
--num_threads_write 200
--num_threads_read 0
--value_size 256
--num_unique_keys 1629145600
--num_writes 1629145601
--uuid ed8c67d8-7fd6-453f-b109-554131f380c1 >& /tmp/write.txt &
Running Random Read Workload
We used 150 concurrent readers; the reads use a random distribution across the 1.6B keys loaded into the system.
% nohup java -jar /opt/yugabyte/java/yb-sample-apps.jar
--workload CassandraKeyValue
--nodes 10.128.0.2:9042, ... ,10.128.0.3:9042
--num_threads_write 0
--num_threads_read 150
--value_size 256
--read_only
--max_written_key 1629145600
--uuid ed8c67d8-7fd6-453f-b109-554131f380c1 >& /tmp/read.txt &
Disk Utilization
Sample disk IO on one of the nodes during the “random read” workload is shown below. The disk stats below show that the workload is evenly distributed across the 2 available data SSDs on the system. Each of the four nodes is handling about 16.4K disk read ops/sec (for the 77K user read ops/sec cluster wide).
The average IO size is 230MB/s / 8.2K reads/sec = 29KB. This corresponds to our db_block_size of 32KB (it’s slightly smaller because while keys in this setup are somewhat compressible, the bulk of the data volume is in the value portion, which has deliberately been picked to be not very compressible).
Bloom Filter Efficiency
The index blocks and bloom filters are cached effectively, and therefore all the misses, about 8.2K per disk (as shown above) are for data blocks. The amount of IO is about 230MB/s on each disk for the 8.2K disk reads ops/sec.
In addition, the bloom filters are highly effective in minimizing the number of IOs (to SSTable files in our Log-Structure-Merge organized storage engine), as shown in the chart below.
Summary
This post highlights YugabyteDB’s ability to support sub-ms read latencies on large data sets in a public cloud deployment. We will cover several additional aspects of YugabyteDB’s storage engine in subsequent posts — topics ranging from optimizations for fast-data use cases such as “events” / “time organized” data, modeling of complex types such as collections & maps with minimal “read and write-amplification”, and so on.
If you have questions or comments, we would love to hear from you in our community forum.
If you want to try this on your own, you can get started with YugabyteDB in just 5 mins.