Distributed SQL Tips and Tricks – May 8, 2020
Welcome to this week’s tips and tricks blog where we recap some distributed SQL questions from around the Internet. We’ll also review upcoming events, new documentation, and blogs that have been published since the last post. Got questions? Make sure to ask them on our YugabyteDB Slack channel, Forum, GitHub, or Stackoverflow. Ok, let’s dive right in:
How can I connect YugabyteDB with Apache Spark?
There are two ways to connect YugabyteDB to Apache Spark, depending on which distributed SQL API you are using. For YCQL we provide a Spark connector library and for YSQL you can use Apache Spark’s JDBC API.
Why is an external load balancer recommended for the YSQL API but not for the YCQL API?
Yugabyte’s two distributed SQL APIs have different needs for load balancers because of the underlying differences in their client driver implementations. YSQL relies on PostgreSQL client drivers that are neither cluster aware nor shard aware. In other words, they are not yet “smart”. On the other hand, YCQL relies on Apache Cassandra Query Language’s smart client drivers that are both cluster aware (and hence do not require a load balancer) and shard aware. Let’s dive deeper to understand what cluster awareness and shard awareness actually mean.
In case of YSQL, most of PostgreSQL’s client drivers take only one YugabyteDB node’s IP as input and will always route requests to that node. While there is no functional issue here since all YugabyteDB nodes are equal and can accept client requests for all the data in the cluster (they will internally route the requests, collect the results, and respond to the client with the results), a load balancer simply abstracts away the cluster of nodes into a single IP so that the client driver can send requests to all the nodes in the cluster. This is different from how YCQL client drivers work. The input to a Cassandra QL client driver is typically a list of IPs of the nodes in the cluster. Even if you give only one node’s IP, the rest of the nodes of the cluster are auto discovered by the client driver after it connects to the cluster for the first time. Thereafter cluster configuration changes such as add/remove nodes are also propagated back to all active clients. Because of this built-in cluster awareness, there is never a primary node in the world of YCQL.
Coming to shard awareness, even though all nodes receive the YSQL requests using a load balancer, there can be a lot of requests that reach nodes that do not have the necessary data since the PostgreSQL driver is not shard aware. Again there is no functional issue but a performance issue, which any distributed RDBMS faces, given additional hops between the nodes to query the data. In case of YCQL, along with an updated list of nodes, the client driver also caches the shard metadata which has the mapping of which shard/tablet is located on which node. As a result, the client driver is able to route queries to the relevant nodes directly without the need for any additional hops.
How can I order returned rows by the IN
list of values?
Sometimes you want to return the rows ordered as specified in a supplied list. This can be easily done like in the example below:
CREATE TABLE books(name TEXT, publication_year TIMESTAMP, PRIMARY KEY(name ASC)); INSERT INTO books values('Pride and Prejudice', '01-28-1813'); INSERT INTO books values('The Great Gatsby', '04-10-1925'); INSERT INTO books values('The Catcher in the Rye', '07-16-1951'); INSERT INTO books values('Little Women', '09-30-1868'); INSERT INTO books values('The Adventures of Huckleberry Finn', '12-10-1884');
Now, say that you want to search for 3 books and you do this using the IN
clause.
SELECT * FROM books WHERE name IN ('The Adventures of Huckleberry Finn', 'Pride and Prejudice', 'The Great Gatsby'); name | publication_year ------------------------------------+--------------------- Pride and Prejudice | 1813-01-28 00:00:00 The Adventures of Huckleberry Finn | 1884-12-10 00:00:00 The Great Gatsby | 1925-04-10 00:00:00 (3 rows)
As you can see, the rows are returned in order of name and not in the order in which they appear in the IN
clause.
If you want to retrieve rows and order them based on their order in the IN
clause, you can make use of the VALUES
clause as shown below:
select b.* from books b join ( values ('The Adventures of Huckleberry Finn',1), ('Pride and Prejudice',2), ('The Great Gatsby',3) ) as x (name, sortorder) on b.name = x.name order by x.sortorder; name | publication_year ------------------------------------+--------------------- The Adventures of Huckleberry Finn | 1884-12-10 00:00:00 Pride and Prejudice | 1813-01-28 00:00:00 The Great Gatsby | 1925-04-10 00:00:00 (3 rows)
Alternatively, you can also use WITH ORDINALITY
:
SELECT b.* FROM books b JOIN unnest('{"The Adventures of Huckleberry Finn", "Pride and Prejudice", "The Great Gatsby"}'::text[]) WITH ORDINALITY t(name, sortorder) USING (name) ORDER BY t.sortorder; name | publication_year ------------------------------------+--------------------- The Adventures of Huckleberry Finn | 1884-12-10 00:00:00 Pride and Prejudice | 1813-01-28 00:00:00 The Great Gatsby | 1925-04-10 00:00:00 (3 rows)
How can I create a backup of a single table in a YSQL database?
You can backup databases in YSQL using the ysql_dump
tool. And a single table using the --table
argument.
Assuming you have a database called yugabyte
and a table called helloworld
, the full command to backup the table is:
./postgres/bin/ysql_dump --schema public --table books --serializable-deferrable --file “helloworld.dump.txt” yugabyte
What is the query to list all table names in my YugabyteDB distributed SQL database?
Getting a list of all tables is easily done by querying the information_schema.tables
view like below:
yugabyte=# SELECT table_name FROM information_schema.tables WHERE table_schema='public' AND table_type='BASE TABLE'; table_name ------------ comments users books (3 rows)
See the full docs on the information_schema.tables
view for additional details.
New Documentation, Blogs, Tutorials, and Videos
New Blogs
- YugabyteDB Engineering Update – May 6, 2020
- Bringing Truth to Competitive Benchmark Claims–YugabyteDB vs CockroachDB, Part 1
- Bringing Truth to Competitive Benchmark Claims–YugabyteDB vs CockroachDB, Part 2
- Why We Built YugabyteDB by Reusing the PostgreSQL Query Layer
- Spanning the Globe without Google Spanner
New Videos
- Developing Cloud Native Spring Microservices with Distributed SQL, Cluster-Aware JDBC Drivers, R2DBC, and Istio
- An Introduction to GraphQL with Distributed SQL
- Install YugabyteDB on macOS using Docker and yugabyted
New and Updated Docs
New array data types and functionality section with 20 new pages, including:
- Synopsis and important topics on array data types
- The
array[]
value constructor - Creating an array value using a literal
- Built-in SQL functions and operators for arrays — covers 30 functions and operators for working with arrays, and examples for each of them
Upcoming Events
- Webinar | YugabyteDB: a distributed PostgreSQL database, May 12, 2020 at 10 am PT [17:00 UTC]
- Webinar | Evaluating CockroachDB vs YugabyteDB: Benchmarks, Architecture, Sharding Strategies & Licensing, May 20, 2020 at 11 am PT [18:00 UTC]
- Open Source Summit North America Virtual Experience, June 29-July 2, 2020
Get Started
Ready to start exploring YugabyteDB features? Getting up and running locally on your laptop is fast. Visit our quickstart page to get started.