Session 14

The Final Query


CS 3140
Database Systems


These notes are very rough. I will improve them over time.


Opening Exercise

Recall our movie database with tables for Alice and Kurt:

    name                 year   length   genre
    --------------------------------------------     Alice
    The Big Chill        1983   105      drama
    Blade Runner         1982   117      scifi
    The Princess Bride   2020   98       fantasy
    Lethal Weapon        1987   109      action
    Black Panther        2018   134      action
    The Princess Bride   1987   98       fantasy
    The Replacements     2000   118      comedy
    Star Trek            1979   132      scifi
    --------------------------------------------     Kurt
    Black Panther        2018   134      action
    The Princess Bride   1987   98       fantasy
    The Replacements     2000   118      comedy
    Star Trek            1979   132      scifi
    Lethal Weapon        1987   109      action
    Return of the Jedi   1983   131      scifi
    48 Hrs.              1982   96       comedy
    The Big Chill        1983   105      drama

Write the SQL statements needed to...



Opening Solution

When we create an index, it applies to a single table. So we need to create two indexes here:

    sqlite> CREATE INDEX LengthIdx1 ON Alice(length);
    sqlite> CREATE INDEX LengthIdx2 ON Kurt(length);

The second task requires a cross product:

    sqlite> .mode column
    sqlite> SELECT Alice.name, Kurt.name
            FROM Alice,Kurt
            WHERE Alice.length + Kurt.length < 210;

The result is not ideal. It contains pairs with repeats when both collections contain the same movie. The tuples are also in no particular order. So let's add a condition and a sorting clause:

    sqlite> SELECT Alice.name, Kurt.name
            FROM Alice,Kurt
            WHERE Alice.length + Kurt.length < 210 AND
                  Alice.name != Kurt.name
            ORDER BY Alice.name;

This list looks like it has duplicates, but remember that (name, year) is the key for these tables. There are two different "Princess Bride" movies in Alice's collection! So let's project the year columns, too:

    sqlite> SELECT Alice.name, Alice.year, Kurt.name, Kurt.year
            FROM Alice,Kurt
            WHERE Alice.length + Kurt.length < 210 AND
                  Alice.name != Kurt.name
            ORDER BY Alice.name;

One downside to this query is that it doesn't return back-to-back Princess Brides even when they are different versions. Does that seem reasonable? How would you remedy this situation?

Alas, this query does not use either of our indexes. It adds the two attributes before it compares the sum to a constant. There are limits to when SQL can use an index to improve performance.

Here is an SQL source file with the final version of our statements.



One Last "Where Are We?"

For fourteen weeks, we have been learning how to design, implement, and use relational databases. Most recently, we learned how to create views that simulate phyical tables and indexes that enable SQL to search tables more efficiently.

Throughout the semester, I have spoken of "relational databases" to distinguish them non-relational databases... What other kinds of database databases are there? When might we use them? Why do we focus our study on only one kind?



Alternatives to the Relational Model

... the story from Day 1: flat files. Why a database? Concerns about data integrity and consistency, efficiency, and usability.

The relational model solves these problems using a small set of operators and clever implementations. It enabled users to "scale up" by adding more and more rows to tables with little or no loss of integrity, consistency, and usability.

Then... people found that the relational model was not perfect. It was easy to "scale up" but not "scale out" to distributed implementations. But it still provided the best trade-off of integrity, efficiency, usability, and cost, so companies stuck with it.

The first big break: MapReduce (story)

    - Google wanted to index the World Wide Web
    - unstructured data (web pages)
    - data changing all the time
    - commodity hardware; cost of disk, RAM, and processor
    - parallelism
Led to Hadoop, a new kind of software infrastructure for building DBs. Replace DBMS with platform.

NoSQL Systems

NoSQL databases are databases that store data in a format other than relational tables.

These alternatives began to emerge around 2000, as the cost of storage and memory decreased rapidly. They enabled users to ensure data integrity without needing to create and maintain a complex data model. As the cost of hardware dropped, developers became the primary cost of software development. NoSQL databases attempt to optimize for developer productivity.

Other motivations include simpler database designs, finer-grained control over availability, and simpler horizontal scaling to clusters of machines.

Types of NoSQL Databases

Four major types of NoSQL databases emerged over time:

Advantages of NoSQL

The various types of NoSQL databases tend to offer a similar set of benefits:

The Pendulum Swings Back

Scalable, reliable, flexible. That all sounds great. Why doesn't everyone use NoSQL all the time?

As many in industry moved to NoSQL, they learned that it, too, is not perfect. It was hard to match run-time efficiency of the relational DBMSs, and the constraints of relational model actually help to ensure correctness. (Think about type checking, Java v Python.) The pendulum began to swing back.

First, people implemented SQL interfaces on top of Hadoop and other NoSQL systems. Then came a new set of scalable databases that embraced SQL fully. Google led the way here, too, with a 2012 paper whose authors include the original MapReduce authors.

At the same time, the SQL community began to assimilate many of the features of NoSQL, adding new datatypes, better support for partitioning and replication, and the ability to write parallelized data definitions.

Researchers at Google wrote that application developers rely heavily on certain features in their DBMS:

They concluded that NoSQL methods are optimal for retrieving specific items and scanning ranges of values in tables. They work best in scenarios where retrieval is relatively straightforward.

SQL, on the other hand, provides significant value in expressing more complex data access patterns and pushing computation down to the data.



So Here We Are

In networking, there is stack of technologies, from infrastructure such as copper and fiber wires on the bottom to applications such as the web and email on the top. In order to make the stack work, programmers end up writing a lot of "glue code" that ties things together. The thing that makes all of this work as well it does is IP, the Internet Protocol. It is the universal interface by which every other layer interacts with the other.

The domain of databases has a similar shape. There is stack of technologies, from infrastructure such as physical storage and B-trees on the bottom to applications such as student information systems and data visualization tools on the top. In order to make the stack work, programmers end up writing a lot of "glue code" that ties things together.

This is the power of SQL. It has become the universal interface for data analysis, which makes the relational model an essential element of all database systems. And, unlike IP, it's a human readable language -- a programming language.

So perhaps now you can understand the focus in this course on the relational data model and SQL. You are learning the basic ideas and skills that apply no matter what kind of database model you end up working with.



Bonus SQL: Using a Database to Solve a Problem

Knowing how to write programs gives you a new power. Whenever you encounter a problem, you can ask yourself, "How might a program help me solve this?"

The same is true for many more specialized CS skills. People who know how to implement a language interpreter can ask themselves, "How might a language help me solve this problem?"

The same is true for databases.

On Tuesday, I asked the question, "Can a database help me create a spell checker?"

On Thursday, I asked the question, "Can a database help me build a passphrase generator?"

You can use the tools you have learned this semester to represent, store, and use information. That's a power you can use to solve many problems.

If you would like to play around with files and SQL code for either demo, please do!



All That's Left Is The Final

There is nothing new from this week on the final.

The material on the sxam will be split roughly like this:



Eugene Wallingford ..... wallingf@cs.uni.edu ..... November 19, 2020