CS 3140 Online Session 13

Online Session 13

Views and Indexes

CS 3140
Database Systems

We use the Homework 7 database for the opening exercise and an extended movie database for the rest of this session. Download them now so that follow along.

Opening Exercise

Last session, we revisited our computer product database:

    Product(maker, model, type)
    PC(model, speed, ram, hd, price)
    Laptop(model, speed, ram, hd, screen, price)
    Printer(model, color, type, price)

and defined a view on it named ComboMaker:

    CREATE VIEW ComboMaker(maker, pcModel, laptopModel) AS
      SELECT maker, P1.model, P2.model
      FROM Product AS P1
             INNER JOIN
           Product AS P2
           USING (maker)
      WHERE P1.type = 'pc'
        AND P2.type = 'laptop';

Use ComboMaker to write an SQL query:

find all laptop models under $1000 from manufacturers that sell both PCs and laptops

Return the model number and price in your solution.

Opening Solution

Without the ComboMaker view, this task would require a fair bit of work. With it, though, we need only a relatively standard query:

    sqlite> SELECT model, price
            FROM   Laptop
            WHERE  price < 1000 AND
                   model IN (SELECT laptopModel
                             FROM ComboMaker);
    model  price
    -----  -----
    2002   949  
    2003   549

This solution shows that we can write queries that mix tables and views. The two kinds of relations can be used interchangeably in most queries.

Recall that ComboMaker is a name for the query that defines it. You can think of our solution as shorthand for:

    SELECT model, price
    FROM Laptop
    WHERE price < 1000  AND
          model IN (SELECT laptopModel
                    FROM (SELECT maker, P1.model, P2.model
                          FROM Product AS P1
                                 INNER JOIN
                          Product AS P2
                          USING (maker)
                          WHERE P1.type = 'pc' AND
                                P2.type = 'laptop'));

I hopes it is apparent from this substitution how valuable creating views can be for both writing and reading queries!

We could also solve this exercise using by joining ComboMaker and Laptop. This would also make it easy to include the manufacturer in our result:

    sqlite> SELECT DISTINCT model, price, maker
            FROM Laptop, ComboMaker
            WHERE Laptop.price < 1000  AND
                  Laptop.model = ComboMaker.laptopModel;
    model  price  maker
    -----  -----  -----
    2002   949    E    
    2003   549    E

We don't need the qualifiers in the WHERE clause, but sometimes they make a query easier to read than it would be without them.

Here is an SQL source file containing these two queries. Download it, study it, and experiment.

Where Are We?

We are finishing off the semester by looking at how we can work with bigger databases and how we can write and manage bigger programs.

Last time, we learned about views, a way to name a query and treat its result as a relation, similar to the physical tables that make up a database. Views allow us to encapsulate complex queries, which makes it easier to read and write other queries. They also provide an abstraction layer on top of the base tables in the database, which enables us to think about queries in a different way. For both reasons, views help us write and manage bigger SQL programs.

This time, we learn about a technique that helps us work with bigger databases by making our queries more efficient under the hood.

Querying Large Databases

When a database gets large enough, scanning a table row by row to find tuples that match a condition can be quite expensive. Consider this more complete set of movie relations we first used in Homework 3 and created in SQLite back in Online Session 9-1:

    Movie(title, year, director, rating, genre)
    Actor(actor, yearBorn)
    StarsIn(actor, title, year)
    Director(director, yearBorn)

Suppose we would like to find all the scifi movies made in 2000:

    SELECT *
    FROM   Movie
    WHERE  year = 2000 AND genre = 'scifi';

IMDB has approximately 550,000 movie titles in its database, but only a dozen or so are scifi movies made in 2000. Searching through the table in a linear fashion would use a lot of time and CPU. When the table is stored on disk, a linear approach likely requires the DBMS to load in many pages from the file system, taking more time and space.

It would be much more efficient if we could retrieve only the several hundred titles released in 2000 and search them one by one to select the scifi movies. It would be even more efficient if we could retrieve directly the dozen or so movies that satisfy both conditions.

If we wanted the details of a specific film, say:

    SELECT *
    FROM   Movie
    WHERE  title = 'Star Wars' AND year = 1977;

we would be doing all that work to find exactly one tuple!

Efficiency becomes especially vital for queries that involve products and joins. Suppose we wanted a list of films made by young directors in 2000:

    SELECT title, year
    FROM   Movie NATURAL JOIN Director
    WHERE  year = 2000 AND yearBorn > 1970;

The join of Movie and Director has as many tuples as Movie itself. But the set of movies made in 2000 is much smaller, and the set of directors born after 1970 is very much smaller. We can imagine our DBMS being able to execute this query much faster, if it could take advantage of these facts.

With an index, it can.

Indexes

An index is a data structure that enables the DBMS to find tuples that have given values for a specific attribute or set of attributes, In a typical database system, indexes are implemented using a B-tree, a self-balancing tree that allows direct queries, sequential queries, insertions, and deletions all in logarithmic time. This is a vast improvement over linear time when processing large tables.

For simplicity, we can think of an index as a binary search tree containing (key, value) pairs, where the keys are values of the indexed attribute A. This enables the DBMS implement queries such as A = 3 or even A ≥ 3.

The SQL syntax for creating indexes depends on the DBMS. The syntax used in SQLite is typical:

    CREATE INDEX name_of_index
    ON table(attribute, ...);

All three queries above refer to the year attribute in Movie. If this is a common kind of query in our applications, we can speed up thos searches with an index:

    CREATE INDEX YearIndex ON Movie(year);

The result is an index organizing all the tuples of the Movie relation keyed on the year column. Once we create YearIndex, SQLite may execute any query on Movie that refers to year using the index rather than by processing Movie's tuples sequentially.

We can build indexes on multiple attributes. In many applications, a relation's key is a common set of attributes in a query. For example, we looked up the details of the original "Star Wars" in one of our queries above. We can create an index on all of the key attributes with a common-separated list:

    CREATE INDEX KeyIndex ON Movie(title, year);

The key identifies a unique tuple, so using a key index can result in much better performance.

As already noted, we don't have to do anything to use an index. SQLite will use the available indexes whenever they are appropriate to a query. We can see this by using the EXPLAIN QUERY PLAN command:

    sqlite> EXPLAIN QUERY PLAN
       ...> SELECT * FROM Director WHERE yearBorn < 1955;
    QUERY PLAN
    `--SCAN TABLE director

    sqlite> EXPLAIN QUERY PLAN
       ...> SELECT * FROM Movie WHERE year < 2001;
    QUERY PLAN
    `--SEARCH TABLE movie USING INDEX YearIndex (year<?)

With no index on Director's yearBorn attribute, all SQLite can do is scan the table tuple-by-tuple. The YearIndex enables the DBMS to search the table using the index's data structure.

Quick Exercise

Our movie database includes four tables:

    Movie(title, year, director, rating, genre)
    Actor(actor, yearBorn)
    StarsIn(actor, title, year)
    Director(director, yearBorn)

For each table, write the SQL statement needed to create an index over one or more of its attributes. Select your attribute sets based on what you think might be the most sommon query over the table.

How many possible indexes can we create over each of these tables?

Selecting Indexes

It might seem that indexes are always good, so we should create as many of them as possible. That's not feasible, given the number of possible indexes that exist, but it's also not practical. Indexes impose costs, too. Just as with views, we have to consider the trade-off between the benefit of indexes and their costs.

The primary benefit of an index is the speed of executing queries. An index enables the DBMS to find tuples more quickly when a query specifies a value for an indexed attribute, or even a range of values. Indexes can also help the DBMS perform joins more quickly, when the join involves an indexed attribute.

Indexes come with two costs:

An index is a data structure, so it takes up space in the database.

More importantly, modifying a table (insert, delete, or update) takes more when there is an index on one of the table's attributes. The DBMS has to update both the table and the index.

The cost of querying and modifying a table is a bit more complicated than simply searching a data structure in memory. In an industrial application, databases are often large enough that only a portion of its tuples will reside in memory at a given time. The rest will reside on disk, spread across many blocks or pages. A query may require loading multiple pages from disk, which operates at a slower rate than memory access. Modifications require both reading and writing multiple pages, which takes even longer.

With more time, we could look at this process in more detail and develop a cost model for database queries and modifications that would help us understand the trade-offs of using indexes better and even help us design better indexes. Unfortunately, that is beyond the scope of this course.

Instead, let's consider at a higher level some of the useful indexes we can create and how we might identify others.

A key index is often the most useful index we can create, for two reasons:

Searches on the key are common, so queries involving the key's attributes are executed frequently.

Keys identify unique tuples, so queries that specify values for the key attributes need access only one tuple, on at most one disk page.

Non-key indexes can also be useful. Sometimes, an attribute is not a key but still identifies a small set of tuples. For instance, a director creates at most a few movies each year. An index on director in the Movie relation can greatly reduce time executing queries that involve directors. The same is true for parts of a key, such as year or title in Movie.

So, how can we determine how many indexes to create, and which ones? The more indexes we have, the more likely a particular query can be accelerated by an index. If we modifythe database often enough, though, the cost imposed by updating may offset the advantage for queries. And let's not forget that insertions and deletions often use a query to find or compute tuples, so even they can benefit some from a well-chosen index!

To determine whether a new index adds more than it costs, we can identify the mix of statements most likely to be executed and compute the average cost of that mix of statements with and without an index. Sometimes, we have access to a log of queries made over time that we can use to identify that mix. (DBMS's often log all queries and modifications by default, or they can if the feature is turned on.) If not, we can make educated guesses about the mix based on the application itself.

Consider UNI's student information system. Department heads and faculty members query the student database frequently using either student number or student name, while students are added to the database much less often and primarily at specific times of the year. Indexes on student number (the key) and on first name and last name are likely to be valuable. Students and faculty members alike query the course database frequently using the course number, and most queries involve the current semester or the next semester. Indexes in these attributes will likely be valuable.

Wrap Up

The selection and implementation of indexes is an area worthy of long, deep study. Alas, our semester is coming to an end.

If you would like to learn more, let me know.

Otherwise, let's close with two useful bits of knowledge about SQLite. First, when we define a PRIMARY KEY for a table or declare an attribute to be UNIQUE, SQLite usually creates an index the movie database, which includes keys for all for tables, using SQLite's .index dot command:

    sqlite> .index
    sqlite_autoindex_Actor_1     sqlite_autoindex_Movie_1   
    sqlite_autoindex_Director_1  sqlite_autoindex_StarsIn_1

When we create our index on year, it shows up in the list of indexes, too:

    sqlite> CREATE INDEX YearIndex ON Movie(year);
    sqlite> .index
    YearIndex                    sqlite_autoindex_Movie_1   
    sqlite_autoindex_Actor_1     sqlite_autoindex_StarsIn_1 
    sqlite_autoindex_Director_1

And, if you ever want to delete an index, you can use the DROP verb we have now seen several times:

    sqlite> DROP INDEX YearIndex;
    sqlite> .index
    sqlite_autoindex_Actor_1     sqlite_autoindex_Movie_1   
    sqlite_autoindex_Director_1  sqlite_autoindex_StarsIn_1

Eugene Wallingford ..... wallingf@cs.uni.edu ..... November 14, 2020