CS 3140 Online Session 9-1

Online Session 9-1

SQL as a Programming Language

CS 3140
Database Systems

Opening Exercise

After Online Session 8, we have an SQLite database of stock price data named NASDAQ_200505.db. Its schema is:

    EndOfDayData(symbol, date, open, high, low, close, volume)

with data that look like this:

    AMAT|20050506|15.56|15.66|15.33|15.63|20778700

Now we can write queries in SQL to interrogate the data.

Write SQL queries to answer these requests:

Find all entries where a stock's high price for the day is more than twice the low price for the day.

Find all entries where a stock traded more than 100,000 shares and closed the day more than 25% higher than its opening price.

(Spoiler alert: the answers come next...).

Opening Solution

When we learned about selection operators in relational algebra, we saw that we can write arbitrary conditions on the attributes of a relation. SQL supports all the usual arithmetic and boolean operators we need to write conditions.

Our first query involves comparing the high and low prices for each entry in the table:

    $ sqlite3 NASDAQ_200505.db
    sqlite> SELECT * FROM EndOfDayData WHERE high > 2*low;
    PRTL|20050502|1.47|1.61|0.73|0.77|9296200
    EPMN|20050504|0.14|0.71|0.14|0.66|185800
    INSG|20050504|0.5|1.29|0.5|0.7|1090000
    CALGZ|20050505|0.09|0.09|0.04|0.04|1000
    GCFBW|20050505|0.12|0.55|0.1|0.49|46800
    CALGZ|20050506|0.09|0.76|0.07|0.3|974500

The second requires a compound condition that examines three columns in each tuple:

    sqlite> SELECT * FROM EndOfDayData
       ...> WHERE (volume > 100000) AND (close > 1.5*open);
    EPMN|20050504|0.14|0.71|0.14|0.66|185800
    CALGZ|20050506|0.09|0.76|0.07|0.3|974500

May 4, 2005 (*) was quite a day for EPMN investors. The price ranged widely, went way up in the end, and traded a lot of shares.

(*) May the Fourth be with you!

These two queries produce relations with six and two tuples, respectively. Just how rare are these phenomena? SQL can help us figure that out with one of its aggregation operators:

    sqlite> SELECT COUNT(*) from EndOfDayData;
    16420

So, only 6 trading days out of 16,420 involve stocks whose high price ever reached twice its low price, and only 2 trading days out of the 16,420 traded more than 100,000 shares and closed up more than 25%. Rare, indeed.

We'll talk more about SQL's aggregation operators soon!

Where Are We?

In recent weeks, we have learned about relational algebra, its most important operators, and how to use them to build expressions that query a relational database. We are now ready to write SQL programs that implement our queries against real databases.

In last week's online session, we learned how to import data from CSV text files into SQLite databases. This gives us the ability to work with larger and more interesting databases, such as the NASDAQ database we used in the opening exercise. We are limited mostly by our own curiosity... There are a lot of free and open source data files available on the web.

This week, we turn our attention to SQL as a programming language. We have already used SQL statements of this form repeatedly:

    SELECT attributes FROM relation WHERE condition;

as a way to implement relational queries of this form:

    π_attributes(σ_condition(relation))

Notice that, in SQL, the "select" clause acts like projection, while the "where" clause acts like selection. Programming languages are odd sometimes...

But SQL has many more features than we have seen thus far. Some of these features implement more of the relational algebra we have learned. Others enable programmers to perform other computational tasks that are useful when querying data and presenting results. We will see that SQL is a programming language: not a general-purpose language, but one designed well for the specific task of writing complete database programs.

In this session, we will explore some of the ways that SQL supports basic programming tasks. We will then look at a feature that allows us to implement relational algebra's renaming operator using local names called aliases.

Before we start, let's prepare the database we will use in our examples.

Data for Today

This section creates the database we use as a running example in this session. It teaches you a little new SQL and a little new SQLite. This material is optional. If you'd like to skip it, if only for now, download the database file now and jump to the next section.

With SQLite's .import command, we can rapidly build relational databases from text files. I refer to the data and queries from Homework 3 as examples in today's lecture, so I used what we learned about .import last week and a couple of extensions to build a movies database.

First, I created CSV files for the data listed in the assignment: [ movie.csv | actor.csv | stars_in.csv | director.csv ]. These files include a header line with names for the columns.

Then I used .import to bring the data into tables named Movie, Actor, StarsIn, and Director:

    sqlite> .mode csv
    sqlite> .import movie.csv Movie
    sqlite> .import actor.csv Actor
    sqlite> .import stars_in.csv StarsIn
    sqlite> .import director.csv Director

Notice that I did not create tables first. When the target table does not already exist, .import treats the first line of the CSV file as a header line, defines the table, and names the columns for us. That saved me typing the schema and the CREATE TABLE statement.

Unfortunately, SQLite doesn't know anything about the data in the file except the column headers. As a result, it generates schema that with the most general types possible:

    sqlite> .schema Movie
    CREATE TABLE Movie(
      "title" TEXT,
      "director" TEXT,
      "year" TEXT,
      "rating" TEXT
    );

I wanted better schema, so I exported the auto-generated schema to an SQL file so that I could edit them:

    sqlite> .output homework03-schema.sql
    sqlite> .schema
    sqlite> .quit

The .output dot command tells SQLite where to write its output for the commands that follow. By default, it writes to standard output. My command instructs it to write the output of the .schema command to a file named homework03-schema.sql.

I then edited the schemas to look the way I wanted them, including type information. For example, I modified the Movie schema to this:

    CREATE TABLE Movie(
      title VARCHAR(20),
      director VARCHAR(20),
      year SMALLINT,
      rating DECIMAL(2,1)
    );

Note the datatype I used for the rating, which is a decimal value on a scale of 1-10, with one digit after the decimal point. DECIMAL is one of the datatypes supported by SQL. Wikipedia has a list of the standard types, but it doesn't have any detail about them. I'll create a more detailed description soon and add it to the Course Resources page.

I ended up with a file of schemas that looked just as I wanted them. I then used .import again to re-import the data and create my database file. If you'd like, here is a complete SQL file for the database with all create and insert statements needed to build the database.

The end result is homework03.db, the database we will use throughout this session to illustrate the new features of SQL we learn. Download it now and follow along in a new SQLite session!

Computing Totals Over Relations

In today's opening exercise, we wrote conditions using simple boolean and arithmetic operators:

    ... WHERE (volume > 100000) AND (close > 1.5*open);

I think you'll find that SQL supports all the basic math and logic that you need to manipulate numbers and compare values.

Sometimes, we'd like to compute summary values for a query, such as the total for a set of values. SQL provides a small set of aggregation operators that replace simple loops in a more general language. We saw one in the discussion of our opening exercise, COUNT. Here's another example:

    sqlite> SELECT COUNT(*) FROM Movie;
    4

We can also count results from a narrower query:

    sqlite> SELECT COUNT(director) FROM Movie WHERE rating > 7.5;
    3

This may not be the answer we expected, because only two directors have movies rated over 7.5. We have seen this phenomenon before, in Session 7, when a relation of genres contained duplicates. In relational algebra, relations are sets, but SQL implements relations as bags, not sets.

A bag is like a set, but it keeps track of how many of each item there are. It's a cool data structure with a lot of uses! If you'd like to learn more, read this short section on Wikipedia.

In Session 7, we learned about the modifier DISTINCT, which enforces set behavior. We can use it here, too, to find the number of unique tuples in a result:

    sqlite> SELECT COUNT(DISTINCT director) FROM Movie
       ...> WHERE rating > 7.5;
    2
    sqlite> SELECT COUNT(actor) FROM StarsIn;
    6
    sqlite> SELECT COUNT(DISTINCT actor) FROM StarsIn;
    3

COUNT is one of five aggregation operators in SQL. The others are SUM, AVG, MAX, and MIN. They work much as you might guess from their names:

    sqlite> SELECT MIN(year) FROM Movie;      # year of oldest movie
    1987
    sqlite> SELECT AVG(rating) FROM Movie;    # the average rating
    7.7

These SELECT statements do not return relations. They return scalar values: a single value in one of the datatypes supported by the language. Scalars are useful as the answers to top-level queries such as these, and they will also be useful as parts of larger queries. (Coming soon...)

SQL has lots of features that enable us to write programs more suitable to our needs. For now, let's consider three briefly.

Computed Attributes

Sometimes, we'd like to change a column of values returned by a query. For example, movie ratings are on a 1-10 scale, but we might want to convert them to a scale of 1-5 stars. So we want to see the ratings cut in half. We can do that sort of arithmetic on an item in the list of attributes we give to SELECT:

    sqlite> SELECT rating/2 FROM Movie;
    4.1
    3.8
    3.7
    3.8

This statement requires to several steps in order. We can think of it projecting rating from the Movie relation, and then applying the /2 operation to each tuple in the result.

Constant Attributes

SQL returns a relation for each query. Our DBMS presents that result to us in some form. By default, SQLite uses a list format with column values separated by | characters:

    sqlite> SELECT title, rating/2 FROM Movie;
    Fargo|4.1
    Raising Arizona|3.8
    Spiderman|3.7
    Wonder Boys|3.8

The name here is self-explanatory but, once separated from the query, the rating is just a number. Maybe we'd like to include in our output a label that augments the number. SQL allows us to create a constant column that acts like a label:

    sqlite> SELECT title, rating/2, 'of five stars' FROM Movie;
    Fargo|4.1|of five stars
    Raising Arizona|3.8|of five stars
    Spiderman|3.7|of five stars
    Wonder Boys|3.8|of five stars

Adding a column with a constant value is a feature of SQL. Most DBMSes give us the ability to format output in a more pleasing fashion. SQLite gives us several such features. One very useful one is column mode:

    sqlite> .mode column
    sqlite> SELECT title, rating/2, 'of five stars' FROM Movie;
    title            rating/2  'of five stars'
    ---------------  --------  ---------------
    Fargo            4.1       of five stars  
    Raising Arizona  3.8       of five stars  
    Spiderman        3.7       of five stars  
    Wonder Boys      3.8       of five stars

It presents output in columnar format and adds column headers. In this case, one of our columns is computed and another is a constant, so SQLite uses those expressions for the headers. Later in the session, we'll see how to improve those column names.

Patterns in `WHERE` Clauses

SQL provides the ability to do several kinds of pattern matching in the conditions we write. The most basic uses the form

    s LIKE p

where s is a value, interpreted as a string, and p is a pattern.

For example, we could find all the movies made in the 2000s by selecting all movies where year &ge 2000, using a boolean comparison. But we can also use a pattern:

    sqlite> SELECT * FROM Movie WHERE year LIKE '20__';
    title        director  year  rating
    -----------  --------  ----  ------
    Spiderman    Raimi     2002  7.4   
    Wonder Boys  Hanson    2000  7.6

The pattern '20__' looks for exact matches: strings that start with '20' and contain two more characters.

We can also match arbitrary strings. For example, we can find all movies whose titles start with 'Wonder', regardless of length, using this pattern:

    sqlite> SELECT * FROM Movie WHERE year LIKE 'Wonder%s';
    title        director  year  rating
    -----------  --------  ----  ------
    Wonder Boys  Hanson    2000  7.6

The pattern 'Wonder%s' would also match 'Wonder Woman', if it were in the database.

We have only begun to scratch the surface of what SQL provides as a programming language, even for these features. We'll learn more as time goes by, and look features we might want on the fly. But now let's move on to a feature that both implements relational algebra and expands the scope of the programs we can write.

Aliases and Local Renaming

Recall our example earlier where we converted ratings to a 1-5 scale using a computed attribute:

    sqlite> SELECT title, rating/2 FROM Movie;
    title            rating/2
    ---------------  --------
    Fargo            4.1     
    Raising Arizona  3.8     
    Spiderman        3.7     
    Wonder Boys      3.8

The new column doesn't have a name in the table, so it uses the expression as the attribute name.

SQL allows us to assign a name to an attribute using an AS clause:

    sqlite> SELECT title, rating/2 AS rating FROM Movie;
    title            rating
    ---------------  ------
    Fargo            4.1     
    Raising Arizona  3.8     
    Spiderman        3.7     
    Wonder Boys      3.8

We can rename both columns, if we like:

    sqlite> SELECT title AS name, rating/2 AS rating FROM Movie;
    name             rating
    ---------------  ------
    Fargo            4.1     
    Raising Arizona  3.8     
    Spiderman        3.7     
    Wonder Boys      3.8

The AS clause enables us to give an alias to an attribute or even a relation. In these examples, the alias is useful for presentation purposes. But sometimes we may need an alias to even write the query we desire.

Consider this request: We would like to find all pairs of actors who have starred in a film together. Our StarsIn relation has all information we need to solve this problem:

    sqlite> SELECT * FROM StarsIn;
    actor      movie          
    ---------  ---------------
    Cage       Raising Arizona
    Maguire    Spiderman      
    Maguire    Wonder Boys    
    McDormand  Fargo          
    McDormand  Raising Arizona
    McDormand  Wonder Boys

All we need to do is to look at all combinations of tuples and find combinations with the same value for movie. But we have a problem when we try to find the product of StarsIn with: itself:

    sqlite> SELECT * FROM StarsIn, StarsIn;
    Error: ambiguous column name: main.StarsIn.actor

The error message here isn't as helpful as it might be. We can figure out the problem if we think about what we know when we join two tuples with a commonly-named attributes... We tell the two attributes apart by prefixing the attribute name with the relation name. But here the relations have the same name, so the names are still ambiguous!

In relational algebra, we could solve this problem by using the renaming operator, ρ, to create new relations with different name:

    ρ_StarsIn1(StarsIn) ⨯ ρ_StarsIn2(StarsIn)

SQL doesn't have a renaming operator, but we can use AS clauses to create local aliases for our relation, compute a product, and then look the same movie title in both columns:

    sqlite> SELECT StarsIn1.actor, StarsIn2.actor
       ...> FROM StarsIn AS StarsIn1, StarsIn AS StarsIn2
       ...> WHERE StarsIn1.movie = StarsIn2.movie;
    actor      actor    
    ---------  ---------
    Cage       Cage     
    Cage       McDormand
    Maguire    Maguire  
    Maguire    Maguire  
    Maguire    McDormand
    McDormand  McDormand
    McDormand  Cage     
    McDormand  McDormand
    McDormand  Maguire  
    McDormand  McDormand

Aliases to the rescue!

We still have a couple of problems. First, our query thinks that actors star in movies with themselves. Now, I think Frances McDormand is an amazing actor, but she can't star in a movie with herself. Let's only keep tuples where the actors' names are different:

    sqlite> SELECT StarsIn1.actor, StarsIn2.actor
       ...> FROM StarsIn AS StarsIn1, StarsIn AS StarsIn2
       ...> WHERE StarsIn1.movie = StarsIn2.movie
       ...>   AND StarsIn1.actor != StarsIn2.actor;
    actor      actor    
    ---------  ---------
    Cage       McDormand
    Maguire    McDormand
    McDormand  Cage     
    McDormand  Maguire

Better, but one problem remains. Our cross product has some tuples in which Cage is chosen from Stars1 and other tuples in which Cage is chosen from Stars2. We can winnow our result further by looking for combinations that favor actors in alphabetical order:

    sqlite> SELECT StarsIn1.actor, StarsIn2.actor
       ...> FROM StarsIn AS StarsIn1, StarsIn AS StarsIn2
       ...> WHERE StarsIn1.movie = StarsIn2.movie
       ...>   AND StarsIn1.actor < StarsIn2.actor;
    actor      actor    
    ---------  ---------
    Cage       McDormand
    Maguire    McDormand

And there we have it!

This example gives you a taste of what is possible when using a local alias to rename a relation. Next session, we will see more problems in which renaming an attribute or a relation within a query enables us to write more complex queries. Those queries will also use another idea that we saw briefly last week: subqueries.

Wrap-Up

There is no deliverable for this session. Just play! Try the queries you've seen in this session in SQLite. If you are feeling adventurous, write a few more SQL queries of your own design against the movie database or the stock database. Let me know if you learn anything interesting about the data as a result.

Eugene Wallingford ..... wallingf@cs.uni.edu ..... October 15, 2020