Session 12

Improving Schema via Decomposition


CS 3140
Database Systems


Quick Look Back

After several weeks of learning how to query relational databases, we are taking a step back to make sure we have relational schemas that can be used effectively.

In class last week, we saw that redundancy in a relation not only wastes space and makes data harder to understand, but it also can lead to at least three kinds of usability problems in our databases:

We'd like a way to be able to tell if a schema is susceptible to these problems and, if so, how we can modify the schema in order to eliminate them.

Last time, we also learned about functional dependency, a generalization of a relation's key. Functional dependencies can help us diagnose and re-design a schema because they help us identify ways in which the relation is trying to do too much.

In last week's online session, we learned the SQL statements for inserting, deleting, and updating information in a database. If our relations are poorly designed and susceptible to anomalies, using these statements can create errors and inconsistency in our data.



Quick Look Ahead

The basic idea that ties functional dependencies to anomalies is this:

Whenever there is a functional dependency in a relation whose lefthand side is not a key or a superset of a key, then that dependency creates the potential for redundancy.

I'm a professor.  Let's just assume I'm right.

So, functional dependencies identify "cleavage" points in a relation: a place where we can decompose the schema breaks into two or more smaller relations with fewer anomalies. This ideas means that, rather than hacking away at a schema in the hope of making it better, we can use relational theory to redesign the schema in a principled way.

Today we learn two new ideas:

and then tie them together in a technique for using functional dependencies to decompose relations in a way that ensures they are in a desirable normal form.



Decomposing a Relation

We can decompose one relation into two relations by partitioning its set of attributes into two sets. If we have:

    R(A1, A2, ..., Ap)
we can decompose R into two relations, S and T:
    S(B1, B2, ..., Bq)
    T(C1, C2, ..., Cr)
where:
    { A1, A2, ..., Ap } = { B1, B2, ..., Bq } ∪ { C1, C2, ..., Cr }

    S = πB1, B2, ..., Bq(R)

    T = πC1, C2, ..., Cq(R)

The sets B and C do not have to be disjoint; they can share one or more attributes.

Recall the Movie-v0 database we used last time:

    title            year  rating  director  yearBorn  actor    
    ---------------  ----  ------  --------  --------  ---------
    Fargo            1996  8.2     Coen      1954      McDormand
    Raising Arizona  1987  7.6     Coen      1954      Cage     
    Raising Arizona  1987  7.6     Coen      1954      McDormand
    Spiderman        2002  7.4     Raimi     1959      Maguire  
    Wonder Boys      2000  7.6     Hanson    1945      Maguire  
    Wonder Boys      2000  7.6     Hanson    1945      McDormand

Its schema is:

    Movie-v0(title, year, rating, director, yearBorn, actor)

We could decompose Movie-v0 in many different ways. (How many?) We could split them down the middle:

    A = { title, year, rating }
    B = { director, yearBorn, actor }
or we could have some overlap:
    A = { title, year, rating, director }
    B = { rating, director, yearBorn, actor }
or we could separate the ends from the middle:
    A = { title, actor }
    B = { year, rating, director, yearBorn }
As you might guess, none of these decompositions make much sense or help create a better database. They are arbitrary choices that have nothing to do with what the database means.

Instead, let's decompose the Movie-v0 relation into two relations with schemas consisting of these sets of attributes:

    A = { title, year, rating, director, yearBorn }
    B = { title, year, actor }

This would result in two relations:


    title            year  rating  director  yearBorn -- Movie-v1a
    ---------------  ----  ------  --------  --------
    Fargo            1996  8.2     Coen      1954    
    Raising Arizona  1987  7.6     Coen      1954    
    Raising Arizona  1987  7.6     Coen      1954    
    Spiderman        2002  7.4     Raimi     1959    
    Wonder Boys      2000  7.6     Hanson    1945    
    Wonder Boys      2000  7.6     Hanson    1945    


    title            year  actor                      -- Movie-v1b
    ---------------  ----  ---------
    Fargo            1996  McDormand
    Raising Arizona  1987  Cage     
    Raising Arizona  1987  McDormand
    Spiderman        2002  Maguire  
    Wonder Boys      2000  Maguire  
    Wonder Boys      2000  McDormand

The new relations eliminate one of the deletion anomalies that plagues Movie-v0: If we decide not to list McDormand as a star of "Fargo" in Movie-v1b, we still have a tuple in Movie-v1a that contains all the other information about "Fargo". Nothing is lost except the information we chose to delete.

The improvement is not an accident. As we saw last time, the set { title, year, actor } is a key for Movie-v0. This followed from the fact that the functional dependency (FD):

    title year actor → rating director yearBorn
holds for the relation. But we also saw that:
    title year → rating director yearBorn
is also a functional dependency on the relation. Its key is a subset of the relation's key. That means there is a subset of attributes with higher cohesion than the relation's key can provide.

I used the latter FD to factor out a new relation containing only the attributes of that dependency. My second relation consists of all the original attributes except for { rating, director, yearBorn }, the ones on the righthand side of the FD I used to decompose the relation.

Relation Movie-v1b is now in good shape. It may seem as if it contains some redundancy, because there are two tuples for "Raising Arizona" and "Wonder Boys". But those repetitions are necessary: they allow us to represent the fact that two actors can star in the same film. Its key consists of all three attributes.

Movie-v1a still has some problems... All the other anomalies we noted last time, such as updating the rating for "Raising Arizona" or deleting "Spiderman", are present here. So we need to decompose this relation further.



Boyce-Codd Normal Form

The goal of decomposing a relation with anomalies is to produce two or more relations that do not have any anomalies.

There is a simple condition that holds for a relation that does not exhibit update, deletion, and insertion anomalies:

Whenever there is a functional dependency
    I1,...,Ij → D1,...,Dk
in the relation, then either { I1,...,Ij } is a key for the relation or it contains a key.

Two notes:

This condition is named Boyce-Codd normal form, usually abbreviated as BCNF. This name carries a lot of computer science history: Codd invented the relational data model, and Boyce co-created SQL.

BCNF illuminates the idea that functional dependencies identify cleavage points in a relation: a place where we can decompose the schema breaks into two or more smaller relations with fewer anomalies. FDs from non-key attributes to other attributes lead to tuples with redundant information.

As we just saw, Movie-v0 is not in BCNF. Its key is { title, year, actor }, but it contains a functional dependency:

    title year → rating director yearBorn
whose lefthand side is a proper subset of the relation's key. It is neither a key for the relation nor a superset of the key.

What about the new relations we created by decomposing Movie-v0?

There is a functional dependency in Movie-v1a that violates the BCNF condition. Can you see it?



Decomposing a Relation into BCNF

We can use functional dependencies as the basis for an algorithm that takes any relation R with attributes A and returns a set of relations R1, R2, ... that are in BCNF:

  1. If R is in BCNF, return R.

  2. Choose one of the functional dependencies that violates BCNF, X → Y. Decompose R into:

  3. Recursively decompose R1 and R2 using this algorithm. Return the union of the results.

Another way to say   A – Y   is   X ∪ (A – (X ∪ Y))  : X plus all the attributes not involved in the functional dependency.

This algorithm is guaranteed to terminate. R1 and R2 each have fewer attributes than R. If we ever get down to a relation with only two attributes, it must be in BCNF. (Can you see why?) Otherwise, there is functional dependency we can use in Step 2 to decompose further.

If you look back at how we decomposed Movie-v0 earlier, you'll see that we did just this. We identified a functional dependency that violated BCNF and created two new relations whose attribute sets match Step 2.

Step 3 says to recursively apply the algorithm to Movie-v1a and Movie-v1b. As we have seen, Movie-v1b is in BCNF, so we can return it.

Movie-v1a(title, year, rating, director, yearBorn) is not in BCNF. It contains at least one functional dependency that violates BCNF:

    director → yearBorn
If we know the director's name, then we know the year in which the actor was born. { director } is not a superkey, so this FD tells us that Movie-v1a is not in BCNF.

This FD also gives us a way to decompose the relation further. We create two new relations:

This would result in two relations:


    director  yearBorn                          -- Movie-v2a
    --------  --------
    Coen      1954    
    Raimi     1959    
    Hanson    1945 


    title            year  rating  director     -- Movie-v2b
    ---------------  ----  ------  --------
    Fargo            1996  8.2     Coen    
    Raising Arizona  1987  7.6     Coen    
    Spiderman        2002  7.4     Raimi   
    Wonder Boys      2000  7.6     Hanson

When we recurse on these two relations, we find that both are in BCNF. Movie-v2a consists of only two attributes, so it must be in BCNF. Movie-v2b contains no functional dependencies except for the one specified by the key:

    title year → rating director
So it is in BCNF, too.

Our two decompositions of Movie-v0 resulted in three new relations:

    Movie-v1b(title, year, actor)
    Movie-v2a(director, yearBorn)
    Movie-v2b(title, year, rating, director)
You may recognize these schema... They define the relations StarsIn, Director, and Movie from our old Movie database! They are precisely the relations I combined to create Movie-v0 for Session 11.

Notice how, after the second decomposition, we have eliminated all of the anomalies we noted last time. We can update the rating for "Raising Arizona" by modifying one tuple. We can delete "Spiderman" without losing any unrelated information. We can insert a new movie without requiring information unrelated to the movie itself. Decomposition has achieved its goal.

Using functional dependencies and Boyce-Codd normal form, we know that these relations are free of update, deletion, and insertion anomalies. Yay! One question remains: have we lost any information by decomposing the relation?



A Quick Exercise

Consider this relation about movies and the studios that make them:

    MadeBy(title, year, studio, president, presAddress)
with these functional dependencies:
    title year → studio president presAddress
    studio → president presAddress
    president → presAddress
(title, year) is the only key for the relation.

Decompose this relation into BCNF.



A Solution

The first functional dependency does not violate BCNF. (Why?)

The other two do. (Why?) We can use either to decompose MadeBy.

Let's use studio → president presAddress.

Our algorithm says to create two new relations.

    R2(studio, president, presAddress)
    R3(title, year, studio)
R3 contains studio and all of the other attributes not involved in the dependency.

The only functional dependency in R3 is title year → studio, which has the relation's key on its lefthand side. So it is in BCNF.

The key for R2 is studio, as it determines all the other attributes in the relation. But R2 includes a functional dependency -- president → presAddress -- whose lefthand side is not a superkey. So R2 is not in BCNF.

We can use this dependency to decompose the new relation. Our algorithm says to create two new relations:

    R4(president, presAddress)
    R5(studio, president)
R5 contains president and all of the other attributes not involved in the dependency.

Both relations consist of two attributes, so they are BCNF. But we can also see from the original list of functional dependencies that each contains one FD matching its key.

This leaves us with three new relations, all in BCNF:

    R3(title, year, studio)
    R4(president, presAddress)
    R5(studio, president)
We might call these Movie, StudioPresident, and Studio, respectively.

If you'd like more practice, use the dependency president → presAddress to take the first step. Do you end up with the same end result? Why or why not?



Recoverability

Decomposing a relation using the algorithm above achieves its intended goal: it creates a set of relations all in BCNF, which means that they do not harbor update, deletion, and insertion anomalies. But we need for it to achieve a second essential goal: the set of new relations faithfully represents the data in the original relation. We don't want for decomposition to reduce the set of queries we can ask of the database.

This second goal is roughly equivalent to a very simple idea: Can we recover the original relation from the tuples in the set of new relations?

Theory has been used to prove that our decomposition algorithm, driven by BCNF-violating functional dependencies, produces relations that can be joined to produce all and only the tuples in the original relation. That proof is beyond the scope of this course. However, you can see an example of it in action by computing the natural join of Movie-v1b, Movie-v2a, and Movie-v2b. The result will be the tuples of Movie-v0.

Here is a zip file containing the Movie-v0 database and CSV files for all the subrelations we created during the session. You can import them into SQLite to create the relations we saw above.

In this sense, the set of new relations faithfully represents the data in the original relation. It turns out that there is one faithfulness property that our algorithm cannot guarantee, related to the functional dependencies of the original and new relations. Other decomposition algorithms guarantee both recoverability of tuples and preservation of functional dependencies, but they cannot guarantee the elimination all anomalies!

Database design theory is a deep and rich area of study. One course is not enough to explore it all. Our two week excursion is enough to give us (1) databases that we can work with effectively and (2) an appreciation for the issues of database design and the techniques available to us.



Eugene Wallingford ..... wallingf@cs.uni.edu ..... November 5, 2020