Intro to DB

Table of Contents

Database on Lagunita

1 SQL mini-course

1.1 Lecture videos

1.1.1 The Join family of Operators

1.1.2 Aggregation

select avg(GPA)
from Student;

select min(GPA)
from Student, Apply
where Student.sID = Apply.sID and major = 'CS';

select avg(GPA)
from Student
where sID in (select sID from Apply where major = 'CS');

select count(*)
from College
where enrollment > 15000;

select count(*)
from (select distinct sID from Apply where cName = 'Cornell');

select count(distinct sID)
from Apply
where cName = 'Cornell';

select (select avg(GPA) from Student
        where sID in (select sID from Apply where major = 'CS')) -
       (select avg(GPA) from Student
        where sID not in (select sID from Apply where major = 'CS'));

select cName, count(*)
from Apply
group by cName;

select state, sum(enrollment)
from College
group by state;

select cName, major, max(mx - mn)
from (select cName, major, min(GPA) as mn, max(GPA) as mx
      from Student join Apply using(sID)
      group by cName, major) M;

select sID, sName, count(distinct cName)
from Student join Apply using(sID)
group by sID;

select sID, sName, count(cName)
from Student join
     (select distinct sID, cName
      from Apply) using(sID)
group by sID;

select sID, sName, count(distinct cName)
from Student left join Apply using(sID)
group by sID;

select sID, sName, count(distinct cName)
from Student join Apply using(sID)
group by sID
union
select sID, sName, 0
from Student
where sID not in (select sID from Apply);

select cName
from Apply
group by cName
having count(*) < 5;

select cName
from (select distinct cName from Apply) S
where (select count(*) from Apply where cName = S.cName) < 5;


select cName
from Apply
group by cName
having count(distinct sID) < 5;

select major
from Student join Apply using(sID)
group by major
having max(GPA) < (select avg(GPA) from Student);

1.1.3 NULL values

select sID, sName, GPA
from Student
where GPA <= 3.5 or GPA > 3.5 or GPA is null;

select sID, sName, GPA, sizeHS
from Student
where GPA <= 3.5 or sizeHS < 1600 or sizeHS >= 1600;

select count(*)
from Student
where GPA is not null;

SQL uses three-valued logic (3VL).

1.1.4 Data Modification Statements

insert into College
values ('Carnegie Mellon', 'PA', 11500);

insert into Apply
select sID, 'Carnegie Mellon', 'CS', null
from Student
where sID not in (select sID from Apply);

insert into Apply
select distinct sID, 'Carnegie Mellon', 'EE', 'Y'
from Apply
where cName <> 'Carnegie Mellon' and
      major = 'EE' and
      decision = 'N';

delete from Student
where sID in (select sID
              from Apply
              group by sID
              having count(distinct major) > 2);

delete from Apply
where sID in (select sID
              from Apply
              group by sID
              having count(distinct major) > 2);

delete from College
where cName not in (select cName
                    from Apply
                    where major = 'CS');

update Apply
set major = 'economics',
    decision = 'Y'
where cName = 'Carnegie Mellon' and
      sID in (select sID from Student where GPA < 3.6);

update Apply
set major = 'CSE'
where major = 'EE' and
      sID in (select sID
              from Student
              where GPA = (select max(GPA)
                           from Apply join Student using(sID)
                           where major = 'EE'));

select * from Apply
where major = 'EE' and
      sID in (select sID
              from Student
              where GPA >= all (select GPA
                                from Student join Apply using(sID)
                                where major = 'EE'));

update Student
set GPA = (select max(GPA) from Student),
    sizeHS = (select min(sizeHS) from Student);

update Apply
set decision = 'Y';

1.2 Exercises

1.2.1 Movie-Rating Query Core

Find the titles of all movies directed by Steven Spielberg.

select title
from Movie
where director = 'Steven Spielberg';

Find all years that have a movie that received a rating of 4 or 5, and sort them in increasing order.

select year
from Movie
where mID in (select mID
              from Rating
              where stars >= 4)
order by year;

Find the titles of all movies that have no ratings.

select title
from Movie
where mID not in (select mID from Rating);

Some reviewers didn’t provide a date with their rating. Find the names of all reviewers who have ratings with a NULL value for the date.

select name
from Reviewer
where rID in (select rID from Rating
              where ratingDate is null);

Write a query to return the ratings data in a more readable format: reviewer name, movie title, stars, and ratingDate. Also, sort the data, first by reviewer name, then by movie title, and lastly by number of stars.

select name, title, stars, ratingDate
from Reviewer join Rating using(rID) join Movie using(mID)
order by name, title, stars;

For all cases where the same reviewer rated the same movie twice and gave it a higher rating the second time, return the reviewer’s name and the title of the movie.

select name, title
from (select rID, mID
      from Rating R1 join Rating R2 using(rID, mID)
      where R1.stars > R2.stars and
            R1.ratingDate > R2.ratingDate and
            (select count(*) from Rating
             where rID = R1.rID and mID = R1.mID) = 2)
      join Reviewer using(rID)
      join Movie using(mID);

For each movie that has at least one rating, find the highest number of stars that movie received. Return the movie title and number of stars. Sort by movie title.

select title, max(stars)
from Rating join Movie using(mID)
group by mID
order by title;

For each movie, return the title and the ’rating spread’, that is, the difference between highest and lowest ratings given to that movie. Sort by rating spread from highest to lowest, then by movie title.

select title, spread
from (select mID, max(stars) - min(stars) as spread
      from Rating
      group by mID)
      join Movie using(mID)
order by spread desc, title;

Find the difference between the average rating of movies released before 1980 and the average rating of movies released after 1980. (Make sure to calculate the average rating for each movie, then the average of those averages for movies before 1980 and movies after. Don’t just calculate the overall average rating before and after 1980.)

select
(select avg(avgStars)
 from (select mID, avg(stars) as avgStars
       from Rating
       group by mID) join Movie using(mID)
 where year < 1980) -
(select avg(avgStars)
 from (select mID, avg(stars) as avgStars
       from Rating
       group by mID) join Movie using(mID)
 where year >= 1980);

1.2.2 Movie-Rating Query Extras

Find the names of all reviewers who rated Gone with the Wind.

select name
from Reviewer
where rID in (select rID from Rating
              where mID = (select mID from Movie
                           where title = 'Gone with the Wind'));

For any rating where the reviewer is the same as the director of the movie, return the reviewer name, movie title, and number of stars.

select distinct name, title, stars
from Rating join Reviewer using(rID) join Movie using(mID)
where director = name;

Return all reviewer names and movie names together in a single list, alphabetized. (Sorting by the first name of the reviewer and first word in the title is fine; no need for special processing on last names or removing “The”.)

select name
from (select name
      from Reviewer
      union
      select title as name
      from Movie)
order by name;

Find the titles of all movies not reviewed by Chris Jackson.

select title
from Movie
where mID not in (select mID from Rating
                  where rID in (select rID from Reviewer
                                where name = 'Chris Jackson'));

For all pairs of reviewers such that both reviewers gave a rating to the same movie, return the names of both reviewers. Eliminate duplicates, don’t pair reviewers with themselves, and include each pair only once. For each pair, return the names in the pair in alphabetical order.

select distinct R1.name, R2.name
from (Rating join Reviewer using(rID)) R1 join
     (Rating join Reviewer using(rID)) R2
on R1.mID = R2.mID and
   R1.rID <> R2.rID and
   R1.name <= R2.name;

For each rating that is the lowest (fewest stars) currently in the database, return the reviewer name, movie title, and number of stars.

select name, title, stars
from Rating R1 join Reviewer R2 join Movie M
on R1.rID = R2.rID and
   R1.mID = m.mID and
   stars = (select min(stars) from Rating);

List movie titles and average ratings, from highest-rated to lowest-rated. If two or more movies have the same average rating, list them in alphabetical order.

select title, avgStars
from (select mID, avg(stars) as avgStars
      from Rating
      group by mID) join Movie using(mID)
order by avgStars desc, title;

Find the names of all reviewers who have contributed three or more ratings. (As an extra challenge, try writing the query without HAVING or without COUNT.)

select name
from Reviewer
where rID in (select rID
              from Rating
              group by rID
              having count(*) >= 3);

select name
from (select rID
      from Rating
      group by rID
      having count(*) >= 3)
     join Reviewer using(rID);

select name
from Rating join Reviewer using(rID)
group by rID
having count(*) >= 3;

select name
from Reviewer
where rID in (select rID
              from Rating R
              where rID in (select rID
                            from Rating S
                            where rID = R.rID and
                                  (mID <> R.mID or
                                   stars <> R.stars) and
                                  rID in (select rID
                                          from Rating
                                          where rID = R.rID and
                                                (mID <> R.mID or
                                                 stars <> R.stars) and
                                                (mID <> S.mID or
                                                 stars <> S.stars))));

Some directors directed more than one movie. For all such directors, return the titles of all movies directed by them, along with the director name. Sort by director name, then movie title. (As an extra challenge, try writing the query both with and without COUNT.)

select title, director
from Movie
where director in (select director
                   from Movie
                   group by director
                   having count(*) > 1)
order by director, title;

select title, director
from Movie R
where director in (select director
                   from Movie
                   where director = R.director and
                         mID <> R.mID)
order by director, title;

Find the movie(s) with the highest average rating. Return the movie title(s) and average rating. (Hint: This query is more difficult to write in SQLite than other systems; you might think of it as finding the highest average rating and then choosing the movie(s) with that average rating.)

select title, (select avg(stars) from Rating where mID = R.mID) as avgStars
from Movie R
where avgStars = (select max((select avg(stars) from Rating where mID = S.mID))
                  from Movie S);

Find the movie(s) with the lowest average rating. Return the movie title(s) and average rating. (Hint: This query may be more difficult to write in SQLite than other systems; you might think of it as finding the lowest average rating and then choosing the movie(s) with that average rating.)

select title, (select avg(stars) from Rating where mID = R.mID) as avgStars
from Movie R
where avgStars = (select min((select avg(stars) from Rating where mID = S.mID))
                  from Movie S);

For each director, return the director’s name together with the title(s) of the movie(s) they directed that received the highest rating among all of their movies, and the value of that rating. Ignore movies whose director is NULL.

select distinct director, title, maxStars
from (select director,
             (select max(stars)
              from Rating join Movie using(mID)
              where director = D.director) as maxStars
      from (select distinct director from Movie
            where director is not null) D)
      join Movie using(director) join Rating using(mID)
where stars = maxStars;

1.2.3 Social-Network Query Core

Find the names of all students who are friends with someone named Gabriel.

select name
from Highschooler
where ID in (select ID1
             from Friend
             where ID1 = ID and
                   ID2 in (select ID
                           from Highschooler
                           where name = 'Gabriel'));

For every student who likes someone 2 or more grades younger than themselves, return that student’s name and grade, and the name and grade of the student they like.

select (select name from Highschooler where ID = ID1),
       (select grade from Highschooler where ID = ID1),
       (select name from Highschooler where ID = ID2),
       (select grade from Highschooler where ID = ID2)
from Likes
where (select grade from Highschooler where ID = ID1) -
      (select grade from Highschooler where ID = ID2) >= 2;

For every pair of students who both like each other, return the name and grade of both students. Include each pair only once, with the two names in alphabetical order.

select *
from Likes
where ID1 < ID2
intersect
select ID2, ID1
from Likes;

select (select name from Highschooler where ID = R.ID1),
       (select grade from Highschooler where ID = R.ID1),
       (select name from Highschooler where ID = R.ID2),
       (select grade from Highschooler where ID = R.ID2)
from Likes R
where (select name from Highschooler where ID = R.ID1) <
      (select name from Highschooler where ID = R.ID2) and
      ID2 in (select ID1 from Likes
              where ID1 = R.ID2 and ID2 = R.ID1);

Find all students who do not appear in the Likes table (as a student who likes or is liked) and return their names and grades. Sort by grade, then by name within each grade.

select name, grade
from Highschooler
where ID not in (select ID1 from Likes) and
      ID not in (select ID2 from Likes)
order by grade, name;

For every situation where student A likes student B, but we have no information about whom B likes (that is, B does not appear as an ID1 in the Likes table), return A and B’s names and grades.

select (select name from Highschooler where ID = ID1),
       (select grade from Highschooler where ID = ID1),
       (select name from Highschooler where ID = ID2),
       (select grade from Highschooler where ID = ID2)
from Likes
where ID2 not in (select ID1 from Likes);

Find names and grades of students who only have friends in the same grade. Return the result sorted by grade, then by name within each grade.

select name, grade
from Highschooler
where ID not in (select ID1
                 from Friend
                 where (select grade from Highschooler where ID = ID1) <>
                       (select grade from Highschooler where ID = ID2))
order by grade, name;

For each student A who likes a student B where the two are not friends, find if they have a friend C in common (who can introduce them!). For all such trios, return the name and grade of A, B, and C.

select (select name from Highschooler where ID = T.ID1),
       (select grade from Highschooler where ID = T.ID1),
       (select name from Highschooler where ID = T.ID2),
       (select grade from Highschooler where ID = T.ID2),
       name, grade
from (select * from Likes
      except
      select * from Friend) T
     join Highschooler
on ID in (select ID2 from Friend where ID1 = T.ID1) and
   ID in (select ID2 from Friend where ID1 = T.ID2);

Find the difference between the number of students in the school and the number of different first names.

select count(ID) - count(distinct name)
from Highschooler;

Find the name and grade of all students who are liked by more than one other student.

select name2, grade2
from (select ID1, name1, grade1, ID2, name as name2, grade as grade2
      from (select ID1, name as name1, grade as grade1, ID2
            from Likes join Highschooler H on ID1 = H.ID)
                 join Highschooler H on ID2 = H.ID)
group by ID2
having count(*) > 1;

1.2.4 Social-Network Query Extras

For every situation where student A likes student B, but student B likes a different student C, return the names and grades of A, B, and C.

select H1.name, H1.grade,
       H2.name, H2.grade,
       H3.name, H3.grade
from Likes R join Likes S
     on R.ID2 = S.ID1 and
        S.ID2 <> R.ID1
     join Highschooler H1
     on R.ID1 = H1.ID
     join Highschooler H2
     on R.ID2 = H2.ID
     join Highschooler H3
     on S.ID2 = H3.ID;

Find those students for whom all of their friends are in different grades from themselves. Return the students’ names and grades.

select name, grade
from Highschooler R
where not exists
      (select * from Friend
       where ID1 = R.ID and
       (select grade from Highschooler
        where ID = ID2 and
              grade = R.grade));

What is the average number of friends per student? (Your result should be just one number.)

select avg(num)
from (select count(*) as num
      from Friend
      group by ID1);

The above code doesn’t consider the situation where there exists a student who has no friend. The following code handle that situation correctly.

select avg(num)
from (select (select count(*) from Friend
              where ID1 = ID) as num
      from Highschooler);

Find the number of students who are either friends with Cassandra or are friends of friends of Cassandra. Do not count Cassandra, even though technically she is a friend of a friend.

select count(*)
from Highschooler
where name <> 'Cassandra' and
      exists (select * from Friend R
              where R.ID1 = ID and
                    (R.ID2 in (select ID from Highschooler
                               where name = 'Cassandra') or
                     R.ID2 in (select ID1 from Friend
                               where ID1 = R.ID2 and
                                     ID2 in (select ID from Highschooler
                                             where name = 'Cassandra'))));

select count(*)
from (select ID1, ID2, ID3
      from (select L1.ID1, L1.ID2, L2.ID2 as ID3
            from Friend L1 join Friend L2
            on L1.ID2 = L2.ID1 and
               L2.ID2 <> L1.ID1))
     join Highschooler R on ID1 = R.ID
     join Highschooler S on ID2 = S.ID
     join Highschooler T on ID3 = T.ID
where S.name = 'Cassandra' or T.name = 'Cassandra';

Find the name and grade of the student(s) with the greatest number of friends.

select (select name from Highschooler where ID = ID1),
       (select grade from Highschooler where ID = ID1)
from Friend
group by ID1
having count(*) = (select max(num)
                   from (select count(*) as num
                         from Friend
                         group by ID1));

1.2.5 Movie-Rating Modification

Add the reviewer Roger Ebert to your database, with an rID of 209.

insert into Reviewer values (209, 'Roger Ebert');

Insert 5-star ratings by James Cameron for all movies in the database. Leave the review date as NULL.

insert into Rating
select (select rID from Reviewer where name = 'James Cameron'),
       mID, 5, null
from Movie;

For all movies that have an average rating of 4 stars or higher, add 25 to the release year. (Update the existing tuples; don’t insert new tuples.)

update Movie
set year = year + 25
where (select avg(stars) from Rating where mID = Movie.mID) >= 4;

Remove all ratings where the movie’s year is before 1970 or after 2000, and the rating is fewer than 4 stars.

delete from Rating
where stars < 4 and
      ((select year from Movie where mID = Rating.mID) < 1970 or
       (select year from Movie where mID = Rating.mID) > 2000)

1.2.6 Social-Network Modification

It’s time for the seniors to graduate. Remove all 12th graders from Highschooler.

delete from Highschooler
where grade = 12;

If two students A and B are friends, and A likes B but not vice-versa, remove the Likes tuple.

delete from Likes
where exists (select *
              from (select * from Friend
                    except
                    select ID2, ID1 from Likes
                    intersect
                    select * from Likes)
              where ID1 = Likes.ID1 and ID2 = Likes.ID2);

For all cases where A is friends with B, and B is friends with C, add a new friendship for the pair A and C. Do not add duplicate friendships, friendships that already exist, or friendships with oneself. (This one is a bit challenging; congratulations if you get it right.)

insert into Friend
select distinct R.ID1, S.ID2
from Friend R join Friend S
on R.ID2 = S.ID1 and S.ID2 <> R.ID1 and
   not exists (select * from Friend
             where ID1 = R.ID1 and
                   ID2 = S.ID2);

2 XPath and XQuery mini-course

2.1 Lecture videos

2.1.1 XPath Introduction

Not nearly as mature as Querying Relational

  • XPath - path expressions + conditions
  • XSLT - XPath + transformations, output
  • XQuery - XPath + full-featured Q.L.

Think of XML as a tree

Basic Constructs

/ root element and separator
{element name} or *
@{attribute name}
// any descendants including self
[condition] e.g. [@price < 50]
[number] nth sub-element

Built-in functions (lots of them)

contains(s1, s2)
name()

Navigation “axes” (13 of them)

parent::
following-sibling::
descendants:: (does not match self, cf. //)
self::

XPath queries operate on and return sequence of elements Sometimes result can be expressed as XML, but not always

2.1.2 XPath Demo

Get the titles of books in the bookstore where the price is lower than 90 and Jeffrey Widom is an author.

doc("BookstoreQ.xml")/Bookstore/Book[@Price < 90 and
        Authors/Author[Last_name = "Widom" and
        First_Name="Jeffrey"]]/Title

Get the titles of books where “Ullman” is an author and “Widom” is not an author. The below code doesn’t work since there is no way using XPath language alone to achieve the query.

doc("BookstoreQ.xml")/Bookstore/Book[
        Authors/Author/Last_Name = "Ullman" and
        Authors/Author/Last_Name != "Widom"]/Title

Get all magazines where there’s a book with the same title. (An example of self-join in XPath)

doc("BookstoreQ.xml")//Magazine[
        Title = doc("BookstoreQ.xml")//Book/Title]

The equal sign in the above code is implicitly existentially quantified. There is a lot of implicit existential quantification in XPath and XQuery.

All Elements whose parent is not “Bookstore” or “Book”.

doc("BookstoreQ.xml")/Bookstore//*[
        name(parent::*) != "Bookstore"
        and name(parent::*) != "Book"]

The above code use the navigation access parent::.

All books and magazines with non-unique titles.

doc("BookstoreQ.xml")/Bookstore/(Book|Magazine)[
        Title = following-sibling::*/Title
        or Title = preceding-sibling::*/Title]

Books where every author’s first name includes “J”. This query will need universal quantification.

doc("BookstoreQ.xml")//Book[
        count(Authors/Author[contains(First_Name, "J")]) =
        count(Authors/Author/First_Name)]

The built-in function count is used to simulate universal quantification. This technique can solve previously undoable query.

Titles of books where “Ullman” is an author and “Widom” is not an author.

doc("BookstoreQ.xml")/Bookstore/Book[
        Authors/Author/Last_Name = "Ullman"
        and count(Authors/Author[Last_Name = "Widom"]) = 0]/Title

2.1.3 XQuery Introduction

  1. XQuery is a expression language (compositional)
  2. Each expression operates on and returns sequence of elements
  3. XPath is one type of expression

XQuery: FLWOR expression

For $var in expr
Let $var := expr
Where condition
Order By expr
Return expr

Mixing queries and XML

<Result> { ...query goes here... } </Result>

2.1.4 XQuery Demo

for $b in doc("BookstoreQ.xml")/Bookstore/Book
where $b/@Price < 90
  and $b/Authors/Author/Last_Name = "Ullman"
return $b/Title
for $b in doc("BookstoreQ.xml")/Bookstore/Book
where some $fn in $b/Authors/Author/First_Name
        satisfies contains($b/Title, $fn)
return <Book>
         { $b/Title }
         { $b/Authors/Author/First_Name }
       </Book>
for $b in doc("BookstoreQ.xml")/Bookstore/Book
where some $fn in $b/Authors/Author/First_Name
        satisfies contains($b/Title, $fn)
return <Book>
         { $b/Title }
         { for $fn in $b/Authors/Authors/Author/First_Name
           where contains($b/Title, $fn) return $fn }
       </Book>
<Average>
  { let $plist := doc("BookstoreQ.xml")/Bookstore/Book/@Price
    return avg($plist) }
</Average>
let $a := avg(doc("BookstoreQ.xml")/Bookstore/Book/@Price)
for $b in doc("BookstoreQ.xml")/Bookstore/Book
where $b/@Price < $a
return <Book>
          { $b/Title }
          <Price> { $b/data(@Price) } </Price>
       </Book>
for $b in doc("BookstoreQ.xml")/Bookstore/Book
order by xs:int($b/@Price)
return <Book>
          { $b/Title }
          <Price> { $b/data(@Price) } </Price>
       </Book>
for $n in distinct-values(doc("BookstoreQ.xml")//Last_Name)
return <Last_Name> { $n } </Last_Name>
for $b in doc("BookstoreQ.xml")/Bookstore/Book
where every $fn in $b/Authors/Author/First_Name
        satisfies contains($fn, "J")
return $b
for $b1 in doc("BookstoreQ.xml")/Bookstore/Book
for $b2 in doc("BookstoreQ.xml")/Bookstore/Book
where $b1/Authors/Author/Last_Name = $b2/Authors/Author/Last_Name
      and $b1/Title < $b2/Title
return
    <BookPair>
        <Title1> { data($b1/Title) } </Title1>
        <Title2> { data($b2/Title) } </Title2>
    </BookPair>

Again, implicitly existential quantification.

<InvertedBookstore>
    { for $ln in distinct-values(doc("BookstoreQ.xml")//Author/Last_Name)
      for $fn in distinct-values(doc("BookstoreQ.xml")//Author[
                                      Last_name=$ln]/First_Name)
      return
          <Author>
              <First_Name> { $fn } </First_name>
              <Last_Name { $ln } </Last_Name>
              { for $b in doc("BookstoreQ.xml")/Bookstore/Book[
                               Authors/Atuhor/Last_Name=$ln]
                return <Book>
                          { $b/@ISBN } { $b/@Price } { $/@Edition }
                          { $b/Title } { $b/Remark }
                       </Book>
          </Author> }
</InvertedBookstore>

2.2 Exercises

2.2.1 Course-Catalog XPath and XQuery

Return all Title elements (of both departments and courses).

doc("courses.xml")//Title

Return last names of all department chairs.

doc("courses.xml")//Department/Chair//Last_Name

Return titles of courses with enrollment greater than 500.

doc("courses.xml")//Course[@Enrollment > 500]/Title

Return titles of departments that have some course that takes “CS106B” as a prerequisite.

doc("courses.xml")//Department[Course/Prerequisites[Prereq = "CS106B"]]/Title

Return last names of all professors or lecturers who use a middle initial. Don’t worry about eliminating duplicates.

doc("courses.xml")//(Lecturer|Professor)[Middle_Initial]/Last_Name

Return the count of courses that have a cross-listed course (i.e., that have “Cross-listed” in their description).

count(doc("courses.xml")//Course[contains(Description, "Cross-listed")])

Return the average enrollment of all courses in the CS department.

avg(doc("courses.xml")//Department[@Code="CS"]/Course/@Enrollment)

Return last names of instructors teaching at least one course that has “system” in its description and enrollment greater than 100.

doc("courses.xml")//Course[@Enrollment > 100 and contains(Description, "system")]/Instructors//Last_Name

Return the title of the course with the largest enrollment.

for $c in doc("courses.xml")//Course
where every $e in doc("courses.xml")//Course/@Enrollment
        satisfies xs:int($e) <= xs:int($c/@Enrollment)
return $c/Title

Only one of the function call xs:int shown above is necessary since the other operand will go through type promotion.

2.2.2 Course-Catalog XPath and XQuery Extras

Return the course number of the course that is cross-listed as “LING180”.

data(doc("courses.xml")//Course[contains(Description, "Cross-listed as LING180")]/@Number)

Return course numbers of courses that have the same title as some other course. (Hint: You might want to use the “preceding” and “following” navigation axes for this query, which were not covered in the video or our demo script; they match any preceding or following node, not just siblings.)

for $c1 in doc("courses.xml")//Course
where some $c2 in doc("courses.xml")//Course
      satisfies $c1 != $c2
                and $c1/Title = $c2/Title
return data($c1/@Number)

Return course numbers of courses taught by an instructor with first name “Daphne” or “Julie”.

data(doc("courses.xml")//Course[Instructors//First_Name = ("Daphne", "Julie")]/@Number)

Return the number (count) of courses that have no lecturers as instructors.

count(doc("courses.xml")//Course[not(Instructors[Lecturer])])

Return titles of courses taught by the chair of a department. For this question, you may assume that all professors have distinct last names.

doc("courses.xml")//Course[
    Instructors/Professor/Last_Name =
        doc("courses.xml")//Department/Chair/Professor/Last_Name
]/Title
for $c in doc("courses.xml")//Course
where $c/Instructors/(Professor|Lecturer)/Last_Name
          = doc("courses.xml")//Department/Chair/(Professor|Lecturer)/Last_Name
return $c/Title
for $c in doc("courses.xml")//Course
where some $t1 in $c/Instructors/(Professor|Lecturer)
      satisfies some $t2 in
                doc("courses.xml")//Department/Chair/(Professor|Lecturer)
                satisfies deep-equal($t1/*, $t2/*)
return $c/Title

Return titles of courses that have both a lecturer and a professor as instructors. Return each title only once.

doc("courses.xml")//Course[Instructors/Lecturer and Instructors/Professor]/Title

Return titles of courses taught by a professor with the last name “Ng” but not by a professor with the last name “Thrun”.

for $c in doc("courses.xml")//Course[Instructors/Professor/Last_Name = "Ng"]
where every $ln in $c/Instructors/Professor/Last_Name
      satisfies $ln != "Thrun"
return $c/Title
doc("courses.xml")//Course[Instructors[
        Professor/Last_Name = "Ng" and
        count(Professor[Last_Name = "Thrun"]) = 0]]/Title

Return course numbers of courses that have a course taught by Eric Roberts as a prerequisite.

for $c in doc("courses.xml")//Course
where some $p in $c/Prerequisites/Prereq
      satisfies $p = data(
          doc("courses.xml")//Course[
              Instructors/(Professor|Lecturer)[
                  First_Name = "Eric" and
                  Last_Name = "Roberts"]]/@Number)
return data($c/@Number)

Create a summary of CS classes: List all CS department courses in order of enrollment. For each course include only its Enrollment (as an attribute) and its Title (as a subelement).

<Summary>
    { for $c in doc("courses.xml")//Department[@Code="CS"]/Course
      order by xs:int($c/@Enrollment)
      return <Course>
                 { $c/@Enrollment }
                 { $c/Title }
             </Course> }
</Summary>

Return a “Professors” element that contains as subelements a listing of all professors in all departments, sorted by last name with each professor appearing once. The “Professor” subelements should have the same structure as in the original data. For this question, you may assume that all professors have distinct last names. Watch out – the presence/absence of middle initials may require some special handling. (This problem is quite challenging; congratulations if you get it right.)

<Professors>
    { for $ln in distinct-values(doc("courses.xml")//Professor/Last_Name)
      order by $ln
      return (doc("courses.xml")//Professor[Last_Name=$ln])[1] }
</Professors>

Expanding on the previous question, create an inverted course listing: Return an “Inverted_Course_Catalog” element that contains as subelements professors together with the courses they teach, sorted by last name. You may still assume that all professors have distinct last names. The “Professor” subelements should have the same structure as in the original data, with an additional single “Courses” subelement under Professor, containing a further “Course” subelement for each course number taught by that professor. Professors who do not teach any courses should have no Courses subelement at all. (This problem is very challenging; extra congratulations if you get it right.)

<Inverted_Course_Catalog>
    { for $ln in distinct-values(doc("courses.xml")//Professor/Last_Name)
      let $prof := (doc("courses.xml")//Professor[Last_Name=$ln])[1]
      order by $ln
      return <Professor>
                 { $prof/First_Name } { $prof/Middle_Initial }
                 { $prof/Last_Name }
                 { let $cs :=
                       <Courses>
                           { for $c in doc("courses.xml")//Course[
                                 Instructors/Professor/Last_Name = $ln]
                             return <Course>
                                        { data($c/@Number) }
                                    </Course> }
                       </Courses>
                   return $cs[Course] }
             </Professor>}
</Inverted_Course_Catalog>

2.2.3 World-Countries XPath and XQuery

Return the area of Mongolia.

data(doc("countries.xml")//country[@name="Mongolia"]/@area)

Return the names of all cities that have the same name as the country in which they are located.

for $c in doc("countries.xml")//country[@name = city/name]
return $c/city[name = $c/@name]/name
for $c in doc("countries.xml")//country[@name = city/name]
return <name> { data($c/@name) } </name>

Return the average population of Russian-speaking countries.

avg(data(doc("countries.xml")//country[language = "Russian"]/@population))

Return the names of all countries that have at least three cities with population greater than 3 million.

data(doc("countries.xml")//country[count(city[xs:int(population) > 3000000]) >= 3]/@name)

The function call xs:int in the above code is actually unnecessary since type promotion would occur due to right-hand operand.

Create a list of French-speaking and German-speaking countries. The result should take the form:

<result>
  <French>
    <country>country-name</country>
    <country>country-name</country>
    ...
  </French>
  <German>
    <country>country-name</country>
    <country>country-name</country>
    ...
  </German>
</result>
<result>
  <French>
    { for $cn in data(doc("countries.xml")//country[language = "French"]/@name)
      return <country> { $cn } </country> }
  </French>
  <German>
    { for $cn in data(doc("countries.xml")//country[language = "German"]/@name)
      return <country> { $cn } </country> }
  </German>
</result>
<result>
  { let $langs := ("French", "German")
    for $lang in $langs
    return element {$lang}
    { for $cn in data(doc("countries.xml")//country[language=$lang]/@name)
          return <country> { $cn } </country> } }
</result>

Return the countries with the highest and lowest population densities. Note that because the “/” operator has its own meaning in XPath and XQuery, the division operator is infix “div”. To compute population density use “(@population div @area)”. You can assume density values are unique. The result should take the form:

<result>
  <highest density="value">country-name</highest>
  <lowest density="value">country-name</lowest>
</result>
<result>
  { for $c1 in doc("countries.xml")//country
    let $density := $c1/@population div $c1/@area
    where every $c2 in doc("countries.xml")//country
          satisfies $density >= $c2/@population div $c2/@area
    return <highest density="{$density}"> { data($c1/@name) } </highest> }
  { for $c1 in doc("countries.xml")//country
    let $density := $c1/@population div $c1/@area
    where every $c2 in doc("countries.xml")//country
          satisfies $density <= $c2/@population div $c2/@area
    return <lowest density="{$density}"> { data($c1/@name) } </lowest> }
</result>

2.2.4 World-Countries XPath and XQuery Extras

Return the names of all countries with population greater than 100 million.

data(doc("countries.xml")//country[@population > 100000000]/@name)

Return the names of all countries where over 50% of the population speaks German. (Hint: Depending on your solution, you may want to use “.”, which refers to the “current element” within an XPath expression.)

data(doc("countries.xml")//country[language[@percentage > 50.0] = "German"]/@name)

Return the names of all countries where a city in that country contains more than one-third of the country’s population.

for $c in doc("countries.xml")//country
where some $cp in $c/city/population
      satisfies $cp div $c/@population > 1 div 3
return data($c/@name)

Return the population density of Qatar. Note: Since the “/” operator has its own meaning in XPath and XQuery, the division operator is “div”. To compute population density use “(@population div @area)”.

let $c := doc("countries.xml")//country[@name="Qatar"]
return $c/@population div $c/@area

Return the names of all countries whose population is less than one thousandth that of some city (in any country).

let $countries := doc("countries.xml")//country
for $c in $countries
where some $city in $countries/city
      satisfies $city/population * 0.001 > $c/@population
return data($c/@name)

Return all city names that appear more than once, i.e., there is more than one city with that name in the data. Return only one instance of each such city name. (Hint: You might want to use the “preceding” and/or “following” navigation axes for this query, which were not covered in the video or our demo script; they match any preceding or following node, not just siblings.)

let $cities := doc("countries.xml")//city
for $city in distinct-values($cities/name)
where count($cities[name = $city]) > 1
return <name>{$city}</name>

Return the names of all countries containing a city such that some other country has a city of the same name. (Hint: You might want to use the “preceding” and/or “following” navigation axes for this query, which were not covered in the video or our demo script; they match any preceding or following node, not just siblings.)

let $countries := doc("countries.xml")//country
for $c in $countries
where some $c2 in $countries
      satisfies not($c2 is $c)
                and $c/city/name = $c2/city/name
return data($c/@name)

Return the names of all countries whose name textually contains a language spoken in that country. For instance, Uzbek is spoken in Uzbekistan, so return Uzbekistan. (Hint: You may want to use “.”, which refers to the “current element” within an XPath expression.)

for $c in doc("countries.xml")//country
where some $lang in $c/language
      satisfies contains($c/@name, $lang)
return data($c/@name)

Return the names of all countries in which people speak a language whose name textually contains the name of the country. For instance, Japanese is spoken in Japan, so return Japan. (Hint: You may want to use “.”, which refers to the “current element” within an XPath expression.)

data(doc("countries.xml")//country[language[contains(., ../@name)]]/@name)

Return all languages spoken in a country whose name textually contains the language name. For instance, German is spoken in Germany, so return German. (Hint: Depending on your solution, may want to use data(.), which returns the text value of the “current element” within an XPath expression.)

data(doc("countries.xml")//country/language[contains(../@name, .)])

Return all languages whose name textually contains the name of a country in which the language is spoken. For instance, Icelandic is spoken in Iceland, so return Icelandic. (Hint: Depending on your solution, may want to use data(.), which returns the text value of the “current element” within an XPath expression.)

data(doc("countries.xml")//country/language[contains(., ../@name)])

Return the number of countries where Russian is spoken.

count(doc("countries.xml")//country[language = "Russian"])

Return the names of all countries for which the data does not include any languages or cities, but the country has more than 10 million people.

data(doc("countries.xml")//country[not(language or city) and @population > 10000000]/@name)

Return the name of the country with the highest population. (Hint: You may need to explicitly cast population numbers as integers with xs:int() to get the correct answer.)

data(doc("countries.xml")//country[
    @population = max(doc("countries.xml")//country/@population)
]/@name)
#+END_EXAMPLE)

Return the name of the country that has the city with the highest
population.  (Hint: You may need to explicitly cast population numbers
as integers with xs:int() to get the correct answer.)

#+BEGIN_EXAMPLE
data(doc("countries.xml")//country[
    city/population = max(doc("countries.xml")//city/population)
]/@name)

Return the average number of languages spoken in countries where Russian is spoken.

avg(for $c in doc("countries.xml")//country[language = "Russian"]
return count($c/language))

Return all country-language pairs where the language is spoken in the country and the name of the country textually contains the language name. Return each pair as a country element with language attribute, e.g.,

<country language="French">French Guiana</country>
for $c in doc("countries.xml")//country
for $lang in $c/language
where contains($c/@name, $lang)
return <country language="{$lang}"> { data($c/@name) } </country>

Return all countries that have at least one city with population greater than 7 million. For each one, return the country name along with the cities greater than 7 million, in the format:

<country name="country-name">
  <big>city-name</big>
  <big>city-name</big>
  ...
</country>
for $country in doc("countries.xml")//country
let $c := <country>
              { $country/@name }
              { for $city in $country/city
                where $city/population > 7000000
                return <big>{data($city/name)}</big> }
          </country>
return $c[big]

Return all countries where at least one language is listed, but the total percentage for all listed languages is less than 90%. Return the country element with its name attribute and its language subelements, but no other attributes or subelements.

for $country in doc("countries.xml")//country[language]
where sum($country/language/@percentage) < 90
return <country>
           { $country/@name }
           { $country/language }
       </country>

Return all countries where at least one language is listed, and every listed language is spoken by less than 20% of the population. Return the country element with its name attribute and its language subelements, but no other attributes or subelements.

for $country in doc("countries.xml")//country[language]
where every $lang in $country/language
      satisfies $lang/@percentage < 20
return <country>
           { $country/@name }
           { $country/language }
       </country>

Find all situations where one country’s most popular language is another country’s least popular, and both countries list more than one language. (Hint: You may need to explicitly cast percentages as floating-point numbers with xs:float() to get the correct answer.) Return the name of the language and the two countries, each in the format:

<LangPair language="lang-name">
  <MostPopular>country-name</MostPopular>
  <LeastPopular>country-name</LeastPopular>
</LangPair>
let $countries := doc("countries.xml")//country[count(language) > 1]
for $c in $countries
for $lang in $c/language
where every $lang2 in $c/language
      satisfies xs:float($lang/@percentage) >= $lang2/@percentage
return (<LangPair language="{$lang}">
            <MostPopular>{data($c/@name)}</MostPopular>
            { for $c2 in $countries[language = $lang]
              let $lang3 := $c2/language[. = $lang]
              where every $lang2 in $c2/language
                    satisfies xs:float($lang3/@percentage) <= $lang2/@percentage
              return <LeastPopular> {data($c2/@name)}</LeastPopular> }
        </LangPair>)[LeastPopular]
let $countries :=
    for $c in doc("countries.xml")//country[count(language) > 1]
    let $most_lang :=
        for $lang in $c/language
        where every $lang2 in $c/language
              satisfies xs:float($lang/@percentage) >= $lang2/@percentage
        return <MostPopular>{data($lang)}</MostPopular>
    let $least_lang :=
        for $lang in $c/language
        where every $lang2 in $c/language
              satisfies xs:float($lang/@percentage) <= $lang2/@percentage
        return <LeastPopular>{data($lang)}</LeastPopular>
    return <country>
               {$c/@name}
               {$most_lang}
               {$least_lang}
           </country>
for $c1 in $countries
for $c2 in $countries
where $c1/MostPopular = $c2/LeastPopular
return <LangPair language="{$c1/MostPopular}">
           <MostPopular>{data($c1/@name)}</MostPopular>
           <LeastPopular>{data($c2/@name)}</LeastPopular>
       </LangPair>

For each language spoken in one or more countries, create a “language” element with a “name” attribute and one “country” subelement for each country in which the language is spoken. The “country” subelements should have two attributes: the country “name”, and “speakers” containing the number of speakers of that language (based on language percentage and the country’s population). Order the result by language name, and enclose the entire list in a single “languages” element. For example, your result might look like:

<languages>
  ...
  <language name="Arabic">
    <country name="Iran" speakers="660942"/>
    <country name="Saudi Arabia" speakers="19409058"/>
    <country name="Yemen" speakers="13483178"/>
  </language>
  ...
</languages>
<languages>
    { for $lang in distinct-values(doc("countries.xml")//language)
      order by $lang
      return <language name="{$lang}">
                 { for $c in doc("countries.xml")//country[language = $lang]
                   return <country name="{$c/@name}"
                                   speakers="{xs:int(
                                       $c/language[. = $lang]/@percentage
                                           div 100 * $c/@population)}">
                          </country> }
             </language> }
</languages>

3 XSLT mini-course

3.1 Lecture videos

  • XSL = Extensible Stylesheet Language
  • XSLT = XSL (with) transformations

XSLT is more widely used than XSL.

XSLT: Rule-Based Transformations

  • Match template and replace
  • Recursively match template
  • Extract values
  • Iteration (for-each)
  • Conditionals (if)

Catches:

  • Strange default/whitespace behavior
  • Implicit template priority scheme

3.2 Exercises

3.2.1 Course-Catalog XSLT

Return a list of department titles.

Your solution should fill in the following stylesheet:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match=...>
        ... template body ...
    </xsl:template>
    ... more templates as needed ...
</xsl:stylesheet>
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="Department/Title">
        <xsl:copy-of select="."/>
    </xsl:template>
    <xsl:template match="text()"/>
</xsl:stylesheet>

Return a list of department elements with no attributes and two subelements each: the department title and the entire Chair subelement structure.

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="Department">
      <Department>
        <xsl:copy-of select="Title"/>
        <xsl:copy-of select="Chair"/>
      </Department>
    </xsl:template>
    <xsl:template match="text()"/>
</xsl:stylesheet>

3.2.2 Course-Catalog XSLT Extras

Return all courses with enrollment greater than 500. Retain the structure of Course elements from the original data.

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="Course[@Enrollment &gt; 500]">
    <xsl:copy-of select="."/>
  </xsl:template>
  <xsl:template match="text()"/>
</xsl:stylesheet>

Remove from the data all courses with enrollment greater than 60, or with no enrollment listed. Otherwise the structure of the data should be the same.

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:strip-space elements="*"/>
  <xsl:template match="*|@*|text()">
    <xsl:copy>
      <xsl:apply-templates select="*|@*|text()"/>
    </xsl:copy>
  </xsl:template>
  <xsl:template match="Course[not(@Enrollment) or @Enrollment &gt; 60]"/>
</xsl:stylesheet>

Create a summarized version of the EE part of the course catalog. For each course in EE, return a Course element, with its Number and Title as attributes, its Description as a subelement, and the last name of each instructor as an Instructor subelement. Discard all information about department titles, chairs, enrollment, and prerequisites, as well as all courses in departments other than EE. (Note: To specify quotes within an already-quoted XPath expression, use quot;.)

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="Department[@Code='EE']">
    <xsl:for-each select="Course">
      <Course Number="{@Number}"
              Title="{Title}">
        <xsl:copy-of select="Description"/>
        <xsl:for-each select="Instructors//Last_Name">
          <Instructor><xsl:value-of select="."/></Instructor>
        </xsl:for-each>
      </Course>
    </xsl:for-each>
  </xsl:template>
  <xsl:template match="text()"/>
</xsl:stylesheet>

Create an HTML table with one-pixel border that lists all CS department courses with enrollment greater than 200. Each row should contain three cells: the course number in italics, course title in bold, and enrollment. Sort the rows alphabetically by course title. No header is needed. (Note: For formatting, just use “table border=1”, and “<b>” and “<i>” tags for bold and italics respectively. To specify quotes within an already-quoted XPath expression, use quot;.)

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="Department[@Code='CS']">
    <table border="1">
      <xsl:for-each select="Course[@Enrollment &gt; 200]">
        <xsl:sort select="Title"/>
        <tr>
          <td><i><xsl:value-of select="@Number"/></i></td>
          <td><b><xsl:value-of select="Title"/></b></td>
          <td><xsl:value-of select="@Enrollment"/></td>
        </tr>
      </xsl:for-each>
    </table>
  </xsl:template>
  <xsl:template match="text()"/>
</xsl:stylesheet>

3.2.3 World-Countries XSLT

Return all countries with population between 9 and 10 million. Retain the structure of country elements from the original data.

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="country[9000000 &lt;= @population and @population &lt;= 10000000]">
    <xsl:copy-of select="."/>
  </xsl:template>
  <xsl:template match="text()"/>
</xsl:stylesheet>

Create a table using HTML constructs that lists all countries that have more than 3 languages. Each row should contain the country name in bold, population, area, and number of languages. Sort the rows in descending order of number of languages. No header is needed for the table, but use <table border=“1”> to make it format nicely, should you choose to check your result in a browser. (Hint: You may find the data-type and order attributes of <xsl:sort> to be useful.)

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="countries">
    <html>
      <table border="1">
        <xsl:for-each select="country[count(language) &gt; 3]">
          <xsl:sort select="count(language)"
                    order="descending"/>
          <tr>
            <td><b><xsl:value-of select="@name"/></b></td>
            <td><xsl:value-of select="@population"/></td>
            <td><xsl:value-of select="@area"/></td>
            <td><xsl:value-of select="count(language)"/></td>
          </tr>
        </xsl:for-each>
      </table>
    </html>
  </xsl:template>
  <xsl:template match="text()"/>
</xsl:stylesheet>

Create an alternate version of the countries database: for each country, include its name and population as subelements, and the number of languages and number of cities as attributes (called “languages” and “cities” respectively).

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="*|@*|text()">
    <xsl:copy>
      <xsl:apply-templates select="*|@*|text()"/>
    </xsl:copy>
  </xsl:template>
  <xsl:template match="country">
    <country languages="{count(language)}"
             cities="{count(city)}">
      <name><xsl:value-of select="@name"/></name>
      <population><xsl:value-of select="@population"/></population>
    </country>
  </xsl:template>
</xsl:stylesheet>

3.2.4 World-Countries XSLT Extras

Find all country names containing the string “stan”; return each one within a “Stan” element. (Note: To specify quotes within an already-quoted XPath expression, use quot;.)

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="country[contains(@name, 'stan')]">
    <Stan><xsl:value-of select="@name"/></Stan>
  </xsl:template>
  <xsl:template match="text()"/>
</xsl:stylesheet>

Remove from the data all countries with area greater than 40,000 and all countries with no cities listed. Otherwise the structure of the data should be the same.

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:strip-space elements="*"/>
  <xsl:template match="*|@*|text()">
    <xsl:copy>
      <xsl:apply-templates select="*|@*|text()"/>
    </xsl:copy>
  </xsl:template>
  <xsl:template match="country[@area &gt; 40000 or not(city)]"/>
</xsl:stylesheet>

4 Relational Design Theory mini-course

4.1 Lecture videos

4.1.1 Overview

Design “anomalies”

  • Redundancy
  • Update anomaly
  • Deletion anomaly

Design by decomposition

  • Start with “mega” relations containing everything
  • Decompose into smaller, better relations with same info
  • Can do decomposition automatically

Automatic decomposition

  • “Mega” relations + properties of the data
  • System decomposes based on properties
  • Final set of relations satisfies normal form
    • No anomalies, no lost information

Properties and Normal Forms

  • Functional dependencies \(\Rightarrow\) Boyce-Codd Normal Form
  • Multivalued dependencies \(\Rightarrow\) Fourth Normal Form

4NF \(\subset\) BCNF

Functional Dependencies and BCNF

  • Apply(SSN, sName, cName)
    • Redundancy; Update & Deletion Anomalies
    • Storing SSN-sName pair once for each college
  • Functional Dependency SSN \(\rightarrow\) sName
    • Same SSN always has same sName
    • Should store each SSN’s sName only once
  • Boyce-Codd Normal Form If A \(\rightarrow\) B then A is a key

SSN should be a key but not.

  • Decompose: Student(SSN, sName) Apply(SSN, cName)

Pull out SSN and sName into its own relation

Multivalued Dependencies and 4NF

  • Apply(SSN, cName, HS)
    • Redundancy; Update & Deletion Anomalies
    • Multiplicative effect
      • C colleges, H high school
      • C * H tuples
      • Should be C + H
    • Not addressed by BCNF: No functional dependencies
  • Multivalued Dependency SSN \(\twoheadrightarrow\) cName, SSN \(\twoheadrightarrow\) HS
    • Given SSN has every combination of cName with HS
    • Should store each cName and each HS for an SSN once
  • Fourth Normal Form If A \(\twoheadrightarrow\) B then A is a key
  • Decompose: Apply(SSN, cName) HighSchool(SSN, HS)

4.1.2 Functional Dependencies

Functional dependencies are generally useful concepts

  • Data storage - compression
  • Reasoning about queries - optimization

Functional dependencies generalizes keys.

Student(SS, sName, address, HScode, HSname, HScity, GPA, priority)

Suppose priority is determined by GPA

Two tuples with same GPA have same priority

\[\forall t, u \in \mathrm{Student}\colon\, t.\mathrm{GPA} = u.\mathrm{GPA} \Rightarrow t.\mathrm{priority} = u.\mathrm{priority}\]

GPA \(\rightarrow\) priority

\[\forall t, u \in R\colon\, t[A_1, \dotsc, A_n] = u[A_1, \dotsc, A_n] \Rightarrow t[B_1, \dotsc, B_n] = u[B_1, \dotsc, B_n]\]

\(A_1, \dotsc, A_n \rightarrow B_1, \dotsc, B_n\)

\(\bar{A} \rightarrow \bar{B}\)

  • SSN \(\rightarrow\) sName
  • SSN \(\rightarrow\) address (assume students don’t work)
  • HScode \(\rightarrow\) HSname, HScity
  • HSname, HScity \(\rightarrow\) HScode (assume no schools with same name in the same city)

And

  • SSN \(\rightarrow\) GPA
  • GPA \(\rightarrow\) priority
  • SSN \(\rightarrow\) priority

transitivity

Apply(SSN, cName, state, date, major)
  • cName \(\rightarrow\) date
  • SSN, cName \(\rightarrow\) major

and

  • SSN \(\rightarrow\) state

Functional Dependencies and Keys

\(R(\bar{A}, \bar{B})\)

Trivial Functional Dependency

  • \(\bar{A} \rightarrow \bar{B}\), \(\bar{B} \subset \bar{A}\)

Nontrivial FD

  • \(\bar{A} \rightarrow \bar{B}\), \(\bar{B} \not\subset \bar{A}\)

Completely nontrivial FD (most interested)

  • \(A \rightarrow B\), \(A \cap B = \varnothing\)

Splitting rule

  • \(\bar{A} \rightarrow B_1, B_2, \dotsc, B_m \Rightarrow \bar{A} \rightarrow B_1, \bar{A} \rightarrow B_2, \dotsc\)

Combining rule

  • \(\bar{A} \rightarrow B_1, \bar{A} \rightarrow B_2, \dotsc \bar{A} \rightarrow B_n \Rightarrow \bar{A} \rightarrow B_1, B_2, \dotsc, B_n\)

Trivial-dependency rules

  • A \(\rightarrow\) B, then A \(\rightarrow\) A \(\cup\) B
  • A \(\rightarrow\) B, then A \(\rightarrow\) A \(\cap\) B

Transitive rule

  • A \(\rightarrow\) B, B \(\rightarrow\) C, then A \(\rightarrow\) C

Closure of Attributes

  • Given relations, FDs, set of attributes \(\bar{A}\)
  • Find all \(B\) such that \(\bar{A} \rightarrow B\)

\(\bar{A}^+\)

apply transitive rule repeatedly to get the closure

if \(\bar{A}^+\) = all attrs, then \(\bar{A}\) is a key.

Grow algorithms can be used to find all keys given a set of FDs.

If \(\bar{A}\) is a key, then all its supersets are keys. Hence, we are often only interested the smallest key.

Specifying FDs for a relation

  • \(S_1\) and \(S_2\) sets of FDs
  • \(S_2\) “follows from” \(S_1\) if every relation instance satisfying \(S_1\) also satisfies \(S_2\)
    • \(S_2\): {SSN \(\rightarrow\) priority}
    • \(S_1\): {SSN \(\rightarrow\) GPA, GPA \(\rightarrow\) priority}

Does A \(\rightarrow\) B follow from S?

  1. \(A^+\) based S check if B in set.
  2. Armstrong’s Axioms

Want: Minimal set of completely nontrivial FDs such that all FDs that hold on the relation follow from the dependencies in this set

4.1.3 Boyce-Codd Normal Form

Decomposition of a relational schema

  • \(R(A_1, \dotsc, A_n)\)
  • \(R_1(B_1, \dotsc, B_k)\)
  • \(R_2(C_1, \dotsc, C_m)\)

Then we want

  • \(\bar{B} \cup \bar{C} = \bar{A}\)
  • \(R_1 \Join R_2 = R\)
Student(SSN, sName, address, HScode, HSname, HScity, GPA, priority)

This is a decomposition.

S1(SSN, sName, address, HScode, GPA, priority)
S2(HScode, HSname, HScity)

This is not a decomposition.

S1(SSN, sName, address, HScode, HSname, HScity)
S2(sName, HSname, GPA, priority)

Good decomposition exhibits lossless join property.

BCNF

  • Relation R with FDs is in BCNF if:
    • For each \(\bar{A} \rightarrow B\), \(\bar{A}\) is a key.

BCNF violation (redundancy, update/delete anomalies)

note: a key can contain another key and thus is a super key.

BCNF decomposition algorithm

  • Input: relation R + FDs for R
  • Output: decomposition of R into BCNF relations with “lossless join”
  • Compute keys for \(R\)
  • Repeat until all relations are in BCNF
    • Pick any \(R'\) with \(A \rightarrow B\) that violates BCNF
    • Decompose \(R'\) into \(R_1(A, B)\) and \(R_2(A, \mathrm{rest})\)
    • Compute FDs for \(R_1\) and \(R_2\)
    • Compute keys for \(R_1\) and \(R_2\)

Randomness can result in different answer. We can extend \(A \rightarrow BA^+\).

4.1.4 Multivalued Dependencies and 4th Normal Form

Fourth Normal Form \(\subset\) Boyce-Codd Normal Form

\[R \quad \bar{A} \twoheadrightarrow \bar{B} \quad A_1, \dotsc, A_n, B_1, \dotsc, B_n\]

\[\forall t, u \in R \, \exists v \in R:\, t[\bar{A}] = u[\bar{A}] \implies v[\bar{A}] = t[\bar{A}] \land v[\bar{B}] = t[\bar{B}] \land v[\mathrm{rest}] = u[\mathrm{rest}]\]

Tuple-generating dependencies

\(\bar{A} \twoheadrightarrow \bar{B} \implies \bar{A} \twoheadrightarrow \mathrm{rest}\)

Trivial Multivalued Dependency

  • \(B \subset A\) or \(A \cup B =\) all attrs

\(A \rightarrow B \implies A \twoheadrightarrow B\)

FD-is-an-MVD rule

4NF \(\implies\) BCNF

Intersection rule

  • If \(A \twoheadrightarrow B\), \(A \twoheadrightarrow C\), then \(A \twoheadrightarrow B \cap C\)

Transitive rule

  • If \(A \twoheadrightarrow B\), \(B \twoheadrightarrow C\), then \(A \twoheadrightarrow C - B\)

Fourth Normal Form

Relation R with MVDs is in 4NF if:

  • For each nontrivial A \(\twoheadrightarrow\) B, A is a key

4.1.5 Shortcomings of BCNF/4NF

4.2 Exercises

  • R(A,B)
  • R(A,C,D)
  • R(A,C)
  • R(C,D)
  • R(C,D)
  • R(A,B,C)
  • R(A,B)
  • R(A,C)
  • R(B,C)
  • R(A,B,D)
  • R(A,B)
  • R(A,D)

5 Unified Modeling Language mini-course

5.1 Lecture Videos

5.1.1 UML Data Modeling

Entity-Relationship Model (E/R)

Unified Modeling Language

  • Data modeling subset

Both are graphical

Both can be translated to relations automatically

  • Or semi-automatically

UML Data Modeling: 5 concepts

  1. Classes
  2. Associations
  3. Association Classes
  4. Subclasses
  5. Composition & Aggregation

Classes

Name, attributes, methods

  • For data modeling: add “pk”, drop methods
|-----------|
|  Student  |
|-----------|
| sID    pk |
| sName     |
| GPA       |
|-----------|
| <methods> |
|-----------|

|-----------|
| College   |
|-----------|
| cName  pk |
| state     |
|-----------|
| <methods> |
|-----------|

Relationships between objects of two classes

|-----------|           |-----------|
|  Student  |           | College   |
|-----------|  Applied  |-----------|
| sID    pk |-----------| cName  pk |
| sName     |         ‣ | state     |
| GPA       |           |-----------|
|-----------|           | <methods> |
| <methods> |           |-----------|
|-----------|

Multiplicity of Associations

Each object of class C1 is related to at least m and at most n objects of class C2

|----|                |----|
| C1 |                | C2 |
|----|        m..n    |----|
|    |----------------|    |
|----|      A         |----|
  • m..* at least m
  • 0..n at most n
  • 0..* no restrictions

The default is 1..1

  • 1..1 abbreviated as 1
  • 0..* abbreviated as *

Students must apply somewhere and may not apply to more than 5 colleges. No college takes more than 20,000 applications.

|-----------|                   |-----------|
|  Student  |                   | College   |
|-----------| 0..20000     1..5 |-----------|
| sID    pk |-------------------| cName  pk |
| sName     |      Applied      | state     |
| GPA       |                   |-----------|
|-----------|                   | <methods> |
| <methods> |                   |-----------|
|-----------|

Types of relationships

One-to-One 0..1 0..1
Many-to-One * 0..1
Many-to-Many * *
Complete 1.. 1..

Default is complete one-to-one

Association Classes

Relationships between objects of two classes,

  • with attributes on relationships
|-----------|                          |-----------|
|  Student  |                          | College   |
|-----------|        Applied           |-----------|
| sID    pk |--------------------------| cName  pk |
| sName     |           |              | state     |
| GPA       |      |----------|        |-----------|
|-----------|      | AppInfo  |        | <methods> |
| <methods> |      |----------|        |-----------|
|-----------|      | date     |
                   | decision |
                   |----------|

It’s hard to have more than one relationship association between the same student and college.

Eliminating Association Classes

  • Unnecessary if 0..1 or 1..1 multiplicity
|----|                |----|
| C1 |                | C2 |
|----| *         1..1 |----|
| A3 |----------------| A4 |
|----|       |        |----|
           |----|
           | AC |
           |----|
           | A1 |
           | A2 |
           |----|

Self-Associations

Associations between a class and itself

|---------|
| Student |*
|---------|-----|
|         |*    | Sibling
|---------|-----|


|---------|
| College |home
|---------|------|
|         |1..1  | Branch
|         |0..10 |
|---------|------|
           Satellite

Subclasses

                    |---------|
                    | Student |  Superclass
                    |---------|  {complete, overlapping}
                    | sID  pk |
                    | sName   |
                    | GPA     |
                    |---------|
                         ↑ inherit
     |--------------------------------------|
     |                   |                  |
|----------|     |-----------|    |------------|                |------------|
| ForeignS |     | DomesticS |    | APStudents |                | APCourse   |
|----------|     |-----------|    |------------|     Took       |------------|
| Country  |     | State     |    |            |----------------| Course# pk |
|          |     | SSN       |    |            |1..*   |   1..10| title      |
|----------|     |-----------|    |------------|       |        | units      |
                                                       |        |------------|
                                                   |--------|
                                                   | APInfo |
                                                   |--------|
                                                   | year   |
                                                   | grade  |
                                                   |--------|
  • Superclass = Generalization
  • Subclass = Specialization
  • Incomplete (Partial) vs. Complete
    • Complete: every obj in the superclass is in at least one subclass
  • Disjoint (Exclusive) vs. Overlapping
    • Disjoint: every obj in the superclass is in at most one subclass

Composition & Aggregation

  • Objects of one class belong to objects of another class
|----------|               |------------|
| College  |               | Department |
|----------| 1..1          |------------|
| cName pk |◆--------------| dName      |
| state    |◇-------|      | building   |
|----------| 0..1   |      |------------|
                    |
               |-----------|
               | Apartment |
               |-----------|
               | addr pk   |
               | #units    |
               |-----------|

The symbol ◆ indicates composition, implicitly specifying 1..1.

  • Department belongs to College.

The symbol ◇ indicates aggregation, implicitly specifying 0..1.

  • Some apartments are owned by or associated with colleges, but not all of them are.

5.1.2 UML to Relations

High-Level Database Design Model

  • User-friendly (graphical) specification language
  • Translated into model of DBMS

Designs can be translated to relations automatically

  • Provided every “regular” class has a key

Every class becomes a relation: pk \(\rightarrow\) primary key

Associations

  • Relation with key from each side

Applied(sID, cName)

Keys for association relations depends on multiplicity

|-------|                      |-------|
|  C1   |                      |  C2   |
|-------| 0..1               * |-------|
| K1 pk |----------------------| K2 pk |
| O1    |          A           | O2    |
|-------|                      |-------|

C1( K1, O1) C2( K2, O2)

A(K1, K2)

|-----------|                          |-----------|
|  Student  |                          | College   |
|-----------| *      Applied     1..1  |-----------|
| sID    pk |--------------------------| cName  pk |
| sName     |           |              | state     |
| GPA       |      |----------|        |-----------|
|-----------|      | AppInfo  |        | <methods> |
| <methods> |      |----------|        |-----------|
|-----------|      | date     |
                   | decision |
                   |----------|

Applied( sID, cName)

C1( K1, O1) C2( K2, O2) \(\leftarrow\) C2( K2, O2, K1) A(K1, K2)

Student( sID, sName, GPA) \(\leftarrow\) Student( sID, sName, GPA, cName) College( cName, state) Applied( sID, cName)

Association Classes

Student( sID, sName, GPA) College( cName, state) Applied( sID, cName, date, decision)

Require a key for every “regular” class.

Self-Associations

|---------|
| Student |*
|---------|-----|
|         |*    | Sibling
|---------|-----|

Student( sID, sName, GPA) Sibling( sID1, sID2)

|----------|
| College  | home
|----------|-------|
| cName pk | 1..1  | Branch
| state    |       |
| enr      | 0..10 |
|----------|-------|
             Satellite

College( cName, state, enr) Branch(home, satellite)

College( cName, state, enr, home) ???

Subclasses

  1. Subclass relations contain superclass key + specialized attrs.
  2. Subclass relations contain all attributes.
  3. One relation containing all superclass + subclass attrs.

subclasses are not regular classes.

                     |-------|
                     |   S   |  superclass
                     |-------|
                     | K  pk |
                     | A     |
                     |-------|
                         ↑                   Subclasses
   |---------------------+-------------------|
   |                                         |
|----|                                     |----|
| S1 |                                     | S2 |
|----|                                     |----|
| B  |                                     | C  |
|----|                                     |----|
  1. S( K, A), S1( K, B), S2( K, C)
  2. S( K, A), S1( K, A, B), S2( K, A, C)
  3. S( K, A, B, C)

Heavily overlapping \(\Rightarrow\) design 3

Disjoint, Complete \(\Rightarrow\) design 2 with S( K, A) discarded

                    |---------|
                    | Student |  Superclass
                    |---------|  {complete, overlapping}
                    | sID  pk |
                    | sName   |
                    | GPA     |
                    |---------|
                         ↑ inherit
     |--------------------------------------|
     |                   |                  |
|----------|     |-----------|    |------------|                |------------|
| ForeignS |     | DomesticS |    | APStudents |                | APCourse   |
|----------|     |-----------|    |------------|     Took       |------------|
| Country  |     | State     |    |            |----------------| Course# pk |
|          |     | SSN    pk |    |            |1..*   |   1..10| title      |
|----------|     |-----------|    |------------|       |        | units      |
                                                       |        |------------|
                                                   |--------|
                                                   | APInfo |
                                                   |--------|
                                                   | year   |
                                                   | grade  |
                                                   |--------|

Student( sID, sName) ForeignS( sID, Country) DomesticS( sID, State, SSN) APStudent(sID) \(\leftarrow\) eliminated, implicated in Took APCourse( Course#, title, units) Took( sID, Course#, year, grade)

Composition & Aggregation

|----------|               |------------|
| College  |               | Department |
|----------| 1..1          |------------|
| cName pk |◆--------------| dName      | Not "regular"
| state    |◇-------|      | building   |
|----------| 0..1   |      |------------|
                    |
               |-----------|
               | Apartment |
               |-----------|
               | addr pk   |
               | #units    |
               |-----------|

College( cName, state) Department(dName, building, cName) Apartment( addr, #units, cName) ↑ nullable

5.2 Exercises

6 Indexes and Transactions mini-course

6.1 Lecture Videos

6.1.1 Indexes

Indexes

  • Primary mechanism to get improved performance
  • Persistent data structure, stored in database
  • Many interesting implementation issues
  A B C
1 cat 2
2 dog 5
3 cow 1
4 dog 9
5 cat 2
6 cat 8
7 cow 6
 

Index on T.A

  • T.A = ’cow’
  • T.A = ’cat’

Index on T.B

  • T.B = 2
  • T.B < 6
  • 4 < T.B ≤ 8

Index on T.(A, B)

  • T.A = ’cat’ and T.B > 5
  • T.A < ’d’ and T.B = 1

Utility

  • Index = difference between full table scans and immediate location of tuples
    • Orders of magnitude performance difference
  • Underlying data structures
    • Balanced trees (B trees, B+ trees)
      • A = V, A < V, V1 <= A <= V2
      • logarithmic time
    • Hash tables
      • constant time
Select sName
From student
Where sID = 18942

Index on sID

Many DBMS’s build indexes automatically on PRIMARY KEY (and sometimes UNIQUE) attributes

Select sID
From Student
Where sName = 'Mary' And GPA > 3.9
  • Index on sName ← hash or tree
  • Index on GPA ← tree-based
  • Index on (sName, GPA)
Select sName, cName
From Student, Apply
Where Student.sID = Apply.sID
  • Index on student.sid or apply.sid or both

Query planning & optimization

Downsides of Indexes

  1. Extra space - marginal
  2. Index creation - medium
  3. Index maintenance - can offset benefits

Picking which indexes to create

Benefits of an index depends on:

  • Size of table (and possibly layout)
  • Data distributions
  • Query vs. update load

“Physical design advisors”

Input: database (statistics) and workload Output: recommended indexes

Query Optimizer takes Database statistics, Query or update, and indexes as input and gives best execution plan with estimated cost as output.

Physical design advisor experiments different indexes through query optimizer and gives recommended indexes, of which benefits outweigh drawbacks.

Create Index IndexName on T(A)
Create Index IndexName on T(A1, A2, ..., An)
Create Unique Index IndexName on T(A)
Drop Index IndexName

6.1.2 Introduction to Transactions

Motivated by two independent requirements

  • Concurrent database access
  • Resilience to system failures

Concurrent Access: Attribute-level Inconsistency

Update College Set enrollment = enrollment + 1000
Where cName = 'Stanford'

concurrent with …

Update College Set enrollment = enrollment + 1500
Where cName = 'Stanford'

get; modify; put

  • Possible outcomes
\begin{align*} 15,000 + 2500 &= 17,500 \\ + 1000 &= 16,000 \\ + 1500 &= 16,500 \end{align*}

Concurrent Access: Tuple-level Inconsistency

Update Apply Set major = 'CS' Where sID = 123

concurrent with …

Update Apply Set decision = 'Y' Where sID = 123

get; modify; put

  • both changes
  • one of the two changes

Concurrent Access: Table-level Inconsistency

Update Apply Set decision = 'Y'
Where sID In (Select sID From Student Where GPA > 3.9)

concurrent with …

Update Student Set GPA = 1.1 * GPA Where sizeHS > 2500

Concurrent Access: Multi-statement inconsistency

Insert Into Archive
  Select * From Apply Where decision = 'N';
Delete From Apply Where decision = 'N';

concurrent with …

Select Count(*) From Apply;
Select Count(*) From Archive;

Concurrency Goal

Execute sequence of SQL statements so they appear to be running in isolation

  • Simple solution: execute them in isolation

But want to enable concurrency whenever safe to do so

  • Multiprocessor
  • Multithreaded
  • Asynchronous I/O

Resilience to System Failures

Rather unpleasant inconsistent state can occur due to crashes when there bulk loading to DBMS

Or

Insert Into Archive
  Select * From Apply Where decision = 'N';
Delete From Apply Where decision = 'N';

Crash failure

Lots of updates buffered in memory

System-Failure Goal

Guarantee all-or-nothing execution, regardless of failures

Solution for both concurrency and failures

Transactions

A transaction is a sequence of one or more SQL operations treated as a unit

  • Transactions appear to run in isolation
  • If the system fails, each transaction’s changes are reflected either entirely or not at all

SQL standard:

  • Transaction begins automatically on first SQL statement
  • On “commit” transaction ends and new one begins
  • Current transaction ends on session termination
  • “Autocommit” turns each statement into transaction

6.1.3 Transaction Properties

ACID Properties

  • Atomicity, Consistency, Isolation, Durability

Isolation

  • Serializability: Operations may be interleaved, but execution must be equivalent to some sequential (serial) order of all transactions

locking portion of the database

Attribute-level Inconsistency

  • T1; T2
  • T2; T1

15,000 → 17,500

Tuple-level Inconsistency

  • T1; T2
  • T2; T1

both changes

Table-level Inconsistency

  • T1; T2
  • T2; T1

Order matters here.

Multi-statement inconsistency

Order also matters here.

Durability

  • If system crashes after transaction commits, all effects of transaction remain in database.

T2 {S1; S2; S3; commit} crashes

implemented by logging

Atomicity

  • Each transaction is “all-or-nothing,” never left half done

T2 { S1; S2; ... crashes ...; commit }

Logging

Transaction Rollback (= Abort) ☆

  • Undoes partial effects of transaction
  • Can be system or client initiated
Begin Transaction;
<get input from user>
SQL commands based on input
<confirm results with user>
If ans = 'ok' Then Commit; Else Rollback;

Rollback only undoes data itself on database, but doesn’t affect variable, and cash delivery etc.

Never hold a transaction for long time. Locking

Consistency

  • Each client, each transaction:
    • Can assume all constraints hold when transaction begins
    • Must guarantee all constraints hold when transaction ends

Serializability ⇒ constraints always hold

6.1.4 Isolation Levels

Serializability ⇒ Overhead, Reduction in concurrency

Weaker “Isolation Levels” (from weak to strong)

  • Read Uncommitted
  • Read Committed
  • Repeatable Read
  • Serializable
  • ↓ Overhead ↑ Concurrency
  • ↓ Consistency Guarantees

Isolation Levels

  • Per transaction
  • “In the eye of the beholder”

Dirty Reads

  • “Dirty” data item: written by an uncommitted transaction

Example 1

Update College Set enrollment = enrollment + 1000
Where cName = 'Stanford'

concurrent with …

Select Avg(enrollment) From College

Example 2

Update Student Set GPA = 1.1 * GPA Where sizeHS > 2500

concurrent with …

Select GPA From Student Where sID = 123

concurrent with …

Update Student Set sizeHS = 2600 Where sID = 234

There is no such thing as dirty read within the same transaction.

Isolation Level: Read Uncommitted

  • A transaction may perform dirty reads

Query concurrent with …

Set Transaction Isolation Level Read Uncommitted;
Select Avg(GPA) From Student;

Isolation Level: Read Committed

  • A transaction may not perform dirty reads

Still does not guarantee global serializability

Query concurrent with …

Set Transaction Isolation Level Read Committed;
Select Avg(GPA) From Student;
Select Max(GPA) From Student;

Isolation Level: Repeatable Read

  • A transaction may not perform dirty reads
  • An item read multiple times cannot change value

Still does not guarantee global serializability

Update Student Set GPA = 1.1 * GPA;
Update Student Set sizeHS = 1500 where sID = 123;

concurrent with …

Set Transaction Isolation Level Repeatable Read;
Select Avg(GPA) From Student;
Select Avg(sizeHS) From Student;

Neither T1; T2 Nor T2; T1

But a relation can change: “phantom” tuples

Insert Into Student -- [100 new tuples]

concurrent with …

Set Transaction Isolation Level Repeatable Read;
Select Avg(GPA) From Student;
Select Max(GPA) From Student;

The inserted tuples are called phantom tuples

Delete From Student -- [100 tuples]

concurrent with Query

T1 must occur either before or after T2

Read Only transactions

  • Helps system optimize performance
  • Independent of isolation level

Orthogonal to isolation level

Set Transaction Read Only;
Set Transaction Isolation Level Repeatable Read;
Select Avg(GPA) From Student;
Select Max(GPA) From Student;

Isolation Levels: Summary

  dirty Reads nonrepeatable reads phantoms
Read Uncommitted Y Y Y
Read Committed N Y Y
Repeatable Read N N Y
Serializable N N N
  • Standard default: Serializable
  • Weaker isolation levels
    • Increased concurrency + decreased overhead = increased performance
    • Weaker consistency guarantees
    • Some systems have default Repeatable Read
  • Isolation level per transaction and “eye of the beholder”
    • Each transaction’s reads must conform to its isolation level

Note: An entire workload is globally serializable only if every transaction’s isolation level is serializable.

6.2 Exercises

6.2.1 Index Quiz

6.2.2 Transaction Quiz

7 Constraints and Triggers mini-course

7.1 Lecture Videos

7.1.1 Motivation and Overview

(Integrity) Constraints <static>

  • constrain allowable database states

Triggers <dynamic>

  • monitor database changes,
  • check conditions and initiate actions

Integrity Constraints

  • Impose restrictions on allowable data, beyond those imposed by
  • structure and types
  • 0.0 < GPA <= 4.0
  • enrollment < 50,000 (75,000)

Why use them?

  • Data-entry errors (inserts)
  • Correctness criteria (updates)
  • Enforce consistency
  • Tell system about data

Classification

  • Non-null
  • Key
  • Referential Integrity (foreign key)

Declaring and enforcing constraints

Declaration

  • With original schema - checked after bulk loading
  • Or later - checked on current DB

Enforcement

  • Check after every “dangerous” modification
  • Deferred constraint checking (transaction)

Triggers

  • “Event-Condition-Action Rules”
  • When event occurs, check condition; if true, do action

Examples

  • enrollment > 35,000 -> reject all applicants
  • insert app with GPA > 3.95 -> accept automatically
  • update sizeHS to be > 7,000 -> change to “wrong”, raise error

Why use them?

  • More logic from apps into DBMS
  • To enforce constraints
    • expressive
    • constraint “repair” logic
Create Trigger name
Before|After|Instead Of events
[ referencing-variables ]
[ For Each Row ]
When (condition)
action

Constraints and Triggers

  • For relational databases
  • SQL standard; systems vary considerably

(Integrity) Constraints

  • constrain allowable database states

Triggers

  • monitor database changes,
  • check conditions and initiate actions

7.1.2 Constraints of Several Types

Integrity Constraints

  • Impose restrictions on allowable data, beyond those imposed by structure and types
    • Non-null constraints
    • Key constraints
    • Attribute-based and tuple-based constraints (limits)
    • General assertions (not implemented by any database)

Not all constraints are implemented.

Constraints Demo

create table Student(
    sID int,
    sName text,
    GPA real not null,
    sizeHS int
);
insert into Student values (123, 'Amy', 3.9, 1000);
insert into Student values (234, 'Bob', 3.6, null);
insert into Student values (345, 'Craig', null, 500);
update Student set GPA = null where sID = 123; -- failed
update Student set GPA = null where sID = 456; -- succeed since 456 doesn't exist
create table Student(
    sID int primary key,
    sName text,
    GPA real,
    sizeHS int
);
insert into Student values (123, 'Amy', 3.9, 1000);
insert into Student values (234, 'Bob', 3.6, 1500);
insert into Student values (123, 'Craig', 3.5, 500); -- failed, key error
update Student set sID = 123 where sName = 'Bob';
update Student set sID = sID - 111; -- succeed if query executed for sID 123 before 234
update Student set sID = sID + 111; -- failed if query executed for sID 123 before 234
create table Student(
    sID int primary key,
    sName text unique,
    GPA real,
    sizeHS int
);
insert into Student values (123, 'Amy', 3.9, 1000);
insert into Student values (234, 'Bob', 3.6, 1500);
insert into Student values (345, 'Amy', 3.5, 500); -- failed
insert into Student values (456, 'Doris', 3.9, 1000);
insert into Student values (567, 'Amy', 3.8, 1500); -- failed
create table College(
    cName text,
    state text,
    enrollment int,
    primary key (cName, state)
);
insert into College values ('Mason', 'CA', 10000);
insert into College values ('Mason', 'NY', 5000);
insert into College values ('Mason', 'CA', 2000);
create table Apply(
    sID int,
    cName text,
    major text,
    decision text,
    unique(sID, cName),
    unique(sID, major)
);
insert into Apply values (123, 'Stanford', 'CS', null);
insert into Apply values (123, 'Berkeley', 'EE', null);
insert into Apply values (123, 'Stanford', 'biology', null); -- failed
insert into Apply values (234, 'Stanford', 'biology', null);
insert into Apply values (123, 'MIT', 'EE', null); -- failed
insert into Apply values (123, 'MIT', 'biology', null);
update Apply set major = 'CS' where cName = 'MIT';
insert into Apply values (123, null, null, 'Y');
insert into Apply values (123, null, null, 'N');

Both queries above succeed. SQL standard and most systems allow duplicate null values for unique key. Most systems do not permit repeated null values for primary key.

Attribute check constraint

create table Student(
    sID int,
    sName text,
    GPA real check(GPA <= 4.0 and GPA > 0.0),
    sizeHS int check(sizeHS < 5000)
);
insert into Student values (123, 'Amy', 3.9, 1000);
insert into Student values (234, 'Bob', 4.6, 1500);
create table Apply(
    sID int,
    cName text,
    major text,
    decision text,
    check(decision = 'N' or cName <> 'Stanford' or major <> 'CS')
);
insert into Apply values (123, 'Stanford', 'CS', 'N');
insert into Apply values (123, 'MIT', 'CS', 'Y');
insert into Apply values (123, 'Stanford', 'CS', 'Y');
update Apply set decision = 'Y' where cName = 'Stanford';
update Apply set cName = 'Stanford' where cName = 'MIT';

Both queries failed on SQLite and Postgres, but not in MySQL (MySQL sucks).

create table Student(
    sID int,
    sName text,
    GPA real check(GPA is not null),
    sizeHS int
);

The above is equivalent to GPA real not null.

create table T(
    A int check(A not in (select A from T))
);

The above check constraint implements key constraint. It fails on SQLite for several reasons. The order of execution and check matters. It works for at least some database system.

create table T(
    A int check((select count(distinct A) from T) =
                (select count(*) from T))
);

No known database allows subquery, esp. aggregation, in check expression, so the above query also doesn’t work.

create table Student(
    sID int,
    sName text,
    GPA real,
    sizeHS int
);

create table Apply(
    sID int,
    cName text,
    major text,
    decision text,
    check(sID in (select sID from Student))
);

The above table definitions is valid according to SQL standards but no database implements subquery in check constraint.

create table College(
    cName text,
    state text,
    enrollment int,
    check(enrollment > (select max(sizeHS) from Student))
);

Also valid in SQL standard but no database allows this. Now, suppose this works for some system, it has other problems. When student table changes, the check constraint may no longer be valid.

General Assertions

  • They are very powerful but unfortunately not implemented by any RDBMS.
create assertion Key
check ((select count(distinct A) from T) =
       (select count(*) from T));
create assertion ReferentialIntegrity
check  (not exists (select * from Apply
                    where sID not in (select sID from Student)));

It’s common to write not exists in general assertions.

create assertion AvgAccept
check (3.0 < (select avg(GPA) from Student
              where sID in
                (select sID from Apply where decision = 'Y')));

7.1.3 Referential Integrity

Integrity Constraints

  • Impose restrictions on allowable data, beyond those imposed by structure and types

Referential integrity = Integrity of references = No “dangling pointers”

Referential integrity from R.A to S.B

  • Each value in column A of table R must appear in column B of table S
    • A is called the “foreign key” (Foreign key constraints)
    • B is usually required to be the primary key for table S or at least unique
    • Multi-attribute foreign keys are allowed

Referential integrity is directional

Referential Integrity Enforcement (R.A to S.B)

  • Potentially violating modifications:
    • Insert into R
    • Delete from S
    • Update R.A
    • Update S.B

Special actions:

  • Delete from S
    • Delete from S
      • Restrict (default), Set Null, Cascade
    • Update S.B
      • Restrict (default), Set Null, Cascade
create table College(
    cName text primary key,
    state text,
    enrollment int
);

create table Student(
    sID int primary key,
    sName text,
    GPA real,
    sizeHS int
);

create table Apply(
    sID int references Student(sID),
    cName text references College(cName),
    major text,
    decision text
);
insert into Apply values (123, 'Stanford', 'CS', 'Y');
insert into Apply values (234, 'Berkeley', 'biology', 'N');

Both queries above failed.

insert into Student values (123, 'Amy', 3.9, 1000);
insert into Student values (234, 'Bob', 3.6, 1500);
insert into College values ('Stanford', 'CA', 15000);
insert into College values ('Berkeley', 'CA', 36000);

After executing the above queries, the previous queries succeed.

update Apply set sID = 345 where sID = 123; -- failed, break referential integrity
update Apply set sID = 234 where sID = 123; -- succeed
delete from College where cName = 'Stanford'; -- failed
delete from College where sID = 234;          -- failed
delete from College where sID = 123;          -- succeed
update College set cName = 'Bezerkeley' where cName = 'Berkeley'; -- failed
drop table Student;             -- failed
create table Apply(
    sID int references Student(sID) on delete set null,
    cName text references College(cName) on update cascade,
    major text,
    decision text
);
delete from Student where sID > 200;

The Bezerkeley query now succeeds.

create table T (
    A int,
    B int,
    C int,
    primary key (A, B),
    foreign key (B, C) references T(A, B) on delete cascade
);
insert into T values (1, 1, 1);
insert into T values (2, 1, 1);
insert into T values (3, 2, 1);
insert into T values (4, 3, 2);
insert into T values (5, 4, 3);
insert into T values (6, 5, 4);
insert into T values (7, 6, 5);
insert into T values (8, 7, 6);
delete from T where A = 1;

The above query will delete all the rows in T.

7.1.4 Triggers Introduction

Triggers

  • “Event-Condition-Action Rules”
  • When event occurs, check condition; if true, do action
    1. Move monitoring logic from apps into DBMS
    2. Enforce constraints
      • Beyond what constraint system supports
      • Automatic constraint “repair”

Implementations vary significantly

Create Trigger name
Before | After | Instead Of events
[ referencing-variables ]
[ For Each Row ]
When (condition)
action

Events are one of insert on T, delete on T, update [of C1, ..., Cn] on T.

[For Each Row] means that the triggers should be executed once for each modified tuple.

When event is an insert, referencing-variables can only reference the newly inserted data. When event is a delete, referencing-variables can only reference the old deleted data. When event is an update, referencing-variables can reference both old and new data.

If a row-level trigger is concerned and the event is a delete, we can refer both the old row and old table as different variables. Old table does not refer to the old state of the table in the database but specifically refers to the rows deleted in the delete event.

If a statement-level trigger is concerned, only table can be referenced.

(Condition) is like SQL where clause.

action is the action to be performed when triggered. This is where various database systems differ.

Referential Integrity:

  • R.A references S.B, cascaded delete
Create Trigger Cascade
After Delete On S
Referencing Old Row As O
For Each Row
-- [ no condition ]
Delete From R Where A = O.B
Create Trigger Cascade
After Delete On S
Referencing Old Table As OT
-- [ For Each Row ]
-- [ no condition ]
Delete From R Where A in (Select B from OT)

Tricky Issues

  • Row-level vs. Statement-level
    • New/Old Row and New/Old Table
    • Before, Instead Of
  • Multiple triggers activated at same time
  • Trigger actions activating other triggers (chaining)
    • Also self-triggering, cycles, nested invocations
  • Conditions in When vs. as part of action

SQLite supports only row-level triggers.

T(K,V) - K key, V value

Create Trigger IncreaseInserts
After Insert On T
Referencing New Row As NR, New Table As NT
For Each Row
When (Select Avg(V) From T) <
     (Select Avg(V) From NT)
Update T set V=V+10 where K=NR.K
  • No statement-level equivalent
  • Nondeterministic final state

7.1.5 Triggers Demo

  • Before and After; Insert, Delete, and Update
  • New and Old
  • Conditions and actions
  • Triggers enforcing constraints
  • Trigger chaining
  • Self-triggering, cycles
  • Conflicts
  • Nested trigger invocations

Introduction video used SQL standard

  • No DBMS implements exact standard
  • Some deviate considerably
    • In both syntax and behavior!

Postgres > SQLite >> MySQL

Postgres

  • Expressiveness/behavior = full standard row-level + statement-level, old/new row & table
  • Cumbersome & awkward syntax

SQLite

  • Row-level only, immediate activation => no old/new table

MySQL

  • Row-level only, immediate activation => no old/new table
  • Only one trigger per event type
  • Limited trigger chaining

SQLite - row-level triggers, immediate activation

  • For Each Row implicit if not specified
  • No Old Table or New Table
  • No Referencing clause
    • Old and New predefined for Old Row and New Row
  • Trigger action: SQL statements in begin-end block
create trigger R1
after insert on Student
for each row
when New.GPA > 3.3 and New.GPA <= 3.6
begin
    insert into Apply values (New.sID, 'Stanford', 'geology', null);
    insert into Apply values (New.sID, 'MIT', 'biology', null);
end;
create trigger R2
after delete on Student
for each row
begin
    delete from Apply where sID = Old.sID;
end;
create trigger R3
after update of cName on College
for each row
begin
    update Apply
    set cName = New.cName
    where cName = Old.cName;
end;
create trigger R4
before insert on College
for each row
when exists (select * from College where cName = New.cName)
begin
    select raise(ignore);
end;

create trigger R5
before update of cName on College
for each row
when exists (select * from College where cName = New.cName)
begin
    select raise(ignore);
end;
create trigger R6
after insert on Apply
for each row
when (select count(*) from Apply where cName = New.cName) > 10
begin
    update College set cName = cName || '-Done'
    where cName = New.cName;
end;
create trigger R7
before insert on Student
for each row
when New.sizeHS < 100 or New.sizeHS > 5000
begin
    select raise(ignore);
end;
create trigger R7               -- alternative version using after
after insert on Student
for each row
when New.sizeHS < 100 or New.sizeHS > 5000
begin
    delete from Student where sID = New.sID;
end;
create trigger AutoAccept
after insert on Apply
for each row
when (New.cName = 'Berkeley' and
      3.7 < (select GPA from Student where sID = New.sID) and
      1200 < (select sizeHS from Student where sID = New.sID))
begin
    update Apply
    set decision = 'Y'
    where sID = New.sID
          and cName = New.cName;
end;
create trigger TooMany
after update of enrollment on College
for each row
when (Old.enrollment <= 16000 and New.enrollment > 16000)
begin
    delete from Apply
        where cName = New.cName and major = 'EE';
    update Apply
        set decision = 'U'
        where cName = New.cName
              and decision = 'Y';
end;

self-triggering

create trigger R1
after insert on T1
for each row
begin
    insert into T1 values (New.A+1);
end;

The above query won’t enter infinite loop since SQLite disallow a trigger to be triggered more than once in a trigger processing session. But if we like, we can turn on recursive_triggers on.

pragma recursive_triggers = on;
create trigger R1
after insert on T1
for each row
when (select count(*) from T1) < 10
begin
    insert into T1 values (New.A+1);
end;

trigger each other in a cycle

create trigger R1
after insert on T1
for each row
begin
    insert into T2 values (New.A+1);
end;

create trigger R2
after insert on T2
for each row
begin
    insert into T3 values (New.A+1);
end;

create trigger R3
after insert on T3
for each row
begin
    insert into T1 values (New.A+1);
end;
pragma recursive_triggers = on;
create trigger R3
after insert on T3
for each row
when (select count(*) from T1) < 100
begin
    insert into T1 values (New.A+1);
end;
create trigger R1
after insert on T1
for each row
begin
    update T1 set A = 2;
end;

create trigger R2
after insert on T1
for each row
when exists (select * from T1 where A = 2)
begin
    update T1 set A = 3;
end;

The second trigger is executed first in SQLite.

nested trigger invocation

create trigger R1
after insert on T1
for each row
begin
    insert into T2 values (1);
    insert into T3 values (1);
end;

create trigger R2
after insert on T2
for each row
begin
    insert into T3 values (2);
    insert into T4 values (2);
end;

create trigger R3
after insert on T3
for each row
begin
    insert into T4 values (3);
end;

immediate activation semantics of SQLite for triggers

create trigger R1
after insert on T1
for each row
begin
    insert into T2 select avg(A) from T1;
end;

7.2 Exercises

7.2.1 Constraints and Triggers

7.2.2 SQL Social-Network Triggers

create trigger FriendSameGrade
after insert on Highschooler
for each row
when New.name = 'Friendly'
begin
  insert into Likes
  select New.ID, ID
  from Highschooler
  where grade = New.grade and ID <> New.ID;
end;
create trigger ManageGrade1
after insert on Highschooler
for each row
when New.grade < 9 or New.grade > 12
begin
  update Highschooler
  set grade = null
  where ID = New.ID;
end;
|
create trigger ManageGrade2
after insert on Highschooler
for each row
when New.grade is null
begin
  update Highschooler
  set grade = 9
  where ID = New.ID;
end;
create trigger SymmetricFriend1
after delete on Friend
for each row
begin
  delete from Friend
  where ID1 = Old.ID2 and ID2 = Old.ID1;
end;
|
create trigger SymmetricFriend2
after insert on Friend
for each row
begin
  insert or ignore into Friend
  values (New.ID2, New.ID1);
end;
create trigger Graduate
after update on Highschooler
for each row
when New.grade > 12
begin
  delete from Highschooler
  where ID = New.ID;
end;
create trigger Graduate
after update on Highschooler
for each row
when New.grade > 12
begin
  delete from Highschooler
  where ID = New.ID;
end;
|
create trigger GoUpOneGrade
before update on Highschooler
for each row
when Old.grade + 1 = New.grade
begin
  update Highschooler
  set grade = grade + 1
  where ID in (select ID2 from Friend
               where ID1 = Old.ID);
end;
create trigger Unfriend
after update on Likes
for each row
when Old.ID1 = New.ID1 and Old.ID2 <> New.ID2
     and exists (select * from Friend
                 where ID1 = Old.ID2 and ID2 = New.ID2)
begin
  delete from Friend
  where ((ID1 = Old.ID2 and ID2 = New.ID2)
         or (ID1 = New.ID2 and ID2 = Old.ID2));
end;

8 Views and Authorization mini-course

8.1 Lecture Videos

8.1.1 Defining and Using Views

Three-level vision of database

  • Physical - Conceptual - Logical

Why use views?

  • Hide some data from some users
  • Make some queries easier/more natural
  • Modularity of database access

Real applications tend to use lots and lots (and lots and lots!) of views

Defining and using views

  • View \(V = \mathrm{ViewQuery}(R_1, R_2, \dotsc, R_n)\)
  • Schema of \(V\) is schema of query result
  • Query \(Q\) involving \(V\), conceptually:

    \begin{align*} &V := \mathrm{ViewQuery}(R_1, R_2, \dotsc, R_n) \\ &\mathrm{Evaluate}\ Q \end{align*}
  • In reality, \(Q\) rewritten to use \(R_1, \dotsc, R_n\) instead of \(V\)
  • Notes: \(R_i\) could itself be a view

SQL Syntax

Create View Vname as
<Query>

Create View Vname(A1, A2, ..., An) As
<Query>
create view CSaccept as
select sID, cName
from Apply
where major = 'CS' and decision = 'Y';
select Student.sID, sName, GPA
from Student, CSaccept
where Student.sID = CSaccept.sID and cName = 'Stanford' and GPA < 3.8;

The above query is conceptually equivalent to the following.

create temporary table T as
select sID, cName
from Apply
where major = 'CS' and decision = 'Y';

select Student.sID, sName, GPA
from Student, T
where Student.sID = T.sID and cName = 'Stanford' and GPA < 3.8;

drop table T;

Some simple RDBMS often rewrites as the following.

select Stuent.sID, sName, GPA
from Student,
     (select sID, cName from Apply
      where major = 'CS' and decision = 'Y') as CSaccept
where Student.sID = CSaccept.sID and cName = 'Stanford' and GPA < 3.8;

More powerful RDBMS rewrites as the following.

select Student.sID, sName, GPA
from Student, Apply
where major = 'CS' and decision = 'Y'
      and Student.sID = Apply.sID and cName = 'Stanford' and GPA < 3.8;

Views that use other views

create view CSberk as
select Student.sID, sName, GPA
from Student, CSaccept
where Student.sID = CSaccept.sID and cName = 'Berkeley' and sizeHS > 500;
select * from CSberk where GPA > 3.8;

Naive rewrite

select * from
(select Student.sID, sName, GPA
 from Student, (select sID, cName from Apply
                where major = 'CS' and decision = 'Y') as CSaccept
 where Student.sID = CSaccept.sID and cName = 'Berkeley' and sizeHS > 500) CSberk
where GPA > 3.8;

Flattened rewrite

select *
from Student, Apply
where major = 'CS' and decision = 'Y'
      and Student.sID = Apply.sID and Apply.cName = 'Berkeley'
      and sizeHS > 500 and GPA > 3.8;

The following will result in an error under PostgreSQL since other objects depend on CSaccept.

drop view CSaccept;

However, under both SQLite and MySQL, the error is not thrown until we try tot use CSberk.

The following probably won’t be what you want to do for large database. Additionally, this is what Martin Fowler might refer to as natural aggregate.

create view Mega as
select College.cName, state, enrollment,
       Student.sID, sName, GPA, sizeHS, major, decision
from College, Student, Apply
where College.cName = Apply.cName and Student.sID = Apply.sID;
select sID, sName, GPA, cName
from Mega
where GPA > 3.5 and major = 'CS' and enrollment > 15000;

Flattened rewrite

select Student.sID, sName, GPA, College.cName
from College, Student, Apply
where College.cName = Apply.cName and Student.sID = Apply.sID
      and GPA > 3.5 and major = 'CS' and enrollment > 15000;

8.1.2 View Modifications - Introduction

Querying views

  • Once \(V\) defined, can reference \(V\) like any table
  • Queries involving \(V\) rewritten to use base tables

Modifying views

  • Once \(V\) defined, can we modify \(V\) like any table?
  • Doesn’t make sense: \(V\) is not stored
  • Has to make sense: views are some users’ entire “view” of the database
  • Solution: Modifications to \(V\) rewritten to modify base tables

Usually there are successful translations for view modifications. The problem is there are too many and which one the user intends.

Suppose we have the following schema.

\begin{align*} & R(A, B) \\ & V = \Pi_A(R) \end{align*}

The table \(R\) has \((1, 2)\) and thus \(V = (1)\). Say the user wants to insert \(3\) into \(V\). Then the attribute \(B\) is undecided for the newly inserted tuple.

A more radical example follows.

\begin{align*} & R(N) = \{ (1), (3), (5) \} \\ & V = \operatorname{avg}(N) = \{ (3) \} \end{align*}

And the user wants to update the average to \(7\), then there are so many ways to update \(R\) to achieve this.

Existent systems do things quite differently. There are several approaches.

  1. Rewriting process specified explicitly by view creator
    • + Can handle all modifications
    • − No guarantee of correctness (or meaningfulness)
  2. Restrict views + modifications so that translation to base table modifications is meaningful and unambiguous
    • + No user intervention
    • − Restrictions are significant

The first approach is enabled by instead of triggers or rules in Postgres’ parlance. The second approach is actually adopted by SQL standard.

8.1.3 View Modifications Using Triggers

Unlike queries, view modifications cannot be automated in general.

The following query will result in an error in SQLite, since it doesn’t allow update to views.

delete from CSaccept where sID = 123;

We can create instead of triggers to achieve it.

create trigger CSacceptDelete
instead of delete on CSaccept
for each row
begin
  delete from Apply
  where sID = Old.sID
        and cName = Old.cName
        and major = 'CS' and decision = 'Y';
end;

Now it works.

Of course, I also need an instead of trigger for the following query to work.

update CSaccept set cName = 'CMU' where sID = 345;

An example of erroneous instead of trigger

create trigger CSacceptUpdate
instead of update of cName on CSaccept
for each row
begin
  update Apply
  set cName = New.cName
  where sID = Old.sID
        and cName = Old.cName
        -- correct condition should be
        --     and major = 'CS' and decision = 'Y';
        and major = 'EE' and decision = 'N';
end;
create view CSEE as
select sID, cName, major
from Apply
where major = 'CS' or major = 'EE';
insert into CSEE values (111, 'Berkeley', 'CS');

Another example of erroneous instead of trigger

create trigger CSEEinsert
instead of insert on CSEE
for each row
begin
  insert into Apply values (New.sID, New.cName, New.major, null);
end;

What if we do the following?

insert into CSEE values (222, 'Berkeley', 'biology');

Better trigger follows.

drop trigger CSEEinsert;

create trigger  CSEEinsert
instead of insert on CSEE
for each row
when New.major = 'CS' or New.major = 'EE'
begin
  insert into Apply values (New.sID, New.cName, New.major, null);
end;

This time, the following query won’t insert data into Apply table.

insert into CSEE values (333, 'Berkeley', 'biology');

The following will successfully insert data into the underlying Apply table.

insert into CSEE values (333, 'Berkeley', 'EE');

When a view involves aggregation, it is generally a good idea to disallow user to modify views.

create view HSgpa as
select sizeHS, avg(gpa) as avgGPA
from Student
group by sizeHS;
create view Majors as
select distinct major from Apply;
create view NonUnique as
select * from Student S1
where exists (select * from Student S2
              where S1.sID <> S2.sID
                    and S2.GPA = S1.GPA and S2.sizeHS = S1.sizeHS);
create view Berk as
select Student.sID, major
from Student, Apply
where Student.sID = Apply.sID and cName = 'Berkeley';
create trigger BerkInsert
instead of insert on Berk
for each row
when New.sID in (select sID from Student)
begin
  insert into Apply values (New.sID, 'Berkeley', New.major, null);
end;
insert into Berk
select sID, 'psychology' from Student
where sID not in (select sID from Apply where cName = 'Berkeley');
create trigger BerkDelete
instead of delete on BerkDelete
for each row
begin
  delete from Apply
  where sID = old.sID and cName = 'Berkeley' and major = Old.major;
end;
delete from Berk where major = 'CS';
create trigger BerkUpdate
instead of update of major on Berk
for each row
begin
  update Apply
  set major = New.major
  where sID = New.sId and cName = 'Berkeley' and major = Old.major;
end;
update Berk set major = 'physics' where major = 'psychology';
drop table Apply;
create table Apply(
    sID int,
    cName text,
    major text,
    decision text not null;
);

Now we cannot insert into the CSEE view.

drop table Apply;
create table Apply(
    sID int,
    cName text,
    major text,
    decision text,
    unique(sID, cName, major)
);
insert into CSEE values (123, 'Berkeley', 'CS');
insert into CSEE values (123, 'Berkeley', 'EE');
insert into Berk values (123, 'EE');          -- failed.
update Berk set major = 'CS' where sID = 123; -- failed.

8.1.4 Automatic View Modifications

Restrictions in SQL Standard for “updatable views”:

  1. select (no distinct) on single table T
  2. Attributes not in view can be null or have default value
  3. Subqueries must not refer to T
  4. No group by or aggregation

MySQL is the only database (among MySQL, SQLite, and PostgreSQL) that supports automatic view modifications. However, it is a little bit more generous than the SQL standard.

insert into CSEE values (111, 'Berkeley', 'CS');
insert into CSEE values (222, 'Berkeley', 'psychology');
insert into CSaccept values (333, 'Berkeley');

All of the above queries will succeed. The second query will insert a row with 'psychology' major into Apply table, which won’t reflect in the view CSEE. And the last query will insert a row with null major and decision into Apply table. Both of the results are not what we want.

We can fix it like this.

create view CSaccept2 as
select sID, cName
from Apply
where major = 'CS' and decision = 'Y'
with check option;
create view CSEE2 as
select sID, cName, major
from Apply
where major = 'CS' or major = 'EE'
with check option;

The delete and insert operations on HSgpa view will fail under both MySQL and SQL standard, since it contains aggregation.

Views with subqueries that does not refer to outer table can be modified.

create view Bio as
select * from Student
where sID in (select sID from Apply where major like 'bio%');
delete from Bio where sName = 'Bob';

For the above query, both MySQL and SQL standard will delete Bob from only the Student table, but not from the Apply table. Of course, if the database is set up with proper referential integrity constraints, Bob will be deleted from Apply table due to it.

insert into Bio values (555, 'Karen', 3.9, 1000);

Karen will be inserted into Student table but won’t appear in the Bio view.

create view Bio2 as
select * from Student
where sID in (select sID from Apply where major like 'bio%')
with check option;

Now the following query will fail.

insert into Bio2 values (666, 'Lori', 3.9, 1000);

The with check option comes with an efficiency cost.

MySQL does allow some modification on join view, whereas SQL standard doesn’t.

create view Stan(sID, aID, sName, major) as
select Student.sID, Apply.sID, sName, major
from Student, Apply
where Student.sID = Apply.sID and cName = 'Stanford';
update Stan set sName = 'CS major' where major = 'CS';

The above query succeeds but the following will cause row being updated to disappear from the view Stan.

update Stan set aID = 666 where aID = 123;
create view Stan2(sID, aID, sName, major) as
select Student.sID, Apply.sID, sName, major
from Student, Apply
where Student.sID = Apply.sID and cName = 'Stanford'
with check option;

Now the same query will fail for view Stan2.

insert into Stan(sID, sName) values (777, 'Lance');

The above query will insert Lance only into Student table with null GPA and high school size. This is not what we want. Also, the same query will fail on view Stan2. However, by inserting a correspondent row into Apply, we can make it succeed, as shown below.

insert into Apply values (888, 'Stanford', 'history', 'Y');
insert into Stan2(sID, sName) values (888, 'Mary');
insert into Apply values (999, 'MIT', 'history', 'Y');
insert into Stan2(sID, sName) values (999, 'Nancy');

The second insert query will fail since Nancy doesn’t apply to Stanford.

delete from Stan where sID = 123;

We cannot delete from join view under MySQL.

Under SQL standard, a view is either updatable or not, meaning that if a view is updatable, then we can apply insert, delete, update on it. However, a view can be partially updatable under MySQL since it is a little bit more generous.

8.1.5 Materialized Views

The views discussed in the previous section are sometimes called virtual views.

Materialized views have an additional advantage:

  • improved query performance.

Materialized views

  • View \(V = \mathrm{ViewQuery}(R_1, R_2, \dotsc, R_n)\)
  • Create table \(V\) with schema of query result
  • Execute \(\mathrm{ViewQuery}\) and put results in \(V\)
  • Queries refer to \(V\) as if it’s a table

But…

  • \(V\) could be very large
  • Modifications to \(R_1, R_2, \dotsc, R_n \Rightarrow\)
    • recompute or modify \(V\)
create materialized view CA-CS as
select C.cName, S.sName
from College C, Student S, Apply A
where C.cName = A.cName and S.sID = A.sID
      and C.state = 'CA' and A.major = 'CS'
  • + Can use CA-CS as if it’s a table (it is!)
  • − Modifications to base data invalidate view (cf. general assertions)

Modifications on modifications views?

  • Good news: just update the stored table
  • Bad news: base tables must stay in sync
    • Same issues as with virtual views

Picking which materialized views to create

  • (Efficiency) benefits of a materialized view depend on:
    • Size of data
    • Complexity of view
    • Number of queries using view
    • Number of modifications affecting view
      • Also “incremental maintenance” versus full recomputation

Query-Update trade off (cf. indexes)

Automatic query rewriting to use materialized views

create materialized view CA-Apply as
select sID, cName, major
from Apply A
where cName in (select cName from College where state = 'CA')
select distinct S.sID, S.GPA
from College C, Student S, Apply A
where C.cName = A.cName and S.sID =A.sID
      and S.GPA > 3.5 and C.state = 'CA' and A.major = 'CS'

The above query will be rewritten as something like the following.

select distinct S.sID, S.GPA
from Student S, CA-Apply A
where S.sID = A.sID and S.GPA > 3.5 and A.major = 'CS'

8.1.6 Authorization

Database Authorization

  • Make sure users see only the data they’re supposed to see
  • Guard the database against modifications by malicious users

Users have privileges and can only operate on data for which they are authorized.

  • Select on R or Select(A1, ..., An) on R
  • Insert on R or Insert(A1, ..., An) on R
  • Update on R or Update(A1, ..., An) on R
  • Delete on R
update Apply
set dec = 'Y'
where sID in (select sID from Student where GPA > 3.9)

For the above query, we need the following privileges.

  • Apply: update(dec), select(sID)
  • Student: select(sID, GPA)
delete from student
where sID not in (select sID from Apply)
  • Student: delete
  • Apply: select(sID)

Select student info for Stanford applicants only

create view SS as
select * from Student
where sID in (select sID from Apply where cName = 'Stanford')
  • SS: select

Delete Berkeley applications only

create view BA as
select * from Apply where cName = 'Berkeley'
  • BA: delete

Note: Updatable views is required to provide modification privileges through views

In fact, authorization is one of the most important uses of views in database systems.

Obtaining Privileges

  • Relation creator is owner
  • Owner has all privileges and may grant privileges
grant privs on R to users
  [with grant option]

Revoking Privileges

revoke privs on R from users
  [cascade | restrict]

Cascade: Also revoke privileges granted from privileges being revoked (transitively), unless also granted from another source.

Restrict: Disallow if Cascade would revoke any other privileges. (This is the default.)

8.2 Exercises

8.2.1 Views Quiz

8.2.2 Authorization Quiz

8.2.3 SQL Movie-Rating View Modification

Write an instead-of trigger that enables updates to the title attribute of view LateRating.

Policy: Updates to attribute title in LateRating should update Movie.title for the corresponding movie. (You may assume attribute mID is a key for table Movie.) Make sure the mID attribute of view LateRating has not also been updated – if it has been updated, don’t make any changes. Don’t worry about updates to stars or ratingDate.

create trigger UpdateLateRatingTitle
instead of update of title on LateRating
for each row
when Old.mId = New.mID
begin
update Movie
set title = New.title
where mID = Old.mID;
end;

Write an instead-of trigger that enables updates to the stars attribute of view LateRating.

Policy: Updates to attribute stars in LateRating should update Rating.stars for the corresponding movie rating. (You may assume attributes [mID,ratingDate] together are a key for table Rating.) Make sure the mID and ratingDate attributes of view LateRating have not also been updated – if either one has been updated, don’t make any changes. Don’t worry about updates to title.

create trigger UpdateLateRatingStars
instead of update of stars on LateRating
for each row
when Old.mID = New.mID and Old.ratingDate = New.ratingDate
begin
update Rating
set stars = New.stars
where mID = Old.mID and ratingDate = Old.ratingDate;
end;

Write an instead-of trigger that enables updates to the mID attribute of view LateRating.

Policy: Updates to attribute mID in LateRating should update Movie.mID and Rating.mID for the corresponding movie. Update all Rating tuples with the old mID, not just the ones contributing to the view. Don’t worry about updates to title, stars, or ratingDate.

create trigger UpdateLateRatingMID
instead of update of mID on LateRating
for each row
begin
update Movie
set mID = New.mID
where mID = Old.mID;
update Rating
set mID = New.mID
where mID = Old.mID;
end;

Finally, write a single instead-of trigger that combines all three of the previous triggers to enable simultaneous updates to attributes mID, title, and/or stars in view LateRating. Combine the view-update policies of the three previous problems, with the exception that mID may now be updated. Make sure the ratingDate attribute of view LateRating has not also been updated – if it has been updated, don’t make any changes.

create trigger UpdateLateRating
instead of update on LateRating
for each row
when Old.ratingDate = New.ratingDate
begin
update Movie
set title = New.title,
    mID = New.mID
where mID = Old.mID;
update Rating
set mID = New.mID
where mID = Old.mID;            -- this is the catch.
update Rating
set stars = New.stars
where mID = New.mID and ratingDate = Old.ratingDate;
end;

Write an instead-of trigger that enables deletions from view HighlyRated.

Policy: Deletions from view HighlyRated should delete all ratings for the corresponding movie that have stars > 3.

create trigger DeleteHighlyRated
instead of delete on HighlyRated
for each row
begin
delete from Rating
where mID = Old.mID and stars > 3; -- the second condition is the catch.
end;

Write an instead-of trigger that enables deletions from view HighlyRated.

Policy: Deletions from view HighlyRated should update all ratings for the corresponding movie that have stars > 3 so they have stars = 3.

create trigger DeleteHighlyRated
instead of delete on HighlyRated
for each row
begin
update Rating
set stars = 3
where mID = Old.mID and stars > 3;
end;

Write an instead-of trigger that enables insertions into view HighlyRated.

Policy: An insertion should be accepted only when the (mID,title) pair already exists in the Movie table. (Otherwise, do nothing.) Insertions into view HighlyRated should add a new rating for the inserted movie with rID = 201, stars = 5, and NULL ratingDate.

create trigger InsertHighlyRated
instead of insert on HighlyRated
for each row
when exists (select * from Movie
             where mID = New.mID
                   and title = New.title)
begin
insert into Rating
values (201, New.mID, 5, null);
end;

Write an instead-of trigger that enables insertions into view NoRating.

Policy: An insertion should be accepted only when the (mID,title) pair already exists in the Movie table. (Otherwise, do nothing.) Insertions into view NoRating should delete all ratings for the corresponding movie.

create trigger InsertNoRating
instead of insert on NoRating
for each row
when exists (select * from Movie
             where mID = New.mID
                   and title = New.title)
begin
delete from Rating
where mID = New.mID;
end;

Write an instead-of trigger that enables deletions from view NoRating.

Policy: Deletions from view NoRating should delete the corresponding movie from the Movie table.

create trigger DeleteNoRating
instead of delete on NoRating
for each row
begin
delete from Movie
where mID = Old.mID;
end;

Write an instead-of trigger that enables deletions from view NoRating.

Policy: Deletions from view NoRating should add a new rating for the deleted movie with rID = 201, stars = 1, and NULL ratingDate.

create trigger DeleteNoRating
instead of delete on NoRating
for each row
begin
insert into Rating
values (201, Old.mID, 1, null);
end;

9 On-Line Analytical Processing mini-course

9.1 Lecture Videos

9.1.1 Introduction to OLAP

Two broad types of database activity

  • OLTP—Online Transaction Processing
    • Short transaction
    • Simple queries
    • Touch small portions of data
    • Frequent updates
  • OLAP—Online Analytical Processing
    • Long transaction
    • Complex queries
    • Touch large portions of data
    • Infrequent updates

More terminology

  • Data warehousing
    • Bring data from operational (OLTP) sources into a single “warehouse” for (OLAP) analysis
  • Decision Support System (DSS)
    • Infrastructure for data analysis
    • E.g., data warehouse tuned for OLAP

“Star Schema”

  • Fact table
    • Updated frequently, often append-only, very large
      • Sales transactions
      • Course enrollments
      • Page views
  • Dimension tables
    • Updated infrequently, not as large
      • Stores, items, customers
      • Students, courses
      • Web pages, users, advertisers
  • Fact table references dimension tables, hence the name “star”

OLAP queries

Sales(storeID, itemID, custID, qty, price)
Store(storeID, city, county, state)
Item(itemID, category, brand, color, size)
Customer(custID, cName, gender, age, address)

The first three attributes of Sales are known as dimension attributes, whereas the rest attributes are known as dependent attributes.

Join → Filter → Group → Aggregate

Performance

  • Inherently very slow:
    • special indexes, query processing techniques
  • Extensive use of materialized views

Data Cube (aka multidimensional OLAP)

  • Dimension data forms axes of “cube”
  • Fact (dependent) data in cells
  • Aggregated data on sides, edges, corner

Fact table uniqueness for data cube

  • If dimension attributes are not key, we must aggregate dependent attributes.
  • Date can be used to create key
    • Dimension or dependent? (Dimension)

Drill-down and Roll-up

select state, brand, sum(qty*price)
from Sales F, Store S, Item I
where F.storeID = S.storeID and F.itemID = I.itemID
group by state, brand

Drill-down

  • Examining summary data, break out by dimension attribute

Adding category to the group by clause is an drill-down.

Roll-up

  • Examining data, summarize by dimension attribute

Removing state from the group by clause is an roll-up.

SQL Constructs

  • with cube and with rollup
select dimension-attrs, aggregates
from tables
where conditions
group by dimension-attrs with [cube | rollup]

with cube

  • Add to result: faces, edges, and corner of cube using null values.

with rollup

  • For hierarchical dimensions, portion of with cube.

9.1.2 OLAP Demo

MySQL is used for the whole demo. MySQL supports with rollup and neither PostgreSQL nor SQLite supports it. MySQL does not yet support with cube, but we can simulate it.

select *
from Sales F, Store S, Item I, Customer C
where F.storeID = S.storeID and F.itemID = I.itemID and F.custID = C.custID;
select S.city, I.color,, C.cName, F.price
from Sales F, Store S, Item I, Customer C
where F.storeID = S.storeID and F.itemID = I.itemID and F.custID = C.custID
and S.state = 'CA' and I.category = 'Tshirt' and C.age < 22 and F.price < 25;
select storeID, custID, sum(price)
from Sales
group by storeID, custID;

Drill-down

select storeID, itemID, custID, sum(price)
from Sales
group by storeID, itemID, custID;

Slicing

select F.storeID, itemID, custID, sum(price)
from Sales F, Store S
where F.storeID = S.storeID and state = 'WA'
group by storeID, itemID, custID;

Dicing

select F.storeID, F.itemID, custID, sum(price)
from Sales F, Store S, Item I
where F.storeID = S.storeID and F.itemID = I.itemID
and state = 'WA' and color = 'red'
group by storeID, itemID, custID;

Roll-up

select itemID, sum(price)
from Sales F
group by itemID;
select state, category, sum(price)
from Sales F, Store S, Item I
where F.storeID = S.storeID and F.itemID = I.itemID
group by state, category;
select state, county, category, sum(price)
from Sales F, Store S, Item I
where F.storeID = S.storeID and F.itemID = I.itemID
group by state, county, category;
select state, county, category, gender, sum(price)
from Sales F, Store S, Item I, Customer C
where F.storeID = S.storeID and F.itemID = I.itemID and F.custID = C.custID
group by state, county, category, gender;
select state, gender, sum(price)
from Sales F, Store S, Customer C
where F.storeID = S.storeID and F.custID = C.custID
group by state, gender;
select storeID, itemID, custID, sum(price)
from Sales
group by storeID, itemID, custID with cube;

No RDBMS supports with cube yet, but we can simulate it using with rollup under MySQL.

select storeID, itemID, custID, sum(price)
from Sales
group by storeID, itemID, custID with rollup
union
select storeID, itemID, custID, sum(price)
from Sales
group by itemID, custID, storeID with rollup
union
select storeID, itemID, custID, sum(price)
from Sales
group by custID, storeID, itemID with rollup;
create table Cube as
select storeID, itemID, custID, sum(price) as p
from Sales
group by storeID, itemID, custID with rollup
union
select storeID, itemID, custID, sum(price) as p
from Sales
group by itemID, custID, storeID with rollup
union
select storeID, itemID, custID, sum(price) as p
from Sales
group by custID, storeID, itemID with rollup;
select C.*
from Cube C, Store S, Item I
where C.storeID = S.storeID and C.itemID = I.itemID
and state = 'CA' and color = 'blue' and custID is null;
select sum(p)
from Cube C, Store S, Item I
where C.storeID = S.storeID and C.itemID = I.itemID
and state = 'CA' and color = 'blue' and custID is null;
select C.*
from Cube C, Store S, Item I
where C.storeID = S.storeID and C.itemID = I.itemID
and state = 'CA' and color = 'blue' and custID is not null;
select sum(p)
from Cube C, Store S, Item I
where C.storeID = S.storeID and C.itemID = I.itemID
and state = 'CA' and color = 'blue' and custID is not null;
select F.*
from Sales F, Store S, Item I
where F.storeID = S.storeID and F.itemID = I.itemID
and state = 'CA' and color = 'blue';
select sum(price)
from Sales F, Store S, Item I
where F.storeID = S.storeID and F.itemID = I.itemID
and state = 'CA' and color = 'blue';

Variation of with cube in SQL standard

select storeID, itemID, custID, sum(price)
from Sales F
group by storeID, itemID, custID with cube(storeID, custID);

Simulation using with rollup

select * from
(select storeID, itemID, custID, sum(price)
 from Sales F
 group by itemID, storeID, custID with rollup) X
where X.itemID is not null
union
select * from
(select storeID, itemID, custID, sum(price)
 from Sales F
 group by itemID, custID, storeID with rollup) X
where X.itemID is not null and X.custID is not null;
select storeID, itemID, custID, sum(price)
from Sales F
group by storeID, itemID, custID with rollup;

9.2 Exercises

9.2.1 OLAP Quiz

(defun f (n1 n2 n3)
    (let ((cells (* n1 n2 n3))
          (sides (+ (* n1 n2) (* n1 n3) (* n2 n3)))
          (edges (+ n1 n2 n3)))
      (list cells (+ cells sides edges 1) (+ cells (* n1 n2) n1 1))))

(mapcar (lambda (x) (apply 'f x)) '((2 2 2) (5 4 3) (4 7 3) (5 4 3)))
8 27 15
60 120 86
84 160 117
60 120 86

10 Recursion in SQL mini-course

10.1 Lecture Videos

10.1.1 Basic Recursive WITH Statement—Introduction

SQL is not a “Turing complete” language

  • Simple, convenient, declarative
  • Expressive enough for most database queries
  • But basic SQL can’t express unbounded computations

Example 1: Ancestors

ParentOf(parent, child)
  • Find all of Mary’s ancestors

Example 2: Company hierarchy

Employee(ID, salary)
Manager(mID, eID)
Project(name, mgrID)
  • Final total salary cost of project ’X’

Example 3: Airline flights

Flight(orig, dest, airline, cost)
  • Find cheapest way to fly from ’A’ to ’B’

SQL with Statement

with R1 as (query-1),
     R2 as (query-2),
     ...
     Rn as (query-n)
<query involving R1, ..., Rn (and other tables)>
with R1(A1, A2, ..., Am) as (query-1),
     R2 as (query-2),
     ...
     Rn as (query-n)
<query involving R1, ..., Rn (and other tables)>
with recursive
     R1 as (query-1),
     R2 as (query-2),
     ...
     Rn as (query-n)
<query involving R1, ..., Rn (and other tables)>

SQL standard specifies recursive keyword goes with relation, but implementations differ and make it as modifier for with, and thus any relations are allowed to be recursive if the recursive keyword is given.

with recursive
     R as (base query
           union
           recursive query)
<query involving R1, ..., Rn (and other tables)>

10.1.2 Basic Recursive WITH Statement—Demo

with recursive Ancestor(a, d) as
(select parent as a, child as d from ParentOf
 union
 select Ancestor.a, ParentOf.child as d
 from Ancestor, ParentOf
 where Ancestor.d = ParentOf.parent)
select a from Ancestor where d = 'Mary';
with recursive Superior as
(select * from Manager
 union
 select S.mID, M.eID
 from Superior S, Manager M
 where S.eID = M.mID)
select sum(salary)
from Employee
where ID in
      (select mgrID from Project where name = 'X'
       union
       select eID from Project, Superior
       where Project.name = 'X' and Project.mgrID = Superior.mID);
with recursive Xemps(ID) as
(select mgrID as ID from Project where name = 'X'
 union
 select eID as ID
 from Manager M, Xemps X
 where M.mID = X.ID)
select sum(salary)
from Employee
where ID in (select ID from Xemps);
with recursive
  Yemps(ID) as (select mgrID as ID from Project where name = 'Y'
                union
                select eID as ID
                from Manager M, Yemps Y
                where M.mID = Y.ID),
  Zemps(ID) as (select mgrID as ID from Project where name = 'Z'
                union
                select eID as ID
                from Manager M, Zemps Z
                where M.mID = Z.ID),
select 'Y-total', sum(salary)
from Employee
where ID in (select ID from Yemps)
union
select 'Z-total', sum(salary)
from Employee
where ID in (select ID from Zemps);
with recursive
  Route(orig, dest, total) as
    (select orig, dest, cost as total from Flight
     union
     select R.orig, F.dest, cost+total as total
     from Route R, Flight F
     where R.dest = F.orig)
select min(total) from Route
where orig = 'A' and dest = 'B';
with recursive
  FromA(dest, total) as
    (select dest, cost as total from Flight where orig = 'A'
     union
     select F.dest, cost+total as total
     from FromA FA, Flight F
     where FA.dest = F.orig)
select min(total) from FromA where dest = 'B';
with recursive
  ToB(orig, total) as
    (select orig, cost as total from Flight where dest = 'B'
     union
     select F.orig, cost+total as total
     from ToB, Flight F
     where ToB.orig = Flight.dest)
select min(total) from ToB where orig = 'A';

All the above queries work well given that there is no loop in our Flight.

A loop will overflow all the above queries.

with recursive
  Route(orig, dest, total) as
    (select orig, dest, cost as total from Flight
     union
     select R.orig, F.dest, cost+total as total
     from Route R, Flight F
     where R.dest = F.orig
     and cost+total < all (select total from Route R2
                           where R2.orig = R.orig and R2.dest = F.dest))
select min(total) from Route
where orig = 'A' and dest = 'B';

The above query works conceptually but PostgreSQL doesn’t allow recursive reference within a subquery.

with recursive
  Route(orig, dest, total) as
    (select orig, dest, cost as total from Flight
     union
     select R.orig, F.dest, cost+total as total
     from Route R, Flight F
     where R.dest = F.orig)
select * from Route
where orig = 'A' and dest = 'B' limit 20;

The above query will give a table with 20 results and thus prevent infinite recursion. However, if you replace * with min(total), infinite recursion comes back again. The following is an alternative solution.

with recursive
  Route(orig, dest, total) as
    (select orig, dest, cost as total, 1 from Flight
     union
     select R.orig, F.dest, cost+total as total, R.length+1 as length
     from Route R, Flight F
     where R.dest = F.orig and R.length < 10)
select min(total) from Route
where orig = 'A' and dest = 'B';

10.1.3 Nonlinear and Mutual Recursion

Linear Recursion

  • there is at most one reference to R in the recursive query.

Example: Ancestors

with recursive
  Ancestor(a, d) as
    (select parent as a, child as d from ParentOf
     union
     select A1.a, A2.d
     from Ancestor A1, Ancestor A2
     where A1.d = A2.a)
select a from Ancestor where d = 'Mary';

This is a nonlinear recursion since there are two references to relation Ancestor.

Nonlinear versus linear

  • + Query looks cleaner
  • + Converges faster
  • − Harder to implement
  • SQL standard only requires linear

Mutual Recursion

  • recursive queries refer to recursive relations other than themselves.

They work in tandem. SQL standard allows mutual recursion in limited forms.

Example: Hubs & Authorities

Link(src, dest)
HubStart(node)
AuthStart(node)

Hub: points to greater than or equal to 3 authority node Authority: pointed to greater than or equal to 3 hub node

Some nodes are pre-designated as hub and/or authority nodes.

with recursive
  Hub(node) as (select node from HubStart
                union
                select src as node from Link L
                where dest in (select node from Auth)
                group by src having count(*) >= 3),
  Auth(node) as (select node from AuthStart
                 union
                 select dest as node from Link L
                 where src in (select node from Hub)
                 group by dest having count(*) >= 3)
select * from Hub;

It’s been said there is no implementation of mutual recursion in common RDBMSs. So the above query works only conceptually.

with recursive
  Hub(node) as (select node from HubStart
                union
                select src as node from Link L
                where src not in (select node from Auth)
                and dest in (select node from Auth)
                group by src having count(*) >= 3),
  Auth(node) as (select node from AuthStart
                 union
                 select dest as node from Link L
                 where dest not in (select node from Hub)
                 and src in (select node from Hub)
                 group by dest having count(*) >= 3)
select * from Hub;

The above query has non-unique fixed points (nondeterministic) of the recursion. This is bad and thus SQL standard doesn’t allow this type of mutual recursion. In fact, the crux of the problem is negative dependence, which can also happen even without mutual recursion. Non-mutual recursion with negative dependence is not allowed either.

Example: Recursion with Aggregation

P(x)
with recursive
  R(x) as (select x from P
           union
           select sum(x) from R)
select * from R;

Recursion with aggregation is disallowed in the SQL standard and isn’t supported by any system.

Extends expressiveness of SQL

  • Basic functionality: linear recursion
  • Extended functionality: nonlinear recursion, mutual recursion
  • Disallowed: negative dependence, aggregation

10.2 Exercises

10.2.1 Recursion Quiz

T(A) containing a set of positive integers with no duplicates.

With Recursive Mystery(X,Y) As
 (Select A As X, A As Y From T
  Union
  Select m1.X, m2.Y
  From Mystery m1, Mystery m2
  Where m2.X = m1.Y + 1)
Select Max(Y-X) + 1 From Mystery

The above query actually computes the length of the longest consecutive sequence in T.

With Recursive Peer(X,Y) As
  (Select M1.employee, M2.employee
   From Manager M1, Manager M2
   Where M1.manager = M2.manager AND M1.employee < M2.employee
   Union
   Select M1.employee, M2.employee
   From Peer S, Manager M1, Manager M2
   Where S.X = M1.manager AND S.Y = M2.manager
   And M1.employee < M2.employee)
Select * from Peer

Author: Lei Zhao

Updated: 2018-01-14 Sun 19:44

Validate