Intro to DB
Table of Contents
- 1. SQL mini-course
- 2. XPath and XQuery mini-course
- 3. XSLT mini-course
- 4. Relational Design Theory mini-course
- 5. Unified Modeling Language mini-course
- 6. Indexes and Transactions mini-course
- 7. Constraints and Triggers mini-course
- 8. Views and Authorization mini-course
- 9. On-Line Analytical Processing mini-course
- 10. Recursion in SQL mini-course
Database on Lagunita
1 SQL mini-course
1.1 Lecture videos
1.1.2 Aggregation
select avg(GPA) from Student; select min(GPA) from Student, Apply where Student.sID = Apply.sID and major = 'CS'; select avg(GPA) from Student where sID in (select sID from Apply where major = 'CS'); select count(*) from College where enrollment > 15000; select count(*) from (select distinct sID from Apply where cName = 'Cornell'); select count(distinct sID) from Apply where cName = 'Cornell'; select (select avg(GPA) from Student where sID in (select sID from Apply where major = 'CS')) - (select avg(GPA) from Student where sID not in (select sID from Apply where major = 'CS')); select cName, count(*) from Apply group by cName; select state, sum(enrollment) from College group by state; select cName, major, max(mx - mn) from (select cName, major, min(GPA) as mn, max(GPA) as mx from Student join Apply using(sID) group by cName, major) M; select sID, sName, count(distinct cName) from Student join Apply using(sID) group by sID; select sID, sName, count(cName) from Student join (select distinct sID, cName from Apply) using(sID) group by sID; select sID, sName, count(distinct cName) from Student left join Apply using(sID) group by sID; select sID, sName, count(distinct cName) from Student join Apply using(sID) group by sID union select sID, sName, 0 from Student where sID not in (select sID from Apply); select cName from Apply group by cName having count(*) < 5; select cName from (select distinct cName from Apply) S where (select count(*) from Apply where cName = S.cName) < 5; select cName from Apply group by cName having count(distinct sID) < 5; select major from Student join Apply using(sID) group by major having max(GPA) < (select avg(GPA) from Student);
1.1.3 NULL values
select sID, sName, GPA from Student where GPA <= 3.5 or GPA > 3.5 or GPA is null; select sID, sName, GPA, sizeHS from Student where GPA <= 3.5 or sizeHS < 1600 or sizeHS >= 1600; select count(*) from Student where GPA is not null;
SQL uses three-valued logic (3VL).
1.1.4 Data Modification Statements
insert into College values ('Carnegie Mellon', 'PA', 11500); insert into Apply select sID, 'Carnegie Mellon', 'CS', null from Student where sID not in (select sID from Apply); insert into Apply select distinct sID, 'Carnegie Mellon', 'EE', 'Y' from Apply where cName <> 'Carnegie Mellon' and major = 'EE' and decision = 'N'; delete from Student where sID in (select sID from Apply group by sID having count(distinct major) > 2); delete from Apply where sID in (select sID from Apply group by sID having count(distinct major) > 2); delete from College where cName not in (select cName from Apply where major = 'CS'); update Apply set major = 'economics', decision = 'Y' where cName = 'Carnegie Mellon' and sID in (select sID from Student where GPA < 3.6); update Apply set major = 'CSE' where major = 'EE' and sID in (select sID from Student where GPA = (select max(GPA) from Apply join Student using(sID) where major = 'EE')); select * from Apply where major = 'EE' and sID in (select sID from Student where GPA >= all (select GPA from Student join Apply using(sID) where major = 'EE')); update Student set GPA = (select max(GPA) from Student), sizeHS = (select min(sizeHS) from Student); update Apply set decision = 'Y';
1.2 Exercises
1.2.1 Movie-Rating Query Core
Find the titles of all movies directed by Steven Spielberg.
select title from Movie where director = 'Steven Spielberg';
Find all years that have a movie that received a rating of 4 or 5, and sort them in increasing order.
select year from Movie where mID in (select mID from Rating where stars >= 4) order by year;
Find the titles of all movies that have no ratings.
select title from Movie where mID not in (select mID from Rating);
Some reviewers didn’t provide a date with their rating. Find the names of all reviewers who have ratings with a NULL value for the date.
select name from Reviewer where rID in (select rID from Rating where ratingDate is null);
Write a query to return the ratings data in a more readable format: reviewer name, movie title, stars, and ratingDate. Also, sort the data, first by reviewer name, then by movie title, and lastly by number of stars.
select name, title, stars, ratingDate from Reviewer join Rating using(rID) join Movie using(mID) order by name, title, stars;
For all cases where the same reviewer rated the same movie twice and gave it a higher rating the second time, return the reviewer’s name and the title of the movie.
select name, title from (select rID, mID from Rating R1 join Rating R2 using(rID, mID) where R1.stars > R2.stars and R1.ratingDate > R2.ratingDate and (select count(*) from Rating where rID = R1.rID and mID = R1.mID) = 2) join Reviewer using(rID) join Movie using(mID);
For each movie that has at least one rating, find the highest number of stars that movie received. Return the movie title and number of stars. Sort by movie title.
select title, max(stars) from Rating join Movie using(mID) group by mID order by title;
For each movie, return the title and the ’rating spread’, that is, the difference between highest and lowest ratings given to that movie. Sort by rating spread from highest to lowest, then by movie title.
select title, spread from (select mID, max(stars) - min(stars) as spread from Rating group by mID) join Movie using(mID) order by spread desc, title;
Find the difference between the average rating of movies released before 1980 and the average rating of movies released after 1980. (Make sure to calculate the average rating for each movie, then the average of those averages for movies before 1980 and movies after. Don’t just calculate the overall average rating before and after 1980.)
select (select avg(avgStars) from (select mID, avg(stars) as avgStars from Rating group by mID) join Movie using(mID) where year < 1980) - (select avg(avgStars) from (select mID, avg(stars) as avgStars from Rating group by mID) join Movie using(mID) where year >= 1980);
1.2.2 Movie-Rating Query Extras
Find the names of all reviewers who rated Gone with the Wind.
select name from Reviewer where rID in (select rID from Rating where mID = (select mID from Movie where title = 'Gone with the Wind'));
For any rating where the reviewer is the same as the director of the movie, return the reviewer name, movie title, and number of stars.
select distinct name, title, stars from Rating join Reviewer using(rID) join Movie using(mID) where director = name;
Return all reviewer names and movie names together in a single list, alphabetized. (Sorting by the first name of the reviewer and first word in the title is fine; no need for special processing on last names or removing “The”.)
select name from (select name from Reviewer union select title as name from Movie) order by name;
Find the titles of all movies not reviewed by Chris Jackson.
select title from Movie where mID not in (select mID from Rating where rID in (select rID from Reviewer where name = 'Chris Jackson'));
For all pairs of reviewers such that both reviewers gave a rating to the same movie, return the names of both reviewers. Eliminate duplicates, don’t pair reviewers with themselves, and include each pair only once. For each pair, return the names in the pair in alphabetical order.
select distinct R1.name, R2.name from (Rating join Reviewer using(rID)) R1 join (Rating join Reviewer using(rID)) R2 on R1.mID = R2.mID and R1.rID <> R2.rID and R1.name <= R2.name;
For each rating that is the lowest (fewest stars) currently in the database, return the reviewer name, movie title, and number of stars.
select name, title, stars from Rating R1 join Reviewer R2 join Movie M on R1.rID = R2.rID and R1.mID = m.mID and stars = (select min(stars) from Rating);
List movie titles and average ratings, from highest-rated to lowest-rated. If two or more movies have the same average rating, list them in alphabetical order.
select title, avgStars from (select mID, avg(stars) as avgStars from Rating group by mID) join Movie using(mID) order by avgStars desc, title;
Find the names of all reviewers who have contributed three or more ratings. (As an extra challenge, try writing the query without HAVING or without COUNT.)
select name from Reviewer where rID in (select rID from Rating group by rID having count(*) >= 3); select name from (select rID from Rating group by rID having count(*) >= 3) join Reviewer using(rID); select name from Rating join Reviewer using(rID) group by rID having count(*) >= 3; select name from Reviewer where rID in (select rID from Rating R where rID in (select rID from Rating S where rID = R.rID and (mID <> R.mID or stars <> R.stars) and rID in (select rID from Rating where rID = R.rID and (mID <> R.mID or stars <> R.stars) and (mID <> S.mID or stars <> S.stars))));
Some directors directed more than one movie. For all such directors, return the titles of all movies directed by them, along with the director name. Sort by director name, then movie title. (As an extra challenge, try writing the query both with and without COUNT.)
select title, director from Movie where director in (select director from Movie group by director having count(*) > 1) order by director, title; select title, director from Movie R where director in (select director from Movie where director = R.director and mID <> R.mID) order by director, title;
Find the movie(s) with the highest average rating. Return the movie title(s) and average rating. (Hint: This query is more difficult to write in SQLite than other systems; you might think of it as finding the highest average rating and then choosing the movie(s) with that average rating.)
select title, (select avg(stars) from Rating where mID = R.mID) as avgStars from Movie R where avgStars = (select max((select avg(stars) from Rating where mID = S.mID)) from Movie S);
Find the movie(s) with the lowest average rating. Return the movie title(s) and average rating. (Hint: This query may be more difficult to write in SQLite than other systems; you might think of it as finding the lowest average rating and then choosing the movie(s) with that average rating.)
select title, (select avg(stars) from Rating where mID = R.mID) as avgStars from Movie R where avgStars = (select min((select avg(stars) from Rating where mID = S.mID)) from Movie S);
For each director, return the director’s name together with the title(s) of the movie(s) they directed that received the highest rating among all of their movies, and the value of that rating. Ignore movies whose director is NULL.
select distinct director, title, maxStars from (select director, (select max(stars) from Rating join Movie using(mID) where director = D.director) as maxStars from (select distinct director from Movie where director is not null) D) join Movie using(director) join Rating using(mID) where stars = maxStars;
1.2.3 Social-Network Query Core
Find the names of all students who are friends with someone named Gabriel.
select name from Highschooler where ID in (select ID1 from Friend where ID1 = ID and ID2 in (select ID from Highschooler where name = 'Gabriel'));
For every student who likes someone 2 or more grades younger than themselves, return that student’s name and grade, and the name and grade of the student they like.
select (select name from Highschooler where ID = ID1), (select grade from Highschooler where ID = ID1), (select name from Highschooler where ID = ID2), (select grade from Highschooler where ID = ID2) from Likes where (select grade from Highschooler where ID = ID1) - (select grade from Highschooler where ID = ID2) >= 2;
For every pair of students who both like each other, return the name and grade of both students. Include each pair only once, with the two names in alphabetical order.
select * from Likes where ID1 < ID2 intersect select ID2, ID1 from Likes; select (select name from Highschooler where ID = R.ID1), (select grade from Highschooler where ID = R.ID1), (select name from Highschooler where ID = R.ID2), (select grade from Highschooler where ID = R.ID2) from Likes R where (select name from Highschooler where ID = R.ID1) < (select name from Highschooler where ID = R.ID2) and ID2 in (select ID1 from Likes where ID1 = R.ID2 and ID2 = R.ID1);
Find all students who do not appear in the Likes table (as a student who likes or is liked) and return their names and grades. Sort by grade, then by name within each grade.
select name, grade from Highschooler where ID not in (select ID1 from Likes) and ID not in (select ID2 from Likes) order by grade, name;
For every situation where student A likes student B, but we have no information about whom B likes (that is, B does not appear as an ID1 in the Likes table), return A and B’s names and grades.
select (select name from Highschooler where ID = ID1), (select grade from Highschooler where ID = ID1), (select name from Highschooler where ID = ID2), (select grade from Highschooler where ID = ID2) from Likes where ID2 not in (select ID1 from Likes);
Find names and grades of students who only have friends in the same grade. Return the result sorted by grade, then by name within each grade.
select name, grade from Highschooler where ID not in (select ID1 from Friend where (select grade from Highschooler where ID = ID1) <> (select grade from Highschooler where ID = ID2)) order by grade, name;
For each student A who likes a student B where the two are not friends, find if they have a friend C in common (who can introduce them!). For all such trios, return the name and grade of A, B, and C.
select (select name from Highschooler where ID = T.ID1), (select grade from Highschooler where ID = T.ID1), (select name from Highschooler where ID = T.ID2), (select grade from Highschooler where ID = T.ID2), name, grade from (select * from Likes except select * from Friend) T join Highschooler on ID in (select ID2 from Friend where ID1 = T.ID1) and ID in (select ID2 from Friend where ID1 = T.ID2);
Find the difference between the number of students in the school and the number of different first names.
select count(ID) - count(distinct name) from Highschooler;
Find the name and grade of all students who are liked by more than one other student.
select name2, grade2 from (select ID1, name1, grade1, ID2, name as name2, grade as grade2 from (select ID1, name as name1, grade as grade1, ID2 from Likes join Highschooler H on ID1 = H.ID) join Highschooler H on ID2 = H.ID) group by ID2 having count(*) > 1;
1.2.4 Social-Network Query Extras
For every situation where student A likes student B, but student B likes a different student C, return the names and grades of A, B, and C.
select H1.name, H1.grade, H2.name, H2.grade, H3.name, H3.grade from Likes R join Likes S on R.ID2 = S.ID1 and S.ID2 <> R.ID1 join Highschooler H1 on R.ID1 = H1.ID join Highschooler H2 on R.ID2 = H2.ID join Highschooler H3 on S.ID2 = H3.ID;
Find those students for whom all of their friends are in different grades from themselves. Return the students’ names and grades.
select name, grade from Highschooler R where not exists (select * from Friend where ID1 = R.ID and (select grade from Highschooler where ID = ID2 and grade = R.grade));
What is the average number of friends per student? (Your result should be just one number.)
select avg(num) from (select count(*) as num from Friend group by ID1);
The above code doesn’t consider the situation where there exists a student who has no friend. The following code handle that situation correctly.
select avg(num) from (select (select count(*) from Friend where ID1 = ID) as num from Highschooler);
Find the number of students who are either friends with Cassandra or are friends of friends of Cassandra. Do not count Cassandra, even though technically she is a friend of a friend.
select count(*) from Highschooler where name <> 'Cassandra' and exists (select * from Friend R where R.ID1 = ID and (R.ID2 in (select ID from Highschooler where name = 'Cassandra') or R.ID2 in (select ID1 from Friend where ID1 = R.ID2 and ID2 in (select ID from Highschooler where name = 'Cassandra')))); select count(*) from (select ID1, ID2, ID3 from (select L1.ID1, L1.ID2, L2.ID2 as ID3 from Friend L1 join Friend L2 on L1.ID2 = L2.ID1 and L2.ID2 <> L1.ID1)) join Highschooler R on ID1 = R.ID join Highschooler S on ID2 = S.ID join Highschooler T on ID3 = T.ID where S.name = 'Cassandra' or T.name = 'Cassandra';
Find the name and grade of the student(s) with the greatest number of friends.
select (select name from Highschooler where ID = ID1), (select grade from Highschooler where ID = ID1) from Friend group by ID1 having count(*) = (select max(num) from (select count(*) as num from Friend group by ID1));
1.2.5 Movie-Rating Modification
Add the reviewer Roger Ebert to your database, with an rID of 209.
insert into Reviewer values (209, 'Roger Ebert');
Insert 5-star ratings by James Cameron for all movies in the database. Leave the review date as NULL.
insert into Rating select (select rID from Reviewer where name = 'James Cameron'), mID, 5, null from Movie;
For all movies that have an average rating of 4 stars or higher, add 25 to the release year. (Update the existing tuples; don’t insert new tuples.)
update Movie set year = year + 25 where (select avg(stars) from Rating where mID = Movie.mID) >= 4;
Remove all ratings where the movie’s year is before 1970 or after 2000, and the rating is fewer than 4 stars.
delete from Rating where stars < 4 and ((select year from Movie where mID = Rating.mID) < 1970 or (select year from Movie where mID = Rating.mID) > 2000)
1.2.6 Social-Network Modification
It’s time for the seniors to graduate. Remove all 12th graders from Highschooler.
delete from Highschooler where grade = 12;
If two students A and B are friends, and A likes B but not vice-versa, remove the Likes tuple.
delete from Likes where exists (select * from (select * from Friend except select ID2, ID1 from Likes intersect select * from Likes) where ID1 = Likes.ID1 and ID2 = Likes.ID2);
For all cases where A is friends with B, and B is friends with C, add a new friendship for the pair A and C. Do not add duplicate friendships, friendships that already exist, or friendships with oneself. (This one is a bit challenging; congratulations if you get it right.)
insert into Friend select distinct R.ID1, S.ID2 from Friend R join Friend S on R.ID2 = S.ID1 and S.ID2 <> R.ID1 and not exists (select * from Friend where ID1 = R.ID1 and ID2 = S.ID2);
2 XPath and XQuery mini-course
2.1 Lecture videos
2.1.1 XPath Introduction
Not nearly as mature as Querying Relational
- XPath - path expressions + conditions
- XSLT - XPath + transformations, output
- XQuery - XPath + full-featured Q.L.
Think of XML as a tree
Basic Constructs
/ root element and separator {element name} or * @{attribute name} // any descendants including self [condition] e.g. [@price < 50] [number] nth sub-element
Built-in functions (lots of them)
contains(s1, s2) name()
Navigation “axes” (13 of them)
parent:: following-sibling:: descendants:: (does not match self, cf. //) self::
XPath queries operate on and return sequence of elements Sometimes result can be expressed as XML, but not always
2.1.2 XPath Demo
Get the titles of books in the bookstore where the price is lower than 90 and Jeffrey Widom is an author.
doc("BookstoreQ.xml")/Bookstore/Book[@Price < 90 and Authors/Author[Last_name = "Widom" and First_Name="Jeffrey"]]/Title
Get the titles of books where “Ullman” is an author and “Widom” is not an author. The below code doesn’t work since there is no way using XPath language alone to achieve the query.
doc("BookstoreQ.xml")/Bookstore/Book[ Authors/Author/Last_Name = "Ullman" and Authors/Author/Last_Name != "Widom"]/Title
Get all magazines where there’s a book with the same title. (An example of self-join in XPath)
doc("BookstoreQ.xml")//Magazine[ Title = doc("BookstoreQ.xml")//Book/Title]
The equal sign in the above code is implicitly existentially quantified. There is a lot of implicit existential quantification in XPath and XQuery.
All Elements whose parent is not “Bookstore” or “Book”.
doc("BookstoreQ.xml")/Bookstore//*[ name(parent::*) != "Bookstore" and name(parent::*) != "Book"]
The above code use the navigation access parent::
.
All books and magazines with non-unique titles.
doc("BookstoreQ.xml")/Bookstore/(Book|Magazine)[ Title = following-sibling::*/Title or Title = preceding-sibling::*/Title]
Books where every author’s first name includes “J”. This query will need universal quantification.
doc("BookstoreQ.xml")//Book[ count(Authors/Author[contains(First_Name, "J")]) = count(Authors/Author/First_Name)]
The built-in function count
is used to simulate universal
quantification. This technique can solve previously undoable query.
Titles of books where “Ullman” is an author and “Widom” is not an author.
doc("BookstoreQ.xml")/Bookstore/Book[ Authors/Author/Last_Name = "Ullman" and count(Authors/Author[Last_Name = "Widom"]) = 0]/Title
2.1.3 XQuery Introduction
- XQuery is a expression language (compositional)
- Each expression operates on and returns sequence of elements
- XPath is one type of expression
XQuery: FLWOR expression
For $var in expr Let $var := expr Where condition Order By expr Return expr
Mixing queries and XML
<Result> { ...query goes here... } </Result>
2.1.4 XQuery Demo
for $b in doc("BookstoreQ.xml")/Bookstore/Book where $b/@Price < 90 and $b/Authors/Author/Last_Name = "Ullman" return $b/Title
for $b in doc("BookstoreQ.xml")/Bookstore/Book where some $fn in $b/Authors/Author/First_Name satisfies contains($b/Title, $fn) return <Book> { $b/Title } { $b/Authors/Author/First_Name } </Book>
for $b in doc("BookstoreQ.xml")/Bookstore/Book where some $fn in $b/Authors/Author/First_Name satisfies contains($b/Title, $fn) return <Book> { $b/Title } { for $fn in $b/Authors/Authors/Author/First_Name where contains($b/Title, $fn) return $fn } </Book>
<Average> { let $plist := doc("BookstoreQ.xml")/Bookstore/Book/@Price return avg($plist) } </Average>
let $a := avg(doc("BookstoreQ.xml")/Bookstore/Book/@Price) for $b in doc("BookstoreQ.xml")/Bookstore/Book where $b/@Price < $a return <Book> { $b/Title } <Price> { $b/data(@Price) } </Price> </Book>
for $b in doc("BookstoreQ.xml")/Bookstore/Book order by xs:int($b/@Price) return <Book> { $b/Title } <Price> { $b/data(@Price) } </Price> </Book>
for $n in distinct-values(doc("BookstoreQ.xml")//Last_Name) return <Last_Name> { $n } </Last_Name>
for $b in doc("BookstoreQ.xml")/Bookstore/Book where every $fn in $b/Authors/Author/First_Name satisfies contains($fn, "J") return $b
for $b1 in doc("BookstoreQ.xml")/Bookstore/Book for $b2 in doc("BookstoreQ.xml")/Bookstore/Book where $b1/Authors/Author/Last_Name = $b2/Authors/Author/Last_Name and $b1/Title < $b2/Title return <BookPair> <Title1> { data($b1/Title) } </Title1> <Title2> { data($b2/Title) } </Title2> </BookPair>
Again, implicitly existential quantification.
<InvertedBookstore> { for $ln in distinct-values(doc("BookstoreQ.xml")//Author/Last_Name) for $fn in distinct-values(doc("BookstoreQ.xml")//Author[ Last_name=$ln]/First_Name) return <Author> <First_Name> { $fn } </First_name> <Last_Name { $ln } </Last_Name> { for $b in doc("BookstoreQ.xml")/Bookstore/Book[ Authors/Atuhor/Last_Name=$ln] return <Book> { $b/@ISBN } { $b/@Price } { $/@Edition } { $b/Title } { $b/Remark } </Book> </Author> } </InvertedBookstore>
2.2 Exercises
2.2.1 Course-Catalog XPath and XQuery
Return all Title elements (of both departments and courses).
doc("courses.xml")//Title
Return last names of all department chairs.
doc("courses.xml")//Department/Chair//Last_Name
Return titles of courses with enrollment greater than 500.
doc("courses.xml")//Course[@Enrollment > 500]/Title
Return titles of departments that have some course that takes “CS106B” as a prerequisite.
doc("courses.xml")//Department[Course/Prerequisites[Prereq = "CS106B"]]/Title
Return last names of all professors or lecturers who use a middle initial. Don’t worry about eliminating duplicates.
doc("courses.xml")//(Lecturer|Professor)[Middle_Initial]/Last_Name
Return the count of courses that have a cross-listed course (i.e., that have “Cross-listed” in their description).
count(doc("courses.xml")//Course[contains(Description, "Cross-listed")])
Return the average enrollment of all courses in the CS department.
avg(doc("courses.xml")//Department[@Code="CS"]/Course/@Enrollment)
Return last names of instructors teaching at least one course that has “system” in its description and enrollment greater than 100.
doc("courses.xml")//Course[@Enrollment > 100 and contains(Description, "system")]/Instructors//Last_Name
Return the title of the course with the largest enrollment.
for $c in doc("courses.xml")//Course where every $e in doc("courses.xml")//Course/@Enrollment satisfies xs:int($e) <= xs:int($c/@Enrollment) return $c/Title
Only one of the function call xs:int
shown above is necessary since
the other operand will go through type promotion.
2.2.2 Course-Catalog XPath and XQuery Extras
Return the course number of the course that is cross-listed as “LING180”.
data(doc("courses.xml")//Course[contains(Description, "Cross-listed as LING180")]/@Number)
Return course numbers of courses that have the same title as some other course. (Hint: You might want to use the “preceding” and “following” navigation axes for this query, which were not covered in the video or our demo script; they match any preceding or following node, not just siblings.)
for $c1 in doc("courses.xml")//Course where some $c2 in doc("courses.xml")//Course satisfies $c1 != $c2 and $c1/Title = $c2/Title return data($c1/@Number)
Return course numbers of courses taught by an instructor with first name “Daphne” or “Julie”.
data(doc("courses.xml")//Course[Instructors//First_Name = ("Daphne", "Julie")]/@Number)
Return the number (count) of courses that have no lecturers as instructors.
count(doc("courses.xml")//Course[not(Instructors[Lecturer])])
Return titles of courses taught by the chair of a department. For this question, you may assume that all professors have distinct last names.
doc("courses.xml")//Course[ Instructors/Professor/Last_Name = doc("courses.xml")//Department/Chair/Professor/Last_Name ]/Title
for $c in doc("courses.xml")//Course where $c/Instructors/(Professor|Lecturer)/Last_Name = doc("courses.xml")//Department/Chair/(Professor|Lecturer)/Last_Name return $c/Title
for $c in doc("courses.xml")//Course where some $t1 in $c/Instructors/(Professor|Lecturer) satisfies some $t2 in doc("courses.xml")//Department/Chair/(Professor|Lecturer) satisfies deep-equal($t1/*, $t2/*) return $c/Title
Return titles of courses that have both a lecturer and a professor as instructors. Return each title only once.
doc("courses.xml")//Course[Instructors/Lecturer and Instructors/Professor]/Title
Return titles of courses taught by a professor with the last name “Ng” but not by a professor with the last name “Thrun”.
for $c in doc("courses.xml")//Course[Instructors/Professor/Last_Name = "Ng"] where every $ln in $c/Instructors/Professor/Last_Name satisfies $ln != "Thrun" return $c/Title
doc("courses.xml")//Course[Instructors[ Professor/Last_Name = "Ng" and count(Professor[Last_Name = "Thrun"]) = 0]]/Title
Return course numbers of courses that have a course taught by Eric Roberts as a prerequisite.
for $c in doc("courses.xml")//Course where some $p in $c/Prerequisites/Prereq satisfies $p = data( doc("courses.xml")//Course[ Instructors/(Professor|Lecturer)[ First_Name = "Eric" and Last_Name = "Roberts"]]/@Number) return data($c/@Number)
Create a summary of CS classes: List all CS department courses in order of enrollment. For each course include only its Enrollment (as an attribute) and its Title (as a subelement).
<Summary> { for $c in doc("courses.xml")//Department[@Code="CS"]/Course order by xs:int($c/@Enrollment) return <Course> { $c/@Enrollment } { $c/Title } </Course> } </Summary>
Return a “Professors” element that contains as subelements a listing of all professors in all departments, sorted by last name with each professor appearing once. The “Professor” subelements should have the same structure as in the original data. For this question, you may assume that all professors have distinct last names. Watch out – the presence/absence of middle initials may require some special handling. (This problem is quite challenging; congratulations if you get it right.)
<Professors> { for $ln in distinct-values(doc("courses.xml")//Professor/Last_Name) order by $ln return (doc("courses.xml")//Professor[Last_Name=$ln])[1] } </Professors>
Expanding on the previous question, create an inverted course listing: Return an “Inverted_Course_Catalog” element that contains as subelements professors together with the courses they teach, sorted by last name. You may still assume that all professors have distinct last names. The “Professor” subelements should have the same structure as in the original data, with an additional single “Courses” subelement under Professor, containing a further “Course” subelement for each course number taught by that professor. Professors who do not teach any courses should have no Courses subelement at all. (This problem is very challenging; extra congratulations if you get it right.)
<Inverted_Course_Catalog> { for $ln in distinct-values(doc("courses.xml")//Professor/Last_Name) let $prof := (doc("courses.xml")//Professor[Last_Name=$ln])[1] order by $ln return <Professor> { $prof/First_Name } { $prof/Middle_Initial } { $prof/Last_Name } { let $cs := <Courses> { for $c in doc("courses.xml")//Course[ Instructors/Professor/Last_Name = $ln] return <Course> { data($c/@Number) } </Course> } </Courses> return $cs[Course] } </Professor>} </Inverted_Course_Catalog>
2.2.3 World-Countries XPath and XQuery
Return the area of Mongolia.
data(doc("countries.xml")//country[@name="Mongolia"]/@area)
Return the names of all cities that have the same name as the country in which they are located.
for $c in doc("countries.xml")//country[@name = city/name] return $c/city[name = $c/@name]/name
for $c in doc("countries.xml")//country[@name = city/name] return <name> { data($c/@name) } </name>
Return the average population of Russian-speaking countries.
avg(data(doc("countries.xml")//country[language = "Russian"]/@population))
Return the names of all countries that have at least three cities with population greater than 3 million.
data(doc("countries.xml")//country[count(city[xs:int(population) > 3000000]) >= 3]/@name)
The function call xs:int
in the above code is actually unnecessary
since type promotion would occur due to right-hand operand.
Create a list of French-speaking and German-speaking countries. The result should take the form:
<result> <French> <country>country-name</country> <country>country-name</country> ... </French> <German> <country>country-name</country> <country>country-name</country> ... </German> </result>
<result> <French> { for $cn in data(doc("countries.xml")//country[language = "French"]/@name) return <country> { $cn } </country> } </French> <German> { for $cn in data(doc("countries.xml")//country[language = "German"]/@name) return <country> { $cn } </country> } </German> </result>
<result> { let $langs := ("French", "German") for $lang in $langs return element {$lang} { for $cn in data(doc("countries.xml")//country[language=$lang]/@name) return <country> { $cn } </country> } } </result>
Return the countries with the highest and lowest population densities. Note that because the “/” operator has its own meaning in XPath and XQuery, the division operator is infix “div”. To compute population density use “(@population div @area)”. You can assume density values are unique. The result should take the form:
<result> <highest density="value">country-name</highest> <lowest density="value">country-name</lowest> </result>
<result> { for $c1 in doc("countries.xml")//country let $density := $c1/@population div $c1/@area where every $c2 in doc("countries.xml")//country satisfies $density >= $c2/@population div $c2/@area return <highest density="{$density}"> { data($c1/@name) } </highest> } { for $c1 in doc("countries.xml")//country let $density := $c1/@population div $c1/@area where every $c2 in doc("countries.xml")//country satisfies $density <= $c2/@population div $c2/@area return <lowest density="{$density}"> { data($c1/@name) } </lowest> } </result>
2.2.4 World-Countries XPath and XQuery Extras
Return the names of all countries with population greater than 100 million.
data(doc("countries.xml")//country[@population > 100000000]/@name)
Return the names of all countries where over 50% of the population speaks German. (Hint: Depending on your solution, you may want to use “.”, which refers to the “current element” within an XPath expression.)
data(doc("countries.xml")//country[language[@percentage > 50.0] = "German"]/@name)
Return the names of all countries where a city in that country contains more than one-third of the country’s population.
for $c in doc("countries.xml")//country where some $cp in $c/city/population satisfies $cp div $c/@population > 1 div 3 return data($c/@name)
Return the population density of Qatar. Note: Since the “/” operator has its own meaning in XPath and XQuery, the division operator is “div”. To compute population density use “(@population div @area)”.
let $c := doc("countries.xml")//country[@name="Qatar"] return $c/@population div $c/@area
Return the names of all countries whose population is less than one thousandth that of some city (in any country).
let $countries := doc("countries.xml")//country for $c in $countries where some $city in $countries/city satisfies $city/population * 0.001 > $c/@population return data($c/@name)
Return all city names that appear more than once, i.e., there is more than one city with that name in the data. Return only one instance of each such city name. (Hint: You might want to use the “preceding” and/or “following” navigation axes for this query, which were not covered in the video or our demo script; they match any preceding or following node, not just siblings.)
let $cities := doc("countries.xml")//city for $city in distinct-values($cities/name) where count($cities[name = $city]) > 1 return <name>{$city}</name>
Return the names of all countries containing a city such that some other country has a city of the same name. (Hint: You might want to use the “preceding” and/or “following” navigation axes for this query, which were not covered in the video or our demo script; they match any preceding or following node, not just siblings.)
let $countries := doc("countries.xml")//country for $c in $countries where some $c2 in $countries satisfies not($c2 is $c) and $c/city/name = $c2/city/name return data($c/@name)
Return the names of all countries whose name textually contains a language spoken in that country. For instance, Uzbek is spoken in Uzbekistan, so return Uzbekistan. (Hint: You may want to use “.”, which refers to the “current element” within an XPath expression.)
for $c in doc("countries.xml")//country where some $lang in $c/language satisfies contains($c/@name, $lang) return data($c/@name)
Return the names of all countries in which people speak a language whose name textually contains the name of the country. For instance, Japanese is spoken in Japan, so return Japan. (Hint: You may want to use “.”, which refers to the “current element” within an XPath expression.)
data(doc("countries.xml")//country[language[contains(., ../@name)]]/@name)
Return all languages spoken in a country whose name textually contains the language name. For instance, German is spoken in Germany, so return German. (Hint: Depending on your solution, may want to use data(.), which returns the text value of the “current element” within an XPath expression.)
data(doc("countries.xml")//country/language[contains(../@name, .)])
Return all languages whose name textually contains the name of a country in which the language is spoken. For instance, Icelandic is spoken in Iceland, so return Icelandic. (Hint: Depending on your solution, may want to use data(.), which returns the text value of the “current element” within an XPath expression.)
data(doc("countries.xml")//country/language[contains(., ../@name)])
Return the number of countries where Russian is spoken.
count(doc("countries.xml")//country[language = "Russian"])
Return the names of all countries for which the data does not include any languages or cities, but the country has more than 10 million people.
data(doc("countries.xml")//country[not(language or city) and @population > 10000000]/@name)
Return the name of the country with the highest population. (Hint: You may need to explicitly cast population numbers as integers with xs:int() to get the correct answer.)
data(doc("countries.xml")//country[ @population = max(doc("countries.xml")//country/@population) ]/@name) #+END_EXAMPLE) Return the name of the country that has the city with the highest population. (Hint: You may need to explicitly cast population numbers as integers with xs:int() to get the correct answer.) #+BEGIN_EXAMPLE data(doc("countries.xml")//country[ city/population = max(doc("countries.xml")//city/population) ]/@name)
Return the average number of languages spoken in countries where Russian is spoken.
avg(for $c in doc("countries.xml")//country[language = "Russian"] return count($c/language))
Return all country-language pairs where the language is spoken in the country and the name of the country textually contains the language name. Return each pair as a country element with language attribute, e.g.,
<country language="French">French Guiana</country>
for $c in doc("countries.xml")//country for $lang in $c/language where contains($c/@name, $lang) return <country language="{$lang}"> { data($c/@name) } </country>
Return all countries that have at least one city with population greater than 7 million. For each one, return the country name along with the cities greater than 7 million, in the format:
<country name="country-name"> <big>city-name</big> <big>city-name</big> ... </country>
for $country in doc("countries.xml")//country let $c := <country> { $country/@name } { for $city in $country/city where $city/population > 7000000 return <big>{data($city/name)}</big> } </country> return $c[big]
Return all countries where at least one language is listed, but the total percentage for all listed languages is less than 90%. Return the country element with its name attribute and its language subelements, but no other attributes or subelements.
for $country in doc("countries.xml")//country[language] where sum($country/language/@percentage) < 90 return <country> { $country/@name } { $country/language } </country>
Return all countries where at least one language is listed, and every listed language is spoken by less than 20% of the population. Return the country element with its name attribute and its language subelements, but no other attributes or subelements.
for $country in doc("countries.xml")//country[language] where every $lang in $country/language satisfies $lang/@percentage < 20 return <country> { $country/@name } { $country/language } </country>
Find all situations where one country’s most popular language is another country’s least popular, and both countries list more than one language. (Hint: You may need to explicitly cast percentages as floating-point numbers with xs:float() to get the correct answer.) Return the name of the language and the two countries, each in the format:
<LangPair language="lang-name"> <MostPopular>country-name</MostPopular> <LeastPopular>country-name</LeastPopular> </LangPair>
let $countries := doc("countries.xml")//country[count(language) > 1] for $c in $countries for $lang in $c/language where every $lang2 in $c/language satisfies xs:float($lang/@percentage) >= $lang2/@percentage return (<LangPair language="{$lang}"> <MostPopular>{data($c/@name)}</MostPopular> { for $c2 in $countries[language = $lang] let $lang3 := $c2/language[. = $lang] where every $lang2 in $c2/language satisfies xs:float($lang3/@percentage) <= $lang2/@percentage return <LeastPopular> {data($c2/@name)}</LeastPopular> } </LangPair>)[LeastPopular]
let $countries := for $c in doc("countries.xml")//country[count(language) > 1] let $most_lang := for $lang in $c/language where every $lang2 in $c/language satisfies xs:float($lang/@percentage) >= $lang2/@percentage return <MostPopular>{data($lang)}</MostPopular> let $least_lang := for $lang in $c/language where every $lang2 in $c/language satisfies xs:float($lang/@percentage) <= $lang2/@percentage return <LeastPopular>{data($lang)}</LeastPopular> return <country> {$c/@name} {$most_lang} {$least_lang} </country> for $c1 in $countries for $c2 in $countries where $c1/MostPopular = $c2/LeastPopular return <LangPair language="{$c1/MostPopular}"> <MostPopular>{data($c1/@name)}</MostPopular> <LeastPopular>{data($c2/@name)}</LeastPopular> </LangPair>
For each language spoken in one or more countries, create a “language” element with a “name” attribute and one “country” subelement for each country in which the language is spoken. The “country” subelements should have two attributes: the country “name”, and “speakers” containing the number of speakers of that language (based on language percentage and the country’s population). Order the result by language name, and enclose the entire list in a single “languages” element. For example, your result might look like:
<languages> ... <language name="Arabic"> <country name="Iran" speakers="660942"/> <country name="Saudi Arabia" speakers="19409058"/> <country name="Yemen" speakers="13483178"/> </language> ... </languages>
<languages> { for $lang in distinct-values(doc("countries.xml")//language) order by $lang return <language name="{$lang}"> { for $c in doc("countries.xml")//country[language = $lang] return <country name="{$c/@name}" speakers="{xs:int( $c/language[. = $lang]/@percentage div 100 * $c/@population)}"> </country> } </language> } </languages>
3 XSLT mini-course
3.1 Lecture videos
- XSL = Extensible Stylesheet Language
- XSLT = XSL (with) transformations
XSLT is more widely used than XSL.
XSLT: Rule-Based Transformations
- Match template and replace
- Recursively match template
- Extract values
- Iteration (for-each)
- Conditionals (if)
Catches:
- Strange default/whitespace behavior
- Implicit template priority scheme
3.2 Exercises
3.2.1 Course-Catalog XSLT
Return a list of department titles.
Your solution should fill in the following stylesheet:
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match=...> ... template body ... </xsl:template> ... more templates as needed ... </xsl:stylesheet>
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="Department/Title"> <xsl:copy-of select="."/> </xsl:template> <xsl:template match="text()"/> </xsl:stylesheet>
Return a list of department elements with no attributes and two subelements each: the department title and the entire Chair subelement structure.
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="Department"> <Department> <xsl:copy-of select="Title"/> <xsl:copy-of select="Chair"/> </Department> </xsl:template> <xsl:template match="text()"/> </xsl:stylesheet>
3.2.2 Course-Catalog XSLT Extras
Return all courses with enrollment greater than 500. Retain the structure of Course elements from the original data.
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="Course[@Enrollment > 500]"> <xsl:copy-of select="."/> </xsl:template> <xsl:template match="text()"/> </xsl:stylesheet>
Remove from the data all courses with enrollment greater than 60, or with no enrollment listed. Otherwise the structure of the data should be the same.
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:strip-space elements="*"/> <xsl:template match="*|@*|text()"> <xsl:copy> <xsl:apply-templates select="*|@*|text()"/> </xsl:copy> </xsl:template> <xsl:template match="Course[not(@Enrollment) or @Enrollment > 60]"/> </xsl:stylesheet>
Create a summarized version of the EE part of the course catalog. For each course in EE, return a Course element, with its Number and Title as attributes, its Description as a subelement, and the last name of each instructor as an Instructor subelement. Discard all information about department titles, chairs, enrollment, and prerequisites, as well as all courses in departments other than EE. (Note: To specify quotes within an already-quoted XPath expression, use quot;.)
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="Department[@Code='EE']"> <xsl:for-each select="Course"> <Course Number="{@Number}" Title="{Title}"> <xsl:copy-of select="Description"/> <xsl:for-each select="Instructors//Last_Name"> <Instructor><xsl:value-of select="."/></Instructor> </xsl:for-each> </Course> </xsl:for-each> </xsl:template> <xsl:template match="text()"/> </xsl:stylesheet>
Create an HTML table with one-pixel border that lists all CS department courses with enrollment greater than 200. Each row should contain three cells: the course number in italics, course title in bold, and enrollment. Sort the rows alphabetically by course title. No header is needed. (Note: For formatting, just use “table border=1”, and “<b>” and “<i>” tags for bold and italics respectively. To specify quotes within an already-quoted XPath expression, use quot;.)
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="Department[@Code='CS']"> <table border="1"> <xsl:for-each select="Course[@Enrollment > 200]"> <xsl:sort select="Title"/> <tr> <td><i><xsl:value-of select="@Number"/></i></td> <td><b><xsl:value-of select="Title"/></b></td> <td><xsl:value-of select="@Enrollment"/></td> </tr> </xsl:for-each> </table> </xsl:template> <xsl:template match="text()"/> </xsl:stylesheet>
3.2.3 World-Countries XSLT
Return all countries with population between 9 and 10 million. Retain the structure of country elements from the original data.
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="country[9000000 <= @population and @population <= 10000000]"> <xsl:copy-of select="."/> </xsl:template> <xsl:template match="text()"/> </xsl:stylesheet>
Create a table using HTML constructs that lists all countries that have more than 3 languages. Each row should contain the country name in bold, population, area, and number of languages. Sort the rows in descending order of number of languages. No header is needed for the table, but use <table border=“1”> to make it format nicely, should you choose to check your result in a browser. (Hint: You may find the data-type and order attributes of <xsl:sort> to be useful.)
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="countries"> <html> <table border="1"> <xsl:for-each select="country[count(language) > 3]"> <xsl:sort select="count(language)" order="descending"/> <tr> <td><b><xsl:value-of select="@name"/></b></td> <td><xsl:value-of select="@population"/></td> <td><xsl:value-of select="@area"/></td> <td><xsl:value-of select="count(language)"/></td> </tr> </xsl:for-each> </table> </html> </xsl:template> <xsl:template match="text()"/> </xsl:stylesheet>
Create an alternate version of the countries database: for each country, include its name and population as subelements, and the number of languages and number of cities as attributes (called “languages” and “cities” respectively).
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="*|@*|text()"> <xsl:copy> <xsl:apply-templates select="*|@*|text()"/> </xsl:copy> </xsl:template> <xsl:template match="country"> <country languages="{count(language)}" cities="{count(city)}"> <name><xsl:value-of select="@name"/></name> <population><xsl:value-of select="@population"/></population> </country> </xsl:template> </xsl:stylesheet>
3.2.4 World-Countries XSLT Extras
Find all country names containing the string “stan”; return each one within a “Stan” element. (Note: To specify quotes within an already-quoted XPath expression, use quot;.)
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="country[contains(@name, 'stan')]"> <Stan><xsl:value-of select="@name"/></Stan> </xsl:template> <xsl:template match="text()"/> </xsl:stylesheet>
Remove from the data all countries with area greater than 40,000 and all countries with no cities listed. Otherwise the structure of the data should be the same.
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:strip-space elements="*"/> <xsl:template match="*|@*|text()"> <xsl:copy> <xsl:apply-templates select="*|@*|text()"/> </xsl:copy> </xsl:template> <xsl:template match="country[@area > 40000 or not(city)]"/> </xsl:stylesheet>
4 Relational Design Theory mini-course
4.1 Lecture videos
4.1.1 Overview
Design “anomalies”
- Redundancy
- Update anomaly
- Deletion anomaly
Design by decomposition
- Start with “mega” relations containing everything
- Decompose into smaller, better relations with same info
- Can do decomposition automatically
Automatic decomposition
- “Mega” relations + properties of the data
- System decomposes based on properties
- Final set of relations satisfies normal form
- No anomalies, no lost information
Properties and Normal Forms
- Functional dependencies \(\Rightarrow\) Boyce-Codd Normal Form
- Multivalued dependencies \(\Rightarrow\) Fourth Normal Form
4NF \(\subset\) BCNF
Functional Dependencies and BCNF
- Apply(SSN, sName, cName)
- Redundancy; Update & Deletion Anomalies
- Storing SSN-sName pair once for each college
- Functional Dependency SSN \(\rightarrow\) sName
- Same SSN always has same sName
- Should store each SSN’s sName only once
- Boyce-Codd Normal Form If A \(\rightarrow\) B then A is a key
SSN should be a key but not.
- Decompose: Student(SSN, sName) Apply(SSN, cName)
Pull out SSN and sName into its own relation
Multivalued Dependencies and 4NF
- Apply(SSN, cName, HS)
- Redundancy; Update & Deletion Anomalies
- Multiplicative effect
- C colleges, H high school
- C * H tuples
- Should be C + H
- Not addressed by BCNF: No functional dependencies
- Multivalued Dependency SSN \(\twoheadrightarrow\) cName, SSN \(\twoheadrightarrow\) HS
- Given SSN has every combination of cName with HS
- Should store each cName and each HS for an SSN once
- Fourth Normal Form If A \(\twoheadrightarrow\) B then A is a key
- Decompose: Apply(SSN, cName) HighSchool(SSN, HS)
4.1.2 Functional Dependencies
Functional dependencies are generally useful concepts
- Data storage - compression
- Reasoning about queries - optimization
Functional dependencies generalizes keys.
Student(SS, sName, address, HScode, HSname, HScity, GPA, priority)
Suppose priority is determined by GPA
Two tuples with same GPA have same priority
\[\forall t, u \in \mathrm{Student}\colon\, t.\mathrm{GPA} = u.\mathrm{GPA} \Rightarrow t.\mathrm{priority} = u.\mathrm{priority}\]
GPA \(\rightarrow\) priority
\[\forall t, u \in R\colon\, t[A_1, \dotsc, A_n] = u[A_1, \dotsc, A_n] \Rightarrow t[B_1, \dotsc, B_n] = u[B_1, \dotsc, B_n]\]
\(A_1, \dotsc, A_n \rightarrow B_1, \dotsc, B_n\)
\(\bar{A} \rightarrow \bar{B}\)
- SSN \(\rightarrow\) sName
- SSN \(\rightarrow\) address (assume students don’t work)
- HScode \(\rightarrow\) HSname, HScity
- HSname, HScity \(\rightarrow\) HScode (assume no schools with same name in the same city)
And
- SSN \(\rightarrow\) GPA
- GPA \(\rightarrow\) priority
- SSN \(\rightarrow\) priority
transitivity
Apply(SSN, cName, state, date, major)
- cName \(\rightarrow\) date
- SSN, cName \(\rightarrow\) major
and
- SSN \(\rightarrow\) state
Functional Dependencies and Keys
\(R(\bar{A}, \bar{B})\)
Trivial Functional Dependency
- \(\bar{A} \rightarrow \bar{B}\), \(\bar{B} \subset \bar{A}\)
Nontrivial FD
- \(\bar{A} \rightarrow \bar{B}\), \(\bar{B} \not\subset \bar{A}\)
Completely nontrivial FD (most interested)
- \(A \rightarrow B\), \(A \cap B = \varnothing\)
Splitting rule
- \(\bar{A} \rightarrow B_1, B_2, \dotsc, B_m \Rightarrow \bar{A} \rightarrow B_1, \bar{A} \rightarrow B_2, \dotsc\)
Combining rule
- \(\bar{A} \rightarrow B_1, \bar{A} \rightarrow B_2, \dotsc \bar{A} \rightarrow B_n \Rightarrow \bar{A} \rightarrow B_1, B_2, \dotsc, B_n\)
Trivial-dependency rules
- A \(\rightarrow\) B, then A \(\rightarrow\) A \(\cup\) B
- A \(\rightarrow\) B, then A \(\rightarrow\) A \(\cap\) B
Transitive rule
- A \(\rightarrow\) B, B \(\rightarrow\) C, then A \(\rightarrow\) C
Closure of Attributes
- Given relations, FDs, set of attributes \(\bar{A}\)
- Find all \(B\) such that \(\bar{A} \rightarrow B\)
\(\bar{A}^+\)
apply transitive rule repeatedly to get the closure
if \(\bar{A}^+\) = all attrs, then \(\bar{A}\) is a key.
Grow algorithms can be used to find all keys given a set of FDs.
If \(\bar{A}\) is a key, then all its supersets are keys. Hence, we are often only interested the smallest key.
Specifying FDs for a relation
- \(S_1\) and \(S_2\) sets of FDs
- \(S_2\) “follows from” \(S_1\) if every relation instance satisfying \(S_1\) also
satisfies \(S_2\)
- \(S_2\): {SSN \(\rightarrow\) priority}
- \(S_1\): {SSN \(\rightarrow\) GPA, GPA \(\rightarrow\) priority}
Does A \(\rightarrow\) B follow from S?
- \(A^+\) based S check if B in set.
- Armstrong’s Axioms
Want: Minimal set of completely nontrivial FDs such that all FDs that hold on the relation follow from the dependencies in this set
4.1.3 Boyce-Codd Normal Form
Decomposition of a relational schema
- \(R(A_1, \dotsc, A_n)\)
- \(R_1(B_1, \dotsc, B_k)\)
- \(R_2(C_1, \dotsc, C_m)\)
Then we want
- \(\bar{B} \cup \bar{C} = \bar{A}\)
- \(R_1 \Join R_2 = R\)
Student(SSN, sName, address, HScode, HSname, HScity, GPA, priority)
This is a decomposition.
S1(SSN, sName, address, HScode, GPA, priority) S2(HScode, HSname, HScity)
This is not a decomposition.
S1(SSN, sName, address, HScode, HSname, HScity) S2(sName, HSname, GPA, priority)
Good decomposition exhibits lossless join property.
BCNF
- Relation R with FDs is in BCNF if:
- For each \(\bar{A} \rightarrow B\), \(\bar{A}\) is a key.
BCNF violation (redundancy, update/delete anomalies)
note: a key can contain another key and thus is a super key.
BCNF decomposition algorithm
- Input: relation R + FDs for R
- Output: decomposition of R into BCNF relations with “lossless join”
- Compute keys for \(R\)
- Repeat until all relations are in BCNF
- Pick any \(R'\) with \(A \rightarrow B\) that violates BCNF
- Decompose \(R'\) into \(R_1(A, B)\) and \(R_2(A, \mathrm{rest})\)
- Compute FDs for \(R_1\) and \(R_2\)
- Compute keys for \(R_1\) and \(R_2\)
Randomness can result in different answer. We can extend \(A \rightarrow BA^+\).
4.1.4 Multivalued Dependencies and 4th Normal Form
Fourth Normal Form \(\subset\) Boyce-Codd Normal Form
\[R \quad \bar{A} \twoheadrightarrow \bar{B} \quad A_1, \dotsc, A_n, B_1, \dotsc, B_n\]
\[\forall t, u \in R \, \exists v \in R:\, t[\bar{A}] = u[\bar{A}] \implies v[\bar{A}] = t[\bar{A}] \land v[\bar{B}] = t[\bar{B}] \land v[\mathrm{rest}] = u[\mathrm{rest}]\]
Tuple-generating dependencies
\(\bar{A} \twoheadrightarrow \bar{B} \implies \bar{A} \twoheadrightarrow \mathrm{rest}\)
Trivial Multivalued Dependency
- \(B \subset A\) or \(A \cup B =\) all attrs
\(A \rightarrow B \implies A \twoheadrightarrow B\)
FD-is-an-MVD rule
4NF \(\implies\) BCNF
Intersection rule
- If \(A \twoheadrightarrow B\), \(A \twoheadrightarrow C\), then \(A \twoheadrightarrow B \cap C\)
Transitive rule
- If \(A \twoheadrightarrow B\), \(B \twoheadrightarrow C\), then \(A \twoheadrightarrow C - B\)
Fourth Normal Form
Relation R with MVDs is in 4NF if:
- For each nontrivial A \(\twoheadrightarrow\) B, A is a key
4.1.5 Shortcomings of BCNF/4NF
5 Unified Modeling Language mini-course
5.1 Lecture Videos
5.1.1 UML Data Modeling
Entity-Relationship Model (E/R)
Unified Modeling Language
- Data modeling subset
Both are graphical
Both can be translated to relations automatically
- Or semi-automatically
UML Data Modeling: 5 concepts
- Classes
- Associations
- Association Classes
- Subclasses
- Composition & Aggregation
Classes
Name, attributes, methods
- For data modeling: add “pk”, drop methods
|-----------| | Student | |-----------| | sID pk | | sName | | GPA | |-----------| | <methods> | |-----------| |-----------| | College | |-----------| | cName pk | | state | |-----------| | <methods> | |-----------|
Relationships between objects of two classes
|-----------| |-----------| | Student | | College | |-----------| Applied |-----------| | sID pk |-----------| cName pk | | sName | ‣ | state | | GPA | |-----------| |-----------| | <methods> | | <methods> | |-----------| |-----------|
Multiplicity of Associations
Each object of class C1 is related to at least m and at most n objects of class C2
|----| |----| | C1 | | C2 | |----| m..n |----| | |----------------| | |----| A |----|
- m..* at least m
- 0..n at most n
- 0..* no restrictions
The default is 1..1
- 1..1 abbreviated as 1
- 0..* abbreviated as *
Students must apply somewhere and may not apply to more than 5 colleges. No college takes more than 20,000 applications.
|-----------| |-----------| | Student | | College | |-----------| 0..20000 1..5 |-----------| | sID pk |-------------------| cName pk | | sName | Applied | state | | GPA | |-----------| |-----------| | <methods> | | <methods> | |-----------| |-----------|
Types of relationships
One-to-One | 0..1 | 0..1 |
Many-to-One | * | 0..1 |
Many-to-Many | * | * |
Complete | 1.. | 1.. |
Default is complete one-to-one
Association Classes
Relationships between objects of two classes,
- with attributes on relationships
|-----------| |-----------| | Student | | College | |-----------| Applied |-----------| | sID pk |--------------------------| cName pk | | sName | | | state | | GPA | |----------| |-----------| |-----------| | AppInfo | | <methods> | | <methods> | |----------| |-----------| |-----------| | date | | decision | |----------|
It’s hard to have more than one relationship association between the same student and college.
Eliminating Association Classes
- Unnecessary if 0..1 or 1..1 multiplicity
|----| |----| | C1 | | C2 | |----| * 1..1 |----| | A3 |----------------| A4 | |----| | |----| |----| | AC | |----| | A1 | | A2 | |----|
Self-Associations
Associations between a class and itself
|---------| | Student |* |---------|-----| | |* | Sibling |---------|-----| |---------| | College |home |---------|------| | |1..1 | Branch | |0..10 | |---------|------| Satellite
Subclasses
|---------| | Student | Superclass |---------| {complete, overlapping} | sID pk | | sName | | GPA | |---------| ↑ inherit |--------------------------------------| | | | |----------| |-----------| |------------| |------------| | ForeignS | | DomesticS | | APStudents | | APCourse | |----------| |-----------| |------------| Took |------------| | Country | | State | | |----------------| Course# pk | | | | SSN | | |1..* | 1..10| title | |----------| |-----------| |------------| | | units | | |------------| |--------| | APInfo | |--------| | year | | grade | |--------|
- Superclass = Generalization
- Subclass = Specialization
- Incomplete (Partial) vs. Complete
- Complete: every obj in the superclass is in at least one subclass
- Disjoint (Exclusive) vs. Overlapping
- Disjoint: every obj in the superclass is in at most one subclass
Composition & Aggregation
- Objects of one class belong to objects of another class
|----------| |------------| | College | | Department | |----------| 1..1 |------------| | cName pk |◆--------------| dName | | state |◇-------| | building | |----------| 0..1 | |------------| | |-----------| | Apartment | |-----------| | addr pk | | #units | |-----------|
The symbol ◆ indicates composition, implicitly specifying 1..1.
- Department belongs to College.
The symbol ◇ indicates aggregation, implicitly specifying 0..1.
- Some apartments are owned by or associated with colleges, but not all of them are.
5.1.2 UML to Relations
High-Level Database Design Model
- User-friendly (graphical) specification language
- Translated into model of DBMS
Designs can be translated to relations automatically
- Provided every “regular” class has a key
Every class becomes a relation: pk \(\rightarrow\) primary key
Associations
- Relation with key from each side
Applied(sID, cName)
Keys for association relations depends on multiplicity
|-------| |-------| | C1 | | C2 | |-------| 0..1 * |-------| | K1 pk |----------------------| K2 pk | | O1 | A | O2 | |-------| |-------|
C1( K1, O1) C2( K2, O2)
A(K1, K2)
|-----------| |-----------| | Student | | College | |-----------| * Applied 1..1 |-----------| | sID pk |--------------------------| cName pk | | sName | | | state | | GPA | |----------| |-----------| |-----------| | AppInfo | | <methods> | | <methods> | |----------| |-----------| |-----------| | date | | decision | |----------|
Applied( sID, cName)
C1( K1, O1)
C2( K2, O2) \(\leftarrow\) C2( K2, O2, K1)
A(K1, K2)
Student( sID, sName, GPA) \(\leftarrow\) Student( sID, sName, GPA, cName)
College( cName, state)
Applied( sID, cName)
Association Classes
Student( sID, sName, GPA) College( cName, state) Applied( sID, cName, date, decision)
Require a key for every “regular” class.
Self-Associations
|---------| | Student |* |---------|-----| | |* | Sibling |---------|-----|
Student( sID, sName, GPA) Sibling( sID1, sID2)
|----------| | College | home |----------|-------| | cName pk | 1..1 | Branch | state | | | enr | 0..10 | |----------|-------| Satellite
College( cName, state, enr) Branch(home, satellite)
College( cName, state, enr, home) ???
Subclasses
- Subclass relations contain superclass key + specialized attrs.
- Subclass relations contain all attributes.
- One relation containing all superclass + subclass attrs.
subclasses are not regular classes.
|-------| | S | superclass |-------| | K pk | | A | |-------| ↑ Subclasses |---------------------+-------------------| | | |----| |----| | S1 | | S2 | |----| |----| | B | | C | |----| |----|
- S( K, A), S1( K, B), S2( K, C)
- S( K, A), S1( K, A, B), S2( K, A, C)
- S( K, A, B, C)
Heavily overlapping \(\Rightarrow\) design 3
Disjoint, Complete \(\Rightarrow\) design 2 with S( K, A) discarded
|---------| | Student | Superclass |---------| {complete, overlapping} | sID pk | | sName | | GPA | |---------| ↑ inherit |--------------------------------------| | | | |----------| |-----------| |------------| |------------| | ForeignS | | DomesticS | | APStudents | | APCourse | |----------| |-----------| |------------| Took |------------| | Country | | State | | |----------------| Course# pk | | | | SSN pk | | |1..* | 1..10| title | |----------| |-----------| |------------| | | units | | |------------| |--------| | APInfo | |--------| | year | | grade | |--------|
Student( sID, sName)
ForeignS( sID, Country)
DomesticS( sID, State, SSN)
APStudent(sID) \(\leftarrow\) eliminated, implicated in Took
APCourse( Course#, title, units)
Took( sID, Course#, year, grade)
Composition & Aggregation
|----------| |------------| | College | | Department | |----------| 1..1 |------------| | cName pk |◆--------------| dName | Not "regular" | state |◇-------| | building | |----------| 0..1 | |------------| | |-----------| | Apartment | |-----------| | addr pk | | #units | |-----------|
College( cName, state) Department(dName, building, cName) Apartment( addr, #units, cName) ↑ nullable
5.2 Exercises
6 Indexes and Transactions mini-course
6.1 Lecture Videos
6.1.1 Indexes
Indexes
- Primary mechanism to get improved performance
- Persistent data structure, stored in database
- Many interesting implementation issues
A | B | C | |
---|---|---|---|
1 | cat | 2 | … |
2 | dog | 5 | … |
3 | cow | 1 | … |
4 | dog | 9 | … |
5 | cat | 2 | … |
6 | cat | 8 | … |
7 | cow | 6 | … |
… | … | … |
Index on T.A
- T.A = ’cow’
- T.A = ’cat’
Index on T.B
- T.B = 2
- T.B < 6
- 4 < T.B ≤ 8
Index on T.(A, B)
- T.A = ’cat’ and T.B > 5
- T.A < ’d’ and T.B = 1
Utility
- Index = difference between full table scans and immediate location
of tuples
- Orders of magnitude performance difference
- Underlying data structures
- Balanced trees (B trees, B+ trees)
- A = V, A < V, V1 <= A <= V2
- logarithmic time
- Hash tables
- constant time
- Balanced trees (B trees, B+ trees)
Select sName From student Where sID = 18942
Index on sID
Many DBMS’s build indexes automatically on PRIMARY KEY (and sometimes UNIQUE) attributes
Select sID From Student Where sName = 'Mary' And GPA > 3.9
- Index on sName ← hash or tree
- Index on GPA ← tree-based
- Index on (sName, GPA)
Select sName, cName From Student, Apply Where Student.sID = Apply.sID
- Index on student.sid or apply.sid or both
Query planning & optimization
Downsides of Indexes
- Extra space - marginal
- Index creation - medium
- Index maintenance - can offset benefits
Picking which indexes to create
Benefits of an index depends on:
- Size of table (and possibly layout)
- Data distributions
- Query vs. update load
“Physical design advisors”
Input: database (statistics) and workload Output: recommended indexes
Query Optimizer takes Database statistics, Query or update, and indexes as input and gives best execution plan with estimated cost as output.
Physical design advisor experiments different indexes through query optimizer and gives recommended indexes, of which benefits outweigh drawbacks.
Create Index IndexName on T(A) Create Index IndexName on T(A1, A2, ..., An) Create Unique Index IndexName on T(A) Drop Index IndexName
6.1.2 Introduction to Transactions
Motivated by two independent requirements
- Concurrent database access
- Resilience to system failures
Concurrent Access: Attribute-level Inconsistency
Update College Set enrollment = enrollment + 1000 Where cName = 'Stanford'
concurrent with …
Update College Set enrollment = enrollment + 1500 Where cName = 'Stanford'
get; modify; put
- Possible outcomes
Concurrent Access: Tuple-level Inconsistency
Update Apply Set major = 'CS' Where sID = 123
concurrent with …
Update Apply Set decision = 'Y' Where sID = 123
get; modify; put
- both changes
- one of the two changes
Concurrent Access: Table-level Inconsistency
Update Apply Set decision = 'Y' Where sID In (Select sID From Student Where GPA > 3.9)
concurrent with …
Update Student Set GPA = 1.1 * GPA Where sizeHS > 2500
Concurrent Access: Multi-statement inconsistency
Insert Into Archive Select * From Apply Where decision = 'N'; Delete From Apply Where decision = 'N';
concurrent with …
Select Count(*) From Apply; Select Count(*) From Archive;
Concurrency Goal
Execute sequence of SQL statements so they appear to be running in isolation
- Simple solution: execute them in isolation
But want to enable concurrency whenever safe to do so
- Multiprocessor
- Multithreaded
- Asynchronous I/O
Resilience to System Failures
Rather unpleasant inconsistent state can occur due to crashes when there bulk loading to DBMS
Or
Insert Into Archive Select * From Apply Where decision = 'N'; Delete From Apply Where decision = 'N';
Crash failure
Lots of updates buffered in memory
System-Failure Goal
Guarantee all-or-nothing execution, regardless of failures
Solution for both concurrency and failures
Transactions
A transaction is a sequence of one or more SQL operations treated as a unit
- Transactions appear to run in isolation
- If the system fails, each transaction’s changes are reflected either entirely or not at all
SQL standard:
- Transaction begins automatically on first SQL statement
- On “commit” transaction ends and new one begins
- Current transaction ends on session termination
- “Autocommit” turns each statement into transaction
6.1.3 Transaction Properties
ACID Properties
- Atomicity, Consistency, Isolation, Durability
Isolation
- Serializability: Operations may be interleaved, but execution must be equivalent to some sequential (serial) order of all transactions
locking portion of the database
- T1; T2
- T2; T1
15,000 → 17,500
- T1; T2
- T2; T1
both changes
- T1; T2
- T2; T1
Order matters here.
Order also matters here.
Durability
- If system crashes after transaction commits, all effects of transaction remain in database.
T2 {S1; S2; S3; commit}
crashes
implemented by logging
Atomicity
- Each transaction is “all-or-nothing,” never left half done
T2 { S1; S2; ... crashes ...; commit }
Logging
Transaction Rollback (= Abort) ☆
- Undoes partial effects of transaction
- Can be system or client initiated
Begin Transaction; <get input from user> SQL commands based on input <confirm results with user> If ans = 'ok' Then Commit; Else Rollback;
Rollback only undoes data itself on database, but doesn’t affect variable, and cash delivery etc.
Never hold a transaction for long time. Locking
Consistency
- Each client, each transaction:
- Can assume all constraints hold when transaction begins
- Must guarantee all constraints hold when transaction ends
Serializability ⇒ constraints always hold
6.1.4 Isolation Levels
Serializability ⇒ Overhead, Reduction in concurrency
Weaker “Isolation Levels” (from weak to strong)
- Read Uncommitted
- Read Committed
- Repeatable Read
- Serializable
- ↓ Overhead ↑ Concurrency
- ↓ Consistency Guarantees
Isolation Levels
- Per transaction
- “In the eye of the beholder”
Dirty Reads
- “Dirty” data item: written by an uncommitted transaction
Example 1
Update College Set enrollment = enrollment + 1000 Where cName = 'Stanford'
concurrent with …
Select Avg(enrollment) From College
Example 2
Update Student Set GPA = 1.1 * GPA Where sizeHS > 2500
concurrent with …
Select GPA From Student Where sID = 123
concurrent with …
Update Student Set sizeHS = 2600 Where sID = 234
There is no such thing as dirty read within the same transaction.
Isolation Level: Read Uncommitted
- A transaction may perform dirty reads
Query concurrent with …
Set Transaction Isolation Level Read Uncommitted; Select Avg(GPA) From Student;
Isolation Level: Read Committed
- A transaction may not perform dirty reads
Still does not guarantee global serializability
Query concurrent with …
Set Transaction Isolation Level Read Committed; Select Avg(GPA) From Student; Select Max(GPA) From Student;
Isolation Level: Repeatable Read
- A transaction may not perform dirty reads
- An item read multiple times cannot change value
Still does not guarantee global serializability
Update Student Set GPA = 1.1 * GPA; Update Student Set sizeHS = 1500 where sID = 123;
concurrent with …
Set Transaction Isolation Level Repeatable Read; Select Avg(GPA) From Student; Select Avg(sizeHS) From Student;
Neither T1; T2 Nor T2; T1
But a relation can change: “phantom” tuples
Insert Into Student -- [100 new tuples]
concurrent with …
Set Transaction Isolation Level Repeatable Read; Select Avg(GPA) From Student; Select Max(GPA) From Student;
The inserted tuples are called phantom tuples
Delete From Student -- [100 tuples]
concurrent with Query
T1 must occur either before or after T2
Read Only transactions
- Helps system optimize performance
- Independent of isolation level
Orthogonal to isolation level
Set Transaction Read Only; Set Transaction Isolation Level Repeatable Read; Select Avg(GPA) From Student; Select Max(GPA) From Student;
Isolation Levels: Summary
dirty Reads | nonrepeatable reads | phantoms | |
---|---|---|---|
Read Uncommitted | Y | Y | Y |
Read Committed | N | Y | Y |
Repeatable Read | N | N | Y |
Serializable | N | N | N |
- Standard default: Serializable
- Weaker isolation levels
- Increased concurrency + decreased overhead = increased performance
- Weaker consistency guarantees
- Some systems have default Repeatable Read
- Isolation level per transaction and “eye of the beholder”
- Each transaction’s reads must conform to its isolation level
Note: An entire workload is globally serializable only if every transaction’s isolation level is serializable.
6.2 Exercises
6.2.1 Index Quiz
6.2.2 Transaction Quiz
7 Constraints and Triggers mini-course
7.1 Lecture Videos
7.1.1 Motivation and Overview
(Integrity) Constraints <static>
- constrain allowable database states
Triggers <dynamic>
- monitor database changes,
- check conditions and initiate actions
Integrity Constraints
- Impose restrictions on allowable data, beyond those imposed by
- structure and types
- 0.0 < GPA <= 4.0
- enrollment < 50,000 (75,000)
Why use them?
- Data-entry errors (inserts)
- Correctness criteria (updates)
- Enforce consistency
- Tell system about data
Classification
- Non-null
- Key
- Referential Integrity (foreign key)
Declaring and enforcing constraints
Declaration
- With original schema - checked after bulk loading
- Or later - checked on current DB
Enforcement
- Check after every “dangerous” modification
- Deferred constraint checking (transaction)
Triggers
- “Event-Condition-Action Rules”
- When event occurs, check condition; if true, do action
Examples
- enrollment > 35,000 -> reject all applicants
- insert app with GPA > 3.95 -> accept automatically
- update sizeHS to be > 7,000 -> change to “wrong”, raise error
Why use them?
- More logic from apps into DBMS
- To enforce constraints
- expressive
- constraint “repair” logic
Create Trigger name Before|After|Instead Of events [ referencing-variables ] [ For Each Row ] When (condition) action
Constraints and Triggers
- For relational databases
- SQL standard; systems vary considerably
(Integrity) Constraints
- constrain allowable database states
Triggers
- monitor database changes,
- check conditions and initiate actions
7.1.2 Constraints of Several Types
Integrity Constraints
- Impose restrictions on allowable data, beyond those imposed by structure and types
- Non-null constraints
- Key constraints
- Attribute-based and tuple-based constraints (limits)
- General assertions (not implemented by any database)
Not all constraints are implemented.
Constraints Demo
create table Student( sID int, sName text, GPA real not null, sizeHS int );
insert into Student values (123, 'Amy', 3.9, 1000); insert into Student values (234, 'Bob', 3.6, null); insert into Student values (345, 'Craig', null, 500);
update Student set GPA = null where sID = 123; -- failed update Student set GPA = null where sID = 456; -- succeed since 456 doesn't exist
create table Student( sID int primary key, sName text, GPA real, sizeHS int );
insert into Student values (123, 'Amy', 3.9, 1000); insert into Student values (234, 'Bob', 3.6, 1500); insert into Student values (123, 'Craig', 3.5, 500); -- failed, key error
update Student set sID = 123 where sName = 'Bob';
update Student set sID = sID - 111; -- succeed if query executed for sID 123 before 234
update Student set sID = sID + 111; -- failed if query executed for sID 123 before 234
create table Student( sID int primary key, sName text unique, GPA real, sizeHS int );
insert into Student values (123, 'Amy', 3.9, 1000); insert into Student values (234, 'Bob', 3.6, 1500); insert into Student values (345, 'Amy', 3.5, 500); -- failed insert into Student values (456, 'Doris', 3.9, 1000); insert into Student values (567, 'Amy', 3.8, 1500); -- failed
create table College( cName text, state text, enrollment int, primary key (cName, state) );
insert into College values ('Mason', 'CA', 10000); insert into College values ('Mason', 'NY', 5000); insert into College values ('Mason', 'CA', 2000);
create table Apply( sID int, cName text, major text, decision text, unique(sID, cName), unique(sID, major) );
insert into Apply values (123, 'Stanford', 'CS', null); insert into Apply values (123, 'Berkeley', 'EE', null); insert into Apply values (123, 'Stanford', 'biology', null); -- failed insert into Apply values (234, 'Stanford', 'biology', null); insert into Apply values (123, 'MIT', 'EE', null); -- failed insert into Apply values (123, 'MIT', 'biology', null);
update Apply set major = 'CS' where cName = 'MIT';
insert into Apply values (123, null, null, 'Y'); insert into Apply values (123, null, null, 'N');
Both queries above succeed. SQL standard and most systems allow duplicate null values for unique key. Most systems do not permit repeated null values for primary key.
Attribute check constraint
create table Student( sID int, sName text, GPA real check(GPA <= 4.0 and GPA > 0.0), sizeHS int check(sizeHS < 5000) );
insert into Student values (123, 'Amy', 3.9, 1000); insert into Student values (234, 'Bob', 4.6, 1500);
create table Apply( sID int, cName text, major text, decision text, check(decision = 'N' or cName <> 'Stanford' or major <> 'CS') );
insert into Apply values (123, 'Stanford', 'CS', 'N'); insert into Apply values (123, 'MIT', 'CS', 'Y'); insert into Apply values (123, 'Stanford', 'CS', 'Y');
update Apply set decision = 'Y' where cName = 'Stanford'; update Apply set cName = 'Stanford' where cName = 'MIT';
Both queries failed on SQLite and Postgres, but not in MySQL (MySQL sucks).
create table Student( sID int, sName text, GPA real check(GPA is not null), sizeHS int );
The above is equivalent to GPA real not null
.
create table T( A int check(A not in (select A from T)) );
The above check constraint implements key constraint. It fails on SQLite for several reasons. The order of execution and check matters. It works for at least some database system.
create table T( A int check((select count(distinct A) from T) = (select count(*) from T)) );
No known database allows subquery, esp. aggregation, in check expression, so the above query also doesn’t work.
create table Student( sID int, sName text, GPA real, sizeHS int ); create table Apply( sID int, cName text, major text, decision text, check(sID in (select sID from Student)) );
The above table definitions is valid according to SQL standards but no database implements subquery in check constraint.
create table College( cName text, state text, enrollment int, check(enrollment > (select max(sizeHS) from Student)) );
Also valid in SQL standard but no database allows this. Now, suppose this works for some system, it has other problems. When student table changes, the check constraint may no longer be valid.
General Assertions
- They are very powerful but unfortunately not implemented by any RDBMS.
create assertion Key check ((select count(distinct A) from T) = (select count(*) from T));
create assertion ReferentialIntegrity check (not exists (select * from Apply where sID not in (select sID from Student)));
It’s common to write not exists
in general assertions.
create assertion AvgAccept check (3.0 < (select avg(GPA) from Student where sID in (select sID from Apply where decision = 'Y')));
7.1.3 Referential Integrity
Integrity Constraints
- Impose restrictions on allowable data, beyond those imposed by structure and types
Referential integrity = Integrity of references = No “dangling pointers”
Referential integrity from R.A to S.B
- Each value in column A of table R must appear in column B of table S
- A is called the “foreign key” (Foreign key constraints)
- B is usually required to be the primary key for table S or at least unique
- Multi-attribute foreign keys are allowed
Referential integrity is directional
Referential Integrity Enforcement (R.A to S.B)
- Potentially violating modifications:
- Insert into R
- Delete from S
- Update R.A
- Update S.B
Special actions:
- Delete from S
- Delete from S
Restrict
(default),Set Null
,Cascade
- Update S.B
Restrict
(default),Set Null
,Cascade
- Delete from S
create table College( cName text primary key, state text, enrollment int ); create table Student( sID int primary key, sName text, GPA real, sizeHS int ); create table Apply( sID int references Student(sID), cName text references College(cName), major text, decision text );
insert into Apply values (123, 'Stanford', 'CS', 'Y'); insert into Apply values (234, 'Berkeley', 'biology', 'N');
Both queries above failed.
insert into Student values (123, 'Amy', 3.9, 1000); insert into Student values (234, 'Bob', 3.6, 1500); insert into College values ('Stanford', 'CA', 15000); insert into College values ('Berkeley', 'CA', 36000);
After executing the above queries, the previous queries succeed.
update Apply set sID = 345 where sID = 123; -- failed, break referential integrity update Apply set sID = 234 where sID = 123; -- succeed
delete from College where cName = 'Stanford'; -- failed delete from College where sID = 234; -- failed delete from College where sID = 123; -- succeed
update College set cName = 'Bezerkeley' where cName = 'Berkeley'; -- failed
drop table Student; -- failed
create table Apply( sID int references Student(sID) on delete set null, cName text references College(cName) on update cascade, major text, decision text );
delete from Student where sID > 200;
The Bezerkeley query now succeeds.
create table T ( A int, B int, C int, primary key (A, B), foreign key (B, C) references T(A, B) on delete cascade ); insert into T values (1, 1, 1); insert into T values (2, 1, 1); insert into T values (3, 2, 1); insert into T values (4, 3, 2); insert into T values (5, 4, 3); insert into T values (6, 5, 4); insert into T values (7, 6, 5); insert into T values (8, 7, 6);
delete from T where A = 1;
The above query will delete all the rows in T.
7.1.4 Triggers Introduction
Triggers
- “Event-Condition-Action Rules”
- When event occurs, check condition; if true, do action
- Move monitoring logic from apps into DBMS
- Enforce constraints
- Beyond what constraint system supports
- Automatic constraint “repair”
Implementations vary significantly
Create Trigger name Before | After | Instead Of events [ referencing-variables ] [ For Each Row ] When (condition) action
Events are one of insert on T
, delete on T
, update [of C1, ..., Cn] on T
.
[For Each Row]
means that the triggers should be executed once for
each modified tuple.
When event is an insert, referencing-variables
can only reference
the newly inserted data. When event is a delete, referencing-variables
can only reference the old deleted data. When event is an update,
referencing-variables
can reference both old and new data.
If a row-level trigger is concerned and the event is a delete, we can refer both the old row and old table as different variables. Old table does not refer to the old state of the table in the database but specifically refers to the rows deleted in the delete event.
If a statement-level trigger is concerned, only table can be referenced.
(Condition)
is like SQL where
clause.
action
is the action to be performed when triggered. This is where
various database systems differ.
Referential Integrity:
- R.A references S.B, cascaded delete
Create Trigger Cascade After Delete On S Referencing Old Row As O For Each Row -- [ no condition ] Delete From R Where A = O.B
Create Trigger Cascade After Delete On S Referencing Old Table As OT -- [ For Each Row ] -- [ no condition ] Delete From R Where A in (Select B from OT)
Tricky Issues
- Row-level vs. Statement-level
- New/Old Row and New/Old Table
- Before, Instead Of
- Multiple triggers activated at same time
- Trigger actions activating other triggers (chaining)
- Also self-triggering, cycles, nested invocations
- Conditions in
When
vs. as part ofaction
SQLite supports only row-level triggers.
T(K,V) - K key, V value
Create Trigger IncreaseInserts After Insert On T Referencing New Row As NR, New Table As NT For Each Row When (Select Avg(V) From T) < (Select Avg(V) From NT) Update T set V=V+10 where K=NR.K
- No statement-level equivalent
- Nondeterministic final state
7.1.5 Triggers Demo
- Before and After; Insert, Delete, and Update
- New and Old
- Conditions and actions
- Triggers enforcing constraints
- Trigger chaining
- Self-triggering, cycles
- Conflicts
- Nested trigger invocations
Introduction video used SQL standard
- No DBMS implements exact standard
- Some deviate considerably
- In both syntax and behavior!
Postgres > SQLite >> MySQL
Postgres
- Expressiveness/behavior = full standard row-level + statement-level, old/new row & table
- Cumbersome & awkward syntax
SQLite
- Row-level only, immediate activation => no old/new table
MySQL
- Row-level only, immediate activation => no old/new table
- Only one trigger per event type
- Limited trigger chaining
SQLite - row-level triggers, immediate activation
For Each Row
implicit if not specified- No
Old Table
orNew Table
- No
Referencing
clauseOld
andNew
predefined forOld Row
andNew Row
- Trigger action: SQL statements in
begin-end
block
create trigger R1 after insert on Student for each row when New.GPA > 3.3 and New.GPA <= 3.6 begin insert into Apply values (New.sID, 'Stanford', 'geology', null); insert into Apply values (New.sID, 'MIT', 'biology', null); end;
create trigger R2 after delete on Student for each row begin delete from Apply where sID = Old.sID; end;
create trigger R3 after update of cName on College for each row begin update Apply set cName = New.cName where cName = Old.cName; end;
create trigger R4 before insert on College for each row when exists (select * from College where cName = New.cName) begin select raise(ignore); end; create trigger R5 before update of cName on College for each row when exists (select * from College where cName = New.cName) begin select raise(ignore); end;
create trigger R6 after insert on Apply for each row when (select count(*) from Apply where cName = New.cName) > 10 begin update College set cName = cName || '-Done' where cName = New.cName; end;
create trigger R7 before insert on Student for each row when New.sizeHS < 100 or New.sizeHS > 5000 begin select raise(ignore); end;
create trigger R7 -- alternative version using after after insert on Student for each row when New.sizeHS < 100 or New.sizeHS > 5000 begin delete from Student where sID = New.sID; end;
create trigger AutoAccept after insert on Apply for each row when (New.cName = 'Berkeley' and 3.7 < (select GPA from Student where sID = New.sID) and 1200 < (select sizeHS from Student where sID = New.sID)) begin update Apply set decision = 'Y' where sID = New.sID and cName = New.cName; end;
create trigger TooMany after update of enrollment on College for each row when (Old.enrollment <= 16000 and New.enrollment > 16000) begin delete from Apply where cName = New.cName and major = 'EE'; update Apply set decision = 'U' where cName = New.cName and decision = 'Y'; end;
self-triggering
create trigger R1 after insert on T1 for each row begin insert into T1 values (New.A+1); end;
The above query won’t enter infinite loop since SQLite disallow a
trigger to be triggered more than once in a trigger processing
session. But if we like, we can turn on recursive_triggers
on.
pragma recursive_triggers = on; create trigger R1 after insert on T1 for each row when (select count(*) from T1) < 10 begin insert into T1 values (New.A+1); end;
trigger each other in a cycle
create trigger R1 after insert on T1 for each row begin insert into T2 values (New.A+1); end; create trigger R2 after insert on T2 for each row begin insert into T3 values (New.A+1); end; create trigger R3 after insert on T3 for each row begin insert into T1 values (New.A+1); end;
pragma recursive_triggers = on; create trigger R3 after insert on T3 for each row when (select count(*) from T1) < 100 begin insert into T1 values (New.A+1); end;
create trigger R1 after insert on T1 for each row begin update T1 set A = 2; end; create trigger R2 after insert on T1 for each row when exists (select * from T1 where A = 2) begin update T1 set A = 3; end;
The second trigger is executed first in SQLite.
nested trigger invocation
create trigger R1 after insert on T1 for each row begin insert into T2 values (1); insert into T3 values (1); end; create trigger R2 after insert on T2 for each row begin insert into T3 values (2); insert into T4 values (2); end; create trigger R3 after insert on T3 for each row begin insert into T4 values (3); end;
immediate activation semantics of SQLite for triggers
create trigger R1 after insert on T1 for each row begin insert into T2 select avg(A) from T1; end;
7.2 Exercises
7.2.1 Constraints and Triggers
7.2.2 SQL Social-Network Triggers
create trigger FriendSameGrade after insert on Highschooler for each row when New.name = 'Friendly' begin insert into Likes select New.ID, ID from Highschooler where grade = New.grade and ID <> New.ID; end;
create trigger ManageGrade1 after insert on Highschooler for each row when New.grade < 9 or New.grade > 12 begin update Highschooler set grade = null where ID = New.ID; end; | create trigger ManageGrade2 after insert on Highschooler for each row when New.grade is null begin update Highschooler set grade = 9 where ID = New.ID; end;
create trigger SymmetricFriend1 after delete on Friend for each row begin delete from Friend where ID1 = Old.ID2 and ID2 = Old.ID1; end; | create trigger SymmetricFriend2 after insert on Friend for each row begin insert or ignore into Friend values (New.ID2, New.ID1); end;
create trigger Graduate after update on Highschooler for each row when New.grade > 12 begin delete from Highschooler where ID = New.ID; end;
create trigger Graduate after update on Highschooler for each row when New.grade > 12 begin delete from Highschooler where ID = New.ID; end; | create trigger GoUpOneGrade before update on Highschooler for each row when Old.grade + 1 = New.grade begin update Highschooler set grade = grade + 1 where ID in (select ID2 from Friend where ID1 = Old.ID); end;
create trigger Unfriend after update on Likes for each row when Old.ID1 = New.ID1 and Old.ID2 <> New.ID2 and exists (select * from Friend where ID1 = Old.ID2 and ID2 = New.ID2) begin delete from Friend where ((ID1 = Old.ID2 and ID2 = New.ID2) or (ID1 = New.ID2 and ID2 = Old.ID2)); end;
8 Views and Authorization mini-course
8.1 Lecture Videos
8.1.1 Defining and Using Views
Three-level vision of database
- Physical - Conceptual - Logical
Why use views?
- Hide some data from some users
- Make some queries easier/more natural
- Modularity of database access
Real applications tend to use lots and lots (and lots and lots!) of views
Defining and using views
- View \(V = \mathrm{ViewQuery}(R_1, R_2, \dotsc, R_n)\)
- Schema of \(V\) is schema of query result
Query \(Q\) involving \(V\), conceptually:
\begin{align*} &V := \mathrm{ViewQuery}(R_1, R_2, \dotsc, R_n) \\ &\mathrm{Evaluate}\ Q \end{align*}- In reality, \(Q\) rewritten to use \(R_1, \dotsc, R_n\) instead of \(V\)
- Notes: \(R_i\) could itself be a view
SQL Syntax
Create View Vname as <Query> Create View Vname(A1, A2, ..., An) As <Query>
create view CSaccept as select sID, cName from Apply where major = 'CS' and decision = 'Y';
select Student.sID, sName, GPA from Student, CSaccept where Student.sID = CSaccept.sID and cName = 'Stanford' and GPA < 3.8;
The above query is conceptually equivalent to the following.
create temporary table T as select sID, cName from Apply where major = 'CS' and decision = 'Y'; select Student.sID, sName, GPA from Student, T where Student.sID = T.sID and cName = 'Stanford' and GPA < 3.8; drop table T;
Some simple RDBMS often rewrites as the following.
select Stuent.sID, sName, GPA from Student, (select sID, cName from Apply where major = 'CS' and decision = 'Y') as CSaccept where Student.sID = CSaccept.sID and cName = 'Stanford' and GPA < 3.8;
More powerful RDBMS rewrites as the following.
select Student.sID, sName, GPA from Student, Apply where major = 'CS' and decision = 'Y' and Student.sID = Apply.sID and cName = 'Stanford' and GPA < 3.8;
Views that use other views
create view CSberk as select Student.sID, sName, GPA from Student, CSaccept where Student.sID = CSaccept.sID and cName = 'Berkeley' and sizeHS > 500;
select * from CSberk where GPA > 3.8;
Naive rewrite
select * from (select Student.sID, sName, GPA from Student, (select sID, cName from Apply where major = 'CS' and decision = 'Y') as CSaccept where Student.sID = CSaccept.sID and cName = 'Berkeley' and sizeHS > 500) CSberk where GPA > 3.8;
Flattened rewrite
select * from Student, Apply where major = 'CS' and decision = 'Y' and Student.sID = Apply.sID and Apply.cName = 'Berkeley' and sizeHS > 500 and GPA > 3.8;
The following will result in an error under PostgreSQL since other objects depend on
CSaccept
.
drop view CSaccept;
However, under both SQLite and MySQL, the error is not thrown until we
try tot use CSberk
.
The following probably won’t be what you want to do for large database. Additionally, this is what Martin Fowler might refer to as natural aggregate.
create view Mega as select College.cName, state, enrollment, Student.sID, sName, GPA, sizeHS, major, decision from College, Student, Apply where College.cName = Apply.cName and Student.sID = Apply.sID;
select sID, sName, GPA, cName from Mega where GPA > 3.5 and major = 'CS' and enrollment > 15000;
Flattened rewrite
select Student.sID, sName, GPA, College.cName from College, Student, Apply where College.cName = Apply.cName and Student.sID = Apply.sID and GPA > 3.5 and major = 'CS' and enrollment > 15000;
8.1.2 View Modifications - Introduction
Querying views
- Once \(V\) defined, can reference \(V\) like any table
- Queries involving \(V\) rewritten to use base tables
Modifying views
- Once \(V\) defined, can we modify \(V\) like any table?
- Doesn’t make sense: \(V\) is not stored
- Has to make sense: views are some users’ entire “view” of the database
- Solution: Modifications to \(V\) rewritten to modify base tables
Usually there are successful translations for view modifications. The problem is there are too many and which one the user intends.
Suppose we have the following schema.
\begin{align*} & R(A, B) \\ & V = \Pi_A(R) \end{align*}The table \(R\) has \((1, 2)\) and thus \(V = (1)\). Say the user wants to insert \(3\) into \(V\). Then the attribute \(B\) is undecided for the newly inserted tuple.
A more radical example follows.
\begin{align*} & R(N) = \{ (1), (3), (5) \} \\ & V = \operatorname{avg}(N) = \{ (3) \} \end{align*}And the user wants to update the average to \(7\), then there are so many ways to update \(R\) to achieve this.
Existent systems do things quite differently. There are several approaches.
- Rewriting process specified explicitly by view creator
- + Can handle all modifications
- − No guarantee of correctness (or meaningfulness)
- Restrict views + modifications so that translation to base table
modifications is meaningful and unambiguous
- + No user intervention
- − Restrictions are significant
The first approach is enabled by instead of
triggers or rules in
Postgres’ parlance. The second approach is actually adopted by SQL
standard.
8.1.3 View Modifications Using Triggers
Unlike queries, view modifications cannot be automated in general.
The following query will result in an error in SQLite, since it doesn’t allow update to views.
delete from CSaccept where sID = 123;
We can create instead of
triggers to achieve it.
create trigger CSacceptDelete instead of delete on CSaccept for each row begin delete from Apply where sID = Old.sID and cName = Old.cName and major = 'CS' and decision = 'Y'; end;
Now it works.
Of course, I also need an instead of
trigger for the following query
to work.
update CSaccept set cName = 'CMU' where sID = 345;
An example of erroneous instead of
trigger
create trigger CSacceptUpdate instead of update of cName on CSaccept for each row begin update Apply set cName = New.cName where sID = Old.sID and cName = Old.cName -- correct condition should be -- and major = 'CS' and decision = 'Y'; and major = 'EE' and decision = 'N'; end;
create view CSEE as select sID, cName, major from Apply where major = 'CS' or major = 'EE';
insert into CSEE values (111, 'Berkeley', 'CS');
Another example of erroneous instead of
trigger
create trigger CSEEinsert instead of insert on CSEE for each row begin insert into Apply values (New.sID, New.cName, New.major, null); end;
What if we do the following?
insert into CSEE values (222, 'Berkeley', 'biology');
Better trigger follows.
drop trigger CSEEinsert; create trigger CSEEinsert instead of insert on CSEE for each row when New.major = 'CS' or New.major = 'EE' begin insert into Apply values (New.sID, New.cName, New.major, null); end;
This time, the following query won’t insert data into Apply table.
insert into CSEE values (333, 'Berkeley', 'biology');
The following will successfully insert data into the underlying Apply table.
insert into CSEE values (333, 'Berkeley', 'EE');
When a view involves aggregation, it is generally a good idea to disallow user to modify views.
create view HSgpa as select sizeHS, avg(gpa) as avgGPA from Student group by sizeHS;
create view Majors as select distinct major from Apply;
create view NonUnique as select * from Student S1 where exists (select * from Student S2 where S1.sID <> S2.sID and S2.GPA = S1.GPA and S2.sizeHS = S1.sizeHS);
create view Berk as select Student.sID, major from Student, Apply where Student.sID = Apply.sID and cName = 'Berkeley';
create trigger BerkInsert instead of insert on Berk for each row when New.sID in (select sID from Student) begin insert into Apply values (New.sID, 'Berkeley', New.major, null); end;
insert into Berk select sID, 'psychology' from Student where sID not in (select sID from Apply where cName = 'Berkeley');
create trigger BerkDelete instead of delete on BerkDelete for each row begin delete from Apply where sID = old.sID and cName = 'Berkeley' and major = Old.major; end;
delete from Berk where major = 'CS';
create trigger BerkUpdate instead of update of major on Berk for each row begin update Apply set major = New.major where sID = New.sId and cName = 'Berkeley' and major = Old.major; end;
update Berk set major = 'physics' where major = 'psychology';
drop table Apply; create table Apply( sID int, cName text, major text, decision text not null; );
Now we cannot insert into the CSEE view.
drop table Apply; create table Apply( sID int, cName text, major text, decision text, unique(sID, cName, major) );
insert into CSEE values (123, 'Berkeley', 'CS'); insert into CSEE values (123, 'Berkeley', 'EE'); insert into Berk values (123, 'EE'); -- failed. update Berk set major = 'CS' where sID = 123; -- failed.
8.1.4 Automatic View Modifications
Restrictions in SQL Standard for “updatable views”:
select
(nodistinct
) on single tableT
- Attributes not in view can be
null
or have default value - Subqueries must not refer to
T
- No
group by
or aggregation
MySQL is the only database (among MySQL, SQLite, and PostgreSQL) that supports automatic view modifications. However, it is a little bit more generous than the SQL standard.
insert into CSEE values (111, 'Berkeley', 'CS'); insert into CSEE values (222, 'Berkeley', 'psychology'); insert into CSaccept values (333, 'Berkeley');
All of the above queries will succeed. The second query will insert a
row with 'psychology'
major into Apply
table, which won’t reflect
in the view CSEE
. And the last query will insert a row with null
major and decision into Apply
table. Both of the results are not
what we want.
We can fix it like this.
create view CSaccept2 as select sID, cName from Apply where major = 'CS' and decision = 'Y' with check option;
create view CSEE2 as select sID, cName, major from Apply where major = 'CS' or major = 'EE' with check option;
The delete
and insert
operations on HSgpa
view will fail under
both MySQL and SQL standard, since it contains aggregation.
Views with subqueries that does not refer to outer table can be modified.
create view Bio as select * from Student where sID in (select sID from Apply where major like 'bio%');
delete from Bio where sName = 'Bob';
For the above query, both MySQL and SQL standard will delete Bob from
only the Student
table, but not from the Apply
table. Of course,
if the database is set up with proper referential integrity
constraints, Bob will be deleted from Apply
table due to it.
insert into Bio values (555, 'Karen', 3.9, 1000);
Karen will be inserted into Student
table but won’t appear in the
Bio
view.
create view Bio2 as select * from Student where sID in (select sID from Apply where major like 'bio%') with check option;
Now the following query will fail.
insert into Bio2 values (666, 'Lori', 3.9, 1000);
The with check option
comes with an efficiency cost.
MySQL does allow some modification on join view, whereas SQL standard doesn’t.
create view Stan(sID, aID, sName, major) as select Student.sID, Apply.sID, sName, major from Student, Apply where Student.sID = Apply.sID and cName = 'Stanford';
update Stan set sName = 'CS major' where major = 'CS';
The above query succeeds but the following will cause row being
updated to disappear from the view Stan
.
update Stan set aID = 666 where aID = 123;
create view Stan2(sID, aID, sName, major) as select Student.sID, Apply.sID, sName, major from Student, Apply where Student.sID = Apply.sID and cName = 'Stanford' with check option;
Now the same query will fail for view Stan2
.
insert into Stan(sID, sName) values (777, 'Lance');
The above query will insert Lance only into Student
table with
null
GPA and high school size. This is not what we want. Also, the
same query will fail on view Stan2
. However, by inserting a
correspondent row into Apply
, we can make it succeed, as shown
below.
insert into Apply values (888, 'Stanford', 'history', 'Y'); insert into Stan2(sID, sName) values (888, 'Mary');
insert into Apply values (999, 'MIT', 'history', 'Y'); insert into Stan2(sID, sName) values (999, 'Nancy');
The second insert
query will fail since Nancy doesn’t apply to
Stanford.
delete from Stan where sID = 123;
We cannot delete from join view under MySQL.
Under SQL standard, a view is either updatable or not, meaning that if
a view is updatable, then we can apply insert
, delete
, update
on
it. However, a view can be partially updatable under MySQL since it
is a little bit more generous.
8.1.5 Materialized Views
The views discussed in the previous section are sometimes called virtual views.
Materialized views have an additional advantage:
- improved query performance.
Materialized views
- View \(V = \mathrm{ViewQuery}(R_1, R_2, \dotsc, R_n)\)
- Create table \(V\) with schema of query result
- Execute \(\mathrm{ViewQuery}\) and put results in \(V\)
- Queries refer to \(V\) as if it’s a table
But…
- \(V\) could be very large
- Modifications to \(R_1, R_2, \dotsc, R_n \Rightarrow\)
- recompute or modify \(V\)
create materialized view CA-CS as select C.cName, S.sName from College C, Student S, Apply A where C.cName = A.cName and S.sID = A.sID and C.state = 'CA' and A.major = 'CS'
- + Can use
CA-CS
as if it’s a table (it is!) - − Modifications to base data invalidate view (cf. general assertions)
Modifications on modifications views?
- Good news: just update the stored table
- Bad news: base tables must stay in sync
- Same issues as with virtual views
Picking which materialized views to create
- (Efficiency) benefits of a materialized view depend on:
- Size of data
- Complexity of view
- Number of queries using view
- Number of modifications affecting view
- Also “incremental maintenance” versus full recomputation
Query-Update trade off (cf. indexes)
Automatic query rewriting to use materialized views
create materialized view CA-Apply as select sID, cName, major from Apply A where cName in (select cName from College where state = 'CA')
select distinct S.sID, S.GPA from College C, Student S, Apply A where C.cName = A.cName and S.sID =A.sID and S.GPA > 3.5 and C.state = 'CA' and A.major = 'CS'
The above query will be rewritten as something like the following.
select distinct S.sID, S.GPA from Student S, CA-Apply A where S.sID = A.sID and S.GPA > 3.5 and A.major = 'CS'
8.1.6 Authorization
Database Authorization
- Make sure users see only the data they’re supposed to see
- Guard the database against modifications by malicious users
Users have privileges and can only operate on data for which they are authorized.
Select on R
orSelect(A1, ..., An) on R
Insert on R
orInsert(A1, ..., An) on R
Update on R
orUpdate(A1, ..., An) on R
Delete on R
update Apply set dec = 'Y' where sID in (select sID from Student where GPA > 3.9)
For the above query, we need the following privileges.
- Apply:
update(dec)
,select(sID)
- Student:
select(sID, GPA)
delete from student where sID not in (select sID from Apply)
- Student:
delete
- Apply:
select(sID)
Select student info for Stanford applicants only
create view SS as select * from Student where sID in (select sID from Apply where cName = 'Stanford')
- SS:
select
Delete Berkeley applications only
create view BA as select * from Apply where cName = 'Berkeley'
- BA:
delete
Note: Updatable views is required to provide modification privileges through views
In fact, authorization is one of the most important uses of views in database systems.
Obtaining Privileges
- Relation creator is owner
- Owner has all privileges and may grant privileges
grant privs on R to users [with grant option]
Revoking Privileges
revoke privs on R from users [cascade | restrict]
Cascade: Also revoke privileges granted from privileges being revoked (transitively), unless also granted from another source.
Restrict: Disallow if Cascade would revoke any other privileges. (This is the default.)
8.2 Exercises
8.2.3 SQL Movie-Rating View Modification
Write an instead-of trigger that enables updates to the title attribute of view LateRating.
Policy: Updates to attribute title in LateRating should update Movie.title for the corresponding movie. (You may assume attribute mID is a key for table Movie.) Make sure the mID attribute of view LateRating has not also been updated – if it has been updated, don’t make any changes. Don’t worry about updates to stars or ratingDate.
create trigger UpdateLateRatingTitle instead of update of title on LateRating for each row when Old.mId = New.mID begin update Movie set title = New.title where mID = Old.mID; end;
Write an instead-of trigger that enables updates to the stars attribute of view LateRating.
Policy: Updates to attribute stars in LateRating should update Rating.stars for the corresponding movie rating. (You may assume attributes [mID,ratingDate] together are a key for table Rating.) Make sure the mID and ratingDate attributes of view LateRating have not also been updated – if either one has been updated, don’t make any changes. Don’t worry about updates to title.
create trigger UpdateLateRatingStars instead of update of stars on LateRating for each row when Old.mID = New.mID and Old.ratingDate = New.ratingDate begin update Rating set stars = New.stars where mID = Old.mID and ratingDate = Old.ratingDate; end;
Write an instead-of trigger that enables updates to the mID attribute of view LateRating.
Policy: Updates to attribute mID in LateRating should update Movie.mID and Rating.mID for the corresponding movie. Update all Rating tuples with the old mID, not just the ones contributing to the view. Don’t worry about updates to title, stars, or ratingDate.
create trigger UpdateLateRatingMID instead of update of mID on LateRating for each row begin update Movie set mID = New.mID where mID = Old.mID; update Rating set mID = New.mID where mID = Old.mID; end;
Finally, write a single instead-of trigger that combines all three of the previous triggers to enable simultaneous updates to attributes mID, title, and/or stars in view LateRating. Combine the view-update policies of the three previous problems, with the exception that mID may now be updated. Make sure the ratingDate attribute of view LateRating has not also been updated – if it has been updated, don’t make any changes.
create trigger UpdateLateRating instead of update on LateRating for each row when Old.ratingDate = New.ratingDate begin update Movie set title = New.title, mID = New.mID where mID = Old.mID; update Rating set mID = New.mID where mID = Old.mID; -- this is the catch. update Rating set stars = New.stars where mID = New.mID and ratingDate = Old.ratingDate; end;
Write an instead-of trigger that enables deletions from view HighlyRated.
Policy: Deletions from view HighlyRated should delete all ratings for the corresponding movie that have stars > 3.
create trigger DeleteHighlyRated instead of delete on HighlyRated for each row begin delete from Rating where mID = Old.mID and stars > 3; -- the second condition is the catch. end;
Write an instead-of trigger that enables deletions from view HighlyRated.
Policy: Deletions from view HighlyRated should update all ratings for the corresponding movie that have stars > 3 so they have stars = 3.
create trigger DeleteHighlyRated instead of delete on HighlyRated for each row begin update Rating set stars = 3 where mID = Old.mID and stars > 3; end;
Write an instead-of trigger that enables insertions into view HighlyRated.
Policy: An insertion should be accepted only when the (mID,title) pair already exists in the Movie table. (Otherwise, do nothing.) Insertions into view HighlyRated should add a new rating for the inserted movie with rID = 201, stars = 5, and NULL ratingDate.
create trigger InsertHighlyRated instead of insert on HighlyRated for each row when exists (select * from Movie where mID = New.mID and title = New.title) begin insert into Rating values (201, New.mID, 5, null); end;
Write an instead-of trigger that enables insertions into view NoRating.
Policy: An insertion should be accepted only when the (mID,title) pair already exists in the Movie table. (Otherwise, do nothing.) Insertions into view NoRating should delete all ratings for the corresponding movie.
create trigger InsertNoRating instead of insert on NoRating for each row when exists (select * from Movie where mID = New.mID and title = New.title) begin delete from Rating where mID = New.mID; end;
Write an instead-of trigger that enables deletions from view NoRating.
Policy: Deletions from view NoRating should delete the corresponding movie from the Movie table.
create trigger DeleteNoRating instead of delete on NoRating for each row begin delete from Movie where mID = Old.mID; end;
Write an instead-of trigger that enables deletions from view NoRating.
Policy: Deletions from view NoRating should add a new rating for the deleted movie with rID = 201, stars = 1, and NULL ratingDate.
create trigger DeleteNoRating instead of delete on NoRating for each row begin insert into Rating values (201, Old.mID, 1, null); end;
9 On-Line Analytical Processing mini-course
9.1 Lecture Videos
9.1.1 Introduction to OLAP
Two broad types of database activity
- OLTP—Online Transaction Processing
- Short transaction
- Simple queries
- Touch small portions of data
- Frequent updates
- OLAP—Online Analytical Processing
- Long transaction
- Complex queries
- Touch large portions of data
- Infrequent updates
More terminology
- Data warehousing
- Bring data from operational (OLTP) sources into a single “warehouse” for (OLAP) analysis
- Decision Support System (DSS)
- Infrastructure for data analysis
- E.g., data warehouse tuned for OLAP
“Star Schema”
- Fact table
- Updated frequently, often append-only, very large
- Sales transactions
- Course enrollments
- Page views
- Updated frequently, often append-only, very large
- Dimension tables
- Updated infrequently, not as large
- Stores, items, customers
- Students, courses
- Web pages, users, advertisers
- Updated infrequently, not as large
- Fact table references dimension tables, hence the name “star”
OLAP queries
Sales(storeID, itemID, custID, qty, price) Store(storeID, city, county, state) Item(itemID, category, brand, color, size) Customer(custID, cName, gender, age, address)
The first three attributes of Sales are known as dimension attributes, whereas the rest attributes are known as dependent attributes.
Join → Filter → Group → Aggregate
Performance
- Inherently very slow:
- special indexes, query processing techniques
- Extensive use of materialized views
Data Cube (aka multidimensional OLAP)
- Dimension data forms axes of “cube”
- Fact (dependent) data in cells
- Aggregated data on sides, edges, corner
Fact table uniqueness for data cube
- If dimension attributes are not key, we must aggregate dependent attributes.
- Date can be used to create key
- Dimension or dependent? (Dimension)
Drill-down and Roll-up
select state, brand, sum(qty*price) from Sales F, Store S, Item I where F.storeID = S.storeID and F.itemID = I.itemID group by state, brand
Drill-down
- Examining summary data, break out by dimension attribute
Adding category
to the group by
clause is an drill-down.
Roll-up
- Examining data, summarize by dimension attribute
Removing state
from the group by
clause is an roll-up.
SQL Constructs
with cube
andwith rollup
select dimension-attrs, aggregates from tables where conditions group by dimension-attrs with [cube | rollup]
with cube
- Add to result: faces, edges, and corner of cube using
null
values.
with rollup
- For hierarchical dimensions, portion of
with cube
.
9.1.2 OLAP Demo
MySQL is used for the whole demo. MySQL supports with rollup
and
neither PostgreSQL nor SQLite supports it. MySQL does not yet support
with cube
, but we can simulate it.
select * from Sales F, Store S, Item I, Customer C where F.storeID = S.storeID and F.itemID = I.itemID and F.custID = C.custID;
select S.city, I.color,, C.cName, F.price from Sales F, Store S, Item I, Customer C where F.storeID = S.storeID and F.itemID = I.itemID and F.custID = C.custID and S.state = 'CA' and I.category = 'Tshirt' and C.age < 22 and F.price < 25;
select storeID, custID, sum(price) from Sales group by storeID, custID;
Drill-down
select storeID, itemID, custID, sum(price) from Sales group by storeID, itemID, custID;
Slicing
select F.storeID, itemID, custID, sum(price) from Sales F, Store S where F.storeID = S.storeID and state = 'WA' group by storeID, itemID, custID;
Dicing
select F.storeID, F.itemID, custID, sum(price) from Sales F, Store S, Item I where F.storeID = S.storeID and F.itemID = I.itemID and state = 'WA' and color = 'red' group by storeID, itemID, custID;
Roll-up
select itemID, sum(price) from Sales F group by itemID;
select state, category, sum(price) from Sales F, Store S, Item I where F.storeID = S.storeID and F.itemID = I.itemID group by state, category;
select state, county, category, sum(price) from Sales F, Store S, Item I where F.storeID = S.storeID and F.itemID = I.itemID group by state, county, category;
select state, county, category, gender, sum(price) from Sales F, Store S, Item I, Customer C where F.storeID = S.storeID and F.itemID = I.itemID and F.custID = C.custID group by state, county, category, gender;
select state, gender, sum(price) from Sales F, Store S, Customer C where F.storeID = S.storeID and F.custID = C.custID group by state, gender;
select storeID, itemID, custID, sum(price) from Sales group by storeID, itemID, custID with cube;
No RDBMS supports with cube
yet, but we can simulate it using with
rollup
under MySQL.
select storeID, itemID, custID, sum(price) from Sales group by storeID, itemID, custID with rollup union select storeID, itemID, custID, sum(price) from Sales group by itemID, custID, storeID with rollup union select storeID, itemID, custID, sum(price) from Sales group by custID, storeID, itemID with rollup;
create table Cube as select storeID, itemID, custID, sum(price) as p from Sales group by storeID, itemID, custID with rollup union select storeID, itemID, custID, sum(price) as p from Sales group by itemID, custID, storeID with rollup union select storeID, itemID, custID, sum(price) as p from Sales group by custID, storeID, itemID with rollup;
select C.* from Cube C, Store S, Item I where C.storeID = S.storeID and C.itemID = I.itemID and state = 'CA' and color = 'blue' and custID is null;
select sum(p) from Cube C, Store S, Item I where C.storeID = S.storeID and C.itemID = I.itemID and state = 'CA' and color = 'blue' and custID is null;
select C.* from Cube C, Store S, Item I where C.storeID = S.storeID and C.itemID = I.itemID and state = 'CA' and color = 'blue' and custID is not null;
select sum(p) from Cube C, Store S, Item I where C.storeID = S.storeID and C.itemID = I.itemID and state = 'CA' and color = 'blue' and custID is not null;
select F.* from Sales F, Store S, Item I where F.storeID = S.storeID and F.itemID = I.itemID and state = 'CA' and color = 'blue';
select sum(price) from Sales F, Store S, Item I where F.storeID = S.storeID and F.itemID = I.itemID and state = 'CA' and color = 'blue';
Variation of with cube
in SQL standard
select storeID, itemID, custID, sum(price) from Sales F group by storeID, itemID, custID with cube(storeID, custID);
Simulation using with rollup
select * from (select storeID, itemID, custID, sum(price) from Sales F group by itemID, storeID, custID with rollup) X where X.itemID is not null union select * from (select storeID, itemID, custID, sum(price) from Sales F group by itemID, custID, storeID with rollup) X where X.itemID is not null and X.custID is not null;
select storeID, itemID, custID, sum(price) from Sales F group by storeID, itemID, custID with rollup;
9.2 Exercises
10 Recursion in SQL mini-course
10.1 Lecture Videos
10.1.1 Basic Recursive WITH Statement—Introduction
SQL is not a “Turing complete” language
- Simple, convenient, declarative
- Expressive enough for most database queries
- But basic SQL can’t express unbounded computations
Example 1: Ancestors
ParentOf(parent, child)
- Find all of Mary’s ancestors
Example 2: Company hierarchy
Employee(ID, salary) Manager(mID, eID) Project(name, mgrID)
- Final total salary cost of project ’X’
Example 3: Airline flights
Flight(orig, dest, airline, cost)
- Find cheapest way to fly from ’A’ to ’B’
SQL with
Statement
with R1 as (query-1), R2 as (query-2), ... Rn as (query-n) <query involving R1, ..., Rn (and other tables)>
with R1(A1, A2, ..., Am) as (query-1), R2 as (query-2), ... Rn as (query-n) <query involving R1, ..., Rn (and other tables)>
with recursive R1 as (query-1), R2 as (query-2), ... Rn as (query-n) <query involving R1, ..., Rn (and other tables)>
SQL standard specifies recursive
keyword goes with relation, but
implementations differ and make it as modifier for with
, and thus
any relations are allowed to be recursive if the recursive
keyword
is given.
with recursive R as (base query union recursive query) <query involving R1, ..., Rn (and other tables)>
10.1.2 Basic Recursive WITH Statement—Demo
with recursive Ancestor(a, d) as (select parent as a, child as d from ParentOf union select Ancestor.a, ParentOf.child as d from Ancestor, ParentOf where Ancestor.d = ParentOf.parent) select a from Ancestor where d = 'Mary';
with recursive Superior as (select * from Manager union select S.mID, M.eID from Superior S, Manager M where S.eID = M.mID) select sum(salary) from Employee where ID in (select mgrID from Project where name = 'X' union select eID from Project, Superior where Project.name = 'X' and Project.mgrID = Superior.mID);
with recursive Xemps(ID) as (select mgrID as ID from Project where name = 'X' union select eID as ID from Manager M, Xemps X where M.mID = X.ID) select sum(salary) from Employee where ID in (select ID from Xemps);
with recursive Yemps(ID) as (select mgrID as ID from Project where name = 'Y' union select eID as ID from Manager M, Yemps Y where M.mID = Y.ID), Zemps(ID) as (select mgrID as ID from Project where name = 'Z' union select eID as ID from Manager M, Zemps Z where M.mID = Z.ID), select 'Y-total', sum(salary) from Employee where ID in (select ID from Yemps) union select 'Z-total', sum(salary) from Employee where ID in (select ID from Zemps);
with recursive Route(orig, dest, total) as (select orig, dest, cost as total from Flight union select R.orig, F.dest, cost+total as total from Route R, Flight F where R.dest = F.orig) select min(total) from Route where orig = 'A' and dest = 'B';
with recursive FromA(dest, total) as (select dest, cost as total from Flight where orig = 'A' union select F.dest, cost+total as total from FromA FA, Flight F where FA.dest = F.orig) select min(total) from FromA where dest = 'B';
with recursive ToB(orig, total) as (select orig, cost as total from Flight where dest = 'B' union select F.orig, cost+total as total from ToB, Flight F where ToB.orig = Flight.dest) select min(total) from ToB where orig = 'A';
All the above queries work well given that there is no loop in our Flight.
A loop will overflow all the above queries.
with recursive Route(orig, dest, total) as (select orig, dest, cost as total from Flight union select R.orig, F.dest, cost+total as total from Route R, Flight F where R.dest = F.orig and cost+total < all (select total from Route R2 where R2.orig = R.orig and R2.dest = F.dest)) select min(total) from Route where orig = 'A' and dest = 'B';
The above query works conceptually but PostgreSQL doesn’t allow recursive reference within a subquery.
with recursive Route(orig, dest, total) as (select orig, dest, cost as total from Flight union select R.orig, F.dest, cost+total as total from Route R, Flight F where R.dest = F.orig) select * from Route where orig = 'A' and dest = 'B' limit 20;
The above query will give a table with 20 results and thus prevent
infinite recursion. However, if you replace *
with min(total)
,
infinite recursion comes back again. The following is an alternative
solution.
with recursive Route(orig, dest, total) as (select orig, dest, cost as total, 1 from Flight union select R.orig, F.dest, cost+total as total, R.length+1 as length from Route R, Flight F where R.dest = F.orig and R.length < 10) select min(total) from Route where orig = 'A' and dest = 'B';
10.1.3 Nonlinear and Mutual Recursion
Linear Recursion
- there is at most one reference to R in the recursive query.
Example: Ancestors
with recursive Ancestor(a, d) as (select parent as a, child as d from ParentOf union select A1.a, A2.d from Ancestor A1, Ancestor A2 where A1.d = A2.a) select a from Ancestor where d = 'Mary';
This is a nonlinear recursion since there are two references to relation Ancestor.
Nonlinear versus linear
- + Query looks cleaner
- + Converges faster
- − Harder to implement
- SQL standard only requires linear
Mutual Recursion
- recursive queries refer to recursive relations other than themselves.
They work in tandem. SQL standard allows mutual recursion in limited forms.
Example: Hubs & Authorities
Link(src, dest) HubStart(node) AuthStart(node)
Hub: points to greater than or equal to 3 authority node Authority: pointed to greater than or equal to 3 hub node
Some nodes are pre-designated as hub and/or authority nodes.
with recursive Hub(node) as (select node from HubStart union select src as node from Link L where dest in (select node from Auth) group by src having count(*) >= 3), Auth(node) as (select node from AuthStart union select dest as node from Link L where src in (select node from Hub) group by dest having count(*) >= 3) select * from Hub;
It’s been said there is no implementation of mutual recursion in common RDBMSs. So the above query works only conceptually.
with recursive Hub(node) as (select node from HubStart union select src as node from Link L where src not in (select node from Auth) and dest in (select node from Auth) group by src having count(*) >= 3), Auth(node) as (select node from AuthStart union select dest as node from Link L where dest not in (select node from Hub) and src in (select node from Hub) group by dest having count(*) >= 3) select * from Hub;
The above query has non-unique fixed points (nondeterministic) of the recursion. This is bad and thus SQL standard doesn’t allow this type of mutual recursion. In fact, the crux of the problem is negative dependence, which can also happen even without mutual recursion. Non-mutual recursion with negative dependence is not allowed either.
Example: Recursion with Aggregation
P(x)
with recursive R(x) as (select x from P union select sum(x) from R) select * from R;
Recursion with aggregation is disallowed in the SQL standard and isn’t supported by any system.
Extends expressiveness of SQL
- Basic functionality: linear recursion
- Extended functionality: nonlinear recursion, mutual recursion
- Disallowed: negative dependence, aggregation
10.2 Exercises
10.2.1 Recursion Quiz
T(A) containing a set of positive integers with no duplicates.
With Recursive Mystery(X,Y) As (Select A As X, A As Y From T Union Select m1.X, m2.Y From Mystery m1, Mystery m2 Where m2.X = m1.Y + 1) Select Max(Y-X) + 1 From Mystery
The above query actually computes the length of the longest consecutive sequence in T.
With Recursive Peer(X,Y) As (Select M1.employee, M2.employee From Manager M1, Manager M2 Where M1.manager = M2.manager AND M1.employee < M2.employee Union Select M1.employee, M2.employee From Peer S, Manager M1, Manager M2 Where S.X = M1.manager AND S.Y = M2.manager And M1.employee < M2.employee) Select * from Peer