Database
Question 1: Relational Algebra Equivalences Each of the proposed equivalences below on relations r(R) and s(S) only holds under certain conditions. In the equivalences, • X, R and S are sets of attributes, • C is a single attribute, • COUNT(r) returns the number of tuples in r, and • ⋈ is natural join. (a) (4 points each) For each equivalence, give example schemas and instances for r and s where the equivalence does not hold. (Assume that the expressions contain no syntax errors. For example, in ii, X is assumed to be a subset of R.) (b) (6 points) For each equivalence, give a side condition that guarantees the equivalence will hold. The condition should be at the schema level (relation schemes, keys, foreign keys) and not at the instance level (number of tuples, specific values). The less restrictive the condition, the better. i. pX(r ⋈ s) º pX(r) ⋈ s ii. pX(r – (sC=5(r))) º pX(r) – pX(sC=5(r)) iii. COUNT(r ⋈ s) º COUNT(r) * COUNT(s) iv. COUNT(r ⋈ s) º COUNT(r) YOUR ANSWERS ------------------------------------------------- Question 2: Statistics a. (5 points) Determine the min and max salary values in the agent table, and the number of rows in that table. b. (5 points) Give an estimate for the number of rows in agent with salary < 75000, assuming a uniform distribution of salaries between min and max salary. Explain how you derived your estimate. c. (5 points) Find the 25th, 50th and 75th percentile values for salaries in the agent table. (The 50th percentile value, for instance, is the smallest number s such that 50% of the rows have salary value less than or equal to s.) d. (5 points) Give an estimate of the number of rows in agent with salary < 75000, assuming in a uniform distribution in each quartile determined in c. Explain how you derived your estimate. e. (5 points) How many rows in agent actually have salary < 75000? CHECK ANSWERS - JUST CHECK IF IT FOLLOW THE LOGIC A) select max(a.salary), min(a.salary), count(a.agent_id) from agent a; max | min | count --------+-------+------- 366962 | 50008 | 662
(1 row) B) select ((c.min_sal_rows - 75000)/(-(b.max - b.min)/b.count)) as estimated from (select max(a.salary), min(a.salary), count(a.agent_id) from agent a) as b, (select count(*) as min_sal_rows from agent where agent.salary < 75000) as c; estimate ---------- 156 ((#Rows of salaries less 75000) - 75000) / -((max salary) - (min salary)) / (total # of rows) C) f17tdb47=> select (b.max - b.min)*.5 fiftieth from (select max(a.salary), min(a.salary), count(a.agent_id) from agent a) as b; fiftieth ---------- 158477.0 (1 row) f17tdb47=> select (b.max - b.min)*.25 as twentyfifth from (select max(a.salary), min(a.salary), count(a.agent_id) from agent a) as b; twentyfifth ---------- 79238.50 (1 row) f17tdb47=> select (b.max - b.min)*.75 as seventyfifth from (select max(a.salary), min(a.salary), count(a.agent_id) from agent a) as b;
seventyfifth ----------- 237715.50 (1 row) ------ D) f17tdb47=> select ((75000 - b.min)/(b.count*.25)) as estimated from (select max(a.salary), min(a.salary), count(a.agent_id) from agent a) as b; estimated ---------------------- 151.0090634441087613
(1 row) E) f17tdb47=> select count(*) from agent a where a.salary < 75000 f17tdb47-> ; count ------- 427 (1 row) -------------------------------------------------
Question 3: Query Plans For each SQL statement below, draw the query plan that Postgres chooses (which you can obtain with the EXPLAIN command). For each plan, suggest a reason that the particular join algorithms were chosen. 1- explain SELECT A1.last, A2.last FROM agent A1, agent A2, languagerel LR, Language L WHERE A1.salary < A2.salary AND A2.agent_ID = LR.agent_ID AND LR.lang_ID = L.lang_ID AND L.language = 'French'; QUERY PLAN --------------------------------------------------------------------------------------------------------------- Nested Loop (cost=18.80..1044.78 rows=21967 width=14) Join Filter: (a1.salary < a2.salary) -> Seq Scan on agent a1 (cost=0.00..14.62 rows=662 width=11) -> Materialize (cost=18.80..37.41 rows=100 width=11) -> Hash Join (cost=18.80..36.91 rows=100 width=11) Hash Cond: (a2.agent_id = lr.agent_id) -> Seq Scan on agent a2 (cost=0.00..14.62 rows=662 width=15) -> Hash (cost=17.55..17.55 rows=100 width=4) -> Nested Loop (cost=5.05..17.55 rows=100 width=4) -> Seq Scan on language l (cost=0.00..1.25 rows=1 width=4) Filter: ((language)::text = 'French'::text) -> Bitmap Heap Scan on languagerel lr (cost=5.05..15.30 rows=100 width=8) Recheck Cond: (lang_id = l.lang_id) -> Bitmap Index Scan on languagerel_pkey (cost=0.00..5.03 rows=100 width=0) Index Cond: (lang_id = l.lang_id) (15 rows)
YOUR ANSWER 2- f17tdb47=> explain SELECT A1.last, A2.last FROM agent A1, agent A2, languagerel LR, Language L WHERE A1.salary = A2.salary AND A2.agent_ID = LR.agent_ID AND LR.lang_ID = L.lang_ID AND L.language = 'French'; QUERY PLAN --------------------------------------------------------------------------------------------------------------- Hash Join (cost=38.16..57.69 rows=242 width=14) Hash Cond: (a1.salary = a2.salary) -> Seq Scan on agent a1 (cost=0.00..14.62 rows=662 width=11) -> Hash (cost=36.91..36.91 rows=100 width=11) -> Hash Join (cost=18.80..36.91 rows=100 width=11) Hash Cond: (a2.agent_id = lr.agent_id) -> Seq Scan on agent a2 (cost=0.00..14.62 rows=662 width=15) -> Hash (cost=17.55..17.55 rows=100 width=4) -> Nested Loop (cost=5.05..17.55 rows=100 width=4) -> Seq Scan on language l (cost=0.00..1.25 rows=1 width=4) Filter: ((language)::text = 'French'::text) -> Bitmap Heap Scan on languagerel lr (cost=5.05..15.30 rows=100 width=8) Recheck Cond: (lang_id = l.lang_id) -> Bitmap Index Scan on languagerel_pkey (cost=0.00..5.03 rows=100 width=0) Index Cond: (lang_id = l.lang_id) (15 rows) YOUR ANSWER 3- explain SELECT A1.last, A2.last FROM agent A1, agent A2, languagerel LR, Language L WHERE A1.salary > A2.salary AND A2.agent_ID = LR.agent_ID AND LR.lang_ID = L.lang_ID AND L.language <> 'French'; QUERY PLAN ------------------------------------------------------------------------------------ Hash Join (cost=80.41..12135.98 rows=417379 width=14) Hash Cond: (a2.agent_id = lr.agent_id) -> Nested Loop (cost=0.00..6604.56 rows=146081 width=18) Join Filter: (a1.salary > a2.salary) -> Seq Scan on agent a1 (cost=0.00..14.62 rows=662 width=11) -> Materialize (cost=0.00..17.93 rows=662 width=15) -> Seq Scan on agent a2 (cost=0.00..14.62 rows=662 width=15) -> Hash (cost=56.77..56.77 rows=1891 width=4)
-> Hash Join (cost=1.49..56.77 rows=1891 width=4) Hash Cond: (lr.lang_id = l.lang_id) -> Seq Scan on languagerel lr (cost=0.00..28.91 rows=1991 width=8) -> Hash (cost=1.25..1.25 rows=19 width=4) -> Seq Scan on language l (cost=0.00..1.25 rows=19 width=4) Filter: ((language)::text <> 'French'::text) (14 rows) YOUR ANSWER
------------------------------------------------- Question 4: Recovery (10 points) There have been proposals to use multiple log files for recovery. Log records can be appended to any one of the files, but all records from the same transaction must go to the same file. Give one advantage of multiple log files and one disadvantage. CHECK ANSWERS - CORRECT IT IF IT'S WRONG Advantages of Multiple Log Files: The advantage of the multiple log files is that it can only be to an advantage if there are multiple file groups because in this case there will be separation of the read only data and also partitioning can be done equally.Hence, it is good with multiple file groups.Indexing on the different file groups is the only advantage. Disadvantages of Multiple Log Files: As log files will be written sequentially hence, they are of no use when not multiple file groups as they will be clustering indexes and hence, making the recovery process tough than earlier. Hence, these are the advantages and disadvantages of the multiple log file. ------------------------------------------------- Question 5: Decomposition Consider the following relational schema: viols(VID VT VD InD BID SNu SNa SCo Bro BrI Zip). (The names are short for Violation ID, Violation Type, Violation Description, Inspection Date, Building ID, Street Number, Street Name, Street Code, Borough, Boro ID, Zip Code.) Assume the following functional dependencies hold on violations. VID ® VT VD InD BID SNu SNa SCo Bro BrI Zip BID ® SNu SNa SCo Bro BrI Zip Bro ® BrI BrI ® Bro VT ® VD VD ® VT
SCo ® SNa SNu SCo Zip ® Bro a. (15 points) Show a decomposition of violations into BCNF. Show each step in your decomposition and the keys for each relation scheme. (You can have more than one key per scheme.) b. (5 points) Find a functional dependency in the original scheme that does not correspond to a key in your normalized scheme. c. (10 points) Is it possible that a legal (satisfies all keys) database instance d on your normalized scheme violates the FD in b., in the sense that if you join all the relations in d together, the result r violates the FD? Explain why or why not. YOUR ANSWERS