basic questions about database theory..
1. Introduction to Databases.pdf
CMSC 508 Database Theory
CMSC 508 Database Theory
Introduction to Databases
Dr. Alberto Cano Assistant Professor Department of Computer Science
Chapter 1 from Database System Concepts, 6th Ed. by Silberschatz, Korth, Sudarshan, 2011 Chapter 1 from Database Management Systems, 3rd Ed. by Ramakrishnan, Gehrke, 2003
Introduction to Databases
1
CMSC 508 Database Theory Introduction to Databases
Database Management System (DBMS) • Collection of interrelated data • Set of programs to access the data • An environment that is convenient, efficient, and reliable
Databases model the real world • Entities: abstractions of data • Relationships: semantic information between entities
Historical perspective • 1960s Network data model • 1960s Hierarchical data model • 1970s Relational data model • 1980s SQL standardization • 1990s Enterprise resource planning (ERP)
Management resource planning (MRP) 2
CMSC 508 Database Theory Introduction to Databases
Example: University database • Add new students, instructors, and courses • Register students for courses, and generate class rosters • Assign grades to students, compute GPA
3
CMSC 508 Database Theory Introduction to Databases
Example: University database • Add new students, instructors, and courses • Register students for courses, and generate class rosters • Assign grades to students, compute GPA
Drawbacks of using file systems to store data • Data redundancy and inconsistency in multiple files and formats • Difficulty in accessing data • Need to write a new program to carry out each new task • Data isolation • Integrity problems and constraints • Atomicity of updates • Concurrent access by multiple users • Security problems
Do not underestimate the power of Excel for a small business!! 4
CMSC 508 Database Theory Introduction to Databases
Advantages of a DBMS • Data Independence • Efficient Data Access • Data Integrity • Security and authorization • Data Administration • Concurrent Access and Crash Recovery • Reduced Application Development Time
Schema: logical structure of the database • Conceptual schema: describes the data model • Physical schema: defines how the conceptual schema is actually
stored in data structures on disks
Instance: actual contents of the database at a given point in time
5
CMSC 508 Database Theory Introduction to Databases
Levels of Abstraction • Physical level
Describes how data is stored • Logical level
Describes data and relationships • View level
Application interfaces
Physical Data Independence (physical schema - logical schema) • Applications depend on the views and logical schema interfaces
Physical Schema
Logical Schema
View 1 View 2 View 3
6
CMSC 508 Database Theory Introduction to Databases
Data Models • A collection of tools for describing
o Data o Data relationships o Data semantics o Data constraints
• Relational model • Entity-Relationship data model • Object-oriented data models • Semistructured data model (XML) • Other older models:
o Network model o Hierarchical model
7
CMSC 508 Database Theory Introduction to Databases
Relational model • Collection of tables to represent both data and the relationships • Tables have multiple columns with unique names • Tables contain records (tuples) that correspond to the attributes • Tables are called relations
Columns
Rows
8
CMSC 508 Database Theory Introduction to Databases
Attribute-Relation File Format (ARFF) • Developed by Weka team for Machine Learning and Data Mining • Relational model in a text file that describes a list of instances
sharing a set of attributes
@RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 5.5,2.3,4.0,1.3,Iris-versicolor 6.5,2.8,4.6,1.5,Iris-versicolor 5.7,2.8,4.5,1.3,Iris-versicolor 6.5,3.0,5.8,2.2,Iris-virginica 7.6,3.0,6.6,2.1,Iris-virginica 4.9,2.5,4.5,1.7,Iris-virginica 7.3,2.9,6.3,1.8,Iris-virginica
9
CMSC 508 Database Theory Introduction to Databases
Data Definition Language (DDL) • Specification notation for defining the database schema
Example: create table instructor ( ID char(5), name varchar(20), dept_name varchar(20), salary numeric(8,2))
• DDL compiler generates a set of tables stored in a data dictionary • Data dictionary contains metadata
o Database schema o Integrity constraints
• Primary key • Referential integrity
o Authorization
10
CMSC 508 Database Theory Introduction to Databases
Data Manipulation Language (DML) • Language for accessing and manipulating the data • Two classes of languages
o Procedural • Specifies what data is required and how to get those data
o Declarative • Specifies what data is required without specifying how
• SQL is the most widely used query language (declarative)
Example: select name from instructor where instructor.ID = ‘22222’
Application programs generally access databases through: • Language extensions to allow embedded SQL • Application program interface (e.g., ODBC/JDBC) which allow SQL
queries to be sent to a database (drivers) 11
CMSC 508 Database Theory Introduction to Databases
Database Design • Logical Design
• Deciding the specification on the database schema • Effective and efficient collection of relation schemas • What relations, attributes, and their types should we have
• Physical Design • Deciding on the physical layout of the database
• Software Design perspective 1. Analysis of requirements 2. Functional and nonfunctional analysis 3. Design synthesis
12
CMSC 508 Database Theory Introduction to Databases
Is there any problem with this design?
13
CMSC 508 Database Theory Introduction to Databases
Database design major issues • Isolation • Inconsistency
o Different rows should different content for an attribute that should have the same value
• Integrity o A given value is not valid according to the specification
Normalization theory • Process of organizing the columns and tables of a relational
database to minimize data redundancy • Normalization decomposes a table into smaller tables without
losing information • References are defined using primary keys and foreign keys
14
CMSC 508 Database Theory Introduction to Databases
The Entity-Relationship model • Data modelling as a collection of entities and relationships • Represented diagrammatically by an entity-relationship diagram • Unified Modeling Language (UML) • Entity sets are represented by a rectangular box • Relationship sets are represented by a diamond connecting
related entity sets
• Entities are translated into tables • How do we translate the relationship?
15
CMSC 508 Database Theory Introduction to Databases
Object-Relational Data Models • Objects, classes and inheritance are directly supported in
database schemas and in the query language
• Extend the relational data model by including object orientation and constructs to deal with added data types
• Allow attributes of tuples to have complex types, including non- atomic values such as nested relations
• Preserve relational foundations, in particular the declarative access to data, while extending modeling power
• Provide upward compatibility with existing relational languages
16
CMSC 508 Database Theory Introduction to Databases
Semistructured Data Model: XML - Extensible Markup Language • Originally intended as a document markup language • The ability to specify new tags, and to create nested tag
structures made XML a great way to exchange data • XML has become the basis for new generation data interchange • A wide variety of tools is available for parsing, browsing and
querying XML documents/data
<?xml version="1.0"?> <catalog>
<book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description>
</book> </catalog>
17
CMSC 508 Database Theory Introduction to Databases
Storage Management • Storage manager is a program that provides the interface
between the low-level data stored in the database and the application programs and queries submitted to the system.
• The storage manager is responsible to the following tasks:
o Interaction with the file manager
o Efficient storing, retrieving and updating of data
• Issues:
o Storage access
o File organization
o Indexing and hashing
18
CMSC 508 Database Theory Introduction to Databases
Query processing 1. Parsing and translation 2. Optimization 3. Evaluation
• Cost difference between equivalent queries can be enormous 19
CMSC 508 Database Theory Introduction to Databases
Transaction management and concurrency • System failures (“If anything can go wrong, it will”)
o DBMS provides fault-tolerant design using transactions
• Concurrency o Interleaving actions of different users leads to inconsistency
20
CMSC 508 Database Theory Introduction to Databases
Transaction management and concurrency • Transaction: collection of operations that performs a single
logical function (atomic) in a database application
• Transaction-management component ensures that the database remains in a consistent (correct) state despite system failures (e.g., power failures and operating system crashes) and transaction failures
• Concurrency-control manager controls the interaction among the concurrent transactions, to ensure the consistency of the database
21
CMSC 508 Database Theory Introduction to Databases
ACID properties • Atomicity: Either all operations of the transaction are properly
reflected in the database or none are
• Consistency: Execution of a transaction in isolation preserves the consistency of the database
• Isolation: Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions
For every pair of transactions Ti and Tj, it appears to Ti that either Tj, finished execution before Ti started, or Tj started execution after Ti finished
• Durability: After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures
22
CMSC 508 Database Theory Introduction to Databases
Database users and administrators • End users use application interfaces (e.g. web user)
• Application programmers write functionality (e.g. webmasters)
• Analysts use query tools (e.g. statistical business analyst)
• Database administrator (DBA)
o Designs logical /physical schemas
o Handles security and authorization
o Data availability, crash recovery
o Database tuning as needs evolve
• Authorization and abstraction level depend on the user type
23
CMSC 508 Database Theory Introduction to Databases
Database system internals
24
CMSC 508 Database Theory Introduction to Databases
Database architecture • Centralized: run locally on a single computer • Parallel / distributed • Most databases today use client-server architecture over network
25
CMSC 508 Database Theory Introduction to Databases
Summary • DBMS consists of a collection of interrelated data and programs
• DBMS provide an environment that is both convenient and efficient to use in retrieving and storing information
• A major purpose of a database system is to provide users with an abstract view of the data
• The relational data model is the most widely deployed model for storing data in databases
• DDL and DML allow users to specify, access and manipulate data
• Transaction management ensures that the database remains in a consistent (correct) state despite system failures
• There are four different types of database-system users, differentiated by the way to interact with the system
26
CMSC 508 Database Theory Introduction to Databases
Review Terms
Database management system (DBMS) Data inconsistency Consistency constraints Data abstraction Instance Schema Physical schema Logical schema Physical data independence Data models Entity-relationship model Relational data model Object-based data model Semistructured data model Database languages Data-definition language
Data-manipulation language Query language Metadata Application program Normalization Data dictionary Storage manager Query processor Transactions ACID properties Atomicity Failure recovery Concurrency control Two- and three-tier architectures Database administrator (DBA) Tuple
27
CMSC 508 Database Theory
CMSC 508 Database Theory
Introduction to Databases
Dr. Alberto Cano Assistant Professor Department of Computer Science
Chapter 1 from Database System Concepts, 6th Ed. by Silberschatz, Korth, Sudarshan, 2011 Chapter 1 from Database Management Systems, 3rd Ed. by Ramakrishnan, Gehrke, 2003
Introduction to Databases
28
__MACOSX/._1. Introduction to Databases.pdf
2. The relational model and relational algebra(2).pdf
CMSC 508 Database Theory
CMSC 508 Database Theory
The relational model and relational algebra
Dr. Alberto Cano Assistant Professor Department of Computer Science
Chapters 2,6 from Database System Concepts, 6th Ed. by Silberschatz, Korth, Sudarshan, 2011 Chapters 3,4 from Database Management Systems, 3rd Ed. by Ramakrishnan, Gehrke, 2003
Relational model
1
CMSC 508 Database Theory Relational model
Relational model • Collection of tables to represent both data and the relationships • Tables have multiple columns (attributes) with unique names • Tables contain records (tuples) that correspond to the attributes • Tables are called relations • #Rows = cardinality, #Columns = arity • Order of tuples is irrelevant Columns
Rows
2
CMSC 508 Database Theory Relational model
Attribute types • Domain: set of allowed values for each attribute
• Attribute values are (normally) required to be atomic (indivisible)
• The special value null is a member of every domain
• The null value causes complications in many operations
• The DDL specifies the attribute domain o Strings (VARCHAR, CHAR) o Numeric (NUMBER, INTEGER) o Date (DATE) o Objects (BFILE)
3
CMSC 508 Database Theory Relational model
Integrity constraints • Conditions that must be true for any instance
• Specified when schema is defined
• Checked when relations are modified
• Domain integrity:
o Data type o Length or size o Is null value allowed o Is the value unique or not for an attribute
4
CMSC 508 Database Theory Relational model
Keys • Set of attributes. Let K R
• K is a superkey of R if values for K are sufficient to identify a unique tuple of each possible relation r(R)
Example: {ID} and {ID,name} are both superkeys of instructor
• Superkey K is a candidate key if K is minimal
Example: {ID} is a candidate key for Instructor
• One of the candidate keys is selected to be the primary key
• Any two individual tuples in the relation are prohibited from
having the same value on the key attributes at the same time
• Foreign key is a column(s) that references a column (most often
the primary key) of another table
• Use of foreign keys ensure referential integrity of the data 5
CMSC 508 Database Theory Relational model
Schema Diagram for University Database
• Identification of entities
• Identification of relations
• Conceptual schema -> relational schema
Time to think a basic conceptual and relational schema
6
CMSC 508 Database Theory Relational model
Schema Diagram for University Database
• Identification of entities o Student (name) o Instructor (name, salary) o Department (name, budget) o Course (title, credits)
• Identification of relationships o Students take courses o Instructors teach courses o Instructors are members of departments o Courses are organized by departments
7
CMSC 508 Database Theory Relational model
Schema Diagram for University Database • Conceptual schema
• Primary keys? • Foreign keys? • Cardinalities? • Grades? • Translation into a relational schema • What if … ? 8
CMSC 508 Database Theory Relational model
Schema Diagram for University Database
9
CMSC 508 Database Theory Relational model
Schema Diagram for University Database
classroom(building, room number, capacity)
department(dept name, building, budget)
course(course id, title, dept name, credits)
instructor(ID , name, dept name, salary)
section(courseid, secid, semester, year, building, roomnumber, time slot id)
teaches(ID , course id, sec id, semester, year)
student(ID , name, dept name, tot cred)
takes(ID , course id, sec id, semester, year, grade)
advisor(sID , iID )
time slot(time slot id, day, start time, end time)
prereq(course id, prereq id)
10
CMSC 508 Database Theory Relational algebra
Relational Query Languages • Procedural vs or declarative • Pure languages:
o Relational algebra o Tuple relational calculus o Domain relational calculus
• Relational operators o Set operators: union, intersection, difference, product o Selection o Projection o Join: natural join, θ-join, equijoin, outer joins, inner join o Division
11
CMSC 508 Database Theory Relational algebra
Union () • Let r and s be relations on schemas R and S respectively • 𝑅 𝑆 = 𝑥 𝑥 ∈ 𝑅 𝑜𝑟 𝑥 ∈ 𝑆}
• union-compatible: relations must have the same set of attributes
12
r s
CMSC 508 Database Theory Relational algebra
Intersection () • Let r and s be relations on schemas R and S respectively • 𝑅 𝑆 = 𝑥 𝑥 ∈ 𝑅 𝑎𝑛𝑑 𝑥 ∈ 𝑆}
• union-compatible: relations must have the same set of attributes
13
r s
CMSC 508 Database Theory Relational algebra
Difference(−) • Let r and s be relations on schemas R and S respectively • 𝑅 − 𝑆 = { 𝑥 ∈ 𝑅 | 𝑥 ∉ 𝑆}
• union-compatible: relations must have the same set of attributes
14
r −s
CMSC 508 Database Theory Relational model
Cartesian product(×) • Output all pairs of rows from the two input relations • 𝑅 × 𝑆 = 𝑟, 𝑠 𝑟 ∈ 𝑅 𝑎𝑛𝑑 𝑠 ∈ 𝑆}
• Properties o NOT commutative 𝑅 × 𝑆 ≠ 𝑆 × 𝑅 o NOT associative 𝑅 × 𝑆 × 𝑊 ≠ 𝑅 × (𝑆 × 𝑊) o 𝑅 𝑆 × 𝑊 𝑍 = 𝑅 𝑊 × 𝑆 𝑍 o 𝑅 𝑆 × 𝑊 𝑍 ≠ 𝑅𝑊 × 𝑆 𝑍 o 𝑅 × 𝑆𝑊 = 𝑅 × 𝑆 𝑅 × 𝑆 o 𝑅 × 𝑆𝑊 = 𝑅 × 𝑆 𝑅 × 𝑆 15
r ×s
CMSC 508 Database Theory Relational algebra
Cartesian product(×)
16
SQL example: select leader_name, title from LEADER, OCCUPATION
Be aware of the size of the product! Attributes are disjoint
CMSC 508 Database Theory Relational algebra
Selection (σ) • Returns rows of the relation that satisfy the predicate • Let R (A,B,C,D)
Properties • Idempotent
σA(R ) = σA(σA(R ))
• Commutative
σA(σB(R ) ) = σB(σA(R ))
• Breaking up
σA ∧ B(R) = σA(σB(R)) = σB(σA(R))
σA ∨ B(R) = σA(R) σB(R)
17
σA=B ∧ D > 5(R )
CMSC 508 Database Theory Relational algebra
Projection (∏) • Outputs specified attributes from all rows of the input relation,
and removes duplicate tuples from the output • Let R (A,B,C)
Properties • Idempotent
∏A1,…,AN (∏B1,…,BM
(R )) = ∏A1,…,AN (R )
where {A1,…,AN} ⊆ {B1,…,BM}
• Distributive over set union
∏A1,…,AN (R P ) = ∏A1,…,AN
(R ) ∏A1,…,AN (P )
Projection does NOT distribute over intersection and set difference 18
∏A,C(R )
CMSC 508 Database Theory Relational algebra
Natural join (⨝) • Let r and s be relations on schemas R and S respectively • Natural join of relations R and S is a relation on schema R S as:
o Consider each pair of tuples tr from r and ts from s. o If tr and ts have the same value on each of the attributes
in R S, add a tuple t to the result, where t has the same value as tr on r t has the same value as ts on s
• Properties: associative and commutative
19r ⨝ s
CMSC 508 Database Theory Relational algebra
Theta-join (θ) • R ⨝θ S = σθ (R×S) • Selection σ meeting condition θ after cross product
Equijoin • θ is the equality operator
20
A B
3 2
2 5
C D
2 1
4 3
4 7
A B C D
3 2 4 7
2 5 4 3
2 5 4 7
R ⨝A<D SR S
CMSC 508 Database Theory Relational algebra
Outer join • An extension of the join operation that avoids loss of information • Computes the join and then adds tuples form one relation that
does not match tuples in the other relation • Uses null values
o null signifies unknown or does not exist o All comparisons involving null are false by definition o The result of any arithmetic expression involving null is null
21
Comp. Sci. Finance Music
ID dept_name
10101 12121 15151
name
Srinivasan Wu Mozart
ID course_id
10101 12121 76766
CS-101 FIN-201 BIO-101
Instructor Teaches
CMSC 508 Database Theory Relational algebra
Inner join • Instructor ⨝ teaches
Left outer join • Instructor teaches
22
Comp. Sci. Finance
ID dept_name
10101 12121
name
Srinivasan Wu
course_id
CS-101 FIN-201
Comp. Sci. Finance Music
ID dept_name
10101 12121 15151
name
Srinivasan Wu Mozart
course_id
CS-101 FIN-201 null
CMSC 508 Database Theory Relational algebra
Right outer join • Instructor teaches
Full outer join • Instructor teaches
23
Comp. Sci. Finance null
ID dept_name
10101 12121 76766
name
Srinivasan Wu null
course_id
CS-101 FIN-201 BIO-101
Comp. Sci. Finance Music null
ID dept_name
10101 12121 15151 76766
name
Srinivasan Wu Mozart null
course_id
CS-101 FIN-201 null BIO-101
CMSC 508 Database Theory Relational algebra
Division • Let r and s be relations on schemas R and S respectively and S ⊆ R • R / S is a relation on schema R – S • A tuple t is in R/S if satisfies both:
o t is in ∏R-S(r) o For every tuple ts in s, there is a tuple tr in r satisfying both:
• tr[S] = ts[S] • tr[R-S] = t
24
CMSC 508 Database Theory Relational algebra
Division
25
A
B1
B2
B3
A/B1 A/B2 A/B3
CMSC 508 Database Theory Relational model
Summary • The relational data model is based on a collection of tables
• The schema refers to the logical design, instance refers to its contents
• The schema includes attributes, types of the attributes and constraints
• A superkey is a set of attributes guaranteed to identify tuples uniquely
• A candidate key is a minimal superkey
• The primary key is one of the candidate keys
• A foreign key is a set of attributes in a referencing relation
• A schema diagram shows the relations, attributes, and keys
• The relational algebra provides a set of operations between relations
26
CMSC 508 Database Theory Relational model
Review Terms
Table Relation Tuple Attribute Domain Atomic domain Null value Database schema Database instance Relation schema Relation instance Keys Superkey Candidate key
Primary key Foreign key Referential integrity constraint Schema diagram Query language Procedural language Nonprocedural language Operations on relations Selection Projection Joins Cartesian product Set operations Relational algebra
27
CMSC 508 Database Theory
CMSC 508 Database Theory
The relational model and relational algebra
Dr. Alberto Cano Assistant Professor Department of Computer Science
Chapters 2,6 from Database System Concepts, 6th Ed. by Silberschatz, Korth, Sudarshan, 2011 Chapters 3,4 from Database Management Systems, 3rd Ed. by Ramakrishnan, Gehrke, 2003
Relational model
28
__MACOSX/._2. The relational model and relational algebra(2).pdf
Homework1.pdf
CMSC 508 – Database Theory
Introduction to databases, relational model and relational algebra
1. List four significant differences between a file-processing system and a DBMS. Why would you
choose a database system instead of simply storing data in operating system files? When would
it make sense not to use a database system?
2. Explain the concept of physical data independence, and its importance in database systems.
3. Explain the difference between two-tier and three-tier architectures.
4. What is a transaction? Define ACID.
5. What is a foreign key constraint? Why are such constraints important? What is referential
integrity?
6. A university database contains information about professors (identified by social security
number, or SSN) and courses (identified by courseid). Professors can teach the same course in
several semesters, and only the most recent such offering needs to be recorded. Draw a diagram
that describes it (assuming no further constraints hold).
7. Consider the instance of the Students relation shown below.
1. Give an example of an attribute (or set of attributes) that you can deduce is not a
candidate key, based on this instance being legal.
2. Is there any example of an attribute (or set of attributes) that you can deduce is a
candidate key, based on this instance being legal?
3. Which of the candidate keys are valid for primary keys? Which one would you choose?
8. Given two relations R1 and R2, where R1 contains N1 tuples, R2 contains N2 tuples, and
N2 > N1 > 0, give the minimum and maximum possible sizes (in tuples) for the resulting relation
produced by each of the following relational algebra expressions. In each case, state any
assumptions about the schemas for R1 and R2 needed to make the expression meaningful:
(1) R1 ∪ R2, (2) R1 ∩ R2, (3) R1−R2, (4) R1 × R2, (5) σa=5(R1), (6) πa(R1), and (7) R1/R2
9. Consider the following bank database. What are the appropriate primary and foreign keys?
branch (branch name, branch city, assets)
customer (customer name, customer street, customer city)
loan (loan number, branch name, amount)
borrower (customer name, loan number)
account (account number, branch name, balance)
depositor (customer name, account number)
10. Consider the following relational database.
employee (person name, street, city)
works (person name, company name, salary)
company (company name, city)
manages (person name, manager name)
Give an expression in the relational algebra to express each of the following queries:
a. Find the names of all employees who work for “First Bank Corporation”.
b. Find the names and cities of residence of all employees who work for “First Bank Corporation”.
c. Find the names, street address, and cities of residence of all employees who work for “First
Bank Corporation” and earn more than $10,000.
d. Find the names of all employees in this database who live in the same city as the company for
which they work.
e. Assume the companies may be located in several cities. Find all companies located in every
city in which “Small Bank Corporation” is located.
11. Consider the following relations, calculate the operations:
a. Graduates U Managers
b. Graduates ∩ Managers
c. Graduates – Managers
12. Consider the following relations, calculate the operation:
13. Define and give an example of the division operator using two relations.