+ All Categories
Home > Documents > Database System Concepts

Database System Concepts

Date post: 07-Jan-2016
Category:
Upload: ike
View: 46 times
Download: 0 times
Share this document with a friend
Description:
Database System Concepts. Instructor: Jin Yuanping [email protected] Professor of College of Software Engineering, Southeast University The reference book: Database System Concepts (4th Edition), A. Silberschatz , et al, McGraw-Hill, 2002. The Content of the Course 1 Introduction - PowerPoint PPT Presentation
327
1 JYP Database System Database System Concepts Concepts Instructor: Jin Yuanping [email protected] Professor of College of Software Engineering, Southeast University The reference book: Database System Concepts (4th Edition), A. Silberschatz , et al, McGraw-Hill, 2002
Transcript
Page 1: Database System Concepts

1JYP

Database System ConceptsDatabase System Concepts

Instructor:

Jin Yuanping [email protected] Professor of College of Software Engineering, Southeast University

The reference book:

Database System Concepts (4th Edition),

A. Silberschatz , et al, McGraw-Hill, 2002

Page 2: Database System Concepts

2JYP

The Content of the Course1 Introduction

2 Entity-Relationship Model

3 Relational Model

4 SQL

5 Integrity Constraints

6 Transactions

7 Concurrency Control

8 Database System Architecture

9 New Applications

10 XML

Page 3: Database System Concepts

3JYP

Topic 1: IntroductionTopic 1: Introduction

Database System

View of Data

Data Models

Data Definition Language

Data Manipulation Language

Transaction Management

Storage Management

Database Administrator

Database Users

Overall System Structure

Page 4: Database System Concepts

4JYP

A Database System A Database System

A collection of interrelated data (DB)

A set of programs to access the data (DBMS)

A set of application programs

A group of people to administrate it (DBA)

DB contains information about a particular enterprise

DBMS provides an environment that is both convenient and efficient to use.

Page 5: Database System Concepts

5JYP

View of DataView of Data

An architecture for a database system

Page 6: Database System Concepts

6JYP

Levels of AbstractionLevels of Abstraction

Physical level describes how a record (e.g., customer) is stored.

Logical level: describes data stored in database, and the relationships among the data.

type customer = recordname : string;street : string;city : integer;

end;

View level: application programs hide details of data types. Views can also hide information (e.g., salary) for security purposes.

Page 7: Database System Concepts

7JYP

Instances and SchemasInstances and Schemas

Similar to types, variables and values in programming languages

Schema – the logical structure of the database (e.g., set of customers and accounts and the relationship between them)

Instance – the actual content of the database at a particular point in time

Page 8: Database System Concepts

8JYP

Data IndependenceData Independence

Ability to modify a schema definition in one level without affecting a schema definition in the next higher level.

The interfaces between the various levels and components should be well defined so that changes in some parts do not seriously influence others.

Two levels of data independence: Physical data independence

Logical data independence

Page 9: Database System Concepts

9JYP

Data ModelsData Models

A collection of concepts and tools for describing data

data relationships

data semantics

data constraints

Entity-relationship model

Object-oriented model

Relational model (Oracle, DB2)

Semi-structured model (XML)

Page 10: Database System Concepts

10JYP

Entity-Relationship ModelEntity-Relationship Model

Example of entity-relationship model

Page 11: Database System Concepts

11JYP

Relational ModelRelational Model

Example of tabular data in the relational model

JonesSmithCurry

Lindsay

customer-name

MainNorthNorthPark

customer-street

HarrisonRyeRye

Pittsfield

customer-city

customer

Page 12: Database System Concepts

12JYP

XML DataXML Data

<bank> <account>

<account-number> A-101 </account-number> <branch-name> Downtown </branch-name> <balance> 500 </balance>

</account> <depositor>

<account-number> A-101 </account-number> <customer-name> Johnson </customer-name>

</depositor> </bank>

Page 13: Database System Concepts

13JYP

Data Definition Language (DDL)Data Definition Language (DDL)

Specification notation for defining the database schema

DDL compiler generates a set of tables

Data dictionary contains metadata (i.e., data about data)

Data storage and definition language – special type of DDL in which the storage structure and access methods used by the database system are specified

Page 14: Database System Concepts

14JYP

Data Manipulation Language (DML)Data Manipulation Language (DML)

Language for accessing and manipulating the data organized by the appropriate data model

Two classes of languages Procedural – user specifies what data is required and how to get

those data

Nonprocedural – user specifies what data is required without specifying how to get those data

Page 15: Database System Concepts

15JYP

Transaction ManagementTransaction Management

A transaction is a collection of operations that performs a single logical function in a database application

Transaction-management component ensures that the database remains in a consistent (correct) state despite system failures (e.g., power failures and operating system crashes) and transaction failures.

Concurrency-control manager controls the interaction among the concurrent transactions, to ensure the consistency of the database.

Page 16: Database System Concepts

16JYP

Storage ManagementStorage Management

Storage manager is a program module that provides the interface between the low-level data stored in the database and the application programs and queries submitted to the system.

The storage manager is responsible for the following tasks: interaction with the file manager

efficient storing, retrieving and updating of data

Page 17: Database System Concepts

17JYP

Database AdministratorDatabase Administrator

Coordinates all the activities of the database system; the database administrator has a good understanding of the enterprise’s information resources and needs.

Database administrator's duties include: Schema definition

Storage structure and access method definition

Schema and physical organization modification

Granting user authority to access the database

Specifying integrity constraints

Acting as liaison with users

Monitoring performance and responding to changes in requirements

Page 18: Database System Concepts

18JYP

Database UsersDatabase Users

Users are differentiated by the way they expect to interact with the system

Application programmers – interact with system through DML calls

Sophisticated users – form requests in a database query language

Specialized users – write specialized database applications that do not fit into the traditional data processing framework

Naïve users – invoke one of the permanent application programs that have been written previously

Page 19: Database System Concepts

19JYP

Overall System StructureOverall System Structure

Page 20: Database System Concepts

20JYP

Overall System Structure (Cont.)Overall System Structure (Cont.)

The storage manager components also include:

Authorization and integrity manager

Page 21: Database System Concepts

21JYP

Topic 2: Entity-Relationship ModelTopic 2: Entity-Relationship Model

Entity Sets

Relationship Sets

Mapping Constraints

Keys

E-R Diagram

Design of an E-R Database Schema

Reduction of an E-R Schema to Tables

Page 22: Database System Concepts

22JYP

Entity SetsEntity Sets

A database can be modeled as: a collection of entities,

relationship among entities.

An entity is an object that exists and is distinguishable from other objects.

Example: specific person, company, event, plant

An entity set is a set of entities of the same type that share the same properties.

Example: set of all persons, companies, trees, holidays

Page 23: Database System Concepts

23JYP

Entity Sets Customer and LoanEntity Sets Customer and Loan

Page 24: Database System Concepts

24JYP

AttributesAttributes

An entity is represented by a set of attributes, that is descriptive properties possessed by all members of an entity set.

Example:

customer = (customer-name, social-security, customer-street, customer-city)

account = (account-number, balance)

Domain – the set of permitted values for each attribute

Attribute types: Simple and composite attributes.

Single-valued and multi-valued attributes.

Null attributes.

Derived attributes.

Page 25: Database System Concepts

25JYP

Composite Attributes Composite Attributes customer-namecustomer-name and and customer-addresscustomer-address

Page 26: Database System Concepts

26JYP

Relationship SetsRelationship Sets

A relationship is an association among several entitiesExample:

Hayes (Hayes, A-102) A-102customer depositor account

entity set relationship set entity set

A relationship set is a mathematical relation among n 2 entity sets.

{(e1, e2, … en) | e1 E1, e2 E2, …, en En}

where (e1, e2, …, en) is a relationship Example:

(Hayes, A-102) depositor

Page 27: Database System Concepts

27JYP

Relationship Sets (Cont.)Relationship Sets (Cont.)

an attribute can also be property of a relationship set. For instance, the depositor relationship set between entity sets customer and account may have the attribute access-date

E-R Diagram with an Attribute Attached to a Relationship

Page 28: Database System Concepts

28JYP

Relationship set Relationship set borrowerborrower

Page 29: Database System Concepts

29JYP

Degree of a Relationship SetDegree of a Relationship Set

Refers to number of entity sets that participate in a relationship set.

Relationship sets that involve two entity sets are binary (or degree two). Generally, most relationship sets in a database system are binary.

Relationship sets may involve more than two entity sets. The entity sets customer, loan, and branch may be linked by the ternary (degree three) relationship set CLB.

Page 30: Database System Concepts

30JYP

RolesRoles

The labels “manager” and “worker” are called roles; they specify how employee entities interact via the works-for relationship set.

Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles.

Role labels are optional, and are used to clarify semantics of the relationship

Entity sets of a relationship need not be distinct

Page 31: Database System Concepts

31JYP

Mapping CardinalitiesMapping Cardinalities

Express the number of entities to which another entity can be associated via a relationship set.

Most useful in describing binary relationship sets.

For a binary relationship set the mapping cardinality must be one of the following types: One to one

One to many

Many to one

Many to many

We distinguish among these types by drawing either a directed line (), signifying “one,” or an undirected line (—), signifying “many,” between the relationship set and the entity set.

Page 32: Database System Concepts

32JYP

Mapping CardinalitiesMapping Cardinalities

One to one One to many

Page 33: Database System Concepts

33JYP

Mapping Cardinalities Mapping Cardinalities

Many to one Many to many

Page 34: Database System Concepts

34JYP

One-To-One RelationshipOne-To-One Relationship

A customer is associated with at most one loan via the relationship borrower

A loan is associated with at most one customer via borrower

Page 35: Database System Concepts

35JYP

One-To-Many and Many-To-One One-To-Many and Many-To-One RelationshipsRelationships

Page 36: Database System Concepts

36JYP

One-To-Many and Many-To-One (Cont.)One-To-Many and Many-To-One (Cont.)

In the one-to-many relationship (a), a loan is associated with at most one customer via borrower, a customer is associated with several (including 0) loans via borrower

In the many-to-one relationship (b), a loan is associated with several (including 0) customers via borrower, a customer is associated with at most one loan via borrower

Page 37: Database System Concepts

37JYP

Many-To-Many RelationshipMany-To-Many Relationship

A customer is associated with several (possibly 0) loans via borrower

A loan is associated with several (possibly 0) customers via borrower

Page 38: Database System Concepts

38JYP

Existence DependenciesExistence Dependencies

If the existence of entity x depends on the existence of entity y, then x is said to be existence dependent on y. y is a dominant entity (in example below, loan)

x is a subordinate entity (in example below, payment)

loan-payment paymentloan

If a loan entity is deleted, then all its associated payment entities must be deleted also.

Page 39: Database System Concepts

39JYP

KeysKeys

A super key of an entity set is a set of one or more attributes whose values uniquely determine each entity.

A candidate key of an entity set is a minimal super key social-security is candidate key of customer

account-number is candidate key of account

Although several candidate keys may exist, one of the candidate keys is selected to be the primary key.

The combination of primary keys of the participating entity sets forms a candidate key of a relationship set. must consider the mapping cardinality and the semantics of the

relationship set when selecting the primary key.

(social-security, account-number) is the primary key of depositor)

Page 40: Database System Concepts

40JYP

E-R Diagram ComponentsE-R Diagram Components

Rectangles represent entity sets.

Ellipses represent attributes

Diamonds represent relationship sets.

Lines link attributes to entity sets and entity sets to relationship sets.

Double ellipses represent multivalued attributes.

Dashed ellipses denote derived attributes.

Primary key attributes are underlined.

Page 41: Database System Concepts

41JYP

Weak Entity SetsWeak Entity Sets

An entity set that does not have a primary key is referred to as a weak entity set.

The existence of a weak entity set depends on the existence of a strong entity set, it must relate to the strong set via a one-to-many relationship set.

The discriminator (or partial key) of a weak entity set is the set of attributes that distinguishes among all the entities of a weak entity set that depend on one particular strong entity.

The primary key of a weak entity set is formed by the primary key of the strong entity set on which the weak entity set is existence dependent, plus the weak entity set’s discriminator.

Page 42: Database System Concepts

42JYP

Weak Entity Sets (Cont.)Weak Entity Sets (Cont.) We depict a weak entity set by double rectangles.

We underline the discriminator of a weak entity set with a dashed line.

payment-number – discriminator of the payment entity set

Primary key for payment – (loan-number, payment-number)

Page 43: Database System Concepts

43JYP

SpecializationSpecialization

Top-down design process, we designate subgroupings within an entity set that are distinctive from other entities in the set.

These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set.

Depicted by a triangle component labeled ISA (i.e., saving-account “is an” account).

Page 44: Database System Concepts

44JYP

Specialization ExampleSpecialization Example

Page 45: Database System Concepts

45JYP

GeneralizationGeneralization

A bottom-up design process – combine a number of entity sets that share the same features into a higher-level entity set.

Specialization and generalization are simple inversions of each other; they are represented in an E-R diagram in the same way.

Attribute inheritance – a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked.

Page 46: Database System Concepts

46JYP

Design Constraints on a GeneralizationDesign Constraints on a Generalization

Constraint on which entities can be members of a given lower-level entity set. condition-defined user-defined

Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization. disjoint overlapping

Completeness constraint – specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization.

total partial

Page 47: Database System Concepts

47JYP

E-R Diagram for Banking EnterpriseE-R Diagram for Banking Enterprise

Page 48: Database System Concepts

48JYP

Reduction of an E-R Schema to TablesReduction of an E-R Schema to Tables

Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database.

A database which conforms to an E-R diagram can be represented by a collection of tables.

For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set.

Each table has a number of columns (generally corresponding to attributes), which have unique names.

Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram.

Page 49: Database System Concepts

49JYP

Representing Entity Sets as TablesRepresenting Entity Sets as Tables

A strong entity set reduces to a table with the same attributes.

JonesSmithHayes

321-12-3123019-28-3746677-89-9011

MainNorthMain

HarrisonRye

Harrison

customer-name social-security c-street c-city

A weak entity set becomes a table that includes a column for the primary key of the identifying strong entity set.

L-17L-23L-15

51122

10 May 199617 May 199612 May 1996

loan-number payment-number payment-date

5075

300

payment-amount

Page 50: Database System Concepts

50JYP

Representing Relationship Sets as TablesRepresenting Relationship Sets as Tables A many-to-many relationship set is represented as a table with

columns for the primary keys of the two participating entity sets, and any descriptive attributes of the relationship set.

social-security account-number access-date

... ... ...

The depositor table

The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant.The payment table already contains the information that would appear in the loan-payment table (i.e., the columns loan-number and payment-number).

Page 51: Database System Concepts

51JYP

Representing Generalization as TablesRepresenting Generalization as Tables Method 1: Form a table for the generalized entity account.

Form a table for each entity set that is specialized (include primary key of generalized entity set)

table table attributesaccount account-number, balance, account-typesavings-account account-number, interest-ratechecking-account account-number, overdraft-amount

Method 2: If the generalization is disjoint and total, form a table for each entity set that is specialized

table table attributessavings-account account-number, balance, interest-ratechecking-account account-number, balance, overdraft-

amountMethod 2 has no table for generalized entity account

Page 52: Database System Concepts

52JYP

Topic 3: Relational ModelTopic 3: Relational Model

Structure of Relational Databases

Relational Algebra

Extended Relational-Algebra-Operations

Modification of the Database

Views

Page 53: Database System Concepts

53JYP

Basic StructureBasic Structure

Given sets A1, A2, …. An, a relation r is a subset of

A1 x A2 x … x An

Thus a relation is a set of n-tuples (a1, a2, …, an)

where ai Ai

Example: if

customer-name = {Jones, Smith, Curry, Lindsay}customer-street = {Main, North, Park}customer-city = {Harrison, Rye, Pittsfield}

Then

r = {(Jones, Main, Harrison), (Smith, North, Rye), (Curry, North, Rye),

(Lindsay, Park, Pittsfield)}

is a relation over

customer-name x customer-street x customer-city

Page 54: Database System Concepts

54JYP

Relation SchemaRelation Schema

A1, A2, …, An are attributes and also as their domains

R = (A1, A2, …, An ) is a relation schema

Customer-schema = (customer-name, customer-street, customer-city)

r(R) is a relation on the relation schema R

customer (Customer-schema)

Page 55: Database System Concepts

55JYP

Relation InstanceRelation Instance

The current values (relation instance) of a relation are specified by a table

An element t of r is a tuple, represented by a row in a table

JonesSmithCurry

Lindsay

customer-name

MainNorthNorthPark

customer-street

HarrisonRyeRye

Pittsfield

customer-city

customer

Page 56: Database System Concepts

56JYP

KeysKeys

Let K R

K is a superkey of R if values for K are sufficient to identify a unique tuple of each possible relation r(R) by “possible r” we mean a relation r that could exist in the enterprise we are modeling.Example: {customer-name, customer-street} and {customer-name} are both superkeys of Customer, if no two customers can possibly have the same name.

K is a candidate key if K is minimalExample: {customer-name} is a candidate key for Customer, since it is a superkey (assuming no two customers can possibly have the same name), and no subset of it is a superkey.

Page 57: Database System Concepts

57JYP

Determining Keys from E-R SetsDetermining Keys from E-R Sets

Strong entity set. The primary key of the entity set becomes the primary key of the relation.

Weak entity set. The primary key of the relation consists of the union of the primary key of the strong entity set and the discriminator of the weak entity set.

Relationship set. The union of the primary keys of the related entity sets becomes a super key of the relation.For binary many-to-one relationship sets, the primary key of the “many” entity set becomes the relation’s primary key.For one-to-one relationship sets, the relation’s primary key can be that of either entity set.

Page 58: Database System Concepts

58JYP

Query LanguagesQuery Languages

Language in which user requests information from the database.

Typically a level higher than a standard programming language.

Categories of languages procedural

non-procedural

“Pure” languages---concise: Relational Algebra

Tuple Relational Calculus

Domain Relational Calculus

Pure languages form underlying basis of query languages that people use.

Page 59: Database System Concepts

59JYP

Relational AlgebraRelational Algebra

Procedural language

Six basic operators select

project

union

set difference

Cartesian product

rename

The operators take one or two relations as inputs and give a new relation as a result.

Page 60: Database System Concepts

60JYP

Select OperationSelect Operation

Notation: p(r)

Defined as:

p(r) = {t | t r and p(t)}

Where P is a formula in propositional calculus, dealing with terms of the form:

<attribute> = <attribute> or <constant>

>

<

“connected by” : (and), (or), (not)

Page 61: Database System Concepts

61JYP

Select Operation – ExampleSelect Operation – Example

• Relation r A B C D

15

1223

773

10

A=B ^ D > 5 (r)A B C D

123

710

Page 62: Database System Concepts

62JYP

Project OperationProject Operation

Notation:

A1, A2, …, Ak (r)

where A1, A2 … Ak are attribute names and r is a relation name.

The result is defined as the relation of k columns obtained by erasing the columns that are not listed

Duplicate rows removed from result, since relations are sets

Page 63: Database System Concepts

63JYP

Project Operation – ExampleProject Operation – Example

Relation r: A B C

10203040

1112

A C

1112

=

A C

112

)( , rCA

Page 64: Database System Concepts

64JYP

Union OperationUnion Operation

Notation: r s

Defined as:

r s = {t | t r or t s}

For r s to be valid.

1. r, s must have the same number of attributes

2. The attribute domains must be compatible (e.g., 2nd column of r deals with the same type of values as does the 2nd column of s)

Page 65: Database System Concepts

65JYP

Union Operation – ExampleUnion Operation – Example

Relations r, s:

r s:

A B

121

A B

23

rs

A B

1213

Page 66: Database System Concepts

66JYP

Set Difference OperationSet Difference Operation

Notation r – s

Defined as:

r – s = {t | t r and t s}

Set differences must be taken between compatible relations. r and s must have the same number of attributes

attribute domains of r and s must be compatible

Page 67: Database System Concepts

67JYP

Set Difference Operation – ExampleSet Difference Operation – Example

Relations r, s:

r – s:

A B

121

A B

23

rs

A B

11

Page 68: Database System Concepts

68JYP

Cartesian-Product OperationCartesian-Product Operation

Notation r x s

Defined as:

r x s = {t q| t r and q s}

Assume that attributes of r(R) and s(S) are disjoint. (That is, R S = ).

If attributes of r(R) and s(S) are not disjoint, then renaming must be used.

Page 69: Database System Concepts

69JYP

Cartesian-Product Operation – Cartesian-Product Operation – ExampleExample

Relations r, s:

r x s:

A B

12

A B

11112222

C D

1019201010102010

E

++––++––

C D

10102010

E

++––r

s

Page 70: Database System Concepts

70JYP

Composition of OperationsComposition of Operations

Can build expressions using multiple operations

Example: A=C(r x s)

Notation: r s (natural join) Let r and s be relations on schemas R and S respectively. The

result is a relation on schema R S which is obtained by considering each pair of tuples tr from r and ts from s.

If tr and ts have the same value on each of the attributes in R S, a tuple t is added to the result, where

t has the same value as tr on r

t has the same value as ts on s

Page 71: Database System Concepts

71JYP

Composition of Operations (Cont.)Composition of Operations (Cont.)

Example:

R = (A, B, C, D)

S = (E, B, D)

Result schema = (A, B, C, D, E)

r s is defined as:

r.A, r.B, r.C ,r.D, s.E (c.B=s.B ^ r.D=s.D(r x s))

Page 72: Database System Concepts

72JYP

Natural Join Operation – ExampleNatural Join Operation – Example

Relations r, s:

A B

12412

C D

aabab

B

13123

D

aaabb

E

r

A B

11112

C D

aaaab

E

s

r s

Page 73: Database System Concepts

73JYP

Banking ExampleBanking Example

branch (branch-name, branch-city, assets)

customer (customer-name, customer-street, customer-city)

account (branch-name, account-number, balance)

loan (branch-name, loan-number, amount)

depositor (customer-name, account-number)

borrower (customer-name, loan-number)

Page 74: Database System Concepts

74JYP

Extended Relational-Algebra-OperationsExtended Relational-Algebra-Operations

Generalized Projection

Outer Join

Aggregate Functions

Page 75: Database System Concepts

75JYP

Generalized ProjectionGeneralized Projection

Extends the projection operation by allowing arithmetic functions to be used in the projection list.

P F1, F2, …, Fn(E)

E is any relational-algebra expression

Each of F1, F2, …, Fn are arithmetic expressions involving constants and attributes in the schema of E.

Given relation credit-info(customer-name, limit, credit-balance), find how much more each person can spend:

P customer-name, limit – credit-balance (credit-info)

Page 76: Database System Concepts

76JYP

Outer JoinOuter Join

An extension of the join operation that avoids loss of information.

Computes the join and then adds tuples form one relation that does not match tuples in the other relation to the result of the join.

Uses null values: null signifies that the value is unknown or does not exist

All comparisons involving null are false by definition.

Page 77: Database System Concepts

77JYP

Outer Join – ExampleOuter Join – Example

Relation loan

branch-name loan-number amount

DowntownRedwoodPerryridge

L-170L-230L-260

300040001700

Relation borrower

customer-name loan-number

JonesSmithHayes

L-170L-230L-155

Page 78: Database System Concepts

78JYP

Outer Join – ExampleOuter Join – Example

loan borrower

branch-name loan-number amount

DowntownRedwood

L-170L-230

30004000

customer-name

JonesSmith

branch-name loan-number amount

DowntownRedwoodPerryridge

L-170L-230L-260

300040001700

customer-name

JonesSmithnull

loan borrower

Page 79: Database System Concepts

79JYP

Outer Join – ExampleOuter Join – Example

loan borrower

branch-name loan-number amount

DowntownRedwoodnull

L-170L-230L-155

30004000null

customer-name

JonesSmithHayes

branch-name loan-number amount

DowntownRedwoodPerryridgenull

L-170L-230L-260L-155

300040001700null

customer-name

JonesSmithnullHayes

loan borrower

Page 80: Database System Concepts

80JYP

Aggregate FunctionsAggregate Functions

Aggregation operator takes a collection of values and returns a single value as a result.

avg: average valuemin: minimum valuemax: maximum valuesum: sum of valuescount: number of values

G1, G2, …, Gn F1, A1, F2, A2, …, Fm, Am (E)

E is any relational-algebra expression

G1, G2 …, Gn is a list of attributes on which to group

Fi is an aggregate function

Ai is an attribute name

Page 81: Database System Concepts

81JYP

Aggregate Function – ExampleAggregate Function – Example

Relation r:

A B

C

773

10

sumC(r)sum-C

27

Page 82: Database System Concepts

82JYP

Aggregate Function – ExampleAggregate Function – Example

Relation account grouped by branch-name:

branch-name sum balance (account)

branch-name account-number balance

PerryridgePerryridgeBrightonBrightonRedwood

A-102A-201A-217A-215A-222

400900750750700

branch-name sum of balance

PerryridgeBrightonRedwood

13001500700

Page 83: Database System Concepts

83JYP

Modification of the DatabaseModification of the Database

The content of the database may be modified using the following operations: Deletion

Insertion

Updating

All these operations are expressed using the assignment operator().

Page 84: Database System Concepts

84JYP

DeletionDeletion

A delete request is expressed similarly to a query, except instead of displaying tuples to the user, the selected tuples are removed from the database.

Can delete only whole tuples; cannot delete values on only particular attributes

A deletion is expressed in relational algebra by:

r r – E

where r is a relation and E is a relational algebra query.

Page 85: Database System Concepts

85JYP

Deletion ExamplesDeletion Examples

Delete all account records in the Perryridge branch.

account account – branch-name = “Perryridge” (account)

Delete all loan records with amount in the range of 0 to 50

loan loan – amount 0and amount 50 (loan)

Delete all accounts at branches located in Needham.

r1 branch-city = “Needham” (account branch)

r2 branch-name, account-number, balance (r1)

r3 customer-name, account-number (r2 depositor)

account account – r2

depositor depositor – r3

Page 86: Database System Concepts

86JYP

InsertionInsertion

To insert data into a relation, we either: specify a tuple to be inserted

write a query whose result is a set of tuples to be inserted

in relational algebra, an insertion is expressed by:

r r E

where r is a relation and E is a relational algebra expression.

The insertion of a single tuple is expressed by letting E be a constant relation containing one tuple.

Page 87: Database System Concepts

87JYP

Insertion ExamplesInsertion Examples

Insert information in the database specifying that Smith has $1200 in account A-973 at the Perryridge branch.

account account {(“Perryridge”, A-973, 1200)}

depositor depositor {(“Smith”, A-973)}

Provide as a gift for all loan customers in the Perryridge branch, a $200 savings account. Let the loan number serve as the account number for the new savings account.

r1 (branch-name = “Perryridge” (borrower loan))

account account ( branch-name, loan-number (r1) x{(200)})

depositor depositor customer-name, loan-number (r1)

Page 88: Database System Concepts

88JYP

UpdatingUpdating

A mechanism to change a value in a tuple without changing all values in the tuple

Use the generalized projection operator to do this task

r F1, F2, …, Fn (r)

Each Fi, is either the ith attribute of r, if the ith attribute is not updated, or, if the attribute is to be updated Fi is an expression, involving only constants and the attributes of r, which gives the new value for the attribute

Page 89: Database System Concepts

89JYP

Update ExamplesUpdate Examples

Make interest payments by increasing all balances by 5 percent.

account branch-name, account-number, balance * 1.05 (account)

Pay all accounts with balances over $10,0006 percent interest and pay all others 5 percent

account branch-name, account-number, balance * 1.06 ( balance 10000 (account)) branch-name, account-number, balance * 1.05 ( balance 10000 (account))

Page 90: Database System Concepts

90JYP

ViewsViews

In some cases, it is not desirable for all users to see the entire logical model (i.e., all the actual relations stored in the database.)

Consider a person who needs to know a customer’s loan number and branch name but has no need to see the loan amount. This person should see a relation described, in the relational algebra, by

customer-name, loan-number, branch-name (borrower loan)

Any relation that is not of the conceptual model but is made visible to a user as a “virtual relation” is called a view.

Page 91: Database System Concepts

91JYP

View DefinitionView Definition

A view is defined using the create view statement which has the form

create view v as <query expression>

where <query expression> is any legal relational algebra query expression. The view name is represented by v.

Once a view is defined, the view name can be used to refer to the virtual relation that the view generates.

View definition is not the same as creating a new relation by evaluating the query expression. Rather, a view definition causes the saving of an expression to be substituted into queries using the view.

Page 92: Database System Concepts

92JYP

View ExamplesView Examples

Consider the view (named all-customer) consisting of branches and their customers.

create view all-customer as

branch-name, customer-name (depositor account)

branch-name, customer-name (borrower loan)

We can find all customers of the Perryridge branch by writing:

customer-name( branch-name = “Perryridge” (all-customer))

Page 93: Database System Concepts

93JYP

Updates Through ViewUpdates Through View

Database modifications expressed as views must be translated to modifications of the actual relations in the database.

Consider the person who needs to see all loan data in the loan relation except amount. The view given to the person, branch-loan, is defined as:

create view branch-loan as

branch-name, loan-number (loan) Since we allow a view name to appear wherever a relation name

is allowed, the person may write:

branch-loan branch-loan {(“Perryridge”, L-37)}

Page 94: Database System Concepts

94JYP

Updates Through Views (Cont.)Updates Through Views (Cont.)

The previous insertion must be represented by an insertion into the actual relation loan from which the view branch-loan is constructed.

An insertion into loan requires a value for amount. The insertion can be dealt with by either. rejecting the insertion and returning an error message to the user.

inserting a tuple (“Perryridge”, L-37, null) into the loan relation

Page 95: Database System Concepts

95JYP

Views Defined Using Other ViewsViews Defined Using Other Views

One view may be used in the expression defining another view

A view relation v1 is said to depend directly on a view relation v2 if v2 is used in the expression defining v1 , this is depicted by a directed edge < v2 v1 >

A view relation v1 is said to depend on view relation v2 , if there is a path from v2 to v1 .

A view relation v is said to be recursive if it depends on itself.

Page 96: Database System Concepts

96JYP

View ExpansionView Expansion

A way to define the meaning of views defined in terms of other views.

Let view v1 be defined by an expression e1 that may itself contain uses of view relations.

View expansion of an expression repeats the following replacement step:

repeatFind any view relation vi in e1

Replace the view relation vi by the expression defining vi until no more view relations are present in e1

As long as the view definitions are not recursive, this loop will terminate

Page 97: Database System Concepts

97JYP

Topic 4: SQLTopic 4: SQL

SQL: Structured Query Language for commercial DBMS---more user-friendly

Basic Structure Set Operations Aggregate Functions Null Values Nested Subqueries Derived Relations Views Modification of the Database Joined Relations Data Definition Language Embedded SQL

Page 98: Database System Concepts

98JYP

Basic Structure Basic Structure

SQL is based on set and relational operations with certain modifications and enhancements

A typical SQL query has the form:select A1, A2, ..., An

from r1, r2, ..., rm

where P

Ais represent attributes

ris represent relations

P is a predicate.

This query is equivalent to the relational algebra expression.

A1, A2, ..., An (P (r1 x r2 x ... x rm))

The result of an SQL query is a relation.

Page 99: Database System Concepts

99JYP

The select ClauseThe select Clause

The select clause corresponds to the projection operation of the relational algebra. It is used to list the attributes desired in the result of a query.

Q1:Find the names of all branches in the loan relation

select branch-namefrom loan

the query is equivalent to:

branch-name(loan)

An asterisk in the select clause denotes “all attributes”

select * from loan

Page 100: Database System Concepts

100JYP

The select Clause (Cont.)The select Clause (Cont.)

SQL allows duplicates in relations as well as in query results.

To force the elimination of duplicates, insert the keyword distinct after select.

Q2: Find the names of all branches in the loan relations, and remove duplicates

select distinct branch-namefrom loan

The keyword all specifies that duplicates not be removed.

select all branch-namefrom loan

Page 101: Database System Concepts

101JYP

The select Clause (Cont.)The select Clause (Cont.)

The select clause can contain arithmetic expressions involving the operation +, –, , and /, and operating on constants or attributes of tuples---generalized projection.

Q3:

select branch-name, loan-number, amount 100 from loan

would return a relation which is the same as the loan relations, except that the attribute amount is multiplied by 100.

Page 102: Database System Concepts

102JYP

The where ClauseThe where Clause

The where clause corresponds to the selection predicate of the relational algebra. It consists of a predicate involving attributes of the relations that appear in the from clause.

Q4: Find all loan number for loans made at the Perryridge branch with loan amounts greater than $1300.

select loan-numberfrom loanwhere branch-name = “ Perryridge” and amount >

1300

SQL uses the logical connectives and, or, and not. It allows the use of arithmetic expressions as operands to the comparison operators.

Page 103: Database System Concepts

103JYP

The where Clause (Cont.)The where Clause (Cont.)

SQL Includes a between comparison operator in order to simplify where clauses that specify that a value be less than or equal to some value and greater than or equal to some other value.

Q5:Find the loan number of those loans with loan amounts between $900 and $1000 (that is, $900 and $1000)

select loan-numberfrom loanwhere amount between 900 and 1000

Page 104: Database System Concepts

104JYP

The from ClauseThe from Clause

The from clause corresponds to the Cartesian product operation of the relational algebra. It lists the relations to be scanned in the evaluation of the expression.

Q6:Find the Cartesian product borrower x loan

select from borrower, loan

Q7:Find the name and loan number of all customers having a loan at the Perryridge branch.

select distinct customer-name, borrower.loan-numberfrom borrower, loanwhere borrower.loan-number = loan.loan-number and

branch-name = “Perryridge”

Page 105: Database System Concepts

105JYP

The Rename OperationThe Rename Operation

The SQL mechanism for renaming attributes is accomplished through the as clause:

old-name as new-name

Q8: Find the name and loan number of all customers having a loan at the Perryridge branch; replace the column name loan-number with the name loan-id.

select distinct customer-name, borrower.loan-number as loan-idfrom borrower, loan

where borrower.loan-number = loan.loan-number and branch-name = “Perryridge”

Page 106: Database System Concepts

106JYP

Tuple VariablesTuple Variables Tuple variables are defined in the from clause via the use of the

as clause. Q9:Find the customer names, their loan numbers and amount for

all customers having a loan at some branch.

select distinct customer-name, T.loan-number, S.amount from borrower as T, loan as S where T.loan-number = S.loan-number

Q10:Find the names of all branches that have greater assets than some branch located in Brooklyn.

select distinct T.branch-namefrom branch as T, branch as SWhere T.assets > S.assets and S.branch-city = “Brooklyn”

Page 107: Database System Concepts

107JYP

String OperationsString Operations

SQL includes a string-matching operator for comparisons on character strings. Patterns are described using two special characters: percent (%). The % character matches any substring.

underscore (_). The _ character matches any character.

Find the names of all customers whose street includes the substring “Main”.

select customer-namefrom customerwhere customer-street like “%Main%”

Match the name “Main%”

like “Main\%” escape “\”

Note in ACCESS, “%” is replaced by “*”, see query X1.

Page 108: Database System Concepts

108JYP

Ordering the Display of TuplesOrdering the Display of Tuples

Q11:List in alphabetic order the names of all customers having a loan in Perryridge branch

select distinct customer-namefrom borrower, loanwhere borrower.loan-number = loan.loan-number and branch-name = “Perryridge”order by customer-name

We may specify desc for descending order or asc for ascending order, for each attribute; ascending order is the default.

SQL must perform a sort to fulfil an order by request. Since sorting a large number of tuples may be costly. It is desirable to sort only when necessary.

Page 109: Database System Concepts

109JYP

DuplicatesDuplicates

In relations with duplicates, SQL can define how many copies of tuples appear in the result.

Multiset versions of some of the relational algebra operators – given multiset relations r1 and r2:

1. If there are c1 copies of tuple t1 in r1, and t1 satisfies selection , then there are c1 copies of t1 in (r1).

2. For each copy of tuple t1 in r1, there is a copy of tuple A(t1) in A(r1) where A(t1) denotes the projection of the single tuple t1.

3. If there are c1 copies of tuple t1 in r1 and c2 copies of tuple t2 in r2, there are c1 x c2 copies of the tuple t1. t2 in r1 x r2

Page 110: Database System Concepts

110JYP

Duplicates (Cont.)Duplicates (Cont.)

Suppose relations r1 with schema (A, B) and r2 with schema (C) are the following multisets:

r1 = {(1, a) (2,a)} r2 = {(2), (3), (3)}

Then B(r1) would be {(a), (a)}, while B(r1) x r2 would be

{(a,2), (a,2), (a,3), (a,3), (a,3), (a,3)}

SQL duplicate semantics:

select A1,, A2, ..., An

from r1, r1, ..., rm

where P

is equivalent to the multiset version of the expression:

A1,, A2, ..., An(P(r1 x r2 x ... x rm))

Page 111: Database System Concepts

111JYP

Set OperationsSet Operations

The set operations union, intersect, and except operate on relations and correspond to the relational algebra operations

Each of the above operations automatically eliminates duplicates; to retain all duplicates use the corresponding multiset versions union all, intersect all and except all. Suppose a tuple occurs m times in r and n times in s, then, it occurs:

m + n times in r union all s

min (m, n) times in r intersect all s

Page 112: Database System Concepts

112JYP

Set OperationsSet Operations

Q12: Find all customers who have a loan, an account, or both:

(select customer-name from depositor)union(select customer-name from borrower)

Find all customers who have both a loan and an account.

(select customer-name from depositor)intersect(select customer-name from borrower)

Find all customers who have an account but no loan.

select customer-name from depositor)except(select customer-name from borrower)

Page 113: Database System Concepts

113JYP

Aggregate FunctionsAggregate Functions

These functions operate on the multiset of values of a column of a relation, and return a value

avg: average valuemin: minimum valuemax: maximum valuesum: sum of valuescount: number of values

Page 114: Database System Concepts

114JYP

Aggregate Functions (Cont.)Aggregate Functions (Cont.)

Q13:Find the average account balance at the Brighton branch.

select avg (balance)from accountwhere branch-name = “Brighton”

Q14:Find the number of tuples in the customer relation.

select count (*)from customer

Find the number of depositors in the bank.

select count (distinct customer-name)

from depositor

Page 115: Database System Concepts

115JYP

Aggregate Functions – Group ByAggregate Functions – Group By

Find the number of depositors for each branch.

select branch-name, count (distinct customer-name)

from depositor, accountwhere depositor.account-number =

account.account-numbergroup by branch-name

Note: Attributes in select clause outside of aggregate functions must appear in group by list

Page 116: Database System Concepts

116JYP

Aggregate Functions – Having ClauseAggregate Functions – Having Clause

Q15: Find the names of all branches where the average account balance is more than $400.

select branch-name, avg (balance)from accountgroup by branch-namehaving avg (balance) > 400

Note: predicates in the having clause are applied after the formation of groups.

Page 117: Database System Concepts

117JYP

Null ValuesNull Values

It is possible for tuples to have a null value, denoted by null, for some of their attributes, null signifies an unknown value or that a value does not exist.

The result of any arithmetic expression involving null is null.

Roughly speaking, all comparisons involving null return false. More precisely. Any comparison with null returns unknown

(unknown or unknown) = unknown,(true and unknown) = unknown, (false and unknown) = false, (unknown and unknown) = unknown

Result of where clause predicate is treated as false if it evaluates to unknown

“P is unknown” evaluates to true if predicate P evaluates to unknown

Page 118: Database System Concepts

118JYP

Null Values (Cont.)Null Values (Cont.)

Find all loan number which appear in the loan relation with nulll values for amount.

select loan-numberfrom loanwhere amount is null

Total all loan amounts

select sum (amount)from loan

Above statement ignores null amounts; result is null if there is no non-null amount.

All aggregate operations except count(*) ignore tuples with null values on the aggregated attributes.

Page 119: Database System Concepts

119JYP

Nested SubqueriesNested Subqueries

SQL provides a mechanism for the nesting of subqueries.

A subquery is a select-from-where expression that is nested within another query.

A common use of subqueries is to perform tests for set membership, set comparisons, and set cardinality.

Page 120: Database System Concepts

120JYP

Set MembershipSet Membership

F in r t r (t = F)04

5

(5 in ) = true

04

6

(5 in ) = false

04

6

(5 not in ) = true

Page 121: Database System Concepts

121JYP

Example QueryExample Query

Q16:Find all customers who have both an account and a loan at the bank.

select distinct customer-namefrom borrowerwhere customer-name in (select customer-name

from depositor)

Q17:Find all customers who have a loan at the bank but do not have an account at the bank

select distinct customer-namefrom borrowerwhere customer-name not in (select customer-name

from depositor)

Page 122: Database System Concepts

122JYP

Example QueryExample Query

Find all customers who have both an account and a loan at the Perryridge branch.

select distinct customer-namefrom borrower, loanwhere borrower.loan-number = loan.loan-number and

branch-name = “Perryridge” and (branch-name, customer-name) in

(select branch-name, customer-namefrom depositor, accountwhere depositor.account-number =

account.account-number)

Page 123: Database System Concepts

123JYP

The Some ClauseThe Some Clause

F <comp> some r t(t r [F <comp> t])Where <comp> can be:

056

(5< some ) = true

05

0

) = false

5

05(5 some ) = true (since 0 5)

(read: 5 < some tuple in the relation)

(5< some

) = true(5 = some

(= some) inHowever, ( some) not in

Page 124: Database System Concepts

124JYP

Example QueryExample Query

Q18:Find all branches that have greater assets than some branch located in Brooklyn.

select branch-namefrom branchwhere assets >some

(select assetsfrom branchwhere branch-city = “Brooklyn”)

Page 125: Database System Concepts

125JYP

The All ClauseThe All Clause

F <comp> all r t(t r [F <comp> t])

056

(5< all ) = false

610

4

) = true

5

46(5 all ) = true (since 5 4 and 5 6)

(5< all

) = false(5 = all

( all) not inHowever, (= all) in

Page 126: Database System Concepts

126JYP

Example QueryExample Query

Q19:Find the names of all branches that have greater assets than all branches located in Horsenneck.

select branch-namefrom branchwhere assets > all

(select assetsfrom branchwhere branch-city = “Horsenneck”)

Page 127: Database System Concepts

127JYP

Test for Empty RelationsTest for Empty Relations

The exists construct returns the value true if the argument subquery is nonempty.

exists r r Ø

not exists r r = Ø

Page 128: Database System Concepts

128JYP

Example QueryExample Query

Find all customers who have an account at all branches located in Brooklyn.

select distinct S.customer-namefrom depositor as Swhere not exists (

(select branch-namefrom branchwhere branch-city = “Brooklyn”)

except(select R.branch-namefrom depositor as T, account as Rwhere T.account-number = R.account-number and

S.customer-name=T.customer-name))

Note that X – Y = Ø X Y

Page 129: Database System Concepts

129JYP

Test for Absence of Duplicate TuplesTest for Absence of Duplicate Tuples

The unique construct tests whether a subquery has any duplicate tuples in its result.

Q20:Find all customers who have only one account at the Perryridge branch.

select T.customer-namefrom depositor as Twhere unique (

select R.customer-namefrom account, depositor as Rwhere T.customer-name = R.customer-name and

R.account-number = account.account-number

andaccount.branch-name = “Perryridge”)

Note: in ACCESS, it does not work when the subquery return more than one tuple.

Page 130: Database System Concepts

130JYP

Example QueryExample Query

Find all customers who have at least two accounts at the Perryridge branch.

select distinct T.customer-namefrom depositor Twhere not unique (

select R.customer-namefrom account, depositor as Rwhere T.customer-name = R.customer-name and

R.account-number - account.account-number andaccount.branch-name = “Perryridge”)

Page 131: Database System Concepts

131JYP

Derived RelationsDerived Relations

Find the average account balance of those branches where the average account balance is greater than $1200.

select branch-name, avg-balancefrom (select branch-name, avg (balance)

from accountgroup by branch-name)

as result (branch-name, avg-balance)where avg-balance > 1200

Note that we do not need to use the having clause, since we compute in the from clause the temporary relation result, and the attributes of result can be used directly in the where clause.

Page 132: Database System Concepts

132JYP

ViewsViews

Provide a mechanism to hide certain data from the view of certain users. To create a view we use the command:

create view v as <query expression>

where: <query expression> is any legal expression

The view name is represented by v

Page 133: Database System Concepts

133JYP

Example QueriesExample Queries

A view consisting of branches and their customerscreate view all-customer as

(select branch-name, customer-namefrom depositor, accountwhere depositor.account-number = account.account-number)

union(select branch-name, customer-namefrom borrower, loanwhere borrower.loan-number = loan.loan-number)

Find all customers of the Perryridge branch

select customer-namefrom all-customerwhere branch-name = “Perryridge”

Page 134: Database System Concepts

134JYP

Modification of the Database – DeletionModification of the Database – Deletion

Delete all account records at the Perryridge branch

delete from depositorwhere account-number in (select account-number

from account where branch-name =

“Perryridge”)

delete from accountwhere branch-name = “Perryridge”

Page 135: Database System Concepts

135JYP

Example DeletionExample Deletion

Delete all accounts at every branch located in Needham.

delete from depositorwhere account-number in (select account-number

from branch, accountwhere branch-city = “Needham”and branch.branch-name = account.branch-

name)

delete from accountwhere branch-name in (select branch-name

from branchwhere branch-city = “Needham”)

Page 136: Database System Concepts

136JYP

Example Deletion (Cont.)Example Deletion (Cont.)

Delete the record of all accounts with balances below the average at the bank.

delete from accountwhere balance < (select avg (balance)

from account) Problem: as we delete tuples from account, the average balance

changes

Solutions used in SQL:

1. First, compute avg balance and find all tuples to delete

2. Next, delete all tuples found above (without recomputing avg or retesting the tuples)

Page 137: Database System Concepts

137JYP

Modification of the Database – Modification of the Database – InsertionInsertion

Add a new tuple to account

insert into accountvalues (“Perryridge”, “A-9732”, 1200)

or equivalently

insert into account (branch-name, balance, account-number)values (“Perryridge”, 1200, “A-9732”)

Add a new tuple to account with balance set to null

insert into accountvalues (“Perryridge”, “A-777”, null)

Page 138: Database System Concepts

138JYP

Modification of the Database – Modification of the Database – InsertionInsertion

Provide as a gift for all loan customers of the Perryridge branch, a $200 savings account. Let the loan number serve as the account number for the new savings account

insert into accountselect branch-name, loan-number, 200from loanwhere branch-name = “Perryridge”

insert into depositorselect customer-name, loan-numberfrom loan, borrowerwhere branch-name = “Perryridge”and loan.account-number = borrower.account-number

Page 139: Database System Concepts

139JYP

Modification of the Database – UpdatesModification of the Database – Updates

Increase all accounts with balances over $10,000 by 6%, all other accounts receive 5%. Write two update statements:

update accountset balance = balance 1.06where balance > 10000

update accountset balance = balance 1.05where balance 10000

The order is important

Page 140: Database System Concepts

140JYP

Update of a ViewUpdate of a View

Create a view of all loan data in loan relation, hiding the amount attribute

create view branch-loan asselect branch-name, loan-numberfrom loan

Add a new tuple to branch-loan

insert into branch-loanvalues (“Perryridge”,” L-307”)

This insertion must be represented by the insertion of the tuple

(“Perryridge”, “L-307”, null)

into the loan relation

Updates on more complex views are difficult or impossible to translate, and hence are disallowed.

Page 141: Database System Concepts

141JYP

Joined RelationsJoined Relations

Join operations take two relations and return as a result another relation.

These additional operations are typically used as subquery expressions in the from clause.

Join condition – defines which tuples in the two relations match, and what attributes are present in the result of the join.

Join type – defines how tuples in each relation that do not match any tuple in the other relation (based on the join condition) are treated.

Join Types

inner joinleft outer joinright outer joinfull outer join

Join Conditions

naturalon <predicate>using (A1, A2, ..., An)

Page 142: Database System Concepts

142JYP

Joined Relations – Datasets for Joined Relations – Datasets for ExamplesExamples

Relation loan

branch-name loan-number amount

Downtown

Redwood

Perryridge

L-170

L-230

L-260

3000

4000

1700

Relation borrower

customer-name loan-number

Jones

Smith

Hayes

L-170

L-230

L-155

Page 143: Database System Concepts

143JYP

Joined Relations – Examples Joined Relations – Examples

Q21:loan inner join borrower onloan.loan-number = borrower.loan-number

branch-name loan-number amount

Downtown

Redwood

L-170

L-230

3000

4000

customer-name loan-number

Jones

Smith

L-170

L-230

branch-name loan-number amount

Downtown

Redwood

Perryridge

L-170

L-230

L-260

3000

4000

1700

customer-name loan-number

Jones

Smith

null

L-170

L-230

null

Q22: loan left outer join borrower onloan.loan-number = borrower.loan-number

Page 144: Database System Concepts

144JYP

Joined Relations – ExamplesJoined Relations – Examples

loan natural inner join borrower

branch-name loan-number amount

Downtown

Redwood

L-170

L-230

3000

4000

customer-name

Jones

Smith

loan natural right outer join borrower

branch-name loan-number amount

Downtown

Redwood

null

L-170

L-230

L-155

3000

4000

null

customer-name

Jones

Smith

Hayes

Page 145: Database System Concepts

145JYP

Joined Relations – ExamplesJoined Relations – Examples

loan natural full outer join borrower using (loan-number)

branch-name loan-number amount

Downtown

Redwood

Perryridge

null

L-170

L-230

L-260

L-155

3000

4000

1700

null

customer-name

Jones

Smith

null

Hayes

Find all customers who have either an account or a loan (but not both) at the bank.

select customer-namefrom (depositor natural full outer join borrower)where account-number is null or loan-number is null

Page 146: Database System Concepts

146JYP

Data Definition Language (DDL)Data Definition Language (DDL)

The schema for each relation.

The domain of values associated with each attribute.

Integrity constraints

The set of indices to be maintained for each relations.

Security and authorization information for each relation.

The physical storage structure of each relation on disk.

Allows the specification of not only a set of relations but also information about each relation, including:

Page 147: Database System Concepts

147JYP

Domain Types in SQLDomain Types in SQL

char(n). Fixed length character string, with user-specified length n.

varchar(n). Variable length character strings, with user-specified maximum length n.

integer. Integer (a finite subset of the integers that is machine-dependent).

smallint. Small integer (a machine-dependent subset of the integer domain type).

numeric(p,d). Fixed point number, with user-specified precision of p digits, with d digits to the right of decimal point.

Page 148: Database System Concepts

148JYP

Domain Types in SQL (Cont.)Domain Types in SQL (Cont.)

real, double precision. Floating point and double-precision floating point numbers, with machine-dependent precision.

float(n). Floating point number, with user-specified precision of at least n digits.

date. Dates, containing a (4 digit) year, month and date.

time. Time of day, in hours, minutes and seconds. Null values are allowed in all the domain types. Declaring an

attribute to be not null prohibits null values for that attribute.

create domain construct in SQL-92 creates user-defined domain types

create domain person-name char(20) not null

Page 149: Database System Concepts

149JYP

Create Table ConstructCreate Table Construct

An SQL relation is defined using the create table command:

create table r (A1 D1, A2 D2, ..., An Dn,(integrity-constraint1),...,(integrity-constraintk))

r is the name of the relation

each Ai is an attribute name in the schema of relation r

Di is the data type of values in the domain of attribute Ai

Example:

create table branch(branch-name char(15) not null,branch-city char(30),assets integer)

Page 150: Database System Concepts

150JYP

Integrity Constraints in Create TableIntegrity Constraints in Create Table

not null

primary key (A1, ..., An)

check (P) , where P is a predicate

Example: Declare branch-name as the primary key for branch and ensure that the values of assets are non-negative.

create table branch(branch-name char(15) not null,branch-city char(30),assets integer,primary key (branch-name),check (assets >=0))

primary key declaration on an attribute automatically ensures not null in SQL-92

Page 151: Database System Concepts

151JYP

Drop and Alter Table ConstructsDrop and Alter Table Constructs

The drop table command deletes all information about the dropped relation from the database.

The alter table command is used to add attributes to an existing relation. All tuples in the relation are assigned null as the value for the new attribute. The form of the alter table command is

alter table r add A D

where A is the name of the attribute to be added to relation r and D is the domain of A.

The alter table command can also be used to drop attributes of a relation

alter table r drop Awhen A is the name of an attribute of relation r.

Page 152: Database System Concepts

152JYP

Embedded SQLEmbedded SQL

The SQL standard defines embeddings of SQL in a variety of programming languages such as Pascal, PL/I, Fortran, C, and Cobol.

A language to which SQL queries are embedded is referred to as a host language, and the SQL structures permitted in the host language comprise embedded SQL.

The basic form of these languages follows that of the system R embedding of SQL into PL/I.

EXEC SQL statement is used to identify embedded SQL request to the preprocessor

EXEC SQL <embedded SQL statement > END-EXEC

Page 153: Database System Concepts

153JYP

Example QueryExample Query

Specify the query in SQL and declare a cursor for it

EXEC SQL

declare c cursor for select customer-name, account-numberfrom depositor, accountwhere depositor account-number = account.account-number

and account.balance > :amount

END-EXEC

From within a host language, find the names and account numbers of customers with more than the variable amount dollars in some account.

Page 154: Database System Concepts

154JYP

Embedded SQL (Cont.)Embedded SQL (Cont.)

The open statement causes the query to be evaluated

EXEC SQL open c END-EXEC

The fetch statement causes the values of one tuple in the query result to be placed on host language variables.

EXEC SQL fetch c into :cn , :an END-EXECRepeated calls to fetch get successive tuples in the query result; a variable in the SQL communication area indicates when end-of-file is reached.

The close statement causes the database system to delete the temporary relation that holds the result of the query.

EXEC SQL close c END-EXEC

Page 155: Database System Concepts

155JYP

Dynamic SQLDynamic SQL

Allows programs to construct and submit SQL queries at run time.

Example of the use of dynamic SQL from within a C program.

char * sqlprog = “ update account set balance = balance * 1.05 where account-number = ? ”EXEC SQL prepare dynprog from :sqlprog;char account [10] = “A-101”;EXEC SQL execute dynprog using :account;

The dynamic SQL program contains a ?, which is a place holder for a value that is provided when the SQL program is executed.

Page 156: Database System Concepts

156JYP

Other SQL FeaturesOther SQL Features

Fourth-generation languages – special language to assist application programmers in creating templates on the screen for a user interface, and in formatting data for report generation, available in most commercial database products

SQL sessions – provide the abstraction of a client and a server (possibly remote) client connects to an SQL server, establishing a session

executes a series of statements

disconnects the session

can commit or rollback the work carried out in the session

Page 157: Database System Concepts

157JYP

An SQL environment contains several components, including a user identifier, and a schema, which identifies which of several schemas a session is using.

the instructions appear in each individual transaction.

serializable schedules to be generated, while others allow view-serializable schedules that are not conflict-serializable.

Page 158: Database System Concepts

158JYP

Topic 5: Integrity ConstraintsTopic 5: Integrity Constraints

Domain Constraints

Referential Integrity

Assertions

Triggers

Functional Dependencies

Page 159: Database System Concepts

159JYP

Domain ConstraintsDomain Constraints

Integrity constraints guard against accidental damage to the database, by ensuring that authorized changes to the database do not result in a loss of data consistency.

Domain constraints are the most elementary form of integrity constraint.

They test values inserted in the database, and test queries to ensure that the comparisons make sense.

Page 160: Database System Concepts

160JYP

Domain Constraints (Cont.)Domain Constraints (Cont.)

The check clause in SQL-92 permits domains to be restricted: Use check clause to ensure that an hourly-wage domain allows only

values greater than a specified value.

create domain hourly-wage numeric(5,2)constraint value-test check(value > = 4.00)

The domain hourly-wage is declared to be a decimal number with 5 digits, 2 of which are after the decimal point

The domain has a constraint that ensures that the hourly-wage is greater than or equal to 4.00

The clause constraint value-test is optional; useful to indicate which constraint an update violated.

Page 161: Database System Concepts

161JYP

Referential IntegrityReferential Integrity

Ensures that a value that appears in one relation for a given set of attributes also appears for a certain set of attributes in another relation. Example: If “Perryridge” is a branch name appearing in one of the

tuples in the account relation, then there exists a tuple in the branch relation for branch “Perryridge”.

Formal Definition

Let r1(R1) and r2(R2) be relations with primary keys K1 and K2 respectively.

The subset of R2 is a foreign key referencing K1 in relation r1, if for every t2 in r2 there must be a tuple t1 in r1 such that t1[K1] = t2[].

Referential integrity constraint: (r2) K1 (r1)

Page 162: Database System Concepts

162JYP

referential Integrity in the E-R Modelreferential Integrity in the E-R Model

Consider relationship set R between entity sets E1 and E2. The relational schema for R includes the primary keys K1 of E1 and K2 of E2.Then K1 and K2 form foreign keys on the relational schemas for E1 and E2 respectively.

Weak entity sets are also a source of referential integrity constraints. For the relation schema for a weak entity set must include the primary key of the entity set on which it depends.

Page 163: Database System Concepts

163JYP

Database ModificationDatabase Modification

The following tests must be made in order to preserve the following referential integrity constraint:

(r2) K1 (r1)

Insert. If a tuple t2 is inserted into r2, the system must ensure that there is a tuple t1 in r1 such that t1[K1] = t2[]. That is

t2 [] K1 (r1)

Delete. If a tuple, t1 is deleted from r1, the system must compute the set of tuples in r2 that reference t1:

= t1[K1] (r2)

If this set is not empty, either the delete command is rejected as an error, or the tuples that reference t1 must themselves be deleted (cascading deletions are possible).

Page 164: Database System Concepts

164JYP

Database Modification (Cont.)Database Modification (Cont.)

Update. There are two cases:

If a tuple t2 is updated in relation r2 and the update modifies values

for foreign key , then a test similar to the insert case is made. Let t2’ denote the new value of tuple t2. The system must ensure that

t2’[] K1(r1)

If a tuple t1 is updated in r1, and the update modifies values for the

primary key (K1), then a test similar to the delete case is made. The

system must compute

= t1[K1] (r2)

using the old value of t1 (the value before the update is applied). If

this set is not empty, the update may be rejected as an error, or the update may be cascaded to the tuples in the set, or the tuples in the set may be deleted.

Page 165: Database System Concepts

165JYP

Referential Integrity in SQLReferential Integrity in SQL

Primary and candidate keys and foreign keys can be specified as part of the SQL create table statement: The primary key clause of the create table statement includes a

list of the attributes that comprise the primary key.

The unique key clause of the create table statement includes a list of the attributes that comprise a candidate key.

The foreign key clause of the create table statement includes both a list of the attributes that comprise the foreign key and the name of the relation referenced by the foreign key.

Page 166: Database System Concepts

166JYP

Referential Integrity in SQL – ExampleReferential Integrity in SQL – Example

create table customer(customer-name char(20) not null,customer-street char(30),customer-city char(30),primary key (customer-name))

create table branch(branch-name char(15) not null,branch-city char(30),assets integer,primary key (branch-name))

Page 167: Database System Concepts

167JYP

Referential Integrity in SQL – Example (Cont.)Referential Integrity in SQL – Example (Cont.)

create table account(branch-name char(15),account-number char(10) not null,balance integer,primary key (account-number), foreign key (branch-name) references branch)

create table depositor(customer-name char(20) not null,account-number char(10) not null,primary key (customer-name, account-number),foreign key (account-number) references account,foreign key (customer-name) references customer)

Page 168: Database System Concepts

168JYP

Cascading Actions in SQLCascading Actions in SQL

create table account

. . .foreign key(branch-name) references branch

on delete cascadeon update cascade.

. . . )

Due to the on delete cascade clauses, if a delete of a tuple in branch results in referential-integrity constraint violation, the delete “cascades” to the account relation, deleting the tuple that refers to the branch that was deleted.

Cascading updates are similar.

Page 169: Database System Concepts

169JYP

Cascading Actions in SQL (Cont.)Cascading Actions in SQL (Cont.)

If there is a chain of foreign-key dependencies across multiple relations, with on delete cascade specified for each dependency, a deletion or update at one end of the chain can propagate across the entire chain.

If a cascading update to delete causes a constraint violation that cannot be handled by a further cascading operation, the system aborts the transaction. As a result, all the changes caused by the transaction and its cascading actions are undone.

Page 170: Database System Concepts

170JYP

AssertionsAssertions

An assertion is a predicate expressing a condition that we wish the database always to satisfy.

An assertion in SQL-92 takes the form

create assertion <assertion-name> check <predicate>

When an assertion is made, the system tests it for validity. This testing may introduce a significant amount of overhead; hence assertions should be used with great care.

Page 171: Database System Concepts

171JYP

Assertion ExampleAssertion Example

The sum of all loan amounts for each branch must be less than the sum of all account balances at the branch.

create assertion sum-constraint check(not exists (select * from branch

where (select sum(amount) from loanwhere loan.branch-name = branch.branch-name)

>=(select sum(balance) from accountwhere account.branch-name = branch.branch-name)))

Page 172: Database System Concepts

172JYP

Assertion ExampleAssertion Example Every loan has at least one customer who maintains an account

with a minimum balance of $1000.

create assertion balance-constraint check

(not exists (select * from loan

where not exists ( select *

from borrower, depositor, account

where loan.loan-number = borrower.loan-number

and borrower.customer-name = depositor.customer-name

and depositor.account-number = account.account-number

and account.balance >=1000)))

Note if a loan has no customer who maintains an account with a

minimum balance of $1000, the sub-query will evaluate to .

Page 173: Database System Concepts

173JYP

Assertion ExampleAssertion Example All borrower of every loan should maintain an account with a

minimum balance of $1000.

create assertion balance-constraint check

(not exists (select * from loan

where exists ( select *

from borrower, depositor, account

where loan.loan-number = borrower.loan-number

and borrower.customer-name = depositor.customer-name

and depositor.account-number = account.account-number

and account.balance <1000))

and not exists ( select * from loan

where not exists ( select *

from borrower, depositor

where borrower.customer-name = depositor.customer-name

and borrower.loan-number = loan.loan-number)))

Page 174: Database System Concepts

174JYP

TriggersTriggers

A trigger is a statement that is executed automatically by the system as a side effect of a modification to the database.

To design a trigger mechanism, we must: Specify the conditions under which the trigger is to be executed.

Specify the actions to be taken when the trigger executes.

The SQL-92 standard does not include trigger, but many implementations support triggers.

Page 175: Database System Concepts

175JYP

Trigger ExampleTrigger Example

Suppose that instead of allowing negative account balances, the bank deals with overdrafts by setting the account balance to zero

creating a loan in the amount of the overdraft

giving this loan a loan number identical to the account number of the overdrawn account

The condition for executing the trigger is an update to the account relation that results in a negative balance value.

Page 176: Database System Concepts

176JYP

Trigger Example (Cont.)Trigger Example (Cont.)

define trigger overdraft on update of account T(if new T.balance < 0then (insert into loan values

(T.branch-name, T.account-number, –new T.balance)insert into borrower

(select customer-name, account-numberfrom depositorwhere T.account-number = depositor account-

number)update account Sset S.balance = 0where S.account-number = T.Account-number))

The keyword new used before T.balance indicates that the value of T.balance after the update should be used; if it is omitted, the value before the update is used.

Page 177: Database System Concepts

177JYP

Functional DependenciesFunctional Dependencies

Constraints on the set of legal relations.

Require that the value for a certain set of attributes determines uniquely the value for another set of attributes.

A functional dependency is a generalization of the notion of a key.

Page 178: Database System Concepts

178JYP

Functional Dependencies (Cont.)Functional Dependencies (Cont.)

Let R be a relation schema

R R

The functional dependency

holds on R if and only if for any legal relations r(R), whenever any two tuples t1 and t2 of r agree on the attributes , they also agree on the attributes . That is,

t1[] = t2 [] t1[] = t2 []

K is a superkey for relation schema R if and only if K R

K is a candidate key for R if and only if K R, and

There is no K, R

Page 179: Database System Concepts

179JYP

Functional Dependencies (Cont.)Functional Dependencies (Cont.)

Functional dependencies allow us to express constraints that cannot be expressed using superkeys. Consider the schema:

Loan-info-schema = (branch-name, loan-number,customer-name, amount).

We expect this set of functional dependencies to hold:

loan-number amountloan-number branch-name

but would not expect the following to hold:

loan-number customer-name

Page 180: Database System Concepts

180JYP

Use of Functional DependenciesUse of Functional Dependencies

We use functional dependencies to: test relations to see if they are legal under a given set of functional

dependencies. If a relation r is legal under a set F of functional dependencies, we say that r satisfies F.

specify constraints on the set of legal relations; we say that F holds on R if all legal relations on R satisfy the set of functional dependencies F.

Note: A specific instance of a relation schema may satisfy a functional dependency even if the functional dependency does not hold on all legal instances. For example, a specific instance of Loan-schema may, by chance, satisfy loan-number customer-name.

Page 181: Database System Concepts

181JYP

Topic 6: TransactionsTopic 6: Transactions

Transaction Concept

Transaction State

Implementation of Atomicity and Durability

Concurrent Executions

Serializability

Recoverability

Implementation of Isolation

Transaction Definition in SQL

Testing for Serializability.

Page 182: Database System Concepts

182JYP

Transaction ConceptTransaction Concept

A transaction is a unit of program execution that accesses and possibly updates various data items.

A transaction must see a consistent database.

During transaction execution the database may be inconsistent.

When the transaction is committed, the database must be consistent.

Two main issues to deal with: Failures of various kinds, such as hardware failures and

system crashes

Concurrent execution of multiple transactions

Page 183: Database System Concepts

183JYP

ACID PropertiesACID Properties

Atomicity. Either all operations of the transaction are properly reflected in the database or none are.

Consistency. Execution of a transaction in isolation preserves the consistency of the database.

Isolation. Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions. Intermediate transaction results must be hidden from other concurrently executed transactions. That is, for every pair of transactions Ti and Tj, it appears to Ti that either Tj finished execution before Ti started, or Tj started execution after Ti finished.

Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.

To preserve integrity of data, the database system must ensure:

Page 184: Database System Concepts

184JYP

Example of Fund TransferExample of Fund Transfer

Transaction to transfer $50 from account A to account B:

1. read(A)

2. A := A – 50

3. write(A)

4. read(B)

5. B := B + 50

6. write(B)

Consistency requirement — the sum of A and B is unchanged by the execution of the transaction.

Atomicity requirement — if the transaction fails after step 3 and before step 6, the system should ensure that its updates are not reflected in the database, otherwise an inconsistency will result.

Page 185: Database System Concepts

185JYP

Example of Fund Transfer (Cont.)Example of Fund Transfer (Cont.)

Durability requirement — once the user has been notified that the transaction has completed (i.e., the transfer of the $50 has taken place), the updates to the database by the transaction must persist despite failures.

Isolation requirement — if between steps 3 and 6, another transaction is allowed to access the partially updated database, it will see an inconsistent database (the sum A + B will be less than it should be).Can be ensured trivially by running transactions serially, that is one after the other. However, executing multiple transactions concurrently has significant benefits, as we will see.

Page 186: Database System Concepts

186JYP

Transaction StateTransaction State

Active, the initial state; the transaction stays in this state while it is executing

Partially committed, after the final statement has been executed.

Failed, after the discovery that normal execution can no longer proceed.

Aborted, after the transaction has been rolled back and the database restored to its state prior to the start of the transaction.

Committed, after successful completion.

Page 187: Database System Concepts

187JYP

Transaction State (Cont.)Transaction State (Cont.)

Page 188: Database System Concepts

188JYP

Implementation of Atomicity and Implementation of Atomicity and DurabilityDurability

The recovery-management component of a database system implements the support for atomicity and durability.

Page 189: Database System Concepts

189JYP

Concurrent ExecutionsConcurrent Executions

Multiple transactions are allowed to run concurrently in the system. Advantages are: increased processor and disk utilization, leading to better

transaction throughput: one transaction can be using the CPU while another is reading from or writing to the disk

reduced average response time for transactions: short transactions need not wait behind long ones.

Concurrency control schemes – mechanisms to control the interaction among the concurrent transactions in order to prevent them from destroying the consistency of the database

Page 190: Database System Concepts

190JYP

SchedulesSchedules

Schedules – sequences that indicate the chronological order in which instructions of concurrent transactions are executed a schedule for a set of transactions must consist of all instructions of

those transactions

must preserve the order in which the instructions appear in each individual transaction.

Page 191: Database System Concepts

191JYP

Example SchedulesExample Schedules Let T1 transfer $50 from A to B, and T2 transfer 10% of

the balance from A to B. The following is a serial schedule (Schedule 1), in which T1 is followed by T2.

T1 T2

read(A)A := A – 50write(A)read(B)B := B + 50write(B)

read(A)temp := A*0.1;A := A – tempwrite(A)read(B)B := B + tempwrite(B)

Page 192: Database System Concepts

192JYP

Let T1 and T2 be the transactions defined previously. The following schedule (Schedule 3) is not a serial schedule, but it is equivalent to Schedule 1.

T1 T2

read(A)A := A – 50write(A)

read (A)temp := A*0.1A = A – tempwrite(A)

read(B)B := B + 50write(B)

read(B)B := B + tempwrite(B)

In both Schedule 1 and 3, the sum A + B is preserved.

Page 193: Database System Concepts

193JYP

Example Schedules (Cont.)Example Schedules (Cont.) The following concurrent schedule (Schedule 4) does not preserve the value of the sum A + B.

T1 T2

read(A)A := A – 50

read (A)temp := A*0.1

A = A – tempwrite(A)read(B)

write(A)read(B)B := B + 50write(B)

B := B + tempwrite(B)

A=90 A=40

temp=4 A=36

B=80

A=40

B=130

B=84

Page 194: Database System Concepts

194JYP

SerializabilitySerializability

Basic Assumption – Each transaction preserves database consistency.

Thus serial execution of a set of transactions preserves database consistency.

A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different forms of schedule equivalence give rise to the notions of:

1. conflict serializability

2. view serializability

We ignore operations other than read and write instructions, and we assume that transactions may perform arbitrary computations on data in local buffers in between reads and writes. Our simplified schedules consist of only read and write instructions.

Page 195: Database System Concepts

195JYP

Conflict SerializabilityConflict Serializability

Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists some item Q accessed by both li and lj, and at least one of these instructions wrote Q.

1. li = read(Q), lj = read(Q). li and lj don’t conflict.2. li = read(Q), lj = write(Q). They conflict.3. li = write(Q), lj = read(Q). They conflict4. li = write(Q), lj = write(Q). They conflict

Intuitively, a conflict between li and lj forces a (logical) temporal order between them. If li and lj are consecutive in a schedule and they do not conflict, their results would remain the same even if they had been interchanged in the schedule.

Page 196: Database System Concepts

196JYP

Conflict Serializability (Cont.)Conflict Serializability (Cont.)

If a schedule S can be transformed into a schedule S´ by a series of swaps of non-conflicting instructions, we say that S and S´ are conflict equivalent.

We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule

Example of a schedule that is not conflict serializable:

T3 T4

read(Q)write(Q)

write(Q)

We are unable to swap instructions in the above schedule to obtain either the serial schedule < T3, T4 >, or the serial schedule < T4, T3 >.

Page 197: Database System Concepts

197JYP

Conflict Serializability (Cont.)Conflict Serializability (Cont.)

Schedule 3 below can be transformed into Schedule 1, a serial schedule where T2 follows T1, by series of swaps of non-conflicting instructions. Therefore Schedule 3 is conflict serializable.

T1 T2

read(A)write(A)

read (A)write(A)

read (B) write(B)

read (B) write(B)

Page 198: Database System Concepts

198JYP

View Serializability*View Serializability* Let S and S´ be two schedules with the same set of transactions. S

and S´ are view equivalent if the following three conditions are met:

1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then transaction Ti must, in schedule S´, also read the initial value of Q.

2. For each data item Q if transaction Ti executes read(Q) in schedule S, and that value was produced by transaction Tj (if any), then transaction Ti must in schedule S´ also read the value of Q that was produced by transaction Tj.

3. For each data item Q, the transaction (if any) that performs the final write(Q) operation in schedule S must perform the final write(Q) operation in schedule S´.

As can be seen, view equivalence is also based purely on reads and writes alone.

Page 199: Database System Concepts

199JYP

View Serializability *(Cont.)View Serializability *(Cont.)

A schedule S is view serializable if it is view equivalent to a serial schedule.

Every conflict serializable schedule is also view serializable.

Schedule 9 — a schedule which is view-serializable but not conflict serializable.

T3 T4 T6

read(Q)write(Q)

write(Q)write (Q)

Every view serializable schedule which is not conflict serializable has bind writes ( no read before write).

Page 200: Database System Concepts

200JYP

RecoverabilityRecoverability

Recoverable schedule — if a transaction Tj reads a data items previously written by a transaction Ti , the commit operation of Ti appears before the commit operation of Tj.

The following schedule (Schedule 11) is not recoverable if T9 commits immediately after the read

T8 T9

read(A)write(A)

read(A) read(B)

If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent database state. Hence database must ensure that schedules are recoverable.

Need to address the effect of transaction failures on concurrently running transactions.

Page 201: Database System Concepts

201JYP

Recoverability (Cont.)Recoverability (Cont.)

Cascading rollback – a single transaction failure leads to a series of transaction rollbacks. Consider the following schedule where none of the transactions has yet committed (so the schedule is recoverable)

T10 T11 T12

read(A)read(B)write(A)

read(A) write(A)

read(A)

If T10 fails, T11 and T12 must also be rolled back.

Can lead to the undoing of a significant amount of work

Page 202: Database System Concepts

202JYP

Recoverability (Cont.)Recoverability (Cont.)

Cascadeless schedules — cascading rollbacks cannot occur; for each pair of transactions Ti and Tj such that Tj reads a data item previously written by Ti, the commit operation of Ti appears before the read operation of Tj.

Every cascadeless schedule is also recoverable

It is desirable to restrict the schedules to those that are cascadeless

Page 203: Database System Concepts

203JYP

Implementation of IsolationImplementation of Isolation

Schedules must be conflict or view serializable, and recoverable, for the sake of database consistency, and preferably cascadeless.

A policy in which only one transaction can execute at a time generates serial schedules, but provides a poor degree of concurrency.

Concurrency-control schemes tradeoff between the amount of concurrency they allow and the amount of overhead that they incur.

Some schemes allow only conflict-serializable schedules to be generated, while others allow view-serializable schedules that are not conflict-serializable.

Page 204: Database System Concepts

204JYP

Transaction Definition in SQLTransaction Definition in SQL

Data manipulation language must include a construct for specifying the set of actions that comprise a transaction.

In SQL, a transaction begins implicitly.

A transaction in SQL ends by: Commit work commits current transaction and begins a new one.

Rollback work causes current transaction to abort.

Levels of consistency specified by SQL-92: Serializable — default

Repeatable read

Read committed

Read uncommitted

Page 205: Database System Concepts

205JYP

Levels of Consistency in SQL-92Levels of Consistency in SQL-92

Serializable — default

Repeatable read —only committed records to be read, repeated reads of same record must return same value. However, a transaction may not be serializable – it may find some records inserted by a transaction but not find others that have not yet been commited.

Read committed — only committed records can be read, but successive reads of record may return different (but committed) values.

Read uncommitted — even uncommitted records may be read.

Lower degrees of consistency useful for gathering approximateinformation about the database, e.g., statistics for query optimizer.

Page 206: Database System Concepts

206JYP

Testing for SerializabilityTesting for Serializability

Consider some schedule of a set of transactions T1, T2, ..., Tn

Precedence graph — a direct graph where the vertices are the transactions (names).

We draw an arc from Ti to Tj if the two transaction conflict, and Ti accessed the data item on which the conflict arose earlier.

We may label the arc by the item that was accessed.

Example 1

x

y

Page 207: Database System Concepts

207JYP

Example Schedule (Schedule A)Example Schedule (Schedule A)T1 T2 T3 T4 T5

read(X)read(Y)read(Z)read(V)read(W)read(W)read(Y)write(Y)write(Z)read(U)read(Y)write(Y)read(Z)write(Z)

read(U)write(U)

Page 208: Database System Concepts

208JYP

Precedence Graph for Schedule APrecedence Graph for Schedule A

T3

T4T1

T2

T5

Page 209: Database System Concepts

209JYP

Test for Conflict SerializabilityTest for Conflict Serializability

A schedule is conflict serializable if and only if its precedence graph is acyclic.

Cycle-detection algorithms exist which take order n2 time, where n is the number of vertices in the graph. (Better algorithms take order n + e where e is the number of edges.)

If precedence graph is acyclic, the serializability order can be obtained by a topological sorting of the graph. This is a linear order consistent with the partial order of the graph.For example, a serializability order for Schedule A would beT5 T1 T3 T2 T4 .

Page 210: Database System Concepts

210JYP

Test for View SerializabilityTest for View Serializability

The precedence graph test for conflict serializability must be modified to apply to a test for view serializability.

Construct a labeled precedence graph. Look for an acyclic graph which is derived from the labeled precedence graph. Schedule is view serializable if and only if such an acyclic graph can be found.

The problem of looking for such an acyclic graph falls in the class of NP-complete problems. Thus existence of an efficient algorithm is unlikely.

However practical algorithms that just check some sufficient conditions for view serializability can still be used.

Page 211: Database System Concepts

211JYP

Concurrency Control vs. Serializability TestsConcurrency Control vs. Serializability Tests

Testing a schedule for serializability after it has executed is a little too late!

Goal – to develop concurrency control protocols that will assure serializability. They will generally not examine the precedence graph as it is being created; instead a protocol will impose a discipline that avoids nonseralizable schedules.Will study such protocols later.

Tests for serializability help understand why a concurrency control protocol is correct.

Page 212: Database System Concepts

212JYP

Topic 7: Concurrency ControlTopic 7: Concurrency Control

Lock-Based Protocols

Multiple Granularity

Deadlock Handling

Insert and Delete Operations

Page 213: Database System Concepts

213JYP

Lock-Based ProtocolsLock-Based Protocols

A lock is a mechanism to control concurrent access to a data item

Data items can be locked in two modes :

1. exclusive (X) mode. Data item can be both read as well as written. X-lock is requested using lock-X instruction.

2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S instruction.

Lock requests are made to concurrency-control manager. Transaction can proceed only after request is granted.

Page 214: Database System Concepts

214JYP

Lock-Based Protocols (Cont.)Lock-Based Protocols (Cont.)

Lock-compatibility matrix

A transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the item by other transactions

The matrix allows any number of transactions to hold shared locks on an item, but if any transaction holds an exclusive on the item no other transaction may hold any lock on the item.

S X

falsetrueS

X true false

Page 215: Database System Concepts

215JYP

Lock-Based Protocols (Cont.)Lock-Based Protocols (Cont.)

If a lock cannot be granted, the requesting transaction is made to wait till all incompatible locks held by other transactions have been released. The lock is then granted.

Page 216: Database System Concepts

216JYP

Lock-Based Protocols (Cont.)Lock-Based Protocols (Cont.)

Example of a transaction performing locking:

T2: lock-S(A);

read (A);

unlock(A);

lock-S(B);

read (B);

unlock(B);

display(A+B)

Locking as above is not sufficient to guarantee serializability — if A and B get updated in-between the read of A and B, the displayed sum would be wrong.

A locking protocol is a set of rules followed by all transactions while requesting and releasing locks. Locking protocols restrict the set of possible schedules.

Page 217: Database System Concepts

217JYP

Pitfalls of Lock-Based ProtocolsPitfalls of Lock-Based Protocols

Consider the partial schedule

Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A.

Such a situation is called a deadlock. To handle a deadlock one of T3 or T4 must be rolled back and its locks released.

T3 T4

lock-X(B)read(B)B:= B-50write(B)

lock-S(A)read(A)lock-S(B)

lock-X(A)

Page 218: Database System Concepts

218JYP

Pitfalls of Lock-Based Protocols (Cont.)Pitfalls of Lock-Based Protocols (Cont.)

The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil.

Starvation is also possible if concurrency control manager is badly designed. For example: A transaction may be waiting for an X-lock on an item, while a

sequence of other transactions request and are granted an S-lock on the same item.

The same transaction is repeatedly rolled back due to deadlocks.

Concurrency control manager can be designed to prevent starvation.

Page 219: Database System Concepts

219JYP

The Two-Phase Locking ProtocolThe Two-Phase Locking Protocol

This is a protocol which ensures conflict-serializable schedules.

Phase 1: Growing Phase transaction may obtain locks

transaction may not release locks

Phase 2: Shrinking Phase transaction may release locks

transaction may not obtain locks

The protocol assures serializability. It can be proved that the

transactions can be serialized in the order of their lock points

(i.e. the point where a transaction acquired its final lock).

Page 220: Database System Concepts

220JYP

The Two-Phase Locking Protocol (Cont.)The Two-Phase Locking Protocol (Cont.)

Two-phase locking does not ensure freedom from deadlocks

Cascading roll-back is possible under two-phase locking. To avoid this, follow a modified protocol called strict two-phase locking. Here a transaction must hold all its exclusive locks till it commits/aborts.

Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this protocol transactions can be serialized in the order in which they commit.

Page 221: Database System Concepts

221JYP

The Two-Phase Locking Protocol (Cont.)The Two-Phase Locking Protocol (Cont.)

There can be conflict serializable schedules that cannot be obtained if two-phase locking is used.

e.g., T1 T2

lock-S(X)read (X)unlock-S(X)

lock-X(Y)write (Y)unlock-X(Y)

lock-X(X)write (X)unlock-X(X)

Page 222: Database System Concepts

222JYP

The Two-Phase Locking Protocol (Cont.)The Two-Phase Locking Protocol (Cont.)

However, in the absence of extra information (e.g., ordering of access to data), two-phase locking is needed for conflict serializability in the following sense:

Given a transaction Ti that does not follow two-phase locking, we can find a transaction Tj that uses two-phase locking, and a schedule for Ti and Tj that is not conflict serializable.

Page 223: Database System Concepts

223JYP

Lock ConversionsLock Conversions

Two-phase locking with lock conversions:

– First Phase: can acquire a lock-S on item

can acquire a lock-X on item

can convert a lock-S to a lock-X (upgrade)

– Second Phase: can release a lock-S

can release a lock-X

can convert a lock-X to a lock-S (downgrade)

This protocol assures serializability. But still relies on the

programmer to insert the various locking instructions.

Page 224: Database System Concepts

224JYP

Automatic Acquisition of LocksAutomatic Acquisition of Locks

A transaction Ti issues the standard read/write instruction,

without explicit locking calls.

The operation read(D) is processed as:

if Ti has a lock on D

then

read(D)

else

begin

if necessary wait until no other

transaction has a lock-X on D

grant Ti a lock-S on D;

read(D) end

Page 225: Database System Concepts

225JYP

Automatic Acquisition of Locks (Cont.)Automatic Acquisition of Locks (Cont.)

write(D) is processed as:

if Ti has a lock-X on D

then write(D) else begin if necessary wait until no other trans. has any lock on D,

if Ti has a lock-S on D then upgrade lock on D to lock-X else grant Ti a lock-X on D

write(D) end; All locks are released after commit or abort

Page 226: Database System Concepts

226JYP

Multiple GranularityMultiple Granularity

Allow data items to be of various sizes and define a hierarchy of data granularities, where the small granularities are nested within larger ones

Can be represented graphically as a tree

when a transaction locks a node in the tree explicitly, it implicitly locks all the node's descendents in the same mode.

Granularity of locking (level in tree where locking is done): fine granularity (lower in tree): high concurrency, high locking

overhead

coarse granularity (higher in tree): low locking overhead, low concurrency

Page 227: Database System Concepts

227JYP

Example of Granularity HierarchyExample of Granularity Hierarchy

The highest level in the example hierarchy is the entire database.

The levels below are of type area, file and record in that order.

Page 228: Database System Concepts

228JYP

Intention Lock ModesIntention Lock Modes

In addition to S and X lock modes, there are three additional lock modes with multiple granularity: intention-shared (IS): indicates explicit locking at a lower level of the

tree but only with shared locks.

intention-exclusive (IX): indicates explicit locking at a lower level with exclusive or shared locks

shared and intention- exclusive (SIX): the subtree rooted by that node is locked explicitly in shared mode and indicates explicit locking at a lower level with exclusive-mode locks.

intention locks allow a higher level node to be locked in S or X mode without having to check all descendent nodes.

Page 229: Database System Concepts

229JYP

Compatibility Matrix with Intention Lock ModesCompatibility Matrix with Intention Lock Modes

The compatibility matrix for all lock modes is:

IS IX S S IX X

IS

IX

S

S IX

X

Page 230: Database System Concepts

230JYP

Multiple Granularity Locking SchemeMultiple Granularity Locking Scheme

Transaction Ti can lock a node Q, using the following rules:

1. The lock compatibility matrix must be observed.

2. The root of the tree must be locked first, and may be locked in

any mode.

3. A node Q can be locked by Ti in IS or S mode only if the parent

of Q is currently locked by Ti in either IX or IS mode.

4. A node Q can be locked by Ti in X, SIX, or IX mode only if the

parent of Q is currently locked by Ti in either IX or SIX mode.

5. Ti can lock a node only if it has not previously unlocked any node

(that is, Ti is two-phase).

6. Ti can unlock a node Q only if none of the children of Q are

currently locked by Ti.

Observe that locks are acquired in root-to-leaf order, whereas they are released in leaf-to-root order.

Page 231: Database System Concepts

231JYP

Deadlock HandlingDeadlock Handling

Consider the following two transactions:

T1: write (X) T2: write(Y)

write(Y) write(X)

Schedule with deadlock

T1 T2

lock-X on Xwrite (X)

lock-X on Ywrite (Y) wait for lock-X on X

wait for lock-X on Y

Page 232: Database System Concepts

232JYP

Deadlock HandlingDeadlock Handling

System is deadlocked if there is a set of transactions such that every transaction in the set is waiting for another transaction in the set.

Deadlock prevention protocols ensure that the system will never enter into a deadlock state. Some prevention strategies : Require that each transaction locks all its data items before it begins

execution (predeclaration).

Impose partial ordering of all data items and require that a transaction can lock data items only in the order specified by the partial order (graph-based protocol).

Page 233: Database System Concepts

233JYP

More Deadlock Prevention StrategiesMore Deadlock Prevention Strategies

Following schemes use transaction timestamps for the sake of deadlock prevention alone.

wait-die scheme — non-preemptive older transaction may wait for younger one to release data item.

Younger transactions never wait for older ones; they are rolled back instead.

a transaction may die several times before acquiring needed data item

wound-wait scheme — preemptive older transaction wounds (forces rollback) younger transaction

instead of waiting for it. Younger transactions may wait for older ones.

may be fewer rollbacks than wait-die scheme.

Page 234: Database System Concepts

234JYP

Deadlock prevention (Cont.)Deadlock prevention (Cont.)

Both in wait-die and in wound-wait schemes, a rolled back transactions is restarted with its original timestamp. Older transactions thus have precedence over newer ones, and starvation is hence avoided.

Timeout-Based Schemes : a transaction waits for a lock only for a specified amount of time.

After that, the wait times out and the transaction is rolled back.

thus deadlocks are not possible

simple to implement; but starvation is possible. Also difficult to determine good value of the timeout interval.

Page 235: Database System Concepts

235JYP

Deadlock Detection*Deadlock Detection*

Deadlocks can be described as a wait-for graph, which consists of a pair G = (V,E), V is a set of vertices (all the transactions in the system)

E is a set of edges; each element is an ordered pair Ti Tj.

If Ti Tj is in E, then there is a directed edge from Ti to Tj, implying that Ti is waiting for Tj to release a data item.

When Ti requests a data item currently being held by Tj, then the edge Ti Tj is inserted in the wait-for graph. This edge is removed only when Tj is no longer holding a data item needed by Ti.

The system is in a deadlock state if and only if the wait-for graph has a cycle. Must invoke a deadlock-detection algorithm periodically to look for cycles.

Page 236: Database System Concepts

236JYP

Deadlock Detection* (Cont.)Deadlock Detection* (Cont.)

Wait-for graph with a cycle

Wait-for graph without a cycle

Page 237: Database System Concepts

237JYP

Deadlock Recovery*Deadlock Recovery*

When deadlock is detected : Some transaction will have to rolled back (made a victim) to break

deadlock. Select that transaction as victim that will incur minimum cost.

Rollback -- determine how far to roll back transaction

Total rollback: Abort the transaction and then restart it.

More effective to roll back transaction only as far as necessary to break deadlock.

Starvation happens if same transaction is always chosen as victim. Include the number of rollbacks in the cost factor to avoid starvation

Page 238: Database System Concepts

238JYP

Topic 8: Database System ArchitecturesTopic 8: Database System Architectures

Centralized Systems

Client--Server Systems

Distributed Systems

Page 239: Database System Concepts

239JYP

Centralized SystemsCentralized Systems

Run on a single computer system and do not interact with other computer systems.

General-purpose computer system: one to a few CPUs and a number of device controllers that are connected through a common bus that provides access to shared memory.

Single-user system (e.g., personal computer or workstation): desk-top unit, single user, usually has only one CPU and one or two hard disks; the OS may support only one user.

Multi-user system: more disks, more memory, multiple CPUs, and a multi-user OS. Serve a large number of users who are connected to the system via terminals. Often called server systems.

Page 240: Database System Concepts

240JYP

Client-Server SystemsClient-Server Systems

Server systems satisfy requests generated at m client systems, whose general structure is shown below:

Client Client Client Client….

network

Server

Page 241: Database System Concepts

241JYP

Client-Server Systems (Cont.)Client-Server Systems (Cont.)

Database functionality can be divided into: Back-end: manages access structures, query evaluation and

optimization, concurrency control and recovery.

Front-end: consists of tools such as forms, report-writers, and graphical user interface facilities.

The interface between the front-end and the back-end is through SQL or through an application program interface.

Page 242: Database System Concepts

242JYP

Client-Server Systems (Cont.)Client-Server Systems (Cont.)

Advantages of replacing mainframes with networks of workstations or personal computers connected to back-end server machines: better functionality for the cost

flexibility in locating resources and expanding facilities

better user interfaces

easier maintenance

One kind of server systems is transaction servers which are widely used in relational database systems.

Page 243: Database System Concepts

243JYP

Transaction ServersTransaction Servers

Also called query server systems or SQL server systems; clients send requests to the server system where the transactions are executed, and results are shipped back to the client.

Requests specified in SQL, and communicated to the server through a remote procedure call (RPC) mechanism.

Transactional RPC allows many RPC calls to collectively form a transaction.

Open Database Connectivity (ODBC) is an application program interface standard from Microsoft for connecting to a server, sending SQL requests, and receiving results.

Page 244: Database System Concepts

244JYP

Distributed SystemsDistributed Systems

Data spread over multiple machines (also referred to as sites or nodes.

Network interconnects the machines

Data shared by users on multiple machines

Differentiate between local and global transactions A local transaction accesses data in the single site at which the

transaction was initiated.

A global transaction either accesses data in a site different from the one at which the transaction was initiated or accesses data in several different sites.

Page 245: Database System Concepts

245JYP

Trade-offs in Distributed SystemsTrade-offs in Distributed Systems

Sharing data – users at one site able to access the data residing at some other sites.

Autonomy – each site is able to retain a degree of control over data stored locally.

Higher system availability through redundancy — data can be replicated at remote sites, and system can function even if a site fails.

Disadvantage: added complexity required to ensure proper coordination among sites. Software development cost.

Greater potential for bugs.

Increased processing overhead.

Page 246: Database System Concepts

246JYP

Topic 9: New ApplicationsTopic 9: New Applications

Decision-Support Systems

Data Analysis

Data Mining

Data Warehousing

Spatial and Geographic Databases

Multimedia Databases

Mobility and Personal Databases

Information-Retrieval Systems

Page 247: Database System Concepts

247JYP

Decision Support SystemDecision Support System

Decision-Support systems are used to make business decisions often based on data collected by on-line transaction-processing systems.

Examples of business decisions: what items to stock?

What insurance premium to change?

Who to send advertisements to?

Examples of data used for making decisions Retail sales transaction details

Customer profiles (income, age, sex, etc.)

Page 248: Database System Concepts

248JYP

Decision-Support Systems: OverviewDecision-Support Systems: Overview

Data analysis tasks are simplified by SQL extensions.

Statistical analysis packages (e.g., S++) can be interfaced with databases Statistical analysis is a large field, will not study it here

Data mining seeks to discover knowledge automatically in the form of statistical rules and patterns from Large databases.

A data warehouse archives information gathered from multiple sources, and stores it under a unified schema, at a single site.

Page 249: Database System Concepts

249JYP

Data AnalysisData Analysis

Aggregate functions summarize large volumes of data

A histogram partitions the values taken by an attribute into ranges, and computes an aggregate over the values in each range; cumbersome to use standard SQL to construct a histogram. Extension proposed by Red Brick:

select percentile, avg (balance)

from account

group by N_tile (balance, 10) as percentile

Here, N_tile(balance,10) divides the values for balance into 10 consecutive ranges, with an equal number of values in each range, duplicates are not eliminated.

Page 250: Database System Concepts

250JYP

Small Medium Large Total

LightDark

820

3510

10 5

5335

Total 28 45 15 88

Cross-tabulation of number by size and color of sample relation sales with the schema Sales(color, size, number).

Data Analysis (Cont.)Data Analysis (Cont.)

Page 251: Database System Concepts

251JYP

Data Analysis (Cont.)Data Analysis (Cont.)Color Size Number

LightLightLightLightDarkDarkDarkDark

allallallall

SmallMediumLarge

allSmall

MediumLarge

allSmall

MediumLarge

all

835105320105

3528451588

Can represent subtotals in relational form by using the value all

E.g. : obtain (Light, all, 53) and (Dark, all, 35) by aggregating individual tuples with different values for size for each color.

Page 252: Database System Concepts

252JYP

Data Analysis (Cont.)Data Analysis (Cont.)

Rollup: Moving from finer-granularity data to a coarser granularity by means of aggregation.

Drill down: Moving from coarser-granularity data to finer-granularity data.

Proposed extensions to SQL, such as the cube operation help to support generation of summary data

The following query generates the previous table.

select color, size, sum (number)

from sales

groupby color, size with cube

Page 253: Database System Concepts

253JYP

Data Analysis (Cont.)Data Analysis (Cont.)

Figure shows the combinations of dimensions size, color, price

In general computing cube operation with n groupby columns gives 2n different groupby combinations.

Page 254: Database System Concepts

254JYP

Data MiningData Mining

Like knowledge discovery in artificial intelligence data mining discovers statistical rules and patterns it differs from machine learning in that it deals with large volumes of data stored primarily on disk.

Knowledge discovered from a database can be represented by a set of rules.

e.g.,: “Young women with annual incomes greater than $50,000 are most likely to buy sports cars”

Discover rules using one of two models:

1. The user is involved directly in the process of knowledge discovery.

2. The system is responsible for automatically discovering knowledge from the database by detecting patterns and correlation's in the data.

Page 255: Database System Concepts

255JYP

Knowledge Representation Using RulesKnowledge Representation Using Rules

General form of rules X antecedent consequent X is a list of one or more variables with associated ranges.

The rule transactions T, buys (T, bread) buys(T, milk) states: if there is a tuple (t1, bread) in the relation buys, there must also be a tuple (t1, milk) in the relation buys.

Population: Cross-product of the ranges of the variables in the rule. In the above example, the set of all transactions.

Support: Measure of what fraction of the population satisfies both the antecedent and the consequent of the rule. e.g., 10% of transactions buy bread and milk.

Confidence : Measure of how often the consequent is true when the antecedent is true. e.g., 80% of transactions that buy bread also buy milk

Page 256: Database System Concepts

256JYP

Some Classes of Data-Mining ProblemsSome Classes of Data-Mining Problems

Classification : Finding rules that partition the given data into disjoint groups (classes) that are relevant for making a decision (e.g.,: which of several factors help classify a person’s credit

worthiness).

Associations: Useful to determine associations between different items (e.g.,: someone who buys bread is quite likely also to buy

milk).

Sequence correlations: determine correlations between related sequence data. (e.g., : when bond rates go up stock prices go down within

two days.)

Page 257: Database System Concepts

257JYP

User-Guided Data MiningUser-Guided Data Mining

In user-guided data mining, primary responsibility for discovering rules is with the user.

User may runs tests on the database to verify or refute a hypothesis. Confidence and support for rules expressing a hypothesis are derived from the database.

An iterative process of forming and refining rules is used. Example: Test the hypothesis “People who hold master’s degrees are the most likely to have an excellent credit rating.” If confidence of rule is low, may refine it into the rule:

people P, P.degree = Masters and P.income 75, 000

P.credit = excellent

Data-visualization though graphical representations like maps, charts, and color-coding, helps detect patterns in data

Page 258: Database System Concepts

258JYP

Classification RulesClassification Rules

Classification rules help assign new objects to a set of classes. E.g., given a new automobile insurance applicant, should he or she be classified as low risk, medium risk or high risk?

Classification rules for above example could use a variety of knowledge, such as educational level of applicant, salary of applicant, age of applicant, etc.

Classification rules can be compactly shown as a Classification tree.

Page 259: Database System Concepts

259JYP

Part of Credit Risk Classification TreePart of Credit Risk Classification Tree

Page 260: Database System Concepts

260JYP

Discovery of Classification Rules*Discovery of Classification Rules*

Training set: a data sample in which the grouping for each tuple is already known.

Top down generation of classification tree. Each internal node of the tree partitions the data into groups

based on the attribute.

The data at a node is not partitioned further if either all (or most) of the items at the node belong to the same class, or all attributes have been considered. Such a node is a leaf node.

Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node.

Page 261: Database System Concepts

261JYP

Discovery of Classification Rules *(Cont.)Discovery of Classification Rules *(Cont.)

Consider credit risk example: Suppose degree is chosen to partition the data at the root. Since degree has a small number of possible values, one child is created for each value.

At each child node of the root, further classification is done if required. Here, partitions are defined by income. Since income is a continuous attribute, some number of intervals are chosen, and one child created for each interval.

Different classification algorithms use different ways of choosing which attribute to partition on at each node, and what the intervals, if any, are.

In general, different branches of the tree could grow to different levels. Different nodes at the same level may use different partitioning attributes.

Page 262: Database System Concepts

262JYP

Discovery of Association RulesDiscovery of Association Rules

Example: transactions T, buys (T, bread) buys (T, milk)

In general: notion of transaction , and its itemset, the set of items contained in the transaction

General form of rule:

transactions T, c(T, i1) and . . . and c(T, in) c(T, i0)

where c(T, ik) denotes that transaction T contains item ik.

Above can be represented as A b where A = {i1, i2, . . . , in}

and b = i0.

Support of rule = number of transactions whose itemsets contain A {b}

Usually desire rules with strong support, which will involve only items purchased in a significant percentage of the transactions.

Page 263: Database System Concepts

263JYP

Discovery of Association Rules (Cont.)Discovery of Association Rules (Cont.)

Consider all possible sets of relevant items.

For each set find its support (i.e. , how many transactions purchase all items in the set).

Use sets with sufficiently high support to generate association rules. From set A generate the rules A - {b} b for each b A.

Support of each of the rules is support of A.

Confidence of a rule is support of A divided by support of A - {b}.

Page 264: Database System Concepts

264JYP

Finding SupportFinding Support

Few sets: Determine level of support via a single pass. A count is maintained for each set, initially set to 0.

When a transaction is fetched, the count is incremented for each set of items which contained in the itemset of the transaction.

Sets with a high count at the end of the pass correspond to items with a high degree of association.

Many sets: If memory not enough to hold all counts for all sets Use multiple passes, considering only some sets in each pass.

Optimization: Once a set is eliminated because it occurs in too small a fraction of the transactions, none of its supersets needs to be considered.

Page 265: Database System Concepts

265JYP

Data WarehousingData Warehousing

A data warehouse is a repository of information gathered from multiple sources.

Page 266: Database System Concepts

266JYP

Data Warehousing (Cont.)Data Warehousing (Cont.)

Provides a single consolidated interface to data

Data stored for an extended period, providing access to historical data

Data/updates are periodically downloaded from online transaction processing (OLTP) systems. Typically, download happens each night.

Data may not be completely up-to-date, but is recent enough for analysis.

Running large queries at the warehouse ensures that OLTP systems are not affected by the decision-support workload.

Page 267: Database System Concepts

267JYP

Issues in Building a WarehouseIssues in Building a Warehouse

When and how to gather data. Source driven: data source initiates data transfer

Destination driven: warehouse initiates data transfer

What schema to use. Schema integration

Cleaning and conversion of incoming data

What data to summarize. Raw data may be too large to store on-line

Aggregate values (totals/subtotals) often suffice

Queries on raw data can often be transformed by query optimizer to use aggregate values

How to propagate updates. Date at warehouse is a view on source data

Efficient view maintenance techniques required

Page 268: Database System Concepts

268JYP

Spatial and Geographic DatabasesSpatial and Geographic Databases

Spatial databases store information related to spatial locations, and support efficient storage, indexing and querying of spatial data.

Special purpose index structures are important for accessing spatial data, and for processing spatial join queries.

Design databases (or CAD databases) store design information about how objects are constructed e.g.: designs of buildings, aircraft, layouts of integrated-circuits

Geographic databases store geographic information (e.g., maps): often called geographic information systems or GIS.

Page 269: Database System Concepts

269JYP

Representation of Geometric InformationRepresentation of Geometric Information

Various geometric constructs can be represented in a database in a normalized fashion.

Represent a line segment by the coordinates of its endpoints.

Approximate a curve by partitioning it into a sequence of segments; represents each segment as a separate tuple that also carries with it the identifier of the curve.

Closed polygons: list its vertices in order, starting vertex is the same as the ending vertex.

Alternative: triangulation — divide polygon into triangles; give the polygon an identifier with each of its

triangles.

Page 270: Database System Concepts

270JYP

Representation of Geometric Constructs (Cont.)Representation of Geometric Constructs (Cont.)

Page 271: Database System Concepts

271JYP

Representation of Geometric Information (Cont.)Representation of Geometric Information (Cont.)

Representation of points and line segment in 3-D similar to 2-D, except that points have an extra z component

Represent arbitrary polyhedra by dividing them into tetrahedrons, like triangulating polygons.

Alternative: List their faces, each of which is a polygon, along with an indication of which side of the face is inside the polyhedron.

Page 272: Database System Concepts

272JYP

Design DatabasesDesign Databases

Represent design components as objects (generally geometric objects); the connections between the objects indicate how the design is structured.

Simple two-dimensional objects: points, lines, triangles, rectangles, polygons.

Complex two-dimensional objects: formed from simple objects via union, intersection, and difference operations.

Complex three-dimensional objects: formed from simpler objects such as spheres, cylinders, and cuboids, by union, intersection, and difference operations.

Wireframe models represent three-dimensional surfaces as a set of simpler objects.

Page 273: Database System Concepts

273JYP

Representation of Geometric ConstructsRepresentation of Geometric Constructs

Design databases also store non-spatial information about objects (e.g., construction material, color, etc.)

Spatial integrity constraints are important.

E.g., pipes should not intersect, wires should not be too close to each other, etc.

(a) Difference of cylinders (b) Union of cylinders

Page 274: Database System Concepts

274JYP

Geographic DataGeographic Data

Raster data consist of bit maps or pixel maps, in two or more dimensions. Example 2-D raster image: satellite image of cloud cover, where

each pixel stores the cloud visibility in a particular area.

Additional dimensions might include the temperature at different altitudes at different regions, or measurements taken at different points in time.

Design databases generally do not store raster data.

Page 275: Database System Concepts

275JYP

Geographic Data (Cont.)Geographic Data (Cont.)

Vector data are constructed from basic geometric objects: points, line segments, triangles, and other polygons in two dimensions, and cylinders, speheres, cuboids, and other polyhedrons in three dimensions.

Vector format often used to represent map data. Roads can be considered as two-dimensional and represented

by lines and curves.

Some features, such as rivers, may be represented either as complex curves or as complex polygons, depending on whether their width is relevant.

Features such as regions and lakes can be depicted as polygons.

Page 276: Database System Concepts

276JYP

Applications of Geographic DataApplications of Geographic Data Examples of geographic data

map data for vehicle navigation

distribution network information for power, telephones, water supply, and sewage

Vehicle navigation systems store information about roads and services for the use of drivers: Spatial data: e.g, road/restaurant/gas-station coordinates

Non-spatial data: e.g., one-way streets, speed limits, traffic congestion

Global Positioning System or GPS unit - utilizes information broadcast from GPS satellites to find the current location of user with an accuracy of tens of meters. increasingly used in vehicle navigation systems as well as utility maintenance

applications.

Page 277: Database System Concepts

277JYP

Spatial QueriesSpatial Queries

Nearness queries request objects that lie near a specified location.

Nearest neighbor queries, given a point or an object, find the nearest object that satisfies given conditions.

Region queries deal with spatial regions. e.g., ask for objects that lie partially or fully inside a specified region.

Queries that compute intersections or unions of regions.

Spatial join of two spatial relations with the location playing the role of join attribute.

Page 278: Database System Concepts

278JYP

Spatial Queries (Cont.)Spatial Queries (Cont.)

Spatial data is typically queried using a graphical query language; results are also displayed in a graphical manner.

Graphical interface constitutes the front-end

Extensions of SQL with abstract data types, such as lines, polygons and bit maps, have been proposed to interface with back-end. allows relational databases to store and retrieve spatial

information

queries can mix spatial and nonspatial conditions

extensions also include and allowing spatial conditions (contains or overlaps).

Page 279: Database System Concepts

279JYP

Multimedia DatabasesMultimedia Databases

To provide such database functions as indexing and consistency, it is desirable to store multimedia data in a database (rather than storing them outside the database, in a file system).

The database must handle large object representation.

Similarity-based retrieval must be provided by special index structures.

Must provide guaranteed steady retrieval rates for continuos-media data.

Page 280: Database System Concepts

280JYP

Similarity-Based RetrievalSimilarity-Based Retrieval

Pictorial data - Two pictures or images that are slightly different as represented in the database may be considered the same by a user.

E.g., identify similar designs for registering a new trademark.

Audio data- Speech-based user interfaces allow the user to give a command or identify a data item by speaking.

E.g., test user input against stored commands.

Handwritten data -- Identify a handwritten data item or command stored in the database (requires similarity testing).

Page 281: Database System Concepts

281JYP

Continuos-Media DataContinuos-Media Data

Most important types are video and audio data.

Characterized by high data volumes and real-time information-delivery requirements. Data must be delivered sufficiently fast that there are no

gaps in the audio or video.

Data must be delivered at a rate that does not cause overflow of system buffers.

Synchronization among distinct data streams must be maintained (video of a person speaking must show lips moving synchronously with the audio).

Page 282: Database System Concepts

282JYP

Multimedia Data FormatsMultimedia Data Formats

Store and transmit multimedia data in compressed form;JPEG is the most widely used format for image data.

Encoding each frame of a video using JPEG is wasteful since, successive frames of a video are often nearly the same.

MPEG standards use commonalties among a sequence of frames to achieve a greater degree of compression.

MPEG-1 stores a minute of 30-frame-per-second video and audio in approximately 12.5 MB (compares with 75 MB for video using only JPEG); quality comparable to VHS video tape.

MPEG-2 designed for digital broadcast systems and DVD; negligible loss of video quality. Compresses 1 minute of audio-video to approximately 17 MB.

MPEG-4

Page 283: Database System Concepts

283JYP

Video ServersVideo Servers

Current video-on-demand servers are based on file systems; existing database systems do not meet real-time response requirements.

Multimedia data are stored on several disks (RAID configuration), or on tertiary storage for less frequently accessed data.

Head-end terminals - used to view multimedia data; PCs or TVs attached to a small, inexpensive computer called a set-top box.

Network - Transmission of multimedia data from a server to multiple head-end terminals requires a high-capacity network, such as asynchronous-transfer-mode (ATM) network.

Page 284: Database System Concepts

284JYP

Mobility and Personal DatabasesMobility and Personal Databases

A mobile computing environment consists of mobile computers, referred to as mobile hosts, and a wired network of computers.

Mobile host may be able to communicate with wired network through a wireless digital communication network.

A model for mobile communication Mobile hosts communicate to the wired network via

computers referred to as mobile support stations.

Each mobile support station manages those mobile hosts within its cell.

When mobile hosts move between cells, there is a handoff of control from one mobile support station to another.

Page 285: Database System Concepts

285JYP

Database Issues in Mobile ComputingDatabase Issues in Mobile Computing

New issues for query optimization. energy (battery power) is a scarce resource and its usage

must be minimized

mobile user’s locations may be a parameter of the query.

Users may need to be able to perform database updates even while the mobile computer is disconnected. e.g., mobile salesman records sale of products on (local

copy of) database.

Can result in conflicts detected on reconnection, which may need to be resolved manually.

Page 286: Database System Concepts

286JYP

Disconnectivity and ConsistencyDisconnectivity and Consistency

A mobile host may remain in operation during periods of disconnection.

Problems created if the user of the mobile host issues queries and updates on data that resides or is cached locally: Recoverability: Updates entered on a disconnected machine

may be lost if the mobile host fails. Since the mobile host represents a single point of failure, stable storage cannot be simulated well.

Consistency : Cached data may become out of date, but the mobile host cannot discover this until it is reconnected.

Page 287: Database System Concepts

287JYP

Information Retrieval SystemsInformation Retrieval Systems

Information retrieval (IR) systems use a simpler data model than database systems, but provide more powerful querying capabilities within the restricted model.

Queries attempt to locate documents that are of interest by specifying, for example, sets of keywords. e.g., find documents containing the words “database

systems”

Information retrieval systems order answers based on their estimated relevance. e.g., user may really only want documents about

database systems, but the system may retrieve all documents that mention the phrase “database systems”.

Documents may be ordered by, for example, how many times the phrase appears in the document.

Page 288: Database System Concepts

288JYP

Differences From Database SystemsDifferences From Database Systems

Information retrieval systems, unlike traditional database systems, handle: Unstructured documents

Searching using keywords and relevance ranking

Most information retrieval systems do not handle: High update rates

Concurrency control

Data structured using more complex data models (e.g., relational or object oriented data models)

Complex queries written in, e.g., SQL

Page 289: Database System Concepts

289JYP

Topic 10: XMLTopic 10: XML

XML: Extensible Markup Language

Defined by the WWW Consortium (W3C)

Originally intended as a document markup language not a database language Documents have tags giving extra information about sections of the

document

E.g. <title> XML </title> <slide> Introduction …</slide>

Derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML

Extensible, unlike HTML

Users can add new tags, and separately specify how the tag should be handled for display

Goal was (is?) to replace HTML as the language for publishing documents on the Web

Page 290: Database System Concepts

290JYP

XML Introduction (Cont.)XML Introduction (Cont.)

The ability to specify new tags, and to create nested tag structures made XML a great way to exchange data, not just documents. Much of the use of XML has been in data exchange applications, not as a

replacement for HTML

Tags make data (relatively) self-documenting E.g.

<bank> <account>

<account-number> A-101 </account-number> <branch-name> Downtown </branch-name> <balance> 500 </balance>

</account> <depositor>

<account-number> A-101 </account-number> <customer-name> Johnson </customer-name>

</depositor>

</bank>

Page 291: Database System Concepts

291JYP

XML: MotivationXML: Motivation

Data interchange is critical in today’s networked world Examples:

Banking: funds transfer

Order processing (especially inter-company orders)

Scientific data

– Chemistry: ChemML, …

– Genetics: BSML (Bio-Sequence Markup Language), …

Paper flow of information between organizations is being replaced by electronic flow of information

Each application area has its own set of standards for representing information

XML has become the basis for all new generation data interchange formats

Page 292: Database System Concepts

292JYP

XML Motivation (Cont.)XML Motivation (Cont.) Earlier generation formats were based on plain text with line

headers indicating the meaning of fields Similar in concept to email headers

Does not allow for nested structures, no standard “type” language

Tied too closely to low level document structure (lines, spaces, etc)

Each XML based standard defines what are valid elements, using XML type specification languages to specify the syntax

DTD (Document Type Descriptors)

XML Schema

Plus textual descriptions of the semantics

XML allows new tags to be defined as required However, this may be constrained by DTDs

A wide variety of tools is available for parsing, browsing and querying XML documents/data

Page 293: Database System Concepts

293JYP

Structure of XML DataStructure of XML Data

Tag: label for a section of data

Element: section of data beginning with <tagname> and ending with matching </tagname>

Elements must be properly nested Proper nesting

<account> … <balance> …. </balance> </account>

Improper nesting

<account> … <balance> …. </account> </balance>

Formally: every start tag must have a unique matching end tag, that is in the context of the same parent element.

Every document must have a single top-level element

Page 294: Database System Concepts

294JYP

Example of Nested ElementsExample of Nested Elements

<bank-1> <customer>

<customer-name> Hayes </customer-name> <customer-street> Main </customer-street> <customer-city> Harrison </customer-city> <account>

<account-number> A-102 </account-number> <branch-name> Perryridge </branch-name> <balance> 400 </balance>

</account> <account> … </account>

</customer> . .

</bank-1>

Page 295: Database System Concepts

295JYP

Motivation for NestingMotivation for Nesting

Nesting of data is useful in data transfer Example: elements representing customer-id, customer name, and

address nested within an order element

Nesting is not supported, or discouraged, in relational databases With multiple orders, customer name and address are stored

redundantly

normalization replaces nested structures in each order by foreign key into table storing customer name and address information

Nesting is supported in object-relational databases

But nesting is appropriate when transferring data External application does not have direct access to data referenced

by a foreign key

Page 296: Database System Concepts

296JYP

Structure of XML Data (Cont.)Structure of XML Data (Cont.)

Mixture of text with sub-elements is legal in XML. Example:

<account> This account is seldom used any more. <account-number> A-102</account-number> <branch-name> Perryridge</branch-name> <balance>400 </balance></account>

Useful for document markup, but discouraged for data representation

Page 297: Database System Concepts

297JYP

AttributesAttributes

Elements can have attributes <account acct-type = “checking” >

<account-number> A-102 </account-number> <branch-name> Perryridge </branch-name> <balance> 400 </balance>

</account>

Attributes are specified by name=value pairs inside the starting tag of an element

An element may have several attributes, but each attribute name can only occur once

<account acct-type = “checking” monthly-fee=“5”>

Page 298: Database System Concepts

298JYP

Attributes Vs. SubelementsAttributes Vs. Subelements

Distinction between subelement and attribute In the context of documents, attributes are part of markup, while

subelement contents are part of the basic document contents

In the context of data representation, the difference is unclear and may be confusing

Same information can be represented in two ways

– <account account-number = “A-101”> …. </account>

– <account> <account-number>A-101</account-number> … </account>

Suggestion: use attributes for identifiers of elements, and use subelements for contents

Page 299: Database System Concepts

299JYP

More on XML SyntaxMore on XML Syntax

Elements without subelements or text content can be abbreviated by ending the start tag with a /> and deleting the end tag <account number=“A-101” branch=“Perryridge” balance=“200 />

To store string data that may contain tags, without the tags being interpreted as subelements, use CDATA as below

<![CDATA[<account> … </account>]]>

Here, <account> and </account> are treated as just strings

Page 300: Database System Concepts

300JYP

NamespacesNamespaces

XML data has to be exchanged between organizations

Same tag name may have different meaning in different organizations, causing confusion on exchanged documents

Specifying a unique string as an element name avoids confusion

Better solution: use unique-name:element-name

Avoid using long unique names all over document by using XML Namespaces

<bank Xmlns:FB=‘http://www.FirstBank.com’> …

<FB:branch> <FB:branchname>Downtown</FB:branchname>

<FB:branchcity> Brooklyn </FB:branchcity> </FB:branch>…

</bank>

Page 301: Database System Concepts

301JYP

XML Document SchemaXML Document Schema

Database schemas constrain what information can be stored, and the data types of stored values

XML documents are not required to have an associated schema

However, schemas are very important for XML data exchange Otherwise, a site cannot automatically interpret data received from

another site

Two mechanisms for specifying XML schema Document Type Definition (DTD)

Widely used

XML Schema

Newer, increasing use

Page 302: Database System Concepts

302JYP

Document Type Definition (DTD)Document Type Definition (DTD)

The type of an XML document can be specified using a DTD

DTD constraints structure of XML data What elements can occur

What attributes can/must an element have

What subelements can/must occur inside each element, and how many times.

DTD does not constrain data types All values represented as strings in XML

DTD syntax <!ELEMENT element (subelements-specification) >

<!ATTLIST element (attributes) >

Page 303: Database System Concepts

303JYP

Element Specification in DTDElement Specification in DTD

Subelements can be specified as names of elements, or #PCDATA (parsed character data), i.e., character strings EMPTY (no subelements) or ANY (anything can be a subelement)

Example<! ELEMENT depositor (customer-name account-number)>

<! ELEMENT customer-name (#PCDATA)><! ELEMENT account-number (#PCDATA)>

Subelement specification may have regular expressions <!ELEMENT bank ( ( account | customer | depositor)+)>

Notation:

– “|” - alternatives

– “+” - 1 or more occurrences

– “*” - 0 or more occurrences

Page 304: Database System Concepts

304JYP

Bank DTDBank DTD

<!DOCTYPE bank [<!ELEMENT bank ( ( account | customer | depositor)+)><!ELEMENT account (account-number branch-name balance)><! ELEMENT customer(customer-name customer-street customer-city)><! ELEMENT depositor (customer-name account-number)><! ELEMENT account-number (#PCDATA)><! ELEMENT branch-name (#PCDATA)><! ELEMENT balance(#PCDATA)><! ELEMENT customer-name(#PCDATA)><! ELEMENT customer-street(#PCDATA)><! ELEMENT customer-city(#PCDATA)>

]>

Page 305: Database System Concepts

305JYP

Attribute Specification in DTDAttribute Specification in DTD

Attribute specification : for each attribute Name Type of attribute

CDATA ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)

– more on this later Whether

mandatory (#REQUIRED) has a default value (value), or neither (#IMPLIED)

Examples <!ATTLIST account acct-type CDATA “checking”> <!ATTLIST customer

customer-id ID # REQUIREDaccounts IDREFS # REQUIRED >

Page 306: Database System Concepts

306JYP

IDs and IDREFsIDs and IDREFs

An element can have at most one attribute of type ID

The ID attribute value of each element in an XML document must be distinct Thus the ID attribute value is an object identifier

An attribute of type IDREF must contain the ID value of an element in the same document

An attribute of type IDREFS contains a set of (0 or more) ID values. Each ID value must contain the ID value of an element in the same document

Page 307: Database System Concepts

307JYP

Bank DTD with AttributesBank DTD with Attributes

Bank DTD with ID and IDREF attribute types. <!DOCTYPE bank-2[

<!ELEMENT account (branch, balance)> <!ATTLIST account account-number ID # REQUIRED

owners IDREFS # REQUIRED> <!ELEMENT customer(customer-name, customer-street,

customer-city)> <!ATTLIST customer

customer-id ID # REQUIRED accounts IDREFS # REQUIRED>

… declarations for branch, balance, customer-name, customer-street and customer-city]>

Page 308: Database System Concepts

308JYP

XML data with ID and IDREF attributesXML data with ID and IDREF attributes

<bank-2><account account-number=“A-401” owners=“C100 C102”>

<branch-name> Downtown </branch-name> <balance> 500 </balance>

</account><customer customer-id=“C100” accounts=“A-401”>

<customer-name>Joe </customer-name> <customer-street> Monroe </customer-street> <customer-city> Madison</customer-city>

</customer><customer customer-id=“C102” accounts=“A-401 A-402”>

<customer-name> Mary </customer-name> <customer-street> Erin </customer-street> <customer-city> Newark </customer-city>

</customer></bank-2>

Page 309: Database System Concepts

309JYP

Limitations of DTDsLimitations of DTDs

No typing of text elements and attributes All values are strings, no integers, reals, etc.

Difficult to specify unordered sets of subelements Order is usually irrelevant in databases

(A | B)* allows specification of an unordered set, but

Cannot ensure that each of A and B occurs only once

IDs and IDREFs are untyped The owners attribute of an account may contain a reference to

another account, which is meaningless

owners attribute should ideally be constrained to refer to customer elements

Page 310: Database System Concepts

310JYP

XML SchemaXML Schema

XML Schema is a more sophisticated schema language which addresses the drawbacks of DTDs. Supports Typing of values

E.g. integer, string, etc

Also, constraints on min/max values

User defined types

Is itself specified in XML syntax, unlike DTDs

More standard representation, but verbose

Is integrated with namespaces

Many more features

List types, uniqueness and foreign key constraints, inheritance ..

BUT: significantly more complicated than DTDs, not yet widely used.

Page 311: Database System Concepts

311JYP

XML Schema Version of Bank DTDXML Schema Version of Bank DTD<xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema>

<xsd:element name=“bank” type=“BankType”/>

<xsd:element name=“account”><xsd:complexType> <xsd:sequence> <xsd:element name=“account-number” type=“xsd:string”/> <xsd:element name=“branch-name” type=“xsd:string”/> <xsd:element name=“balance” type=“xsd:decimal”/> </xsd:squence></xsd:complexType>

</xsd:element>….. definitions of customer and depositor ….

<xsd:complexType name=“BankType”><xsd:squence>

<xsd:element ref=“account” minOccurs=“0” maxOccurs=“unbounded”/><xsd:element ref=“customer” minOccurs=“0” maxOccurs=“unbounded”/><xsd:element ref=“depositor” minOccurs=“0” maxOccurs=“unbounded”/>

</xsd:sequence></xsd:complexType></xsd:schema>

Page 312: Database System Concepts

312JYP

Querying and Transforming XML DataQuerying and Transforming XML Data

Translation of information from one XML schema to another

Querying on XML data

Above two are closely related, and handled by the same tools

Standard XML querying/translation languages XPath

Simple language consisting of path expressions

XQuery

An XML query language with a rich set of features

Wide variety of other languages have been proposed, and some served as basis for the Xquery standard XML-QL, Quilt, XQL, …

Page 313: Database System Concepts

313JYP

Tree Model of XML DataTree Model of XML Data

Query and transformation languages are based on a tree model of XML data

An XML document is modeled as a tree, with nodes corresponding to elements and attributes Element nodes have children nodes, which can be attributes or

subelements Text in an element is modeled as a text node child of the element Children of a node are ordered according to their order in the XML

document Element and attribute nodes (except for the root node) have a single

parent, which is an element node The root node has a single child, which is the root element of the

document

We use the terminology of nodes, children, parent, siblings, ancestor, descendant, etc., which should be interpreted in the above tree model of XML data.

Page 314: Database System Concepts

314JYP

XPathXPath

XPath is used to address (select) parts of documents using path expressions

A path expression is a sequence of steps separated by “/” Think of file names in a directory hierarchy

Result of path expression: set of values that along with their containing elements/attributes match the specified path

E.g. /bank-2/customer/customer-name evaluated on the bank-2 data we saw earlier returns

<customer-name>Joe</customer-name>

<customer-name>Mary</customer-name>

E.g. /bank-2/customer/customer-name/text( )

returns the same names, but without the enclosing tags

Page 315: Database System Concepts

315JYP

XPath (Cont.)XPath (Cont.)

The initial “/” denotes root of the document (above the top-level tag)

Path expressions are evaluated left to right Each step operates on the set of instances produced by the previous step

Selection predicates may follow any step in a path, in [ ] E.g. /bank-2/account[balance > 400]

returns account elements with a balance value greater than 400

/bank-2/account[balance] returns account elements containing a balance subelement

Attributes are accessed using “@” E.g. /bank-2/account[balance > 400]/@account-number

returns the account numbers of those accounts with balance > 400

IDREF attributes are not dereferenced automatically (more on this later)

Page 316: Database System Concepts

316JYP

Functions in XPathFunctions in XPath

XPath provides several functions The function count() at the end of a path counts the number of

elements in the set generated by the path E.g. /bank-2/account[customer/count() > 2]

– Returns accounts with > 2 customers Also function for testing position (1, 2, ..) of node w.r.t. siblings

Boolean connectives and and or and function not() can be used in predicates

IDREFs can be referenced using function id() id() can also be applied to sets of references such as IDREFS and

even to strings containing multiple references separated by blanks E.g. /bank-2/account/id(@owner)

returns all customers referred to from the owners attribute of account elements.

Page 317: Database System Concepts

317JYP

More XPath FeaturesMore XPath Features Operator “|” used to implement union

E.g. /bank-2/account/id(@owner) | /bank-2/loan/id(@borrower)

gives customers with either accounts or loans

However, “|” cannot be nested inside other operators.

“//” can be used to skip multiple levels of nodes E.g. /bank-2//customer-name

finds any customer-name element anywhere under the /bank-2 element, regardless of the element in which it is contained.

A step in the path can go to:

parents, siblings, ancestors and descendants

of the nodes generated by the previous step, not just to the children “//”, described above, is a short from for specifying “all descendants”

“..” specifies the parent.

We omit further details,

Page 318: Database System Concepts

318JYP

XQueryXQuery

XQuery is a general purpose query language for XML data

Currently being standardized by the World Wide Web Consortium (W3C) The textbook description is based on a March 2001 draft of the standard.

The final version may differ, but major features likely to stay unchanged.

Alpha version of XQuery engine available free from Microsoft

XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and XML-QL

XQuery uses a for … let … where .. result … syntax for SQL from where SQL where result SQL select let allows temporary variables, and has no equivalent in SQL

Page 319: Database System Concepts

319JYP

FLWR Syntax in XQuery FLWR Syntax in XQuery

For clause uses XPath expressions, and variable in for clause ranges over values in the set returned by XPath

Simple FLWR expression in XQuery

find all accounts with balance > 400, with each result enclosed in an <account-number> .. </account-number> tag

for $x in /bank-2/account

let $acctno := $x/@account-number where $x/balance > 400 return <account-number> $acctno </account-number>

Let clause not really needed in this query, and selection can be done In XPath. Query can be written as:

for $x in /bank-2/account[balance>400]return <account-number> $x/@account-number

</account-number>

Page 320: Database System Concepts

320JYP

Path Expressions and FunctionsPath Expressions and Functions

Path expressions are used to bind variables in the for clause, but can also be used in other places E.g. path expressions can be used in let clause, to bind variables to

results of path expressions

The function distinct( ) can be used to removed duplicates in path expression results

The function document(name) returns root of named document

E.g. document(“bank-2.xml”)/bank-2/account

Aggregate functions such as sum( ) and count( ) can be applied to path expression results

XQuery does not support group by, but the same effect can be got by nested queries, with nested FLWR expressions within a result clause More on nested queries later

Page 321: Database System Concepts

321JYP

JoinsJoins Joins are specified in a manner very similar to SQL

for $a in /bank/account, $c in /bank/customer, $d in /bank/depositor

where $a/account-number = $d/account-number and $c/customer-name = $d/customer-name

return <cust-acct> $c $a </cust-acct> The same query can be expressed with the selections specified

as XPath selections:

for $a in /bank/account $c in /bank/customer

$d in /bank/depositor[ account-number = $a/account-number and customer-name = $c/customer-name] return <cust-acct> $c $a</cust-acct>

Page 322: Database System Concepts

322JYP

Changing Nesting StructureChanging Nesting Structure

The following query converts data from the flat structure for bank information into the nested structure used in bank-1

<bank-1> for $c in /bank/customer return

<customer> $c/* for $d in /bank/depositor[customer-name = $c/customer-name], $a in /bank/account[account-number=$d/account-number] return $a

</customer> </bank-1>

$c/* denotes all the children of the node to which $c is bound, without the enclosing top-level tag

Exercise for reader: write a nested query to find sum of accountbalances, grouped by branch.

Page 323: Database System Concepts

323JYP

XQuery Path ExpressionsXQuery Path Expressions

$c/text() gives text content of an element without any subelements/tags

XQuery path expressions support the “–>” operator for dereferencing IDREFs Equivalent to the id( ) function of XPath, but simpler to use

Can be applied to a set of IDREFs to get a set of results

June 2001 version of standard has changed “–>” to “=>”

Page 324: Database System Concepts

324JYP

Sorting in XQuery Sorting in XQuery

Sortby clause can be used at the end of any expression. E.g. to return customers sorted by name for $c in /bank/customer return <customer> $c/* </customer> sortby(name)

Can sort at multiple levels of nesting (sort by customer-name, and by account-number within each customer)

<bank-1> for $c in /bank/customer return

<customer> $c/* for $d in /bank/depositor[customer-name=$c/customer-name], $a in /bank/account[account-number=$d/account-number]

return <account> $a/* </account> sortby(account-number)</customer> sortby(customer-name)

</bank-1>

Page 325: Database System Concepts

325JYP

Functions and Other XQuery FeaturesFunctions and Other XQuery Features

User defined functions with the type system of XMLSchema function balances(xsd:string $c) returns list(xsd:numeric) { for $d in /bank/depositor[customer-name = $c], $a in /bank/account[account-number=$d/account-number] return $a/balance

}

Types are optional for function parameters and return values

Universal and existential quantification in where clause predicates

some $e in path satisfies P

every $e in path satisfies P

XQuery also supports If-then-else clauses

Page 326: Database System Concepts

326JYP

Application Program InterfaceApplication Program Interface

There are two standard application program interfaces to XML data:SAX (Simple API for XML)

Based on parser model, user provides event handlers for parsing events

– E.g. start of element, end of element

– Not suitable for database applicationsDOM (Document Object Model)

XML data is parsed into a tree representation Variety of functions provided for traversing the DOM tree E.g.: Java DOM API provides Node class with methods

getParentNode( ), getFirstChild( ), getNextSibling( ) getAttribute( ), getData( ) (for text node) getElementsByTagName( ), …

Also provides functions for updating DOM tree

Page 327: Database System Concepts

327JYP

The end of the courseThe end of the course

Thank you for your cooperation! Thank you for your cooperation!


Recommended