RJ Editionlsisreviving.weebly.com/uploads/2/3/6/8/23689241/qb.pdf · RJ Edition Assignment (¬)...

www.vidyarthiplus.com RJ Edition

www.vidyarthiplus.com

BE- CSE/IT REGULATION 2013 SEMESTER III

CS6302 Database Management Systems UNIT I-INTRODUCTION TO DBMS

PART A-2 MARKS

1. What is data base management system? Why do we need it?

DBMS is a collection of interrelated data and a set of programs to access those data.

It provides facilities for controlling data access, enforcing data integrity, managing concurrency control, recovering the database after failures.

Restoring it from backup files, as well as maintaining database security.

2. List any eight applications of DBMS.

Banking b) Airlines c) Universities d) Credit card transactions e) Tele communication

f) Finance g) Sales h) Manufacturing i) Human resources

3. What are the disadvantages of file processing system?

The disadvantages of file processing systems are

Data redundancy and inconsistency

Difficulty in accessing data

Data isolation

Integrity problems

Atomicity problems

Concurrent access anomalies

4. Define Entity , attribute & relationship with examples

Entity represent a real world object-Customer

An entity is represented by a set of attributes. Attributes are descriptive properties possessed by each member of an entity set.

customer name, customer id, Customer Street, customer city.

A relationship is an association among several entities.

Example: A depositor relationship associates a customer with each account that he/she has.



5. Define weak and strong entity sets?

Weak entity set: entity set that do not have key attribute of their own are called weak entity sets.

Strong entity set: Entity set that has a primary key is termed a strong entity set.

6. What is a data model? List the types of data model used.

It determines in which manner data can be stored, organized, and manipulated in a database system.

Types of data model used

Hierarchical model

Network model

Relational model

Entity-relationship

Object-relational model

Object model

7. Write short notes on relational model

The relational model uses a collection of tables to represent both data and the relationships among those data.

The relational model is an example of a record based model.

Attributes: column headers

Tuple : Row

8. List out the six fundamental operators and 4 additional operators in relational algebra.

Six Fundamental operators: Selection (s) Projection (p) Union (È) Set Difference (-) Cartesian Product (X) Rename (r) Four Additional operators: Set Intersection (Ç) Natural Join ( ) Division (÷)



Assignment (¬)

9. What is the difference between BCNF and third normal form

S.No 3 NF BCNF 1. R should be 2NF R should be 2NF 2. X ⟶ A in R at least one of

the following conditions are met:

X is a key or superkey in R

A is a prime attribute in R

X ⟶ A in R all of the following conditions are met:

X is a key or superkey in R

A is a prime attribute in R

3. All 3NF will not be BCNF All BCNF will be 3 NF.

10. What is the purpose of normalization? The purpose of normalization is to reduce redundancy and make everything appear on the

same level. It can also be defined as an act of restoring something to its normal conditions.

11. What is meant by join dependency? A table T is subject to a join dependency if T can always be recreated by joining multiple

tables each having a subset of the attributes of T. If one of the tables in the join has all the attributes of the table T, the join dependency is called trivial. 12. What are the axioms of functional dependency?

Reflexivity: e.g., ssn, name -> ssn

Augmentation e.g., ssn->name then ssn,grade-> ssn,grade

Transitivity ssn->address address-> county-tax-rate THEN: ssn-> county-tax-rate



PART B

16 MARKS

1. Explain about the Purpose of Database system &disadvantages of DBMS.

The typical file-processing system is supported by a conventional operating system.

It stores permanent records in various files, and it needs different application programs to extract records from, and add records to, the appropriate files.

Disadvantages.

1. Data redundancy and inconsistency:

Storing the same data multiple times is called data redundancy.

This redundancy leads to several problems.

• Need to perform a single logical update multiple times.

• Storage space is wasted.

• Files that represent the same data may become inconsistent.

Data inconsistency is the various copies of the same data may no larger. agree.

Example:

One user group may enter a student's birth date erroneously as JAN-19-1984, whereas the other user groups may enter the correct value of JAN-29-1984.

2. Difficulty in accessing data

File-processing environments do not allow needed data to be retrieved in a convenient and efficient manner.

Example:

Suppose that one of the bank officers needs to find out the names of all customers who live within a particular area. The bank officer has„ now two choices: either obtain the list of all customers and extract the needed information manually or ask a system programmer to write the necessary application program. Both alternatives are obviously unsatisfactory. Suppose that such a program is written, and that, several days later, the same officer needs to trim that list to include only those customers who have an account balance of $10,000 or more. A program to generate such a list does not exist. Again, the officer has the preceding two options, neither of which is satisfactory.



3. Data isolation

Because data are scattered in various files, and files may be in different formats, writing new application programs to retrieve the appropriate data is difficult.

4. Integrity problems

The data values stored in the database must satisfy certain types of consistency constraints. Example:

The balance of certain types of bank accounts may never fall below a prescribed amount ($25). Developers enforce these constraints in the system by addition appropriate code in the various application programs.

5. Atomicity problems

Atomic means the transaction must happen in its entirety or not at all.

. Example:

Consider a program to transfer $50 from account A to account B. If a system failure occurs during the execution of the program, it is possible that the $50 was removed from account A but was not credited to account B, resulting in an inconsistent database state.

6. Concurrent - access anomalies

Multiple users to update the data simultaneously.

Interaction of concurrent updates is possible and may result in inconsistent data.

Example: When several reservation clerks try to assign a seat on an airline flight, the system should ensure that each seat can be accessed by only one clerk at a time for assignment to a passenger.

7. Security problems

Enforcing security constraints to the file processing system is difficult.

Disadvantages of DBMS

The disadvantages of the database approach are,

• Complexity ,• Size • Cost of DBMSs • Additional hardware costs • Cost of conversion

• Performance • Higher impact of a failure Complexity

2. Explain about different kinds of data models?

The data model is a collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints.



A data model provides a way to describe the design of a data base at the physical, logical and view level.

The purpose of a data model is to represent data and to make the data understandable.

According to the types of concepts used to describe the database structure, there are three data models:

1. An external data model, to represent each user's view of the organization.

2. A conceptual data model, to represent the logical view that is DBMS independent.

3. An internal data model, to represent the conceptual schema in such a way that it can be understood by the DBMS.

Categories of data model:

1. Record-based data models

2. Object-based data models

3. Physical-data models.

The first two are used to describe data at the conceptual and external levels, the latter is used to describe data at the internal level.

1. Record - Based data models

In a record-based model, the database consists of a number of fixed format records possibly of differing types.

Each record type defines a fixed number of fields, each typically of a fixed length.

There are three types of record-based logical data model.

• Hierarchical data model.

• Network data model

• Relational data model

Hierarchical data model

In the hierarchical model, data is represented as collections of records and relationships are represented by sets.

The hierarchical model allows a node to have only one parent.

A hierarchical model can be represented as a tree graph, with records appearing as nodes, also called segments, and sets as edges.



Network data model

In the network model, data is represented as collections of records and relationships are represented by sets. Each set is composed of at least two record types:

An owner record that is equivalent to the hierarchical model's parent

A member record that is equivalent to the hierarchical model's child

A set represents a 1 :M relationship between the owner and the member.

Relational data model:

The relational data model is based on the concept of mathematical relations.

Relational model stores data in the form of a table.

Each table corresponds to an entity, and each row represents an instance of that entity.

Tables, also called relations are related to each other through the sharing of a common entity characteristic.

Example

Relational DBMS DB2, oracle, MS SQL-server.

2. Object - Based Data Models

Object-based data models use concepts such as entities, attributes, and relationships.

An entity is a distinct object in the organization that is to be represents in the database.

An attribute is a property that describes some aspect of the object, and a relationship is an association between entities.

Common types of object-based data model are:

• Entity - Relationship model

• Object - oriented model

• Semantic model

Entity - Relationship Model:

The ER model is based on the following components:

Entity: An entity was defined as anything about which data are to be collected and stored.



Each entity is described by a set of attributes that describes particular characteristics of the entity.

Object oriented model:

In the object-oriented data model (OODM) both data and their relationships are contained in a single structure known as an object.

An object includes information about relationships between the facts within the object, as well as information about its relationships with other objects.

The OODM is said to be a semantic data model because semantic indicates meaning.

The OO data model is based on the following components:

An object is an abstraction of a real-world entity.

Attributes describe the properties of an object.

3. Briefly explain about Entity-Relationship model:

The semantic aspect of the model lies in its representation of the meaning of the data. The E-R model is very useful in mapping the meanings and interactions of real-world enterprises onto a conceptual schema.

The ERDs represent three main components entities, attributes and relationships.

Entity sets:

An entity is a thing or object in the real world that is distinguishable from all other objects. Example:

Each person in an enterprise is entity.

An entity has a set of properties, and the values for some set of properties may uniquely identify an entity.

Example: A person may have a person-id would uniquely identify one particular property whose value uniquely identifies that person. :

Relationship sets:

A relationship is an association among several entities.

Example: A relationship that associates customer smith with loan L-16, specifies that Smith is a customer with loan number L-16.



Attributes:

For each attribute, there is a set of permitted values, called the domain, or value set, of that attribute.

Example: The domain of attribute customer name might be the set of all text strings of a certain length.

An attribute of an entity set is a function that maps from the entity set into a domain.

An attribute can be characterized by the following attribute types:

• Simple and composite attributes.

• Single valued and multi valued attributes.

• Derived attribute.

Simple attribute (atomic attributes)

An attribute composed of a single component with an independent existence is called simple attribute.

Simple attributes cannot be further subdivided into smaller components.

composite attribute. An attribute composed of multiple components, each with an independent existence is called

Example: The address attribute of the branch entity can be subdivided into street, city, and postcode attributes.

Single-valued Attributes:

An attribute that holds a single value for each occurrence of an entity type is called single valued attribute. Example:

Each occurrence of the Branch entity type has a single value for the branch number (branch No) attribute (for example B003).

Multi-valued Attribute

An attribute that holds multiple values for each occurrence of an entity type is called multi-valued attribute. Example:

Each occurrence of the Branch entity type can have multiple values for the telNo attribute (for example, branch number B003 has telephone numbers 0141-339-2178 and 0141-339-4439).

Derived attributes

An attribute that represents a value that is derivable from the value of a related attribute or set of attributes, not necessarily in the same entity type is called derived attributes.



4. Briefly explain about 1NF,2NFand 3NF

A relation is in first normal form if all its attributes are simple. In other words, none of the attributes of the relation is a relation. Example -1. Assume the following relation Student-courses (Sid:pk, Sname, Phone, Courses-taken) Course-taken (Course-id:pk, Course-description, Credit-hours, Grade) According to the definition of first normal form relation Student-courses is not in first normal form because one of its attribute Courses-taken is itself a table and is not a simple attribute. Student-courses Sid Sname Phone Courses-taken 100 John 487 2454 St-100-courses-taken 200 Smith 671 8120 St-200-courses-taken 300 Russell 871 2356 St-300-courses-taken St-100-Course-taken Course-id Course-description Credit-hours Grade IS380 Database Concepts 3 A IS416 Unix Operating System 3 B St-200-Course-taken Course-id Course-description Credit-hours Grade IS380 Database Concepts 3 B IS416 Unix Operating System 3 B IS420 Data Net Work 3 C St-300-Course-taken Course-id Course-description Credit-hours Grade IS417 System Analysis 3 A Student-courses ( Sid:pk1, Sname, Phone, Course-id:pk2,Course-description, Credit-hours, Grade) Notice that the primary key of this table is a composite key made up of two parts; Sid and Course-id. Note that pk1 following an attribute indicates that the attribute is the first part of the primary key and pk2 indicates that the attribute is the second part of the primary key. Student-courses Sid Sname Phone Course-id Course-description Credit-hoursGrade 100 John 487 2454 IS380 Database Concepts 3 A 100 John 487 2454 IS416 Unix Operating

System 3 B

200 Smith 671 8120 IS380 Database Concepts 3 B 200 Smith 671 8120 IS416 Unix Operating 3 B



System 200 Smith 671 8120 IS420 Data Net Work 3 C 300 Russell 871 2356 IS417 System Analysis 3 A The new relation Student-courses still suffers from all three anomalies for the following reasons:

1. Insertion anomaly: We cannot add a new course such as IS247 with course description programming techniques to the database unless we add a student who to take the course.

2. Update anomaly: If we change the course description for IS380 from Database Concepts to New_Database_Concepts we have to make changes in more than one place or else the database will be inconsistent.

3. Deletion anomaly: If student Russell is deleted from the database we also loose

information that we had on course IS417 with description System_Analysis. Second normal relation: A first normal form relation is in second normal form if all its non-primary attributes are fully functionally dependent on the primary key. In Student-courses relation both Sid and Course-id are primary attributes because they are components of the primary key. To convert Student-courses to second normal relations we have to make all non-primary attributes to be fully functionally dependent on the primary key. PROJECT Student-courses ON (Sid, Sname, Phone) creates a table call it Student. The relation Student will be Student (Sid:pk, Sname, Phone) and PROJECT Student-courses ON (Sid, Course-id, Grade) creates a table call it Student-grade. The relation Student-grade will be Student-grade (Sid:pk1:fk:Student, Course-id::pk2:fk:Courses, Grade) and Projects Student-courses ON (Course-id, Course-Description, Credit-hours) create a table call it Courses. Student (Sid:pk, Sname, Phone) Sid Sname Phone 100 John 487 2454 200 Smith 671 8120 300 Russell 871 2356



Courses (Course-id::pk, Course-Description) Course-id Course-description Credit-hours IS380 Database Concepts 3 IS416 Unix Operating System 3 IS420 Data Net Work 3 IS417 System Analysis 3 Student-grade (Sid:pk1:fk:Student, Course-id::pk2:fk:Courses, Grade) Sid Course-id Grade Marks 100 IS380 A 90 100 IS416 B 80 200 IS380 B 80 200 IS416 B 80 200 IS420 C 70 300 IS417 A 90 Further these three sets are free from all anomalies Insertion anomaly: Now a new Course with course-id IS247 and Course-description can be inserted to the table Course. Equally we can add any new students to the database by adding their id, name and phone to Student table. Therefore our database, which made up of these three tables does not suffer from insertion anomaly. Update anomaly: Since redundancy of the data was eliminated no update anomaly can occur. To change the course-description for IS380 only one change is needed in tableCourses. Deletion anomaly: the deletion of student Russell from the database is achieved by deleting Russell's records from both Student and Student-grade relations and this does not have any side effect because the course IS417 untouched in the table Courses.

4. Discuss about 3NF. Third Normal Form: A second normal form relation is in third normal form if all non-primary attributes have non-transitivity dependency on the primary key. Assume the relation: STUDENT (Sid: pk, Activity, fee) Further Activity ------------> fee that is the Activity determine the fee

Sid Activity Fee 100 Swimming 100 200 Tennis 100 300 Golf 300 400 Swimming 100



Table STUDENT is in first normal form because all its attributes are simple. Also STUDENT is in second normal form because all its non-primary attributes are fully functionally dependent on the primary key (Sid). Table STUDENT suffers from all 3 anomalies; a new student cannot be added to the database unless he/she takes an activity and no activity can be inserted into the database unless we get a student to take that activity. There is redundancy in the table (see Swimming), therefore to change the fee for Swimming we must make changes in more than one place and that will cause update anomaly problem. If student 300 is deleted from the table we also loose the fact that we had Golf activity with its fee to be 300. To overcome these anomalies STUDENT table should be converted to smaller tables. Consider the following three projection of the STUDENT relation: PROJECT STUDENT on [Sid, Activity] and we get a relation name it STUD-AVT (Sid:pk, Activity) with the following data : STUD_ACT Sid Activity 100 Swimming 200 Tennis 300 Golf 400 Swimming PROJECT STUDENT on [Activity, Fee] and we get a relation name AVT-Fee (Activity:pk, Fee) with the following data : AVT-Fee Activity Fee Swimming 100Tennis 100Golf 300Swimming 100 PROJECT STUDENT on [Sid, Fee] and we get a relation name Sid-Fee (Sid:pk, Fee) with the following data :

Sid-Fee Sid Fee



100 100 200 100 300 300 400 100 STUD-AVT and AVT-Fee because the join of these two projections produces the original STUDENT table. Such projections are called non-loss projections. Therefore the join of STUD-AVT and AVT-Fee on the common attribute Activity recreate the original STUDENT table. projections Sid-Fee and AVT-Fee on their common attribute Sid generates erroneous data that were not in the original STUDENT table and such projections are called loss projections. Following is the join of projections Sid-Fee and AVT-Fee on their common attribute Sid

Sid Activity Fee 100 Swimming 100 100 Tennis 100 200 Tennis 100 200 Swimming 100 300 Golf 300 400 Swimming 100 400 Tennis 100

Both projections STUD-AVT and AVT-Fee are in third normal form and they do not suffer from any anomalies. 5. Briefly explain about Fundamental Relational Algebra operations:

The select, project and rename operations are called unary operations, because they operate on one relation.

The union, Cartesian product, and set difference operations operate on pairs of relations and are called binary operations.

STAFF

Staffho Name Position Sex DOB Salary Branchno

SL21 John Manager M 1-9-45 30000 B005 SG37 Ann Assistant F 10-11-60 20000 B003 SG14 David Supervisor M 4-3-58 18000 B003 SA9 Mary Assistant F 3-6-40 12000 B007 SG9 Julie Manager F 4-5-70 9000 B003



SL41 Susan Assistant F 6-8-80 20000 B005 Branch

Branchno Street City Postcode B005 22 Deer

Rd London SW1 4EH

B007 16 Argyll St

Aberdeen AB2 3SU

B003 163 Main St

Glasgow Gil 9QX

B004 32 Manse Rd

Bristol BS99INZ

B002 56 Clover Dr

London NW10 6EU

PropertyforRent

Property No

Street City Postcode

Type Rooms

Rent Owner no

Staffno

Braa no

PAH 16 Hothead

Aberdeen

AB7550

House 6 650 C046 SA9 Bin

PL94 6 aryll St

London

NW2 Flat 4 400 C087 SL41 BOB

PG4 6 Lawrence St

Glasgow G119QX

Flat 3 350 CO40 BOO

PG36 2 Manor Rd

Glasgow

G324QX

Flat 3 375 C093 SG37 BOH

PC21 18 Dale Rd

Glasgow

Gt2 House 5 600 C087 SG37 B0«

PG16 5 Novar Dr

Glasgow

G120QX

Flat 4 450 C093 SG14 Boat

Client

ClientNO Name telNo prefType Max.rent CR76 John 0207-774-

5632 Flat 425

CR56 Aline 0141-848-1825

Flat 350

CR74 Mike 01475-392178

House 750

CR62 Mary 01224- Flat 600



196720 Viewing

clientNo properfyNo

viewDate Comment

CR56 PA14 24-05-01 Too small CR76 PG4 20-04-01 Too

remote CR56 PG4 26-05-01 CR62 PAH 14-05-01 No dining

room CR56 PG36 28-04-01 Private owner

clientNo branchNo staftNo Datejoined

CR76 BOOS SL41 02-01-01 CR56 B003 SG37 11-04-00 CR74 B003 SG37 16-11-99 CR62 B007 SA9 07-03-00

Selection (or Restriction) (s)

The selection operation works on a single relation R and defines a relation that contains only those tuples of R that satisfy the specified condition (predicate).

Syntax:

s Predicate (R)

Example:

List all staff with a salary greater than 10000.

Sol:

salary > 10000 (Staff).

The input relation is staff and the predicate is salary>10000. The selection operation defines a relation containing only those staff tuples with a salary greater than 10000.

staffNo Name Position Sex DOB Salary branchNo SL21 John Manager M 01-10-45 30000 B005 SG37 Ann Assistant F 10-11-60 12000 B003 SGI4 David Supervis

or M 24-03-58 18000 B0003

SG5 Susan Manager F 03-06-40 24000 B0003



Projection (p):

The projection operation works on a single relation R and defines a relation that contains a vertical subset of R, extracting the values of specified attributes and eliminating duplicates.

Syntax: p al, ....... an(R)-

Example:

Produce a list of salaries for all staff, showing only the staffNo, name and salary. . staffNo. Name, Salary (Staff).

The result of this operation is

Staffno

Name Salary

SL2I John 30000

SG37 Ann 20000

SG14 David 18000

SA9 Mary 12000

SG9 Julie 9000 SL41 Susan 2000

0 Rename (.):

Rename operation can rename either the relation name or the attribute names or both

Syntax:.s (BI.B2, . Bn) (R) Or .s (R) Or p (B1.B2 Bn) (R)

S is the new relation name and B1, B2,..... Bn are the new attribute names.

The first expression renames both the relation and its attributes, the second renames the relation only, and the third renames the attributes only. If the attributes of R are (Al, A2, ... An) in that order, then each Aj is renamed as Bj.

6. Explain about Multi-valued dependencies and Fourth Normal Form:

Multi-valued dependency (MVD)

• A multi-valued dependency occurs when a determinant determines more than one dependent, and the dependents are independent of each other

• Ex.: course implies teacher; course implies text, where teacher and text are independent



• A relvar with course, teacher and text is all key, and exhibits redundancy, but is in 3NF Updates can exhibit anomalies

• Relvar R is in 4 NF if and only if, whenever there exist subsets A and B of the attributes of R such that the nontrivial multi-valued dependency A multi-determines B is satisfied, then all attributes of R are also functionally dependent on A

4th Normal Form

A table is in fourth normal form (4NF) if and only if it is in BCNF and contains no more than one multi-valued dependency.

1. Anomalies can occur in relations in BCNF if there is more than one multi-valued dependency.

2. If A B and A C but B and C are unrelated, ie A (B,C) is false, then we have more than one multi-valued dependency.

A relation is in 4NF when it is in BCNF and has no more than one multi-valued dependency

Example: Assume the following relation with multi-value dependency: Employee (Eid:pk1, Languages:pk2, Skills:pk3) Eid ---->> Languages Eid ------>> Skills Languages and Skills are independent.

Eid Language Skill 100 English Teaching 100 Kurdish Politic 100 English politic 100 Kurdish Teaching 200 Arabic Singing This relation is not in fourth normal form and suffers from all three types of anomalies. Insertion anomaly: To insert row (200 English Cooking) we have to insert two extra rows (200 Arabic cooking), and (200 English Singing) otherwise the database will be inconsistent. Note the table will be as follow:

Eid Language Skill 100 English Teaching 100 Kurdish Politic 100 English politic



100 Kurdish Teaching

200 Arabic Singing 200 English Cooking 200 Arabic Cooking 200 English Singing Deletion anomaly: If employee 100 discontinue politic skill we have to delete two rows (100 Kurdish Politic), and (100 English Politic) otherwise the database will be inconsistent. Update anomaly: If employee 200 changes his skill from singing to dancing we have to make changes in more than one place The relation is projected to the following two non-loss projections which are in forth normal form Emplyee_Language(Eid:pk1, Languages:pk2)

Eid Language 100 English 100 Kurdish 200 Arabic Emplyee_Language(Eid:pk1, Skills:pk2)

Eid Skill 100 Teaching 100 Politic 200 Singing



UNIT II-SQL & QUERY OPTIMIZATION

PART A -2 MARKS

1. What is embedded SQL? What are its advantages?

Embedded SQL is a method of combining the computing power of a programming language and the database manipulation capabilities of SQL.

The embedded SQL statements are parsed by an embedded SQL preprocessor and replaced by host-language.

The output from the preprocessor is then compiled by the host compiler.

Host languages such as: C/C++, COBOL and Fortran.

2. What are the categories of SQL command?

SQL commands are divided in to the following categories:

1. data - definition language 2. data manipulation language 3. Data Query language

4. data control language 5. data administration statements 6. Transaction control statements

3. What are the differences between drop, truncate and delete?

DELETE command is used to remove rows from a table DROP Command is used to remove table from the database / data dictionary TRUNCATE command is used to remove all rows from a table

4. What are the measures of quality of disk?

The main measures of the qualities of a disk are capacity, access time, data transfer rate, and reliability,

5. What is meant by query optimization?

The phase that identifies an efficient execution plan for evaluating a query that has the least estimated cost is referred to as query optimization.

6. What is meant by the term heuristic optimization?

Where the 'query tree' or 'algebra tree' is transformed using a set of predefined rules that will improve the queries performance. Performing the selections as early as possible to reduce load. It is a form of Query Processing.

7. What is meant by DDL.What are the commands are used?



DDL: Data base schema is specified by a set of definitions expressed by a special language called a data definition language.

Create ,Alter ,Truncate,Drop

8. What is meant by DML.What are the commands are used?

DML: A data manipulation language is a language that enables users to access or manipulate data as organized by the appropriate data model.

Insert ,update,delete,select.

9. What is the use of group by clause? Group by clause is used to apply aggregate functions to a set of tuples.The attributes given in the group by clause are used to form groups.Tuples with the same value on all attributes in the group by clause are placed in one group.

10. What is the use of sub queries? A sub query is a select-from-where expression that is nested with in another query. A common use of sub queries is to perform tests for set membership, make setcomparisions, and determine set cardinality.

11. What is view in SQL? How is it defined?

Any relation that is not part of the logical model, but is made visible to a user as a virtual relation is called a view. We define view in SQL by using the create view command. The form of the create view command is Create view v as <query expression>

12. Describe about SQL objects?

SQL objects are schemas, data dictionaries, journals, catalogs, tables, aliases, views,

indexes, constraints, triggers, sequences, stored procedures, user-defined functions, user-defined types, and SQL packages. SQL creates and maintains these objects as system objects



PART B

1. Explain about Embedded SQL.

A language in which SQL queries are embedded is referred to as a host language, and the SQL structures permitted in the host language constitute embedded SQL.

Purpose of embedded SQL

• Not all queries can be expressed in SQL, since SQL does not provide the full expressive power of a general purpose language.

• Non declarative actions cannot be done from within SQL.

To identify embedded SQL requests to the preprocessor, EXEC SQL statement is used.

Syntax

EXEC SQL

<embedded SQL statements before executing any SQL statements, the program must first connect to the database by using EXEC SQL CONNECT: username USING connect string;

SQL CA

The DBMS uses an SQL communications Area (SQL CA) to report runtime errors to the application program.

The SQLCA is a data structure that contains error variables and status indicators.

Syntax:

EXEC SQL INCLUDE Sqlca;

Whenever Statement

The WHENEVER statement is a directive to the precompiler to automatically generate code to handle errors after every SQL statement.

Syntax:

EXEC SQL WHENEVER <condition> <action> The condition can be one of the following

• SQLERROR tells the precompiler to generate code to handle errors (SQLCODEO).

• SQL WARNING tells the precompiler to generate code to handle warnings (SQLCODE>0).

• NOT FOUND tells the precompiler to generate code to handle the specific warning that a retrieval operation has found no more records.

The action can be



• Continue, to ignore the condition and proceed to the next statement.

• Do, to transfer control to an error handling function.

• Do break, to place an actual 'break' statement in the program.

• Do continue, to place an actual 'continue' statement in the program.

• Goto label, to transfer control to the specified label.

• STOP, to rollback all uncommitted work and terminate the program.

* Host language variables

All host variables must be declared to SQL in a BEGIN DECLARE SECTION..... END DECLARE SECTION block.

This block must appear before any of the variables are used in an embedded SQL statement.

2. Explain about SQL Fundamentals:

Structural query language (SQL) is the standard command set used to communicate with the relational database management systems.

Advantages of SQL:

SQL is a high level language that provides a greater degree of abstraction than procedural languages

. • Increased acceptance and availability of SQL.

• Applications written in SQL can be easily ported across systems.

• SQL as a language is independent of the way it is implemented internally.

• Simple and easy to leam.

• Set-at-a-time feature of the SQL makes it increasingly powerful than the record-at-a-time processing technique.

• SQL can handle complex situations.

SQL data types:

SQL supports the following data types.

• CHAR(n) - fixed length string of exactlyV characters.

• VARCHAR(n) - varying length string whose maximum length is 'n' characters.

• FLOAT - floating point number.



Types of SQL commands:

SQL statements are divided into the following categories:

• Data Definition Language (DDL): used to create, alter and delete database objects.

• Data Manipulation Language (DML): used to insert, modify and delete the data in the database.

• Data Query Language (DQL): enables the users to query one or more tables to get the information they want.

• Data Control Language (DCL): controls the user access to the database objects.

• Transaction control statements (TCS): manage all the changes made by the DML statements.

SQL operators:

Arithmetic operators - are used to add, subtract, multiply, divide and negate data value (+, -, *, /).

Comparison operators - are used to compare one expression with another. Some comparison operators are =, >, >=, <, <=, IN, ANY, ALL, SOME, BETWEEN, EXISTS, and so on.

Logical operators - are used to produce a single result from combining the two separate conditions. The logical operators are AND, OR and NOT.

Set operators - combine the results of two separate queries into a single result. The set operators are UNION, UNIONALL, INTERSECT, MINUS and so on.

3. Explain about data integrity constraints.

Data integrity refers to the correctness and completeness of the data in a database, i.e.

An integrity constraint is a mechanism used to prevent invalid data entry into the table.

The various types of integrity constraints are

1) Domain integrity constraints

2) Entity integrity constraints

3) Referential integrity constraints.

Domain integrity constraints

These constraints set a range, and any violations that take place will prevent the user from performing the manipulation.

There are two types of domain integrity constraints

• Not null constraint



• Check constraint

Not null constraints

when a 'Not Null' constraint is enforced though, either on a column or set of columns in a table, it will not allow Null values.

The user has to provide a value for the column.

Check constraints

Check constraints specify conditions that each row must satisfy.

These are rules governed by logical expressions or Boolean expressions.

Check conditions cannot contain subqueries. .

Entity integrity constraints

An entity is any data recorded in a database.

Each entity represents a table and each row of a table represents an instance of that entity.

Each row in a table can be uniquely identified using the entity constraints.

Unique constraints

Primary key constraints.

Unique constraints

Unique key constraints

It is used to prevent the duplication of values within the rows of a specified column or a set of columns in a table.

Columns defined with this constraint can also allow Null values.

* Primary key constraints

The primary key constraint avoids duplication of rows and does not allow Null values, when enforced in a column or set of columns..

Referential integrity constraints

Referential integrity constraint is used to establish a 'parent-child' or a •master-detail' relationship between two tables having a common column.

To implement this, define the column in the parent table as a primary key and the same column in the child table as a foreign key referring to the corresponding parent entry.



Syntax (column constraints) Creating constraints on a new table

4. Explain about Data Definition Language:

• Create table command

• Alter table command

• Truncate table command

• Drop table command.

Create table

The create table statement creates a new base table.

Syntax:

Create table table-name (col 1 -definition[, col2-definition]... [,coln-definition][,primary-key-definition] [,alternate-key-definition][,foreign-key-definition]); Example:

SQL>create table Book

(ISBN char(10) not null, Title char(30) not null with default, Author char(30) not null with default, Publisher char(30) not null with default, Year integer not null with default, Price integer null, Primary key (ISBN)); O/P: table created.

Alter

Alter table An existing base table can be modified by using the alter table statement.

Syntax:

Alter table base-table-name Add column datatype {null / not null with default};

Example:

SQL>Alter table book Add discount integer null; O/P: Table altered.

This adds another column discount with data type integer.

Truncate table

If there is no further use of records stored in a table and the structure has to be retained then the records alone can be deleted.

Syntax Truncate table table-name; Example:

SQL>Truncate table book; O/P: table truncated.

This command would delete all the records from the table, book.



DESC

Desc command used to view the structure of the table.

Syntax

Desc table-name; Example:

SQL>Desc book;

O/P:

Name null? Type ISBN char(10) Title char(30) Author char(30) Publisher char(30) Year integer Price null integer

Drop table

An existing base table can be deleted at any time by using the drop table statement.

Syntax

Drop table table-name;

O/P: table dropped.

This command will delete the table named book along with its contents, indexes and any views defined for that table.

To use a host variable in an embedded SQL statement, the variable name is prefixed by a colon (:).

5. Explain about query optimization?

Introduction In Non-procedural DMLs (eg. SQL), user specifies what data is required rather than how it is to be retrieved. Relieves user of knowing what constitutes good execution strategy. Gives DBMS more control over system performance. Two main techniques for query optimization:

• heuristic rules that order operations in a query. • comparing different strategies based on relative costs,and selecting one that

minimizes resource usage. Disk access tends to be dominant cost in query processing for centralized DBMS. Query Processing Query Processing: Activities involved in retrieving data from the database. Aims of QP: transform query written in high-level language (e.g.SQL), into correct and efficient execution strategy expressed in low-level language (implementing RA) execute the strategy to retrieve required data.



Query Optimization Query Optimization: Activity of choosing an efficient execution strategy for processing query. As there are many equivalent transformations of same high-level query, aim of QO is to choose one that minimizes resource usage.Generally, reduce total execution time of query. Problem computationally intractable with large number of relations, so strategy adopted is reduced to finding near optimum solution. Phases of Query Processing QP has 4 main phases: decomposition – Aims are to transform high-level query into RA query and check that query is syntactically and semantically correct.

optimization code generation execution.

Optimization: Heuristical Processing Strategies Perform selection operations as early as possible. Keep predicates on same relation together. Combine Cartesian product with subsequent selection whose predicate represents join

condition into a join operation.

Use associativity of binary operations to rearrange leaf nodes so leaf nodes with most restrictive selection operations executed first.

6. Discuss about query tuning process.

Query optimization is a very complex task.



Combinatorial explosion. The task is to find one good query evaluation plan, not the best one. •No optimizer optimizes all queries adequately. •There is a need for query tuning. All optimizers differ in their ability to optimize queries, making it difficult to prescribe principles. •Having to tune queries is a fact of life. Query tuning has a localized effect and is thus relatively attractive. It is a time-consuming and specialized task. It makes the queries harder to understand. However, it is often a necessity. This is not likely to change any time soon Query Tuning Issues

Need too many disk accesses (eg. Scan for a point query)? •Need unnecessary computation? Redundant DISTINT SELECT DISTINCT cpr# FROM Employee WHERE dept = ‘computer’

•Relevant indexes are not used? (Next slide) •Unnecessary nested sub queries? Nested Queries Nested block is optimized independently, with the outer tupleconsidered as providing a selection condition. •Outer block is optimized with the cost of ‘calling’nested block computation taken into account. •Implicit ordering of these blocks means that some good strategies are not considered. The non-nested version of the query is typically optimized better. SELECTS.snameFROMSailors SWHERE EXISTS (SELECT *FROM Reserves

RWHERER.bid=103 ANDR.sid=S.sid)

Nested block to optimize:SELECT *FROM Reserves RWHERER.bid=103 ANDS.sid= outer value

Equivalent non-nested query:SELECTS.snameFROMSailors S, Reserves WHERES.sid=R.sidANDR.bid=103



UNIT III-TRANSACTION PROCESSING AND CONCURRENCY CONTROL

PART A-2 MARKS

1. What are the ACID properties?

To ensure the integrity of data (atomicity, consistency, isolation, durability) is a set of properties that guarantee database transactions are processed reliably.

2. What are the states of transaction?

The states of transaction are

Active

Partially committed

Failed

Aborted

Committed

Terminated

3. Give the reasons for allowing concurrency?

If the transactions run serially, a short transaction may have to wait for a preceding long transaction to complete, which can lead to unpredictable delays in running a transaction.

So concurrent execution reduces the unpredictable delays in running transactions.

4. Differentiate strict two phase locking protocol and rigorous two phase locking protocol.

In strict two phase locking protocol all exclusive mode locks taken by a transaction is held until that transaction commits.

Rigorous two phase locking protocol requires that all locks be held until the transaction commits.

5. Define deadlock?

When each transaction T in a set of two or more transactions is waiting for some item that is locked by some other transaction T in the set. This situation is called as deadlock

6. State the benefits of strict two phase locking.

Two Phase locking protocol prevent deadlock. This protocol maintains the schedule which lock should be granted or not.

Ensures Serializability.



7. List the sql statements used for transaction control

Commit & Rollback

The COMMIT statement commits the database changes that were made during the current transaction, making the changes permanent.

The ROLLBACK statement backs out, or cancels, the database changes that are made by the current transaction and restores changed data to the state before the transaction began

8. What is the need for concurrency?

Simultaneous execution of transactions over a shared database can create several data integrity and consistency problems:

• Lost Updates. • Uncommitted Data. • Inconsistent retrievals

9. What are the two approaches in distributed environment? Locking Timestamp Locking guarantees that the concurrent execution is equivalent to some unpredictable serial

execution of those transactions.

10. What is average response time? The average response time is that the average time for a transaction to be completed after it has

been submitted. 11. What are the two types of serializability? The two types of serializability is

Conflict serializability View serializability

12. What are the different modes of lock? The modes of lock are:

Shared Exclusive



PART B

1. Explain about Locking Protocols:

Locking is a procedure used to control concurrent access to data when one transaction is accessing the database, a lock may deny access to other transactions to prevent incorrect results.

A transaction must obtain a read or write lock on a data item before it can perform a read or write operation.

S X S true X false

False False

The read lock is also called a shared lock.

The write lock is also known as an exclusive lock..

The basic rules for locking are

• If a transaction has a read lock on a data item, it can read the item but not update it. • If a transaction has a read lock on a data item, other transactions can obtain a read lock on

the data item, but no write locks. • If a transaction has a write lock on a data item, it can both read and update the data item. • If a transaction has a write lock on a data item, then other transactions cannot obtain

either a read lock or a write lock on the data item.

The locking works as

• All transactions that needs to access a data item must first acquire a read lock or write lock on the data item depending on whether it is a ready only operation or not.

• If the data item for which the lock is requested is not already locked, the transaction is granted the requested lock,

• If the item is currently lock, the DBMS determines what kind of lock is the current one. The DBMS also finds out what lock is requested.

• If a read lock is requested on an item that is already under a read lock, then the requested will be granted.



• If a read lock or a write lock is requested on an item that is already under a write lock, then the request is denied and the transaction must wait until the lock is released. m •

• A transaction continues to hold the lock until it explicitly releases it either during execution or when it terminates. •

• The effects of a write operation will be visible to other transactions only after the write lock is released.

Two phase locking protocol requires that each transaction issue lock and unlock requests in two phases:

1. Growing phase -A transaction may obtain locks, but may not release any lock.

2. Shrinking phase -A transaction may release locks, but may not obtain any new locks.

Initially, a transaction is in the growing phase. The transaction acquires locks as needed. Once the transaction releases a lock, it enters the shrinking phase, and it can issue not more lock requests.

The point in the schedule where the transaction has obtained its final lock (the end of its growing phase) is called the lock point of the transaction.

Example 1:

Transactions T| and T2 do not follow the two - phase locking

T, T2 read - lock (y); read - lock (x); read - item (y); read - item (x); unlock (y); unlock (x); write - lock (x); write - lock (y); read - item (x); read - item (y); x = x + y; y = x + y; write - item (x); write - item (y); unlock (x); unlock (y);

2. Explain about Deadlock:

Deadlock occurs when each transaction T in a set of two or more transactions is waiting for some item that is locked by some other transaction T in the set.

There are three general techniques for handling deadlock:

. Timeouts

. Deadlock prevention

. Deadlock detection

. Recovery.



Timeouts

A transaction that requests a lock will wait for only a system defined period of time. If the lock has not been granted within this period, the lock request times out.

In this case, the DBMS assumes the transaction may be deadlocked, even though it may not be, and it aborts and automatically restarts the transaction.

Deadlock Prevention

Using transaction timestamps.

Wait - Die algorithm allows only an older transaction to wait for a younger one otherwise the transaction is aborted and restarted with the same timestamp so that eventually it will become the oldest active transaction and will not die.

Wound - wait, allows only a younger transaction can wait for an older one. If an older transaction requests a lock held by a younger one the younger one is aborted.

Deadlock detection and Recovery

Deadlock detection is usually handled by the construction of a wait - for graph (WFG) that shows the transaction dependencies, that is transaction Tj is dependent on Tj if transaction Tj holds the lock on a data item that Tj is waiting for,

X

• Deadlock exists if and only if the WFG contains a cycle. • When a detection algorithm determines that a deadlock exists, the system must recover

from the deadlock. The most common solution is to roll back one or more transactions to break the deadlock.

• Starvation occurs when the same transaction is always chosen as the victim, and the transaction can never complete.

3. Briefly explain about Serializability:

Schedule is a sequence of the operations by a set of concurrent transactions that preserves the order of the operations in each of the individual transactions.

Serial schedule is a schedule where the operations of each transaction are executed consecutively without any interleaved operations from other transactions.



In a serial schedule, the transactions are performed in serial order, ie if Tj and T2 are transactions, serial order would be Tj followed by T2 or T2 followed by Tj.

Non serial schedule is a schedule where the operations from a set of concurrent transactions are interleaved.

The objective of serializability is to find non serial schedules that allow transactions to execute concurrently without interfering with one another,

Conflict serializability:

In serializability, the ordering of read and write operations is important:

. It two transactions only read a data item, they do not conflict and order is not important.

. If two transactions either read or write completely separate data items, they do not conflict

and order is not important.

It one transaction writes a data item and another either reads or writes the same data item, the order of execution is important.

The instructions I; and Ij conflict if they are operations by different transactions on the same data item, and atleast one of these instructions is a write operation.

Example conflicting instructions.

SI

T, T2 read (A) write (A) read (A) write (A) read (B) write (B) read (B) write (B)

The write (A) instruction of Tj conflicts with the read (A) instruction of T2. However, the write (A) instruction of T2 does not conflict with the read (B) instruction of Tt, because the two instructions access different data items.

View serializability:

The schedules S and S' are said to be view equivalent if the following conditions met:



For each data item x{< if transaction Ti reads the initial value of x in schedule S, then transaction Tj must, in schedule S also read the initial value of x. .

For each data item x, if transaction Tj executes read (x) in schedule S, and if that value was produced by write (x) operation executed by transaction Tj, then the read (x) operation of transaction Tj must, in schedule S, also read the value of x that was produced by the same write (x) operation of transaction T;.

For each data item xt the transaction that performs the final write (x) operation in schedule S must perform the final write (x) operation in schedule S'.

Schedule 1

T, T2 read (A) write (A) read (B) write (B) read (A) write (A) read (B) write (B)

Schedule 2

Ti T2 read (A) write (A) read (A) write (A) read (B) write (B) read (B) write (B)

4. Briefly explain about two phase commit:

Centralized database require only one DP (Data processing).

Database operations take place at only one site, and the consequences of database operations are immediately known to the DBMS.

The two-phase commit protocol guarantees that if a portion of a transaction operation cannot be committed; all changes made at the other sites participating in the transaction will be undone to maintain a consistent database state.



Each DP maintains its own transaction log. The two-phase commit protocol requires that the transaction entry log for each DP be written before the database fragment is actually updated.

Therefore, the two-phase commit protocol requires a Do-UNDO-REDO protocol and a write-ahead protocol

The DO-UNDO-REDO protocol is used by the DP to roll back and / or roll forward transactions with the help of the system's transaction log entries. The DO-UNDO-REDO protocol defines three types of operations:

• Do performs the operation and records the "before" and "after" values in the transaction log.

• UNDO reverses an operation, using the log entries written by the DO portion of the sequence.

• REDO redoes an operation, using the log entries written by the DO portion of the sequence.

To ensure that the DO, UNDO, and REDO operations, can survive a system crash while they are being executed, a write-ahead protocol is used.

The two-phase commit protocol defines the operations between two types of nodes: The coordinator and one or more subordinates, or cohorts.

The participating nodes agree on a coordinator. Generally, the coordinator role is assigned to the node that initiates the transaction.



The protocol is implemented in two phases:

Phasel: Preparation

1) The coordinator sends a PREPARE TO COMMIT message to all subordinates.

2) The subordinates receive the message. Write the transaction log, using the write-ahead

Protocol and send an acknowledgement (YES / PREPARED TO COMMIT or NO / NOT PREPARED) message to the coordinator.

3) The coordinator makes sure that all nodes are ready to commit, or it aborts the action.

If all nodes are PREPARED TO COMMIT, the transaction goes to phase-2. If one or more nodes reply NO or NOT PREPARED, the coordinator broadcasts an ABORT message to all subordinates.

Phase2: The Final Commit

1) The coordinator broadcasts a COMMIT message to all subordinates and waits for the replies.

2) Each subordinate receives the COMMIT message, then updates the database using the DO

protocol.

3) The subordinates reply with a COMMITTED or NOT COMMITTED message to the coordinator. If one or more subordinates did not COMMIT, the coordinator sends an ABORT message, thereby forcing them to UNDO all changes.

The objective of the two-phase commit is to ensure that all nodes commit their part of the transaction, otherwise, the transaction is aborted.

If one of the nodes fails to commit, the information necessary to recover the database is in the transaction log, and the database can be recovered with the DO-UNDO-REDO protocol.

5. Briefly explain about Transaction states:

A transaction must be in one of the following states:

• Active -This is the initial state, the transaction stays in this state while it is executing.

• Partially committed -A transaction is in this state when it has executed the final statement.

• Failed -A transaction is in this state once the normal execution of the transaction cannot proceed.

• Aborted -A transaction is said to be aborted when the transaction has rolled back and the database is being restored to the consistent state prior to the start of the transaction.

• Committed -A transaction is in the committed state once it has been successfully executed and the database is transformed into a new consistent state.



A transaction starts in the active state;

A transaction contains a group of statements that form a logical unit of work. When the transaction has finished executing the last statement, it enters the partially committed state.

At this point the transaction has completed execution, but it is still possible that it may have to be aborted. The database system then writes enough information to the disk. When the last of this information is written, the transaction enters the committed states.

A transaction enters the failed state once the system determines that the transaction can no longer proceed with its normal execution.

This could be due to hardware failures or logical errors. Such a transaction should be rolled back. When the roll back is complete, the transaction enters the aborted state when a transaction aborts, the system has two options as follows:

• Restart the transaction

• Kill the transaction.

6. What are the issues in concurrency control?

• Process of managing simultaneous operations on the database without having them interferes with one another.



• Prevents interference when two or more users are accessing database simultaneously and at least one is updating data. • Although two transactions may be correct in themselves, interleaving of operations may produce an incorrect result. Need for Concurrency Control Three examples of potential problems caused by concurrency: • Lost update problem • Uncommitted dependency problem • Inconsistent analysis problem.

Lost Update Problem Successfully completed update is overridden by another user. Example: • T1 withdraws £10 from an account with balx, initially £100. • T2 deposits £100 into same account. • Serially, final balance would be £190.

Lost Update Problem

This can be avoided by preventing T1 from reading balx until after update Uncommitted Dependency Problem Occurs when one transaction can see intermediate results of another transaction before it has committed. Example: • T4 updates balx to £200 but it aborts, so balx should be back at original value of £100. • T3 has read new value of balx (£200) and uses value as basis of £10 reduction, giving a new balance of £190, instead of £90. Uncommitted Dependency Problem Problem avoided by preventing T3 from reading balx until after T4 commits or aborts. Inconsistent Analysis Problem Occurs when transaction reads several values but second transaction updates some of them during execution of first.



Example: • T6 is totaling balances of account x (£100), account y (£50), and account z (£25). • Meantime, T5 has transferred £10 from balx to balz, so T6 now has wrong result (£10 too high). Inconsistent Analysis Problem Problem avoided by preventing T6 from reading balx and balz until after T5 completed updates.



UNIT IV-TRENDS IN DATABASE TECHNOLOGY

PART A

1. Distinguish between static hashing and dynamic hashing?

Static hashing

Static hashing uses a hash function in which the set of bucket adders is fixed. Such hash functions cannot easily accommodate databases that grow larger over time.

Dynamic hashing

Dynamic hashing allows us to modify the hash function dynamically. Dynamic hashing copes with changes in database size by splitting and coalescing buckets as the database grows and shrinks.

2. What can be done to reduce the occurrences of bucket overflows in a hash file organization?

1.Choose the hach function more carefully,and make better estimates of the relation size.

2.If the estimated size of the relation is nr and number of records per block is fr,allocate (nr/fr)*(1+d).buckets instead of (nr/fr) buckets.

Here d is a fudge factor,typically around 0.2

3. Compare sequential access devices with random access devices.

Random Access (DISK)

Storage space allocation and tracking-Disk blocks

Concurrent volume access - A volume can be accessed concurrently by different operations

Sequential Access (FILE)

Storage space allocation and tracking - Volumes

Concurrent volume access -A volume can be accessed concurrently by different operations

4. What are ordered indices?

Ordered Indices

• In order to allow fast random access, an index structure may be used. • A file may have several indices on different search keys. • If the file containing the records is sequentially ordered, the index whose search key

specifies the sequential order of the file is the primary index, or clustering index.



• Indices whose search key specifies an order different from the sequential order of the file are called the secondary indices, or nonclustering indices.

5. What are differences between sparse index and dense index?

Dense Index: An index record appears for every search key value in file. This record contains search key value and a pointer to the actual record

Sparse Index: Index records are created only for some of the records. To locate a record, we find the index record with the largest search key value less than or equal to the search key value we are looking for.

6. How does B Tree differ from B + Tree?Why is a b+ tree usually preferredas an access structure to a data file.

1. In a B tree search keys and data stored in internal or leaf nodes. But in B+-tree data store only leaf nodes. 2. Searching of any data in a B+ tree is very easy because all data are found in leaf nodes otherwise in a B tree data cannot found in leaf node. 3. Insertion of a B tree is more complicated than B+ tree. 4. B +tree store redundant search key but b-tree has no redundant value. 7. Give the comparison between ordered indexing and hashing?

Ordered indexing An ordered index is based on a sorted ordering of the values.

To access the records we use an index structure. Each index structure is associated with a search key. Ordered index are divided in to two groups.

Primary index (or) clustering indices, ii) Secondary indices (or) non-clustering indices.

Hashing

A Hashed index is based on the values being uniformly distributed using a mathematical function called hash function.

File organizations based on hashing technique allow us to avoid accessing an index structures. Hashing techniques are divided in to two types.-

i) Static Hashing

ii) Dynamic Hashing

8. What are the factors to be taken into account when choosing a RAID level?



Monetary cost of extra disk storage requirements

Performance requirements in terms of number of I/O operations

Performance when a disk has failed.

Performances during rebuild.

9. What are the different phases in Knowledge discovery?

i) Data Selection – Selecting data about specific item or category ii) Data cleansing – Correcting invalid data or eliminating records iii) Enrichment – Enhancing data with additional sources of information iv) Data transformation and Encoding – Reducing amount of data by generalization v) Data mining – Techniques to mine different rules and patterns vi) Reporting and Display of discovered information – Displaying result as listings, graphical outputs, summary tables or visualizations in a user understandable manner.

10. What are the ways to represent knowledge extracted during data mining? i) Association rules ii) Classification hierarchies iii) Sequential patterns iv) Patterns with in time series v) Categorization and Segmentation

11. What are the two types of views in multidimensional model? i)Roll-up display – Move sup the hierarchy, by grouping into larger units along a dimension. This is a coarser grained view, which increases generalization. ii)Drill-down Display – This furnishes a finer-grained view.

12. What are the Applications of Data warehousing? i) OLAP (Online Analytical Processing) ii) DSS (Decision Support System)/EIS (Executive Information Systems) iii) Data Mining



PARTB

1. Briefly explain about RAID:

A variety of disks - organization techniques collectively called redundant arrays of independent disks (RAID), have been proposed to achieve improved performance and reliability.

RAID levels

Mirroring provides high reliability, but it is expensive. Striping provides high data - transfer rates, but does not improve reliability. Various alternative schemes aim to provide redundancy at lower cost by combining disk striping with "parity" bits. The schemes are classified into RAID levels.

RAID level 0

RAID level 0 uses data striping at the level of blocks has not redundant data (such as mirroring or parity bits) and hence has the best write performance since updates do not have to be duplicated. However, its read performance is not good.

RAID level 1

RAID level 1 refers to disk mirroring with block striping. Its read performance is good than RAID level 0. Performance improvement is possible by scheduling a read request to the disk with shortest expected seek and rotational delay.

RAID level 2

RAID level 2 uses memory-style redundancy by using hamming codes, which contain parity bits for distinct overlapping subsets of components

The disks labeled P store the error-correction bits. If one of the disks fails, the remaining bits of the byte and the associated error-correction bits can be read from other disks, and can be used to reconstruct the damaged data.

RAID level 3

Bit inter leaved parity organization, improves on level 2 by exploiting the fact that disk, controllers, can detect whether a sector has been read correctly, so a single parity bit can be used for error correction.

If one of the sectors gets damaged, the system knows exactly which sector it is, and, for each bit in the sector, the system can figure out whether it is a 1 or a 0 by computing the parity of the corresponding bits from sectors in the other disks.

If the parity of the remaining bits is equal to the stored parity, the missing bit is 0. otherwise, it is 1. RAID level 3 supports a lower number of I/O operations per second, since every disk has to participate in every I/O request.



RAID level 4

RAID level 4, block inter leaved parity organization, uses block-level striping and keeps a parity block on a separate disk for corresponding blocks from N other disks.

If one of the disks fails, the parity block can be used with the corresponding blocks from the other disks to restore the blocks of the failed disk.

Multiple read accesses can proceed in parallel, leading to a higher overall I/O rate.

A single write requires four disk accesses: two to read the two old blocks, and two to write the two blocks.

RAID level 5

RAID level 5, block-inter leaved distributed parity, improves on level 4 by partitioning data and parity among all N + 1 disks.

In level 5, all disks can participate in satisfying read requests, so level5 increases the total number of requests that can be met in a given amount of time.

RAID level 6



RAID level 6, the P + Q redundancy scheme, is much like RAID level 5, but stores extra redundant information to guard against multiple disk failures, instead of using parity,

level 6 uses error-correcting codes. In this, 2 bits of redundant data are stored for every 4 bits of data and the system can tolerate two disk failures

2. Briefly explain about Organization of records in files:

The order in which records are stored and accessed in the file is dependent on the file organization.

The physical arrangement of data in a file into records and pages on secondary storage is called file organization.

The main types of file organization are:

• Heap (unordered) files

• Sequential (ordered) files

• Hash files

Heap files

Records are placed on disk in no particular order.

Records are placed in the file in the same order as they are inserted. A new record s inserted in the last page of the file.

A linear search must be performed to access a record from the file until the required record is found.

To delete a record, the required page first has to be retrieved, the record marked as deleted, and the page written back to disk.

Heap files are one of the best organizations for bulk loading data into a table, as records are inserted at the end of the sequence.

Sequential (ordered files)

Records are ordered by the value of specified fields.

A binary search must be performed to access a record as follows •

Retrieve the mid-page of the file check whether the required record is between the first and last records of this page

• If the value of the key field in the first record on the page is greater than the required value, occurs on an earlier page therefore repeat the above steps.



• If value of the key field in the last record on the page is less than the required value, it occurs on a latter page, and so repeat the above steps.

The binary search is more efficient than a linear search.

Inserting a record near the start of a large file could be very time-consuming. One solution is to

Hash files (Random or direct files)

Records are placed on disk according to a hash function.

A hash function calculates the address of the page in which the record is to be stored based on one or more fields in the record.

The base field is called the hash field, or if the field is also a key field of the file, it is called the hash key. The hash function is chosen so that records are as evenly distributed as possible throughout the file.

Each address generated by a hashing function corresponds to a page, or bucket, with slots for multiple records. Within a bucket, records are placed in order of arrival. When the same address is generated for two or more records, then it is called as a collision. The records are called synonyms.

There are several techniques can be used to manage collisions.

• Open addressing

• Unchained overflow

• Chained overflow

• Multiple hashing

Open addressing

If a collision occurs, the system perform a linear search to find the first available slot to insert a new record.

Unchained overflow

Instead of searching for a free slot, an overflow area is maintained for collisions that cannot be placed at the hash address.

Chained overflow

An overflow area is maintained for collisions that cannot be placed at the hash address and each bucket has an additional field, called a synonym pointer, that indicates whether a collision has occurred, if so, points to the overflow page used, the pointer is zero no collision has occurred.

Multiple hashing



An alternative approach to collision management is to apply a second hashing function if the first one results in a collision. The aim is to produce a new hash address that will avoid a collision. The second hashing function is generally used to place records in an overflow area.

3. Explain about several type of ordered indexes:

Indices whose search key specifies an order different from the sequential order of the file are called non clustering indices or secondary indices.

All files are ordered sequentially on some search key, with a clustering index on the search key, are called index - sequential files. There are several type of ordered indexes.

• Primary index

• Clustering index

• Secondary index

Primary indexes

A primary index is an ordered file whose records are of fixed length with two fields.

The first field is the primary key of the data file, and the second filed is a pointer to a disk block (a block address).

Indexes can also be characterized as dense or sparse.

A dense index has an index entry for every search key value in the data file.

A sparse (or non dense) index has index entries for only some of the search values.

A primary index is hence a non dense (sparse index), since it includes an entry for each disk block of the data file and the keys of its anchor record rather than tor every search vaue.

To retrieve a record, given the value K of its primary key field, do a binary search on the index file to find the appropriate index entry i, and then retrieve the data file block whose address is P (i).



Clustering indexes

If records of a file are physically ordered on a non key field is called the clustering field.

A clustering index is also an ordered file with two fields. The first field is of the same type as the clustering field of the data file, and the second field is a block pointer.

This differs from a primary index, which requires that the ordering field of the data file have a distinct value for each record.

Record insertion and deletion still cause problems, because the data records are physically ordered. To alleviate the problem of insertion, it is common to reserve a whole block for each value of the clustering field. All records with that value are placed in the block.

A secondary index is also an ordered file similar to a primary index.

There are several techniques for handling non-unique secondary indexes.

• Produce a dense secondary index that maps on to all records in the data file, thereby allowing duplicate key values to appear in the index.

• Allow the secondary index to have an index entry for each distinct key value, but allow the block pointers to be multi-valued, with an entry corresponding to each duplicate key value in the data file.

• Allow the secondary index to have an index entry for each distinct key value. However, the block pointer would not pointer to the data file but to a bucket that contains pointers to the corresponding records in the data file.

• The secondary index may be on a field which is a candidate key and has a unique value in every record, or a non key with duplicate values.



• A secondary index structure on a key field that has a distinct value for every record. Sucha field is sometimes called a secondary key. In this case there is one index entry for each record in the data file, which contains the value of the secondary key for the record and a pointer either to the block in which the record is sorted to the record itself. Hence, such an index is dense.

Multilevel indexes

When an index file becomes large and extends over many pages, the search time for the required increases.

4. Briefly explain about B+ tree index file:

A binary tree has order 2 in which each node has no more than two children. The rules for a B+ tree are as follows.

• If the root is not a leaf node, it must have at least two children.

• For a tree of order n, each node except the root and leaf nodes must have between n/2 and n pointers and children. IF n/2 is not an integer, the result is rounded up.

• For a tree of order n, the number of key values in a leaf node must be between (n-l)/2 and (n-l) pointers and and children. If (n-l)/2 is not an integer, the result is rounded up.

• The number of key values contained in a non leaf node is 1 less than the number of pointers.

• The tree must always be balanced ie every path from the root node to a leaf must have the same length.

Queries on B+ Trees

If the search key value is less than or equal to key value, the pointer to the left of key value; isused to find the next node to be searched otherwise the pointer at the end of the node is used.

Insertion



• Find the leaf node in which the search - key value would appear.

• If there is a room to insert the search-key value, insert the value in the leaf node, and position it such that the search keys arc still in order.

• If there is no room to insert the search-key value, split the node into two nodes.

Put the first [n/2] in the existing node and the remaining values in a new node. If the new node has smallest search - key value, insert this search -key value into the parent of the leaf node that was split

. It was possible to perform this insertion because there was room for an added search - key value. If there were no room, the parent would have had to be split. Example

Consider the B+ tree

To insert a record with a branch-name value of clearview, find that clearview should appear in the node containing "Brighton" and "Downtown". There is no room to insert the search-key value "clearview". Therefore, the node is split into two nodes.

The two leaf nodes that result from inserting clearview and splitting the node containing Brighton and downtown. Put the first [n/2] in the existing node and the remaining values in a new node.

The new node has "Downtown" as its smallest search-key value, insert this search - key value into the parent of the leaf node that was split. The search - key value downtown was inserted into the parent.



Deletion:

Find the record to be deleted, and remove it from the file. To delete a leaf node, must delete the pointer to it from its parent. This deletion leaves the parent node, which formerly contained three pointers, with only two pointers.

5. Explain about Static hashing:

In static hashing the hash address space is fixed when the file is created. The term bucket denotes a unit of storage that can store one or more records.

A hash function h is a function from k to B. Where K denotes the set of all search-key values, and B denote the set of all bucket addresses.

Hash functions



The worst possible hash function maps all search-key values to the same bucket.

An ideal hash function distributes the stored keys uniformly across all the buckets, so that every bucket has the same number of records.

Choose a hash function that assigns search-key values to buckets in such a way that the distribution has these qualities.

• The distribution is uniform

• The distribution is random

Handling of bucket overflows when a record is inserted, the bucket to which it is mapped has space to store the record. If the bucket does not have enough space, a bucket overflow is said to occur. Bucket overflow can occur for several reasons

• Insufficient buckets

• Skew some buckets are assigned more records than are others.

Skew can occur for two reasons

1. Multiple records may have the same search key.

2. The chosen hash function may result in non uniform distribution of search keys.

Bucket overflow can be handled by using overflow buckets

If a record must be inserted into a bucket b, and b is already full, the system provides an overflow bucket for b and inserts the record into the overflow bucket

. If the overflow bucket is also full, the system provides another overflow bucket, and so on.

All the overflow buckets of a given bucket are chained together in a linked list. Overflow handling using such a linked list is called overflow chaining.

Lookup algorithm

The system uses the hash function on the search key to identify a bucket b.

The system must examine all the records in bucket b to see whether they match the search key as before. If bucket b has overflow buckets, the system must examine the records in all the overflow buckets also closed hashing means the set of buckets is fixed and there is overflow chains.

Open hashing, the set of buckets is fixed, and there are no overflow chains. If a bucket is full, the system inserts records in the next bucket in cyclic order that has space, is called linear probing.

Open hashing has been used to construct symbol tables for compilers and assemblers, but closed hashing is preferable for database systems.

Hash indices



Hashing can be used not only for file organization, but also for index structure creation. A hash index organizes the search keys, with their associated pointers, into a hash file structure.

6. Briefly explain about spatial database.

A spatial database is a database that is optimized to store and query data related to objects in space, including points, lines and polygons. It is a collection of spatially referenced data that acts as a model of reality Spatial Databases Concept

Keep track of objects in a multidimensional space Geographical Information Systems (GIS) Maps Weather

In general spatial databases are multi-dimensional Applications of Spatial Databases

Geographic Information Systems (GIS) E.g., ESRI’s ArcInfo; OpenGIS Consortium Geospatial information All classes of spatial queries and data are common

Computer-Aided Design/Manufacturing Store spatial objects such as surface of airplane fuselage Range queries and spatial join queries are common

Multimedia Databases Images, video, text, etc. stored and retrieved by content First converted to feature vector form; high dimensionality Nearest-neighbor queries are the most common

Types of Spatial Data

Point Data Points in a multidimensional space E.g., Raster data such as satellite imagery, where each pixel stores a measured value E.g., Feature vectors extracted from text

Region Data Objects have spatial extent with location and boundary DB typically uses geometric approximations constructed using line segments, polygons,

etc.,called vector data.



UNIT V- ADVANCED TOPICS

PART A

1. What is the importance of database security?

To prevent unauthorized data observation. –To prevent unauthorized data modification. - To ensure the data confidential.-

To make sure the data integrity is preserved. - To make sure only the

Authorized users have access to the data.

2. What are the types of security?

Types of Security

– Legal and ethical issues

– Policy issues

– System-related issues

– The need to identify multiple security levels

3. Define data encryption?

A final security issue is data encryption, which is used to protect sensitive data (such as credit card numbers) that is being transmitted via some type communication network.

4. State the role of DBA?

The database administrator (DBA) is the central authority for managing a database system. The DBA’s responsibilities include granting privileges to users who need to use the system and classifying users and data in accordance with the policy of the organization. The DBA has a DBA account in the DBMS, sometimes called a system or super user account, which provides powerful capabilities:

5. What are the restrictions are enforced on data access?

1. A subject S is not allowed read access to an object O unless class(S) ≥ class (O). This is known as the simple security property.

2. A subject S is not allowed to write an object O unless class(S) ≤ class (O). This known as the star property (or * property).



6. What are the drawbacks of DAC model?

The main drawback of DAC models is their vulnerability to malicious attacks, such as Trojan horses embedded in application programs.

7. Define Distributed Database.

A logically interrelated collection of shared data (and a description of this data) physically

distributed over a computer network.

8. What is Homogeneous and Heterogeneous DDBMS?

In Homogeneous system, all sites use the same DBMS product. In Heterogeneous system, sites may run different DBMS products

9. What are the failures in distributed DBMS? The loss of message The failure of a communication link The failure of a site

10. Define OODM, OODB, OODBMS. OODM: - A logical data model that captures the semantics of objects supported in object-

oriented programming. OODB: - A persistent and sharable collection of objects defined by an OODM. OODBMS: - The manager of an OODB.

11. What are the mandatory features proposed by Object Oriented Database System?

Encapsulation must be supported. Types or classes must be supported. Types or classes must be able to inherit from their ancestors. Dynamic binding must be supported. The DML can be computationally complete

12. Define Association Rule? A association rule is of the form X => Y , when X = {x1,x2,…xn} and Y = {y1,y2,….yn} are set of items , with xi and yi being distinct items for all i and j.

This association rule states that if a customer buys X he or she is also likely to buy Y. In general any association rule has the form LHS => RHS, where LHS and RHS are set of items.



PART B

1. Explain about Distributed Databases: A distributed database is a database physically stored in two or more computer systems. Although geographically dispersed, a distributed database system manages and controls the entire database as a single collection of data. Distributed database Architecture A Distributed Database Management System (DDBMS) consists of a single logical database that is split into a number of fragments. Each fragment is stored on one or more computers under the control of a separate DBMS, with the computers connected by a communications network. Each site is capable of independently processing user requests that require access to local data and is also capable of processing data stored on other computers in the network. In a homogeneous distributed data base system, each database in the system is by the same vendor. In a heterogeneous distributed database system, at least one of the databases will be that of a different vendor.

Distributed processing are the operations that occur when an application distributes its tasks among different computers in a network.

Distributed database applications use distributed transactions to access both local and remote data and modify the global database in real-time.

Full replication, in which a copy is stored in every site in the system.

Data Fragmentation If relation r is fragmented, r is divided into a number of fragments fi, r2, ... ,rn. These fragments contain sufficient information to allow reconstruction of the original relation r.

There are two different schemes for fragmenting a relation

• Horizontal fragmentation

• Vertical fragmentation

Horizontal fragmentation splits the relation by assigning each tuple of r tc one or more fragments.

Vertical fragmentation splits the relation by decomposing the scheme R of relation r.

* Transparency



The user of a distributed database system should not be required to know either where the data are physically located or how the data can be accessed at the specific local site. This characteristic is called as data transparency.

There are several forms of data transparency

• Fragmentation transparency: Users are not required to know how a relation has been ragmented.

• Replication transparency: Users do not have to be concerned with what data objects have been replicated, or where replicas have been placed.

• Location transparency: Users are not required to know the physical location of the data.

A distributed system may suffer from

• Failure of a site

• Loss of messages

• Failure of a communication link

• Network partition

Advantages of DDBMS

• Reflects organizational structure

• Improved shareability

• Improved availability

• Improved reliability

• Improved performance

• Economics

• Modular growth

Disadvantages of DDBMS

• Complexity

• Cost

• Security

• Integrity control more difficult

• Lack of standards



• Lack of experience

• Database design more complex.

Characteristics of DDBMS

• A collection of logically related shared data

• The data is split into a number of fragments.

• Fragments may be replicated.

• Fragments / replicas are allocated to sites.

• The sites are linked by a communications network.

• The data at each site is under the control cf a DBMS.

• The DBMS at each site can handle local applications, autonomously.

• Each DBMS participates in atleast one global application.

Functions of distributed DBMSs.

• Distributed query processing

• Data tracking

• Distributed transaction management

• Replicated data management

• Distributed data recovery

• Security

• Distributed catalog management.



2. Discuss briefly about the architecture of Data warehouse?

Three-Tier Data Warehouse Architecture

Generally the data warehouses adopt the three-tier architecture. Following are the three tiers of data warehouse architecture.

Bottom Tier - The bottom tier of the architecture is the data warehouse database server.It is the relational database system.We use the back end tools and utilities to feed data into bottom tier.these back end tools and utilities performs the Extract, Clean, Load, and refresh functions.

Middle Tier - In the middle tier we have OLAp Server. the OLAP Server can be implemented in either of the following ways.

o By relational OLAP (ROLAP), which is an extended relational database management system. The ROLAP maps the operations on multidimensional data to standard relational operations.

o By Multidimensional OLAP (MOLAP) model, which directly implements multidimensional data and operations.

Top-Tier - This tier is the front-end client layer. This layer hold the query tools and reporting tool, analysis tools and data mining tools.

Following diagram explains the Three-tier Architecture of Data warehouse:



Data Warehouse Models

From the perspective of data warehouse architecture we have the following data warehouse models:

Virtual Warehouse Data mart Enterprise Warehouse

Virtual Warehouse

The view over a operational data warehouse is known as virtual warehouse. It is easy to built the virtual warehouse.

Building the virtual warehouse requires excess capacity on operational database servers.

Data Mart

Data mart contains the subset of organisation-wide data. This subset of data is valuable to specific group of an organisation

Note: in other words we can say that data mart contains only that data which is specific to a particular group. For example the marketing data mart may contain only data related to item, customers and sales. The data mart are confined to subjects.



Points to remember about data marts

window based or Unix/Linux based servers are used to implement data marts. They are implemented on low cost server.

The implementation cycle of data mart is measured in short period of time i.e. in weeks rather than months or years.

The life cycle of a data mart may be complex in long run if it's planning and design are not organisation-wide.

Data mart are small in size. Data mart are customized by department. The source of data mart is departmentally structured data warehouse. Data mart are flexible.

Enterprise Warehouse

The enterprise warehouse collects all the information all the subjects spanning the entire organization

This provides us the enterprise-wide data integration. This provides us the enterprise-wide data integration. The data is integrated from operational systems and external information providers. This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or

beyond.

3. Discuss about access control in DB security?

The discretionary access control techniques of granting and revoking privileges on relations has traditionally been the main security mechanism for relational database systems.

This is an all-or-nothing method: A user either has or does not have a certain privilege.

In many applications, and additional security policy is needed that classifies data and users based on security classes. This approach as mandatory access control, would typically be combined with the discretionary access control mechanisms.

Typical security classes are top secret (TS), secret (S), confidential (C), and unclassified (U), where TS is the highest level and U the lowest: TS ≥ S ≥ C ≥ U

The commonly used model for multilevel security, known as the Bell-LaPadula model, classifies each subject (user, account, program) and object (relation, tuple, column, view, operation) into one of the security classifications, T, S, C, or U: clearance (classification) of a subject S as class(S) and to the classification of an object O as class(O).

Two restrictions are enforced on data access based on the subject/object classifications:



1. A subject S is not allowed read access to an object O unless class(S) ≥ class(O). This is known as the simple security property.

2. A subject S is not allowed to write an object O unless class(S) ≤ class(O). This known as the star property (or * property).

To incorporate multilevel security notions into the relational database model, it is common to consider attribute values and tuples as data objects. Hence, each attribute A is associated with a classification attribute C in the schema, and each attribute value in a tuple is associated with a corresponding security classification. In addition, in some models, a tuple classification attribute TC is added to the relation attributes to provide a classification for each tuple as a whole. Hence, a multilevel relation schema R with n attributes would be represented as

R(A1,C1,A2,C2, …, An,Cn,TC)

where each Ci represents the classification attribute associated with attribute Ai.

The value of the TC attribute in each tuple t – which is the highest of all attribute classification values within t – provides a general classification for the tuple itself, whereas each Ci provides a finer security classification for each attribute value within the tuple.

The apparent key of a multilevel relation is the set of attributes that would have formed the primary key in a regular(single-level) relation.

4. Discuss about object oriented database?

MAIN CLAIM: OO databases try to maintain a direct correspondence between real-world and database objects so that objects do not lose their integrity and identity and can easily be identified and operated upon

Object: Two components: state (value) and behavior (operations). Similar to program variable in programming language, except that it will typically have a complex data structure as well as specific operations defined by the programmer

In OO databases, objects may have an object structure of arbitrary complexity in order to contain all of the necessary information that describes the object.

In contrast, in traditional database systems, information about a complex object is often scattered over many relations or records, leading to loss of direct correspondence between a real-world object and its database representation



The internal structure of an object in OOPLs includes the specification of instance variables, which hold the values that define the internal state of the object.

An instance variable is similar to the concept of an attribute, except that instance variables may be encapsulated within the object and thus are not necessarily visible to external users

Some OO models insist that all operations a user can apply to an object must be predefined. This forces a complete encapsulation of objects.

To encourage encapsulation, an operation is defined in two parts:

signature or interface of the operation, specifies the operation name and arguments (or parameters).

method or body, specifies the implementation of the operation.

Operations can be invoked by passing a message to an object, which includes the operation name and the parameters. The object then executes the method for that operation.

This encapsulation permits modification of the internal structure of an object, as well as the implementation of its operations, without the need to disturb the external programs that invoke these operations

Some OO systems provide capabilities for dealing with multiple versions of the same object (a feature that is essential in design and engineering applications).

For example, an old version of an object that represents a tested and verified design should be retained until the new version is tested and verified:

very crucial for designs in manufacturing process control, architecture , software systems …..

Operator polymorphism: It refers to an operation’s ability to be applied to different types of objects; in such a situation, an operation name may refer to several distinct implementations, depending on the type of objects it is applied to.

This feature is also called operator overloading

A database system that incorporates all the important object-oriented concepts

Some additional features

Unique Object identifiers



Persistent object handling

Advantages of OODBS

Designer can specify the structure of objects and their behavior (methods)

Better interaction with object-oriented languages such as Java and C++

Definition of complex and user-defined types

Encapsulation of operations and user-defined methods

Object Query Language (OQL)

Declarative query language

Not computationally complete

Syntax based on SQL (select, from, where)

Additional flexibility (queries with user defined operators and types)

5. Describe about association rules in data mining?

Association Rule Mining

Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

Applications

Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.

Examples

Rule form: “Body Head [support, confidence]”.

buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]

major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]

Confidence (AB) = #tuples containing both A & B / #tuples containing A = P(B|A) = P(A U B ) / P (A)

Support (AB) = #tuples containing both A & B/ total number of tuples = P(A U B)



Key Concepts :

• Frequent Itemsets: The sets of item which has minimum support (denoted by Li for ith-Itemset).

• Apriori Property: Any subset of frequent itemset must be frequent.

• Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with itself.

• Join Step: Ck is generated by joining Lk-1with itself

• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

• Pseudo-code:

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};

for (k = 1; Lk !=; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end

return k Lk;

6. Discuss about data mining architecture.

Data mining (knowledge discovery in databases):

– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

7 Data Mining Steps

1. Data cleaning – remove noise and inconsistent data



2. Data integration – combine multiple sources

3. Data selection – retrieve from the database data relevant to the analysis task

4. Data transformation – data are transformed or consolidated into forms appropriate for mining (e.g. performing summary or aggregation operations)

5. Data mining – intelligent methods are applied to extract data patterns

6. Pattern evaluation – identify truly interesting patterns representing knowledge based on some interestingness measures

7. Knowledge presentation – present mined knowledge to the user

Steps of a KDD Process

Learning the application domain:

relevant prior knowledge and goals of application

Creating a target data set: data selection

Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining

summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s)

Data mining: search for patterns of interest

Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge



Date post:	15-Mar-2018
Category:	Documents
Upload:	vokhanh
View:	216 times
Download:	3 times

RJ Editionlsisreviving.weebly.com/uploads/2/3/6/8/23689241/qb.pdf · RJ Edition Assignment (¬)...

Documents