CS 426 - Advanced Database
CS 426 - Advanced Database
DBMS (Database Management System)A database-management system (DBMS) is
A collection of interrelated data and A set of programs to access those data. An environment that is both convenient and efficient to use
Database System ApplicationsDatabases are widely used. Here are some representative applications:
Enterprise Information• Sale• Accounting• Human resources• Manufacturing• Online Retailers
Banking and Finance• Banking• Credit Card Transactions• Finance
Universities Airlines Telecommunications
Data Manipulation LanguageA data-manipulation language (DML) is a language that enables users to access or manipulate data as organized by the appropriate data model. The types of access are:
Retrieval of information stored in the database Insertion of new information into the database Deletion of information from the database Modification of information stored in the database
Example:
select instructor.name from instructor where instructor.dept name = ’History’;
By Dr. Nitin S. Goje Page 1 | 49
CS 426 - Advanced Database
Data Definition LanguageTo specify a database schema by a set of definitions expressed by a special language called a data-definition language (DDL).
The DDL is also used to specify additional properties of the data.
Example:
create table department (dept_name char (20), building char (15), budget numeric (12,2));
Relational DatabaseA relational database is based on the relational model and uses a collection of tables to represent both data and the relationships among those data.
Each table has multiple columns and each column has a unique name.
Example:
By Dr. Nitin S. Goje Page 2 | 49
CS 426 - Advanced Database
Structured Query LanguageSQL can define the structure of the data, modify data in the database, and specify security constraints.
The SQL language has several parts:
• Data-definition language (DDL). The SQL DDL provides commands for defining relation schemas, deleting relations, and modifying relation schemas.
• Data-manipulation language (DML). The SQL DML provides the ability to query information from the database and to insert tuples into, delete tuples from, and modify tuples in the database.
• Integrity. The SQL DDL includes commands for specifying integrity constraints that the data stored in the database must satisfy. Updates that violate integrity constraints are disallowed.
• View definition. The SQL DDL includes commands for defining views.
• Transaction control. SQL includes commands for specifying the beginning and ending of transactions.
• Embedded SQL and dynamic SQL. Embedded and dynamic SQL define how SQL statements can be embedded within general-purpose programming languages, such as C, C++, and Java.
• Authorization. The SQL DDL includes commands for specifying access rights to relations and views.
SQL Data TypesThe SQL standard supports a variety of built-in types, including:
• char(n): A fixed-length character string with user-specified length n. The full form, character, can be used instead.
• varchar(n): A variable-length character string with user-specified maximum length n. The full form, character varying, is equivalent.
• int: An integer (a finite subset of the integers that is machine dependent). The full form, integer, is equivalent.
• smallint: A small integer (a machine-dependent subset of the integer type).
• numeric(p, d):A fixed-point number with user-specified precision. The number consists of p digits (plus a sign), and d of the p digits are to the right of the decimal point. Thus,
By Dr. Nitin S. Goje Page 3 | 49
CS 426 - Advanced Database
numeric(3,1) allows 44.5 to be stored exactly, but neither 444.5 or 0.32 can be stored exactly in a field of this type.
• real, double precision: Floating-point and double-precision floating-point numbers with machine-dependent precision.
• float(n): A floating-point number, with precision of at least n digits.
Advanced SQL1. Accessing SQL From a Programming Language
There are two approaches to accessing SQL from a general-purpose programming language:
• Dynamic SQL:
Dynamic SQL allows the program to construct an SQL query as a character string at runtime, submit the query, and then retrieve the result into program variables a tuple at a time.
The dynamic SQL component of SQL allows programs to construct and submit SQL queries at runtime.
• Embedded SQL:
Like dynamic SQL, embedded SQL provides a means by which a program can interact with a database server.
However, under embedded SQL, the SQL statements are identified at compile time using a preprocessor.
The preprocessor submits the SQL statements to the database system for pre-compilation and optimization; then it replaces the SQL statements in the application program with appropriate code and function calls before invoking the programming-language compiler.
2. Functions and Procedures
Procedures and functions allow “business logic” to be stored in the database, and executed from SQL statements.
Example:
create function dept count(dept name varchar(20))
returns integer
begin
By Dr. Nitin S. Goje Page 4 | 49
CS 426 - Advanced Database
declare d count integer;
select count(*) into d count
from instructor
where instructor.dept name= dept name
return d count;
end
3. Triggers
A trigger is a statement that the system executes automatically as a side effect of a modification to the database. To design a trigger mechanism, we must meet two requirements:
1. Specify when a trigger is to be executed. This is broken up into an event that causes the trigger to be checked and a condition that must be satisfied for trigger execution to proceed.
2. Specify the actions to be taken when the trigger executes.
Object ModelsAn object typically has two components: state (value) and behavior (operations).
One goal of an ODMS (Object Data Management System) is to maintain a direct correspondence between real-world and database objects so that objects do not lose their integrity and identity and can easily be identified and operated upon.
define type EMPLOYEE tuple ( Fname: string; Minit: char; Lname: string; Ssn: string; Birth_date: DATE; Address: string; Sex: char; Salary: float; Supervisor: EMPLOYEE; Dept: DEPARTMENT;); define type DATE tuple ( Year: integer; Month: integer; Day: integer; );
By Dr. Nitin S. Goje Page 5 | 49
CS 426 - Advanced Database
Object Database Modeling Group The ODMG object model is the data model upon which the object definition language
(ODL) and object query language (OQL) are based.
It is meant to provide a standard data model for object databases, just as SQL describes a standard data model for relational databases.
It also provides a standard terminology in a field where the same terms were sometimes used to describe different concepts.
Object Definition Language The ODL is designed to support the semantic constructs of the ODMG object model and
is independent of any particular programming language.
Its main use is to create object specifications—that is, classes and interfaces. Hence, ODL is not a full programming language.
A user can specify a database schema in ODL independently of any programming language, and then use the specific language bindings to specify how ODL constructs can be mapped to constructs in specific programming languages, such as C++, Smalltalk, and Java.
Object Query Language The object query language OQL is the query language proposed for the ODMG object
model.
It is designed to work closely with the programming languages for which an ODMG binding is defined, such as C++, Smalltalk, and Java.
Hence, an OQL query embedded into one of these programming languages can return objects that match the type system of that language.
Additionally, the implementations of class operations in an ODMG schema can have their code written in these programming languages.
The OQL syntax for queries is similar to the syntax of the relational standard query language SQL, with additional features for ODMG concepts, such as object identity, complex objects, operations, inheritance, polymorphism, and relationships.
By Dr. Nitin S. Goje Page 6 | 49
CS 426 - Advanced Database
Transaction Collections of operations that form a single logical unit of work are called transactions.
A database system must ensure proper execution of transactions despite failures—either the entire transaction executes, or none of it does.
Furthermore, it must manage concurrent execution of transactions in a way that avoids the introduction of inconsistency.
A transaction is a unit of program execution that accesses and possibly updates various data items.
Usually, a transaction is initiated by a user program written in a high-level data-manipulation language (typically SQL), or programming language (for example, C++, or Java), with embedded database accesses in JDBC or ODBC.
Begin Transaction and End Transaction A transaction is delimited by statements (or function calls) of the form begin transaction
and end transaction.
The transaction consists of all operations executed between the begin transaction and end transaction.
ExampleLet Ti be a transaction that transfers $50 from account A to account B. This transaction can be defined as:
Ti : read(A);
A := A − 50;
write(A);
read(B);
B := B + 50;
write(B).
By Dr. Nitin S. Goje Page 7 | 49
CS 426 - Advanced Database
Properties of Transactions Atomicity - Either all operations of the transaction are reflected properly in the
database, or none are.
Consistency - Execution of a transaction in isolation (that is, with no other transaction executing concurrently) preserves the consistency of the database.
Isolation - Even though multiple transactions may execute concurrently, the system guarantees that, for every pair of transactions Ti and Tj , it appears to Ti that either Tj finished execution before Ti started or Tj started execution after Ti finished. Thus, each transaction is unaware of other transactions executing concurrently in the system.
Durability - After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.
States of TransactionA transaction must be in one of the following states:
Active, the initial state; the transaction stays in this state while it is executing.
Partially committed, after the final statement has been executed.
Failed, after the discovery that normal execution can no longer proceed.
Aborted, after the transaction has been rolled back and the database has been restored to its state prior to the start of the transaction.
Committed, after successful completion.
States Diagram of TransactionsBy Dr. Nitin S. Goje Page 8 | 49
CS 426 - Advanced Database
We say that a transaction has committed only if it has entered the committed state. Similarly, we say that a transaction has aborted only if it has entered the aborted state. A transaction is said to have terminated if it has either committed or aborted. A transaction starts in the active state. When it finishes its final statement, it enters the partially committed state. At this point,
the transaction has completed its execution, but it is still possible that it may have to be aborted, since the actual output may still be temporarily residing in main memory, and thus a hardware failure may preclude its successful completion.
The database system then writes out enough information to disk that, even in the event of a failure, the updates performed by the transaction can be re-created when the system restarts after the failure.
When the last of this information is written out, the transaction enters the committed state.
Concurrency Control
By Dr. Nitin S. Goje Page 9 | 49
CS 426 - Advanced Database
When several transactions execute concurrently in the database, however, the isolation
property may no longer be preserved.
To ensure that it is, the system must control the interaction among the concurrent
transactions; this control is achieved through one of a variety of mechanisms called
concurrency control schemes.
In practice, the most frequently used schemes are
Two-phase locking and
Snapshot isolation.
Lock Based Protocols One way to ensure isolation is to require that data items be accessed in a mutually
exclusive manner; that is, while one transaction is accessing a data item, no other
transaction can modify that data item.
The most common method used to implement this requirement is to allow a transaction
to access a data item only if it is currently holding a lock on that item.
Timestamp Based Protocols The locking protocols that we have described thus far determine the order between
every pair of conflicting transactions at execution time by the first lock that both
members of the pair request that involves incompatible modes.
Another method for determining the serializability order is to select an ordering among
transactions in advance.
The most common method for doing so is to use a timestamp-ordering scheme.
The Two Phase Locking ProtocolsBy Dr. Nitin S. Goje Page 10 | 49
CS 426 - Advanced Database
One protocol that ensures serializability is the two-phase locking protocol. This protocol
requires that each transaction issue lock and unlock requests in two phases:
◦ 1. Growing phase. A transaction may obtain locks, but may not release any lock.
◦ 2. Shrinking phase. A transaction may release locks, but may not obtain any new
locks.
Initially, a transaction is in the growing phase. The transaction acquires locks as needed.
Once the transaction releases a lock, it enters the shrinking phase, and it can issue no
more lock requests.
Snapshot Isolation Snapshot isolation involves giving a transaction a “snapshot” of the database at the time
when it begins its execution.
It then operates on that snapshot in complete isolation from concurrent transactions.
The data values in the snapshot consist only of values written by committed
transactions.
This isolation is ideal for read-only transactions since they never wait and are never
aborted by the concurrency manager.
Transactions that update the database must, of course, interact with potentially
conflicting concurrent update transactions before updates are actually placed in the
database.
Updates are kept in the transaction’s private workspace until the transaction
successfully commits, at which point the updates are written to the database.
When a transaction T is allowed to commit, the transition of T to the committed state
and the writing of all of the updates made by T to the database must be done as an
atomic action so that any snapshot created for another transaction either includes all
updates by transaction T or none of them.
By Dr. Nitin S. Goje Page 11 | 49
CS 426 - Advanced Database
Recovery System A computer system, like any other device, is subject to failure from a variety of causes:
disk crash, power outage, software error, a fire in the machine room, even sabotage.
In any failure, information may be lost. Therefore, the database system must take
actions in advance to ensure that the atomicity and durability properties of transactions,
are preserved.
An integral part of a database system is a recovery scheme that can restore the
database to the consistent state that existed before the failure.
Failure ClassificationThere are various types of failure that may occur in a system, each of which needs to be dealt
with in a different manner.
We shall consider only the following types of failure:
• Transaction failure. There are two types of errors that may cause a transaction to fail:
◦ Logical error. The transaction can no longer continue with its normal execution
because of some internal condition, such as bad input, data not found, overflow,
or resource limit exceeded.
◦ System error. The system has entered an undesirable state (for example,
deadlock), as a result of which a transaction cannot continue with its normal
execution. The transaction, however, can be re-executed at a later time.
• System crash. There is a hardware malfunction, or a bug in the database software or
the operating system, that causes the loss of the content of volatile storage, and brings
transaction processing to a halt. The content of nonvolatile storage remains intact, and
is not corrupted.
The assumption that hardware errors and bugs in the software bring the system to a
halt, but do not corrupt the nonvolatile storage contents, is known as the fail-stop By Dr. Nitin S. Goje Page 12 | 49
CS 426 - Advanced Database
assumption. Well-designed systems have numerous internal checks, at the hardware
and the software level that bring the system to a halt when there is an error. Hence, the
fail-stop assumption is a reasonable one.
• Disk failure. A disk block loses its content as a result of either a head crash or failure
during a data-transfer operation. Copies of the data on other disks, or archival backups
on tertiary media, such as DVD or tapes, are used to recover from the failure.
By Dr. Nitin S. Goje Page 13 | 49
CS 426 - Advanced Database
Week-7Revision From Week-1 to Week-6TEST-1
By Dr. Nitin S. Goje Page 14 | 49
CS 426 - Advanced Database
Week-8Database System Architecture:
The architecture of a database system determines its capability, reliability, effectiveness
and efficiency in meeting user requirements.
But besides the visible functions seen through some data manipulation language, a good
database architecture should provide:
a) Independence of data and programs
b) Ease of system design
c) Ease of programming
d) Powerful query facilities
e) Protection of data
Centralized System: Run on a single computer system and do not interact with other computer systems.
General-purpose computer system: one to a few CPUs and a number of device
controllers that are connected through a common bus that provides access to shared
memory.
Single-user system (e.g., personal computer or workstation): desk-top unit, single user,
usually has only one CPU and one or two hard disks; the OS may support only one user.
Multi-user system: more disks, more memory, multiple CPUs, and a multi-user OS. Serve
a large number of users who are connected to the system vie terminals. Often called
server systems.
By Dr. Nitin S. Goje Page 15 | 49
CS 426 - Advanced Database
Fig: A Centralized Computer System
Client Server Systems: A centralized systems act as server systems that satisfy requests generated by client
systems.
Fig: General Structure for a Client-Server System Database functionality can be divided into:
Back-end: manages access structures, query evaluation and optimization,
concurrency control and recovery.
Front-end: consists of tools such as forms, report-writers, and graphical user
interface facilities.
By Dr. Nitin S. Goje Page 16 | 49
CS 426 - Advanced Database
The interface between the front-end and the back-end is through SQL or through an
application program interface.
Advantages of replacing mainframes with networks of workstations or personal
computers connected to back-end server machines:
better functionality for the cost
flexibility in locating resources and expanding facilities
better user interfaces
easier maintenance
Server System Architecture: Server systems can be broadly categorized into two kinds:
transaction servers which are widely used in relational database systems, and
data servers, used in object-oriented database systems
Transaction Servers: Also called query server systems or SQL server systems
Clients send requests to the server
Transactions are executed at the server
Results are shipped back to the client.
By Dr. Nitin S. Goje Page 17 | 49
CS 426 - Advanced Database
Requests are specified in SQL, and communicated to the server through a remote
procedure call (RPC) mechanism.
Transactional RPC allows many RPC calls to form a transaction.
Data Servers: Used in high-speed LANs, in cases where
The clients are comparable in processing power to the server
The tasks to be executed are compute intensive.
Data are shipped to clients where processing is performed, and then shipped results
back to the server.
This architecture requires full back-end functionality at the clients.
Used in many object-oriented database systems
Issues:
Page-Shipping versus Item-Shipping
Locking
Data Caching
Lock Caching
Parallel Systems: Parallel database systems consist of multiple processors and multiple disks connected by
a fast interconnection network.
A coarse-grain parallel machine consists of a small number of powerful processors
A massively parallel or fine grain parallel machine utilizes thousands of smaller
processors.
Two main performance measures:
throughput --- the number of tasks that can be completed in a given time
interval
response time --- the amount of time it takes to complete a single task from the
time it is submittedBy Dr. Nitin S. Goje Page 18 | 49
CS 426 - Advanced Database
Speed-Up and Scale-Up Speedup: a fixed-sized problem executing on a small system is given to a system which
is N-times larger.
Measured by:
Speedup is linear if equation equals N.
Scaleup: increase the size of both the problem and the system N-times larger system
used to perform N-times larger job
Measured by:
Scale up is linear if equation equals 1.
Interconnection Network Architectures Bus. System components send data on and receive data from a single communication
bus.
Mesh. Components are arranged as nodes in a grid, and each component is connected
to all adjacent components Communication links grow with growing number of
components, and so scales better.
Hypercube. Components are numbered in binary; components are connected to one
another if their binary representations differ in exactly one bit.
By Dr. Nitin S. Goje Page 19 | 49
CS 426 - Advanced Database
Fig: Interconnection Architectures
Parallel Database Architectures: Shared memory -- processors share a common memory
Shared disk -- processors share a common disk
Shared nothing -- processors share neither a common memory nor common disk
Hierarchical -- hybrid of the above architectures
By Dr. Nitin S. Goje Page 20 | 49
CS 426 - Advanced Database
Fig: Parallel Database ArchitecturesDistributed Systems:
Data spread over multiple machines (also referred to as sites or nodes).
Network interconnects the machines
Data shared by users on multiple machines
Fig: A Distributed Systems
Network Types: Local-area networks (LANs) – composed of processors that are distributed over small
geographical areas, such as a single building or a few adjacent buildings.
Wide-area networks (WANs) – composed of processors distributed over a large
geographical area.
By Dr. Nitin S. Goje Page 21 | 49
CS 426 - Advanced Database
Local-Area Network:
WANs with continuous connection (e.g., the Internet) are needed for implementing
distributed database systems
Groupware applications such as Lotus notes can work on WANs with discontinuous
connection:
Data is replicated.
Updates are propagated to replicas periodically.
Copies of data may be updated independently.
Non-serializable executions can thus result. Resolution is application dependent.
By Dr. Nitin S. Goje Page 22 | 49
CS 426 - Advanced Database
Week-9Parallel Databases:
Introduction: Parallel machines are becoming quite common and affordable
Prices of microprocessors, memory and disks have dropped sharply
Recent desktop computers feature multiple processors and this trend is
projected to accelerate
Databases are growing increasingly large
Large volumes of transaction data are collected and stored for later analysis.
Multimedia objects like images are increasingly stored in databases
Large-scale parallel database systems increasingly used for:
storing large volumes of data
processing time-consuming decision-support queries
providing high throughput for transaction processing
Parallelism in Databases: Data can be partitioned across multiple disks for parallel I/O.
Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel
Data can be partitioned and each processor can work independently on its own
partition.
Queries are expressed in high level language (SQL, translated to relational algebra)
Makes parallelization easier.
Different queries can be run in parallel with each other. Concurrency control takes care
of conflicts.
Thus, databases naturally lend themselves to parallelism.
By Dr. Nitin S. Goje Page 23 | 49
CS 426 - Advanced Database
I/O Parallelism: Reduce the time required to retrieve relations from disk by partitioning
The relations on multiple disks.
Horizontal partitioning – tuples of a relation are divided among many disks such that
each tuple resides on one disk.
Partitioning techniques (number of disks = n):
Round-robin:
Send the I th tuple inserted in the relation to disk i mod n.
Hash partitioning:
Choose one or more attributes as the partitioning attributes.
Choose hash function h with range 0…n - 1
Let i denote result of hash function h applied to the partitioning attribute
value of a tuple. Send tuple to disk i.
Range partitioning: Choose an attribute as the partitioning attribute.
A partitioning vector [v0, v1, ..., vn-2] is chosen.
Let v be the partitioning attribute value of a tuple. Tuples such that vi ≤
vi+1 go to disk I + 1. Tuples with v < v0 go to disk 0 and tuples with v ≥ vn-2
go to disk n-1.
E.g., with a partitioning vector [5,11], a tuple with partitioning attribute
value of 2 will go to disk 0, a tuple with value 8 will go to disk 1, while a
tuple with value 20 will go to disk2.
Comparison of Partitioning Techniques:
Evaluate how well partitioning techniques support the following types of data access:
1. Scanning the entire relation.
2. Locating a tuple associatively – point queries.
E.g., r.A = 25.
By Dr. Nitin S. Goje Page 24 | 49
CS 426 - Advanced Database
3. Locating all tuples such that the value of a given attribute lies within a specified range
– range queries.
E.g., 10 ≤ r.A < 25.
Round robin:
Advantages
Best suited for sequential scan of entire relation on each query.
All disks have almost an equal number of tuples; retrieval work is thus
well balanced between disks.
Range queries are difficult to process
No clustering -- tuples are scattered across all disks
Hash partitioning:
Good for sequential access
Assuming hash function is good, and partitioning attributes form a key,
tuples will be equally distributed between disks
Retrieval work is then well balanced between disks.
Good for point queries on partitioning attribute
Can lookup single disk, leaving others available for answering other
queries.
Index on partitioning attribute can be local to disk, making lookup and
update more efficient
No clustering, so difficult to answer range queries
Range partitioning:
Provides data clustering by partitioning attribute value.
Good for sequential access
Good for point queries on partitioning attribute: only one disk needs to be
accessed.
For range queries on partitioning attribute, one to a few disks may need to be
accessed
Remaining disks are available for other queries. By Dr. Nitin S. Goje Page 25 | 49
CS 426 - Advanced Database
Good if result tuples are from one to a few blocks.
If many blocks are to be fetched, they are still fetched from one to a few
disks, and potential parallelism in disk access is wasted
Example of execution skew.
Interquery Parallelism: Queries/transactions execute in parallel with one another.
Increases transaction throughput; used primarily to scale up a transaction processing
system to support a larger number of transactions per second.
Easiest form of parallelism to support, particularly in a shared-memory parallel
database, because even sequential database systems support concurrent processing.
More complicated to implement on shared-disk or shared-nothing architectures
Locking and logging must be coordinated by passing messages between
processors.
Data in a local buffer may have been updated at another processor.
Cache-coherency has to be maintained — reads and writes of data in buffer
must find latest version of data.
Intraquery Parallelism: Execution of a single query in parallel on multiple processors/disks; important for
speeding up long-running queries.
Two complementary forms of intraquery parallelism:
Intraoperation Parallelism – parallelize the execution of each individual
operation in the query.
Interoperation Parallelism – execute the different operations in a query
expression in parallel.
The first form scales better with increasing parallelism because the number of tuples
processed by each operation is typically more than the number of operations in a query.
By Dr. Nitin S. Goje Page 26 | 49
CS 426 - Advanced Database
Week-10Distributed Database System:
A distributed database system consists of loosely coupled sites that share no physical
component
Database systems that run on each site are independent of each other
Transactions may access data at one or more sites
Homogeneous Distributed Databases: In a homogeneous distributed database
All sites have identical software
Are aware of each other and agree to cooperate in processing user requests.
Each site surrenders part of its autonomy in terms of right to change schemas or
software
Appears to user as a single system
Heterogeneous Distributed Databases: In a heterogeneous distributed database
Different sites may use different schemas and software
Difference in schema is a major problem for query processing
Difference in software is a major problem for transaction processing
Sites may not be aware of each other and may provide only limited facilities for
cooperation in transaction processing
Distributed Data Storage: Assume relational data model
Replication
By Dr. Nitin S. Goje Page 27 | 49
CS 426 - Advanced Database
System maintains multiple copies of data, stored in different sites, for faster
retrieval and fault tolerance.
Fragmentation
Relation is partitioned into several fragments stored in distinct sites.
Replication and fragmentation can be combined
Relation is partitioned into several fragments: system maintains several identical
replicas of each such fragment.
Data Replication: A relation or fragment of a relation is replicated if it is stored redundantly in two or
more sites.
Full replication of a relation is the case where the relation is stored at all sites.
Fully redundant databases are those in which every site contains a copy of the entire
database.
Advantages of Replication
Availability: failure of site containing relation r does not result in unavailability of
r is replicas exist.
Parallelism: queries on r may be processed by several nodes in parallel.
Reduced data transfer: relation r is available locally at each site containing a
replica of r.
Disadvantages of Replication
Increased cost of updates: each replica of relation r must be updated.
Increased complexity of concurrency control: concurrent updates to distinct
replicas may lead to inconsistent data unless special concurrency control
mechanisms are implemented.
By Dr. Nitin S. Goje Page 28 | 49
CS 426 - Advanced Database
One solution: choose one copy as primary copy and apply concurrency
control operations on primary copy.
Data Fragmentation: Division of relation r into fragments r1, r2, …, rn which contain sufficient information to
reconstruct relation r.
Horizontal fragmentation: each tuple of r is assigned to one or more fragments
Vertical fragmentation: the schema for relation r is split into several smaller schemas
All schemas must contain a common candidate key (or super key) to ensure
lossless join property.
A special attribute, the tuple-id attribute may be added to each schema to serve
as a candidate key.
Horizontal Fragmentation of account Relation:
By Dr. Nitin S. Goje Page 29 | 49
CS 426 - Advanced Database
Vertical Fragmentation of employee_info Relation:
By Dr. Nitin S. Goje Page 30 | 49
CS 426 - Advanced Database
Advantages of Fragmentation: Horizontal:
allows parallel processing on fragments of a relation
allows a relation to be split so that tuples are located where they are most
frequently accessed
Vertical:
allows tuples to be split so that each part of the tuple is stored where it is most
frequently accessed
tuple-id attribute allows efficient joining of vertical fragments
allows parallel processing on a relation
Vertical and horizontal fragmentation can be mixed.
By Dr. Nitin S. Goje Page 31 | 49
CS 426 - Advanced Database
Fragments may be successively fragmented to an arbitrary depth.
Data Transparency: Data transparency: Degree to which system user may remain unaware of the details of
how and where the data items are stored in a distributed system
Consider transparency issues in relation to:
Fragmentation transparency
Replication transparency
Location transparency
Distributed Transactions: Transaction may access data at several sites.
Each site has a local transaction manager responsible for:
Maintaining a log for recovery purposes
Participating in coordinating the concurrent execution of the transactions
executing at that site.
Each site has a transaction coordinator, which is responsible for:
Starting the execution of transactions that originate at the site.
Distributing sub-transactions at appropriate sites for execution.
Coordinating the termination of each transaction that originates at the site,
which may result in the transaction being committed at all sites or aborted at all
sites.
Transaction System Architecture:By Dr. Nitin S. Goje Page 32 | 49
CS 426 - Advanced Database
Week-11By Dr. Nitin S. Goje Page 33 | 49
CS 426 - Advanced Database
Information Integration:
Integration in Data Management: Evolution The Classical Database Application
Database Application with Several DBMSs
Data Access via Distributed DBMS
Federated Database System
Data Integration (with Global Schema)
The Classical Database Application:
Centralized system with three-tier architecture
Implicit integration: integration supported by the Data Base Management System
(DBMS), i.e., the data manager
Database Application with Several DBMS’s:
By Dr. Nitin S. Goje Page 34 | 49
CS 426 - Advanced Database
Centralized system with three-tier architecture and multiple stores
Application hides integration: integration \embedded" within application
Data Access via Distributed DBMS:
Centralized system with three-tier architecture and multiple data stores
Distributed data management: different data sources of the same type, under the
control of the organization, managed by a Distributed DBMS
By Dr. Nitin S. Goje Page 35 | 49
CS 426 - Advanced Database
Federated Database System:
Centralized system with three-tier architecture and distributed stores
Data federation: different data sources, not necessarily of the same type, or under the
control of the organization, federated within one data layer
Data Integration (with Global Schema):
Centralized system with four-tier architecture and distributed stores
By Dr. Nitin S. Goje Page 36 | 49
CS 426 - Advanced Database
Data exchange and integration: the global schema is “independent" from the different
data sources, which are heterogeneous, and not necessarily under the control of a
single organization
Application-based Distribution:
Decentralized system
Application-based distribution: distributed integration realized within application
By Dr. Nitin S. Goje Page 37 | 49
CS 426 - Advanced Database
P2P Data Integration:
Centralized system with three-tier architecture
Peer-to-peer data exchange and integration: distributed data integration realized with
no central global schemas
What is Information Integration?: Information Integration is the problem of
providing a unified and transparent view
to a collection of data stored in multiple, autonomous, and heterogeneous data
sources.
The unified view is achieved through a global (or target) schema, and is realized either
through a materialized database (exchange), or
through a virtualization mechanism based on querying (integration).
By Dr. Nitin S. Goje Page 38 | 49
CS 426 - Advanced Database
Relevance of Information Integration: Growing demand (and market)
At least two contexts
Intra-organization information integration
(e.g., Enterprise Information Systems)
Inter-organization information integration
(e.g., integration on the Web)
Basic Approaches to Sharing Information: There are three basic approaches to sharing information.
You can consolidate the information into a single database, which eliminates the
need for further integration.
You can leave information distributed, and provide tools to federate that
information, making it appear to be in a single virtual database. Or,
By Dr. Nitin S. Goje Page 39 | 49
CS 426 - Advanced Database
You can share information, which lets you maintain the information in multiple
data stores and applications.
Information Integration: Available Industrial Solutions: Distributed database systems
Tools for source wrapping
Tools for ETL (Extraction, Transformation and Loading)
Data warehousing
Tools based on database federation, e.g., DB2 Information Integrator
Distributed query optimization
Current Information Integration Tools: Characteristics: Physical transparency, i.e., masking from the user the physical characteristics of the
sources
Heterogeneity, i.e., federating highly diverse types of sources
Extensibility
Autonomy of data sources
Performance, through distributed query optimization
However, current tools do not (directly) support the so-called logical (or conceptual)
transparency (via an integrated schema), which is crucial in data integration
Advantages of Information Integration: Understand information – Analyze the data and its relationships. Share definitions and
policies across projects. Despite complexity, govern big data based on business needs.
Improve information – Deliver accurate, current data, with consistency across master
data entities. Manage information throughout its lifecycle. Document its lineage. Secure
and protect it.
By Dr. Nitin S. Goje Page 40 | 49
CS 426 - Advanced Database
Act on information – Accelerate projects by enabling confidence, adapting quickly to
change, and making high-value information continuously available.
Week-12
Revision from Week-8 to Week-11
Test-2
By Dr. Nitin S. Goje Page 41 | 49
CS 426 - Advanced Database
Week-13Object Relational Database Management System:
Among modern database technologies the object relational database management
system (ORDBMS) is a new database technology which can successfully deal with very
large data volumes with great complexity.
According to the findings of the Stonebraker and Moore findings, database techniques
can be groups into four main categories which are file systems, relational DBMS, Object
Oriented DBMS and Object Relational DBMS.
Based on these four categories Stonebraker and Moore developed their Database
classification matrix. Following Figure shows database classification matrix
\What is ORDBMS?
ORDBMS is similar to a relational database.
It has object oriented database models like objects, classes and inheritance etc.
It also directly supports database schemas in the query language.
The gap between OODBMS and RDBMS is bridged by ORDBMS
ORDBMSs allow developers to implement new data types and functions like Java and C.
By Dr. Nitin S. Goje Page 42 | 49
CS 426 - Advanced Database
ORDBMSs have changed query-centric approach to data management.
Tools Available for ORDBMS: Main Proprietary Tools Available in the Market
DB2
Microsoft SQL
Oracle Databases
Informix
Adaptive Server Enterprise
Valentina
Cache
Main Open Source Tools Available in Market
PostgreSQL
CUBRID
Zope Object database
Giga Base
Greenpium database
Main Features Available in ORDBMS’s: Object Types: user-defined data types (UDT) or abstract types (ADT) can be referred to
as object types.
Functions/Methods: For each object type, the user can define the methods for data
access.
Varray: The varray is a collection type that allows the user to embed homogenous data
into an array to form an object in a pre-defined array data type.
Nested table: A nested table is a collection type that can be stored within another table.
By Dr. Nitin S. Goje Page 43 | 49
CS 426 - Advanced Database
Inheritance: With Object type inheritance, users can build subtypes in hierarchies of
database types in ORDBs.
Object View: Object view allows users to develop object structures in existing relational
tables.
Advantages of ORDBMS: Reusability and Sharing
Its increase the flexibility and functionality
It is high maintainable
It is extensible easily and reliable
Can work with complex data types
Enhance the system overall performances
Disadvantages of ORDBMS: More complex than traditional relational databases
It is Costlier
Object orientation is misses.
Difficult to find qualified database professionals
By Dr. Nitin S. Goje Page 44 | 49
CS 426 - Advanced Database
Week-14Object Oriented Database:
Object: Definitions: Objects:
User defined complex data types
An object has structure or state (variables) and methods (behavior/operations).
An object is described by four characteristics
Identifier: a system-wide unique id for an object
Name: an object may also have a unique name in DB (optional)
Lifetime: determines if the object is persistent or transient
Structure: Construction of objects using type constructors
Object-Oriented Concepts: Abstract Data Types
Class definition, provides extension to complex attribute types
Encapsulation
Implementation of operations and object structure hidden
Inheritance
Sharing of data within hierarchy scope, supports code reusability
Polymorphism
Operator overloading
By Dr. Nitin S. Goje Page 45 | 49
CS 426 - Advanced Database
What is Object Oriented Database (OODB)?: A database system that incorporates all the important object-oriented concepts like
Encapsulation, Inheritance and Polymorphism.
Object database work well with:
CAS Applications (CASE-computer aided software engineering, CAD-computer aided
design, CAM-computer aided manufacture)
Multimedia Applications
Object projects that change over time.
Commerce
Advantages of OODBS: Designer can specify the structure of objects and their behavior (methods)
Better interaction with object-oriented languages such as Java and C++
Definition of complex and user-defined types
Encapsulation of operations and user-defined methods
Disadvantages of OODBS: Lower efficiency when data is simple and relationships are simple.
Relational tables are simpler.
Late binding may slow access speed.
More user tools exist for RDBMS.
Standards for RDBMS are more stable.
OODBS Standards: Object Data Management Group
Object Database Standard ODM6.2.0
By Dr. Nitin S. Goje Page 46 | 49
CS 426 - Advanced Database
Object Query Language
OQL support of SQL9
Week-15Object Query Language (OQL):
OQL is an object database query language, and is specified as part of the ODMG
standards.
OQL is being used as an embedded query language.
OQL can also be used as a stand-along query language.
OQL is based on SQL.
Many queries in SQL are also valid in OQL.
OQL also extends SQL to deal with object-oriented notion.
Example of OQL query:The following is a sample query
“What are the names of the black product?”
Select distinct p.name
From products p
Where p.color = “black”
Valid in both SQL and OQL, but results are different.
Result of the query (SQL):
By Dr. Nitin S. Goje Page 47 | 49
CS 426 - Advanced Database
Result of the query (OQL):
Comparison:
By Dr. Nitin S. Goje Page 48 | 49
CS 426 - Advanced Database
OQL vs C++: OQL is declarative
OQL can be used interactively
OQL embedded in C++ makes programs simpler
OQL can be seamlessly optimised
OQL guarantees logical/physical independence
OQL vs. SQL2: OQL supports complex objects
OQL supports methods
OQL vs. SQL3: OQL is stable, implemented, available while SQL3 is still in the design process
OQL is a simple query language while SQL3 is a full DB PL
OQL definition takes 20 pages while SQL3 is currently 1300 pages
OQL can match different data models (C++, ODMG, SQL2, SQL3)
Revision from Week-1 to Week-15By Dr. Nitin S. Goje Page 49 | 49