faculty.psau.edu.sa · Web viewCentralized system with four-tier architecture and distributed...

CS 426 - Advanced Database


DBMS (Database Management System)A database-management system (DBMS) is

A collection of interrelated data and A set of programs to access those data. An environment that is both convenient and efficient to use

Database System ApplicationsDatabases are widely used. Here are some representative applications:

Enterprise Information• Sale• Accounting• Human resources• Manufacturing• Online Retailers

Banking and Finance• Banking• Credit Card Transactions• Finance

Universities Airlines Telecommunications

Data Manipulation LanguageA data-manipulation language (DML) is a language that enables users to access or manipulate data as organized by the appropriate data model. The types of access are:

Retrieval of information stored in the database Insertion of new information into the database Deletion of information from the database Modification of information stored in the database

Example:

select instructor.name from instructor where instructor.dept name = ’History’;

By Dr. Nitin S. Goje Page 1 | 49


Data Definition LanguageTo specify a database schema by a set of definitions expressed by a special language called a data-definition language (DDL).

The DDL is also used to specify additional properties of the data.

Example:

create table department (dept_name char (20), building char (15), budget numeric (12,2));

Relational DatabaseA relational database is based on the relational model and uses a collection of tables to represent both data and the relationships among those data.

Each table has multiple columns and each column has a unique name.

Example:



Structured Query LanguageSQL can define the structure of the data, modify data in the database, and specify security constraints.

The SQL language has several parts:

• Data-definition language (DDL). The SQL DDL provides commands for defining relation schemas, deleting relations, and modifying relation schemas.

• Data-manipulation language (DML). The SQL DML provides the ability to query information from the database and to insert tuples into, delete tuples from, and modify tuples in the database.

• Integrity. The SQL DDL includes commands for specifying integrity constraints that the data stored in the database must satisfy. Updates that violate integrity constraints are disallowed.

• View definition. The SQL DDL includes commands for defining views.

• Transaction control. SQL includes commands for specifying the beginning and ending of transactions.

• Embedded SQL and dynamic SQL. Embedded and dynamic SQL define how SQL statements can be embedded within general-purpose programming languages, such as C, C++, and Java.

• Authorization. The SQL DDL includes commands for specifying access rights to relations and views.

SQL Data TypesThe SQL standard supports a variety of built-in types, including:

• char(n): A fixed-length character string with user-specified length n. The full form, character, can be used instead.

• varchar(n): A variable-length character string with user-specified maximum length n. The full form, character varying, is equivalent.

• int: An integer (a finite subset of the integers that is machine dependent). The full form, integer, is equivalent.

• smallint: A small integer (a machine-dependent subset of the integer type).

• numeric(p, d):A fixed-point number with user-specified precision. The number consists of p digits (plus a sign), and d of the p digits are to the right of the decimal point. Thus,



numeric(3,1) allows 44.5 to be stored exactly, but neither 444.5 or 0.32 can be stored exactly in a field of this type.

• real, double precision: Floating-point and double-precision floating-point numbers with machine-dependent precision.

• float(n): A floating-point number, with precision of at least n digits.

Advanced SQL1. Accessing SQL From a Programming Language

There are two approaches to accessing SQL from a general-purpose programming language:

• Dynamic SQL:

Dynamic SQL allows the program to construct an SQL query as a character string at runtime, submit the query, and then retrieve the result into program variables a tuple at a time.

The dynamic SQL component of SQL allows programs to construct and submit SQL queries at runtime.

• Embedded SQL:

Like dynamic SQL, embedded SQL provides a means by which a program can interact with a database server.

However, under embedded SQL, the SQL statements are identified at compile time using a preprocessor.

The preprocessor submits the SQL statements to the database system for pre-compilation and optimization; then it replaces the SQL statements in the application program with appropriate code and function calls before invoking the programming-language compiler.

2. Functions and Procedures

Procedures and functions allow “business logic” to be stored in the database, and executed from SQL statements.

Example:

create function dept count(dept name varchar(20))

returns integer

begin



declare d count integer;

select count(*) into d count

from instructor

where instructor.dept name= dept name

return d count;

end

3. Triggers

A trigger is a statement that the system executes automatically as a side effect of a modification to the database. To design a trigger mechanism, we must meet two requirements:

1. Specify when a trigger is to be executed. This is broken up into an event that causes the trigger to be checked and a condition that must be satisfied for trigger execution to proceed.

2. Specify the actions to be taken when the trigger executes.

Object ModelsAn object typically has two components: state (value) and behavior (operations).

One goal of an ODMS (Object Data Management System) is to maintain a direct correspondence between real-world and database objects so that objects do not lose their integrity and identity and can easily be identified and operated upon.

define type EMPLOYEE tuple ( Fname: string; Minit: char; Lname: string; Ssn: string; Birth_date: DATE; Address: string; Sex: char; Salary: float; Supervisor: EMPLOYEE; Dept: DEPARTMENT;); define type DATE tuple ( Year: integer; Month: integer; Day: integer; );



Object Database Modeling Group The ODMG object model is the data model upon which the object definition language

(ODL) and object query language (OQL) are based.

It is meant to provide a standard data model for object databases, just as SQL describes a standard data model for relational databases.

It also provides a standard terminology in a field where the same terms were sometimes used to describe different concepts.

Object Definition Language The ODL is designed to support the semantic constructs of the ODMG object model and

is independent of any particular programming language.

Its main use is to create object specifications—that is, classes and interfaces. Hence, ODL is not a full programming language.

A user can specify a database schema in ODL independently of any programming language, and then use the specific language bindings to specify how ODL constructs can be mapped to constructs in specific programming languages, such as C++, Smalltalk, and Java.

Object Query Language The object query language OQL is the query language proposed for the ODMG object

model.

It is designed to work closely with the programming languages for which an ODMG binding is defined, such as C++, Smalltalk, and Java.

Hence, an OQL query embedded into one of these programming languages can return objects that match the type system of that language.

Additionally, the implementations of class operations in an ODMG schema can have their code written in these programming languages.

The OQL syntax for queries is similar to the syntax of the relational standard query language SQL, with additional features for ODMG concepts, such as object identity, complex objects, operations, inheritance, polymorphism, and relationships.



Transaction Collections of operations that form a single logical unit of work are called transactions.

A database system must ensure proper execution of transactions despite failures—either the entire transaction executes, or none of it does.

Furthermore, it must manage concurrent execution of transactions in a way that avoids the introduction of inconsistency.

A transaction is a unit of program execution that accesses and possibly updates various data items.

Usually, a transaction is initiated by a user program written in a high-level data-manipulation language (typically SQL), or programming language (for example, C++, or Java), with embedded database accesses in JDBC or ODBC.

Begin Transaction and End Transaction A transaction is delimited by statements (or function calls) of the form begin transaction

and end transaction.

The transaction consists of all operations executed between the begin transaction and end transaction.

ExampleLet Ti be a transaction that transfers $50 from account A to account B. This transaction can be defined as:

Ti : read(A);

A := A − 50;

write(A);

read(B);

B := B + 50;

write(B).



Properties of Transactions Atomicity - Either all operations of the transaction are reflected properly in the

database, or none are.

Consistency - Execution of a transaction in isolation (that is, with no other transaction executing concurrently) preserves the consistency of the database.

Isolation - Even though multiple transactions may execute concurrently, the system guarantees that, for every pair of transactions Ti and Tj , it appears to Ti that either Tj finished execution before Ti started or Tj started execution after Ti finished. Thus, each transaction is unaware of other transactions executing concurrently in the system.

Durability - After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.

States of TransactionA transaction must be in one of the following states:

Active, the initial state; the transaction stays in this state while it is executing.

Partially committed, after the final statement has been executed.

Failed, after the discovery that normal execution can no longer proceed.

Aborted, after the transaction has been rolled back and the database has been restored to its state prior to the start of the transaction.

Committed, after successful completion.

States Diagram of TransactionsBy Dr. Nitin S. Goje Page 8 | 49


We say that a transaction has committed only if it has entered the committed state. Similarly, we say that a transaction has aborted only if it has entered the aborted state. A transaction is said to have terminated if it has either committed or aborted. A transaction starts in the active state. When it finishes its final statement, it enters the partially committed state. At this point,

the transaction has completed its execution, but it is still possible that it may have to be aborted, since the actual output may still be temporarily residing in main memory, and thus a hardware failure may preclude its successful completion.

The database system then writes out enough information to disk that, even in the event of a failure, the updates performed by the transaction can be re-created when the system restarts after the failure.

When the last of this information is written out, the transaction enters the committed state.

Concurrency Control



When several transactions execute concurrently in the database, however, the isolation

property may no longer be preserved.

To ensure that it is, the system must control the interaction among the concurrent

transactions; this control is achieved through one of a variety of mechanisms called

concurrency control schemes.

In practice, the most frequently used schemes are

Two-phase locking and

Snapshot isolation.

Lock Based Protocols One way to ensure isolation is to require that data items be accessed in a mutually

exclusive manner; that is, while one transaction is accessing a data item, no other

transaction can modify that data item.

The most common method used to implement this requirement is to allow a transaction

to access a data item only if it is currently holding a lock on that item.

Timestamp Based Protocols The locking protocols that we have described thus far determine the order between

every pair of conflicting transactions at execution time by the first lock that both

members of the pair request that involves incompatible modes.

Another method for determining the serializability order is to select an ordering among

transactions in advance.

The most common method for doing so is to use a timestamp-ordering scheme.

The Two Phase Locking ProtocolsBy Dr. Nitin S. Goje Page 10 | 49


One protocol that ensures serializability is the two-phase locking protocol. This protocol

requires that each transaction issue lock and unlock requests in two phases:

◦ 1. Growing phase. A transaction may obtain locks, but may not release any lock.

◦ 2. Shrinking phase. A transaction may release locks, but may not obtain any new

locks.

Initially, a transaction is in the growing phase. The transaction acquires locks as needed.

Once the transaction releases a lock, it enters the shrinking phase, and it can issue no

more lock requests.

Snapshot Isolation Snapshot isolation involves giving a transaction a “snapshot” of the database at the time

when it begins its execution.

It then operates on that snapshot in complete isolation from concurrent transactions.

The data values in the snapshot consist only of values written by committed

transactions.

This isolation is ideal for read-only transactions since they never wait and are never

aborted by the concurrency manager.

Transactions that update the database must, of course, interact with potentially

conflicting concurrent update transactions before updates are actually placed in the

database.

Updates are kept in the transaction’s private workspace until the transaction

successfully commits, at which point the updates are written to the database.

When a transaction T is allowed to commit, the transition of T to the committed state

and the writing of all of the updates made by T to the database must be done as an

atomic action so that any snapshot created for another transaction either includes all

updates by transaction T or none of them.



Recovery System A computer system, like any other device, is subject to failure from a variety of causes:

disk crash, power outage, software error, a fire in the machine room, even sabotage.

In any failure, information may be lost. Therefore, the database system must take

actions in advance to ensure that the atomicity and durability properties of transactions,

are preserved.

An integral part of a database system is a recovery scheme that can restore the

database to the consistent state that existed before the failure.

Failure ClassificationThere are various types of failure that may occur in a system, each of which needs to be dealt

with in a different manner.

We shall consider only the following types of failure:

• Transaction failure. There are two types of errors that may cause a transaction to fail:

◦ Logical error. The transaction can no longer continue with its normal execution

because of some internal condition, such as bad input, data not found, overflow,

or resource limit exceeded.

◦ System error. The system has entered an undesirable state (for example,

deadlock), as a result of which a transaction cannot continue with its normal

execution. The transaction, however, can be re-executed at a later time.

• System crash. There is a hardware malfunction, or a bug in the database software or

the operating system, that causes the loss of the content of volatile storage, and brings

transaction processing to a halt. The content of nonvolatile storage remains intact, and

is not corrupted.

The assumption that hardware errors and bugs in the software bring the system to a

halt, but do not corrupt the nonvolatile storage contents, is known as the fail-stop By Dr. Nitin S. Goje Page 12 | 49


assumption. Well-designed systems have numerous internal checks, at the hardware

and the software level that bring the system to a halt when there is an error. Hence, the

fail-stop assumption is a reasonable one.

• Disk failure. A disk block loses its content as a result of either a head crash or failure

during a data-transfer operation. Copies of the data on other disks, or archival backups

on tertiary media, such as DVD or tapes, are used to recover from the failure.



Week-7Revision From Week-1 to Week-6TEST-1



Week-8Database System Architecture:

The architecture of a database system determines its capability, reliability, effectiveness

and efficiency in meeting user requirements.

But besides the visible functions seen through some data manipulation language, a good

database architecture should provide:

a) Independence of data and programs

b) Ease of system design

c) Ease of programming

d) Powerful query facilities

e) Protection of data

Centralized System: Run on a single computer system and do not interact with other computer systems.

General-purpose computer system: one to a few CPUs and a number of device

controllers that are connected through a common bus that provides access to shared

memory.

Single-user system (e.g., personal computer or workstation): desk-top unit, single user,

usually has only one CPU and one or two hard disks; the OS may support only one user.

Multi-user system: more disks, more memory, multiple CPUs, and a multi-user OS. Serve

a large number of users who are connected to the system vie terminals. Often called

server systems.



Fig: A Centralized Computer System

Client Server Systems: A centralized systems act as server systems that satisfy requests generated by client

systems.

Fig: General Structure for a Client-Server System Database functionality can be divided into:

Back-end: manages access structures, query evaluation and optimization,

concurrency control and recovery.

Front-end: consists of tools such as forms, report-writers, and graphical user

interface facilities.



The interface between the front-end and the back-end is through SQL or through an

application program interface.

Advantages of replacing mainframes with networks of workstations or personal

computers connected to back-end server machines:

better functionality for the cost

flexibility in locating resources and expanding facilities

better user interfaces

easier maintenance

Server System Architecture: Server systems can be broadly categorized into two kinds:

transaction servers which are widely used in relational database systems, and

data servers, used in object-oriented database systems

Transaction Servers: Also called query server systems or SQL server systems

Clients send requests to the server

Transactions are executed at the server

Results are shipped back to the client.



Requests are specified in SQL, and communicated to the server through a remote

procedure call (RPC) mechanism.

Transactional RPC allows many RPC calls to form a transaction.

Data Servers: Used in high-speed LANs, in cases where

The clients are comparable in processing power to the server

The tasks to be executed are compute intensive.

Data are shipped to clients where processing is performed, and then shipped results

back to the server.

This architecture requires full back-end functionality at the clients.

Used in many object-oriented database systems

Issues:

Page-Shipping versus Item-Shipping

Locking

Data Caching

Lock Caching

Parallel Systems: Parallel database systems consist of multiple processors and multiple disks connected by

a fast interconnection network.

A coarse-grain parallel machine consists of a small number of powerful processors

A massively parallel or fine grain parallel machine utilizes thousands of smaller

processors.

Two main performance measures:

throughput --- the number of tasks that can be completed in a given time

interval

response time --- the amount of time it takes to complete a single task from the

time it is submittedBy Dr. Nitin S. Goje Page 18 | 49


Speed-Up and Scale-Up Speedup: a fixed-sized problem executing on a small system is given to a system which

is N-times larger.

Measured by:

Speedup is linear if equation equals N.

Scaleup: increase the size of both the problem and the system N-times larger system

used to perform N-times larger job

Measured by:

Scale up is linear if equation equals 1.

Interconnection Network Architectures Bus. System components send data on and receive data from a single communication

bus.

Mesh. Components are arranged as nodes in a grid, and each component is connected

to all adjacent components Communication links grow with growing number of

components, and so scales better.

Hypercube. Components are numbered in binary; components are connected to one

another if their binary representations differ in exactly one bit.



Fig: Interconnection Architectures

Parallel Database Architectures: Shared memory -- processors share a common memory

Shared disk -- processors share a common disk

Shared nothing -- processors share neither a common memory nor common disk

Hierarchical -- hybrid of the above architectures



Fig: Parallel Database ArchitecturesDistributed Systems:

Data spread over multiple machines (also referred to as sites or nodes).

Network interconnects the machines

Data shared by users on multiple machines

Fig: A Distributed Systems

Network Types: Local-area networks (LANs) – composed of processors that are distributed over small

geographical areas, such as a single building or a few adjacent buildings.

Wide-area networks (WANs) – composed of processors distributed over a large

geographical area.



Local-Area Network:

WANs with continuous connection (e.g., the Internet) are needed for implementing

distributed database systems

Groupware applications such as Lotus notes can work on WANs with discontinuous

connection:

Data is replicated.

Updates are propagated to replicas periodically.

Copies of data may be updated independently.

Non-serializable executions can thus result. Resolution is application dependent.



Week-9Parallel Databases:

Introduction: Parallel machines are becoming quite common and affordable

Prices of microprocessors, memory and disks have dropped sharply

Recent desktop computers feature multiple processors and this trend is

projected to accelerate

Databases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis.

Multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for:

storing large volumes of data

processing time-consuming decision-support queries

providing high throughput for transaction processing

Parallelism in Databases: Data can be partitioned across multiple disks for parallel I/O.

Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own

partition.

Queries are expressed in high level language (SQL, translated to relational algebra)

Makes parallelization easier.

Different queries can be run in parallel with each other. Concurrency control takes care

of conflicts.

Thus, databases naturally lend themselves to parallelism.



I/O Parallelism: Reduce the time required to retrieve relations from disk by partitioning

The relations on multiple disks.

Horizontal partitioning – tuples of a relation are divided among many disks such that

each tuple resides on one disk.

Partitioning techniques (number of disks = n):

Round-robin:

Send the I th tuple inserted in the relation to disk i mod n.

Hash partitioning:

Choose one or more attributes as the partitioning attributes.

Choose hash function h with range 0…n - 1

Let i denote result of hash function h applied to the partitioning attribute

value of a tuple. Send tuple to disk i.

Range partitioning: Choose an attribute as the partitioning attribute.

A partitioning vector [v0, v1, ..., vn-2] is chosen.

Let v be the partitioning attribute value of a tuple. Tuples such that vi ≤

vi+1 go to disk I + 1. Tuples with v < v0 go to disk 0 and tuples with v ≥ vn-2

go to disk n-1.

E.g., with a partitioning vector [5,11], a tuple with partitioning attribute

value of 2 will go to disk 0, a tuple with value 8 will go to disk 1, while a

tuple with value 20 will go to disk2.

Comparison of Partitioning Techniques:

Evaluate how well partitioning techniques support the following types of data access:

1. Scanning the entire relation.

2. Locating a tuple associatively – point queries.

E.g., r.A = 25.



3. Locating all tuples such that the value of a given attribute lies within a specified range

– range queries.

E.g., 10 ≤ r.A < 25.

Round robin:

Advantages

Best suited for sequential scan of entire relation on each query.

All disks have almost an equal number of tuples; retrieval work is thus

well balanced between disks.

Range queries are difficult to process

No clustering -- tuples are scattered across all disks

Hash partitioning:

Good for sequential access

Assuming hash function is good, and partitioning attributes form a key,

tuples will be equally distributed between disks

Retrieval work is then well balanced between disks.

Good for point queries on partitioning attribute

Can lookup single disk, leaving others available for answering other

queries.

Index on partitioning attribute can be local to disk, making lookup and

update more efficient

No clustering, so difficult to answer range queries

Range partitioning:

Provides data clustering by partitioning attribute value.

Good for sequential access

Good for point queries on partitioning attribute: only one disk needs to be

accessed.

For range queries on partitioning attribute, one to a few disks may need to be

accessed

Remaining disks are available for other queries. By Dr. Nitin S. Goje Page 25 | 49


Good if result tuples are from one to a few blocks.

If many blocks are to be fetched, they are still fetched from one to a few

disks, and potential parallelism in disk access is wasted

Example of execution skew.

Interquery Parallelism: Queries/transactions execute in parallel with one another.

Increases transaction throughput; used primarily to scale up a transaction processing

system to support a larger number of transactions per second.

Easiest form of parallelism to support, particularly in a shared-memory parallel

database, because even sequential database systems support concurrent processing.

More complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between

processors.

Data in a local buffer may have been updated at another processor.

Cache-coherency has to be maintained — reads and writes of data in buffer

must find latest version of data.

Intraquery Parallelism: Execution of a single query in parallel on multiple processors/disks; important for

speeding up long-running queries.

Two complementary forms of intraquery parallelism:

Intraoperation Parallelism – parallelize the execution of each individual

operation in the query.

Interoperation Parallelism – execute the different operations in a query

expression in parallel.

The first form scales better with increasing parallelism because the number of tuples

processed by each operation is typically more than the number of operations in a query.



Week-10Distributed Database System:

A distributed database system consists of loosely coupled sites that share no physical

component

Database systems that run on each site are independent of each other

Transactions may access data at one or more sites

Homogeneous Distributed Databases: In a homogeneous distributed database

All sites have identical software

Are aware of each other and agree to cooperate in processing user requests.

Each site surrenders part of its autonomy in terms of right to change schemas or

software

Appears to user as a single system

Heterogeneous Distributed Databases: In a heterogeneous distributed database

Different sites may use different schemas and software

Difference in schema is a major problem for query processing

Difference in software is a major problem for transaction processing

Sites may not be aware of each other and may provide only limited facilities for

cooperation in transaction processing

Distributed Data Storage: Assume relational data model

Replication



System maintains multiple copies of data, stored in different sites, for faster

retrieval and fault tolerance.

Fragmentation

Relation is partitioned into several fragments stored in distinct sites.

Replication and fragmentation can be combined

Relation is partitioned into several fragments: system maintains several identical

replicas of each such fragment.

Data Replication: A relation or fragment of a relation is replicated if it is stored redundantly in two or

more sites.

Full replication of a relation is the case where the relation is stored at all sites.

Fully redundant databases are those in which every site contains a copy of the entire

database.

Advantages of Replication

Availability: failure of site containing relation r does not result in unavailability of

r is replicas exist.

Parallelism: queries on r may be processed by several nodes in parallel.

Reduced data transfer: relation r is available locally at each site containing a

replica of r.

Disadvantages of Replication

Increased cost of updates: each replica of relation r must be updated.

Increased complexity of concurrency control: concurrent updates to distinct

replicas may lead to inconsistent data unless special concurrency control

mechanisms are implemented.



One solution: choose one copy as primary copy and apply concurrency

control operations on primary copy.

Data Fragmentation: Division of relation r into fragments r1, r2, …, rn which contain sufficient information to

reconstruct relation r.

Horizontal fragmentation: each tuple of r is assigned to one or more fragments

Vertical fragmentation: the schema for relation r is split into several smaller schemas

All schemas must contain a common candidate key (or super key) to ensure

lossless join property.

A special attribute, the tuple-id attribute may be added to each schema to serve

as a candidate key.

Horizontal Fragmentation of account Relation:



Vertical Fragmentation of employee_info Relation:



Advantages of Fragmentation: Horizontal:

allows parallel processing on fragments of a relation

allows a relation to be split so that tuples are located where they are most

frequently accessed

Vertical:

allows tuples to be split so that each part of the tuple is stored where it is most

frequently accessed

tuple-id attribute allows efficient joining of vertical fragments

allows parallel processing on a relation

Vertical and horizontal fragmentation can be mixed.



Fragments may be successively fragmented to an arbitrary depth.

Data Transparency: Data transparency: Degree to which system user may remain unaware of the details of

how and where the data items are stored in a distributed system

Consider transparency issues in relation to:

Fragmentation transparency

Replication transparency

Location transparency

Distributed Transactions: Transaction may access data at several sites.

Each site has a local transaction manager responsible for:

Maintaining a log for recovery purposes

Participating in coordinating the concurrent execution of the transactions

executing at that site.

Each site has a transaction coordinator, which is responsible for:

Starting the execution of transactions that originate at the site.

Distributing sub-transactions at appropriate sites for execution.

Coordinating the termination of each transaction that originates at the site,

which may result in the transaction being committed at all sites or aborted at all

sites.

Transaction System Architecture:By Dr. Nitin S. Goje Page 32 | 49


Week-11By Dr. Nitin S. Goje Page 33 | 49


Information Integration:

Integration in Data Management: Evolution The Classical Database Application

Database Application with Several DBMSs

Data Access via Distributed DBMS

Federated Database System

Data Integration (with Global Schema)

The Classical Database Application:

Centralized system with three-tier architecture

Implicit integration: integration supported by the Data Base Management System

(DBMS), i.e., the data manager

Database Application with Several DBMS’s:



Centralized system with three-tier architecture and multiple stores

Application hides integration: integration \embedded" within application

Data Access via Distributed DBMS:

Centralized system with three-tier architecture and multiple data stores

Distributed data management: different data sources of the same type, under the

control of the organization, managed by a Distributed DBMS



Federated Database System:

Centralized system with three-tier architecture and distributed stores

Data federation: different data sources, not necessarily of the same type, or under the

control of the organization, federated within one data layer

Data Integration (with Global Schema):

Centralized system with four-tier architecture and distributed stores



Data exchange and integration: the global schema is “independent" from the different

data sources, which are heterogeneous, and not necessarily under the control of a

single organization

Application-based Distribution:

Decentralized system

Application-based distribution: distributed integration realized within application



P2P Data Integration:

Centralized system with three-tier architecture

Peer-to-peer data exchange and integration: distributed data integration realized with

no central global schemas

What is Information Integration?: Information Integration is the problem of

providing a unified and transparent view

to a collection of data stored in multiple, autonomous, and heterogeneous data

sources.

The unified view is achieved through a global (or target) schema, and is realized either

through a materialized database (exchange), or

through a virtualization mechanism based on querying (integration).



Relevance of Information Integration: Growing demand (and market)

At least two contexts

Intra-organization information integration

(e.g., Enterprise Information Systems)

Inter-organization information integration

(e.g., integration on the Web)

Basic Approaches to Sharing Information: There are three basic approaches to sharing information.

You can consolidate the information into a single database, which eliminates the

need for further integration.

You can leave information distributed, and provide tools to federate that

information, making it appear to be in a single virtual database. Or,



You can share information, which lets you maintain the information in multiple

data stores and applications.

Information Integration: Available Industrial Solutions: Distributed database systems

Tools for source wrapping

Tools for ETL (Extraction, Transformation and Loading)

Data warehousing

Tools based on database federation, e.g., DB2 Information Integrator

Distributed query optimization

Current Information Integration Tools: Characteristics: Physical transparency, i.e., masking from the user the physical characteristics of the

sources

Heterogeneity, i.e., federating highly diverse types of sources

Extensibility

Autonomy of data sources

Performance, through distributed query optimization

However, current tools do not (directly) support the so-called logical (or conceptual)

transparency (via an integrated schema), which is crucial in data integration

Advantages of Information Integration: Understand information – Analyze the data and its relationships. Share definitions and

policies across projects. Despite complexity, govern big data based on business needs.

Improve information – Deliver accurate, current data, with consistency across master

data entities. Manage information throughout its lifecycle. Document its lineage. Secure

and protect it.



Act on information – Accelerate projects by enabling confidence, adapting quickly to

change, and making high-value information continuously available.

Week-12

Revision from Week-8 to Week-11

Test-2



Week-13Object Relational Database Management System:

Among modern database technologies the object relational database management

system (ORDBMS) is a new database technology which can successfully deal with very

large data volumes with great complexity.

According to the findings of the Stonebraker and Moore findings, database techniques

can be groups into four main categories which are file systems, relational DBMS, Object

Oriented DBMS and Object Relational DBMS.

Based on these four categories Stonebraker and Moore developed their Database

classification matrix. Following Figure shows database classification matrix

\What is ORDBMS?

ORDBMS is similar to a relational database.

It has object oriented database models like objects, classes and inheritance etc.

It also directly supports database schemas in the query language.

The gap between OODBMS and RDBMS is bridged by ORDBMS

ORDBMSs allow developers to implement new data types and functions like Java and C.



ORDBMSs have changed query-centric approach to data management.

Tools Available for ORDBMS: Main Proprietary Tools Available in the Market

DB2

Microsoft SQL

Oracle Databases

Informix

Adaptive Server Enterprise

Valentina

Cache

Main Open Source Tools Available in Market

PostgreSQL

CUBRID

Zope Object database

Giga Base

Greenpium database

Main Features Available in ORDBMS’s: Object Types: user-defined data types (UDT) or abstract types (ADT) can be referred to

as object types.

Functions/Methods: For each object type, the user can define the methods for data

access.

Varray: The varray is a collection type that allows the user to embed homogenous data

into an array to form an object in a pre-defined array data type.

Nested table: A nested table is a collection type that can be stored within another table.



Inheritance: With Object type inheritance, users can build subtypes in hierarchies of

database types in ORDBs.

Object View: Object view allows users to develop object structures in existing relational

tables.

Advantages of ORDBMS: Reusability and Sharing

Its increase the flexibility and functionality

It is high maintainable

It is extensible easily and reliable

Can work with complex data types

Enhance the system overall performances

Disadvantages of ORDBMS: More complex than traditional relational databases

It is Costlier

Object orientation is misses.

Difficult to find qualified database professionals



Week-14Object Oriented Database:

Object: Definitions: Objects:

User defined complex data types

An object has structure or state (variables) and methods (behavior/operations).

An object is described by four characteristics

Identifier: a system-wide unique id for an object

Name: an object may also have a unique name in DB (optional)

Lifetime: determines if the object is persistent or transient

Structure: Construction of objects using type constructors

Object-Oriented Concepts: Abstract Data Types

Class definition, provides extension to complex attribute types

Encapsulation

Implementation of operations and object structure hidden

Inheritance

Sharing of data within hierarchy scope, supports code reusability

Polymorphism

Operator overloading



What is Object Oriented Database (OODB)?: A database system that incorporates all the important object-oriented concepts like

Encapsulation, Inheritance and Polymorphism.

Object database work well with:

CAS Applications (CASE-computer aided software engineering, CAD-computer aided

design, CAM-computer aided manufacture)

Multimedia Applications

Object projects that change over time.

Commerce

Advantages of OODBS: Designer can specify the structure of objects and their behavior (methods)

Better interaction with object-oriented languages such as Java and C++

Definition of complex and user-defined types

Encapsulation of operations and user-defined methods

Disadvantages of OODBS: Lower efficiency when data is simple and relationships are simple.

Relational tables are simpler.

Late binding may slow access speed.

More user tools exist for RDBMS.

Standards for RDBMS are more stable.

OODBS Standards: Object Data Management Group

Object Database Standard ODM6.2.0



Object Query Language

OQL support of SQL9

Week-15Object Query Language (OQL):

OQL is an object database query language, and is specified as part of the ODMG

standards.

OQL is being used as an embedded query language.

OQL can also be used as a stand-along query language.

OQL is based on SQL.

Many queries in SQL are also valid in OQL.

OQL also extends SQL to deal with object-oriented notion.

Example of OQL query:The following is a sample query

“What are the names of the black product?”

Select distinct p.name

From products p

Where p.color = “black”

Valid in both SQL and OQL, but results are different.

Result of the query (SQL):



Result of the query (OQL):

Comparison:



OQL vs C++: OQL is declarative

OQL can be used interactively

OQL embedded in C++ makes programs simpler

OQL can be seamlessly optimised

OQL guarantees logical/physical independence

OQL vs. SQL2: OQL supports complex objects

OQL supports methods

OQL vs. SQL3: OQL is stable, implemented, available while SQL3 is still in the design process

OQL is a simple query language while SQL3 is a full DB PL

OQL definition takes 20 pages while SQL3 is currently 1300 pages

OQL can match different data models (C++, ODMG, SQL2, SQL3)

Revision from Week-1 to Week-15By Dr. Nitin S. Goje Page 49 | 49

Date post:	10-May-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

faculty.psau.edu.sa · Web viewCentralized system with four-tier architecture and distributed...

Documents