+ All Categories
Home > Documents > Distributed Database

Distributed Database

Date post: 20-Jul-2016
Category:
Upload: indira-kundu
View: 7 times
Download: 6 times
Share this document with a friend
Description:
brief introduction to distributed data base
64
Parallel and Distributed Databases
Transcript
Page 1: Distributed Database

Parallel and Distributed Databases

Page 2: Distributed Database

Þ Parallel DBMS - What and Why?

Þ What is a Client/Server DBMS?

Þ Why do we need Distributed DBMSs?

Þ Date’s rules for a Distributed DBMS

Þ Benefits of a Distributed DBMS

Þ Issues associated with a Distributed DBMS

Þ Disadvantages of a Distributed DBMS

Page 3: Distributed Database

PARALLEL DATABASE SYSTEM

Page 4: Distributed Database

PARALLEL DBMSsWHY DO WE NEED THEM?

• More and More Data!

We have databases that hold a high amount of data, in the order of 1012 bytes:

10,000,000,000,000 bytes!

• Faster and Faster Access!

We have data applications that need to process data at very high speeds:

10,000s transactions per second!

SINGLE-PROCESSOR DBMS AREN’T UP TO THE JOB!

Page 5: Distributed Database

Improves Response Time.

INTERQUERY PARALLELISM

It is possible to process a number of transactions in parallel with each other.

Improves Throughput.

INTRAQUERY PARALLELISM

It is possible to process ‘sub-tasks’ of a transaction in parallel with each other.

PARALLEL DBMSsBENEFITS OF A PARALLEL DBMS

Page 6: Distributed Database

Speed-Up.

As you multiply resources by a certain factor, the time taken to execute a transaction should be reduced by the same factor:

10 seconds to scan a DB of 10,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs

PARALLEL DBMSsHOW TO MEASURE THE BENEFITS

Scale-up.

As you multiply resources the size of a task that can be executed in a given time should be increased by the same factor.

1 second to scan a DB of 1,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs

Page 7: Distributed Database

Sub-linear speed-up

Linear speed-up (ideal)

Number of CPUs

Num

ber o

f tra

nsac

tions

/sec

ond

1000/Sec

5 CPUs

2000/Sec

10 CPUs 16 CPUs

1600/Sec

PARALLEL DBMSsSPEED-UP

Page 8: Distributed Database

10 CPUs2 GB Database

Number of CPUs, Database size

Num

ber o

f tra

nsac

tions

/sec

ond

Linear scale-up (ideal)

Sub-linear scale-up

1000/Sec

5 CPUs1 GB Database

900/Sec

PARALLEL DBMSsSCALE-UP

Page 9: Distributed Database

MEMORYCPU

CPU

CPU

CPU

CPU

CPU

Shared Memory – Parallel Database Architecture

Page 10: Distributed Database

CPU

CPU

CPU

CPU

CPU

CPU

Shared Disk – Parallel Database Architecture

M

M

M

M

M

M

Page 11: Distributed Database

Shared Nothing – Parallel Database Architecture

CPUM

CPUM

CPUM

CPU M

CPU M

Page 12: Distributed Database

MAINFRAME DATABASE SYSTEM

Page 13: Distributed Database

DUMB

DUMB

DUMB

SPEC

IALI

SED

NET

WO

RK C

ON

NEC

TIO

N

TERMINALSMAINFRAME COMPUTER

PRESENTATION LOGIC

BUSINESS LOGIC

DATA LOGIC

Page 14: Distributed Database

DISTRIBUTED DATABASE SYSTEM

Page 15: Distributed Database

A distributed database system is a collection of logically related databases that co-operate in a

transparent manner. Transparent implies that each user within the

system may access all of the data within all of the databases as if they were a single database

There should be ‘location independence’ i.e.- as the user is unaware of where the data is located it is possible to move the data from one physical location to another without affecting the user.

DISTRIBUTED DATABASESWHAT IS A DISTRIBUTED DATABASE?

Page 16: Distributed Database

WID E A REA N

ET WO

RKLAN

CLIENT CLIENT

CLIENT CLIENT

DBMS

DISTRIBUTED DATABASE ARCHITECTURE

LAN

CLIENT CLIENT

CLIENT CLIENT

DBMS

Leytonstone

CLIENT CLIENT

CLIENT

DBMS

Stratford

CLIENT

CLIENT CLIENT

CLIENT

DBMS

Barking

CLIENT

CLIENT

CLIENT

Leyton

Page 17: Distributed Database

D/BASE

SERVER #1CLIENT#1

D/BASE

SERVER #2

CLIENT#2

CLIENT#3

M:N CLIENT/SERVER DBMS ARCHITECTURE

NOT TRANSPARENT!

Page 18: Distributed Database

DB Computer Network

Site 2

Site 1

GSC

DDBMS

DC LDBMS

GSC

DDBMS

DC LDBMS = Local DBMS DC = Data Communications GSC = Global Systems Catalog DDBMS = Distributed DBMS

COMPONENTS OF A DDBMS

Page 19: Distributed Database

• Reduced Communication Overhead

Most data access is local, less expensive and performs better.• Improved Processing Power

Instead of one server handling the full database, we now have a collection of machines handling the same database. • Removal of Reliance on a Central Site

If a server fails, then the only part of the system that is affected is the relevant local site. The rest of the system remains functional and available.

DISTRIBUTED DATABASESADVANTAGES

Page 20: Distributed Database

• Expandability

It is easier to accommodate increasing the size of the global (logical) database. • Local autonomy

The database is brought nearer to its users. This can effect a cultural change as it allows potentially greater control over local data .

DISTRIBUTED DATABASESADVANTAGES

Page 21: Distributed Database

A distributed system looks exactly like a non-distributed system to the user!

1. Local autonomy2. No reliance on a central site3. Continuous operation4. Location independence5. Fragmentation independence6. Replication independence7. Distributed query independence8. Distributed transaction processing9. Hardware independence10. Operating system independence11. Network independence12. Database independence

DISTRIBUTED DATABASESDATE’S TWELVE RULES FOR A DDBMS

Page 22: Distributed Database

LAN

CLIENT

CLIENT

LAN

CLIENT CLIENT

CLIENT CLIENT

LAN

CLIENT

CLIENT

LAN

CLIENT

Leyton

CLIENT

CLIENT CLIENT

Stratford

DBMS

WIDE ARE A N

E TWO

RK

Barking Leytonstone

DISTRIBUTED PROCESSING ARCHITECTURE

CLIENT

CLIENT

CLIENT

CLIENT

Page 23: Distributed Database

Þ Data Allocation

Þ Data Fragmentation

Þ Distributed Catalogue Management

Þ Distributed Transactions

Þ Distributed Queries – (see chapter 20)

DISTRIBUTED DATABASESISSUES

Page 24: Distributed Database

1. Locality of reference Is the data near to the sites that need it?

2. Reliability and availability Does the strategy improve fault tolerance and accessibility?

3. Performance Does the strategy result in bottlenecks or under-utilisation of resources?

4. Storage costs How does the strategy effect the availability and cost of data storage?

5. Communication costs How much network traffic will result from the strategy?

DISTRIBUTED DATABASESDATA ALLOCATION METRICS

Page 25: Distributed Database

CENTRALISED

DISTRIBUTED DATABASESDATA ALLOCATION STRATEGIES

Locality of Reference

Reliability/Availability

Storage Costs

Performance

Communication Costs

Lowest

Lowest

Lowest

Unsatisfactory

Highest

Page 26: Distributed Database

PARTITIONED/FRAGMENTED

DISTRIBUTED DATABASESDATA ALLOCATION STRATEGIES

Locality of Reference

Reliability/Availability

Storage Costs

Performance

Communication Costs

High

Low (item) – High (system)

Lowest

Satisfactory

Low

Page 27: Distributed Database

COMPLETE REPLICATION

DISTRIBUTED DATABASESDATA ALLOCATION STRATEGIES

Locality of Reference

Reliability/Availability

Storage Costs

Performance

Communication Costs

Highest

Highest

Highest

High

High (update) – Low (read)

Page 28: Distributed Database

SELECTIVE REPLICATION

DISTRIBUTED DATABASESDATA ALLOCATION STRATEGIES

Locality of Reference

Reliability/Availability

Storage Costs

Performance

Communication Costs

High

Average

Satisfactory

Low

Low (item) – High (system)

Page 29: Distributed Database

Þ Usage Applications are usually interested in ‘views’ not whole relations.

Þ Efficiency It’s more efficient if data is close to where it is frequently used.

Þ Parallelism It is possible to run several ‘sub-queries’ in tandem.

Þ Security Data not required by local applications is not stored at the local site.

DISTRIBUTED DATABASESWHY FRAGMENT DATA?

Page 30: Distributed Database

CLIENT/SERVER DATABASE SYSTEM

Page 31: Distributed Database

CLIENT/SERVER DBMS

Þ Manages user interface

Þ Accepts user data

Þ Processes application/business logic

Þ Generates database requests (SQL)

Þ Transmits database requests to server

Þ Receives results from server

Þ Formats results according to application logic

Þ Present results to the user

CLIENT PROCESS

Page 32: Distributed Database

CLIENT/SERVER DBMS

Þ Accepts database requests

Þ Processes database requests

Performs integrity checks

Handles concurrent access

Optimises queries

Performs security checks

Enacts recovery routines

Þ Transmits result of database request to client

SERVER PROCESS

Page 33: Distributed Database

Data Request Data Response

CLIENT/SERVERDBMS ARCHITECTURECLIENT#1

CLIENT#2

CLIENT#3

PRESENTATION LOGICBUSINESS LOGIC

DATA LOGIC

(FAT CLIENT)

D/BASE

SERVER

Page 34: Distributed Database

D/BASE

SERVER

Data Request Data Response

CLIENT/SERVERDBMS ARCHITECTURECLIENT#1

CLIENT#2

CLIENT#3

PRESENTATION LOGIC

BUSINESS LOGICDATA LOGIC

(THIN CLIENT)

PL/S QL

Page 35: Distributed Database

LAN

CLIENT

CLIENT

LAN

CLIENT CLIENT

CLIENT CLIENT

LAN

CLIENT

CLIENT

LAN

CLIENT

Leyton

CLIENT

CLIENT CLIENT

Stratford

DBMS

WIDE ARE A N

E TWO

RK

Barking Leytonstone

DISTRIBUTED PROCESSING ARCHITECTURE

CLIENT

CLIENT

CLIENT

CLIENT

Page 36: Distributed Database

Middleware Systems Overview and Introduction

Page 37: Distributed Database

Middleware Systems• Middleware systems are comprised of abstractions and services

to facilitate the design, development, integration and deployment of distributed applications in heterogeneous networking environments.– remote communication mechanisms (Web services,

CORBA, Java RMI, DCOM - i.e. request brokers)– event notification and messaging services (COSS

Notifications, Java Messaging Service etc.)– transaction services– naming services (COSS Naming, LDAP)

Page 38: Distributed Database

Definition by Example

• The following constitute middleware systems or middleware platforms– CORBA, DCE, RMI, J2EE (?), Web Services, DCOM,

COM+, .Net Remoting, application servers, …– some of these are collections and aggregations of

many different services– some are marketing terms

Page 39: Distributed Database

What & Where is Middleware ?

DistributedSystems

MiddlewareSystems

ProgrammingLanguagesDatabases

Operating Systems

Networking

• middleware is dispersed among many disciplines

Page 40: Distributed Database

What & Where is Middleware ?

DistributedSystems

ACM PODC, ICDE

MiddlewareACM/IFIP/IEEE

Middleware Conference,DEBS, DOA, EDOC

ProgrammingLanguages

DatabasesSIGMOD, VLDB, ICDE

Operating SystemsSIGOPS

NetworkingSIGCOMM,INFOCOM

• mobile computing, software engineering, ….

Page 41: Distributed Database

Middleware Research

• dispersed among different fields• with different research methodologies • different standards, points of views, and approaches• a Middleware research community is starting to crystallize around

conferences such as Middleware, DEBS, DOA, EDOC et al.– Many other conferences have middleware tracks

• many existing fields/communities are broadening their scope• “middleware” is still somewhat a trendy or marketing term, but I

think it is crystallizing into a separate field - middleware systems.• in the long term we are trying to identify concepts and build a body

of knowledge that identifies middleware systems - much like OS - PL - DS ...

Page 42: Distributed Database

Middleware Systems I

• In a nutshell: – Middleware is about supporting the development

of distributed applications in networked environments

• This also includes the integration of systems• About making this task easier, more efficient,

less error prone• About enabling the infrastructure software for

this task

Page 43: Distributed Database

Middleware Systems II

• software technologies to help manage complexity and heterogeneity inherent to the development of distributed systems, distributed applications, and information systems

• layer of software above the operating system and the network substrate, but below the application

• Higher-level programming abstraction for developing the distributed application

• higher than “lower” level abstractions, such as sockets provided by the operating system– a socket is a communication end-point from which data can be read or

onto which data can be written

Page 44: Distributed Database

Middleware Systems III

• aims at reducing the burden of developing distributed application for developer

• informally called “plumbing”, i.e., like pipes that connect entities for communication

• often called “glue code”, i.e., it glues independent systems together and makes them work together

• it masks the heterogeneity programmers of distributed applications have to deal with– network & hardware– operating system & programming language– different middleware platforms– location, access, failure, concurrency, mobility, ...

• often also referred to as transparencies, i.e., network transparency, location transparency

Page 45: Distributed Database

Middleware Systems IV

• an operating system is “the software that makes the hardware usable”

• similarly, a middleware system makes the distributed system programmable and manageable

• bare computer without OS could be programmed, so could the distributed application be developed without middleware

• programs could be written in assembly, but higher-level languages are far more productive for this purpose

• however, sometimes the assembly-variant is chosen - WHY?

Page 46: Distributed Database

The Questions

• What are the right programming abstractions for middleware systems?

• What protocols do these abstractions require to work as promised?

• What, if any, of the underlying systems (networks, hardware, distribution) should be exposed to the application developer?– Views range from

• full distribution transparency to • full control and visibility of underlying system to• fewer hybrid approaches achieving both

– With each having vast implications on the programming abstractions offered

Page 47: Distributed Database

Middleware Metaphorically

Distributed application

Middleware

Operating system

Network

Host 1

Distributed application

Middleware

Operating system

Host 2

Page 48: Distributed Database

Categories of Middleware

• remote invocation mechanisms– e.g., DCOM, CORBA, DCE, Sun RPC, Java RMI, Web Services ...

• naming and directory services– e.g., JNDI, LDAP, COSS Naming, DNS, COSS trader, ...

• message oriented middleware– e.g., JMS, MQSI, MQSeries, ...

• publish/subscribe systems– e.g., JMS, various proprietary systems, COSS Notification

Page 49: Distributed Database

Categories II

• (distributed) tuple spaces– (databases) - I do not consider a DBMS a middleware system– LNDA, initially an abstraction for developing parallel programs– inspired InfoSpaces, later JavaSpaces, later JINI

• transaction processing system (TP-monitors)– implement transactional applications, e.g.e, ATM

example• adapters, wrappers, mediators

Page 50: Distributed Database

Categories III

• choreography and orchestration– Workflow and business process tools (BPEL et al.)– a.k.a. Web service composition

• fault tolerance, load balancing, etc.

• real-time, embedded, high-performance, safety critical

Page 51: Distributed Database

Middleware Curriculum

• A middleware curriculum needs to capture the invariants defining the above categories and presenting them

• A middleware curriculum needs to capture the essence and the lessons learned from specifying and building these types of systems over and over again

• We have witnessed the re-invention of many of these abstractions without any functional changes over the past 25 years (see later in the course.)

• Due to lack of time and the invited guest lectures, we will only look at a few of these categories

Page 52: Distributed Database

Concurrency Control

Page 53: Distributed Database

Lock-Based Protocols• A lock is a mechanism to control concurrent access to a data

item• Data items can be locked in two modes : 1. exclusive (X) mode. Data item can be both read as well as written. X-lock is requested using lock-X instruction. 2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S instruction.• Lock requests are made to concurrency-control manager.

Transaction can proceed only after request is granted.

Page 54: Distributed Database

Lock-Based Protocols (Cont.)• Lock-compatibility matrix

• A transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the item by other transactions

• Any number of transactions can hold shared locks on an item, – but if any transaction holds an exclusive on the item no other

transaction may hold any lock on the item.• If a lock cannot be granted, the requesting transaction is made to wait

till all incompatible locks held by other transactions have been released. The lock is then granted.

Page 55: Distributed Database

Lock-Based Protocols (Cont.)• Example of a transaction performing locking: T2: lock-S(A); read (A); unlock(A); lock-S(B); read (B); unlock(B); display(A+B)• Locking as above is not sufficient to guarantee serializability — if A and B

get updated in-between the read of A and B, the displayed sum would be wrong.

• A locking protocol is a set of rules followed by all transactions while requesting and releasing locks. Locking protocols restrict the set of possible schedules.

Page 56: Distributed Database

Pitfalls of Lock-Based Protocols• Consider the partial schedule

• Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A.

• Such a situation is called a deadlock. – To handle a deadlock one of T3 or T4 must be rolled back

and its locks released.

Page 57: Distributed Database

Pitfalls of Lock-Based Protocols (Cont.)• The potential for deadlock exists in most locking

protocols. Deadlocks are a necessary evil.• Starvation is also possible if concurrency control

manager is badly designed. For example:– A transaction may be waiting for an X-lock on an item,

while a sequence of other transactions request and are granted an S-lock on the same item.

– The same transaction is repeatedly rolled back due to deadlocks.

• Concurrency control manager can be designed to prevent starvation.

Page 58: Distributed Database

The Two-Phase Locking Protocol• This is a protocol which ensures conflict-serializable

schedules.• Phase 1: Growing Phase– transaction may obtain locks – transaction may not release locks

• Phase 2: Shrinking Phase– transaction may release locks– transaction may not obtain locks

• The protocol assures serializability. It can be proved that the transactions can be serialized in the order of their lock points (i.e. the point where a transaction acquired its final lock).

Page 59: Distributed Database

The Two-Phase Locking Protocol (Cont.)• Two-phase locking does not ensure freedom from deadlocks• Cascading roll-back is possible under two-phase locking. To

avoid this, follow a modified protocol called strict two-phase locking. Here a transaction must hold all its exclusive locks till it commits/aborts.

• Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this protocol transactions can be serialized in the order in which they commit.

Page 60: Distributed Database

The Two-Phase Locking Protocol (Cont.)

• There can be conflict serializable schedules that cannot be obtained if two-phase locking is used.

• However, in the absence of extra information (e.g., ordering of access to data), two-phase locking is needed for conflict serializability in the following sense:

Given a transaction Ti that does not follow two-phase locking, we can find a transaction Tj that uses two-phase locking, and a schedule for Ti and Tj that is not conflict serializable.

Page 61: Distributed Database

Lock Conversions• Two-phase locking with lock conversions: – First Phase: – can acquire a lock-S on item– can acquire a lock-X on item– can convert a lock-S to a lock-X (upgrade)

– Second Phase:– can release a lock-S– can release a lock-X– can convert a lock-X to a lock-S (downgrade)

• This protocol assures serializability. But still relies on the programmer to insert the various locking instructions.

Page 62: Distributed Database

Automatic Acquisition of Locks• A transaction Ti issues the standard read/write instruction,

without explicit locking calls.• The operation read(D) is processed as: if Ti has a lock on D then read(D) else begin if necessary wait until no other transaction has a lock-X on D grant Ti a lock-S on D; read(D) end

Page 63: Distributed Database

Automatic Acquisition of Locks (Cont.)• write(D) is processed as: if Ti has a lock-X on D then write(D) else begin if necessary wait until no other trans. has any lock on D, if Ti has a lock-S on D then upgrade lock on D to lock-X else grant Ti a lock-X on D write(D) end;• All locks are released after commit or abort

Page 64: Distributed Database

Implementation of Locking• A lock manager can be implemented as a separate process

to which transactions send lock and unlock requests• The lock manager replies to a lock request by sending a

lock grant messages (or a message asking the transaction to roll back, in case of a deadlock)

• The requesting transaction waits until its request is answered

• The lock manager maintains a data-structure called a lock table to record granted locks and pending requests

• The lock table is usually implemented as an in-memory hash table indexed on the name of the data item being locked


Recommended