Post on 01-Jan-2016
transcript
Chapter 10Distributed Database Management System
Chapter 10Distributed Database Management System
Database Systems: Design, Implementation, and Management
DDBMSDDBMS
Distributed Database Management System
Governs the storage and processing of logically related data over interconnected computer systems in which both data and processing functions are distributed among several sites.
The Evolution of Distributed DBMSThe Evolution of Distributed DBMS
Centralized DBMS in the 1970’s Support for structured information needs.
Regularly issued formal reports in standard formats.
Prepared by specialist using 3GL in response to precisely channeled request.
Centrally stored corporate data.
Data access through dumb terminals.
Incapable of providing quick, unstructured, and ad hoc information for decision makers in a dynamic business environment.
The Evolution of Distributed DBMSThe Evolution of Distributed DBMS
Social and Technical Changes in the 1980’s
Business operations became more decentralized geographically.
Competition increased at the global level.
Customer demands and market needs favored a decentralized management style.
Rapid technological change created low-cost microcomputers. The LANs became the basis for computerized solutions.
The large number of applications based on DBMSs and the need to protect investments in centralized DBMS software made the notion of data sharing attractive.
The Evolution of Distributed DBMSThe Evolution of Distributed DBMS Two Database Requirements in a Dynamic
Business Environment:
Quick ad hoc data access became crucial in the quick-response decision making environment.
The decentralization of management structure based on the decentralization of business units made decentralized multiple-access and multiple-location databases a necessity.
Ad-hoc query – 即興的查詢
The Evolution of Distributed DBMSThe Evolution of Distributed DBMS
Developments in the 1990’s affecting DBMS The growing acceptance of the Internet and the World Wide
Web as the platform for data access and distribution. The increased focus on data analysis that led to data mining
and data warehousing.
Data mining Proactive ( In contrast to reactive DSS tools ) Instead of having end user define the problem, select the data and
select the tools to analyze such data,the data-mining tool automatically search the data for anomalies and possible relationships, thereby identifying problems that have not yet been identified by the end user.
Data warehousing An integrated, subject-oriented, time-variant, nonvolatile database
that provides support for decision making.
The Evolution of Distributed DBMSThe Evolution of Distributed DBMS
DDBMS Advantages Data are located near the “greatest demand” site.
Faster data access
Faster data processing
Growth facilitationNew sites can be added to the network without affecting the operations of other sites.
Improved communicationsa local account receivable operation uses sales department data directly, without having to depend on delayed reports from the central office.
The Evolution of Distributed DBMSThe Evolution of Distributed DBMS
DDBMS Advantages Reduced operating costs
low-cost PC ←→ mainframe
User-friendly interfaceeasy-to-use GUI
Less danger of a single-point failure
Processor independencerequests do not depend on a specific processor;any available processor can handle the user’s request.
The Evolution of Distributed DBMSThe Evolution of Distributed DBMS
DDBMS Disadvantages Complexity of management and control
Securitythe possibility of security lapses increases when data are located at multiple sites.
Lack of standardsno standard communication protocols for DDBMS
Increased storage requirementsData replication
Distributed Processingand Distributed DatabaseDistributed Processing
and Distributed Database Distributed processing
shares the database’s logical processing among two or more physically independent sites that are connected through a network. ( Figure 10.1)
Distributed database stores a logically related database over two or more physically independent sites connected via a computer network. (Figure 10.2)
Distributed Processing Environment
Figure 10.1
Distributed Database Environment
Figure 10.2
Distributed Processingand Distributed DatabaseDistributed Processing
and Distributed Database Distributed processing does not require a
distributed database
Distributed database requires distributed processing
Distributed processing may be based on a single database located on a single computer. In order to manage distributed data, copies or parts of the database processing functions must be distributed to all data storage sites.
Both distributed processing and distributed databases require a network to connect all components.
What Is A Distributed DBMS?What Is A Distributed DBMS?
A distributed database management system (DDBMS) governs the storage and processing of logically related data over interconnected computer systems in which both data and processing functions are distributed among several sites.
What Is A Distributed DBMS?What Is A Distributed DBMS?
Functions of a DDBMS Application interface
Validation (analyze data requests)
Transformation (determine request’s components distributed/local)
Query-optimization to find the best access strategy
Mapping to determine the data location
I/O interface to read or write data
Formatting to prepare the data for presentation
Security to provide data privacy
Backup and recovery
Database administration
Concurrency control
Transaction management
Centralized Database Management System
Figure 10.3
Fully Distributed Database Management System
Figure 10.4
Single Logical database (Fig.10.4)Single Logical database (Fig.10.4)
Both users see only one logical database anddo not need to know the names of the fragments.
The end-user need not even know
the database is divided into separate fragments
where the fragments are located
DDBMS ComponentsDDBMS Components Computer workstations
that form the network system.
Network hardware and software components that reside in each workstation.
Communications media that carry the data from one workstation to another.
Transaction processor (TP) receives and processes the application’s data requests.
Data processor (DP) stores and retrieves data located at the site. Also known as data manager (DM).
Distributed Database System ComponentsFigure 10.5
DDBMS ComponentsDDBMS Components
DDBMS protocol determines how the DDBMS will:
Interface with the network to transport data and commands between DPs and TPs.
Synchronize all data received from DPs (TP side) androute retrieved data to the appropriate TPs (DP side).
Ensure common database functions in a distributed system -- security, concurrency control, backup, and recovery.
Levels of Data & Process DistributionLevels of Data & Process Distribution
Database systems can be classified based on process distribution and data distribution
SD MD
MP
SP
Levels of Data & Process DistributionLevels of Data & Process Distribution
SPSD (Single-Site Processing, Single-Site Data) All processing is done on a single CPU or host
computer. All data are stored on the host computer’s local disk. The DBMS is located on the host computer. The DBMS is accessed by dumb terminals. Typical of most mainframe and minicomputer DBMSs. Typical of the 1st generation of single-user
microcomputer database.
Host DBMS
Nondistributed (Centralized) DBMSSPSD (Single-Site Processing, Single-Site Data)
Figure 10.6
Levels of Data & Process DistributionLevels of Data & Process Distribution
MPSD (Multiple-Site Processing, Single-Site Data) Typically, MPSD requires a network file server on which
conventional applications are accessed through a LAN.
A variation of the MPSD approach is known as a client/server architecture. (Chapter 12)
LAN DBMS
Levels of Data & Process DistributionLevels of Data & Process Distribution
MPMD (Multiple-Site Processing, Multiple-Site Data)
Fully distributed DBMS with support for multiple DPs and TPs at multiple sites.
Homogeneous DDMS integrate only one type of centralized DBMS over the network.
Heterogeneous DDBMS integrate different types of centralized DBMSs over a network. (See Figure 10.8)
Figure 10.8 Heterogeneous Distributed Database Scenario
Distributed DB TransparencyDistributed DB Transparency
DDBMS transparency features have the common property of allowing the end users to think that he is the database’s only user.
Distribution transparency
Transaction transparency
Failure transparency
Performance transparency
Heterogeneity transparency
Distribution TransparencyDistribution Transparency
Distribution transparency allows us to manage a physically dispersed databaseas though it were a centralized database.
Three Levels of Distribution Transparency Fragmentation transparency Location transparency Local mapping transparency
Names?
Distribution TransparencyDistribution Transparency
Example (Figure 10.9): Employee data (EMPLOYEE) are distributed over three locations: New York, Atlanta, and Miami.Depending on the level of distribution transparency support, three different cases of queries are possible:
Distribution TransparencyDistribution Transparency
Case 1: DB Supports Fragmentation Transparency
SELECT *FROM EMPLOYEEWHERE EMP_DOB < ‘01-JAN-1940’;
Distribution TransparencyDistribution Transparency
Case 2: DB Supports Location Transparency
SELECT *FROM E1WHERE EMP_DOB < ‘01-JAN-1940’;
UNION
SELECT *FROM E2WHERE EMP_DOC < ‘01-JAN-1940’;
UNION
SELECT *FROM E3WHERE EMP_DOC < ‘01-JAN-1940’;
Distribution TransparencyDistribution Transparency
Case 3: DB Supports Local Mapping Transparency
SELECT *FROM E1 NODE NYWHERE EMP_DOB < ‘01-JAN-1940’;
UNION
SELECT *FROM E2 NODE ATLWHERE EMP_DOC < ‘01-JAN-1940’;
UNION
SELECT *FROM E3 NODE MIAWHERE EMP_DOC < ‘01-JAN-1940’;
Distribution TransparencyDistribution Transparency Distribution transparency is supported by
distributed data dictionary (DDD) or distributed data catalog (DDC).
The DDC contains the description of the entire database as seen by the database administrator.
The database description, known as the distributed global schema, is the common database schema used by local TPs to translate user requests into subqueries.
Transaction TransparencyTransaction Transparency Transaction transparency
ensures that database transactions will maintain the database’s integrity and consistency.
The transaction will be completed only if all database sites involved in the transaction complete their part of the transaction.
Related Concepts: Remote Requests Remote Transactions Distributed Transactions Distributed Requests
Transaction TransparencyTransaction Transparency
Distributed Requests and Distributed Transactions
A remote request allows us to access data to be processed by a single remote database processor. (Figure 10.10)
A Remote Request
Figure 10.10
Transaction TransparencyTransaction Transparency
Distributed Requests and Distributed Transactions
A remote transaction, composed of several requests, may access data at only a single site. (Figure 10.11)
A Remote Transaction
Figure 10.11
Transaction TransparencyTransaction Transparency
Distributed Requests and Distributed Transactions
A distributed transaction allows a transaction to reference several different (local or remote) DP sites. (Figure 10.12)
A Distributed Transaction
Transaction TransparencyTransaction Transparency
Distributed Requests and Distributed Transactions
A distributed request lets us reference data from several remote DP sites. (Figure 10.13)
It also allows a single request to reference a physically partitioned table. (Figure 10.14)
A Distributed Request
Another Distributed Request
Distributed Concurrency ControlDistributed Concurrency Control
Concurrency control becomes especially important because multisite, multiple-process are much more likely to create data inconsistencies and deadlocked transactions.
Premature commit (Fig. 10.15)
Each transaction operation was comitted by each local DP, ( site A, site B )but one of the DPs could not commit the transaction’s results.( site C )
yield an inconsistent database, because we cannot uncommit committed data.
Solution: two-phase commit protocol
Figure 10.15
Transaction TransparencyTransaction Transparency
Two-Phase Commit Protocol The two-phase commit protocol guarantees that,
if a portion of a transaction operation cannot be committed, all changes made at the other sites participating in the transaction will be undone to maintain a consistent database state.
Each DP maintains its own transaction log. The two-phase protocol requires that each individual DP’s transaction log entry be written before the database fragment is actually updated.
The two-phase commit protocol requires a DO-UNDO-REDO protocol and a write-ahead protocol.
Transaction TransparencyTransaction Transparency
Two-Phase Commit Protocol The DO-UNDO-REDO protocol
is used by the DP to roll back and/or roll forward transactions with the help of the system’s transaction log entries.
DO performs the operation and records the “before” and “after” values in the transaction log.
UNDO reverses an operation, using the log entries written by the DO portion of the sequence.
REDO redoes an operation, using the log entries written by DO portion of the sequence.
To ensure that DO-UNDO-REDO operations can survive a system crash while they are being executed, a write-ahead protocol is used.
The write-ahead protocol forces the log entry to be written to permanent storage before the actual operation takes place.
Transaction TransparencyTransaction Transparency Two-Phase Commit Protocol
defines the operations between two types of nodes: the coordinator and one or more subordinates or cohorts. The protocol is implemented in two phases:
Phase 1: Preparation The coordinator
sends a PREPARE TO COMMIT message to all subordinates.
The subordinates receive the message, write the transaction log using the write-ahead protocol, and send an acknowledgement message to the coordinator.
The coordinator makes sure that all nodes are ready to commit, or it aborts the transaction.
Transaction TransparencyTransaction Transparency Phase 2: The Final Commit
The coordinator broadcasts a COMMIT message to all subordinates and waits for the replies.
Each subordinate receives the COMMIT message then updates the database, using the DO protocol.
The subordinates reply with a COMMITTED or NOT COMMITTED message to the coordinator.
If one or more subordinates did not commit, the coordinator sends an ABORT message, thereby forcing them to UNDO all changes.
Phase1
Phase 2
Finite State Diagram for Coordinator and Cohort
Performance Transparency andQuery Optimization
Performance Transparency andQuery Optimization
The objective of a query optimization routine is to minimize the total cost associated with the execution of a request. The costs associated with a request are a function of the:
Access time (I/O) costinvolved in accessing the physical data stored on disk.
Communication cost associated with the transmission of data among nodes in distributed database systems.
CPU time cost associated with the processing overhead of managing distributed transactions.
Query optimization must provide distribution transparency as well as replica transparency.
Replica transparency refers to the DDBMSs ability to hide the existence of multiple copies of data from the user.
Most of the query optimization algorithms are based on two principles: Selection of the optimum execution order Selection of sites to be accessed to minimize
communication costs
Performance Transparency andQuery Optimization
Performance Transparency andQuery Optimization
Operation Modes of Query Optimization Automatic query optimization
means that the DDBMS finds the most cost-effective access path without user intervention.
Manual query optimization requires that the optimization be selected and scheduled by the end user or programmer.
Timing of Query Optimization Static query optimization takes place at compilation
time. Dynamic query optimization takes place at execution
time.
Performance Transparency andQuery Optimization
Performance Transparency andQuery Optimization
Optimization Techniques by Information Used Statistically based query optimization
uses statistical information about the database. In the dynamic statistical generation mode,
the DDBMS automatically evaluates and updates the statistics after each access.
In the manual statistical generation mode, the statistics must be updated periodically through a user-selected utility.
Rule-based query optimization algorithm is based on a set of user-defined rules to determine the best query access strategy.
Performance Transparency andQuery Optimization
Performance Transparency andQuery Optimization
Distributed Database DesignDistributed Database Design
The design of a distributed database introduces three new issues:
How to partition the database into fragments.
Which fragments to replicate.
Where to locate those fragments and replicas.
Data FragmentationData Fragmentation Data fragmentation
allows us to break a single object into two or more segments or fragments.
Each fragment can be stored at any site over a computer network.
Data fragmentation information is stored in the distributed data catalog (DDC), from which it is accessed by the transaction processor (TP) to process user requests.
Three Types of Fragmentation Strategies: Horizontal fragmentation Vertical fragmentation Mixed fragmentation
A Sample CUSTOMER Table
Figure 10.16
Data FragmentationData Fragmentation Horizontal Fragmentation
Division of a relation into subsets (fragments) of tuples (rows). Each fragment is stored at a different node, and each fragment has unique rows. Each fragment represents the equivalent of a SELECT statement, with the WHERE clause on a single attribute.
Table 10.3 Horizontal Fragmentation of the CUSTOMER Table By State
Table Fragments In Three Locations
Figure 10.17
Data FragmentationData Fragmentation
Vertical FragmentationDivision of a relation into attribute (column) subsets. Each subset (fragment) is stored at a different node, and each fragment has unique columns -- with the exception of the key column. This is the equivalent of the PROJECT statement.
Table 10.4 Vertical Fragmentation of the CUSTOMER Table
Vertically Fragmented Table Contents
Figure 10.18
Data FragmentationData Fragmentation
Mixed FragmentationCombination of horizontal and vertical strategies. A table may be divided into several horizontal subsets (rows), each one having a subset of the attributes (columns).
Table 10.5 Mixed Fragmentation of the CUSTOMER Table
Figure 10.19
Data ReplicationData Replication Data replication
refers to the storage of data copies at multiple sites served by a computer network. Fragment copies can be stored at several sites to serve
specific information requirements. The existence of fragment copies
can enhance data availability and response time, reducing communication and total query costs.
Figure 10.20
Data ReplicationData Replication
Mutual Consistency Rule Replicated data are subject to the mutual
consistency rule, which requires that all copies of data fragments be identical.
DDBMS must ensure that a database update is performed at all sites where replicas exist.
Data replication imposes additional DDBMS processing overhead.
Data ReplicationData Replication
Replication Conditions A fully replicated database
stores multiple copies of all database fragments at multiple sites.
A partially replicated database stores multiple copies of some database fragments at multiple sites.
Factors for Data Replication Decision Database Size Usage Frequency
Data AllocationData Allocation Data allocation
describes the processing of deciding where to locate data.
Data Allocation Strategies Centralized
The entire database is stored at one site.
PartitionedThe database is divided into several disjoint parts (fragments) and stored at several sites.
ReplicatedCopies of one or more database fragments are stored at several sites.
Data AllocationData Allocation
Data allocation algorithms take into consideration a variety of factors: Performance and data availability goals
Size, number of rows, the number of relations that an entity maintains with other entities.
Types of transactions to be applied to the database, the attributes accessed by each of those transactions.
Client/Server vs. DDBMSClient/Server vs. DDBMS Client/server architecture
refers to the way in which computers interact to form a system.
It features a user of resources (client) and a provider of resources (server)
The architecture can be used to implement a DBMS in which the client is the transaction processor (TP) and the server is the data processor (DP).
Client/Server ArchitectureClient/Server Architecture
Client/Server Advantages Client/server solutions tend to be less expensive.
Client/server solutions allow the end user to use the microcomputer’s graphical user interface (GUI), thereby improving functionality and simplicity.
There are more people with PC skills than with mainframe skills.
The PC is well established in the workplace.
Numerous data analysis and query tools exist to facilitate interaction with many of the DBMSs.
There are considerable cost advantages to off-loading application development from the mainframe to PCs.
Client/Server ArchitectureClient/Server Architecture
Client/Server Disadvantages The client/server architecture
creates a more complex environment with different platforms.
An increase in the number of users and processing sites often paves the way for security problems.
The burden of training a wider circle of users and computer personnel increases the cost of maintaining the environment.
C. J. Date’s 12 Commandments for Distributed Database
C. J. Date’s 12 Commandments for Distributed Database
1.Local Site Independenceeach local site can act as an independent, autonomous, centralized DBMS.
2.Central Site IndependenceNo site in the network relies on a central site or any other site.
3.Failure IndependenceThe system is not affected by node failures.
4.Location Transparency
5.Fragmentation Transparency
6.Replication Transparency
7.Distributed Query Processing
8.Distributed Transaction Processing
9.Hardware Independence
10.Operating System Independence
11.Network Independence
12.Database Independence