Distributed Database Management Systems by xsz60651

VIEWS: 35 PAGES: 14

More Info
									DDBMS - Lecture 3                 Overview of Distributed Database Management Systems                                                      i


                                                     Table of Contents
Evolution of DDBMS................................................................................................................. 1
     DDBMS Advantages .......................................................................................................... 1
     DDBMS Disadvantages ..................................................................................................... 1
Distributed Database vs. Distributed Processing ....................................................................... 1
Functions of DDBMS................................................................................................................. 2
          Centralized Database .................................................................................................. 2
DDBMS Components ................................................................................................................ 3
Levels of Data and Process Distribution .................................................................................... 3
     Single-Site Processing, Single-Site Data (SPSD) .............................................................. 3
     Multiple-Site Processing, Single-Site Data (MPSD) ......................................................... 4
     Multiple-Site Processing, Multiple-Site Data (MPMD) .................................................... 4
Distributed DB Transparency Features ...................................................................................... 4
Distribution Transparency .......................................................................................................... 5
Transaction Transparency........................................................................................................... 6
    Management mechanisms .................................................................................................. 6
    Distributed Concurrency Control ....................................................................................... 7
    Two-Phase Commit Protocol ............................................................................................. 7
          Phases ......................................................................................................................... 8
Performance Transparency and Query Optimization ................................................................. 8
Data Fragmentation .................................................................................................................... 9
     Horizontal Fragmentation .................................................................................................. 9
     Vertical Fragmentation ..................................................................................................... 10
     Mixed Fragmentation ....................................................................................................... 10
Data Replication ....................................................................................................................... 10
Data Allocation ......................................................................................................................... 11
Client/Server vs. DDBMS ........................................................................................................ 12
Date’s 12 Commandments for Distributed Databases.............................................................. 12
DDBMS - Lecture 3     Overview of Distributed Database Management Systems                 1


Evolution of DDBMS
 Decentralized database management systems (DDBMS)
   Interconnected computer systems
   Data/processing functions reside on multiple sites
 1970’s: Centralized DBMS
 1980’s: Social and Technical Changes
   Ad hoc capability required
   Decentralized management structure common
 1990’s: New forces
   Internet and the World Wide Web used for data access and distribution
   Data analysis through data mining and data warehousing
 The decentralized database is especially desirable because centralized database
  management is subject to problem such as:
   Performance degradation – remote locations over greater distances.
   High costs – maintaining and operating large central (mainframe) database systems.
   Reliability problems – by dependence on a central site.

DDBMS Advantages
 Data located near site with greatest demand
 Faster data access
 Faster data processing
 Growth facilitation
 Improved communications
 Reduced operating costs
 User-friendly interface
 Less danger of single-point failure
 Processor independence

DDBMS Disadvantages
 Complexity of management and control
 Security
 Lack of standards
 Increased storage requirements
 Greater difficulty in managing data environment
 Increased training costs

Distributed Database vs. Distributed Processing
 Distributed processing (Figure 1) – Stores logically related database over physically
  independent sites
DDBMS - Lecture 3      Overview of Distributed Database Management Systems                 2


   Does not require distributed database but shares the processing chores among several
     sites.
   May be based on a single database on single computer
   Copies or parts of database processing functions must be distributed to all data storage
     sites
 Distributed database (Figure 2) – Stores logically related database over two or more
  physically independent sites
   Requires distributed processing
   A database is a composed of several parts known as database fragments.
 Both
   Require a network to connect components

Functions of DDBMS
   Application/end user interface
   Validation to analyze data requests
   Transformation to determine request components
   Query optimization to find the best access strategy
   Mapping to determine the data location
   I/O interface to read or write data
   Formatting to prepare the data for presentation
   Security to provide data privacy
   Backup and recovery
   DB Administration
   Concurrency Control
   Transaction Management

Centralized Database
 Refer Figure 3: all activities illustrated below are transparent to the end user.
    Receives an application’s (or an end user’s) request.
    Validates, analyzes, and decomposes the request.
    Maps the request’s logical-to-physical data components.
    Decomposes the request into several disk I/O operations.
    Searches for, locates, reads, and validates the data.
    Ensures database consistency, security, and integrity.
    Validates the data for the conditions, if any, specified by the request.
   Presents the selected data in the required format.
 A fully Distributed Database Management System (See Figure 4) must perform all the
  functions of a centralized DBMS.
   It must handle all necessary functions imposed by the distribution of data and
DDBMS - Lecture 3      Overview of Distributed Database Management Systems                   3


      processing.
     It must perform these additional functions transparently to the end users.
     Users see only one logical database and do not need to know the names of the
      fragments; i.e., they need not even know that the database is divided into separate
      fragment, nor do they need to know where the fragments are located.

DDBMS Components
 The DDBMS must include (at least) the following components:
   Computer workstations
     Network hardware and software components
     Communications media
     Transaction processor (TP) – receives and processes the application’s data requests.
       Also called application manager (AP) or transaction manager (TM)
     Data processor (DP) – stores and retrieves data located at the site.
       Also called data manager (DM)
       May even be a centralized DBMS.
     The Distributed Database Components are illustrated in Figure 5.
     Note: each TP can access data on any DP, and each DP handles all requests for
        local data from any TP.
 DDBMS Protocols
   Interface with network to transport data and commands between DPs and TPs
   Synchronize data received from DPs and route to appropriate TPs
   Ensure common database functions
     Security
     Concurrency control
     Backup and recovery

Levels of Data and Process Distribution
-   Database systems can be classified based on process distribution and data distribution




Single-Site Processing, Single-Site Data (SPSD)
 Refer Figure 6
 All processing on single CPU or host computer
DDBMS - Lecture 3      Overview of Distributed Database Management Systems                  4


   All data are stored on host computer disk
   DBMS located on the host computer
   DBMS accessed by dumb terminals
   Typical of mainframe and minicomputer DBMSs
   Typical of 1st generation of single-user microcomputer database

Multiple-Site Processing, Single-Site Data (MPSD)
 Refer Figure 7
 Requires network file server
 Applications accessed through LAN
 Variation known as client/server architecture

Multiple-Site Processing, Multiple-Site Data (MPMD)
 Fully distributed DDBMS with support for multiple DPs and TPs at multiple sites
 Homogeneous
     Integrate one type of centralized DBMS over the network
     The same DBMS will be running on different mainframes, minicomputers, and
      microcomputers.
 Heterogeneous (Figure 8)
   Integrate different types of centralized DBMSs over a network
   The fully heterogeneous DDBMS will support different DBMSs that may even
       support different data models (relational, hierarchical, or network) running under
       different computer systems listed above.
 Some DDBMS implementations support several platforms, operating systems, and
  networks, and allow remote data access to another DBMS. However, such DDBMSs
  still are subject to certain restrictions:
   Remote access is provided on a read-only basis and does not support write privileges.
     Restrictions are placed on the number of remote tables that may be accessed in a
      single transaction.
     Restrictions are placed on the number of distinct databases that may be accessed.
     Restrictions are placed on the database model that may be accessed with relational
      databases but not network or hierarchical databases.

Distributed DB Transparency Features
 Functional characteristics and properties:
   Be grouped and described as transparency features.
   Allows end users to feel like only database user.
   Hides complexities of distributed database.
 Transparency features
DDBMS - Lecture 3       Overview of Distributed Database Management Systems                   5


     Distribution Transparency – allows a distributed database to be treated as a single
      logical database, the user doesn’t need to know
       The data are partitioned.
       The data can be replicated at several sites.
       The data location.
     Transaction Transparency – allows a transaction to update data at several network
      sites.
     Failure Transparency – ensures that the system will continue to operate in the event of
      a node failure.
     Performance Transparency – allows the system to perform as if it were a centralized
      DBMS.
     Heterogeneity Transparency – allows the integration of several different local DBMSs
      (relational, network, and hierarchical) under a common, or global, schema.

Distribution Transparency
-   Allows management of a physically dispersed database as though it were centralized
-   Three Levels:
     Fragmentation transparency – the highest level of transparency; the end user or
      programmer does not need to know that a database is partitioned. Therefore, neither
      fragment names nor fragment locations are specified prior to data access.
     Location transparency – when the end user or programmer must specify the database
      fragment names but does not need to specify where these fragments are located.
     Local mapping transparency – when the end user or programmer must specify both
      the fragment names and their locations.




 The use of various transparency levels – refers Figure 9, in which, assume
  1. The EMPLOYEE table is fragmented and each fragment is unique.
  2. The end user wants to list all the employees with a date of birth prior to January 1, 1940.
    3. No partition of the database is replicated at any other site on the network.
     Case 1: The Database Supports Fragmentation Transparency –
        It does not specify fragment names or locations.
DDBMS - Lecture 3      Overview of Distributed Database Management Systems                       6


       SELECT * FROM EMPLOYEE
       WHERE EMP_DOB < ’01-JAN-1940’;
     Case 2: The Database Supports Location Transparency –
      Fragment names must be specified in the query, but fragment location is not specified.
      SELECT FROM E1
      WHERE EMP_DOB < ’01-JAN-1940’;
      UNION
      SELECT * FROM E2
      WHERE EMP_DOB < ’01-JAN-1940’;
     Case 3: The Database Supports Local Mapping Transparency –
      Both the fragment name and location must be specified in the query.
      SELECT * FROM E1 NODE NY
      WHERE EMP_DOB < ’01-JAN-1940’;
      UNION
      SELECT * FROM E2 NODE ATL
      WHERE EMP_DOB < ’01-JAN-1940’;
      UNION
      SELECT * FROM E3 NODE MIA
       WHERE EMP_DOB < ’01-JAN-1940’;
 Distribution transparency is supported by a distributed data dictionary (DDD), or a
  distributed data catalog (DDC).
   DDC contains the description of the entire database as seen by the database
      administrator.
     The database description, known as the distributed global schema, is the common
      database schema used by local TPs to translate user requests into subqueries (remote
      requests) that will be processed by different DPs.
     Some of the current DDBMS implementations impose limitations on the level of
      transparency support; e.g., you might be able to distribute a database, but not a table,
      across multiple sites. Such a condition indicates that the DDBMS supports location
      transparency but not fragmentation transparency.

Transaction Transparency
-   Ensures transactions maintain integrity and consistency
-   Completed only if all involved database sites complete their part of the transaction

Management mechanisms
 Remote request (refer Figure 10) – access data to be processed by a single remote
  database processor; i.e., the SQL statement (or request) can reference data at only one
  remote site.
DDBMS - Lecture 3      Overview of Distributed Database Management Systems                     7


      The transaction updates the CUSTOMER and INVOICE tables.
      Both tables are located at Site B.
      The transaction can reference only one remote DP.
      Each SQL statement (or request) can reference only one (the same) remote DP at a
       time, and the entire transaction can reference and is executed at only one remote DP.
 Remote transaction (refer Figure 11) – composed of several request, may access data at
  only a single site.
 Distributed transaction (refer Figure 12) – allows a transaction to reference several
  different (local or remote) DP sites.
    The transaction references two remote sites (B and C).
    The first request (SELECT statement) is processed by the DP at the remote Site B, and
     the next requests (UPDATE and INSERT) are processed by the DP at the remote Site
     C.
    Each request can access only one remote site at a tome.
 Distributed request (refer Figure 13 & 14) – reference data from several remote DP sites.
   Partition a database table into several fragments.
   Reference one or more of those fragments with only one request; i.e., fragmentation
     transparency is performed.

Distributed Concurrency Control
 Multi-site, multiple-process operations more likely to create data inconsistencies and
   deadlocked transactions
 Problems (Figure 15)
    Transaction committed by local DP
    One DP could not commit transaction’s result
    Yields inconsistent database
    Solution: two-phase commit protocol!

Two-Phase Commit Protocol
 If a portion of a transaction operation cannot be committed, all changes made at the other
  sites participating in the transaction will be undone to maintain a consistent database state.
 The protocol requires that each individual DP’s transaction log entry, which is maintained
  by unique DP, be written before the dataset fragment is actually updated.
 DO-UNDO-REDO protocol and write-ahead protocol are necessary.
   DO-UNDO-REDO protocol – rolls back and/or rolls forward transactions with the
       help of the system’s transaction log entries. Three types below:
        DO – performs the operation and records the “before” and “after” values in the
           transaction log.
        UNDO – reverses an operation, using the log entries written by the DO portion of
DDBMS - Lecture 3      Overview of Distributed Database Management Systems                      8


         the sequence.
      REDO – redoes an operation, using the log entries written by the DO portion of
         the sequence.
   Write-ahead protocol – forces the log entry to be written to permanent storage before
     he actual operation takes place.
 The protocol defines the operations between two kinds of nodes
   Coordinator
   Subordinates (or cohorts)

Phases
 Phase 1: Preparation
    Coordinator sends message to all subordinates
    Subordinates receive the message, write the transaction log using the write-ahead
       protocol, and send an acknowledgement (YES/PREPARED TO COMMIT or NO/NOT
       PREPARED) message to the coordinator.
   Confirms all are ready to commit or abort
 Phase 2: Final Commit – Ensures all subordinates have committed or aborted
   The coordinator broadcasts a COMMIT message to all subordinates and waits for the
     replies.
    Each subordinate receives the COMMIT message, then updates the database using the
     DO protocol.
    The subordinates reply with a COMMITTED or NOT COMMITTED message to the
       coordinator.

Performance Transparency and Query Optimization
 Objective: Minimize total cost associated with execution since the data may be replicated
  in several different sites and a database may be partitioned into several fragments.
 Main costs
   Access time – involved in accessing the physical data stored on disk.
   Communication – associated with the transmission of data among nodes in DDBS.
   CPU time – associated with the processing overhead of managing distributed
      transactions.
 Basis for query optimization algorithms
   Classified by communication or processing costs.
   To evaluate query optimization, the TP must receive data from the DP, synchronize it,
     assemble the answer, and present it to the end user or an application.
 Replica Transparency – as important as distribution transparency, it hides the existence of
  multiple copies of data from the user.
   The selection of the optimum execution order
DDBMS - Lecture 3      Overview of Distributed Database Management Systems                     9


   The selection of sites accessed to minimize communication costs
 Evaluated on the basis of operation mode of classification
   Automatic query optimization – the DDBMS finds the most cost-effective access path
     without user intervention. (More desirable from the end user’s view, but high cost!)
   Manual query optimization – the optimization be selected and scheduled by the end
     user or programmer.
 Timing of classification according to when the optimization is down.
     Static query optimization – takes place at compilation time; i.e., the best
      optimization strategy is selected when the query is compiled by DBMS. (e.g., SQL
       statements are embedded in procedural programming language, such as COBOL or
       Pascal.)
     Dynamic query optimization – takes place at execution time; i.e., database access
      strategy is determined by the DBMS at the run time, using the most up-to-date
      information about the database. (Efficient but expensive!)
 Classified according to the type of information that is used to optimize the query:
     Statistically based optimization algorithms – provide information about such
      database characteristics as size, number of records, average access time, number of
      request serviced, number of users with access rights, and so on.
        Dynamic statistical generation mode – the DBMS automatically evaluates and
         updates the statistics after each access.
        Manual statistical generation mode – the statistics must be updated periodically
         through a user-selected utility, such as IBM’s RUNSTAT command used by OS/2
         Database Manager
     Rule-based query optimization algorithms – determine the best query strategy
      based on a set of user-defined rules. The rules are entered by the end user or
      database administrator, and they typically are very general in nature.

Data Fragmentation
-   Partition database into fragments, the data fragmentation allows us to break a single object
    into two or more segments or fragments.
-   Data fragmentation information is stored in the distributed data catalog (DDC), from
    which it is accessed by the TP to process user requests.
-   Three types of data fragmentation strategies: horizontal, vertical and mixed fragmentation.
-   Figure 16 is selected as a sample CUSTOMER table.

Horizontal Fragmentation
 Refers to the division of a relation into subsets (fragments) of tuples (rows).
 Each fragment is stored at a different node, and each fragment has unique rows while all
   unique rows the same attributes (columns).
DDBMS - Lecture 3      Overview of Distributed Database Management Systems                  10


 It is equivalent to SELECT statement with WHERE clause on a single attribute.
 Refer the sample table below and Figure 17.




Vertical Fragmentation
 Refers to the division of a relation into attribute (column) subsets.
 Each subset (fragment) is stored at a different node, and each fragment has unique
   columns – with the exception of the key column, which is common to all fragments.
 It is equivalent to PROJECT statement.
 Refer the sample table below and Figure 18.




Mixed Fragmentation
 Refers to a combination of horizontal and vertical strategies.
 A table may be divided into several horizontal subsets (rows), each one having a subset of
   the attributes (columns).
 Refer the sample table below and Figure 19.




Data Replication
 Data replication refers to the storage of data copies at multiple sites served by a computer
DDBMS - Lecture 3     Overview of Distributed Database Management Systems                       11


  network where the fragment copies can be stored at several sites to serve specific
  information requirements.
 Mutual consistency rule – all copies of data fragments are identical; DDBMS must ensure
  that a database update is performed at all sites where replicas exist.
   If the database is fragmented, the DBMS must decompose a query into subqueries to
      access the appropriate fragments.
   If the database is replicated, the DDBMS must decide which copy to access.
       A READ operation selects the nearest copy.
       A WRITE operation requires all copies be selected and updated.
   The TP sends a data request to each selected DP for execution.
   The DP receives and executes each request and sends the data back to the TP.
   The TP assembles the DP responses.
 Replication scenarios:
   Fully replicated databases – store multiple copies of each database fragment at
     multiple sites; i.e., all database fragments are replicated.
    Partially replicated databases – store multiple copies of some database fragments at
     multiple sites. (available for most DDBMS)
    Unreplicated replicated databases – store each database fragment at a single site; i.e.,
     no duplicate database fragment.
 Factors influence the decision to use data replication:
   Database size
   Usage frequency
   Costs – performance, software overhead, and management – associated with
     synchronizing transactions and their components vs. fault-tolerance benefits that are
     associated with replicated data.

Data Allocation
 Data allocation describes the process of deciding where to locate data.
 The strategies of data allocation follows:
   Centralized data allocation – the entire database is stored at one site.
   Partitioned data allocation – the database is divided into several disjoint parts
     (fragments) and stored at several sites.
   Replicated data allocation – copied of one or more database fragments are stored at
     several sites.
 Study issue: which data to locate where. The algorithm considers:
    Performance and data availability goals.
    Size, number of rows, and the number of relations that an entity maintains with other
     entities.
DDBMS - Lecture 3      Overview of Distributed Database Management Systems                     12


    Types of transactions to be applied to the database, the attributes accessed by each of
     those transactions, and so on.

Client/Server vs. DDBMS
 Client/Server Advantages Over DDBMS
   Client/server less expensive
   Client/server solutions allow use of microcomputer’s GUI
   More people with PC skills than mainframe skills
   PC is well established in workplace
   Numerous data analysis and query tools exist
   Considerable cost advantages to off-loading application development
 Client/Server Disadvantages
   Creates more complex environment with different platforms
   Increased number of users and sites creates security problems
   Training issues become more complex and expensive

Date’s 12 Commandments for Distributed Databases
 No discussion of distributed databases is complete unless it includes C.J. Date’s
  distributed database commandments with twelve rules below:
  1. Local Site Independence – Each local site can act as an independent, autonomous,
      centralized DBMS with responsibility on security, concurrency control, backup, and
      recovery.
  2. Central Site Independence – No site in the network relies on a central site or any other
      site, but all have the same capability.
  3. Failure Independence – The system is not affected by node failures.
  4. Location Transparency – The user does not need to know the location of the data in
        order to retrieve those data.
   5.   Fragmentation Transparency – The user sees only one single logical database because
        data fragmentation is transparent to the user.
   6.   Replication Transparency – The user sees only one single logical database because the
        DDBMS transparently selects the database fragment to access.
   7.   Distributed Query Processing – A distributed query may be executed at several
        different DP sites with transparent performance of query optimization.
   8.   Distributed Transaction Processing – A transaction may update data at several different
        sites.
   9. Hardware Independence – The system must run on any hardware platform.
   10. Operating System Independence – The system must run on any operating system
       software platform.
DDBMS - Lecture 3    Overview of Distributed Database Management Systems                13


   11. Network Independence – The system must run on any network platform.
   12. Database Independence – The system must support any vendor’s database product.

								
To top