DISTRIBUTED DATABASE DESIGN - Amazon S3 by fjzhangweiqun


									                          DISTRIBUTED DATABASE DESIGN

       1.0 Objectives
       5.1 Introduction
       5.2 A Framework for Distributed Database Design
              5.2.1 Objectives of the Design of Data Distribution
              5.2.2 Top – Down and Bottom – Up Approach –A classical Design
       5.3 The Design of Database Fragmentation
              5.3.1 Horizontal Fragmentation
           Primary Fragmentation
           Derived Horizontal Fragmentation
              5.3.2 Vertical Fragmentation
              5.3.3 Mixed Fragmentation
       5.4 The Allocation of Fragments
       5.5 Summary
5.0 Objectives: In this unit you will come to know the different design aspects of
distributed databases. At the end of this unit you will be able to describe the topics
          A framework for distributed database design
               The objectives of design of data distribution
               Top – Down and Bottom – Up design approaches
          The design of database fragmentation
               Horizontal Fragmentation
               Vertical Fragmentation
               Mixed Fragmentation
          The allocation of fragments
               General Criteria for Fragment allocation
5.1 Introduction: The concept of data distribution itself is difficult to design and
implement because of various technical and organizational issues. So we need to have
an efficient design methodology. From the technical aspect, the interconnection of
sites and appropriate distribution of the data and applications to the sites depending
upon the requirement of applications and for optimizing performances. From the
organizational point, the issue of decentralization is crucial and distributing an
application has a greater effect on the organization. In recent years, lot of research
work has taken place in this area and the major outcome of this are:
   o Design criteria for effective data distribution
   o Mathematical background of the design aids
In the section 5.2 you will learn a framework of the design including the design
approaches like Top – Down and Bottom – Up. The section 5.3 explains about the
design of Horizontal and Vertical Fragmentation. In the section 5.4 we will give
principles and concepts in Fragment allocation.
5.2 A Framework for Distributed Database Design: The design of a centralized
database concentrates on:
    Designing the conceptual schema that describes the complete database
    Designing the Physical database, which maps the conceptual schema to the
       storage areas and determines the appropriate access methods.
The above two steps contributes in distributed database towards the design of Global
schema and the design of local databases. The added steps are:
    Designing the Fragmentation: - The actual procedure of dividing the existing
       global relations into horizontal, vertical or mixed fragments
    Designing the allocation of fragments: -Allocation of different fragments
       according to the site requirements
Before designing the Distributed database a thorough knowledge about the application
is a must. In this case we expect the following things from the designer.
              Site of Origin: The site from which the application is issued.
              The frequency of invoking the request at each site
              The number, type and the statistical distribution of accesses made by
               each application to each required data.
In the coming section let us try to know the actual need of design of data distribution.
5.2.1 Objectives of the Design of Data Distribution: In the design of data
distribution the following objectives should be considered.
      Processing locality: Reducing the remote references in turn maximizing the
       local references is the primary aim of the data distribution. This can be
       achieved by having redundant fragment allocation meeting the site
       requirements. Complete locality is an extended idea, which simplifies the
       execution of application.
      Availability and reliability of distributed data: Availability is achieved by
       having multiple copies of the data for read only applications. Reliability is
       achieved by     storing the multiple copies of the information, as it will be
       helpful in case of system crashes.
      Workload distribution: workload distribution is the major goal to have high
       degree of parallelism.
      Storage costs and Processing locality: Cost criteria and Availability of
       storage areas should be intelligently handled for effective data distribution.
Using the all above criteria may increase the design complexity. So important aspects
are taken as objectives depending upon the requirement and others are treated as
constraints. In the next section let us design a simple approach for maximizing the
processing locality.
5.2.2 Top – Down and Bottom – Up Approach –A classical Design
There are two classical approaches as far as distributed databases design is concerned.
They are:
   1. Top – Down Approach: This may be quite useful when the system has to be
       designed from the scratch. Here we follow the following steps:
             Design of Global Schema.
             Design of Fragmentation Schema.
             Design of Allocation Schema.
             Design of Local Schema (Design of “Physical Databases”).
   2. Bottom - Up Approach: This can be used for an existing system. This
       approach is based on the integration of existing schemata into a single, global
       schema. But requires that the following aspects have to be fulfilled.
             The selection of a common database model for describing the Global
               schema of the database.
             The translation of each local schema into the common data model.
             The Integration of common schemata into a common Global schema.
               i.e the merging of common data definitions and the resolution of
               conflicts among different representations given to the same data.
       The Bottom – Up design require solving these three problems. Then of course
       the design steps are just reverse of the previous method.
5.3 The Design of Database Fragmentation: Here we discuss the design of non-
overlapping fragments, which are the logical units of allocation. That is, it is
important to have an efficient design methodology so that we can overcome the
related problems of allocation. In the following, we explain the design of Horizontal,
Vertical and Mixed Fragmentations.
5.3.1 Horizontal Fragmentation: Here we discuss two important methods called
Primary and Derived. Determining the horizontal fragmentation involves knowing:
    The logical properties of the data such as fragmentation predicates.
    The statistical properties of the data such as the number of references of
       applications to the fragments. Primary Fragmentation: The correctness of Primary fragmentation requires
that each global relation be selected in one and only one fragment. Thus, determining
the primary horizontal fragmentation of a global relation requires determining a set of
disjoint and complete selection predicates (we shall define this later in this section).
The property we expect from each fragment is that the elements of them must be
referenced homogeneously by all applications.
                    Let G be the global relation for which we want to produce a
       horizontal       primary fragmentation. Let us define some terminologies.
              A Simple Predicate: is a predicate of the type:
                                           Attribute = value
              A Min-term Predicate y for a set P of simple predicates is the
               conjunction of all predicates appearing in P, either taken in natural
               form or negated. Thus:
                                        y =  pi *
                                           pi  p
                         Where (pi* = pi or pi * =NOT pi) and y  false
              A     fragment is the set of all tuples for which a min-term predicate
              A simple predicate is relevant respect to a set P of simple predicates if
               there exists at least two min-term predicates of P whose expression
               differs only in the predicate pi itself such that the corresponding
               fragments are referenced in a different way by at least one application.
Let us try to understand the above terminologies by taking an example. Let us
consider the relations DEPT (DEPTNUM, NAME, AREA) and JOB(JOBID,JOB
NAME). Let us assume that only two departments are functioning i.e 1 & 2.Now
some examples for simple predicates are:
                  DEPTNUM =1 or DEPTNUM  2, DEPTNUM = 2 or DEPTNUM  1
                  JOB = “programmer” or JOB  “programmer”.
The corresponding min-term predicates are
                  DEPTNUM =1 AND JOB = “programmer”
                  DEPTNUM =1 AND JOB  “programmer”
                  DEPTNUM 1 AND JOB  “programmer”
                  DEPTNUM 1 AND JOB = “programmer”
Now let us concentrate on some more supporting terminologies. Let P = {p1,p2,….p
n}be   a set of simple predicates. For correct and efficient fragmentation P must be
complete and minimal.
                   We say that a set P of predicates is complete if and only if any two
                    tuples belonging to the same fragment are referenced with the same
                    probabilities by any application.
                   The set P is minimal if all its predicates are relevant.
Example: In the above example, P1 ={DEPTNUM = 1} is not complete since the
application is even interested in the employees who are “programmers”. So in this
case P2 = {DEPTNUM =1,JOB = “programmer”} is complete and minimal. The set
P3 = {DEPTNUM =1, JOB = “programmer”, SAL > 50} is complete but not minimal
since SAL >50 is not relevant.
        By knowing the minimum characteristics that are to be considered now let us
generalize the method to be followed while producing fragments of the given global
            I. Consider a predicate pi that partitions the tuples of the global relation G
                  into two parts, which are referenced differently at least by one
                  application. Let P = p1.
            II. Consider a new simple predicate pi which partitions at least one
                  fragment of P into two parts, which are referenced in a different way by
                  at least one application. Eliminate non-relevant predicates from P.
                  Repeat this step until the set of min-term fragments of P is complete.
Example: Let us take two cities of Karnataka: Shimoga and Mysore. The example
application considered is the marketing of medical goods. The global schema for this
application includes the relations EMPL, DEPT, SUPPLIER and SUPPLY. These
relations look as follows:
We design the fragmentation of SUPPLIER and DEPT with a Primary Fragmentation.
Now let us take a query. Find the names of suppliers with a given number SNUM.
As you have already come across a popular query language SQL can be used for
representing this query.
                               Select NAME
                               From SUPPLIER
                               Where SNUM = $Y

       This query issued at any one of the sites. Let us assume that we have three
sites in our purview. Site 1 is in Shimoga, Site 2 is in Mysore and Site 3 is in between
Shimoga and Mysore. So, if the query is issued at Site 1 it references SUPPLIERS
whose CITY is “Shimoga” with almost 90% probability; if it is issued at Site 2 it
references SUPPLIERS of “Shimoga” and “Mysore” with equal probability; if it is
issued at Site3 it references SUPPLIERS whose CITY is “Mysore” with almost 90%
probability. This is because the obvious fact that department around one city tends to
use suppliers, which are close to them.
We can write the predicates for the above application,
              P1: CITY = “SHIMOGA”
              P2: CITY = “MYSORE”
Since the set {P1, P2} is complete and minimal, the search is terminated.
       Let us now consider the global relation DEPT: DEPT (DEPTNUM, NAME,
AREA, MGRNUM). Some example predicates that are suitable for administrative
applications are considered.
       P1: DEPTNUM < = 10
       P2: (10 < DEPTNUM < = 20)
       P3: DEPTNUM > 20
       P4: AREA = “NORTH”
       P5: AREA = “SOUTH”
       If we assume that in the northern area the departments with DEPTNUM > 20
will never be there, then AREA = “NORTH” implies that DEPTNUM > 20 is false.
Thus the fragments are reduced to the following four:
       Y1: DEPTNUM < = 10
       Y2: (10 < DEPTNUM < = 20) AND (AREA = “NORTH” )
       Y3: (10 < DEPTNUM < = 20) AND (AREA = “SOUTH” )
       Y4: DEPTNUM > 20
       If we now concentrate about the fragment allocation we can easily allocate
fragments corresponding to y1 and y4 at sites 1 and 3.But depending upon the
requirement fragments y2 and y3 will be allocated to either sites 1 or 3. Derived Horizontal Fragmentation: This is not based on the properties of its
own attributes, but it is derived from the horizontal fragmentation of another relation.
It is used to make the join between the fragments. A distributed join is a join
between horizontally fragmented relations. That is when you want to join the two
relations G and H you have to compare their fragments. Join Graphs can efficiently
represent it. The fig 5.1 represents the different possible join graphs.
           o Total: The join graph is total when it contains all possible edges
               between fragments of G and H.
           o Reduced: The join graph is reduced when some of the edges between
               G and H are missing. Here we have two types:
                    Partitioned: A reduced graph is partitioned if the graph is
                       composed of two or more sub graphs without edge between
                    Simple: A reduced graph is simple if it is partitioned and each
                       sub graph has just one edge.
Example: Consider the relation SUPPLY (SNUM, PNUM, DEPTNUM, QUAN). Let
us take the following case. Some application
           o Requires the information about supplies of given suppliers; thus they
               join between SUPPLY and SUPPLIER in the SNUM attribute.
           o Requires the information about supplies at a given department; then
               they perform join between SUPPLY and DEPT on the DEPTNUM
Let us assume that the relation DEPT is horizontally fragmented on the attribute
DEPTNUM and that SUPPLIER is horizontally fragmented on the attribute SNUM.
The derived horizontal fragmentation can be obtained for relation SUPPLY by either
performing a Semi - join operation with SUPPLIER on SNUM or with DEPT on
DEPTNUM; both of them are correct.
5.3.2 Vertical Fragmentation: This requires grouping the attributes into sets, which
are referenced in the similar manner by applications. This method has been discussed
by considering two separate types of problems:
             The Vertical Partitioning Problem: Here set must be disjoint. Of
              course one attribute must be common. For example assume that a
              relation S is vertically fragmented using this concept into S1 and
              S2.This can be useful where an application can be executed using
              either S1 or S2.Otherwise having the complete S at a particular site
              may be a unnecessary burden.
              Two possible design approaches:
                         1. The      split   approach:    The global relations are
                               progressively split into fragments
                         2. The       Grouping     approach:        The    attributes   are
                               progressively aggregated to constitute fragments.
              Both are Heuristic approaches as each iteration steps look for best
              choice. In both the cases formulas are used to indicate the best
              possible splitting or grouping.

 R1                                  R1                        S1         R1                  S1

 R2                                                                       R2                  S2

                                                                          R3                   S3
 R3                                  R3
                                                                S1                             S4
 R4                                  R4                                   R4

                        Figure 5.1 The different possible join graphs
                The Vertical Clustering Problem: Here sets can overlap. Here
                 depending upon the requirement you may have more than one common
                 attribute in the two different fragments of a global relation. It
                 introduces Replication within fragments, as some common attributes
                 are present in the fragments. It is suitable only for Read-Only
                 applications; because for applications, which involve frequent updating
                 of these common attributes needs to be referred to the sites where all
                 these attributes are present. Therefore, Vertical clustering is suggested
                 where overlapping attributes are not heavily updated.
Example: Consider the global relation          EMPL (EMPNUM, NAME, SAL, TAX,
MGRNUM, DEPTNUM). The following are made:
                Administrative    applications,   requires   NAME,      SAL,   TAX    of
                The department, requires NAME, MGRNUM and DEPTNUM
Here Vertical clustering is suggested as the attribute NAME is required in both the
fragments. So the fragments may be:
                  EMPL1 (EMPNUM, NAME, SAL, TAX)
                  EMPL2 (EMPNUM, NAME, MGRNUM, DEPTNUM)
5.3.3 Mixed Fragmentation: The simple way for performing this is:
                Apply Horizontal fragmentation to Vertical fragments
                Apply Vertical fragmentation to Horizontal fragments
Both these aspects are illustrated using the following diagrams 5.2 and 5.3.
         A1        A2       A3         A4          A5

Fig: 5.2 Vertical fragmentations followed by horizontal fragmentation.
        A1           A2           A3        A4          A5
Fig: 5.3 Vertical fragmentations followed by horizontal fragmentation

5.4 The Allocation of Fragments:
In this section we explain the different aspects to be considered when you go for
allocating a particular fragment to site. This section describes some general criteria
that can be used for allocating fragments. There are two types of allocation methods,
which can be followed. They are:
Non-redundant Allocation: It is simple. A method known as “Best-fit approach” can
be used; i.e a measure is associated with each possible allocation, and the site with the
bets measure is selected. It avoids placing a fragment at a given site where already a
fragment is present which is related to this fragment.
      Redundant Allocation: It is complex design, since:
           o The degree of replication is a variable of the problem.
           o The modeling of read applications is complicated as the applications
               may select any of the several alternatives.
       The following two methods can be used for determining the redundant
       allocation of fragments:
                      Determine the set of all sites where the benefit of allocating one
                       copy of the fragment is higher than the cost, and allocate a copy
                       of the fragment to each element of this site; this method selects
                       “all beneficial sites”.
                      Start from a non-replicated version. Then progressively
                       introduce replicated copies from the most beneficial; the
                       process is terminated when no additional replication is
Both the reliability and availability of the system increases if there are two or three
copies of the fragment, but further copies give a less than proportional increase.
5.5 Summary:
In this unit we have discussed the four phases of the design of Distributed databases:
Global schema, Fragmentation schema, Allocation schema and Local schema. Some
important aspects of design of fragmentation and allocation schemas are described in
detail. Also some of the practical examples are chosen for familiarizing the new

To top