Distributed Database - Department of Electrical Engineering and

Document Sample
Distributed Database - Department of Electrical Engineering and Powered By Docstoc
					Distributed Database Systems
           COP5711
What is a Distributed Database System ?
A distributed database is a collection of databases which are distributed
over different computers of a computer network.

  • Each site has autonomous processing capability and can perform local
    applications.

  • Each site also participates in the execution of at least one global
    application which requires accessing data at several sites.
  Multiprocessor Database Computers
                            Cannot run an
                             application
                              by itself
                                            Access
                                            Processor
         Application
                           Interface        Access
         (front-end)
                           Processor        Processor
         computer
                                            Access
                                            Processor



What we miss here is the existence of local
applications, in the sense that the integration of the
system has reached the point where no one of the
computers (i.e., IFPs & ACPs) is capable of executing
an application by itself.
          Why Distributed Databases ?
1.   Local Autonomy: permits setting and enforcing local policies regarding
     the use of local data (suitable for organization that are inherently
     decentralized).
2. Improved Performance: The regularly used data is proximate to the
     users and given the parallelism inherent in distributed systems.
3. Improved Reliability/Availability:
        Data replication can be used to obtain higher reliability and
         availability.
        The autonomous processing capability of the different sites
         ensures a graceful degradation property.

4. Incremental Growth: supports a smooth incremental growth with a
     minimum degree of impact on the already existing sites.
5. Shareability: allows preexisting sites to share data.
6. Reduced Communication Overhead: The fact that many applications
     are local clearly reduces the communication overhead with respect to
     centralized databases.
        Disadvantages of DDBSs
Cost: replication of effort (manpower).
Security: More difficult to control
Complexity:
   •   The possible duplication is mainly due to reliability and
       efficiency considerations. Data redundancy, however,
       complicates update operations.
   •   If some sites fail while an update is being executed, the
       system must make sure that the effects will be
       reflected on the data residing at the failing sites as
       soon as the system can recover from the failure.
   •   The synchronization of transactions on multiple sites is
       considerably harder than for a centralized system.
Distributed DBMS
   Architecture
     NetworkTransparancy
• The user should be protected from the
  operational details of the network.
• It is desirable to hide even the existence
  of the network, if possible.
   Location transparency: The command used is
    independent of the system on which the data is
    stored.
   Naming transparency: a unique name is
    provided for each object in the database.
    Replication & Fragmentation
           Transparancy
• The user is unaware of the replication of
  framents
• Queries are specified on the relations
  (rather than the fragments).                   Site A
                                  Copy 1 of R1

                                  Copy 1 of R2
  Relation R
                 Fragment R1                     Site B
                                  Copy 2 of R1
                 Fragment R2

                 Fragment R3

                 Fragment R4                     Site C
                                  Copy 2 of R2
    ANSI/SPARC Architecture

 External Schema      External           External        External
                      view               view            view


                                         Conceptual
 Conceptual Schema
                                         view


                                         Internal
 Internal Schema
                                         view


Internal view: deals with the physical definition and organization of data.
Conceptual view: abstract definition of the database. It is the “real
world” view of the enterprise being modeled in the database.
External view: individual user’s view of the database.
A Taxonomy of Distributed Data Systems

A distributed database
can be defined as                       Distributed data systems

 • a logically integrated
   collection of shared           Homogeneous
                                                      Heterogeneous
                                                      (Multidatabase)
   data which is
 • physically distributed
   across the nodes of a               Federated
                                                          Unfederated
                                                          (no local users)
   computer network.

                            Loosely coupled        Tightly coupled
                            (interoperable DB      (/w global schema)
                              systems using
                               export schema)
Architecture of a Homogeneous DDBMS

   Global user                   Global user
      view 1                       view n       A homogeneous
                 Global Schema
                                                DDBMS resembles
                                                a centralized DB,
                 Fragmentation
                    Schema                      but instead of
                  Allocation                    storing all the
                   Schema
                                                data at one site,
    Local                             Local     the data is
 conceptual                        conceptual
  schema 1                          schema n    distributed across
     Local                             Local    a number of sites
   internal                          internal
  schema 1                          schema n    in a network.

 Local DB 1                       Local DB n
 Fragmentation Schema & Allocation Schema

Fragmentation Schema: describes how the global
relations are divided into fragments.

Allocation Schema: specifies at which sites each
fragment is stored.

Example: Fragmentation of global relation R.

     A            B           To materialize R, the following
                              operations are required:
    C            D            R = (A B) U ( C D) U E

            E
       Homogeneous vs. Heterogeneous
                           Global                     • Homogeneous DDBMS
                           user
                                                         – No local users
 Local                                                   – Most systems do not have
 user                                  Local
                                       user
                                                           local schemas
                 Multidatabase
                 Management                           • Heterogeneous DDBMS
                    system
                                                         – There are both local and
                                                           global users
                                                         – Multidatabase systems are
                                                           split into:
DBMS         DBMS         DBMS           DBMS
                                                            • Tightly Coupled Systems:
                                                              have a global schema
Database 1   Database 2   Database 3     Database 4         • Loosely Coupled Systems:
                                                              do not have a global
                                                              schema.
       Schema Architecture of a Tightly-
               Coupled System
             Global user                        Global user       An individual node’s
             view 1                             view n            participation in the MDB
                                                                  is defined by means of a
                                                                  participation schema.
                           Global Conceptual Schema


Auxiliary        Local                                        Local           Auxiliary
Schema 1         Participation                                Participation   Schema 1
                 Schema 1                                     Schema 1

Local user
                 Local                                        Local
view 1                                                                        Local user
                 Conceptual                                   Conceptual
                 Schema 1                                     Schema 1        view 1

Local user       Local                                        Local
view 2           Internal                                     Internal        Local user
                 Schema 1                                     Schema 1        view 2


                 Local DB 1                                   Local DB 1
        Auxiliary Schema (1)
Auxiliary schema describes the rules which
govern the mappings between the local and
global levels.
 Rules for unit conversion: may be required when
  one site expresses distance in kilometers and
  another in miles, …
 Rules for handling null values: may be necessary
  where one site stores additional information which
  is not stored at another site.
   – Example: One site stores the name, home address and
     telephone number of its employees, whereas another just
     stores names and addresses.
           Auxiliary Schema (2)
 Rules for naming conflicts: naming conflicts occur when:
    semantically identical data items are named differently
      • DNAME  Department name (at Site 1)
      • DEPTNAME  Department name (at Site 2)
    semantically different data items are named identically.
      • NAME  Department name (at Site 1)
      • NAME  Manager name (at Site 2)

 Rules for handling data representation conflicts: Such
  conflicts occur when semantically identical data items
  are represented differently in different data source.
    Example: Data represented as a character string in one
     database may be represented as a real number in the other
     database.
       Auxiliary Schema (3)
Rules for handling data scaling conflicts: Such
 conflicts occur when semantically identical
 data items stored in different databases
 using different units of measure.
   Example: “Large”, “New”, “Good”, etc.



           These problems are called
           domain mismatch problems
              Loosely-Coupled Systems
         (Interoperable Database Systems)

                   Global        Global        Global
                 user view 1   user view 2   user view 3


   Local
user view 1         Local        Local         Local
                 Conceptual    Conceptual    Conceptual
   Local          schema 1     Schema 2      Schema n
user view 2
                    Local         Local         Local
                  internal      internal      internal
                 schema 1      Schema 2      Schema n


                 Local DB 1    Local DB 2     Local DB n
              Loosely-Coupled Systems
                   Global             Global            Global
                 user view 1        user view 2       user view m



                  Export        Export      Export     Export
                 schema 1      schema 2    Schema 3   Schema n
   Local
user view 1         Local             Local             Local
                 Conceptual         Conceptual        Conceptual
   Local          schema 1          Schema 2          Schema n
user view 2
                    Local              Local             Local
                  internal           internal          internal
                 schema 1           Schema 2          Schema n


                 Local DB 1         Local DB 2         Local DB n
Integration of Heterogeneous Data Models
• Provide bidirectional translators between all
  pairs of models
  – Advantage: support multiple models at the global level.
    No need to learn another data model and language
  – Disadvantage: requires n(n-1) translators, where n is
    the number of different models.
• Adopt a single model (called canonical model) at
  the global level and map all the local models onto
  this model
  – Advantage: requires only 2n translators
  – Disadvantage: translations must go through the global
    model.
           (The 2nd approach is more widely used)
Mini Project Suggestion 1 (1/4)

                     Mobile
                     object
                     within          Mobile
                     range           sensor




                                     Sensing
                                      range

                              Undetected
                                mobile
                                object
  Mini Project Suggestion 1 (2/4)
Group Mobility Model
   All mobile objects start in the same region
   Mobile objects belong to k different mobile groups
   Clients in each group have a group mobility
    pattern, in which
      the group leader follows the random way-point mobility
       model, and
      other group members move with their leader with
       random local movement
   Mini Project Suggestion 1 (3/4)
Two objectives
  1. Provide full sensing coverage for all mobile objects
  2. Prevent network partition among the mobile
     sensors
                               Network
                               partition
         Avoid
          this
  Mini Project Suggestion 1 (4/4)
Presentation:
  1. Explain your techniques
  2. Present plots to discuss your simulation results
  3. Demo - real-time visualization of the simulation

Documentation:
  Submit a report on the distributed algorithms,
  diagrams for illustration, and plots for simulation
  results.
Distributed Database Design

•Top-Down Approach: The database system is
 being designed from scratch.

  • Issues: fragmentation & allocation

•Bottom-up Approach: Integrating existing
 databases into one database

  • Issues: Design of the export and global
    schemas.
                    TOP-DOWN DESIGN PROCESS
                              Requirements Analysis
Entity analysis +
   functional                 System Requirements                         Defining the
    analysis                          (Objectives)                       interfaces for
                                                                            end users
                      Conceptual     View integration
                      design                             View Design


         Global                     Access                  External Schema
         conceptual                 information             Definitions
         schema

                                   Distribution Design


                              Local Conceptual Schemas            Fragmentation
                                                                   & allocation
   Maps the local
    conceptual
    schemas to                     Physical Design
  physical storage
      devices
                                   Physical Schema
           Design Consideration (1)

The organization of distributed systems can be
  investigated along three dimensions:

Level of sharing

   1. No sharing: Each application and its data
      execute at one site.
   2. Data sharing: Programs are replicated at all
      sites, but data files are not.
   3. Data + Program Sharing: Both data and
      programs may be shared.
        Design Consideration (2)

Access Pattern
  1. Static: Access patterns do not change.
  2. Dynamic: Access patterns change over
     time.
Level of Knowledge
  1. No information
  2. Partial information: Access patterns may
     deviate from the predictions.
  3. Complete information: Access patterns
     can reasonably be predicted.
            Fragmentation Alternatives
                 J   JNO       JNAME           BUDGET          LOC
                     J1        Instrumental    150,000        Montreal
                     J2        Database Dev.   135,000        New York
                     J3        CAD/CAM         250,000        New York
                     J4        Maintenance     350,000        Paris


     Horizontal Partitioning                                  Vertical Partitioning
J1    JNO     JNAME          BUDGET        LOC                  JNO      BUDGET
                                                                 J1      150,000
      J1     Instrumental     150,000    Montreal
                                                                 J2      135,000
      J2     Database Dev.    135,000    New York                J3      250,000
                                                                 J4      310,000

J2    JNO     JNAME          BUDGET        LOC
                                                         JNO        JNAME           LOC
                                                         J1     Instrumentation   Montreal
      J3      CAD/CAM        150,000     Montreal
                                                         J2     Database Devl     New York
      J4      Maintenance.   310,000     Paris
                                                         J3     CAD/CAM           New York
                                                         J4     Maintenance       Paris
        Why fragment at all?

Reasons:
 • Interquery concurrency
 • Intraquery concurrency

Disadvantages:
 • Vertical fragmentation may incur overhead.
 • Attributes participating in a dependency may be
   allocated to different sites.
      Integrity checking is more costly.
Degree of Fragmentation

• Application views are usually subsets of
  relations. Hence, it is only natural to
  consider subsets of relations as
  distribution units.

• The appropriate degree of fragmentation
  is dependent on the applications.
  Correctness Rules

• Vertical Partitioning
   • Lossless
     decomposition          Allocation Alternatives
   • Dependency
     preservation           •Partitioning: No replication
                            •Partial Replication: Some
• Horizontal Partitioning    fragments are replicated
   • Disjoint fragments
                            •Full Replication: Database
                             exists in its entirety at
                             each site
                Notations
S   Title SAL

          L1

E   ENO ENAME TITLE        J   JNO JNAME BUDGET LOC


                L2
                                        L3

           G     ENO JNO RESP DUR



           L1: 1-to-many relationship
           S: Owner(L1), Source relation
           E: Member(L1), Target relation
              Simple Predicates
Given a relation R(A1, A2, …, An) where Ai has domain Di, a
simple predicate pj defined on R has the form
   pj: Ai  Value
   where  {, , , , , } and Value  Di

  Example:

        J    JNO    JNAME            BUDGET    LOC
             J1     Instrumental    150,000   Montreal
             J2     Database Dev.   135,000   New York
             J3     CAD/CAM         250,000   New York
             J4     Maintenance     350,000   Orlando

  Simple predicates:
                          p1: JNAME = “Maintenance”
                          P2: BUDGET < 200,000

  Note: A simple predicate defines a data fragment
          MINTERM PREDICATE
Given a set of simple predicates for relation R.
    P = {p1, p2, …, pm}
The set of minterm predicates        TITLE        SAL
    M = {m1, m2, …, mn}              Elect. Eng.  40,000
is defined as                        Syst. Analy. 54,000

   M = {mi | mi = p j P
                     p *j      }                   Mech. Eng.       32,000
                                                    Programmer       42,000
   where      p*  p j or p*  p j
               j           j

  Possible simple predicates:               Some corresponding
  P1: TITLE=“Elect. Eng.”                   minterm predicates:
  P2: TITLE=“Syst. Analy”           m1 : TITLE  " Elect.Eng ."  SAL  30 ,000
  P3: TITLE=“Mech. Eng.”
                                    m 2 : TITLE  " Elect.Eng "  SAL  30 ,000
  P4: TITLE=“Programmer”
  P5: SAL ≤ 35,000                        A minterm predicate defines
  P6: SAL > 35,000                        a data fragment
Primary Horizontal Fragmentation

A primary horizontal fragmentation is defined by a selection
operation on the owner relations of a database schema.


E   ENO ENAME TITLE             J     JNO JNAME BUDGET LOC
                     L2
                                                 L3
              G    ENO JNO RESP DUR                   Owner(L3) = J


 A possible fragmentation of J is defined as follows:
                  J1   BUDGET200,000 ( J )
                  J 2   BUDGET200,000 ( J )
    Horizontal Fragments

Thus, a horizontal fragment Ri of relation R
consists of all the tuples of R that satisfy a
minterm predicate mi.

There are as many horizontal fragments
(also called minterm fragments) as there are
minterm predicates.
                  Completeness (1)
  A set of simple predicate Pr is said to be complete if and only
  if there is an equal probability of access by every application
  to any two tuples belonging to any minterm fragment that is
  defined according to Pr.
  Simple Predicates      Minterm Fragments             Applications
                                            p1
  A1 ≥ k1                                             A1
                            F1
                                            p1
  A2 = k2                                        p3
                                       p3                  A2
                            F2
  A3 ≤ k3
                                                           A3
  A4 = k4                   F3
                                                       A4
The fragments look homogeneous
               Completeness (2)
Simple Predicates   Minterm Fragments          Applications
                                     p1
A1 ≥ k1                                          A1
                      F1
                                     p1
A2 = k2                                   p3
                                p3                    A2
                      F2
A3 ≤ k3
                                                      A3
A4 = k4                                   p4
                     F3                          A4
                                     p5

                                                Set of simple
                                                predicates is
                                                 incomplete
                       Completeness (2)
Simple Predicates         Minterm Fragments          Applications
                                           p1
A1 ≥ k1                                                A1
                            F1
                                           p1
A2 = k2                                         p3
                                      p3                    A2
                            F2
A3 ≤ k3
                                                            A3
A4 = k4                                         p4
                            F31
                           F3                          A4
A5 > k5                                    p5

                            F32
          Additional
            simple
          predicate
                            Completeness (4)
  A set of simple predicate Pr is said to be complete if and only
  if there is an equal probability of access by every application
  to any two tuples belonging to any minterm fragment that is
  defined according to Pr.


J 1   LOC " MONTREAL " ( J )   Case 1: The only application that accesses
                                  J wants to access the tuples according to
J 2   LOC " NewYork " ( J )    the location.

J 3   LOC" Orlando " ( J )     The set of simple predicates
      LOC=“Montreal”                          LOC=“Montreal”,
                       J1                Pr = LOC=“New York”,
                 LOC=“New York”               LOC=“Orlando”
  J                    J2
                                  is complete because each tuple of each
                                  fragment has the same probability of
                       J3
       LOC=“Orlando”              being accessed.
               Completeness (5)

   Example:                 J1   JNO
                                 001
                                         JNAME
                                        Instrumental
                                                         BUDGET
                                                         150,000
                                                                    LOC
                                                                   Montreal

                           J2
                                 JNO      JNAME          BUDGET     LOC
                                 004       GUI            135,000 New York
     LOC=“Montreal”,             007     CAD/CAM         250,000 New York
Pr = LOC=“New York”,
     LOC=“Orlando”         J3    JNO     JNAME           BUDGET LOC
                                 003     Database Dev.   310,000 Orlando



   Case 2: There is a second application which accesses only those
   project tuples where the budget is less than $200,000.
      Since tuple “004” is accessed more frequently than tuple
       “007”, Pr is not complete.
      To make the the set complete, we need to add
       (BUDGET< 200,000) to Pr.
            Completeness (6)
                        BUDGET<=200,000

                          J11
    LOC=“Montreal”

            J1            J12
                        BUDGET>200,000

J    LOC=“New York”     BUDGET<=200,000
            J2            J21       Small-budget applications


        LOC=“Orlando”     J22
                                             Note: Completeness is a
          J3            BUDGET>200,000
                                             desirable property because a
                        BUDGET<=200,000      complete set defines
                                             fragments that are not only
                          J31
                                             logically uniform in that they
                                             all satisfy the minterm
                          J32                predicate, but statistically
                                             homogeneous.
                        BUDGET>200,000
Redundant Fragmentation

                       Fragment 1
    Logically
   uniform &
  statistically
  homogeneous
    fragment           Fragment 2




• Fragments 1 and 2 have the same
  characteristics
• The fragmentation is unnecessary
                           Minimality
Relevant:
Let mi and mj be two almost identical minterm predicates:
      mi =        p1 Λ p 2 Λ p3                      fragment fi
      mj =        p 1 Λ ¬ p 2 Λ p3                   fragment fj
    p2 is relevant if and only if
                                                       Access frequency
                         acc(mi )      acc(m j )
                                     
                         card ( f i ) card ( f j )
                                                       Cardinality




      p1     f1                  p2      fi     Prob1
f                   p3    f12                                A
                                 ¬p2    fj           Prob2           Prob1 ≠ Prob2
                         Minimality
  Relevant:
  Let mi and mj be two almost identical minterm predicates:
      mi =     p1 Λ p 2 Λ p3                      fragment fi
      mj =     p 1 Λ ¬ p 2 Λ p3                   fragment fj
   p2 is relevant if and only if
                                                    Access frequency
                      acc(mi )      acc(m j )
                                  
                      card ( f i ) card ( f j )
                                                    Cardinality



That is, there should be at least one application that accesses fi and fj
differently.
    i.e., The simple predicate pi should be relevant in determining a
          fragmentation.
Minimal:     If all the predicates of a set Pr are relevant, Pr is minimal.
A Complete and Minimal Example
Two applications:
1. One application accesses the tuples according
   to location.
2. Another application accesses only those project
   tuples where the budget is less than $200,000.


   Case 1: Pr={Loc=“Montreal”, Loc=“New York”, Loc=“Orlando”,
           BUDGET<=200,000,BUDGET>200,000} is
           complete and minimal.

   Case 2: If, however, we were to add the predicate
           JNAME= “Instrumentation” to Pr, the resulting
           set would not be minimal since the new predicate
           is not relevant with respect to the applications.
                          BUDGET<=200,000   JNAME = “Instrument”
                            J11                 J121
      LOC=“Montreal”

               J1           J12                 J122
                          BUDGET>200,000
                                            JNAME!  “Instrument”
J      LOC=“New York”     BUDGET<=200,000
               J2           J21
                                              [ JNAME = “Instrument” ]
                                              is not relevant.
          LOC=“Orlando”     J22
            J3            BUDGET>200,000

                          BUDGET<=200,000

                            J31

                            J32
                          BUDGET>200,000

    Relevant                          Irrelevant
       Application Information
• Qualification Information
   – The fundamental qualification information consists of the
     predicates used in user queries (i.e., “where” clauses in SQL).
   – 80/20 rule: 20% of user queries account for 80% of the total
     data access.
        One should investigate the more important queries.
• Quantitative Information
   – Minterm Selectivity sel(mi): number of tuples that would be
     accessed by a query specified according to a given minterm
     predicate.
   – Access Freequency acc(qi): the access frequency of queries in
     a given period.

   Qualitative information guides the fragmentation activity.
   Quantitative information guides the allocation activity.
Determine the set of meaningful minterm predicates
 Applications:
    • Take the salary and determine a raise accordingly.
    • The employee records are managed in two places, one handling the
      records of those with salary less than or equal to $30,000 and the other
      handling the records of those who earn more than $30,000.

Pr={p1: SAL<=30,000, p2: SAL>30,000} is complete and minimal.

The minterm predicates:
     m1 : ( SAL  30,000)  ( SAL  30,000)
     m2 : ( SAL  30,000)  ( SAL  30,000)    i1  m1 is contradictory
     m3 : ( SAL  30,000)  ( SAL  30,000)
     m4 : ( SAL  30,000)  ( SAL  30,000)   i 2  m4 is contradictory

Implications:                                           Therefore, we are left with
                                                        M = {m2, m3}
     i1 : ( SAL  30,000)  ( SAL  30,000)
     i 2 : ( SAL  30,000)  ( SAL  30,000)
     i 3 : ( SAL  30,000)  ( SAL  30,000)
     i 4 : ( SAL  30,000)  ( SAL  30,000)
                 Invalid Implications

            J    JNO    JNAME                BUDGET       LOC
                 J1     Instrumental         150,000      Montreal
                 J2     Database Dev.        135,000      New York
                 J3     CAD/CAM              250,000      New York
                 J4     Maintenance          350,000      Orlando



Simple predicates      VALID Implications              INVALID Implications
p1: LOC = “Montreal”   i 1 : p 1  p 2  p 3    i 8 : LOC " Montreal"  ( BUDGET  200 ,000 )
p2: LOC = “New York”                              i 9 : LOC " Orlando"  ( BUDGET  200 ,000 )
                       i 2 : p 2  p 1  p 3
p3: LOC = “Orlando”
p4: BUDGET ≤ 200,000   i 3 : p 3  p 1  p 2
p5: BUDGET > 200,000
                                                 Implications should be
                       i 4 : p 4  p 5          defined according to the
                       i 5 : p 5  p 4          semantics of the database,
                       i 6 : p 4  p 5          not according to the
                       i 7 : p 5  p 4          current values.
Compute Complete & Minimal Set
 Rule: a relation or fragment is partitioned “into at least two parts which are
 accessed differently by at least one application.
 Relevant: a simple predicate which satisfies the above rule, is relevant.

• Repeat until the predicate set is complete
    –   Find a simple predicate pi that is relevant
    –   Determine minterm fragments fi and fj according to pi
    –   Accept pi , fi , and fj
    –   Remove any pk and fk from acceptance list if pk becomes
        nonrelevant /* the list is minimal */
• Determine the set of minterm predicates M (using
  the acceptance list)
• Determine the set of implications I (among the
  acceptance list)
• For each mi in M, remove mi if it is contradictory
  according to I
Derived Horizontal Fragmentation


Derived fragmentation is used to facilitate the
join between fragments.


In some cases, the horizontal fragmentation of a
relation cannot be based on a property of its own
attributes, but is derived from the horizontal
fragmentation of another relation.
  Benefits of Derived Fragmentation
         PAY (TITLE, SAL)     Primary Fragmentation:
                                      PAY 1  SA L  30, 000( PAY )
EMP (ENO, ENAME, TITLE)               PAY 2   ( SA L  30, 000)( PAY )

                                   EMP1 = EMP SJ PAY1
Using Derived Fragmentation:       EMP2 = EMP SJ PAY2

 EMP1                       PAY1      EMPi and PAYi can be allocated
 EMP2                       PAY2      to the same site.


Not using derived fragmentation: one can divide EMP into EMP1
and EMP2 based on TITLE and divide PAY into PAY1, PAY2, PAY3
based on SAL. To join EMP and PAY, we have the following
scenarios.
                            PAY1

  EMP1                      PAY2    More communication overhead !
  EMP2                      PAY3
               Chain Relationships

                     •   Design the primary
                         fragmenation for R1.

R1 (R1PK, …)
                     •   Derive the derived
                         fragmentation for Rk as
R2 (R2PK, R1FK, …)       follows:

R3 (R3PK, R2FK, …)       • Rk = Rk SJRKFK=R(k-1)PK R(k-1)
                         • for 2  k  n in that order.
...
              Derived Fragmentation

  EMP (ENO, ENAME, TITLE)      PROJ (PNO, PNAME, BUDGET)


Join might      EMP_PROJ (ENO, PNO, RESP, DUR)
be required



      • How do we fragment EMP_PROJ ?
          – Semi-Join with EMP, or
          – Semi-Join with PROJ
      • Criterion: Suport the more-frequent join
        operation.
       VERTICAL FRAGMENTATION
Purpose: Identify fragments Ri such that
many applications can be executed using
just one fragment.                                          A7          A1


Advantage: When many applications which
use R1 and many applications which use R2              R2          R1
are issued at different sites, fragmenting
R avoids communication overhead.

   Vertical partitioning is more complicated than horizontal
   partitioning:
    • Vertical Partitioning: The number of possible fragments is
      equal to mm where m is the number of nonprimary key
      attributes
    • Horizontal Partitioning: 2n possible minterm predicates can
      be defined, where n is the number of simple predicates in the
      complete and minimal set Pr.
  Vertical Fragmentation Approaches
Greedy Heuristic Approaches:
  Split Approach: Global relations are
  progressively split into fragments.
  Grouping Approach: Attributes are
  progressively aggregated to constitute
  fragments.

Correctness:
  Each attribute of R belongs to at least one
  fragment.
  Each fragment includes either a key of R or a
  “tuple identifier”.
Vertical Clustering - Replication
   In evaluating the convenience of vertical
   clustering, it is important that overlapping
   attributes are not heavily updated.
  Example:        EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM)



                 Administrative Applications   Applications
                          at Site 1             at all sites
                                                               NAME is
                                                               relatively
     Bad Fragmentation: NAME not available in EMP2              stable
         1.  EMP1(ENUM,NAME,TAX,SAL)
         2. EMP2(ENUM,MGRNUM,DNUM)

             Good Fragmentation:
                 1.  EMP1(ENUM, NAME, TAX, SAL)
                 2. EMP2(ENUM, NAME, MGRNUM, DNUM)
                     Split Approach

•   Splitting is considered only for attributes that do
    not participate in the primary key.

•   The split approach involves three steps:
    1.   Obtain attribute affinity matrix.
    2. Use a clustering algorithm to group some attributes
       together based on the attribute affinity matrix. This
       algorithm produces a clustered affinity matrix.
    3. Use a partitioning algorithm to partition attributes
       such that set of attributes are accessed solely or for
       the most part by distinct set of applications.
       Attribute Usage Matrix
               PROJ     PNO PNAME BUDGET LOC
                         A1   A2       A3     A4

                                             1 if Aj is referenced by qi
                              use(qi,Aj) =
q1: SELECT BUDGET                            0 otherwise
     FROM PROJ
     WHERE PNO=Value;
                                             A1 A2 A3 A4
q2: SELECT PNAME, BUDGET
     FROM PROJ;                        q1 1 0         1    0
q3: SELECT PNAME
                                       q2 0 1         1    0
     FROM PROJ                         q3 0 1         0    1
     WHERE LOC=Value;
                                       q4 0 0         1    1
q4: SELECT SUM(BUDGET)
     FROM PROJ
     WHERE Loc=Value                  Attribute Usage Matrix
               Attribute Affinity Measure
    aff ( Ai, Aj )                           
                          k , use ( qk , Ai )  s  use ( qk , Aj )  s
                                                                            ref (q )  acc (q )
                                                                           s
                                                                                   s     k             s        k

                         For each query qk that uses both Ai and Aj             Popularity of such Ai-Aj pair at
                                                                                            all sites
Popularity
 of using    Relation R                                 Site m
Ai and Aj                                                                                              Site n
                                                     qk
together

                 Ai                                                                               qi
                                                          qi                                               qi
               Ak

                    Aj                ref s (qk )                      Site s

                                                                  qk
     refs(qk) : Number of accesses to                                                  accs ( qk )
                attributes (Ai,Aj) for each
                                                                           qi
                execution of qk at site s
                                                                 accs (qk) : Application access
                                                                            frequency of qk at site s.
          Attribute Affinity Matrix
aff ( Ai, Aj )                        
                   k , use ( qk , Ai )  s  use ( qk , Aj )  s
                                                                     ref (q )  acc (q )
                                                                    s
                                                                            s     k              s    k

                   For each query qk that uses both Ai and Aj            Popularity of such Ai-Aj pair at
                                                                                     all sites




 refs (qk): Number of accesses                                            A1 A2 A3 A4
             to attributes (Ai,Aj)
                                                                    A1
             for each execution
             of qk at site s                                        A2                aff ( A2, A3)

                                                                    A3
 accs (qk): Application access
            frequency of qk at                                      A4
            site s.                                                 Attribute Affinity Matrix
Attribute Affinity Matrix Example
          A1 A2 A3 A4              A1 A2    A3    A4
     q1   1   0   1   0      A1   45 0 45 0
     q2   0   1   1   0      A2   0 80 5 75
     q3   0   1   0   1      A3   45 5 53 3
     q4   0   0   1   1      A4   0 75 3 78
    Attribute Usage Matrix   Attribute Affinity Matrix (AA)




Next Step - Determine clustered affinity (CA) matrix
           Clustered Affinity Matrix
             Step 1: Initialize CA

                                 Copy first 2 columns


       A1 A2     A3   A4                A1    A2    A3 A4
  A1   45 0 45 0                   A1 45 0
  A2   0 80 5 75                   A2 0 80
  A3   45 5 53 3                   A3 45 5
  A4   0 75 3 78                   A4 0 75
Attribute Affinity Matrix (AA)   Clustered Affinity Matrix (CA)
       Clustered Affinity Matrix
   Step 2: Determine Location for A3
          3 possible
          positions                                      A1 A2 A3
          for A3                        A1 A3 A2
                            A0 A3 A 1

  A0   A1 A2     A3    A4    A5           A0       A1   A2   A3 A4   A5
  A1   45 0 45 0                          A1    45 0
  A2   0 80 5 75                          A2    0 80
  A3   45 5 53 3                          A3    45 5
  A4   0 75 3 78                          A4    0 75
Attribute Affinity Matrix (AA)          Clustered Affinity Matrix (CA)
          Clustered Affinity Matrix
     Step 2: Determine the order for A3
                                        n
                  bond ( Ax , Ay )   aff ( Az , Ax )  aff ( Az , Ay )
                                       z 1
Contribution
                 cont ( Ai , Ak , A j )  2  bond ( Ai , Ak )  2  bond ( Ak , A j )  2  bond ( Ai , A j )

                Cont(A0,A3,A1) = 8820           Cont(A1,A3,A2) = 10150            Cont(A2,A3,A4) = 1780
                Since Cont(A1,A3,A2) is the greatest, [A1,A3,A2] is the best order.

               A1 A2        A3       A4                                     A1      A3       A2 A4
     A1    45 0 45 0                                               A1      45 45 0
     A2    0 80 5 75                                               A2      0 5 80
     A3    45 5 53 3                                               A3      45 53 5
     A4    0 75 3 78                                               A4      0 3 75
  Attribute Affinity Matrix (AA)                               Clustered Affinity Matrix (CA)

       Note: aff(A0,Ai)=aff(Ai,A0)=aff(A5,Ai)=aff(Ai,A5)=0 by definition
       Clustered Affinity Matrix
  Step 2: Determine the order for A4

   Since Cont(A3,A2,A4) is the biggest, [A3,A2,A4] is the best order.



        A1 A2     A3    A4                       A1   A3    A2    A4
  A1   45 0 45 0                           A1   45 45 0 0
  A2   0 80 5 75                           A2   0 5 80 75
  A3   45 5 53 3                           A3   45 53 5 3
  A4   0 75 3 78                           A4   0 3 75 78
Attribute Affinity Matrix (AA)          Clustered Affinity Matrix (CA)
        Clustered Affinity Matrix
       Step 3: Re-order the Rows

 The rows are organized in the same order as the columns.


       A1    A3   A2   A4               A1    A3   A2   A4
  A1   45 45 0 0                   A1   45 45 0 0
  A2   0 5 80 75                   A3   45 53 5 3
  A3   45 53 5 3                   A2   0 5 80 75
  A4   0 3 75 78                   A4   0 3 75 78
Clustered Affinity Matrix (CA)   Clustered Affinity Matrix (CA)
Partitioning                          Bad grouping since
                                      A1 and A2 are never
                                      accessed together
                                                                A4 and A3
                                                                   are
                                                                 usually
Find the sets of attributes          A1     A3   A2    A4          not
                                                                accessed
that are accessed, for the
most part, by distinct sets
                                A1 45 45 0 0                    together

of applications.                A3 45 53 5 3
                                                                 A4 and A2
                                A2 0 5 80 75                     are often
We look for dividing points
along the diagonal such that    A4 0 3 75 78                     accessed
                                                                 together
 • Total accesses to only      Clustered Affinity Matrix (CA)
   one fragment are
   maximized, while                       Cluster 1:   A1 & A3
 • Total accesses to more                 Cluster 2:   A2 & A4
   than one fragments are
                                 Two vertical fragments:
   minimized.                     PROJ1(A1, A3) and PROJ2(A2, A4)
           MIXED FRAGMENTATION
•Apply horizontal fragmentation to vertical fragments.
•Apply vertical fragmentation to horizontal fragments.

Example: Applications about work at each department reference tuples
of employees in the departments located around the site with 80%
probability.
                   EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM)

    ENUM NAME TAX SAL           ENUM   NAME   MGRNUM   DNUM

                                    Jacksonville
                                    Orlando
                                    Miami

                                                              Horizontal
           Vertical fragmentation                             Fragmentation
                                                              (local work)
i:     fragment index            ALLOCATION –
j:     site index                 Notations
k:     application index
fkj:   the frequency of
                                                   Site j
       application k at site j
                                     Fragment i
rki:  the number of retrieval
      references of application k                 uki
      to fragment i.                 rki
uki: the number of update
      references of application k       Application k
      to fragment i.                     /w freq. fkj

nki = rki + uki
 Allocation of Horizontal Fragments (1)
 No replication: Best Fit Strategy
 • The number of local references of Ri at site j is

       Benefit to
                         Bij   f kj nki                     Number of
                                                              Access by k
           Site j                        k
                                             Frequency of
                    All applications k
                                              application k
                             at Site j


 • Ri is allocated at site j* such that Bij* is maximum.


Advantage: A fragment is allocated to a site that needs it most.
Disadvantage: It disregards the “mutual” effect of placing a
fragment at a given site if a related fragment is also at that
site.
Allocation of Horizontal Fragments (2)
All beneficial sites approach (replication)


                    Bij   f kjrki  c  f kj'uki
      Fragment i          k           j ' j k


        Site j
                    Savings due to     Cost of update
                    retrieval          references from
                    references         other sites



 • Ri is allocated at all sites j* such that Bij* > 0.
 • When all Bij’s are negative, a single copy of Ri is
   placed at the site such that Bij* is maximum.
       Allocation of Horizontal Fragments (3)
     Another Replication Approach:

           di     The degree of redundancy of Ri

           Fi     The reliability and availability benefit of having Ri fully replicated.

                  The reliability and availability benefit when the fragment has di
         (di)
                  copies.


  (d i )  (1  21d )  F i
                      i
                                    (1)  0,  (2)  F i ,  (3)  3  F i ,      β
                                                         2                 4
                                                                                      Fi
The benefit of introducing a new copy of Ri at site j :

      Bij   f kj rki  c  f kj'uki   (d i)
                                                                                           1   di
                 k                    k   j ' j
                                                                        Also takes into
                          Same as All Beneficial                        account the benefit
                             Sites approach                             of availability
      Allocation of Vertical Fragments
                                                                                            Should we allocate fragment Rs
                                PSr         A1           A3              A2                 to site PSs , and fragment Rt to
                                                                                            site PSt ?
                                             Ri          Rs        Rt
                                                                                                  PSr
  Application type A1
    at site PSr , that
                                                                                                              A1       A3    A2
    accesses only Rs


                       As                 At                  A4   ...          An                     As         Rs         Rt     At
                                                                                                                                         PSt

                      PSs                PSt              PS4                 PSn                     PSs
                                                                                                                       A4     PS4
       B ist             f ks n ks              f kt n kt              f ks n ks                                  .
                                                                                                                        .
                 k   As                  k  At                   k   A1                                              .
Applications
 of type As
                          
                           k
                                      f kt n kt     2 f
                                                    k
                                                                    ki   n ki     
                                                                                  4  l  n k
                                                                                                      f kl n ki         An    PSn

      at PSs
                                A2                       A3                                      Al

    This formula can be used within an exhaustive “splitting”
      algorithm by trying all possible combinations of sites s and t.
                         SUMMARY
Design of a distributed DB consists of four phases:
   – Phase 1: Global schema design (same as in centralized DB
     design)
   – Phase 2: Fragmentation
      • Horizontal Fragmentation
          – Primary: Determent a complete and minimal set of predicates
          – Derived: Use semijoin
      • Vertical Fragmentation
          Identify fragments such that many applications can be executed
            using just one fragment.
   – Phase 3: Allocation
      The primary goal is to minize the number of remote accesses.
   – Phase 4: Physical schema design (same as in centralized DB
     design).
        Mini Research Project 2
• Formulate the data placement problem for
  map-reduce applications in cloud computing
• Apply the techniques discussed in class or
  introduce new techniques to solve the problem
• Objectives:
  – Maximize node utilization in the cloud
  – Maximize parallelism for each of the applications
Database Integration
     Bottom-up Design
                        Overview
• The design process in
  multidatabase systems is            Database 1      Database 2    Database 3
  bottomup.
   – The individual databases         Translator 1   Translator 2   Translator 3
     actually exists
   – Designing the global                 InS1           InS2           InS3
     conceptual schema (GCS)
     involves integrating these   Intermediate
     local databases into a       schema in
                                  canonical
     multidatabase.               representation     INTEGRATOR

• Database integration can
  occur in two steps:
                                                         GCS
  Schema Translation and
  Schema Integration.
    Network Data Model (Review)
• There are two basic data structures in the
  network model: records and sets.      owner record type
   DEPARTMENT (DEPT-NAME, BUDGET, MANAGER)
                             Employs                          set type
   EMPLOYEE (E#, NAME, ADDRESS, TITLE, SALARY)                 member record type
   Record type: a group of records of the same type.
   Set type: indicates a many-to-one relationship in the direction of the
     arrow.

• Representation of set instances:
 DEPARTMENT (owner record)
 Database

                                           EMPLOYEE
            Jones, L.                      (member records)

                 Patel, J.                          Vu, K.
Example: Three Local Databases
Database 1 (Relational Model):
               S (TITLE, SAL)

 E (ENO, ENAME, TITLE)

               J (JNO, JNAME, BUDGET, LOC, CNAME)

      G (ENO, JNO, RESP, DUR)

Database 2 (Network Model):
  DEPARTMENT (DEPT_NAME, BUDGET, MANAGER)
                         Employs
                                       Dummy
                    Work
                                     Record Type
                         Worksin

  EMPLOYEE (E#, NAME, ADDRESS, TITLE, SALARY)
Example: Three Local Databases

Database 3 (ER Model):
                                                       Project        Project
                           Responsibility
 Engineer   Engineer                                     No.           Name
   No.       Name
                                                                                Budget
                                              1
   ENGINEER            N      WORKS                      PROJECT
                               IN
                                                                 N
                                                                                Location

   Title    Salary                                         CONTRACTED
                             Duration                          BY

                                            Contract             1
                                              Date
                                                           CLIENT

                                                       Client        Address
                                                       Name
     Schema Translation: Relational to ER
              S (TITLE, SAL)
                                                                                JNO      JNAME
                                                   ENO    ENAME       RESP
E (ENO, ENAME, TITLE)
                                                                 N              M
          J (JNO, JNAME, BUDGET, LOC, CNAME)             E                              J
                                                          1
   G (ENO, JNO, RESP, DUR)                                             DUR                     LOC
                                                                               BUDGET
                                                         PAY
                                                                                       CNAME
• E & J have a many-to-                                    N
  many relationship
• E & S have a 1-to-many                                 S
  relationship
                                                                SAL
                                                      TITLE
                                                                             Treat salary as
                                                                             an attribute of
                                                                             an engineer
                                  JNO      JNAME                             entity
  ENO     ENAME            RESP

                N                 M
          E                                J
                           DUR                  LOC
              SAL                 BUDGET                     Relationships may be identified from
  TITLE
                                        CNAME                the foreign keys defined for each
                                                             relation.
 Schema Translation: Network to ER
    DEPARTMENT            EMPLOYEE                         WORK
                                                       N          M
         Employs          Works-in

                   WORK                            EMPLOYS        WORKS-IN

  Dummy                                        1                        1
record type
                                          DEPARTMENT              EMPLOYEE


                    N                M
   DEPARTMENT              EMPLOYS       EMPLOYEE


• Map each record type in the network schema to an entity
  and each set type to a relationship.
• Network model uses dummy records in its representation of
  many-to-many relationships that need to be recognized
  during mapping.
        Schema Integration
Schema integration follows the translation
process and generates the GCS by
integrating the intermediate schemas.
  – Identify the components of a database which
    are related to one another.
     • Two components can be related as (1) equivalent, (2)
       one contained in the other one, (3) overlapped, or (4)
       disjoint.

  – Select the best representation for the GCS.
  – Integrate the components of each
    intermediate schema.
           Integration Methodologies

                  Integration
                    Process
                                                 Binary: Decreases the
                                                 potential integration
         Binary                 N-ary
                                                 complexity and lead toward
                                                 automation techniques.

Ladder     Balanced       One-shot   Iterative   One-shot: There is no
                                                 implied priority for
                                                 integration order of
                                                 schemas, and the trade-off
                                                 can be made among all
                                                 schemas rather than among
                                                 a few.
           Integration Process
Schema integration occurs in a sequence of four
steps:

• Preintegration: establish the “rules” of the integration
  process before actual integration occurs.

• Comparison: naming and structural conflicts are identified.

• Conformation: resolve naming and structural conflicts

• Merging and restructuring: all schemas must be merged into
  a single database schema and then restructured to create
  the “best integrated schema.
   Schema Integration: Preintegration
1. An integration method (binary or n-ary) must be
   selected and the schema integration order defined.
   – The order implicitly defines priorities.

2. Candidate keys in each schema are identified to
   enable the integrator to determine dependencies
   implied by the schemas.

3. The mapping or transformation rules should be
   described before integration begins.
   – e.g., mapping from degree Celsius in one schema to
     degrees Fahrenheit in another.
Preintegration Example: InS1

                                                      Project        Project
                          Responsibility                No.           Name
Engineer   Engineer
  No.       Name
                                                                               Budget
                                             1
  ENGINEER            N       WORKS                     PROJECT
                               IN
                                                                N
                                                                               Location

  Title    Salary                                         CONTRACTED
                             Duration                         BY

                                           Contract             1
                                             Date
                                                          CLIENT

                                                      Client        Address
                                                      Name
 Preintegration Example: InS2 & InS3

              Name                                  Dept-name   Budget
E#
                              Title
                                                                         Manager
                                                1
                                  N
      EMPLOYEE                        EMPLOYS       DEPARTMENT

     Address         Salary                            InS2


                                      Resp            JNO       Jname
      Eno            Ename

                                                                         Budget
                                                M
      ENGINEER                   N    EMPLOYS               J
                                                                          InS3
                                                       Cname      Loc
      Title           Sal
                                       Dur
  Keys & Integration Order

                                   KEYS

                     InS1:   Engineer No. in ENGINEER
                             Project No. in PROJECT
                             Client name in CLIENT

                     InS2:   E# in EMPLOYEE
InS1   InS2   InS3
                             Dept-name in DEPARTMENT

Integration method   InS3:   Eno in E
                             Jno in J
         Schema Comparison:
         Naming Conflict (1)
Synonyms: two identical entities that have
different names.

                  InS1                 InS3
             ENGINEER           E
               Engineering No       Eno
               Engineer Name        Ename
               Salary               Sal
             WORKSIN            G
               Responsibility       Resp
               Duration             Dur
             PROJECT            J
               Project No           Jno
               Project Name         Jname
               Location             Loc
           Schema Comparison:
           Naming Conflict (2)
Homonyms: Two different entities that have
identical names.

• In InS1, ENGINEER.Title refers to the title of
  engineers.

• In InS2, EMPLOYEE.Title refers to the title of all
  employees.


  domain (EMPLOYEE.Title) >> domain (ENIGNEREER.Title)
Schema Comparison – Relation
     between Schemas
• Two schemas can be related in four
  possible ways:
  – They can be identical to one another.
  – One can be a subset of the other.
  – Some components from one may occur in other
    while retaining some unique features
  – They could be completely different with no
    overlap.
• An attribute in one schema may represent
  the same information as an entity in
  another one
    Schema Comparison Example
• InS3 is a subset of InS1                     E#


                                              Name

      ENGINEER                   EMPLOYEE     Title


                                             Address
                                   EMPLOYS
            IS-A relationship                  Salary



                                DEPARTMENT




• Some parts of InS1 (about engineers) and InS3
  (about engineers) occur in InS2 (about employees)
    Schema Comparison – Structural
            Conflicts (1)
• Type conflicts: occur when the same object is
  represented by an attribute in one schema and by an
  entity in another schema.
   – The client of a project is modeled as an entity in InS1,
     however
   – the client is included as an attribute of the J entity in InS3
                                                                        N
  Resp          JNO       Jname                           CONTRACTED         PROJECT
                                                              BY

                                  Budget   Contract            1
            M
  EMPLOYS
                      J                      Date
                                                                             InS1
                                                          CLIENT
                              InS3
                Cname       Loc
   Dur                                                Client       Address
                                                      Name
   Schema Comparison – Structural
           Conflicts (2)
                                                                      This is
                                                                    1-to-many
                                                                                  Project
Dependency conflicts:               Engineer
                                      No.
                                                      Engineer
                                                       Name
                                                                                    No.

occur when different
                                                                             1
relationship modes are                ENGINEER                 N     WORKS
                                                                      IN
                                                                                  PROJECT
used to represent the
                                      Title           Salary                     InS1
same thing in different
schemas.
                                                                      This is
                                               Resp                many-to-many
                Eno     Ename


                                                               M
                ENGINEER        N         EMPLOYS                     J

                Title     Sal                  Dur          InS3
   Schema Comparison: Structural
          Conflicts (3)

• Key conflicts: occur when different candidate keys
  are available and different primary keys are
  selected in different schemas

• Behavioral conflicts: are implied by the modeling
  mechanism,
   – e.g., deletion of the last employee causes the dissolution
     of the department.
    Conformation: Naming Conflicts
Naming conflicts are resolved simply by renaming
conflict ones.
                                                  Homonyms:
Synonyms: rename the schema of InS3               • Prefix each attribute
to conform to the naming of InS1.                   by the name of the
                                                    entity to which it
              InS3                  InS1            belong,
E                              ENGINEER              e.g., ENGINEER.Title
    Eno    Engineering No       Engineering No
                                                           EMPLOYEE.Title
    Ename  Engineering Name     Engineer Name
    Sal    Salary               Salary           • and prefix each entity
G                              WORKSIN              by the name of the
    Resp    Responsibility      Responsibility
                                                    schema to which it
    Dur     Duration            Duration
J                              PROJECT              belongs.
    Jno    Project No           Project No          e.g., InS1.ENGINEER
    Jname  Project Name         Project Name
                                                           InS2.EMPLOYEE
    Loc    Location             Location
                  Resolving Structural Conflicts
    Transforming entities/attributes/relationships among one another

                                             Responsibility             Project        Project
       Engineer        Engineer
                                                                          No.           Name
         No.            Name
                                                                                                   Budget

InS3     ENGINEER                    N           WORKS
                                                  IN
                                                                M
                                                                         PROJECT
                                                                                                  Location
                                                                             Client
         Title         Salary                                                Name
                                                Duration


                                                                                   Project       Project
                                                       Responsibility
                  Engineer        Engineer                                           No.          Name
                    No.            Name
                                                                                                             Budget
                                                                         M
                    ENGINEER                     N            WORKS                 PROJECT
                                                               IN
                                                                                             N
                                                                                                             Location

                    Title         Salary                                                 C-P
                                                           Duration
Example:                                                                                     M
                                                                                                       New
Transform the attribute Client name in                                    Client
                                                                                             C
                                                                                                       InS3
InS3 to an entity C to make InS3                                          Name

conform to the presentation of InS1.
                Schema Integration:
              Merging & Restructuring

Merging requires that the information contained in the
participating schemas be retained in the integrated
schema.
                                        Merging using the IS-A
                                        relationship

                                         Use InS3 as the final schema
                                         since it is more general in
                                         terms of the C-P relationship
                                         (i.e., many-to-many)
                                         (next page)
   InS2         InS3      InS1
(Employees)   (Engineers) (Engineers)
               Integrate InS1 & InS3
                                                                                           Project        Project
                                                                   Responsibility
                           Engineer        Engineer                                          No.           Name

            InS1
                             No.            Name
                                                                                                                    Budget
                                                                                    1
                             ENGINEER                     N           WORKS                 PROJECT
                                                                       IN
                                                                                                     N              Location
                             Title         Salary                                             CONTRACTED
                                                                     Duration                     BY

                                                                                Contract             1
                                                                                  Date
                                                                                              CLIENT

                                                                                           Client        Address
                                                     Project         Project               Name
                          Responsibility
Engineer   Engineer                                    No.            Name
  No.       Name
                                                                                Budget
                                           M
                                                      PROJECT
                                                                                                    InS3 is
                      N
 ENGINEER                    WORKS
                              IN
                                                               N                Location
  Title    Salary                                      CONTRACTED
                                                                                                    more
                            Duration                       BY
                                                                                                    general
                                                               M                InS3
                                            Client
                                            Name               C
       Merging & Restructuring Example
 Final Result:                                                   Project     Project
                                 Duration                          No.        Name
                                                Responsibility

                                                                                        Budget
                                                            M
                 ENGINEER          N           WORKS               PROJECT
                                                IN
                                                                                        Location

                                                                     CONTRACTED
                                              E#                         BY


                                             Name
                                                                     CLIENT           InS1/InS3
                 EMPLOYEE                    Title
                   N

                                            Address              Client     Address
                   EMPLOYS                                       name
                                             SAL
                   1
InS2            DEPARTMENT
                                                       Unfortunately, Conformation and
       Budget          Manager   Dept-name
                                                       restructuring stages are an art
                                                       rather then a science
 Query Processing in
Multidatabase Systems
Query Processing in Three Steps
1. Global query is                 Schema Integration
   decomposed into local
   queries
                           Local Schema 1    Local Schema 2    Local Schema 3




                           Translator 1     Translator 2       Translator 3


                              InS1                 InS2              InS3

                            Q1,1            Q1,2              Q1,3



                                            INTEGRATOR


                                       Q1
                                                   GCS
Query Processing in Three Steps
2. Each local query is             Schema Integration
   translated into
   queries over the
                           Local Schema 1      Local Schema 2      Local Schema 3

                         Q’1,1               Q’1,2              Q’1,3
   corresponding local
                          Translator 1
   database system                             Translator 2        Translator 3


                                 InS1                InS2               InS3

                            Q1,1              Q1,2               Q1,3



                                             INTEGRATOR


                                        Q1
                                                     GCS
Query Processing in Three Steps
3. Results of the local             Schema Integration
   queries are combined
   into the answer
                            Local Schema 1      Local Schema 2      Local Schema 3

                          Q’1,1               Q’1,2              Q’1,3

                           Translator 1         Translator 2        Translator 3


    Final                         InS1                InS2               InS3
   answer     Combine        Q1,1              Q1,2               Q1,3



                                              INTEGRATOR


                                         Q1
                                                      GCS
 Query Processing in Three Steps
1.   Global query is                Schema Integration
     decomposed into local
     queries
                             Local Schema 1    Local Schema 2   Local Schema 3




2. Each local query is       Translator 1     Translator 2      Translator 3
   translated into
   queries over the             InS1               InS2              InS3

   corresponding local
   database system
                                              INTEGRATOR
3. Results of the local
   queries are combined
   into the answer                                 GCS
               Outline
• Overview of major query processing
  components in multidatabase systems:
  – Query Decomposition
  – Query Translation
  – Global Query Optimization

• Techniques for each of the above
  components
Query Decomposition
Query Decomposition                                     Global Query
Overview
                                      Query decomposition &
                                       global optimization



    SQ1           SQ2
                              ...      SQn            PQ1   …     PQn


Query
translator 1
               Query
               translator 2
                              …     Query
                                    translator n
                                                   SQi export schema subquery
                                                       in global query language
                                                   TQi target query (local

   TQ1            TQ2         ...      TQn
                                                       subquery) in local query
                                                       language
                                                   PQi postprocessing query
                                                       used to combine results
   DB1            DB2                   DBn            returned by subqueries
                                                       to form the answer
            Assumptions
• We use the object-oriented data model to
  present a query modification algorithm
• To simplify the discussion, we assume that
  there are only two export schemas:
        ES1                ES2
      Emp1: SSN          Emp2: SSN
            Name               Name
            Salary             Salary
            Age                Rank
                 Definitions
• type: Given a class C, the type
  of C denoted by type(C ), is the
  set of attributes defined for C
  and their corresponding
  domains.                            World

• world: the world of C, denoted
  by world(C ), is the set of real-                Type

  world objects described by C.
• extension: the extension of C,
                                              Extension
  denoted by extension(C ), is the
  set of instances contained in C.
                                              A Class
     Review: Outerjoin
The outerjoin of relation R1 and R2
(R1 ⋈o R2 ) is the union of three
components:
  – the join of R1 and R2,
  – dangling tuples of R1 padded with null
    values, and
  – dangling tuples of R2 padded with null
    values.
                Outerjoin Example
Emp1                                      EmpO
 OID   SSN    Name     Salary     Age     OID     SSN    Name    Salary    Age    Rank
  3    6789   Smith    90,000      40
                                           1      2222   Ahad    98,000    null   S. Mgr.
  4    4321   Chang    62,000      30
                                           2      7531   Wang    95,000    mull   S. Mgr.
  5    8642   Patel    75,000      35
                                                                 Incon-
                                           3      6789   Smith   sistent
                                                                           40      Mgr.

Emp2                                       4      4321   Chang   62,000    30      null

 OID   SSN    Name    Salary    Rank       5      8642   Patel   75,000    35      null
  1    2222   Ahad    98,000    S. Mgr.
  2    7531   Wang    95,000    S. Mgr.
  3    6789   Smith   25,000     Mgr.          Dangling Tuple       Dangling Tuple
Schema Integration - Outerjoin
Two classes C1 and C2 can be integrated
by equi-outerjoining the two classes on
the OID to form a new class C.
     – extension(C ) = extension(C1 ) ⋈o extension(C2 )
     – type(C ) = type(C1 ) ⋃ type(C2 )
     – world(C ) = world(C1 ) ⋃ world(C2 )



C1                C2                        C
  Schema Integration - Generalization

Two classes C1 and C2 can be integrated by
generalizing the two classes to form the
superclass C.
                                             Outer
                                             union
  type(C ) = type(C1 ) ⋂ type(C2 )

  extension(C ) = ᅲtype(C) [extension(C1 ) ⋃o extension(C2 )]

  world(C ) = world(C1 ) ⋃ world(C2 )
     Generalization Example
Emp1: SSN                          Emp2: SSN                 EmpG: SSN
      Name                               Name                      Name
      Salary                             Salary                    Salary
      Age                                Rank

• Emp1 and Emp2 will also appear in the
  global schema since not all information in
  Emp1 and Emp2 is retained in EmpG
                                  EmpG SSN
               Generalization




                                       Name                  More
                                       Salary               specific

                                Emp1 Age        Rank Emp2
  Inconsistency Resolution
• The schema integration techniques
  work as long as there is no data
  inconsistency
• If data inconsistency occurs,
  aggregate functions may be used to
  resolve the problem.
     Inconsistency Resolution Example
        Export Schemas                                 Integrated Schema
Emp1: SSN        Emp2: SSN                       EmpG: SSN           EmpO: SSN
      Name             Name                            Name      or        Name
      Salary           Salary                          Salary              Salary
      Age              Rank           Generalization                       Age
                                                                 Outer     Rank
Aggregate Functions - Examples:                                    join
EmpG.Name = Emp1.Name, if EmpG is in world(Emp1)
          = Emp2.Name, if EmpG is in world(Emp2) – world(Emp1)
EmpG.Salary = Emp1.Salary, if EmpG is in world(Emp1) – world(Emp2)
            = Emp2.Salary, ifEmpG is in world(Emp2) – world(Emp1)
            = Sum(Emp1.Salary, Emp2.Salary), if EmpG is in world(Emp1) ⋂ world(Emp2)
EmpO.Age = Emp1.Age, if EmpO is in world(Emp1)
         = Null, if EmpO is in world(Emp2) – world(Emp1)                              world(Emp2) –
                                                                                       world(Emp1)
EmpO.Rank = Emp2.Rank, if EmpO is in world(Emp2)                            world(Emp1)
                                                                                 ⋂
                                                                            world(Emp2)
          = Null, if EmpO is in world(Emp1) – world(Emp2)   world(Emp1) –
                                                             world(Emp2)



                                                            World (Emp1)            World (Emp2)
                 Query Modification
  Step 1: Determine Number of Subqueries
Global            Select     EmpO.Name, EmpO.Rank
Query             From       EmpO
                             Where  EmpO.Salary > 80,000 AND
                                    EmpO.Age > 35

Obtain a partition of world(EmpO) based on the aggregate
function used to resolve the data inconsistency.
Option 1 (based on Salary)
                                         Inconsistency Function:
part. 1: world(Emp1) – world(Emp2)
part. 2: world(Emp2) – world(Emp1)       EmpO.Salary = Emp1.Salary, if
                                                       EmpG is in world(Emp1) – world(Emp2)
part. 3: world(Emp1) ⋂ world(Emp2)
                                                      = Emp2.Salary, if
                                                        EmpG is in world(Emp2) – world(Emp1)
         world(Emp1)                                  = Sum(Emp1.Salary,Emp2.Salary), if
                                                        EmpG is in world(Emp1) ⋂ world(Emp2)
                       3             2
             1
                                     world(Emp2)
                 Query Modification
   Step 1: Determine Number of Subqueries
Global               Select     EmpO.Name, EmpO.Rank
Query                From       EmpO
                                Where  EmpO.Salary > 80,000 AND
                                       EmpO.Age > 35

Obtain a partition of world(EmpO) based on the aggregate
function used to resolve the data inconsistency.
                                                   Option 2 (based on Age)
Inconsistency Function:                            part. 1: world(Emp1)
EmpO.Age                                           part. 2: world(Emp2) – world(Emp1)
 = Emp1.Age, if EmpO is in world(Emp1)
 = Null, if EmpO is in world(Emp2) – world(Emp1)
                                                   world(Emp1)


                                                           1            2
                                                                         world(Emp2)
               Query Modification
  Step 1: Determine Number of Subqueries
Global             Select       EmpO.Name, EmpO.Rank
Query              From         EmpO
                                Where  EmpO.Salary > 80,000 AND
                                       EmpO.Age > 35

Obtain a partition of world(EmpO) based on the aggregate
function used to resolve the data inconsistency.
Option 1 (based on Salary)                      Option 2 (based on Age)
part. 1: world(Emp1) – world(Emp2)              part. 1: world(Emp1)
part. 2: world(Emp2) – world(Emp1)              part. 2: world(Emp2) –
part. 3: world(Emp1) ⋂ world(Emp2)                            world(Emp1)
 world(Emp1)                                  world(Emp1)

               3            2                                       2
     1                                                 1
                            world(Emp2)                              world(Emp2)

We use Option 1 since it is the finest partition among all the partitions.
          Query Modification
                     Another Example

 Option 1:                           Option 2:
world(Emp1)                        world(Emp1)

                   2                                 2
          1                                  1
                    world(Emp2)                      world(Emp2)




Use finer partition (Option 3):

                 world(Emp1)

                               3       2
                     1
                                       world(Emp2)
                      Query Modification
                  Step 2: Query Decomposition
Global Query:                                part. 1: Select Emp1.Name
Select EmpO.Name, EmpO.Rank                           From Emp1
From EmpO
Where EmpO.Salary > 80,000 AND
                                                      Where Emp1.Salary > 80,000 AND
        EmpO.Age > 35                                         Emp1.Age > 35 AND
                                                              Emp1.SSN NOT IN
Partition:                                                         (Select Emp2.SSN
                                                                    From Emp2)
       1          3                          part. 2: This subquery is discarded because
                           2                          EmpO.Age is Null.
 world(Emp1)               world(Emp2)
                                               part. 3: Select Emp1.Name, Emp2.Rank
 Query Decomposition: Obtain                             From Emp1, Emp2
    a query for each subset in                           Where Sum(Emp1.Salary,
    the chosen partition.                                       Emp2.Salary) > 80,000 AND
EmpO.Age = Emp1.Age, if EmpO is in world(Emp1)                 Emp1.Age > 35 AND
         = Null, if EmpO is in world(Emp2) – world(Emp1)       Emp1.SSN = Emp2.SSN
EmpO.Salary = Emp1.Salary, if EmpG is in world(Emp1) – world(Emp2)
           = Emp2.Salary, ifEmpG is in world(Emp2) – world(Emp1)
           = Sum(Emp1.Salary, Emp2.Salary), if EmpG is in world(Emp1) ⋂ world(Emp2)
            Query Modification
         Step 3: Further Decomposition
STEP 3: Some resulting query may still reference
data from more than one database. They need to be
further decomposed into subqueries and possibly also
postprocessing queries
                                 Select Emp1.Name
                                 From Emp1
Before STEP 3:                   Where Emp1.Salary > 80,000 and
Select Emp1.Name                        Emp1. Age > 35 and
From Emp1                               Emp1.SSN NOT IN X
Where Emp1.Salary > 80,000 and
       Emp1. Age > 35 and
       Emp1.SSN NOT IN                              X
         (Select Emp2.SSN
          From Emp2)
                                 Insert INTO X
                                 Select Emp2.SSN
                                 From Emp2)
      Query Modification
      Step 4: Query Optimization


STEP 4: It may be desirable to reduce
 the number of subqueries by
 combining subqueries for the same
 database.
Query Translation
     Query Translation (1)

IF     Global Query Language ≠
                Local Query Language

THEN     Export                   Local
         Schema      Translator   Query
         Subquery                 Language
      Query Translation (2)
IF the source query language has a higher
  expressive power THEN EITHER
  – Some source queries cannot be translated; or
  – they must be translated using both
     • the syntax of the target query language, and
     • some facilities of a high-level programming language.

Example: A recursive OODB query may not be
  translated into a relational query using SQL
  alone.
  Translation Techniques (1)
CASE 1: A single target query is generated
  IF the target database system has a query
  optimizer
     THEN the query optimizer can be used
          to optimize the translated query
     ELSE the translator has to consider the
          performance issues
 Translation Techniques (2)
CASE 2: A set of target queries is needed.
• It might pay to have the minimum number of
  queries
   – It minimizes the number of invocations of the target
     system
   – It may also reduce the cost of combining the partial
     results
• It might pay for a set to contain target queries
  that can be well coordinated
   – The results or intermediate results of the queries
     processed earlier can be used to reduce the cost of
     processing the remaining queries
  Relation-to-OO Translation
OODB Schema:

Auto             Company               People                City
 OID               OID                   OID                   OID
 Color             Name                  Name                  Name
 Manufacturer      Profit                Hometown              State
                   Headquarter           Automobile
                   President             Age


     Equivalent Relational Schema:
     Auto (Auto-OID, Color, Company-OID)
     Company (Company-OID, Name, Profit, City-OID, People-OID)
     People (People-OID, Name, Age, City-OID, Auto-OID)
     City (City-OID, Name, State)
       Relational-to-OO Example (1)
    Global Query:
    Select Auto1.*                        Relational Predicate Graph:
    From   Auto Auto1, Auto Auto2,
           Company, People,                        1) Company-OID
                                         Auto1                      Company
           City City1, City City2
                                                                         2) People-OID
    Where Auto1.Conmpany-OID =
            Company,Company-OID AND                                  People
          Company.People-OID =            City2                      Age=52
1           People.People-OID AND
          People.Age = 52 AND                                             3) Auto-OID
2
          People.Auto-OID =
3                                                                     Auto2
            Auto2.Auto-OID AND             City1                     Color=red
4         Auto2.Color = “red” AND
          People.City-OID =
5
            City1.City-OID AND          Find all red cars own by a 52 year
6         City1.Name = City2.Name AND   old who is the President of the car
          Company.City-OID =            manufacturer and lives in the same
            City2.City-OID              city of the car manufacturer
 Relational-to-OO Example (2)
OO Predicate Graph:       Auto1
                                    Company-OID
                                                  Company
                                                       People-OID

                                                                Predicate 1
                           City1                   People
                                                   Age=52

                                                      Auto-OID
     Predicate 3
                                                    Auto2
                            City2                  Color=red
                                                               Predicate 2

  OO Query:
  Where Auto.Manufacturer.President.Age = 52 AND
        Auto.Manufacturer.President.Automobile.Color = red AND
        Auto.Manufacturer.Headquarter.Name =
                 Auto.Manufacturer.President.Hometown.Name
Global Query Optimization
 Global Query Optimization (1)
• A query obtained by the query modification
  process may still reference data from
  more than one database.
 Example: part. 3 (i.e., world(Emp1) ⋂ world(Emp2))
 on page 126
   Select Emp1.Name, Emp2.Rank
   From Emp1, Emp2      /* access two databases
   Where sum(Emp1.Salary, Emp2.Salary) > 80,000 AND
            Emp1.Age > 35 AND
            Emp1.SSN = Emp2.SSN
  → Some global strategy is needed to process such queries
 Global Query Optimization (2)
• Select Emp1.Name, Emp2.Rank
  From Emp1, Emp2     /* access two databases
  Where sum(Emp1.Salary, Emp2.Salary) > 80,000 AND
             Emp1.Age > 35 AND
             Emp1.SSN = Emp2.SSN
    → Some global strategy is needed to process such queries
  Site 1              Site 2


           Emp2                Emp2        Site 1              Site 2


                                                    Emp1                Emp2


   form                form
   result              result
                                                           form
                                                           result
    Emp1                Emp1
                                                                Site 3
             Site 2               Site 1
                        Data Inconsistency
       • If C is integrated from C1 and C2 with no
         data inconsistency on attribute A, then
            бA op a (C) = бA op a (C1) ⋃ бA op a (C2)
       • If A has data inconsistency, then the above
         equality may no longer hold.
              Example: Consider the select operation
EmpO
                                                 бEmpO.Salary > 100,000 (EmpO)
OID    SSN    Name    Salary    Age    Rank

 1     2222   Ahad    98,000    null   S. Mgr.    The correct answer should
 2     7531   Wang    95,000    mull   S. Mgr.    have the record for Smith.
 3     6789   Smith
                      Incon-
                                40      Mgr.
                                                  However, the above query
                      sistent
                                                  returns an empty set
 4     4321   Chang   62,000    30      null
                                                   Smith has a combined salary greater than 100,000
 5     8642   Patel   75,000    35      null
Data Inconsistency - Optimization
Express an outerjoin (or a generalization) as
  outer-unions as follows:
   C1 ⋈o C2 = C1-O ⋃o C2-O ⋃o (C1-C ⋈OID C2-C)
   C1-O: Those tuples of C1 that have no matching tuples
     in C2 (private part)
   C1-C: Those tuples of C1 that have matching tuples in
     C2 (overlap part)

бA op a (C1 ⋈o C2 ) = бA op a (C1-O) ⋃o бA op a (C2-O)
                         ⋃o бA op a (C1-C ⋈ C2-C)
                                     Can we improve this term ?
   Distribution of Selections (1)

бA op a (C1 ⋈o C2 ) = бA op a (C1-O) ⋃o бA op a (C2-O)
                        ⋃o бA op a (C1-C ⋈ C2-C)


When can we dustribute
б over ⋈ ?                        Expensive operation



                       Attribute A is defined by
                       an aggregate function
                       (see page 126)
     Distribution of Selection (2)
     Four cases were identified when all arguments of the aggregate
     function (for resolving conflicts) are non-negative

     1. f(A1,A2) op a ≡ A1 op a AND A2 op a:
Aggregate      бA op a (C1-C ⋈ C2-C) = бA op a (C1-C) ⋈ бA op a ( C2-C)
 function
            Example: max(Emp1-C.Salary, Emp2-C.Salary) < 30K
                       ≡ Emp1-C.Salary < 30K AND
                         Emp2-C.Salary < 30K

     2. f(A1,A2) op a ≡ f(A1 op a, A2 op a) op a:
               бA op a(C1-C ⋈ C2-C) = бA op a(бA1 op a(C1-C) ⋈ бA2 op a(C2-C))
            Example: sum(Emp1-C.Salary, Emp2-C.Salary) < 30K
                       ≡ sum(Emp1-C.Salary < 30K,
                         Emp2-C.Salary < 30K) < 30K
    Distribution of Selection (3)
3. f(A1,A2) op a ≡ f(A1 op’ a, A2 op’ a) op a:
       бA op a(C1-C ⋈ C2-C) = бA op a(бA1 op’ a(C1-C) ⋈
                                           бA2 op’ a(C2-C))
    Example: sum(Emp1-C.Salary, Emp2-C.Salary) = 30K
                 ≡ sum(Emp1-C.Salary ≤ 30K,
                        Emp2-C.Salary ≤ 30K) = 30K

4. No improvement is possible:
    Example: sum(Emp1-C.Salary, Emp2-C.Salary) > 30K
Distribution Rules for б over ⋈

           бA op a(C1-C ⋈ C2-C)

           op
    f            > ≥ ≤ < = ≠ in Not in
   sum(A1, A2)   4 4 2 2 3 4 4     4
   avg(A1, A2)   4 4 2 2 3 4 4     4
   max(A1, A2)   4 4 1 1 3 4 4     4
   min(A1, A2)   1 1 4 4 3 4 4     4
        Problem in Global Query
            Optimization (1)
Important information about local entity sets that is
needed to determine global query processing plans
may not be provided by the local database systems.
   – Example: cardinalities
                availability of fast access paths
   – Techniques:
      • Sampling queries may be designed to collect statistics
        about the local databases.
      • A monitoring system can be used to collect the
        completion time for subqueries. This can be used to
        better estimate subsequent subqueries.
         Problems in Global Query
             Optimization (2)
• Different query processing algorithms may have been
  used in different local database systems.
   → Cooperation across different systems difficult
      Examples: Semijoin may not be supported on some
                 local systems.
• Data transmission between different local database
  systems may not be fully supported.
   Examples:
   – A local database system may not allow update operations
   – For many nonrelational systems, the instances of one entity
     set are more likely to be clustered with the instances of
     other entity sets. Such clustering makes it very expensive to
     extract data for one entity set.
   → Need more sophisticated decomposition algorithms.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:11/12/2012
language:Unknown
pages:147
About Good!!!NICE!!! The best document database!