Docstoc

A Comparative Analysis of Methodologies for Database Schema

Document Sample
A Comparative Analysis of Methodologies for Database Schema Powered By Docstoc
					A Comparative Analysis of Methodologies
for Database Schema Integration
C. BATINI and M. LENZERINI
Dipartimento di Informutica e Sistemistica, University of Rome, Rome, Italy

S. B. NAVATHE
Database Systems Research and Development Center, Computer and Information Sciences Department,
University of Florida, Gainesville, Florida 32601




             One of the fundamental principles of the database approach is that a database allows a
             nonredundant, unified representation of all data managed in an organization. This is
             achieved only when methodologies are available to support integration across
             organizational and application boundaries.
                Methodologies for database design usually perform the design activity by separately
             producing several schemas, representing parts of the application, which are subsequently
             merged. Database schema integration is the activity of integrating the schemas of existing
             or proposed databases into a global, unified schema.
                The aim of the paper is to provide first a unifying framework for the problem of
             schema integration, then a comparative review of the work done thus far in this area.
             Such a framework, with the associated analysis of the existing approaches, provides a
             basis for identifying strengths and weaknesses of individual methodologies, as well as
             general guidelines for future improvements and extensions.

             Categories and Subject Descriptors: H.0 [Information  Systems]: General; H.2.1
             [Database Management]:      Data Models; Schema and Subschema; H.2.5: [Database
             Management]:    Heterogeneous Databases; D.2.1: [Requirements/Specifications]:
             Methodologies
             General Terms: Management
             Additional Key Words and Phrases: Conceptual database design, database integration,
             database schema integration, information systems design, models, view integration



INTRODUCTION                                             models, significantly    more powerful than
1.1 Schema Integration
                                                         primary data models in representing the
                                                         application of interest, have been proposed
Database management systems (DBMSs)                      (e.g., see Smith’s abstraction     hierarchy
have been developed in the past two dec-                 model [Smith and Smith 19771, the Seman-
ades using various data models and archi-                tic Data Model [Hammer and McLeod
tectures. The primary data models used for               19811, the TAXIS data model [Mylopoulos
implementation    are the hierarchical, net-             et al. 19801, DAPLEX [Shipman 19801, and
work, and relational data models. More re-               recently, the Galileo data model [Albano et
cently, several so-called semantic data                  al. 19851).

Authors in alphabetical order.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
0 1987 ACM 0360-0300/86/1200-0323        $1.50


                                                             ACM Computing Surveys, Vol. 18, No. 4, December 1986
 324         l       C. Batini,      M. Lenzerini,          and S. B. Navathe
     CONTENTS                                                       traditional    “data-processing-using-files”
                                                                    approach is that a database management
                                                                    system makes it possible to define an inte-
                                                                    grated view of relevant data for all appli-
     INTRODUCTION                                                   cations. This eliminates duplication, avoids
         I.1 Schema Integration                                     problems of multiple updates, and mini-
         1.2 View Integration in Database Design                    mizes inconsistencies    across applications.
         1.3 Database Integration
         I.4 Organisational    Context for Integration              Whereas the above claims of the database
         1.5 Structure of the Paper                                 approach are highly touted, database text-
     1. METHODOLOGIES               FOR SCHEMA                     books and survey literature to date have
        INTEGRATION                                                paid scant attention to this topic. At the
         1.1 An example
         1.2 Causes for Schema Diversity
                                                                   same time, research on the problem of
        1.3 Steps and Goals                                        integration    has proceeded, and most
             of the Integration Process                            of the researchers have suggested perform-
        1.4 Influence of the Conceptual Model                      ing the integration     activity as a part of
             on the Integration Process                            the conceptual design step. In this paper
    2. A COMPARISON             OF METHODOLOGIES
        2.1 Introduction
                                                                   we refer to the integration      activity by a
        2.2 Applicability    of Integration Methodologies          generic term, schema integration, which is
        2.3 Methodologies Viewed as Black Boxes                    defined as the activity of integrating the
        2.4 Gross Architecture       of Methodologies              schemas of existing or proposed databases
        2.5 Preintegration
        2.6 Comparison of Schema8
                                                                   into a global, unified schema. Schema
        2.7 Conforming of Schemas                                  integration,   as defined here, occurs in
        2.8 Merging and Restructuring                              two contexts:
    3. CONCLUSIONS            AND FUTURE          WORK
        3.1 General Remarks
        3.2 Missing Aspects
                                                                   (1) View integration (in database design)
        3.3 Future Research Directions                                 produces a global conceptual descrip-
    APPENDIX         1. A SUMMARY           DESCRIPTION                tion of a proposed database.
                        OF METHODOLOGIES                           (2) Database     integration (in distributed
    APPENDIX        2. THE ENTITY-RELATIONSHIP                         database management) produces the
                        MODEL
    ACKNOWLEDGMENTS
                                                                       global schema of a collection of data-
    REFERENCES                                                         bases. This global schema is a virtual
                                                                       view of all databases taken together in
                                                                       a distributed database environment.

    Since semantic models allow data to be                            The database technology has progressed
 described in a very abstract and under-                           to a level where thousands of organizations
 standable manner, they are currently used                         are using databases for their day-to-day,
 in designing the conceptual structure of                          tactical, and strategic management appli-
 databases. This conceptual activity is called                     cations. The distributed database manage-
 conceptual database design. Its goal is to                        ment area is also becoming sufficiently well
produce an abstract, global view of the data                       understood, and we expect to see a large
 of the application,       called   conceptual                     number of organizations changing to dis-
schema.                                                           tributed    databases by integrating      their
    The introduction  of a conceptual step in                     current diverse databases.
design methodologies is a fairly recent de-                           The contributions   to the state of the art
velopment. It allows designers and users to                       of database design methodologies, and in
cooperate in collecting requirements and                          particular schema integration,      have been
provides a high-level specification     of the                    particularly significant in the last ten years.
data involved in the application. Further-                        Our goal is to provide first a framework by
more, it simplifies the integration of differ-                    which the problem of schema integration
ing perspectives     and expectations     that                    can be better understood, and second a
various users have of the application.                            comparative review of the work done thus
   One of the basic motivations      for using                    far on this problem. Such a framework with
the database approach instead of the                              an associated analysis of the prevalent
ACM Computing     Surveys, Vol. 18, No. 4, December     1986
                Comparison of Methodologies     for Database Schema Integration                l      325
approaches provides                               guage (or data definition  language) of a
                                                  specific DBMS. Figure 1 shows the phases
(1) a conceptual foundation to the problem        of database design and the intermediate
    of schema integration;                        schema representations.   The phases of
(2) a basis upon which to identify strengths      database design are
    and weaknesses and the missing fea-
    tures about individual methodologies;          (1) Requirements    Specification and Analy-
(3) general guidelines for future improve-            sis. An analysis of the information      re-
    ments and extensions to the present               quirements of various areas within an
    approaches.                                       organization resulting in a preliminary
                                                      specification of the information      needs
In the next section we explain the view               of various user groups.
integration activity; Section I.3 is devoted      (2) Conceptual Design. Modeling and rep-
to database integration. In Section I.4 we            resentation of users’ and applications’
elaborate on the motivation      for investi-         views of information        and possibly a
gating integration. Finally, in Section I.5           specification of the processing or use of
we describe the general structure of the              the information. The final result of this
remainder of the paper.                               activity is a conceptual schema that
                                                      represents a global, high-level descrip-
                                                      tion of the requirements.
1.2 View Integration   in Database   Design
                                                  (3) Implementation     Design. Transforming
 The problem of database design is one of             a conceptual schema into the logical
 designing the structure of a database in a           schema of a DBMS. The second and
 given environment     of users and applica-          third phases taken together are called
tions such that all users’ data requirements          logical database design.
 and all applications’ process requirements       (4) Physical Schema Design and Optimi-
 are “best satisfied.” This problem has ex-           zation. Mapping the logical schema of
isted ever since DBMSs came into being.               a database into an appropriate stored
    The DBMSs that store and manipulate               representation    in a DBMS, including
 a database must have a definition of the             new physical parameters to optimize
 database in the form of a schema. This is            the database performance against a set
 termed the intension of the database. The            of transactions.
 actual values of data in a database are
called instances or occurrences of data.          Typically, the application design activity
 Sometimes they are termed the extension          proceeds in parallel with database design.
of a database, or just “the database.”            Hence, Figure 1 also shows specifications
Whereas the extension of a database keeps         related to applications as the outputs of the
changing with time, the intension of the          last two phases.
database is supposed to be time invariant.           As shown in Figure 1, the activity of view
The database design problem aims at de-           integration   can be performed at several
signing the intension schema of the data-         points of the database design process. It
base, which includes logical specifications       usually is performed during conceptual de-
 such as groupings of attributes and rela-        sign. In that case, its goal is to produce an
tionships among these groupings (logical          integrated schema starting from several ap-
schema), as well as physical specifications       plication views that have been produced
such as the type of access to records, in-        independent1y.l
dexes, ordering, and physical placement
 (physical schema). On the basis of this dis-     ‘There is a body of work that regards conceptual
tinction, the corresponding database design       design as an activity that considers the application as
activities are termed logical schema design       a whole, thus producing a single schema. This includes
andphysical schema design. Logical schema         Batini et al. 119841, Biller and Neuhold 119821, Brodie
                                                  [1981], Brodie and Zilles [1981], Ceri [i983];Ceri    et
design involves the problem of designing          al. [1981], Chen [1983], Lum et al. [1970], Olle et al.
the conceptual schema and mapping such            [1982], Rolland and Richards        [1982], and Sakai
a schema into the schema definition lan-          [1981].

                                                      ACM Computing Surveys, Vol. 18, No. 4, December 1986
 326      .         C. Batini,      M. Lenzerini,         and 5’. B. Navathe
                                          User Requirements in
                                          an Application Domain


                                                      1

                                    1      Requirement    Analysis      1



                                                      1
                                   Specification and Representation
                                           of Information Needs     *----v---y
                                                                                               \
                                   of Organization and User Groups
                                                                                                   I
                               c
                                                      1                                            I




        Logical                          Global Conceptual Schema

         Design



                                    +,

                                                                                                   i
                                                                                               J
                                                                            *--------~
                                       1                           1
                               ktgical Schema             Functional Design
                                                           of Applications




                               Implemented                   Application
                             Database Schema                  Programs

              Figure 1.    Phases of database design. (Adapted from Navathe and Schkolnick [1978].)


  The reason for integration is twofold:                                    their own reauirements and exnecta-
                                                                            tions of data, -which may conflict with
(1) The structure of the database for large                                 other user groups.
    annlications (orzanizations) is too com-
    plex to be modeTedby a single designer                              Another possibility (Figure 1) is to per-
    in a single view.                                                form integration even before the “concep-
(2) User groups typically operate inde-                              tual design” step is undertaken. In this
    pendently in organizations and have                              case, view integration still occurs; however,

ACM Computing     Surveys, Vol. 18, No. 4, December   1986
              Comparison of Methodologies    for Database Schema Integration                   l        327
views are less formal and are mostly in the      requests of users-retrievals as well as up-
form of narrative descriptions of require-       dates-from such a semantic data model
ments. The last possibility shown in the         into the actual databases.
figure is to perform integration after the           The database integration activity is de-
implementation design step, that is, start       scribed in a general way in Figure 2. It
from schemas expressed as implementable          shows that this activity has as input the
logical schemas. This is the approach fol-       local schemas and the local queries and
lowed in methodologies based strictly on         transactions. Most existing work, however,
the relational model (see Al-Fedaghi and         does not explicitly take into account the
 Scheuermann [ 19811and Casanova and Vi-         latter process-oriented information in de-
 da1 [1983]) that do not advocate a concep-      veloping the integrated schema. It is
tual step and model requirements directly        strictly used in mapping the queries (query
 in terms of the relational model.               mapping) between the global and the local
                                                 levels. Hence, we show the global schema
                                                 as well as the data and query-mapping spec-
1.3 Database Integration                         ifications to be the outputs of the database
Database integration is a relatively recent      integration activity.
problem that has appeared in the context
of distributed databases. A distributed da-      1.4 Organizational Context for Integration
tabase is a collection of data that logically
belong to the same system but are spread         Thus far we have pointed out how schema
over the sites of a computer network [Ceri       integration arises in database design. As we
and Pelagatti 19841. Distributed databases       survey the work on schema integration, it
and distributed database management sys-         is worthwhile to point out an organizational
tems can be classified into two major cate-      context for this important area.
gories: homogeneous, dealing with local             There is a growing trend to regard data
databases having the same data model             as an autonomous resource of the organi-
and identical DBMSs, and heterogeneous,          zation, independent of the functions cur-
having a diversity in data models and            rently in use in the organization [National
DBMSs. The term Federated Database is            Bureau of Standards 19821.There is a need
used (e.g., in McLeod and Heimbigner             to capture the meaning of data for the
 [1980]) to refer to a collection of databases   whole organization in order to manage it
in which the sharing is made more explicit       effectively. Because of this awareness, in-
by allowing export schemas, which define         tegration of data has become an area of
the sharable part of each local database.        growing interest in recent years.
Each application is able to design its own          One of the fundamental principles of the
global schema by integrating the export          database approach is that a database allows
schemas.                                         a nonredundant, unified representation of
   The above contexts require that an in-        all data managed in an organization. This
tegrated global schema be designed from          is true only when methodologies are avail-
the local schemas, which refer to existing       able to support integration across organi-
databases. This too can be considered a          zational and application boundaries. More
database design activity. Existing work on       and more organizations are becoming aware
database integration included in our survey      of the potential of database systems and
implicitly addresses this problem. The au-       wish to use them for integrated applications
thors of these works [Dayal and Hwang            and not just as software for fast retrieval
1984; ElMasri et al. 1987; Mannino and           and updating of data.
Effelsberg 1984a; Motro and Buneman                 Even when applications and user groups
19811 use a semantic data model as an            are structurally disconnected, as in most
intermediate model to facilitate the inte-       governmental and large administrative set-
gration. Another implicit assumption they        ups, there is something to be gained by
make is that the heterogeneous database          having an enterprise-wide view of the data
management system is able to map the             resource. This potentially affords individ-

                                                    ACM Computing   Surveys, Vol. 18, No. 4, December   1986
 328       l      C. Batini, M. Lenzerini, and S. B. Navathe
                                   Local Database                  Local Database
                                      Schemas                   Queries/Transactions




                 A global                                                                    Mapping of
                 database                             Data Mapping                     queries/transactions
                  schema                              from global                           from global
                                                         to local                       to local databases


                               Figure 2.    Inputs and outputs of database integration.


ual applications to “build bridges” among                           In Section 1 we establish the general
themselves and understand how the data-                         framework for a comparison of schema
bases or files relate to one another.                           integration    methodologies.   An example
    With the increasing use of databases, we                    introduces the aspects that influence the
expect the integration problem to be more                       integration process; we then identify the
severe and pervasive. New technologies of                       activities usually performed during schema
networking, distributed databases, knowl-                       integration. These activities are used as a
edge-based systems, and office systems will                     basis for comparing methodologies. Finally,
tend to spread the shared use of data in                        we examine the influence of the conceptual
terms of number of users, diversity of ap-                      model on the overall integration process.
plications, and sophistication   of concepts.                       Section 2 is devoted to a detailed com-
Design, manufacturing,       and engineering                    parative analysis of the methodologies. The
applications are becoming centered around                       results of the analysis are presented in the
database management systems. The need                           following format:
for methodologies for integrating data in
its diverse conceptual and physical forms                       (1) Tables      illustrating    comparative     fea-
is thus expected to increase substantially.                          tures. The table entries are drawn from
                                                                     the original publications on the meth-
                                                                     odologies and are not exhaustively ex-
1.5 Structure of the Paper                                          plained. However, we extract salient
                                                                     features and trends that are evident in
As described in Section 1.1, our main goals                         these tables. We highlight the approach
are to provide a conceptual foundation for                          of a specific methodology when it ex-
schema integration and perform a detailed                           plains a specific feature.
comparison of existing work in this area.
                                                                (2) Because of the diversity               of the
   It is possible to classify the existing lit-                     data models (entity-relationship,            en-
erature into two categories:                                        tity-category-relationship,        functional,
(A) Complete methodologies for schema in-                           structural, Navathe-Schkolnick,            rela-
    tegration. These include view integra-                          tional, and generalized entity manipu-
    tion and database integration.                                  lator) used in the methodologies, we
(B) Related works addressing specific is-                           have adopted a uniform              treatment
    sues of schema integration.                                     of concepts primarily          based on the
                                                                    entity-relationship      model. The entity-
In the References, the relevant literature                 is       relationship       model is briefly summa-
placed under Categories A and B.                                    rized in Appendix 2.
ACM Computing   Surveys, Vol. 18, No. 4, December   1986
             Comparison of Methodologies for Database Schema Integration                    l        329



   In Section 3, we present the conclusions Topics. Observe the corresponding change
of this investigation, identify missing as- in the second schema as we go from (a) to
pects and open problems, and indicate some (b) in Figure 4. When we look at the new
research directions.                          schemas (Figure 4b), another difference we
   We compare 12 different complete meth- notice is that Publisher is present in the
odologies. A summary description of each two schemas with different types: It is an
is in Appendix 2. The data model used, entity in the first schema and an attribute
inputs and outputs, the general strategy, in the second. The reason for choosing dif-
and special features are briefly described ferent types (attribute vs. entity) comes
for each methodology.                         from the different relevance that Publisher
   In order to make the treatment of schema has in the two schemas. However, we have
integration uniformly applicable to both to conform the two representations if we
the view integration as well as the database want to merge them. Therefore we trans-
integration contexts, we use the following form the attribute Publisher into an entity
terminology:                                  in the second schema and add a new attri-
                                             bute, Name, to it (see Figure 4~). We now
   General terms used for schema integra- can superimpose the two schemas, produc-
      tion: Component Schema, Integrated ing the representation in Figure 4d. We
      Schema.                                have not finished merging yet, since we
   View integration context: user view, con- have to look for properties that relate con-
      ceptual view.                          cepts belonging to different schemas, which
   Database integration      context: local were “hidden” previously. This is the case
      schema, global schema.                 with the subset relationship between the
                                             concepts Book and Publication. We can
                                             add such a subset relationship to the
1. METHODOLOGIES      FOR SCHEMA             merged schema, producing the result shown
    INTEGRATION                              in Figure 4e. Now, to simplify the represen-
1.1 An Example                               tation, we can restructure the schema by
                                             dropping the properties (relationships and
In order to introduce the reader to the main attributes) of Book that are common to
features and problems of schema integra- Publication. This is allowable since the
tion, we present an example. In Figure 3, subset relationship implies that all the
we show two descriptions of requirements properties of publications are implicitly in-
and corresponding possible conceptual herited by Book. The final schema is shown
schemas that model them.                     in Figure 4f.
   The following additional information
applies to this example:
                                              1.2 Causes for Schema Diversity
(1) The meaning of “Topics” in the first      The example of schema integration used
    schema is the same as that of “Key-
    word” in the second schema.               above is obviously a “toy example” that
(2) “Publication” in the second schema is     highlights some of the basic problems in-
    a more abstract concept than “Book”       volved. That the integration of realistic
    in the first schema. That is, “Publica-   sized component schemas can be a complex
    tion” includes additional things such     endeavor is amply evident from this ex-
    as proceedings, journals, monographs,     ample.
    etc.                                         The basic problems to be dealt with dur-
                                              ing integration come from structural and
Figure 4 shows a set of activities that may   semantical diversities of schemas to be
be performed to integrate the schemas.        merged. Our investigation of integration
   Let us look at the two schemas in Fig-     starts with a classification of the various
ure 4a. Topics and Keywords correspond to     causes for schema diversity, which are dif-
the same concept. Since we have to merge      ferent perspectives, equivalence among
the schemas, the names should be unified      constructs of the model, and incompatible
into a single name. Let us choose the name    design specifications.
                                                 ACM Computing   Surveys, Vol. 18, No. 4, December   1986
330      l       C. Batini,     M. Lenzerini,        and S. B. Navathe


                                             Books have titles. They are published
                                           by Publishers with names and addresses.
                                               Books are adopted by Universities




                                               publications    of different  types.
                                                 Each publication      has a lille.
                                             a publisher    and a list of keywords.
                                              Each keyword consists of a name.
                                                  a code and a research area.




                                                                           Area


                        Figure 3. Examples of requirements and corresponding schemas.


1.2.1 Different Perspectives                                           Another example is given in Figure 5, in
                                                                    which the two schemas represent informa-
In the design process, different user groups                        tion about employees and their depart-
or designers adopt their own viewpoints in                          ments. In Figure 5a information is modeled
modeling the same objects in the applica-                           by means of the relationship E-D. In Figure
tion domain. For instance, in the example                           5b, relationship E-P relates the employees
in Section 1.1, different names were at-                            with projects, whereas relationship P-D as-
tached to the same concept in the two                               sociates projects with departments. It is
views.                                                              assumed that an Employee “belongs to”


ACM Computing   Surveys, Vol. 18, No. 4, December   1986
                        Title                           t4an-e
             Book

Address




                                                                  L

                       Figure 4a. Original   schemas.



                        Title                           Name




          Figure 4b. Choose “Topics”   for “Keyword”     (Schema 2).



                        Title                           Name




                                                                       Area


          Figure 4c. Make Publisher    into an entity   (Schema 2).
                                                 Title




    Figure 4d. Superimposition     of schemas.




Figure 4e. Creation of a subset relationship.
                                                                                 Title




Figure 4f. Drop the properties   of Book common
to Publication.




                                                  Figure 4. An example of integration.
334       l         C. Batini,      M. Lenzerini,     and S. B. Navathe

                                               Employee
                                                              shows that an Employee is always assigned
                                                              to a unique project, since the cardinality
                                                              constraint 1 : n has been specified. The
                                                              correct situation (that an Employee may be
                                                              assigned to many projects) appears in
                                                              Schema 2.
                                                                 These three aspects are concerned with
                                                              what we can call the common part of the
                                                              various schemas, that is, the set of concepts
  1 Department      1
                                                              of the application domain that are repre-
                                                              sented in all of the schemas. In other words,
                                                              the above aspects represent the reasons
                                                              why the common part may be modeled in
                                            \Department   1   different ways in different schemas.
                                                                 In order to perform integration, it is CN-
              Figure 5. Different    perspectives.            cial to single out not only the set of common
                                                              concepts but also the set of different con-
                                                              cepts in different schemas that are mu-
those departments that are involved in the                    tually related by some semantic properties.
projects the employee works on. Therefore                     We refer to these as interschemaproperties.
the relationship between Employee and De-                     They are semantic relationships holding
partment is perceived as a direct relation-                   between a set of objects in one schema and
ship in one schema, whereas it is seen via                    a different set of objects in another schema.
the entity Project and two relationships in                   In the rest of this section, we provide a
another.                                                      further taxonomy to address correspond-
                                                              ences among common concepts and con-
1.2.2 Equivalence among Constructs                            cepts related by interschema properties.
      of the Model
                                                              1.2.4 Common Concepts
Typically, in conceptual models, several
combinations of constructs can model the                      Owing to the causes for schema diversity
same application domain equivalently. As                      described above, it may very well happen
a consequence, “richer” models give rise to                   that the same concept of the application
a larger variety of possibilities to model the                domain can be represented by different rep-
same situation. For example, in Figure 3,                     resentations R1 and Rz in different schemas,
the association between Book and Pub-                         and several types of semantic relationships
lisher was modeled as an attribute of Pub-                    can exist between such representations.
lisher in one schema and as a relationship                    They may be identical, equivalent, compat-
between Book and Publisher in the other.                      ible, or incompatible:
Figure 6 shows another example of equiv-
alent constructs. Man and Woman are dis-                      (1) Identical:   R1 and RP are exactly the
tinguished by a generalization hierarchy in                       same. This happens when the same
the first schema, whereas in the second                           modeling constructs are used, the same
schema they are distinguished by the dif-                         perceptions are applied, and no inco-
ferent values of the attribute Sex.                               herence enters into the specification.
                                                              (2) Equivalent:   R1 and RP are not exactly
                                                                  the same because different but equiva-
1.2.3 Incompatible        Design Specifications
                                                                  lent modeling constructs have been ap-
Erroneous choices regarding names, types,                         plied. The perceptions are still the
integrity constraints, etc. may result in er-                     same and coherent. Several definitions
roneous inputs to the schema integration                          of equivalence have been proposed in
process. A good schema integration meth-                          the literature (see Atzeni et al. [1982],
odology must lead to the detection of such                        Beeri et al. [1978], Biller [ 19791, Na-
errors. Schema 1 in Figure 7 erroneously                          vathe and Gadgie [1982], Ng et al.

ACM Computing Surveys, Vol. 18, No. 4, December 1986
               Comparison of Methodologies for Database Schema Integration                                     l          335

                            Person




                         T    Sex




                               (a)                                                              (b)
             Figure 6.   Equivalent   constructs.    (a) Generalization   hierarchy.   (b) A single entity.




7   Project
          1                               Lr’
                                                Project
                                                          n

                                                                    Figure 7.    Incompatible    design specifications.
3                                         3                           SchemaMmna
                                                                    (a)    W
                                                                          1.   2.

       (a)                                          (b)



 [1983], Rissanen [1977]). Although                           (3) Compatible: R1 and RS are neither
several semantic data models are in                               identical nor equivalent. However, the
existence today, the authors of these                             modeling constructs, designer percep-
models do not provide any criteria for                            tion, and integrity constraints are not
equivalence of concepts. Definitions                              contradictory.
are typically based on three different                        (4) Incompatible:   I$ and Rz are contra&c-
types of equivalence:                                             tory because of the incoherence of the
                                                                  specification.
(4 Behavioral:     R1 is equivalent to Rz                     Situations (2), (3), and (4) above can be
    if for every instantiation     of R1, a                   interpreted as conflicts. Conflicts and their
    corresponding instantiation       of Rz                   resolutions are central to the problems of
    exists that has the same set of an-                       integration. A general definition of the term
    swers to any given query and vice                         conflict would be as follows:
    versa [Atzeni et al. 19821.
(b) Mapping: R, and Rz are equivalent                            A conflict between two representations
    if their instances can be put in a                           R, and RP of the same concept is every
    one-to-one     correspondence      (e.g.,                    situation that gives rise to the represen-
    see Rissanen [ 19771).                                       tations R1 and R2 not being identical.
(c) Transformational:     R1 is equivalent
    to Rz if RP can be obtained from R1                        1.2.5 Concepts Related
    by applying a set of atomic trans-                               by Some Semantic Property
    formations that by definition pre-
    serve equivalence. (Navathe and                           Regarding      the concepts in component sche-
    Gadgie [ 19821 call this “restructure                     mas that     are not the same but are related,
    equivalence.“)    This technique is                       we need      to discover all the interschema
    common in other disciplines (e.g.,                        properties    that relate them. In Figure 8, we
    program equivalence).                                     show two     examples of interschema proper-


                                                                   ACM Computing Surveys, Vol. 18, No. 4, December        1986
336        l      C. Batini,       M. Lenzerini,        and S. B. Navathe


                                                                                                    Application
                                                                                                      Domains




                                                                                                          Schema   1



                    Interschema
               Subset      Relationship



                                                                                                          Schema   2

                                                                 (a)




                                                                 lb)

                          Figure 8.       Interschema   properties.     (a) Example   1. (b) Example 2.


ties. The subset relationship among Person                             1.3.1 Preintegration
and Employee (Example 1) and the rela-                                 An analysis of schemas is carried out before
tionship    “Belongs-to”    between Country                            integration to decide upon some integration
and State (Example 2) are interschema                                  policy. This governs the choice of schemas
properties that could not be perceived in                              to be integrated, the order of integration,
the original component schemas.                                        and a possible assignment of preferences to
                                                                       entire schemas or portions of schemas. Giv-
                                                                       ing preference to financial        applications
1.3 Steps and Goals
                                                                       over production applications is one exam-
    of the integration        Process
                                                                       ple of an integration policy that could be
Thus far, we have discussed the nature of                              set by management.
the schema integration problem and iden-                                  Global strategies for integration, namely,
tified the causes and implications       of                            the amount of designer interaction and the
schema diversity. How do the methodolo-                                number of schemas to be integrated at one
gies accomplish the task of integration?                               time, are also decided in this phase. Collec-
Each methodology follows its own solution                              tion of additional information      relevant to
procedure.    However,    any methodology                              integration,   such as assertions or con-
eventually can be considered to be a mix-                              straints among views, is also considered to
ture of the following activities.                                      be a part of this phase.

ACM Computing Surveys, Vol. 18, No. 4, December 1986
                  Comparison of Methodologies      for Database Schema Integration                     l        337

1.3.2 Comparison of the Schemas                          1.4 Influence of the Conceptual         Model
                                                             on the Integration Process
Schemas are analyzed and compared to
determine   the correspondences   among              All of the above issues and activities are
concepts and detect possible conflicts.              strongly influenced by the data model used
Interschema properties may be discovered             to express conceptual schemas. The rela-
while comparing schemas.                             tionship between the comparison and con-
                                                     forming activity and the choice of data
                                                     model is apparent in all the methodologies
    1.3.3 Conforming the Schemas                     that perform these activities “by layers”
Once conflicts are detected, an effort is            [Batini et al. 1983; ElMasri et al. 1987;
made to resolve them so that the merging             Kahn 1979; Navathe and Gadgil 1982; Teo-
of various schemas is possible. Automatic            rey and Fry 1982; Wiederhold and ElMasri
conflict resolution is generally not feasible;       1979; Yao et al. 19821. These layers corre-
close interaction with designers and users           spond to the different semantic constructs
is required       before  compromises      can       supported by the model; Table 1 makes an
be achieved in any real-life integration             interesting point concerning the specific
activity.                                            order of the layers of schema constructs
                                                     used in the methodologies. The comparison
                                                     activity focuses on primitive objects first
    1.3.4 Merging and Restructuring                  (e.g., entities in the entity-relationship
                                                     model); then it deals with those modeling
Now the schemas are ready to be superim-             constructs     that   represent    associations
posed, giving rise to some intermediate              among primitive objects (e.g., relationships
integrated   schema(s). The intermediate             in the entity-relationship      model). Note
results are analyzed and, if necessary,              that relational-model-based     methodologies
restructured    in order to achieve several          do not show up in this table because the
desirable qualities. A global conceptual             relation is their only schema construct.
schema may be tested against the following              A few qualitative        observations     can
qualitative criteria:                                be made concerning the relative merit of
                                                     different models.
l    Completeness and Correctness. The in-              A simpler data model, that is, one with
     tegrated schema must contain all con-           fewer data-modeling constructs, properties,
     cepts present in any component schema           and constraints has an advantage in con-
     correctly. The integrated schema must be        forming and merging activities. This stems
     a representation of the union of the ap-        from various factors:
     plication     domains associated with the
     schemas.                                        l    the possibility of type conflicts is smaller;
l    Minimality.     If the same concept is rep-     l    the    transformation     operations      are
     resented in more than one component                  simpler;
     schema, it must be represented only once        l    merging involves fewer primitive opera-
     in the integrated schema.                            tions.
l     Understandability.         The integrated
     schema should be easy to understand for         On the other hand, a simpler model consti-
     the designer and the end user. This im-         tutes a weaker tool in the hands of the
     plies that among the several possible rep-      designer in discovering similarities, dissim-
     resentations      of results of integration     ilarities, or incompatibilities.   Models with
     allowed by a data model, the one that is        a rich set of type and abstraction mecha-
     (qualitatively)    the most understandable      nisms have the advantage of representing
     should be chosen.                               predefined groupings of concepts and al-
                                                     lowing comparisons at a higher level of
We make use of the above four phases of              abstraction.
schema integration for analyzing and com-                Schema integration       comes about when
paring different methodologies.                      the design of a global schema is attempted

                                                            ACM Computing   Surveys, Vol. 18, No. 4, December   1986
338       9      C. Batini,     M. Lenzerini,         and S. B. Navathe
                          Table 1. Order of Schema Constructs Subjected to Integration
                              Reference                         Phase 1                 Phase 2
                Batini et al. [ 19831                      Entities               Relationships
                ElMasri et ai. [1987]                      Object classes         Relationship classes
                Kahn [ 19791                               Entities               Relationships
                Navathe and Gadgil [ 19821                 Objects                Connections
                Teorey and Fry [ 19821                     Aggregations           Generalizations
                Yao et al. 119821                          Obiects                Functions
                Wiederhold and ElMasri [ 19791             Primary relations      Connections



using the “divide and conquer” philosophy.                         rally avoid dealing with a large subset
It is an inherent attribute or property of                         of the possible conflicts. The semantic-
this philosophy that the “global character-                        model-based methodologies in general
istics” that cannot be captured by the in-                         allow a larger amount of freedom in
dividual views must be added when a global                         terms of naming, compatible and in-
view becomes available. Consider the rela-                         compatible      design perspectives,    etc.
tive advantage of the entity-relationship                          Correspondingly,      they deal with a
model over the relational model in this                            much wider spectrum of conflicts.
respect.                                                       (2) The more recent relational-model-
    Referring to the example in Section 1.1,                       based methodologies (e.g., Biskup and
adding the subset between Book and Pub-                            Convent [ 19861, Casanova and Vidal
lication allowed us to incorporate a “global                        [1983]) use inclusion, exclusion, and
characteristic” that was not evident in com-                       union     functional    dependencies      in
ponent schemas. The relational         model                       addition to conventional functional de-
lacks this modeling feature. Hence, it could                       pendencies. An inclusion       (exclusion)
only be captured by expressing and enforc-                         dependency is used to constrain the set
ing it as a part of the transactions      that                     of values associated with a given attri-
operate on the global schema or by defining                        bute to be subset of (disjoint from) the
new dependencies such as inclusion inter-                          set of values corresponding to another
dependencies (e.g., as in Casanova and                             attribute. By making use of these de-
Vidal [ 19831).                                                    pendencies, they claim to achieve the
    The current body of work on the schema                         same semantic expressiveness as the
integration problem can be divided into two                        semantic models. Owing to the well-
schools: one using the relational or func-                         defined semantics of the relational
tional models and one using semantic                               model in terms of set theory and de-
models. Among the semantic models, the                             pendency theory [Maier 1983; Ullman
entity-relationship   model and its variants                       19821, they are able to address the
are dominant.                                                      problem of minimal redundancy in a
    A few observations can be made when                            formal wav.
comparing these two schools:
(1) Methodologies     using the relational                     2. A COMPARISON         OF METHODOLOGIES
    model ([AlIFedaghi and Scheuermann                         2.1 Introduction
    1981; Casanova and Vidal 19831) make
    the universal relation schema assump-                      There are 12 different complete methodol-
    tion; that is, every attribute name is                     ogies that we consider (see Category A Ref-
    unique, across the entire database. As                     erences). We have placed a summary
    a consequence, problems related to                         description of each in Appendix 2, which
    naming and contradictory         specifica-                include the data model used, inputs and
    tions are ignored. Furthermore,        they                outputs, the general strategy, and special
    are not really able to state different                     features.
    perspectives (e.g., contradictory    func-                    In this section, each methodology is an-
    tional dependencies in two views at the                    alyzed and compared on the basis of some
    start would not be allowed) and natu-                      common criteria. In Section 2.2, we concen-

ACM Computing   Surveys, Vol. 18, No. 4, December   1986
                    Comparison of Methodologies            for Database Schema Integration                       339



                                         Table 2.   Placement of Methodologies
             Phases of database design                                             References
Between requirement      analysis and conceptual              Kahn [1979]
  design
Conceptual    design                                         Batini and Lenzerini [1984], ElMasri et al. [1987],
                                                               Navathe and Gadgil [ 19821, Teorey and Fry [ 19821,
                                                               Wiederhold and ElMasri [1979]
Implementation      design                                   Al-Fedaghi and Scheuermann [ 19811, Casanova and
                                                                Vidal [1983], Yao et al. [1982]



trate on the phases of database design,                           Performing integration during the re-
where the integration methodologies are                        quirements analysis phase is difficult be-
most applicable. It is seen there that the                     cause user requirements are generally very
different methodologies apply to different                     poorly structured, and are difficult to deal
portions of the design process from require-                   with in terms of a formal methodology in-
ments analysis to implementation design.                       volving a semantic analysis. Among the
We deepen the framework provided in Sec-                       methodologies, only that of Kahn [1979]
tion 1 by first considering these methodol-                    can be considered applicable to the require-
ogies as “black boxes” and examine their                       ments analysis phase. There, a loosely
inputs and outputs. Then we deal with their                    structured data model is used that resem-
general structure by examining the proce-                      bles those used for collecting requirements
dures that they follow in terms of the four                    specifications.
main activities: preintegration, compari-                         On the other hand, performing integra-
son, conforming, merging, and restructur-                      tion during the implementation design
ing. Finally, we describe each of these                        phase is difficult because representations
activities in detail.                                          at that point do not allow one to make
                                                               effective use of abstractions. Methodologies
                                                               such as those of Al-Fedaghi and Scheuer-
2.2 Applicability      of Integration    Methodologies         mann [1981] and Yao et al. [1982] are able
A majority of the methodologies being an-                     to do integration as a part of the logical
alyzed here fall into the class of view inte-                 design phase by working with the relational
gration methodologies. In fact, all except                     (or a functional) data model and various
those of Dayal and Hwang [ 19841, Motro                       types of dependencies. Pure relational syn-
and Buneman [1981], and Mannino and                           thesis algorithms (e.g., Bernstein [ 19761
Effelsberg [1984a] belong to this class.                      and Biskup et al. [1979]) can also be con-
That of ElMasri et al. [1987] belongs                         sidered examples of this approach. As such,
to both view integration and database                         they do not deal with the more powerful
integration.                                                  semantic constructs or abstractions.
   There is little choice in terms of deciding                    The above observations suggest that the
when schemas are integrated in the case of                    preferred phase for integration is the con-
database integration; it has to be performed                  ceptual design phase, where the use of ab-
on the basis of existing local database sche-                 straction is very helpful in comparing and
mas when a global interface is desired to                     conforming different perceptions of the ap-
access them. In contrast, view integration                    plication domain by different user groups.
can occur at different times (see Figure 1).                      Another viewpoint regarding the phase
It is therefore worthwhile to consider the                    when schema integration should be per-
correspondence between the phases of                          formed may be stated in terms of the fol-
database design and the various view                          lowing statements:
integration methodologies.
   Table 2 shows the phases at which the                      (1) Perform integration as early as pos-
various view integration methodologies can                        sible because the cost of carrying
be considered to be best applicable.                              erroneous/inconsistent data increases
                                                                 ACM Computing Surveys, Vol. 16, No. 4, December 1966
340       l      C. Batini,   M. Lenzerini,     and S. B. Navathe

    during the life cycle of the database                Assertions. These correspond to con-
    and the application.                                 straints. Intraview    assertions are con-
(2) Perform integration    only after com-               straints defined on concepts within one
    plete, correct, minimal, unambiguous                 schema, whereas interview assertions are
    representations are available.                       constraints among concepts belonging to
                                                         different views. Methodologies that as-
    This again leads one to the conclusion
                                                         sume interview assertions to be input
that schema integration should be placed
                                                         implicitly   require    that some global
after requirements analysis but before im-
                                                         knowledge pertaining to the diverse ap-
plementation    design. Methodologies    [Ba-            plications is supplied to the designer.
tini and Lenzerini 1984; ElMasri et al. 1987;            Modified assertions in the output are re-
Navathe and Gadgil 1982; Teorey and Fry
                                                         vised constraints.
1982; Wiederhold and ElMasri 19791 in-                   Processing Requirements. These refer to
deed confirm this position. We have placed               the operations defined on component
these methodologies under “conceptual de-                views. They may be specified in the form
sign” in Table 2 according to the present                of a high-level     data manipulation   or
terminology. Database integration can be
                                                         query language.
considered to apply more to the conceptual               Mapping Rules. These define the map-
design phase rather than the other two. The              ping from queries (operations) applicable
above point of view is confirmed by [Dayal               to component schemas to queries (oper-
and Hwang 1984; ElMasri et al. 1987; Man-                ations) against the integrated schema.
nino and Effelsberg 1984a; Motro and                     Statement of Conflicts. This is a set of
Buneman 19811 in that for doing database
                                                         conflicts that the designer is not able to
integration they advocate translating het-               resolve and is beyond the scope of the
erogeneous logical schemas into conceptual               methodology to resolve automatically.
data representations. Hence, all methodol-
ogies for database integration    [Dayal and                One issue deserving special attention is
Hwang 1984; ElMasri et al. 1987; Mannino                the treatment of processing requirements.
and Effelsberg 1984a; Motro and Buneman                 Some methodologies         [Al-Fedaghi    and
 19811 are placed in that category.                     Scheuermann 1981; Batini and Lenzerini
                                                        1984; Casanova and Vidall983; Kahn 1979;
                                                        Teorey and Fry 1982; Wiederhold           and
2.3 Methodologies      Viewed as Black Boxes            ElMasri 19791 ignore processing require-
                                                        ments totally. Navathe and Gadgil [1982]
The basic input to schema integration is a
                                                        and Yao et al. [1982] refer to the transac-
number of component schemas and the
                                                        tions and queries on component schemas
basic output is an integrated schema.
                                                        that have to be supported after integration.
   Table 3 shows the specific inputs and
                                                        Navathe and Gadgil [1982] show that this
outputs taken into account by different
                                                        support of processing requirements is pro-
methodologies. Since Navathe and Gadgil
                                                        vided by a set of mapping rules. In Dayal
[ 19821 represented the view integration
                                                        and Hwang [1984] and Motro and Bune-
process with the most comprehensive list-
                                                        man [1981] the query modification process
ing of inputs and outputs, which roughly
                                                        is addressed in detail to deal with the proc-
represent a union of all methodologies, we
                                                        essing of local queries on the global data-
discuss their terminology:
                                                        base. Batini et al. [1983] and Yao et al.
l   Enterprise View. Pertinent only to view              [1982] consider the problem of query mod-
    integration, and not to database integra-           ification during view integration.
    tion, this view is an initial conceptual                We can conclude that a complete treat-
    schema which is the enterprise’s view of            ment of processing requirements during in-
    the most important and stable concepts              tegration is not present in any of the meth-
    in the application domain. Having such              odologies surveyed. Some recent proposals
    a view at one’s disposal makes the activ-           have been made to combine process de-
    ities of comparing and conforming views             sign with database design [Carswell and
    easier.                                             Navathe 19861.

ACM Computing Surveys, Vol. 18, No. 4, December 1986
                     Comparison of Methodologies for Database Schema Integration                                            341
                                             Table 3. Inputs and Outputs
                Reference                                 Inputs                                   outputs
Al-Fedaghi     and Scheuermann             n External   views                        n External schemas
   [1981]                                                                            Conceptual schema
                                                                                     Mapping between external
                                                                                       schemas and conceptual
                                                                                       schema
Batini    and Lenzerini     [ 19841        User schemas                              Global schema
                                           Weights for schemas
                                           Enterprise schema
Casanova and Vidal [ 19831                 User views                                Conceptual   schema
Dayal and Hwang [ 19841                    Local schemas of existing                 Global interface to databases
                                             databases                               Modified queries
                                           Queries
ElMasri    et al. [1987]                   Local schemas                             Global schema
                                           Interschema assertions                    Mapping rules
Kahn [1979]                                Local information       structures        Global information      structure
Motro and Buneman           [ 19811        Logical schemas                           Super-view
                                           Database queries                          Modified queries
Mannino      and Effelsberg    [1984a]     Local schemas                             Global schema
                                           Interschema assertions about              Mapping rules
                                              entities and attributes                Definition of integration     schema
                                                                                       objects
Navathe and Gadgil [ 19821                 Enterprise view                           Global view
                                           Local views                               Mapping rules
                                           Interview assertions                      Modified assertions
                                           Intraview assertions                      Conflicts
                                           Processing requirements
Teorey and Fry [1982]                      Information,  application, event,         Global information      structure
                                              corporate perspectives                 Conflicts
                                           Policy guidance and rules
Yao et al. [ 19821                         Views                                     Global view
                                           Processing specifications                 Modified processing specification
Wiederhold     and ElMasri      [1979]     Two schemas                               Global schema


   The form in which the inputs and out-                        Table 4:
puts exist in an integration system (which
                                                                (1) Those that perform a repetitive com-
may be partly automated) is not stated
                                                                    parison, conforming, and merging of
explicitly by any of the authors considered.
                                                                    schemas, and avoid the need to restruc-
It is obvious that in order to process the
                                                                    ture later [Mannino    and Effelsberg
schemas in an automated environment,
                                                                    1984a; Navathe and Gadgil 1982; Wied-
they must be expressed in some well-
                                                                    erhold and ElMasri 19791.
defined     language    or some internal
                                                                (2) Those that perform most of the activi-
representation using data structures.
                                                                    ties during and after the merging of
2.4 Gross Architecture           of Methodologies
                                                                    schemas. They include Steps 3 and 4
                                                                    only and avoid comparison and con-
Let us consider the four activities of the                          forming of the schemas [Al-Fedaghi
integration process. In Table 4, we show                            and Scheuermann       1981; Casanova
the steps that are performed by each of the                         and Vidal 1983; Motro and Buneman
methodologies and the looping structure                             1981; Teorey and Fry 1982; Yao et al.
present in them.                                                    19821.
   It is possible to classify the methodolo-                    (3) Those that perform all four activities
gies into four groups on the basis of                               [Batini and Lenzerini 1984; Dayal and

                                                                       ACM Computing Surveys, Vol. 18, No. 4, December 1986
342          .       C. Batini,         M. Lmzerini,         and S. B. Navathe
                                             Table 4.        Schema Integration Activities
                                               Preintegration            Compare       Conform      Merging          Restructuring
                 References                        (Step 1)              (Step 2)      (Step   3)   (SW         44     (Step 4b)

Al-Fedaghi       and Scheuermann                        -                   -             -               X--TX
   [ 19811
Batini    and Lenzerini       [ 19841                   -                  XQX--rX‘X
Casanova and Vidal[1983]                                -                  -              -               x-x
Dayal and Hwang [ 19841                                 -                   x-x-x-x
ElMasri    et al. [ 19871                               x-x-x-x-----+x
Kahn [1979]                                             -                  x-x-x-x
Motro and Buneman             [1981]                    -                  -              -               x-x
Mannino      and Effelsberg      [1984a]                x-x-x-x                                                            -
Navathe and Gadgil [ 19821                              x-x5x+x                                                            -
Teorey and Fry [ 19821                                  -                  -              -               x-x
Yao et al. [ 19821                                      -                  -              -               x-x
Wiederhold       and ElMasri      [1979]                -                   x4--x+x                                        -



    Hwang 1984; ElMasri et al. 1987; Kahn                                       attempted. As such, they appear to
    19791.                                                                      have the maximum interaction with the
(4) Those that explicitly mention preinte-                                      user/designer.
    gration analysis [ElMasri et al. 1987;
    Mannino and Effelsberg 1984a; Na-
                                                                         2.5 Preintegration
    vathe and Gadgil 19821.
On the basis of the looping structure alone, As shown in Table 4, only three method-
the following similarities can be observed:  ologies [ElMasri et al. 1987; Mannino and
                                             Effelsberg 1984a; Navathe and Gadgil
(1) Casanova and Vidal [ 19831and Teorey 19821 explicitly mention preintegration.
    and Fry [1982] have a “no-feedback”      They basically propose a collection of cor-
    approach to integration. They only per- respondences among schemas in the form
    form the merging and restructuring       of constraints and assertions among com-
    steps.                                   ponent schemas. These specifications are
(2) Al-Fedaghi and Scheuermann [ 19811; used, for example, to relate names, to es-
    Dayal and Hwang [1984], Motro and tablish that an object in one schema is the
    Buneman [1981], and Yao et al. [1982] result of some operation on a set of objects
    are similar to the above group in that in another schema, etc.
    they perform only merging and restruc-      For all methodologies, whether or not
    turing; however, they allow a feedback preintegration is explicitly mentioned, the
    between these two steps.                 sequencing and grouping of schemas for
(3) Kahn [1979], Mannino and Effelsberg      integration has to be considered. In this
    [1984a], Navathe and Gadgil [1982], section we describe the different strategies
    Wiederhold and ElMasri [1979] pro- that address this problem.
    vide a global loop from the end of the      The first step, choice of schemas, in-
    process to the initial comparison activ- volves processing component schemas in
    ity. Kahn [1979] includes the restruc- some sequence. In general, the number of
    turing step, whereas the others do not. schemas considered for integration of each
(4) Finally, Batini and Lenzerini [1984]     step can be n > 2. Figure 9 shows four
    and ElMasri et al. [1987] cover all the possible variations termed integration-
    steps; moreover, they provide an itera- processing strategies. Each strategy is
    tive execution of comparison and con- shown in the form of a tree. The leaf nodes
    forming steps before any merging is of the tree correspond to the component
ACM Computing       Surveys, Vol. 18, No. 4, December       1986
                Comparison of Methodologies for Database Schema Integration                                            l      343
                                                Integration      Process




                      binary                                                                    n-&y




       ladder                        balanced                              one-shot             I      I       iterative




7-l                            r

                            Figure 9.   Types of integration-processing           strategies.


                                   Table 5.     Integration-Processing       Strategies
                                                               Type of integration-
                Reference                                      processing strategy                     Balance of strategy
  Al-Fedaghi and Scheuermann [1981]                           One-shot n-ary                                    -
  Batik and Lenzerini [1984]                                  Binary                                        Ladder
  Casanova and Vidal[1983]                                    Binary                                        Balanced
  Dayal and Hwang [ 19841                                     Binary                                        No claim
  ElMasri et al. [ 19871                                      One-shot n-ary                                    -
  Kahn [1979]                                                 Binary                                        No claim
  Motro and Buneman [1981]                                    Binary                                        No claim
  Mannino and Effelsberg [1984a]                              Binary among families                         No claim
  Navathe and Gadgil [1982]                                   Iterative n-ary                                   -
  Teorey and Fry [1982]                                       Binary                                        Balanced
  Yao et al. [1982]                                           One-shot n-ary                                    -
  Wiederhold and ElMasri [1979]                               Binary                                        Ladder



schemas, and the nonleaf nodes correspond                          Table 5 is a comparison of methodologies
to intermediate results of integration. The                     along two dimensions: binary versus n-ary
root node is the final result. The primary                      and the nature of balancing.
classification of strategies is binary versus                      We now comment on the specific features
n-ary.                                                          pertaining to the above classes of strategies.
   Binary strategies allow the integration of                      The advantage of binary strategies is in
two schemas at a time. They are called                          terms of simplifying the activities of com-
ludder strategies when a new component                          parison and conforming at each integration
schema is integrated with an existing inter-                    step. It is evident from the table that most
mediate result at each step. A binary strat-                    of the methodologies agree in adopting a
egy is balanced when the schemas are di-                        binary strategy because of the increasing
vided into pairs at the start and are inte-                     complexity of the integration step with re-
grated in a symmetric fashion (see Fig-                         spect to the number of schemas to be inte-
ure 9, balanced).                                               grated. In general, the merging algorithm
   N-ary strategies allow integration of n                      for n schemas can be shown to be n2 in
schemas at a time (n > 2). An n-ary strategy                    complexity. Hence, keeping n down is de-
is one shot when the n schemas are inte-                        sirable from the standpoint of complexity.
grated in a single step; it is iterative other-                 The disadvantages of binary strategies are
wise. The latter is the most general case.                      an increased number of integration opera-
                                                                     ACM Computing        Surveys, Vol. 16, No. 4, December   1986
344       l      C. Batini, M. Lenzerini, and S. B. Navathe
tions and the need for a final analysis to                 2.6 Comparison   of Schemas
add missing global properties.                             The fundamental activity in this step con-
   The motivation behind the ladder pro-                   sists of checking all conflicts in the repre-
cessing strategy comes from two reasons:                   sentation of the same objects in different
(1) Component schemas can be picked up                     schemas. Methodologies broadly distin-
    for integration in the decreasing order                guish two types of conflicts (see Table 6):
    of their relevance (or “weights,” as Ba-               naming conflicts and structural conflicts.
    tini and Lenzerini [1984] call them);                  We now examine each in detail.
(2) There is an inherent importance asso-
    ciated with an already existing partially              2.6.1 Naming Conflicts
    integrated schema.
                                               Schemas in data models incorporate names
   An integration step could take advantage for the various objects represented. People
of this situation by resolving conflicts in from different application areas of the same
 favor of the partially integrated schema. organization refer to the same data using
For instance, an enterprise view (see Sec- their own terminology and names. This
tion 2.3) is frequently viable in an organi- results in a proliferation of names as well
zation. Choosing it as the initial schema as a possible inconsistency among names
makes the detection and resolution of dis- in the component schemas. The problem-
similarities more efficient.                   atic relationships among names are of two
   A binary balanced strategy has been pro- types:
posed only by Teorey and Fry [ 19821.They
justify it on the basis of minimizing the (1) Homonyms: When the same name is
number of comparisons among concepts in             used for two different concepts, giving
the schemas.                                        rise to inconsistency unless detected.
   The work of ElMasri and Navathe                  Consider the two schemas shown in
                                                    Figure 10. Both schemas include an
 [ElMasri 1980; Navathe et al. 19841 are
good examples of one-shot n-ary strategies.         entity named EQUIPMENT. However,
They consider that during Step 2, an anal-          the EQUIPMENT in Figure 10a refers
ysis of the n schemas is performed together.        to Computers/Copiers/Mimeographic
After collecting, analyzing, and validating         machines, whereas in Figure lob it re-
all the interview assertions, they perform          fers to pieces of furniture as well as air
the integration in a single step. The obvious       conditioners. It is obvious that merging
advantages of n-ary integration are:                the two entities in the integrated
                                                    schema would result in producing a sin-
 (1) A considerable amount of semantic              gle entity for two conceptually distinct
      analysis can be performed before merg-        objects.
      ing, avoiding the necessity of a further (2) Synonyms: When the same concept is
      analysis and transformation of the            described by two or more names. Un-
      integrated schema;                            less different names improve the un-
(2) The number of steps for integration is          derstanding of different users, they are
      minimized.                                    not justified.
                                                       An example appears in Figure 11,
The recommended procedure given by Na-              where CLIENT and CUSTOMER are
vathe and Gadgil [1982] is an iterative n-          synonyms; the entities with these two
ary strategy where “equivalence groups” of          names in the two schemas refer to the
user views are initially formed, the views          same real-world concept. In this case,
within the groups are merged first, creating        keeping two distinct entities in the in-
intermediate integrated schemas that are            tegrated schema would result in mod-
again analyzed and grouped iteratively.             eling a single object by means of two
   Not all the analyzed methodologies state         different entities.
what strategy they adopt. Hardly any (ex-
cept Teorey and Fry [1982]) make any              The motivation behind establishing
statement about balancing.                     naming correspondences and discovering

ACM Computing   Surveys, Vol. 18, No. 4, December   1986
                    Comparison of Methodologies for Database Schema Integration                                            l    345
                                           Table 6.   Naming and Structural Conflicts
               Reference                              Naming conflicts                        Structural       conflicts
Al-Fedaghi     and Scheuermann                                -                                            -
   [1981]
Batini    and Lenzerini      [1984]          Homonyms                                Type inconsistencies
                                             Synonyms                                Integrity constraints        conflicts
Casanova and Vidal [ 19831                                    -                                            -
Dayal and Hwang [ 19841                      Homonyms                                Schema level conflicts:
                                             Synonyms                                  Scale differences
                                                                                       Structural differences
                                                                                       Differences in abstraction
                                                                                     Data level inconsistencies:
                                                                                       Various levels of obsolescence
                                                                                          and reliability
ElMasri     et al. [ 19871                   Homonyms                                Open ended treatment of conflicts,
                                             Synonyms                                     specifically:
                                             Attribute equivalence assertions          Differences in abstraction levels
                                             Entity class equivalence                  Differences in roles, degree, and
                                                                                          cardinality constraints of
                                                                                          relationships
Kahn [ 19791                                 Homonyms                                Cardinality   ratio conflicts
                                             Synonyms
Motro and Buneman            [1981]                           -
Mannino      and Effelsberg      [1984a]     Use of qualified names                  Differences   in abstractions
                                             Attribute equivalence specification
Navathe      and Gadgil [1982]               Homonyms                                Dependency conflicts
                                             Synonyms                                Redundancy conflicts
                                                                                     Modeling conflicts
Teorey and Fry [ 19821                                                                                     -
Yao et al. [1982]                                                                                          -
Wiederhold      and ElMasri      [1979]                                              Cardinality   ratio conflicts




                                            I    Building




  piiz.q
    (4       Figure 10.      Example of homonyms.
                                                                             (4
                                                                              Figure 11.    Example of synonyms.
                                                                                                                           (b)

homonyms and synonyms is to be able to                             comparing concepts with the same name in
determine the four kinds of semantic rela-                         different schemas, synonyms can only be
tionships among component schemas that                             detected after an external specification.
we introduced in Section 1.2. Note that                               Data dictionaries have been advocated as
whereas homonyms can be detected by                                a useful adjunct tool to schema integration


                                                                         ACM Computing Surveys, Vol. 18, No.       4,   December 1986
346       .       C. Batini,     M. Lenzerini,        and 5’. B. Navathe
 methodologies for a better management of                     back to the three reasons for schema diver-
 names [Navathe and Kerschberg 19861.                         sity described in Section 1.2. Table 6 lists
   Methodologies [Al-Fedaghi and Scheuer-                     the different kinds of structural conflicts
 mann 1981; Casanova and Vidal1983; Mo-                       that are taken into account in various
tro and Buneman 1981; Teorey and Fry                          methodologies. Here we present a classifi-
 1982; Wiederhold and ElMasri 1979; Yao                       cation of structural conflicts that is inde-
et al. 19821 make no mention of naming                        pendent from the various terminologies and
correspondences, probably as a result of an                   from the specific characteristics of the dif-
implicit assumption that such correspon-                      ferent data models adopted in the method-
dences are preestablished and thus no                         ologies. Such a classification distinguishes
naming conflicts can arise (see also the                      the following kinds of conflicts:
discussion of the relational model in Sec-
tion 1.4). In ElMasri et al. [1987] a full                    (1) Type Conflicts. These arise when the
naming convention                                                   same concept is represented by differ-
                                                                   ent modeling constructs in different
schemaname.objectname                                              schemas. This is the case when, for
for objects and                                                    example, a class of objects is repre-
                                                                   sented as an entity in one schema and
schemaname.objectname.attributename                                as an attribute in another schema.
for attributes is adopted to assure unique-                   (2) Dependency       Conflicts. These arise
ness of names. As a consequence, hom-                              when a group of concepts are related
onyms cannot arise. The synonym problem                            among themselves with different de-
still remains and must be dealt with via the                       pendencies in different schemas. For
establishment of attribute classes. There is                       example, the relationship Marriage be-
also a cross-reference lexicon of names                            tween Man and Woman is 1: 1 in one
maintained to keep information on syn-                             schema, but m : n in another accounting
onyms. In Batini and Lenzerini [ 19841 and                         for a marriage history.
ElMasri et al. [ 19871the integration system                  (3) Key Conflicts. Different keys are as-
automatically assigns a “degree of similar-                        signed to the same concept in different
ity” to pairs of objects, based on several                         schemas. For example, SS# and Emp-
matching criteria. Users are presented with                        id may be the keys of Employee in two
the similarity information to help them                           component schemas.
detect synonyms.                                              (4) Behavioral Conflicts. These arise when
   A type of homonyms arises when for the                         different insertion/deletion policies are
same concept there is a match on names                            associated with the same class of ob-
but no match on the corresponding sets of                         jects in distinct schemas. For example,
instances. They can occur at various levels                       in one schema a department may be
of abstraction. For example, at the attribute                     allowed to exist without employees,
level, size refers to dress size (a single in-                    whereas in another, deleting the last
teger code) in one schema, whereas it refers                      employee associated with,a department
to trouser size (a pair of integers) in another                   leads to the deletion of the department
schema. At the entity level, STUDENT                              itself. Note that these conflicts may
refers to all students in the database kept                       arise only when the data model allows
in the registrar’s office, whereas it refers                      for the representation of behavioral
to married students only in the married-                          properties of objects.
student-housing database.                                        Another activity typically performed
2.6.2 Structural Conflicts
                                                              during the schema comparison step is the
                                                              discovery of interschema properties. Meth-
We use the term structural conflicts to in-                   odologies usually consider this discovery to
clude conflicts that arise as a result of a                   be a by-product of conflict detection. If any
different choice of modeling constructs or                    interschema properties are discovered dur-
integrity constraints. They can be traced                     ing this step, they are saved and processed


ACM Computing   Surveys, Vol. 18, No. 4, December   1986
                 Comparison of Methodologies for Database Schema Integration                           l      347
                          Table 7.   Schema Transformations Performed by Methodologies
                 Reference                            Conform                      Merge and restructure
  Al-Fedaghi and Scheuermann [1981]                       -                  Removal of redundant
                                                                               dependencies
  Batini and Lenzerini [1984]                  Type transformations          Subsetting
                                               Restructuring                 Aggregation
                                               Renaming                      Restructuring
  Casanova and Vidal [ 19831                                                 Optimization
  Dayal and Hwang [ 19841                                                    Include
                                                                             Integration by generalization
                                                                             Define supertype
                                                                             Define subtype
                                                                             Scale unifying
                                                                             Renaming
  ElMasri et al. [1987]                        Modify assertions             Remove redundant relationships
                                               Renaming
  Kahn [ 19791                                                               Redundancy elimination
  Mannino and Effelsberg, [1984a]              Algebraic operations          Create generalization hierarchies
                                                                             Create subtype
  Motro and Buneman [1981]                                -                  Meet
                                                                             Fold
                                                                             Aggregate
                                                                             Join
                                                                             Add
                                                                             Delete
  Navathe and Gadgil [1982]                                                  Attribute enhancement
                                                                             Attribute creation
                                                                             Restriction
  Teorey and Fry [1982]                                                      Aggregation
                                                                             Generalization
  Yao et al. [ 19821                                     -                   Removal of functions
  Wiederhold and ElMasri [ 19791                         -                   Subsetting



during schema merging [ElMasri et al.                        quires that schema transformations be per-
19871 or schema restructuring [Batini and                    formed. In order to resolve a conflict, the
Lenzerini 19841.                                             designer must understand the semantic re-
   In general, both the discovery of conflicts               lationships among the concepts involved in
and the interschema properties are aided                     the conflict. Sometimes conflicts cannot be
by a strong interaction between the de-                      resolved because they arose as a result of
signer and the user. This is the position                    some basic inconsistency. In this case, the
advocated by [Batini and Lenzerini [1984],                   conflicts are reported to the users, who
ElMasri et al. [1987], Kahn [1979], Man-                     must guide the designer in their resolution.
nino and Effelsberg [1984a], and Navathe                        The concept of schema transformation is
and Gadgil [ 19821.                                          central to conflict resolution and therefore
                                                             to the conforming activity. Since method-
                                                             ologies also perform schema transforma-
2.7 Conforming of Schemas
                                                             tions during merging and restructuring, in
The goal of this activity is to conform or                   Table 7 we introduce a comprehensive tax-
align schemas to make them compatible for                    onomy of all types of transformations.
integration. Achieving this goal amounts to                     From this table it is clear that a limited
resolving the conflicts, which in turn re-                   number of transformations are proposed


                                                                ACM ComputingSurveys,Vol. 18, No. 4, December1986
 348      l       C. Batini, M. Lenzerini, and S. B. Navathe




                                                           (c)
                            Figure 12.    Transformation         of an attribute   into an entity.



 for conflict resolution. Simple renaming                          identification    to entity E (since 1: n means
 operations are used for solving naming con-                       that every instance of A participates only
 flicts by most methodologies. With regard                         once in the relationship with E). In Figure
 to other types of conflicts, the methodolo-                        12c attribute A is only a part of an identifier
 gies do not spell out formally how the pro-                       and so in the new structure, entity A be-
 cess of resolution is carried out; however,                       comes a part of a compound identifier for
 an indication is given in several of them as                      entity E.
to how one should proceed. For example,                               It is interesting to note that among the
when dealing with equivalence, Batini and                          methodologies surveyed, none provide an
Lenzerini [1984] suggest that atomic con-                          analysis or proof of the completeness of the
cepts be transformed (i.e., transform enti-                        schema transformation       operations from the
ties/attributes/relationships     among one                        standpoint of being able to resolve any type
another) to reach a common canonical rep-                          of conflict that can arise.
resentation of the schemas.                                           All the methodologies take the goal of
     We show in Figure 12 three examples of                        the conforming activity to be the construc-
transforming an attribute into an entity, as                       tion of a single “consensus schema” by pos-
suggested by Batini and Lenzerini [1984].                          sibly changing some user views. This is
The dashed lines in these figures specify                          consistent with the ANSI/SPARC            [Klug
identifiers. In Figure 12a attribute A is not                      and Tsichritzis     19771 three-schema archi-
an identifier. It is shown to be transformed                       tecture in which the conceptual schema is
into an entity. In Figure 12b, attribute A,                        a unified representation of the whole appli-
which is an identifier, becomes an entity in                       cation, whereas individual perspectives are
the new schema; entity A now provides                              captured by external schemas.

ACM Computing   Surveys, Vol. 18, No. 4, December   1986
               Comparison of Methodologies   for Database Schema Integration              l      349




2.8 Merging   and Restructuring              grated schema for the representation of the
                                             set of instances that are common to two
The activities usually performed by meth- different classes.
odologies during this phase require differ-     Other types of interschema properties
ent kinds of operations to be performed on are concerned with aggregation relation-
either the component schemas or the tem- ships among classes. Batini and Lenzerini
porary integrated schema. In order to es- [ 19841, Motro and Buneman [ 19811 and
tablish a common framework for this Teorey and Fry [1982], propose specific
phase, we assume that all methodologies transformations for introducing new rela-
first merge the component schemas by tionships in the integrated schema so that
means of a simple superimposition of com- aggregation among classes coming from
mon concepts, and then perform restruc- different component schemas can be
turing operations on the integrated schema represented.
obtained by such a merging. Table 8 shows       Finally, there is a set of transformations
the transformations proposed in the meth- that introduces new concepts in order to
odologies for this step. Each transforma-    convey all the information represented in
tion is performed in order to improve the the component schemas. In Navathe and
schema with respect to one of the three Gadgil [1982 ] “attribute creation” is the
qualities described in Section 1.3, namely, transformation that adds a new attribute
completeness, minimality, and understand- to an entity in the integrated schema (a
ability. We now analyze each quality sepa- similar transformation is called “add” by
rately.                                      Motro and Buneman [1981]). For example,
2.8.1 Completeness                           the attribute Category for the class Student
                                             in the integrated schema may be used to
To achieve completeness, the designer has distinguish among Graduate Students (the
to conclude the analysis and addition of students represented in View 1) and
 interschema properties that is initiated in Undergraduate Students (the students
previous design steps. In Figure 8 we represented in View 2).
 showed examples of interschema proper-          Note that the variety of interschema
ties. In Table 8 we present a comprehensive properties is strongly related to the reper-
 list of interschema properties mentioned in tory of schema constructs at the disposal of
the methodologies. Note that “subsetting”    the data model. Among the semantic
 is the interschema property used by most models, Wiederhold and ElMasri [1979]
 methodologies. In fact, it is considered to provide the richest set of interschema prop-
be the basis for accommodating multiple       erties in the form of various subsets among
user perspectives on comparable classes of different schema constructs. For every
 objects.                                     meaningful pair of constructs in their
    Batini and Lenzerini [1984], Dayal and model, they show an exhaustive list of cases
 Hwang [1984], Mannino and Effelsberg and show how to integrate each by adding
 [1984a], Motro and Buneman [1981], Teo- interschema properties. Among the rela-
 rey and Fry [ 19821, and Wiederhold and tional model based approaches, the richest
 ElMasri [ 19791 propose suitable transfor-   set of interschema properties-inclusion,
 mations for introducing subset-generaliza-   exclusion, and union functional dependen-
 tion relationships in the integrated schema cies-are provided by Casanova and Vidal
 (subsetting, integration by generalization,  [1983] and more recently in the extension
 define subtype, etc. are the names of such of this methodology by Biskup and Convent
 transformations). In Motro and Buneman [1986].
  [1981], “meet” is the transformation that
 produces a common generalization of two 2.8.2 Minimality
 classes. Such a transformation is based on
 the existence of a common key for the two In most of the methodologies, the objective
 classes. On the other hand, “join” produces of minimality is to discover and eliminate
 a common subtype for the two classes. It is redundancies. A different approach is fol-
used when a class is needed in the inte- lowed by Batini and Lenzerini [1984],

                                                   ACM Computing Surveys, Vol. 18, No. 4, December 1986
350       l      C. Batini, M. Lenzerini, and S. B. Navathe
                                            Table 8.      Interschema   Properties
                                Reference                                Interschema properties
                  Al-Fedaghi     and Scheuermann                                    -
                     [1981]
                  Batini and Lenzerini      [ 19841               Subsetting
                                                                  Generalization
                                                                  Relationship
                  Casanova and Vidal [ 19831                      Inclusion dependencies
                                                                  Exclusion dependencies
                                                                  Union functional dependencies
                  Dayal and Hwang [ 19841                         Subsetting
                                                                  Subfunction
                  ElMasri   et al. [1987]                         Assertions    related to extensions
                                                                  Clustering    of attributes into classes
                  Kahn [1979]                                                           -

                  Motro and Buneman         [ 19811               Subsetting
                  Mannino      and Effelsberg   [1984a]           Generalization
                                                                  Overlap and nonoverlap
                                                                  Attribute scope and meaning
                  Navathe and Gadgil [1982]                       Categorization
                                                                  Subsetting
                                                                  Partitioning
                  Teorey and Fry [1982]                           Generalization
                                                                  Aggregation
                  Yao et al. [1982]                                                   -

                  Wiederhold     and ElMasri    119791            Subsetting



where it is stated that discovering the re-
dundancies is a task of conceptual design,
whereas their elimination has to be per-
                                                                                                 I    Employee
                                                                                                                      I

formed during the implementation design
phase.
   We motivate the minimality notion in
Figure 13. There, three subset relationships
are present, indicated by double-lined ar-                                      Engineer
rows; each arrow points from a subentity
to a superentity.
   The subset relationship between Engi-
neering-manager and Employee is redun-
dant since it can be derived from the other
                                                                                                     Engineering-
two. Keeping a minimal number of con-                                                                  Manager
cepts in the global schema implies dropping
the redundant relationship from it. Other
typical situations sought are cycles of rela-                            Figure 13.       A schema with redundancy.
tionships, derived attributes [Batini and
Lenzerini 1984; ElMasri et al. 1987; Na-                         mann 1981; Casanova and Vidal19831. For
vathe and Gadgil 19821, and composition                          these approaches, minimality is the driving
of functions [Yao et al. 19821. In the rela-                     force behind integration.
tional-model-based approaches, redundan-                            As seen from Table 7, most of the schema
cies are related to derived dependencies of                      transformations during restructuring are
various types [Al-Fedaghi and Scheuer-                           geared for a removal of redundancy.

ACM Computing Surveys, Vol. 18, No. 4, December 1986
                Comparison of Methodologies for Database Schema Integration                                l     351




                                                                                                 Journal

                                                    (a)



   Researcher




                                                    (b)

                   Figure 14.   Improving understandability. (a) Schema A. (b) Schema B.


2.8.3 Understandability                                   mapping of queries is an output of the
                                                          database integration process. Dayal and
Attention to the issue of understandability               Hwang [ 19841 develop query modification
is diffused in all methodologies. The prob-               algorithms for modifying global queries
lem is addressed explicitly by Batini and                 into essential local subqueries with dupli-
Lenzerini [ 19841.We reproduce an example                 cate elimination.
in Figure 14, where they argue on qualita-
tive terms that, while the two schemas are
equivalent, Schema B is more understand-                  3. CONCLUSIONS        AND FUTURE WORK
able than Schema A. Schema B was
obtained from Schema A by adding a                        3.1 General Remarks
generalization hierarchy relating Publica-                A few general remarks about the method-
tion to Book and Paper. In general, for                   ologies are in order. The methodologies sur-
improving understandability,      additional              veyed can be reviewed on the basis of some
schema transformations are needed.                        general criteria as follows.
    At present, to our knowledge, no quan-
titative and objective measures of concep-                3.1.1 Use
tual understandability exist that can be
applied here. If a graphical representation               Most methodologies were developed as
of the conceptual model is supported, the                 parts of research projects with low empha-
shape of the diagram, the total length of                 sis on developing full-scale automated sys-
connections, the number of crossings and                  tems. It is obvious that design tools can be
bends, and so forth may be used as param-                 built using the concepts from individual
eters to define graphic understandability                 methodologies. If the size of the problem
[Batini et al. 19861.                                     can be contained within manual means,
    A specific activity performed during the              however, methodologies also can be used
restructuring step by database integration                manually.
methodologies is query modification. We                      Partial implementation of some of the
already have indicated in Figure 2 that the               methodologies (e.g., Batini and Lenzerini

                                                             ACM Computing   Surveys, Vol. 18, No. 4, December   1986
 352       l      C. Batini,     M. knzerini,          and S. B. Navathe

   [ 19841, Teorey and Fry [ 19821, and Yao et 3.2 Missing Aspects
  al. [1982]) have been reported. Nothing has Several aspects are currently missing in
  been reported, however, on the actual use methodologies for view integration.
  of these methodologies to perform view
  integration.                                  (4 Processing Specifications in View Inte-
      The entity-relationship     model, which      gration. This is the specification of the
  provides a basis for the [Batini and Len-         queries and transactions on component
  zerini 1984; ElMasri et al. 1987; Kahn 19791      schemas. An initial position reported
  methodologies, was reported to be the most        on view integration in the database de-
  widely used model in practice. Chilson and        sign literature [Yao et al. 19821 was
  Kudlac [1983] report that the Navathe and         that a view integration methodology
  Schkolnick model [Navathe and Schkol-             should have two goals with respect to
  nick 19781,used in the Navathe and Gadgil        processing specification:
  [1982] methodology, was also known to the
 users surveyed.                                    (1) Feasibility. The integrated schema
      Out of the methodologies for database               supports any processes on the com-
 integration, that of Dayal and Hwang                     ponent schema.
  [1984] has been used with modifications           (2) Performance.         The integrated
 within the framework of the Multibase                    schema is “optimal” with respect to
 project at Computer Corporation of Amer-                 a given set of component schema
 ica. The Multibase system [Landers and                   queries and transactions. Specifi-
 Rosenberg 19821 has been designed and                    cations of queries and transactions
 implemented to allow users access to                     are not explicitly used in any meth-
 heterogeneous databases in a single sys-                 odologies except that of Yao et al.
 tem. Several researchers [Hubbard 1980;                  [1982], where an “optimal” struc-
 Chiang et al. 1983; Data Designer 1981;                 ture is selected on the basis of a
 Ferrara 19851 describe tools that allow an              given set of transactions. We be-
 integration capability to a limited extent.             lieve that performance analysis
                                                         based on processing specifications
                                                         is not meaningful at the conceptual
 3.1.2 Completeness and Detailed Specification           design level since no reasonable
                                                         performance predictions can be
 Most of the surveyed methodologies do not               made. Such performance analysis
provide an algorithmic specification of the              is meaningful only when logical
 integration activities, and they rarely show            and physical schemas are fully de-
whether the set of conflicts or the set of               fined in a DBMS. On the other
transformations considered is complete in                hand, we stress that the real per-
some sense. What they provide are general                formance measures of conceptual
guidelines and concepts on different steps.              schemas are the goals that we
Methodologies that address well-defined                  stated in Section 1.3, namely, com-
problems of logical design based on purely              pleteness and correctness, mini-
mechanized procedures such as Al-Fedaghi                mality, and understandability.
and Scheuerman [ 19811, Casanova and Vi- (b) Behavioral Specification. This is the
da1 [1983], and Yao et al. [1982] are able to      specification of the dynamic properties
construct precise algorithms. But by their         of objects in the schema (e.g., the value
very nature, they cover more of the imple-         of the salary of an employee can never
mentation design compared to conceptual            decrease).
design (according to Table 2).                        None of the methodologies surveyed
     A side effect of the above problem is that    model behavioral properties fully. The
there is no easy way to guarantee conver-          models adopted by ElMasri et al. [ 19871
gence in these methodologies, especially for      and Navathe and Gadgil [1982] allow
those involving looping structures (see Ta-       them to formulate limited types of
ble 5). The termination of the loops is es-       behavioral properties in the form
sentially left to the designer’s discretion.      of insertion/deletion constraints.

ACM Computing   Surveys, Vol. 18, No. 4, December   1986
               Comparison of Methodologies for Database Schema Integration                              353
    Schema Mappings. To support the lo-          aided design (CAD). In the first case, the
    cal views (i.e., external schemas accord-    integration methodology has to deal with
    ing to the ANSI/SPARC [Klug and              data at different levels of summarization.
    Tsichritzis 19771 terminology) of the        This leads to a greater complexity of the
    users of component schemas on the            semantic analysis, an accompanying in-
    basis of the integrated schema is a          crease in conflicts, and a corresponding
    problem that is well addressed by the        increase in the complexity of conflict-
    database integration        methodologies    resolution strategies. In the case of CAD
    [Dayal and Hwang 1984; Motro and             databases, problems arise as a result of
    Buneman 19811. However, it is only           multiple representations of the same data,
    hinted at by ElMasri et al. [ 19871,Man-     as in very large scale integration design,
    nino and Effelsberg [1984a], and Na-         top-down organization of design data, and
    vathe and Gadgil [1982] in the form of       the far-reaching update propagation.
    recognizing “mapping rules” as an out-          The statistical and CAD databases are
    put of integration. Only Wiederhold          often subjected to database integration for
    and ElMasri [1979] have given a com-         allowing sharing of information. New
    plete set of rules to support component      methodologies of database integration for
    schemas. Actually, various levels of         such cases need to be designed; the existing
    mappings need to be addressed in going       works seem limited in this area.
    from (or building) external schemas of
    the integrated schema to (from) one          3.3.2 Knowledge Base Integration
    or more external schema(s) of the
    component schemas.                           Integration of knowledge bases has received
                                                 attention in the literature only recently
3.3 Future Research    Directions
                                                  [Eick and Lockemann 19851. Knowledge
                                                 bases treat classes and their instances to-
From Sections 3.1 and 3.2, it is obvious that    gether: This implies that data and metadata
more work is required on incorporating           coexist in the representation. Moreover,
processing specifications, behavior model-       they provide richer linguistic mechanisms:
ing, and schema mapping in the schema            Knowledge is often expressed in the form
integration methodologies. More research         of logical assertions, productions, and in-
is also required to settle open issues such      ference rules. Rule integration is a problem
as the choice of data models and levels of       in itself. These considerations bring a new
integrity constraint specification. Along        set of issues that are not covered presently
with these, the following directions for fu-     in the surveyed methodologies.
ture research are important.
                                                 3.3.3 Process Integration
3.3.1 Extension of Integration Methodologies
                                                 This refers to the activity of integrating
View integration methodologies need to be        and transforming a set of processes appli-
extended to be used in distributed database      cable to component schemas into a set of
design. This would imply enriching the in-       processes applicable to the integrated
puts by adding more information on the           schema. It seems that many notions used
distribution preference of users as well as      for data schema integration can be trans-
distributed processing requirements. The         ferred to process integration: For example,
principle behind the integration process         the goals (Section 1.3) are equally applica-
would remain practically unaltered, but a        ble; so are the concepts of equivalence,
new set of problems would have to be con-        semantic analysis, conflict detection and
sidered in terms of materializing the so-        resolution, and transformations. Tucher-
called local conceptual schemas.                 man et al. [1985] consider database design
   Another possible extension could be to        to be an integration of process modules.
address the design of databases with special     Some preliminary work is under way on the
properties, such as scientific and statistical   related problem of program transformation
databases and databases for computer-            [Demo 1983; Demo and Kundu 19851.

                                                    ACM Computing   Surveys, Vol. 18, No. 4, December   1986
354      l       C. Batini, M. Lmzerini, and S. B. Navathe
3.3.4 Expert Systems for Schema Integration                Input: n external views (component sche-
 As pointed out in the Introduction and                      mas), given in terms of relations and
 Section 1, schema integration is a difficult                functional dependencies.
 and complex task. An expert system ap- Output: n external schemas, one concep-
proach to database design in general and to      tual schema (integrated schema), a
schema integration in particular on the          mapping mechanism between external
basis of the rules and heuristics of design      schemas and conceptual schema.
is worth investigating. Projects have al-
ready been under way in this area (e.g., Processing specifications considered: No.
see Bouzeghoub et al. [1986] and Shin Integration strategy:
and Irani [1985]). Model dependent rules
should be used in the comparison and con- (1) Find sets of functional dependencies
forming activities with the goal of improv-        common to some set of external views.
ing the equivalence and/or compatibility of (2) Eliminate in previous sets (local) re-
component schemas. Alternative schema              dundant dependencies.
transformations can be suggested or eval-      (3) Remove redundant dependencies due
uated by the system when a conflict must           to transitivity in the global set of de-
be solved. Selection among alternative             pendencies, thus producing a nonre-
schemas for integration can be guided by           dundant cover of the conceptual
system-designer interaction.                       schema.
                                               (4) Identify dependencies that were elimi-
APPENDIX 1. A SUMMARY DESCRIPTION                  nated in previous steps, but must now
               OF METHODOLOGIES                    be readded to external views in order
                                                   to minimize their effect on the mapping
 In the following, the methodologies sur-          process; construct external views.
veyed in this paper are briefly described.
 The same categories of description are used Special features:
for each methodology. These descriptions (1) The main goal of the methodology is to
 should only be treated as a quick reference       obtain mappings that
guide and not as a substitute for the original     (a) preserve compatibility between re-
descriptions of the methodologies. They are             lations and dependencies in exter-
included here to highlight the fact that,               nal schemas and in the integrated
although the general intent of all method-              schema;
ologies is very similar, the actual mechanics      (b) reduce interferences between in-
vary greatly. The terminology of the au-                sert/delete operations in different
thors is used without modification. Words               external schemas.
in parentheses refer to equivalent terms (2) The methodology assures that all rela-
used in this paper.                                tions are projections of a universal
   Of the above methodologies surveyed,            relation.
those of Dayal and Hwang [ 19841,Mannino
and Effelsberg [1984a], and Motro and Batini and Lenzerini [1984]
Buneman [ 19811apply to database integra-
tion; the method of [ElMasri et al. [1987] Type: View integration methodology.
is used for database integration as well as Model: Entity-relationship model (see Ap-
view integration, whereas the remaining          pendix 2).
methodologies apply to view integration
only.                                          Input:
                                                 user schemata (component schemas),
Al-Fedaghi and Scheuermann [1981]
                                                 enterprise schema,
                                                 weights for schemata.
Type: View integration methodology.
                                               Output:      Global schema (integrated
Model: Relational model.                       schema).


ACM Computing   Surveys, Vol. 18, No. 4, December   1986
              Comparison of Methodologies    for Database Schema Integration                  l        355
Processing specifications considered: No.       Integration strategy:
Integration strategy:                           (1) Combine user views, merging relation
                                                    schemas of the two different views and
(1) Choose the enterprise schema as the             defining new inclusion, exclusion, and
    base schema.
                                                    union functional dependencies.
(2) While new schemas are to be inte-           (2) Optimize the temporary conceptual
    grated, do
    (2.1) Choose a new schema.                      schema, trying to minimize redundancy
                                                    and the size of the schema.
    (2.2) Find conflicts between the two
          schemas.
                                                Special features:
    (2.3) Amend the two schemas in order
          to conform them.
    (2.4) Merge the schemas.                    (1) The relational model is enriched with
                                                    interrelational dependencies useful for
    (2.5) Analyze the draft integrated              expressing how data in distinct views
          schema in order to discover re-
          dundancies and simplify the rep-          are interrelated.
          resentation.                          (2) It is assumed that a preliminary inte-
                                                    gration process has been carried out to
Special features:                                   detect which structures of different
                                                    views represent the same information
(1) Several indications are suggested to            and interrelational dependencies.
    guide the designer in the investigation     (3) The optimization procedure (Step 2) is
    of conflicts (e.g., type inconsistencies,       shown to be correct for a special class
    concept likeness/unlikeness).                   of schemas, called restricted schemas;
(2) For every indication, several scenarios         a restricted schema is essentially a rep-
    are proposed (i.e., typical solutions of        resentation of a collection of entities-
    the conflict).                                  relationships, identified by their keys.
(3) Several types of equivalence transfor-
    mations are supplied to confirm the
    representation of concepts.                 Dayal and Hwang [ 19841
(4) A specific activity is suggested to im-     Type: Database integration methodology.
    prove understandability of the global
    schema.
                                                Model: Functional model. The model uses
Related references: Batini and Lenzerini         two constructs: entities and functions
  [1983] and Batini et al. [1983].               (i.e., properties of entities, relationships
                                                 among entities). Functions may be single
                                                 valued or multivalued. Entities may be
Casanova   and Vidal [ 19831                     user defined (e.g., Person) or else con-
Type: View integration methodology.              stants (e.g., Boolean). A generalization
                                                 abstraction is provided among entities
Model: Relational model. Besides func-           and functions.
tional dependencies, other types of
dependencies are considered: inclusion,         Input:
exclusion, and union functional dependen-         local schemas of existing databases,
cies.                                             queries.
Input: Two       user   views   (component      Output:
schema).                                         global interface to databases,
Output: Conceptual      schema (integrated       modified queries.
schema).
                                                Processing     specifications           considered:
Processing specifications considered: No.         Queries.


                                                   ACM Computing   Surveys, Vol. 18, No. 4, December   1986
356       l      C. Batini,      M. Lmzerini,        and 5’ B. Navathe
Integration strategy:                                       Integration strategy:

(1) Solve conflicts among concepts in local                 (1)Transform
                                                              existing schemas into ECR
    schemas (naming, scale, structural, ab-      if needed.
    straction conflicts).                    (2) Preintegration, which consists of an in-
                                                 terleaved application of schema analy-
(2) Solve conflicts among data in existing       sis and modification with assertion
    databases (inconsistencies in identi-
    fiers, different degree of obsolescence,     specification.
    different degree of reliability).            Integration of object classes.
(3) Modify queries and make them consist-
                                                            2
                                                 Integration of relationship classes.
    ent with the global schema.              (5) Generation of mappings.
                                             The above procedure is followed as an
Special features:                            n-ary integration process.
(1) Generalization abstraction is uniformly                 Special features:
    used as a means to combine entities                     (1) A very detailed treatment of attribute
    and resolve different types of conflicts.                   and object extension assertions via con-
(2) A detailed algorithm is given for query                     sistency checking and verification of
    modification and is formally proved                         algorithms is included.
    correct and nonredundant by Hwang                       (2) The methodology uses the notion of
    [1982].                                                     extension of attribute types and object
                                                                classes as a basis for comparison.
ElMasri et al. [ 19871                                      (3) The methodology applies equally to
                                                                view integration and database integra-
Type: Both view integration in logical da-                      tion.
  tabase design and database integration.
                                                            Related references: ElMasri and Navathe
Model:    Entity-Category-Relationship                        [1984], Larson et al. [1986], Navathe
 (ECR) model [ElMasri et al. 19851,which                      et al. [1984], Navathe et al. [1986],
 recognizes, besides entities and relation-                   Weeldreyer [ 19861.
 ships, the concept of categories. Cate-
 gories are used for two purposes: to show                  Kahn [1979]
 a generalization of a superentity into sub-
 entities and to simply allow for the defi-                 Type: View integration methodology.
 nition of a subset of an entity based on                   Model: Entity-relationship   model (see Ap-
 some predicate.                                             pendix 2).
Input:                                                      Input: Local information structures (com-
  n schemas, which represent either user                      ponent schemas).
     views or existing databases repre-                     Output: Global information structure (in-
     sented in the ECR model;                                tegrated schema).
     attribute equivalence assertions;
     object class extension assertions.                     Processing specifications considered: No.
                                                            Integration strategy:
output:
  integrated schema,                                        (1) (Entity step) Aggregate entities.
     mappings between integrated and con-                       (1.1) Standardize names.
     ceptual schemas.                                           (1.2) Aggregate entities to form a non-
                                                                       redundant collection.
Processing specifications considered: Not                       (1.3) Check entities against processing
  to determine the result of integration.                              requirements.
  However, the problem of dealing with                          (1.4) Eliminate    nonessential attri-
  queries on the integrated schema is ad-                              butes.
  dressed in terms of mappings.                                 (1.5) Simplify the representation.
ACM Computing   Surveys, Vol. 18, No. 4, December   1986
                            .*
              Comparison of Methodologies for Database Schema Integration                    l        357

(2) (Relationship step) Aggregate relation-   (2) Companion global view definition lan-
    ships.                                        guage that uses the same set of integra-
    (2.1) Standardize names.                      tion operators as the methodology.
    (2.2) Analyze consistency of relation-    (3) Semantic equivalence and range of
           ship cardinalities versus entity       meaning of individual        attributes,
           cardinalities.                         groups of attributes, and functions of
    (2.3) Aggregate relationships.                attributes can be defined in attribute
    (2.4) Determine conditional and exist-        assertions.
           ence-dependent relationship.       Steps 2, 3, and 4 may be performed in
    (2.5) Eliminate all redundant relation-     sequence or iteratively with backtrack-
           ships.                               ing.
Special features:                             Related references: Mannino and Effels-
0) A rich set of heuristics is suggested to     berg [ 1984b], Mannino and Karle [ 19861,
    guide the designer in discovering con-      and Mannino et al. [1986].
    flicts.
(59 Several types of qualities are defined    ~otro and Buneman [1981]
    for the integrated schema, and strate-
    gies are suggested to achieve these.      Type: Database integration methodology.
                                              Model: Functional model. Constructs of
Mannino and Effelsberg   [1984a]               the model are classes of objects, which
Type: Database integration.                    may be related by two types of func-
                                               tions-att, by which one class becomes
Model: Generalized entity manipulator.         an attribute of another class, and gen,
Input: Local schemas in a common data          which establishes a generalization rela-
  model, interschema assertions about en-      tionship.
  tity types and attributes.                  Input:
Output: Global view objects, global view        Two logical (component) schemas with
  mapping, integration schema objects.             the corresponding databases,
                                                queries.
Processing specifications considered: No.
                                              output:
Integration strategy:                           superview        (global     schema), modified
(1) Transform each local schema into an
    easy-to-integrate form.                   Processing         specifications         considered:
(2) Match the entity types and attributes       Queries.
    of the local schemas.
(3) Define assertions about the entity
                                              Integration strategy:
    types that can be generalized and then    (1) Merge the two (independent) logical
    define assertions about equivalent at-        schemas by combining initially primi-
    tributes.                                     tive classes.
(4) Merge pairs of “generalizable” entity     (2) While new restructurings can be ap-
    families as indicated by the assertions       plied to the temporary integrated
    and designer preferences.                     schema, do
(5) Define global attribute formats and
    conversion rules for the global entity        (2.1) Choose a restructuring primitive
    types.                                              and apply to the integrated
     .*,. ,                                             schema.
Sneciar reatures:
                                              Special features:
(1) The merging step uses entity families
    (collection of entity types related       (1) The main feature of the methodology
    by generalization) rather than simple         is to provide a large and powerful set
    entity types.                                 of restructuring primitives while no

                                                 ACM Computing    Surveys, Vol. 18, No. 4, December   1986

                                                                       !m
358       l      C. Batini, M. Lenzerini, and S. B. Navathe
      heuristics are given to discipline their             Special features:
      use.
                                                           (1) Equivalence and containment relations
Related references: Motro [ 19811.                             among information contents of user
                                                               schemas are assumed as input to the
Navathe and Gadgil [ 19821                                     design.
                                                           (2) A taxonomy is given for types of com-
Type: View integration methodology.                            parisons among objects, conflicts, and
Model: Navathe-Schkolnick       model. The                     view integration operations.
 main construct of the model is the object                 (3) Conflicts are generally resolved by
 (type), representing either an entity or                      adopting the most restrictive specifi-
 an association, which can be recursively                      cation.
 defined in terms of entities or associa-                  (4) Attention is given to the problem of
 tions. Other concepts are connectors,                         automating the view integration proc-
 which model insertion/deletion proper-                        ess, distinguishing activities that can
 ties of associations and subsets between                      be solved automatically and activities
 objects. Associations are divided into                        that ask for interaction with the
 three types: subsetting, partitioning, and                    designer/user.
 categorizing associations.
                                                           Teorey and Fry 119821
Input:
  enterprise view,                                         Type: View integration methodology.
  local views (component schemas),
  integration strategy,                                    Model: Semantic hierarchical model. Con-
  interview and intraview assertions,                       structs are classes of objects, aggregation
  processing requirements.                                  abstractions among objects by which an
                                                            object is seen as an abstraction of the
output:                                                     set of its properties, and generalization
  global view,                                              abstractions.
  mapping rules,
  unresolved conflicts,                                    Input:
  modified assertions.                                       information    perspective     (component
                                                                schemas),
Integration strategy:                                        application perspective,
(1) Divide views into classes of                             event perspective,
                                                             corporate perspective,
      equivalent views,                                      policy guidance and rules.
      identical views,
      single views.                                        output:
                                                             global information structure (integrated
(2) Integrate classes checking for conflicts                    schema),
    (among names, keys, etc.) and solving                    conflicts.
    them on the basis of assertions and
    order of preference.                                   Processing specifications considered: No.
(3) While new view assertion operations
    are applicable, do                                     Integration strategy:
    (3.1) Perform new integrations be-
          tween intermediate and semiin-                   (1) Order local views as to importance with
          tegrated views in a way similar to                   respect to the specific design objectives.
          Step 2.                                          (2) Consolidate local views, two at a time
(4) Generate mapping rules determining                         (the order is determined by Step 1).
    how each of the component views can                    (3) Solve conflicts that have arisen in
    be obtained from the integrated view.                      Step 2.


ACM Computing   Surveys, Vol. 18, No. 4, December   1986
               Comparison of Methodologies for Database Schema Integration                     l        359




Special features:                               Related references: ElMasri [ 19801 and
                                                  ElMasri and Wiederhold [ 19791.
(1) Attention is given to the problem of
    integration of processing specifica-
    tions, but no specific strategies and       Yao et al. [1982]
    methods are proposed.                       Type: View integration methodology.
(2) Different types of integration process-
    ing strategies (see Section 2.5) are com-   Model: Functional model. Constructs of
    pared. The binary balanced strategy is       the model are nodes, classified into
    claimed to be the most effective.            simple nodes representing atomic data
                                                 elements and tuple nodes, represent-
                                                 ing nonfunctional (i.e., many-to-many)
Wiederhold   and ElMasri [ 19791
                                                 relationships among nodes, and func-
Type: View integration methodology.              tions among nodes.

Model: Structural model. Such a model is Input: Two schemas.
  constructed from relations that are used
  to represent entity classes and several Output: The integrated schema.
  types of relationships among entity
  classes. Other types of relationships are Processing specifications considered: Yes,
  represented by connections between re-       in language TASL.
  lations.
                                            Integration strategy:
Input: Two data models (component sche-
  mas).                                     (1) Merge nodes with same values.
                                            (2) Merge nodes that are subsets of other
output:                                          nodes.
  Integrated database model (integrated (3) Remove redundant functions and mod-
     schema),                                    ify corresponding transaction speci-
  database submodels.                            fications.
Processing specifications considered: Only      Special features:
  primitive operations on concepts (inser-
  tion, deletion).                              (1) The main aspect dealt with in the
                                                    methodology is to determine and
Integration strategy:                               remove redundancv.
                                                (2) Paths to be removed are found by using
(1) Find all compatible pairs of relations.         processing specification information.
(2) For each pair of relations, integrate the   (3) A transaction         specification        language
    connection between them.                        (TASL) accompanies the methodology.
(3) Integrate compatible relations.
Special features:                               APPENDIX    2. THE ENTITY-RELATIONSHIP
                                                               MODEL
(1) Owing to the rich variety of modeling
    constructs of the structural model, an      The original model, known as the entity-
    extensive set of conflicts is presented     relationship Model (E-R), was proposed by
    and analyzed, and solutions are pro-        Chen [1976]. Further extensions of the
    vided.                                      model appear in DOS Santos et al. [1980]
(2) Mapping rules are derived from the          and Scheuermann et al. [ 19801.The follow-
    integration process to express data         ing concepts are defined in the model.
    models consistently with the integrated        An entity is a class of objects of the real
    database model.                             world having similar characteristics and


                                                   ACM Computing    Surveys, Vol. 16, No. 4, December   1986
360       .      C. Batini,     M. Lenzerini,        and S. B. Navathe

properties. A relationship is a class of ele-                Table 9. Concepts of the Entity-Relationship Model
mentary facts or associations among enti-                              and Corresponding Symbols
ties. An attribute is an elementary property
either of an entity or a relationship. An                        CONCEPT                         SYMBOL
entity El is a subset of an entity Ez if every
object belonging to E1 also belongs to Ez.
An entity E is a generalization of entities                      Entity
Ez, E?, . . . , E, if                                                                        L

(1) every Ei is a subset of E, and
(2) every object belonging to E belongs ex-
    actly to one of the Ei’s.                                    Relationship

A diagrammatic representation is widely
used with the E-R model. In Table 9 we
show the correspondence between the con-                         Attribute                       Q-
cepts of the model and the diagrammatic
symbols.
   An example of a schema appears in Fig-
ure 15, which describes information of in-                       Subset
terest in a data processing department of a
company.
   The information is about employees
(which includes programmers, managers,
and senior programmers), projects, and lan-                      Generalization
guages. The entities and their correspond-
ing attributes are as follows:
  Employee: Employee#, Last-name, Age
  Project: Project#, Name
  Language: Name, Version
  Manager: Budget                                           mer inherit all properties (attributes and
  Programmer: none                                          relationship types) of Employee, which in-
  Senior-programmer: Years-of-experi-                       clude attributes of Employee and relation-
    ence                                                    ship “Works-on.” Owing to the subset re-
                                                            lationship, Senior-programmer inherits all
    The relationships among the above en-                   the properties of Programmer, which in-
tities are:                                                 clude relationship “Uses” and all attributes
  “Works-on,” connecting Employee and                       from Employee.
Projects.                                                      Various types of constraints have been
  “Uses,” connecting Programmer, Project                    specified to go with the E-R model. Here
and Language.                                               we only refer to the cardinality constraints.
                                                            The cardinality constraint restricts the
   The Works-on relationship has an attri-                  number of relationships in which a specific
bute %-of-time.                                             entity can participate. In the example, the
   Senior-Programmer is a subset of Pro-                    cardinality     constraint   governing the
grammer, and Employee is a generalization                   “Works On” relationship is many to many
of Programmer and Manager. The resulting                    (m : n); that is, an employee may Work on
hierarchy among Employee as a generic                       many projects, and a project may have
entity and Programmer and Manager as its                    many employees who Work on it. Common
specialized entities is denoted by the name                 cardinality constraints are: one to one
Rank. Note that, by virtue of the generali-                 (1: l), one to many (1: n), many to one
zation hierarchies, Manager and Program-                    (n:l), and many to many (m:n).


ACM Computing   Surveys, Vol. 18, No. 4, December   1986
                 Comparison of Methodologies            for Database Schema Integration                       l        361

          Employee *                                         % of time
                                                                                               Project    *
              Last name
                   Age                                                                         Name




 Budget



                                                                                               Name


                                                                                               Compiler           Version

                                  Figure 15.   An entity-relationship    schema.


               ACKNOWLEDGMENTS                                Rep., Honeywell Corporate Research Center
                                                              (submitted for publication).
Navathe’s research was partly supported by National       KAHN, B. 1979. A structured logical data base design
Science Foundation grant No. INT-8400216. Batini’s           methodology. Ph.D. dissertation, Computer Sci-
and Lenzerini’s research was partly supported by Pro-        ence Dept., Univ. of Michigan, Ann Arbor, Mich.
getto Finalizzato Informatica and Progetto Finalizzato    MANNINO, M. V., AND EFFELSBERG,W. 1984a. A
Transporti, CNR, Italy.                                      methodology for global schema design, Computer
    We gratefully acknowledge the patience and coop-         and Information Sciences Dept., Univ. of Florida,
eration of Sharon Grant and Claudio Dollari in pre-          Tech. Rep. No. TR-84-1, Sept.
paring this manuscript. The comments of anonymous         MOTRO, A., AND BUNEMAN, P. 1981. Constructing
referees were very helpful in revising an earlier draft      superviews. In Proceedings of the International
of the paper.                                                Conference on Management of Data (Ann Arbor,
                                                             Mich., Apr. 29-May 1). ACM, New York.
                                                          NAVATHE, S. B., AND GADGIL, S. G. 1982. A meth-
                   REFERENCES                                odology for view integration in logical data base
                                                             design. In Proceedings of the 8th International
A. Complete    Methodologies   for Schema Integration        Conference on Very Large Data Bases (Mexico
                                                             City). VLDB Endowment, Saratoga, Calif.
AL-FEDAGHI, S., AND SCHEUERMANN, P. 1981.                 TEOREY, T., AND FRY, J. 1982. Design of Database
    Mapping considerations in the design of schemas          Structures. Prentice-Hall, Englewood Cliffs, N.J.
    for the relational model. IEEE Trans. Softw. Eng.     WIEDERHOLD,G., AND ELMASRI, R. 1979. A struc-
    SE-7, 1 (Jan.).                                          tural model for database systems. Rep. STAN-
BATINI, C., AND LENZERINI, M. 1984. A methodol-              CS-79-722, Computer Science Dept., Stanford
    ogy for data schema integration in the entity            Univ., Stanford, Calif.
    relationship model. IEEE Trans. Softw. Eng.           YAO, S. B., WADDLE, V., AND HOUSEL, B. 1982.
    SE-lo, 6 (Nov.), 650-663.                                View modeling and integration using the func-
CASANOVA, M., AND VIDAL, M. 1983. Towards a                  tional data model. IEEE Trans. Softw. Eng. SE-
     sound view integration methodology. In Proceed-           8,6, 544-553.
     ings of the 2nd ACM SZGACTISZGMOD        Confer-
     ence dn Principles of Database’Systems (Atlanta,     B. Related Work
   Ga., Mar. 21-23). ACM, New York, pp. 36-47.
DAYAL, U., AND HWANG, H. 1984. View definition            ALBANO, A., CARDELLI, L., AND ORSINI, R.
   and generalization for database integration in             1985. Galileo: A strongly typed, interactive con-
   multibase: A system for heterogeneous distrib-             ceptual language. ACM Trans. Database Syst. 10,
   uted databases. IEEE Trans. Softw. Eng. SE-lo,             2 (June), 230-260.
    6 (Nov.), 628-644.                                    ATZENI, P., AUSIELLO, G., BATINI, C., AND MOSCAR-
ELMASRI, R., LARSON, J., AND NAVATHE, S. B.                   INI, M. 1982. Inclusion and equivalence between
   1987. Integration algorithms for federated                 relational database schemata. Theor. Comput.
   databases and logical database design. Tech.                Sci. 19, 267-285.




                                                               ACM Computing   Surveys, Vol. 18, No. 4, December       1986
362       .      C. Batini,     M. Lenzerini,        and S. B. Navathe
BATINI, C., AND LENZERINI, M. 1983. A conceptual            proach, S. Spaccapietra, Ed. (Dijon, France,
      foundation to view integration. In Proceedings of      Nov.), pp. 19-36.
      the IFIP TC.2 Workina Conference on &stem         CERI, S., ED. 1983. Methodology and Tools for
      Description MethodologGs (Kecskmet, Hungary).          Database Design. North-Holland, Amsterdam.
      Elsevier, Amsterdam, pp. 109-139.                 CERI, S., AND PELAGATPI, G. 1984. Distributed Da-
BATINI, C., LENZERINI, M., AND MOSCARINI, M.                 tabases: Principles and Systems. McGraw-Hill,
      1983. Views integration. In Methodology and            New York.
      Tools for Data Base Design, S. Ceri, Ed. North-   CERI, S., PELAGA~I, G., AND BRACCHI, G. 1981. A
     Holland, Amsterdam.                                     structured methodology for designing static and
BATINI, C., DEMO, B., AND DI LEVA, A. 1984. A                dynamic aspects of data base applications. Znf.
     methodology for conceptual design of office data        Syst. 6, 1, 31-45.
     bases. Znf. Syst. 9,3,251-263.                     CHEN, P. P. 1976. The entity-relationship model-
BATINI, C., NARDELLI, E., AND TAMASSIA, R. 1986.            Toward a unified view of data. ACM Trans.
     A layout algorithm for data flow diagrams. IEEE        Database Syst. 1, 1 (Mar.), 9-36.
     Trans. Softw. Eng. SE-12,4 (Apr.), 538-546.        CHEN, P. P. 1983. English sentence structure and
BEERI, C., BERNSTEIN, P., AND GOODMAN, N.                   entity-relationship    diagrams. J. Znf. Sci. 29,
     1978. A sophisticate’s introduction to database         127-150.
     normalization theory. In Proceedings of the 4th CHIANG, W., BASAR, E., LIEN, C., AND TEICHROEW,
    Intemutionul Conference on Very Lurge Data              D. 1983. Data modeling with PSL/PSA: The
    Bases (West Berlin, Sept. 13-15). IEEE, New             view integration system (VIS). ISDOS Rep. No.
    York.                                                   M0549-0, Ann Arbor, Mich.
BERNSTEIN, P. A. 1976. Synthesizing third normal CHILSON, D., AND KUDLAC, C. 1983. Database de-
    form relations from functional dependencies.            sign: A survey of logical and physical design tech-
    ACM Trans. Database Syst. 1,4 (Dec.), 277-298.          niques. Database 15, 1 (Fall).
BILLER, H. 1979. On the equivalence of data base DATA DESIGNER 1981. Data designer product de-
     schemas: A semantic approach to data transla-          scription. Database Design Inc., Ann Arbor,
    tion. Znf. Syst. 4, 1, 35-47.                           Mich.
BILLER, H., AND NEUHOLD, E. J. 1982. Concepts for DEMO, B. 1983. Program analysis for conversion
     the conceptual schema. In Architecture and             from a navigation to a specification database
     Models in Data Base Management Systems,                interface. In Proceedilzps of the 9th International
     G. M. Nijssen, Ed. North Holland, Amsterdam,           Conference on Very Layge’Data Bases (Florence,
    pp. l-30.                                               Italy). VLDB Endowment, Saratoga, Calif.
BISKUP, J., AND CONVENT, B. 1986. A formal view DEMO, B., AND KUNDU, S. 1985. Modeling the CO-
     integration method. In Proceedings of the Znter-      DASYL DML execution context dependency for
     notional Conference on the Management of Data         application program conversion. In Proceedings
     (Washington, D.C., May 28-30). ACM, New               of the International Conference on Management
    York.                                                  of Data (Austin, TX., May 28-30). ACM, New
BISKUP, J., DAYAL, U., AND BERNSTEIN, P. A.                York, pp. 354-363.
     1979. Independent database schemas. In Pro- DOS SANTOS,C. S., NEUHOLD, E. J., AND FURTADO,
    ceedings of the Znternutionul Conference on the        A. L. 1980. A data type approach to the entity
    Management of Data (Boston, Mass., May 30-             relationship model. In Proceedings of the Znter-
    June 1). ACM, New York.                                national Conference on the Entity Re.!ationship
BOUZEGHOUB,M., GARDARIN, G., AND METAIS, E.                Approach to System Anulysis and Design, P. Chen,
    1986. Database design tools: An expert systems         Ed. (Los Angeles, 1979). North-Holland, Amster-
    approach. In Proceedings of 11th Znternutionul         dam, pp. 103-120.
    Conference of Very Large Databases (Stockholm, EICK, C. F., AND LOCKEMANN, P. C. 1985.
    Sweden). Morgan Kaufmann, Los Altos, Calif.            Acquisition of terminological knowledge using da-
BRODIE, M. L. 1981. On modelling behavioural se-           tabase design techniques. In Proceedings of the
    mantics of data. In Proceedinns of the 7th Znter-      International Conference on Management of Data
    national Conference on Very-Large Data Bases           (Austin, TX., May 28-30). ACM, New York, pp.
    (Cannes. France, Se&. 9-11). IEEE. New York.           84-94.
    Pp. 32-41.             -                            ELMASRI, R. 1980. On the design, use and integra-
BRODIE, M. L., AND ZILLES, S. N., EDS. 1981. In            tion of data models. Ph.D. dissertation, Rep. No.
    Proceedings of the Workshop on Data Abstraction,       STAN-CS-80-801, Dept. of Computer Science,
    Databases, and Conceptual Modelling. SIGPLAN           Stanford Univ., Stanford, Calif.
    Not. 16, 1 (Jan.).                                  ELMASRI, R., AND NAVATHE, S. B. 1984. Object
CARSWELL, J. L., AND NAVATHE, S. B. 1986.                   integration in database design. In Proceedings of
    SA-ER: A methodology that links structured              the IEEE COMPDEC Conference (Anaheim,
    analysis and entity relationship modeling for da-       Calif., Apr.). IEEE, New York, pp. 426-433.
    tabase design. In Proceedings of the 5th Znternu- ELMASRI, R., AND WIEDERHOLD, G. 1979. Data
    tionul Conference on the Entity Relationship Ap-        model integration using the structural model. In


ACM Computing   Surveys, Vol. 18, No. 4, December   1986
                 Comparison of Methodologies for Database Schema Integration                          l         363
    Proceedings of the International Conference on            Computer Science Dept., Univ. of Pennsylvania,
    Management of Data (Boston. Mass., May 30-                Philadelphia, Pa. 1981.
    June 1). ACM, New York.                               MYLOPOULOS,J., BERNSTEIN, P. A., AND WONG, H.
ELMASRI, R., WEELDRYER, J., AND HEVNER, A.                    K.T. 1980. A language facility for designing
    1985. The category concept: An extension to the           database-intensive applications. ACM Trans.
    entity-relationship model. Data Knawl. Eng. I, 1          Database Syst. 5,2 (June) 185-207.
    (June).                                               NATIONAL BUREAU OF STANDARDS1982. Data base
FERRARA, F. M. 1985. EASY-ER: An integrated sys-              directions: Information resource management-
    tem for the design and documentation of data              strategies and tools. Special Publ. 500-92, A.
    base applications. In Proceedings of the 4th Znter-       Goldfine, Ed. U.S. Dept. of Commerce, Washing-
    natianal Conference on the Entity Relationship            ton, D.C., Sept. 1982.
    Approach (Chicago, Ill.). IEEE Computer Society,      NAVATHE, S.B., AND SCHKOLNICK, M. 1978. View
   Silver Spring, Md., pp. 104-113.                           representation in logical database design. In Pro-
HAMMER, M., AND MCLEOD, D. 1981. Database de-                 ceedings of tke International Conference on Man-
   scription with SDM: A semantic database model.             agement of Data (Austin, Tex.). ACM, New York,
   ACM Trans. Database Syst. 6,3 (Sept.), 351-386.            pp. 144-156.
HUBBARD, G. 1980. Computer Assisted Data Base             NAVATHE, S. B., AND KERSCHBERG,L. 1986. Role
   Design. Van Nostrand-Reinhold, New York.                   of data dictionaries in information resource man-
HWANG, H. Y. 1982. Database integration and op-               agement. Znf. Manage. 10, 1.
   timization in multidatabase systems. Ph.D. dis-        NAVATHE, S. B., SASHIDHAR,T., AND ELMASRI, R.
   sertation, Dept. of Computer Science, Univ. of             1984. Relationship matching in schema integra-
   Texas, Austin, Oct.                                        tion. In Proceedings of the 10th International
KLUG, A., AND TSICHRITZIS, D., Eds. 1977. The                 Conference on Very Large Data Bases (Singa-
   ANSIJKBJSPARC Report of the Study Group on                 pore). Morgan Kaufmann, Los Altos, Calif.
   Data Base Management Systems. AFIPS Press,             NAVATHE, S. B., ELMASRI, R., AND LARSON, J.
   Reston, Va.                                                1986. Integrating user views in database design.
LANDERS, T. A., AND ROSENBERG,R. L. 1982. An                  IEEE Computer 19,l (Jan.), 50-62.
   overview of Multibase. In Distributed Databases,                                               F.
                                                          Nc, P.. JAJODIA,S.. AND SPRINGSTEEL, 1983. The
   H. J. Schneider, Ed. North-Holland, Amsterdam.             problem of equivalence of entity relationship
LARSON, J., NAVATHE, S. B., AND ELMASRI, R.                    diasrams. IEEE Trans. Softw. Enc. SE-g. 5,
    1986. Attribute equivalence and its role in                617u-630.
   schema integration. Tech. Rep., Honeywell Com-         OLLE, T. W., SOL, H. G., AND VERRIJN-STUART,A.
   puter Sciences Center, Golden Valley, Minn.                A., Eds. 1982. Information systems design
LUM, V., GHOSH, S., SCHKOLNICK, M., JEFFERSON,                methodologies: A comparative review. In Proceed-
   D., Su, S., FRY, J., AND YAO, B. 1979. 1978 New            ings of the IFIP WG 8.1 Working Conference on
   Orleans data base design workshop. In Proceed-             Comparative Review of Znformation Systems De-
    ings of the 5th Zntematianal Conference on Very           sign Methodologies (Noordwijkerhout, The Neth-
    Large Data Bases (Rio de Janeiro, Oct. 3-5).              erlands). North-Holland, Amsterdam.
    IEEE, New York, pp. 328-339.                          RISSANEN,J. 1977. Independent components of rela-
MAIER, D. 1983. The Theory of Relational Databases.           tions. ACM Trans. Database Syst. 2, 4 (Dec.),
    Computer Science Press, Potomac, Md.                      317-325.
MANNINO,       M.    V., AND EFFELSBERG, W.               ROLLAND, C., AND RICHARDS, C. 1982. Transaction
    1984b. Matching techniques in global schema               modeling. In Proceedings of the Znternatianal
    design. In Proceedings of the IEEE COMPDEC                Conference on Management of Data (Orlando,
    Conference (Los Angeles, Calif.). IEEE, New               Fla., June 2-4). ACM, New York, pp. 265-275.
    York, pp. 418-425.                                    SAKAI, H. 1981. A method for defining information
MANNINO, M. V., AND KARLE, C. 1986. An extension              structures and transactions in conceptual schema
    of the general entity manipulator language for            design. In Proceedings of the 7th Znternatianal
    global view definition. Data Knawl. Eng. 2, 1.            Conference on Very Large Data Bases (Cannes,
MANNINO, M. V., NAVATHE, S. B., AND EFFELSBERG,               France, Sept. 9-11). IEEE, New York, pp.
    W. 1986. Operators and rules for merging gen-             225-234.
    eralization hierarchies. Working Paper, Graduate      SCHEUERMANN, P., SCHIFFNER,G., AND WEBER, H.
    School of Business, Univ. of Texas, Austin, April         1980. Abstraction capabilities and invariant
    1986.                                                     properties modeling within the entity relation-
MCLEOD, D., AND HEIMBIGNER, D. 1980. A feder-                 ship approach. In Proceedings of the Zntematianal
    ated architecture for data base systems. In               Conference on Entity Relationship Approach to
    Proceedings   of the AFIPS      National  Cam-            System Analysis and Design, P. Chen, Ed. (Los
    puter Conference, vol. 39. AFIPS Press, Arling-           Angeles, 1979). North-Holland, Amsterdam.
   ton, Va.                                               SHIN, D. G., AND IRANI, K. B. 1985. Knowledge-
MOTRO, A. 1981. Virtual merging of databases.                 based distributed database system design. In
   Ph.D. dissertation, Tech. Rep. #MS-CIS-80-39,              Proceedings   of the International   Conference    on



                                                              ACM ComputingSurveys,Vol. 18,No. 4, December1966
364      .      C. Batini,   M. Lenzerini,     and S. B. Navathe
    Management of Data (Austin, Tex., May 28-30).           In Proceedings of the 11th Internntionul Confer-
    ACM, New York, pp. 95-105.                              ence on Very Large Data Bases (Stockholm,
SHIPMAN, D. W. 1980. The functional data model              Sweden). Morgan Kaufmann, Los Altos, Calif.
    and data language DAPLEX. ACM Trans. Data-          ULLMAN, J. D. 1982. Principles of Database Systems,
    base Syst. 6, 1 (Mar.), 140-173.                        2nd ed. Computer Science Press, Potomac, Md.
SMITH, J. M., AND SMITH, D. C. 1977. Database           WEELDREYER, J. A. 1986. Structural aspects of the
    abstraction:   Aggregation    and generalization.       entity-category-relationship  model of data,
    ACM Trans. Database Syst. 2,2 (June), 105-133.          Tech. Rep. HR-80-251, Honeywell Computer Sci-
TUCHERMAN, L., FURTADO, A. L., AND CASANOVA,                ences Center, Golden Valley, Minn.
    M. A. 1985. A tool for modular database design.


Received March 1985; final revision accepted December 1986.




ACM Computing Surveys, Vol. 18, No. 4, December 1966

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:11/13/2012
language:Unknown
pages:42