Document Sample

                             Richard Wang and Stuart E . Madnick, Sloan School of Managemen t
             Timely access to multiple disparate databases which       evolved from our observation that inter-databas e
        were independently developed and administered to               incompatibilities at the instance value level as well as thos e
        produce composite information has become increasingl y         at the schema level must be resolved . The methods used i n
        critical for organizations to gain competitive advantage .     this approach include schema integration, Inter-Databas e
        However, many inter-database problems such a s                 Table (IDT), Inter-Database Instance Identification Tabl e
        inconsistency, ambiguity, and contradiction remai n            (IDIIT), object hierarchies, and heuristic rules .
        unresolved. This paper presents an approach for resolvin g          Concepts and research background of CIS are presente d
        these problems . The techniques employed in this approac h     in the remainder of this section . Section 2 presents a cas e
        include schema integration, inter-database tables ,            study of tour-guide databases to exemplify issues involved
        attribute subsetting, object hierarchies, and heuristi c       in attaining connectivity . In section 3, a connectivit y
        rules . Schema integration techniques resolve th e             strategy is presented . Finally, concluding remarks appear i n
        incompatibilities among the databases at the schema level .    section 4 .
        Inter-database tables resolve the semantic inconsistency
        and concept granularity at the instance value level . The      1 .1 A Strategic Applications, Technology, an d
        inter-database instance identification table identifies a n          Organizational Research Initiativ e
        instance across databases. Object hierarchies represen t           The potential strategic importance of informatio n
        schemata as well as instances . Finally, heuristic rules are   technology (IT) is now a widely accepted fact (e .g ., Cas h
        used to facilitate the construction of the inter-databas e     and Konsynski, 1985 ; Clemons & McFarlan, 1986 ; Ive &
        instance identification table and the production of            Learmonth, 1984) . It has also become increasingly clea r
        composite information .                                        that the identification of strategic applications alone doe s
                                                                       not result in success for an organization . A carefu l
        KEY WORDS AND PHRASES : Distributed Databas e                  coordination from the domains of strategic applications ,
        Management Systems, Data Administration, Databas e             information technologies, and organizational structure s
        Administration, Database Integration,              Logica l    must be made in order to attain success, as in Figure . 1 .
        Connectivity, Object-Oriented Programming ,
        Organizational Information Systems, , Strategic Information              StrategicApplications

        Systems .

                   1, INTRODUCTIO N
         Significant advances in the price, speed performance ,                                                1        ,r—►
     capacity, and capabilities of new database technology hav e                                                   ,'   Technology
    created a wide range of opportunities for busines s
    applications . These opportunities can be exploited to mee t
     corporate strategic goals . One important category o f             Figure 1 .A Strategic Application Technology an d
    strategic applications involve inter-corporate linkage (e .g . ,      Organizational Research Initiative (SATORI )
     tying into supplier and/or buyer systems) and/or intra-                An effective corporation is one that successfull y
     corporate integration (e .g ., tying together disparat e          reconciles the problems and opportunities of linking thes e
     functional areas within a firm) of information systems . Thi s    three domains . It is important to recognize that no singl e
    category of information systems has been referred to a s           pattern of interconnection among these three domains i s
    Composite Information Systems (CIS) (Lam and Madnick ,             likely to be consistently successful . Thus, one corporatio n
     1978 ; Levine, 1987 ; Madnick and Wang, 1988a,b,c) .              may wish to lead from its technological domain an d
         A key benefit of CIS is to provide timely access t o          reconcile the other two domains accordingly . In contrast ,
    multiple disparate databases in concert in order to produc e       another corporation may wish to develop its strategi c
    composite information . The process for obtaining thi s            applications from its product/market choice and develop it s
    benefit is referred to as connectivity in this paper . Withou t    technological and organizational capabilities accordingly .
    connectivity, it is difficult, expensive, time-consuming, an d     The way that the corporation matches its interna l
    error-prone to produce composite answers from informatio n         capabilities with the external requirements determines it s
    which may be stored in different databases located i n             success in the marketplace . The primary research activitie s
    different divisions of organizations .                             related to CIS are discussed below .
         In the past, problems such as inconsistency an d
    contradiction among the disparate databases have been deal t       1 .2   Related Work
    with on an ad hoc basis . This paper presents an approach fo r
                                                                           The pioneering research work on CIS began over a
    resolving these problems through enhancing the semanti c
                                                                       decade ago (Lam and Madnick, 1978) . Researchers in th e
    power of the database integration . The enhanced approach

    38 DATA BASE Fall 1989
information systems field have since evolved concepts suc h            The research results have created an opportunity fo r
as inter-organizational information systems and distribute d      organizations to produce composite information that may b e
systems, which are summarized below .                             stored in different databases located in different divisions o f
     Barrett and Konsynski (1982) discussed concept s             organizations . Moreover, the increasingly availabl e
underlying the growth of inter-organizational informatio n        commercial products are important for implementing CI S
systems (IOS) . A classification scheme was presented t o         with high return on investment, as illustrated below .
examine issues of cost commitment, responsibility, an d
complexity of the operating environments . Barrett (1987 )        1 .3   Strategic CIS Opportunitie s
further discussed a range of strategic options and IO S                Consider the following case study of a majo r
implementations . Their work represents a managerial              international bank (Frank, Madnick, and Wang, 1986) .
perspective on the development and deployment of CIS .            Three separate database systems, shown in Figure 2, ar e
     In linking business and technology planning, Benso n         used for cash management, loan management, and line o f
and Parker (1985) argued that business planning shoul d           credit processing . Suppose a client requests that $100,00 0
drive technology planning . Enterprise-Wide Information           be transferred from one account to another . If the client' s
Management (EwIM) grids were proposed to enabl e                  cash balances in the funds transfer system can not cover th e
practitioners as well academics to apply the EwIM tools o f       transaction, it will be rejected -- even though the client ma y
planning . Many of the IS planning tools such as Busines s        have a $1,000,000 active line of credit! This rejection ,
Systems Planning (BSP) and Critical Success Factor s              besides being annoying and possibly embarrassing to th e
(CSF) were mapped onto the grids . The work represents a          client, will require significant effort later to correct b y
trend towards articulating issues involved in business and I T    manually drawing on the line of credit to cover the tr ansfe r
at the planning level, eventually evolving into a                 of funds .
methodology for linking strategic applications t o
appropriate IT and to the organizational context .
      In the technical arena, much research has bee n
conducted on the design of large capacity, cost-effectiv e
memory systems with rapid access time . Goya( an d
Agerwala (1984) analyzed the performance of future share d
storage systems . Madnick and Wang (1986) modeled th e                                                             r
INFOPLEX database computer in order to provid e
substantial performance improvements over conventiona l                                       CAS H
                                                                                                                        LINE OF
                                                                                                                        CREDI T
computers (e .g., up to 1000 fold increases in throughput) i n                               SYSTEM                     SYSTE M

information management, to support very large comple x
databases (e .g ., over 100 billion bytes of structured data) ,                              DATABAS E             ' DATABAS E
and to insure extremely high reliability .                                                                         -,

      In parallel, the MULTIBASE research project a t
Computer Corporation of America (Goldhirch, Lander ,
Rosenberg, and Yedwab, 1984) attempts to provide a                 Figure 2 . An Electronic Banking System Withou t
uniform interface through a single query language an d                                     Integratio n
database schema to data in pre-existing, heterogeneous ,               If the bank can connect the three separate databas e
distributed databases . The Federated Architectur e               systems together so that information is accessed in concert ,
 (Heimbigner, 1985) provides mechanisms for sharing data ,        and so that funds can be automatically drawn on the line o f
 for combining information from several components, an d          credit, then product differentiation will be achieved throug h
 for coordinating activities among autonomous component s         the enhanced quality of service . Reprocessing costs will als o
 via negotiation . Hewitt at MIT (1986) deals with highl y        be reduced because special manual intervention can b e
 parallel open systems . The underlying assumption of thei r      avoided
research is that future IT applications will involve th e              Two levels of connectivity need to be considered i n
 interaction of subsystems that have been independentl y          producing composite information : physical connectivity an d
 developed and administered at disparate geographica l            logical connectivity . Physical connectivity refers to th e
 locations .                                                      process of actual communication among disparate databases .
      In the private sector, commercial database machines ,       Although many issues need to be addressed in physica l
 such as Britton Lee's IDM 500 and Teradata's DBC 1012 ,          connectivity (e .g ., bandwidths, security, availability, an d
 have been introduced . Furthermore, homogeneou s                 reliability), we assume that adequate communicatio n
 distributed database products such as INGRES* an d               solutions are available . Our focus is on the semanti c
 SQL*STAR are now commercially available . It i s                 incompatibilities of databases . The process of resolving th e
 conceivable that computation power approaching Cray 1 ca n       semantic contradiction, inconsistency, and ambiguity tha t
 be available on the desktop by the mid 1990's . Meanwhile ,      results from different assumptions made in disparat e
 the window, mouse, and icon-based software coupled wit h         databases is referred to as logical connectivity . For brevity ,
 rule-based techniques have provided the end user wit h           connectivity hereafter refers to logical connectivity . A tour -
 dramatically easier user interfaces to the computer-base d       guide case is presented below to illustrate issues involved i n
 information . Furthermore, commercial on-line database s         connectivity .
 such as Dow Jones are increasingly accessible for up-to-dat e
 information .

                                                                                                   DATA BASE Fall 1989 39

                                                                       types of rooms) . To attain the most complete an d
                                                                       comprehensive information, we would need to access al l
             2. TOUR-GUIDE DATABASES                                   three tour guides . Let us suppose that AAA is implemente d
                                                                       in INGRES*, FODOR in SQL*STAR, and MASS in R *
         Tour guides are easy to understand, abundant in dat a         by different organizations . Suppose also that we can acces s
    semantics, and representative of the situation involved i n        them in concert through computer networks to produc e
    CIS . We chose tour guides in order to raise issues involve d      composite information such as price, location, and facility .
    in resolving semantic incompatibilities in the delivery o f             Interacting with a CIS front end processor, a touris t
    timely, appropriate, and comprehensive composite                   may wish to get composite information about the facilitie s
    information . Three tour guides are presented : AAA Tou r          at the Logan Airport Hilton in Boston from all three tou r
    Book for Massachusetts, 1987 (abbr . AAA hereinafter) ,            guides . Let us see how we can formulate a composit e
    FODOR's New England, 1987 (abbr . FODOR), and Th e                 answer for the question, " What are the facilities at th e
    Spirit of Massachusetts, 1987 (abbr . MASS) . As discussed         Logan Airport Hilton in Boston? " from the tour-guid e
    below, each tour guide contains somewhat differen t                databases with schemata shown in Figure 3 .
    information and different degrees of detail or perspective o n
    common information (e .g ., average price of room ,
    minimum and maximum room rates, and price of different
                                                                             S DATA DICTIONAR Y

                           TNAME                               CNAME                        COLTYPE/LENGT H

                         MASS-Info                               Name                               Char 3 0
                         MASS-Info                              Address
                         MASS-Info                           Facilit -T .
                         MASS-Info                              Ratin ;                              Ch . 4
                         MASS-Info                            #-of-Room s
                         MASS-Info                               Other                                    :1
                         MASS-Phone                             Phone#
                          MASS-CC                                 CC                                Char 2
                        MASS-Amenity                         Amenity Code                           Num(1)
                        MASS-Packae                          Package Name                           Char 4 0
                        MASS-Package                        Package Descript                        Char(80)

                                                                    n Airport Hilton with Rati n
                            AA A                               FODO R                           MAS S
                      (Character 25+)                     (Character 30+)                 (Character 25+ )
                         Parking lot                                                          Free Barkin g
                            CCTV                                  TV*                          Cable T V
                             A/C                                  A/C*                     Air Conditionin g
                           Phones                               Phone*        - -         Telephone in roo m
                             Pool                            Outdoor pool       -                 Poo l
                       Airport transport                  Airport car available       Free transportation to/from
                          Dining rm                           Restaurants                   _ Restauran t
                      Non-smokers ' room                                                  Non-smoker room s
                                                                  Pets                        pets allowed
                           Cocktail                                Bar                           Lounge
                            Suites                           Entertainment             Near public transportation         _
                        Smoke detectors                        Heating*                 Handicapped accessibl e

                 + The data formats of the attributes .
                 * The facility inferred from the FODOR expensive category .
    2 .1 Problems Encountered In Extractin g                         data dictionaries need to be accessed, as exemplified in Tabl e
            Composite Informatio n                                   1, In addition, the numeric amenity codes in MASS have t o
         Different queries need to be generated to access th e       he converted to the descriptions used in AAA and FODO R
    relations in AAA, FODOR, and MASS to accumulate th e             (e .g ., "6" means "pool") .
    facility data of the Logan Airport Hilton . In this process, i t       The information that would be accumulated from tha t
    is necessary to realize that amenity in MASS is equivalen t      process is shown in Table 2 (except the entries with a
    to facility in FODOR and AAA . In order to retrieve the data     "'D . In order to know that TV, A/C, phone, and heating ar e
    format of the facilities in Figure 3, the COLUMNS in th e        also available from FODOR, it is necessary to know tha t

    40 DATA BASE Fall 1989

     the Logan Airport Hilton is categorized as expensive b y                                              transportation available ; restaurant ; non-smokers '
     FODOR where expensive means, among other criteria ,                                                   room ; and pets allowed . In addition, the followin g
     " bath or shower in each room, restaurants, TV, phone ,                                               facilities have been reported : suites, smok e
     attractive furnishings, heating, and A/C ." Since th e                                                detectors, entertainment, cocktail, bar, lounge, nea r
     meaning of expensive is not stored as part of the relations, a                                        public transportation, and handicapped accessible . "
     procedure is needed to obtain the information .
                                                                                                  2 .2 .   Insights Gained From the Exampl e
                           AAA    Relation s
                                                                                                       The tour-guide example revealed that two levels o f
      AAA-Info:            (Name', Address, Rctc-Ctcic . Lodging Type, Phone N, Other )
      AAA-Dinxtion :       (Address', Direction)                                                  incompatibilities need to be resolved : one at the schem a
      AAA-Facility :
                           (Name', Facility' )
                           (Name' . Credit-Cord' )
                                                                                                  level and the other at the instance value level . At th e
      AAA-Rate :                                         2
                           (Name', Season', IPL, I PIi . 2PIIIL, 2PI BR, 21'2B11, XI', F-ctok )   schema level, incompatibilities include synonyms ,
                           FODOR Relation s                                                       structural differences, and incompleteness :
      FODOR-Info :         (IN', Name . Address, Comment, Location, Package, Category )           • Type of lodging such as hotel, motel, and inn in AA A
      FODOR-Phone:         (ION', Mown' )
      FODOR-Facility:      (1DN', Facility' )                                                          is referred to in MASS as type of facilities . They are
      FODOR .Service:      (IN', Service' )
                                                                                                       synonyms at the attribute level (or entity level ,
                           MASS     Relation s
                                                                                                       depending on how they are modeled) since they refer t o
                           (Name', Address, Facility-Type. Rating, N-of-Rooms, Other )
                           (Name`, Phoncti' )
                                                                                                       the same domain of values . The attributes " comment "
       MASS-Amenity    .
                           (Name', CC' )
                           (Atcecity-codc' )
                                                                                                       in FODOR and "other" in MASS are also synonym s
       MASS-Package        (Name', Package-Name, Package-Descript)                                     because both refer to the general comments given to a
                                                                                                       lodging. Similarly, amenity in MASS is equivalent t o
     Figure 3 . Relational Schemata for AAA, FODOR ,                                                   facility in AAA .
                         and MAS S                                                                • Structural conflicts such as type conflicts and ke y
                                                                                                       conflicts are revealed . For example, "package" is a
         Many other semantic problems must also be resolve d                                           relation in MASS but an attribute in FODOR, causin g
     in order to get composite answers . Two examples ar e                                             a type conflict . ID# is used in FODOR, but name i s
     presented below to illustrate the complexity .                                                    used in MASS instead as the primary keys, causing a
                                                                                                       key conflict .
     Example 1 :        How can one identify an instanc e                                         •    Incompleteness arises since each guide specializes i n
                        across multiple databases ?                                                    certain aspects of the problem domain . For example ,
          A unique global key identifier may not always exis t                                         AAA has a detailed rate relation while FODO R
     when multiple disparate databases are involved . Fo r                                             specializes in service and location .
     example, the names, addresses, and phone numbers of thi s
     hotel are reported as follows :                                                                   At the instance value level, incompatibilities occur on
     AAA :         Logan Airport Hilton ; Logan Internationa l                                    a continuum, ranging from simple to complicated . In a
                   Airport, East Boston, 02128, (617) 569-930 0                                   simpler case, code conversion may suffice since a regula r
     FODOR :       Hilton Inn at Logan ; Logan Int'l Airport ,                                    pattern may be available . For example, the amenity code 6
                   569-930 0                                                                      means pool in MASS, but the characters "pool" are use d
     MASS : The Logan Airport Hilton ; Loga n                                                     directly in AAA . This type of conversion can be easil y
                   International Airport, Boston, 02128, (617 )                                   made once the incompatibility is recognized . In a ver y
                   569-9300 or 1-800-HILTON S                                                     complicated case, however, each instance value may b e
     The identity of the lodging needs to be resolved in order t o                                inconsistent, as exemplified by the lodging identificatio n
     retrieve the facility data of " Logan Airport Hilton" acros s                                problem across the tour-guide databases (discussed i n
     the three databases .                                                                        Example 1) . The granularity and ambiguity of instanc e
                                                                                                  values may further complicate the problem . The followin g
     Example 2 :         How can one judge credibility ?                                          section presents a connectivity strategy to resolve thes e
          Contradiction, granularity, and ambiguity ar e                                          problems .
     unavoidable when integrating disparate databases . Fo r
     example, AAA indicates that the Logan Airport Hilton ha s                                               3 . CONNECTIVITY STRATEG Y
     color TV without cable, but MASS reports that cable TV i s
     available -- an apparent contradiction . A closer examinatio n
                                                                                                        The incompatibilities revealed from the tour-guid e
     reveals that AAA has three categories for TV : C/TV fo r                                     example suggest that schema integration methodologie s
     color TV, CATV for cable TV, and C/CATV for color cabl e
                                                                                                  (e .g ., Batini, Lenezini, and Navathe, 1986 ; Dayal an d
     TV ; MASS indicates only if cable TV is available .                                          Hwang, 1984 ; Elmasri, Larson, and Navathe, 1987) can be
     Therefore, AAA is more detailed and may be assumed to be                                     effective in resolving problems at the schema level . Schem a
     more credible in reporting TV information . The credibility
                                                                                                  integration offers the CIS developer an opportunity t o
     knowledge needs to be incorporated if the contradiction is to
                                                                                                  identify the syntax and semantic problems inherent i n
     be resolved .                                                                                disparate databases . On the other hand, inter-database table s
          If all the semantic problems can be solved, a composite
                                                                                                  (IDT), inter-database instance identification tables (IDIIT) ,
     answer for the facilities of the Logan Airport Hilton may be
                                                                                                  and knowledge-based techniques are used to resolv e
     obtained as follows :
                                                                                                  incompatibilities and ambiguities at the instance valu e
          "free parking ; color TV without cable ; ai r
          conditioning ; phone in room ; pool ; airport

                                                                                                                                      DATA BASE Fall 1989 41

     level, as section 3 .2 discusses . FODOR and MASS are use d       3 .2      Resolving Incompatibilities                             at th e
     to illustrate the schema integration process below .                        Instance Value Leve l
                                                                            Although the name conflict between amenity in MAS S
     3.1 Resolving Incompatibilities at the Schem a                    and facility in FODOR is resolved at the schema level, th e
          Leve l                                                       problem is not solved yet at the instance level . Fo r
        Techniques used in the literature ]. show that man y           example, "outdoor pool" is used in FODOR, "pool" i n
    incompatibilities between FODOR and MASS can b e                   MASS ; similarly, "airport car available" is used i n
    revealed and resolved, as listed below .                           FODOR, "free transportation to/from airport" in MASS .
                                                                       This kind of problem can be avoided in the single databas e
         The FODOR-info and MASS-info relations are rename d           environment since the DB designer can predefine the domai n
         "lodging . "                                                  values . In MASS, for example, the amenity code is used t o
                                                                       encode the domain values (from 1 to 23, where 6 mean s
         The ID# in FODOR is not used since it is unique onl y         pool) ; therefore, all the values in MASS for amenity hav e
         locally . Instead, the lodging name is used as th e           an exact interpretation . However, there is a problem whe n
         primary key to identify a lodging . As we will elaborate      multiple databases are involved : in producing composit e
         later, lodging identification across multiple databases i s   information, it is difficult for the computer to interpret th e
         a central issue in attaining connectivity .                   relationship between "outdoor pool" and "pool" or "airpor t
                                                                       car available" and "free transportation to/from airport . "
         The attributes "comment" in FODOR and "other" i n
         MASS are merged as an attribute of lodging, rename d                                Package                    Lodging ID I T
         "comments . "

                                                                                   Credit Card              Lodging
         The attribute "package" in MASS is converted into a n
         entity in the integrated schema .                                                          /
                                                                                                            Facilit y         Service
         The attribute " location" in FODOR is carried over as a n
         attribute of lodging .                                                                            Fcode IDT         Scode IDT

         The attributes " facility type " (renamed lodging type )
         and "# of units" in MASS are carried over as attribute s                 Category       Ratin g
         of lodging .
                                                                        Figure 4 . An Integrated Schema for FODOR an d
         The entity "CC" (credit card) in MASS is also carrie d
                                                                                              MAS S
         over to the integrated schema, renamed "Credit Card . "
                                                                            The phenomenon described above is not uncommo n
         The entity "amenity" in MASS becomes the entit y              when multiple databases are involved . For each commo n
         "facility" in the integrated schema .                         attribute in two different databases, the domains need to b e
                                                                       checked for their values . If the ranges are inconsistent, the n
         In this way, the obvious name conflicts, structura l          an inter-database table (IDT) is created to reconcile th e
    differences, and incompleteness between FODOR an d                 difference, as exemplified below .
    MASS are resolved . The new entities for lodging, package ,
    credit card, and phone# are depicted in the extended entit y       3 .2 .1   Inter-Database Tables (IDT)
    relation diagram shown in Figure 4 . However, many more
                                                                            To resolve the facility differences in FODOR an d
    subtle incompatibilities remain unresolved, as discusse d          MASS, a unique concept ID, concept group, and concep t
    below .                                                            level are assigned to each concept . As shown in Table 3, th e
                                                                       concept ID 101 is assigned to " A/C " in FODOR and " ai r
                                                                       conditioning" in MASS . Concepts with different degrees o f
                                                                       granularity are assigned to the same concept group, but th e
                                                                       more generic concept is assigned a higher concept level
         1 Batini, Lenzirini, and Navathe [1986] gave an exampl e      number. For example, outdoor pool (103) and pool (104 )
    of schema integration to serve as the background of a              are both assigned to the same concept group (3), but pool i s
    comparative analysis of methodologies for schem a                  assigned a higher concept level number (2) than outdoo r
    integration . Elmasri, Larson, and Navathe [1987] presente d       pool (1) . In this way, the facility of FODOR and MASS ar e
    schema integration algorithms for federation databases an d        reconciled . Furthermore, such assignments provide a
    logical database design . Many issues in schema integratio n       mechanism to group and differentiate concepts . Thi s
    regarding entity, attribute, and relation equivalence have als o   mechanism is crucial for producing composite information .
    been discussed by many other researchers . For instance, i n            Although the IDT provides a mechanism to group an d
    resolving conflicts in different schemata, Dayal and Hwan g        differentiate concepts when a granularity problem arises, i t
    [1984] included naming conflicts, scale differences ,              does not help resolve contradictions . Recall that AA A
    structural differences, and differences in abstraction .           indicates "color TV without cable" as a facility, but MAS S
                                                                       reports that "cable TV" is available . Since "cable TV "

    42 DATA BASE Fall 1989

    appears in both AAA and MASS, the same concept ID i s                 Note that the IDT also allows us to indicate tha t
    used to encode the facility . As a result, the table parser ca n "dining rm" and "restaurant " are equivalent . In addition, i t
    not detect the contradiction . One way to resolve th e           permits us to encode the judgment that "cocktail", "bar" ,
    contradiction is to incorporate the judgment that AAA i s        and "lounge" are similar concepts (all with the sam e
    more credible into the system into the IDT for facility, as      specificity, group, and credit index) . Similarly, a servic e
    shown in Table 4 . The credibility index for TV indicate s       IDT is created for the entity service . We now turn ou r
    that when in doubt, one should use the information retrieved     attention to another subtle incompatibility .
    from AAA .                         Table 3 Inter-Database Table for Facilitie s
                      Concep t              Concep t
                                      Concep t                  Interpretation                Synony m
                       Level           Group  ID
                         1               1    101                       A/C                air conditioning
                         1               2    102                      shone                   tele hone
                         1               3    103                 outdoor poo l
                         2               3    104                      poo l
                         1               4    105                color cable TV                C/CAT V
                         1              4     106                   cable TV                    CATV
                         1              4     107              color TV w/o cable                C/T V
                           2      _     4     108                       TV
                           1            5     109    , Free transportation to/from airport                 -
                           1            5   - 110               airport transport          airport car avai l
                           1            6     111 _                 restaurant    -            dining rm
                                        7 ~   112                    cocktail - -
                           1 --
                           1            ~                                            - --               - -
                                              113                       bar
                           1            7     114                     loun - e
                               Table 4 Credibility Inter-Database Table for Facilitie s
                    Concep t Concep t            Credibilit y   Concep t        AAA          FODOR MASS
                     Level    G_ ro u_ i           Index          ID                                              -
                       1          4                 AAA           105         C/CAT V
                       1          4                 AAA           106         cable TV                   cable TV
                       1          4                 AAA           107    color TV w/o cabl e
                                  4                 AAA _         108                            TV
                       1          6                               111       dining room      restaurants restaurant
                       1 _        7                SAME           112          cocktai l
                       1          7                SAME           113                            bar
                       1          7                SAME           114                                     lounge

    3 .2 .2   Converting Indecomposable Attribute s                     instance level -- the unique inter-database identifie r
         The attributes category in FODOR and rating in MAS S           problem .
    were discovered to be neither disjoint nor equivalent 2 . Th e      3 .2 .3     Inter-Database Instance Identificatio n
    domains are (inexpensive, moderate, expensive, deluxe ,                       Tables (IDIIT)
    super deluxe) for FODOR category and ($, $$, $$$, $$$$ )                 Recall that Logan Airport Hilton was reported as th e
    for MASS rating . However, they do refer to something i n           name identifier for a particular lodging in AAA, Hilton In n
    common in terms of their role and structural identity .             at Logan in FODOR, and The Logan Airport Hilton i n
    Although the literature has suggested that an attribute             MASS respectively, causing an identification problem .
    should be converted to an entity if it is represented as a n        Such an instance level inconsistency can occur for eac h
    entity in another schema (e .g ., department is an attribute i n    instance ; on the other hand, in the facility attribute, th e
    one schema but an entity in the other), none has suggested ,        domain set has a limited number of values no matter ho w
    to our knowledge, how to integrate disjoint attributes suc h        many instances exist in the databases . Note that thi s
    as category vs . rating . To produce the integrated schema a s      problem also occurs in the nonkey attributes such as addres s
    shown in Figure 4, we convert category and rating int o             and phone numbers, which presume different values fo r
    entities, then create "CAT/RAT" as a generalized entity .           different lodgings, causing potential inconsistency ,
    Note that knowledge needs to be used to store th e                  ambiguity, and contradiction . The key uniqueness problem
    information for conversion purposes . We now turn ou r              is more critical since it is used to identify the same lodgin g
    attention to an even more challenging incompatibility at th e       across multiple databases .
                                                                             It is possible that a tax ID, which uniquely identifies a
                                                                        lodging, may be stored in FODOR and MASS . It may als o
         2 Elmasi, Larson, and Navathe (1987) refined th e              be possible to find a combination of attributes to identify a
    characteristics of attributes and defined three types o f           lodging uniquely (e .g ., tax ID, phone number, and zi p
    attribute equivalences : (1) strong attribute equivalence ; (2 )    code) . If neither of the conditions exists but the problem ca n
    weak attribute equivalence ; and (3) disjoint equivalence .
                                                                                                         DATA BASE Fall 1989 43

     be confined with additional assumptions (such as only on e          Eventually a small set of instances in each of the database s
     phone number for each lodging), then the problem is als o           is generated for the final identification .
     reduced to one of the first two cases . If none of the abov e            The identification process can be done each time a n
     cases applies, then the attribute subsetting technique shoul d      instance needs to be identified . Alternatively, an inter -
     be employed .                                                       database instance identifier table (IDIIT) can be create d
          Attribute subsetting is a process for eliminatin g             whereby each lodging is assigned a unique inter-database ID ,
     unrelated inter-database instances by comparing commo n             as shown in Table 5. Once the IDIIT is established ,
     attribute values . Instances that have a common attribute bu t      identifying a lodging across databases is a straightforwar d
     have different attribute values are eliminated from the             table look up . The trade-off is that IDIIT is proportional t o
     candidate set . For instance, if a target instance has a Iodgin g   the size of the overall databases ; it may be problematic i f
     type hotel, then instances in other databases which hav e           instance updating occurs frequently . The lodging IDIIT i s
     lodging type motel are eliminated from the candidate set .          also depicted in Figure 4 .
                                          Table 5 An Instance of IDIIT for Lod i n
                 Inter-Database ID                  AAA ID                   FODOR ID                     MASS ID
                    3456789876543            Logan Airport Hilton        Hilton Inn at Logan        The Logan Airport Hilton
         We have presented several techniques to resolve th e            representation and rule-based inference prototype called th e
    incompatibilities among the databases . It is interesting t o        Knowledge Object Representation Language (KOREL )
    note that artificial intelligence concepts, such as frames an d      (Levine, 1987) .
    rules, and the object-oriented approach provide an expressiv e            Figure 5 depicts part of the integrated schem a
    and general way of thinking about the problems and ou r              represented in the KOREL notation . Each entity can b e
    solution techniques .                                                implemented as a frame with a set of slots . Each slot ha s
                                                                         one or more facets . For example, the entity lodging ha s
    3 .2 .4 Knowledge-Based Technique s                                  slots for its attributes such as name, address, lodging type ,
          The integrated schema shown in Figure 4 can b e                direction, and comments . In addition, it has JOIN slots t o
    represented as frames . Many object-oriented languages (e .g . ,     link lodging with phone#, package, cat/rat, credit card ,
    LOOPS) are now commercially available to implemen t                  facility, and service frames . The JOIN slot has two facets :
    frames and inheritance properties (Stefik and Bobrow ,               the join name and the join key . The generalized property i s
    1986) . Our goal is to experiment with various nove l                implemented through the subtype slot, as shown in th e
    concepts in a multi-process environment in which the direc t         cat/rat frame, which has category and rating as its subtypes .
    access of multiple databases is possible . Therefore, w e            Once the frames are defined, KOREL commands can be use d
    developed a specialized frame-based knowledge                        to invoke methods to produce composite information .
             (LODGING                                                 (CAT/RA T
                  (NAME : (VALUE-TYPE string))                             (NAME : (VALUE-TYPE string) )
                  (ADDRESS : (VALUE-TYPE string))                          (SUBTYPE : (category, rating) )
                  (LODGING-TYPE : (VALUE-TYPE                              (JOIN : (JOIN-NAME lodging )
                  integer))                                                         (JOIN-KEY name)) )
                  (#-OF-UNITS : (VALUE-TYPE integer))                 (PHONE #
                 (DIRECTION : (VALUE-TYPE string))                        (NAME : (VALUE-TYPE string) )
                 (LOCATION : (VALUE-TYPE string))                         (NUMBERS : (VALUE-TYPE string )
                 (COMMENTS : (VALUE-TYPE : string))                       (MULTIPLE-VALUE-FUNCTION true) )
                 (JOIN : (JOIN-NAME phone#)                               (JOIN : (JOIN-NAME lodging )
                           (JOIN-KEY name))                                        (JOIN-KEY name)) )
                 (JOIN : (JOIN-NAME package)                         (FACILIT Y
                           (JOIN-KEY name))                               (NAME : (VALUE-TYPE string) )
                 (JOIN : (JOIN-NAME lodging-idit)                         (FCODE : (VALUE-TYPE integer )
                          (JOIN-KEY name))                                (MULTIPLE-VALUE-FUNCTION true) )
                (JOIN : (JOIN-NAME cat/rat)                              (JOIN : (JOIN-NAME lodging)
                          (JOIN-KEY name))                                         (JOIN-KEY name) )
                (JOIN : (JOIN-NAME credit-card)                          (JOIN : (JOIN-NAME fcode-idt )
                          (JOIN-KEY name))                                         (JOIN-KEY fcode)) )
                (JOIN : (JOIN-NAME facility)                         (CATEGOR Y
                          (JOIN-KEY name))                               (SUPERTYPE : (cat/rat)) )
                (JOIN : (JOIN-NAME service )
                          (JOIN-KEY name)) )
         Figure 5 A Partial Representation of the Integrated Schema For FODOR and MASS in KORE L
        KOREL can also be used to represent the concept level,     Take the IDT for facility as an example . The issues there ar e
    concept group, credibility, and other inheritance properties . how to represent synonyms, concepts, specificity, an d

    44 DATA BASE Fall 1989

    credibility information, as shown in Table 3 and Table 4 .          discriminating power . Our primary focus is on heuristics
    An object hierarchy is created in Figure 6 to depict the            that are generalizable to various application domains .
    concepts related to facility . The numbers from 101 to 11 4              We have illustrated frame-based representation, objec t
    denote the concepts identified in Table 3 . A node "HI" i s         hierarchy, and heuristic rules . The expressive power offere d
    also created as a higher level concept for cocktail, bar, an d      by knowledge-based techniques can be exploited in th e
    lounge . Each object can be implemented as a KORE L                 implementation of a system to access multiple databases, a s
    frame . For example, TV (108) can be implemented as a               discussed below .
    frame that inherits properties from facility and credibility i n
    AAA . It has slots for its concept ID (108), concept nam e          3,2 .5   Prototype Implementation
    (TV), and synonyms (e .g ., television) . The concept leve l             An Abstract Data Base Management System (ADBMS )
    and concept group are elegantly represented in the hierarchy .      was implemented in KOREL as a CIS front end to acces s
                                                                        disparate databases for composite answers . ADBMS is a
                                    Concep t                            higher level conceptual DBMS that conceals th e
                                                                        implementation details of the actual DBMSs from othe r
                 Credibility                Facility   Location         objects in the community . It applies an integrated schema ,
                                                                        as illustrated in Figure 4, of the local database schemata t o
                                                                        implement the CIS front end . With the information fro m
                                                                        the integrated schema and the corresponding information
                               08              +
                                                                        from the local databases, it sends queries (via messages) t o
                                      112                               the local databases (e .g ., AAA, FODOR, and MASS) to
                                                                        access the appropriate information .3 Adding a new DBM S
                                                                        will not result in any change to the existing applications .
                                                                             Also implemented was a set of commands . Th e
         Figure 6 . An Object Hierarchy for Facilit y                   commands provide the basic features of an object-oriented
                                                                        language with extensions to simplify constraint an d
         It is interesting to observe the ramifications of givin g      knowledge representation . Mechanisms are provided fo r
    MASS credibility for pool (104) . Without the additiona l           interfaces with databases as well as building, relating, and
    credibility information, outdoor pool (103) would b e               showing objects . The functional relationship among
    selected to formulate a composite answer because it is more         ADBMS, database objects, and the actual DBMS i s
    specific than pool . With the new credibility information, a n      illustrated in Figure 7 .
    interesting situation is created in which the more specific              A new prototype, called the CIS/Tool Kit (CIS/TK), i s
    information has less credibility (FODOR reported "outdoo r          currently being developed with the purpose to provid e
    pool" whereas MASS reported "pool") . A heuristic rule ca n         improvements in performance and flexibility and genera l
    be added to make the general judgment call . For instance, I F      tools to facilitate rapid development of interfaces to externa l
    the concept level is higher but the source of data is mor e         commercial information services, such as I .P . Sharp' s
    credible, THEN select the source of data .                          Disclosure and Finsbury Data Services Dataline .
         Heuristic rules can also be employed to extrac t
    additional information unattainable before . In Figure 6 ,                    	 4.   CONCLUDING REMARKS
    lodging information is included in the object hierarch y
    (which is not in Table 3 or Table 4 because lodging is not a
                                                                             As information technologies rapidly become availabl e
    facility) . Conceivably, additional information about th e          to society, a key issue for information systems researcher s
    facilities of a lodging is embedded in a lodging's locatio n        will be how to deliver timely, appropriate, an d
    and its lodging type . For example, IF the lodging type is a        comprehensive information to the end user . To obtain thi s
    motel, THEN it would be reasonable to encode a heuristi c
                                                                        information, one may have to extract informatio n
    rule stating that free parking is available . Alternatively, IF a
    lodging's location is in the Boston Back Bay area (from zi p
    code 02116), and the lodging is rated as $$S, THEN valet
                                                                             3 Note that in the process of accessing the loca l
    parking is available .                                              databases, it is also necessary to translate a query in on e
         Another important application of the heuristic rules i s       general form into each particular format used by a loca l
    in attribute subsetting . An instance may have man y                DBMS . This transformation would require very specifi c
    attributes to select for subsetting . The choice is domai n
                                                                        knowledge of the local DBMS . Research conducted at th e
    specific and requires intimate knowledge of the applicatio n        Computer Corporation of America on MULTIBASE [10 ]
    domain . In the lodging inter-database identification problem ,     and more recently on PROBE has addressed the problem . A
     for example, a lodging has many attributes . Furthermore ,         Global Data Manager (GDM) and Local] Database Interface s
    additional information for subsetting may also be availabl e        (LDI) were developed, for example, to perform th e
     from other frames such as phoneli, package, and credit card .      transformation from local databases to GDM . The reader i s
    How would the system know that it is useful to subse t              referred to [7, 10, 13] for a more detailed discussion of th e
     from lodging type and zip code instead of from comments o r        issues involved in query transformation and modification i n
    direction? Designing a good heuristic for attribut e
                                                                        DBMS . Our research focuses on semantic reconciliatio n
     subsetting is a critical task . We are exploring genera l          and instance identification problems in the contents of th e
     heuristics, which include rules such as to choose th e
     attribute in the current set that has the maximum

                                                                                                         DATA BASE Fall 1989 45

    distributed throughout disparate databases within and/o r              Multidatabase System," IEEE Transactions on Softwar e
    across organizational boundaries . How to extract th e                 Engineering, Vol . SE-10, No . 6, November 1984, pp .
                                                                           628-644 .
    appropriate information from these disparate database s
                                                                    [8]    Elmasri R ., Larson J . and Navathe, S . "Schem a
    efficiently, how to reconcile semantic differences among the
                                                                           Integration Algorithms for Federated Databases and
    databases so as to produce composite information, and ho w             Logical Database Design," Submitted for Publication ,
    to deliver the composite information to the user expedientl y           1987 .
    are the issues that we have discussed in this paper .           [9]    Frank, W .F ., Madnick, S .E ., and Wang, Y .R . " A
                                                                           Conceptual Model for Integrated Autonomou s
                                                                           Processing : An International Bank's Experience with
                                                                           Large Databases," Proceedings of the 8th Annua l
                                                                           International Conference on Information System s
                                                                           (ICIS), December 1987, pp . 219-231 .
                                                                    [10]   Goldhirsch, D ., Landers, T ., Rosenberg, R ., an d
                                                                           Yedwab, L. MULTIBASE : System Administrator' s
                                                                           Guide, Computer Corporation of America, Cambridge ,
                                                                           MA, November 1984 .
                                                                    [11]   Goyal, A . and Agerwala, T . " Performance Analysis o f
                                                                           Future Shared Storage Systems," IBM Journal o f
                                                                           Research and Development January 1984, p . 126-138 .
                                                                    [12]   Hewitt, C . E . Office Are Open Systems . ACM
                                                                           Transactions on Office Information Systems, Vol . 4 ,
                                                                           No . 3, July 1986, pp . 271-287 .
                                                                    [13]   Heimbigner, D . and Mcleod D . "A Federate d
                                                                           Architecture for Information Management," AC M
                                                                           Transactions on Office Information Systems, Vol . 3 ,
                                                                           No . 3, July 1985, pp . 253-278 .
                                                                    [14]   Ives, B . and Learmonth, G .P ., "The Informatio n
          Figure 7 . Functional Relationship Amon g                        Systems as a Competitive Weapon," Communication s
               ADBMS and the Actual DBM S                                  of the ACM, Vol . 27(12), December 1984, pp . 1193 -
         We have presented a connectivity strategy based on                1201 .
    schema integration, inter-database tables (IDT), inter -        [15]   Lam, C .Y . and Madnick, S .E ., Composite Informatio n
    database instance identification tables (IDIIT), an d                  Systems - a new concept in information systems .
    knowledge-based techniques in order to resolve problem s               CISR WP# 35, Sloan School of Management, MIT ,
    such as inconsistency, ambiguity, and contradiction ; th e             May 1978 .
                                                                    [16]   Levine, S ., "Interfacing Objects and Database, "
    resolution of those problems makes connectivity attainable .
                                                                           Master's Thesis, Electrical Engineering and Computer
    This research has provided a concrete step towards building a          Science, MIT, May 1987 .
    theoretical foundation of connectivity that reconciles th e     [17]   Madnick, S . E . and Wang, Y . R ., "Modeling th e
    different assumptions and perspectives resulting from th e             INFOPLEX database computer : a multiprocesso r
    different mental models embedded in the different database s           systems with unbalanced flows ." Proceedings of the
    being integrated .                                                     6th Advanced Database Symposium, August 1986, pp .
                                                                           85-93 .
                                                                    [18]   Madnick, S .E. and Wang, Y .R . "Integrating Disparat e
                       REFERENCE S                                         Databases For Composite Answers ." Proceedings o f
                                                                           the Twenty-first Annual Hawaii Internationa l
    [1]  Barrett, S . "Strategic Alternatives and Inter -                  Conference on System Sciences, Vol . II, Januar y
         Organizational Systems Implementations : A n                      1988, pp .583-592 .
         Overview," Journal of MIS, Vol . 3, No . 3, Winter         [19]   Madnick, S .E . and Wang, Y .R . "A Framework o f
         1986-87, pp . 3-16 .                                              Composite Information Systems for strategi c
    [2]  Barrett, S ., and Konsynski, B .K . "Inter-Organizatio n          advantage ." Proceedings of the Twenty-first Annua l
         Information Sharing Systems,"          MIS Quarterly ,            Hawaii International Conference on System Sciences ,
         Special Issue 1982, pp . 93-104 .                                 Vol . III, January 1988, pp.35-43 .
    [3] Batini, C . Lenzirini, M . and Navathe, S .B . " A          [20]   Madnick, S .E . and Wang, Y .R . " Evolution Toward s
         Comparative Analysis of Methodologies for Databas e               Strategic Applications of Databases Throug h
         Schema Integration," ACM Computing Surveys, Vol .                 Composite Information Systems," To appear in th e
         18, No . 4, December 1986, pp . 323 - 363 .                       Journal of Management Information Systems, Summe r
    [4] Benson, R .I . and Parker, M . M . Enterprise-Wid e                or Fall, 1988 .
         Information Management : An Introduction to th e           [21]   Stefik, M . and Bobrow, D .G . "Object-Oriente d
         Concepts . IBM Los Angeles Scientific Center, G320 -              Programming : Themes and Variations," AI Magazine ,
         2768, May 1985 .                                                  Vol . 6, No . 4, Winter 1986, pp . 40 - 62 .
    [5 ] Cash, J . I ., and Konsynski, B .R . "IS Redraw s                           Richard Wang and Stuart E . hfadnic k
         Competitive Boundaries," Harvard Business Review ,                          E53 . 320, Sloan School of Managemen t
                                                                                            MIT, Cambridge, MA 0213 9
         March-April 1985, 134-142 .                                                               (617) 253. 265 6
    [6]  Clemons, E .K . and McFarlan, F.W ., "Telecom : Hoo k                       ACKNOWLEDGEMENTS : Work reported herei n
         Up or Lose Out," Harvard Business Review, July -                       has been supported in part by the Department o f
         August, 1986 .                                                         Transportation's Transportation Systems Center, the U .S .
                                                                                Air Force, the Center for Management of Information at th e
    [7]  Dayal, U . and Hwang, K . " View Definition an d                       University of Arizona, and Citibank . The authors also wis h
         Generalization for Database Integration i n                            to acknowledge the valuable comments from th e
                                                                                anonymous referees on an earlier version of this paper .

    46 DATA BASE Fall 1989

Shared By: