THIRD GENERATION DATABASE SYSTEM MANIFESTO

Document Sample
THIRD GENERATION DATABASE SYSTEM MANIFESTO Powered By Docstoc
					THIRD GENERATION
DATABASE SYSTEM
   MANIFESTO
The Committee for advanced DBMS
            Function
           INTRODUCTION
• First Generation DB systems: Network and hierarchical
  systems present in the 70s(Offering data definition and
  manipulation languages)
• Second Generation DB systems: Non-procedural data
  manipulation language and data independence
  (DB2,INGRES,NON_STOP SQL, ORACLE,..) Focused
  on business data processing. Inadequate for CAD,CASE,
  hypertext applications.
          INTRODUCTION
• Consider a publishing application. Client wishes to
  arrange the layout of a newspaper.Storing text
  segments, graphics and icons is required.
  Supporting such data is difficult in 2nd generation
  systems.
• 2nd generations systems sometimes do not support
  business data processing adequately
  either.Consider an insurance application that
  processes claims. Requires traditional data like
  names, ages etc. But also requires photographs of
  the events to which claims are related.
   THIRD GENERATION DBMS
          SYSTEMS
• PRINCIPLE 1:
  Besides traditional data management
  services, third generation DBMS will
  provide support for richer object structures
  and rules.
            PRINCIPLE 1
• Richer object structures to store and
  manipulate text and spatial data are
  required.
• Capability of specifying rules should be
  given to the designer.
• Referential integrity in a relational context
  is an example for a rule
          PRINCIPLE 1
• Ex: Newspaper application
• Text,Icons,maps,advertisement copy(Rich
  object structures)
• Rate,Number of days the adv. will run,
  classification(Traditional data)
• Can`t put advertisement of Sun on the
  same page as Microsoft.(Rule)
           PRINCIPLE 1
• Ex: Insurance application:
• Need to store photographs
• Making changes to insurance coverage is a
  standard transaction.
• Rules such as:
 Cancel the coverage of any customer who
  has had a claim of type Y over value X
Escalate any claim older than N days.
   THIRD GENERATION DBMS
          SYSTEMS
• PRINCIPLE 2:
 Third generation DBMSs must subsume
  second generation DBMSs
             PRINCIPLE 2
• Non procedural access and data independence
  must not be compromised by third generation
  systems.
• Query language is a necessity.
• Ex: mechanical CAD systems. Store the parts
  which compose a product.
• Along with spatial geometry of data, store cost of
  the part, color,meantime to failure...
• Query Example:
  How much does the cost of my product if
  supplier X raises prices by Y percent.
             PRINCIPLE 2
• Second advance is data independence.
  Second generation systems maintain the
  consistency of all access paths to data and
  a query optimizer chooses the best way to
  access the data. Also, second generation
  systems provide views whereby a user can
  be insulated from the changes underlying
  the set of collections.
   THIRD GENERATION DBMS
          SYSTEMS
• PRINCIPLE 3:
 Third generation DBMSs must be open to
  other subsystems.
           PRINCIPLE 3
• Must have 4th generation language(4GL),
  various decision support tools, access from
  many PLs, access to popular systems like
  LOTUS, graphical interfaces, distributed
  DBMS.
• Must allow access from additional tools
  running in different environments
    THIRD GENERATION DBMS
           SYSTEMS
• There are three groups of propositions which
  should be followed by third generation DBMSs.
• Propositions which result from Principle1 and
  refine the requirements of object & rule
  management.
• Propositions which claim that a 3rd generation
  DBMS should subsume a 2nd generation DBMS
• Propositions which result from the requirement
  that a third generation system be open
Propositions Concerning Object and
         Rule Management
• DBMSs cannot anticipate all the kinds of
  data elements that an application might
  want. Most people think time is measured
  in seconds and days but the day ends at
  16.30 at most banks and yesterday may
  need to skip the weekends and holidays.
  Hence it is imperative that a third
  generation system manage a diversity of
  objects.
    1.1 A Third Generation DBMS must have a
                 rich type system

•   An abstract data type to construct new ones
•   An array constructor
•   A Sequence type constructor
•   A record type constructor
•   A set type constructor
 1.1: A Third generation DBMS must have a
              rich type system:
                   (cont.d)

• Functions as a type
• A union type constructor
• Recursive composition of the above constructors

• In addition to implementing these type constructors, the
  underlying query language must be extended.
 1.1: A Third generation DBMS must have a
               rich type system
• The first mechanism allows one to construct new
  base types in addition to the standard integers,
  floats and character strings available in most
  systems. These include bit strings, points, lines,
  complex numbers, etc.
• The second mechanism allows one to have arrays
  of data elements, such as found in many scientific
  applications. Arrays normally have the property that
  a new element cannot be inserted into the middle of
  the array and cause all the subsequent members to
  have their position incremented. In some
  applications such as the lines of text in a document,
  one requires this insertion property, and the third
  type constructor supports such sequences.
 1.1: A Third generation DBMS must have a
               rich type system
• The fourth mechanism allows one to group data
  elements into records. Using this type constructor
  one could form, for example, a record of data items
  for a person who is one of the "old guard" of a
  particular university.
• The fifth mechanism is required to form unordered
  collections of data elements or records. For
  example, the set type constructor is required to form
  the set of all the old guard. The next mechanism
  allows one to construct a data element which can
  take a value from one of several types.
• The last mechanism allows type constructors to be
  recursively composed to support complex objects
  which have internal structure such as documents,
  spatial geometries, etc.
     1.2 Inheritance is a good idea

• Allowing types to be organized into an
  inheritance hierarchy is a good idea.
• If only single inheritance is supported then
  there are too many situations that can not
  be adequately modeled. So, multiple
  inheritance is better.
Multiple Inheritance
1.3 Functions, including database procedures
 and methods and encapsulation are a good
                     idea
• 2nd generation systems support functions
  restrictedly(SQL supports -create, alter,drop-
  which are functions of tables.)
• Encapsulation: Users should not be allowed
  direct access to EMPLOYEE collection but given
  funcions HIRE(EMPLOYEE),
  FIRE(EMPLOYEE),
  RAISE_SAL(EMPLOYEE)
• Encapsulation encourages modularity and
  registers functions along with the data
1.3 Functions, including database procedures
 and methods and encapsulation are a good
                     idea
• Encapsulation has performance advantages
   in a protected or distributed system.
• Ex: function HIRE(EMPLOYEE).
 If specified as a function to be executed
   internally, only one round trip message btw
   DBMS and application is required.
 If runs in a user program, one round trip
   message for each access.
1.3 Functions, including database procedures
 and methods and encapsulation are a good
                     idea
• Such functions can be inherited and overridden in
  inheritance hierarchy.
• HIRE(EMPLOYEE) can be applied to
  STUDENT EMPLOYEE by being overridden.
• Users should write functions in a high level
  language and obtain access through a non-
  procedural access language.
• Functions should run queries and not perform
  their own navigation using lower-level DBMS
  calls.
1.3 Functions, including database procedures
 and methods and encapsulation are a good
                     idea
• Some OODB enthisuasts claim that the only way to
  access a data type be via functions. For example, access
  EMPLOYEE via HIRE(EMPLOYEE). But select *
 from EMPLOYEE
 where salary > 1000
  needs to access all data elements. So data elemets inside
  should be accessed with accessor functions. An
  authorization system is required to control this access.
• Functions should be inherited. Functions stored in the
  database with DBMS calls in the query language are
  very common.
1.4 Unique Identifiers(UIDs) for records should
be assigned by the DBMS only if a user-defined
          primary key is not available
• 2nd generation systems support primary
  keys. Its value never changes and it is also
  human-readable.
• If no primary key is available, the system
  assigns a UID
 1.5 Rules(triggers,constraints) will become a major
    feature in future systems. They should not be
   associated with a specific function or collection
• OODB researchers ignore the importance of rules. They
  say rules are implemented by adding some code to a
  function which operates on the collection.
• Ex: Rule: No employee wins greater than his
  manager.Insert code for this into RAISE_SAL and
  HIRE_EMPLOYEE functions
• But, this way whenever a new function is added some
  code must be included by it. Also, same code is added to
  two different places which consumes programming
  effort(replication).
 1.5 Rules(triggers,constraints) will become a major
    feature in future systems. They should not be
   associated with a specific function or collection
• Example Rules:
• Whenever Hasan gets a salary adjustment, propagate the
  change to Can.
• Whenever Can gets a salary adjustment, propagate the
  change to Osman.
• Also, the users might ask queries about the rules applied .
  If they are embedded in functions,it will be hard to do
  this. It will be hard too, when a rule is deleted.
• So, rules must be enforced by the DBMS but not be
  bound to any function or collection.
     3.2 Propositions Concerning
     Increasing DBMS Function
• 3rd generation systems must subsume all
  capabilities of 2nd generation systems.
• These are query languages, specification of
  sets of data elements and data
  independence
  2.1 Essentially all programatic access to a
      database should be through a non-
    procedural high-level access language.

• The expressive power of a query language must
  be present in every programatic interface.
• Long term this can be provided by adding query
  language constructs to programming languages
• Short term,this can be provided by embedding a
  query language into conventional programming
  languages.
  2.1 Essentially all programatic access to a
database should be through a non-procedural
          high-level access language.
• Many OODB researchers state that they wish to
  navigate to desired data using a low-level
  procedural interface
• But this kind of interface is undesirable and
  shouldn`t be used
• When the programmer navigates to data in this
  fashion, he replaces the function of query
  optimizer. A well written optimizer can always
  do better than a programmer.
  2.1 Essentially all programatic access to a
database should be through a non-procedural
          high-level access language.
• Second problem is schema evolution. If the
  number of indexes change or the physical
  access paths to data changes, the
  programmer must modify his program. On
  the other hand, query optimizer simply
  produces a new plan for the new
  environment.
  2.1 Essentially all programatic access to a
database should be through a non-procedural
          high-level access language.
• Consider a user who is browsing a DB ie,
  navigating from one record to the other.
 “select *
  from collection
  where collection.key=value”
  Although there is little to optimize in this query,
  one is still insulated from required program
  maintenance in the event of schema changes.
• One does not obtain this service if a lower level
  interface is used such as
  “dereference(pointer)
  2.1 Essentially all programatic access to a
database should be through a non-procedural
          high-level access language.
• A majority of OODB researchers suggest that a
  pointer be soft, ie its value not change even if the
  data it points to is moved. This characteristic,
  location independence, is desirable because it
  allows data elements to be moved without
  compromising the structure of DB.
• There is little performance benefit to using the
  lower level interface when a single data is
  returned and there is performance degrading
  when multiple data elements are returned.
  2.1 Essentially all programatic access to a
database should be through a non-procedural
          high-level access language.
• OODB enthusiasts claim that CAD programmers
  want to perform their own navigation and
  therefore need lower level interface. This is like
  resisting to move from assembly language to high
  level programming languages. These
  programmers require education.
• All DBMS access should be specified by queries
  in a non-procedural high-level access notation.
     2.2 There should be at least two ways to specify
 collections, one using enumeration of members and one
     using the query language to specify membership

• The OODB literature suggests specifying sets by enumerating
  members by means of a linked list or array of identifiers. This is an
  inferior choice because:
 ALUMNI(name,age,address)
 GROUPS(g-name, composition)
 Here we have a collection of alumni of a university along with
  collection of groups of alumni. Each group has a name and
  composition fiels indicates members of each group.
• Composition can be specified as an array of pointers to alumni but
  this is inefficient because sets are quite large and they overlap. Also.,
  when a new person is added to ALUMNI collection, application
  programmer should add him to all appropriate groups.
    2.2 There should be at least two ways to specify
collections, one using enumeration of members and one
    using the query language to specify membership

• It is also possible to represent GROUPS as follows:
 GROUPS(g-name,min-age,max-age,composition).
• In this specification, groups are parametrized by age
  and set membership becomes automatic. When new
  alumnus are added, they are added to the appropriate
  groups. Also look at the following query
  Select g-name
  from GROUPS
  where composition.name=„Hasan‟
     2.2 There should be at least two ways to specify
 collections, one using enumeration of members and one
     using the query language to specify membership

• If an array of pointers is used, query optimizer will scan
   all records in GROUPS and then find „Hasan‟. On the
   other hand if automatic membership method is used, the
   optimizer will transform the query into:
 “ select g-name
   from GROUPS,ALUMNI
   where ALUMNI.name=„Hasan‟ and
   Alumni.age > GROUPS.min_age and Alumni.age <
   GROUPS.max_age”
• If there is an index on GROUPS.min-age or
   GROUPS.max-age or ALUMNI.name, this query will
   perform better.
     2.2 There should be at least two ways to specify
 collections, one using enumeration of members and one
     using the query language to specify membership

• In summary, there are two ways to specify collections
  such as sets, arrays and sequences. They can be specified
  either extensionally through collections of pointers or
  intensionally through expressions which also maintains
  automatic membership.
• Also with an intensional specification optimizer is free to
  use any access path, not limited to use any pointer
  structures.
• Both representations are required but intensional should
  be favored.
   2.3 Updatable views are essential

• Most of the databases are dynamic,when a set of
  collections changes, the DB needs maintenance.
• Encapsulation of DB access into functions and
  encapsulation of functions with a single
  collection is helpful in identifying the functions
  to be changed. But if a change is made to
  schema, it may take too long to change the
  affected functions.
    2.3 Updatable views are essential

• A better approach is to support virtual collections(views).
• In second generation systems, it is not possible to update
  relational views.
• The traditional way to support view updates is to perform
  command transformations along the lines. To
  disambiguate view updates additional semantic
  information must be provided.
• One approach is to require that each collection be opaque
  which might become a view at a later time. This way
  there is a group of functions for all accesses and view
  definer must perform program maintenance on these
  functions.
   2.3 Updatable views are essential

• Alternately, a suitable rules system can be used to
  provide the necessary semantics.This way, only
  one(or a small number of) rules need to be
  specified to provide view update semantics.
 2.4 Performance indicators have nothing
   to do with data models and must not
             appear in them.
• Main determiners of performance are:
       The amount of performance tuning done on DBMS
       Usage of compilation techniques by the DBMS
       Location of the buffer pool(in client or DBMS)
       Kind of indexing available
       Performance of the client-DBMS interface
       Clustering performed.
• Such issues have nothing to do with the data model or
  usage of a high-level language versus a lower level
  interface.
2.4 Performance indicators have nothing to
do with data models and must not appear in
                  them.
• Clustering related objects together has been
  highlighted as an important OODB feature. But
  this is a physical representation issue and has
  nothing to do with the data model.
• Some nontraditional problems such as CAD need
  a database environment which is optimized for
  the specific application like
   An access method for spatial data
   A buffer pool in the engineers WS.
   • But these are all performance issues and have nothing
    to do with the data model.
   3.3 Propositions that result from the
        necessity of an open system

• Up to now, we have discussed the
  characteristics of third generation DBMS`s.
  Now we turn our attention to the API, through
  which a user program will communicate with
  the DBMS.
   3.1 Third generation DBMSs must be
       accesible from multiple HLLs.
• Some designers wrongly claim that, a function
  should yield the same result if it is executed in
  the user space on transient data or inside the
  DBMS on persistent data. This can happen only
  if the execution model of DBMS is identical to
  that HLL.
• There is no aggrement on the single HLL.
• An open DBMS must be multi-lingual. It must
  allow access from a variety of externally written
  application systems.
 3.2 Persistent X for a variety of Xs is a good idea. They
    will all be supported on top of a single DBMS by
  compiler extensions and a(more or less) complex run
                        time system
• Second generation systems were interfaced to
  programming languages using a preprocessor partly
  because early DBMS developers did not have the
  cooperation of compiler developers. But the resulting
  interfaces were not very friendly and were characterized
  as “like glueing an apple on a pancake”. Obviously it is
  possible to provide the same level of elegance for general
  purpose programming languages.
• It is crucial to have a closer match between type systems.
  Also, it will be nice to allow any variable in a user`s
  program to be optionally persistent(ie remembered after
  the program terminates).
 3.2 Persistent X for a variety of Xs is a good idea. They
    will all be supported on top of a single DBMS by
  compiler extensions and a(more or less) complex run
                        time system
• In order to perform well, persistent X must maintain a
  cache of data elements and manage this cache using some
  replacement algorithm. A user space cache will improve
  program performance 100-1000 times.
• Functions should be coded by including calls to the
  DBMS expressed in the query language. Hence,
  persistent X requires some way to express queries. Such
  queries can be expressed in a notation appropriate to the
  HLL in question. The run-time system for the HLL must
  accept and process such queries.
• Such a run-time system will be difficult to build
  depending on the HLL.
      3.3 For better or worse, SQL is
          intergalactic dataspeak.

• SQL is the universal way of expressing
  queries today. Although it has a variety of
  well known minor problems, it is necessary
  for commercial viability.
   3.4 Queries and their resulting answers
 should be the lowest level of communication
        between a client and a server.
• In an environment where a user interacts with a
  remote DB, there is a question about the protocol to
  be used. OODB enthisuasts debate whether requests
  should be for single records or pages.
• Expressions in the query language should be the
  lowest level unit of communication. If queries can be
  packed into a function, then the user can use RPC to
  cause function execution on the server.
• If a lower level specification is used, such as record
  or page transfers, the protocol design will be harder
  because of machine dependencies and states.
                   Summary
    We agree with OODB enthisuasts on :
•    Rich type system, functions, inheritance and
    encapsulation.
•   Disagree on object management issues. We
    propose a larger issue of support for data, rule
    and object management.
•   DBMS access should only occur through HLLs.
•   Use of automatic collections should be
    encouraged whenever possible. Persistence may
    be added to HLLs. Unique identifiers should be
    used either user-defined or system-defined.
          Summary(Cont.d)
• To become true third generation systems, OODB
  systems must add a query language, query optimizer,
  a rules system, SQL client/server support, support for
  views, persistent programming languages.
• There is significant research to overcome these
  propositions.
Michael Stonebraker of the University of California, Berkeley,
Lawrence A. Rowe of the Uni-versity of California, Berkeley,
Bruce Lindsay of IBM Research,
James Gray of Tandem Computers,
 Michael Carey of the University of Wisconsin,
 Michael Brodie of GTE Laboratories,
Philip Bernstein of Digital Equipment Corporation, and
David Beech of OracleCorporation.



                              Hazirlayan ve Sunan: Hasan
Dogu