Parallel and Distributed Database Systems

Document Sample
Parallel and Distributed Database Systems Powered By Docstoc
					Parallel and Distributed Database Systems


                Mark Kaiser
               Blake Reinhart
            Sarvani Mallapragada
OUTLINE
 Introduction
 Description of the application area
    Underlying Principles
 Different ways to parallelize
    Architectural Issues
    Concurrency control
 Existing Software
 Very Brief Demo of Software
Introduction
 Database Management System(DBMS)
   computer software that manages databases
 Distributed Computing
   program parts run simultaneously on multiple
     computers communicating over a network
 Parallel Computing
      program parts running simultaneously on multiple
       processors in the same computer
Introduction
 Both types of processing require dividing a
  program into parts that can run simultaneously

 Distributed programs often must deal with
  heterogeneous environments, network links of
  varying latencies, and unpredictable failures in
  the network or the computers.
Introduction
 The improvement of DBMS technology
  coincided with significant developments in
  distributed computing and parallel processing
  technologies.
 The end result
   emergence of distributed DBMS and parallel DBMS
 Have become the dominant data management
  tools for highly data-intensive applications
Underlying Principles
 A distributed database (DDB) is a collection of
  multiple, logically interrelated databases distributed
  over a computer network

 Each site
    has its own primary and secondary storage
    runs its own operating system (which may be the same or
     different at different sites)
    has the capability to execute applications on its own
    sites are interconnected by a computer network rather than a
     multiprocessor configuration
Architecture of DBMS
 possible distribution alternatives
    Multiple-client/single-server architecture
    Multiple-client/multiple server architecture
       More distributed and more flexible architecture


 Most current DBMS implement one or the
  other type of the client-server architectures
Architecture of DBMS
 Multiple client/single server architecture
    a number of client machines access a single
     database server
    management problems are considerably simplified
     since the database is stored on a single server
    pertinent issues relate to the management of client
     buffers and the caching of data and (possibly) locks.
    data management is done centrally at the single
     server
Architecture of DBMS
 Multiple-client/multiple server architecture
    the database is distributed across multiple servers
     which have to communicate with each other in
     responding to user queries and in executing
     transactions
    Each client machine has a “home” server to which
     it directs user requests
Distributed DBMS Architecture
 A truly distributed DBMS does not distinguish
  between client and server machines
 Ideally, each site can perform the functionality
  of a client and a server
 Such architectures, called peer-to-peer, require
  sophisticated protocols to manage the data
  distributed across multiple sites
 The complexity of required software has
  delayed the offering of peer-to-peer distributed
  DBMS products
Parallel DBMS Architecture
 Range between two extremes
   the shared-nothing architecture
   the shared-memory architecture
   A useful intermediate point is the shared-disk
    architecture
PDBMS- Shared Nothing Architecture
 each processor has exclusive access to its main
  memory and disk unit (s)
 Thus, each node can be viewed as a local site
  (with its own database and software) in a
  DDBMS
 The difference between shared-nothing PDBMS
  and DDBMS is basically one of implementation
  platform
 So most solutions designed for DDBMS may be
  re-used in PDBMS
PDBMS- Shared Nothing Architecture
 Three main virtues:
    Cost
    Extensibility
    availability
 Pitfalls:
   higher complexity
    (potential) load balancing problems

 Examples
    Teradata’s DBC
    Tandem’s Non- Stop SQL products
    a number of prototypes such as BUBBA, EDS, GAMMA, GRACE,
     PRISMA and ARBRE
PDBMS- Shared Memory Architecture
 Any processor has access to any memory module or
  disk unit through a fast interconnect (e.g., a high-speed
  bus or a cross-bar switch)
 Several new mainframe designs such as the IBM3090
  or Bull’s DPS8, and symmetric multiprocessors such
  as Sequent and Encore, follow this approach
 All the shared-memory commercial products (e.g.,
  INGRES and ORACLE) today exploit inter-query
  parallelism only (i.e., no intra-query parallelism).
PDBMS- Shared Memory Architecture
 Virtues:
    simplicity
    load balancing
 Pitfalls:
    cost
    limited extensibility
    low availability
 Examples
    XPRS, DBS3 [Bergsten] and Volcano [Graefe]
    In a sense, the implementation of DB2 on an IBM3090 with
     6 processors was the first example.
PDBMS- Shared Disk Architecture
 Any processor has access to any disk unit
  through interconnect, but exclusive (non-
  shared) access to its main memory
 Each processor can then access database pages
  on the shared disk and copy them into its own
  cache
 To avoid conflicting accesses to the same
  pages, global locking and protocols for the
  maintenance of cache coherency are needed
PDBMS- Shared Disk Architecture
 Virtues:
      cost
      extensibility
      load balancing
      availability
      easy migration from uni-processor systems
 Pitfalls:
    higher complexity
    potential performance problems
PDBMS- Shared Disk Architecture
 Examples
   IBM’s IMS/VS Data Sharing product
   DEC’s VAX DBMS and Rdb products
   The implementation of ORACLE on DEC’s
    VAXcluster and NCUBE computers
   Note that all these systems exploit inter-query
    parallelism only
Concurrency control
 Since multiple users access a shared database, these
  accesses need to be synchronized to ensure database
  consistency
 achieved by means of concurrency control algorithms
 most popular concurrency control algorithms are
  locking-based
 User accesses are encapsulated as transactions, whose
  operations at the lowest level are a set of read and write
  operations to the database
Existing Software
 ScimoreDB
 Objectivity/DB
 CouchDB written in Erlang
  Mnesia written in Erlang
 Greenplum uses PostgreSQL
 Drizzle designed for concurrency on multi-core
  architecture. It uses modified MySQL code
 Lotus Notes/Domino
 QUESTIONS??