Parallel Database Systems A SNAP Application

Document Sample
Parallel Database Systems A SNAP Application Powered By Docstoc
					        Parallel Database Systems
           A SNAP Application

                Gordon Bell                          Jim Gray

          450 Old Oak Court                   310 Filbert, SF CA 94133
         Los Altos, CA 94022                   Gray@Microsoft.com
        GBell@Microsoft.com


                                         Platform
                               Network



                                                                    1
Bell & Gray 4/15 / 95
                                      Outline
 • Cyberspace Pep Talk:
    • Databases are the dirt of Cyberspace
    • Billions of clients mean millions of servers
 • Parallel Imperative:
        •   Hardware trend: Many little devices
        •   Consequence: Servers are arrays of commodity components
        •   PC’s are the bricks of Cyberspace
        •   Must automate parallel {design / operation / use}
        •   Software parallelism via dataflow & Data Partitioning
 • Parallel database techniques
        •   Parallel execution of many little jobs (OLTP)
        •   Data Partitioning
        •   Pipeline Execution
        •   Automation techniques)
 • Summary                                                            2
Bell & Gray 4/15 / 95
            Kinds Of Information Processing

                        Point-to-Point   Broadcast
                         conversation    lecture
        Immediate        money           concert       Net
                                                       work

          Time           mail            book
                                         newspaper     Data
          Shifted                                      Base

 Its ALL going electronic
 Immediate is being stored for analysis (so ALL database)
 Analysis & Automatic Processing are being added         3
Bell & Gray 4/15 / 95
       Why Put Everything in Cyberspace?

                                                       Point-to-Point
     Low rent                                          OR




                           Immediate OR Time Delayed
       min $/byte                                      Broadcast

     Shrinks time
                                                                    Network
      now or later

     Shrinks space                                      Locate
      here or there                                     Process
                                                        Analyze
     Automate processing                                Summarize
      knowbots                                                      Data
                                                                    Base
                                                                       4
Bell & Gray 4/15 / 95
             Database Store ALL Data Types

 • The Old World:
    – Millions of objects                                • The New World:
    – 100-byte objects                                      • Billions of objects
                   People                                   • Big objects (1MB)
                Name Address                                • Objects have behavior
                David   NY                                     (methods)
                Mike    Berk

                Won     Austin                                       Paperless office
                                               People
                                                                     Library of congress online
                                 Name Address Papers Picture Voice   All information online
                                                                         entertainment
                                 David NY
                                                                         publishing
                                 Mike   Berk                             business
                                                                     Information Network,
                                 Won Austin                          Knowledge Navigator,
                                                                     Information at your fingertips
                                                                                             6
Bell & Gray 4/15 / 95
     Magnetic Storage Cheaper than Paper

 • File Cabinet:        cabinet (4 drawer)      250$
                        paper (24,000 sheets)   250$
                        space (2x3 @ 10$/ft2)   180$
                        total                   700$
                        3 ¢/sheet
 • Disk:                disk (8 GB =)    4,000$
                        ASCII: 4 m pages
                        0.1 ¢/sheet (30x cheaper)
 • Image:               200 k pages
                    2 ¢/sheet         (similar to paper)
 • Store everything on disk



                                                           7
Bell & Gray 4/15 / 95
                    Cyberspace Demographics
                                                         1950 National Computer
  • Computer History:                                    1960 Corporate Computer
                                                         1970 Site Computer
                                                         1980 Departmental Computer
                                                         1990 Personal Computer
                                                         2000 ?
  • most computers are small                            100M
                NEXT: 1 Billion X for some X (phone?)
                                                          1M


                                                          1K
                                                         SUPER MAIN MINI WS PC
                                                               FRAME
  • most of the money is in                             100B$
       clients and wiring
       1990: 50% desktop                                10B$
       1995: 75% desktop                                1B$                   8
Bell & Gray 4/15 / 95
                        Billions of Clients

           • Every device will be “intelligent”
           • Doors, rooms, cars, ...
           • Computing will be ubiquitous




                                                  9
Bell & Gray 4/15 / 95
                        Billions of Clients Need
                           Millions of Servers
All clients are networked to servers
            may be nomadic or on-demand               Clients
                                           mobile
                                           clients
                                                               fixed
                                                              clients
Fast clients want faster servers Servers
                                                     server
Servers provide
  data,
  control,                                       super
  coordination                   Super Servers    server
                                     Large Databases
  communication                      High Traffic shared data
                                                                   10
Bell & Gray 4/15 / 95
                                      Outline
 • Cyberspace Pep Talk:
        • Databases are the dirt of Cyberspace
        • Billions of clients mean millions of servers
 • Parallel Imperative:
    • Hardware trend: Many little devices
    • Consequence: Server arrays of commodity parts
    • PC’s are the bricks of Cyberspace
    • Must automate parallel {design / operation / use}
    • Software parallelism via dataflow & Data Partitioning
 • Parallel database techniques
        •   Parallel execution of many little jobs (OLTP)
        •   Data Partitioning
        •   Pipeline Execution
        •   Automation techniques)
                                                            17
 • Summary
Bell & Gray 4/15 / 95
                Moore’s Law Restated
             Many Little Won over Few Big
   Hardware trends: Few generic parts:                                  CPU
                                                                        RAM
                                                         Disk & Tape arrays
                                                          ATM for LAN/WAN
            1 M$                                                 ?? for CAN
                         100 K$           10 K$                    ?? for OS
                                      Micro            Nano
      Mainframe          Mini


                                            3.5"   2.5" 1.8"
                                  5.25"
                        9"
   These parts will be inexpensive (commodity components)
   Systems will be arrays of these parts
   Software challenge: how to program arrays
                                          18
Bell & Gray 4/15 / 95
                                       100 Tape Transports
    Future SuperServer                    = 1,000 tapes
                                          = 1 PetaByte


                                        1,000 discs =




                                                                  High Speed Network ( 10 Gb/s)
  Array of                              10 Terrorbytes
     processors,
     disks,                               100 Nodes
     tapes
     comm lines                             1 Tips
  Challenge:
         How to program it
         Must use parallelism
           Pipeline
                        hide latency
               Partition
                        bandwidth
                        scaleup

                                                             19
Bell & Gray 4/15 / 95
                        The Hardware is in Place and
                             Then A Miracle Occurs
              ?

                          SNAP
                          Scaleable Network And Platforms
                            Commodity Distributed OS
                              built on
                            Commodity Platforms
                            Commodity Network Interconnect


                                                       21
Bell & Gray 4/15 / 95
               Why Parallel Access To Data?
  At 10 MB/s                                    1,000 x parallel
  1.2 days to scan                              1.3 minute SCAN.

          1 Terabyte                                 1 Terabyte




                        10 MB/s
                              Parallelism:
                              divide a big problem
                               into many smaller ones
Bell & Gray 4/15 / 95
                                to be solved in parallel.         22
             DataFlow Programming
        Prefetch & Postwrite Hide Latency

  • Can't wait for the data to arrive
  • Need a memory that gets the data in advance ( 100MB/S)

  • Solution:
     • Pipeline from source (tape, disc, ram...) to cpu cache
     • Pipeline results to destination




                                                           23
Bell & Gray 4/15 / 95
                        The New Law of Computing

Grosch's Law:
                                2x $ is 4x performance

                             1,000 MIPS
                                 32 $
                1 MIPS        .03$/MIPS           2x $ is
                1$
                                              2x performance

                                          Parallel Law:
    Needs                                             1,000 MIPS
     Linear Speedup and Linear Scaleup                  1,000 $ 1 MIPS
     Not always possible                                             1$

                                                                24
Bell & Gray 4/15 / 95
      Parallelism: Performance is the Goal

     Goal is to get 'good' performance.
    Law 1: parallel system should be
           faster than serial system

     Law 2: parallel system should give
              near-linear scaleup or
              near-linear speedup or
              both.
   Parallelism is faster, not cheaper:
              trades money for time.
                                          25
Bell & Gray 4/15 / 95
                        The Perils of Parallelism

                            A Bad Speedup Curve
                            No Parallelism Benefit        Three Perils




                                                                Interference
                                                     Startup




                                                                               Skew
                                     Linearity


                                Processors & Discs        Processors & Discs

     Startup:      Creating processes
                   Opening files
                   Optimization
     Interference: Device (cpu, disc, bus)
                   logical (lock, hotspot, server, log,...)
     Skew:         If tasks get very small, variance > service time
                                                                                27
Bell & Gray 4/15 / 95
                    Kinds of Parallel Execution


                                 Any            Any
                              Sequential     Sequential
  Pipeline                     Program        Program




                             Sequential
                              Sequential
 Partition                         Any
                               Sequential
                                Sequential
                                                  Any
                                              Sequential
                                               Sequential
     outputs split N ways        Program        Program
     inputs merge M ways

                                                            29
Bell & Gray 4/15 / 95
                              Data Rivers
                        Split + Merge Streams

                            N X M Data Streams

                                              M Consumers
                        N producers
                                      River

                 Producers add records to the river,
                 Consumers consume records from the river
                 Purely sequential programming.
                 River does flow control and buffering
                       does partition and merge of data records
                 River = Exchange operator in Volcano.            30
Bell & Gray 4/15 / 95
              Partitioned Data and Execution
             Spreads computation and IO among processors

                                              Count

                        Count   Count         Count     Count      Count




                                        A Table
                        A...E   F...J        K...N    O...S     T...Z




  Partitioned data gives
         NATURAL execution parallelism
                                                                           31
Bell & Gray 4/15 / 95
                Partitioned + Merge + Pipeline
                          Execution
                                          Merge

                         Sort     Sort      Sort      Sort       Sort

                         Join     Join      Join      Join       Join




                        A...E   F...J    K...N     O...S     T...Z



     Pure dataflow programming
        Gives linear speedup & scaleup
         But, top node may be bottleneck
         So....                                                         32
Bell & Gray 4/15 / 95
                               N xM way Parallelism


                                  Merge    Merge      Merge

                        Sort       Sort     Sort       Sort       Sort

                        Join       Join      Join      Join       Join




                   A...E         F...J    K...N     O...S     T...Z

                 N inputs, M outputs, no bottlenecks.
                                                                         33
Bell & Gray 4/15 / 95
                  Why are Relational Operators
                   Successful for Parallelism?

  Relational data model                uniform operators
                                       on uniform data stream
                                       Closed under composition

  Each operator consumes 1 or 2 input streams
  Each stream is a uniform collection of data
  Sequential data in and out: Pure dataflow
  partitioning some operators (e.g. aggregates, non-equi-join, sort,..)
                      requires innovation

                    AUTOMATIC PARALLELISM                                 34
Bell & Gray 4/15 / 95
                                        SQL
   a NonProcedural Programming Language

 • SQL: functional programming language
                    describes answer set.
 • Optimizer picks best execution plan
    • Picks data flow web (pipeline),
    • degree of parallelism (partitioning)
    • other execution parameters (process placement, memory,...)
                        Planning                Execution
                          Schema                Monitor




                  GUI       Optimizer    Plan    Executors
                                                                   35
Bell & Gray 4/15 / 95                                     Rivers
     Database Systems “Hide” Parallelism


       • Automate system management via tools
             • data placement
             • data organization (indexing)
             • periodic tasks (dump / recover / reorganize)
       • Automatic fault tolerance
             • duplex & failover
             • transactions
       • Automatic parallelism
             • among transactions (locking)
             • within a transaction (parallel execution)
                                                              36
Bell & Gray 4/15 / 95
                        Success Stories

 • Online Transaction Processing
        • many little jobs
        • SQL systems support 3700 tps-A
                         (24 cpu, 240 disk)
        • SQL systems support 21,000 tpm-C     hardware
                         (110 cpu, 800 disk)

 • Batch (decision support and Utility)
        • few big jobs, parallelism inside
        • Scan data at 100 MB/s
        • Linear Scaleup to 50 processors
                                               hardware
                                                     37
Bell & Gray 4/15 / 95
                        Kinds of Partitioned Data

    Split a SQL table to subset of nodes & disks

    Partition within set:
    Range                                    Hash                          Round Robin




A...E F...J     K...N O...S T...Z   A...E F...J   K...N O...S T...Z   A...E F...J   K...N O...S T...Z


Good for equijoins, Good for equijoins Good to spread load
range queries
group-by
     Shared disk and memory less sensitive to partitioning,
     Shared nothing benefits from "good" partitioning
                                                                                              38
Bell & Gray 4/15 / 95
                        Picking Data Ranges

      Disk Partitioning
      For range partitioning, sample load on disks.
         Cool hot disks by making range smaller
      For hash partitioning,
         Cool hot disks by mapping some buckets to others

      River Partitioning
      Use hashing and assume uniform
      If range partitioning, sample data and use
          histogram to level the bulk

      Teradata, Tandem, Oracle use these tricks


                                                        41
Bell & Gray 4/15 / 95
                         Parallel Data Scan
   Select image
   from landsat                                        Temporal
   where date between 1970 and 1990
   and overlaps(location, :Rockies)                        Spatial
   and snow_cover(image) >.7;
                                                        Image
       Landsat          Assign one process per processor/disk:       Answer
date loc image             find images with right data & location    image
 1/2/72   33N              analyze image, if 70% snow, return it
 .        120W
 .        .
 .        .
 .        .
 .        .
 ..       .
          .
 .
 .        .
                                         date, location,
 4/8/95   34N
          120W
                                         & image tests
                                                                       42
Bell & Gray 4/15 / 95
                        Parallel Aggregates
 For aggregate function, need a decomposition strategy:

        count(S) = count(s(i)), ditto for sum()
        avg(S) = ( sum(s(i))) /  count(s(i))
        and so on...

 For groups,
    sub-aggregate groups close to the source
    drop sub-aggregates into a hash river.

                                                   Count

                             Count   Count         Count     Count      Count




                                             A Table
                             A...E   F...J        K...N    O...S     T...Z      44
Bell & Gray 4/15 / 95
                                        Parallel Sort
 M input N output
 Sort design                                  River is range or hash partitioned

 Disk and merge                                                               Merge
                                                                              runs
 not needed if
 sort fits in
 memory                                                                    Sub-sorts
                                                                           generate
                                                                           runs
 Scales linearly because
            6                                 Range or Hash Partition River
log(10 )                6
           12
                 =           =>   2x slower
log(10 )                12                        Scan     or     other source

Sort is benchmark from hell for shared nothing machines
   net traffic = disk bandwidth, no data filtering at the source
                                                                               46
Bell & Gray 4/15 / 95
    Blocking Operators =Short Piplelines
      An operator is blocking,
                                                                          Database Load
         if it does not produce any output,                               Template has
                                                                          three blocked
         until it has consumed all its input                              phases



                                  Tape        Scan   Sort Runs   Merge Runs Table Insert
                                  File                                                     SQL Table
                                  SQL Table
                                  Process


                                                     Sort Runs   Merge Runs Index Insert   Index 1
      Examples:
         Sort,                                       Sort Runs   Merge Runs Index Insert
                                                                                           Index 2

         Aggregates,                                 Sort Runs   Merge Runs Index Insert
                                                                                           Index 3
         Hash-Join (reads all of one operand)

      Blocking operators kill pipeline parallelism
      Make partition parallelism all the more important.
                                                                                            47
Bell & Gray 4/15 / 95
                          Hash Join

 Hash smaller table into N buckets (hope N=1)    Right Table
 If N=1 read larger table, hash to smaller
 Else, hash outer to disk then
     bucket-by-bucket hash join.                        Hash
                                                        Buckets
 Purely sequential data behavior       Left
                                       Table
 Always beats sort-merge and nested
    unless data is clustered.
 Good for equi, outer, exclusion join
 Lots of papers,
    products just appearing (what went wrong?)

 Hash reduces skew
                                                           50
Bell & Gray 4/15 / 95
              Observation: Execution “easy”
                   Automation “hard”
 It is “easy” to build a fast parallel execution environment
      (no one has done it, but it is just programming)

 It is hard to write a robust and world-class query optimizer.
      There are many tricks
      One quickly hits the complexity barrier

 Common approach:
   Pick best sequential plan
   Pick degree of parallelism based on bottleneck analysis
   Bind operators to process
   Place processes at nodes
   Place scratch files near processes
   Use memory as a constraint
                                                           51
Bell & Gray 4/15 / 95
                Systems That Work This Way

     Shared Nothing                    CLIENTS


           Teradata:      400 nodes
           Tandem:        110 nodes
           IBM / SP2 / DB2: 48 nodes
           ATT & Sybase 112 nodes
           Informix/SP2     48 nodes   CLIENTS



     Shared Disk
           Oracle         170 nodes
           Rdb             24 nodes

     Shared Memory                          CLIENTS

           Informix         9 nodes        Processors
                                            Memory
           RedBrick         ? nodes
                                                        52
Bell & Gray 4/15 / 95
                                          100 Tape Transports
          Research Problems                  = 1,000 tapes
                                             = 1 PetaByte


• Automatic data placement                 1,000 discs =




                                                                 High Speed Network ( 10 Gb/s)
     (partition: random or organized)      10 Terrorbytes

• Automatic parallel programming             100 Nodes
        (process placement)
                                               1 Tips
• Parallel concepts, algorithms & tools

• Parallel Query Optimization

• Execution Techniques
      load balance,
      checkpoint/restart,
      pacing,
                                                            53
Bell & Gray 4/15 / 95
                                       Summary
• Cyberspace is Growing
• Databases are the dirt of cybersspace
      PCs are the bricks, Networks are the morter.
                        Many little devices: Performance via Arrays of {cpu, disk ,tape}

• Then a miracle occurs: a scaleable distributed OS and net
   • SNAP: Scaleable Networks and Platforms
• Then parallel database systems give software parallelism
   • OLTP: lots of little jobs run in parallel
   • Batch TP: data flow & data partitioning
   • Automate processor & storage array administration
   • Automate processor & storage array programming
• 2000 platforms as easy as 1 platform.
                                                                                           54
Bell & Gray 4/15 / 95