Parallel Database Systems 101 Jim Gray Gordon Bell Micr - PowerPoint

Document Sample
Parallel Database Systems 101 Jim Gray Gordon Bell Micr - PowerPoint Powered By Docstoc
					                                                Parallel Database
                                                  Systems 101
                                                           Jim Gray & Gordon Bell
                                                            Microsoft Corporation

                               presented at VLDB 95, Zurich Switzerland, Sept 1995

     • Detailed notes available from Gray@Microsoft.com
             – this presentation is 120 of the 174 slides (time limit)
             – Notes in PowerPoint7 and Word7




Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                     1
                                                                   Outline
      • Why Parallelism:
              –technology push
              –application pull
      • Benchmark Buyer‟s Guide
              – metrics
              – simple tests
      • Parallel Database Techniques
              – partitioned data
              – partitioned and pipelined execution
              – parallel relational operators
      • Parallel Database Systems
              – Teradata. Tandem, Oracle, Informix, Sybase, DB2, „RedBrick
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey             2
                     Kinds Of Information Processing

                                              Point-to-Point         Broadcast
                                                     conversation   lecture
           Immediate                                 money          concert      Net
                                                                                 work

              Time                                   mail           book
                                                                    newspaper    Data
              Shifted                                                            Base

  Its ALL going electronic
  Immediate is being stored for analysis (so ALL database)
  Analysis & Automatic Processing are being added
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                    3
               Why Put Everything in Cyberspace?

                                                                                              Point-to-Point




                                                                   mmediate OR Time Delayed
                                                                                              OR
        Low rent                                                                              Broadcast
          min $/byte

        Shrinks time                                                                                       Network
         now or later
                                                                                               Locate
        Shrinks space                                                                          Process
         here or there                                                                         Analyze
                                                                                               Summarize
        Automate processing                                                                                    Data
         knowbots                                                                                              Base
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                                      4
                                                             Databases:
                                           Information At Your Fingertips™
                                                 Information Network™
                                                Knowledge Navigator™


 • All information will be in an online database (somewhere)
 • You might record everything you
         • read: 10MB/day, 400 GB/lifetime (two tapes)
         • hear: 400MB/day, 16 TB/lifetime (a tape per decade)
         • see: 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe someday)
 •    Data storage, organization, and analysis is a challenge.
 •    That is what databases are about
 •    DBs do a good job on “records”
 •    Now working on text, spatial, image, and sound.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey             5
                      Database Store ALL Data Types

  • The Old World:
     – Millions of objects                                                             • The New World:
     – 100-byte objects                                                                   • Billions of objects
                          People                                                          • Big objects (1MB)
                     Name Address
                                                                                          • Objects have behavior
                     David        NY
                                                                                             (methods)
                     Mike         Berk

                     Won        Austin                                      People
                                                                                                     Paperless office
                                                                                                     Library of congress online
                                                      Name Address Papers            Picture Voice   All information online
                                                      David        NY                                    entertainment
                                                                                                         publishing
                                                      Mike         Berk
                                                                                                         business
                                                      Won          Austin                            Information Network,
                                                                                                     Knowledge Navigator,
                                                                                                     Information at your fingertips
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                                                      6
            Magnetic Storage Cheaper than Paper

  • File Cabinet:                                 cabinet (4 drawer)       250$
                                                  paper (24,000 sheets)    250$
                                                  space (2x3 @ 10$/ft2)    180$
                                                  total                    700$
                                  3 ¢/sheet
  • Disk:                                         disk (8 GB =)    2,000$
                                                  ASCII: 4 m pages
                                  0.05 ¢/sheet                     (60x cheaper)
  • Image:                                        200 k pages
                                  1 ¢/sheet                        (3x cheaper than paper)
  • Store everything on disk

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                             7
                                                            Moore‟s Law
•XXX doubles every 18 months 60% increase per year
   –Micro Processor speeds
                                  1GB
   –chip density                 128MB
   –Magnetic disk density         8MB
                                       1 chip memory size
                                       ( 2 MB to 32 MB)
   –Communications bandwidth     1MB
        WAN bandwidth approaching LANs                             128KB

•Exponential Growth:                      8KB
                                                        1980        1990       2000
                                             1970
   –The past does not matter            bits: 1K 4K 16K 64K 256K 1M 4M 16M64M 256M

   –10x here, 10x there, soon you're talking REAL change.
•PC costs decline faster than any other platform
   –Volume & learning curves
   –PCs will be the building bricks of all future systems
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                10
                         In The Limit: The Pico Processor
                                                                      3
                                                                   1 MM
                                                                          1 M SPECmarks,
                                          Pico Processor                           1TFLOP

            1 MB                       10 pico-second ram                 106 clocks to bulk ram

                                     10 nano-second ram
                                                                          Event-horizon on chip.
          100 MB
           10 GB 10 microsecond ram                                       VM reincarnated
                1 TB            10 millisecond disc
                                                                          Multi-program cache
          100 TB 10 second tape archive                                     On-Chip SMP



                             Terror Bytes!
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                   14
                                               What's a Terabyte?
                                              (250 K$ of Disk @ .25$/MB)
         1 Terabyte
            1,000,000,000 business letters                                150 miles of bookshelf
              100,000,000 book pages                                       15 miles of bookshelf
               50,000,000 FAX images                                        7 miles of bookshelf
               10,000,000 TV pictures (mpeg)                               10 days of video
                    4,000 LandSat images

         Library of Congress (in ASCII) is 25 TB

          1980: 200 M$ of disc                                     10,000 discs
                  5 M$ of tape silo                                10,000 tapes

          1995: 250 K$ of magnetic disc                               70 discs
                500 K$ of optical disc robot                         250 platters
                 50 K$ of tape silo                                   50 tapes

         Terror Byte !!
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                   23
                                           Summary (of storage)
    • Capacity and cost are improving fast (100x per decade)
    • Accesses are getting larger (MOX, GOX, SCANS)
    • BUT Latencies and bandwidth are not improving much
                            • (3x per decade)
    • How to deal with this???
    • Bandwidth:
       – Use partitioned parallel access (disk & tape farms)
    • Latency
       – Pipeline data up storage hierarchy (next section)



Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   30
                                    Interesting Storage Ratios
    • Disk is back to 100x cheaper than RAM
    • Nearline tape is only 10x cheaper than disk
      – and the gap is closing!       RAM $/MB
                                                                         Disk $/MB
  100:1
                                                                       Disk & DRAM look good

                                               30:1
    10:1

                                  Disk $/MB
                                                                   ?   ??? Why bother with Tape
                                  Nearline Tape


       1:1
                 1960 1970 1980 1990 2000
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                  31
                     Performance =Storage Accesses
                         not Instructions Executed
  • In the “old days” we counted instructions and IO‟s
  • Now we count memory references
  • Processors wait most of the time
               Where the time goes:
   clock ticks used by AlphaSort Components
                                                                        70 MIPS
     Disc Wait                             Sort
                                             Sort                       “real” apps have worse Icache
          Disc Wait                                                OS   misses so run at 60 MIPS
                                                                        if well tuned, 20 MIPS if not

              Memory Wait                                                     I-Cache
                                                      B-Cache                  Miss
                                                      Data Miss         D-Cache
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
                                                                        Miss                            32
                                     Storage Latency:
                                  How Far Away is the Data?
                                       Andromdeda
                             9
                        10       Tape /Optical                                     2,000 Years
                                 Robot
          Clock Ticks




                        10 6 Disk                                   Pluto              2 Years




                                                                   Sacramento           1.5 hr
                   100           Memory
                        10       On Board Cache                      This Campus       10 min
                         2       On Chip Cache                               This Room
                         1       Registers                                     My Head  1 min
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                 33
                                                  Network Speeds

                                                                          Comm Speedups
• Network speeds grow 60% / year                                   1e 9
• WAN speeds limited by politics
                                                                   1e 8
        • if voice is X$/minute, how much is video?
• Switched 100Mb Ethernet                                          1e 7      Processors (i/s)
   • 1,000x more bandwidth                                         1e 6
• ATM is a scaleable net:                                                         LANs &
                                                                   1e 5
   • 1 Gb/s to desktop & wall plug                                                WANs (b/s)
   • commodity: same for LAN, WAN                                  1e 4
• 1Tb/s fibers in laboratory                                       1e 3
                                                                          1960 1970 1980 1990 2000
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                    Year         34
                              Network Trends & Challenge
                                                                   1010
      •    Bandwidth UP 104 Price DOWN                             10 1 Gb/s
                                                                     9



      •    Speed-of-light unchanged                                108
                                                                                  CAN          PC Bus

      •
                                                                     7
           Software got worse                                      10
                                                                                 LAN
                                                                   10 1 Mb/s
                                                                     6

      •    Standard Fast Nets                                        5           WAN
                                                                   10
                      »   ATM                                        4
                                                                   10
                      »   PCI                                                    POTS
                      »   Myrinet                                  10 1 Kb/s
                                                                     3


                                                                     2
                      »   Tnet                                     10          1975     1985      1995 2000
                                                                     1965
      • HOPE:
         – Commodity Net
         – Good software
      • Then clusters become a SNAP!
      • commodity: 10k$/slice
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                        35
                                           The Seven Price Tiers

•                  10$:                                 wrist watch computers
•                 100$:                                 pocket/ palm computers
•                1,000$:                                portable computers
•               10,000$:                                personal computers (desktop)
•              100,000$:                                departmental computers (closet)
•             1,000,000$:                               site computers (glass house)
•             10,000,000$:                              regional computers (glass castle)
               SuperServer: Costs more than 100,000 $
               “Mainframe” Costs more than 1M$
               Must be an array of processors,
                                   disks, tapes
                                   comm ports
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                            36
                                The New Computer Industry

 • Horizontal integration                                              Function      Example
       is new structure                                                Operation     AT&T
 • Each layer picks best                                              Integration    EDS
       from lower layer.                                            Applications     SAP
 • Desktop (C/S) market                                              Middleware      Oracle
    • 1991: 50%                                                        Baseware      Microsoft
    • 1995: 75%                                                         Systems      Compaq
                                                                   Silicon & Oxide   Intel & Seagate




Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                   38
                      Software Economics: Bill‟s Law

  • Bill Joy‟s law (Sun):
       Don‟t write software for less than 100,000 platforms.
          @10M$ engineering expense, 1,000$ price
  • Bill Gate‟s law:
    Don‟t write software for less than 1,000,000 platforms.
        @10M$ engineering expense, 100$ price
  • Examples:
     • UNIX vs NT: 3,500$ vs 500$
     • UNIX-Oracle vs SQL-Server: 100,000$ vs 1,000$
     • No Spreadsheet or Presentation pack on UNIX/VMS/...
  • Commoditization of base Software & Hardware
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   40
                                   Thesis
                      Many Little will Win over Few Big


                                   1 M$
                                                                   100 K$           10 K$

                                                                                    Micro            Nano
                         Mainframe                                  Mini


                                                                                              2.5"   1.8"
                                                                            5.25"      3.5"
                                                               9"
                                  14"

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                            44
                                         Year 2000 4B Machine

  • The Year 2000 commodity PC (3K$)
                                                                   1 Bips Processor
  •Billion Instructions/Sec
                                                                    .1 B byte RAM
  •Billion Bytes RAM
                                                                    10 B byte Disk
  •Billion Bits/s Net
  • 10 B Bytes Disk

  •Billion Pixel display
          • 3000 x 3000 x 24 pixel

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                      45
              4 B PC‟s: The Bricks of Cyberspace

    • Cost 3,000 $
    • Come with
       • OS (NT, POSIX,..)
       • DBMS
       • High speed Net
       • System management
       • GUI / OOUI
       • Tools
    • Compatible with everyone else
    • CyberBricks
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   46
                   Implications of Hardware Trends
      Large Disc Farms will be inexpensive ( 100$/GB)

      Large RAM databases will be inexpensive (1,000$/GB)

      Processors will be inexpensive                               1k SPECint
                                                                       CPU
      So
               The building block will be
                 a processor                                                    50 GB Disc
                 with large RAM                                    5 GB RAM
                 lots of Disc




Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                             47
   Implication of Hardware Trends: Clusters


                                                                                  CPU



                                                                                           50 GB Disc
                                                                                5 GB RAM




                                                                   Future Servers are CLUSTERS
                                                                   of processors, discs

                                                                   Distributed Database techniques
                                                                   make clusters work


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                        48
                                                                   100 Tape Transports
       Future SuperServer                                             = 1,000 tapes
                                                                      = 1 PetaByte
           4T Machine




                                                                                         High Speed Network ( 10 Gb/s)
  Array of 1,000 4B machines                                        1,000 discs =
     processors,                                                    10 Terrorbytes
     disks,
     tapes                                                             100 Nodes
     comm lines                                                          1 Tips
  A few MegaBucks
  Challenge:
     Manageability
     Programmability
     Security
     Availability
     Scaleability
     Affordability
  As easy as a single system
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                                   49
              Great Debate: Shared What?

  Shared Memory               Shared Disk             Shared Nothing
      (SMP)                                              (network)
                              CLIENTS              CLIENTS
           CLIENTS

          Processors
           Memory




  Easy to program                                   Hard to program
  Difficult to build                                Easy to build
  Difficult to scaleup                              Easy to scaleup
  Sequent, SGI, Sun         VMScluster, Sysplex   Tandem, Teradata, SP2

        Winner will be a synthesis of these ideas
        Distributed shared memory (DASH, Encore) blurs distinction
        between Network and Bus (locality still important)
                        But gives Shared memory message cost.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey          50
                       Scaleables: Uneconomic So Far
      • A Slice is a processor, memory, and a few disks.
      • Slice Price of Scaleables so far is 5x to 10x markup
         – Teradata: 70K$ for a Intel 486 + 32MB + 4 disk.
         – Tandem: 100k$ for a MipsCo R4000 + 64MB + 4 disk
         – Intel:     75k$ for an I860 +32MB + 2 disk
         – TMC:       75k$ for a SPARC 3 + 32MB + 2 disk.
         – IBM/SP2: 100k$ for a R6000 + 64MB + 8 disk
      • Compaq Slice Price is less than 10k$
      • What is the problem?
         – Proprietary interconnect
         – Proprietary packaging
         – Proprietary software (vendorIX)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   51
                                 Summary
   • Storage trends force pipeline & partition parallelism
      – Lots of bytes & bandwidth per dollar
      – Lots of latency
   • Processor trends force pipeline & partition
      – Lots of MIPS per dollar
      – Lots of processors
   • Putting it together Scaleable Networks and Platforms)
      – Build clusters of commodity processors & storage
      – Commodity interconnect is key (S of PMS)
           » Traditional interconnects give 100k$/slice.
              – Commodity Cluster Operating System is key
              – Fault isolation and tolerance is key
              – Automatic Parallel Programming is key
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   52
                                              The Hardware is in Place and
                                                    Then A Miracle Occurs
                   ?

                                                                   SNAP
                                                    Scaleable Network And Platforms
                                                            Commodity Distributed OS
                                                              built on
                                                            Commodity Platforms
                                                            Commodity Network Interconnect
                                                    Enables
                                                      Parallel Applications
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                             53
         Why Parallel Access To Data?
    At 10 MB/s             1,000 x parallel
    1.2 days to scan       1.5 minute SCAN.

               1 Terabyte                                            1 Terabyte




                              10 MB/s                Parallelism:
                                                     divide a big problem
                                                      into many smaller ones
                                                       to be solved in parallel.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                   56
                      DataFlow Programming
                 Prefetch & Postwrite Hide Latency

   • Can't wait for the data to arrive (2,000 years!)
   • Need a memory that gets the data in advance ( 100MB/S)

   • Solution:
      • Pipeline from source (tape, disc, ram...) to cpu cache
      • Pipeline results to destination




Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   57
                       Why are Relational Operators
                       So Successful for Parallelism?
    Relational data model                                          uniform operators
                                                                   on uniform data stream
                                                                   Closed under composition

    Each operator consumes 1 or 2 input streams
    Each stream is a uniform collection of data
    Sequential data in and out: Pure dataflow
    partitioning some operators (e.g. aggregates, non-equi-join, sort,..)
                     requires innovation

                          AUTOMATIC PARALLELISM
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                              58
            Database Systems “Hide” Parallelism
       • Automate system management via tools
               • data placement
               • data organization (indexing)
               • periodic tasks (dump / recover / reorganize)
       • Automatic fault tolerance
               • duplex & failover
               • transactions
       • Automatic parallelism
               • among transactions (locking)
               • within a transaction (parallel execution)

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   59
                                  Automatic Parallel OR DB
     Select image
     from landsat                                                                Temporal
     where date between 1970 and 1990
     and overlaps(location, :Rockies)                                                Spatial
     and snow_cover(image) >.7;
                                                                              Image
          Landsat                          Assign one process per processor/disk:              Answer
 date loc image                               find images with right data & location           image
  1/2/72      33N                             analyze image, if 70% snow, return it
  .           120W
  .           .
  .           .
  .           .
  .           .
  ..          .
  .           .                                                    date, location,
  .           .
  4/8/95      34N                                                  & image tests
              120W

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                    60
                                                                   Outline
      • Why Parallelism:
              – technology push
              – application pull
      • Benchmark Buyer‟s Guide
         –metrics
         –simple tests
      • Parallel Database Techniques
              – partitioned data
              – partitioned and pipelined execution
              – parallel relational operators
      • Parallel Database Systems
              – Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrick
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey             61
         Parallelism: Speedup & Scaleup
      Speedup:         100GB            100GB
            Same Job,
            More Hardware
            Less time

      Scaleup:                                                     100GB                1 TB
            Bigger Job,
            More Hardware
            Same time
                                                                    1 k clients   10 k clients
      Transaction                                                  100GB               1 TB
      Scaleup:
            more clients/servers                                                     Server
            Same response time                                     Server

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                 62
                                The New Law of Computing

 Grosch's Law:
                                                             2x $ is 4x performance

                                                     1,000 MIPS
                                                         32 $
                      1 MIPS                          .03$/MIPS              2x $ is
                      1$
                                                                         2x performance

                                                                   Parallel Law:
                                                                                   1,000 MIPS
     Needs                                                                           1,000 $  1 MIPS
      Linear Speedup and Linear Scaleup                                                          1$
      Not always possible
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                  63
           Parallelism: Performance is the Goal
         Goal is to get 'good' performance.
        Law 1: parallel system should be
               faster than serial system

          Law 2: parallel system should give
                   near-linear scaleup or
                   near-linear speedup or
                   both.


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   64
          The New Performance Metrics
  • Transaction Processing Performance Council:
         – TPC-A: simple transaction
         – TPC-B: server only, about 3x lighter than TPC-A
         – Both obsoleted by TPC-C (no new results after 6/7/95)
  • TPC-C (revision 3) Transactions Per Minute tpm-C
         – Mix of 5 transactions: query, update, minibatch
         – Terminal price eliminated
         – about 5x heavier than tpcA (so 3.5 ktpcA - 20 ktpmC)
  • TPC-D approved in March 1995 - Transactions Per Hour
         – Scaleable database (30 GB, 100GB, 300GB,... )
         – 17 complex SQL queries (no rewrites, no hints without permission)
         – 2 load/purge queries
         – No official results yet, many “customer” results.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey           65
                                                           TPC-C Results 12/94
                                                          Courtesy of Charles Levine of Tandem (of course)

                                   3000
                                                                                                                                       Tan dem
                                                                                                                                       HP-H7 0
                                                                                                                                       AS40 0
                                                                                                                                       HP90 00
                                                   HP9000                                                                              RS600 0
                                                                                                                                       Sun
                                   2000                                                                                                DG
                                                                                                                                       HP-H7 0
                                                  AS400                                                                                HP T500 -8
                                                                                                Tandem Himalaya Server
                   COST ($/TPMC)




                                                            16 c pus          32 c pus                   64 c pus                                   11 2 cpus

                                                    RS6000

                                   1000                             HP T500

                                                              SUN

                                                  HP 9000 E55, H70




                                          0
                                              0      2000        4000         6000       8000   10000   12000       14000   16000   18000        20000    22000

                                                                                            PERFORMANCE (TPMC)


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                                                                                  66
                                                    Success Stories

   • Online Transaction Processing
           • many little jobs
           • SQL systems support 3700 tps-A
                                                     (24 cpu, 240 disk)
           • SQL systems support 21,000 tpm-C
                                                    (112 cpu,670 disks)   hardware

   • Batch (decision support and Utility)
           • few big jobs, parallelism inside
           • Scan data at 100 MB/s
           • Linear Scaleup to 500 processors
                                                                          hardware
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                     67
                                     The Perils of Parallelism
                          The Good                          A Bad Speedup Curve     A Bad Speedup Curve
              NewTime
    Speedup = OldTime



                        Speedup Curve                           No Parallelism             3-Factors
                                                                    Benefit




                                                                                                Interference
                                                                                      Startup
                                         it   y
                                      ar




                                                                                                               Skew
                                   e
                                Lin                                 Linearity


                        Processors & Discs                     Processors & Discs
                                                                                     Processors & Discs

          Startup:                                     Creating processes
                                                       Opening files
                                                       Optimization
          Interference:                                Device (cpu, disc, bus)
                                                       logical (lock, hotspot, server, log,...)
          Skew:                                        If tasks get very small, variance > service time

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                                      68
                                Benchmark Buyer's Guide
                                                                   The Whole Story
 Things to ask                                                     (for any system)
                                                                                         The Benchmark
 When does it stop scaling?                                                                      Report

 Throughput numbers,




                                                                                       Throughput
   Not ratios.


 Standard benchmarks allow
    Comparison to others
                                                                                      Processors & Discs
 Comparison to sequential


 Ratios and non-standard benchmarks are red flags.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                           69
                            Performance 101: Scan Rate
       Disk is 3MB/s to 10MB/s
       Record is 100B to 200B (TPC-D 110...160, Wisconsin 204)
       So should be able to read 10kr/s to 100kr/s

       Simple test: Time this on a 1M record table
                       SELECT count(*) FROM T WHERE x < :infinity;
                       (table on one disk, turn off parallelism)
                                                                     Scan
       Typical problems:
               disk or controller is an antique
                                                                      Agg
                                                                     Count
               no read-ahead in operating system or DB
               small page reads (2kb)
               data not clustered on disk
               big cpu overhead in record movement

       Parallelism is not the cure for these problems
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey            70
                                               Parallel Scan Rate
  Simplest parallel test:
    Scaleup previous test:
      4 disks,                                                      Scan
                                                                     Scan
      4 controllers,                                                  Scan
                                                                       Scan
      4 processors                                                   Agg
      4 times as many records                                       Count
                                                                      Agg
                                                                     Count
                                                                       Agg
                                                                      Count
            partitioned 4 ways.                                         Agg
                                                                       Count
    Same query

          Should have same elapsed time.                                Agg
                                                                       Sum
          Some systems do.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey               71
                                          Parallel Update Rate

        Test: UPDATE T
              SET x = x + :one;

        Test for million row T on 1 disk

        Test for four million row T on 4 disks

        Look for bottlenecks.                                      Log   UPDATE

        After each call, execute ROLLBACK WORK

        See if UNDO runs at the DO speed

        See if UNDO is parallel (scales up)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                  72
            The records/$/second Metric
  • parallel database systems scan data
  • An interesting metric (100 byte record):
     – Record Scan Rate / System Cost
  • Typical scan rates: 1k records/s to 30k records/s
  • Each Scaleable system has a “slice price” guess:
      –   Gateway: 15k$ (P5 + ATM + 2 disks +NT + SQLserver or Informix or Oracle)
      –   Teradata: 75k$
      –   Sequent: 75k$ (P5+2 disks+Dynix+Informix)
      –   Tandem: 100k$
      –   IBM SP2: 130k$ (RS6000+2 disks, AIX, DB2)
     • You can compute slice price for systems later in presentation
     • BAD:                            0.1 records/s/$ (there is one of these)
     • GOOD:                           0.33 records/s/$ (there is one of these)
     • Super!                          1.00 records/s/$ (there is one of these)
     • We should aim at 10 records/s/$ with P6.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                74
                             Embarrassing Questions to
                               Ask Your PDB Vendor
      How are constraints checked?
           ask about unique secondary indices
           ask about deferred constraints
           ask about referential integrity

      How does parallelism interact with
           triggers
           Stored procedures
           OO extensions

      How can I change my 10 TB database design in an hour?
           add index
           add constraint
           reorganize / repartition

      These are hard problems.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   75
                                                                   Outline
    • Why Parallelism:
            – technology push
            – application pull
    • Benchmark Buyer‟s Guide
            – metrics
            – simple tests

    • Parallel Database Techniques
            –partitioned data
            –partitioned and pipelined execution
            –parallel relational operators
    • Parallel Database Systems
            – Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrick
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey             76
                             Automatic Data Partitioning
       Split a SQL table to subset of nodes & disks

       Partition within set:
       Range                                                       Hash                         Round Robin




    A...E F...J        K...N O...S T...Z                   A...E F...J   K...N O...S T...Z     A...E F...J   K...N O...S T...Z

 Good for equijoins,                                      Good for equijoins                 Good to spread load
 range queries
 group-by
        Shared disk and memory less sensitive to partitioning,
        Shared nothing benefits from "good" partitioning
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                                             77
                                                 Index Partitioning

   Hash indices partition by hash

                                                                      0...9   10..19   20..29 30..39 40..•




   B-tree indices partition as a forest of trees.
         One tree per range
                                                                      A..C     D..F     G...M   N...R   S..Z




   Primary index clusters data


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                        78
                             Secondary Index Partitioning
In shared nothing, secondary indices are Problematic

Partition by base table key ranges
        Insert: completely local (but what about unique?)
                                                                                  A..Z    A..Z   A..Z
        Lookup: examines ALL trees (see figure)                    A..Z   A..Z

        Unique index involves lookup on insert.                           Base Table

Partition by secondary key ranges
        Insert: two nodes (base and index)
        Lookup: two nodes (index -> base)
                                                                   A..C          G...M
        Uniqueness is easy                                                D..F           N...R   S..•

                                                                          Base Table
         Teradata solution


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                        79
                              Kinds of Parallel Execution

                                                                      Any            Any
                                                                   Sequential     Sequential
    Pipeline                                                        Program        Program




   Partition                                                            Any            Any
                                                                                    Sequential
                                                                     Sequential
       outputs split N ways                                           Program        Program
       inputs merge M ways


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                 80
                                                Data Rivers
                                          Split + Merge Streams
                                                        N X M Data Streams

                                                                           M Consumers
                                          N producers

                                                                   River
Producers add records to the river,
Consumers consume records from the river
Purely sequential programming.
River does flow control and buffering
      does partition and merge of data records

River = Split/Merge in Gamma = Exchange operator in Volcano.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                         81
                                        Partitioned Execution
                  Spreads computation and IO among processors

                                                                         Count

                                     Count               Count           Count     Count      Count




                                                                   A Table
                                     A...E              F...J           K...N    O...S     T...Z


       Partitioned data gives
                        NATURAL parallelism

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                      82
                                        N x M way Parallelism

                                                           Merge     Merge      Merge

                                         Sort                Sort     Sort       Sort       Sort

                                          Join               Join      Join      Join       Join




                                      A...E              F...J      K...N     O...S     T...Z


               N inputs, M outputs, no bottlenecks.

               Partitioned Data
               Partitioned and Pipelined Data Flows
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                   83
                                           Picking Data Ranges

        Disk Partitioning
        For range partitioning, sample load on disks.
           Cool hot disks by making range smaller
        For hash partitioning,
           Cool hot disks by mapping some buckets to others

        River Partitioning
        Use hashing and assume uniform
        If range partitioning, sample data and use
            histogram to level the bulk

        Teradata, Tandem, Oracle use these tricks


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   84
           Blocking Operators = Short Pipelines
        An operator is blocking,
                                                                                                               Database Load
           if it does not produce any output,                                                                  Template has
                                                                                                               three blocked
           until it has consumed all its input                                                                 phases



                                                                   Tape        Scan   Sort Runs   Merge Runs     Table Insert
                                                                   File                                                         SQL Table
                                                                   SQL Table
                                                                   Process


                                                                                      Sort Runs   Merge Runs     Index Insert   Index 1

        Examples:                                                                     Sort Runs   Merge Runs     Index Insert
           Sort,                                                                                                                Index 2


           Aggregates,                                                                Sort Runs   Merge Runs    Index Insert
                                                                                                                                 Index 3

           Hash-Join (reads all of one operand)

        Blocking operators kill pipeline parallelism
        Make partition parallelism all the more important.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                                                            85
                   Simple Aggregates (sort or hash?)
       Simple aggregates (count, min, max, ...) can use indices
             More compact
             Sometimes have aggregate info.

       GROUP BY aggregates
         scan in category order if possible (use indices)
         Else
            If categories fit in RAM use RAM category hash table
            Else
                make temp of <category, item>
                sort by category,
                do math in merge step.


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   86
                                                                   Sort
  Used for
     loading and reorganization (sort makes them sequential)
     build B-trees
     reports
         non-equijoins
  Rarely used for aggregates or equi-joins (if hash available




                                                   Sort                    Merge
                          Input                                     Runs           Sorted
                          Data                                                     Data
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                            88
                                                                   Parallel Sort
  M input N output
  Sort design                                                         River is range or hash partitioned
                                                                                                      Merge
  Disk and merge                                                                                      runs
  not needed if
  sort fits in
                                                                                                      Sub-sorts
  memory                                                                                              generate
                                                                                                      runs
  Scales linearly because
                                                                      Range or Hash Partition River
 log(106 )   6
                       =               =>     2x slower
 log(10 )
               12
                             12                                            Scan     or     other source
Sort is benchmark from hell for shared nothing machines
   net traffic = disk bandwidth, no data filtering at the source

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                            89
     SIGMOD Sort Award                                                    1.0E+06
                                                                                                   Sort Records/second vs Time


 Datamation Sort: 1M records (100 B recs)                                 1.0E+05
                                                                                                                    Cray YMP
                                                                                                                                   IBM 3090
                                                                                                                                                       SGI
                                                                                                                                                         Alpha


         1000 seconds 1986
                                                                                                                                  Sequent
           60 seconds 1990                                                1.0E+04                                                                 Intel
                                                                                                                                            HyperCube
           7 seconds 1994
                                                                          1.0E+03                                                 Hardware Sorter

              3.5 seconds 1995 (SGI challenge)                                                                  Tandem
                                                                                                           M68000
                                                                          1.0E+02
         micros finally beat the mainframe!                                                              1985                   1990                     1995


         finally! a UNIX system that does IO                                                       250
                                                                                                           Sort Time on an SGI Challenge
                                                                                                                       1.6 GB (16 M 100-byte records)
 SIGMOD MinuteSort                                                                                 200                 12 cpu,
                                                                                                                       2.2 GB,
                                                                                                                                         write done

                                                                                                                       96 disk           lists merged




                                                                          Elapsed Time (seconds)
         1.1GB, Nyberg,                                            1994                            150
                                                                                                                                                 lists-sorted

                                                                                                                                                 read-done
                  Alpha 3cpu                                                                       100
                                                                                                                                                 pin



         1.6GB, Nyberg,                                            1995                             50



                 SGI Challenge (12 cpu)                                                              0
                                                                                                                1        2         4         6           10
 no SIGMOD PennySort record                                                                                     Threads (Sprocs) devoted to sorting
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                                                                              90
                                               Nested Loops Join

If inner table indexed on join cols (b-tree or hash)
     then sequential scan outer (from start key)
     For each outer record
        probe inner table for matching recs

Works best if inner is in RAM (=> small inner




                                                                         Inner
                                                                   Outer Table
                                                                   Table


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey             91
                    Merge Join (and sort-merge join)

  If tables sorted on join cols (b-tree or hash)
       then sequential scan each (from start key)
       left < right    left=right   left > right
       advance left    match        advance right                  NxM
                                                                   case
                                                                   Cartesian
  Nice sequential scan of data (disk speed)                        product
     (MxN case may cause backwards rescan)

  Sort-merge join sorts before doing the merge
                                                                         Left    Right
                                                                         Table   Table
Partitions well: partition smaller to larger partition.

Works for all joins (outer, non-equijoins, Cartesian, exclusion,...)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                    92
                                                                   Hash Join

  Hash smaller table into N buckets (hope N=1)                                    Right Table
  If N=1 read larger table, hash to smaller
  Else, hash outer to disk then
      bucket-by-bucket hash join.                                                        Hash
                                                                                         Buckets
  Purely sequential data behavior                                         Left
                                                                          Table
  Always beats sort-merge and nested
     unless data is clustered.
  Good for equi, outer, exclusion join
  Lots of papers,
     products just appearing (what went wrong?)

  Hash reduces skew
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                93
                                               Parallel Hash Join
         ICL implemented hash join with bitmaps in CAFS machine
               (1976)!

         Kitsuregawa pointed out the parallelism benefits of hash
               join in early 1980‟s (it partitions beautifully)

         We ignored them! (why?) But now, Everybody's doing it.
                         (or promises to do it).


         Hashing minimizes skew, requires little thinking for
           redistribution

         Hashing uses massive main memory


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey    95
                                                        Observations
         It is easy to build a fast parallel execution environment
              (no one has done it, but it is just programming)

         It is hard to write a robust and world-class query optimizer.
              There are many tricks
              One quickly hits the complexity barrier

         Common approach:
            Pick best sequential plan
            Pick degree of parallelism based on bottleneck analysis
         Bind operators to process



Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey         98
                                 What‟s Wrong With That?
      Why isn‟t the best serial plan, the best parallel plan?

      Counter example:
        Table partitioned with local secondary index at two nodes
        Range query selects all of node 1 and 1% of node 2.
        Node 1 should do a scan of its partition.
        Node 2 should use secondary index.

      SELECT * FROM telephone_book WHERE name < “NoGood”;

      Sybase Navigator & DB2 PE                                    Table   Index
           should get this right.                                  Scan    Scan



      We need theorems here                                        A..M    N..Z
        (practitioners do not have them)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                   99
                         What Systems Work This Way

       Shared Nothing                                                  CLIENTS


               Teradata:        400 nodes
               Tandem:          110 nodes
               IBM / SP2 / DB2: 128 nodes
               Informix/SP2      48 nodes
               ATT & Sybase 8x14 nodes

       Shared Disk                                                     CLIENTS


               Oracle                                  170 nodes
               Rdb                                      24 nodes

       Shared Memory                                                        CLIENTS

               Informix                                      9 nodes       Processors
                                                                            Memory

               RedBrick                                      ? nodes
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                        101
                                                                   Outline
 • Why Parallelism:
         – technology push
         – application pull
 • Benchmark Buyer‟s Guide
         – metrics
         – simple tests
 • Parallel Database Techniques
         – partitioned data
         – partitioned and pipelined execution
         – parallel relational operators

 • Parallel Database Systems
         –Teradata                                                 - Oracle     -DB2
         –Tandem                                                   - Informix   -RedBrick
                                                                   - Sybase
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                            102
                         System Survey Ground Rules
      Premise: The world does not need yet another PDB survey
                              It would be nice to have a survey of “real” systems
      Visited each parallel DB vendor I could (time limited)
      Asked not to be given confidential info.
      Asked for public manuals and benchmarks
      Asked that my notes be reviewed
      I say only nice things (I am a PDB booster)




Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                    103
                    Acknowledgments
    Teradata
           Todd Walter and Carrie Ballinger
    Tandem
           Susanne Englert, Don Slutz, HansJorge Zeller, Mike Pong
    Oracle
           Gary Hallmark, Bill Widdington
    Informix
           Gary Kelley, Hannes Spintzik, Frank Symonds, Dave Clay
    Navigator
           Rick Stellwagen, Brian Hart, Ilya Listvinsky, Bill Huffman ,
           Bob McDonald, Jan Graveson
           Ron Chung Hu, Stuart Thompto
    DB2
           Chaitan Baru, Gilles Fecteau, James Hamilton,
                   Hamid Pirahesh
    Redbrick
          Gordon Bell: VLDB 95 Parallel Database Donovan Schneider
Jim Gray &Phil Fernandez,Systems Survey                                   104
                                                                   Teradata
  • Ship 1984, now an ATT GIS brand name

  • Parallel DB server for decision support
              SQL in, tables out
  • Support Heterogeneous data (convert to client format)

  Data hash partitioned among AMPs UNIX                                               PC   Mac

    with fallback (mirror) hash.   VMS
                                                                              AS400
                                                                              MAC      Application
                                                                                       Processor     IBM

  Applications run on clients                                                               PEP



  Biggest installation: 476 nodes, 2.4 TB                                                   AMP



  Ported to UNIX base
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                           105
                                                                          PC   Mac
                                                                   UNIX
  Parsing Engines                                                  VMS
                                                  Application
    Interface to IBM or Ethernet or...     MAC
                                           AS400
                                                  Processor                          IBM
    Accept SQL, return records and status.             PEP
    Support SQL 89, moving to SQL92
       Parse, Plan & authorize SQL
                                                        AMP
       cost based optimizer
       Issue requests to AMPs
       Merge AMP results to requester.
       Some global load control based on client priority
           (adaptive and GREAT!)
    Access Modules
    Almost all work done in AMPs
    A shared nothing SQL engine
       scans, inserts, joins, log, lock,....
    Manages up to 4 disks (as one logical volume)
    Easy design, manage, grow (just add disk)

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                           106
                       Data Layout: Hash Partitioning
   All data declustered to all nodes
   Each table has a hash key (may be compound)
   Key maps to one of 4,000 buckets
   Buckets map to one of the AMPs
   Non-Unique secondary index partitioned by table
     criterion
   Fallback bucket maps to second AMP in cluster.
       Typical cluster is 6 nodes (2 is mirroring).
   Cluster limits failure scope:
       2 failures only cause data outage if both in same
         cluster.

   Within a node, each hash to cylinder
         then hash to “page”
             Page is a heap with a sorted directory
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   107
                Teradata Optimization & Execution
 Sophisticated query optimizer
         (many tricks) Great emphasis on Joins & Aggregates.

 Nested, merge, product, bitmap join (no hash join)

 Automatic load balancing from hashing & load control

 Excellent utilities for data loading, reorganize
 Move > 1TB database from old to new in 6 days,
      in background while old system running

 Old hardware, 3.8B row table (1TB), >300 AMPs
       typical scan, sort, join averages 30 minutes

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   108
                                                  Query Execution
     Protocol
         PE requests work
         AMP responds OK (or pushback)
         AMP works (if all OK)
         AMP declares finished
         When all finished, PE does 2PC and starts pull
     Simple scan:
         PE broadcasts scan to each AMP
         Each AMP scans produces answer spool file
         PE pulls spool file from AMPs via Ynet
     If scan were ordered, sort “catcher” would be forked
         at each AMP pipelined to scans
         Ynet and PE would do merge of merges from AMPs


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey    109
                                           Aggregates, Updates
      Aggregate of Scan:
        Scan‟s produce local sub-aggregates
        Hash sub-aggregates to Ynet
        Each AMP “catches” its sub-aggregate hash buckets
        Consolidate sub-aggregates.
        PE pulls aggregates from AMPs via Ynet.
        Note: fully scaleable design

      Insert / Update / Delete at a AMP node
         generates insert / update /delete messages to
             unique-secondary indices
             fallback bucket of base table.
             messages saved in spool if node is down

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   110
                                    Query Execution: Joins
 Great emphasis on Joins.
 Includes small-table large-table optimization
    cheapest triple, then cheapest in triple.

 If equi-partitioned, do locally
 If not equi-partitioned,
     May replicate small table to large partition (Ynet shines)
     May repartition one if other is already partitioned on join
     May repartition both (in parallel)

 Join algorithm within node is
    Product
    Nested
    Sort-merge
    Hash bit map of secondary indices, intersected.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   111
                                                                   Utilities
    Bulk Data Load, Fast Data Load, Multi-load,
       Blast 32KB of data to an AMP
       Multiple sessions by multiple clients can drive 200x parallel
       Double buffer
       AMP unpacks, and puts “upsert”onto Ynet
       One record can generate multiple upserts
          (transaction-> inventory, store-sales, ...)
       Catcher on Ynet, grabs relevant “upserts” to temp file.
       Sorts and then batches inserts (survives restarts).
       Online and restartable.
       Customers cite this as Teradata strength.

    Fast Export (similar to bulk data load)

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey               112
                                                                   Utilities II
       Backup / Restore: Rarely needed because of fallback.
         Cluster is unit of recovery
         Backup is online, Restore is offline

       Reorganize:
         Rarely needed, add disk is just restart
         Add node:
            rehash all buckets that go to that node:
                (Ynet has old and new bucket map)
         Fully parallel and fault tolerant, takes minutes




Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                  113
                                                         Port To UNIX
    New design (3700 series) described in VLDB 93
    Ported to UNIX platforms (3600 AP, PE, AMP)
    Moved Teradata to Software Ynet on SMPs
       Based on Bullet-Proof UNIX with TOS layer atop.
                    message system
                    communications stacks
                    raw disk & virtual processors
                    virtual partitions (buckets go to virtual partitions)
                    removes many TOS limits
                    Result is 10x to 60x faster                           SQL Applications
                      than an AMP                          Parsing engine (parallelism)
                                    Teradata SQL (AMP logic)
    Compiled expression evaluation UNIX PDE: TOS adapter
       (gives 50x speedup on scans) UNIX 5.4 (SMP, RAS, virtual Ynet)
    Large main memory helps                  HARDWARE
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                       114
                               Customer Benchmarks
    Standard Benchmarks
           Only old Boral/DeWitt Wisconsin numbers.
           Nothing public.

   Moving > 1TB database from one old to new in 6 days,
            in background while old system running
     So: unload-load rate > 2MB/s sustained
        Background task (speed limited by host speed/space)

   Old hardware, 3.8B row table, >300 AMPs
         typical scan, sort, join averages 30 minutes

   rates (rec size not cited):                                      krec/s/AMP    k rec/s
                                                                 scan:     9     2.7 mr/s !!!!!!
                                                clustered join:            2     600 kr/s
                                                    insert-select:       .39     120 kr/s
                                         Hash index build:
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
                                                                         3.3     100 kr/s      115
                              UNIX/SMP Port of Teradata
  Times to process a Teradata Test DB on a 8 Pentium, 3650.
    These numbers are 10 to 150x better than a single AMP
        Compiled expression handling
         more memory
    op                        rows seconds k r/s MB/s
    scan                   50000000     737     67.8 11.0
    copy                    5000000 1136        4.4    0.7
    aggregate              50000000     788     63.5 10.3
    Join 50x2M (clustered) 52000000     768     67.7 11.0
    Join 5x5 (unclustered) 10000000     237     42.2 6.8
    Join 50Mx.1K           50000100 1916        26.1 4.2

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   116
                                        Teradata Good Things
 Scaleable to large (multi-terabyte) databases

 Available TODAY!

 It is VERY real: in production in many large sites

 Robust and complete set of utilities

 Automatic management.

 Integrates with the IBM mainframe OLTP world

 Heterogeneous data support is good data warehouse
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   117
                                                                   Tandem
   Message-based OS (Guardian):
    (1) location transparency
    (2) fault isolation (failover to other nodes).
   Expand software 255 Systems WAN
                                                                                         224
                                                                        4 node System
                                                                                         PROCESSORS
     Classic shared-nothing                                                                30MB/S
     system (like Teradata                                                        8 x1M B/S
     except applications                                                             1-16 MIPS R4400 cpus
     run inside DB machine.                                                          dual port controllers,
                                                                                     dual 30MB/s LAN
  1974-1985: Encompass: Fault-tolerant Distributed OLTP
  1986: NonStopSQL: First distributed and high-performance SQL (200 tps)
  1989: Parallel NonStopSQL: Parallel query optimizer/executor
  1994: Parallel and Online SQL (utilities, DDL, recovery, ....)
  1995: Moving to ServerNet: shared disk model
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                         118
                                          Tandem Data Layout
   Each table or index range partitioned to a set of disks
      (anywhere in network)                      Partition
                                                                    Block File= {parts}
   Index is B-tree per partition
      clustering index is B+ tree

   Table fragments are files (extent based).
                                                                   Extents may be added
   Descriptors for all local files live in local catalog
         (node autonomy)

   Tables can be distributed in network (lan or wan)

   Duplexed disks and disk processes for failover
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                     119
         Tandem Software (Process Structure)

                SQL                                                    Disk             Disk Pair
                                                                    Server Pair         or Array
      C/COBOL/..                             SQL engine                        buffer

      Application                            Joins, Sorts                      pool
                                                                   Selects
                                             global aggs
                                                                   Update, Delete
                                                                                          Data partition
                                             triggers
                                                                   Record/Set insert
                                             index maintenance
                                                                   Aggregates
                                             views
                                                                   Assertions
                                             security
                                                                   Locking
                                                                   Logging

                                                                   Query Compiler Transactions
                                    GUI                                              Helper
                                                                           Utilities
                                                                                     Processes

     Hardware & OS move data at 4MB/s with >1 ins/byte
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                           120
                                                     OLTP Features
           Insert / Update / Delete index in parallel with base table
              If 5 indices, 5x faster response time.

           Record and key-value range locking, SQL92 isolation levels

           Undo scanner per log: double-buffers undo to each server

           21 k tpc-C (WOW!!) with 110 node server (800GB db)

           Can mix OLTP and batch.
              Priority serving to avoid priority inversion problem
           Buffer management prevents sequential buffer pollution




Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey        121
                     Tandem Query Plan & Execution
  Simple selects & aggregates done in disk servers

  Parallelism chosen: scan: table fragmentation                      Application
                          hash: # processors                       SQL subsystem
                             or Outer table fragments
  Sorts: redistribution, sort in executors (N-M)                   Executors
  Joins done in executors (nest, sort-merge, hash).
                                                                     Disk
  Redistribution is always a hash (minimize skew)                    Servers
  Pipeline as deep as possible (use lots of processes)

  Multiple logs & parallel UNDO avoid bottlenecks

  Can mix OLTP and batch.
     Priority serving to avoid priority inversion problem
  Buffer management prevents sequential buffer pollution
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                   122
                                                Parallel Operators
  Initially just inserted rivers between sequential operators
  Parallel query optimizer
  Created executors at all clustering nodes or
                        at all nodes, repartitioned via hash to them
  Gave parallel select, insert, update, delete
                            join, sort, aggregates,...
                            correlated subqueries are blocking
  Got linear speedup/scaleup on Wisconsin.
  Marketing never noticed, product slept from 1989-1993
      Developers added: Hash Join
                            aggregates in disk process
                            SQL92 features
                            parallel utilities
                            online everything
                            converted to MIPSco
                            fixed bugs
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey       123
                                                     Join Strategies
     Nested loop
     Sort merge
     Both can work off index-only access
     Replicate small to all partitions (when one small)
     Small-table Cartesian product large-table optimization
     Now hybrid-hash join
        uses many small buckets
        tuned to memory demand
        tuned to sequential disk performance
        no bitmaps because (1) parallel hash
                               (2) equijoins usually do not benefit

     When both large, and unclustered (rare case)
       N+M scanners, 16 catchers: sortmerge or hybrid hash
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey       124
   Administration (Parallel & Online everything)
      All utilities are                     online (claim to reduce outages by 40%):
          Add table, column,...
          Add index:
                      builds index from stale copy
                      uses log for catchup
                      in final minute, gets lock, completes index.
        Reorg B-tree while it is accessed
        Add / split/ merge/ reorg partition
        Backup
        Recover page, partition, file.
        Add, alter logs, disks, processors, ...
      You need this: Terabyte operations take a long time!

      Parallel Utilities:
            load (M to N)
            index build (M scanners, N inserters, in background)
            recovery:
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                       125
                                                          Benchmarks
    No official DSS benchmark reports
    Unofficial results
       1 to 16 R4400 class processors, 64MB each (Himalayas)
           3 disks, 3 ctlrs each
                                         Sequential                                    16x Parallel
                                   rec/s      MB/s                              rec/s         MB/s speedup
           Load Wisc            1.6 kr/s 321 Kb/s                              28 kr/s    5.4 MB/s      16
           Parallel Index build 1.5 kr/s    15Kb/s                             24 kr/s    240 KB/s      16
           SCAN                                 28 kr/s5.8 MB/s 470 kr/s 94 MB/s                       16 !!!!!!!
           Aggregate (1 col)                        25 kr/s        4.9 MB/s   400 kr/s   58 MB/s       16
           Aggregate (6 col)                        18 kr/s        3.6 MB/s   300 kr/s   60 MB/s       16
           2-Way hash Join                          13 kr/s        2.6 MB/s   214 kr/s   42 MB/s       16
           3-Way hash Join                           ? kr/s          ? Mb/s     ? kr/s    ? MB/s        ?

          1x and 16x rates are best I‟ve seen anywhere.

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                              126
                                         Tandem Good Things
      21 K TPM-C (WOW!)

      It is available TODAY!

      Online everything

      Fault tolerant, distributed, high availability

      Mix OLTP and batch

      Great Hash Join Algorithm

      Probably the best peak performance available
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   127
                                                                   Oracle
 Parallel Server (V7): Multiple threads in a server
                       Multiple servers in a cluster
                   Client/server, OLTP & clusters (TP lite)

 Parallel Query (V7.1) Parallel SELECT (and sub-selects)

 Parallel Recovery: (V7.1) @ restart, one log scanner,
                                multiple redoers
                        Beta in 1993, Ship 6/94.
 More Parallel (create table): V7.2, 6/95

 Shared disk implementation ported to most platforms

 Parallel SELECT (no parallel INSERT, UPDATE, DELETE, DDL)
                                                    except for sub-selects inside these verbs.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                 128
                                              Oracle Data Layout
 Homogenous:                                                                        Segment     Table Space
  one table (index) per segment                                                           Block = File Set




                                                                   Table or Index
  extents picked from a TableSpace




                                                                                      Extents
 Files may be raw disk
 Segments are B-trees or heaps.
                                                                                                Extents may be added

 data -> disk map is automatic
                 No range / hash / round-robin partitioning

 ROWID can be used as scan partitioning on base tables.

 Guiding principal:
  If its not organized, it can‟t get disorganized,
              and doesn‟t need to be reorganized.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                                   129
       Oracle Parallel Query Product Concept
   Convert serial SELECT plan to parallel plan
    If Table scan or HINT then consider parallel plan
    Table has default degree of parallelism (explicitly set)
    Overridden by system limits and hints.
    Use max degree of all participating tables.
    Intermediate results are hash partitioned
    Nested Loop Join and Merge Join

   User hints can (must?) specify join order, join strategy,
   index, degree of parallelism,...
                       es
                  a nc
             n st
          I
         D                Multi-
                                                        DB                  Query
           eg        process & thread
                                                                   Client
                                                                            Coordinator
                re
                  e
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                          130
                                                       Query Planning
   Query Coordinator starts with Oracle Cost-Based plan

   If plan requests Table scan or HINT then consider parallel plan

                    Table has default degree of parallelism (explicitly set)
                    Overridden by system limits and hints.
                    Use max degree of all participating tables.

   Shared disk makes temp space allocation easy

   Planner picks                            degree of parallelism and
                                            river partitioning.

   Proud of their OR optimization.

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey               131
                                                   Query Execution
    Coordinator does extra work to

            merge the outputs of several sorts
                 subsorts pushed to servers

            aggregate the outputs of several aggregates
                 aggregates pushed to servers

    Parallel function invocation is potentially a big win.

                    SELECT COUNT ( f(a,b,c,...)) FROM T;

            Invokes function f on each element of T, 100x parallel.


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey      132
                                                     Join Strategies
            Oracle has (1) Nested Loop Join
                       (2) Merge Join

            Replicate inner to outer partition automatic in shared disk
                  (looks like partition outer).

            Has small-table large-table optimization
                 (Cartesian product join)

            User hints can specify join order, join strategy, index
              degree of parallelism,...



Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey          133
                                 Transactions & Recovery
       Transactions and transaction save points (linear nest).

       ReadOnly snapshots for decision support.

       SQL92 isolation levels (ACID = Snapshot isolation)

       Database has multiple rollback segments UNDO log,

       Transaction has one commit/REDO log so may be a bottleneck

       Parallel recovery at restart:
          One log scanner,
             DEGREE REDO streams, typically one per disk
             INSTANCE REDO streams, typically two-deep per disk


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   134
                            Administration
    Not much special:
         Limit degree of parallelism at a server
         Set default parallelism of a table
         Query can only lower these limits
    No special tools, meters, monitors,...
      Just ordinary Parallel Server
                      Oracle Utilities
    User can write parallel load / unload utility
        Index build, Constraints, are separate steps
        Not incremental or online or restartable.
         Update Statistics (Analyze) is not parallel
         Index build is a N-1 parallel: N scanner/sorter, 1 inserter.
         Parallel recovery at restart:
                One log scanner,
                        DEGREE REDO streams, typically one per disk
                        INSTANCE REDO streams, typically two-deep per disk 135
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
                                                          Benchmarks
        Sequent 20x 50MHz 486, .5GB RAM, 20 disk
                                                                   Sequential           20x Parallel
        rec/s                                    krecr/s                KB/s      rec/s       MB/s speedup
        Load 5M Wisc                              .5 kr/s           113 KB/s    8.8 kr/s 1.8 MB/s       16
        Parallel Index load                      2.2 kr/s            18 Kb/s    29 kr/s 235 KB/s        13
        SCAN                                     1.7 kr/s           364 KB/s    26 kr/s 5.3 MB/s        15
        Agg MJ                                   3.3 kr/s           660 KB/s    45 kr/s 9.3 MB/s        14
        Agg NJ                                   1.4 kr/s           290 KB/s    26 kr/s 5.4 MB/s        19
     Same benchmark on
        16x SP1 (a shared nothing machine), got similar results.
        168x N-cube ( 16MB/node), 4 lock nodes, 64 disk nodes got good scaleup
     Oracle has published details on all these benchmarks.
    Sept 1994 news:
     20 Pentium, 40 disk system, SCAN at 44 MB/s 55% cpu
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                             136
                                                    Oracle Good Things
    Available now!

    Parallel Everywhere (on everybody‟s box)

    A HIGH FUNCTION SQL

    No restrictions (triggers, indices,...)

    Very easy to use (almost no knobs or options)

    Parallel invocation of stored procedures

    Near-linear scaleup and speedup of SELECTs.

    Respectable performance on Sequent
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey         137
                                                                   Informix
     DSA (Dynamic Scaleable Architecture) describes redesign to
              thread-based, server-based system.
       V6 - 1993 - : DSA -- rearchitecture (threads, OLTP focus)
       V7 - 1994 - : PDQ -- Parallel Data Query (SMP)
       V8 - 1995 - : XMP -- Cluster parallelism (shared disk/nothing).

     Parallelism is a MAJOR focus now that SQL92 under control

     Other major focus is TOOLS (ODBC, DRDA, NewEra 4GL).

     Informix is a UNIX SQL system:
             AIX (IBM), HP/UX (HP), OSF/1 (DEC, HP), SCO/UNIX, Sequent/DYNIX, SUN (SunOS, Solaris)


     Today shared nothing parallelism on IBM SP2, ATT3650, ICL, (beta)

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                 138
                                          Informix Data Layout
  Table or index maps to homogeneous set of DB spaces
    contains “chunks” (extents)     DBspace  File
                                                                    Block
  Partition by: range,
             round robin
             expression
             hash (V8)
                                                                   Chunks may be added

  Access via B+Tree, B* tree, and hash (V8)

  Built an extent-based file system on raw disks or files

  High speed sequential, clustering, async IO,...
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                         139
                                              Informix Execution
  Completely parallel DML, some parallel DDL
  Parallel SELECT, UPDATE, DELETE
                                                           Virtual      helpers
     Executor per partition in all cases.                  Processes
                                               Client
  Parallel       sort,
                 joins (nest, merge, hash)                  Buffer Pool
                 aggregates,
                 union                                        M join
  Whenever an operator has input and
                                                        scan
        a free output buffer, it can work                            M join
        to fill the output buffer.
        Natural flow control                                    scan        scan
  Blocking operators (sort, hash join, aggregates, correlated subqueries)
        Spool to a buffer (if small), else spool to disk.
  Shared buffer pool minimizes data copies.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey             140
                                                       Parallel Plans
      Query plan is parallelized by
        scanner per table partition (does select, project)
        sub-aggregates per partition (hash or sort)

      If clustered join (nested loop or merge) then operator
          per outer or per partition

      If hash-join, parallel scan smaller first,
             build bitmap and hash buckets
          then scan larger and:
             join to smaller if it fits in memory
             else filter via bitmap and build larger buckets
          then join bucket by bucket
      Hybrid hash join with bitmaps and bucket tuning.

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey        141
                                               Parallel Operators
                Parallel SELECT, UPDATE, DELETE
                   Executor per partition in all cases.

                Parallel                        sort,
                                                joins,
                                                aggregates,
                                                union

                Only correlated subqueries are blocking

                Completely parallel DML, some parallel DDL



Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey    142
                                 Transactions & Recovery
     SQL 2 isolation levels allow DSS to run in background

     Transaction save points

     Separate logical and physical logs.
       Bulk updates could bottleneck on single log.

     Recovery unit is data partition (DBspace)

     Parallel recovery: thread per DBspace

     If DB fragment

      unavailable, DSS readers can skip it
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   143
                                     Informix Administration
    Can assign % of processors, memory, IO to DSS
      (parallel query)

    Sum of all parallel queries live within this quota

    Each query can specify the % of the total that it wishes.
      (0 means sequential execution)

                                                                   Utilities
     Parallel Data load (SMP only)
     Parallel Index Build (N - M)
     Parallel recovery
     Online backup / restore
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey               144
                                                          Benchmarks
       Sequent system:                                             9 Pentium processors
                                                                   1 GB main memory
                                                                   Base tables on 16 disk (FWD SCSI)
                                                                   Indices on 10 discs
                                                                   Temp space on 10 disks

                                                               Sequential                 Parallel
                                                         rec/s      MB/s            rec/s    MB/s speedup
       Load 300M Wisc                                    3kr/s   600Kb/s
       Parallel Index load                                                        48kr/s    1MB/s
       SCAN                                            17kr/s         3.5MB/s    147kr/s   30MB/s       8.3
       Aggregate                                       11kr/s         2.3MB/s    113kr/s   23MB/s      10.1
       2-Way hash Join                                 18kr/s         3.2MB/s    242kr/s   31MB/s       9.7
       3-Way hash Join                                 25kr/s         3.5Mb/s    239kr/s   33MB/s       9.5



Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                              145
              Informix Shared Nothing Benchmark
     IBM SP2 - : TPC-D-like database 48 SP2 Processors
     Customer Benchmark, Not audited benchmark.

     Load              60 GB in 40 minutes,
                    250 GB in 140 min                              about 100 GB/hr !   2GB/node/hr

     Scan & Aggregate (#6)
                     60 GB in 7 min = 140 MB/s = 3 MB/s/node = 30 kr/s
                    260 GB in 24 min = 180 MB/s = 4 MB/s/node = 40 kr/s

     Power Test (17 complex queries and 2 load/purge ops)
                        60 GB in 5 hrs
                       260 GB in 18 hrs

     Multiuser Test:
                    1 user, 12 queries: 10 hrs, 4 users, 3 queries: 10 hrs
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                146
                                       Informix Good Things
            A full function SQL

            Available today on Sequent

            Beautiful manuals

            Linear speedup and scaleup

            Best published performance on UNIX systems
              Probably best price performance.
                             (but things are changing fast!)

            Some mechanisms to mix OLTP and batch.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   147
                Sybase Navigator Product concept
   Two layer software architecture:
    (1) Navigator
         drives array of shared-nothing SQL engines.
    (2) Array of SQL engines, each unaware of others.
         similar to Tandem disk processes               TOR
                                                  ISTRA
                                            ADMIN
         SQL engine is COTS.




                                                                            NAVIGATOR
   Goal: linear scaleup and speedup,                                                          SQL
                                                                                                      SQL
                                                                                                        SQL
                                                                                                          SQL
                                                                                                    SQL
        plus good OLTP support                                                   CO
                                                                                        SQL
                                                                                          SQL
                                                                                                      SQL
                                                                                              SQL
                                                                                    NF
                                                                                       IGU
                                                                                           RA
   Emphasize WHOLE LIFECYCLE                                                                  TO
                                                                                                  R
       Configurator: tools to design a parallel system
       Administrator: tools to manage a parallel system
                                   (install/upgrade, start/stop, backup/restore, monitor/tune)
                   Optimizer: execute requests in parallel.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                            148
                                                         Configurator
        Fully graphical design tool

        Given    ER model and dataflow model of the application
                 workload characteristics
                 response time requirements,
                 hardware components
                 (heavy into circles and arrows)
        Recommends hardware configuration/
                 Table definitions (SQL)
                 table partitioning




Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey        149
                                                       Administrator
        Made HUGE investments in this area.
        Truly industry leading
           graphical tools make MPP configuration “doable”.

        GUI interface to manage:
              startup / shutdown of cluster
              backup / restore / manage logs
              configure (install, add nodes, configure and tune servers)
              Manage / consolidate system event logs
              System stored procedures (global operations)
                  (e.g. aggregate statistics from local to global cat)
              Monitor SQL Server events



Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey           150
                                                           Data Layout
      Pure shared nothing
      Navigator partitions data among SQL servers
            • map to a subset of the servers
            • range partition or hash partition.
      Secondary indices are partitioned with base table
      No Unique secondary indices
      Only shorthand views, no protection views
      Schema server stores global data definition for all nodes.
      Each partition server has
            schema for its partition
            data for its partition.
            log for its partition


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey         151
                Sybase SQL Server Backgrounder
    Recently became SQL89 compliant (cursors, nulls, etc)
    Stored procedures, multi-threaded, internationalized,
    B*-tree centric (clustering index is B+tree)
    Use nested loops, sort-merge join (sort is index build).
    Page locking, 2K disk IO, ... other little-endian design decisions.
    Respectable TPC-C results (AIX RS/6000).
    UNIX raw disks or files are base (also on OS/2, NetWare,...).
    table->disk mapping
       CREATE DATABASE name ON {device...} LOG ON {device...}
       SP_ADDSEGMENT segment, device
       CREATE TABLE name(cols) [ ON segment]

    Microsoft has a copy of the code, deep ported to NT

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey     152
                 Navigator Extension Mechanisms
    Navigator extended Sybase TDS by
      Adding stored procedures to do things
      Extending the syntax (e.g. see data placement syntax below)

    Sybase TDS and OpenServer design are great for this
      All “front ends based on OpenServer and threads”




Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey    153
           Process Structure - Pure Shared Nothing
    DBA Server does everything: SQL compilation
                                System management
                                Catalog management
                                SQL server restart (in 2nd node)
                                DBA fallback detects deadlock
                                does DBA takeover on fail
    Control server at each node manages SQL servers there
       (security, request caching, 2PC, final merge /aggregate,...
           parallel stored procedures (SMID) )
    Split server manages re-partitioning of data
    SQL Server is unit of query parallelism, (one per cpu per node)
                                Clients

                            Control                DBA             = system             GUI
                                                   server            manager
                              (1/node)
                                                                     monitor            Navigator
                                                                                        Manager
                           Split                                   & SQL optimizer
                                                   schema          = catalogs
                           SQL                     server
                                                                     database in a SQL server
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                    154
                             Simple Request Processing
   Client connects to Navigator (a Control Server) using
      standard Sybase TDS protocol.

   SQL request flows to DBA server that compiles it
     sends stored procedures (plans) to all control servers
           plans to all relevant SQL servers
     Control server executes plan.
     Pass to SQL server, returns results.
                                                                   Client
   Plan cached on second call,
                                                                                DBA
      DBA server not invoked.                                       Control
                                                                     (1/node)
                                                                                server
      Good for OLTP
                                                                   Split
                                                                                schema
                                                                   SQL          server


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                         155
                              Parallel Request Processing
           If query involves multiple nodes, then command
               sent to each one (diagram shows secondary index lookup)
           Query sent to SQL servers that may have relevant data.

           If data needs to be redistributed or aggregated,
                  split servers issue queries and inserts
                                    (that is their only role)
                           split servers have no persistent storage.
                                                        Client

                                                                               Control    DBA
                                                           Control   Control              server
                                                                               Split
                                                          Split      Split               schema
                                                          SQL                            server

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                   156
                                                Data Manipulation
   SQL server is unit of parallelism

   "Parallelized EVERYTHING in the T-SQL language"
         Includes SIMD execution of T-SQL procedures,
         plus N-M data move operations.

   Two-level optimization:
                DBA Server has optimizer
                      (BIG investment, all new code,
                      NOT the infamous Sybase optimizer)
                Each SQL server has Sybase optimizer
         If extreme skew, different servers have different plans
         DBA optimizer shares code with SQL server
                      (so they do not play chess with one another).
         Very proud of their optimizer.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey    157
                                                Query Execution
     Classic Sellinger cost-based optimizer.
     SELECT, UPDATE, DELETE N-to-M parallel
     Bulk and async INSERT interface.
     N-M Parallel sort
     Aggregate (hash/sort)
     select and join can do index-only access if data is there.
     eliminate correlated subqueries (convert to join).
                         (Gansky&Wong. SIGMOD87 extended)
     Join: nested-loop, sort-merge, index only
            Sybase often dynamically builds index to
                   support nested loop (fake sort-merge)
     Typically left-deep sequence of binary joins.

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   158
                              Join and Partition Strategy
   Partition strategies
          If already partitioned on join, then no splitting
          Else Move subset of T1 to T2 partitions.
          or      Replicate T1 to all T2 partitions
          or      repartition both T1 and T2 to width of home nodes
                  or target.
          No hash join, but
                  all (re) partitioning is range or hash based.

   Not aggressive parallelism/pipelining: 2 op at a time.
   Pipeline to disk via split server (not local to disk and then split).
   Split servers fake subtables for SQL engines.
   Top level aggregates merged by control, others done by split.

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey       159
                                                                   Utilities
             Bulk data load (N-M) async calls

             GUI manages
               Backup all SQL serves in parallel

             Reorg via                       CREATE TABLE <new> ,
                                             INSERT INTO <new> SELECT * FROM <old>

             Utilities are mostly offline (as per Sybase)

             Nice EXPLAIN utility



Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                     160
                                                                   Futures
            Hash join within split servers

            Shared memory optimizations

            Full support for unique secondary indices

            Full trigger support (cross-server triggers)

            Full security and view support.




Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey             161
                           Benchmarks
  Preliminary: 8x8 3600 - Ynet.
  node: 8 x (50MHz 486 256k local cache) 512MB main memory,
        2 x 10 disk arrays, @ 2GB 4 MB/s per disk.
        6 x Sybase servers
  Scaleup & speedup tests of 1, 4, and 8 nodes.
  Numbers (except loading) reported as ratios of elapsed times

  S&S tests show a >7x speedup of 8-way over 1-way

  Tests cover insert, select, update, delete, join, aggregate, load

     Reference Account: Chase Manahattan Bank
                    14x8 P5 ATT 3600 cluster: (112 processors)
                    56 SQL servers, 10GB each = 560 GB
                    100x faster than DB2/MVS (minutes vs days)
     Linearity is great.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey      162
                                     Navigator Good Things
               Concern for lifecycle
                      design,
                      install,
                      manage,
                      operate,
                      use

               Good optimization techniques

               Fully parallel, including stored procedures!

               Scaleup and Speedup are near linear.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   163
                                                              Sybase IQ
     Sybase bought Expressway
     Expressway evolved from Model 204
       bitmap technology: index duplicates with bitmap
         compress bitmap.
       Can give 10x or 100x speedup.
       Can save space and IO bandwidth
     Currently, two products (Sybase and IQ) not integrated



Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey          164
                                                                   DB2
DB2/VM: = SQL/DS: System R gone public
DB2/MVS (classic Parallel Sysplex, Parallel Query Server, ...)
    Parallel and async IO into one process (on mainframe)
    Parallel execution in next release (late next year?)
    MVS PQS now withdrawn?
DB2/AS400: Home grown

DB2-2-PE: OS2/DM grown large. First moved to AIX
    Being extended parallelism
    Parallelism based on SP/2 -- shared nothing done right.
    Benchmarks today - Beta everywhere
DB2++: separate code path has OO extensions, good TPC-C
     Ported to HP/UX, Solaris, NT in beta

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey         165
                         DB2/2 Data Layout
   •       DATABASE: a collection of nodes (up to 128 SP2s so far)
   •       NODEGROUP: a collection of logical nodes (a 4k hash map
   •       LOGICAL NODE: A DB2 instance (segments, log, locks...)
   •       PHYSICAL NODE: A box.
                                                                                Segments
   •       Logical Node: Segments of 4 k pages
              – Segments allocated in units (64K default)
              – Tables stripe across all segments
       • Table created in NodeGroup:
              – Hash (partition key) across all members of group
       • Cluster has single system Image
                                                                        Group 2
                                                                 Nodes:
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                 Group 1 166
                                     DB2/2 Query Execution
     • Each node maintains pool of AIX server processes
     • Query optimizer does query decomposition to node
       plans (like R* distributed query decomposition)
     • Parallel Optimization is 1Ø (not like Wai Hong‟s work)
     • Sends sub-plans to nodes to be executed by servers
     • Node binds plan to server process
     • Intermediate results hashed
     • Proud that Optimizer does not need hints.
     • “Standard” join strategies (except no hash join).



Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   167
                                                       DB2/2 Utilities
       • 4 loaders:
          – import
          – raw-insert (fabricates raw blocks, no checks)
          – insert
          – bulk insert
       • Reorganize hash map, add / drop nodes, add devices
               – Table unavailable during these operations
       • Online & Incremental backup
       • Fault tolerance via HACMP



Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey         168
                       DB2/2 Performance:
                  Good performance Great Scaling
                                                                        Speedup vs Nodes
Wisconsin scaleups                                                 25.0
                                                                        DB2/2 PE on SP2                     Load
 big = 4.8 M rec = 1 GB                                                                                     Scan
                                                                   20.0
small = 1.2 M rec = 256MB                                                                                   Agg
scan rate ~12 kr/s/node                                            15.0                                     SMJ
raw load: 2.5 kr/s/node                                                                                     NLJ
see notes for more data                                            10.0
                                                                                                            SMJ2

                                                                                                            Index1
                                                                    5.0
                                                                                                            Index2

                                                                    0.0                                     MJ
                                                                          0   2   4   6   8   10 12 14 16


Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey                                                   169
                                            DB2/2 Good Things
•    Scaleable to 128 nodes (or more)
•    From IBM
•    Good performance
•    Complete SQL (update, insert,...)
•    Will converge with DB2/3 (OO and TPC-C stuff)
•    Will be available off AIX someday
        – (aix is slow and SP2 is very expensive)




Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey   170
                                                                   RedBrick
• Read-only (LOAD then SELECT only) Database system
   – Load is incremental and sophisticated
• Precompute indices to make small-large joins run fast
        – Indices use compression techniques.
        – Only join via indices
• Many aggregate functions to make DSS reports easy
• Parallelism:
   – Pipeline IO
   – Typically a thread per processor (works on index partition)
   – Piggyback many queries on one scan
   – Parallel utilities (index in parallel, etc)
   – SP2 implementation uses shared disk model.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey              171
                                                                   Summary
  There is a LOT of activity
       (many products coming to market)

  Query optimization is near the complexity barrier
    Needs a new approach?

  All have good speedup & scaleup if they can find a plan

  Managing huge processor / disk / tape arrays is hard.

  I am working on commoditizing these ideas:
        low $/record/sec (scaleup PC technology)
        low Admin $/node (automate, automate, automate,...)
        Continuous availability (online & fault tolerant)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey             172