Building PetaByte Data Servers by ert554898

VIEWS: 10 PAGES: 50

									Building Peta-Byte Servers
   Jim Gray
   Microsoft Research
   Gray@Microsoft.com
   http://www.Research.Microsoft.com/~Gray/talks

   Kilo    103
   Mega    106
   Giga    109
   Tera    1012             today, we are here
   Peta    1015
   Exa     1018                      1
    Outline
• The challenge: Building GIANT data stores
  – for example, the EOS/DIS 15 PB system

• Conclusion 1
  – Think about Maps and SCANS

• Conclusion 2:
  – Think about Clusters




                                              2
  The Challenge -- EOS/DIS
• Antarctica is melting -- 77% of fresh water liberated
   – sea level rises 70 meters
   – Chico & Memphis are beach-front property
   – New York, Washington, SF, SB, LA, London, Paris 

• Let’s study it! Mission to Planet Earth
• EOS: Earth Observing System (17B$ => 10B$)
   – 50 instruments on 10 satellites 1997-2001
   – Landsat (added later)

• EOS DIS: Data Information System:
   – 3-5 MB/s raw, 30-50 MB/s cooked.
   – 4 TB/day,
   – 15 PB by year 2007                                   3
  The Process Flow
• Data arrives and is pre-processed.
  –instrument data is calibrated,
                      gridded
                      averaged
  –Geophysical data is derived
• Users ask         for stored data
          OR to analyze and combine data.
• Can make the pull-push split dynamically
  Pull Processing    Other Data   Push Processing



                                                    4
   Designing EOS/DIS (for success)
• Expect that millions will use the system (online)
 Three user categories:
  – NASA 500 -- funded by NASA to do science
  – Global Change 10 k - other dirt bags
  – Internet 20 m - everyone else
                      Grain speculators
                      Environmental Impact Reports
                      school kids
                      New applications
     => discovery & access must be automatic

• Allow anyone to set up a peer- node (DAAC & SCF)
• Design for Ad Hoc queries,
 Not Just Standard Data Products
       If push is 90%, then 10% of data is read (on average).
     => A failure: no one uses the data, in DSS, push is 1% or less.
                                                                       5
     => computation demand is enormous (pull:push is 100: 1)
     The (UC alternative) Architecture
• 2+N data center design
• Scaleable DBMS to manage the data
• Emphasize Pull vs Push processing
• Storage hierarchy
• Data Pump
• Just in time acquisition


                                         6
   2+N Data Center Design
• Duplex the archive (for fault tolerance)
• Let anyone build an extract (the +N)
• Partition data by time and by space (store 2 or 4 ways).
• Each partition is a free-standing DBMS
     (similar to Tandem, Teradata designs).
• Clients and Partitions interact via standard protocols
   – DCOM/CORBA, OLE-DB, HTTP,…

• Use the (Next Generation) Internet
                                                             7
Obvious Point:
EOS/DIS will be a Cluster of SMPs
 • It needs 16 PB storage
   = 1 M disks in current technology
   = 500K tapes in current technology

 • It needs 100 TeraOps of processing
   = 100K processors (current technology)
   and ~ 100 Terabytes of DRAM

 • 1997 requirements are 1000x smaller
   – smaller data rate
   – almost no re-processing work           8
   Hardware Architecture
• 2 Huge Data Centers
• Each has 50 to 1,000 nodes in a cluster
  – Each node has about 25…250 TB of storage (FY00 prices)
        –   SMP               .5Bips to 50 Bips    20K$
        –   DRAM              50GB to 1 TB         50K$
        –   100 disks         2.3 TB to 230 TB    200K$
        –   10 tape robots    50 TB to 500 TB     100K$
        –   2 Interconnects   1GBps to 100 GBps    20K$

• Node costs 500K$
• Data Center costs 25M$ (capital cost)
                                                             9
       Scaleable DBMS
• Adopt cluster approach (Tandem, Teradata, VMScluster,..)
• System must scale to many processors, disks, links
• Organize data as a Database, not a collection of files
   – SQL rather than FTP as the metaphor
   – add object types unique to EOS/DIS (Object Relational DB)

• DBMS based on standard object model
   – CORBA or DCOM (not vendor specific)

• Grow by adding components
• System must be self-managing                                   10
    Storage Hierarchy
• Cache hot 10% (1.5 PB) on disk.
• Keep cold 90% on near-line tape.
• Remember recent results on speculation|
   research challenge: how trade push +store vs. pull.
• (more on this later Maps & SCANS)
                          10-TB RAM            500 nodes

                             1 PB of Disk 10,000 drives

                                    15 PB of Tape Robot
                                                         11
                                             4x1,000 robots
     Data Pump


• Some queries require reading ALL the data
     (for reprocessing)
• Each Data Center scans the data every 2 days.
    – Data rate 10 PB/day = 10 TB/node/day = 120 MB/s
• Compute on demand small jobs
•   less than 1,000 tape mounts
•      less than 100 M disk accesses
•      less than 100 TeraOps.
•      (less than 30 minute response time)
• For BIG JOBS scan entire 15PB database
• Queries (and extracts) “snoop” this data pump.        12
Just-in-time acquisition 30%
•   Hardware prices decline 20%-40%/year
•   So buy at last moment
•   Buy best product that day: commodity
•   Depreciate over 3 years so that facility is fresh.
•            (after 3 years, cost is 23% of original). 60% decline peaks at 10M$
                    EOS DIS Disk Storage Size and Cost
         5               assume 40% price decline/year
    10
                                            Data Need TB
         4
    10

         3
    10

         2                                                 Storage Cost M$
    10

    10

     1
                                                                               13
      1994        1996     1998    2000     2002    2004       2006     2008
Just-in-time acquisition 50%!!!!!!!
• Hardware prices decline 50%/year lately
• The PC revolution!
• Its amazing!
     100,000
               EOS-DIS STORAGE NEEDS
      10,000
                  Total Storage Capacity
                            (TB)
       1,000
                                                  Disk Cost (M$)
                                                40% /year price cut
        100



         10                            Disk Cost (M$)
                                     50% /year price cut

          1
           1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007   14
TPC C improved fast                                                            40% hardware,
(250%/year!)                                                                   100% software,
                                                                               100% PC Technology
              $/tpmC vs time                                                       tpmC vs time
$1,000
                                                                     100,000


                                                                                   250 %/year
                                                                                 improvement!
                                                                      10,000



               250 %/year


                                                              tpmC
 $100
             improvement!
                                                                       1,000
                                 1.5
                            2.755676




                                                                        100
                                                                          Mar-94 Sep-94 Apr-95 Oct-95 May- Dec-96 Jun-97 Jan-98
  $10                                                                                                  96
    Mar-94 Sep-94 Apr-95 Oct-95 May-96 Dec-96 Jun-97 Jan-98                                       date
                             date                                                                                           15
Problems
• HSM (hierarchical storage management)
• Design and Meta-data
• Ingest
• Data discovery, search, and analysis
• reorganize-reprocess
• disaster recovery
• management/operations cost
                                          16
Demo


  http://msrlab/terraserver




                              17
    Outline
• The challenge: Building GIANT data stores
  – for example, the EOS/DIS 15 PB system

• Conclusion 1
  – Think about Maps and SCANS

• Conclusion 2:
  – Think about Clusters




                                              18
Meta-Message:
Technology Ratios Are Important
• If everything gets faster & cheaper
            at the same rate
                   THEN nothing really changes.
• Things getting MUCH BETTER:
  – communication speed & cost 1,000x
  – processor speed & cost 100x
  – storage size & cost 100x

• Things staying about the same
  – speed of light (more or less constant)
  – people (10x more expensive)
  – storage speed (only 10x better)               19
              Today’s Storage Hierarchy :
              Speed & Capacity vs Cost Tradeoffs
                                  Size vs Speed              Price vs Speed
                         1015                                                          104
                                         Nearline      Cache
                                            Tape Offline
                                                         Main
Typical System (bytes)




                         1012                     Tape                       102
                                         Disc                Secondary
                                                                       Online




                                                                                              $/MB
                                                Online
                                  Secondary     Tape                   Tape
                         109                                    Disc         100
                                   Main                       Nearline     Offline
                                                                 Tape      Tape
                         106                                                   -2      10

                                Cache
                         103                                                           10-4

                                10-9 10-6 10-3 10 0 10 3   10-9 10-6 10-3 10 0 10 3
                                   Access Time (seconds)       Access Time (seconds)
                                                                                                     20
                               Storage Ratios Changed
                   • 10x better access time
                   • 10x more bandwidth
                   • 4,000x lower media price
                   • DRAM/DISK 100:1 to 10:10 to 50:1
                         Disk Performance vs Time                                     Disk Performance vs Time                                     Storage Price vs Time
                                                                                    (accesses/ second & Capacity)
                   100                              100                                                                                    10000
                                                                                      100                    10
                                                                                                                                           1000
                                                          bandwidth (MB/s)
access time (ms)




                                                                                                                    Disk Capackty
                                                                                                                                            100




                                                                                                                                    $/MB
                                                                                                                         (GB)
                                                                             Accesses per




                   10                               10                                      10               1                               10
                                                                               Second




                                                                                                                                              1

                                                                                                                                             0.1

                    1                             1
                                                                                            1                0.1                            0.01
                    1980           1990        2000                                         1980   1990   2000                                 1980           1990         2000
                                   Year                                                            Year                                                       Year                21
            What's a Terabyte
1 Terabyte
1,000,000,000 business letters
  100,000,000 book pages                 150 miles of bookshelf
   50,000,000 FAX images                  15 miles of bookshelf
   10,000,000 TV pictures (mpeg)           7 miles of bookshelf
        4,000 LandSat images              10 days of video
Library of Congress (in ASCI) is 25 TB

1980: 200 M$ of disc                 10,000 discs
        5 M$ of tape silo            10,000 tapes

1997: 200 K$ of magnetic disc          120 discs
      250 K$ of optical disc robot      200 platters
       25 K$ of tape silo                25 tapes


Terror Byte !!
.1% of a PetaByte!!!!!!!!!!!!!!!!!!
                                                                  22
      The Cost of Storage & Access
• File Cabinet:   cabinet (4 drawer)         250$
                  paper (24,000 sheets) 250$
                  space (2x3 @ 10$/ft2) 180$
                  total                 700$
                       3 ¢/sheet
• Disk:           disk (9 GB =)       2,000$
                  ASCII:           5 m pages
                       0.2 ¢/sheet (15x cheaper
• Image:          200 k pages
                       1 ¢/sheet (similar to paper)   23
Trends:
Application Storage Demand Grew
    The Old World:            • The New World:
      – Millions of objects      – Billions of objects
      – 100-byte objects         – Big objects (1MB)
                 People


          Name     Address


          David      NY




          Mike       Berk




          Won        Aus tin




                                                         24
     Trends:
     New Applications
Multimedia: Text, voice, image, video, ...
 The paperless office

 Library of congress online (on your campus)

 All information comes electronically
    entertainment
    publishing
    business

 Information Network,
 Knowledge Navigator,
 Information at Your Fingertips

                                               25
Thesis: Performance =Storage Accesses
not Instructions Executed
• In the “old days” we counted instructions and IO’s
• Now we count memory references
• Processors wait most of the time
                Where the time goes:
   clock ticks used by AlphaSort Components

        Disc Wait
             Dis c Wait
                             Sort
                          Sort
                                    OS


            Memory Wait
                                               I-Cache
                                B-Cache         Miss
                                            D-Cache
                                Data Miss
                                            Miss
                                                         26
            The Pico Processor

                                           1 M SPECmarks
           Pico Processor
       3
1 MM
                                           106 clocks/
       10 pico-second ram     megabyte          fault to bulk ram
   10 nano-second ram 10 gigabyte
                                           Event-horizon on chip.
 10 microsecond ram         1 terabyte
                                           VM reincarnated
10 millisecond disc         100 terabyte
10 second tape archive 100 petabyte        Multi-program cache


  Terror Bytes!
                                                                    27
  Storage Latency: How Far
  Away is the Data?
             Andromeda
10 9   Tape /Optical                     2,000 Years
        Robot


 10 6 Disk                Pluto              2 Years




                         Sacramento           1.5 hr
100    Memory
 10    On Board Cache      This Campus       10 min
  2    On Chip Cache         This Room
  1    Registers                     My Head  1 min
                                                       28
         The Five Minute Rule
 • Trade DRAM for Disk Accesses
 • Cost of an access (DriveCost / Access_per_second)
 • Cost of a DRAM page ( $/MB / pages_per_MB)
 • Break even has two terms:
 • Technology term and an Economic term
                                                                   PagesPerMB                        ofDRAM                     PricePerDi   skDrive
BreakEvenR     eferenceIn               terval                                                                                                       1 
                                                                 AccessPerS                 econdPerDi                sk       PricePerMB    ofDRAM

 • Grew page size to compensate for changing ratios.
 • Still at 5 minute for random, 1 minute sequential
             BreakEvenR   eferenceIn   terval   
                                                     PagesPerMB     ofDRAM
                                                                                   
                                                                                        PricePerDi   skDrive
                                                                                                               1 
                                                                                                                                                       29
                                                    AccessPerS   econdPerDi   sk       PricePerMB    ofDRAM
Shows Best Page Index Page Size ~16KB
                                                                                                  In d e x P a g e U tility v s P a g e S iz e
                In d e x P a g e U tility v s P a g e S iz e
                                                                                                        a n d D is k P e r fo r m a n c e
                         a n d In d e x E le m e t S iz e
       1 .0 0                                                                                1 .0 0

       0 .9 0                                                                                0 .9 0

                                                                                             0 .8 0
       0 .8 0                                                                                                                1 0 M B /s




                                                                                U t ilit y
                                       1 6 b y t e e n t r ie s
U t ilit y




                                                                                             0 .7 0
       0 .7 0                               3 2 b y te                                                              5 M B /s
                                                                                             0 .6 0
       0 .6 0                               6 4 b y te                                                              3 M B /s
                                                                                             0 .5 0
       0 .5 0                             1 2 8 b y te                                                         1 M B /s
                                                                                             0 .4 0
                                                                                                       2      4        8       16     32     64     128
       0 .4 0
                     2        4       8        16        32       64     128            40 M B /s     0.65   0.74     0.83     0.91   0.97   0.99   0.94

        16 B       0.64      0.72   0.78     0.82      0.79       0.69   0.54           10 M B /s     0.64   0.72     0.78     0.82   0.79   0.69   0.54

        32 B       0.54      0.62   0.69     0.73      0.71       0.63   0.50           5 M B /s      0.62   0.69     0.73     0.71   0.63   0.50   0.34

        64 B       0.44      0.53   0.60     0.64      0.64       0.57   0.45           3 M B /s      0.51   0.56     0.58     0.54   0.46   0.34   0.22

        128 B      0.34      0.43   0.51     0.56      0.56       0.51   0.41           1 M B /s      0.40   0.44     0.44     0.41   0.33   0.24   0.16

                                    P a g e S iz e ( K B )                                                          P a g e S iz e ( K B )
                                                                                                                                                           30
     Standard Storage Metrics
• Capacity:
  – RAM: MB and $/MB: today at 10MB & 100$/MB
  – Disk: GB and $/GB: today at 10 GB and 200$/GB
  – Tape: TB and $/TB: today at .1TB and 25k$/TB (nearline)
• Access time (latency)
  – RAM: 100 ns
  – Disk: 10 ms
  – Tape: 30 second pick, 30 second position
• Transfer rate
  – RAM:      1 GB/s
  – Disk:     5 MB/s - - - Arrays can go to 1GB/s
  – Tape:     5 MB/s - - - striping is problematic
                                                              31
     New Storage Metrics:
     Kaps, Maps, SCAN?
• Kaps: How many kilobyte objects served per second
  – The file server, transaction processing metric
  – This is the OLD metric.
• Maps: How many megabyte objects served per
  second
  – The Multi-Media metric
• SCAN: How long to scan all the data
  – the data mining and utility metric
• And
  – Kaps/$, Maps/$, TBscan/$

                                                      32
  For the Record (good 1997 devices)
                     DRAM     DISK    TAPE robot
Unit capacity (GB)     1         9        35 X 14
   Unit price $      15000     2000     10000
       $/GB          15000     222        20
   Latency (s)       1.E-7    1.E-2     3.E+1
Bandwidth (Mbps)      500        5        5
       Kaps          5.E+5    1.E+2     3.E-2
       Maps          5.E+2     4.76     3.E-2
 Scan time (s/TB)      2       1800     98000
      $/Kaps         3.E-10   2.E-7     3.E-3
      $/Maps         3.E-7    4.E-6     3.E-3
     $/TBscan        $0.32      $4       $296
                                               33
  How To Get Lots of Maps, SCANs
• parallelism: use many little devices in parallel
   At 10 MB/s:                       1,000 x parallel:
   1.2 days to scan                  100 seconds SCAN.

     1 Terabyte                         1 Terabyte




          10 MB/s



Parallelism: divide a big problem into many smaller ones
to be solved in parallel.
• Beware of the media myth
• Beware of the access time myth                           34
      The Disk Farm On a Card
The 100GB disc card
An array of discs                           14"


Can be used as
    100 discs
       1 striped disc
      10 Fault Tolerant discs
      ....etc
LOTS of accesses/second
             bandwidth
Life is cheap, its the accessories that cost ya.

Processors are cheap, it’s the peripherals that cost ya
             (a 10k$ disc card).                          35
Tape Farms for Tertiary Storage
Not Mainframe Silos
  100 robots
                                    1M$
                                    50TB
                                  50$/GB
                                  3K Maps
                  10K$ robot
                   14 tapes        27 hr Scan
                  500 GB
                     5 MB/s
                    20$/GB Scan in 27 hours.
                   30 Maps many independent tape robots
                            (like a disc farm)



                                                    36
           The Metrics:
           Disk and Tape Farms Win
                                          GB/K$          Data Motel:
1,000,000
                                          Kaps           Data checks in,
 100,000                                                 but it never checks ou
                                          Maps
  10,000                                  SCANS/Day

   1,000

    100

     10
      1

     0.1

   0.01
            1000 x Disc Farm      STC Tape Robot        100x DLT Tape Farm
                               6,000 tapes, 8 readers
                                                                             37
      Tape & Optical:
      Beware of the Media Myth

Optical is cheap: 200 $/platter
                    2 GB/platter
    => 100$/GB (2x cheaper than disc)

Tape is cheap:     30 $/tape
                   20 GB/tape
    => 1.5 $/GB    (100x cheaper than disc).



                                               38
Tape & Optical Reality:
Media is 10% of System Cost
 Tape needs a robot (10 k$ ... 3 m$ )
   10 ... 1000 tapes (at 20GB each) => 20$/GB ... 200$/GB
   (1x…10x cheaper than disc)
 Optical needs a robot (100 k$ )
  100 platters = 200GB ( TODAY ) => 400 $/GB
   ( more expensive than mag disc )

 Robots have poor access times
    Not good for Library of Congress (25TB)
    Data motel: data checks in but it never checks out!

                                                            39
     The Access Time Myth
The Myth: seek or pick time dominates
The reality: (1) Queuing dominates                  Wait


             (2) Transfer dominates BLOBs
             (3) Disk seeks often short
Implication: many cheap servers
  better than one fast expensive server
   – shorter queues                    Transfer   Transfer

   – parallel transfer                 Rotate     Rotate

   – lower cost/access and cost/byte
This is now obvious for disk arrays     Seek       Seek


This will be obvious for tape arrays
                                                             40
    Outline
• The challenge: Building GIANT data stores
  – for example, the EOS/DIS 15 PB system

• Conclusion 1
  – Think about Maps and SCAN & 5 minute rule

• Conclusion 2:
  – Think about Clusters




                                                41
     Scaleable Computers
     BOTH SMP and Cluster

                 Grow Up with SMP
                      4xP6 is now standard
SMP
Super Server     Grow Out with Cluster

                 Cluster has inexpensive parts

Departmental
Server                                     Cluster
                                           of PCs
Personal
System
                                                 42
    What do TPC results say?
• Mainframes do not compete on performance or price
     They have great legacy code (MVS)
• PC nodes performance is 1/3 of high-end UNIX nodes
  – 6xP6 vs 48xUltraSparc

• PC Technology is 3x cheaper than high-end UNIX
• Peak performance is a cluster
  – Tandem 100 node cluster
  – DEC Alpha 4x8 cluster

• Commodity solutions WILL come to this market        43
     Cluster Advantages
• Clients and Servers made from the same stuff.
   – Inexpensive: Built with commodity components

• Fault tolerance:
   – Spare modules mask failures
• Modular growth
   – grow by adding small modules

• Parallel data search
   – use multiple processors and disks


                                                    44
         Clusters being built
•   Teradata 500 nodes                    (50k$/slice)
•   Tandem,VMScluster 150 nodes          (100k$/slice)
•   Intel, 9,000 nodes @ 55M$              ( 6k$/slice)
•   Teradata, Tandem, DEC moving to NT+low slice price

• IBM: 512 nodes ASCI @ 100m$               (200k$/slice)
• PC clusters (bare handed) at dozens of nodes
  web servers (msn, PointCast,…), DB servers

• KEY TECHNOLOGY HERE IS THE APPS.
  – Apps distribute data
  – Apps distribute execution
                                                            45
  Clusters are winning the high end
• Until recently a 4x8 cluster has best TPC-C performance
• Clusters have best data mining story (TPC-D)
• This year, a 32xUltraSparc cluster won the MinuteSort
                        Sort Records/second vs Time
          1.0E+07


                                            Next NOW (100 nodes)
          1.0E+06
                                                                NOW
                                                            SGI
                                 IBM 3090                   IBM RS6000
          1.0E+05                                       Alpha


                         Cray YMP
                                                         Intel
          1.0E+04                                       Hyper
                                             Sequent

                                      Hardware Sorter
          1.0E+03

                        Tandem
                    M68000
          1.0E+02
                1985                1990                1995             2000
                                                                                46
    Clusters (Plumbing)
• Single system image
  – naming
  – protection/security
  – management/load balance

• Fault Tolerance
• Hot Pluggable hardware & Software




                                      47
    So, What’s New?
• When slices cost 50k$, you buy 10 or 20.
• When slices cost 5k$ you buy 100 or 200.
• Manageability, programmability, usability
  become key issues (total cost of ownership).
• PCs are MUCH easier to use and program

   MPP
   Vicious Cycle            New  New    New  New    New  New    New  New
   No Customers!           MPP & App   MPP & App   MPP & App   MPP & App
                           NewOS       NewOS       NewOS       NewOS


                                                        Apps
   CP/Commodity
                                  Standard
   Virtuous Cycle:              OS & Hardware
   Standards allow progress
                                                      Customers       48
   and investment protection
    Where We Are Today
• Clusters moving fast
  – OLTP
  – Sort
  – WolfPack

• Technology ahead of schedule
  – cpus, disks, tapes,wires,..

• Databases are evolving
• Parallel DBMSs are evolving
• Operations (batch) has a long way to go on Unix/PC.   50
      Outline
• The challenge: Building GIANT data stores
  – for example, the EOS/DIS 15 PB system

• Conclusion 1
  – Think about Maps and SCANs & 5 minute rule

• Conclusion 2:
  – Think about Clusters

• Slides & paper:
  http:\\www.research.Microsoft.com\~Gray\talks
  December SIGMOD RECORD
  http:\\www.research.Microsoft.com\~Gray\5_Min_Rule_Sigmod.doc
                                                              51

								
To top