In Search of PetaByte Databases by oga20203

VIEWS: 37 PAGES: 21

									                        ExaByte




In Search of PetaByte   PetaByte


     Databases
                        TeraByte
       Jim Gray
       Tony Hey
                        GigaByte
                           The Cost of Storage
                        (heading for 1K$/TB soon)                                                                             ExaByte

                                              12/1/1999
         1000                                                         40


                                              9/1/2000
          900         Price vs disk capacity                          35
                                                     k$/TB                                                  IDE
          800
                                                                      30                                    SCSI
          700
                              y = 17.9x                               25
                                                                                                                             PetaByte
          600
        1000                                          IDE             40
          500
            $




                                                                      20




                                                                  $
         900
          400     Price vs disk capacity              SCSI            35
                                                                      15                   raw                SCSI
         800
          300

                                           9/1/2001
                           SCSI                                       30
                                                                      10                  k$/TB               IDE
          200
         700
          100              IDE               y = 6.7x                 25
                                                                       5                                           6
         600
1400                                                                   10.0
         500 0
        $




            Price vs disk capacity y = 13x 60                         20
                                                                       0




                                                                  $
            400 0      20  GB  40                                          9.0
                                                                           0         10         20    GB
                                                                                                      30      40        50   TeraByte
                                                                                                                             60
1200
            300
                                                                      15
                                                                           8.0
                                                                                                 raw
1000        200
                                  y = 7.2x                            10                        k$/TB
                                                                           7.0
            100                            y = 3.8x                    5
 800                                                                       6.0
              0                                                        0
                        SCSI
$




                                                                           5.0
                                                                  $

                  0         20       40        60            80                                            SCSI
 600                                                                       0               20           40             60         80
                        IDE Raw Disk unit Size GB                          4.0                             size
                                                                                                 Disk unit IDE GB

 400                                                                       3.0
                                                                                                                             GigaByte
                                               y = 2.0x                    2.0
 200
                                                                           1.0
    0                                                                      0.0
        0              50      100        150               200                  0               50          100        150            200
                       Raw Disk unit Size GB                                                          Disk unit size GB
               Summary                  ExaByte


• DBs own the sweet-spot:
  – 1GB to 100TB
                                        PetaByte
• Big data is not in databases
  HPTS does not own
  high performance storage (BIG DATA)
                                        TeraByte
• We should
• Cost of storage is people:
  –Performance goal:                    GigaByte

   1 Admin per PB
             State is Expensive
                                                   ExaByte
• Stateless clones are easy to manage
  – App servers are middle tier
• Cost goes to zero with Moore’s law.              PetaByte
  – One admin per 1,000 clones.
  – Good story about scaleout.
• Stateful servers are expensive to manage         TeraByte
  – 1TB to 100TB per admin
  – Storage cost is going to zero(2k$ to 200k$).
• Cost of storage is management cost               GigaByte
      Personal 100 GB today    ExaByte
  The Personal Petabyte (someday)
• It’s coming (2M$ today…2K$ in 10 years)
• Today the pack rats have ~ 10-100GB          PetaByte


  – 1-10 GB in text (eMail, PDF, PPT, OCR…)
  – 10GB – 50GB tiff, mpeg, jpeg,…
  – Some have 1TB (voice + video).             TeraByte


• Video can drive it to 1PB.
• Online PB affordable in 10 years.
                                             GigaByte
• Get ready: tools to capture, manage,
  organize, search, display will be big app.
               10 TB
   An Image Database: TerraServer
                                ExaByte

• Snapshot of the USA (1 meter granularity)
  – 10,000,000,000,000 (=10^13) sq meters
  – == 15TB raw (some duplicates)                   PetaByte

  – == 5 TB cooked
     • 5x compression
     • + Image pyramid
                                                    TeraByte
     • + gazetteer
• Interesting things:
  – Its all in the Database
                                                   GigaByte
  – Clustered (allows flaky hardware, online upgrade)
  – Triplexed – snapshot each night
       Databases (== SQL)                   ExaByte



• VLDB survey (Winter Corp).
• 10 TB to 100TB DBs.                       PetaByte

  – Size doubling yearly
  – Riding disk Moore’s law
  – 10,000 disks at 18GB is 100TB cooked.   TeraByte

• Mostly DSS and data warehouses.
• Some media managers
                                            GigaByte
                  DB iFS                         ExaByte
• DB2: leave the files where they live
  – Referential integrity between DBMS and FS.
• Oracle: put the files in the DBMS          PetaByte
  – One security model
  – One storage management model
  – One space manager
                                             TeraByte
  – One recovery manger
  – One replication system
  – One thing to tune.
                                             GigaByte
  – Features: transactions,….
              Interesting facts              ExaByte
•   No DBMSs beyond 100TB.
•   Most bytes are in files.
•   The web is file centric               PetaByte

•   eMail is file centric.
•   Science (and batch) is file centric.
•   But….                                 TeraByte


•   SQL performance is better than CIFS/NFS..
    – CISC vs RISC
                                            GigaByte
         BarBar: the biggest DB                ExaByte



•   350 TB
•   Uses Objectivity™                        PetaByte

•   SLAC events
•   Linux cluster scans DB looking for patterns
                                               TeraByte




                                               GigaByte
                 300 TB (cooked)
                 Hotmail / Yahoo   ExaByte



• Clone front ends
  ~10,000@hotmail.                 PetaByte
• Application servers
  –   ~100 @ hotmail
  –   Get mail box
                                   TeraByte
  –   Get/put mail
  –   Disk bound
       • ~30,000 disks
• ~ 20 admins                      GigaByte
                  AOL (msn)                   ExaByte
                    (1PB?)
•   10 B transactions per day (10% of that)
•   Huge storage                              PetaByte

•   Huge traffic
•   Lots of eye candy
                                              TeraByte
•   DB used for security/accounting.
•   GUESS AOL is a petabyte
    – (40M x 10MB = 400 x 1012)
                                              GigaByte
                  Google             ExaByte
           1.5PB as of last spring
• 8,000 no-name PCs
    – Each 1/3U, 2 x 80 GB disk, 2   PetaByte
      cpu 256MB ram
•   1.4 PB online.
•   2 TB ram online
                                     TeraByte
•   8 TeraOps
•   Slice-price is 1K$ so 8M$.
•   15 admins (!) (== 1/100TB).
                                     GigaByte
               Computational
                 Science                           ExaByte

• Traditional Empirical Science
  – Scientist gathers data by direct observation
  – Scientist analyzes data
                                                   PetaByte

• Computational Science
  – Data captured by instruments
    Or data generated by simulator
  – Processed by software                          TeraByte

  – Placed in a database
  – Scientist analyzes database
  – tcl scripts
     • on C programs                               GigaByte
         – on ASCII files
                 Astronomy                     ExaByte



•   I’ve been trying to apply DB to astronomy
•   Today they are at 10TB per data set     PetaByte

•   Heading for Petabytes
•   Using Objectivity
                                            TeraByte
•   Trying SQL (talk to me offline)


                                              GigaByte
                                Fast Moving Objects                                                       ExaByte



• Find near earth asteroids:
  SELECT r.objID as rId, g.objId as gId, r.run, r.camcol, r.field as field, g.field as gField,
      r.ra as ra_r, r.dec as dec_r, g.ra as ra_g, g.dec as dec_g,
      sqrt( power(r.cx -g.cx,2)+ power(r.cy-g.cy,2)+power(r.cz-g.cz,2) )*(10800/PI()) as distance
    FROM PhotoObj r, PhotoObj g
    WHERE                                                                                                 PetaByte
      r.run = g.run and r.camcol=g.camcol and abs(g.field-r.field)<2 -- the match criteria
      -- the red selection criteria
      and ((power(r.q_r,2) + power(r.u_r,2)) > 0.111111 )
      and r.fiberMag_r between 6 and 22 and r.fiberMag_r < r.fiberMag_g and r.fiberMag_r < r.fiberMag_i
      and r.parentID=0 and r.fiberMag_r < r.fiberMag_u and r.fiberMag_r < r.fiberMag_z
      and r.isoA_r/r.isoB_r > 1.5 and r.isoA_r>2.0
      -- the green selection criteria
      and ((power(g.q_g,2) + power(g.u_g,2)) > 0.111111 )
      and g.fiberMag_g between 6 and 22 and g.fiberMag_g < g.fiberMag_r and g.fiberMag_g < g.fiberMag_i
      and g.fiberMag_g < g.fiberMag_u and g.fiberMag_g < g.fiberMag_z
      and g.parentID=0 and g.isoA_g/g.isoB_g > 1.5 and g.isoA_g > 2.0
      -- the matchup of the pair
      and sqrt(power(r.cx -g.cx,2)+ power(r.cy-g.cy,2)+power(r.cz-g.cz,2))*(10800/PI())< 4.0
                                                                                                          TeraByte
      and abs(r.fiberMag_r-g.fiberMag_g)< 2.0




• Finds 3 objects in 11 minutes
• Ugly,
  but consider the alternatives                                                                           GigaByte
  (c programs an files and…)
  –
  Particle Physics – Hunting the
     Higgs and Dark Matter ExaByte
• April 2006: First pp collisions at TeV energies at
  the Large Hadron Collider in Geneva
• ATLAS/CMS Experiments involve 2000 physicists  PetaByte

  from 200 organizations in US, EU, Asia
• Need to store,access, process, analyse 10 PB/yr
  with 200 TFlop/s distributed computation       TeraByte
• Building hierarchical Grid infrastructure to
  distribute data and computation
• Many 10’s of million $ funding – GryPhyN,
  PPDataGrid, iVDGL, DataGrid, DataTag, GridPP
                                                 GigaByte


     ExaBytes and PetaFlop/s by 2015
          Astronomy: Past and                             ExaByte
         Future of the Universe
• Virtual Observatories – NVO, AVO, AstroGrid
    – Store all wavelengths, need distributed joins      PetaByte

    – NVO 500 TB/yr from 2004
• Laser Interferometer Gravitational Observatory
    – Search for direct evidence for gravitational wavesTeraByte
    – LIGO 250 TB/yr, random streaming from 2002
• VISTA Visible and IR Survey Telescope in 2004
    – 250 GB/night, 100 TB/yr, Petabytes in 10 yrs
        New phase of astronomy, storing, searching
                                                         GigaByte

    and analysing Petabytes of data
 Engineering, Environment and
     Medical Applications
                            ExaByte


• Real-Time Health Monitoring
    – UK DAME project for Rolls Royce Aero Engines
    – 1 GB sensor data/flight, 100,000 engine hours/day
                                                     PetaByte

• Earth Observation
    – ESA satellites generate 100 GB/day
    – NASA 15 PB by 2007
                                                      TeraByte
• Medical Images to Information
    – UK IRC Project on mammograms and MRIs
    – 100 MB/mammogram, UK 3M/yr, US 26M/yr
    – 200 MB/patient, Oxford 500 women/yr             GigaByte
       Many Petabytes of data of real commercial
    interest
Grids, Databases and Cool Tools
                            ExaByte
• Scientists:
    – will build Grids based on Globus Open Source m/w
    – will have instruments generating Petabytes of data
                                                        PetaByte
    – will annotate their data with XML-based metadata
 Realize a version of Licklider and Taylor’s original
  vision of resource sharing and the ARPANET
• TP and DB community:                            TeraByte

    - Should assist in developing Grid Interfaces to DBMS
    - Should develop ‘Cool Tools’ for Grid Services
       There will be commercial Grid applications and
                                                   GigaByte
    viable business opportunities
               Summary             ExaByte


• DBs own the sweet-spot:
  – 1GB to 100TB
                                   PetaByte
• Big data is not in databases
• HPTS crowd is not really high
  performance storage (BIG DATA)   TeraByte

• Cost of storage is people:
  –Performance goal:
   1 Admin per PB                  GigaByte

								
To top