Docstoc

The Quest for Scalable Support of Data Intensive - LRZ Leibniz

Document Sample
The Quest for Scalable Support of Data  Intensive - LRZ Leibniz Powered By Docstoc
					• Segregated storage and compute
  – NFS, GPFS, PVFS, Lustre
  – Batch-scheduled systems: Clusters, Grids, and
    Supercomputers
  – Programming paradigm: HPC, MTC, and HTC
• Co-located storage and compute
  –   HDFS, GFS
  –   Data centers at Google, Yahoo, and others
  –   Programming paradigm: MapReduce
  –   Others from academia: Sector, MosaStore, Chirp

                                                       2
• Segregated storage and compute
  – NFS, GPFS, PVFS, Lustre
  – Batch-scheduled systems: Clusters, Grids, and
    Supercomputers
  – Programming paradigm: HPC, MTC, and HTC
• Co-located storage and compute
  –   HDFS, GFS
  –   Data centers at Google, Yahoo, and others
  –   Programming paradigm: MapReduce
  –   Others from academia: Sector, MosaStore, Chirp

                                                       3
• Segregated storage and compute
  – NFS, GPFS, PVFS, Lustre
  – Batch-scheduled systems: Clusters, Grids, and
    Supercomputers
  – Programming paradigm: HPC, MTC, and HTC
• Co-located storage and compute
  –   HDFS, GFS
  –   Data centers at Google, Yahoo, and others
  –   Programming paradigm: MapReduce
  –   Others from academia: Sector, MosaStore, Chirp

                                                       4
• Segregated storage and compute
  – NFS, GPFS, PVFS, Lustre
  – Batch-scheduled systems: Clusters, Grids, and
    Supercomputers
  – Programming paradigm: HPC, MTC, and HTC
• Co-located storage and compute
  –   HDFS, GFS
  –   Data centers at Google, Yahoo, and others
  –   Programming paradigm: MapReduce
  –   Others from academia: Sector, MosaStore, Chirp

                                                       5
•   Local Disk:
     – 2002-2004: ANL/UC TG Site 1000                                 Local Disk
                                                                      Cluster
       (70GB SCSI)                                                    Supercomputer




                                       MB/s per Processor Core
     – Today: PADS (RAID-0, 6     100
       drives 750GB SATA)
•   Cluster:
     – 2002-2004: ANL/UC TG Site                                 10
       (GPFS, 8 servers, 1Gb/s each)
     – Today: PADS (GPFS, SAN)                                                -2.2X
                                                                 1            -99X         -15X
•   Supercomputer:                                                                        -438X
     – 2002-2004: IBM Blue Gene/L
       (GPFS)                        0.1
     – Today: IBM Blue Gene/P (GPFS)                                  2002-2004       Today




                                                                                              6
 What if we could combine the
 scientific community’s existing
programming paradigms, but yet
 still exploit the data locality that
   naturally occurs in scientific
             workloads?
                                        7
8
              Input
               Input Hi
              Data Hi
               Data
               Size
                Size                       MapReduce/MTC
                                            MapReduce/MTC                 MTC
                                            (Data Analysis,                MTC
                                             (Data Analysis,         (Big Data and
                                               Mining)                (Big Data and
                                                Mining)               Many Tasks)
                          Med                                          Many Tasks)
                          Med
                                            HPC
                                             HPC
                                          (Heroic
                                           (Heroic               HTC/MTC
                                            MPI                   HTC/MTC
                                             MPI               (Many Loosely
                                                                (Many Loosely
                                           Tasks)
                                            Tasks)             Coupled Tasks)
                                                                Coupled Tasks)
                           Low
                            Low
                                       11                      1K
                                                                1K         1M
                                                                            1M
                                                                  Number of Tasks
                                                                   Number of Tasks
                                                                                      9
[MTAGS08] “Many-Task Computing for Grids and Supercomputers”
“Significant performance improvements can be obtained
in the analysis of large dataset by leveraging information
   about data analysis workloads rather than individual
                    data analysis tasks.”
• Important concepts related to the hypothesis
   – Workload: a complex query (or set of queries) decomposable into
     simpler tasks to answer broader analysis questions
   – Data locality is crucial to the efficient use of large scale distributed
     systems for scientific and data-intensive applications
   – Allocate computational and caching storage resources, co-scheduled to
     optimize workload performance

                                                                        10
• Resource acquired in response to
  demand                                                                                 Idle Resources
                                                                                           Idle Resources
• Data diffuse from archival storage to
  newly acquired transient resources
• Resource “caching” allows faster
                                                                                                                   text
                                                                                                                      text




  responses to subsequent requestsScheduler
                                    Task Dispatcher
                                     Task Dispatcher
                                 Data-Aware
                                  Data-Aware Scheduler                        Persistent Storage
• Resources are released when                                                   Persistent Storage
                                                                              Shared File System
                                                                               Shared File System

  demand drops
• Optimizes performance by co-                                                                              Provisioned Resources
                                                                                                             Provisioned Resources

  scheduling data and computations
• Decrease dependency of a
  shared/parallel file systems
• Critical to support data intensive MTC
                                                                                                                             11
[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”
       • What would data diffusion look like in practice?
       • Extend the Falkon framework




                                                                    12
[SC07] “Falkon: a Fast and Light-weight tasK executiON framework”
• FA: first-available
       – simple load balancing
• MCH: max-cache-hit
       – maximize cache hits
• MCU: max-compute-util
       – maximize processor utilization
• GCC: good-cache-compute
       – maximize both cache hit and processor utilization at
         the same time

                                                                              13
[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”
• 3GHz dual CPUs
                                                                                 Task Submit
• ANL/UC TG with                                                                  Task Submit
                                                                                 Notification for Task Availability
                                                               5                  Notification for Task Availability         5000
                                                                5                Task Dispatch (data-aware scheduler)         5000
                                                                                  Task Dispatch (data-aware scheduler)
  128 processors                                                                 Task Results (data-aware scheduler)
                                                                                  Task Results Task Results
                                                                                 Notification for (data-aware scheduler)

                                                 CPU Time per Task (ms)
                                                                                  Notification for Task
                                                                                 WS Communication Results




                                                                                                                                      Throughput (tasks/sec)
                                                CPU Time per Task (ms)
                                                               4                                                             4000
• Scheduling window




                                                                                                                                     Throughput (tasks/sec)
                                                                4                 WS Communication
                                                                                 Throughput (tasks/sec)                       4000
                                                                                  Throughput (tasks/sec)
  2500 tasks                                                   3                                                             3000
                                                                3                                                             3000
• Dataset
   • 100K files                                                2
                                                                2
                                                                                                                             2000
                                                                                                                              2000

   • 1 byte each                                               1                                                             1000
                                                                1                                                             1000
• Tasks
                                                               0                                                             0
   • Read 1 file                                                0                                                             0
                                                                             first-    first-     max- max-cache- good-
                                                                               first-    first-    max- max-cache- good-
   • Write 1 file                                                          available available compute-util
                                                                            available available compute-util
                                                                                                             hit
                                                                                                              hit
                                                                                                                   cache-
                                                                                                                    cache-
                                                                          without I/O with I/O                    compute
                                                                           without I/O with I/O                    compute


                                                                                                                              14
[DIDC09] “Towards Data Intensive Many-Task Computing”, under review
• Monotonically Increasing Workload
       – Emphasizes increasing loads
• Sine-Wave Workload
       – Emphasizes varying loads
• All-Pairs Workload
       – Compare to best case model of active storage
• Image Stacking Workload (Astronomy)
       – Evaluate data diffusion on a real large-scale data-
         intensive application from astronomy domain

                                                                              15
[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”
• 250K tasks                                       1000
                                                          Arrival Rate
                                                                                         250000
                                                          Tasks completed
   – 10MB reads                                     900
   – 10ms compute




                                                                                                  Tasks Completed
                                                    800                                  200000


                       Arrival Rate (per second)
• Vary arrival rate:                                700

   – Min: 1 task/sec                                600                                  150000

   – Increment function:                            500
     CEILING(*1.3)                                  400                                  100000

   – Max: 1000 tasks/sec                            300
                                                    200                                  50000
• 128 processors
                                                    100
• Ideal case:                                         0                                  0
   – 1415 sec
   – 80Gb/s peak                                                            Time (sec)
     throughput

                                                                                             16
• GPFS vs. ideal: 5011 sec vs. 1415 sec
     100
                        90
                        80
                        70
   Queue Length (x1K)
   Throughput (Gb/s)
    Nodes Allocated




                        60
                        50
                        40
                        30
                        20
                        10
                         0


                                                 Time (sec)
                             Throughput (Gb/s)            Demand (Gb/s)
                             Wait Queue Length            Number of Nodes
                                                                            17
                      Max-compute-util                                                                         Max-cache-hit
                100                                           1                                          100                                      1
                 90                                           0.9                                         90                                      0.9




                                                                                    Queue Length (x1K)
                                                                                                          80                                      0.8




                                                                                    Throughput (Gb/s)
                     80                                       0.8




                                                                                     Nodes Allocated
                     70                                       0.7                                         70                                      0.7




                                                                                                                                                        CPU Utilization %
Queue Length (x1K)




                                                                                                                                                        Cache Hit/Miss %
Throughput (Gb/s)




                                                                 Cache Hit/Miss %
 Nodes Allocated




                     60                                       0.6                                         60                                      0.6
                     50                                       0.5                                         50                                      0.5
                     40                                       0.4                                         40                                      0.4
                                                                                                          30                                      0.3
                     30                                       0.3
                                                                                                          20                                      0.2
                     20                                       0.2
                                                                                                          10                                      0.1
                     10                                       0.1
                                                                                                           0                                      0
                      0                                       0

                                                                                                                      Time (sec)
                                Time (sec)
        Cache Miss %        Cache Hit Global %   Cache Hit Local %                             Cache Miss %        Cache Hit Global %   Cache Hit Local %
        Throughput (Gb/s)   Demand (Gb/s)        Wait Queue Length                             Throughput (Gb/s)   Demand (Gb/s)        Wait Queue Length
        Number of Nodes                                                                        Number of Nodes     CPU Utilization




                                                                                                                                               18
                     100                                           100%                                                                      100                                        100%
                      90                                           90%                                                                        90                                        90%




                                                                                                                        Queue Length (x1K)
                                                                                                                        Throughput (Gb/s)
                                                                                                                         Nodes Allocated




                                                                                                                                                                                              Cache Hit/Miss %
                      80                                           80%
Queue Length (x1K)




                                                                                                                                              80                                        80%
Throughput (Gb/s)




                                                                                        Cache Hit/Miss %
 Nodes Allocated




                      70                                           70%                                                                        70                                        70%
                      60                                           60%                                        1GB                             60                                        60%
                      50                                           50%                                                                        50                                        50%
                      40
                      30
                                                                   40%
                                                                   30%
                                                                                                           1.5GB                              40                                        40%
                                                                                                                                              30                                        30%
                      20                                           20%
                                                                                                                                              20                                        20%
                      10                                           10%
                                                                                                                                              10                                        10%
                       0                                           0%
                                                                                                                                               0                                        0%


                                   Time (sec)                                                                                                             Time (sec)
              Cache Miss %       Cache Hit Global %     Cache Hit Local %                                                          Cache Miss %         Cache Hit Global %   Cache Hit Local %
              Demand (Gb/s)      Throughput (Gb/s)      Wait Queue Length                                                          Throughput (Gb/s)    Demand (Gb/s)        Wait Queue Length
              Number of Nodes                                                                                                      Number of Nodes
                     100                                        100%                                                                100                                                   1
                      90                                        90%                                                                          90                                           0.9




                                                                                                                    Queue Length (x1K)
                      80                                        80%                                                                          80                                           0.8




                                                                                                                    Throughput (Gb/s)
                                                                                                                     Nodes Allocated
Queue Length (x1K)
Throughput (Gb/s)




                                                                     Cache Hit/Miss %




                      70                                        70%                                                                          70                                           0.7
 Nodes Allocated




                                                                                                                                                                                               Cache Hit/Miss %
                      60                                        60%                                          2GB                             60                                           0.6
                      50                                        50%                                                                          50                                           0.5
                      40
                      30
                                                                40%
                                                                30%
                                                                                                           4GB                               40                                           0.4
                                                                                                                                             30                                           0.3
                      20                                        20%                                                                          20                                           0.2
                      10                                        10%                                                                          10                                           0.1
                       0                                        0%                                                                            0                                           0

                                 Time (sec)                                                                                                               Time (sec)
           Cache Miss %         Cache Hit Global %    Cache Hit Local %                                                        Cache Miss %            Cache Hit Global %    Cache Hit Local %
           Throughput (Gb/s)    Demand (Gb/s)         Wait Queue Length                                                        Throughput (Gb/s)       Demand (Gb/s)         Wait Queue Length
           Number of Nodes                                                                                                     Number of Nodes
                                                                                                                                                                                   19
• Data Diffusion vs. ideal: 1436 sec vs 1415 sec
                           100                                         1
                           90                                          0.9
      Queue Length (x1K)



                           80                                          0.8
      Throughput (Gb/s)
       Nodes Allocated




                           70                                          0.7




                                                                           Cache Hit/Miss %
                           60                                          0.6
                           50                                          0.5
                           40                                          0.4
                           30                                          0.3
                           20                                          0.2
                           10                                          0.1
                            0                                          0


                                        Time (sec)
                 Cache Miss %        Cache Hit Global %   Cache Hit Local %
                 Throughput (Gb/s)   Demand (Gb/s)        Wait Queue Length
                 Number of Nodes
                                                                                              20
                    20      80                                      81
                                                           81
                    18                             73
Throughput (Gb/s)




                    16                                                          46
                    14                                                    21                                          Throughput:
                    12
                                          12
                    10                                                                                                    – Average: 14Gb/s vs 4Gb/s
                     8
                     6
                     4
                                  6                                                                                       – Peak: 81Gb/s vs. 6Gb/s
                     2                                                                                             1800




                                                                                     Average Response Time (sec)
                     0                                                                                                     1569
                                                                                                                   1600
                         Ideal   FA     GCC     GCC     GCC     GCC      MCH   MCU
                                                                                                                   1400
                                        1GB    1.5GB    2GB     4GB      4GB   4GB
                                      Local Worker Caches (Gb/s)                                                   1200           1084
                                      Remote Worker Caches (Gb/s)                                                  1000
                                      GPFS Throughput (Gb/s)
                                                                                                                    800
                                                                                                                    600
                                                                                                                    400
      Response Time                                                                                                 200                  114
                                                                                                                                                             230
                                                                                                                                                                        287

                                                                                                                                                 3.4   3.1
                         – 3 sec vs 1569 sec                              506X                                        0
                                                                                                                           FA     GCC     GCC    GCC   GCC   MCH        MCU
                                                                                                                                  1GB    1.5GB   2GB   4GB   4GB        4GB

                                                                                                                                                                   21
                                       1                                                          3.5




                                                                                                        Speedup (comp. to LAN GPFS)
                  Performance Index   0.9
                                      0.8                                                          3
• Performance
                                      0.7
  Index:
                                      0.6                                                         2.5
  – 34X higher                        0.5
• Speedup                             0.4                                                          2
  – 3.5X faster                       0.3
    than GPFS                         0.2                                                         1.5
                                      0.1
                                       0                                                           1
                                               FA   GCC GCC GCC GCC GCC MCH MCU
                                                    1GB 1.5GB 2GB 4GB 4GB 4GB 4GB
                                                                      SRP
                                            Performance Index   Speedup (compared to first-available)
                                                                                                   22
• 2M tasks
   – 10MB reads               A                                                       )
                                                        (sin(sqrt(time 0.11) * 2.859678 1) * (time 0.11) * 5.705
   – 10ms compute                           1000               Arrival Rate
                                                                Arrival Rate                           2000000
                                             1000              Number of Tasks                           2000000
• Vary arrival rate:                         900                Number of Tasks                        1800000




                                                                                                                Number of Tasks Completed
                                              900                                                        1800000




                                                                                                               Number of Tasks Completed
                                             800                                                       1600000
   – Min: 1 task/sec
                               Arrival Rate (per sec)
                                              800                                                        1600000
                              Arrival Rate (per sec)
                                             700                                                       1400000
   – Arrival rate function:                   700                                                        1400000
                                             600                                                       1200000
                                              600                                                        1200000
   – Max: 1000 tasks/sec                     500                                                       1000000
                                              500                                                        1000000
• 200 processors                             400
                                              400
                                                                                                       800000
                                                                                                         800000
                                             300                                                       600000
• Ideal case:                                 300                                                        600000
                                             200                                                       400000
   – 6505 sec                                 200                                                        400000
                                             100                                                       200000
                                              100                                                        200000
   – 80Gb/s peak                               0                                                       0
                                                 0                                                       0
     throughput
                                                           00 0

                                                           00 0

                                                           0 0

                                                           0 0

                                                           0 0

                                                           0 0

                                                           0 0

                                                           0 0

                                                           0 0

                                                           00 0
                                                               0
                                                          60 00
                                                         12 20

                                                         18 80

                                                         24 40

                                                         30 0

                                                         36 0

                                                         42 0

                                                         48 0

                                                         54 0

                                                         60 0

                                                         66 0
                                                          300

                                                          360

                                                          420

                                                          480

                                                          540

                                                          600

                                                          660
                                                             0

                                                          10
                                                            6


                                                          1

                                                          2




                                                                             Time (sec)
                                                                              Time (sec)
                                                                                                         23
• GPFS                     5.7 hrs, ~8Gb/s, 1138 CPU hrs
                         100
                          90
                          80
    Queue Length (x1K)
    Throughput (Gb/s)
     Nodes Allocated




                          70
                          60
                          50
                          40
                          30
                          20
                          10
                           0



                                                Time (sec)
                                Throughput (Gb/s)            Demand (Gb/s)
                                Wait Queue Length            Number of Nodes
                                                                               24
• GPFS                         5.7 hrs, ~8Gb/s, 1138 CPU hrs
• GCC+SRP                      1.8 hrs, ~25Gb/s, 361 CPU hrs
                         100                                         100%
    Queue Length (x1K)
    Throughput (Gb/s)




                          90                                         90%
     Nodes Allocated




                                                                          Cache Hit/Miss
                          80                                         80%
                          70                                         70%
                          60                                         60%
                          50                                         50%
                          40                                         40%
                          30                                         30%
                          20                 j                       20%
                          10                                         10%
                           0                                         0%


                                         Time (sec)
                 Cache Hit Local %    Cache Hit Global %   Cache Miss %
                 Demand (Gb/s)        Throughput (Gb/s)    Wait Queue Length
                 Number of Nodes
                                                                                           25
• GPFS                         5.7 hrs, ~8Gb/s, 1138 CPU hrs
• GCC+SRP                      1.8 hrs, ~25Gb/s, 361 CPU hrs
• GCC+DRP
        100
                               1.86 hrs, ~24Gb/s, 253 CPU hrs
                                                       100%
     Queue Length (x1K)




                          90                                         90%
     Throughput (Gb/s)




                                                                           Cache Hit/Miss %
      Nodes Allocated




                          80                                         80%
                          70                                         70%
                          60                                         60%
                          50                                         50%
                          40                                         40%
                          30                                         30%
                          20                                         20%
                          10                                         10%
                           0                                         0%


                                         Time (sec)
                  Cache Miss %        Cache Hit Global %   Cache Hit Local %
                  Throughput (Gb/s)   Demand (Gb/s)        Wait Queue Length
                  Number of Nodes
                                                                                              26
• 500x500                                        • All-Pairs( set A, set B, function F )
       –    250K tasks
       –    24MB reads
                                                   returns matrix M:
       –    100ms compute                        • Compare all elements of set A to
       –    200 CPUs                               all elements of set B via function F,
• 1000x1000                                        yielding matrix M, such that
       •    1M tasks
       •    24MB reads                             M[i,j] = F(A[i],B[j])
       •    4sec compute                                              1 foreach $i in A
       •    4096 CPUs                                                 2     foreach $j in B
• Ideal case:                                                         3         submit_job F $i $j
       – 6505 sec                                                     4     end
       – 80Gb/s peak                                                  5 end
         throughput
                                                                                                     27
[DIDC09] “Towards Data Intensive Many-Task Computing”, under review
                    100.00                                                              100%
                     90.00                                                              90%
                     80.00                                                              80%
                     70.00                                                              70%
Throughput (Gb/s)




                                                                                                Cache Hit/Miss
                     60.00                                                              60%
                     50.00                                                              50%
                     40.00                                                              40%
                     30.00               Efficiency: 75%                                30%
                     20.00                                                              20%
                     10.00                                                              10%
                      0.00                                                              0%


                                                     Time (sec)
                       Cache Hit Local %                          Cache Hit Global %
                       Cache Miss %                               Max Throughput (GPFS)
                       Throughput (Data Diffusion)                Max Throughput (Local Disk)
                                                                                                                 28
                              200.00                                                          100%
                              180.00                                                          90%
                              160.00                                                          80%
                              140.00                                                          70%
          Throughput (Gb/s)




                                                     Efficiency: 86%




                                                                                                   Cache Hit/Miss
                              120.00                                                          60%
                              100.00                                                          50%
                               80.00                                                          40%
                               60.00                                                          30%
                               40.00                                                          20%
                               20.00                                                          10%
                                0.00                                                          0%


                                                               Time (sec)
                               Cache Hit Local %                       Cache Hit Global %
                               Cache Miss %                            Max Throughput (GPFS)
                               Throughput (Data Diffusion)             Max Throughput (Local Memory)
                                                                                                                    29
[DIDC09] “Towards Data Intensive Many-Task Computing”, under review
                                                                                100%                                Best Case (active storage)
                                                                                                                    Falkon (data diffusion)
                                                                                    90%                             Best Case (parallel file system)

• Pull vs. Push                                                                     80%
                                                                                    70%
                                                                                    60%




                                                                       Efficiency
       – Data Diffusion                                                             50%
                                                                                    40%
                                                                                    30%
              • Pulls task working set                                              20%
                                                                                    10%

              • Incremental spanning                                                0%
                                                                                           500x500       500x500       1000x1000      1000x1000
                                                                                          200 CPUs      200 CPUs       4096 CPUs      5832 CPUs
                forest                                                                      1 sec         0.1 sec
                                                                                                              Experiment
                                                                                                                         4 sec          4 sec

                                                                                                                                        Shared

       – Active Storage:                                              Experiment             Approach
                                                                                                                Local

                                                                                                                 (GB)
                                                                                                                            Network
                                                                                                             Disk/Memory (node-to-node)
                                                                                                                              (GB)
                                                                                                                                          File
                                                                                                                                        System
                                                                                                                                         (GB)

              • Pushes workload                                        500x500
                                                                      200 CPUs
                                                                                             Best Case
                                                                                          (active storage)
                                                                                                                6000               1536            12
                                                                                               Falkon
                working set to all nodes                                1 sec
                                                                                          (data diffusion)
                                                                                             Best Case
                                                                                                                6000               1698            34

                                                                       500x500                                  6000               1536            12
                                                                                          (active storage)
              • Static spanning tree                                  200 CPUs
                                                                        0.1 sec
                                                                                               Falkon
                                                                                          (data diffusion)
                                                                                                                6000               1528            62
                                                                                             Best Case
                                                                      1000x1000                                 24000              12288           24
 Christopher Moretti, Douglas Thain,                                  4096 CPUs
                                                                                          (active storage)
                                                                                               Falkon
                                                                        4 sec                                   24000              4676           384
        University of Notre Dame                                                          (data diffusion)
                                                                                             Best Case
                                                                      1000x1000                                 24000              12288           24
                                                                                          (active storage)
                                                                      5832 CPUs                                                            30
                                                                                               Falkon
[DIDC09] “Towards Data Intensive Many-Task Computing”, under review     4 sec                                   24000              3867           906
                                                                                          (data diffusion)
• Best to use active storage if
   – Slow data source
   – Workload working set fits on local node storage
• Best to use data diffusion if
   – Medium to fast data source
   – Task working set << workload working set
   – Task working set fits on local node storage
• If task working set does not fit on local node storage
   – Use parallel file system (i.e. GPFS, Lustre, PVFS, etc)


                                                               31
                                                                                                                                 +
  • Purpose                                                                                                                      +
          – On-demand “stacks” of                                                                                                +
                                                                                                                                 +
            random locations within                                                                                              +
            ~10TB dataset                                                                                                        +
                                                                                                                                 +
  • Challenge                                                                                                                    =
          – Processing Costs:
                  • O(100ms) per object
                                                                                                                       Sloan
          – Data Intensive:                                                         AP
                                                                                                                       Data
                                                                                  Locality Number of Objects Number of Files
                  • 40MB:1sec                                                        1         111700           111700
                                                                                   1.38        154345           111699
        – Rapid access to 10-10K                                                     2          97999            49000
                                                                                     3          88857            29620
             “random” files                                                          4          76575            19145
                                                                                     5          60590            12120
        – Time-varying load
[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”
                                                                                    10
                                                                                    20
                                                                                               46480
                                                                                               40460
                                                                                                                 4650
                                                                                                                 2025       32
[TG06] “AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis”     30          23695             790
                                 450                                          open
                                                                              radec2xy
                                 400                                          readHDU+getTile+curl+convertArray
                                                                              calibration+interpolation+doStacking
                                                                              writeStacking
                                 350
                                 300
                     Time (ms)




                                 250
                                 200
                                 150
                                 100
                                 50
                                  0
                                       GPFS GZ                LOCAL GZ               GPFS FIT       LOCAL FIT
                                                           Filesystem and Image Format


                                                                                                                     33
[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”
                                                                                                                                               Data Diffusion (GZ)
                                                                                                               2000                             Data Diffusion (GZ)
                                                                                                                                               Data Diffusion (FIT)
                                                                                                                2000                            Data Diffusion (FIT)
                                                                                                                                               GPFS (GZ)
     Low data locality                                                                                         1800                             GPFS (GZ)




                                                                                         Time (ms) per stack per CPU
                                                                                                                1800                           GPFS (FIT)




                                                                                        Time (ms) per stack per CPU
                                                                                                               1600                             GPFS (FIT)
                                                                                                                1600
                               – Similar (but better)                                                          1400
                                                                                                                1400
                                 performance to GPFS                                                           1200
                                                                                                                1200
                                                                                                               1000
                                                                                                                1000
                                                         Data Diffusion (GZ)                                    800
                       2000                               Data Diffusion (GZ)                                    800
                        2000                             Data Diffusion (FIT)                                   600
                                                          Data Diffusion (FIT)
                                                         GPFS (GZ)                                               600
                       1800                               GPFS (GZ)
 Time (ms) per stack per CPU




                        1800                             GPFS (FIT)                                             400
Time (ms) per stack per CPU




                       1600                               GPFS (FIT)                                             400
                        1600                                                                                    200
                       1400                                                                                      200
                        1400                                                                                      0
                       1200                                                                                         0
                        1200                                                                                             2       4       8    16      32        64          128
                       1000                                                                                                  2       4   8      16      32       64          128
                        1000                                                                                                             Number of CPUs
                        800                                                                                                               Number of CPUs
                         800
                        600
                         600
                        400
                         400
                        200
                                                                                                                       High data locality
                         200
                          0
                            0                                                                                          – Near perfect scalability
                                 2       4       8    16      32        64       128
                                     2       4   8      16      32       64       128
                                                 Number of CPUs
                                                  Number of CPUs

                                                                                                                                                                       34
     [DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”
                                                                                                                              Data Diffusion Throughput Local
  • Aggregate throughput:                                                                                   50
                                                                                                             50
                                                                                                            45
                                                                                                                               Data Diffusion Throughput Local
                                                                                                                              Data Diffusion Throughput Cache-to-Cache
                                                                                                                               Data Diffusion Throughput Cache-to-Cache
                                                                                                                              Data Diffusion Throughput GPFS
                                                                                                             45                Data Diffusion Throughput GPFS
                                                                                                                              GPFS Throughput (FIT)




                                                                                         Aggregate Throughput (Gb/s)
     – 39Gb/s                                                                                                                  GPFS Throughput (FIT)




                                                                                        Aggregate Throughput (Gb/s)
                                                                                                            40                GPFS Throughput (GZ)
                                                                                                             40                GPFS Throughput (GZ)
                                                                                                            35
     – 10X higher than GPFS                                                                                  35
                                                                                                            30
                                                                                                             30
  • Reduced load on GPFS                                                                                    25
                                                                                                             25
                                                                                                            20
                                – 0.49Gb/s                                                                   20
                                                                                                            15
                                – 1/10 of the original load                                                  15
                    2000
                                                                Data Diffusion (GZ)                         10
                     2000
                                                                 Data Diffusion (GZ)
                                                                Data Diffusion (FIT)                         10
                    1800                                         Data Diffusion (FIT)
                                                                GPFS (GZ)                                    5
                                                                 GPFS (GZ)
                                                                GPFS (FIT)                                     5
  Time (ms) per stack per CPU




                     1800
                                                                 GPFS (FIT)
Time (ms) per stack per CPU




                    1600                                                                                     0
                     1600                                                                                      0
                    1400                                                                                        1          1.38    2       3        4      5     10       20    30
                     1400                                                                                              1    1.38       2       3 Locality
                                                                                                                                                      4      5    10       20    30
                    1200
                     1200                                                                                                                         Locality
                    1000
                     1000
                     800
                      800
                     600
                      600
                                                                                        • Big performance gains
                     400
                      400
                     200
                      200
                                                                                          as locality increases
                       0
                         0
                                 1    5  1.38 2
                                            10   20 3  30   4    Ideal
                                   4 1  5  1.38 2
                                              10   20   330        Ideal
                                   Locality                                                                                                                                35
                                     Locality
  [DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”
• Data access patterns: write once, read many
• Task definition must include input/output files
  metadata
• Per task working set must fit in local storage
• Needs IP connectivity between hosts
• Needs local storage (disk, memory, etc)
• Needs Java 1.4+

                                                36
•   [Ghemawat03,Dean04]: MapReduce+GFS
•   [Bialecki05]: Hadoop+HDFS
•   [Gu06]: Sphere+Sector
•   [Tatebe04]: Gfarm
•   [Chervenak04]: RLS, DRS
•   [Kosar06]: Stork

• Conclusions
    – None focused on the co-location of storage and generic
      black box computations with data-aware scheduling while
      operating in a dynamic elastic environment
    – Swift + Falkon + Data Diffusion is arguably a more generic
      and powerful solution than MapReduce                    37
• Identified that data locality is crucial to the
  efficient use of large scale distributed systems
  for data-intensive applications       Data Diffusion
  – Integrated streamlined task dispatching with data
    aware scheduling policies
  – Heuristics to maximize real world performance
  – Suitable for varying, data-intensive workloads
  – Proof of O(NM) Competitive Caching



                                                    38
      • Falkon is a real system
              – Late 2005: Initial prototype, AstroPortal
              – January 2007: Falkon v0
              – November 2007: Globus incubator project v0.1
                      • http://dev.globus.org/wiki/Incubator/Falkon
              – February 2009: Globus incubator project v0.9
      • Implemented in Java (~20K lines of code) and C
        (~1K lines of code)
              – Open source: svn co https://svn.globus.org/repos/falkon
      • Source code contributors (beside myself)
              – Yong Zhao, Zhao Zhang, Ben Clifford, Mihael Hategan
                                                                      39
[Globus07] “Falkon: A Proposal for Project Globus Incubation”
• Workload
   • 160K CPUs
   • 1M tasks
   • 60 sec per task
• 2 CPU years in 453 sec
• Throughput: 2312 tasks/sec
• 85% efficiency



                                                                           40
[TPDS09] “Middleware Support for Many-Task Computing”, under preparation
                                                                           41
[TPDS09] “Middleware Support for Many-Task Computing”, under preparation
ACM MTAGS09 Workshop
        @ SC09
Due Date: August 1st, 2009




                             42
    IEEE TPDS Journal
   Special Issue on MTC
Due Date: December 1st, 2009




                          43
• More information:
   – Other publications: http://people.cs.uchicago.edu/~iraicu/
   – Falkon: http://dev.globus.org/wiki/Incubator/Falkon
   – Swift: http://www.ci.uchicago.edu/swift/index.php
• Funding:
   – NASA: Ames Research Center, GSRP
   – DOE: Office of Advanced Scientific Computing Research,
     Office of Science, U.S. Dept. of Energy
   – NSF: TeraGrid
• Relevant activities:
   – ACM MTAGS09 Workshop at Supercomputing 2009
      • http://dsl.cs.uchicago.edu/MTAGS09/
   – Special Issue on MTC in IEEE TPDS Journal
      • http://dsl.cs.uchicago.edu/TPDS_MTC/                  44

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:5/15/2011
language:English
pages:44