Docstoc

EMC Presentation

Document Sample
EMC Presentation Powered By Docstoc
					        Research @ Northeastern
               University
• I/O storage modeling and performance
     – David Kaeli


• Soft error modeling and mitigation
     – Mehdi B. Tahoori




EMC Presentation          April 2005     1
 I/O Storage Research at
 Northeastern University
                David Kaeli
                Yijian Wang
Department of Electrical and Computer Engineering
             Northeastern University
                   Boston, MA
               kaeli@ece.neu.edu
                   Outline
• Motivation to study file-based I/O
• Profile-driven partitioning for parallel file
  I/O
• I/O Qualification Laboratory @ NU
• Areas for future work



EMC Presentation     April 2005                   3
         Important File-base I/O
              Workloads
• Many subsurface sensing and imaging workloads
  involve file-based I/O
     – Cellular biology – in-vitro fertilization with NU biologists
     – Medical imaging – cancer therapy with MGH
     – Underwater mapping – multi-sensor fusion with Woods Hole
       Oceanographic Institution
     – Ground-penetrating radar – toxic waste tracking with Idaho
       National Labs




EMC Presentation               April 2005                             4
                                                                                                            Air

        The Impact of Profile-guided
                                                                                                                       Mine
     Parallelization on SSI Applications                                                                    Soil




•   Reduced the runtime of a single-body
                                                                                  Scattered Light Simulation
    Steepest Descent Fast Multipole Method                                                Speedup
    (SDFMM) application by 74% on a 32-node
                                                                          100000
    Beowulf cluster




                                                     Runtime in seconds
     •   Hot-path parallelization                                          10000                                   Original

     •   Data restructuring                                                    1000
                                                                                                                   Matlab-to-C
                                                                                100
•   Reduced the runtime of a Monte Carlo                                         10
                                                                                                                   Hot path
                                                                                                                   parallelization
    scattered light simulation by 98% on                                          1
    a 16-node Silicon Graphics Origin 2000
     •   Matlab-to-C compliation
                                                                                 Ellipsoid Algorithm Speedup
     •   Hot-path parallelization
                                                                                   (versus serial C version)

•   Obtained superlinear speedup of Ellipsoid                             20

                                                     Speedup
                                                                          15
    Algorithm run on a 16-node IBM SP2                                    10
                                                                           5
     •   Matlab-to-C compliation                                           0
     •   Hot-path parallelization                                                     1        2        4          8          16
                                                                                                Number of Nodes
         EMC Presentation               April 2005                                        64-vector         256-vector
                                                                                                                       5
                                                                                          1024-vector       linear speedup
           Limits of Parallelization
• For compute-bound workloads, Beowulf clusters can
  be used effectively to overcome computational
  barriers
• Middlewares (e.g., MPI and MPI/IO) can significantly
  reduce the programming effort on parallel systems
• Multiple clusters can be combined, utilizing Grid
  Middleware (Globus Toolkit)
• For file-based I/O-bound workloads, Beowulf clusters
  and Grid systems are presently ill-suited to exploit the
  potential parallelism present on these systems


EMC Presentation          April 2005                     6
                   Outline
• Motivation to study file-based I/O
• Profile-driven partitioning for parallel file
  I/O
• I/O Qualification Laboratory @ NU
• Areas for future work



EMC Presentation     April 2005                   7
      Parallel I/O Acceleration
  • The I/O bottleneck
        – The growing gap between the speed of
          processors, networks and underlying I/O devices
        – Many imaging and scientific applications access
          disks very frequently
  • I/O intensive applications
        – Out-of-core applications
             – Work on large datasets that cannot fit in main memory
        – File-intensive applications
             – Access file-based datasets frequently
             – Large number of file operations
EMC Presentation                April 2005                             8
                       Introduction

• Storage architectures
     – Direct Attached Storage (DAS)
           – Storage device is directly attached to the computer
     – Network Attached Storage (NAS)
           – Storage subsystem is attached to a network of servers
             and file requests are passed through a parallel filesystem
             to the centralized storage device
     – Storage Area Network (SAN)
           – A dedicated network to provide an any-to-any connection
             between processors and disks



EMC Presentation                 April 2005                           9
                       I/O Partitioning
                 P
  An I/O                      Multiple Processes       P       P           …     P
intensive                        (i.e. MPI-IO)
application
                Disk                                                      Disk

 Multiple disks
  (i.e. RAID)

                 P                                         P       P        …    P




      Disk      Disk   …      Disk                         Disk    Disk     …    Disk

              Data Striping                                    Data Partitioning
    EMC Presentation                      April 2005                              10
                    I/O Partitioning
• I/O is parallelized at both the application level
  (using MPI and MPI-IO) and the disk level
  (using file partitioning)
• Ideally, every process will only access files on
  local disk (though this is typically not possible
  due to data sharing)
• How to recognize the access patterns?
   • Profile-guided approach



 EMC Presentation        April 2005                   11
                   Profile Generation

                         Run the application


                    Capture I/O execution profiles


                   Apply our partitioning algorithm


                     Rerun the tuned application


EMC Presentation               April 2005             12
        I/O traces and partitioning
 • For every process, for every contiguous file access,
   we capture the following I/O profile information:
       –   Process ID
       –   File ID
       –   Address
       –   Chunk size
       –   I/O operation (read/write)
       –   Timestamp
 • Generate a partition for every process
 • Optimal partitioning is NP-complete, so we develop a
   greedy algorithm
 • We have found we can use partial profiles to guide
   partitioning
EMC Presentation                  April 2005          13
        Greedy File Partitioning Algorithm
for each IO process, create a partition;
for each contiguous data chunk {
         total up the # of read/write accesses on a process-ID basis;
         if the chunk is accessed by only one process
                     assign the chunk to the associated partition;
         if the chunk is read (but never written) by multiple processes
                     duplicate the chunk in all partitions where read;
         if the chunk is written by one process, but later read by multiple {
                     assign the chunk to all partitions where read and broadcast
                     the updates on writes;
         else assign the chunk to a shared partition;
        }}
For each partition
         sort chunks based on the earliest timestamp for each chunk;
  EMC Presentation                    April 2005                          14
                      Parallel I/O Workloads
• NASA Parallel Benchmark (NPB2.4)/BT
   – Computational fluid dynamics
   – Generates a file (~1.6 GB) dynamically and then reads it back
   – Writes/reads sequentially in chunk sizes of 2040 Bytes
• SPEChpc96/seismic
   – Seismic processing
   – Generates a file (~1.5 GB) dynamically and then reads it back
   – Writes sequential chunks of 96 KB and reads sequential chunks of 2 KB
• Tile-IO
   – Parallel Benchmarking Consortium
   – Tile access to a two-dimensional matrix (~1 GB) with overlap
   – Writes/reads sequential chunks of 32 KB, with 2KB of overlap
• Perf
   – Parallel I/O test program within MPICH
   – Writes a 1 MB chunk at a location determined by rank, no overlap
• Mandelbrot
   – An image processing application that includes visualization
   – Chunk size is dependent on the number of processes
   EMC Presentation                April 2005                           15
                              RAID
 Beowulf Cluster
                              Node
          P2-350Mhz        P2-350Mhz    P2-350Mhz




 Local                  10/100Mb                 Local
PCI-IDE               Ethernet Switch           PCI-IDE
 Disk                                            Disk




                          P2-350Mhz
          P2-350Mhz                     P2-350Mhz
                              RAID
                              Node
   EMC Presentation       April 2005                16
                   Hardware Specifics
• DAS configuration
     – Linux box, Western Digital WD800BB (IDE), 80GB,
       7200RPM
• Beowulf cluster (base configuration)
     – Fast Ethernet 100Mbits/sec
     – Network Attached RAID - Morstor TF200 with 6-9GB drives
       Seagate SCSI disks, 7200rpm, RAID-5
     – Local attached IDE disks – IBM UltraATA-350840, 5400rpm
• Fibre channel disks
     – Seagate Cheetah X15 ST-336752FC, 15000rpm




EMC Presentation             April 2005                      17
                                            Write/Read Bandwidth
                                          200




                     Bandwidth (MB/sec)
                                          150
                                                                                                 4 procs
                                                                                                 9 procs
 NPB2.4/BT                                100
                                                                                                 16 procs
                                                                                                 25 procs
                                           50


                                            0
                                                Unix    Unix    MPI-IO   MPI-IO   P-IO    P-IO
                                                Write   Read     Write    Read    Write   Read
                                          200
                     Bandwidth (MB/sec)




                                          150
                                                                                                 4 procs
                                                                                                 8 procs
                                          100
                                                                                                 16 procs
SPECHPC/seis                                                                                     24 procs
                                           50


                                            0
                                                Unix    Unix    MPI-IO   MPI-IO   P-IO    P-IO
                                                Write   Read     Write    Read    Write   Read
  EMC Presentation                                             April 2005                                   18
                                                                     Write/Read Bandwidth
                     125
                                        MPI-Tile                                                                            250
                                                                                                                                                            Perf
Bandwidth (MB/sec)




                                                                                                       Bandwidth (MB/sec)
                     100                                                                                                    200

                      75                                                                                                    150

                      50                                                                                                    100

                      25                                                                                                     50

                       0                                                                                                      0
                           MPI write   MPI read                          PIO write       PIO read                                  MPI write         MPI read        PIO write    PIO read

                                                                   250
                                                                                            Mandelbrot
                                              Bandwidth (MB/sec)




                                                                   200

                                                                                                                                                                4 procs
                                                                   150
                                                                                                                                                                8 procs
                                                                                                                                                                16 procs
                                                                   100                                                                                          24 procs

                                                                    50

                                                                     0
                                                                             MPI write      MPI read                         PIO write         PIO read

                           EMC Presentation                                                     April 2005                                                                       19
                                                    Total Execution Time
Execution Time (seconds)


                           4000


                           3000
                                                                                                                  MPI-IO
                                                                                                                  PIO
                           2000


                           1000


                                   0
                                                T                 c         O         erf                     t
                                             /B                hp        e-I        P                   lb ro
                                      2   .4
                                                      PE
                                                           C          Til                       d   e
                               N   PB               S                                       Man

                           EMC Presentation                            April 2005                                 20
Profile training sensitivity analysis
• We have found that IO access patterns are
  independent of file-based data values
• When we increase the problem size or
  reduce the number of processes, either:
     – the number of IOs increases, but access patterns
       and chunk size remain the same (SPEChpc96,
       Mandelbrot), or
     – the number of IOs and IO access patterns remain
       the same, but the chunk size increases (NBT, Tile-
       IO, Perf)
• Re-profiling can be avoided
EMC Presentation          April 2005                   21
Execution-driven Parallel I/O Modeling
  • Growing need to process large, complex
    datasets in high performance parallel
    computing applications
  • Efficient implementation of storage
    architectures can significantly improve system
    performance
  • An accurate simulation environment for users
    to test and evaluate different storage
    architectures and applications

 EMC Presentation     April 2005               22
Execution-driven I/O Modeling
• Target applications: parallel scientific programs
  (MPI)
• Target machine/Host machine: Beowulf clusters
• Use DiskSim as the underlying disk drive
  simulator
• Direct execution to model CPU and network
  communication
• We execute the real parallel I/O accesses and
  meanwhile, calculate the simulated I/O response
  time
EMC Presentation     April 2005               23
                     Validation – Synthetic I/O Workload on DAS
                          Response Time of Sequential Reads
                                                                                                                 Response Time of Sequential Writes
          10                                                                                            12
           8                      model                                                                 10
                                  real                                                                   8
seconds




           6




                                                                                              seconds
                                                                                                         6
           4
                                                                                                         4
           2
                                                                                                         2
           0                                                                                             0
                          1               2           4          8           16                                 1           2             4              8        16
                                      access size in number of blocks                                                   access size in number of blocks
                                        number of accesses = 1000                                                         number of accesses = 1000


                          Response Time of Non-contiguous Reads                                               Response Time of Non-contiguous Writes
                     10                                                                                  10
                      8                                                                                   8
           seconds




                      6                                                                        seconds    6
                      4                                                                                   4
                      2                                                                                   2

                      0                                                                                   0
                                                                                                                1       2            4           8           16   32
                              1           2       4          8          16        32
                                                                                                                            seek distance in number of blocks
                                     seek distance in number of blocks                                                            access size = 1 block
                                           access size = 1 block                                                              number of accesses = 1000
           EMC            Presentation number of accesses = 1000                       April 2005                                                                      24
      Simulation Framework - NAS


      Local I/O traces     Local I/O traces        Local I/O traces       Local I/O traces
                         LAN/WAN


                            Network File System

               RAID controller                                           Filesystem metadata
                                                            I/O traces
                                        I/O requests                     Logical file access
                                                                         addresses


                                                           Disk
                                                           Sim
EMC Presentation                      April 2005                                    25
                             Execution Time of NPB2.4/BT on
                                NAS - base configuration
          4000
          3500
          3000
          2500
seconds




                                                                model
          2000
                                                                real
          1500
          1000
            500
               0
                             4        9             16     25
                                    number of processors
          EMC Presentation                April 2005            26
        Simulation Framework – SAN direct
  • A variety of SAN where disks are distributed across the network and each
    server is directly connected to a single device
  • File partitioning
  • Utilize I/O profiling and data partitioning heuristics to distribute portions of
    files to disks close to the processing nodes
                                        LAN/WAN




FileSystem             FileSystem               FileSystem            FileSystem

               I/O traces             I/O traces               I/O traces   I/O traces
             Disk                   Disk                     Disk                  Disk
             Sim                    Sim                      Sim                   Sim
    EMC Presentation                       April 2005                                    27
                         Execution Time of NPB2.4/BT on
                         SAN-direct - base configuration
          3000

          2500

          2000
seconds




                                                               model
          1500
                                                               real
          1000

          500

             0
                         4        9                  16   25
                                number of processors
          EMC Presentation              April 2005             28
             Hardware Specifications




EMC Presentation      April 2005       29
                            I/O Bandwidth of SPEChpc/seis
       250
       200
       150                                                                                                                  4 processors
MB/s




                                                                                                                            8 processors
       100
                                                                                                                            16 processors
       50
        0


                                                                        SAN-direct -ATA

                                                                                          SAN-direct-SCSI

                                                                                                            SAN-direct-FC
              NAS-joulian




                                                          SAN-joulian
                            NAS-ATA

                                      NAS-SCSI

                                                 NAS-FC




                                      storage architectures
       EMC Presentation                                         April 2005                                                        30
                                  I/O Bandwidth of Mandelbrot
       400
       350
       300
       250                                                                                                                  4 processors
MB/s




       200                                                                                                                  8 processors
       150                                                                                                                  16 processors
       100
        50
         0


                                                                        SAN-direct -ATA

                                                                                          SAN-direct-SCSI

                                                                                                            SAN-direct-FC
              NAS-joulian




                                                          SAN-joulian
                            NAS-ATA

                                      NAS-SCSI

                                                 NAS-FC




                                      storage architectures
       EMC Presentation                                         April 2005                                                        31
                            Publications
1.    “Profile-guided File Partitioning on Beowulf Clusters,” Journal of Cluster
      Computing, Special Issue on Parallel I/O, to appear 2005.
2.    “Execution-Driven Simulation of Network Storage Systems,” Proceedings of the 12th
      ACM/IEEE International Symposium on Modeling, Analysis and Simulation of
      Computer and Telecommunication Systems (MASCOTS), October 2004, pp. 604-
      611.
3.    “Profile-Guided I/O Partitioning,” Proceedings of the 17th ACM International
      Symposium on Supercomputing, June 2003, pp. 252-260.
4.    “Source Level Transformations to Apply I/O Data Partitioning,” Proceedings of the
      IEEE Workshop on Storage Network Architecture And Parallel IO, Oct. 2003, pp. 12-
      21.
5.    “Profile-Based Characterization and Tuning for Subsurface Sensing and Imaging
      Applications,” International Journal of Systems, Science and Technology, September
      2002, pp. 40-55.




     EMC Presentation                  April 2005                                32
   Summary of Cluster-based Work
• Many imaging applications are dominated by file-based
  I/O
• Parallel systems can only be effectively utilized if I/O is
  also parallelized
• Developed a profile-guided approach to I/O data
  partitioning
• Impacting clinical trials at MGH
• Reduced overall execution time by 27-82% over MPI-IO
• Execution-driven I/O model is highly accurate and
  provides significant modeling flexibility


 EMC Presentation          April 2005                     33
                   Outline
• Motivation to study file-based I/O
• Profile-driven partitioning for parallel file
  I/O
• I/O Qualification Laboratory @ NU
• Areas for future work



EMC Presentation     April 2005               34
         I/O Qualification Laboratory
• Working with Enterprise Strategy Group
• Develop a state-of-the-art facility to provide
  independent performance qualification of
  Enterprise Storage systems
• Provide a quarterly report to ES customer
  base on the status of current ES offerings
• Work with leading ES vendors to provide
  them with custom early performance
  evaluation of their beta products

EMC Presentation      April 2005                   35
         I/O Qualification Laboratory
• Contacted by IOIntegrity and SANGATE
  for product qualification
• Developed potential partners that are
  leaders in the ES field
• Initial proposals already reviewed by
  IBM, Hitachi and other ES vendors
• Looking for initial endorsement from
  industry

EMC Presentation     April 2005         36
         I/O Qualification Laboratory
• Why @ NU
     – Track record with industry (EMC, IBM,
       Sun)
     – Experience with benchmarking and IO
       characterization
     – Interesting set of applications (medical,
       environmental, etc.)
     – Great opportunity to work within the
       cooperative education model

EMC Presentation        April 2005                 37
                   Outline
• Motivation to study file-based I/O
• Profile-driven partitioning for parallel file
  I/O
• I/O Qualification Laboratory @ NU
• Areas for future work



EMC Presentation     April 2005               38
                      Areas for Future Work
• Designing a Peer-to-Peer storage system on a Grid system
  by partitioning datasets across geographically distributed
  storage devices
                                     Internet


                                Head node    Head node

                 RAID
                         100Mbit/s                       1Gbit/s




                          31 sub-nodes               8 sub-nodes
                        joulian.hpcl.neu.edu       keys.ece.neu.edu
   EMC Presentation                   April 2005                      39
                        NPB2.4/BT read performance

       180
       160
       140
       120
                                                       4 procs
       100                                             9 procs
MB/s
        80                                             16 procs
                                                       25 procs
        60
        40
        20
          0
                   single server   dual server   P2P

EMC Presentation                   April 2005            40
                   Areas for Future Work
• Reduce simulation time by identifying
  characteristic “phases” in I/O workloads
• Apply machine learning algorithms to identify
  clusters of representative I/O behavior
• Utilize K-Means and Multinomial clustering to
  obtain high fidelity in simulation runs utilizing
  sampled I/O behavior

“A Multinomial Clustering Model for Fast Simulation of
Architecture Designs”, submitted to the 2005 ACM KDD
Conference.

EMC Presentation           April 2005                    41

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:15
posted:2/20/2012
language:English
pages:41