Psc Application Form - PowerPoint

Document Sample
Psc Application Form - PowerPoint Powered By Docstoc
					Unified Parallel C (UPC)
               Kathy Yelick

           UC Berkeley and LBNL


4/3/2004        PSC Petascale Methods   1
  UPC Projects
• GWU: http://upc.gwu.edu
  • Benchmarking, language design
• MTU: http://www.upc.mtu.edu
  • Language, benchmarking, MPI runtime for HP compiler
• UFL: http://www.hcs.ufl.edu/proj/upc
  • Communication runtime (GASNet)
• UMD: http://www.cs.umd.edu/~tseng/
  • Benchmarks
• IDA: http://www.super.org
  • Language, compiler for t3e
• Other companies (Intel, Sun,…) and labs
  4/3/2004           PSC Petascale Methods           2
 UPC Compiler Efforts
• HP: http://www.hp.com/go/upc
    • Compiler, tests, language
• Etnus: http://www.etnus.com
    • Debugger
• Intrepid: http://www.intrepid.com/upc
    • Compiler based on gcc
• UCB/LBNL: http://upc.lbl.gov
    • Compiler, runtime, applications
• IBM: http://www.ibm.com
    • Compiler under development for SP line
• Cray: http://www.cray.com
    • Compiler product for X1
4/3/2004              PSC Petascale Methods    3
Comparison to MPI
• One-sided vs. two-sided communication models
• Programmability
    • Two-sided works reasonably well for regular
      computation
    • When computation is irregular/asynchronous, issuing
      receives can be difficult
    • To simplify programming, communication is grouped
      into a phase, which limits overlap
• Performance
    • Some hardware does one-sided communication
    • RDMA support is increasingly common


4/3/2004             PSC Petascale Methods             4
  Communication Support Today
        25                                    Added Latency

        20
                                              Send Overhead (Alone)
                                              Send & Rec Overhead
        15
     usec




                                              Rec Overhead (Alone)
        10

            5

            0
                                PI
                          IB PI




                                                                   PI
               PI




                                                        G IP L

                                                                   PI
                                                 PI
       E/ m




                                                        yr GM
       T 3 eg




                                        dr hm
                               A
                              /M




                                                                /M
             M
     T3 /Sh




                                                                 M
                                              /M
             R




                      /L




                                                               V
                                              S




                                                                /
          E/




                                                              E/
                            M
          E-




                                                             et

                                                             et
                                          ics




                                                            E/
                                           s/
                      M
         E




                                                           ig
                                                          in

                                                          in
                                        ic
      T3




                    IB




                                                         ig
                                                        yr
                                     dr




                                                       G
                                         ua



                                                      M

                                                      M
                                 ua
                                        Q
                                Q




• Potential performance advantage for fine-grained, one-sided programs
• Potential productivity advantage for irregular applications
   4/3/2004                   PSC Petascale Methods                     5
MPI vs. PGAS Languages

• GASNet - portable, high-performance communication layer
  • compilation target for both UPC and Titanium
  • reference implementation over MPI 1.1 (AMMPI-based)
  • direct implementation over many vendor network API's:
        • IBM LAPI, Quadrics Elan, Myrinet GM, Infiniband vapi, Dolphin
          SCI, others on the way…
• Applications: NAS parallel benchmarks (CG & MG)
   • Standard benchmarks written in UPC by GWU
   • Compiled using Berkeley UPC compiler
   • Difference is GASNet backend: MPI 1.1 vs vendor API
   • Also used HP/Compaq UPC compiler where available
• Caveats
   • Not a comparison of MPI as a programming model
 4/3/2004                 PSC Petascale Methods                    6
Performance Difference Translates to Applications

                                          •Bulk-synchronous
                                           NAS MG and CG
                                           codes in UPC
                                          •Elan-based layer
                                           beats MPI
                                            • Performance and
                                              scaling
                                          •The only difference
                                           in the Berkeley lines
                                           is the network API!
                                          •Machine: Alpha +
                                           Quadrics, Lemieux
  4/3/2004        PSC Petascale Methods   Source: Bonachea and Duell
                                                             7
Performance Difference Translates to Applications

                                          Apps on GM-based
                                           layer beat apps on
                                           MPI-based layer by
                                           ~ 20%
                                          The only difference is
                                           the network API!


                                          Machine:
                                          • Pentium 3 + Myrinet
                                          • NERSC Alvarez
                                            cluster

  4/3/2004        PSC Petascale Methods                   8
Performance Difference Translates to Applications




App on LAPI-based layer provides significantly better absolute
performance and scaling than same app on MPI-based layer
The only difference is the network API!
Machine: IBM SP, Seaborg at NERSC
   4/3/2004             PSC Petascale Methods             9
Productivity
• Productivity is hard to measure
    • # lines (or characters) is easy to measure
    • May not reflect programmability, but if the same
      algorithms are used, it can reveal some differences
• Fast fine-grained communication is useful
    • Incremental program development
    • Inherently fine-grained applications
    • Compare performance of these fine-grained versions




4/3/2004             PSC Petascale Methods            10
Productivity Study [El Gazhawi et al, GWU]
                          SEQ*1        MPI        SEQ*2        UPC       MPI/SEQ     UPC/SEQ
                                                                            (%)         (%)
  GUPS           #line       41          98          41          47       139.02       14.63
                 #char     1063        2979        1063        1251       180.02       17.68
Histogram        #line       12          30          12          20       150.00       66.67
                 #char      188         705         188         376       275.00      100.00
NAS-EP #line                130         187         127         149        43.85       17.32
                 #char     4741        6824        2868        3326        44.94       15.97
NAS-FT #line                704        1281         607         952        81.96       56.84
                 #char    23662       44203       13775       20505        86.81       48.86
N-Queens #line
                             86         166          86         139        93.02       61.63
                 #char     1555        3332        1555        2516       124.28       61.80
All the line counts are the number of real code lines (no comments, no blocks)
*1: The sequential code is coded in C except for NAS-EP and FT which are coded in Fortran.
*2: The sequential code is always in C.


      4/3/2004                       PSC Petascale Methods                                   11
Fine-Grained Applications have Larger Spread

                                                    Machine:
                                                    • HP Alpha +
                                                      Quadrics, Lemieux
                                                    Benchmark:
                                                    • Naïve CG with
                                                      fine-grained
                                                      remote accesses
For comparison purposes
   • All versions scale poorly due to naïve algorithm, as expected
   • Absolute performance: Elan version is more than 4x faster!
   • Means more work for application programmers in MPI
   • Elan-based layer more suitable for:
        • incremental application development and fine-grained algorithms
   4/3/2004                 PSC Petascale Methods                   12
    A Brief Look at the Past
 • Conjugate Gradient dominated by sparse matrix-
   vector multiply

                Sparse Matrix-Vector Multiply (T3E)            • Same fine-grained
         250                                                     version used in
         200                                                     previous
                                                               • Shows advantage of
                              UPC + Prefetch
                              MPI (Aztec)
Mflops




         150                  UPC Bulk
                              UPC Small                          t3e network model and
         100
                                                                 UPC
          50
                                                               • Will we get a machine
            0                                                    like this again?
            • Longer2 term, identify large 32
               1          4    8     16     application
                             Processors




         4/3/2004                              PSC Petascale Methods                13
Goals of the Berkeley UPC Project
• Make UPC Ubiquitous
   • Parallel machines
   • Workstations and PCs for development
   • A portable compiler: for future machines too
• Research in compiler optimizations for parallel
  languages
• Demonstration of UPC on real applications
• Ongoing language development with the UPC
  Consortium

• Collaboration between LBNL and UCB
4/3/2004           PSC Petascale Methods        14
        Example: Berkeley UPC Compiler

                                        • Compiler based on Open64
                   UPC
                                           • Multiple front-ends, including gcc
                                           • Intermediate form called WHIRL
                Higher WHIRL
                                        • Current focus on C backend
                                           • IA64 possible in future
Optimizing                 C+
transformations          Runtime        • UPC Runtime
                                           • Pointer representation
               Lower WHIRL                 • Shared/distribute memory
                                        • Communication in GASNet
                                           • Portable
                Assembly: IA64,            • Language-independent
               MIPS,… + Runtime
    4/3/2004                   PSC Petascale Methods                    15
    Portability Strategy in UPC Compiler
• Generation of C code from translator                   Runtime Layers
• Layered approach to runtime                          Compiler-generated code
   • Core GASNet API:
                                                       Language-specific runtime
     • Most basic required primitives, as narrow and
       general as possible                               GASNet Extended API
     • Implemented directly on each platform
     • Based heavily on active messages paradigm        GASNet Core API
  • Extended API
                                                          Network Hardware
     • Wider interface that includes more
       complicated operations
     • Reference implementation provided in terms
       of core
     • Implementers may tune for network
  • UPC Runtime:
     • pointer representation (specific to UPC,
       possibly to machine)
     • thread implementation
    4/3/2004                 PSC Petascale Methods                     16
Portability of Berkeley UPC Compiler
• Make UPC Ubiquitous
   • Current and future parallel machines
   • Workstations and PCs for development
• Ports of Berkeley UPC Compiler
  • OS: Linux, FreeBSD, Tru64, AIX, IRIX, HPUX, Solaris,
    MSWindows(cygwin), MacOSX, Unicos, SuperUX
  • CPU: x86, Itanium, Alpha, PowerPC, PA-RISC
  • Supercomputers: Cray T3e, Cray X-1, IBM SP, NEC SX-6,
    Cluster X (Big Mac), SGI Altix 3000
• Recently added a net-compile option
   • Only install runtime system locally
• Runtime ported to Posix Threads (direct load/store)
   • Run on SGI Altix as well as SMPs
• GASNet tuned to vendor-supplied communication layer
   • Myrinet GM, Quadrics Elan, Mellanox Infiniband
     VAPI, IBM LAPI, Cray X1, Cray/SGI SHMEM
4/3/2004               PSC Petascale Methods                17
 Pointer-to-Shared: Phases
• UPC has three difference kinds of pointers:
   • Block-cyclic:
        shared [4] double a [n];
   • Cyclic:
        shared double a [n];
   • Indefinite (always local):
        shared [0] double *a = (shared [0] double *) upc_alloc(n);
• A pointer needs a “phase” to keep track of relative position
  within a block
   • Source of overhead for updating and dereferencing
• Special case for “phaseless” Pointers
   • Cyclic pointers always have phase 0
   • Indefinite blocked pointers only have one block
   • Don’t need to keep phase for cyclic and indefinite
   • Don’t need to update thread id for indefinite

  4/3/2004               PSC Petascale Methods                  18
              Accessing Shared Memory in UPC

                         start of array object



                                              …                Shared
Phase                           block
                                                               Memory
                                size




               Thread 0     Thread 1           … Thread N -1


                Address        Thread                Phase
                  addr              0                  2
   4/3/2004                  PSC Petascale Methods                19
 Pointer-to-Shared Representation
• Shared pointer representation trade-offs
   • Use of scalar types (long) rather than a struct may improve
     backend code quality
            • Faster pointer manipulation, e.g., ptr+int and dereferencing
            • Important in C, because array reference based on pointers
   • Pointer size is important to performance
            • Use of smaller types, 64 bits, rather than 128 bits may allow
              pointers to reside in a single register
            • But very large machines may require a longer pointer type
• Consider two different machines:
   • 2048-processor machine with 16 GB/processor  128 bits
   • 64-processor machine with 2 GB/processor  64 bits
            • 6 bits for thread, 31 bits of address, 27 bits for phase  64 bit
• Portability and performance balance in UPC compiler
   • The pointer representation is hidden in the runtime layer
   • Can easily switch at compiler installation time
 4/3/2004                       PSC Petascale Methods                         20
   Performance of Shared Pointer Arithmetic
                                  Cost of Shared Pointer Operations


                            100

                            90
                                                      ptr + int -- struct
                                                                                         1 cycle =
                            80                        ptr + int -- packed                1.5ns
                            70                        ptr - ptr -- struct
         number of cycles




                                                      ptr - ptr -- packed
                            60
                                                      ptr equality -- struct
                            50                        ptr equality -- packed
                            40

                            30

                            20

                            10

                             0
                                  generic    cyclic           indefinite       regular
                                                 Pointer Type



• Phaseless pointer an important optimization
   • Indefinite pointers almost as fast as regular C pointers
• Packing also helps, especially for pointer and int addition
    4/3/2004                                PSC Petascale Methods                            21
Comparison with HP UPC v1.7

                                             Pointer-to-shared operations

                              50
                              45
                              40
                                                                                                    1 cycle =
           number of cycles




                                                                            ptr + int -- HP
                              35
                                                                            ptr + int -- Berkeley   1.5ns
                              30
                                                                            ptr - ptr -- HP
                              25
                                                                            ptr -ptr -- Berkeley
                              20
                                                                            ptr == ptr -- HP
                              15
                                                                            ptr == ptr-- Berkeley
                              10
                               5
                               0
                                   generic        cyclic       indefinite
                                             type of pointer




• HP a little faster, due to it generating assembly coded
• Gap for addition likely smaller with further optimizations

4/3/2004                                            PSC Petascale Methods                                 22
                     Cost of Shared Memory Access

                            Cost of Shared Local Memory Access                                                       Cost of shared remote access

                    800    750                                                                           6000
                    700




                                                                                      number of cycles
                                                                                                         5000
 number of cycles




                    600
                                                                                                         4000
                    500
                                                                                                         3000
                    400
                    300                               240                                                2000
                    200                                                                                  1000
                    100                  8                             7                                    0
                      0                                                                                         HP read   Berkeley   HP write   Berkeley
                          HP read   Berkeley read   HP write     Berkeley write                                             read                 write




• Local accesses somewhat slower than private accesses
• Remote accesses significantly worse, as expected
                      4/3/2004                                             PSC Petascale Methods                                                           23
Optimizing Explicitly Parallel Code
• Compiler optimizations for parallel languages
  • Enabled optimizations in Open64 base
  • Static analyses for parallel code
    • Problem is to understand when code motion is legal
      without changing views from other processorst
    • Extended cycle detection to arrays with three
      different algorithms [LCPC „03]
  • Message strip-mining
    • Packing messages is good, but it can go too far
    • Use performance model to strip-mine messages into
      smaller chunks to optimize overlap [VECPAR „04]
   • Automatic message vectorization (packing)
      underway
4/3/2004          PSC Petascale Methods              24
         Performance Example
         • Performance of the Berkeley MG UPC code
         • HP (Lemieux, left) includes MPI comparison

                     NAS Multigrid                                            NAS Multigrid
                      (Alpha/Quadrics)                                             (SGI Altix)



         14000                                                    14000
         12000                                                    12000
         10000                                                    10000
MFLOPS




                                                         MFLOPS
          8000                                                    8000
          6000                                MPI
                                                                  6000
          4000                                UPC
                                                                  4000
          2000                                                    2000
             0
                                                                     0
                 0   10         20       30         40                    0   20             40      60    80
                           Processors                                                   Processors




         4/3/2004                         PSC Petascale Methods                                       25
Berkeley UPC on the X1

                                                                 Livermore Loops

                                 4
   Performance Ratio C / UPC



                               3.5                       48x
                                 3
                               2.5
                                 2
                               1.5
                                 1
                               0.5
                                 0
                                     1   2   3   4   5   6   7   8   9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24


• Translator generated C code usually vectorizes as
   well as original C code
• Source-to-source translation a reasonable strategy
•4/3/2004 needed for 3DPSC Petascale Methods
   Work                  arrays                                                                                   26
                                            GASNet/X1 Performance
                                                              Shmem                      Puts                                                                                                              Gets
                                       14
                                                              GASNet                                                                                      14               Shmem
Put per m essage gap (m icroseconds)




                                                 Shm em
                                       13                                                                                                                 13     Shm em
                                       12
                                                 GASNet
                                                                 MPI                                                                                             GASNet    GASNet




                                                                                                                           Get Lat ency (m icroseconds)
                                                 MPI                                                                                                      12
                                       11                                                                                                                 11
                                       10                                                                                                                 10
                                       9                                                                                                                   9
                                       8                                                                                                                   8
                                       7                                                                                                                   7
                                       6                                                                                                                   6
                                       5                                                                                                                   5
                                       4                                                                                                                   4
                                       3                                                                                                                   3
                                       2                                                                                                                   2

                                       1                                                                                                                   1
                                                                                                                                                           0
                                       0                                                                                                                    1    2     4         8    16    32        64   128   256 512 1024 2048
                                            1      2      4        8    16    32         64   128   256    512 1024 2048
                                                                             Ve c t or               b c op y()                                            RMW       Sc a la r              Ve c t or              b c op y()
                                        RMW            Sc a la r

                                                                       Message Size (byt es)                                                                                         Message Size (byt es)


                            • GASNet/X1 improves small message performance over
                              shmem and MPI
                            • GASNet/X1 communication can be integrated seamlessly
                              into long computation loops and is vectorizable
                            • GASNet/X1 operates directly on global pointers
                                                4/3/2004                                                   PSC Petascale Methods                                                                                          27
NAS CG: OpenMP style vs. MPI style

                                     NAS CG Performance

                         120
    MFLOPS per thread/




                         100                                     UPC (OpenMP style)
                          80
        second




                                                                 UPC (MPI Style)
                          60
                          40                                     MPI Fortran

                          20
                           0
                               2        4         8         12
                               Threads (SSP mode, two nodes)


• GAS language outperforms MPI+Fortran (flat is good!)
• Fine-grained (OpenMP style) version still slower
    • shared memory programming style leads to more
    overhead (redundant boundary computation)
• GAS languages can support both programming styles
 4/3/2004                                PSC Petascale Methods                        28
EP on Alpha/Quadrics (GWU Bench)

                               EP -- class A

                 70
                 60       HP
   Mops/second




                 50       Berkeley
                 40
                 30
                 20
                 10
                  0
                      0   5              10            15   20
                              number of threads

4/3/2004                       PSC Petascale Methods             29
IS on Alpha/Quadrics (GWU Bench)


                                IS -- Class B

               40
               35
Mops/seconds




               30
               25
               20
               15                       HP
               10                       Berkeley
                5
                0
                    0   2       4              6    8   10
                             number of threads


  4/3/2004                  PSC Petascale Methods        30
MG on Alpha/Quadrics (Berkeley version)

                            MG Class B

          9000
          8000
          7000
          6000                                         berkeley
MFlop/s




          5000                                         berkeley2
          4000                                         berkeley3
          3000
                                                       f77+MPI
          2000
          1000
             0
                 0   10       20            30    40
                          Processors



4/3/2004                  PSC Petascale Methods              31
   Multigrid on Cray X1
                                           NAS Multigrid
                        12
                                 UPC MSP
                        10
        GFLOPS/second


                                 MPI Fortran MSP
                         8
                                 UPC SSP
                         6
                                 MPI Fortran SSP
                         4
                         2
                         0
                             1       2       4       8     16     32
                                  Number of SSPs (1 MSP == 4 SSP)

• Performance similar to MPI
• Cray C does not automatically vectorize/multistream (addition of pragmas)
• 4 SSP slightly better than 1 MSP, 2 MSP much better than 8 SSP (cache
conflict caused by layout of private data)
   4/3/2004                   PSC Petascale Methods                   32
Integer Sort
                              NAS IS Performance

                     35
                     30
     MFLOPS/second




                     25
                     20                                    Berkeley UPC
                     15                                    MPI C
                     10
                      5
                      0
                          1    2             4         8
                                   Threads


• Benchmark written in bulk synchronous style
• Performance is similar to MPI
• Code does not vectorize – even the best performer is much
  slower than cache-based superscalar architecture
 4/3/2004                      PSC Petascale Methods                 33
 Fine-grained Irregular Accesses – UPC GUPS

                              Gups Performance
                120
                                                     Cray UPC
    updates/second

                100
                     80                              Berkeley UPC
        Million




                     60
                                                     Berkeley UPC
                     40                              scatter/gather
                     20
                      0
                          1          2           4
                                  Threads



• Hard to control vectorization of fine-grained accesses
    • temporary variables, casts, etc.
• Communication librariesPetascalehelp
  4/3/2004              PSC
                            may Methods                               34
Recent Progress on Applications
• Application demonstration of UPC
    • NAS PB-size problems
           • Berkeley NAS MG avoids most global barriers and relies
             on UPC relaxed memory model
           • Berkeley NAS CG has several versions, including simpler,
             fine-grained communication
    • Algorithms that are challenging in MPI
           • 2D Delauney Triangulation [SIAM PP „04]
           • AMR in UPC: Chombo (non-adaptive) Poisson solver




4/3/2004                    PSC Petascale Methods                 35
Progress in Language
• Group is active in UPC Consortium meetings, mailing
  list, SC booth, etc.
• Recent language level work:
    • Specification of UPC memory model in progress
      • Joint with MTU
      • Behavioral spec [Dagstuhl03]
  • UPC IO nearly finalized
      • Joint with GWU and ANL
  • UPC Collectives V 1.0 finalized
      • Effort led by MTU
  • Improvements/updates to UPC Language Spec
      • Led by IDA



4/3/2004                    PSC Petascale Methods       36
Center Overview
• Broad collaboration between three groups:
   • Library efforts: MPI, ARMCI, GA, OpenMP
   • Language efforts: UPC, CAF, Titanium
   • New model investigations: multi-threading, memory
     consistency models
• Led by Rusty Lusk at ANL
• Major focus is common runtime system
   • GASNet for UPC, Titanium and (soon) CAF
• Also common compiler
   • CAF, UPC, and OpenMP work based on Open64




4/3/2004             PSC Petascale Methods               37
 Progress on UPC Runtime

• Cross-language support: Berkeley UPC and MPI
   • Calling MPI from UPC
   • Calling UPC from MPI
• Runtime for gcc-based UPC compiler by Intrepid
• Interface UPC compiler to parallel collectives libraries (end of
  FY04)
   • Reference implementation just released by HP/MTU
• Thread version of the Berkeley UPC runtime layer
   • Evaluating performance on hybrid GASNet systems




  4/3/2004                  PSC Petascale Methods                    38
 Progress on GASNet

• GASNet: Myrinet GM, Quadrix Elan-3, IBM LAPI, UDP, MPI,
  Infiniband
• Ongoing: SCI (with UFL), Cray X1 SGI Shmem, and reviewing
  future Myrinet and latest Elan-4
• Extension to GASNet to support strided and scatter/gather
  communication
   • Also proposed support for UPC bulk copy support
• Analysis of MPI one-sided for GAS languages
   • Problems with synchronization model
• Multiple protocols for managing “pinned” memory in Direct
  Memory Addressing systems [CAC ’03]
   • Depends on language usage as well as network architecture



  4/3/2004               PSC Petascale Methods                   39
Future Plans
• Architecture-specific GASNet for scatter-gather and
  strided hardware support.
   • Need for CAF and for UPC with message vectorization
• Optimized collective communication library
   • Spec agreed on in 2003
   • New reference implementation
   • Developing GASNet extension for building optimized
     collectives
• Application- and architecture- driven optimization
• Interface to the UPC I/O library
• Evaluate GASNet on machines with non-cache coherent
  shared memory
   • BlueGene/L and NEC SX6
4/3/2004            PSC Petascale Methods            40
Try It Out
• Download from the Berkeley UPC web page
    • http://upc.lbl.gov
• May just get runtime system (includes GASNet)
    • Netcompile is default
    • Runtime is easier to install
• New release planned for this summer
    • Not quite open development model
    • We “publicize” a “latest stable version” that is not
      fully tested
• Let us know what happens (good and bad)
    • Mail upc@lbl.gov


4/3/2004               PSC Petascale Methods                 41
     UPC Outline

1. Background and Philosophy          8. Synchronization
2. UPC Execution Model                9. Performance Tuning
3. UPC Memory Model                       and Early Results
4. Data and Pointers                  10. Concluding Remarks
5. Dynamic Memory
   Management
6. Programming Examples




     4/3/2004        PSC Petascale Methods                 42
   Context

• Most parallel programs are written using either:
    • Message passing with a SPMD model
        • Usually for scientific applications with C++/Fortran
        • Scales easily
    • Shared memory with threads in OpenMP,
      Threads+C/C++/F or Java
        • Usually for non-scientific applications
        • Easier to program, but less scalable performance
• Global Address Space (GAS) Languages take the best of both
    • global address space like threads (programmability)
    • SPMD parallelism like MPI (performance)
    • local/global distinction, i.e., layout matters (performance)
   4/3/2004              PSC Petascale Methods              43
    Partitioned Global Address Space
•   Languages programming model with SPMD parallelism
    Explicitly-parallel
     • Fixed at program start-up, typically 1 thread per processor
•   Global address space model of memory
     • Allows programmer to directly represent distributed data
       structures
•   Address space is logically partitioned
     • Local vs. remote memory (two-level hierarchy)
•   Programmer control over performance critical decisions
     • Data layout and communication
•   Performance transparency and tunability are goals
     • Initial implementation can use fine-grained shared memory
•   Base languages differ: UPC (C), CAF (Fortran), Titanium
    (Java)
    4/3/2004             PSC Petascale Methods             44
     Global Address Space Eases
     Programming
        Thread Thread
   address space          0 Thread    1                                  n



                   X[0]       X[1]                                X[P]
                                                                             Shared
      Global




                   ptr:        ptr:                               ptr:
                                                                             Private

• The languages share the global address space abstraction
   • Shared memory is partitioned by processors
   • Remote memory may stay remote: no automatic caching implied
   • One-sided communication through reads/writes of shared
      variables
   • Both individual and bulk memory copies
• Differ on details
   • Some models have a separate private memory area
   • Distributed array generality and how they are constructed

   4/3/2004                               PSC Petascale Methods                        45
One-Sided Communication Is Sometimes Faster

        25                                         Added Latency

        20
                                                   Send Overhead (Alone)
                                                   Send & Rec Overhead
        15
     usec




                                                   Rec Overhead (Alone)
        10

            5

            0
                                  I
                          IB PI




                                                                    PI
                I




                                                                         G IP L

                                                                                  PI
                                                    PI
              m




                                                            in M
       T 3 eg




                                P



                                                   m
              P



                               A
                              /M




                                                                 /M
            /M
     T3 /Sh




                                                         M t/G




                                                                                M
                                                  M
                                                Sh
           -R




                      /L




                                                                               V
                                                                             E/
                                               s/
                            M




                                                              et
          E




                                                                            E/
                                             s/
         /E




                                                              e
                      M
         E




                                            ic




                                                                          ig
                                                            in
                                         ic
      T3




                    IB




                                                                         ig
       E




                                         dr



                                                          yr

                                                          yr
                                      dr




                                                                         G
                                              ua



                                                         M
                                  ua
                                         Q
                                 Q




• Potential performance advantage for fine-grained, one-sided programs
• Potential productivity advantage for irregular applications
   4/3/2004                    PSC Petascale Methods                                   46
  Current Implementations

• A successful language/library must run everywhere
• UPC
   • Commercial compilers available on Cray, SGI, HP machines
   • Open source compiler from LBNL/UCB (and another from MTU)
• CAF
   • Commercial compiler available on Cray machines
   • Open source compiler available from Rice
• Titanium (Friday)
   • Open source compiler from UCB runs on most machines
• Common tools
   • Open64 open source research compiler infrastructure
   • ARMCI, GASNet for distributed memory implementations
   • Pthreads, System V shared memory

  4/3/2004              PSC Petascale Methods               47
UPC Overview and Design Philosophy
• Unified Parallel C (UPC) is:
   • An explicit parallel extension of ANSI C
   • A partitioned global address space language
   • Sometimes called a GAS language
• Similar to the C language philosophy
   • Programmers are clever and careful, and may
     need to get close to hardware
           • to get performance, but
           • can get in trouble
   • Concise and efficient syntax
• Common and familiar syntax and semantics for
  parallel C with simple extensions to ANSI C
• Based on ideas in Split-C, AC, and PCP
4/3/2004                   PSC Petascale Methods   48
           UPC Execution
              Model




4/3/2004     PSC Petascale Methods   49
 UPC Execution Model
• A number of threads working independently in a SPMD
  fashion
   • Number of threads specified at compile-time or run-time;
     available as program variable THREADS
   • MYTHREAD specifies thread index (0..THREADS-1)
   • upc_barrier is a global synchronization: all wait
   • There is a form of parallel loop that we will see later
• There are two compilation modes
   • Static Threads mode:
            • Threads is specified at compile time by the user
            • The program may is THREADS as a compile-time constant
    • Dynamic threads mode:
            • Compiled code may be run with varying numbers of threads
 4/3/2004                    PSC Petascale Methods                 50
Hello World in UPC
• Any legal C program is also a legal UPC program
• If you compile and run it as UPC with P threads, it will
  run P copies of the program.
• Using this fact, plus the identifiers from the previous
  slides, we can parallel hello world:

#include <upc.h> /* needed for UPC extensions */
#include <stdio.h>

main() {
  printf("Thread %d of %d: hello UPC world\n",
         MYTHREAD, THREADS);
}


4/3/2004              PSC Petascale Methods                  51
Example: Monte Carlo Pi Calculation
• Estimate Pi by throwing darts at a unit square
• Calculate percentage that fall in the unit circle
     • Area of square = r2 = 1
     • Area of circle quadrant = ¼ * p r2 = p/4
• Randomly throw darts at x,y positions
• If x2 + y2 < 1, then point is inside circle
• Compute ratio:
     • # points inside / # points total
     • p = 4*ratio


                                                      r =1


4/3/2004               PSC Petascale Methods                 52
Pi in UPC
• Independent estimates of pi:
  main(int argc, char **argv) {
    int i, hits, trials = 0;                 Each thread gets its own
    double pi;                               copy of these variables

      if (argc != 2)trials = 1000000;             Each thread can use
      else trials = atoi(argv[1]);                input arguments

                                                  Initialize random in
      srand(MYTHREAD*17);                         math library

      for (i=0; i < trials; i++) hits += hit();
      pi = 4.0*hits/trials;
      printf("PI estimated to %f.", pi);
  }
                       Each thread calls “hit” separately
4/3/2004             PSC Petascale Methods                       53
               UPC Memory
                 Model
           • Scalar Variables
           • Distributed Arrays
           • Pointers to shared data




4/3/2004           PSC Petascale Methods   55
Private vs. Shared Variables in UPC
• Normal C variables and objects are allocated in the
  private memory space for each thread.
• Shared variables are allocated only once, with thread 0
                  shared int ours;
                  int mine;
• Simple shared variables of this kind may not occur in a
  within a function definition

                   Thread0 Thread1                          Threadn
 Global address




                    ours:                                             Shared
     space




                    mine:   mine:                           mine:
                                                                      Private

4/3/2004                            PSC Petascale Methods                       56
  Pi in UPC (Cooperative Version)
• Parallel computing of pi, but with a race condition
  shared int hits;                shared variable to
  main(int argc, char **argv) {   record hits
      int i, my_hits = 0;
      int trials = atoi(argv[1]);
      my_trials = (trials + THREADS - 1 divide work up
                   - MYTHREAD)/THREADS; evenly
      srand(MYTHREAD*17);
      for (i=0; i < my_trials; i++)
        hits += hit();
                                      accumulate hits
      upc_barrier;
      if (MYTHREAD == 0) {
        printf("PI estimated to %f.", 4.0*hits/trials);
      }
   }
  4/3/2004            PSC Petascale Methods             57
Pi in UPC (Cooperative Version)
• The race condition can be fixed in several ways:
    • Add a lock around the hits increment (later)
    • Have each thread update a separate counter:
          • Have one thread compute sum
          • Use a “collective” to compute sum (recently added to UPC)
                                                  all_hits is
shared int all_hits [THREADS];                    shared by all
main(int argc, char **argv) {                     processors,
   … declarations an initialization code omitted just as hits was
    for (i=0; i < my_trials; i++)
       all_hits[MYTHREAD] += hit();
    upc_barrier;                                Where does it live?
    if (MYTHREAD == 0) {
       for (i=0; i < THREADS; i++) hits += all_hits[i];
       printf("PI estimated to %f.", 4.0*hits/trials);
    }
}4/3/2004                 PSC Petascale Methods                  58
    Shared Arrays Are Cyclic By Default
• Shared array elements are spread across the threads
    shared int x[THREADS]   /* 1 element per thread */
    shared int y[3][THREADS] /* 3 elements per thread */
    shared int z[3*THREADS] /* 3 elements per thread, cyclic */
• In the pictures below
    • Assume THREADS = 4
    • Elements with affinity to processor 0 are red

                                              As a 2D array, this
                                              is logically blocked
     x
                                              by columns

     y

     z

4/3/2004              PSC Petascale Methods                   59
 Example: Vector Addition
• Questions about parallel vector additions:
  • How to layout data (here it is cyclic)
  • Which processor does what (here it is “owner computes”)

     /* vadd.c */
     #include <upc_relaxed.h>
     #define N 100*THREADS
                                               cyclic layout
     shared int v1[N], v2[N], sum[N];
     void main() {
         int i;                        owner computes
         for(i=0; i<N; i++)
               if (MYTHREAD = = i%THREADS)
                     sum[i]=v1[i]+v2[i];
     }

  4/3/2004             PSC Petascale Methods                   60
 Vector Addition with upc_forall
• The loop in vadd is common, so there is upc_forall:
   • 4th argument is int expression that gives “affinity”
  • Iteration executes when:
       • affinity%THREADS is MYTHREAD
      /* vadd.c */
      #include <upc_relaxed.h>
      #define N 100*THREADS

      shared int v1[N], v2[N], sum[N];

      void main() {
          int i;
          upc_forall(i=0; i<N; i++; i)
                     sum[i]=v1[i]+v2[i];
      }
  4/3/2004               PSC Petascale Methods              61
Work Sharing with upc_forall()
• Iteration are independent
• Each thread gets a bunch of iterations
• Simple C-like syntax and semantics
   upc_forall(init; test; loop; affinity)
      statement;
• Affinity field to distribute the work
   • Cyclic (round robin) distribution
   • Blocked (chunks of iterations) distribution
• Semantics are undefined if there are dependencies
  between iterations executed by different threads
   • Programmer has indicated iterations are
      independent

4/3/2004           PSC Petascale Methods              62
 UPC Matrix Vector Multiplication Code
• Here is one possible matrix-vector multiplication
  #include <upc_relaxed.h>

  shared int a[THREADS][THREADS];
  shared int b[THREADS], c[THREADS];

  void main (void) {
        int i, j , l;

             upc_forall( i = 0 ; i < THREADS ; i++; i) {
                   c[i] = 0;
                   for ( l= 0 ; l THREADS ; l++)
                         c[i] += a[i][l]*b[l];
             }
  }


  4/3/2004               PSC Petascale Methods         63
Data Distribution

           Th. 0

                       Th. 1

                                 Th. 2
B


                                                  Th. 0
       Thread 0
                   Thread 1
                               Thread 2
                                           *                      =
                                                  Th. 1

                                                  Th. 2

                   A                                B                 C



4/3/2004                                  PSC Petascale Methods           64
A Better Data Distribution

           Th. 0

                       Th. 1

                               Th. 2
B


              Thread 0                         Th. 0
                                        *                      =
              Thread 1                         Th. 1

              Thread 2                         Th. 2

                   A                             B                 C



4/3/2004                               PSC Petascale Methods           65
Layouts in General
• All non-array shared variables have affinity with thread zero.
• Array layouts are controlled by layout specifiers:
              shared [b] double x [n];
     • Groups of b elements are wrapped around
     • Empty: cyclic layout of data in 1D view
     • layout_specifier [ integer_expression ]
• The affinity of an array element is defined in terms of the
  block size, a compile-time constant, and THREADS a
  runtime constant.
• Element i has affinity with thread
    ( i / block_size) % PROCS.


4/3/2004              PSC Petascale Methods              66
Layout Terminology
• Notation is HPF, but terminology is language-independent
   • Assume there are 4 processors




           (Block, *)    (*, Block)               (Block, Block)




           (Cyclic, *)   (Cyclic, Cyclic)        (Cyclic, Block)

4/3/2004                 PSC Petascale Methods                     67
2D Array Layouts in UPC
• Array a1 has a row layout and array a2 has a block row
  layout.
           shared [m] int a1 [n][m];
           shared [k*m] int a2 [n][m];


• If (k + m) % THREADS = = 0 them a3 has a row layout
     shared int a3 [n][m+k];
• To get more general HPF and ScaLAPACK style 2D
  blocked layouts, one needs to add dimensions.
• Assume r*c = THREADS;
   shared [b1][b2] int a5 [m][n][r][c][b1][b2];
• or equivalently
    shared [b1*b2] int a5 [m][n][r][c][b1][b2];
4/3/2004                PSC Petascale Methods         68
 UPC Matrix Vector Multiplication Code
• Matrix-vector multiplication with better layout
   #include <upc_relaxed.h>

   shared [THREADS] int a[THREADS][THREADS];
   shared int b[THREADS], c[THREADS];

   void main (void) {
         int i, j , l;

             upc_forall( i = 0 ; i < THREADS ; i++; i)
   {
                   c[i] = 0;
                   for ( l= 0 ; l THREADS ; l++)
                         c[i] += a[i][l]*b[l];
             }
   }

  4/3/2004              PSC Petascale Methods            69
Example: Matrix Multiplication in UPC


• Given two integer matrices A(NxP) and B(PxM)
• Compute C =A x B.
• Entries Cij in C are computed by the formula:


                           p

             C   ij
                         A
                          l 1
                                 il
                                       Blj




4/3/2004              PSC Petascale Methods   70
  Matrix Multiply in C
#include <stdlib.h>
#include <time.h>

#define N 4
#define P 4
#define M 4

int a[N][P], c[N][M];
int b[P][M];

void main (void) {
  int i, j , l;
  for (i = 0 ; i<N ; i++) {
   for (j=0 ; j<M ;j++) {
       c[i][j] = 0;
       for (l = 0 ; lP ; l++) c[i][j] += a[i][l]*b[l][j];
   }
  }
}

  4/3/2004              PSC Petascale Methods           71
        Domain Decomposition for UPC
• Exploits locality in matrix multiplication
•     A (N  P) is decomposed row-wise      •                        B(P  M) is decomposed column wise
      into blocks of size (N  P) / THREADS                          into M/ THREADS blocks as shown
      as shown below:                                                below:
                                                                                            Thread THREADS-1
                                                                         Thread 0
                          P                                                          M
                  0 .. (N*P / THREADS) -1         Thread 0
           (N*P / THREADS)..(2*N*P / THREADS)-1   Thread 1


    N                                                                P
         ((THREADS-1)N*P) / THREADS ..
        (THREADS*N*P / THREADS)-1
                                                  Thread THREADS-1


    •Note: N and M are assumed to be multiples
                                                             Columns 0:
    of THREADS
                                                             (M/THREADS)-1
                                                                               Columns ((THREAD-1) 
                                                                               M)/THREADS:(M-1)


        4/3/2004                                    PSC Petascale Methods                                 72
UPC Matrix Multiplication Code
/* mat_mult_1.c */
#include <upc_relaxed.h>

#define N 4
#define P 4
#define M 4

shared [N*P /THREADS] int a[N][P], c[N][M];
// a and c are row-wise blocked shared matrices

shared[M/THREADS] int b[P][M]; //column-wise blocking

void main (void) {
          int i, j , l; // private variables

           upc_forall(i = 0 ; i<N ; i++; &c[i][0]) {
                     for (j=0 ; j<M ;j++) {
                                 c[i][j] = 0;
                                 for (l= 0 ; lP ; l++) c[i][j] += a[i][l]*b[l][j];
                     }
           }
}
4/3/2004                            PSC Petascale Methods                             73
Notes on the Matrix Multiplication
Example
• The UPC code for the matrix multiplication is almost
  the same size as the sequential code
• Shared variable declarations include the keyword
  shared
• Making a private copy of matrix B in each thread
  might result in better performance since many remote
  memory operations can be avoided
• Can be done with the help of upc_memget




4/3/2004            PSC Petascale Methods            74
 Pointers to Shared vs. Arrays
• In the C tradition, array can be access through pointers
• Here is the vector addition example using pointers

#include <upc_relaxed.h>
#define N 100*THREADS
shared int v1[N], v2[N], sum[N];
void main() {
  int i;
  shared int *p1, *p2; v1
                            p1
  p1=v1; p2=v2;
  for (i=0; i<N; i++, p1++, p2++ )
     if (i %THREADS= = MYTHREAD)
            sum[i]= *p1 + *p2;
  }
  4/3/2004             PSC Petascale Methods                 75
 UPC Pointers
                         Where does the pointer reside?

                          Private            Shared
               Private    PP (p1)            PS (p3)
   Where
   does it
   point?      Shared     SP (p2)            SS (p4)


int *p1;             /* private pointer to local memory */
shared int *p2; /* private pointer to shared space */
int *shared p3; /* shared pointer to local memory */
shared int *shared p4; /* shared pointer to
                                 shared space */
Shared to private is not recommended.

 4/3/2004            PSC Petascale Methods                76
  UPC Pointers

    address space   Thread0 Thread1                        Threadn
                     p3:     p3:                           p3:
                     p4:     p4:                           p4:       Shared
       Global




                    p1:      p1:                           p1:
                    p2:      p2:                           p2:       Private


 int *p1;            /* private pointer to local memory */
 shared int *p2; /* private pointer to shared space */
 int *shared p3; /* shared pointer to local memory */
 shared int *shared p4; /* shared pointer to
                                   shared space */
Pointers to shared often require more storage and are more costly to
dereference; they may refer to local or remote memory.
  4/3/2004                         PSC Petascale Methods                       77
Common Uses for UPC Pointer Types
int *p1;
• These pointers are fast
• Use to access private data in part of code performing local
  work
• Often cast a pointer-to-shared to one of these to get faster
  access to shared data that is local
shared int *p2;
• Use to refer to remote data
• Larger and slower due to test-for-local + possible
  communication
int *shared p3;
• Not recommended
shared int *shared p4;
• Use to build shared linked structures, e.g., a linked list

4/3/2004                PSC Petascale Methods                    78
UPC Pointers


 • In UPC pointers to shared objects have three fields:
    • thread number
    • local address of block
    • phase (specifies position in the block)
           Virtual Address           Thread          Phase

 • Example: Cray T3E implementation

    Phase           Thread            Virtual Address

  63            49 48           38 37                        0


4/3/2004                     PSC Petascale Methods               79
UPC Pointers

• Pointer arithmetic supports blocked and non-blocked
  array distributions
• Casting of shared to private pointers is allowed but
  not vice versa !
• When casting a pointer to shared to a private pointer,
  the thread number of the pointer to shared may be
  lost
• Casting of shared to private is well defined only if the
  object pointed to by the pointer to shared has affinity
  with the thread performing the cast



4/3/2004             PSC Petascale Methods              80
Special Functions


• size_t upc_threadof(shared void *ptr);
  returns the thread number that has affinity to the pointer
  to shared
• size_t upc_phaseof(shared void *ptr);
  returns the index (position within the block)field of the
  pointer to shared
• size_t upc_addrfield(shared void *ptr);
  returns the address of the block which is pointed at by
  the pointer to shared
• shared void *upc_resetphase(shared void *ptr); resets
  the phase to zero

4/3/2004             PSC Petascale Methods              81
Synchronization

 • No implicit synchronization among the threads
 • UPC provides many synchronization
   mechanisms:
   • Barriers (Blocking)
           • upc_barrier
      • Split Phase Barriers (Non Blocking)
           • upc_notify
           • upc_wait
      • Optional label allow for
      • Locks

4/3/2004                   PSC Petascale Methods   82
Synchronization - Locks

 • In UPC, shared data can be protected against
   multiple writers :
    • void upc_lock(upc_lock_t *l)
    • int upc_lock_attempt(upc_lock_t *l) //returns 1 on
       success and 0 on failure
    • void upc_unlock(upc_lock_t *l)
 • Locks can be allocated dynamically. Dynamically
   allocated locks can be freed
 • Dynamic locks are properly initialized and static locks
   need initialization



4/3/2004             PSC Petascale Methods              83
  Corrected version Pi Example
• Parallel computing of pi, but with a bug
  shared int hits;
                                                  all threads collectively
  main(int argc, char **argv) {
                                                  allocate lock
      int i, my_hits = 0;
      upc_lock_t *hit_lock = upc_all_lock_alloc();
      ...initialization of trials, my_trials, srand code omitted
      for (i=0; i < my_trials; i++)
          my_hits += hit();
      upc_lock(hit_lock);
      hits += my_hits;                        update in critical
      upc_unlock(hit_lock);                   region
      upc_barrier;
      if (MYTHREAD == 0) {
          printf("PI estimated to %f.", 4.0*hits/trials);
      }
      upc_lock_free(hit_lock);
   }
  4/3/2004                 PSC Petascale Methods                    84
Memory Consistency in UPC
• The consistency model of shared memory accesses are
  controlled by designating accesses as strict, relaxed, or
  unualified (the default).

• There are several ways of designating the ordering type.

• A type qualifier, strict or relaxed can be used to affect all
  variables of that type.

• Labels strict or relaxed can be used to control the
  accesses within a statement.
•     strict : { x = y ; z = y+1; }

• A strict or relaxed cast can be used to override the
  current label or type qualifier.
4/3/2004               PSC Petascale Methods                85
Synchronization- Fence


• Upc provides a fence construct
   • Equivalent to a null strict reference, and has the
     syntax
           • upc_fence;
     • UPC ensures that all shared references issued
       before the upc_fence are complete




4/3/2004                  PSC Petascale Methods           86
 Matrix Multiplication with Blocked
 Matrices
#include <upc_relaxed.h>
shared [N*P/THREADS] int a[N][P], c[N][M];

shared [M/THREADS] int b[P][M];
int b_local[P][M];

void main (void) {
       int i, j , l; // private variables

         upc_memget(b_local, b, P*M*sizeof(int));

       upc_forall(i = 0 ; i<N ; i++; &c[i][0]) {
              for (j=0 ; j<M ;j++) {
                     c[i][j] = 0;
                     for (l= 0 ; lP ; l++) c[i][j] +=
a[i][l]*b_local[l][j];
              }
       }
}

  4/3/2004               PSC Petascale Methods           87
   Shared and Private Data

Assume THREADS = 4
shared [3] int A[4][THREADS];
will result in the following data layout:

  Thread 0        Thread 1           Thread 2   Thread 3
   A[0][0]        A[0][3]             A[1][2]   A[2][1]
   A[0][1]        A[1][0]             A[1][3]   A[2][2]
   A[0][2]        A[1][1]             A[2][0]   A[2][3]
   A[3][0]        A[3][3]
   A[3][1]
   A[3][2]


4/3/2004               PSC Petascale Methods               88
   UPC Pointers



         Thread 0            Thread 0             Thread 2          Thread 3
          X[0]                X[1]                   X[2]              X[3]
          X[4]       dp       X[5]        dp+1      X[6]     dp+2     X[7]
dp + 3    X[8]      dp + 4    X[9]       dp + 5     X[10]    dp+6     X[11]
          X[12]     dp + 8    X[13]      dp + 9     X[14]             X[15]
dp + 7


                                      dp1



    4/3/2004                     PSC Petascale Methods                         89
  UPC Pointers


Thread 0              Thread 1                     Thread 2            Thread 3
  X[0]                      X[3]          dp + 1      X[6]    dp + 4      X[9]
 X[1]                   X[4]                          X[7]    dp + 5      X[10]
                                         dp + 2
 X[2]           dp      X[5]             dp + 3       X[8]    dp + 6     X[11]

  X[12]      dp + 7     X[15]
  X[13]      dp + 8
  X[14]       dp+9
                      dp1




  4/3/2004                         PSC Petascale Methods                     90
Bulk Copy Operations in UPC
• UPC provides standard library functions to move data
  to/from shared memory
• Can be used to move chunks in the shared space or
  between shared and private spaces
• Equivalent of memcpy :
   • upc_memcpy(dst, src, size) : copy from shared to
      shared
   • upc_memput(dst, src, size) : copy from private to
      shared
   • upc_memget(dst, src, size) : copy from shared to
      private
• Equivalent of memset:
   • upc_memset(dst, char, size) : initialize shared
      memory with a character
4/3/2004            PSC Petascale Methods            91
Worksharing with upc_forall
• Distributes independent iteration across threads in the way you
  wish– typically to boost locality exploitation
• Simple C-like syntax and semantics
    upc_forall(init; test; loop; expression)
     statement
• Expression could be an integer expression or a reference to
  (address of) a shared object




4/3/2004                  PSC Petascale Methods                     92
  Work Sharing: upc_forall()
• Example 1: Exploiting locality
  shared int a[100],b[100], c[101];
  int i;
  upc_forall (i=0; i<100; i++; &a[i])
       a[i] = b[i] * c[i+1];
• Example 2: distribution in a round-robin fashion
  shared int a[100],b[100], c[101];
  int i;
  upc_forall (i=0; i<100; i++; i)
       a[i] = b[i] * c[i+1];

  Note: Examples 1 and 2 happen to result in the same distribution



  4/3/2004                 PSC Petascale Methods              93
           Work Sharing: upc_forall()

• Example 3: distribution by chunks
     shared int a[100],b[100], c[101];
     int i;
     upc_forall (i=0; i<100; i++; (i*THREADS)/100)
          a[i] = b[i] * c[i+1];


      i                 i*THREADS               i*THREADS/100
      0..24             0..96                   0
      25..49            100..196                1
      50..74            200..296                2
      75..99            300..396                3

4/3/2004                PSC Petascale Methods                   94
     UPC Outline

1. Background and Philosophy          8. Synchronization
2. UPC Execution Model                9. Performance Tuning
3. UPC Memory Model                       and Early Results
4. UPC: A Quick Intro                 10. Concluding
5. Data and Pointers                     Remarks
6. Dynamic Memory
   Management
7. Programming Examples




     4/3/2004        PSC Petascale Methods                 95
Dynamic Memory Allocation in UPC

• Dynamic memory allocation of shared memory is
  available in UPC
• Functions can be collective or not
• A collective function has to be called by every thread
  and will return the same value to all of them




4/3/2004             PSC Petascale Methods                 96
    Global Memory Allocation
 shared void *upc_global_alloc(size_t
   nblocks, size_t nbytes);
   nblocks : number of blocks
   nbytes : block size
 • Non collective, expected to be called by one thread
 • The calling thread allocates a contiguous memory
   space in the shared space
 • If called by more than one thread, multiple regions are
   allocated and each thread which makes the call gets
   a different pointer
 • Space allocated per calling thread is equivalent to :
   shared [nbytes] char[nblocks * nbytes]
 • (Not yet implemented on Cray)
4/3/2004             PSC Petascale Methods             97
Collective Global Memory Allocation



 shared void *upc_all_alloc(size_t nblocks, size_t nbytes);

     nblocks:   number of blocks
     nbytes:    block size

 • This function has the same result as upc_global_alloc. But this
   is a collective function, which is expected to be called by all
   threads
 • All the threads will get the same pointer
 • Equivalent to :
   shared [nbytes] char[nblocks * nbytes]



  4/3/2004               PSC Petascale Methods                   98
Memory Freeing


  void upc_free(shared void *ptr);

• The upc_free function frees the dynamically allocated
  shared memory pointed to by ptr
• upc_free is not collective




4/3/2004             PSC Petascale Methods                99
     UPC Outline

1. Background and Philosophy          8. Synchronization
2. UPC Execution Model                9. Performance Tuning
3. UPC Memory Model                       and Early Results
4. UPC: A Quick Intro                 10. Concluding
5. Data and Pointers                     Remarks
6. Dynamic Memory
   Management
7. Programming Examples




     4/3/2004        PSC Petascale Methods                 100
Example: Matrix Multiplication in UPC



• Given two integer matrices A(NxP) and B(PxM), we
  want to compute C =A x B.
• Entries cij in C are computed by the formula:



                              p

                c   ij
                            a
                             l 1
                                    il
                                          blj




4/3/2004                 PSC Petascale Methods       101
Doing it in C
 #include <stdlib.h>
 #include <time.h>
 #define N 4
 #define P 4
 #define M 4
 int a[N][P] = {1,2,3,4,5,6,7,8,9,10,11,12,14,14,15,16}, c[N][M];
 int b[P][M] = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1};

 void main (void) {
     int i, j , l;
     for (i = 0 ; i<N ; i++) {
           for (j=0 ; j<M ;j++) {
                     c[i][j] = 0;
                     for (l = 0 ; lP ; l++) c[i][j] += a[i][l]*b[l][j];
           }
     }
 }


    Note: some compiler do not yet support the intialization in declaration statements
4/3/2004                              PSC Petascale Methods                              102
        Domain Decomposition for UPC
• Exploits locality in matrix multiplication
•     A (N  P) is decomposed row-wise      •                        B(P  M) is decomposed column wise
      into blocks of size (N  P) / THREADS                          into M/ THREADS blocks as shown
      as shown below:                                                below:
                                                                                            Thread THREADS-1
                                                                         Thread 0
                          P                                                          M
                  0 .. (N*P / THREADS) -1         Thread 0
           (N*P / THREADS)..(2*N*P / THREADS)-1   Thread 1


    N                                                                P
         ((THREADS-1)N*P) / THREADS ..
        (THREADS*N*P / THREADS)-1
                                                  Thread THREADS-1


    •Note: N and M are assumed to be multiples
                                                             Columns 0:
    of THREADS
                                                             (M/THREADS)-1
                                                                               Columns ((THREAD-1) 
                                                                               M)/THREADS:(M-1)
        4/3/2004                                     PSC Petascale Methods                               103
UPC Matrix Multiplication Code
#include <upc_relaxed.h>
#define N 4
#define P 4
#define M 4

shared [N*P /THREADS] int a[N][P] =
{1,2,3,4,5,6,7,8,9,10,11,12,14,14,15,16}, c[N][M];
// a and c are blocked shared matrices, initialization is not currently
implemented
shared[M/THREADS] int b[P][M] = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1};
void main (void) {
           int i, j , l; // private variables

           upc_forall(i = 0 ; i<N ; i++; &c[i][0]) {
                    for (j=0 ; j<M ;j++) {
                                c[i][j] = 0;
                                for (l= 0 ; lP ; l++) c[i][j] += a[i][l]*b[l][j];
                    }
           }
}
4/3/2004                         PSC Petascale Methods                               104
UPC Matrix Multiplication
Code with block copy
    #include <upc_relaxed.h>
    shared [N*P /THREADS] int a[N][P], c[N][M];
    // a and c are blocked shared matrices, initialization is not currently implemented
    shared[M/THREADS] int b[P][M];
    int b_local[P][M];

    void main (void) {
             int i, j , l; // private variables

               upc_memget(b_local, b, P*M*sizeof(int));

               upc_forall(i = 0 ; i<N ; i++; &c[i][0]) {
                        for (j=0 ; j<M ;j++) {
                                    c[i][j] = 0;
                                    for (l= 0 ; lP ; l++) c[i][j] += a[i][l]*b_local[l][j];
                        }
               }
    }
    4/3/2004                         PSC Petascale Methods                               105
     UPC Outline

1. Background and Philosophy          8. Synchronization
2. UPC Execution Model                9. Performance Tuning
3. UPC Memory Model                       and Early Results
4. UPC: A Quick Intro                 10. Concluding
5. Data and Pointers                     Remarks
6. Dynamic Memory
   Management
7. Programming Examples




     4/3/2004        PSC Petascale Methods                 106
Memory Consistency Models


• Has to do with the ordering of shared operations
• Under the relaxed consistency model, the shared
  operations can be reordered by the compiler / runtime
  system
• The strict consistency model enforces sequential
  ordering of shared operations. (no shared operation
  can begin before the previously specified one is done)




4/3/2004            PSC Petascale Methods            107
Memory Consistency Models



 • User specifies the memory model through:
    • declarations
    • pragmas for a particular statement or sequence of
      statements
    • use of barriers, and global operations
 • Consistency can be strict or relaxed
 • Programmers responsible for using correct
   consistency model



4/3/2004            PSC Petascale Methods           108
Memory Consistency
• Default behavior can be controlled by the programmer:
   • Use strict memory consistency
           #include<upc_strict.h>
    • Use relaxed memory consistency
           #include<upc_relaxed.h>




4/3/2004                    PSC Petascale Methods    109
Memory Consistency


• Default behavior can be altered for a variable definition
  using:
    • Type qualifiers: strict & relaxed
• Default behavior can be altered for a statement or a
  block of statements using
    • #pragma upc strict
    • #pragma upc relaxed




4/3/2004             PSC Petascale Methods              110
     UPC Outline

1. Background and Philosophy          8. Synchronization
2. UPC Execution Model                9. Performance Tuning
3. UPC Memory Model                       and Early Results
4. UPC: A Quick Intro                 10. Concluding
5. Data and Pointers                     Remarks
6. Dynamic Memory
   Management
7. Programming Examples




     4/3/2004        PSC Petascale Methods                 111
How to Exploit the Opportunities for
Performance Enhancement?
• Compiler optimizations
• Run-time system
• Hand tuning




4/3/2004            PSC Petascale Methods   112
   List of Possible Optimizations for
   UPC Codes
• Space privatization: use private pointers instead of
  pointer to shareds when dealing with local shared data
  (through casting and assignments)
• Block moves: use block copy instead of copying
  elements one by one with a loop, through string
  operations or structures
• Latency hiding: For example, overlap remote accesses
  with local processing using split-phase barriers
• Vendors can also help decrease cost for address
  translation and providing optimized standard libraries



4/3/2004             PSC Petascale Methods            113
Performance of Shared vs. Private Accesses
(Old COMPAQ Measurement)

         MB/s           read single     write single
                        elements        elements
         CC                       640.0            400.0
         UPC Private                  686.0        565.0
         UPC local                       7.0         44.0
         shared
         UPC remote                      0.2          0.2
         shared

   Recent compiler developments have improved some of that
   4/3/2004            PSC Petascale Methods                114
Using Local Pointers Instead of pointer
to shared

  …
  int *pa = (int*) &A[i][0];
  int *pc = (int*) &C[i][0];
  …
  upc_forall(i=0;i<N;i++;&A[i][0]) {
        for(j=0;j<P;j++)
                  pa[j]+=pc[j];
  }
• Pointer arithmetic is faster using local pointers than
  pointer to shared
• The pointer dereference can be one order of
  magnitude faster

4/3/2004                  PSC Petascale Methods            115
Performance of UPC


• UPC benchmarking results
   • Nqueens Problem
   • Matrix Multiplication
   • Sobel Edge detection
   • Stream and GUPS
   • NPB
   • Splash-2
• Compaq AlphaServer SC and Origin 2000/3000
• Check the web site for new measurements


4/3/2004          PSC Petascale Methods        116
Shared vs. Private Accesses (Recent SGI
Origin 3000 Measurement)
                               MB/S     Memcpy     Array    Scale   Sum   Block   Block
                                                   Copy                    Get    Scale
                               GCC       400       266       266    800   N/A     N/A
 STREAM BENCHMARK




                         UPC Private     400       266       266    800   N/A     N/A

                          UPC Local      N/A        40       44     100   400     400

                     UPC Shared (SMP)    N/A        40       44     88    266     266

                         UPC Shared      N/A        34       38     72    200     200
                          (Remote)




                    4/3/2004               PSC Petascale Methods                   117
                          Execution Time over SGI–Origin 2k NAS-EP –
                          Class A
                         200



                         180



                         160



                         140
Computation Time (sec)




                         120



                         100



                         80



                         60



                         40



                         20



                          0
                                    1   2        4                  8   16   32

                                                       Processors

                                                     UPC - O0   GCC
                         4/3/2004           PSC Petascale Methods                 118
                                   Performance of Edge detection on
                                   the Origin 2000


                       100                                                                             35

                                                                                                       30
                        10
Execution Time (sec)




                                                                                                       25

                                                                                                       20




                                                                                             Speedup
                         1

                                                                                                       15

                        0.1                                                                            10

                                                                                                        5
                       0.01
                                                                                                        0
                               1         2         4            8           16        32                    1   2        4           8         16         32
                                                           NP                                                                 NP

                                             UPC NO OPT.    UPC FULL OPT.                                           FULL OPTIMIZED       OPTIMAL




                                   Execution Time                                                                            Speedup


                              4/3/2004                                           PSC Petascale Methods                                              119
                           Execution Time over SGI–Origin 2k NAS-FT –
                           Class A
                         450



                         400



                         350



                         300
Computation Time (sec)




                         250




                         200



                         150



                         100



                         50



                          0
                                     1   2        4                 8         16   32

                                                      Processors

                                              UPC - O0   UPC - O1       GCC
                          4/3/2004           PSC Petascale Methods                      120
                         Execution Time over SGI–Origin 2k
                         NAS-CG – Class A
                         70




                         60




                         50
Computation Time (sec)




                         40




                         30




                         20




                         10




                          0
                                    1   2            4                8           16   32

                                                         Processors

                         4/3/2004              PSC UPC - O1 UPC -
                                            UPC - O0 Petascale Methods O3   GCC             121
                         Execution Time over SGI–Origin 2k
                         NAS-EP – Class A
                         250




                         200
Computation Time (sec)




                         150




                         100




                         50




                          0
                               1    2              4                8                16                     32

                                                       Processors
                                                                                          MPI & OpenMP versions written in
                                                                                          Fortran and compiled by F77
                                        UPC - O0   MPI    OpenMP        F/CC   GCC        UPC version compiled by GCC
                    4/3/2004               PSC Petascale Methods                                                 122
                         Execution Time over SGI–Origin 2k
                         NAS-FT – Class A
                         160




                         140




                         120
Computation Time (sec)




                         100




                          80




                          60




                          40




                          20




                           0
                                    1   2              4                8                16                     32

                                                           Processors
                                                                                              MPI & OpenMP versions written in
                                                                                              Fortran and compiled by F77
                                            UPC - O1   MPI    OpenMP        F/CC   GCC        UPC version compiled by GCC



                         4/3/2004              PSC Petascale Methods                                                 123
                         Execution Time over SGI–Origin 2k NAS-CG
                         – Class A
                         70




                         60




                         50
Computation Time (sec)




                         40




                         30




                         20




                         10




                          0
                                    1   2              4                8                16                     32

                                                           Processors
                                                                                              MPI & OpenMP versions written in
                                                                                              Fortran and compiled by F77
                                            UPC - O3   MPI    OpenMP        F/CC   GCC        UPC version compiled by GCC
                         4/3/2004               PSC Petascale Methods                                                 124
                          Execution Time over SGI–Origin 2k NAS-MG
                          – Class A
                         80




                         70




                         60
Computation Time (sec)




                         50




                         40




                         30




                         20




                         10




                         0
                                    1   2              4                8                16                     32

                                                           Processors
                                                                                              MPI & OpenMP versions written in
                                                                                              Fortran and compiled by F77
                                            UPC - O3   MPI    OpenMP        F/CC   GCC        UPC version compiled by GCC
                         4/3/2004               PSC Petascale Methods                                                125
     UPC Outline

1. Background and Philosophy          8. Synchronization
2. UPC Execution Model                9. Performance Tuning
3. UPC Memory Model                       and Early Results
4. UPC: A Quick Intro                 10. Concluding
5. Data and Pointers                     Remarks
6. Dynamic Memory
   Management
7. Programming Examples




     4/3/2004        PSC Petascale Methods                 126
Conclusions
    UPCTime-To-Solution=
    UPCProgramming Time + UPCExecution Time

•   Simple and Familiar View             •   Data locality exploitation
     • Domain decomposition
       maintains global application
                                         •   No calls
       view                              •   One-sided communications
     • No function calls                 •   Low overhead for short
•   Concise Syntax                           accesses
     • Remote writes with
       assignment to shared
     • Remote reads with
       expressions involving shared
     • Domain decomposition
       (mainly) implied in
       declarations (logical place!)



4/3/2004                    PSC Petascale Methods                    127
 Conclusions

• UPC is easy to program in for C writers, significantly
  easier than alternative paradigms at times
• UPC exhibits very little overhead when compared with
  MPI for problems that are embarrassingly parallel. No
  tuning is necessary.
• For other problems compiler optimizations are
  happening but not fully there
• With hand-tuning, UPC performance compared
  favorably with MPI
• Hand tuned code, with block moves, is still
  substantially simpler than message passing code


 4/3/2004             PSC Petascale Methods           128
Conclusions


• Automatic compiler optimizations should focus on
    • Inexpensive address translation
    • Space Privatization for local shared accesses
    • Prefetching and aggregation of remote accesses,
      prediction is easier under the UPC model
• More performance help is expected from optimized
  standard library implementations, specially collective
  and I/O




4/3/2004             PSC Petascale Methods             129
    References
•   The official UPC website, http://upc.gwu.edu
•   T. A.El-Ghazawi, W.W.Carlson, J. M. Draper. UPC Language Specifications
    V1.1 (http://upc.gwu.edu). May, 2003
•   François Cantonnet, Yiyi Yao, Smita Annareddy, Ahmed S. Mohamed, Tarek
    A. El-Ghazawi Performance Monitoring and Evaluation of a UPC
    Implementation on a NUMA Architecture, International Parallel and
    Distributed Processing Symposium(IPDPS‟03) Nice Acropolis Convention
    Center, Nice, France, 2003.
•   Wei-Yu Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
    Katherine Yelick, A performance analysis of the Berkeley UPC compiler,
    International Conference on Supercomputing, Proceedings of the 17th
    annual international conference on Supercomputing 2003,San Francisco,
    CA, USA
•   Tarek A. El-Ghazawi, François Cantonnet, UPC Performance and Potential:
    A NPB Experimental Study, SuperComputing 2002 (SC2002). IEEE,
    Baltimore MD, USA, 2002.
•   Tarek A.El-Ghazawi, Sébastien Chauvin, UPC Benchmarking Issues,
    Proceedings of the International Conference on Parallel Processing
    (ICPP‟01). IEEE CS Press. Valencia, Spain, September 2001.
    4/3/2004                 PSC Petascale Methods                   130
CS267 Final Projects
• Project proposal
    • Teams of 3 students, typically across departments
    • Interesting parallel application or system
    • Conference-quality paper
    • High performance is key:
        • Understanding performance, tuning, scaling, etc.
        • More important the difficulty of problem

• Leverage
    • Projects in other classes (but discuss with me first)
    • Research projects


4/3/2004               PSC Petascale Methods              131
Project Ideas
• Applications
    • Implement existing sequential or shared memory
      program on distributed memory
    • Investigate SMP trade-offs (using only MPI versus
      MPI and thread based parallelism)
• Tools and Systems

   • Effects of reordering on sparse matrix factoring and
     solves
• Numerical algorithms
   • Improved solver for immersed boundary method
   • Use of multiple vectors (blocked algorithms) in
     iterative solvers
4/3/2004             PSC Petascale Methods             132
Project Ideas
• Novel computational platforms
    • Exploiting hierarchy of SMP-clusters in benchmarks
    • Computing aggregate operations on ad hoc networks
      (Culler)
    • Push/explore limits of computing on “the grid”
    • Performance under failures
• Detailed benchmarking and performance analysis,
  including identification of optimization opportunities
    • Titanium
    • UPC
    • IBM SP (Blue Horizon)


4/3/2004            PSC Petascale Methods           133
Hardware Limits to Software Innovation




• Software send overhead for 8-byte messages over time.
• Not improving much over time (even in absolute terms)
4/3/2004           PSC Petascale Methods          134

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:114
posted:7/21/2011
language:English
pages:133
Description: Psc Application Form document sample