Exclusive

Document Sample
Exclusive Powered By Docstoc
					CS420: OpenMP Performance Issues

                      Laxmikant V. Kale
Jacobi with OpenMP:
       Different computation schemes
           Which dimension to iterate over first, X or Y?
           Which dimension to parallelize, X or Y?




    2                          CS420: OpenMP Performance Issues   Fall 2009
Different computation schemes
                                         Perfomance comparison of different parallelization/computation schemes


                      30.00


                      25.00
Avg. Iter Time (ms)




                      20.00
                                                                                                                  OMP_NUM_THREADS=1
                                                                                                                  OMP_NUM_THREADS=2
                      15.00
                                                                                                                  OMP_NUM_THREADS=4
                                                                                                                  OMP_NUM_THREADS=8
                      10.00


                       5.00


                       0.00
                              xy_outer              xy_inner            yx_outer            yx_inner




3                                                        CS420: OpenMP Performance Issues                                      Fall 2009
Different way of expressing the same
parallelization scheme (1)
       Take xy_inner as an example
           overhead of creating omp threads in every inner loop
       Implicit  Explicit parallelism expression
           Removes the above overhead



                                   




    4                         CS420: OpenMP Performance Issues     Fall 2009
Different way of expressing the same
parallelization scheme (2)

                                     Performance comparison between different ways of expressing parallelism


                          30.00


                          25.00
    Avg. Iter Time (ms)




                          20.00
                                                                                                               OMP_NUM_THREADS=1
                                                                                                               OMP_NUM_THREADS=2
                          15.00
                                                                                                               OMP_NUM_THREADS=3
                                                                                                               OMP_NUM_THREADS=8
                          10.00


                           5.00


                           0.00
                                  xy_inner_implicit                        xy_inner_explicit




5                                                     CS420: OpenMP Performance Issues                                      Fall 2009
Parallelization for absolute performance (1)
       Implicit barrier for each parallel construct in OpenMP
           Each iteration of the outermost loop is separated by an implicit
            OpenMP barrier
       Is it possible that one thread starts the next iteration
        without waiting for all other threads finish the current
        iteration?
           Removing the barrier could lead to the overlap of computation
            from different iterations, thus saving time!




    6                          CS420: OpenMP Performance Issues       Fall 2009
Is square decomposition good?




                                                          5




    No, because the situation is asymmetric across dimensions.
    It should be a rectangle with longer dimension along rows..


7                      CS420: OpenMP Performance Issues           Fall 2009
               Parallelization for scalability(2)




                            Performance comparison among different tile shapes (running on 8 threads)

                     3.00

                     2.50
Avg Iter Time (ms)




                     2.00

                     1.50

                     1.00

                     0.50

                     0.00
                            1*8                        2*4                       4*2                    8*1




                      8                         CS420: OpenMP Performance Issues                              Fall 2009
Parallelization for absolute performance (2)
       Considering block decomposition of the matrix
       For simplicity, each thread holds one block


                                   • Observation:

                                         •Thread 5 can start the next iteration
                                         when its neighbor threads 1,4,6,9
                                         finishe their updates for the shaded
                                         parts respectively
                                         •Thread 5 doesn’t need to wait for its
                                         non-neighbor threads (such as 0, 3, 7,
                                         13 etc.) to finish the current iteration
                                         to start the next iteration


    9                      CS420: OpenMP Performance Issues                 Fall 2009
OpenMP Performance Issues
    Using OpenMP to parallelize a program isn’t as simple as
     it looks: i.e. by just adding compiler directives, especially
     when considering performance issues
        More changes to the original program
        Write programs in explicitly parallel (not “parallel for”) form
            But then: Almost no difference between OpenMP and Pthread in
             terms of programmability
            Need more high-level language constructs to help programmer




    10                        CS420: OpenMP Performance Issues          Fall 2009
OpenMP performance Issues
    Essentially, a major problem is that the programming
     model does not correspond to the performance model
        Programmer doesn’t see the cost of the constructs
    You must be aware of communication via cache lines:
        If a processors writes data and another reads it, it’s a
         communication with costs even if it’s a shared memory
         hardware




    11                      CS420: OpenMP Performance Issues        Fall 2009
New short topic: Charm++ and shared memory nodes
    Basic approach:
        Data is private by default (objects) and shared explicitly by
         default (asynchronous method invocation)
        However: several data sharing mechanisms are available:
            Readonly: directly supported
            Nodegroups: node-level shared object, with atomic methods
    Just plain old shared memory is accessible..
        Disciplined uses: (send a pointer in a message)
        Two modes (not enforced, but conventions):
            You-can-modify-this (and I know no one else will)
            You can read this (and I promise no one will change it until you have
             read it)

    12                          CS420: OpenMP Performance Issues              Fall 2009
Node Groups
    Node Groups - a collection of objects (chares)
        Exactly one representative on each node
            Ideally suited for system libraries on SMP
        Similar to arrays:
            Broadcasts, reductions, indexing
        But not completely like arrays:
            Non-migratable; one per node




    13                          CS420: OpenMP Performance Issues   Fall 2009
Declarations
    .ci file
          nodegroup mynodegroup {
            entry mynodegroup();                //Constructor
            entry void foo(foomsg *);           //Entry method
            entry [exclusive] void foo2();      // excusive entry ,ethod
         };
    C++ file
     class mynodegroup : public Cbase_mynodegroup {
         int count ;
     public:
         mynodegroup(): count(0) {}
         void foo(foomsg *m) { CkPrintf(“Do Nothing”);}
         void foo2() { count ++; }
     };

     Exclusive entry methods, which exist only on node groups, are entry methods that do not
        execute while other exclusive entry methods of its node group are executing in the same
        node. The execution of the entry methods are lock-protected.



    14                                 CS420: OpenMP Performance Issues                  Fall 2009
Creating and Calling Groups
    Creation
     p = CProxy_mygroup::ckNew();
    Remote invocation
     p.foo(msg);   //broadcast
     p[1].foo(msg); //asynchronous
     p.foo(msg, npes, pes); // list send
    Direct local access
     mygroup *g=p.ckLocalNodeBranch();
     g->foo(….); //local invocation
    Danger: if you migrate, the group stays behind!



    15                     CS420: OpenMP Performance Issues   Fall 2009

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:11/10/2012
language:Unknown
pages:15