threads

Document Sample
threads Powered By Docstoc
					Java for High Performance Computing

Multithreaded and Shared-Memory
       Programming in Java
    http://www.hpjava.org/courses/arl
            Instructor: Bryan Carpenter
            Pervasive Technology Labs
                 Indiana University




              dbcarpen@indiana.edu        1
Java as a Threaded Language
   In C, C++, etc it is possible to do multithreaded programming,
    given a suitable library.
    – e.g. the pthreads library.
   Thread libraries provide one approach to doing parallel
    programming on computers with shared memory.
    – Another approach is OpenMP, which uses compiler directives. This
      will be discussed later.
   Unlike these (traditional HPC) languages, Java integrates
    threads into the basic language specification in a much tighter
    way.
    – Every Java Virtual Machine must support threads.
   Although this close integration doesn’t exactly make
    multithreaded programming in Java easy, it does help avoid
    common programming errors, and keeps the semantics clean.
                              dbcarpen@indiana.edu                   2
Features of Java Threads
   Java provides a set of synchronization primitives based on
    monitor and condition variable paradigm of C.A.R. Hoare.
    – Underlying functionality is similar to POSIX threads, for example.
   Syntactic extension for threads is (deceptively?) small:
    –   synchronized attribute on methods.
    –   synchronized statement.
    –   volatile keyword.
    –   Other thread management and synchronization captured in the Thread
        class and related classes.
   But the presence of threads has a more wide-ranging effect on
    the language specification and JVM implementation.
    – e.g., the Java memory model.



                              dbcarpen@indiana.edu                         3
Contents of this Lecture Set
   Introduction to Java Threads.
    – Mutual Exclusion.
    – Synchronization between Java Threads using wait() and notify().
    – Other features of Java Threads.
   Shared Memory Parallel Computing with Java Threads
    – We review select parallel applications and benchmarks of Java on
      SMPs from the recent literature.
   Special Topic: JOMP
    – JOMP is a prototype implementation of the OpenMP standard for
      Java.
   Suggested Exercises



                             dbcarpen@indiana.edu                        4
Java Thread Basics




    dbcarpen@indiana.edu   5
Creating New Threads in a Java Program
   Any Java thread of execution is associated with an instance of
    the Thread class. Before starting a new thread, you must
    create a new instance of this class.
   The Java Thread class implements the interface Runnable.
    So every Thread instance has a method:
         public void run() { . . . }

   When the thread is started, the code executed in the new
    thread is the body of the run() method.
    – Generally speaking the new thread ends when this method returns.




                                dbcarpen@indiana.edu                     6
Making Thread Instances
   There are two ways to create a thread instance (and define the thread
    run() method). Choose at your convenience:
    1. Extend the Thread class and override the run() method, e.g.:
        class MyThread extends Thread {
           public void run() {
             System.out.println(“Hello from another thread”) ;
           }
        }
        ...
        Thread thread = new MyThread() ;

    2. Create a separate Runnable object and pass it to the Thread constructor,
       e.g.:
        class MyRunnable implements Runnable {
           public void run() {
             System.out.println(“Hello from another thread”) ;
           }
        }
        ...
        Thread thread = new MyThread(new MyRunnable()) ;
                                 dbcarpen@indiana.edu                             7
Starting a Thread
   Creating the Thread instance does not in itself start the
    thread running.
   To do that you must call the start() method on the new
    instance:
        thread.start() ;
    This operation causes the run() method to start executing
    concurrently with the original thread.
   In our example the new thread will print the message “Hello
    from another thread” to standard output, then immediately
    terminate.
   You can only call the start() method once on any Thread
    instance. Trying to ―restart‖ a thread causes an exception to
    be thrown.

                           dbcarpen@indiana.edu                 8
Example: Multiple Threads
    class MyThread extends Thread {
      MyThread(int id) {
        this.id = id ;
      }
      public void run() {
        System.out.println(“Hello from thread ” + id) ;
      }
      private int id ;
    }
    ...
    Thread [] threads = new Thread [p] ;
    for(int i = 0 ; i < p ; i++)
       threads [i] = new MyThread(i) ;
    for(int i = 0 ; i < p ; i++)
       threads [i].start() ;
                            dbcarpen@indiana.edu          9
Remarks
   This is one way of creating and starting p new threads to run
    concurrently.
   The output might be something like (for p = 4):
        Hello from thread 3
        Hello from thread 4
        Hello from thread 2
        Hello from thread 1
    Of course there is no guarantee of order (or atomicity) of
    outputs, because the threads are concurrent.
   One might worry about the efficiency of this approach for
    large numbers of threads (massive parallelism).


                              dbcarpen@indiana.edu               10
JVM Termination and Daemon Threads
   When a Java application is started, the main() method of the application
    is executed in a singled-out thread called the main thread.
   In the simplest case—if the main method never creates any new
    threads—the JVM keeps running until the main() method completes (and
    the main thread terminates).
    –   If the JVM was started with the java command, the command finishes.
   If the main() method creates new threads, then by default the JVM
    terminates when all user-created threads have terminated.
   More generally there are system threads executing in the background,
    concurrent with the user threads (e.g. threads might be associated with
    garbage collection). These threads are marked as daemon threads—
    meaning just that they don’t have the property of ―keeping the JVM
    alive‖. So, more strictly, the JVM terminates when all non-daemon
    threads terminate.
   Ordinary user threads can create daemon threads by applying the
    setDaemon() method to the thread instance before starting it.


                                dbcarpen@indiana.edu                          11
Mutual Exclusion




   dbcarpen@indiana.edu   12
Avoiding Interference
   In any non-trivial multithreaded (or shared-memory-parallel) program,
    interference between threads is an issue.
   Generally interference (or a race condition) occurs if two threads are
    trying to do operations on the same variables at the same time. This often
    results in corrupt data.
    –   But not always. It depends on the exact interleaving of instructions. This
        non-determinism is the worst feature of race conditions.
   A popular solution is to provide some kind of lock primitive. Only one
    thread can acquire a particular lock at any particular time. The
    concurrent program can then be written so that operations on a given
    group of variables are only ever performed by threads that hold the lock
    associated with that group.
   In POSIX threads, for example, the lock objects are called mutexes.



                                 dbcarpen@indiana.edu                                13
Example use of a Mutex (in C)
               Thread A                                 Thread B

pthread_mutex_lock(&my_mutex) ;


/* critical region */
tmp1 = count ;                           pthread_mutex_lock(&my_mutex) ;
count = tmp1 + 1 ;

                                                         Blocked
pthread_mutex_unlock(&my_mutex) ;


                                         /* critical region */
                                         tmp2 = count ;
                                         count = tmp2 - 1 ;


                                         pthread_mutex_unlock(&my_mutex) ;

                            dbcarpen@indiana.edu                     14
Pthreads-style mutexes
   In POSIX threads, a mutex is allocated then initialized with
    pthread_mutex_init().
   Once a mutex is initialized, a thread can acquire a lock on it by calling
    e.g. pthread_mutex_lock(). While it holds the lock it performs some
    update on the variables guarded by the lock (critical region). Then the
    thread calls pthread_mutex_unlock() to release the lock.
   Other threads that try to call pthread_mutex_lock() while the critical
    region is being executed are blocked until the first thread releases the
    lock.
   This is fine, but opportunities for error include:
    –   There is no built-in association between the lock object (mutex) and the set of
        variables it guards—it is up to the program to maintain a consistent
        association.
    –   If you forget to call pthread_mutex_unlock() after pthread_mutex_lock(),
        deadlocks will occur.


                                  dbcarpen@indiana.edu                             15
    Monitors
    Java addresses these problems by adopting a version of the monitors proposed
     by C.A.R. Hoare.
    Every Java object is created with its own lock (and every lock is associated
     with an object—there is no way to create an isolated mutex). In Java this lock
     is often called the monitor lock.
    Methods of a class can be declared to be synchronized.
    The object’s lock is acquired on entry to a synchronized method, and
     released on exit from the method.
     –   Synchronized static methods need slightly different treatment.
    Assuming methods generally modify the fields (instance variables) of the
     objects they are called on, this leads to a natural and systematic association
     between locks and the variables they guard: viz. a lock guards the instance
     variables of the object it is attached to.
    The critical region becomes the body of the synchronized method.

                                     dbcarpen@indiana.edu                       16
Example use of Synchronized Methods
             Thread A                                     Thread B

… call to counter.increment() …


// body of synchronized method
tmp1 = count ;                                … call to counter.decrement() …
count = tmp1 + 1 ;

                                                           Blocked
… counter.increment() returns …


                                              // body of synchronized method
                                              tmp2 = count ;
                                              count = tmp2 - 1 ;


                                              … counter.decrement() returns …

                                 dbcarpen@indiana.edu                           17
Caveats
   This approach helps to encourage good practices, and make multithreaded
    Java programs less error-prone than, say, multithreaded C programs.
   But it isn’t magic:
    –   It still depends on correct identification of the critical regions, to avoid race
        conditions.
    –   The natural association between the lock of the object and its fields relies on the
        programmer following conventional patterns of object oriented programming
        (which the language encourages but doesn’t enforce).
    –   By using the synchronized construct (see later), programs can break this
        association altogether.
    –   There are plenty more insidious ways to introduce deadlocks, besides accidentally
        forgetting to release a lock!
   Concurrent programming is hard, and if you start with the assumption Java
    somehow makes concurrent programming ―easy‖, you are probably going to
    write some broken programs!



                                  dbcarpen@indiana.edu                              18
Example: A Simple Queue
  public class SimpleQueue {
      private Node front, back ;
      public synchronized void add(Object data) {
         if (front != null) {
              back.next = new Node(data) ;
              back = back.next ;
         }
         else {
               front = new Node(data) ;
               back = front ;
          }
      }
      public synchronized Object rem() {
          Object result = null ;
          if (front != null) {
                result = front.data ;
                front = front.next ;
          }
          return result ;
      }
  }
                               dbcarpen@indiana.edu   19
Remarks
   This queue is implemented as a linked list with a front pointer
    and a back pointer.
   The method add() adds a node to the back of the list; the
    method rem() removes a node from the front of the list.
   The Node class just has a data field (type Object) and a next
    field (type Node).
   The rem() method immediately returns null when the queue
    is empty.
   The following slide gives an example of what could go wrong
    without mutual exclusion. It assumes two threads
    concurrently add nodes to the queue.
    – In the initial state, Z is the last item in the queue. In the final state, the
      X node is orphaned, and the back pointer is null.


                                 dbcarpen@indiana.edu                            20
The Need for Synchronized Methods
Thread A: add(X)                                Z       null

                                         back
back.next = new Node(X) ;                                                    Thread B: add(Y)
                                     X     null                              back.next = new Node(Y) ;
                            Z
                                                               X      null
                       back                         Z
                                                                Y     null
back = back.next ;                           back

                                X     null
                        Z
                                Y        null                                back = back.next ;
                     back
                                                               X      null
                                                    Z
                                                                Y     null    Corrupt data structure!
                                                back           null
                                    dbcarpen@indiana.edu                                            21
The synchronized construct
   The keyword synchronized also appears in the synchronized
    statement, which has syntax like:
        synchronized (object) {
          … critical region …
        }
   Here object is a reference to any object. The synchronized
    statement first acquires the lock on this object, then executes
    the critical region, then releases the lock.
   Typically you might use this for the lock object, somewhere
    inside a none-synchronized method, when the critical region
    is smaller than the whole method body.
   In general, though, the synchronized statement allows you to
    use the lock in any object to guard any code.

                           dbcarpen@indiana.edu                  22
Performance Overheads of synchronized
   Acquiring locks obviously introduces an overhead in
    execution of synchronized methods. See, for example:
    ―Performance Limitations of the Java Core Libraries‖,
    Allan Heydon and Marc Najork (Compaq),
    Proceedings of ACM 1999 Java Grande Conference.
   Many of the utility classes in the Java platform (e.g. Vector,
    etc) were originally specified to have synchronized methods,
    to make them safe for the multithreaded environment.
   This is now generally thought to have been a mistake: newer
    replacement classes (e.g. ArrayList) usually don’t have
    synchronized methods—it is left to the user to provide the
    synchronization, as needed, e.g. through wrapper classes.



                              dbcarpen@indiana.edu               23
General Synchronization




       dbcarpen@indiana.edu   24
Beyond Mutual Exclusion
   The mutual exclusion provided by synchronized methods and
    statements is an important special sort of synchronization.
   But there are other interesting forms of synchronization
    between threads. Mutual exclusion by itself is not enough to
    implement these more general sorts of thread interaction (not
    efficiently, at least).
   POSIX threads, for example, provides a second kind of
    synchronization object called a condition variable to
    implement more general inter-thread synchronization.
   In Java, condition variables (like locks) are implicit in the
    definition of objects: every object effectively has a single
    condition variable associated with it.



                           dbcarpen@indiana.edu                25
A Motivating Example
   Consider the simple queue from the previous example.
   If we try to remove an item from the front of the queue when
    the queue is empty, SimpleQueue was specified to just return
    null.
   This is reasonable if our queue is just meant as a data structure
    buried somewhere in an algorithm. But what if the queue is a
    message buffer in a communication system?
   In that case, if the queue is empty, it may be more natural for
    the ―remove‖ operation to block until some other thread added
    a message to the queue.




                            dbcarpen@indiana.edu                  26
Busy Waiting
   One approach would be to add a method that polls the queue until data is
    ready:
     public synchronized Object get() {
         while(true) {
            Object result = rem() ;
            if (result != null)
                 return result ;
         }
     }
   This works, but it may be inefficient to keep doing the basic rem()
    operation in a tight loop, if these machine cycles could be used by other
    threads.
     – This isn’t clear cut: sometimes busy waiting is the most efficient approach.
   Another possibility is to put a sleep() operation in the loop, to deschedule
    the thread for some fixed interval between polling operations. But then we
    lose responsiveness.



                                  dbcarpen@indiana.edu                                27
wait() and notify()
   In general a more elegant approach is to use the wait() and
    notify() family of methods. These are defined in the Java
    Object class.
   Typically a call to a wait() method puts the calling thread to
    sleep until another thread wakes it up again by calling a
    notify() method.
   In our example, if the queue is currently empty, the get()
    method would invoke wait(). This causes the get() operation
    to block. Later when another thread calls add(), putting data
    on the queue, the add() method invokes notify() to wake up
    the ―sleeping‖ thread. The get() method can then return.




                           dbcarpen@indiana.edu                 28
A Simplified Example
public class Semaphore {
   int s ;
     public Semaphore(int s) { this.s = s ; }
     public synchronized void add() {
        s++ ;
        notify() ;
     }
     public synchronized void get() throws InterruptedException {
        while(s == 0)
            wait() ;
        s-- ;
     }
 }


                             dbcarpen@indiana.edu              29
Remarks I
   Rather than a linked list we have a simple counter, which is
    required always to be non-negative.
    – add() increments the counter.
    – get() decrements the counter, but if the counter was zero it blocks until
      another thread increments the counter.
   The data structures are simplified, but the synchronization
    features used here are essentially identical to what would be
    needed in a blocking queue (left as an exercise).
   You may recognize this as an implementation of a classical
    semaphore—an important synchronization primitive in its
    own right.




                               dbcarpen@indiana.edu                         30
Remarks II
   wait() and notify() must be used inside synchronized methods of the
    object they are applied to.
   The wait() operation ―pauses‖ the thread that calls it. It also releases the
    lock which the thread holds on the object (for the duration of the wait()
    call: the lock will be claimed again before continuing, after the pause).
   Several threads can wait() simultaneously on the same object.
   If any threads are waiting on an object, the notify() method ―wakes up‖
    exactly one of those threads. If no threads are waiting on the object,
    notify() does nothing.
   A wait() method may throw an InterruptedException (rethrown by by
    get() in the example). This will be discussed later.
   Although the logic in the example doesn’t strictly require it, universal lore
    has it that one should always put a wait() call in a loop, in case the
    condition that caused the thread to sleep has not been resolved when the
    wait() returns (a programmer flaunting this rule might use if in place of
    while).

                                 dbcarpen@indiana.edu                         31
Another Example
public class Barrier {
    private int n, generation = 0, count = 0 ;
    public Barrier(int n) { this.n = n ; }
    public synchronized void synch() throws InterruptedException {
       int genNum = generation ;
       count++ ;
       if(count == n) {
           count = 0 ;
           generation++ ;
           notifyAll() ;
       }
       else
           while(generation == genNum)
              wait() ;
    }
}

                                 dbcarpen@indiana.edu                32
Remarks
   This class implements barrier synchronization—an important operation in
    shared memory parallel programming.
   It synchronizes n processes: when n threads make calls to synch() the first
    n-1 block until the last one has entered the barrier.
   The method notifyAll() generalizes notify(). It wakes up all threads
    currently waiting on this object.
     – Many authorities consider use of notifyAll() to be ―safer‖ than notify(), and
       recommend always to use notifyAll().
   In the example, the generation number labels the current, collective barrier
    operation: it is only really needed to control the while loop round wait().
     – And this loop is only really needed to conform to the standard pattern of
       wait()-usage, mentioned earlier.




                                  dbcarpen@indiana.edu                             33
Concluding Remarks on Synchronization
   We illustrated with a couple of simple examples that wait()
    and notify() allow various interesting patterns of thread
    synchronization (or thread communication) to be
    implemented.
   In some sense these primitives are sufficient to implement
    ―general‖ concurrent programming—any pattern of thread
    synchronization can be implemented in terms of these
    primitives.
    – For example you can easily implement message passing between
      threads (left as an exercise…)
   This doesn’t mean these are necessarily the last word in
    synchronization: e.g. for scalable parallel processing one
    would like a primitive barrier operation more efficient than
    the O(n) implementation given above.

                            dbcarpen@indiana.edu                     34
Other Features of Java Threads




          dbcarpen@indiana.edu   35
Other Features
   This lecture isn’t supposed to cover all the details—for those
    you should look at the spec!
   But we mention here a few other features you may find
    useful.




                           dbcarpen@indiana.edu                 36
Join Operations
   The Thread API has a family of join() operations. These
    implement another simple but useful form of synchronization,
    by which the current thread can simply wait for another thread
    to terminate, e.g.:
        Thread child = new MyThread() ;
        child.start() ;
        … Do something in current thread …
        child.join() ;    // wait for child thread to finish




                                  dbcarpen@indiana.edu         37
Priority and Name
   Thread have properties priority and name, which can be
    defined by suitable setter methods, before starting the thread,
    and accessed by getter methods.




                            dbcarpen@indiana.edu                 38
Sleeping
   You can cause a thread to sleep for a fixed interval using the
    sleep() methods.
   This operation is distinct from and less powerful than wait().
    It is not possible for another thread to prematurely wake up a
    thread paused using sleep().
    – If you want to sleep for a fixed interval, but allow another thread to
      wake you beforehand if necessary, use the variants of wait() with
      timeouts instead.




                               dbcarpen@indiana.edu                            39
Deprecated Thread Methods
   There is a family of methods of the Thread class that was originally
    supposed to provide external ―life-or-death‖ control over threads.
   These were never reliable, and they are now officially ―deprecated‖. You
    should avoid them.
   If you have a need to interrupt a running thread, you should explicitly write
    the thread it in such a way that it listens for interrupt conditions (see the
    next slide).
     – If you want to run an arbitrary thread in such a way that it can be externally
       killed and cleaned by an external agent, you probably need to fork a separate
       process to run it.
   The most interesting deprecated methods are:
     –   stop()
     –   destroy()
     –   suspend()
     –   resume()

                                   dbcarpen@indiana.edu                            40
Interrupting Threads
   Calling the method interrupt() on a thread instance requests cancellation
    of the thread execution.
   It does this in an advisory way: the code for the interrupted thread must be
    written to explicitly test whether it has been interrupted, e.g.:
     public void run() {
        while(!interrupted())
            … do something …
     }
    Here interrupted() is a static method of the Thread class which
    determines whether the current thread has been interrupted
   If the interrupted thread is executing a blocking operation like wait() or
    sleep(), the operation may end with an InterruptedException. An
    interruptible thread should catch this exception and terminate itself.
   Clearly this mechanism depends wholly on suitable implementation of the
    thread body. The programmer must decide at the outset whether it is
    important that a particular thread be responsive to interrupts—often it
    isn’t.


                                dbcarpen@indiana.edu                         41
Thread Groups
   There is a mechanism for organizing threads into groups.
    This may be useful for imposing security restrictions on
    which threads can interrupt other threads, for example.
   Check out the API of the ThreadGroup class if you think this
    may be important for your application.




                          dbcarpen@indiana.edu               42
Thread-Local Variables
   An object from the ThreadLocal class stores an object which
    has a different, local value in every thread.
   For example, suppose you implemented the MPI message-
    passing interface, mapping each MPI process to a Java thread.
    You decide that ―world communicator‖ should be a static
    variable of the Comm class. But then how do you get the
    rank() method to return a different process ID for each
    thread, so this kind of thing works?:
        int me = Comm.WORLD.rank() ;
   One approach would be to store the process ID in the
    communicator object in a thread local variable.
    – Another approach would be to use a hash map keyed by
      Thread.currentThread().
   Check the API of the ThreadLocal class for details.

                            dbcarpen@indiana.edu              43
Volatile Variables
   Suppose a the value of a variable must be accessible by multiple threads,
    but for some reason you decided you can’t afford the overheads of
    synchronized methods or the synchronized statement.
     – Presumably effects of race conditions have been proved innocuous.
   In general Java does not guarantee that—in the absence of lock operations
    to force synchronization of memory—the value of a variable written by a
    one thread will be visible to other threads.
   But if you declare a field to be volatile:
       volatile int myVariable ;
    the JVM is supposed to synchronize the value of any thread-local (cached)
    copy of the variable with central storage (making it visible to all threads)
    every time the variable is updated.
   The exact semantics of volatile variables, and the Java memory model in
    general, is still controversial, see for example:
     “A New Approach to the Semantics of Multithreaded Java”,
     Jeremy Manson and William Pugh,
     http://www.cs.umd.edu/~pugh/java/memoryModel/


                                 dbcarpen@indiana.edu                        44
Thread-based Parallel Applications and
            Benchmarks




              dbcarpen@indiana.edu       45
Threads on Symmetric Multiprocessors
   Most modern implementations of the Java Virtual Machine
    will map Java threads into native threads of the underlying
    operating system.
    – For example these may be POSIX threads.
   On multiprocessor architectures with shared memory, these
    threads can exploit multiple available processors.
   Hence it is possible to do true parallel programming using
    Java threads within a single JVM.




                            dbcarpen@indiana.edu                  46
Select Application Benchmark Results
   We present some results, borrowed from the literature in this
    area.
   Two codes are described in
    ―High-Performance Java Codes for Computational Fluid Dynamics‖
    C. Riley, S. Chatterjee, and R. Biswas,
    ACM 2001 Java Grande/ISCOPE
   LAURA is a finite-volume flow solver for multiblock,
    structured grids.
   2D_TAG is a triangular adaptive grid generation tool.




                            dbcarpen@indiana.edu                     47
Parallel Speedup of LAURA Code




               dbcarpen@indiana.edu   48
Parallel Speedup of 2D_TAG Code




               dbcarpen@indiana.edu   49
Remarks
   LAURA speedups are generally fair, and are outstanding with the Jalapeno
    JVM on PowerPC (unfortunately this is the only JVM that isn’t freely
    available.)
     – LAURA is ―regular‖, and the parallelization strategy needs little or no locking.
   2D_TAG results are only reported for PowerPC (presumably worse on
    other platforms). This code is very dynamic, very OO, and the naïve
    version uses many synchronized methods, hence poor performance.
     – The two optimized versions cut down the amount of locking by partitioning
       the grid, and only using locks in accesses to edge regions.
   As noted earlier, synchronization in Java is quite expensive.




                                  dbcarpen@indiana.edu                             50
CartaBlanca
   CartaBlanca , from Los Alomos National Lab, is a general purpose non-
    linear solver environment for physics computations on non-linear grids.
     ―Parallel Operation of CartaBlanca on Shared and Distributed Memory
        Computers‖
     N. Padial-Collins, W. VanderHeyden, D. Zhang, E. Dendy, D. Livescu.
     2002 ACM Java Grande/ISCOPE conference.
   It employs an object-oriented component design, and is pure Java.
   Parallel versions are based on partitioned meshes, and run either in
    multithreaded mode on SMPs, or on networks using a suitable high
    performance RMI. Here we are interested in the former case
    (multithreaded approach).
   Shared memory results are for an 8-processor Intel SMP machine with 900
    MHz chips.
     – The following results are from a pre-publication version of the paper cited,
       and should presumably be taken as indicative only.

                                  dbcarpen@indiana.edu                                51
Broken Dam Problem
        Two immiscible fluids, side by side.
        3D tetrahedral mesh.




                         dbcarpen@indiana.edu   52
Broken Dam Benchmark Results




       Best speedup: 3.65 on 8 processors
                    dbcarpen@indiana.edu    53
Heat Transfer Problem
   Solves transient heat equation on a square domain.




                Best speedup: 4.6 on 8 processors
                               dbcarpen@indiana.edu      54
NAS Parallel Benchmarks
   The NAS parallel benchmark suite contains a set of kernel
    applications based on CFD codes.
   In
     Implementation of the NAS Parallel Benchmarks in Java
     Michael A. Frumkin, Matthew Schultz, Haoqiang Jin, and Jerry Yan
     NAS Technical Report NAS-02-009
    the OpenMP versions of the following codes:
     –   BT, SP, LU: Simulated CFD applications.
     –   FT: Kernel of a 3-D FFT.
     –   MG: Multigrid solver for 3-D Poisson equation.
     –   CG: Conjugate Gradient method to find eigenvalues of sparse matrix.
    were translated to Java using Java threads.


                                dbcarpen@indiana.edu                      55
Timings on IBM p690




              dbcarpen@indiana.edu   56
Timings for Origin2000




                dbcarpen@indiana.edu   57
Timings for SUN Enterprise10000




               dbcarpen@indiana.edu   58
Remarks on NAS Benchmark Results
   In this paper the comparisons with Fortran are not flattering to
    Java (in fact this was the main conclusion of the authors).
    – It seems the SGI Java is particularly slow.
    – Generally speaking Java fares much better on Linux or Windows
      platforms—not represented here.
   Nevertheless, concentrating here on parallel aspects, all cases
    show useful or excellent speedup.
    – These results are for the largest problem size.




                               dbcarpen@indiana.edu                   59
Special Topic: JOMP




     dbcarpen@indiana.edu   60
OpenMP and Java
   OpenMP (www.openmp.org) is a well-known standard for parallel
    programming on shared-memory computers.
   It consists of a series of directives and libraries embedded in a base
    language, typically Fortran or C/C++.
     – The directives allow definition of parallel regions, executed by a thread-team,
       work-sharing directives, which typically define how iterations of a loop are
       divided between the thread team, and synchronization features like barriers
       and critical regions.
   For scientific programming, the OpenMP paradigm can be considerably
    more concise and expressive than using generic thread libraries.
   Although it doesn’t seem to have been widely taken up, there is an
    interesting implementation of OpenMP for Java from Edinburgh Parallel
    Computing Center—namely, JOMP.
     http://www.epcc.ed.ac.uk/research/jomp/


                                  dbcarpen@indiana.edu                             61
JOMP Execution Model
   A Java program annotated with JOMP directives is written in
    a source file with filename extension .jomp.
   This file is fed to the JOMP compiler, which outputs a cryptic
    .java file, including calls to a runtime library responsible for
    managing Java threads to handle parallel regions, partitioning
    work-shared loops, doing barrier synchronizations, etc.
    – In the most straightforward case the run-time library is implemented
      on top of the standard Java thread library.
    – The intermediate .java file is compiled by a standard javac.
   One simply executes the resulting class file by, e.g.
         java -Djomp.threads=10 MyClass




                              dbcarpen@indiana.edu                       62
    JOMP Directives
   JOMP directives are syntactically Java comments.
     – Note however that OpenMP does not automatically have a property of HPF and
       related dialects of Fortran, namely that removing directives leaves a sequential
       program with the same semantics.
     – In general OpenMP directives will change the behavior and results of a program.
       So the fact the directives have the syntax of comments is, arguably, slightly
       misleading.
     – It is nevertheless possible, and presumably good practice, to write OpenMP
       programs in such a way that they do execute the same under a sequential compiler.
   In JOMP the typical syntax of a construct is:
         //omp directive
         Java-statement
 where the Java-statement is most often a compound statement, like a for
  statement, or a block in braces, { }.
 Other JOMP directives stand alone, and are not associated with a following
  Java block (e.g. the barrier directive).

                                    dbcarpen@indiana.edu                           63
JOMP Directives: parallel
   The parallel construct has basic form:
         //omp parallel
         Java-statement
   The effect is to ―create a thread team‖. The original thread
    becomes master of the team. Master and all other team
    members execute the Java-statement, which will nearly always
    be some compound statement.
   There is an implicit barrier synchronization at the end of the
    parallel region.
   There are various optional clauses, e.g.:
         //omp parallel shared(vars1) private(vars2) …
         Java-statement
    Here vars1, vars2 are lists of variable names. These clauses
    control whether the specified variables are shared or private to
    threads within the parallel region.
                             dbcarpen@indiana.edu                 64
    JOMP Directives: for
   The for construct has basic form:
         //omp for
         for(integer-type var = expr ; test-var ; incr-var)
             Java-statement
   A for construct must appear nested (lexically or ―dynamically‖) inside a
    parallel construct—the iterations are divided amongst the active thread team.
   The form of the enclosed Java for statement is restricted so that the number of
    iterations can be computed before the loop starts (it must be basically
    equivalent to a traditional FORTRAN DO loop).
   There is an implicit barrier synchronization at the end of the for construct; this
    can be disabled by using the nowait clause:
         //omp for nowait …
         Java-for-statement
   The way the iterations are partitioned can be controlled by a schedule clause:
         //omp for schedule(mode, chunk-size) …
         Java-for-statement


                                     dbcarpen@indiana.edu                        65
JOMP Directives: master, critical, barrier
   The master construct has the form:
          //omp master
          Java-statement
    This appears inside a parallel construct. The Java-statement is only
    executed by the master thread.
   The critical construct has the form:
          //omp critical [name]
          Java-statement
    This appears inside a parallel construct, and defines a critical region. The
    optional name specifies a lock (if omitted, the default lock is used).
   The barrier directive has the form:
          //omp barrier
    This appears inside a parallel construct. It is treated like an executable
    statement and specifies a barrier synchronization at the point it appears in the
    program.

                                  dbcarpen@indiana.edu                          66
Other Features
   There are various other directives and clauses in OpenMP and
    JOMP that we didn’t cover, but we gave a flavor.
   It is up to the programmer to ensure that all threads in a thread
    team encounter work-sharing constructs, and other constructs
    that imply barrier synchronization, in the same order.




                            dbcarpen@indiana.edu                  67
An Example
   We describe an example from the JOMP release, a CFD code
    taken from the JOMP version of the Java Grande Benchmark
    suite.
   The code Tunnel (presumably short for ―Java Wind Tunnel‖),
    in the JG JOMP benchmarks folder section3/euler/, solves the
    Euler equations for 2-dimensional inviscid flow.




                          dbcarpen@indiana.edu               68
The Euler Equations (in one slide)
   Euler equations are a family of conservation equations for (in 2D) 4 quantities:
    matter (or mass), 2 components of momentum, total energy:




    Flow variables (f, g) related to dependent variables U by:




    ρ = density, (u, v) = velocity, E = energy, H = enthalpy.
    With equations of state, can easily compute (f, g) as a function of U.
                                 dbcarpen@indiana.edu                        69
Finite Volume Discretization
   A finite volume approach: divide space into a series of
    quadrilaterals, and integrate over a single cell using Gauss’
    theorem:



    where Ω is cell volume and:




    Note U and (f, g) are defined in the centers of the cells. The
    face values of (f, g) are computed as averages between
    neighboring cells. So R(U) ends up involving values from
    neighboring cells.
                            dbcarpen@indiana.edu                    70
Runge Kutta Integration
   Now we have a large set of coupled ordinary differential
    equations that can be integrated by, e.g., a Runge-Kutta
    scheme. In the variant used here a single integration step
    is a sequence of four stages each of the form:




    where the αs are small constants.
   We skipped over some important details (e.g. boundary
    conditions, artificial damping) but the equations above
    summarize the core algorithm implemented by the Tunnel
    program.
                           dbcarpen@indiana.edu                  71
Organization
   So the program Tunnel involves a sequence of steps like:
    1. Calculate p, H from current U components (using equations of state
       for fluid).
    2. Calculate f from U, p, H.
    3. Calculate g from U, p, H.
    4. Calculate R from (f, g).
    5. Update U.
   Parallelizing in JOMP is very straightforward.
   The operations above are called from the method called
    doIteration(). A simplified, schematic form is given on the
    next slide.



                              dbcarpen@indiana.edu                      72
Main Driver
void doIteration() {
   int i, j ;
    //omp parallel private(i, j)
    {
        …
        calculateStateVar(pg1, tg1, ug1) ;
        calculateF(pg1, tg1, ug1) ;
        calculateG(pg1, tg1, ug1) ;
        calculateR()
      //omp for
      for(i = 1; j < imax ; i++)
         for(j = 1 ; j < jmax ; ++j) {
             ug1[i][j].a = ug[i][j].a – 0.5*deltat/a[i][j]*(r[i][j].a – d[i][j].a) ;
             ug1[i][j].b = ug[i][j].b – 0.5*deltat/a[i][j]*(r[i][j].b – d[i][j].b) ;
             ug1[i][j].c = ug[i][j].c – 0.5*deltat/a[i][j]*(r[i][j].c – d[i][j].c) ;
             ug1[i][j].d = ug[i][j].d – 0.5*deltat/a[i][j]*(r[i][j].d – d[i][j].d) ;
         }
       … Three other similar stages
    }
     …
}

                                       dbcarpen@indiana.edu                            73
Remarks
   The arrays ug, ug1 hold states of the four components of U
    (in fields a, b, c, d of their elements).
   The arrays pg, and tg are pressure and temperature. The array
    a is cell volume.
   The array d holds artificial viscosity, which we didn’t discuss.
   We have shown one work-sharing construct within this
    method—a for construct.
   Much of the real work is in the routines calculateR(), etc,
    which are called inside the parallel construct.




                            dbcarpen@indiana.edu                 74
Calculation of R
      private void calculateR() {
          double temp;
          int j;
          //omp for
          for (int i = 1; i < imax; i++)
              for (j = 1; j < jmax; ++j) {
                   …Set fields of r[i][j] to zero…
                  /* East Face */
                  temp = 0.5 * (ynode[i][j] - ynode[i][j-1]);
                  r[i][j].a += temp*(f[i][j].a + f[i+1][j].a);
                  r[i][j].b += temp*(f[i][j].b + f[i+1][j].b);
                  r[i][j].c += temp*(f[i][j].c + f[i+1][j].c);
                  r[i][j].d += temp*(f[i][j].d + f[i+1][j].d);
                  temp = - 0.5*(xnode[i][j] - xnode[i][j-1]);
                  r[i][j].a += temp * (g[i][j].a+g[i+1][j].a);
                  r[i][j].b += temp * (g[i][j].b+g[i+1][j].b);
                  r[i][j].c += temp * (g[i][j].c+g[i+1][j].c);
                  r[i][j].d += temp * (g[i][j].d+g[i+1][j].d);
                  …Add similar contributions of N, S, W faces…
              }
      }                       dbcarpen@indiana.edu               75
    Remarks
   This is the practical implementation of the mathematical definition
    of R given on the earlier slide about discretization.
    – We average cell-centered flow values because they are formally evaluated
      on the faces of the cell.
    – The arrays xnode, ynode contain coordinates of the vertices of the grid.
   The for directive here is not in the lexical scope of a parallel
    construct. It is called an orphaned directive.
    – It is in the dynamic scope of a parallel construct—i.e. this method is called
      from the body of a parallel construct.
   Those of us conditioned by distributed-memory, domain-
    decompositions tend to think of the accesses to neighbors as
    communications. Of course this is wrong here. There is no fixed
    association between array elements and threads.
    – Note, on the other hand, there is an implicit thread synchronization
      associated with the barrier finishing every for construct (by default).

                                  dbcarpen@indiana.edu                          76
General Observations on JOMP
   In the example we borrowed, parallelization of code was very
    easy compared with most other approaches.
   The automatic work-sharing of the for construct is very
    convenient.
     – One limitation is that it only naturally applies to the outer loop of a
       multidimensional loop nest—doesn’t generally support
       multidimensional blocking.
   It relies on a good implementation of the OpenMP
    synchronization primitives. Not obvious a priori that the
    frequent, implicit barriers will translate efficiently to standard
    Java thread operations.
   Lack of implementations:
     – Sources of the prototype JOMP compiler, etc are not available at the
       EPCC Web site? Will anybody use the software if it’s non-standard
       and closed source??

                                 dbcarpen@indiana.edu                            77
Suggested Exercises




     dbcarpen@indiana.edu   78
1. Java Thread Synchronization
   Following the pattern of the Semaphore class, complete the
    implementation of a queue with a get() operation that blocks
    when the queue is empty, using wait() and notify().
   Extend this to implement send() and recv() primitives for
    communicating between threads by message-passing.
    – Try implementing other features of the MPI standard for
      communication between threads, e.g. synchronous mode sends (where
      both sender and receive block until both are ready to communicate).
      Avoid ―busy-waiting‖ or polling solutions.
    – Define an MPI_Comm_Rank-like function using a ThreadLocal
      variable.




                             dbcarpen@indiana.edu                     79
2. Experiment with JOMP
   Download and install the JOMP software from
     http://www.epcc.ed.ac.uk/research/jomp/
    (This just involves downloading a .jar file and adding it to
    your CLASSPATH.)
   You may also want to download the JOMP version of the Java
    Grande benchmarks from the same location.
   Implement your favorite parallel algorithm in Java using the
    JOMP implementation of OpenMP.
    – Do you encounter any implementation limits of the compiler or
      runtime?
    – Do you get parallel speedup on multiprocessor machines?
    – Tell us your results!

                             dbcarpen@indiana.edu                     80

				
DOCUMENT INFO