ppt - Louisiana State University

Document Sample
ppt - Louisiana State University Powered By Docstoc
					 Prof. Thomas Sterling
 Dr. Hartmut Kaiser
 Department of Computer Science
 Louisiana State University
 March 24th , 2011


HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &
MEANS
APPLIED PARALLEL ALGORITHMS 4


                                  CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                         Spring 2011   1
                        Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation

Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort

                                 CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                        Spring 2011   2
                       Puzzle of the Day
Duff‘s device: what is going on here?


void copy(char *to, char const *from, int count)
{
   int n = (count + 3) / 4;
   switch (count % 4) {
   case 0:
      do {
         *to++ = *from++;                 'case' defines jump labels only!
   case 3:
         *to++ = *from++;
   case 2:
         *to++ = *from++;       Missing 'break' makes code 'fall through'
   case 1:
         *to++ = *from++;
      } while (--n > 0);
   }
}


                                             CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                    Spring 2011   3
                        Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation

Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort

                                 CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                        Spring 2011   4
                Time and Frequency Domain
                  Representation of Signals
•Two ways of looking at the same signal

Example 1: Time and frequency domain representations of a
sine wave




    http://robots.freehostia.com/Radio/Image137.gif   http://www.theparticle.com/cs/bc/mcs/signalnotes.pdf

                                                                CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                       Spring 2011   5
Example 2

                   Time and frequency domain
                      representations of a 4Hz +
                      12Hz Sine Wave




    http://www.theparticle.com/cs/bc/mcs/signalnotes.pdf
              CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                   Spring 2011     6
                                                Fourier Analysis
   • Fourier analysis: Represent continuous functions by
     potentially infinite series of sine and cosine functions




NOTE: The signal sum is composed from sine and cosine functions
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn                                                   http://zone.ni.com/cms/images/devzone/tut/a/8c34be30580.gif
                                                                                CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                                7
                                                                                                                       Spring 2011
                       Fourier Analysis




Nice demo: http://www.imaios.com/en/e-Courses/e-MRI/image-formation/Fourier-transform




                                                CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                       Spring 2011   8
           Fourier Representation of Square
                        Wave




• Spectrum extends to infinity
• As we move from left to right on the frequency axis amplitude(of
  components) decreases monotonically
http://www.engr.colostate.edu/~dga/mechatronics/figures/4-5.gif
                                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                 Spring 2011   9
         Fourier Representation of Square Wave
 • Synthesis of a square wave(of zero DC component) from
   its frequency domain components
 • Ideal square wave is represented by the thick black line




http://mathworld.wolfram.com/FourierSeriesSquareWave.html

                                                            CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                   Spring 2011   10
 Fourier Representation of Square Wave




Nice demo: http://www.imaios.com/en/e-Courses/e-MRI/image-formation/Fourier-transform




                                                CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                       Spring 2011   11
                        Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation

Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort

                                 CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                        Spring 2011   12
                    Digital Signals
• Digital signal: A digital signal is a signal that is both
  discrete and quantized
• Digital signals can be obtained by sampling analog
  signals
• The figure represents an analog to digital converter that
  does sampling and quantization




                                   CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                          Spring 2011   13
                         Digital Signal Processing
    • Processing of digital signals with the help of a computer

Continuous                                                                                              Continuous
   Input                                    Digital Signal                                                Output
              A/D Converter                                                     D/A Converter
                                             Processing




http://www.ece.rochester.edu/courses/ECE446/Introduction%20to%20Digital%20Signal%20Processing.pdf
                                                             CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                    Spring 2011
                                                                                                                  14
                  Advantages of Digital Signal
                         Processing

 •   Digital system can be simply reprogrammed for other applications / ported
     to different hardware / duplicated (Reconfiguring analog system means
     hardware redesign, testing, verification)

 •   DSP provides better control of accuracy requirements (Analog system
     depends on strict components tolerance, response may drift with
     temperature)

 •   Digital signals can be easily stored without deterioration (Analog signals are
     not easily transportable and often can’t be processed off-line)

 •   More sophisticated signal processing algorithms can be implemented
     (Difficult to perform precise mathematical operations in analog form)

Adapted from http://www-sigproc.eng.cam.ac.uk/~op205/3F3_1_Introduction_to_DSP.pdf

                                                        CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                               Spring 2011   15
              Why use Discrete Fourier Transform?


  • Digital Signal Processing applications often require
    mapping of data in the time domain to its frequency
    domain counterparts

  • Many applications in science, engineering




Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
                                                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                 Spring 2011   16
                                                Example 1
     • Spectrogram of Speech Signal




  NOTE: Spectrogram is a 3D representation of signal amplitude vs
  time and frequency


http://ccrma.stanford.edu/~jos/st/Spectrogram_Speech.html

                                                            CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                   Spring 2011   17
                                                  Example 2
            •Removing blemishes of a photograph

                                                                                      DFT is used for converting
                                                                                      image data in the spatial (2D)
              To filter an image in the frequency domain:                             domain to the frequency
                    1.     Compute F(u,v) the DFT of the image                        domain before filtering and
                    2.     Multiply F(u,v) by a filter function H(u,v)                for conversion back to spatial
                    3.     Compute the inverse DFT of the result                      domain afterwards




                                                                         Output of different Gaussian low
                                                                         pass filters for removing blemishes


Adapted from www.comp.dit.ie/bmacnamee/materials/dip/lectures/ImageProcessing7-FrequencyFiltering.ppt
                                                                      CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                             Spring 2011   18
            Discrete Fourier Transform(Qualitative)

   • Discrete Fourier transform: Map a sequence over time to
     another sequence over frequency

           – Signal strength as a function of time 
           – Fourier coefficients as a function of frequency




Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
                                                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                 Spring 2011   19
                                           DFT Example (1/4)
                  16 data points representing signal strength over time




Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
                                                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                 Spring 2011   20
                                           DFT Example (2/4)
                  DFT yields amplitudes and frequencies of sine/cosine functions




Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
                                                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                 Spring 2011   21
                                           DFT Example (3/4)
                  Plot of four constituent sine/cosine functions and their sum




Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
                                                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                 Spring 2011   22
                                           DFT Example (4/4)
                  Continuous function and original 16 samples




Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
                                                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                 Spring 2011   23
                        Formal Definition of DFT

• DFT of a discrete signal x[n] of N sample points is
defined as

                    N 1                                          2i
  X [k ]   x[n]   nk ,   e                                   N           for     0k  N
                    n 0

                                                                                                              2
 • Direct implementation of this equation requires N
 complex additions and multiplications

 NOTE: DFT of an N point sequence gives N points in the
      transform domain
 http://cas.ensmp.fr/~chaplais/wavetour_presentation/transformees/Fourier/FFTUS.html

                                                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4   24
                                                                                                                 Spring 2011
                                    Formal Definition of DFT
• Complex plane, relation of different powers of ω
  for N = 8                             N 1                                                                                                    2i

                                      im                                                X [k ]   x[n]   nk ,   e                           N
                                                     2i                                             n 0
                                                                                                                                    0k  N
                                                 2

                         3
                              2i          e
                                           2
                                           8
                                                      8
                                                                  2i
     83  e                   8
                                                      8  e
                                                       1
                                                                1
                                                                   8


              2i                                                                 2i
                                                                              0
                                                                e
          4
84  e
                                                                0                  8
               8
                                                                8
                                                                                  re
                                    0,0
                                                                        2i
                                                                    7

                    5
                        2i                            e 7
                                                           8
                                                                         8

   85  e               8
                                                     2i
                                                                                                  N 1                                              2i
                                                 6
                                                                                1                                                               
                                                                                                 
                                           e
                                           6

                                                                                                   X [k ]   nk ,   e
                                                      8
                                           8
                                                                         x[n]                                                                       N
                                                                                N                 k 0
                                                                                                                                      0n N
                                                                                           CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                                  Spring 2011   25
                                                Computing DFT

   • Writing the previous definition of DFT in matrix form
                                                                                                       N 1                     2i

   • Matrix-vector product X                                         = Fn x             X [k ]   x[n]   nk ,   e           N

                                                                                                       n 0
           – x is input vector (signal samples)                                                                        0k  N
           – Each element of Fn
                   fi,j =  n for 0  i, j < n and  n is primitive nth
                            ij


                                                        root of unity
           – X is output vector (discrete Fourier coefficients)



                                                                                                 2i

                NOTE:  n is a complex number defined as                                     e    N

Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
                                                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                 Spring 2011   26
                                                        Example 1

   How to compute the DFT of a vector having two elements?

   • Example Vector: (2, 3)
   • 2, the primitive square root of unity, is -1
        2 0 2 1  x0  1 1  2   5 
          0     0
        10            
                11       1  1 3     1
                                      
             2  x1  
        2                            
        200 201  X 0  1 1 1  5   2 
        10            
                 11       2 1  1  1   3 
                                         
             2 
        2              X1             
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
                                                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                 Spring 2011   27
                                                        Example 2

   How to compute the DFT of a vector having four elements?

   • Example Vector:(1, 2, 4, 3)
   • The primitive 4th root of unity is i
    4 4 4 4  x0  1 1 1 1  1   10 
       0   0   0    0
    0                                            
    4 4 4 4  x1  1 i  1  i  2    3  i 
           1   2    3

     0  2  4  6  x   1  1 1  1 4    0 
    4    4    4    4  2                          
     0  3  6  9  x  1  i  1 i  3    3  i 
    4    4    4    4  3                          

Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
                                                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                 Spring 2011   28
                        Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation

Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort

                                 CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                        Spring 2011   29
                  Why Fast Fourier Transform(FFT)?
   •       Reduce the computational operations required

   •       Straightforward implementation: (n2)
   •       Fast Fourier transform: (n log n)
                - (n log n) << (n2) for large values of n




Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
                                                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                 Spring 2011   30
                 Fast Fourier Transform

Fourier matrix FN can be decomposed into half size Fourier matrices FN/2 :
                                                            1 0 0                0 
         I DN / 2  FN / 2    0                                                      
 FN    I  D  0                 P                 0  0                0 
               N / 2        FN / 2 
                                      
                                                       DN   0 0  2             0 
 I : Identity matrix                                                                    
                                                                              0 
 P : permutation matrix                                     0 0 0              0  N 1 
                                                                                        
                                                   PN : Row reordering,
                                                   first even rows, then odd
  Example (N = 4):

 1 1    1   1  1 1          1    1  1     0   1 0  1 1          0 0  0      1 0 0
                                                                                 
 1 4 4 4  1 i
      1   2   3
                               i2   i3   0   1 0   i  1  1        0 0  0      0 0 1
 1  2  4  6   1 i 2     i4
                                        
                                    i6   1   0  1 0  0 0          1 1  1      0 0 0
    4    4   4
                                                                                  
 1  3  6  9  1 i 3       i6    9
                                    i  0     1 0  i  0 0          1  1 0     0 1 0
    4    4   4                                                                    


                                                        CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                               Spring 2011   31
                  Fast Fourier Transform

• Based on divide-and-conquer strategy
• We want to compute f(x), a polynomial of degree n (n is
  power of 2) at the n complex nth roots of unity

• We define two new functions, f [0] and f [1]

      f ( x)  a0 a1 x a2 x 2  ...  an 1 x n 1
             f [ 0] ( x)  a0  a2 x  a4 x2  ... an2 xn / 21
                                                                                              x  x 2
             f [1] ( x)  a1  a3 x  a5 x2  ... an1 xn / 21
      f ( x)  f [ 0 ] ( x 2 )  x  f [1] ( x 2 )
                                        Adapted from slides(and text) of Parallel Programming in C with MPI and
                                        OpenMP by Michael Quinn
                                                       CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                              Spring 2011         32
                                                    FFT (Cont…)


   • Problem of evaluating f (x) at n values of  reduces to
           a) Evaluating f [0](x) and f [1](x) at n/2 values of 
                That is, computing f(x) at points
                                                                                            n ,  n ,  n , ... ,  n 1
                                                                                             0     1     2           n

                 becomes evaluating f [0] & f [1] at

                                                                          (n )2 , (n )2 , (n )2 , ... , (n / 21 )2
                                                                            0        1        2              n


             b) Performing f [0](x2) + x f [1](x2)

   • Leads to recursive algorithm with time complexity
     (n log n)
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
                                                                                   CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                                        33
                                                                                                                          Spring 2011
              Recursive Sequential Implementation
                            of FFT
  Recursive_FFT(a,n)

  Parameter                  n                                     Number of elements in a
                             a[0……(n-1)]                          Coefficients
  Local                     n                                     Primitive nth root of unity
                                                                 Evaluate polynomial at this point
                             a [0]                                Even numbered coefficients
                            a[1]                                  Odd numbered coefficients
                            y                                     Result of transform
                            y [0]                                 Result of FFT of a [0]
                            y[1]                                  Result of FFT of a [1]



Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
                                                                              CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                     Spring 2011   34
         Recursive Sequential Implementation
                   of FFT (Cont…)
if n=1 then
      return a
else         2 i

     n e n
     1
     a [0] (a[0],a[2],….,a[n-2])
     a [1] (a[1],a[3],….,a[n-1])
     y [0]  Recursive_FFT(a [0],n/2)
     y [1]  Recursive_FFT(a [1],n/2)
     for k0 to n/2 -1 do
           y[k] y [0] [k]+* y [1] [k]
           y[k+n/2] y [0] [k]- * y [1] [k]
            * n
     end for
     return y
                            Adapted from slides(and text) of Parallel Programming in C with MPI and
endif                       OpenMP by Michael Quinn
                                                                       CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                              Spring 2011   35
                  Iterative Implementation Preferable

               • Well-written iterative version performs fewer index
                 computations than recursive version
               • Iterative version evaluates key common sub-
                 expression only once
               • Easier to derive parallel FFT algorithm when
                 sequential algorithm in iterative form




Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
                                                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                 Spring 2011   36
                       Recursive  Iterative (1/4)
We now discuss the derivation of an iterative algorithm starting with the recursive one
                                                                • Each rounded rectangle indicates an fft
                                                                function call

                                                                • The function goes on dividing the vector
 Recursive implementation of FFT for the                        into half until a scalar is obtained
 input sequence (1,2,4,3) is shown below                        (NOTE: DFT of a scalar is the scalar itself)
                                      (10,-3-i,0,-3+i)
                                                                • The values returned as result of each function
                                                                call is indicated on the curved arrows
                                  fft(1,2,4,3)
             (5,-3)                                          (5,-1)        Adapted from slides(and text) of Parallel
                                                                           Programming in C with MPI and
                                                                           OpenMP by Michael Quinn

                  fft(1,4)                               fft(2,3)

   (1)                                (4)     (2)                          (3)

         fft(1)              fft(4)                 fft(2)            fft(3)
                                                                      CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                             Spring 2011
           Recursive  Iterative (2/4)
• Determining which computations are performed for each function invocation

• For each rounded rectangle, the computation is of the form
       x+y(z)
       x-y(z)
     which corresponds to the following statements of the recursive algorithm
       y[k] y [0] [k]+* y [1] [k]
       y[k+n/2] y [0] [k]- * y [1] [k]
                                                     Adapted from slides(and text) of Parallel
                                                     Programming in C with MPI and
                                                     OpenMP by Michael Quinn


                   5+1(5) -3+i(-1)   5-1(5) -3-i(-1)


            (5, -3)                                  (5, -1)
             1+1(4) 1-1(4)                2+1(3) 2-1(3)



               1           4                2              3

                                                CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                       Spring 2011   38
                                     Recursive  Iterative (3/4)
             • This diagram tracks the propagation of data values (input vector at the bottom
             and FFT output at the top)
             • Permutation stage: Index i of the input vector is replaced by rev(i), where
             rev(i) is the binary value of i read in the reverse order (00=>00, 01=>10,
             10=>01, 11=>11)
                                   10                 -3-i               0               -3+i

                                            5+1*5   -3+i*(-1)               5-1*5                      -3-i*(-1)


                                              5         -3                          5                        -1
                                            1+1*4    1-1*4                  2+1*3                        2-1*3

                                              1         4                           2                        3
                                             1         4                        2                           3
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
                                             1              2                    4                          3
                                                                CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                       Spring 2011   39
            Recursive  Iterative (4/4)
• Initially, the scalars are simply forwarded upwards as the DFT
  of a scalar is the scalar itself
• For other stages, computation of the output is performed
  using two values forwarded from the previous stage
• The arrows depicting data flow form butterfly patterns


• An iterative algorithm can be deduced from the previous
  diagram
• The computation represented in each row (excluding the
  bottommost row) corresponds to one iteration of the algorithm
• Hence log(n) iterations should be performed (log(4)=2 in the
  previous example)
• For each iteration the algorithm modifies the value of every
  index (here n indices)
                                          Adapted from slides(and text) of Parallel
                                          Programming in C with MPI and
                                          OpenMP by Michael Quinn

                                     CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                            Spring 2011   40
                        Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation

Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort

                                 CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                        Spring 2011   41
                   Stages of Parallel Program Design
      • Partition
             – Divide problem into tasks
      • Communicate
             – Determine amount and pattern of
               communication
      • Agglomerate
             – Combine tasks
      • Map
             – Assign agglomerated tasks to
               processors
      • Efficiency analysis

Adapted from http://nereida.deioc.ull.es/html/openmp/minnesotatutorial/content_openMP.html

                                                                         CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                                                Spring 2011   42
           Parallel FFT Program Design
• Domain decomposition
   – Associate primitive task with each element of input vector a and
     corresponding element of output vector y


• Add channels to handle communications between tasks




                                             Adapted from slides(and text) of Parallel
                                             Programming in C with MPI and
                                             OpenMP by Michael Quinn


                                        CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                               Spring 2011   43
               FFT Task/Channel Graph (n=8)



•Long rounded rectangles represent
tasks and arrows indicate
communication between processes




                                          Adapted from slides(and text) of Parallel
                                          Programming in C with MPI and
                                          OpenMP by Michael Quinn
                                     CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                            Spring 2011   44
               FFT Task/Channel Graph (n=8) Cont…
Steps:

•Permute vector as follows
(000=>000, 001=>100, …,
110=>011, 111=>111)

•Perform log(n) iterations (log(8)=3)
- stage 1 completed after iteration 1
- stage 2 completed after iteration 2
- stage 3 completed after iteration 3
  (Vector y after stage 3 gives the
   output)

NOTE: Vector y will contain the
intermediate results of
stage 1 and stage 2
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
                                             stage 1        stage 2          stage 3
                                            CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                   Spring 2011   45
           Diagrammatic Representation of
                  Profiling Results
Conventions:

    C    represents a function compute (args) that accepts the propagated
values and performs the following computation (refer slide 33)
              x+y(z)
              x-y(z)

   S   represents the MPI_Send(args) command

   R   represents the MPI_Receive(args) command

   P   represents the function permute(args) which is basically
        permute(args)
        {
        ………
        MPI_Send(args)
        ………
        }

           represents the time for which the process is idle
                         http://www.cs.uoregon.edu/research/paracomp/tau/tauprofile/images/petsc/            46
                                                        CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                              Spring 2011
                   Diagrammatic Representation of
                          Profiling Results
     Permutation       Stage 1               Stage 2                    Stage 3
        Phase
                                                                                      y[0]
P0                 S    R        C   S   R         C      S       R         C
                                                                                      y[1]
P1     P     R     R    S        C   S   R         C      S       R         C
                                                                                          NOTE: The diagram
                                                                                          is oversimplified to
P2                                                                                   y[2]
                   S    R        C   R   S         C      S       R         C             enhance understanding
                                                                                          of butterfly diagram

P3     P     R     R    S        C   R   S         C      S       R         C         y[3]



P4     R     P     S    R        C   S   R         C      R       S         C         y[4]


P5                                                                                    y[5]
                   R    S        C   S   R         C      R       S         C

P6                                                                                    y[6]
       R     P     S    R        C   R   S         C      R       S         C

P7
                   R    S        C   R   S         C      R       S         C         y[7]

                                                       CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                              Spring 2011   47
          Agglomeration and Mapping

• Agglomerate primitive tasks associated with contiguous
  elements of vector to reduce communication
• Map one agglomerated task to each process




                                      Adapted from slides(and text) of Parallel
                                      Programming in C with MPI and
                                      OpenMP by Michael Quinn

                                 CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                        Spring 2011   48
         After Agglomeration, Mapping
 Input


                         In general, an n point FFT can be
                         implemented on a multicomputer
                         supporting p processes

                         In this case, n=16 and p=4.
                             a[0], a[1], a[2], a[3]  process 1
                             a[4], a[5], a[6], a[7]  process 2
                             and so on




                             Adapted from slides(and text) of Parallel
Output                       Programming in C with MPI and
                             OpenMP by Michael Quinn


                        CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                               Spring 2011   49
       Phases of Parallel FFT Algorithm
• Phase 1: Processes permute a’s (all-to-all
  communication)
• Phase 2:
   – First log n – log p iterations of FFT
   – No message passing is required
• Phase 3:
   – Final log p iterations
   – Processes organized as logical hypercube
   – In each iteration every process swaps values with
     partner across a hypercube dimension

                                       Adapted from slides(and text) of Parallel
                                       Programming in C with MPI and
                                       OpenMP by Michael Quinn


                                  CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                         Spring 2011   50
       Computation Complexity Analysis
• Each process performs equal share of computation
   – Sequential complexity: Θ(n log n)
• Hence the complexity of parallel implementation is
                 Θ(n log n / p)




                                              Adapted from slides(and text) of Parallel
                                              Programming in C with MPI and
                                              OpenMP by Michael Quinn


                                         CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                Spring 2011
       Communication Complexity Analysis

• A maximum of ceil(n / p) elements of the vector
  associated with a process
• In the all to all communication stage, every process
  swaps about n/p values with its counterpart
    – Time complexity: Θ(n/p log p)
• A total of log p iterations that need communication with
  other processes (average n/p swaps)
    – Time complexity: Θ(n/p log p)
• Hence the total communication complexity of parallel
  implementation is
                    Θ(n/p log p)  Adapted from slides(and text) of Parallel
                                                    Programming in C with MPI and
                                                    OpenMP by Michael Quinn

                                               CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                      Spring 2011   52
                        Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation

Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort

                                 CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                        Spring 2011   53
                      Parallel Sorting
• Finding a permutation of a sequence [a1, a2, ...an-1], such
  that a1 <= a2 <= … an-1
• Often we sort records based on key
• Parallel sort results in:
   – Partial sequences are sorted on all nodes
   – Largest value on node N-1 is smaller or equal to smallest value
     on node N
• Several ways to parallelize
   – Chunk sequence, sort locally, merge back (bubblesort)
   – Project algorithm structure onto cmmunication and distribution
     scheme (quicksort)



                                        CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                               Spring 2011   54
                                 Bubble Sort
•  The bubble sort is the oldest and simplest sort in use. Unfortunately, it's also the
   slowest.
• The bubble sort works by comparing each item in the list with the item next to it,
   and swapping them if required.
• The algorithm repeats this process until it makes a pass all the way through the
   list without swapping any items (in other words, all items are in the correct order).
• This causes larger values to "bubble" to the end of the list while smaller values
   "sink" towards the beginning of the list.
The bubble sort is generally considered to be the most inefficient sorting algorithm in
   common usage. Under best-case conditions (the list is already sorted), the bubble
   sort can approach a constant O(n) level of complexity. General-case is O(n2).
Pros: Simplicity and ease of implementation.
Cons: Extremely inefficient.
Reference
    http://math.hws.edu/TMCM/java/xSortLab/
Source
    http://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/sorting/bubblesort.c

                                                        http://www.sci.hkbu.edu.hk
                                                    CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                           Spring 2011   55
                           Bubblesort
void sort(int *v, int n)
{
   int i, j;
   for(i = n-2; i >= 0; i--)
        for(j = 0; j <= i; j++)
                if(v[j] > v[j+1])
                         swap(v[j], v[j+1]);
}




                                          CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                 Spring 2011   56
Bubblesort




        CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                               Spring 2011   57
                              Discussion

•   Bubble sort takes time proportional to N*N/2 for N data items
•   This parallelization splits N data items into N/P so time on one of the P
    processors now proportional to (N/P*N/P)/2
     – i.e. have reduced time by a factor of P*P!
•   Bubble sort is much slower than quick sort!
     – Better to run quick sort on single processor than bubble sort on many processors!




                                                        http://www.sci.hkbu.edu.hk
                                                    CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                           Spring 2011   58
                        Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation

Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort

                                 CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                        Spring 2011   59
                                  Merge Sort
•   The merge sort splits the list to be sorted into two equal halves, and places them in
    separate arrays.
•   Each array is recursively sorted, and then merged back together to form the final
    sorted list.
•   Like most recursive sorts, the merge sort has an algorithmic complexity of O(n log n).
•   Elementary implementations of the merge sort make use of three arrays - one for
    each half of the data set and one to store the sorted list in. The below algorithm
    merges the arrays in-place, so only two arrays are required. There are non-recursive
    versions of the merge sort, but they don't yield any significant performance
    enhancement over the recursive algorithm on most machines.

Pros: Marginally faster than the heap sort for larger sets.
Cons: At least twice the memory requirements of the other sorts; recursive.

Reference
http://math.hws.edu/TMCM/java/xSortLab/
Source
http://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/sorting/mergesort.c

                                                     CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                            Spring 2011   60
                              Merge Sort




[cdekate@celeritas sort]$ mpiexec -np 4 ./mergesort
1000000; 4 processors; 0.250000 secs
[cdekate@celeritas sort]$

                                                CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                       Spring 2011   61
                              Mergesort
void msort(int *A, int min, int max)
{
   int *C;                /* dummy, just to fit the function */
   int mid = (min+max)/2;
   int lowerCount = mid - min + 1;
   int upperCount = max - mid;

    /* If the range consists of a single element, it's already sorted */
    if (max == min) {
         return;
    } else {
         /* Otherwise, sort the first half */
         sort(A, min, mid);
         /* Now sort the second half */
         sort(A, mid+1, max);
         /* Now merge the two halves */
         C = merge(A + min, lowerCount, A + mid + 1, upperCount);
    }
}




                                             CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                    Spring 2011   62
Mergesort




       CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                              Spring 2011   63
                        Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation

Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort

                                 CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                        Spring 2011   64
                                   Heap Sort
• The heap sort is the slowest of the O(n log n) sorting algorithms, but unlike the merge
  and quick sorts it doesn't require massive recursion or multiple arrays to work. This
  makes it the most attractive option for very large data sets of millions of items.
• The heap sort works as it name suggests
    1.   It begins by building a heap out of the data set,
    2.   Then removing the largest item and placing it at the end of the sorted array.
    3.   After removing the largest item, it reconstructs the heap and removes the largest remaining
         item and places it in the next open position from the end of the sorted array.
    4.   This is repeated until there are no items left in the heap and the sorted array is full.
         Elementary implementations require two arrays - one to hold the heap and the other to hold
         the sorted elements.
To do an in-place sort and save the space the second array would require, the
   algorithm below "cheats" by using the same array to store both the heap and the
   sorted array. Whenever an item is removed from the heap, it frees up a space at
   the end of the array that the removed item can be placed in.
Pros: In-place and non-recursive, making it a good choice for extremely large data
  sets.
Cons: Slower than the merge and quick sorts.
Reference
    http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/heapsort.html
Source
    http://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/heapsort/heapsort.c
                                                        CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                               Spring 2011   65
Heapsort




       CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                              Spring 2011   66
                        Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation

Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort

                                 CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                        Spring 2011   67
                                  Quick Sort
•   The quick sort is an in-place, divide-and-conquer, massively recursive sort.
•   Divide and Conquer Algorithms
     –   Algorithms that solve (conquer) problems by dividing them into smaller sub-
         problems until the problem is so small that it is trivially solved.
•   In Place
     –   In place sorting algorithms don't require additional temporary space to store
         elements as they sort; they use the space originally occupied by the elements.
•   Quicksort takes time proportional to (worst case) N*N for N data items, usually
    n log n, but most of the time much faster
     –   for 1,000,000 items, Nlog2N ~ 1,000,000*20
•   Constant communication cost – 2*N data items
     –   for 1,000,000 must send/receive 2*1,000,000 from/to root
•   In general, processing/communication proportional to N*log2N/2*N = log2N/2
     –   so for 1,000,000 items, only 20/2 =10 times as much processing as communication
•   Suggests can only get speedup, with this parallelization, for very large N

Reference
     http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/qsort.html
Source
     http://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/qsort/qsort.c
                                                          http://www.sci.hkbu.edu.hk
                                                      CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                             Spring 2011   68
                                Quick Sort
• The recursive algorithm consists of four steps (which closely resemble the
  merge sort):
     1. If there are one or less elements in the array to be sorted, return immediately.
     2. Pick an element in the array to serve as a "pivot" point. (Usually the left-most
        element in the array is used.)
     3. Split the array into two parts - one with elements larger than the pivot and the
        other with elements smaller than the pivot.
     4. Recursively repeat the algorithm for both halves of the original array.
•   The efficiency of the algorithm is majorly impacted by which element is
    chosen as the pivot point.
•   The worst-case efficiency of the quick sort, O(n2), occurs when the list is
    sorted and the left-most element is chosen.
•   If the data to be sorted isn't random, randomly choosing a pivot point is
    recommended. As long as the pivot point is chosen randomly, the quick sort
    has an algorithmic complexity of O(n log n).

Pros: Extremely fast.
Cons: Very complex algorithm, massively recursive
                                                       http://www.sci.hkbu.edu.hk
                                                   CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                                          Spring 2011   69
Quicksort




        CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                               Spring 2011   70
        Summary : Material for the Test


• Discrete Fourier Transform:   Slides 24-26
• Fast Fourier Transform (FFT): Slides 30-40
• Parallel FFT:                 Slides 41-52




                                 CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                                                        Spring 2011   71
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
                                       Spring 2011   72


				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:4/18/2013
language:English
pages:72