; lectures
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

lectures

VIEWS: 6 PAGES: 45

  • pg 1
									         High-Performance Matrix Computations

                                Prof. Paolo Bientinesi

                           pauldj@aices.rwth-aachen.de

                                    May 15, 2011




Paolo Bientinesi (AICES)               Performance       May 15, 2011   1 / 22
History of Dense Linear Algebra Libraries

  1971        NAG              Linear Algebra & more
  1972        EISPACK          Eigenproblems
  1973        BLAS-1           Vector operations
  1978        LINPACK          Linear systems, least-squares
  1988        BLAS-2           Matrix-vector operations
  1990        BLAS-3           Matrix-matrix operations
  1992        LAPACK           EISPACK + LINPACK
  1993        ScaLAPACK        Distributed memory LAPACK
  1997        PLAPACK          Environment for distributed memory
  2001        FLAME            High-level abstraction; shared & distr. mem.
  2006        PLASMA, CellSs   Multicores


   Paolo Bientinesi (AICES)        Performance                      May 15, 2011   2 / 22
Memory Hierarchy




   Paolo Bientinesi (AICES)   Performance   May 15, 2011   3 / 22
BLAS: Basic Linear Algebra Subroutines

 Linear Algebra operations decomposed into simpler operations.




    Paolo Bientinesi (AICES)      Performance                    May 15, 2011   4 / 22
BLAS: Basic Linear Algebra Subroutines

 Linear Algebra operations decomposed into simpler operations.


                     BLAS-1:     y := y + αx           x, y ∈ Rn
                               dot := α + xT y




    Paolo Bientinesi (AICES)             Performance               May 15, 2011   4 / 22
BLAS: Basic Linear Algebra Subroutines

 Linear Algebra operations decomposed into simpler operations.


                     BLAS-1:     y := y + αx                x, y ∈ Rn
                               dot := α + xT y
                     BLAS-2:    y := y + Ax            A ∈ Rn×n , x, y ∈ Rn
                                y := A−1 x




    Paolo Bientinesi (AICES)             Performance                          May 15, 2011   4 / 22
BLAS: Basic Linear Algebra Subroutines

 Linear Algebra operations decomposed into simpler operations.


                     BLAS-1:     y := y + αx                x, y ∈ Rn
                               dot := α + xT y
                     BLAS-2:    y := y + Ax            A ∈ Rn×n , x, y ∈ Rn
                                y := A−1 x
                     BLAS-3:    C := C + AB              A, B, C ∈ Rn×n
                                C := A−1 B




    Paolo Bientinesi (AICES)             Performance                          May 15, 2011   4 / 22
BLAS: Basic Linear Algebra Subroutines

 Linear Algebra operations decomposed into simpler operations.


                     BLAS-1:      y := y + αx                x, y ∈ Rn
                                dot := α + xT y
                     BLAS-2:     y := y + Ax            A ∈ Rn×n , x, y ∈ Rn
                                 y := A−1 x
                     BLAS-3:     C := C + AB              A, B, C ∈ Rn×n
                                 C := A−1 B


                     BLAS      #FLOPS    Mem. refs.         Ratio
                     Level 1     2n        3n               2/3




    Paolo Bientinesi (AICES)              Performance                          May 15, 2011   4 / 22
BLAS: Basic Linear Algebra Subroutines

 Linear Algebra operations decomposed into simpler operations.


                     BLAS-1:      y := y + αx                x, y ∈ Rn
                                dot := α + xT y
                     BLAS-2:     y := y + Ax            A ∈ Rn×n , x, y ∈ Rn
                                 y := A−1 x
                     BLAS-3:     C := C + AB              A, B, C ∈ Rn×n
                                 C := A−1 B


                     BLAS      #FLOPS    Mem. refs.         Ratio
                     Level 1     2n        3n               2/3
                     Level 2    2n2           n2              2



    Paolo Bientinesi (AICES)              Performance                          May 15, 2011   4 / 22
BLAS: Basic Linear Algebra Subroutines

 Linear Algebra operations decomposed into simpler operations.


                     BLAS-1:      y := y + αx                x, y ∈ Rn
                                dot := α + xT y
                     BLAS-2:     y := y + Ax            A ∈ Rn×n , x, y ∈ Rn
                                 y := A−1 x
                     BLAS-3:     C := C + AB              A, B, C ∈ Rn×n
                                 C := A−1 B


                     BLAS      #FLOPS    Mem. refs.         Ratio
                     Level 1     2n        3n               2/3
                     Level 2    2n2           n2              2
                                     3            2
                     Level 3    2n           4n              n/2

    Paolo Bientinesi (AICES)              Performance                          May 15, 2011   4 / 22
References

   BLAS
           GotoBLAS:
           http://web.tacc.utexas.edu/resources/software/
           “Automated Empirical Optimization of Software and the ATLAS
           Project” by Clint Whaley, Antoine Petitet and Jack Dongarra.
           Parallel Computing, 27, 2001.
           “Is Search Really Necessary to Generate High-Performance
           BLAS?” by Yotov, Li, Ren, Garzarán, Padua, Pingali and
           Stodghill. Proceedings of the IEEE, 93(2), 2005.
           “Anatomy of High-Performance Matrix Multiplication”
           by Kazushige Goto and Robert van de Geijn.
           ACM Transactions on Mathematical Software, 34 (3), 2008.
           “High-Performance Implementation of the Level-3 BLAS”
           by Kazushige Goto and Robert van de Geijn.
           ACM Transactions on Mathematical Software, 35 (1), 2008.

   Paolo Bientinesi (AICES)         Performance                  May 15, 2011   5 / 22
References




         LAPACK
                 www.netlib.org/lapack/
                 Working Notes:    www.netlib.org/lapack/lawns
                 “LAPACK Users’ Guide” (third ed.),
                 Society for Industrial and Applied Mathematics, 1999.




   Paolo Bientinesi (AICES)            Performance                       May 15, 2011   6 / 22
Performance



                                       Quality of an algorithm


                              Metrics?
                                 Execution time
                                 Memory usage
                                 Memory/network accesses
                                 Accuracy
                                 Size
                                 ...




   Paolo Bientinesi (AICES)                    Performance       May 15, 2011   7 / 22
Execution time



                          It says how fast an algorithm is

                          “Does not measure the memory usage”

                          “It might depend on the input”

                          “It depends on the processor”

                          It does not measure how well the algorithm
                          takes advantage of a given architecture




   Paolo Bientinesi (AICES)                 Performance                May 15, 2011   8 / 22
The quality of an algorithm should be related
to the potential of the architecture
“Potential of an architecture”?

      Theoretical Peak Performance
      Theoretical Peak Performance =
         #cores * frequency * #flops/cycle

              what is frequency? GHz
              what is a flop?
              floating point operation
              what is a cycle?
              how often the processor can initiate new operations
              #flops/cycle =
              number of flops executed initiated in one cycle




Paolo Bientinesi (AICES)            Performance                     May 15, 2011   9 / 22
The quality of an algorithm should be related
to the potential of the architecture
“Potential of an architecture”?

      Theoretical Peak Performance
      Theoretical Peak Performance =
         #cores * frequency * #flops/cycle

              what is frequency? GHz
              what is a flop?
              floating point operation
              what is a cycle?
              how often the processor can initiate new operations
              #flops/cycle =
              number of flops executed initiated in one cycle

      In practice, the theoretical peak performance is unattainable.
      Why?

Paolo Bientinesi (AICES)            Performance                     May 15, 2011   9 / 22
CPU vs. Memory




  For the CPU to execute an operation,
  data needs to be in registers




   Paolo Bientinesi (AICES)     Performance   May 15, 2011   10 / 22
CPU vs. Memory




  For the CPU to execute an operation,
  data needs to be in registers

  Cache misses

  L1 misses, L2, L3, TLB, instruction misses. . .

  Cache line




   Paolo Bientinesi (AICES)       Performance       May 15, 2011   10 / 22
Back to performance




   Theroretical peak performance → unattainable

   Practical peak performance? (practical = attainable)

   DGEMM (BLAS)
   Double (precision) GEneral Matrix Matrix multiply

   Peak performance ≡ DGEMM




   Paolo Bientinesi (AICES)       Performance             May 15, 2011   11 / 22
Back to performance




   Theroretical peak performance → unattainable

   Practical peak performance? (practical = attainable)

   DGEMM (BLAS)
   Double (precision) GEneral Matrix Matrix multiply

   Peak performance ≡ DGEMM

   How to compute DGEMM’s performance?




   Paolo Bientinesi (AICES)       Performance             May 15, 2011   11 / 22
Performance


                                         #ops
                              Perf =
                                       exec. time




   Paolo Bientinesi (AICES)        Performance      May 15, 2011   12 / 22
Performance


                                                   #ops
                                      Perf =
                                                 exec. time


                              What is #ops?




   Paolo Bientinesi (AICES)                   Performance     May 15, 2011   12 / 22
Performance


                                                   #ops
                                      Perf =
                                                 exec. time


                              What is #ops?
                              What is #ops in your algorithm?




   Paolo Bientinesi (AICES)                   Performance       May 15, 2011   12 / 22
Performance


                                                   #ops
                                       Perf =
                                                 exec. time


                              What is #ops?
                              What is #ops in your algorithm?
                              What if the algortihm is iterative?
                              “iterative algorithm” = “loop-based”




   Paolo Bientinesi (AICES)                   Performance            May 15, 2011   12 / 22
Performance


                                                   #ops
                                       Perf =
                                                 exec. time


                              What is #ops?
                              What is #ops in your algorithm?
                              What if the algortihm is iterative?
                              “iterative algorithm” = “loop-based”


                                                   Perf
                                    Efficiency =
                                                Peak Perf
                                       0 ≤ Efficiency ≤ 1



   Paolo Bientinesi (AICES)                   Performance            May 15, 2011   12 / 22
BLAS: Performance
Multithreaded DGEMM




                                      Performance of DGEMM on Intel Xeon 5450. nb=192
                     100

                      90

                      80

                      70

                      60
        efficiency




                      50

                      40

                      30

                      20                                                                   1 thread
                                                                                           2 threads
                      10                                                                   4 threads
                                                                                           8 threads
                       0
                        0       500               1000                 1500         2000          2500
                                                         Matrix size


     Paolo Bientinesi (AICES)                         Performance                               May 15, 2011   13 / 22
Time vs. Performance



                          Can you cheat time?




   Paolo Bientinesi (AICES)               Performance   May 15, 2011   14 / 22
Time vs. Performance



                          Can you cheat time?

                          Can you cheat performance?




   Paolo Bientinesi (AICES)               Performance   May 15, 2011   14 / 22
Time vs. Performance



                          Can you cheat time?

                          Can you cheat performance?

                          “fast” flops vs. “slow” flops




   Paolo Bientinesi (AICES)                 Performance   May 15, 2011   14 / 22
Time vs. Performance



                          Can you cheat time?

                          Can you cheat performance?

                          “fast” flops vs. “slow” flops

                          5nlogn vs. 2n2




   Paolo Bientinesi (AICES)                 Performance   May 15, 2011   14 / 22
Time vs. Performance



                          Can you cheat time?

                          Can you cheat performance?

                          “fast” flops vs. “slow” flops

                          5nlogn vs. 2n2


                                           USE BOTH!




   Paolo Bientinesi (AICES)                  Performance   May 15, 2011   14 / 22
Parallel Performance

                    p processors




   Paolo Bientinesi (AICES)        Performance   May 15, 2011   15 / 22
Parallel Performance

                    p processors
                    Total performance? Execution time?




   Paolo Bientinesi (AICES)            Performance       May 15, 2011   15 / 22
Parallel Performance

                    p processors
                    Total performance? Execution time?

                    Tp (n) = execution time for a problem of size n




   Paolo Bientinesi (AICES)              Performance                  May 15, 2011   15 / 22
Parallel Performance

                    p processors
                    Total performance? Execution time?

                    Tp (n) = execution time for a problem of size n

                                T1 (n)
                    Speedup =
                                Tp (n)




   Paolo Bientinesi (AICES)              Performance                  May 15, 2011   15 / 22
Parallel Performance

                    p processors
                    Total performance? Execution time?

                    Tp (n) = execution time for a problem of size n

                                T1 (n)                 Speedup
                    Speedup =            Efficiency =
                                Tp (n)                    p
                    T1 (n) =?    0 ≤ Speedup ≤ p,       0 ≤ Efficiency ≤ 1




   Paolo Bientinesi (AICES)              Performance                   May 15, 2011   15 / 22
Parallel Performance

                    p processors
                    Total performance? Execution time?

                    Tp (n) = execution time for a problem of size n

                                T1 (n)                  Speedup
                    Speedup =             Efficiency =
                                Tp (n)                     p
                    T1 (n) =?    0 ≤ Speedup ≤ p,        0 ≤ Efficiency ≤ 1

                                       Tp (n)
                    Strong Scalability:        , with p increasing
                                       T2p (n)
                    (fixed problem, increasing number of processors)




   Paolo Bientinesi (AICES)               Performance                   May 15, 2011   15 / 22
Parallel Performance

                    p processors
                    Total performance? Execution time?

                    Tp (n) = execution time for a problem of size n

                                T1 (n)                  Speedup
                    Speedup =             Efficiency =
                                Tp (n)                     p
                    T1 (n) =?    0 ≤ Speedup ≤ p,        0 ≤ Efficiency ≤ 1

                                       Tp (n)
                    Strong Scalability:        , with p increasing
                                       T2p (n)
                    (fixed problem, increasing number of processors)

                                      Tp (n)
                    Weak Scalability:         , with p increasing
                                     Tkp (2n)
                    (fixed memory use per processor,
                     increasing number of processors)

   Paolo Bientinesi (AICES)               Performance                   May 15, 2011   15 / 22
Parallel Matrix Distribution
How to store a (large) matrix using p = r × c processors?

                        bad ideas
                                1D: by rows
                                1D: by columns
                                2D: r × c quadrants
                       Load imbalance, bad scalability

                        not so great ideas
                                1D (block) cyclic: by rows, slicing (in blocks
                                of) rows and wrapping around
                                1D (block) cyclic: by columns, slicing (in
                                blocks of) columns and wrapping around
                       Bad scalability


     Paolo Bientinesi (AICES)                    Performance                     May 15, 2011   16 / 22
Parallel Matrix Distribution

How to store a (large) matrix using p = r × c processors?

                     good idea
                            2D (block) cyclic: overdecomposing according
                            to a r × c mesh, and wrapping around



    References
          Applet:
          http://acts.nersc.gov/scalapack/hands-on/datadist.html
          “Introduction to High-Performance Scientific Computing”
          by Victor Eijkhout (free download). Page 164.
          “A Comprehensive Approach to Parallel Linear Algebra Libraries”
          (Technical Report, University of Texas). Chapter 3, pages 19–40.



     Paolo Bientinesi (AICES)                Performance                   May 15, 2011   17 / 22
Collective Communication



                              Broadcast Reduce         Allreduce
                              Scatter    Gather        Allgather
                              Reduce-scatter           Alltoall



  References
        “Collective Communication: Theory, Practice, and Experience”,
        Chan, Heimlich, Purkayastha, van de Geijn.
        (FLAME working note #22)
        Collective Communications in MPI
        http://www.mcs.anl.gov/research/projects/mpi/tutorial/gropp/node72.html




   Paolo Bientinesi (AICES)              Performance                        May 15, 2011   18 / 22
Collective Communication: Lower Bounds


                              Cost of communication:        α + nβ
                              Cost of computation:          γ #ops

           Primitive                    Latency         Bandwidth    Computation
           Broadcast                    log2 (p) α          nβ           -
           Reduce                       log2 (p) α          nβ         p−1
                                                                        p nγ
           Allreduce                    log2 (p) α       2 p−1 nβ
                                                             p
                                                                       p−1
                                                                        p nγ
           Scatter                      log2 (p) α        p−1
                                                            p nβ         -
           Gather                       log2 (p) α        p−1
                                                            p nβ         -
           Allgather                    log2 (p) α        p−1
                                                            p nβ         -
           Reduce-Scatter               log2 (p) α        p−1
                                                            p nβ
                                                                       p−1
                                                                        p nγ




   Paolo Bientinesi (AICES)                   Performance                     May 15, 2011   19 / 22
Dense Matrix-Vector Product: Scalability
1D matrix distribution

                                  y = Ax,    x, y ∈ Rn ,     A ∈ Rn×n
 A is partitioned by rows and distributed over p = r × c processors.
 Processor i owns the block of rows Ai .
 Similarly for x: processor i owns xi .


       Algorithm
        1  Allgather(xi )                   x becomes available to every processor
          2    y i = Ai x                   local computation

       Parallel cost (lower bound for Tp (n))
        1   log2 (p) α + p−1 nβ ≈ log2 (p)α + nβ
                          p
                    2
          2    2n γ
                p


       Sequential cost                T1 (n) = 2n2 γ
       Paolo Bientinesi (AICES)                Performance                 May 15, 2011   20 / 22
GEMV: Scalability
1D matrix distribution



                                   T1 (n)            2n2 γ
 Speedup              Sp (n) =            =                     2
                                   Tp (n)   log2 (p)α + nβ + 2 n γ
                                                               p

                                   Sp (n)                 1
 Efficiency            Ep (n) =            =      p log2 (p) α        p β
                                     p      1+                  +
                                                    2n2     γ       2n γ

 Strong scalability lim Ep (n) = 0 
                                  p→∞

 Weak scalability
 M = local memory; M p = combined memory;
 nM = largest problem solvable with p processors ⇒ n2 = M p.
                                                    M
                             1
  lim Ep (nM ) =                   √
                                    p
                                        =0 
 p→∞             1 + log2 (p) α + 2√M β
                       2M γ           γ




       Paolo Bientinesi (AICES)                  Performance               May 15, 2011   21 / 22
GEMV: Scalability
2D matrix distribution




 TODO list
      How is partitioned A?
      x and y partitioned as before.
      Algorithm? Which type of communication is needed?
      Cost for the algorithm? communication? computation?
      Speedup? Parallel efficiency?
      Strong scalability? still no.
      Weak scalability? no, but almost.
      Especially true for n3 algorirhms.




       Paolo Bientinesi (AICES)       Performance           May 15, 2011   22 / 22

								
To top
;