VIEWS: 6 PAGES: 45 POSTED ON: 10/2/2012
High-Performance Matrix Computations Prof. Paolo Bientinesi pauldj@aices.rwth-aachen.de May 15, 2011 Paolo Bientinesi (AICES) Performance May 15, 2011 1 / 22 History of Dense Linear Algebra Libraries 1971 NAG Linear Algebra & more 1972 EISPACK Eigenproblems 1973 BLAS-1 Vector operations 1978 LINPACK Linear systems, least-squares 1988 BLAS-2 Matrix-vector operations 1990 BLAS-3 Matrix-matrix operations 1992 LAPACK EISPACK + LINPACK 1993 ScaLAPACK Distributed memory LAPACK 1997 PLAPACK Environment for distributed memory 2001 FLAME High-level abstraction; shared & distr. mem. 2006 PLASMA, CellSs Multicores Paolo Bientinesi (AICES) Performance May 15, 2011 2 / 22 Memory Hierarchy Paolo Bientinesi (AICES) Performance May 15, 2011 3 / 22 BLAS: Basic Linear Algebra Subroutines Linear Algebra operations decomposed into simpler operations. Paolo Bientinesi (AICES) Performance May 15, 2011 4 / 22 BLAS: Basic Linear Algebra Subroutines Linear Algebra operations decomposed into simpler operations. BLAS-1: y := y + αx x, y ∈ Rn dot := α + xT y Paolo Bientinesi (AICES) Performance May 15, 2011 4 / 22 BLAS: Basic Linear Algebra Subroutines Linear Algebra operations decomposed into simpler operations. BLAS-1: y := y + αx x, y ∈ Rn dot := α + xT y BLAS-2: y := y + Ax A ∈ Rn×n , x, y ∈ Rn y := A−1 x Paolo Bientinesi (AICES) Performance May 15, 2011 4 / 22 BLAS: Basic Linear Algebra Subroutines Linear Algebra operations decomposed into simpler operations. BLAS-1: y := y + αx x, y ∈ Rn dot := α + xT y BLAS-2: y := y + Ax A ∈ Rn×n , x, y ∈ Rn y := A−1 x BLAS-3: C := C + AB A, B, C ∈ Rn×n C := A−1 B Paolo Bientinesi (AICES) Performance May 15, 2011 4 / 22 BLAS: Basic Linear Algebra Subroutines Linear Algebra operations decomposed into simpler operations. BLAS-1: y := y + αx x, y ∈ Rn dot := α + xT y BLAS-2: y := y + Ax A ∈ Rn×n , x, y ∈ Rn y := A−1 x BLAS-3: C := C + AB A, B, C ∈ Rn×n C := A−1 B BLAS #FLOPS Mem. refs. Ratio Level 1 2n 3n 2/3 Paolo Bientinesi (AICES) Performance May 15, 2011 4 / 22 BLAS: Basic Linear Algebra Subroutines Linear Algebra operations decomposed into simpler operations. BLAS-1: y := y + αx x, y ∈ Rn dot := α + xT y BLAS-2: y := y + Ax A ∈ Rn×n , x, y ∈ Rn y := A−1 x BLAS-3: C := C + AB A, B, C ∈ Rn×n C := A−1 B BLAS #FLOPS Mem. refs. Ratio Level 1 2n 3n 2/3 Level 2 2n2 n2 2 Paolo Bientinesi (AICES) Performance May 15, 2011 4 / 22 BLAS: Basic Linear Algebra Subroutines Linear Algebra operations decomposed into simpler operations. BLAS-1: y := y + αx x, y ∈ Rn dot := α + xT y BLAS-2: y := y + Ax A ∈ Rn×n , x, y ∈ Rn y := A−1 x BLAS-3: C := C + AB A, B, C ∈ Rn×n C := A−1 B BLAS #FLOPS Mem. refs. Ratio Level 1 2n 3n 2/3 Level 2 2n2 n2 2 3 2 Level 3 2n 4n n/2 Paolo Bientinesi (AICES) Performance May 15, 2011 4 / 22 References BLAS GotoBLAS: http://web.tacc.utexas.edu/resources/software/ “Automated Empirical Optimization of Software and the ATLAS Project” by Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27, 2001. “Is Search Really Necessary to Generate High-Performance BLAS?” by Yotov, Li, Ren, Garzarán, Padua, Pingali and Stodghill. Proceedings of the IEEE, 93(2), 2005. “Anatomy of High-Performance Matrix Multiplication” by Kazushige Goto and Robert van de Geijn. ACM Transactions on Mathematical Software, 34 (3), 2008. “High-Performance Implementation of the Level-3 BLAS” by Kazushige Goto and Robert van de Geijn. ACM Transactions on Mathematical Software, 35 (1), 2008. Paolo Bientinesi (AICES) Performance May 15, 2011 5 / 22 References LAPACK www.netlib.org/lapack/ Working Notes: www.netlib.org/lapack/lawns “LAPACK Users’ Guide” (third ed.), Society for Industrial and Applied Mathematics, 1999. Paolo Bientinesi (AICES) Performance May 15, 2011 6 / 22 Performance Quality of an algorithm Metrics? Execution time Memory usage Memory/network accesses Accuracy Size ... Paolo Bientinesi (AICES) Performance May 15, 2011 7 / 22 Execution time It says how fast an algorithm is “Does not measure the memory usage” “It might depend on the input” “It depends on the processor” It does not measure how well the algorithm takes advantage of a given architecture Paolo Bientinesi (AICES) Performance May 15, 2011 8 / 22 The quality of an algorithm should be related to the potential of the architecture “Potential of an architecture”? Theoretical Peak Performance Theoretical Peak Performance = #cores * frequency * #ﬂops/cycle what is frequency? GHz what is a ﬂop? ﬂoating point operation what is a cycle? how often the processor can initiate new operations #ﬂops/cycle = number of ﬂops executed initiated in one cycle Paolo Bientinesi (AICES) Performance May 15, 2011 9 / 22 The quality of an algorithm should be related to the potential of the architecture “Potential of an architecture”? Theoretical Peak Performance Theoretical Peak Performance = #cores * frequency * #ﬂops/cycle what is frequency? GHz what is a ﬂop? ﬂoating point operation what is a cycle? how often the processor can initiate new operations #ﬂops/cycle = number of ﬂops executed initiated in one cycle In practice, the theoretical peak performance is unattainable. Why? Paolo Bientinesi (AICES) Performance May 15, 2011 9 / 22 CPU vs. Memory For the CPU to execute an operation, data needs to be in registers Paolo Bientinesi (AICES) Performance May 15, 2011 10 / 22 CPU vs. Memory For the CPU to execute an operation, data needs to be in registers Cache misses L1 misses, L2, L3, TLB, instruction misses. . . Cache line Paolo Bientinesi (AICES) Performance May 15, 2011 10 / 22 Back to performance Theroretical peak performance → unattainable Practical peak performance? (practical = attainable) DGEMM (BLAS) Double (precision) GEneral Matrix Matrix multiply Peak performance ≡ DGEMM Paolo Bientinesi (AICES) Performance May 15, 2011 11 / 22 Back to performance Theroretical peak performance → unattainable Practical peak performance? (practical = attainable) DGEMM (BLAS) Double (precision) GEneral Matrix Matrix multiply Peak performance ≡ DGEMM How to compute DGEMM’s performance? Paolo Bientinesi (AICES) Performance May 15, 2011 11 / 22 Performance #ops Perf = exec. time Paolo Bientinesi (AICES) Performance May 15, 2011 12 / 22 Performance #ops Perf = exec. time What is #ops? Paolo Bientinesi (AICES) Performance May 15, 2011 12 / 22 Performance #ops Perf = exec. time What is #ops? What is #ops in your algorithm? Paolo Bientinesi (AICES) Performance May 15, 2011 12 / 22 Performance #ops Perf = exec. time What is #ops? What is #ops in your algorithm? What if the algortihm is iterative? “iterative algorithm” = “loop-based” Paolo Bientinesi (AICES) Performance May 15, 2011 12 / 22 Performance #ops Perf = exec. time What is #ops? What is #ops in your algorithm? What if the algortihm is iterative? “iterative algorithm” = “loop-based” Perf Eﬃciency = Peak Perf 0 ≤ Eﬃciency ≤ 1 Paolo Bientinesi (AICES) Performance May 15, 2011 12 / 22 BLAS: Performance Multithreaded DGEMM Performance of DGEMM on Intel Xeon 5450. nb=192 100 90 80 70 60 efficiency 50 40 30 20 1 thread 2 threads 10 4 threads 8 threads 0 0 500 1000 1500 2000 2500 Matrix size Paolo Bientinesi (AICES) Performance May 15, 2011 13 / 22 Time vs. Performance Can you cheat time? Paolo Bientinesi (AICES) Performance May 15, 2011 14 / 22 Time vs. Performance Can you cheat time? Can you cheat performance? Paolo Bientinesi (AICES) Performance May 15, 2011 14 / 22 Time vs. Performance Can you cheat time? Can you cheat performance? “fast” ﬂops vs. “slow” ﬂops Paolo Bientinesi (AICES) Performance May 15, 2011 14 / 22 Time vs. Performance Can you cheat time? Can you cheat performance? “fast” ﬂops vs. “slow” ﬂops 5nlogn vs. 2n2 Paolo Bientinesi (AICES) Performance May 15, 2011 14 / 22 Time vs. Performance Can you cheat time? Can you cheat performance? “fast” ﬂops vs. “slow” ﬂops 5nlogn vs. 2n2 USE BOTH! Paolo Bientinesi (AICES) Performance May 15, 2011 14 / 22 Parallel Performance p processors Paolo Bientinesi (AICES) Performance May 15, 2011 15 / 22 Parallel Performance p processors Total performance? Execution time? Paolo Bientinesi (AICES) Performance May 15, 2011 15 / 22 Parallel Performance p processors Total performance? Execution time? Tp (n) = execution time for a problem of size n Paolo Bientinesi (AICES) Performance May 15, 2011 15 / 22 Parallel Performance p processors Total performance? Execution time? Tp (n) = execution time for a problem of size n T1 (n) Speedup = Tp (n) Paolo Bientinesi (AICES) Performance May 15, 2011 15 / 22 Parallel Performance p processors Total performance? Execution time? Tp (n) = execution time for a problem of size n T1 (n) Speedup Speedup = Efﬁciency = Tp (n) p T1 (n) =? 0 ≤ Speedup ≤ p, 0 ≤ Efﬁciency ≤ 1 Paolo Bientinesi (AICES) Performance May 15, 2011 15 / 22 Parallel Performance p processors Total performance? Execution time? Tp (n) = execution time for a problem of size n T1 (n) Speedup Speedup = Efﬁciency = Tp (n) p T1 (n) =? 0 ≤ Speedup ≤ p, 0 ≤ Efﬁciency ≤ 1 Tp (n) Strong Scalability: , with p increasing T2p (n) (ﬁxed problem, increasing number of processors) Paolo Bientinesi (AICES) Performance May 15, 2011 15 / 22 Parallel Performance p processors Total performance? Execution time? Tp (n) = execution time for a problem of size n T1 (n) Speedup Speedup = Efﬁciency = Tp (n) p T1 (n) =? 0 ≤ Speedup ≤ p, 0 ≤ Efﬁciency ≤ 1 Tp (n) Strong Scalability: , with p increasing T2p (n) (ﬁxed problem, increasing number of processors) Tp (n) Weak Scalability: , with p increasing Tkp (2n) (ﬁxed memory use per processor, increasing number of processors) Paolo Bientinesi (AICES) Performance May 15, 2011 15 / 22 Parallel Matrix Distribution How to store a (large) matrix using p = r × c processors? bad ideas 1D: by rows 1D: by columns 2D: r × c quadrants Load imbalance, bad scalability not so great ideas 1D (block) cyclic: by rows, slicing (in blocks of) rows and wrapping around 1D (block) cyclic: by columns, slicing (in blocks of) columns and wrapping around Bad scalability Paolo Bientinesi (AICES) Performance May 15, 2011 16 / 22 Parallel Matrix Distribution How to store a (large) matrix using p = r × c processors? good idea 2D (block) cyclic: overdecomposing according to a r × c mesh, and wrapping around References Applet: http://acts.nersc.gov/scalapack/hands-on/datadist.html “Introduction to High-Performance Scientiﬁc Computing” by Victor Eijkhout (free download). Page 164. “A Comprehensive Approach to Parallel Linear Algebra Libraries” (Technical Report, University of Texas). Chapter 3, pages 19–40. Paolo Bientinesi (AICES) Performance May 15, 2011 17 / 22 Collective Communication Broadcast Reduce Allreduce Scatter Gather Allgather Reduce-scatter Alltoall References “Collective Communication: Theory, Practice, and Experience”, Chan, Heimlich, Purkayastha, van de Geijn. (FLAME working note #22) Collective Communications in MPI http://www.mcs.anl.gov/research/projects/mpi/tutorial/gropp/node72.html Paolo Bientinesi (AICES) Performance May 15, 2011 18 / 22 Collective Communication: Lower Bounds Cost of communication: α + nβ Cost of computation: γ #ops Primitive Latency Bandwidth Computation Broadcast log2 (p) α nβ - Reduce log2 (p) α nβ p−1 p nγ Allreduce log2 (p) α 2 p−1 nβ p p−1 p nγ Scatter log2 (p) α p−1 p nβ - Gather log2 (p) α p−1 p nβ - Allgather log2 (p) α p−1 p nβ - Reduce-Scatter log2 (p) α p−1 p nβ p−1 p nγ Paolo Bientinesi (AICES) Performance May 15, 2011 19 / 22 Dense Matrix-Vector Product: Scalability 1D matrix distribution y = Ax, x, y ∈ Rn , A ∈ Rn×n A is partitioned by rows and distributed over p = r × c processors. Processor i owns the block of rows Ai . Similarly for x: processor i owns xi . Algorithm 1 Allgather(xi ) x becomes available to every processor 2 y i = Ai x local computation Parallel cost (lower bound for Tp (n)) 1 log2 (p) α + p−1 nβ ≈ log2 (p)α + nβ p 2 2 2n γ p Sequential cost T1 (n) = 2n2 γ Paolo Bientinesi (AICES) Performance May 15, 2011 20 / 22 GEMV: Scalability 1D matrix distribution T1 (n) 2n2 γ Speedup Sp (n) = = 2 Tp (n) log2 (p)α + nβ + 2 n γ p Sp (n) 1 Efﬁciency Ep (n) = = p log2 (p) α p β p 1+ + 2n2 γ 2n γ Strong scalability lim Ep (n) = 0 p→∞ Weak scalability M = local memory; M p = combined memory; nM = largest problem solvable with p processors ⇒ n2 = M p. M 1 lim Ep (nM ) = √ p =0 p→∞ 1 + log2 (p) α + 2√M β 2M γ γ Paolo Bientinesi (AICES) Performance May 15, 2011 21 / 22 GEMV: Scalability 2D matrix distribution TODO list How is partitioned A? x and y partitioned as before. Algorithm? Which type of communication is needed? Cost for the algorithm? communication? computation? Speedup? Parallel efﬁciency? Strong scalability? still no. Weak scalability? no, but almost. Especially true for n3 algorirhms. Paolo Bientinesi (AICES) Performance May 15, 2011 22 / 22