VIEWS: 54 PAGES: 15 CATEGORY: Business POSTED ON: 11/16/2008 Public Domain
C++ Expression Templates in an Embedded, Parallel, Real-Time Signal Processing Library Edward M. Rutledge MIT Lincoln Laboratory Abstract 1.0 INTRODUCTION In order to facilitate a smooth transition from an algorithm's linear algebra specification to its software implementation, a high performance signal processing library should provide software constructs that allow linear algebra to be easily translated into high performance code. Currently, C is the prevalent language for high performance signal processing libraries, largely because of concerns about the efficiency of C++. If it were not for these concerns, C++ would be a better choice, partly because of its expressiveness. Using C++, we can define vector and matrix classes and overloaded operators that allow linear algebra to be more easily and intuitively translated into code. However, C++ operator overloading is the source of much of the concern about the efficiency of C++. Traditional techniques of operator overloading incur added overhead that may be unacceptable in an embedded real-time system. C++ "expression templates," which are currently gaining popularity in the scientific computing arena, provide a solution to this problem. In this presentation, we explore the impact of C++ expression templates on the simplicity and performance of an experimental, embedded, parallel, real-time signal processing library. We demonstrate that expression templates can be as beneficial in the rapid development of embedded, parallel, real-time signal processing applications as they have proven to be in the rapid development of high performance scientific simulations. Our examples and experiments focus on architectures, problem sizes, and kernels relevant to radar and other array-sensor processing applications. 2.0 OVERVIEW OF C++ EXPRESSION TEMPLATES AND PETE C++ overloaded operators typically return temporary objects containing the results of the operation. This technique incurs the overhead of creating and destroying the temporary objects, which include additional memory use, copy overhead, and possible overhead of allocating and freeing dynamic memory in the objects. Alternatively, C++ expression templates can be used to transform an arbitrary expression of objects such as vectors or matrices into a parse tree representation of that expression. This is done at compile time, and the parse tree object itself takes a small amount of memory, so little overhead is incurred at run time. With an expression's entire parse tree represented in a single object, the expression can be evaluated as a whole instead of being evaluated one operation at a time. This eliminates the need to produce temporary objects to hold intermediate results, creating a profound positive impact on the suitability of C++ operator overloading for embedded real-time systems, where efficient use of system resources is essential. The Portable Expression Template Engine (PETE) is a free software package, developed at Los Alamos National Laboratory's Advanced Computing Laboratory, that allows programmers to add expression template capability to their C++ programs. PETE furnishes a utility for defining C++ operators that transform expressions of class objects into parse trees, and PETE furnishes functions that help in navigating these expression parse trees and help in efficiently evaluating expressions of element-wise operations. We use PETE to add expression template capability to the experimental C++ signal processing library referred to in this presentation. 3.0 IMPACT OF PETE ON CODE SIMPLICITY AND PERFORMANCE Through the following examples and experiments, we show that PETE and C++ can combine to form an efficient and easy to use signal processing programming capability. 3.1 Simplicity To show the impact of C++ and operator overloading on the simplicity and readability of signal processing code, we compare implementations of linear algebra expressions using our experimental C++ signal processing library with operator overloading to implementations using a C signal processing library. 3.2 Performance To show the performance impact of PETE in our experimental signal processing library, we measure and compare the execution time of a linear algebra expression implemented in three different ways: − using an optimized native C library, − using our experimental C++ library with PETE operators and vector and matrix classes that evaluate the expression using an optimized native C library, and − using our experimental C++ library with PETE operators and vector and matrix classes that evaluate the expression using PETE methods of element-wise evaluation. The linear algebra expression and problem size are extracted from an existing radar signal processing application. The first two approaches are compared to show the overhead introduced by the C++/PETE approach. The second two approaches are compared to show the possible performance advantage of PETE element-wise expression evaluation over optimized native C library call expression evaluation, especially for evaluating expressions where many element-wise operations are chained together. The experiments are run on both serial and distributed memory message passing architectures. In the latter case, we examine both data mappings that require inter-processor communication and those that do not. 4.0 FUTURE OPTIMIZATIONS We also discuss some possibilities for further optimization of mathematical expressions, including determining the optimal order for performing a series of chained operations in an expression with respect to operation count and with respect to communication overhead. These optimizations would not be possible using traditional C++ operator overloading. C++ Expression Templates in an Embedded, Parallel, Real-Time Signal Processing Library Eddie Rutledge *This work is sponsored by the US Navy, under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and not necessarily endorsed by the United States Air Force. 000801-er-1 MIT Lincoln Laboratory KAM 4/27/01 Evolution of Parallel Processing Libraries Applicability PVL = Scientific (non-real-time) computing • C++ STAPL • Object- = Real-time signal processing oriented •C Parallel • Object-based Processing Library ScaLAPACK • Fortran • Object-based MPI/RT Parallel MPI •C Communications •C • Object-based • Object-based VSIPL LAPACK •C Single processor • Fortran • Object-based • C++ Library PETE • Object-oriented 1988 89 90 91 92 93 94 95 96 97 98 99 2000 PVL PVL LAPACK = Linear algebra package •• Collaboration between Lincoln and Lockheed Martin Collaboration between Lincoln and Lockheed Martin MPI = Message-passing interface •• Component of AEGIS Common Signal Processor (CSP) Component of AEGIS Common Signal Processor (CSP) MPI/RT = MPI real-time Software Application Program Interface (API) Software Application Program Interface (API) ScaLAPACK = Scalable LAPACK •• Combines VSIPL API & parallel constructs from STAPL Combines VSIPL API & parallel constructs from STAPL VSIPL = Vector, Signal, and Image Processing Library •• Portable, high performance, standardized signal processing Portable, high performance, standardized signal processing STAPL = Space-Time Adaptive Proc. Library library for real-time array signal processing library for real-time array signal processing PETE= Portable Expression Template Engine 000801-er-2 MIT Lincoln Laboratory KAM 4/27/01 C Vs. C++ Code Simplicity C (Functional Approach) C (Object Based) C++ (Object Oriented) for (i=0;I<rows;I++) for (j=0;j<cols;j++) vsip_vadd_f(&B,&C,&A); A=B+C A[i][j]=B[i][j]+C[i][j]; We examine C++ expression templates as a way to increase code performance 000801-er-3 MIT Lincoln Laboratory KAM 4/27/01 Complexity C vs. C++ Performance Application Space Code VSIPL Library Space MPI PVL Functional Object-Based Object-oriented Approach C Library C++ Library Optimization Application Potential Space Expression Templates VSIPL Library Space MPI PVL Functional Object-Based Object-oriented Approach C Library C++ Library Focus: Compare performance We compare the performance of the object-oriented C++ approach (PVL) We compare the performance of the object-oriented C++ approach (PVL) with the performance of the object-based C approach (VSIPL/MPI) with the performance of the object-based C approach (VSIPL/MPI) 000801-er-4 MIT Lincoln Laboratory KAM 4/27/01 Typical C++ Operator Overloading- Overhead Example: A=B+C vector add Example: A=B+C vector add Main 2 temporary vectors created 2 temporary vectors created 1. Pass B and C Additional Memory Use references to operator + B&, • Static memory C& Operator + • Dynamic memory (also affects 2. Create temporary execution time) temp result vector 3. Calculate results, store in temporary B+C temp 4. Return copy of temporary Additional Execution Time opy pc 5. Pass results reference tem • Time to create a to operator= tem pc new vector op Operator = y& • Time to create a temp copy A copy of a vector 6. Perform assignment • Time to destruct both temporaries 000801-er-5 MIT Lincoln Laboratory KAM 4/27/01 C++ Expression Templates and PETE Parse Tree Expression Type Expression BinaryNode<OpAssign, Vector, Expression BinaryNode<OpAdd, Vector A=B+C*D A=B+C*D Templates BinaryNode<OpMultiply, Vector, Vector >>> Main Parse trees, not vectors, created Parse trees, not vectors, created Parse trees, not vectors, created 1. Pass B and C references to operator + , Reduced Memory Use B& C& Operator + 2. Create expression + • Parse tree only parse tree B& C& contains references 3. Return expression parse tree py co Reduced Execution Time 4. Pass expression tree reference to operator co Operator = py & • Parse tree created at compile time 5. Calculate result and perform assignment B+C A • PETE, the Portable Expression Template Engine, is available from the Advanced Computing Laboratory at Los Alamos National Laboratory • PETE provides: – Expression template capability PETE: http://www.acl.lanl.gov/pete – Facilities to help navigate and evaluating parse trees 000801-er-6 MIT Lincoln Laboratory KAM 4/27/01 Experiments Parallel Complexity Experiment 3: Varying distributions Measure Execution Time Experiment 2: Identical distributions Software Experiment 1: Single Node Technology Software Technology Expression Complexity C C++/VSIPL C++/PETE C C++ C++ MPI PVL PVL VSIPL - MPI - MPI - VSIPL - PETE - PETE - Expression - Expression templates templates - Evaluation 000801-er-7 MIT Lincoln Laboratory KAM 4/27/01 Experiment 1: Single Node A=B+C A=B+C*D A=B+C*D/E+fft(F) 1.3 1.2 1.2 Relative Execution Time Relative Execution Time Relative Execution Time 1.1 1.2 1.1 C C++/VSIPL 1 C++/PETE 1.1 0.9 1 0.8 1 0.9 C C 0.7 C++/VSIPL C++/VSIPL C++/PETE C++/PETE 0.9 0.6 0.8 8 8 2 8 2 68 72 8 8 2 8 2 68 72 8 8 2 48 192 768 107 2 32 1 2 5 1 204 819 327 1310 32 1 2 5 1 204 819 327 1310 32 12 51 20 8 32 13 Vector Length Vector Length Vector Length • Platform: Linux PC • Element-wise multiply and divide Relative overhead of C++/VSIPL vs. C is small Relative overhead of C++/VSIPL vs. C is small when expression templates are used when expression templates are used For longer chained expressions of element-wise operations, C++ with For longer chained expressions of element-wise operations, C++ with PETE element-wise evaluation outperforms other implementations PETE element-wise evaluation outperforms other implementations 000801-er-8 MIT Lincoln Laboratory KAM 4/27/01 Experiment 2: Identical Distributions A=B+C A=B+C*D A=B+C*D/E+fft(F) 1.5 1.4 1.1 Relative Execution Time Relative Execution Time Relative Execution Time 1.3 1.4 1.2 C 1.3 C++/VSIPL 1.1 C++/PETE 1.2 C 1 1 C++/VSIPL C++/PETE 0.9 1.1 C 0.8 C++/VSIPL 1 C++/PETE 0.7 0.9 0.6 0.9 8 28 12 48 92 68 72 8 28 12 48 92 68 72 8 8 2 48 192 2768 107 2 32 1 5 20 81 327 1310 32 1 5 20 81 327 1310 32 12 51 20 8 3 1 3 Vector Length Vector Length Vector Length • Platform: 4 node Linux cluster • Element-wise multiply and divide Similar performance to single node case Similar performance to single node case Relative differences in A=B+C*D/E+fft(F) are smaller because of Relative differences in A=B+C*D/E+fft(F) are smaller because of communication required in (unoptimized) FFT communication required in (unoptimized) FFT 000801-er-9 MIT Lincoln Laboratory KAM 4/27/01 Experiment 3: Different Distributions A=B+C A=B+C*D A=B+C*D/E+fft(F) 1. 1 1.1 1.1 C C++/VSIPL Relative Execution Time Relative Execution Time Relative Execution Time C++/PETE 1 1 1 1 0.9 C C C++/VSIPL C++/VSIPL C++/PETE C++/PETE 0.9 0.8 0.9 28 12 48 92 8 0 72 28 12 48 92 68 07 2 8 2 48 192 2768 107 2 8 32 76 8 32 27 131 8 32 12 51 1 5 20 81 32 131 1 5 20 81 3 20 8 3 13 Vector Length Vector Length Vector Length • Platform: 4 node Linux cluster • Element-wise multiply and divide • A distributed on 2 nodes, B, C, D, E and F distributed on 4 nodes Relative differences between the 3 approaches are much smaller Relative differences between the 3 approaches are much smaller because of communication required in all cases because of communication required in all cases 000801-er-10 MIT Lincoln Laboratory KAM 4/27/01 Results Summary Execution time relative to C/VSIPL implementation Single Node: A=B+C Single Node : A=B+C*D Single Node : A=B+C*D/E+fft(F) Vector Length=8 Vector Length=131072 Vector Length=8 Vector Length=131072 Vector Length=8 Vector Length=131072 1.4 1.3 1.2 1.3 1.2 1.1 1.2 1.1 1 1.1 0.9 0.8 1 1 0.7 0.9 0.6 0.9 C++/VSIPL C++/PETE C++/VSIPL C++/PETE C++/VSIPL C++/PETE Identical Distributions: A=B+C*D/E+fft(F) Different Distributions: A=B+C*D/E+fft(F) • C++/VSIPL- Small overhead compared to C Vector Length=8 Vector Length=131072 Vector Length=8 Vector Length=131072 1.1 1.1 • C++/PETE- Significant advantage for chained expressions 1 1 • Relative differences largest where no communication is 0.9 C++/VSIPL C++/PETE 0.9 C++/VSIPL C++/PETE required 000801-er-11 MIT Lincoln Laboratory KAM 4/27/01 Future Optimizations 2 optimizations made possible by expression templates not possible with typical C++ operator overloading Minimize Op Count Minimize Communication Example: matrix-matrix multiply Example: A=B+C B-30x35 • A and C are on processor 1 C-35x15 • B is on processor 2 A=BxCxDxExF D-15x5 E-5x10 F-10x20 Typical C++ operator overloading solution Typical C++ operator overloading solution could involve unnecessary communication could involve unnecessary communication Default ordering: 25,500 scalar mulitplies Default ordering: 25,500 scalar mulitplies of C to processor 2, then results to A=(((BxC)xD)xE)xF of C to processor 2, then results to A=(((BxC)xD)xE)xF processor 11 processor Optimal ordering: 11,875 scalar multiplies Optimal ordering: 11,875 scalar multiplies Optimal solution: Move B’s data to Optimal solution: Move B’s data to A=(Bx(CxD))x(ExF) A=(Bx(CxD))x(ExF) processor 1, then evaluate processor 1, then evaluate Both optimizations require knowledge of the entire expression Expression templates Expression templates 000801-er-12 MIT Lincoln Laboratory KAM 4/27/01 Conclusions • C++ is preferable to C for high--performance signal processing – More direct translation from linear algebra to code – Use of C++ expression templates results in similar performance • PETE element-wise evaluation is preferable in some cases to mathematical libraries such as VSIPL which necessitate the creation of temporary objects. Example: A=B+C*D • Expression templates allow optimizations not possible with typical C++ operator overloading – Minimize operation count. Example: A=BxCxDxExF – Minimize communication. Example: A=B+C Acknowledgements • PETE was obtained from Los Alamos National Laboratory’s Advanced Computing Laboratory Lockheed Martin MIT Lincoln Laboratory Phil Barile Jane Kent Jim Daly Jan Matlis Nathan Doss Mike Lontoc Jim Demers Patrick Richardson Mary Frances Caravaglio Rathin Putatunda Dan Drake Eddie Rutledge Michael Iaquinto Roelan Teachey Hank Hoffmann Glenn Schrader Jeremy Kepner Ben Sadoski James Lebak 000801-er-13 MIT Lincoln Laboratory KAM 4/27/01