C++ Expression Templates in an Embedded, Parallel, Real-Time by alllona

VIEWS: 54 PAGES: 15

									    C++ Expression Templates in an Embedded, Parallel, Real-Time Signal Processing
                                      Library

                                         Edward M. Rutledge
                                         MIT Lincoln Laboratory


                                                Abstract

1.0 INTRODUCTION

In order to facilitate a smooth transition from an algorithm's linear algebra specification to its software
implementation, a high performance signal processing library should provide software constructs that
allow linear algebra to be easily translated into high performance code. Currently, C is the prevalent
language for high performance signal processing libraries, largely because of concerns about the
efficiency of C++. If it were not for these concerns, C++ would be a better choice, partly because of its
expressiveness. Using C++, we can define vector and matrix classes and overloaded operators that allow
linear algebra to be more easily and intuitively translated into code. However, C++ operator overloading is
the source of much of the concern about the efficiency of C++. Traditional techniques of operator
overloading incur added overhead that may be unacceptable in an embedded real-time system. C++
"expression templates," which are currently gaining popularity in the scientific computing arena, provide a
solution to this problem.

In this presentation, we explore the impact of C++ expression templates on the simplicity and
performance of an experimental, embedded, parallel, real-time signal processing library.

We demonstrate that expression templates can be as beneficial in the rapid development of embedded,
parallel, real-time signal processing applications as they have proven to be in the rapid development of
high performance scientific simulations. Our examples and experiments focus on architectures, problem
sizes, and kernels relevant to radar and other array-sensor processing applications.

2.0 OVERVIEW OF C++ EXPRESSION TEMPLATES AND PETE

C++ overloaded operators typically return temporary objects containing the results of the operation. This
technique incurs the overhead of creating and destroying the temporary objects, which include additional
memory use, copy overhead, and possible overhead of allocating and freeing dynamic memory in the
objects. Alternatively, C++ expression templates can be used to transform an arbitrary expression of
objects such as vectors or matrices into a parse tree representation of that expression. This is done at
compile time, and the parse tree object itself takes a small amount of memory, so little overhead is
incurred at run time. With an expression's entire parse tree represented in a single object, the expression
can be evaluated as a whole instead of being evaluated one operation at a time. This eliminates the need
to produce temporary objects to hold intermediate results, creating a profound positive impact on the
suitability of C++ operator overloading for embedded real-time systems, where efficient use of system
resources is essential.

The Portable Expression Template Engine (PETE) is a free software package, developed at Los Alamos
National Laboratory's Advanced Computing Laboratory, that allows programmers to add expression
template capability to their C++ programs. PETE furnishes a utility for defining C++ operators that
transform expressions of class objects into parse trees, and PETE furnishes functions that help in
navigating these expression parse trees and help in efficiently evaluating expressions of element-wise
operations. We use PETE to add expression template capability to the experimental C++ signal
processing library referred to in this presentation.
3.0 IMPACT OF PETE ON CODE SIMPLICITY AND PERFORMANCE

Through the following examples and experiments, we show that PETE and C++ can combine to form an
efficient and easy to use signal processing programming capability.

3.1 Simplicity

To show the impact of C++ and operator overloading on the simplicity and readability of signal processing
code, we compare implementations of linear algebra expressions using our experimental C++ signal
processing library with operator overloading to implementations using a C signal processing library.

3.2 Performance

To show the performance impact of PETE in our experimental signal processing library, we measure and
compare the execution time of a linear algebra expression implemented in three different ways:

      −   using an optimized native C library,

      −   using our experimental C++ library with PETE operators and vector and matrix classes that
          evaluate the expression using an optimized native C library, and

      −   using our experimental C++ library with PETE operators and vector and matrix classes that
          evaluate the expression using PETE methods of element-wise evaluation.

The linear algebra expression and problem size are extracted from an existing radar signal processing
application.

The first two approaches are compared to show the overhead introduced by the C++/PETE approach.
The second two approaches are compared to show the possible performance advantage of PETE
element-wise expression evaluation over optimized native C library call expression evaluation, especially
for evaluating expressions where many element-wise operations are chained together.

The experiments are run on both serial and distributed memory message passing architectures. In the
latter case, we examine both data mappings that require inter-processor communication and those that
do not.

4.0 FUTURE OPTIMIZATIONS

We also discuss some possibilities for further optimization of mathematical expressions, including
determining the optimal order for performing a series of chained operations in an expression with respect
to operation count and with respect to communication overhead. These optimizations would not be
possible using traditional C++ operator overloading.
       C++ Expression Templates in an Embedded,
          Parallel, Real-Time Signal Processing
                          Library


                                                Eddie Rutledge




*This work is sponsored by the US Navy, under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and
              recommendations are those of the author and not necessarily endorsed by the United States Air Force.

 000801-er-1
                                                                                             MIT Lincoln Laboratory
 KAM 4/27/01
                          Evolution of Parallel Processing
                                      Libraries
                            Applicability                                                                                       PVL
                                      = Scientific (non-real-time) computing                                                         • C++
                                                                                                STAPL                                • Object-
                                      = Real-time signal processing                                                                    oriented
                                                                                                    •C
    Parallel                                                                                        • Object-based
  Processing
    Library                                                    ScaLAPACK
                                                           • Fortran
                                                           • Object-based

                                                                                                    MPI/RT
   Parallel                                                  MPI                         •C
Communications                                             •C                            • Object-based
                                                           • Object-based

                                                                                                    VSIPL
                              LAPACK                                                     •C
Single processor          • Fortran                                                      • Object-based
                                                                                                                           • C++
     Library                                                                                           PETE                • Object-oriented

                   1988        89      90      91     92       93     94       95   96   97       98      99      2000



                                      PVL
                                      PVL                                                LAPACK = Linear algebra package
 •• Collaboration between Lincoln and Lockheed Martin
     Collaboration between Lincoln and Lockheed Martin                                   MPI = Message-passing interface
 •• Component of AEGIS Common Signal Processor (CSP)
     Component of AEGIS Common Signal Processor (CSP)                                    MPI/RT = MPI real-time
    Software Application Program Interface (API)
     Software Application Program Interface (API)                                        ScaLAPACK = Scalable LAPACK
 •• Combines VSIPL API & parallel constructs from STAPL
     Combines VSIPL API & parallel constructs from STAPL                                 VSIPL = Vector, Signal, and Image Processing Library

 •• Portable, high performance, standardized signal processing
     Portable, high performance, standardized signal processing                          STAPL = Space-Time Adaptive Proc. Library

    library for real-time array signal processing
     library for real-time array signal processing                                       PETE= Portable Expression Template Engine



    000801-er-2
                                                                                                           MIT Lincoln Laboratory
    KAM 4/27/01
                          C Vs. C++ Code Simplicity

C (Functional Approach)           C (Object Based)       C++ (Object Oriented)



   for (i=0;I<rows;I++)
    for (j=0;j<cols;j++)        vsip_vadd_f(&B,&C,&A);           A=B+C
   A[i][j]=B[i][j]+C[i][j];




We examine C++ expression templates as a way to increase code performance
   000801-er-3
                                                         MIT Lincoln Laboratory
   KAM 4/27/01
Complexity                           C vs. C++ Performance

                             Application
                               Space
  Code




                                                  VSIPL              Library
                                                                     Space
                                                   MPI                                       PVL

                Functional                   Object-Based                       Object-oriented
                Approach                      C Library                          C++ Library
Optimization




                             Application
 Potential




                               Space                                                   Expression
                                                                                       Templates


                                                  VSIPL              Library
                                                                     Space
                                                   MPI                                       PVL

                Functional                   Object-Based                       Object-oriented
                Approach                      C Library                          C++ Library

                                                          Focus: Compare performance

           We compare the performance of the object-oriented C++ approach (PVL)
           We compare the performance of the object-oriented C++ approach (PVL)
             with the performance of the object-based C approach (VSIPL/MPI)
              with the performance of the object-based C approach (VSIPL/MPI)
       000801-er-4
                                                                        MIT Lincoln Laboratory
       KAM 4/27/01
                  Typical C++ Operator Overloading-
                              Overhead
                                 Example: A=B+C vector add
                                 Example: A=B+C vector add

Main                                                2 temporary vectors created
                                                    2 temporary vectors created
 1. Pass B and C                                         Additional Memory Use
   references to
   operator + B&,                                                   • Static memory
                    C&
 Operator +                                                         • Dynamic memory
                                                                      (also affects
       2. Create temporary                                            execution time)
                               temp
          result vector
       3. Calculate results,
          store in temporary   B+C       temp
       4. Return copy of
          temporary                                      Additional Execution Time
                                          opy
                                        pc
 5. Pass results reference           tem
                                                                    • Time to create a
   to operator= tem
                     pc                                               new vector
                       op
 Operator =              y&                                         • Time to create a
                                temp copy       A                     copy of a vector
      6. Perform assignment                                         • Time to destruct
                                                                      both temporaries


  000801-er-5
                                                                MIT Lincoln Laboratory
  KAM 4/27/01
                    C++ Expression Templates and PETE
                                                                          Parse Tree          Expression Type

                                        Expression                                     BinaryNode<OpAssign, Vector,
                                                             Expression                BinaryNode<OpAdd, Vector
                                        A=B+C*D
                                         A=B+C*D             Templates                 BinaryNode<OpMultiply, Vector,
                                                                                                   Vector >>>


       Main
                                                                             Parse trees, not vectors, created
                                                                             Parse trees, not vectors, created
                                                                             Parse trees, not vectors, created
           1. Pass B and C
              references to
              operator +
                              ,
                                                                                                     Reduced Memory Use
                            B&
                             C&
          Operator +
                 2. Create expression               +                                                              • Parse tree only
                    parse tree                 B&       C&
                                                                                                                        contains references
                 3. Return expression
                    parse tree                        py
                                                    co
                                                                                                     Reduced Execution Time
           4. Pass expression tree
              reference to operator
                              co
          Operator =               py
                                        &
                                                                                                                    • Parse tree created
                                                                                                                        at compile time
                 5. Calculate result and
                    perform assignment B+C                     A


• PETE, the Portable Expression Template Engine, is available from the
  Advanced Computing Laboratory at Los Alamos National Laboratory
• PETE provides:
     – Expression template capability                                PETE: http://www.acl.lanl.gov/pete
     – Facilities to help navigate and evaluating parse trees
   000801-er-6
                                                                                                       MIT Lincoln Laboratory
   KAM 4/27/01
                                           Experiments
               Parallel
              Complexity




                     Experiment 3: Varying distributions                   Measure Execution Time


                     Experiment 2: Identical distributions

                                                                           Software
                     Experiment 1: Single Node                            Technology




                                              Software Technology
    Expression
    Complexity                         C               C++/VSIPL        C++/PETE
                                    C                 C++               C++
                                    MPI               PVL               PVL
                                    VSIPL              - MPI             - MPI
                                                       - VSIPL           - PETE
                                                       - PETE              - Expression
                                                         - Expression        templates
                                                           templates       - Evaluation

000801-er-7
                                                                                 MIT Lincoln Laboratory
KAM 4/27/01
                                                                   Experiment 1: Single Node
                                                A=B+C                                                                A=B+C*D                                                           A=B+C*D/E+fft(F)
                          1.3                                                                         1.2                                                                    1.2
Relative Execution Time




                                                                            Relative Execution Time




                                                                                                                                                   Relative Execution Time
                                                                                                      1.1
                          1.2                                                                                                                                                1.1
                                                              C
                                                              C++/VSIPL                                1
                                                              C++/PETE

                          1.1                                                                         0.9                                                                     1


                                                                                                      0.8
                           1                                                                                                                                                 0.9                  C
                                                                                                                        C
                                                                                                      0.7               C++/VSIPL
                                                                                                                                                                                                  C++/VSIPL
                                                                                                                                                                                                  C++/PETE
                                                                                                                        C++/PETE
                          0.9                                                                         0.6                                                                    0.8
                                 8               8   2   8   2   68    72                                   8           8   2   8   2   68    72                                   8              8        2        48 192 768 107
                                                                                                                                                                                                                                   2
                                      32      1 2 5 1 204 819 327 1310                                          32   1 2 5 1 204 819 327 1310                                           32   12       51       20     8   32 13
                                               Vector Length                                                          Vector Length                                                               Vector Length


                                     •     Platform: Linux PC
                                     •     Element-wise multiply and divide

                                                              Relative overhead of C++/VSIPL vs. C is small
                                                              Relative overhead of C++/VSIPL vs. C is small
                                                                   when expression templates are used
                                                                   when expression templates are used
                                     For longer chained expressions of element-wise operations, C++ with
                                     For longer chained expressions of element-wise operations, C++ with
                                      PETE element-wise evaluation outperforms other implementations
                                       PETE element-wise evaluation outperforms other implementations
                                000801-er-8
                                                                                                                                                                                       MIT Lincoln Laboratory
                                KAM 4/27/01
                                                      Experiment 2: Identical Distributions
                                                  A=B+C                                                                               A=B+C*D                                                                                 A=B+C*D/E+fft(F)

                          1.5                                                                                          1.4                                                                                          1.1
Relative Execution Time




                                                                                             Relative Execution Time




                                                                                                                                                                                          Relative Execution Time
                                                                                                                       1.3
                          1.4
                                                                                                                       1.2                                             C
                          1.3                                                                                                                                          C++/VSIPL
                                                                                                                       1.1                                             C++/PETE

                          1.2                                                  C                                        1                                                                                            1
                                                                               C++/VSIPL
                                                                               C++/PETE                                0.9
                          1.1
                                                                                                                                                                                                                                                              C
                                                                                                                       0.8
                                                                                                                                                                                                                                                              C++/VSIPL
                           1                                                                                                                                                                                                                                  C++/PETE
                                                                                                                       0.7

                          0.9                                                                                          0.6                                                                                          0.9
                                8            28       12        48        92      68    72                                   8            28       12        48        92      68    72                                   8             8        2        48 192 2768 107
                                                                                                                                                                                                                                                                          2
                                    32   1        5        20        81        327 1310                                          32   1        5        20        81        327 1310                                          32   12       51       20     8   3 1  3
                                             Vector Length                                                                                Vector Length                                                                             Vector Length

                                    •   Platform: 4 node Linux cluster
                                    •   Element-wise multiply and divide

                                                                                  Similar performance to single node case
                                                                                  Similar performance to single node case

                                         Relative differences in A=B+C*D/E+fft(F) are smaller because of
                                         Relative differences in A=B+C*D/E+fft(F) are smaller because of
                                                  communication required in (unoptimized) FFT
                                                   communication required in (unoptimized) FFT
                           000801-er-9
                                                                                                                                                                                                                          MIT Lincoln Laboratory
                           KAM 4/27/01
                                                           Experiment 3: Different Distributions
                                                      A=B+C                                                                                A=B+C*D                                                                                 A=B+C*D/E+fft(F)
                          1. 1                                                                                              1.1                                                                                          1.1
                                                  C
                                                  C++/VSIPL




                                                                                                  Relative Execution Time




                                                                                                                                                                                               Relative Execution Time
Relative Execution Time




                                                  C++/PETE
                                                                                                                             1

                          1                                                                                                                                                                                               1
                                 1
                                                                                                                            0.9
                                                                                                                                                                                                                                                                C
                                                                                                                                                                             C                                                                                  C++/VSIPL
                                                                                                                                                                             C++/VSIPL                                                                          C++/PETE
                                                                                                                                                                             C++/PETE
                          0.9                                                                                               0.8                                                                                          0.9
                                                 28       12        48        92       8 0   72                                                 28       12        48        92      68 07
                                                                                                                                                                                           2                                                  8        2        48 192 2768 107
                                                                                                                                                                                                                                                                                2
                                 8   32                                              76                                           8   32                                           27 131
                                                                                                                                                                                                                               8    32   12       51
                                             1        5        20        81        32 131                                                   1        5        20        81        3                                                                        20     8   3 13
                                                 Vector Length                                                                                  Vector Length                                                                             Vector Length



                                     •    Platform: 4 node Linux cluster
                                     •    Element-wise multiply and divide
                                     •    A distributed on 2 nodes, B, C, D, E and F distributed on 4 nodes

                                             Relative differences between the 3 approaches are much smaller
                                             Relative differences between the 3 approaches are much smaller
                                                     because of communication required in all cases
                                                      because of communication required in all cases
                              000801-er-10
                                                                                                                                                                                                                           MIT Lincoln Laboratory
                              KAM 4/27/01
                                                                  Results Summary

                                                 Execution time relative to C/VSIPL implementation

               Single Node: A=B+C                                         Single Node : A=B+C*D                            Single Node : A=B+C*D/E+fft(F)

                       Vector Length=8     Vector Length=131072               Vector Length=8    Vector Length=131072                Vector Length=8   Vector Length=131072
   1.4                                                             1.3                                                   1.2
   1.3                                                             1.2
                                                                   1.1
   1.2                                                                                                                   1.1
                                                                     1
   1.1                                                             0.9
                                                                   0.8                                                    1
        1
                                                                   0.7
   0.9                                                             0.6                                                   0.9
               C++/VSIPL                    C++/PETE                       C++/VSIPL              C++/PETE                        C++/VSIPL             C++/PETE




Identical Distributions: A=B+C*D/E+fft(F)                         Different Distributions: A=B+C*D/E+fft(F)                    • C++/VSIPL- Small
                                                                                                                                 overhead compared to C
                  Vector Length=8        Vector Length=131072                  Vector Length=8    Vector Length=131072
  1.1                                                               1.1                                                        • C++/PETE- Significant
                                                                                                                                 advantage for chained
                                                                                                                                 expressions
   1                                                                 1
                                                                                                                               • Relative differences
                                                                                                                                 largest where no
                                                                                                                                 communication is
  0.9
             C++/VSIPL                    C++/PETE
                                                                    0.9
                                                                            C++/VSIPL               C++/PETE
                                                                                                                                 required


        000801-er-11
                                                                                                                               MIT Lincoln Laboratory
        KAM 4/27/01
                               Future Optimizations
             2 optimizations made possible by expression templates
             not possible with typical C++ operator overloading
                  Minimize Op Count                     Minimize Communication

 Example: matrix-matrix multiply                    Example: A=B+C
                             B-30x35                 • A and C are on processor 1
                             C-35x15                 • B is on processor 2
         A=BxCxDxExF         D-15x5
                             E-5x10
                             F-10x20              Typical C++ operator overloading solution
                                                   Typical C++ operator overloading solution
                                                  could involve unnecessary communication
                                                   could involve unnecessary communication
Default ordering: 25,500 scalar mulitplies
 Default ordering: 25,500 scalar mulitplies       of C to processor 2, then results to
A=(((BxC)xD)xE)xF                                  of C to processor 2, then results to
 A=(((BxC)xD)xE)xF                                processor 11
                                                   processor

Optimal ordering: 11,875 scalar multiplies
 Optimal ordering: 11,875 scalar multiplies       Optimal solution: Move B’s data to
                                                   Optimal solution: Move B’s data to
A=(Bx(CxD))x(ExF)
 A=(Bx(CxD))x(ExF)                                processor 1, then evaluate
                                                   processor 1, then evaluate

           Both optimizations require knowledge of the entire expression

                                       Expression templates
                                       Expression templates
   000801-er-12
                                                                     MIT Lincoln Laboratory
   KAM 4/27/01
                                     Conclusions
      • C++ is preferable to C for high--performance signal
               processing
               – More direct translation from linear algebra to code
               – Use of C++ expression templates results in similar
                 performance
      • PETE element-wise evaluation is preferable in some cases
               to mathematical libraries such as VSIPL which necessitate
               the creation of temporary objects. Example: A=B+C*D
      •        Expression templates allow optimizations not possible with
               typical C++ operator overloading
               – Minimize operation count. Example: A=BxCxDxExF
               – Minimize communication. Example: A=B+C
                             Acknowledgements
      •        PETE was obtained from Los Alamos National Laboratory’s
               Advanced Computing Laboratory
               Lockheed Martin                        MIT Lincoln Laboratory
Phil Barile                      Jane Kent           Jim Daly              Jan Matlis
Nathan Doss                      Mike Lontoc         Jim Demers            Patrick Richardson
Mary Frances Caravaglio          Rathin Putatunda    Dan Drake             Eddie Rutledge
Michael Iaquinto                 Roelan Teachey      Hank Hoffmann         Glenn Schrader
                                                     Jeremy Kepner         Ben Sadoski
                                                     James Lebak
000801-er-13
                                                                     MIT Lincoln Laboratory
KAM 4/27/01

								
To top