Experimentally Validated Energy Analysis of Embedded Software by vgw19124

VIEWS: 5 PAGES: 10

									Experimentally Validated Energy Analysis of Embedded Software
                                  AJ Elbirti, S. Durpheyii, D. Fahaiii, and S. Rajeiv
                                        University of Massachusetts Lowell



                                                     Abstract

A wide range of current and new technologies employ embedded systems that are characterized as both low-power and
low-cost. Because of these constraints, maximizing the performance of these systems becomes increasingly difficult.
Examples of these types of systems include mobile communication devices (such as cellular telephones and personal
digital assistants) and smart cards. Based on the rapid growth of the Internet and increased consumer demand for
wireless communication devices, a corresponding increase is expected in the number of applications that utilize
embedded processors. Many Internet and wireless applications – electronic mail, electronic banking, medical
databases, and electronic commerce – require the exchange of private information. These applications require
information security to protect sensitive data, resulting in the need for research in the areas of application-specific
cryptography and data security for these types of systems. In particular, cryptographic algorithms tend to be
computation-intensive, resulting in a significant burden being placed upon the system power budget. This burden is
further exacerbated when targeting systems with limited hardware resources, which are exemplified by wireless
communication devices. This study is a first step toward the realization of software compilers that solve the
challenging problem of creating power-efficient and high-performance software implementations when targeting
systems with limited hardware resources. Representative embedded systems targeting the 68HC11 and 80C51
microprocessors will be examined and measured power and energy consumption data for the microprocessors will be
presented. This data will lead to a set of recommendations for use in developing processor-specific power-optimized
compilers. It is expected that these compilers will be capable of generating software that will address the current and
future power consumption requirements of systems with limited hardware resources.

Keywords: cryptography, embedded, software, energy, power

1 Introduction

Many current and new technologies employ low-power and low-cost embedded systems, and
maximizing the performance of these systems becomes increasingly difficult. Examples of these
types of systems include mobile communication devices (such as cellular telephones and personal
digital assistants) and smart cards. New technologies that rely on wireless communication are
being readily accepted by consumers as products and services become available at an affordable
price. These products rely on embedded processors and industry trends continue to indicate that
the vast majority of microprocessors produced by chip manufacturers will be used by these types of
products [1, 2, 3, 4, 5, 6].

Algorithms that are both computation-intensive and iterative tend to consume a significant amount
of an embedded system’s power budget, especially if these algorithms employ wide operands.
Cryptographic algorithms are a representative subset of these types of algorithms and are required
to achieve secure communications when performing electronic commerce via mobile
communication devices.       Therefore, a means of optimizing cryptographic algorithm
implementations to minimize power consumption is expected to become of high interest to the
wireless communication industry.


_______________________________________________
        i
              Email:   Adam_Elbirt@uml.edu
        ii
              Email:   Shannen_Durphey@student.uml.edu
        iii
              Email:   David_Faha@student.uml.edu
        iv
              Email:   Saylee_Raje@student.uml.edu
This study is a first step toward the realization of software compilers that solve the challenging
problem of creating power-efficient and high-performance software implementations when
targeting embedded systems that are low-power and low-cost. Representative embedded systems
will be examined, and data will be presented that will lead to a set of recommendations for use in
developing processor-specific compilers capable of generating software that will address the
current and future power and energy consumption requirements of low-cost embedded systems.

2 Methodology

In order to better understand the power consumption requirements of the targeted embedded
systems, and therefore design power-efficient software implementations of cryptographic
algorithms, an investigation of processor power consumption must be performed at the most
elementary level - the instruction level [7, 8, 9, 10, 11, 12, 13, 14]. Research efforts in the area of
instruction-level software power minimization revolve around the hypothesis that the power
consumed by a processor performing the repeated execution of a given instruction may be
considered to be the power cost of that instruction. This cost is termed the base power cost of the
instruction (denoted as P) and is calculated as P = I × V where I and V are the current and voltage
required by the processor to perform the instruction. The base energy cost of an instruction,
denoted as E, may then defined in terms of the base power cost and the time required to execute the
instruction (denoted as T) and is calculated as E = P × T. The time required to execute an
instruction is determined by the number of machine cycles required for the given instruction
(denoted as N) and the clock period (denoted as τ) and is calculated as T = N × τ.

For a given program, there also exists an inter-instruction power cost that impacts the total power
cost. This additional cost is incurred by events such as cache misses, pipeline stalls, and resource
limitations. Note that this method of power analysis determines the average base and inter-
instruction power costs for a given addressing mode and set of operands. This is significant in that
digital multimeters are capable of measuring average current and voltage values over a given time
window. Additionally, to account for variations in power consumption due to addressing mode,
address values, and operand values, the instruction instances within the looping structure must also
vary these parameters [7, 8, 9, 13, 14, 15, 16, 17, 18, 19, 20].

To determine the average base power cost of an instruction, both the current draw of the processor
and the voltage applied to the processor when executing the instruction loop must be known.
These parameters may be measured via a voltmeter and an ammeter applied to the VCC input of the
processor during program operation. The program used to determine the base power cost of the
instruction is simply a loop that implements several instances of the instruction. As the size of the
loop increases, the impact of the branch or jump instruction at the end of the loop is minimized,
resulting in the measured current value converging to the true value. However, the size of the loop
must not grow past either the size of the processor's cache or the measurement window of the
ammeter. Either of these conditions results in an increase in power consumption that is not a true
component of the base power cost of the instruction [7, 8, 9, 13].

The average base energy cost of an instruction follows directly from the average base power cost.
An instruction set may then be partitioned into instruction groupings based on average base energy
and average base power costs. Note that for a complete energy-based characterization of a
processor's instruction set, each instruction must be evaluated for all possible addressing modes.
Moreover, conditional branch instructions must be evaluated for both the success and failure cases.
Algorithm implementations, and cryptographic algorithms in particular, may be optimized based on
the resultant groupings to minimize the average energy and average power costs through the use of
those instructions classified as low energy and low power when combined with architecture-
specific optimizations [7, 9, 13, 18, 21].

To measure the base power cost of an instruction, the processor is configured to run a program that
repeatedly executes the instruction. Each program begins with an initialization sequence to set up
the processor. Note that the initialization sequence is extremely short. Because this sequence is
executed only once, its influence on the power and energy consumption of an instruction under test
is negligible and therefore not included in the power and energy calculations.

Following the initialization sequence, multiple instances of the instruction under test are executed,
followed by an unconditional jump to the first instance of the instruction under test. The number of
instances is limited by the number of bytes required by the instruction and the size of the available
code space in the embedded system. As a result, it is usually impossible to generate a single loop
that filled the 100 ms measurement window of the ammeter without overrunning code space. To
address this issue, the loop is executed multiple times to allow the ammeter to measure the average
current draw throughout the entire 100 ms measurement window. However, the multiple loop
executions results in an increase in the power and energy attributable to the repeated execution of
the unconditional jump instruction.

Based on this analysis, to obtain an accurate measurement of the base power cost of each
instruction, the first instruction to be measured is the unconditional jump. The current I and
voltage V are measured at the VCC pin on the processor when executing a program that implements
a single unconditional jump. Based on the number of clock cycles required for the unconditional
jump, the number of instructions n executed in a 100 ms measurement window is determined. The
true base power cost and base energy cost of the unconditional jump are then calculated as

       PJUMP = (I × V)/n
       EJUMP = P × N × τ

where N is the machine cycles required to execute the instruction and τ is the system clock period.

Having determined the base power cost and base energy cost of the unconditional jump instruction,
a similar analysis is performed on all of the other instructions within the instruction set. For each
program, the total number of instructions executed within the 100 ms measurement window
(INSTTOTAL) is calculated as the product of the number of instructions per loop and the number of
loops (NUMLOOP) executed within the 100 ms measurement window. The measured values for I
and V are used to calculate PTOTAL. However, this PTOTAL also includes the power associated with
the unconditional jumps. Therefore, the true total power, base power cost, and base energy cost for
the instruction are then calculated as
                           PTRUE = PTOTAL – (NUMLOOP × PJUMP)
                           PBASE = PTRUE/INSTTOTAL
                           EBASE = PBASE × N × τ

It is important to note that because energy consumption is a function of power consumption,
minimizing the energy consumption of an embedded system will also minimize the power
consumption of that system. We therefore focus on minimization of system energy consumption.

3 Results

The CMD11E1 development board for the Motorola 68HC11E1 (manufactured by Axiom
Manufacturing) and the SBC-1 development board for the Intel 80C51 (manufactured by
Industrologic, Inc.) were used to collect data. The VCC pin on each processor was lifted from its
socket to allow for the connection of an ammeter in series with the circuit to measure the average
current draw over a 100 ms time window. A voltmeter was also connected between the processor's
VCC and GND pins to measure the average processor voltage over a 100 ms time window. The
Fluke 45 Dual Display Multimeter was used to perform both the current and voltage measurements.

3.1 General Results.

The measurements and calculations described in the previous section were performed for the
68HC11 and 80C51 microprocessors. Figure 1 details the base energy cost per instruction for the
68HC11. Similarly, Figure 2 details the base energy cost per instruction for the 80C51. Note that
for all conditional branch instructions, the value shown reflects the higher result of comparing the
success and failure cases.

                                                                    Energy

                                10000
           (10^-14 W seconds)




                                 1000
  Energy
                    att




                                  100                                                                    Energy

                                   10

                                    1
                                   2

                                        2

                                            2

                                                2

                                                    3

                                                        3

                                                            3

                                                                4

                                                                    4

                                                                        4

                                                                             5

                                                                                 5

                                                                                     6

                                                                                         6

                                                                                             6

                                                                                                 7

                                                                                                     7




                                                        Clock Cycles Per Instruction

Figure 1. 68HC11 Energy per instruction, sorted by clock cycles per instruction.

From Figure 1 it is clear that 68HC11 energy consumption increases in a logarithmic manner as the
clock cycles per instruction increase. Figure 1 also demonstrates that 68HC11 instructions with the
same number of clock cycles per instruction cycle exhibit similar energy consumption
characteristics. The energy consumption values of Figure 1 are also significant in that they are
easily adaptable for use as cost values. These cost values will be used by the compiler as
guidelines toward achieving software implementations that are optimized to reduce energy
consumption as described in [11, 13, 14, 22, 23].

Figure 2 demonstrates that 80C51 energy consumption increases in a logarithmic manner as the
clock cycles per instruction increase. Figure 2 also demonstrates that 80C51 instructions with the
same number of clock cycles per instruction cycle exhibit similar energy consumption
characteristics. Much like the values detailed in Figure 1, the energy consumption values of Figure
2 are easily adaptable for use as cost values to be used by the compiler as guidelines toward
achieving software implementations that are optimized to reduce energy consumption.

                                                                             Energy

                                 10000
            (10^-14 W seconds)




                                 1000
   Energy
                     att




                                                                                                                                   Energy
                                  100



                                   10
                                         12

                                              12

                                                   12

                                                        12

                                                             12

                                                                  12

                                                                       12

                                                                            12

                                                                                 12

                                                                                      12

                                                                                           24

                                                                                                24

                                                                                                     24

                                                                                                          24

                                                                                                               24

                                                                                                                    24

                                                                                                                         24

                                                                                                                              48
                                                                  Clock Cycles Per Instruction

Figure 2. 80C51 Energy per instruction, sorted by clock cycles per instruction.

3.2 68HC11 Architecture-Specific Results

Figure 3 compares the average energy consumption of instructions targeting Accumulator A versus
Accumulator B for each memory addressing mode of the 68HC11. While the choice of
accumulator has only a minor impact on average energy consumption, using Accumulator B does
result in a slight reduction in energy consumption for all memory addressing modes except Indexed
X mode, with Immediate mode exhibiting the greatest reduction in energy, 2.93 % on average.
Therefore, it is recommended that Accumulator B be used when operating in memory addressing
modes other than Indexed X mode, where Accumulator A should be used.
                                                              Average Energy
   Average Energy Consumption


                                100
                                 90
      (10^-14 Watt seconds)




                                 80
                                 70
                                 60
                                                                                                   Accumulator A
                                 50
                                                                                                   Accumulator B
                                 40
                                 30
                                 20
                                 10
                                  0




                                                                                              nt
                                                                                   Y
                                                                         X
                                                               te
                                                     d
                                         ct



                                                   de




                                                                                            re
                                                                                    d
                                                                          d
                                                             ia
                                      ire




                                                                                  xe
                                                                        xe




                                                                                          he
                                                           ed
                                                 en
                                  D




                                                                                de
                                                                      de




                                                                                        In
                                                          m
                                               xt




                                                                              In
                                              E



                                                         Im



                                                                    In




                                                              Addressing Mode

Figure 3. 68HC11 Average energy per instruction, sorted by accumulator A and B usage.

Table 1 compares the energy costs for performing double precision operations using either one
68HC11 double precision instruction or two 68HC11 single precision instructions. In the case of
arithmetic operations - addition and subtraction - the use of two single precision instructions results
in a reduction of energy for all memory addressing modes except Indexed Y mode. However, these
energy totals do not account for inter-instruction energy costs. Given that the energy reduction
using two single precision instructions is greater than 10 % when operating in Direct, Extended, or
Immediate memory addressing modes, it is expected that this form of implementation will still
result in an energy reduction after accounting for inter-instruction energy costs. Using two single
precision instructions when operating in Indexed X memory addressing mode exhibits only a 6.18
% energy reduction and it is likely that this reduction will be negated after accounting for inter-
instruction energy costs. Therefore, it is recommended that two single precision arithmetic
instructions be used when operating in memory addressing modes other than Indexed X and
Indexed Y modes, for which one double precision arithmetic instruction should be used.

When considering memory access operations - load and store - the use of one 68HC11 double
precision instruction results in a substantial reduction of energy consumption for all memory
addressing modes except Immediate mode. This energy savings peaks at 31.35 % when loading
data from memory in Indexed Y mode and nearly all of the other memory access operations exhibit
an energy reduction of over 10 %, a characteristic that is expected to improve further when inter-
instruction energy costs are considered for the cases involving two single precision instructions.
Similarly, it is expected that even Immediate mode memory access operations will exhibit an
energy reduction when using one double precision instruction when inter-instruction energy costs
are considered. Therefore, it is recommended that one double precision memory access instruction
be used for all memory addressing modes.
 Table 1. 68HC11 Double precision versus single precision energy cost (10-14 Watt seconds).
         Operation    Direct Extended Immediate Indexed X Indexed Y
           ADDD        78.2       109.5          48.4         118.1          167.8
       ADDB, ADCA      52.9        93.3          20.5         103.9          175.6
           SUBD        78.3       109.6          48.7         125.1          167.3
       SUBB, SBCA      53.5        95.7          22.8         117.4          175.6
            LDD        45.9        72.5          24.7          85.8          120.2
       LDAB, LDAA      52.1        94.9          23.9         116.6          175.1
            STD        52.1        79.0           •            91.5          128.1
        STAB, STAA     57.2        99.0           •           120.1          180.4

3.3 80C51 Architecture-Specific Results

Table 2 compares the energy costs for performing a load/store operation using the 80C51
accumulator as an intermediate storage location versus performing a direct load/store operation
between registers and memory. In all cases, using the accumulator as an intermediate storage
location results in a substantial reduction of energy consumption even though both forms of the
operation require the same number of clock cycles. This energy savings peaks at 49.24 % while all
cases exhibit an energy reduction of over 46 %. It is expected that this form of implementation will
continue to result in a significant energy reduction even after accounting for inter-instruction
energy costs. Therefore, it is recommended that the accumulator be used as an intermediate storage
location for register-based load/store operations.

 Table 2. 80C51 Energy cost for intermediate accumulator usage versus direct load/store to
                              memory (10-14 Watt seconds).
                          Operation               Clock Cycles Energy
                        MOV 23H, R1                    24         185.4
                  MOV A, R1 → MOV 23H, A               24         100.0
                        MOV R1, 23H                    24         189.8
                  MOV A, 23H → MOV R1, A               24          99.8
                       MOV 23H, @R1                    24         188.0
                 MOV A, @R1 → MOV 23H, A               24         101.1
                       MOV @R1, 23H                    24         196.8
                 MOV @R1, A → MOV A, 23H               24          99.9
                        MOV 24H, #12                   24         192.8
                 MOV A, #12H → MOV 24H, A              24         103.4

Table 3 compares the energy costs for performing a logical operation using the 80C51 accumulator
as an intermediate storage location versus performing a logical operation directly upon a memory
location. In all cases, using the accumulator as an intermediate storage location results in a
substantial reduction of energy consumption despite the fact that accumulator use increases the
number of clock cycles required to complete the operation. This energy savings peaks at 20.58 %
while all cases exhibit an energy reduction of over 20 %. It is expected that this form of
implementation will continue to result in a significant energy reduction even after accounting for
inter-instruction energy costs. Therefore, it is recommended that the accumulator be used as an
intermediate storage location for logical operations as opposed to performing these operations
directly upon a memory location.

 Table 3. 80C51 Energy cost for intermediate accumulator usage versus direct operation on
                         memory operands (10-14 Watt seconds).
                          Operation                      Clock Cycles Energy
                        ANL 23H, #22H                         24          193.6
          MOV A, 23H → ANL A, #22H → MOV 23h, A               36          154.1
                        ORL 23H, #22H                         24          194.6
          MOV A, 23H → ORL A, #22H → MOV 23h, A               36          154.6
                        XRL 23H, #22H                         24          195.3
          MOV A, 23H → XRL A, #22H → MOV 23h, A               36          155.1

3.4 Recommendations for Power and Energy Optimization

Based on the results detailed in the previous section, the following is a list of recommendations for
use in creating a compiler that optimizes for power and energy consumption when targeting the
68HC11 microprocessor:

       Use the measured base energy cost for each instruction as the instruction cost value during
       compilation. Instructions will be selected by the compiler to minimize the total program
       cost, resulting in minimized power and energy consumption.
       Use Accumulator B when operating in memory addressing modes other than Indexed X
       mode.
       Use Accumulator A when operating in Indexed X memory addressing mode.
       Use two single precision arithmetic instructions instead of one double precision arithmetic
       instruction when operating in memory addressing modes other than Indexed X and Indexed
       Y modes.
       Use one double precision arithmetic instruction instead of two single precision arithmetic
       instructions when operating in Indexed X and Index Y memory addressing modes.
       Use one double precision memory access instruction instead of two single precision
       memory access instructions for all memory addressing modes.

When targeting the 80C51 microprocessor, the following is a list of recommendations for use in
creating a compiler that optimizes for power and energy consumption:

       Use the measured base energy cost for each instruction as the instruction cost value during
       compilation. Instructions will be selected by the compiler to minimize the total program
       cost, resulting in minimized power and energy consumption.
       Use the accumulator as an intermediate storage location for register-based load/store
       operations to/from memory.
       Use the accumulator as an intermediate storage location for logical operations to be
       performed on the contents of a memory location.
4 Conclusions

This study is a first step toward the realization of software compilers that solve the challenging
problem of realizing power-efficient and high-performance software implementations when
targeting embedded systems that are low-power and low-cost. Representative embedded systems
targeting the 68HC11 and 80C51 microprocessors were examined and measured power and energy
consumption data for the microprocessors was presented.            This data led to a set of
recommendations for use in developing power-optimized compilers that are expected to generate
software that will address the current and future power and energy consumption requirements of
embedded systems employing microprocessors such as the 68HC11 and the 80C51.

5 References

1.    D. Estrin, R. Govindan, and J. Heidemann, “Embedding the Internet,” Communications of the ACM, vol. 43, pp.
      39-41, May 2000.
2.    D. Tennenhouse, “Proactive Computing,” Communications of the ACM, vol. 43, pp. 43-50, May 2000.
3.    A. Cataldo, “Chip Makers Detect the Scent of Recovery,” Electronic Engineering Times, March 20 2002.
4.    P. Krill, “Panel Calls for Chip Standards,” InfoWorld, May 1 2002.
5.    S. Ohr, “Power-Hungry Portables Eye Voltage Regulator,” Electronic Engineering Times, pp. 55-57, July 8 2002.
6.    F. Yang, “Samueli Predicts Comms Design Future,” Electronic Engineering Times, March 15 2002.
7.    M. T.-C. Lee, V. Tiwari, S. Malik, and M. Fujita, “Power Analysis and Low-Power Scheduling Techniques for
      Embedded DSP Software,” in Proceedings of the Eighth International Symposium on System Synthesis, (Cannes,
      France), pp. 110-115, September 13-15 1995.
8.    V. Tiwari and M. T.-C. Lee, “Power Analysis of a 32-Bit Embedded Microcontroller,” in Proceedings of the 1995
      Asia and South Pacific Design Automation Conference – ASP-DAC 1995, (Chiba, Japan), pp. 141-148, August
      1995.
9.    M. T.-C. Lee, V. Tiwari, S. Malik, and M. Fujita, “Power Analysis and Minimization Techniques for Embedded
      DSP Software,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 5, pp. 123-135, March
      1997.
10.   V. Tiwari, S. Malik, and A. Wolfe, “Power Analysis of the Intel 486DX2,” Tech. Rep. CE-M94-5, Princeton
      University Department of Electrical Engineering, June 1994.
11.   V. Tiwari, S. Malik, and A. Wolfe, “Compilation Techniques for Low Energy: An Overview,” in Proceedings of
      the 1994 IEEE Symposium on Low Power Electronics, (Piscataway, New Jersey, USA), pp. 38-39, October 10-12
      1994.
12.   V. Tiwari, T.-C. Lee, M. Fujita, and D. Maheshwari, “Power Analysis of the SPARClite MB86934,” Tech. Rep.
      FLA-CAD-94-01, Fujitsu Labs of America, August 1994.
13.   V. Tiwari, S. Malik, and A. Wolfe, “Power Analysis of Embedded Software: A First Step Towards Software
      Power Minimization,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 2, pp. 437-445,
      December 1994.
14.   V. Tiwari, S. Malik, A. Wolfe, and M. T.-C. Lee, “Instruction Level Power Analysis and Optimization of
      Software,” in Proceedings of the Ninth International Conference on VLSI Design, (Banglore, India), pp. 326-328,
      January 3-6 1996.
15.   N. Chang, K. Kim, and H. G. Lee, “Cycle-Accurate Energy Consumption Measurement and Analysis: Case Study
      of ARM7TDMI,” in Proceedings of the 2000 International Symposium on Low Power Electronics and Design –
      ISLPED 2000, (Portacino Coast, Italy), pp. 185-190, July 26-27 2000.
16.   N. Chang, K. Kim, and H. G. Lee, “Cycle-Accurate Energy Measurement and Characterization With a Case Study
      of the ARM7TDMI,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 10, pp. 146-154,
      April 2002.
17.   F. N. Najm, “A Survey of Power Estimation Techniques in VLSI Circuits,” IEEE Transactions on Very Large
      Scale Integration (VLSI) Systems, vol. 2, pp. 446-455, December 1994.
18.   C.-L. Su, C.-Y. Tsui, and A. M. Despain, “Saving Power in the Control Path of Embedded Processors,” IEEE
      Design & Test of Computers, vol. 11, pp. 24-30, Winter 1994.
19. N. Julien, J. Laurent, E. Senn, and E. Martin, “Power Estimation of a C Algorithm Based on the Function-Level
    Power Analysis of a Digital Signal Processor,” in Proceedings of the Fourth International Symposium on High
    Performance Computing – ISHPC 2002, vol. 2327, pp. 354-360, 2002.
20. S. Nikolaidis and T. Laopoulos, “Instruction-Level Power Consumption Estimation of Embedded Processors for
    Low-Power Applications,” Computer Standards & Interfaces, vol. 24, no. 2, pp. 133-137, 2002.
21. C.-L. Su, C.-Y. Tsui, and A. M. Despain, “Low Power Architecture Design and Compilation Techniques for High-
    Performance Processors,” Compcon, pp. 489-498, Spring 1994.
22. V. Tiwari, S. Malik, A. Wolfe, and M. T.-C. Lee, “Instruction Level Power Analysis and Optimization of
    Software,” Journal of VLSI Signal Processing, vol. 13, pp. 1-18, August 1996.
23. P. Marwedel, S. Steinke, and L. Wehmeyer, “Compilation Techniques for Energy-, Code-Size-, and Run-Time-
    Efficient Embedded Software,” in Workshop on Advanced Compiler Techniques for High Performance Embedded
    Processors, (Bucharest, Romania), July 18-21 2001.

								
To top