Implementation of a CORDIC based double precision floating point

Document Sample
Implementation of a CORDIC based double precision floating point Powered By Docstoc
					                  Implementation of a CORDIC based
              double precision floating point unit used in an
                       ASIP based GNSS receiver
                                 Götz Kappen, Sofian el Bahri, Oliver Priebe, Tobias G. Noll,
                 Chair of Electrical Engineering and Computer Systems, Aachen University, Germany



BIOGRAPHY

Götz KAPPEN received the Dipl.-Ing. degree in 2002 from RWTH Aachen University. Since then he has been working
as a PhD student at the Chair of Electrical Engineering and Computer Systems, RWTH Aachen University. His fields of
research are satellite navigation systems and digital signal processing.

Sofian el BAHRI received the Dipl.-Ing. degree in electrical engineering in 2006 from RWTH Aachen University. His
fields of research are application specific architectures for digital signal processing.

Oliver PRIEBE received the Dipl.-Ing. degree in 2002 from RWTH Aachen University. Since 2004 he has been
working as a scientific co-worker at the Chair of Electrical Engineering and Computer Systems, RWTH Aachen
University. His fields of research are satellite navigation systems and digital signal processing.

Tobias G. NOLL received the Ing. (grad.) degree in Electrical Engineering from the Fachhochschule Koblenz in 1974,
the Dipl.-Ing. degree in Electrical Engineering from the Technical University of Munich in 1982, and the Dr.-Ing.
degree from the Ruhr-University of Bochum in 1989. From 1974 to 1976, he was with the Max-Planck-Institute of
Radio Astronomy, Bonn. Since 1976 he was with the Corporate Research and Development Department of Siemens and
since 1987 he headed a group of laboratories concerned with CMOS circuits for digital signal processing. In 1992, he
joined the RWTH Aachen University where he is a Professor holding the Chair of Electrical Engineering and Computer
Systems. His activities focus on low power deep submicron CMOS architectures, circuits and design methodologies, as
well as digital signal processing for communications and medicine electronics.



1. INTRODUCTION

Currently GNSS receivers feature an embedded standard processor (e.g. ARM7) as central processing unit to realize
navigation processor functionality (e.g. correlator channel control, position estimation, pseudorange correction and
coordinate transformations). Although the performance of embedded processors increases floating point instructions,
allowing for high accuracy and straightforward implementation of the signal flow developed by the algorithm designer,
offer lower performance as they are emulated in software. For this reason Digital Signal Processors (DSPs) as well as
General Purpose (GP) processors used in high performance signal processing applications often realize floating point
operations using a dedicated Floating Point Unit (FPU) (e.g. [1], [2]). Recently, GNSS receivers for space applications
[3] or high accuracy multioperable positioning [4] are assembled using a FPU to enhance overall performance.

FPUs generally calculate the specified function in a few clock cycles in parallel to the main processor’s pipeline.
Nevertheless, most embedded microprocessors used in mobile and thus energy critical applications do not feature a
dedicated FPU because of power and area overhead. In this case floating point arithmetic is emulated in software. In
addition to emulation of floating point arithmetic most embedded processors’ emulate trigonometric and mathematical
standard functions using floating point arithmetic. Therefore, an enhanced floating point performance will implicitly
improve the performance of trigonometric and mathematical standard functions. Nevertheless, the performance of these
functions can be further improved using dedicated co-processors.

This paper presents a flexible FPU capable of calculating a wide variety of mathematical and floating point functions to
enhance the performance of the Application Specific Instruction Set Processor (ASIP), which forms the central
processing unit of a flexible GNSS receiver architecture. As presented in [5] the target architecture is assembled by
application specific hardware blocks to reduce energy and area consumption of next generation multioperable GNSS
receivers, while maintaining maximum flexibility. In this approach the standard embedded microprocessor used in most
commercially available receivers is replaced by an ASIP to increase the receiver’s area and energy efficiency, i.e. mm²
per Million Operations Per Second (MOPS) and MOPS per mW.

This paper focuses on the implementation of a configurable co-processor realizing double precision floating point and
trigonometric as well as standard mathematical functions. A library based modular approach allows using a set of
specified co-processor functions without any modifications of the receiver’s source code. Due to configurable
complexity of the co-processor, the area-energy product can be minimized using an application specific subset of
supported functions.

The rest of the paper is organized as follows: Section 2 sketches the idea of replacing the processor in a standard GNSS
receiver by an ASIP and describes the ASIP’s design flow. Performance analyses of a GNSS receiver algorithm on the
template ASIP shows that real-time constraints could not be achieved. A detailed profiling reveals that floating point
instructions as well as trigonometric instructions are performance critical. To reduce the number of processing cycles,
section 3 introduces the implementation and functional verification of the configurable double precision co-processor to
be used in embedded systems. The presented implementation is compared to existing co-processor. Required accuracy
of the floating point representation is estimated and compared to the accuracy achieved by the co-processor. Section 4
details coupling of ASIP and co-processor. Emphasis is placed on required modifications of the ASIP’s hard- and
software and the approach is compared with existing solutions. Section 5 and 6 summarize the performance and cost
figures of the ASIP / co-processor architecture. In section 7 the presented work is compared to existing co-processors.
Finally, section 7 concludes the paper.


2. ASIP BASED GNSS RECEIVER

2.1 ASIP Motivation and Design Flow

As described in [6] hardware architectures used for today’s System-on-Chip (SoC) designs face a
power-versus-flexibility conflict. That is, programmable and flexible architectures like for instance GP processors or
DSPs offer the lowest area and power efficiency. On the other hand, dedicated and for that reason fixed hardware
architectures like for instance Application Specific Integrated Circuits (ASICs) offer the highest efficiency in terms of
area and power.

In this context ASIPs offer a trade-off between GP processors and dedicated solutions by exploiting a priori knowledge
about the class of application to be implemented [7]. This knowledge allows for adaptation of the processor’s
instruction set leading to a modified Arithmetic Logic Unit (ALU) as well as pipeline architecture. Additionally, ASIPs
may feature for instance optimized memory architecture as well as external interfaces.

                                                                                  Architecture
                                                                                  Description


                                                                            Processor Designer



                                     Software Dev. Tools
                                                                                                   VHDL
                                             Simulator                                           Description

                                                 Assembler

                                                      Linker
                                                                                                 Synthesis
                                                           Compiler




                                   Application                                                   Gate Level
                                                                      Executable
                                  (C/C++ Code)                                                     Model




                                                                                                   Cost
                                                                      Profiling
                                                                                                 Evaluation



                                     Figure 1. ASIP design and optimization flow
The ASIP’s design and optimization flow is shown in Figure 1 and starts by implementing the target application on a
template processor described in an Architecture Description Language (ADL). In this work the LISA processor
description language [8] is used for description of the ASIP architecture and generation of associated tools and
hardware description files [9]. To allow for meaningful results, the template processor belongs to the same class as the
target processor. Typical classes are for instance Very Long Instruction Word (VLIW) and Single Instruction Multiple
Data (SIMD) architectures often used in parallel data paths. In this paper a Reduced Instruction Set Computer (RISC)
processor architecture typical for control functionality and mostly implemented in standard GNSS receiver architectures
has been chosen as starting point for the development of the navigation processor.



                                                                             2
For software development, simulation and performance evaluation the ASIP development tools (i.e. Processor
Designer) automatically generate assembler, linker and compiler as well as a cycle accurate simulator based on the
ADL. By using this automatic tool generation, information about costs and benefit of processor extensions can be
derived rapidly. Additionally, hardware description files are generated based on the ADL description. The HDL
description files can be used to synthesize the ASIP for FPGAs or standard cell technologies allowing for real-time
functional verification and analysis of silicon area and power consumption.

The optimization process is based on profiling data of the cycle accurate ASIP model of, generated throughout the
design flow. The profiling reveals performance critical functions which are potential optimization candidates. During
this optimization process, application specific modifications of the ASIP’s instruction set and architecture can be used
to improve performance and efficiency. Besides instruction set modifications co-processors are another promising
option, as described in [7]. For this paper’s purpose performance constraints are as follows: The ASIP should allow for
real-time position estimation at a position fix rate of 1-10 Hz using at least four satellites in view and a typical GNSS
processor clock frequency of 20 MHz.


2.2 Template ASIP’s Performance Evaluation

Throughout this paper a standard Position, Velocity and Time (PVT) algorithm is used for the performance evaluation
of the ASIP / co-processor architecture. The standard algorithm is further divided into four sub-functions:
     1. Calculation of the satellite position (satellite_position)
     2. Correction of the measured pseudoranges (correction)
     3. Estimation of the receiver position by solving a set of non-linear equations (estimation)
     4. Transformation of Cartesian coordinates to latitude, longitude and height (ecef_to_llh).

Executing this standard algorithm on the template processor shows that the overall number of clock cycles required for
position estimation using 5 satellites is about 78 million cycles and increases if more satellites are visible. As can be
seen this clearly conflicts real-time requirements proposed in section 2.1 and prevents using the template ASIP in a
GNSS receiver.

Profiling and analysis of functions executed by the ASIP shows that floating point as well as trigonometric functions
require a significant fraction of the overall number of processing cycles. For a detailed investigation Table 1 compares
standard functions realized using template ASIP, LEON processors and ARM7TDMI. Results are determined using a
cycle accurate LEON [10] and ARM [11] processor simulator respectively. It can be seen that the template ASIP
requires a significantly larger number of processing cycles for standard operations leading to a poor overall
performance compared to LEON and ARM7TDMI. This performance drawback is based on the flexible library concept
used for ASIP development. As standard and floating point libraries should be compliant with a wide variety of
processor classes during development process, the ASIP feature generic library implementations offering poor
performance.

                             Table 1. Comparison of standard mathematical functions
                              Function       Template ASIP         LEON            ARM
                                mul              6354                443            73
                                add              3977                988            78
                                sqrt            175445              2265           1243
                                cos             170393             18281           2176
                                sin             154102             18281           2100

This paper aims to improve the performance of the template ASIP by coupling of a configurable co-processor
supporting trigonometric, standard math and floating point operations. By offering configurability the overall receiver
hardware can be optimized for the application code to offer highest performance at lowest costs in terms of area and
power consumption.


3. CONFIGURABLE DOUBLE PRECISION CO-PROCESSOR

3.1 CORDIC Algorithm

The co-processor implementation is based on the Coordinate Rotation Digital Computer (CORDIC) algorithm first
developed for real-time navigation. The CORDIC algorithm allows for a hardware efficient way to iteratively calculate




                                                            3
trigonometric functions using only shift and add operations. The unified CORDIC [12] extends the basic principle to
allow for calculation of various mathematical functions.

The basic principle behind CORDIC is rotation of two input values interpreted as vector components. In each iteration
step the input vector is rotated by a predefined decreasing angle. There are two different modes of operation. In the
rotation mode the input vector is rotated to a predefined angle by minimizing the angular input value. In contrast to this
the vector mode minimizes one of the input components so that the output vector coincides with the coordinate axis.
The vector mode for instance allows for a simple calculation of the input vector’s length.

The following equations describe the unified CORDIC algorithm presented in [12] for the vector components (x, y) and
the angle z:

         xi +1 = xi − m ⋅ y i ⋅ d i ⋅ 2 −i
         y i +1 = yi + xi ⋅ d i ⋅ 2 −i
         z i +1 = z i − d i ⋅ ei

Here, m specifies the coordinate system (1, 0, -1 for circular, linear and hyperbolic) and ei is the elementary angle at
iteration step i. The decision variable di is determined each step depending on the mode of operation:
                − sign( y i ) for vector mode
          di = 
                 sign( z i ) for rotation mode
In contrast to the integer implementation introduced in [13] the above equations have to be performed in floating point
arithmetic.


3.2 Implementation

The co-processor presented in this paper is based on an iterative implementation of CORDIC equations [13]. This is
mainly due to the fact, that the co-processor does not require high throughput rates and that non-continuous
co-processor access prevents reasonable pipelining.

The schematic of the iterative CORDIC implementation is shown in Figure 2. After a mode dependent initialisation the
parameters required for the following CORDIC iterations are determined. The CORDIC calculation block implements
the unified CORDIC equations in double precision IEEE754 compliant arithmetic. Iterations are performed until
predefined accuracy or a maximum number of iterations steps are reached. The CORDIC calculation block shown in
Figure 2 is realized using three pipeline steps to increase the operating frequency.

                                                       mode
                                                                    Init



                                                      CORDIC Core


                                                       mode
                                                               Parameter
                                                              Determination
                                                                               #lat_cyc




                                                                CORDIC
                                                               Calculation

                                                              Convergence
                                                                 Check


                                                       mode
                                                                  Result
                                                              Postprocessing


                                         Figure 2. Schematic of co-processor architecture
In this work two different version of the CORDIC co-processor have been developed to trade off co-processor area
versus latency. The difference between these two implementations concerns the realization of the CORDIC calculation
block. Therefore, to reduce the required area the serial version solves the CORDIC equations successively for the
parameters x, y and z. Besides the CORDIC mode and coordinate system Table 2 summarizes latency values of serial
and parallel FPU implementation for each supported function.




                                                                     4
                                                                 Table 2. Co-processor instruction’s latency
                                                          Operation                    Coord.            Mode       #lat_ser   #lat_par
                                                          sin(x), cos(x)               Circular          rotation   212        159
                                                          arcsin(x)                    Circular          vector     412        309
                                                          arctan2(x)                   Circular          vector     208        156
                                                          x·y, x+y                     -                 -          5          3
                                                          x/y                          Circular          vector     208        156
                                                          sqrt(x)                      hyperbolic        vector     220        165



3.3 Required Precision

To get a first idea about the required co-processor accuracy the estimation function has been implemented using a
generic floating point format with variable mantissa accuracy. For this purpose ideal measurement data is generated by
choosing a random position on the earth’s surface and computing a variable number of visible satellites based on
almanac data. Using these ideal input values the estimation function computes the receiver’s position using a generic
floating point format. Based on these assumptions, Figure 3 shows the probability that a specified accuracy (i.e. 1m,
2m, 5m and 10m) is achieved with a predefined mantissa bit width. For this simulation and ideal input data 30 mantissa
bits are sufficient to achieve an accuracy of less that 1m.

                                                         < 1m   < 2m    < 5m   < 10m
                                                30




                                                28
                      Number of Mantissa Bits




                                                26




                                                24




                                                22




                                                20




                                                18
                                                     0             20                      40                60           80              100
                                                                                                  Probability %


                Figure 3. Probability of achieved accuracy depending on number of mantissa bits


3.4. Functional Verification and Achieved Accuracy

The co-processor has been coupled to a reference processor for functional verification of co-processor coupling as well
as for accuracy estimations. For the purpose of this paper a NIOS 2 processor frequently used in FPGA based SoCs is
employed.

To estimate the accuracy values with random exponent and mantissa have been generated and used as inputs for each
supported co-processor function. The result of the co-processor computation has been compared to the solution
determined by the NIOS 2 software emulation. Figure 4 shows the minimum number of equal mantissa bits for these
two implementations. It can be seen that for a reasonable input range software emulated and co-processor based floating
point calculations differ by less than 4 mantissa bits. Degradation of the accuracy by more than 3-4 mantissa bits is
mostly based on input range restrictions of software and co-processor implementation.

It can be seen that the co-processor accuracy satisfies the demands of GNSS receivers stated in section 3.3. As the
CORDIC based co-processor implementation gains about 1 bit of accuracy per iteration the possibility of trading
co-processor’s latency against accuracy could be a promising approach to minimize the application’s execution time.




                                                                                                     5
                                                                         cos                                                      arcsin                                                   arctan2




                              Number of Equal Mantissa Bits




                                                                                        Number of Equal Mantissa Bits




                                                                                                                                                  Number of Equal Mantissa Bits
                                                              60                                                        60                                                        60


                                                              40                                                        40                                                        40


                                                              20                                                        20                                                        20


                                                               0                                                        0                                                          0
                                                              -20         0        20                                   -40        -20       0                                    -40        -20       0
                                                                    Exponent Value                                            Exponent Value                                            Exponent Value



                              Number of Equal Mantissa Bits




                                                                                        Number of Equal Mantissa Bits




                                                                                                                                                  Number of Equal Mantissa Bits
                                                                    multiplication                                              division                                                     sqrt
                                                              60                                                        60                                                        60


                                                              40                                                        40                                                        40


                                                              20                                                        20                                                        20


                                                               0                                                        0                                                          0
                                                              -20         0        20                                   -20         0        20                                   -20         0        20
                                                                    Exponent Value                                            Exponent Value                                            Exponent Value

                 Figure 4. Comparison of software emulated and co-processor mantissa accuracy


3.5 Configurations

The complexity of the co-processor, that is the functions which are supported, can be specified at compile time to
perfectly match the considered application. For instance a co-processor implementation supporting only sine and cosine
functions could be an interesting option if the implemented algorithm makes extensive use of these trigonometric
functions. On the other hand a co-processor just featuring floating point additions and multiplications could be
promising for floating point matrix computations. Functions which are not supported by the co-processor are
implemented using the ASIP’s libraries.

                             Table 3. Area Comparison of Co-Processor Configurations
                            Functions                                                                                         Parallel                                                  Serial
                            sin, cos                                                                                          1977 ALUTs                                                2177 ALUTs
                            + [atan, atan2]                                                                                   2013 ALUTs                                                2262 ALUTs
                            + [asin, div, add]                                                                                4779 ALUTs                                                2839 ALUTs
                            + [sqrt]                                                                                          7936 ALUTs                                                6140 ALUTs
                            + [mul]                                                                                           7944 ALUTs                                                6145 ALUTs


Based on a detailed cost-benefit analysis an application specific subset of co-processor functions can be selected to
maximize the area and power efficiency of the overall design. Table 3 summarizes possible configurations and the
required area for the parallel and serial CORDIC implementation using an Altera Stratix II FPGA (i.e.
EP2S60F1020C3). As can be seen maximal and minimal area requirements of the co-processor differ by a factor of
approximately 2.8 and a factor 4 for serial and parallel implementation respectively. For FPGA synthesis replacement
of multiplier logic by internal dedicated multipliers has been turned off. If dedicated multipliers are supported values of
the last two configurations are reduced by approximately 2500 ALUTs.


4. ASIP / CO-PROCESSOR COUPLING

This section deals with coupling aspects of ASIP and co-processor in the hardware and software domain. The main goal
of this work is the optimization of floating point and trigonometric as well as standard mathematical library functions.
These optimizations will target the ASIP’s mathematical and standard libraries to allow for analysis of various
co-processor configurations without the need for any application source code modifications.

4.1 Hardware Modifications

The template RISC processor used as a starting point for ASIP development features a five stage pipe-line, 16 general
purpose registers and a load / store architecture.
The applied hardware modifications belong to one of the following groups:
    1. Interface extension
    2. Instructions set extension
    3. Processor control flow modifications



                                                                                                                                   6
To realize coupling of ASIP and co-processor additional input and output ports have been added to the ADL processor
description including two arguments (i.e. arg1, arg2), one result (res) as well as control ports (mode, hold_proc, clk_en,
result_ready). In a next step interface instructions are implemented. One instruction (funcarg) realizes the output of
control signals and arguments. This instruction is present for each co processor instruction (func) supported. The
function (getresult) reads the data after sensing a logical high result_ready signal. Each preceding in-struction is added
as a new leaf in the coding tree. A typical co processor access begins by transmit-ting argument and function specific
control signals and enabling the co processor (clk_en = ´1´). In the following processing cycle the co processor sets the
hold_proc signal and the ASIP stops execution of the pipeline. After the co processor finished calculation the ASIP is
reactivated by setting hold_proc = ´0´ and results are read using the added special instruction get_result.

By adding a co-processor interface the number of required FPGA resources is increasing while the maximum clock
frequency is nearly constant (Table 4).
                                  Table 4. Area overhead for co-processor coupling
                                                              ALUTs                     Frequency
                          Template Processor                  3666                      63.26 MHz
                          + interface                         5081                      58.85 MHz


4.2 Co-Processor Access

Co-processor access (Figure 5) starts by setting the mode signal for the specific instruction and transfer of one or two
input arguments (i.e. arg1 and arg2). In the following cycle the co-processor activates the ASIP’s sleep mode by setting
the proc_sleep signal and begins with CORDIC iteration process. After a predefined accuracy or a maximum of
iteration steps is reached the co-processor sends the result back to the ASIP and releases the proc_sleep signal.
                                                                     #lat_cyc


                                      clk (in)

                                      res_ready (out)

                                      clk_en (in)

                                      proc_sleep (out)

                                      mode (in)               mode


                                      arg1 (in)               arg1


                                      arg2 (in)               arg2


                                      res (out)                                   res



                                   Figure 5. Co-Processor Access Timing Diagram
4.3 Software Modifications

As mentioned above the main goal is to prevent source code modifications. Therefore, the co-processor instructions are
inserted into the AISP’s standard libraries. Using a simple plug-and-play approach shown in Figure 6, functions can be
implemented using the co-processor or software emulation using the ASIP’s generic libraries.
                                     application.c                              math.h

                             #include <math.h>
                             int main(void)                    ...
                             {                                 double cos(double);
                               ...                             double atan2(double,double);
                               a = sin(b);                     ...
                               ...                             double sin(double);
                             }                                 ...


                                  sin.c                                                         sin.c

                                  double sin(double x)                   double sin(double x)
                                  {                                      {
                                   //C Implementation                     sinarg(x);
                                   ...                                    return getresult;
                                  }                                      }



                                                 Figure 6. Library Modifications



                                                                 7
Figure 6 shows this approach exemplarily for a sine calculation called in the application domain. As for standard
processors the function call is resolved using mathematical standard libraries(i.e. math.h). Depending on the selection
made in the ASIP’s definition file the compiler selects software emulation or co-processor implementation during the
linker phase.
This approach has been used to include different subsets of co-processor functions in the PVT estimation algorithm’s
compilation process and hence allows for comparison of performance achieved using different co-processor
configurations.

5. ASIP / CO-PROCESSOR PERFORMANCE

After implementation of the presented hardware and software modifications described in section 4, the PVT
estimation’s source code can be compiled for the ASIP / co-processor architecture. For a functional verification and
real-time implementation as well as determination of processing cycles the ASIP’s hardware description files are
generated as described in chapter 3. Finally, the ASIP is connected to the co-processor component, program and data
memory. The compiled executable is loaded to program memory. Execution time as well as number of co-processor
active cycles required for power estimation can be determined using Modelsim.

The standard navigation algorithm used as an example throughout this paper has been described in section 2.1. Figure 7
shows reduction of processing cycles per function call, which could be achieved using different configurations.
                                                                               1,00E+08



                                                                               1,00E+07
                                                                      Cycles




                                                                               1,00E+06



                                                                               1,00E+05



                                                                               1,00E+04
                                                                                                    template            [sin, cos]          + [atan,                        + [asin, div,       + [sqrt]          + [mul]
                                                                                                      ASIP                                   atan2]                             add]

                                                                                                                            satellite_position ecefecef_to_llh
                                                                                                              estimation pos_est satpos               atmo_corr                                            correction

                                                                  Figure 7. Processing cycles of standard navigation receiver functions
The gain achievable by coupling ASIP and co-processor is clearly related to the fraction of trigonometric and floating
point instructions. Here, reductions by one (ecef_to_llh) to two orders (estimation) of magnitude compared to the
template ASIP could be reached.


6. AE COST COMPARISONS

For the standard cell implementation costs and performance can be visualized using an Area-Energy (AE) diagram.
Figure 8 shows each co-processor configuration classified by energy consumption per function call and area required
for an implementation using a 90nm standard cell technology.
                            1,20                                                                                                                               0,05



                            1,00
                                                                                                                                                               0,04
                                                                                                                                                                                                                                       A*E = const
                                                                                                                                        Energy per Call [mJ]




                            0,80
     Energy per Call [mJ]




                                                                                                              A*E = const                                      0,03

                            0,60

                                                                                                                                                               0,02

                            0,40


                                                                                                                                                               0,01
                            0,20



                                                                                                                                                               0,00
                            0,00
                                                                                                                                                                   0,050              0,100             0,150              0,200               0,250            0,300
                               0,050               0,100             0,150                  0,200             0,250             0,300
                                                                                                                                                                                                              Area [mm²]
                                                                               Area [mm²]

                                   [sin, cos]   + [atan, atan2]   + [asin, div, add]   + [sqrt]     + [mul]     template ASIP                                         [sin, cos]   + [atan, atan2]   + [asin, div, add]   + [sqrt]   + [mul]    template ASIP


               Figure 8. AE cost comparisons for co-processor configurations; estimation (left) and correction (right)



                                                                                                                                                               8
For the estimation function (left) one can observe reduced AE costs for each of the co-processor configuration. In
contrast to this only the full featured co-processor offers a slightly improved AE product for the implementation of the
correction (right) function. These functions clarify the importance of choosing an application specific subset of
co-processor functions to improve overall costs and performance.

Finally Figure 9 compares the AE costs for the overall PVT algorithm (i.e. correction, estimation, satellite_position,
ecef_to_llh).
                                         120
                                                100
                                                100
                                         100
                                                                                                       79,1
                                                                                                       79.1
                                         80
                            Scaled A*E



                                                             67,1
                                                             67.1        65.4
                                                                         65,4
                                         60
                                                                                         47.8
                                                                                         47,8

                                         40

                                         20                                                                         14.1
                                                                                                                    14,1


                                          0
                                               template    [sin, cos]   + [atan,      + [asin, div,   + [sqrt]     + [mul]
                                                 ASIP                    atan2]           add]


                                          Figure 9. AE cost comparison for overall PVT algorithm


7. RELATED WORK

There is a large number of floating point co-processors to enhance the performance of embedded microprocessors.
Prominent examples are presented in [14] and [15]. Most of these attempts realize standard floating point arithmetic
including addition, subtraction, multiplication and division based on double or single precision arithmetic.
In contrast to most FPUs [16] introduces a CORDIC based double precision unit to enhance the performance of
standard processor in a SoC. This unit additionally supports trigonometric and standard math functions.

Finally, [17] describes coupling of configurable and synthesizable LEON processor and CORDIC based single
precision floating point unit. The main difference between the approach in [17] and this work concerns the
implementation of the FPU as well as processor FPU coupling. Especially in the latter aspect the ASIP offers a high
degree of flexibility to modify the register bank, interface ports as well as instruction set and pipeline architecture. This
flexibility enables a tight coupling of the FPU and ASIP ensuring highest performance and smallest amount of interface
overhead.

As a benefit of the ASIP environment the designer has direct access to floating point and standard libraries. By
modifying these libraries as described in section 4.2 a seamless integration of the co-processor instructions without any
source modifications required in [17] becomes feasible. Especially time consuming replacement of operands (i.e. +, *, /)
by a dedicated co-processor instruction (i.e. cpadd, cpmul, cpdiv) is unnecessary.

                                                Table 5. Comparison of CORDIC based FPUs
                                                      Cycle / MOPS                 Serial                        Parallel
                                                      [16]                         Implementation                Implementation
              FPGA Implementation
                ALUTs                                 5290                         6145                          7944
                fmax                                  109 MHz                      34.86 MHz                     43.94 MHz
              Function Latency
                add / sub                             114 / 0.96                   5 / 6.39                      3 / 12.61
                mul                                   153 / 0.71                   5 / 6.39                      3 / 12.61
                div                                   197 / 0.55                   208 / 0.15                    156 / 0.24
                sqrt                                  141 / 0.77                   220 / 0.15                    165 / 0.23
                sin / cos                             360 / 0.3                    212 / 0.15                    159 / 0.24
                atan                                  467 / 0.23                   208 / 0.15                    156 / 0.24


The FPU implemented throughout this work is based on an iterative CORDIC implementation described in section 3.
This yields latency of co-processor instructions of 3 (for add / mul and a parallel implementation) to approximately 100



                                                                                9
(for asin and serial implementation) cycles compared to a constant latency of 3 cycles for the approach [17] and a cycle
number of 114 (for addition) to 467 (arctangent) in [16]. Table 5 compares latency and performance in MOPS of
co-processor instructions as well as required ALUTs of the implementation on a Stratix II FPGA.

                                    Table 6. Profiling of PVT estimation algorithm
                              sin      cos      atan      atan2    asin      div      add       sqrt     mul
         estimation           90       90       0         0        0         96       954       18       1044
         satellite_position   16       17       0         2        1         3        40        2        53
         correction           1        3        0         0        0         10       23        0        31
         ecef_to_llh          3        4        2         0        0         6        7         2        21

Table 6 shows the profiling results of the four PVT sub-functions and a test case with 5 satellites in view. For this test
case the co-processor presented in this work reduces the number of co-processor cycles by a more than factor of 6.2
(parallel) and 4.5 (serial). As the commercial solution [16] features a higher clock frequency the overall performance
(operations per second) increases just by a factor of 2.2 (parallel) and 1.3 (serial) respectively.


8. CONCLUSION

This paper presents a double precision floating point unit used to enhance the performance of an ASIP based low power
GNSS receiver architecture. The ASIPs design process has been described and a performance comparison of ASIP and
standard embedded processors reveals a poor floating point performance. To improve performance values for floating
point as well as trigonometric and standard math functions a configurable CORDIC based co processor has been
implemented. To trade off performance and area a serial and a parallel implementation of CORDIC equations for x, y
and z have been implemented and required ALUTs and maximal clock frequency for an FPGA implementation have
been shown. The co processor has been functionally verified and the achieved accuracy has been compared to a NIOS 2
software emulation. For reasonable input parameters the accuracy of the co processor instructions satisfies accuracy
required by the PVT estimation algorithm.

Hardware and software modifications required for ASIP / co processor coupling are described. Special emphasis has
been placed on a seamless integration of the co processor functions. Therefore, a subset of functions can be selected in a
definition file and software emulations are replaced during compilation process.

The gain achievable using the presented co-processor has been exemplary shown for four sub functions of a standard
PVT estimation algorithm. Performance analyses shows improvements by one ore two orders of magnitude compared to
the AISP’s software emulation and depending on the executed function. Hence, using the ASIP / co-processor
architecture assumed real-time processing constraints (i.e. 1-10Hz position fix rate at 20MHz processor clock speed)
could be achieved.

For a comparison of co processor configurations the AE costs have been derived for a 90nm TSMC CMOS standard
cell technology. Results show that for a standard GNSS algorithm the full featured co-processor minimizes the AE
costs. Implementations on an FPGA show performance and area values comparable to a commercial CORDIC
co-processor.

The presented coupling of ASIP / co-processor can be further enhanced replacing the currently used sleep mode by
parallel execution of ASIP and co-processor instructions.


9. REFERENCES

[1]   www.analog.com
[2]   www.ti.com
[3]   S. Fischer, P. Rastetter, M. Mittnacht, F. Griesauer, P. Silvestrin, “AGGA-3 in an Avionic System“, ESA
      Workshop on Spacecraft Data Systems and Software.
[4]   www.javad.com
[5]   G. Kappen, T.G. Noll, “Mapping of multioperable GNSS receiver algorithms to a heterogeneous ASIP based
      platform” , Proceedings of the International Global Navigation Satellite Systems Society (IGNSS) Symposium
      2006, Surfers Paradise, Australia, 17.-21.7.2006.
[6]   T. G. Noll, “Application Domain Specific Embedded FPGAs for SoC Platforms”, Invited Survey Lecture, Irish
      Signals and Systems Conference 2004 (ISSC'04), Jun. 2004.
[7]   K. Keutzer, S. Malik, A. R. Newton, “From ASIC to ASIP: The Next Design Discontinuity”, ICCD Proceedings,
      2002.



                                                            10
[8]    CoWare, LISATek Language Reference Manual, Version V2005.2.2, July 2006.
[9]    CoWare, LISATek Processor Designer Manual, Version V2005.2.2, July 2006.
[10]   Gaisler Research, “TSIM2 Simulator User’s Manual”, Version 2.0.8, April 2007.
[11]   ARM Ltd, “Benchmarking with ARMulator”, March, 2002.
[12]   J.S. Walther, “A unified algorithm for elementary functions”, Proc. of the Spring Joint Computer Conference,
       1971, pp. 379-385.
[13]   R. Andraka, “A survey of CORDIC algorithms for FPGA based computers”, International Symposium on Field
       Programmable Gate Arrays, Monterey, 1998, pp. 191-200.
[14]   E. Catovic, “GRFPU – High Performance IEEE-754 Floating-Point Unit”, DASIA 2004, Nice.
[15]   Nallatech, “Double Precision Floating Point Core”, 2004.
[16]   Digital Core Design, “DFPMU Floating Point Coprocessor”, 2007.
[17]   T.Vladimirova, D.Eamey, S.Keller, M.N.Sweeting. “Floating-Point Mathematical Co-Processor for a Single Chip
       On-Board Computer”, Proceedings of the 6th Military and Aerospace Applications of Programmable Logic
       Devices and Technologies (MAPLD'2003), September 9-11, 2003, Washington DC, US.




                                                         11