VIEWS: 0 PAGES: 11 CATEGORY: Real Estate POSTED ON: 1/19/2010 Public Domain
Implementation of a CORDIC based double precision floating point unit used in an ASIP based GNSS receiver Götz Kappen, Sofian el Bahri, Oliver Priebe, Tobias G. Noll, Chair of Electrical Engineering and Computer Systems, Aachen University, Germany BIOGRAPHY Götz KAPPEN received the Dipl.-Ing. degree in 2002 from RWTH Aachen University. Since then he has been working as a PhD student at the Chair of Electrical Engineering and Computer Systems, RWTH Aachen University. His fields of research are satellite navigation systems and digital signal processing. Sofian el BAHRI received the Dipl.-Ing. degree in electrical engineering in 2006 from RWTH Aachen University. His fields of research are application specific architectures for digital signal processing. Oliver PRIEBE received the Dipl.-Ing. degree in 2002 from RWTH Aachen University. Since 2004 he has been working as a scientific co-worker at the Chair of Electrical Engineering and Computer Systems, RWTH Aachen University. His fields of research are satellite navigation systems and digital signal processing. Tobias G. NOLL received the Ing. (grad.) degree in Electrical Engineering from the Fachhochschule Koblenz in 1974, the Dipl.-Ing. degree in Electrical Engineering from the Technical University of Munich in 1982, and the Dr.-Ing. degree from the Ruhr-University of Bochum in 1989. From 1974 to 1976, he was with the Max-Planck-Institute of Radio Astronomy, Bonn. Since 1976 he was with the Corporate Research and Development Department of Siemens and since 1987 he headed a group of laboratories concerned with CMOS circuits for digital signal processing. In 1992, he joined the RWTH Aachen University where he is a Professor holding the Chair of Electrical Engineering and Computer Systems. His activities focus on low power deep submicron CMOS architectures, circuits and design methodologies, as well as digital signal processing for communications and medicine electronics. 1. INTRODUCTION Currently GNSS receivers feature an embedded standard processor (e.g. ARM7) as central processing unit to realize navigation processor functionality (e.g. correlator channel control, position estimation, pseudorange correction and coordinate transformations). Although the performance of embedded processors increases floating point instructions, allowing for high accuracy and straightforward implementation of the signal flow developed by the algorithm designer, offer lower performance as they are emulated in software. For this reason Digital Signal Processors (DSPs) as well as General Purpose (GP) processors used in high performance signal processing applications often realize floating point operations using a dedicated Floating Point Unit (FPU) (e.g. [1], [2]). Recently, GNSS receivers for space applications [3] or high accuracy multioperable positioning [4] are assembled using a FPU to enhance overall performance. FPUs generally calculate the specified function in a few clock cycles in parallel to the main processor’s pipeline. Nevertheless, most embedded microprocessors used in mobile and thus energy critical applications do not feature a dedicated FPU because of power and area overhead. In this case floating point arithmetic is emulated in software. In addition to emulation of floating point arithmetic most embedded processors’ emulate trigonometric and mathematical standard functions using floating point arithmetic. Therefore, an enhanced floating point performance will implicitly improve the performance of trigonometric and mathematical standard functions. Nevertheless, the performance of these functions can be further improved using dedicated co-processors. This paper presents a flexible FPU capable of calculating a wide variety of mathematical and floating point functions to enhance the performance of the Application Specific Instruction Set Processor (ASIP), which forms the central processing unit of a flexible GNSS receiver architecture. As presented in [5] the target architecture is assembled by application specific hardware blocks to reduce energy and area consumption of next generation multioperable GNSS receivers, while maintaining maximum flexibility. In this approach the standard embedded microprocessor used in most commercially available receivers is replaced by an ASIP to increase the receiver’s area and energy efficiency, i.e. mm² per Million Operations Per Second (MOPS) and MOPS per mW. This paper focuses on the implementation of a configurable co-processor realizing double precision floating point and trigonometric as well as standard mathematical functions. A library based modular approach allows using a set of specified co-processor functions without any modifications of the receiver’s source code. Due to configurable complexity of the co-processor, the area-energy product can be minimized using an application specific subset of supported functions. The rest of the paper is organized as follows: Section 2 sketches the idea of replacing the processor in a standard GNSS receiver by an ASIP and describes the ASIP’s design flow. Performance analyses of a GNSS receiver algorithm on the template ASIP shows that real-time constraints could not be achieved. A detailed profiling reveals that floating point instructions as well as trigonometric instructions are performance critical. To reduce the number of processing cycles, section 3 introduces the implementation and functional verification of the configurable double precision co-processor to be used in embedded systems. The presented implementation is compared to existing co-processor. Required accuracy of the floating point representation is estimated and compared to the accuracy achieved by the co-processor. Section 4 details coupling of ASIP and co-processor. Emphasis is placed on required modifications of the ASIP’s hard- and software and the approach is compared with existing solutions. Section 5 and 6 summarize the performance and cost figures of the ASIP / co-processor architecture. In section 7 the presented work is compared to existing co-processors. Finally, section 7 concludes the paper. 2. ASIP BASED GNSS RECEIVER 2.1 ASIP Motivation and Design Flow As described in [6] hardware architectures used for today’s System-on-Chip (SoC) designs face a power-versus-flexibility conflict. That is, programmable and flexible architectures like for instance GP processors or DSPs offer the lowest area and power efficiency. On the other hand, dedicated and for that reason fixed hardware architectures like for instance Application Specific Integrated Circuits (ASICs) offer the highest efficiency in terms of area and power. In this context ASIPs offer a trade-off between GP processors and dedicated solutions by exploiting a priori knowledge about the class of application to be implemented [7]. This knowledge allows for adaptation of the processor’s instruction set leading to a modified Arithmetic Logic Unit (ALU) as well as pipeline architecture. Additionally, ASIPs may feature for instance optimized memory architecture as well as external interfaces. Architecture Description Processor Designer Software Dev. Tools VHDL Simulator Description Assembler Linker Synthesis Compiler Application Gate Level Executable (C/C++ Code) Model Cost Profiling Evaluation Figure 1. ASIP design and optimization flow The ASIP’s design and optimization flow is shown in Figure 1 and starts by implementing the target application on a template processor described in an Architecture Description Language (ADL). In this work the LISA processor description language [8] is used for description of the ASIP architecture and generation of associated tools and hardware description files [9]. To allow for meaningful results, the template processor belongs to the same class as the target processor. Typical classes are for instance Very Long Instruction Word (VLIW) and Single Instruction Multiple Data (SIMD) architectures often used in parallel data paths. In this paper a Reduced Instruction Set Computer (RISC) processor architecture typical for control functionality and mostly implemented in standard GNSS receiver architectures has been chosen as starting point for the development of the navigation processor. 2 For software development, simulation and performance evaluation the ASIP development tools (i.e. Processor Designer) automatically generate assembler, linker and compiler as well as a cycle accurate simulator based on the ADL. By using this automatic tool generation, information about costs and benefit of processor extensions can be derived rapidly. Additionally, hardware description files are generated based on the ADL description. The HDL description files can be used to synthesize the ASIP for FPGAs or standard cell technologies allowing for real-time functional verification and analysis of silicon area and power consumption. The optimization process is based on profiling data of the cycle accurate ASIP model of, generated throughout the design flow. The profiling reveals performance critical functions which are potential optimization candidates. During this optimization process, application specific modifications of the ASIP’s instruction set and architecture can be used to improve performance and efficiency. Besides instruction set modifications co-processors are another promising option, as described in [7]. For this paper’s purpose performance constraints are as follows: The ASIP should allow for real-time position estimation at a position fix rate of 1-10 Hz using at least four satellites in view and a typical GNSS processor clock frequency of 20 MHz. 2.2 Template ASIP’s Performance Evaluation Throughout this paper a standard Position, Velocity and Time (PVT) algorithm is used for the performance evaluation of the ASIP / co-processor architecture. The standard algorithm is further divided into four sub-functions: 1. Calculation of the satellite position (satellite_position) 2. Correction of the measured pseudoranges (correction) 3. Estimation of the receiver position by solving a set of non-linear equations (estimation) 4. Transformation of Cartesian coordinates to latitude, longitude and height (ecef_to_llh). Executing this standard algorithm on the template processor shows that the overall number of clock cycles required for position estimation using 5 satellites is about 78 million cycles and increases if more satellites are visible. As can be seen this clearly conflicts real-time requirements proposed in section 2.1 and prevents using the template ASIP in a GNSS receiver. Profiling and analysis of functions executed by the ASIP shows that floating point as well as trigonometric functions require a significant fraction of the overall number of processing cycles. For a detailed investigation Table 1 compares standard functions realized using template ASIP, LEON processors and ARM7TDMI. Results are determined using a cycle accurate LEON [10] and ARM [11] processor simulator respectively. It can be seen that the template ASIP requires a significantly larger number of processing cycles for standard operations leading to a poor overall performance compared to LEON and ARM7TDMI. This performance drawback is based on the flexible library concept used for ASIP development. As standard and floating point libraries should be compliant with a wide variety of processor classes during development process, the ASIP feature generic library implementations offering poor performance. Table 1. Comparison of standard mathematical functions Function Template ASIP LEON ARM mul 6354 443 73 add 3977 988 78 sqrt 175445 2265 1243 cos 170393 18281 2176 sin 154102 18281 2100 This paper aims to improve the performance of the template ASIP by coupling of a configurable co-processor supporting trigonometric, standard math and floating point operations. By offering configurability the overall receiver hardware can be optimized for the application code to offer highest performance at lowest costs in terms of area and power consumption. 3. CONFIGURABLE DOUBLE PRECISION CO-PROCESSOR 3.1 CORDIC Algorithm The co-processor implementation is based on the Coordinate Rotation Digital Computer (CORDIC) algorithm first developed for real-time navigation. The CORDIC algorithm allows for a hardware efficient way to iteratively calculate 3 trigonometric functions using only shift and add operations. The unified CORDIC [12] extends the basic principle to allow for calculation of various mathematical functions. The basic principle behind CORDIC is rotation of two input values interpreted as vector components. In each iteration step the input vector is rotated by a predefined decreasing angle. There are two different modes of operation. In the rotation mode the input vector is rotated to a predefined angle by minimizing the angular input value. In contrast to this the vector mode minimizes one of the input components so that the output vector coincides with the coordinate axis. The vector mode for instance allows for a simple calculation of the input vector’s length. The following equations describe the unified CORDIC algorithm presented in [12] for the vector components (x, y) and the angle z: xi +1 = xi − m ⋅ y i ⋅ d i ⋅ 2 −i y i +1 = yi + xi ⋅ d i ⋅ 2 −i z i +1 = z i − d i ⋅ ei Here, m specifies the coordinate system (1, 0, -1 for circular, linear and hyperbolic) and ei is the elementary angle at iteration step i. The decision variable di is determined each step depending on the mode of operation: − sign( y i ) for vector mode di = sign( z i ) for rotation mode In contrast to the integer implementation introduced in [13] the above equations have to be performed in floating point arithmetic. 3.2 Implementation The co-processor presented in this paper is based on an iterative implementation of CORDIC equations [13]. This is mainly due to the fact, that the co-processor does not require high throughput rates and that non-continuous co-processor access prevents reasonable pipelining. The schematic of the iterative CORDIC implementation is shown in Figure 2. After a mode dependent initialisation the parameters required for the following CORDIC iterations are determined. The CORDIC calculation block implements the unified CORDIC equations in double precision IEEE754 compliant arithmetic. Iterations are performed until predefined accuracy or a maximum number of iterations steps are reached. The CORDIC calculation block shown in Figure 2 is realized using three pipeline steps to increase the operating frequency. mode Init CORDIC Core mode Parameter Determination #lat_cyc CORDIC Calculation Convergence Check mode Result Postprocessing Figure 2. Schematic of co-processor architecture In this work two different version of the CORDIC co-processor have been developed to trade off co-processor area versus latency. The difference between these two implementations concerns the realization of the CORDIC calculation block. Therefore, to reduce the required area the serial version solves the CORDIC equations successively for the parameters x, y and z. Besides the CORDIC mode and coordinate system Table 2 summarizes latency values of serial and parallel FPU implementation for each supported function. 4 Table 2. Co-processor instruction’s latency Operation Coord. Mode #lat_ser #lat_par sin(x), cos(x) Circular rotation 212 159 arcsin(x) Circular vector 412 309 arctan2(x) Circular vector 208 156 x·y, x+y - - 5 3 x/y Circular vector 208 156 sqrt(x) hyperbolic vector 220 165 3.3 Required Precision To get a first idea about the required co-processor accuracy the estimation function has been implemented using a generic floating point format with variable mantissa accuracy. For this purpose ideal measurement data is generated by choosing a random position on the earth’s surface and computing a variable number of visible satellites based on almanac data. Using these ideal input values the estimation function computes the receiver’s position using a generic floating point format. Based on these assumptions, Figure 3 shows the probability that a specified accuracy (i.e. 1m, 2m, 5m and 10m) is achieved with a predefined mantissa bit width. For this simulation and ideal input data 30 mantissa bits are sufficient to achieve an accuracy of less that 1m. < 1m < 2m < 5m < 10m 30 28 Number of Mantissa Bits 26 24 22 20 18 0 20 40 60 80 100 Probability % Figure 3. Probability of achieved accuracy depending on number of mantissa bits 3.4. Functional Verification and Achieved Accuracy The co-processor has been coupled to a reference processor for functional verification of co-processor coupling as well as for accuracy estimations. For the purpose of this paper a NIOS 2 processor frequently used in FPGA based SoCs is employed. To estimate the accuracy values with random exponent and mantissa have been generated and used as inputs for each supported co-processor function. The result of the co-processor computation has been compared to the solution determined by the NIOS 2 software emulation. Figure 4 shows the minimum number of equal mantissa bits for these two implementations. It can be seen that for a reasonable input range software emulated and co-processor based floating point calculations differ by less than 4 mantissa bits. Degradation of the accuracy by more than 3-4 mantissa bits is mostly based on input range restrictions of software and co-processor implementation. It can be seen that the co-processor accuracy satisfies the demands of GNSS receivers stated in section 3.3. As the CORDIC based co-processor implementation gains about 1 bit of accuracy per iteration the possibility of trading co-processor’s latency against accuracy could be a promising approach to minimize the application’s execution time. 5 cos arcsin arctan2 Number of Equal Mantissa Bits Number of Equal Mantissa Bits Number of Equal Mantissa Bits 60 60 60 40 40 40 20 20 20 0 0 0 -20 0 20 -40 -20 0 -40 -20 0 Exponent Value Exponent Value Exponent Value Number of Equal Mantissa Bits Number of Equal Mantissa Bits Number of Equal Mantissa Bits multiplication division sqrt 60 60 60 40 40 40 20 20 20 0 0 0 -20 0 20 -20 0 20 -20 0 20 Exponent Value Exponent Value Exponent Value Figure 4. Comparison of software emulated and co-processor mantissa accuracy 3.5 Configurations The complexity of the co-processor, that is the functions which are supported, can be specified at compile time to perfectly match the considered application. For instance a co-processor implementation supporting only sine and cosine functions could be an interesting option if the implemented algorithm makes extensive use of these trigonometric functions. On the other hand a co-processor just featuring floating point additions and multiplications could be promising for floating point matrix computations. Functions which are not supported by the co-processor are implemented using the ASIP’s libraries. Table 3. Area Comparison of Co-Processor Configurations Functions Parallel Serial sin, cos 1977 ALUTs 2177 ALUTs + [atan, atan2] 2013 ALUTs 2262 ALUTs + [asin, div, add] 4779 ALUTs 2839 ALUTs + [sqrt] 7936 ALUTs 6140 ALUTs + [mul] 7944 ALUTs 6145 ALUTs Based on a detailed cost-benefit analysis an application specific subset of co-processor functions can be selected to maximize the area and power efficiency of the overall design. Table 3 summarizes possible configurations and the required area for the parallel and serial CORDIC implementation using an Altera Stratix II FPGA (i.e. EP2S60F1020C3). As can be seen maximal and minimal area requirements of the co-processor differ by a factor of approximately 2.8 and a factor 4 for serial and parallel implementation respectively. For FPGA synthesis replacement of multiplier logic by internal dedicated multipliers has been turned off. If dedicated multipliers are supported values of the last two configurations are reduced by approximately 2500 ALUTs. 4. ASIP / CO-PROCESSOR COUPLING This section deals with coupling aspects of ASIP and co-processor in the hardware and software domain. The main goal of this work is the optimization of floating point and trigonometric as well as standard mathematical library functions. These optimizations will target the ASIP’s mathematical and standard libraries to allow for analysis of various co-processor configurations without the need for any application source code modifications. 4.1 Hardware Modifications The template RISC processor used as a starting point for ASIP development features a five stage pipe-line, 16 general purpose registers and a load / store architecture. The applied hardware modifications belong to one of the following groups: 1. Interface extension 2. Instructions set extension 3. Processor control flow modifications 6 To realize coupling of ASIP and co-processor additional input and output ports have been added to the ADL processor description including two arguments (i.e. arg1, arg2), one result (res) as well as control ports (mode, hold_proc, clk_en, result_ready). In a next step interface instructions are implemented. One instruction (funcarg) realizes the output of control signals and arguments. This instruction is present for each co processor instruction (func) supported. The function (getresult) reads the data after sensing a logical high result_ready signal. Each preceding in-struction is added as a new leaf in the coding tree. A typical co processor access begins by transmit-ting argument and function specific control signals and enabling the co processor (clk_en = ´1´). In the following processing cycle the co processor sets the hold_proc signal and the ASIP stops execution of the pipeline. After the co processor finished calculation the ASIP is reactivated by setting hold_proc = ´0´ and results are read using the added special instruction get_result. By adding a co-processor interface the number of required FPGA resources is increasing while the maximum clock frequency is nearly constant (Table 4). Table 4. Area overhead for co-processor coupling ALUTs Frequency Template Processor 3666 63.26 MHz + interface 5081 58.85 MHz 4.2 Co-Processor Access Co-processor access (Figure 5) starts by setting the mode signal for the specific instruction and transfer of one or two input arguments (i.e. arg1 and arg2). In the following cycle the co-processor activates the ASIP’s sleep mode by setting the proc_sleep signal and begins with CORDIC iteration process. After a predefined accuracy or a maximum of iteration steps is reached the co-processor sends the result back to the ASIP and releases the proc_sleep signal. #lat_cyc clk (in) res_ready (out) clk_en (in) proc_sleep (out) mode (in) mode arg1 (in) arg1 arg2 (in) arg2 res (out) res Figure 5. Co-Processor Access Timing Diagram 4.3 Software Modifications As mentioned above the main goal is to prevent source code modifications. Therefore, the co-processor instructions are inserted into the AISP’s standard libraries. Using a simple plug-and-play approach shown in Figure 6, functions can be implemented using the co-processor or software emulation using the ASIP’s generic libraries. application.c math.h #include <math.h> int main(void) ... { double cos(double); ... double atan2(double,double); a = sin(b); ... ... double sin(double); } ... sin.c sin.c double sin(double x) double sin(double x) { { //C Implementation sinarg(x); ... return getresult; } } Figure 6. Library Modifications 7 Figure 6 shows this approach exemplarily for a sine calculation called in the application domain. As for standard processors the function call is resolved using mathematical standard libraries(i.e. math.h). Depending on the selection made in the ASIP’s definition file the compiler selects software emulation or co-processor implementation during the linker phase. This approach has been used to include different subsets of co-processor functions in the PVT estimation algorithm’s compilation process and hence allows for comparison of performance achieved using different co-processor configurations. 5. ASIP / CO-PROCESSOR PERFORMANCE After implementation of the presented hardware and software modifications described in section 4, the PVT estimation’s source code can be compiled for the ASIP / co-processor architecture. For a functional verification and real-time implementation as well as determination of processing cycles the ASIP’s hardware description files are generated as described in chapter 3. Finally, the ASIP is connected to the co-processor component, program and data memory. The compiled executable is loaded to program memory. Execution time as well as number of co-processor active cycles required for power estimation can be determined using Modelsim. The standard navigation algorithm used as an example throughout this paper has been described in section 2.1. Figure 7 shows reduction of processing cycles per function call, which could be achieved using different configurations. 1,00E+08 1,00E+07 Cycles 1,00E+06 1,00E+05 1,00E+04 template [sin, cos] + [atan, + [asin, div, + [sqrt] + [mul] ASIP atan2] add] satellite_position ecefecef_to_llh estimation pos_est satpos atmo_corr correction Figure 7. Processing cycles of standard navigation receiver functions The gain achievable by coupling ASIP and co-processor is clearly related to the fraction of trigonometric and floating point instructions. Here, reductions by one (ecef_to_llh) to two orders (estimation) of magnitude compared to the template ASIP could be reached. 6. AE COST COMPARISONS For the standard cell implementation costs and performance can be visualized using an Area-Energy (AE) diagram. Figure 8 shows each co-processor configuration classified by energy consumption per function call and area required for an implementation using a 90nm standard cell technology. 1,20 0,05 1,00 0,04 A*E = const Energy per Call [mJ] 0,80 Energy per Call [mJ] A*E = const 0,03 0,60 0,02 0,40 0,01 0,20 0,00 0,00 0,050 0,100 0,150 0,200 0,250 0,300 0,050 0,100 0,150 0,200 0,250 0,300 Area [mm²] Area [mm²] [sin, cos] + [atan, atan2] + [asin, div, add] + [sqrt] + [mul] template ASIP [sin, cos] + [atan, atan2] + [asin, div, add] + [sqrt] + [mul] template ASIP Figure 8. AE cost comparisons for co-processor configurations; estimation (left) and correction (right) 8 For the estimation function (left) one can observe reduced AE costs for each of the co-processor configuration. In contrast to this only the full featured co-processor offers a slightly improved AE product for the implementation of the correction (right) function. These functions clarify the importance of choosing an application specific subset of co-processor functions to improve overall costs and performance. Finally Figure 9 compares the AE costs for the overall PVT algorithm (i.e. correction, estimation, satellite_position, ecef_to_llh). 120 100 100 100 79,1 79.1 80 Scaled A*E 67,1 67.1 65.4 65,4 60 47.8 47,8 40 20 14.1 14,1 0 template [sin, cos] + [atan, + [asin, div, + [sqrt] + [mul] ASIP atan2] add] Figure 9. AE cost comparison for overall PVT algorithm 7. RELATED WORK There is a large number of floating point co-processors to enhance the performance of embedded microprocessors. Prominent examples are presented in [14] and [15]. Most of these attempts realize standard floating point arithmetic including addition, subtraction, multiplication and division based on double or single precision arithmetic. In contrast to most FPUs [16] introduces a CORDIC based double precision unit to enhance the performance of standard processor in a SoC. This unit additionally supports trigonometric and standard math functions. Finally, [17] describes coupling of configurable and synthesizable LEON processor and CORDIC based single precision floating point unit. The main difference between the approach in [17] and this work concerns the implementation of the FPU as well as processor FPU coupling. Especially in the latter aspect the ASIP offers a high degree of flexibility to modify the register bank, interface ports as well as instruction set and pipeline architecture. This flexibility enables a tight coupling of the FPU and ASIP ensuring highest performance and smallest amount of interface overhead. As a benefit of the ASIP environment the designer has direct access to floating point and standard libraries. By modifying these libraries as described in section 4.2 a seamless integration of the co-processor instructions without any source modifications required in [17] becomes feasible. Especially time consuming replacement of operands (i.e. +, *, /) by a dedicated co-processor instruction (i.e. cpadd, cpmul, cpdiv) is unnecessary. Table 5. Comparison of CORDIC based FPUs Cycle / MOPS Serial Parallel [16] Implementation Implementation FPGA Implementation ALUTs 5290 6145 7944 fmax 109 MHz 34.86 MHz 43.94 MHz Function Latency add / sub 114 / 0.96 5 / 6.39 3 / 12.61 mul 153 / 0.71 5 / 6.39 3 / 12.61 div 197 / 0.55 208 / 0.15 156 / 0.24 sqrt 141 / 0.77 220 / 0.15 165 / 0.23 sin / cos 360 / 0.3 212 / 0.15 159 / 0.24 atan 467 / 0.23 208 / 0.15 156 / 0.24 The FPU implemented throughout this work is based on an iterative CORDIC implementation described in section 3. This yields latency of co-processor instructions of 3 (for add / mul and a parallel implementation) to approximately 100 9 (for asin and serial implementation) cycles compared to a constant latency of 3 cycles for the approach [17] and a cycle number of 114 (for addition) to 467 (arctangent) in [16]. Table 5 compares latency and performance in MOPS of co-processor instructions as well as required ALUTs of the implementation on a Stratix II FPGA. Table 6. Profiling of PVT estimation algorithm sin cos atan atan2 asin div add sqrt mul estimation 90 90 0 0 0 96 954 18 1044 satellite_position 16 17 0 2 1 3 40 2 53 correction 1 3 0 0 0 10 23 0 31 ecef_to_llh 3 4 2 0 0 6 7 2 21 Table 6 shows the profiling results of the four PVT sub-functions and a test case with 5 satellites in view. For this test case the co-processor presented in this work reduces the number of co-processor cycles by a more than factor of 6.2 (parallel) and 4.5 (serial). As the commercial solution [16] features a higher clock frequency the overall performance (operations per second) increases just by a factor of 2.2 (parallel) and 1.3 (serial) respectively. 8. CONCLUSION This paper presents a double precision floating point unit used to enhance the performance of an ASIP based low power GNSS receiver architecture. The ASIPs design process has been described and a performance comparison of ASIP and standard embedded processors reveals a poor floating point performance. To improve performance values for floating point as well as trigonometric and standard math functions a configurable CORDIC based co processor has been implemented. To trade off performance and area a serial and a parallel implementation of CORDIC equations for x, y and z have been implemented and required ALUTs and maximal clock frequency for an FPGA implementation have been shown. The co processor has been functionally verified and the achieved accuracy has been compared to a NIOS 2 software emulation. For reasonable input parameters the accuracy of the co processor instructions satisfies accuracy required by the PVT estimation algorithm. Hardware and software modifications required for ASIP / co processor coupling are described. Special emphasis has been placed on a seamless integration of the co processor functions. Therefore, a subset of functions can be selected in a definition file and software emulations are replaced during compilation process. The gain achievable using the presented co-processor has been exemplary shown for four sub functions of a standard PVT estimation algorithm. Performance analyses shows improvements by one ore two orders of magnitude compared to the AISP’s software emulation and depending on the executed function. Hence, using the ASIP / co-processor architecture assumed real-time processing constraints (i.e. 1-10Hz position fix rate at 20MHz processor clock speed) could be achieved. For a comparison of co processor configurations the AE costs have been derived for a 90nm TSMC CMOS standard cell technology. Results show that for a standard GNSS algorithm the full featured co-processor minimizes the AE costs. Implementations on an FPGA show performance and area values comparable to a commercial CORDIC co-processor. The presented coupling of ASIP / co-processor can be further enhanced replacing the currently used sleep mode by parallel execution of ASIP and co-processor instructions. 9. REFERENCES [1] www.analog.com [2] www.ti.com [3] S. Fischer, P. Rastetter, M. Mittnacht, F. Griesauer, P. Silvestrin, “AGGA-3 in an Avionic System“, ESA Workshop on Spacecraft Data Systems and Software. [4] www.javad.com [5] G. Kappen, T.G. Noll, “Mapping of multioperable GNSS receiver algorithms to a heterogeneous ASIP based platform” , Proceedings of the International Global Navigation Satellite Systems Society (IGNSS) Symposium 2006, Surfers Paradise, Australia, 17.-21.7.2006. [6] T. G. Noll, “Application Domain Specific Embedded FPGAs for SoC Platforms”, Invited Survey Lecture, Irish Signals and Systems Conference 2004 (ISSC'04), Jun. 2004. [7] K. Keutzer, S. Malik, A. R. Newton, “From ASIC to ASIP: The Next Design Discontinuity”, ICCD Proceedings, 2002. 10 [8] CoWare, LISATek Language Reference Manual, Version V2005.2.2, July 2006. [9] CoWare, LISATek Processor Designer Manual, Version V2005.2.2, July 2006. [10] Gaisler Research, “TSIM2 Simulator User’s Manual”, Version 2.0.8, April 2007. [11] ARM Ltd, “Benchmarking with ARMulator”, March, 2002. [12] J.S. Walther, “A unified algorithm for elementary functions”, Proc. of the Spring Joint Computer Conference, 1971, pp. 379-385. [13] R. Andraka, “A survey of CORDIC algorithms for FPGA based computers”, International Symposium on Field Programmable Gate Arrays, Monterey, 1998, pp. 191-200. [14] E. Catovic, “GRFPU – High Performance IEEE-754 Floating-Point Unit”, DASIA 2004, Nice. [15] Nallatech, “Double Precision Floating Point Core”, 2004. [16] Digital Core Design, “DFPMU Floating Point Coprocessor”, 2007. [17] T.Vladimirova, D.Eamey, S.Keller, M.N.Sweeting. “Floating-Point Mathematical Co-Processor for a Single Chip On-Board Computer”, Proceedings of the 6th Military and Aerospace Applications of Programmable Logic Devices and Technologies (MAPLD'2003), September 9-11, 2003, Washington DC, US. 11