VIEWS: 37 PAGES: 13 POSTED ON: 7/12/2011 Public Domain
IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009 175 Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support Dimitri Tan, Member, IEEE, Carl E. Lemonds, Senior Member, IEEE, and Michael J. Schulte, Senior Member, IEEE Abstract—The demand for improved SIMD floating-point performance on general-purpose x86-compatible microprocessors is rising. At the same time, there is a conflicting demand in the low-power computing market for a reduction in power consumption. Along with this, there is the absolute necessity of backward compatibility for x86-compatible microprocessors, which includes the support of x87 scientific floating-point instructions. The combined effect is that there is a need for low-power, low-cost floating-point units that are still capable of delivering good SIMD performance while maintaining full x86 functionality. This paper presents the design of an x86-compatible floating-point multiplier (FPM) that is compliant with the IEEE-754 Standard for Binary Floating-Point Arithmetic [12] and is specifically tailored to provide good SIMD performance in a low-cost, low-power solution while maintaining full x87 backward compatibility. The FPM efficiently supports multiple precisions using an iterative rectangular multiplier. The FPM can perform two parallel single-precision multiplies every cycle with a latency of two cycles, one double-precision multiply every two cycles with a latency of four cycles, or one extended-double-precision multiply every three cycles with a latency of five cycles. The iterative FPM also supports division, square-root, and transcendental functions. Compared to a previous design with similar functionality, the proposed iterative FPM has 60 percent less area and 59 percent less dynamic power dissipation. Index Terms—Computer arithmetic, rectangular multiplier, floating-point arithmetic, low-power, multiplying circuits, multimedia, very-large-scale integration. Ç 1 INTRODUCTION E VER since the introduction of SIMD extensions to general- purpose processors, there has been a rising demand for improved SIMD performance to accommodate 3D graphics, SIMD floating-point extensions include SSE, SSE2, and SSE3 [5]. These instructions are heavily used in multimedia applications and in particular single-precision (SP) opera- video conferencing, and other multimedia applications [1], tions occur very frequently [7]. [2], [3], [4], [5]. At the same time, the low-power computing In recent x86 floating-point units, the SIMD extensions market is demanding a reduction in power consumption and x87 instructions are mapped onto the same hardware to despite an increase in performance. In general, these two save resources. In the AMD-K7TM and AMD-K8TM micro- requirements are conflicting since increased performance is processors and derivatives, the hardware is optimized for typically achieved with a corresponding increase in power x87 instructions [8], [9]. An alternative approach, presented consumption due to increased frequency, increased hard- in this paper, is to optimize for SIMD extensions and provide ware resources, or a combination of these. x87 functionality with a reduction in the performance of Backward compatibility of the x86 microprocessors has the latter. The advantage of this alternative approach is a enabled the survival of this Complex Instruction Set reduction in hardware resources and power, and an Computer (CISC) architecture and is therefore an absolute improvement in the performance of the SIMD extensions. requirement for future microprocessors. In the area of This paper presents the design of an x86-compatible floating-point, backward compatibility includes support for floating-point multiplier (FPM) that is optimized for x87 floating-point instructions [6]. These instructions are SP SSE instructions. The FPM can perform two parallel used in scientific computing and are not generally used in 24-bit Â 24-bit SP multiplies each cycle with a latency of two multimedia applications [7]. In current x86 processors, the cycles, one 53-bit Â 53-bit double-precision (DP) multiply every two cycles with a latency of four cycles, or one 64-bit Â 64-bit extended-double-precision (EP) multiply . D. Tan and C.E. Lemonds are with Advanced Micro Devices Inc., PCS-3, 9500 Arboretum Blvd, Suite 400, Austin, TX 78759. every three cycles with a latency of five cycles. In addition E-mail: {Dimitri.Tan, Carl.Lemonds}@amd.com. to performing multiplication, the FPM is used to perform . M.J. Schulte is with the University of Wisconsin-Madison, 4619 division and square root, and provides support for the Engineering Hall, 1415 Engineering Drive, Madison, WI 53706-1691. E-mail: schulte@engr.wisc.edu. x87 transcendental functions. Two internal multiplier sig- nificand precisions of 68-bit Â 68-bit and 76-bit Â 76-bit are Manuscript received 21 July 2007; revised 28 Feb. 2008; accepted 18 Sept. 2008; published online 23 Oct. 2008. required to support divide, square-root, and transcendental Recommended for acceptance by P. Kornerup, P. Montuschi, J.-M. Muller, functions. and E. Schwarz. The FPM is based on a rectangular significand multiplier For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number TCSI-2007-07-0339. tree that performs DP and EP multiplies through iteration. A Digital Object Identifier no. 10.1109/TC.2008.203. rectangular multiplier is of the form N Â M, where the 0018-9340/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply. 176 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009 multiplicand width N is greater than the multiplier single wider multiplication into multiple narrower multi- width M [10]. The rectangular FPM uses significantly less plications and sums the resulting products. For example, hardware than a fully pipelined multiplier. Furthermore, the rectangular FPM reduces the latency of SP multiplies, A Â B ¼ ðAH þ AL Þ Â ðBH þ BL Þ and the wider multiplicand conveniently accommodates ¼ ðAH Â BH Þ þ ðAH Â BL Þ þðAL Â BH Þ þ ðAL Â BL Þ; two parallel SP (packed) multiplies. The rectangular multi- plier is also used to decrease the latency of divide and where A is the multiplicand, and B is the multiplier. A and B square-root operations as described in [11]. The combina- can be divided into an arbitrary number of parts of different tion of these effects has the potential to reduce power widths. This partitioning gives different design choices and dissipation for multimedia applications. trade-offs. The maximum widths dictate the hardware The main contribution of this paper is the presentation of requirements. The recursive algorithm can be applied an iterative rectangular FPM that is optimized for packed iteratively by reusing the same hardware and performing SP multiplies and efficiently supports DP and EP multiplies. each of the narrower multiplications in different cycles. For Several of the individual techniques presented in this paper example, have been previously published, but the manner in which they have been combined in this design has not been A Â B ¼ðA Â BH Þiteration1 previously published to the authors’ knowledge. Specifi- þðA Â BL Þiteration2 : cally, this is the only multiplier that uses multiple passes for DP and EP multiplies to reduce area and power while Typically, in an iterative-recursive multiplier algorithm, the supporting two packed SP multiplies in a single pass. This product from the previous iteration is fed back to the paper also presents a new rounding scheme that efficiently current iteration in redundant form to avoid the delay supports multiple iterations, multiple precisions, and multi- of carry propagation in the critical path. The redundant ple rounding boundaries for EP. The proposed FPM complies product is typically merged into the partial product with the IEEE Standard for Binary Floating-Point Arithmetic addition tree without adding delay. [12] with some external hardware and microcode support, Typically, FPMs assume normalized inputs and attempt and it supports the SSE and x87 floating-point multiply, to combine the addition and rounding stages to avoid the divide, square-root, and transcendental function instructions delay of two carry propagations in series. It is possible to do specified in [6]. As demonstrated in Section 7, the proposed this if rounding is performed before normalization. If we FPM reduces area and dynamic power by roughly 60 percent assume normalized inputs, rounding in an FPM must deal compared to a previous FPM with similar functionality. with two distinct cases: rounding overflow and no rounding The remainder of this paper is organized as follows: overflow. Rounding overflow refers to the case in which Section 2 gives a brief overview of the main ideas and the theory behind the techniques used in the FPM. Section 3 the unrounded product is in the range [2.0, 4.0), and no presents the hardware architecture of the FPM. Section 4 rounding overflow refers to the case in which the un- describes the iterative multiplication algorithm. Section 5 rounded product is in the range [1.0, 2.0). These two cases describes the rounding algorithm and hardware. Section 6 can be computed separately using dedicated adder circuits gives an overview of previous x86 FPMs and iterative FPMs. and then selected once the overflow outcome is known [8]. Section 7 provides area and power estimates for the proposed In this scheme, a constant is added to the intermediate design and compares it to a previous design with similar product to reduce all rounding modes to round-to-zero, i.e., functionality. Section 8 gives our conclusions. truncation. The constant is rounding mode dependent and precision dependent and thus can accommodate multiple 2 MAIN IDEAS AND THEORY rounding modes and precisions. Alternatively, injection- based rounding also adds (injects) a constant but then uses a According to [13], “Many FP/multimedia applications have compound adder to compute the sum and sum þ 1 [14]. This a fairly balanced set of multiplies and adds. The machine allows both rounding overflow and no rounding overflow can usually keep busy interleaving a multiply and an add cases to be handled simultaneously with only one adder. every two clock cycles at much less cost than fully Accommodating multiple rounding positions in injection- pipelining all the FP/SSE execution hardware.” Multi- based rounding becomes problematic because the use of the plication readily lends itself to iterative algorithms and can accommodate numerous configurations which enable var- compound adder assumes a fixed rounding position. The multiplier presented in this paper uses recursive- ious area versus latency trade-offs. As noted in [7], “Most graphics multimedia applications use 32-bit floating-point iterative multiplication to perform DP and EP multiplies by operations.” Therefore, a reasonable approach is to optimize taking multiple passes through a rectangular multiplier. It for SP operations. also has the ability to perform two SP multiplies in parallel. Before describing the multiplier architecture, it is worth- Rounding results to different precisions is implementing while to briefly review some of the techniques that it uses. The using two separate rounding paths: one that takes one cycle multiplier presented in this paper uses both recursion and and is highly optimized for two parallel SP operations and iteration to trade off performance (i.e., throughput) against another which takes two cycles and handles higher precision area and power. A recursive multiplier algorithm divides a operations. Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply. TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT 177 Fig. 2. FPM pipeline diagrams. (a) SSE-SP scalar (one SP multiply). (b) SSE-SP packed (two SP multiplies). (c) SSE-DP. (d) x87 EP or Internal Precision (IP68, IP76) Fig. 1. FPM significand data path. combined, formatted, and multiplexed with the output 3 RECTANGULAR FLOATING-POINT MULTIPLIER from the DP/EP rounder to select the final result. The final ARCHITECTURE result is written to the register file and forwarded back to A block diagram of our proposed FPM, illustrating the the inputs of the FPM and other FP units via the bypass details of the significand data path, is shown in Fig. 1. To networks to enhance performance of dependent operations. simplify Fig. 1, the additional hardware for exception With such a configuration, a scalar SP multiplication takes processing, exponent computations, and divide/square-root one iteration, two parallel (packed) SP multiplications take support is not shown. The significand data path consists of one iteration, a scalar DP multiplication takes two iterations, three pipeline stages. The first pipeline stage consists of a and a scalar EP multiplication takes three iterations. Fig. 2 76-bit Â 27-bit multiplier which uses modified radix-4 Booth shows the pipeline diagrams for each precision supported recoding [15] and a partial product reduction tree consisting by the FPM. of 4-2 compressors [16]. The 76-bit Â 27-bit multiplier The significand multiplier consists of a 76-bit Â 27-bit accepts a feedback product in redundant carry-save form rectangular tree multiplier, which performs 76-bit Â 76-bit to facilitate iteration and a 76-bit addend specifically to multiplications over multiple cycles, as shown in Fig. 3. support divide and square-root operations. The addend is This saves considerable area compared to a fully parallel needed because the iterations for divide and square-root use 76-bit Â 76-bit multiplier, but penalizes the performance of a restricted form of the multiply-add operation. The details the higher precision (DP and EP) multiply instructions of the Goldschmidt-based divide algorithm are explained in because the multiplier must stall subsequent multiply [11] and [17]. The operand width of 76 bits is required at the instructions. However, the multiplier is fully pipelined for microarchitectural level to support division at the internal SP operations. The multiplier accepts a 76-bit multiplicand precision of 68 bits for transcendental functions [8]. input, a 76-bit multiplier input, and a 76-bit addend input. The second and third pipeline stages consist of combined These inputs are held for the duration of the operation. The addition and rounding followed by result selection, format- 76-bit multiplier input is supplied to alignment multiplexing ting for different precisions, and forwarding of the result which outputs two 27-bit values. Each 27-bit value is then to the register file and bypass networks. There are two recoded using a set of modified radix-4 Booth encoders. Two identical copies of the SP rounding unit to support packed separate 27-bit multiplier values are required to support the SP multiply operations and a single combined DP/EP packed SP mode. rounding unit that also handles all rounding for all other The outputs of the Booth encoders are used to select the precisions and for divide and square-root operations. The multiples of the multiplicand to form fourteen 81-bit partial SP rounders take one cycle and the DP/EP rounder takes products. One of the 27-bit multiplier values controls the two cycles. The outputs of the two SP rounders are generation of the upper 38 bits of each partial product, while Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply. 178 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009 Fig. 3. 76-bit Â 27-bit rectangular multiplier. the other 27-bit multiplier value controls the generation of 4:2 compressor cell can be replaced by either a full adder (FA) the lower 38 bits of each partial product. In unpacked (i.e., 3:2 CSA) cell, or a half-adder (HA) cell or with a buffer modes, the two 27-bit multiplier values are identical. In cell depending on the number of inputs that are zero. parallel to the partial product generation, two 76-bit feed- The subsequent levels of the compression tree can also back terms are combined with a 76-bit addend using a benefit from these optimizations to save area. Although the 3-2 carry-save adder (CSA). The 3-2 carry-save addition is multiplier is unsigned, a sign extension term is required to computed in parallel with the Booth encoding and multi- accommodate the sign embedded in the uncompressed plexing and does not add to the critical path. The 14 partial feedback terms from the previous iteration. This is an artifact products plus two combined terms are summed using a of the signed nature of the Booth encoding and the use of sign compression tree consisting of three levels of 4-2 compres- encoding for each individual partial product instead of sign sors to produce a 103-bit product in redundant carry-save extension [15]. Each partial product also requires “hot-ones” representation. The 103-bit carry-save product is then stored which are used to account for the increment term required in two 103-bit registers. when taking the twos complement for negatively weighted A diagram of the partial product array for the 76-bit Â partial products [18]. For a given partial product, the hot- 27-bit multiplication is show in Fig. 4. This diagram also ones are appended to the subsequent partial product. For shows the alignment of the two 76-bit feedback terms and the positively weighted partial products, the hot-ones are zeroes. 76-bit addend. The two feedback terms are needed to support As shown in Fig. 3, the two feedback terms and addend are iterations and are aligned to the right. The addend is needed compressed using a 3-2 CSA into two terms for a total of to support division and square root and is aligned to the left. 16 values to be summed. The division algorithm that exploits this multiplier hardware In order to support two parallel SP multiplications, the two is described in [11]. To avoid unnecessary hardware, the SP multiplications are mapped onto the array simulta- additional terms are inserted into the unused portions of neously. The superposition of two 24-bit Â 24-bit multiplier the array wherever possible. Fig. 4 shows how the partial partial product arrays onto a 76-bit Â 27-bit partial product product terms are partitioned into groups of four corre- array is shown in Fig. 5. Since the lower array ends at bit 48, sponding to the first level of 4-2 compressors shown in Fig. 3. the significant bits of the upper array and lower array are Note that, in certain bit positions, a 4-2 compressor cell is not separated by seven bits. The reduction tree has three levels of required since some of the inputs are zeros. In these cases, the 4-2 compressors. Therefore, the lower array can propagate a Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply. TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT 179 Fig. 4. Radix-4 Booth-encoded 76-bit Â 27-bit partial product array. carry at most three bit positions and will not interfere with the from the previous iteration and are then added to the lower upper array. Hence, no additional hardware is required to kill 76 bits of the current product. SP multiplies require only a any potential carries propagating from the lower array into single iteration, DP multiplies require two iterations and the upper array. However, in order to accommodate the sign EP multiplies require three iterations. encoding bits and the hot-ones, an additional multiplexer is The alignment of the unrounded product and the inserted after the Booth multiplexers and prior to the 4-2 position of the rounding points within the 103-bit carry- compressor tree as indicated in Fig. 3. The multiplexing after save multiplier output are shown in Fig. 7. This diagram the Booth multiplexing is only required for the sign encoding shows the position of the rounding overflow bit ðV Þ, the bits of the lower array and the hot-ones of the upper array, so most-significant bit of the product ðMÞ, the least-significant the additional hardware required is small. This hardware, bit of the product ðLÞ, the round bit ðRÞ, the remaining however, is on the critical path and adds the delay of a 2-1 result significand bits, and the sticky region. For packed multiplexer. An alternative to multiplexing in the sign- SP multiplies, the unrounded products are aligned such encoding bits and hot-one bits after the Booth multiplexing is that the “high” subword product is fully left aligned and to insert these bits into the feedback terms which are all zeros the “low” subword product is fully right aligned. To help for the first iteration. simplify the rounding, the DP and EP multiplies align the final product such that the number of unique rounding points are reduced without adding more precision multi- 4 ITERATIVE 76 Â 27 MULTIPLICATION ALGORITHM plexer inputs. For EP multiplies that are to be rounded to The iterative multiplication algorithm for the rectangular SP (EP24), the unrounded product is aligned such that the multiplier is given in Fig. 6. For each multiply iteration, the LSB of the product is in the same position as the LSB of the appropriate multiplier bits are selected for the high and low DP product and EP product to be rounded to DP (EP53). multiplier values, and the product is computed in redun- This has the added benefit of reducing the size of the sticky dant carry-save form. For SSE-SP multiplies and the first region compared to its size if the product is instead fully iteration of all other precisions, the two feedback terms are left aligned. It is also possible to align the EP64 and IP68 set to zero. For the second iteration of SSE-DP multiplies rounding points, but this would require an additional and the second and third iterations of EP multiplies, the two precision multiplexer input in the multiplier stage. The feedback terms are set to the upper 76 bits of the product 76-bit internal precision product (IP76) is used for Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply. 180 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009 is partitioned into feedback, sticky, and carry regions, and the final result extraction. During the first two passes, the feedback term is sent back to the multiplier, and the bits in the sticky and carry regions are sent to the DP/EP Rounder discussed in Section 5. During the third pass through the multiplier, all of the product bits in carry-save format are sent to the DP/EP Rounder. In this pass, the 48 lower product bits correspond to the sticky and carry regions, the next 24 product bits make up the significand if overflow does not occur, and the 29 upper product bits are discarded. 5 ROUNDING Before describing the details of the proposed rounding scheme, the rounding scheme used in the AMD-K7TM / AMD-K8TM FPM is briefly explained [8], [9]. In this rounding scheme, the product is computed using three separate 152-bit carry-propagate adders (CPAs). The first CPA computes the unrounded result for denormals and determines the sig- nificand product overflow bit. The second CPA computes a rounded result with the assumption that the unrounded result will not have an overflow, i.e., the unrounded product is assumed to be in the range [1.0, 2.0). The third CPA computes a rounded result with the assumption that the unrounded result will have an overflow, i.e., the unrounded product is assumed to be in the range [2.0, 4.0). Rounding is achieved by selecting a rounding constant which, when added to the product, reduces all rounding modes to a simple truncation with a possible LSB fix-up for round-to-nearest-even (RTNE). To avoid an extra carry- propagate addition, the rounding constant is first combined with the redundant carry-save form of the product using a 3-2 CSA before being passed to the CPA. The 3-2 CSA also provides support for the divide and square-root operations for computing the “back-mul” step [8]. For RTNE, the rounding constant consists of a single one in the round bit position (i.e., the half ULP position). Therefore, if the round bit is one, the product is incremented. This achieves round- to-nearest-up and in the case of a tie, the LSB is set to zero to keep the result even. For round-to-infinity, when the result is of the appropriate sign, the round constant consists of a string of ones starting from the round bit and ending at the LSB of the fully precise product. Therefore, any “1” located in that region causes the product to be incremented. The AMD-K7TM /AMD-K8TM rounding scheme is fast and easily supports multiple rounding precisions but consumes a considerable amount of hardware and is therefore Fig. 6. Iterative multiply algorithm. undesirable in low-cost and low-power systems. The proposed rounding circuitry takes as input the intermediate results in division and square root. No product in redundant carry-save form and rounds the rounding is needed for this mode since truncation is result according to the appropriate control word (FCW sufficient [11]. for x87 instructions or MXCSR for SSE instructions). The As an example, the multiplication algorithm for EP rounding circuitry contains separate rounding units for rounded to SP (EP24) is shown graphically in Fig. 8. To align SSE-SP high and SSE-SP low results, and a combined the LSB of the EP24 product with the LSBs from the SSE-DP rounding unit that rounds for SSE-DP, x87-EP, and divide/ and EP53 products, the multiplicand and multiplier are square-root results. Each of the rounding units is based on a aligned to the right as far as possible. For the first pass, compound adder rounding scheme, which is more power the lower 27 multiplier bits are selected for the multiplier and area efficient than the rounding scheme used in the operand, for the second pass the next 27 bits are selected, AMD-K7TM /AMD-K8TM multiplier [8]. It should be noted and for the third pass the upper 10 bits are selected with that the AMD-K8TM rounding scheme is inherently faster 17 zeros prepended to form the 27-bit multiplier operand than the rounding scheme presented here but at the cost of supplied to the Booth encoders. Fig. 8 also shows the 103-bit increased area and power. The microarchitecture requires product generated from each pass, how the 103-bit product that the FPM be able to produce the unrounded, normalized Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply. TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT 181 Fig. 7. Unrounded product alignment. Fig. 8. EP multiply rounded to SP (EP24). result for support of denormalized results, as described at sum½1 : 0 plus three ðsum3½1 : 0Þ. The 2-bit constant adders the end of this section. This complicates the use of injection- also compute the carry-out from bit 1 into bit 2 for each based rounding, described in [19], [20], and [21], which could summation case ðc2p0; c2p1; c2p2; c2p3Þ. The upper 23 bits potentially simplify the rounding units. are passed to a two-way compound adder that computes The SSE-SP rounder performs SSE-SP rounding only. their sum plus zero ðS0 ¼ Xs ½24 : 2 þ Xc ½24 : 1Þ and their This is a highly optimized and compact rounder compared sum plus one ðS1 ¼ Xs ½24 : 2 þ Xc ½24 : 1 þ 1Þ. Each of these to the DP/EP rounder since it only has to deal with one products is then normalized based on the significand precision. This unit has two identical instances: one for the product overflow bits (V0 for S0 and V1 for S1 ). lower SSE-SP result and one for the upper SSE-SP result. A In parallel to the upper data path, the lower 24 bits are block diagram of the SP rounder is given in Fig. 9. In the passed to a carry-tree and sticky-bit computation logic. The proposed SP rounding scheme, the upper 24 bits are passed carry-tree computes the unrounded LSB ðLÞ, the round through one level of HAs which compresses the lower bit ðRÞ, and the carry-out from the R-bit ðRcout Þ. In parallel, two bits to one bit ðXs ½1Þ. The lower bits are denoted as the sticky-bit computation logic performs the logical OR of a0 ¼ Ps ½23, b0 ¼ Pc ½23, a1 ¼ Xs ½1. The sum of these bits is the lower 22 bits to produce the sticky-bit ðSÞ. Two sets of denoted as sum½1 : 0 ¼ fa1 ; a0 g þ f0; b0 g. These three bits rounding selects are then determined using L, R, Rcout , S, ða0 ; a1 ; b0 Þ are passed to a set of 2-bit constant adders which the product’s sign ðsignÞ, and the rounding mode. One set compute sum½1 : 0 plus zero ðsum0½1 : 0Þ, sum½1 : 0 plus of rounding selects assumes overflow of the product does one ðsum1½1 : 0Þ, sum½1 : 0 plus two ðsum2½1 : 0Þ, and not occur ðV ¼ 0Þ, or equivalently, that the unrounded Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply. 182 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009 Fig. 9. SP rounder. significand product is in the range [1.0, 2.0). The other set of block diagram of the DP/EP rounder is shown in Fig. 11. Due rounding selects assumes that overflow of the product does to the large number of different precisions that must be occur ðV ¼ 1Þ, or equivalently, that the unrounded sig- supported, the DP/EP rounder is split over two cycles, nificand product is in the range [2.0, 4.0). This is similar to as it is in the AMD-K8TM processor. However, unlike the the approach described in [22], except that all possibilities AMD-K8TM FPM, the combined DP/EP rounder is based on a are computed in parallel to reduce delay. The two LSBs are compound adder rounding scheme that is more area and selected for each condition (V ¼ 0 and V ¼ 1), and based power efficient than the AMD-K8TM rounding scheme. The on Rcout , the unrounded overflow bit ðV Þ is determined. The V -bit is then used to select the appropriate rounding DP/EP rounding scheme is similar to the SP rounding increment determination to select S0 or S1 . Finally, for the scheme except that it is necessary to perform a right shift RTNE rounding mode, the LSB may need to be set to zero. to prealign the rounding point to the same significance prior The rounding algorithm is described in pseudocode in to the compound addition and to perform a left shift Fig. 10. It should be noted that the particular ordering of to postalign the MSB to the same significance after the steps described was chosen for ease of description and, in the compound addition. This is the cost of having to support actual hardware implementation, the order of each step is multiple rounding points in the same data path. The second best determined by examining the specific timing paths and difference is that the carry-tree and sticky logic need to ensuring a balance between the upper path and lower path. include the carry-out and sticky from previous iterations. The For instance, the order of the round-increment selection step third difference is that for each target precision there is a pair and normalization step can be swapped. It should also be of 2-1 multiplexers that are used to insert the two rounded noted that originally the SP and DP/EP rounding algorithms both used two consecutive HA rows to accommodate all LSBs into the correct positions within the final rounded rounding possibilities. However, analysis during formal significand. The DP/EP rounder also provides a bypass path verification efforts revealed that it was possible to reduce for divide and square root to allow the compound adder to this to one HA row. be reused for other additions, such as computing the The combined DP/EP rounder performs rounding for intermediate quotient Æ1 ULP, instead of adding dedicated SSE-DP, x87-SP, x87-DP, x87-EP, IP68 (for transcendental hardware. For simplicity, Fig. 11 does not show the rounding functions), and for divide and square-root operations. A circuitry required for divide and square root. Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply. TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT 183 additional normalization hardware or correction hardware. Second, in the case of denormal results, the FPM produces the normalized, unrounded result with the exponent falling out of range (below Emin ) along with sticky information. This is fed to an external unit which performs denormalization and rounding according to the IEEE-754 standard. To support this system, the floating-point registers are represented as normalized numbers with an extended exponent field in the register file. The internal representation is converted from memory format during loads and to memory format during stores. This approach for handling denormals is also used in the AMD-K8TM processor. 6 RELATED MULTIPLIER ARCHITECTURES Previous x86 FPMs have taken various forms. The Cyrix multiplier includes a 17-bit Â 69-bit rectangular significand multiplier that uses radix-8 signed encoding, a signed-digit summation tree, and signed-digit redundant feedback [10]. This design is very area efficient. In contrast, the AMD-K7TM / AMD-K8TM multiplier includes a full-pipelined 76-bit Â 76-bit significand multiplier with a latency of four cycles and is optimized for EP operations [8]. The Intel Pentium-41 multiplier is fully pipelined for DP and takes two iterations for EP [13]. Both the AMD-K7TM /AMD-K8TM multiplier and Intel Pentium-41 multiplier can execute two parallel SP (packed) multiplies every clock cycle. Iterative FPMs have also been described in the literature. For example, Anderson et al. [17] describe an iterative tree multiplier that generates only six partial products per cycle and requires five cycles to assimilate the 56-bit multiplier significand. In [14], a dual-mode iterative FPM is described that executes a SP multiply in two clock cycles at a throughput of one multiplication per clock cycle, or a DP multiply in three clock cycles at a throughput of one multiplication per two clock cycles. The multiplier consists of a 27-bit Â 53-bit tree multiplier coupled with an injection-based rounder. In [18], a single-pass fused-multiply-add (FMA) floating-point unit is compared to a dual-pass FMA floating-point unit. Both FMA units support SP and DP operations. The dual-pass FMA unit is again based on an iterative rectangular multiplier and executes an SP FMA operation in one pass and a DP FMA operation in two passes. None of these iterative designs support simultaneous (packed) SP operations. Lastly, Akkas and Schulte [23] describe an iterative FPM that supports two DP multiplies without iteration or a quadruple-precision multiply using two iterations. In this design, the quadruple- precision multiply is achieved using an iterative algorithm. Fig. 10. SP rounding algorithm. Alternative methods for achieving packed integer multi- plies are described in [24] and [25], and an application to In order to fully support the IEEE-754 standard, the FPM packed FMA is described in [26]. A dual-mode FPM which requires some external assistance in dealing with denormals. supports one DP multiply or two parallel SP multiplies is described in [22]. This multiplier uses radix-8 Booth encoding First, the FPM assumes denormal inputs are first normalized and handles the packed multiplies in a fashion similar to the with the exponent sufficiently extended to accommodate proposed design, except that the generation and compression the normalization shift amount. In this manner, the FPM of partial products is performed in multiple pipeline stages can operate directly on the operands without needing any and EP multiplies are not supported. The multiplier is fully Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply. 184 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009 Fig. 11. Combined DP/EP rounder. pipelined and operates without stalling. It therefore requires baseline design (AMD-K8TM -FPM). The AMD-K8TM -FPM a full DP significand multiplier. is a highly aggressive design that is specifically targeted toward high performance. In contrast, the proposed design is intended to be a low-cost and low-power solution with 7 RESULTS, COMPARISON, AND TESTING similar functionality. The implementation results reflect The proposed rectangular multiplier was implemented in the two different design objectives. a 65-nm SOI technology using static CMOS logic and a Functional testing was performed using a mixture of data-path-orientated, cell-based methodology. The cell random data patterns and directed data patterns by simulta- library used consisted of typical static CMOS cells in neously applying the same stimulus to the proposed iterative addition to some specialized cells such as the 4-2 FPM unit and the AMD-K8TM reference FPM unit. The results compressor, the Booth encoder, and the Booth multiplexer. from each unit were captured and compared. To provide a point of comparison, a design similar to the A comparison of multiply instruction latencies and AMD-K8TM FPM (AMD-K8TM -FPM) described in [8] and throughputs is given in Table 4. Performance modeling [9] was also implemented with the same technology. The studies were performed to measure the estimated instruc- implementation results are shown in Tables 1, 2, and 3. tions per cycle (IPC) for a range of benchmarks. The The dynamic power was measured by applying random AMD-K8TM performance model configured with the original input patterns and measuring the average current using a AMD-K8TM FPM instruction latencies and throughputs SPICE-like circuit simulator with the transistor netlist and served as the baseline model while the AMD-K8TM perfor- extracted parasitics. Both designs were measured using mance model configured with the proposed iterative FPM the same clock frequency ðftypical Þ and the same supply instruction latencies and throughputs served as the compar- voltage ðVtypical Þ. The proposed design consumes signifi- ison model. As expected, performance studies using SSE-SP- cantly less area and dynamic power compared to the dominated target applications demonstrated an increase in Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply. TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT 185 TABLE 1 TABLE 3 Area/Power Comparison for Significand Multipliers Area/Power Comparison for Entire Significand Data Path performance compared to the baseline design. For instance, a set of SSE-SP-dominant traces extracted from the SPEC- CPU20061 benchmark demonstrated a range of improve- ments from 1.1 percent to 10.5 percent relative to the baseline design. For x87-dominant applications, there was a similar of x87 instructions, demonstrated a performance loss of decrease in performance. However, since those applications 2.5 percent. The x87 architecture requires that the multi- are mainly dominated by memory throughput, the difference plication be carried out in EP and then rounded to the target was not significant on average and other microarchitectural precision of SP, DP, or EP. Therefore, it is necessary to choices such as load bandwidth and instruction window size perform a full EP multiply even if the operands only contain are more important. For example, on average, the SPECfp- significant bits which fit within the SP region or within the DP 20001 benchmark, which contains a significant percentage region. To reduce the latency of some x87 multiplies, it is possible to detect the number of significant bits in the multiplier and determine if this quantity falls within the TABLE 2 range of SP, DP, or EP, and then only perform the multi- Area/Power Comparison for Rounders plication to that precision. The multiplicand does not need to be examined since it does not contribute to the number of passes through the 76-bit Â 27-bit multiplier array. For instance, if the multiplier significand contains less than 28 leading significant bits, then only a single pass through the multiplier array is required and the latency of the EP multiply will be reduced from five cycles to three cycles and the throughput will be increased from 1/3 to 1. To make use of this feature, it is necessary to either use an instruction scheduler that can accommodate data-dependent instruction latencies or can keep track of the number of significant bits in the data. This feature relies on the assumption that for certain applications the operands have SP or DP ranges. Further- more, if it can be arranged that the multiplier always contain the least number of significant bits compared to the multi- plicand, then this will increase the extent to which this feature can be used. Using this feature can return some of the performance loss introduced by the pipeline stalls due to the iterative nature of the EP multiplies. 8 CONCLUSION This paper has presented an x86-compatible FPM that is based on a 76-bit Â 27-bit rectangular multiplier and is optimized for packed SSE-SP multiples. The multiplier is compared to a design with similar functionality that was Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply. 186 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009 [4] H. Nguyen and L.K. John, “Exploiting SIMD Parallelism in DSP TABLE 4 and Multimedia Algorithms Using the AltiVec Technology,” Proc. Latency/Throughput Comparison 13th Int’l Conf. Supercomputing (ICS ’99), pp. 11-20, June 1999. [5] “Advanced Micro Devices,” AMD64 Architecture Programmer’s Manual Volume 4: 128-Bit Media Instructions, rev. 3.07 ed., Dec. 2005. [6] “Advanced Micro Devices,” AMD64 Architecture Programmer’s Manual Volume 5: 64-Bit Media and x87 Floating-Point Instructions, rev. 3.06 ed., Dec. 2005. [7] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, ch. 2, third ed. Morgan Kaufmann, p. 119, May 2002. [8] S. Oberman, “Floating-Point Division and Square Root Algo- rithms and Implementation in the AMD-K72 Microprocessor,” Proc. 14th IEEE Symp. Computer Arithmetic (ARITH ’99), pp. 106-115, Apr. 1999. [9] C. Keltcher, K. McGrath, A. Ahmed, and P. Conway, “The AMD Opteron Processor for Multiprocessor Servers,” IEEE Micro, vol. 23, pp. 66-76, Mar. 2003. [10] W. Briggs and D. Matula, “A 17 Â 69 Bit Multiply and Add Unit with Redundant Binary Feedback and Single Cycle Latency,” Proc. 11th IEEE Symp. Computer Arithmetic (ARITH ’93), pp. 163-170, July 1993. [11] M. Schulte, C. Lemonds, and D. Tan, “Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier,” Proc. IEEE Int’l Conf. Computer Design (ICCD ’07), pp. 304-310, Oct. 2007. [12] ANSI and IEEE, IEEE-754 Standard for Binary Floating-Point Arithmetic, 1985. [13] G. Hinton, M. Upton, D. Sager, D. Boggs, D. Carmean, P. Roussel, T. Chappell, T. Fletcher, M. Milshtein, M. Sprague, S. Samaan, and R. Murray, “A 0.18-um CMOS IA-32 Processor with a 4-GHz Integer Execution Unit,” IEEE J. Solid-State Circuits, vol. 36, pp. 1617-1627, Nov. 2001. [14] G. Even, S.M. Mueller, and P.-M. Seidel, “A Dual Mode IEEE Multiplier,” Proc. Second Ann. IEEE Int’l Conf. Innovative Systems in Silicon (ISIS ’97), pp. 282-289, Oct. 1997. [15] S. Vassiliadis, E. Schwarz, and B. Sung, “Hard-Wired Multipliers with Encoded Partial Products,” IEEE Trans. Computers, vol. 40, pp. 1181-1197, Nov. 1991. optimized instead for the largest precision. The proposed [16] A. Weinberger, “4:2 Carry-Save Adder Module,” IBM Technical Disclosure Bull., vol. 23, pp. 3811-3814, Jan. 1981. design consumes significantly less area and power while [17] S. Anderson, J. Earle, R. Goldschmidt, and D. Powers, “The IBM achieving improved performance for the target applications System/360 Model 91: Floating-Point Execution Unit,” IBM J. and only slightly reduced performance for x87-dominated Research and Development, vol. 11, pp. 34-53, Jan. 1967. applications. The rectangular multiplier also facilitates [18] R.M. Jessani and M. Putrino, “Comparison of Single- and Dual- Pass Multiply-Add Fused Floating-Point Units,” IEEE Trans. efficient algorithms for divide and square root with a small Computers, vol. 47, pp. 927-937, Sept. 1998. amount of additional hardware. [19] M.R. Santoro, G. Bewick, and M. Horowitz, “Rounding Algo- rithms for IEEE Multipliers,” Proc. Ninth IEEE Symp. Computer Arithmetic (ARITH ’89), pp. 176-183, Sept. 1989. ACKNOWLEDGMENTS [20] G. Even and P.-M. Seidel, “A Comparison of Three Rounding Algorithms for IEEE Floating-Point Multiplication,” IEEE Trans. We would like to thank Peter Seidel for suggesting optimiza- Computers, vol. 49, pp. 638-650, July 2000. tions to the rounding circuitry based on analysis derived from [21] N.T. Quach, N. Takagi, and M. Flynn, “Systematic IEEE Rounding formal verification efforts, Albert Danysh and Eric Quinnell Method for High-Speed Floating-Point Multipliers,” IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 12, pp. 511-521, for their excellent work on the multiplier and rounding May 2004. circuitry implementation, Raj Desikan for his excellent work [22] A. Enriques and K. Jones, “Design of a Multi-Mode Pipelined on the performance modeling and analysis, and to the Multiplier for Floating-Point Applications,” Proc. IEEE Nat’l Aerospace and Electronics Conf. (NAECON ’91), vol. 1, pp. 77-81, anonymous reviewers for their helpful comments. May 1991. [23] A. Akkas and M. Schulte, “A Quadruple Precision and Dual Double Precision Floating-Point Multiplier,” Proc. Euromicro Symp. REFERENCES Digital System Design (DSD ’03), pp. 76-81, Sept. 2003. [1] P. Ranganathan, S. Adve, and N. Jouppi, “Performance of Image [24] D. Tan, A. Danysh, and M. Liebelt, “Multiple-Precision Fixed- and Video Processing with General-Purpose Processors and Point Vector Multiply-Accumulator Using Shared Segmentation,” Media ISA Extensions,” Proc. 26th Ann. Int’l Symp. Computer Proc. 16th IEEE Symp. Computer Arithmetic (ARITH ’03), pp. 12-19, Architecture (ISCA ’99), vol. 27, pp. 124-135, May 1999. June 2003. [2] S.K. Raman, V. Pentkovski, and J. Keshava, “Implementing [25] S. Krithivasan and M.J. Schulte, “Multiplier Architectures for Streaming SIMD Extensions on the Pentium III Processor,” IEEE Media Processing,” Proc. IEEE 37th Asilomar Conf. Signals, Systems, Micro, vol. 20, pp. 47-57, July 2000. and Computers (ACSSC ’03), vol. 2, pp. 2193-2197, Nov. 2003. [3] M.-L. Li, R. Sasanka, S. Adve, Y.-K. Chen, and E. Debes, “The [26] L. Huang, L. Shen, K. Dai, and Z. Wang, “A New Architecture ALPBench Benchmark Suite for Complex Multimedia Applica- for Multiple-Precision Floating-Point Multiply-Add Fused Unit tions,” Proc. IEEE Int’l Symp. Workload Characterization (IISWC ’05), Design,” Proc. 18th IEEE Symp. Computer Arithmetic (ARITH ’07), pp. 34-45, Oct. 2005. pp. 69-76, June 2007. Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply. TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT 187 Dimitri Tan received the BSEE degree from the Michael J. Schulte received the BS degree in University of Adelaide, Australia. He was pre- electrical engineering from the University of viously with Motorola Inc. and Freescale Semi- Wisconsin-Madison and the MS and PhD conductor Inc., where he worked on various degrees in electrical engineering from the microprocessor and SoC designs. He is cur- University of Texas, Austin. He is currently an rently with Advanced Micro Devices Inc., Austin, associate professor at the University of Wiscon- Texas, working on x86 microprocessor design. sin-Madison, where he leads the Madison His research interests include computer archi- Embedded Systems and Architectures Group. tecture, computer arithmetic, and reconfigurable His research interests include high-performance computing. He is a member of the IEEE. embedded processors, computer architecture, domain-specific systems, and computer arithmetic. He is a senior Carl E. Lemonds received the BSEE and member of the IEEE. MSEE degrees from the University of Missouri, Columbia. He worked in corporate R&D at Texas . For more information on this or any other computing topic, Instruments, where he designed arithmetic please visit our Digital Library at www.computer.org/publications/dlib. circuits and algorithms for various DSP test chips. After a brief stint at Cyrix, he joined Intel in 1999. At Intel, he worked on the FPU for the Tejas project (Pentium4 class processor). In January of 2004, he joined Advanced Micro Devices (AMD) Inc., Austin, Texas, where he is currently a principal member of the technical staff. His interests include computer arithmetic, floating-point, and DSP. His current research is in vector floating-point processors. He is a senior member of the IEEE and a member of the ACM. Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.