Low-Power Multiple-Precision Iterative Floating-Point Multiplier

Document Sample
Low-Power Multiple-Precision Iterative Floating-Point Multiplier Powered By Docstoc
					IEEE TRANSACTIONS ON COMPUTERS,                 VOL. 58, NO. 2,        FEBRUARY 2009                                                                    175

        Low-Power Multiple-Precision Iterative
      Floating-Point Multiplier with SIMD Support
                     Dimitri Tan, Member, IEEE, Carl E. Lemonds, Senior Member, IEEE, and
                                    Michael J. Schulte, Senior Member, IEEE

       Abstract—The demand for improved SIMD floating-point performance on general-purpose x86-compatible microprocessors is
       rising. At the same time, there is a conflicting demand in the low-power computing market for a reduction in power consumption.
       Along with this, there is the absolute necessity of backward compatibility for x86-compatible microprocessors, which includes
       the support of x87 scientific floating-point instructions. The combined effect is that there is a need for low-power, low-cost
       floating-point units that are still capable of delivering good SIMD performance while maintaining full x86 functionality. This paper
       presents the design of an x86-compatible floating-point multiplier (FPM) that is compliant with the IEEE-754 Standard for Binary
       Floating-Point Arithmetic [12] and is specifically tailored to provide good SIMD performance in a low-cost, low-power solution while
       maintaining full x87 backward compatibility. The FPM efficiently supports multiple precisions using an iterative rectangular
       multiplier. The FPM can perform two parallel single-precision multiplies every cycle with a latency of two cycles, one
       double-precision multiply every two cycles with a latency of four cycles, or one extended-double-precision multiply every three
       cycles with a latency of five cycles. The iterative FPM also supports division, square-root, and transcendental functions. Compared
       to a previous design with similar functionality, the proposed iterative FPM has 60 percent less area and 59 percent less dynamic
       power dissipation.

       Index Terms—Computer arithmetic, rectangular multiplier, floating-point arithmetic, low-power, multiplying circuits, multimedia,
       very-large-scale integration.



E    VER since the introduction of SIMD extensions to general-
     purpose processors, there has been a rising demand for
improved SIMD performance to accommodate 3D graphics,
                                                                                             SIMD floating-point extensions include SSE, SSE2, and
                                                                                             SSE3 [5]. These instructions are heavily used in multimedia
                                                                                             applications and in particular single-precision (SP) opera-
video conferencing, and other multimedia applications [1],                                   tions occur very frequently [7].
[2], [3], [4], [5]. At the same time, the low-power computing                                    In recent x86 floating-point units, the SIMD extensions
market is demanding a reduction in power consumption                                         and x87 instructions are mapped onto the same hardware to
despite an increase in performance. In general, these two                                    save resources. In the AMD-K7TM and AMD-K8TM micro-
requirements are conflicting since increased performance is                                  processors and derivatives, the hardware is optimized for
typically achieved with a corresponding increase in power                                    x87 instructions [8], [9]. An alternative approach, presented
consumption due to increased frequency, increased hard-                                      in this paper, is to optimize for SIMD extensions and provide
ware resources, or a combination of these.                                                   x87 functionality with a reduction in the performance of
   Backward compatibility of the x86 microprocessors has                                     the latter. The advantage of this alternative approach is a
enabled the survival of this Complex Instruction Set                                         reduction in hardware resources and power, and an
Computer (CISC) architecture and is therefore an absolute                                    improvement in the performance of the SIMD extensions.
requirement for future microprocessors. In the area of                                           This paper presents the design of an x86-compatible
floating-point, backward compatibility includes support for                                  floating-point multiplier (FPM) that is optimized for
x87 floating-point instructions [6]. These instructions are                                  SP SSE instructions. The FPM can perform two parallel
used in scientific computing and are not generally used in                                   24-bit  24-bit SP multiplies each cycle with a latency of two
multimedia applications [7]. In current x86 processors, the                                  cycles, one 53-bit  53-bit double-precision (DP) multiply
                                                                                             every two cycles with a latency of four cycles, or one
                                                                                             64-bit  64-bit extended-double-precision (EP) multiply
. D. Tan and C.E. Lemonds are with Advanced Micro Devices Inc., PCS-3,
  9500 Arboretum Blvd, Suite 400, Austin, TX 78759.                                          every three cycles with a latency of five cycles. In addition
  E-mail: {Dimitri.Tan, Carl.Lemonds}                                               to performing multiplication, the FPM is used to perform
. M.J. Schulte is with the University of Wisconsin-Madison, 4619                             division and square root, and provides support for the
  Engineering Hall, 1415 Engineering Drive, Madison, WI 53706-1691.
  E-mail:                                                             x87 transcendental functions. Two internal multiplier sig-
                                                                                             nificand precisions of 68-bit  68-bit and 76-bit  76-bit are
Manuscript received 21 July 2007; revised 28 Feb. 2008; accepted 18 Sept.
2008; published online 23 Oct. 2008.                                                         required to support divide, square-root, and transcendental
Recommended for acceptance by P. Kornerup, P. Montuschi, J.-M. Muller,                       functions.
and E. Schwarz.                                                                                  The FPM is based on a rectangular significand multiplier
For information on obtaining reprints of this article, please send e-mail to:, and reference IEEECS Log Number TCSI-2007-07-0339.                          tree that performs DP and EP multiplies through iteration. A
Digital Object Identifier no. 10.1109/TC.2008.203.                                           rectangular multiplier is of the form N Â M, where the
                                                     0018-9340/09/$25.00 ß 2009 IEEE         Published by the IEEE Computer Society
             Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.
176                                                                                    IEEE TRANSACTIONS ON COMPUTERS,                      VOL. 58, NO. 2,   FEBRUARY 2009

multiplicand width N is greater than the multiplier                                        single wider multiplication into multiple narrower multi-
width M [10]. The rectangular FPM uses significantly less                                  plications and sums the resulting products. For example,
hardware than a fully pipelined multiplier. Furthermore,
the rectangular FPM reduces the latency of SP multiplies,                                                       A Â B ¼ ðAH þ AL Þ Â ðBH þ BL Þ
and the wider multiplicand conveniently accommodates                                                                  ¼ ðAH Â BH Þ þ ðAH Â BL Þ
                                                                                                                      þðAL Â BH Þ þ ðAL Â BL Þ;
two parallel SP (packed) multiplies. The rectangular multi-
plier is also used to decrease the latency of divide and                                   where A is the multiplicand, and B is the multiplier. A and B
square-root operations as described in [11]. The combina-                                  can be divided into an arbitrary number of parts of different
tion of these effects has the potential to reduce power                                    widths. This partitioning gives different design choices and
dissipation for multimedia applications.                                                   trade-offs. The maximum widths dictate the hardware
   The main contribution of this paper is the presentation of                              requirements. The recursive algorithm can be applied
an iterative rectangular FPM that is optimized for packed
                                                                                           iteratively by reusing the same hardware and performing
SP multiplies and efficiently supports DP and EP multiplies.
                                                                                           each of the narrower multiplications in different cycles. For
Several of the individual techniques presented in this paper
have been previously published, but the manner in which
they have been combined in this design has not been                                                                   A  B ¼ðA  BH Þiteration1
previously published to the authors’ knowledge. Specifi-
                                                                                                                            þðA Â BL Þiteration2 :
cally, this is the only multiplier that uses multiple passes for
DP and EP multiplies to reduce area and power while                                        Typically, in an iterative-recursive multiplier algorithm, the
supporting two packed SP multiplies in a single pass. This                                 product from the previous iteration is fed back to the
paper also presents a new rounding scheme that efficiently                                 current iteration in redundant form to avoid the delay
supports multiple iterations, multiple precisions, and multi-                              of carry propagation in the critical path. The redundant
ple rounding boundaries for EP. The proposed FPM complies                                  product is typically merged into the partial product
with the IEEE Standard for Binary Floating-Point Arithmetic                                addition tree without adding delay.
[12] with some external hardware and microcode support,                                       Typically, FPMs assume normalized inputs and attempt
and it supports the SSE and x87 floating-point multiply,                                   to combine the addition and rounding stages to avoid the
divide, square-root, and transcendental function instructions                              delay of two carry propagations in series. It is possible to do
specified in [6]. As demonstrated in Section 7, the proposed
                                                                                           this if rounding is performed before normalization. If we
FPM reduces area and dynamic power by roughly 60 percent
                                                                                           assume normalized inputs, rounding in an FPM must deal
compared to a previous FPM with similar functionality.
                                                                                           with two distinct cases: rounding overflow and no rounding
   The remainder of this paper is organized as follows:
                                                                                           overflow. Rounding overflow refers to the case in which
Section 2 gives a brief overview of the main ideas and the
theory behind the techniques used in the FPM. Section 3                                    the unrounded product is in the range [2.0, 4.0), and no
presents the hardware architecture of the FPM. Section 4                                   rounding overflow refers to the case in which the un-
describes the iterative multiplication algorithm. Section 5                                rounded product is in the range [1.0, 2.0). These two cases
describes the rounding algorithm and hardware. Section 6                                   can be computed separately using dedicated adder circuits
gives an overview of previous x86 FPMs and iterative FPMs.                                 and then selected once the overflow outcome is known [8].
Section 7 provides area and power estimates for the proposed                               In this scheme, a constant is added to the intermediate
design and compares it to a previous design with similar                                   product to reduce all rounding modes to round-to-zero, i.e.,
functionality. Section 8 gives our conclusions.                                            truncation. The constant is rounding mode dependent and
                                                                                           precision dependent and thus can accommodate multiple
2     MAIN IDEAS       AND     THEORY                                                      rounding modes and precisions. Alternatively, injection-
                                                                                           based rounding also adds (injects) a constant but then uses a
According to [13], “Many FP/multimedia applications have
                                                                                           compound adder to compute the sum and sum þ 1 [14]. This
a fairly balanced set of multiplies and adds. The machine
                                                                                           allows both rounding overflow and no rounding overflow
can usually keep busy interleaving a multiply and an add
                                                                                           cases to be handled simultaneously with only one adder.
every two clock cycles at much less cost than fully
                                                                                           Accommodating multiple rounding positions in injection-
pipelining all the FP/SSE execution hardware.” Multi-
                                                                                           based rounding becomes problematic because the use of the
plication readily lends itself to iterative algorithms and can
accommodate numerous configurations which enable var-                                      compound adder assumes a fixed rounding position.
                                                                                              The multiplier presented in this paper uses recursive-
ious area versus latency trade-offs. As noted in [7], “Most
graphics multimedia applications use 32-bit floating-point                                 iterative multiplication to perform DP and EP multiplies by
operations.” Therefore, a reasonable approach is to optimize                               taking multiple passes through a rectangular multiplier. It
for SP operations.                                                                         also has the ability to perform two SP multiplies in parallel.
   Before describing the multiplier architecture, it is worth-                             Rounding results to different precisions is implementing
while to briefly review some of the techniques that it uses. The                           using two separate rounding paths: one that takes one cycle
multiplier presented in this paper uses both recursion and                                 and is highly optimized for two parallel SP operations and
iteration to trade off performance (i.e., throughput) against                              another which takes two cycles and handles higher precision
area and power. A recursive multiplier algorithm divides a                                 operations.

           Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.
TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT                                                             177

                                                                                            Fig. 2. FPM pipeline diagrams. (a) SSE-SP scalar (one SP multiply).
                                                                                            (b) SSE-SP packed (two SP multiplies). (c) SSE-DP. (d) x87 EP or
                                                                                            Internal Precision (IP68, IP76)
Fig. 1. FPM significand data path.
                                                                                            combined, formatted, and multiplexed with the output
3    RECTANGULAR FLOATING-POINT MULTIPLIER                                                  from the DP/EP rounder to select the final result. The final
     ARCHITECTURE                                                                           result is written to the register file and forwarded back to
A block diagram of our proposed FPM, illustrating the                                       the inputs of the FPM and other FP units via the bypass
details of the significand data path, is shown in Fig. 1. To                                networks to enhance performance of dependent operations.
simplify Fig. 1, the additional hardware for exception                                      With such a configuration, a scalar SP multiplication takes
processing, exponent computations, and divide/square-root                                   one iteration, two parallel (packed) SP multiplications take
support is not shown. The significand data path consists of                                 one iteration, a scalar DP multiplication takes two iterations,
three pipeline stages. The first pipeline stage consists of a                               and a scalar EP multiplication takes three iterations. Fig. 2
76-bit  27-bit multiplier which uses modified radix-4 Booth                                shows the pipeline diagrams for each precision supported
recoding [15] and a partial product reduction tree consisting                               by the FPM.
of 4-2 compressors [16]. The 76-bit  27-bit multiplier                                        The significand multiplier consists of a 76-bit  27-bit
accepts a feedback product in redundant carry-save form                                     rectangular tree multiplier, which performs 76-bit  76-bit
to facilitate iteration and a 76-bit addend specifically to                                 multiplications over multiple cycles, as shown in Fig. 3.
support divide and square-root operations. The addend is                                    This saves considerable area compared to a fully parallel
needed because the iterations for divide and square-root use                                76-bit  76-bit multiplier, but penalizes the performance of
a restricted form of the multiply-add operation. The details                                the higher precision (DP and EP) multiply instructions
of the Goldschmidt-based divide algorithm are explained in                                  because the multiplier must stall subsequent multiply
[11] and [17]. The operand width of 76 bits is required at the                              instructions. However, the multiplier is fully pipelined for
microarchitectural level to support division at the internal                                SP operations. The multiplier accepts a 76-bit multiplicand
precision of 68 bits for transcendental functions [8].                                      input, a 76-bit multiplier input, and a 76-bit addend input.
   The second and third pipeline stages consist of combined                                 These inputs are held for the duration of the operation. The
addition and rounding followed by result selection, format-                                 76-bit multiplier input is supplied to alignment multiplexing
ting for different precisions, and forwarding of the result                                 which outputs two 27-bit values. Each 27-bit value is then
to the register file and bypass networks. There are two                                     recoded using a set of modified radix-4 Booth encoders. Two
identical copies of the SP rounding unit to support packed                                  separate 27-bit multiplier values are required to support the
SP multiply operations and a single combined DP/EP                                          packed SP mode.
rounding unit that also handles all rounding for all other                                     The outputs of the Booth encoders are used to select the
precisions and for divide and square-root operations. The                                   multiples of the multiplicand to form fourteen 81-bit partial
SP rounders take one cycle and the DP/EP rounder takes                                      products. One of the 27-bit multiplier values controls the
two cycles. The outputs of the two SP rounders are                                          generation of the upper 38 bits of each partial product, while

            Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.
178                                                                                      IEEE TRANSACTIONS ON COMPUTERS,                      VOL. 58, NO. 2,   FEBRUARY 2009

Fig. 3. 76-bit  27-bit rectangular multiplier.

the other 27-bit multiplier value controls the generation of                                 4:2 compressor cell can be replaced by either a full adder (FA)
the lower 38 bits of each partial product. In unpacked                                       (i.e., 3:2 CSA) cell, or a half-adder (HA) cell or with a buffer
modes, the two 27-bit multiplier values are identical. In                                    cell depending on the number of inputs that are zero.
parallel to the partial product generation, two 76-bit feed-                                 The subsequent levels of the compression tree can also
back terms are combined with a 76-bit addend using a                                         benefit from these optimizations to save area. Although the
3-2 carry-save adder (CSA). The 3-2 carry-save addition is                                   multiplier is unsigned, a sign extension term is required to
computed in parallel with the Booth encoding and multi-                                      accommodate the sign embedded in the uncompressed
plexing and does not add to the critical path. The 14 partial                                feedback terms from the previous iteration. This is an artifact
products plus two combined terms are summed using a                                          of the signed nature of the Booth encoding and the use of sign
compression tree consisting of three levels of 4-2 compres-                                  encoding for each individual partial product instead of sign
sors to produce a 103-bit product in redundant carry-save                                    extension [15]. Each partial product also requires “hot-ones”
representation. The 103-bit carry-save product is then stored                                which are used to account for the increment term required
in two 103-bit registers.                                                                    when taking the twos complement for negatively weighted
   A diagram of the partial product array for the 76-bit                                    partial products [18]. For a given partial product, the hot-
27-bit multiplication is show in Fig. 4. This diagram also                                   ones are appended to the subsequent partial product. For
shows the alignment of the two 76-bit feedback terms and the                                 positively weighted partial products, the hot-ones are zeroes.
76-bit addend. The two feedback terms are needed to support                                  As shown in Fig. 3, the two feedback terms and addend are
iterations and are aligned to the right. The addend is needed                                compressed using a 3-2 CSA into two terms for a total of
to support division and square root and is aligned to the left.                              16 values to be summed.
The division algorithm that exploits this multiplier hardware                                    In order to support two parallel SP multiplications, the two
is described in [11]. To avoid unnecessary hardware, the                                     SP multiplications are mapped onto the array simulta-
additional terms are inserted into the unused portions of                                    neously. The superposition of two 24-bit  24-bit multiplier
the array wherever possible. Fig. 4 shows how the partial                                    partial product arrays onto a 76-bit  27-bit partial product
product terms are partitioned into groups of four corre-                                     array is shown in Fig. 5. Since the lower array ends at bit 48,
sponding to the first level of 4-2 compressors shown in Fig. 3.                              the significant bits of the upper array and lower array are
Note that, in certain bit positions, a 4-2 compressor cell is not                            separated by seven bits. The reduction tree has three levels of
required since some of the inputs are zeros. In these cases, the                             4-2 compressors. Therefore, the lower array can propagate a

             Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

Fig. 4. Radix-4 Booth-encoded 76-bit  27-bit partial product array.

carry at most three bit positions and will not interfere with the                           from the previous iteration and are then added to the lower
upper array. Hence, no additional hardware is required to kill                              76 bits of the current product. SP multiplies require only a
any potential carries propagating from the lower array into                                 single iteration, DP multiplies require two iterations and
the upper array. However, in order to accommodate the sign                                  EP multiplies require three iterations.
encoding bits and the hot-ones, an additional multiplexer is                                   The alignment of the unrounded product and the
inserted after the Booth multiplexers and prior to the 4-2                                  position of the rounding points within the 103-bit carry-
compressor tree as indicated in Fig. 3. The multiplexing after                              save multiplier output are shown in Fig. 7. This diagram
the Booth multiplexing is only required for the sign encoding                               shows the position of the rounding overflow bit ðV Þ, the
bits of the lower array and the hot-ones of the upper array, so                             most-significant bit of the product ðMÞ, the least-significant
the additional hardware required is small. This hardware,                                   bit of the product ðLÞ, the round bit ðRÞ, the remaining
however, is on the critical path and adds the delay of a 2-1                                result significand bits, and the sticky region. For packed
multiplexer. An alternative to multiplexing in the sign-                                    SP multiplies, the unrounded products are aligned such
encoding bits and hot-one bits after the Booth multiplexing is                              that the “high” subword product is fully left aligned and
to insert these bits into the feedback terms which are all zeros                            the “low” subword product is fully right aligned. To help
for the first iteration.                                                                    simplify the rounding, the DP and EP multiplies align the
                                                                                            final product such that the number of unique rounding
                                                                                            points are reduced without adding more precision multi-
4    ITERATIVE 76 Â 27 MULTIPLICATION ALGORITHM                                             plexer inputs. For EP multiplies that are to be rounded to
The iterative multiplication algorithm for the rectangular                                  SP (EP24), the unrounded product is aligned such that the
multiplier is given in Fig. 6. For each multiply iteration, the                             LSB of the product is in the same position as the LSB of the
appropriate multiplier bits are selected for the high and low                               DP product and EP product to be rounded to DP (EP53).
multiplier values, and the product is computed in redun-                                    This has the added benefit of reducing the size of the sticky
dant carry-save form. For SSE-SP multiplies and the first                                   region compared to its size if the product is instead fully
iteration of all other precisions, the two feedback terms are                               left aligned. It is also possible to align the EP64 and IP68
set to zero. For the second iteration of SSE-DP multiplies                                  rounding points, but this would require an additional
and the second and third iterations of EP multiplies, the two                               precision multiplexer input in the multiplier stage. The
feedback terms are set to the upper 76 bits of the product                                  76-bit internal precision product (IP76) is used for

            Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.
180                                                                                       IEEE TRANSACTIONS ON COMPUTERS,                      VOL. 58, NO. 2,   FEBRUARY 2009

                                                                                              is partitioned into feedback, sticky, and carry regions, and
                                                                                              the final result extraction. During the first two passes, the
                                                                                              feedback term is sent back to the multiplier, and the bits in
                                                                                              the sticky and carry regions are sent to the DP/EP Rounder
                                                                                              discussed in Section 5. During the third pass through the
                                                                                              multiplier, all of the product bits in carry-save format are
                                                                                              sent to the DP/EP Rounder. In this pass, the 48 lower
                                                                                              product bits correspond to the sticky and carry regions, the
                                                                                              next 24 product bits make up the significand if overflow
                                                                                              does not occur, and the 29 upper product bits are discarded.

                                                                                              5     ROUNDING
                                                                                              Before describing the details of the proposed rounding
                                                                                              scheme, the rounding scheme used in the AMD-K7TM /
                                                                                              AMD-K8TM FPM is briefly explained [8], [9]. In this rounding
                                                                                              scheme, the product is computed using three separate 152-bit
                                                                                              carry-propagate adders (CPAs). The first CPA computes the
                                                                                              unrounded result for denormals and determines the sig-
                                                                                              nificand product overflow bit. The second CPA computes a
                                                                                              rounded result with the assumption that the unrounded
                                                                                              result will not have an overflow, i.e., the unrounded product
                                                                                              is assumed to be in the range [1.0, 2.0). The third CPA
                                                                                              computes a rounded result with the assumption that the
                                                                                              unrounded result will have an overflow, i.e., the unrounded
                                                                                              product is assumed to be in the range [2.0, 4.0).
                                                                                                 Rounding is achieved by selecting a rounding constant
                                                                                              which, when added to the product, reduces all rounding
                                                                                              modes to a simple truncation with a possible LSB fix-up for
                                                                                              round-to-nearest-even (RTNE). To avoid an extra carry-
                                                                                              propagate addition, the rounding constant is first combined
                                                                                              with the redundant carry-save form of the product using a
                                                                                              3-2 CSA before being passed to the CPA. The 3-2 CSA also
                                                                                              provides support for the divide and square-root operations
                                                                                              for computing the “back-mul” step [8]. For RTNE, the
                                                                                              rounding constant consists of a single one in the round bit
                                                                                              position (i.e., the half ULP position). Therefore, if the round
                                                                                              bit is one, the product is incremented. This achieves round-
                                                                                              to-nearest-up and in the case of a tie, the LSB is set to zero
                                                                                              to keep the result even. For round-to-infinity, when the
                                                                                              result is of the appropriate sign, the round constant consists
                                                                                              of a string of ones starting from the round bit and ending at
                                                                                              the LSB of the fully precise product. Therefore, any “1”
                                                                                              located in that region causes the product to be incremented.
                                                                                              The AMD-K7TM /AMD-K8TM rounding scheme is fast and
                                                                                              easily supports multiple rounding precisions but consumes
                                                                                              a considerable amount of hardware and is therefore
Fig. 6. Iterative multiply algorithm.                                                         undesirable in low-cost and low-power systems.
                                                                                                 The proposed rounding circuitry takes as input the
intermediate results in division and square root. No                                          product in redundant carry-save form and rounds the
rounding is needed for this mode since truncation is                                          result according to the appropriate control word (FCW
sufficient [11].                                                                              for x87 instructions or MXCSR for SSE instructions). The
   As an example, the multiplication algorithm for EP                                         rounding circuitry contains separate rounding units for
rounded to SP (EP24) is shown graphically in Fig. 8. To align                                 SSE-SP high and SSE-SP low results, and a combined
the LSB of the EP24 product with the LSBs from the SSE-DP                                     rounding unit that rounds for SSE-DP, x87-EP, and divide/
and EP53 products, the multiplicand and multiplier are                                        square-root results. Each of the rounding units is based on a
aligned to the right as far as possible. For the first pass,                                  compound adder rounding scheme, which is more power
the lower 27 multiplier bits are selected for the multiplier                                  and area efficient than the rounding scheme used in the
operand, for the second pass the next 27 bits are selected,                                   AMD-K7TM /AMD-K8TM multiplier [8]. It should be noted
and for the third pass the upper 10 bits are selected with                                    that the AMD-K8TM rounding scheme is inherently faster
17 zeros prepended to form the 27-bit multiplier operand                                      than the rounding scheme presented here but at the cost of
supplied to the Booth encoders. Fig. 8 also shows the 103-bit                                 increased area and power. The microarchitecture requires
product generated from each pass, how the 103-bit product                                     that the FPM be able to produce the unrounded, normalized

              Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

Fig. 7. Unrounded product alignment.

Fig. 8. EP multiply rounded to SP (EP24).

result for support of denormalized results, as described at                                 sum½1 : 0Š plus three ðsum3½1 : 0ŠÞ. The 2-bit constant adders
the end of this section. This complicates the use of injection-                             also compute the carry-out from bit 1 into bit 2 for each
based rounding, described in [19], [20], and [21], which could                              summation case ðc2p0; c2p1; c2p2; c2p3Þ. The upper 23 bits
potentially simplify the rounding units.                                                    are passed to a two-way compound adder that computes
   The SSE-SP rounder performs SSE-SP rounding only.                                        their sum plus zero ðS0 ¼ Xs ½24 : 2Š þ Xc ½24 : 1ŠÞ and their
This is a highly optimized and compact rounder compared                                     sum plus one ðS1 ¼ Xs ½24 : 2Š þ Xc ½24 : 1Š þ 1Þ. Each of these
to the DP/EP rounder since it only has to deal with one                                     products is then normalized based on the significand
precision. This unit has two identical instances: one for the                               product overflow bits (V0 for S0 and V1 for S1 ).
lower SSE-SP result and one for the upper SSE-SP result. A                                     In parallel to the upper data path, the lower 24 bits are
block diagram of the SP rounder is given in Fig. 9. In the                                  passed to a carry-tree and sticky-bit computation logic. The
proposed SP rounding scheme, the upper 24 bits are passed                                   carry-tree computes the unrounded LSB ðLÞ, the round
through one level of HAs which compresses the lower                                         bit ðRÞ, and the carry-out from the R-bit ðRcout Þ. In parallel,
two bits to one bit ðXs ½1ŠÞ. The lower bits are denoted as                                 the sticky-bit computation logic performs the logical OR of
a0 ¼ Ps ½23Š, b0 ¼ Pc ½23Š, a1 ¼ Xs ½1Š. The sum of these bits is                           the lower 22 bits to produce the sticky-bit ðSÞ. Two sets of
denoted as sum½1 : 0Š ¼ fa1 ; a0 g þ f0; b0 g. These three bits                             rounding selects are then determined using L, R, Rcout , S,
ða0 ; a1 ; b0 Þ are passed to a set of 2-bit constant adders which                          the product’s sign ðsignÞ, and the rounding mode. One set
compute sum½1 : 0Š plus zero ðsum0½1 : 0ŠÞ, sum½1 : 0Š plus                                 of rounding selects assumes overflow of the product does
one ðsum1½1 : 0ŠÞ, sum½1 : 0Š plus two ðsum2½1 : 0ŠÞ, and                                   not occur ðV ¼ 0Þ, or equivalently, that the unrounded

            Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.
182                                                                                     IEEE TRANSACTIONS ON COMPUTERS,                      VOL. 58, NO. 2,   FEBRUARY 2009

Fig. 9. SP rounder.

significand product is in the range [1.0, 2.0). The other set of                            block diagram of the DP/EP rounder is shown in Fig. 11. Due
rounding selects assumes that overflow of the product does                                  to the large number of different precisions that must be
occur ðV ¼ 1Þ, or equivalently, that the unrounded sig-                                     supported, the DP/EP rounder is split over two cycles,
nificand product is in the range [2.0, 4.0). This is similar to                             as it is in the AMD-K8TM processor. However, unlike the
the approach described in [22], except that all possibilities                               AMD-K8TM FPM, the combined DP/EP rounder is based on a
are computed in parallel to reduce delay. The two LSBs are
                                                                                            compound adder rounding scheme that is more area and
selected for each condition (V ¼ 0 and V ¼ 1), and based
                                                                                            power efficient than the AMD-K8TM rounding scheme. The
on Rcout , the unrounded overflow bit ðV Þ is determined.
The V -bit is then used to select the appropriate rounding                                  DP/EP rounding scheme is similar to the SP rounding
increment determination to select S0 or S1 . Finally, for the                               scheme except that it is necessary to perform a right shift
RTNE rounding mode, the LSB may need to be set to zero.                                     to prealign the rounding point to the same significance prior
   The rounding algorithm is described in pseudocode in                                     to the compound addition and to perform a left shift
Fig. 10. It should be noted that the particular ordering of                                 to postalign the MSB to the same significance after the
steps described was chosen for ease of description and, in the                              compound addition. This is the cost of having to support
actual hardware implementation, the order of each step is                                   multiple rounding points in the same data path. The second
best determined by examining the specific timing paths and                                  difference is that the carry-tree and sticky logic need to
ensuring a balance between the upper path and lower path.                                   include the carry-out and sticky from previous iterations. The
For instance, the order of the round-increment selection step
                                                                                            third difference is that for each target precision there is a pair
and normalization step can be swapped. It should also be
                                                                                            of 2-1 multiplexers that are used to insert the two rounded
noted that originally the SP and DP/EP rounding algorithms
both used two consecutive HA rows to accommodate all                                        LSBs into the correct positions within the final rounded
rounding possibilities. However, analysis during formal                                     significand. The DP/EP rounder also provides a bypass path
verification efforts revealed that it was possible to reduce                                for divide and square root to allow the compound adder to
this to one HA row.                                                                         be reused for other additions, such as computing the
   The combined DP/EP rounder performs rounding for                                         intermediate quotient Æ1 ULP, instead of adding dedicated
SSE-DP, x87-SP, x87-DP, x87-EP, IP68 (for transcendental                                    hardware. For simplicity, Fig. 11 does not show the rounding
functions), and for divide and square-root operations. A                                    circuitry required for divide and square root.

            Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

                                                                                            additional normalization hardware or correction hardware.
                                                                                            Second, in the case of denormal results, the FPM produces the
                                                                                            normalized, unrounded result with the exponent falling out
                                                                                            of range (below Emin ) along with sticky information. This is
                                                                                            fed to an external unit which performs denormalization and
                                                                                            rounding according to the IEEE-754 standard. To support
                                                                                            this system, the floating-point registers are represented as
                                                                                            normalized numbers with an extended exponent field in the
                                                                                            register file. The internal representation is converted from
                                                                                            memory format during loads and to memory format during
                                                                                            stores. This approach for handling denormals is also used in
                                                                                            the AMD-K8TM processor.

                                                                                            6     RELATED MULTIPLIER ARCHITECTURES
                                                                                            Previous x86 FPMs have taken various forms. The Cyrix
                                                                                            multiplier includes a 17-bit  69-bit rectangular significand
                                                                                            multiplier that uses radix-8 signed encoding, a signed-digit
                                                                                            summation tree, and signed-digit redundant feedback [10].
                                                                                            This design is very area efficient. In contrast, the AMD-K7TM /
                                                                                            AMD-K8TM multiplier includes a full-pipelined 76-bit Â
                                                                                            76-bit significand multiplier with a latency of four cycles and
                                                                                            is optimized for EP operations [8]. The Intel Pentium-41
                                                                                            multiplier is fully pipelined for DP and takes two iterations
                                                                                            for EP [13]. Both the AMD-K7TM /AMD-K8TM multiplier and
                                                                                            Intel Pentium-41 multiplier can execute two parallel SP
                                                                                            (packed) multiplies every clock cycle.
                                                                                               Iterative FPMs have also been described in the literature.
                                                                                            For example, Anderson et al. [17] describe an iterative tree
                                                                                            multiplier that generates only six partial products per cycle
                                                                                            and requires five cycles to assimilate the 56-bit multiplier
                                                                                            significand. In [14], a dual-mode iterative FPM is described
                                                                                            that executes a SP multiply in two clock cycles at a throughput
                                                                                            of one multiplication per clock cycle, or a DP multiply in three
                                                                                            clock cycles at a throughput of one multiplication per two
                                                                                            clock cycles. The multiplier consists of a 27-bit  53-bit tree
                                                                                            multiplier coupled with an injection-based rounder. In [18], a
                                                                                            single-pass fused-multiply-add (FMA) floating-point unit is
                                                                                            compared to a dual-pass FMA floating-point unit. Both FMA
                                                                                            units support SP and DP operations. The dual-pass FMA unit
                                                                                            is again based on an iterative rectangular multiplier and
                                                                                            executes an SP FMA operation in one pass and a DP FMA
                                                                                            operation in two passes. None of these iterative designs
                                                                                            support simultaneous (packed) SP operations. Lastly, Akkas
                                                                                            and Schulte [23] describe an iterative FPM that supports two
                                                                                            DP multiplies without iteration or a quadruple-precision
                                                                                            multiply using two iterations. In this design, the quadruple-
                                                                                            precision multiply is achieved using an iterative algorithm.
Fig. 10. SP rounding algorithm.                                                                Alternative methods for achieving packed integer multi-
                                                                                            plies are described in [24] and [25], and an application to
   In order to fully support the IEEE-754 standard, the FPM                                 packed FMA is described in [26]. A dual-mode FPM which
requires some external assistance in dealing with denormals.                                supports one DP multiply or two parallel SP multiplies is
                                                                                            described in [22]. This multiplier uses radix-8 Booth encoding
First, the FPM assumes denormal inputs are first normalized
                                                                                            and handles the packed multiplies in a fashion similar to the
with the exponent sufficiently extended to accommodate                                      proposed design, except that the generation and compression
the normalization shift amount. In this manner, the FPM                                     of partial products is performed in multiple pipeline stages
can operate directly on the operands without needing any                                    and EP multiplies are not supported. The multiplier is fully

            Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.
184                                                                                    IEEE TRANSACTIONS ON COMPUTERS,                      VOL. 58, NO. 2,   FEBRUARY 2009

Fig. 11. Combined DP/EP rounder.

pipelined and operates without stalling. It therefore requires                             baseline design (AMD-K8TM -FPM). The AMD-K8TM -FPM
a full DP significand multiplier.                                                          is a highly aggressive design that is specifically targeted
                                                                                           toward high performance. In contrast, the proposed design
                                                                                           is intended to be a low-cost and low-power solution with
7     RESULTS, COMPARISON,                   AND      TESTING
                                                                                           similar functionality. The implementation results reflect
The proposed rectangular multiplier was implemented in                                     the two different design objectives.
a 65-nm SOI technology using static CMOS logic and a                                           Functional testing was performed using a mixture of
data-path-orientated, cell-based methodology. The cell                                     random data patterns and directed data patterns by simulta-
library used consisted of typical static CMOS cells in                                     neously applying the same stimulus to the proposed iterative
addition to some specialized cells such as the 4-2                                         FPM unit and the AMD-K8TM reference FPM unit. The results
compressor, the Booth encoder, and the Booth multiplexer.                                  from each unit were captured and compared.
To provide a point of comparison, a design similar to the                                      A comparison of multiply instruction latencies and
AMD-K8TM FPM (AMD-K8TM -FPM) described in [8] and                                          throughputs is given in Table 4. Performance modeling
[9] was also implemented with the same technology. The                                     studies were performed to measure the estimated instruc-
implementation results are shown in Tables 1, 2, and 3.                                    tions per cycle (IPC) for a range of benchmarks. The
The dynamic power was measured by applying random                                          AMD-K8TM performance model configured with the original
input patterns and measuring the average current using a                                   AMD-K8TM FPM instruction latencies and throughputs
SPICE-like circuit simulator with the transistor netlist and                               served as the baseline model while the AMD-K8TM perfor-
extracted parasitics. Both designs were measured using                                     mance model configured with the proposed iterative FPM
the same clock frequency ðftypical Þ and the same supply                                   instruction latencies and throughputs served as the compar-
voltage ðVtypical Þ. The proposed design consumes signifi-                                 ison model. As expected, performance studies using SSE-SP-
cantly less area and dynamic power compared to the                                         dominated target applications demonstrated an increase in

           Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

                       TABLE 1                                                                                      TABLE 3
      Area/Power Comparison for Significand Multipliers                                        Area/Power Comparison for Entire Significand Data Path

performance compared to the baseline design. For instance, a
set of SSE-SP-dominant traces extracted from the SPEC-
CPU20061 benchmark demonstrated a range of improve-
ments from 1.1 percent to 10.5 percent relative to the baseline
design. For x87-dominant applications, there was a similar                                 of x87 instructions, demonstrated a performance loss of
decrease in performance. However, since those applications                                 2.5 percent. The x87 architecture requires that the multi-
are mainly dominated by memory throughput, the difference                                  plication be carried out in EP and then rounded to the target
was not significant on average and other microarchitectural                                precision of SP, DP, or EP. Therefore, it is necessary to
choices such as load bandwidth and instruction window size                                 perform a full EP multiply even if the operands only contain
are more important. For example, on average, the SPECfp-                                   significant bits which fit within the SP region or within the DP
20001 benchmark, which contains a significant percentage                                   region. To reduce the latency of some x87 multiplies, it is
                                                                                           possible to detect the number of significant bits in
                                                                                           the multiplier and determine if this quantity falls within the
                        TABLE 2                                                            range of SP, DP, or EP, and then only perform the multi-
            Area/Power Comparison for Rounders                                             plication to that precision. The multiplicand does not need to
                                                                                           be examined since it does not contribute to the number of
                                                                                           passes through the 76-bit  27-bit multiplier array. For
                                                                                           instance, if the multiplier significand contains less than 28
                                                                                           leading significant bits, then only a single pass through the
                                                                                           multiplier array is required and the latency of the EP multiply
                                                                                           will be reduced from five cycles to three cycles and the
                                                                                           throughput will be increased from 1/3 to 1. To make use of
                                                                                           this feature, it is necessary to either use an instruction
                                                                                           scheduler that can accommodate data-dependent instruction
                                                                                           latencies or can keep track of the number of significant bits in
                                                                                           the data. This feature relies on the assumption that for certain
                                                                                           applications the operands have SP or DP ranges. Further-
                                                                                           more, if it can be arranged that the multiplier always contain
                                                                                           the least number of significant bits compared to the multi-
                                                                                           plicand, then this will increase the extent to which this feature
                                                                                           can be used. Using this feature can return some of the
                                                                                           performance loss introduced by the pipeline stalls due to the
                                                                                           iterative nature of the EP multiplies.

                                                                                           8     CONCLUSION
                                                                                           This paper has presented an x86-compatible FPM that
                                                                                           is based on a 76-bit  27-bit rectangular multiplier and is
                                                                                           optimized for packed SSE-SP multiples. The multiplier is
                                                                                           compared to a design with similar functionality that was

           Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.
186                                                                                       IEEE TRANSACTIONS ON COMPUTERS,                      VOL. 58, NO. 2,   FEBRUARY 2009

                                                                                              [4]    H. Nguyen and L.K. John, “Exploiting SIMD Parallelism in DSP
                              TABLE 4                                                                and Multimedia Algorithms Using the AltiVec Technology,” Proc.
                   Latency/Throughput Comparison                                                     13th Int’l Conf. Supercomputing (ICS ’99), pp. 11-20, June 1999.
                                                                                              [5]    “Advanced Micro Devices,” AMD64 Architecture Programmer’s
                                                                                                     Manual Volume 4: 128-Bit Media Instructions, rev. 3.07 ed.,
                                                                                                     Dec. 2005.
                                                                                              [6]    “Advanced Micro Devices,” AMD64 Architecture Programmer’s
                                                                                                     Manual Volume 5: 64-Bit Media and x87 Floating-Point Instructions,
                                                                                                     rev. 3.06 ed., Dec. 2005.
                                                                                              [7]    J. Hennessy and D. Patterson, Computer Architecture: A
                                                                                                     Quantitative Approach, ch. 2, third ed. Morgan Kaufmann,
                                                                                                     p. 119, May 2002.
                                                                                              [8]    S. Oberman, “Floating-Point Division and Square Root Algo-
                                                                                                     rithms and Implementation in the AMD-K72 Microprocessor,”
                                                                                                     Proc. 14th IEEE Symp. Computer Arithmetic (ARITH ’99),
                                                                                                     pp. 106-115, Apr. 1999.
                                                                                              [9]    C. Keltcher, K. McGrath, A. Ahmed, and P. Conway, “The
                                                                                                     AMD Opteron Processor for Multiprocessor Servers,” IEEE Micro,
                                                                                                     vol. 23, pp. 66-76, Mar. 2003.
                                                                                              [10]   W. Briggs and D. Matula, “A 17 Â 69 Bit Multiply and Add
                                                                                                     Unit with Redundant Binary Feedback and Single Cycle
                                                                                                     Latency,” Proc. 11th IEEE Symp. Computer Arithmetic (ARITH ’93),
                                                                                                     pp. 163-170, July 1993.
                                                                                              [11]   M. Schulte, C. Lemonds, and D. Tan, “Floating-Point Division
                                                                                                     Algorithms for an x86 Microprocessor with a Rectangular
                                                                                                     Multiplier,” Proc. IEEE Int’l Conf. Computer Design (ICCD ’07),
                                                                                                     pp. 304-310, Oct. 2007.
                                                                                              [12]   ANSI and IEEE, IEEE-754 Standard for Binary Floating-Point
                                                                                                     Arithmetic, 1985.
                                                                                              [13]   G. Hinton, M. Upton, D. Sager, D. Boggs, D. Carmean, P. Roussel,
                                                                                                     T. Chappell, T. Fletcher, M. Milshtein, M. Sprague, S. Samaan, and
                                                                                                     R. Murray, “A 0.18-um CMOS IA-32 Processor with a 4-GHz
                                                                                                     Integer Execution Unit,” IEEE J. Solid-State Circuits, vol. 36,
                                                                                                     pp. 1617-1627, Nov. 2001.
                                                                                              [14]   G. Even, S.M. Mueller, and P.-M. Seidel, “A Dual Mode IEEE
                                                                                                     Multiplier,” Proc. Second Ann. IEEE Int’l Conf. Innovative Systems in
                                                                                                     Silicon (ISIS ’97), pp. 282-289, Oct. 1997.
                                                                                              [15]   S. Vassiliadis, E. Schwarz, and B. Sung, “Hard-Wired Multipliers
                                                                                                     with Encoded Partial Products,” IEEE Trans. Computers, vol. 40,
                                                                                                     pp. 1181-1197, Nov. 1991.
optimized instead for the largest precision. The proposed                                     [16]   A. Weinberger, “4:2 Carry-Save Adder Module,” IBM Technical
                                                                                                     Disclosure Bull., vol. 23, pp. 3811-3814, Jan. 1981.
design consumes significantly less area and power while
                                                                                              [17]   S. Anderson, J. Earle, R. Goldschmidt, and D. Powers, “The IBM
achieving improved performance for the target applications                                           System/360 Model 91: Floating-Point Execution Unit,” IBM J.
and only slightly reduced performance for x87-dominated                                              Research and Development, vol. 11, pp. 34-53, Jan. 1967.
applications. The rectangular multiplier also facilitates                                     [18]   R.M. Jessani and M. Putrino, “Comparison of Single- and Dual-
                                                                                                     Pass Multiply-Add Fused Floating-Point Units,” IEEE Trans.
efficient algorithms for divide and square root with a small                                         Computers, vol. 47, pp. 927-937, Sept. 1998.
amount of additional hardware.                                                                [19]   M.R. Santoro, G. Bewick, and M. Horowitz, “Rounding Algo-
                                                                                                     rithms for IEEE Multipliers,” Proc. Ninth IEEE Symp. Computer
                                                                                                     Arithmetic (ARITH ’89), pp. 176-183, Sept. 1989.
ACKNOWLEDGMENTS                                                                               [20]   G. Even and P.-M. Seidel, “A Comparison of Three Rounding
                                                                                                     Algorithms for IEEE Floating-Point Multiplication,” IEEE Trans.
We would like to thank Peter Seidel for suggesting optimiza-                                         Computers, vol. 49, pp. 638-650, July 2000.
tions to the rounding circuitry based on analysis derived from                                [21]   N.T. Quach, N. Takagi, and M. Flynn, “Systematic IEEE Rounding
formal verification efforts, Albert Danysh and Eric Quinnell                                         Method for High-Speed Floating-Point Multipliers,” IEEE Trans.
                                                                                                     Very Large Scale Integration (VLSI) Systems, vol. 12, pp. 511-521,
for their excellent work on the multiplier and rounding                                              May 2004.
circuitry implementation, Raj Desikan for his excellent work                                  [22]   A. Enriques and K. Jones, “Design of a Multi-Mode Pipelined
on the performance modeling and analysis, and to the                                                 Multiplier for Floating-Point Applications,” Proc. IEEE Nat’l
                                                                                                     Aerospace and Electronics Conf. (NAECON ’91), vol. 1, pp. 77-81,
anonymous reviewers for their helpful comments.                                                      May 1991.
                                                                                              [23]   A. Akkas and M. Schulte, “A Quadruple Precision and Dual
                                                                                                     Double Precision Floating-Point Multiplier,” Proc. Euromicro Symp.
REFERENCES                                                                                           Digital System Design (DSD ’03), pp. 76-81, Sept. 2003.
[1]   P. Ranganathan, S. Adve, and N. Jouppi, “Performance of Image                           [24]   D. Tan, A. Danysh, and M. Liebelt, “Multiple-Precision Fixed-
      and Video Processing with General-Purpose Processors and                                       Point Vector Multiply-Accumulator Using Shared Segmentation,”
      Media ISA Extensions,” Proc. 26th Ann. Int’l Symp. Computer                                    Proc. 16th IEEE Symp. Computer Arithmetic (ARITH ’03), pp. 12-19,
      Architecture (ISCA ’99), vol. 27, pp. 124-135, May 1999.                                       June 2003.
[2]   S.K. Raman, V. Pentkovski, and J. Keshava, “Implementing                                [25]   S. Krithivasan and M.J. Schulte, “Multiplier Architectures for
      Streaming SIMD Extensions on the Pentium III Processor,” IEEE                                  Media Processing,” Proc. IEEE 37th Asilomar Conf. Signals, Systems,
      Micro, vol. 20, pp. 47-57, July 2000.                                                          and Computers (ACSSC ’03), vol. 2, pp. 2193-2197, Nov. 2003.
[3]   M.-L. Li, R. Sasanka, S. Adve, Y.-K. Chen, and E. Debes, “The                           [26]   L. Huang, L. Shen, K. Dai, and Z. Wang, “A New Architecture
      ALPBench Benchmark Suite for Complex Multimedia Applica-                                       for Multiple-Precision Floating-Point Multiply-Add Fused Unit
      tions,” Proc. IEEE Int’l Symp. Workload Characterization (IISWC ’05),                          Design,” Proc. 18th IEEE Symp. Computer Arithmetic (ARITH ’07),
      pp. 34-45, Oct. 2005.                                                                          pp. 69-76, June 2007.

              Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.
TAN ET AL.: LOW-POWER MULTIPLE-PRECISION ITERATIVE FLOATING-POINT MULTIPLIER WITH SIMD SUPPORT                                                                   187

                           Dimitri Tan received the BSEE degree from the                                            Michael J. Schulte received the BS degree in
                           University of Adelaide, Australia. He was pre-                                           electrical engineering from the University of
                           viously with Motorola Inc. and Freescale Semi-                                           Wisconsin-Madison and the MS and PhD
                           conductor Inc., where he worked on various                                               degrees in electrical engineering from the
                           microprocessor and SoC designs. He is cur-                                               University of Texas, Austin. He is currently an
                           rently with Advanced Micro Devices Inc., Austin,                                         associate professor at the University of Wiscon-
                           Texas, working on x86 microprocessor design.                                             sin-Madison, where he leads the Madison
                           His research interests include computer archi-                                           Embedded Systems and Architectures Group.
                           tecture, computer arithmetic, and reconfigurable                                         His research interests include high-performance
                           computing. He is a member of the IEEE.                                                   embedded processors, computer architecture,
                                                                                              domain-specific systems, and computer arithmetic. He is a senior
                         Carl E. Lemonds received the BSEE and                                member of the IEEE.
                         MSEE degrees from the University of Missouri,
                         Columbia. He worked in corporate R&D at Texas                        . For more information on this or any other computing topic,
                         Instruments, where he designed arithmetic                            please visit our Digital Library at
                         circuits and algorithms for various DSP test
                         chips. After a brief stint at Cyrix, he joined Intel in
                         1999. At Intel, he worked on the FPU for the
                         Tejas project (Pentium4 class processor). In
                         January of 2004, he joined Advanced Micro
                         Devices (AMD) Inc., Austin, Texas, where he is
currently a principal member of the technical staff. His interests include
computer arithmetic, floating-point, and DSP. His current research is in
vector floating-point processors. He is a senior member of the IEEE and
a member of the ACM.

              Authorized licensed use limited to: University of Wisconsin. Downloaded on March 30, 2009 at 19:16 from IEEE Xplore. Restrictions apply.

Shared By: