High Precision BLAS

Document Sample
High Precision BLAS Powered By Docstoc
					                      High-Precision BLAS on FPGA-enhanced Computers

                                   Chuan He, Guan Qin, and Richard E. Ewing
                                       Institute for Scientific Computation
                                Texas A&M University, College Station, TX 77843
                                                   Wei Zhao,
                                 Rensselaer Polytechnic Institute, Troy, NY 12180

                         Abstract                              achieve significant performance improvements over
                                                               contemporary general-purpose computers for many
The emergence of high-density reconfigurable hardware          computationally intensive applications. In [2], the authors
devices gives scientists and engineers an option to            predicted that FPGA devices would yield three to eight
accelerating their numerical computing applications on         times more raw computing power than commodity CPUs
low-cost but powerful “FPGA-enhanced computers”. In            by 2009.
this paper, we introduced our efforts towards improving        Since 2000s, increasing research efforts have been
the computational performance of Basic Linear Algebra          conducted for solving numerical PDE problems on
Subprograms         (BLAS)        by      FPGA-specific        FPGA-enhanced computer platforms. Specifically,
algorithms/methods. Our study focus on three BLAS              scientists did considerable work trying to accelerate Basic
subroutines: floating point summation, matrix-vector           Linear Algebra Subprograms (BLAS), which is a widely-
multiplication, and matrix-matrix multiplication. They         accepted open-source numerical computing library that
represent all three levels of BLAS functionalities, and        performs basic linear algebra operations for this class of
their sustained computational performances are either          scientific computations. However, because a large portion
memory bandwidth bounded or computation bounded. By            of BLAS subroutines (most Level-1 and Level-2
proposing the group-alignment based floating-point             subroutines) exist low computations to operands ratio, so
summation method and applying this technique to other          are memory-bounded operations. It is now a computer’s
subroutines, we significantly improved their sustained         external memory bandwidth who decides the sustained
computational performance and reduced numerical                computational performance, but not the computing engine
errors with moderate FPGA resources consumed.                  itself. For computation-bounded Level-3 BLAS
Comparing with existing FPGA-based implementations,            subroutines, the costly implementation of double-
our designs are efficient and compact with improved            precision floating-point arithmetic on FPGA would
numerical accuracy and stability.                              become the main problem. Correspondingly, most of
                                                               these research works did not demonstrate exciting
                                                               performance improvements on FPGA-based systems over
1. Introduction                                                contemporary commercial CPUs [3] [4] [5].
                                                               We believe that most commonly-used numerical methods
In the past decade, numerical computing efforts have           are well-tuned for general-purpose computers, so in
grown rapidly with performance improvements of                 general not ideal for FPGA-enhanced computer platform.
general-purpose computers and the popularity of parallel       In order to achieve better computational performance on
computing environments. Although the nominal                   these new computing systems, one should design and/or
frequency of commodity CPUs is skyrocketing, the               customize new algorithms/methods for their specific
utilization of CPUs’ peak performance for numerical            scientific/engineering problems at hand. Furthermore,
computing applications is still very poor [1]. With the        these new algorithms/methods must be flexible and robust
emergence of the new generation of large-scale FPGA            so that a wide range of commercial FPGA-based
devices, scientists and engineers are being presented with     platforms could accommodate them effectively and
new opportunity to accelerate their specific applications      efficiently.
with low-cost but powerful “FPGA-enhanced computers”.          In this paper, we introduce part of our recent efforts
Equipped with the delightful hardware “In-System-              towards accelerating representative BLAS subroutines on
Programmability (ISP)”, FPGA-enhanced computers can
FPGA-enhanced computer platform by hardware-specific
algorithms/methods. Based on our previous work
regarding high-precision floating-point summation on
FPGAs [6], we concentrate here on matrix-vector and
matrix-matrix multiplications, which are two kernel
Level-2 or Level-3 BLAS subroutines, respectively.
Results of this work can be applied to accelerate direct
solvers for dense/sparse linear system equations.
The rest of this paper is organized as follow: In Section 2,
we review in brief the group-alignment based floating-
point summation method proposed in [6]. Based on this
technique, we present in Section 3 and Section 4 our
design for high-precision matrix-vector and matrix-matrix
multiplications on FPGAs. Their FPGA-based
implementations are introduced in Section 5 with
corresponding performance evaluations. We also
introduce in section 6 a reference design for accelerating      Figure 1. Structure of group-alignment based floating-
iterative refinement method using the proposed FPGA-            point summation
specific high-precision BLAS technique. Finally,
conclusions and a discussion of future research directions      Instead of following existing commonly-used high-
are presented in Section 7.                                     accuracy summation algorithms [10], which are all well-
                                                                tuned for commercial CPUs so are not ideal for FPGA-
2. Group-alignment             based      floating-point        based systems, we propose in [6] a group-alignment based
   summation [6]                                                summation method using customized floating-point/fixed-
                                                                point hybrid arithmetic. The basic idea is straightforward:
Floating-point summation is such an important operation         As shown in Figure 1, to accumulate a group of n
in numerical computations that some computer scientists         floating-point summands, instead of (n-1) standard
even suggested to import it into general-purpose CPUs as        floating-point additions, where all exponent comparisons
“the fifth floating-point arithmetic” [7]. It is also a         and mantissa shifting are scattered, we collect them
fundamental building block for other BLAS subroutines           together to construct a unique exponent comparator in
such as dot-product, matrix-vector multiplication, matrix-      Step 2 and a barrel shifter in Step 4. Now, all summands
matrix multiplication, so plays a key role in numerical         in the same group are aligned into consistent fixed-point
computations. Unlike other basic floating-point                 format, so can be summed up with simple fixed-point
arithmetic, summation is not well-conditioned because of        accumulator in Step 5.
so-called “catastrophic cancellations” [8], i.e. small
relative errors hidden in inputs might result in much           Table 1. Comparisons of numerical error bounds
larger relative error in output. Also, intrinsic data                       Sequential           Group-alignment
dependency among neighbouring floating-point additions                      Accumulation         based Summation
would restrict its sustained computational performance
seriously. To make the situation worse, the sequence of
                                                                 Absolute    e ≤ (n −1)εM ⋅       e' ≤ (n − 1)ε ' M ⋅
                                                                             ( s1 + s2 +L+ sn )  max{s1 , s2 L sn }
consecutive additions will also affect the final sum, which      Bound
makes it impossible to produce a unique solution for the
same input data set on different computer platforms with
                                                                               e                      e'
different programming languages or software compilers
                                                                 Relative           ≤ (n − 1)ε M ⋅         ≤ (n − 1)ε 'M ⋅
[9]. On the bright side, this un-uniqueness relieves us                        S                      S
                                                                                                      max{s1 , s2 L sn }
from strictly obeying the IEEE standard in our                                 s1 + s 2 + L + s n
implementation but choose the exact solution as the only                           s1 + s 2 + L s n        s1 + s2 + L sn
accuracy criterion.
Given a set of n floating-point numbers s1 , s2 ,L, sn with
small relative errors ε M , our goal is to design a numerical   Absolute and relative numerical error bounds (Table 1)
algorithm as well as its FPGA-based hardware                    were derived there using formal error analysis methods to
                                                        n       prove that this hardware-based algorithm can always
implementation to compute the summation S = ∑ si                result in similar, or even better, average/worst-case
                                                       i =1     absolute and relative errors than standard floating-point
accurately and efficiently.                                     arithmetic. Numerical experiments are also performed to
demonstrate the accuracy and scalability of this new              could be easily applied here to construct a dot-product
approach.                                                         unit by introducing a simplified standard floating-point
                                                                  multiplier at the front end. This multiplier accepts two
3.High-precision matrix-vector multiplication                     entries from matrix A and vector x at each clock cycle;
  on FPGAs                                                        normalizes them; and multiplies their mantissas. Then, the
                                                                  product and the sum of exponents are fed to the input of
This       Level-2   BLAS     subroutine   is   defined     as    the following group-alignment based summation unit with
       n                                                          all of the multiplier’s post-processing stages eliminated.
zi = ∑ Aij ⋅ x j + y j , i = 1,L n                        (3.1)   In order to obtain a high-precision result, we cannot
       j =1                                                       simply round those product terms to standard floating-
where A is a dense or sparse n × n matrix; x, y and z are         point format. To reveal the problem, let us consider a
 n × 1 vectors. Suppose all entries of A, x, y, and z are         simple case, the dot-product of two two-element
saved in main memory (which is also the case for most             vectors a = [ a1 , a 2 ] and b = [b1 , b2 ] . Using formal
                                                                                              T                         T

realistic numerical PDE problems). The subroutine
              ( )                                 (
consists of 2n 2 floating-point operations and n 2 + 3n      )    numerical error analysis method [14], we have:

memory accesses in total. Their ratio is nearly two, which         fl ( fl (a1 × b1 ) + fl (a 2 × b2 ))
reveals the memory-bandwidth-bounded property of this
                                                                  = fl ((a1 × b1 )(1 + ε 1 ) + (a 2 × b2 )(1 + ε 2 ))
When executing on general-purpose computers, the                  = ((a1 × b1 )(1 + ε 1 ) + (a 2 × b2 )(1 + ε 2 ))(1 + ε 3 )
observable sustained computational performances of                = a1 × b1 + a 2 × b2 + (a1 × b1 )ε 1 + (a 2 × b2 )ε 2 + (a1 × b1 )ε 3
Level-2 BLAS subroutines correspond to only 30~50                        + (a 2 × b2 )ε 3 + (a1 × b1 )ε 1ε 3 + (a 2 × b2 )ε 2 ε 3
                                                                  ≈ a1 × b1 + a 2 × b2 + (a1 × b1 )(ε 1 + ε 3 ) + (a 2 × b2 )(ε 2 + ε 3 )
percent of the theoretical values [1], which are bounded
by in-system memory bandwidth. In other words, a
considerable portion of physical memory bandwidth is                                                                               (3.2)
wasted because of the inefficiency of the memory                  After ignoring high-order rounding error terms in this
subsystem, which normally consists of a hierarchy of              equation, we can observe that rounding errors produced
data/instruction caches with fixed evacuation policy.             by multiplications ( ε 1 , ε 2 ) have similar magnitude as
Comparatively, the utilization of external memory                 addition error ( ε 3 ), and so should also be take into
bandwidth for a specific problem could always be
improved on FPGA-based system by utilizing FPGA’s                 consideration for accurate solution. All effective bits of
abundant in-chip programmable memory resources such               input mantissa products should be kept so that the
as RAM blocks and distributed registers.                          numerical errors produced by multiplications could be
Recently, scientists proposed several FPGA-based                  removed. In the mean time, the word-width of the barrel
schemes [3] [4] to accelerate matrix-vector multiplication        shifter and the fixed-point accumulator in the group-
by constructing appropriate buffering/caching subsystem.          alignment based summation unit should also be extended
Although considerable performance improvements had                correspondingly for better numerical error bounds.
been achieved in those works compared with the same               Comparatively,         high-precision        floating-point
subroutine running on general-purpose computers, their            computations are very difficult to achieve on general-
results are not so encouraging. Specifically, external            purpose computers because most commodity CPUs
memory bandwidth of a typical FPGA-based system                   provide only standard-precision floating-point arithmetic.
generally can support millions of words per second data           People have to use software codes to emulate high-
transferring rate, which is at the same level as the data         precision arithmetic, or the results would be at the risk of
throughput of a pipelined floating-point arithmetic unit.         rounding errors coming from both floating-point additions
So, a few parallel running arithmetic units could saturate        and multiplications. Some CPUs do provide so-called
all available memory bandwidth and leave considerable             “Fused Multiply-Addition (FMA)” unit/instruction, where
on-board FPGA resources unused.                                   a floating-point accumulator is placed adjacent to the
By calculating matrix-vector multiplication in high-              multiplier to eliminate rounding errors introduced by
precision, we try to put more on-board FPGA resources             multiplications. However, most software compilers do not
into play so that our computations could be benefited in          yet support this function because it tends to complicate
an indirect way. After decomposing matrix A into n row            instruction scheduling, and may eventually slowdown the
vectors, the matrix-vector multiply can be treated as             execution.
 n dedicated vector dot-products. The group-alignment             Although this new high-precision approach doesn’t
                                                                  alleviate the memory bandwidth bottleneck, it can help us
based floating-point summation unit we proposed in [6]
evaluate z i to higher accuracy at nearly the same speed      and multiplied by entries from the input vector x, whose
                                                              elements are all buffered inside FPGA’s RAM blocks.
as conventional approaches. High-precision arithmetic is
                                                              Multiple RAM blocks can support multiple operand
always preferable for matrix-vector multiplication when
                                                              accesses from x at each clock cycle so that a number of
matrix A is bad-conditioned. Scientists [11] proposed
                                                              multipliers can work in parallel if the accumulative
pure software methods to achieve higher precision
                                                              external memory bandwidth is wide enough to support
(double-extended, double-double, or quadruple precision)
                                                              simultaneous accesses of the same number of operands
BLAS subroutines using only standard floating-point
                                                              from the matrix. Please note that all effective bits of input
arithmetic (single or double precision) at the cost of 4~10
                                                              mantissa products (48 bits for single-precision
times of slowdown. From this point of view, our new
                                                              multiplication and 106 bits for double-precision) are kept
approach significantly improved the computational
                                                              and the word-width of the barrel shifter and the fixed-
performance       of     high-precision       matrix-vector
                                                              point accumulator in the group-alignment based
                                                              summation unit are extended correspondingly for high-
                                                              precision results.
   ai , j +3                  x j +3
                                                              Algorithm 1: Matrix-Vector Multiply in Row Order
                                                              Input: n × (n + 1) dense matrix [y; A] (, which is an
        ai , j + 2            x j+2
                                                                     augmented matrix with the column vector y
                                                                     attached to the left of matrix A,) saved in external
               ai , j +1                                              (n + 1) × 1 vector [1, x] saved in s SRAM blocks
                              x j +1
                                                                     inside FPGA device. The number s block contains
                                                                     vector entries (i, i + s, i + 2s,L, i + ((n + 1) s − 1))
                     ai , j                                   Output: z = A ⋅ x + y
                                                              For i = 1 to n
                                                                   For j = 1 to (n+1)/s
                                                                        Read in matrix entries Ai , ( j −1 )∗ s +1: j ∗ s from
                                                                        external memory
                                                                        For k =1 to s do parallel
                                                                                 p k = Ai ,( j −1)∗s + k ⋅ x( j −1)∗s + k
                                                                        End For
                                                                        q j = ∑ pk
                                                                                    k =1
                                                                  End For
                                                                          ( n +1) / s
Figure 2. The block diagram and data flow for
matrix-vector multiplication in row order
                                                                   zi =     ∑q
                                                                             j =1

                                                              End For
Because the memory bandwidth limitation always
presents, we need to construct an appropriate data            The group-alignment based summation technique is
buffering system to attain the most from the limited          applied twice here to address the intrinsic data
resources. The data dependency existing in summations of      dependency problem: one in parallel [12] followed by
consecutive products is the only issue that needs to be       another in sequence [6]. Although a little costly, this
addressed by the buffering system. Starting from simple       approach does simplify the design for the buffering circuit
cases where all matrix entries are stored in external         as well as internal data flow: memory accesses to matrix
memory, but at least one input/output vector can be           A and vector x are always in sequence, the generations of
accommodated in FPGA’s in-chip RAM blocks, we have            output z entries are also in order, and there are no local
two choices to construct the data buffering system: The       data feed-back paths in this design. Indeed, we can easily
first one is to calculate matrix-vector multiplication in     combine these two summation units into a single one. The
row order just as Equation (3.1) implies. As shown in         resulting parallel-sequential group-alignment based
Figure 2, matrix entries are read into FPGA row by row        summation unit can accept s summands at each clock
cycle, align them with the same number of parallel-               data dependency problem is eliminated by this simple
working mantissa shifters, and then accumulate those              scheduling scheme because consecutive summations are
aligned mantissas using a s-input fixed-point binary adder        now for different z entries. Algorithm 3 is this matrix-
tree followed by a single fixed-point accumulator.                vector multiplication in column order method.
Although only limited hardware resources could be saved
after the combination, this new approach effectively                                                          X                           s-parallel
enlarges the summation group size, so would benefit the                    ai, j +3
                                                                                                                  x j +3                   Group-
final precision.The corresponding FPGA-specific matrix-                                                            X                     Alignment
                                                                                ai, j + 2                                                   based
vector multiplication method is listed in Algorithm 1.                                                                 x j +2                           Vector y
By using s independent sequential group-alignment based                               a i , j +1
                                                                                                                           X                 Unit        in-chip
                                                                                                                               x j +1                     buffer
summation units to evaluate s vector dot-product in
parallel, another matrix-vector multiplication in row order                                   ai, j
scheme is also feasible here. Indeed, this is the                                                                                       xj
conventional approach we observed in others’ work [3]              In-chip A Cache                        In-chip x Cache
[4], except that multiple standard floating-point adders
were adopted for accumulation instead of the group-
alignment based summation method. Obviously, this
scheme would be much more expensive than our
approach, so is ignored in this work.

                                                                                                          A                                         x      y

                                                                                                          External Memory Space
                                                                  Figure 4. The block diagram and data flow for
                                                                  matrix-vector multiplication in column order

                                                                  Algorithm 3. Matrix-vector Multiply in column order
                                                                  Input: n × n matrix A saved in external memory
Figure 3. Blocked matrix-vector multiplication
                                                                           n × 1 vector x saved in external memory
The other data buffering scheme is based on the fact that                  n × 1 vector z are accommodated inside FPGA’s
matrix-vector multiplication can also be represented as                   in-chip RAM blocks and initialized with
the summation of n scaled column vectors of matrix A as                   corresponding y entries
the following equation shows:                                     Output: z = A ⋅ x + y
                   n −1                                           Set all entries in vector y to be zero
z = A ⋅ x + y = ∑ Aj ⋅ x j + y                           (3.3)    For j = 1 to n/s
                   j =0
                                                                      Read in vector entries x( j −1)∗s +1, j ∗s
Figure 3 demonstrates the computational scheme of this
blocked matrix-vector multiplication in column order.                For i = 1 to n
The block diagram and data flow of the corresponding                    Read in matrix entries Ai , ( j −1 )∗ s +1: j ∗ s
FPGA-based hardware implementation is depicted in                       For k =1 to s do parallel
Figure 4. Here, all entries of the resulting vector z are
accommodated inside FPGA’s in-chip RAM blocks and
                                                                                 p k = Ai ,( j −1)∗s + k ⋅ x( j −1)∗s + k
initialized as y. The input matrix A is stripped into blocks             End For
of size n × s , and s is selected carefully so that all entries                                       s

in a block row can be accessed simultaneously from                        zi = zi + ∑ pk
external memory. Correspondingly, there are also s                                                 k =1

parallel-running multipliers. All of those product values            End For
generated at the same clock cycle together with the               End For
relevant partial sum value read from output vector z are
summed up to update the same z entry. The troublesome             Compared with the previous approach, only one (s+1)-
parallel group-alignment based summation unit is               subroutine. For the ideal case where FPGA’s internal
required. However, one local data feed-back loop is            RAM blocks could accommodate all matrix elements of
introduced in order to update the output vector. This is                                     ( )
                                                               A, B, and C, there would be 3n 2 memory accesses and
not an issue here because FPGA’s in-chip SRAM blocks
consist of circuitry to support two independent access         ( ) 3
                                                                2n floating-point operations in total. So the ratio
ports.                                                         between external memory accesses and floating-point
As we mentioned above, these two matrix-vector                 operations is proportional to n, which reveals excellent
multiplication approaches all assume that FPGA’s in-chip       data reusability, especially when n is large. In theory,
RAM blocks can accommodate at least one input/output           such computation-bounded subroutines could always put
vector. For large matrix-vector multiplication problems in     onboard FPGA devices into full play, and so achieve
reality, where the input/output vector contains too many       comparable sustained computational performance as
entries to be buffered inside FPGA device, we are forced       contemporary commodity CPUs.
to save all vectors together with the matrix in external       In reality, only a small portion of matrix entries could be
SDRAM space. (Sometimes, we may have external                  saved in-chip. So block matrix-matrix multiplication
SRAM modules integrated on board, which can be used to         scheme is always the most popular choice, although it
save the vector(s).) The main objective of data buffering      may introduce considerable additional external memory
design now is to hide access lags caused by external           accesses to deal with intermediate results. For example, if
SDRAM modules and attain the most from the limited             there were only enough in-chip memory spaces for
external memory bandwidth.                                     buffering three s × s blocks of matrix entries, we need to
                                                               read in 2 × s matrix entries from A and B for every
The second approach can be easily extended to deal with                      2

such cases using stripped matrix-vector multiplication
scheme similar to the one shown in Figure 3. Every
                                                               2 × s 3 + s 2 floating-point computations. (Here, we
partial matrix-vector multiplication related to one matrix     ignore purposely the memory access for saving resulting
strip requires accesses to this output vector twice in order   matrix elements of C because this workload would be
to have all of its entries updated. Correspondingly, the       amortized if the matrix size is large.) So the total number
output vector z have to be accessed repeatedly from/to         of external memory accesses of this subroutine
external memory, which would result in considerable               ⎛n⎞         2n 3
                                                               is ⎜ ⎟ ⋅ 2 s =
performance degradation if the width of each matrix strip                          .
was narrow. For example, if we had only 4 columns in              ⎝s⎠          s
each matrix strip, an additional 50 percent memory             A successful design of matrix-matrix multiply on FPGA-
accesses would be introduced, and so the sustained             based platform implies building an appropriate internal
performance of the hardwired computing engine would be         data buffering subsystem utilizing FPGA’s abundant in-
degraded by one third. Adopting a wide matrix strip            chip SRAM blocks and distributed registers. Fully
would amortize this overhead and significantly improve         exploiting data/computation locality of the subroutine is
the situation. However, the more matrix strip row entries      pivotal to the success. Research works had been done
we used, the larger parallel summation unit would be           discussing FPGA-based implementations of matrix-matrix
resulted. An effective alternative is to construct an          multiplication. In [3], the authors mainly concentrated on
internal RAM block-based input buffer to accommodate           theoretical peak performance analysis, but not on detailed
up to tens of thousands of vector x entries; and then we       implementation. In [13], the authors designed a two-
adopt the parallel-sequential group-alignment based            dimensional processor mesh based on conventional
summation unit mentioned above to accumulate the               systolic matrix-matrix multiply hardware algorithm. Each
partial-sums row by row, and strip by strip.                   processor node has its own multiply-accumulate unit
                                                               (MAC) together with local memory space, and is
4. High-precision matrix multiplication on                     responsible for the computations of a consecutive block of
                                                               matrix C. Elements of matrix A and B are fed into the
FPGAs                                                          computing engine from boundary nodes, and traverse
                                                               internal processors via in-chip interconnection paths. The
Given two n × n dense matrix A and B, the operations of
                                                               disadvantage of this design is that all boundary processing
matrix-matrix multiply C = A ⋅ B are defined as:               nodes import/export operands from/to the outside world

                                                               so that the number of input/output ports it requires would
C ij =          ik   ⋅ B kj                            (4.1)
                                                               be large. In [4], the authors proposed a one-dimensional
         k =0
                                                               processor array to address this problem. Only the first
In contrast to Level-1 and Level-2 BLAS subroutines,
                                                               processing node read matrix A and B from external
matrix multiplication is a computation-bounded
memory. These values then traverse all inner nodes in the      small RAM pieces constitute the first level cache for
linear array to participate in all related computations.       matrix A, which is constructed with multiple dual-port in-
Once the procedure is finished, each processing node           chip SRAM blocks. One set of memory ports of these
simply transfers its local results of matrix C to its          SRAM blocks are used for providing operands to 16
neighbors. These results are finally written back to           multipliers simultaneously; the other set of memory ports
external memory via the first processing node. One             are used for updating matrix entries and are
common problem of these designs is that they all require       interconnected as cascaded FIFO with 16X16 levels. This
complex control logics for coordinating communications         input cache contains two independent 256-word matrix
and interactions among multiple processing nodes.              entry buffers, each of which can accept an entire matrix
                                                               block and working in swapping manner to hide the data
Correspondingly, the computational performance
                                                               refreshing cycles. In refresh mode, entries of a matrix A
demonstrated in these works are all less than or similar to
                                                               block are read out from the second level cache in column
contemporary commodity CPUs.
                                                               order and pushed into the FIFO from its input port at the
In our work, we proposed a new solution for matrix-            bottom. So we need a total of 256 push cycles to update
matrix multiplication, which can achieve much more             all matrix block entries. In computation mode, 16
compact hardware implementation on FPGA-enhanced               commonly-addressed matrix entries that come from the
computer platform. This new matrix-matrix multiplication       same block row, are accessed simultaneously and fed to
unit doesn’t employ multiple processing nodes, but does        16 multipliers at each clock cycle. The accesses to one
use a large computing engine with an array of standard         matrix block will repeat for 16 rounds, with 16 clock
floating-point multipliers followed by one parallel group-     cycles in each round. So in total, the computation mode
alignment based summation unit. A two-level data               also last for 256 cycles.
buffering subsystem is designed to read operands of
matrix A and B from external memory, caching them in
local memory, and distributing them to arithmetic units in
correct sequence. Two contrary requirements are
addressed satisfactorily by the data buffering scheme:
First, the block size s is required to be large enough to
ensure the computation-bounded nature of the underlying
blocked matrix-matrix multiplication. Second, because of
the parallel running of multiple arithmetic units, there are
concurrent accesses to multiple operands in one data
block. Obviously, a large-capacity RAM block provides
limited number of read/write ports, and so is inapplicable
We demonstrate our design by a 16X16 block matrix-
matrix multiplication computing engine shown in Figure
5. In this design, because s is set to 16, each matrix block
would have 16X16=256 double precision entries. The
fully-pipelined computing engine contains 16 modified
floating-point multipliers and a 17-parallel summation
unit, and so can finish 16 multiplications and 16 additions
at each clock cycle. By varying the value of s, we can
easily change the size of the computing engine as well as      Figure 5. The block diagram and data flow for
its computational performance.                                 blocked matrix-matrix multiplication
Multiple matrix C blocks are accommodated inside the in-
chip output buffer, which can be efficiently constructed       Another 16-level cascaded FIFO is employed for the first-
with FPGA’s in-chip SRAM blocks. This output buffer            level caching of 16 matrix B entries (one column of a
circuit has two independent data-accessing ports: One is       matrix B block) with two sets of 16 dedicated one-double-
connected to an input port of the parallel summation unit      word registers. It also works in swapping manner to
for feeding in previous partial sums of C entries; the other   overlap the refreshing and computation cycles. But its
is for writing back those updated partial-sums. Concurrent     switching speed is 16 times faster than matrix A caching
read and write addresses to the same memory space              circuit. Once updated, the operands remain unchanged
always have a fixed distance, and so would not introduce       during the next 16 clock cycles providing another group
any conflicts.                                                 of matrix B entries to multipliers. All 16 products together
A two-level caching circuit is employed to efficiently         with the old partial-sum read from the output buffer are
buffer operands read from matrix A and B. 16 dedicated         then fed into the group-alignment based parallel
summation unit simultaneously as a group of summands.          In [6], we designed a sequential group-alignment based
The summation result is written back to the output buffer      summation unit on an entry-level Virtex II Pro evaluation
via another data port. No data/computation dependency          board [14]. We compared the performance of the new
exists in this implementation.                                 design with two other approaches proposed in [15] and
As we introduced above, the computing engine needs 256         [16]. Results show that the new group-alignment based
clock cycles to finish the multiplication of two 16X16         summation method provides us much higher sustained
matrix blocks. On the other hand, the 256-entry first-level    computational performance, less FPGA resource
matrix A cache updates its contents every 256 cycles, and      consumptions, and shorter pipelining latency than others.
during the same time interval, the contents of the 16-entry    Furthermore, adopting more fraction bits in the unfolded
matrix B cache are refreshed for 16 times. In order to         mantissa shifter as well as the fixed-point accumulator
keep the computing engine operating at 200MHz, we              consumes negligible extra FPGA resources, but can
need a memory channel that can afford a data transferring      significantly improve numerical error bounds of the
rate at 400M double words per second (3.2GByte/s),             summation.
which corresponds to the memory bandwidth provided by          For the matrix-vector multiplication, because of the
a 400MHz DDR-SDRAM module. In this design, we                  memory bandwidth bottleneck, all previous designs [3]
introduced another level of large-capacity data cache to       [4] can saturate on-board external memory bandwidth
buffer multiple matrix blocks in-core. A simple block          easily, but leave considerable FPGA resources unused.
scheduling circuit is employed to coordinate the               The adoption of the group-alignment based summation
multiplication of large matrices with multiple 16X16           unit can help us improve numerical accuracy significantly
matrix blocks. We can simply follow the ordinary block         utilizing those unused hardware resources. In other words,
matrix-matrix multiply algorithm as shown in Figure 6.         our approach provides users with an effective way to
The special data buffering scheme we adopted here can          increase the utilization of on-board FPGA resources for
ensure that the entire block a2 and the first column of        useful work. In the next section, we will use a simple
                                                               example to demonstrate the benefit of high-precision
block b3 would be ready in-core immediately after the          floating-point arithmetic for the solution of linear system
multiplication of a1 ⋅ b1 so that the computations of          equations.
                                                               Because of the lack of powerful FPGA development
c1 ⇐ c1 + a2 ⋅ b3 could start without any pipelining stalls.   platform, we can only evaluate the performance of our
Once the final results of a block in matrix C are obtained,    new matrix-matrix multiplication computing engine based
we have to save them back to external memory. If the size      on synthesis results reported by Xilinx ISE. We choose
of the matrices is large enough, this overhead would be        Xilinx Virtex II Pro XC2VP50-7 chip as our targeting
considerably amortized.                                        FPGA device. For the 16X16 blocked matrix
                                                               multiplication unit (as shown in Figure 5), we consumed
                                                               17256 logic slices, 144 18X18 hardwired multipliers, and
                                                               64 RAM blocks. The clock rate applied to the computing
                                                               engine is 200 MHz, so that the accumulated
                                                               computational performance is 6.4G FLOPS. The
                                                               equivalent matrix block size is 64; correspondingly,
                                                               800MByte/s external memory bandwidth is required to
Figure 6. Blocked matrix multiplication                        keep the computing engine operating at its full speed. Of
                                                               course, larger-capacity caching circuits could be built
From a perspective outside of the buffering system, the        easily using in-chip RAM blocks or onboard external
block size of matrix multiplication is now enlarged            SRAM modules depending on available hardware
because of the existence of the level-2 cache.                 resources at hand. The requirement for external memory
Correspondingly, the requirement for external memory           bandwidth would reduce proportionally.
bandwidth would decrease proportionally. For example, it
is straightforward to build an in-chip second-level cache
with 8192 matrix A or B entries. The block size now
                                                               6. Application: Accurate iterative refinement
becomes 64 and we need 800MByte/s external memory
bandwidth to keep the computing engine operating at its        In this section, we use a simple example to demonstrate
full speed, which is four times less than without the level-   the benefits of applying our FPGA-specific high-precision
2 cache.                                                       BLAS subroutines to enhance the iterative refinement
                                                               method for accurate solutions of linear system equations.
                                                               The linear system we considered here is:
5. FPGA implementation
                                                                Ln H n x = Ln en                                   (6.1)
Where   H n is the n-dimensional Hilbert matrix, whose        •    The conventional iterative refinement method doesn’t

entry hij = 1 / (i + j − 1) ; en is an n-dimensional unit
                                                                   work well for the problems. Even using double-
                                                                   precision floating-point arithmetic, the iterations will
vector,   Ln is a diagonal matrix that can scale all               stagger finally with significant residual errors.
                                                              •    Our FPGA-specific high-precision BLAS subroutines
H n entries to rounding-error free integers. The Hilbert           work perfectly for all listed problems. In most cases,
matrices are chosen because their condition numbers                the executions are terminated in a couple of iterations.
increase rapidly with matrix dimension [17] so that the            And the residual numerical errors are below the unit
corresponding linear systems are very sensitive to                 rounding error introduced by floating-point
numerical rounding errors. Also, because the inverse of            representation.
Hilbert matrix is known analytically, the exact solution of
the linear system can be easily computed and used as the      Table    2.   Convergence   comparison   between
reference.                                                    conventional iterative refinement using double-
                                                              precision arithmetic and the FPGA-specific high-
Algorithm 4: Iterative refinement for accurate                precision BLAS subroutines
solutions of bad-conditioned linear system: Ax = b                n        5       6       7        8      9      10     11
Input: A = Ln H n , b = Ln en                                  Condition
                                                              number of 4.8e5     1.5e7   4.8e8   1.5e10 4.9e11 1.6e13 5.2e14
Output: the solution x                                             Hn
Find the LU decomposition of matrix A: A = LU                   relative
Repeat                                                          error of 1e-12    1e-11   1e-10   1e-8    1e-7   1e-5   1e-3
   Compute the residual r = b – Ax                              relative
   Compute the 2-norm of r                                      error of
                                                                           0(2)   0(2)    0(2)     0(2)   0(2)   0(3)   0(4)
   Solve Ad = r for d using existing LU factorization         x-accurate
   Update the solution x = x + d                              (iterations)

Until r is small enough, or stops decreasing, or the
                                                              7. Conclusion
number of iterations exceed a threshold
                                                              In this paper, we propose a series of group-alignment
Algorithm 4 is the conventional iterative refinement          based algorithms/methods to improve the evaluation of
method. The computational complexity is dominated by          representative BLAS subroutines on FPGA-enhanced
LU decomposition, which is a computation-bounded              computers. Comparing with previous designs, our new
numerical subroutine and can be easily accelerated            FPGA-specific implementations are powerful, efficient,
utilizing FPGA’s high computational potential. In order to    and compact with improved numerical accuracy and
solve bad-conditioned linear systems accurately, this         stability.
algorithm     requires     high-precision   floating-point    Future work will concentrate on designing FPGA-specific
arithmetic in its inner loops. Because there are no           high-precision arithmetic/function units and analysing
corresponding hardwired quadruple or double-double            their positive effects on bad-conditioned numerical
precision arithmetic units available inside commodity         computing applications. We hope that high-precision
CPUs, the computational performance of this algorithm is      would result in more efficient implementations on FPGA-
in general very poor when running on general-purpose          enhanced computer so that the numerical computations of
computers.                                                    scientific/engineering problems could be benefited in an
In our implementation, we apply our FPGA-specific high-       indirect way.
precision BLAS subroutines to improve the performance
of the iterative refinement method. Specifically,
quadruple precision is adopted in the computations of LU-
decomposition, matrix-vector multiplication, vector norm,
                                                              [1] A. J. van der Steen, The EuroBen Throughput
and forward/backward substitution. Corresponding high-
                                                                  Benchmark Framework, Utrecht University, 2006
precision results are then rounded into standard double-
                                                              [2] K. D. Underwood, FPGAs vs. CPUs: Trends in peak
precision floating-point format. We can observe in Table
                                                                  floating-point performance, In proceedings of the
2 that:
                                                                  ACM/SIGDA 12th international symposium on
• The condition numbers of the Hilbert matrices                   FPGA, 171-180, 2004
     increase    rapidly     with    matrix    dimension.     [3] L. Zhuo, V. K. Prasanna, High Performance Linear
     Correspondingly, the linear system equations are bad-        Algebra Operations on Reconfigurable Systems,
     conditioned and very sensitive to numerical rounding         Scientific Computing, 2005
[4] K. D. Underwood, K. S. Hemmert, Closing the gap:
    CPU and FPGA Trends, in Sustainable Floating-point
    BLAS Performance, In proceedings of the 12th Annual
    IEEE Symposium on Field-Programmable Custom
    Computing Machines, FCCM 2004
[5] M. C. Smith, J. S. Vetter, and S. R. Alam, Scientific
    Computing Beyond CPUs: FPGA implementation of
    Common Scientific Kernels, MAPLD 2005
[6] C. He, G. Qin, M. Lu, and W. Zhao, Accurate
    Floating-Point Summation with Group-Alignment
    Technique on FPGAs , The International Conference
    on Engineering of Reconfigurable Systems and
    Algorithms (ERSA), 2006
[7] U. Kulisch, The Fifth Floating-Point Operation for
    Top-Performance Computers, Universitat Karlsruhe,
[8] D. Goldberg, What every scientist should know about
    floating-point arithmetic, ACM Computing Surveys,
    23(1):5-48, 1991
[9] N. J. Higham, The accuracy of floating point
    summation, SIAM Journal of Scientific Computing,
    14: 783-799, 1993
[10] D. Priest, Differences among IEEE 754
[11] X. S. Li, J. W. Demmel, D. H. Bailey, G. Henry, et.
    al., Design, Implementation and Testing of Extended
    and Mixed Precision BLAS, ACM Transactions on
    Mathematical Software, 18(2): 152-205, 2002
[12] C. He, G. Qin, M. Lu, W. Zhao, An Efficient
    Implementation of High-Accuracy Finite Difference
    Computing Engine on FPGAs, The IEEE 17th
    International Conference on Application-specific
    Systems, Architectures and Processors (ASAP), 2006
[13] J. Jang, S. Choi, and V. K. Prasanna, Area and Time
    Efficient Implementation of Matrix Multiplication on
    FPGAs, 2002
[15] Ling Zhuo, Gerald R. Morris, Viktor K. Prasanna.
    "Designing Scalable FPGA-Based Reduction Circuits
    Using Pipelined Floating-Point Cores," 19th IEEE
    International Parallel and Distributed Processing
    Symposium (IPDPS'05), 2005
[16] Z. Luo and M. Martonosi, Accelerating pipelined
    integer and floating-point accumulations in
    configurable hardware with delayed addition
    techniques, IEEE Transactions on Computers, 49(3):
    208-218, 2000
[17] N. J. Higham, Accuracy and Stability of Numerical
    Algorithms, SIAM, Philadelphia, 2002

Shared By: