Document Sample

High-Precision BLAS on FPGA-enhanced Computers Chuan He, Guan Qin, and Richard E. Ewing Institute for Scientific Computation Texas A&M University, College Station, TX 77843 chuanhe@ece.tamu.edu, guan.qin@tamu.edu, Richard-ewing@tamu.edu Wei Zhao, Rensselaer Polytechnic Institute, Troy, NY 12180 zhaow3@rpi.edu Abstract achieve significant performance improvements over contemporary general-purpose computers for many The emergence of high-density reconfigurable hardware computationally intensive applications. In [2], the authors devices gives scientists and engineers an option to predicted that FPGA devices would yield three to eight accelerating their numerical computing applications on times more raw computing power than commodity CPUs low-cost but powerful “FPGA-enhanced computers”. In by 2009. this paper, we introduced our efforts towards improving Since 2000s, increasing research efforts have been the computational performance of Basic Linear Algebra conducted for solving numerical PDE problems on Subprograms (BLAS) by FPGA-specific FPGA-enhanced computer platforms. Specifically, algorithms/methods. Our study focus on three BLAS scientists did considerable work trying to accelerate Basic subroutines: floating point summation, matrix-vector Linear Algebra Subprograms (BLAS), which is a widely- multiplication, and matrix-matrix multiplication. They accepted open-source numerical computing library that represent all three levels of BLAS functionalities, and performs basic linear algebra operations for this class of their sustained computational performances are either scientific computations. However, because a large portion memory bandwidth bounded or computation bounded. By of BLAS subroutines (most Level-1 and Level-2 proposing the group-alignment based floating-point subroutines) exist low computations to operands ratio, so summation method and applying this technique to other are memory-bounded operations. It is now a computer’s subroutines, we significantly improved their sustained external memory bandwidth who decides the sustained computational performance and reduced numerical computational performance, but not the computing engine errors with moderate FPGA resources consumed. itself. For computation-bounded Level-3 BLAS Comparing with existing FPGA-based implementations, subroutines, the costly implementation of double- our designs are efficient and compact with improved precision floating-point arithmetic on FPGA would numerical accuracy and stability. become the main problem. Correspondingly, most of these research works did not demonstrate exciting performance improvements on FPGA-based systems over 1. Introduction contemporary commercial CPUs [3] [4] [5]. We believe that most commonly-used numerical methods In the past decade, numerical computing efforts have are well-tuned for general-purpose computers, so in grown rapidly with performance improvements of general not ideal for FPGA-enhanced computer platform. general-purpose computers and the popularity of parallel In order to achieve better computational performance on computing environments. Although the nominal these new computing systems, one should design and/or frequency of commodity CPUs is skyrocketing, the customize new algorithms/methods for their specific utilization of CPUs’ peak performance for numerical scientific/engineering problems at hand. Furthermore, computing applications is still very poor [1]. With the these new algorithms/methods must be flexible and robust emergence of the new generation of large-scale FPGA so that a wide range of commercial FPGA-based devices, scientists and engineers are being presented with platforms could accommodate them effectively and new opportunity to accelerate their specific applications efficiently. with low-cost but powerful “FPGA-enhanced computers”. In this paper, we introduce part of our recent efforts Equipped with the delightful hardware “In-System- towards accelerating representative BLAS subroutines on Programmability (ISP)”, FPGA-enhanced computers can FPGA-enhanced computer platform by hardware-specific algorithms/methods. Based on our previous work regarding high-precision floating-point summation on FPGAs [6], we concentrate here on matrix-vector and matrix-matrix multiplications, which are two kernel Level-2 or Level-3 BLAS subroutines, respectively. Results of this work can be applied to accelerate direct solvers for dense/sparse linear system equations. The rest of this paper is organized as follow: In Section 2, we review in brief the group-alignment based floating- point summation method proposed in [6]. Based on this technique, we present in Section 3 and Section 4 our design for high-precision matrix-vector and matrix-matrix multiplications on FPGAs. Their FPGA-based implementations are introduced in Section 5 with corresponding performance evaluations. We also introduce in section 6 a reference design for accelerating Figure 1. Structure of group-alignment based floating- iterative refinement method using the proposed FPGA- point summation specific high-precision BLAS technique. Finally, conclusions and a discussion of future research directions Instead of following existing commonly-used high- are presented in Section 7. accuracy summation algorithms [10], which are all well- tuned for commercial CPUs so are not ideal for FPGA- 2. Group-alignment based floating-point based systems, we propose in [6] a group-alignment based summation [6] summation method using customized floating-point/fixed- point hybrid arithmetic. The basic idea is straightforward: Floating-point summation is such an important operation As shown in Figure 1, to accumulate a group of n in numerical computations that some computer scientists floating-point summands, instead of (n-1) standard even suggested to import it into general-purpose CPUs as floating-point additions, where all exponent comparisons “the fifth floating-point arithmetic” [7]. It is also a and mantissa shifting are scattered, we collect them fundamental building block for other BLAS subroutines together to construct a unique exponent comparator in such as dot-product, matrix-vector multiplication, matrix- Step 2 and a barrel shifter in Step 4. Now, all summands matrix multiplication, so plays a key role in numerical in the same group are aligned into consistent fixed-point computations. Unlike other basic floating-point format, so can be summed up with simple fixed-point arithmetic, summation is not well-conditioned because of accumulator in Step 5. so-called “catastrophic cancellations” [8], i.e. small relative errors hidden in inputs might result in much Table 1. Comparisons of numerical error bounds larger relative error in output. Also, intrinsic data Sequential Group-alignment dependency among neighbouring floating-point additions Accumulation based Summation would restrict its sustained computational performance seriously. To make the situation worse, the sequence of Absolute e ≤ (n −1)εM ⋅ e' ≤ (n − 1)ε ' M ⋅ ( s1 + s2 +L+ sn ) max{s1 , s2 L sn } Error consecutive additions will also affect the final sum, which Bound makes it impossible to produce a unique solution for the same input data set on different computer platforms with e e' different programming languages or software compilers Relative ≤ (n − 1)ε M ⋅ ≤ (n − 1)ε 'M ⋅ [9]. On the bright side, this un-uniqueness relieves us S S max{s1 , s2 L sn } Error from strictly obeying the IEEE standard in our s1 + s 2 + L + s n Bound implementation but choose the exact solution as the only s1 + s 2 + L s n s1 + s2 + L sn accuracy criterion. Given a set of n floating-point numbers s1 , s2 ,L, sn with small relative errors ε M , our goal is to design a numerical Absolute and relative numerical error bounds (Table 1) algorithm as well as its FPGA-based hardware were derived there using formal error analysis methods to n prove that this hardware-based algorithm can always implementation to compute the summation S = ∑ si result in similar, or even better, average/worst-case i =1 absolute and relative errors than standard floating-point accurately and efficiently. arithmetic. Numerical experiments are also performed to demonstrate the accuracy and scalability of this new could be easily applied here to construct a dot-product approach. unit by introducing a simplified standard floating-point multiplier at the front end. This multiplier accepts two 3.High-precision matrix-vector multiplication entries from matrix A and vector x at each clock cycle; on FPGAs normalizes them; and multiplies their mantissas. Then, the product and the sum of exponents are fed to the input of This Level-2 BLAS subroutine is defined as the following group-alignment based summation unit with n all of the multiplier’s post-processing stages eliminated. zi = ∑ Aij ⋅ x j + y j , i = 1,L n (3.1) In order to obtain a high-precision result, we cannot j =1 simply round those product terms to standard floating- where A is a dense or sparse n × n matrix; x, y and z are point format. To reveal the problem, let us consider a n × 1 vectors. Suppose all entries of A, x, y, and z are simple case, the dot-product of two two-element saved in main memory (which is also the case for most vectors a = [ a1 , a 2 ] and b = [b1 , b2 ] . Using formal T T realistic numerical PDE problems). The subroutine ( ) ( consists of 2n 2 floating-point operations and n 2 + 3n ) numerical error analysis method [14], we have: memory accesses in total. Their ratio is nearly two, which fl ( fl (a1 × b1 ) + fl (a 2 × b2 )) reveals the memory-bandwidth-bounded property of this = fl ((a1 × b1 )(1 + ε 1 ) + (a 2 × b2 )(1 + ε 2 )) subroutine. When executing on general-purpose computers, the = ((a1 × b1 )(1 + ε 1 ) + (a 2 × b2 )(1 + ε 2 ))(1 + ε 3 ) observable sustained computational performances of = a1 × b1 + a 2 × b2 + (a1 × b1 )ε 1 + (a 2 × b2 )ε 2 + (a1 × b1 )ε 3 Level-2 BLAS subroutines correspond to only 30~50 + (a 2 × b2 )ε 3 + (a1 × b1 )ε 1ε 3 + (a 2 × b2 )ε 2 ε 3 ≈ a1 × b1 + a 2 × b2 + (a1 × b1 )(ε 1 + ε 3 ) + (a 2 × b2 )(ε 2 + ε 3 ) percent of the theoretical values [1], which are bounded by in-system memory bandwidth. In other words, a considerable portion of physical memory bandwidth is (3.2) wasted because of the inefficiency of the memory After ignoring high-order rounding error terms in this subsystem, which normally consists of a hierarchy of equation, we can observe that rounding errors produced data/instruction caches with fixed evacuation policy. by multiplications ( ε 1 , ε 2 ) have similar magnitude as Comparatively, the utilization of external memory addition error ( ε 3 ), and so should also be take into bandwidth for a specific problem could always be improved on FPGA-based system by utilizing FPGA’s consideration for accurate solution. All effective bits of abundant in-chip programmable memory resources such input mantissa products should be kept so that the as RAM blocks and distributed registers. numerical errors produced by multiplications could be Recently, scientists proposed several FPGA-based removed. In the mean time, the word-width of the barrel schemes [3] [4] to accelerate matrix-vector multiplication shifter and the fixed-point accumulator in the group- by constructing appropriate buffering/caching subsystem. alignment based summation unit should also be extended Although considerable performance improvements had correspondingly for better numerical error bounds. been achieved in those works compared with the same Comparatively, high-precision floating-point subroutine running on general-purpose computers, their computations are very difficult to achieve on general- results are not so encouraging. Specifically, external purpose computers because most commodity CPUs memory bandwidth of a typical FPGA-based system provide only standard-precision floating-point arithmetic. generally can support millions of words per second data People have to use software codes to emulate high- transferring rate, which is at the same level as the data precision arithmetic, or the results would be at the risk of throughput of a pipelined floating-point arithmetic unit. rounding errors coming from both floating-point additions So, a few parallel running arithmetic units could saturate and multiplications. Some CPUs do provide so-called all available memory bandwidth and leave considerable “Fused Multiply-Addition (FMA)” unit/instruction, where on-board FPGA resources unused. a floating-point accumulator is placed adjacent to the By calculating matrix-vector multiplication in high- multiplier to eliminate rounding errors introduced by precision, we try to put more on-board FPGA resources multiplications. However, most software compilers do not into play so that our computations could be benefited in yet support this function because it tends to complicate an indirect way. After decomposing matrix A into n row instruction scheduling, and may eventually slowdown the vectors, the matrix-vector multiply can be treated as execution. n dedicated vector dot-products. The group-alignment Although this new high-precision approach doesn’t alleviate the memory bandwidth bottleneck, it can help us based floating-point summation unit we proposed in [6] evaluate z i to higher accuracy at nearly the same speed and multiplied by entries from the input vector x, whose elements are all buffered inside FPGA’s RAM blocks. as conventional approaches. High-precision arithmetic is Multiple RAM blocks can support multiple operand always preferable for matrix-vector multiplication when accesses from x at each clock cycle so that a number of matrix A is bad-conditioned. Scientists [11] proposed multipliers can work in parallel if the accumulative pure software methods to achieve higher precision external memory bandwidth is wide enough to support (double-extended, double-double, or quadruple precision) simultaneous accesses of the same number of operands BLAS subroutines using only standard floating-point from the matrix. Please note that all effective bits of input arithmetic (single or double precision) at the cost of 4~10 mantissa products (48 bits for single-precision times of slowdown. From this point of view, our new multiplication and 106 bits for double-precision) are kept approach significantly improved the computational and the word-width of the barrel shifter and the fixed- performance of high-precision matrix-vector point accumulator in the group-alignment based multiplication. summation unit are extended correspondingly for high- precision results. ai , j +3 x j +3 Algorithm 1: Matrix-Vector Multiply in Row Order Input: n × (n + 1) dense matrix [y; A] (, which is an ai , j + 2 x j+2 augmented matrix with the column vector y attached to the left of matrix A,) saved in external memory ai , j +1 (n + 1) × 1 vector [1, x] saved in s SRAM blocks x j +1 inside FPGA device. The number s block contains vector entries (i, i + s, i + 2s,L, i + ((n + 1) s − 1)) ai , j Output: z = A ⋅ x + y xj For i = 1 to n For j = 1 to (n+1)/s Read in matrix entries Ai , ( j −1 )∗ s +1: j ∗ s from external memory For k =1 to s do parallel p k = Ai ,( j −1)∗s + k ⋅ x( j −1)∗s + k End For s q j = ∑ pk k =1 End For ( n +1) / s Figure 2. The block diagram and data flow for matrix-vector multiplication in row order zi = ∑q j =1 j End For Because the memory bandwidth limitation always presents, we need to construct an appropriate data The group-alignment based summation technique is buffering system to attain the most from the limited applied twice here to address the intrinsic data resources. The data dependency existing in summations of dependency problem: one in parallel [12] followed by consecutive products is the only issue that needs to be another in sequence [6]. Although a little costly, this addressed by the buffering system. Starting from simple approach does simplify the design for the buffering circuit cases where all matrix entries are stored in external as well as internal data flow: memory accesses to matrix memory, but at least one input/output vector can be A and vector x are always in sequence, the generations of accommodated in FPGA’s in-chip RAM blocks, we have output z entries are also in order, and there are no local two choices to construct the data buffering system: The data feed-back paths in this design. Indeed, we can easily first one is to calculate matrix-vector multiplication in combine these two summation units into a single one. The row order just as Equation (3.1) implies. As shown in resulting parallel-sequential group-alignment based Figure 2, matrix entries are read into FPGA row by row summation unit can accept s summands at each clock cycle, align them with the same number of parallel- data dependency problem is eliminated by this simple working mantissa shifters, and then accumulate those scheduling scheme because consecutive summations are aligned mantissas using a s-input fixed-point binary adder now for different z entries. Algorithm 3 is this matrix- tree followed by a single fixed-point accumulator. vector multiplication in column order method. Although only limited hardware resources could be saved after the combination, this new approach effectively X s-parallel enlarges the summation group size, so would benefit the ai, j +3 x j +3 Group- final precision.The corresponding FPGA-specific matrix- X Alignment ai, j + 2 based vector multiplication method is listed in Algorithm 1. x j +2 Vector y Summation By using s independent sequential group-alignment based a i , j +1 X Unit in-chip x j +1 buffer summation units to evaluate s vector dot-product in X parallel, another matrix-vector multiplication in row order ai, j scheme is also feasible here. Indeed, this is the xj conventional approach we observed in others’ work [3] In-chip A Cache In-chip x Cache [4], except that multiple standard floating-point adders were adopted for accumulation instead of the group- alignment based summation method. Obviously, this scheme would be much more expensive than our approach, so is ignored in this work. A x y External Memory Space Figure 4. The block diagram and data flow for matrix-vector multiplication in column order Algorithm 3. Matrix-vector Multiply in column order Input: n × n matrix A saved in external memory Figure 3. Blocked matrix-vector multiplication n × 1 vector x saved in external memory The other data buffering scheme is based on the fact that n × 1 vector z are accommodated inside FPGA’s matrix-vector multiplication can also be represented as in-chip RAM blocks and initialized with the summation of n scaled column vectors of matrix A as corresponding y entries the following equation shows: Output: z = A ⋅ x + y n −1 Set all entries in vector y to be zero z = A ⋅ x + y = ∑ Aj ⋅ x j + y (3.3) For j = 1 to n/s j =0 Read in vector entries x( j −1)∗s +1, j ∗s Figure 3 demonstrates the computational scheme of this blocked matrix-vector multiplication in column order. For i = 1 to n The block diagram and data flow of the corresponding Read in matrix entries Ai , ( j −1 )∗ s +1: j ∗ s FPGA-based hardware implementation is depicted in For k =1 to s do parallel Figure 4. Here, all entries of the resulting vector z are accommodated inside FPGA’s in-chip RAM blocks and p k = Ai ,( j −1)∗s + k ⋅ x( j −1)∗s + k initialized as y. The input matrix A is stripped into blocks End For of size n × s , and s is selected carefully so that all entries s in a block row can be accessed simultaneously from zi = zi + ∑ pk external memory. Correspondingly, there are also s k =1 parallel-running multipliers. All of those product values End For generated at the same clock cycle together with the End For relevant partial sum value read from output vector z are summed up to update the same z entry. The troublesome Compared with the previous approach, only one (s+1)- parallel group-alignment based summation unit is subroutine. For the ideal case where FPGA’s internal required. However, one local data feed-back loop is RAM blocks could accommodate all matrix elements of introduced in order to update the output vector. This is ( ) A, B, and C, there would be 3n 2 memory accesses and not an issue here because FPGA’s in-chip SRAM blocks consist of circuitry to support two independent access ( ) 3 2n floating-point operations in total. So the ratio ports. between external memory accesses and floating-point As we mentioned above, these two matrix-vector operations is proportional to n, which reveals excellent multiplication approaches all assume that FPGA’s in-chip data reusability, especially when n is large. In theory, RAM blocks can accommodate at least one input/output such computation-bounded subroutines could always put vector. For large matrix-vector multiplication problems in onboard FPGA devices into full play, and so achieve reality, where the input/output vector contains too many comparable sustained computational performance as entries to be buffered inside FPGA device, we are forced contemporary commodity CPUs. to save all vectors together with the matrix in external In reality, only a small portion of matrix entries could be SDRAM space. (Sometimes, we may have external saved in-chip. So block matrix-matrix multiplication SRAM modules integrated on board, which can be used to scheme is always the most popular choice, although it save the vector(s).) The main objective of data buffering may introduce considerable additional external memory design now is to hide access lags caused by external accesses to deal with intermediate results. For example, if SDRAM modules and attain the most from the limited there were only enough in-chip memory spaces for external memory bandwidth. buffering three s × s blocks of matrix entries, we need to read in 2 × s matrix entries from A and B for every The second approach can be easily extended to deal with 2 such cases using stripped matrix-vector multiplication scheme similar to the one shown in Figure 3. Every 2 × s 3 + s 2 floating-point computations. (Here, we partial matrix-vector multiplication related to one matrix ignore purposely the memory access for saving resulting strip requires accesses to this output vector twice in order matrix elements of C because this workload would be to have all of its entries updated. Correspondingly, the amortized if the matrix size is large.) So the total number output vector z have to be accessed repeatedly from/to of external memory accesses of this subroutine 3 external memory, which would result in considerable ⎛n⎞ 2n 3 is ⎜ ⎟ ⋅ 2 s = 2 performance degradation if the width of each matrix strip . was narrow. For example, if we had only 4 columns in ⎝s⎠ s each matrix strip, an additional 50 percent memory A successful design of matrix-matrix multiply on FPGA- accesses would be introduced, and so the sustained based platform implies building an appropriate internal performance of the hardwired computing engine would be data buffering subsystem utilizing FPGA’s abundant in- degraded by one third. Adopting a wide matrix strip chip SRAM blocks and distributed registers. Fully would amortize this overhead and significantly improve exploiting data/computation locality of the subroutine is the situation. However, the more matrix strip row entries pivotal to the success. Research works had been done we used, the larger parallel summation unit would be discussing FPGA-based implementations of matrix-matrix resulted. An effective alternative is to construct an multiplication. In [3], the authors mainly concentrated on internal RAM block-based input buffer to accommodate theoretical peak performance analysis, but not on detailed up to tens of thousands of vector x entries; and then we implementation. In [13], the authors designed a two- adopt the parallel-sequential group-alignment based dimensional processor mesh based on conventional summation unit mentioned above to accumulate the systolic matrix-matrix multiply hardware algorithm. Each partial-sums row by row, and strip by strip. processor node has its own multiply-accumulate unit (MAC) together with local memory space, and is 4. High-precision matrix multiplication on responsible for the computations of a consecutive block of matrix C. Elements of matrix A and B are fed into the FPGAs computing engine from boundary nodes, and traverse internal processors via in-chip interconnection paths. The Given two n × n dense matrix A and B, the operations of disadvantage of this design is that all boundary processing matrix-matrix multiply C = A ⋅ B are defined as: nodes import/export operands from/to the outside world n ∑A so that the number of input/output ports it requires would C ij = ik ⋅ B kj (4.1) be large. In [4], the authors proposed a one-dimensional k =0 processor array to address this problem. Only the first In contrast to Level-1 and Level-2 BLAS subroutines, processing node read matrix A and B from external matrix multiplication is a computation-bounded memory. These values then traverse all inner nodes in the small RAM pieces constitute the first level cache for linear array to participate in all related computations. matrix A, which is constructed with multiple dual-port in- Once the procedure is finished, each processing node chip SRAM blocks. One set of memory ports of these simply transfers its local results of matrix C to its SRAM blocks are used for providing operands to 16 neighbors. These results are finally written back to multipliers simultaneously; the other set of memory ports external memory via the first processing node. One are used for updating matrix entries and are common problem of these designs is that they all require interconnected as cascaded FIFO with 16X16 levels. This complex control logics for coordinating communications input cache contains two independent 256-word matrix and interactions among multiple processing nodes. entry buffers, each of which can accept an entire matrix block and working in swapping manner to hide the data Correspondingly, the computational performance refreshing cycles. In refresh mode, entries of a matrix A demonstrated in these works are all less than or similar to block are read out from the second level cache in column contemporary commodity CPUs. order and pushed into the FIFO from its input port at the In our work, we proposed a new solution for matrix- bottom. So we need a total of 256 push cycles to update matrix multiplication, which can achieve much more all matrix block entries. In computation mode, 16 compact hardware implementation on FPGA-enhanced commonly-addressed matrix entries that come from the computer platform. This new matrix-matrix multiplication same block row, are accessed simultaneously and fed to unit doesn’t employ multiple processing nodes, but does 16 multipliers at each clock cycle. The accesses to one use a large computing engine with an array of standard matrix block will repeat for 16 rounds, with 16 clock floating-point multipliers followed by one parallel group- cycles in each round. So in total, the computation mode alignment based summation unit. A two-level data also last for 256 cycles. buffering subsystem is designed to read operands of matrix A and B from external memory, caching them in local memory, and distributing them to arithmetic units in correct sequence. Two contrary requirements are addressed satisfactorily by the data buffering scheme: First, the block size s is required to be large enough to ensure the computation-bounded nature of the underlying blocked matrix-matrix multiplication. Second, because of the parallel running of multiple arithmetic units, there are concurrent accesses to multiple operands in one data block. Obviously, a large-capacity RAM block provides limited number of read/write ports, and so is inapplicable here. We demonstrate our design by a 16X16 block matrix- matrix multiplication computing engine shown in Figure 5. In this design, because s is set to 16, each matrix block would have 16X16=256 double precision entries. The fully-pipelined computing engine contains 16 modified floating-point multipliers and a 17-parallel summation unit, and so can finish 16 multiplications and 16 additions at each clock cycle. By varying the value of s, we can easily change the size of the computing engine as well as Figure 5. The block diagram and data flow for its computational performance. blocked matrix-matrix multiplication Multiple matrix C blocks are accommodated inside the in- chip output buffer, which can be efficiently constructed Another 16-level cascaded FIFO is employed for the first- with FPGA’s in-chip SRAM blocks. This output buffer level caching of 16 matrix B entries (one column of a circuit has two independent data-accessing ports: One is matrix B block) with two sets of 16 dedicated one-double- connected to an input port of the parallel summation unit word registers. It also works in swapping manner to for feeding in previous partial sums of C entries; the other overlap the refreshing and computation cycles. But its is for writing back those updated partial-sums. Concurrent switching speed is 16 times faster than matrix A caching read and write addresses to the same memory space circuit. Once updated, the operands remain unchanged always have a fixed distance, and so would not introduce during the next 16 clock cycles providing another group any conflicts. of matrix B entries to multipliers. All 16 products together A two-level caching circuit is employed to efficiently with the old partial-sum read from the output buffer are buffer operands read from matrix A and B. 16 dedicated then fed into the group-alignment based parallel summation unit simultaneously as a group of summands. In [6], we designed a sequential group-alignment based The summation result is written back to the output buffer summation unit on an entry-level Virtex II Pro evaluation via another data port. No data/computation dependency board [14]. We compared the performance of the new exists in this implementation. design with two other approaches proposed in [15] and As we introduced above, the computing engine needs 256 [16]. Results show that the new group-alignment based clock cycles to finish the multiplication of two 16X16 summation method provides us much higher sustained matrix blocks. On the other hand, the 256-entry first-level computational performance, less FPGA resource matrix A cache updates its contents every 256 cycles, and consumptions, and shorter pipelining latency than others. during the same time interval, the contents of the 16-entry Furthermore, adopting more fraction bits in the unfolded matrix B cache are refreshed for 16 times. In order to mantissa shifter as well as the fixed-point accumulator keep the computing engine operating at 200MHz, we consumes negligible extra FPGA resources, but can need a memory channel that can afford a data transferring significantly improve numerical error bounds of the rate at 400M double words per second (3.2GByte/s), summation. which corresponds to the memory bandwidth provided by For the matrix-vector multiplication, because of the a 400MHz DDR-SDRAM module. In this design, we memory bandwidth bottleneck, all previous designs [3] introduced another level of large-capacity data cache to [4] can saturate on-board external memory bandwidth buffer multiple matrix blocks in-core. A simple block easily, but leave considerable FPGA resources unused. scheduling circuit is employed to coordinate the The adoption of the group-alignment based summation multiplication of large matrices with multiple 16X16 unit can help us improve numerical accuracy significantly matrix blocks. We can simply follow the ordinary block utilizing those unused hardware resources. In other words, matrix-matrix multiply algorithm as shown in Figure 6. our approach provides users with an effective way to The special data buffering scheme we adopted here can increase the utilization of on-board FPGA resources for ensure that the entire block a2 and the first column of useful work. In the next section, we will use a simple example to demonstrate the benefit of high-precision block b3 would be ready in-core immediately after the floating-point arithmetic for the solution of linear system multiplication of a1 ⋅ b1 so that the computations of equations. Because of the lack of powerful FPGA development c1 ⇐ c1 + a2 ⋅ b3 could start without any pipelining stalls. platform, we can only evaluate the performance of our Once the final results of a block in matrix C are obtained, new matrix-matrix multiplication computing engine based we have to save them back to external memory. If the size on synthesis results reported by Xilinx ISE. We choose of the matrices is large enough, this overhead would be Xilinx Virtex II Pro XC2VP50-7 chip as our targeting considerably amortized. FPGA device. For the 16X16 blocked matrix multiplication unit (as shown in Figure 5), we consumed 17256 logic slices, 144 18X18 hardwired multipliers, and 64 RAM blocks. The clock rate applied to the computing engine is 200 MHz, so that the accumulated computational performance is 6.4G FLOPS. The equivalent matrix block size is 64; correspondingly, 800MByte/s external memory bandwidth is required to Figure 6. Blocked matrix multiplication keep the computing engine operating at its full speed. Of course, larger-capacity caching circuits could be built From a perspective outside of the buffering system, the easily using in-chip RAM blocks or onboard external block size of matrix multiplication is now enlarged SRAM modules depending on available hardware because of the existence of the level-2 cache. resources at hand. The requirement for external memory Correspondingly, the requirement for external memory bandwidth would reduce proportionally. bandwidth would decrease proportionally. For example, it is straightforward to build an in-chip second-level cache with 8192 matrix A or B entries. The block size now 6. Application: Accurate iterative refinement becomes 64 and we need 800MByte/s external memory bandwidth to keep the computing engine operating at its In this section, we use a simple example to demonstrate full speed, which is four times less than without the level- the benefits of applying our FPGA-specific high-precision 2 cache. BLAS subroutines to enhance the iterative refinement method for accurate solutions of linear system equations. The linear system we considered here is: 5. FPGA implementation Ln H n x = Ln en (6.1) Where H n is the n-dimensional Hilbert matrix, whose • The conventional iterative refinement method doesn’t entry hij = 1 / (i + j − 1) ; en is an n-dimensional unit work well for the problems. Even using double- precision floating-point arithmetic, the iterations will vector, Ln is a diagonal matrix that can scale all stagger finally with significant residual errors. • Our FPGA-specific high-precision BLAS subroutines H n entries to rounding-error free integers. The Hilbert work perfectly for all listed problems. In most cases, matrices are chosen because their condition numbers the executions are terminated in a couple of iterations. increase rapidly with matrix dimension [17] so that the And the residual numerical errors are below the unit corresponding linear systems are very sensitive to rounding error introduced by floating-point numerical rounding errors. Also, because the inverse of representation. Hilbert matrix is known analytically, the exact solution of the linear system can be easily computed and used as the Table 2. Convergence comparison between reference. conventional iterative refinement using double- precision arithmetic and the FPGA-specific high- Algorithm 4: Iterative refinement for accurate precision BLAS subroutines solutions of bad-conditioned linear system: Ax = b n 5 6 7 8 9 10 11 Input: A = Ln H n , b = Ln en Condition number of 4.8e5 1.5e7 4.8e8 1.5e10 4.9e11 1.6e13 5.2e14 Output: the solution x Hn Find the LU decomposition of matrix A: A = LU relative Repeat error of 1e-12 1e-11 1e-10 1e-8 1e-7 1e-5 1e-3 x-double Compute the residual r = b – Ax relative Compute the 2-norm of r error of 0(2) 0(2) 0(2) 0(2) 0(2) 0(3) 0(4) Solve Ad = r for d using existing LU factorization x-accurate Update the solution x = x + d (iterations) Until r is small enough, or stops decreasing, or the 7. Conclusion number of iterations exceed a threshold In this paper, we propose a series of group-alignment Algorithm 4 is the conventional iterative refinement based algorithms/methods to improve the evaluation of method. The computational complexity is dominated by representative BLAS subroutines on FPGA-enhanced LU decomposition, which is a computation-bounded computers. Comparing with previous designs, our new numerical subroutine and can be easily accelerated FPGA-specific implementations are powerful, efficient, utilizing FPGA’s high computational potential. In order to and compact with improved numerical accuracy and solve bad-conditioned linear systems accurately, this stability. algorithm requires high-precision floating-point Future work will concentrate on designing FPGA-specific arithmetic in its inner loops. Because there are no high-precision arithmetic/function units and analysing corresponding hardwired quadruple or double-double their positive effects on bad-conditioned numerical precision arithmetic units available inside commodity computing applications. We hope that high-precision CPUs, the computational performance of this algorithm is would result in more efficient implementations on FPGA- in general very poor when running on general-purpose enhanced computer so that the numerical computations of computers. scientific/engineering problems could be benefited in an In our implementation, we apply our FPGA-specific high- indirect way. precision BLAS subroutines to improve the performance of the iterative refinement method. Specifically, quadruple precision is adopted in the computations of LU- References decomposition, matrix-vector multiplication, vector norm, [1] A. J. van der Steen, The EuroBen Throughput and forward/backward substitution. Corresponding high- Benchmark Framework, Utrecht University, 2006 precision results are then rounded into standard double- [2] K. D. Underwood, FPGAs vs. CPUs: Trends in peak precision floating-point format. We can observe in Table floating-point performance, In proceedings of the 2 that: ACM/SIGDA 12th international symposium on • The condition numbers of the Hilbert matrices FPGA, 171-180, 2004 increase rapidly with matrix dimension. [3] L. Zhuo, V. K. Prasanna, High Performance Linear Correspondingly, the linear system equations are bad- Algebra Operations on Reconfigurable Systems, conditioned and very sensitive to numerical rounding Scientific Computing, 2005 errors. [4] K. D. Underwood, K. S. Hemmert, Closing the gap: CPU and FPGA Trends, in Sustainable Floating-point BLAS Performance, In proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 2004 [5] M. C. Smith, J. S. Vetter, and S. R. Alam, Scientific Computing Beyond CPUs: FPGA implementation of Common Scientific Kernels, MAPLD 2005 [6] C. He, G. Qin, M. Lu, and W. Zhao, Accurate Floating-Point Summation with Group-Alignment Technique on FPGAs , The International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), 2006 [7] U. Kulisch, The Fifth Floating-Point Operation for Top-Performance Computers, Universitat Karlsruhe, 1997 [8] D. Goldberg, What every scientist should know about floating-point arithmetic, ACM Computing Surveys, 23(1):5-48, 1991 [9] N. J. Higham, The accuracy of floating point summation, SIAM Journal of Scientific Computing, 14: 783-799, 1993 [10] D. Priest, Differences among IEEE 754 Implementations, http://www.validgh.com/goldberg/ [11] X. S. Li, J. W. Demmel, D. H. Bailey, G. Henry, et. al., Design, Implementation and Testing of Extended and Mixed Precision BLAS, ACM Transactions on Mathematical Software, 18(2): 152-205, 2002 [12] C. He, G. Qin, M. Lu, W. Zhao, An Efficient Implementation of High-Accuracy Finite Difference Computing Engine on FPGAs, The IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2006 [13] J. Jang, S. Choi, and V. K. Prasanna, Area and Time Efficient Implementation of Matrix Multiplication on FPGAs, 2002 [14] www.xilinx.com/univ/xupv2p.html [15] Ling Zhuo, Gerald R. Morris, Viktor K. Prasanna. "Designing Scalable FPGA-Based Reduction Circuits Using Pipelined Floating-Point Cores," 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), 2005 [16] Z. Luo and M. Martonosi, Accelerating pipelined integer and floating-point accumulations in configurable hardware with delayed addition techniques, IEEE Transactions on Computers, 49(3): 208-218, 2000 [17] N. J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, 2002

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 2 |

posted: | 10/2/2012 |

language: | Unknown |

pages: | 10 |

OTHER DOCS BY xiaopangnv

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.