VIEWS: 0 PAGES: 56 POSTED ON: 2/8/2013 Public Domain
An Efficient FPGA Implementation of IEEE 802.16e LDPC Encoder Speaker: Chau-Yuan-Yu Advisor: Mong-Kai Ku Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Better Encoder scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Better Encoding scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion Low-Density Parity-Check Code Benefit of LDPC Codes. Approaching Shannon limit Low error floor LDPC code is adopted by various standards (e.g. DVB-S2, 802.11n, 802.16e) Low-Density Parity-Check Code Parity check matrix H is sparse Very few 1’s in each row and column Null space of H is the codeword space Valid Codeword Low-Density Parity-Check Code In (n, k) block codes, k-bit information data can be encoded as n-bit codeword. In systematic block codes, the information bits directly exist in the bits of codeword. Systematic Part Parity Part Low-Density Parity-Check Code General encoding of systematic linear block codes Finding generator matrix G via H. C = sG = [s | p] Issues with LDPC codes The size of G is very large. G is not generally sparse. Encoding complexity will be very high. Structured LDPC Codes Quasi-Cyclic LDPC Codes In QC-LDPC, H can be partitioned into square sub-blocks of size z x z. Each sub-blocks can be Z x Z zero sub-block or identity matrix with permutation. Structured LDPC Codes QC Codes With Dual-Diagonal Structure In IEEE standards QC-LDPC Codes have Dual-Diagonal parity structure. We take 802.16e code rate ½ matrix for example. 0 represent identity matrix. Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Better Encoding scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion General Encoding for LDPC Codes Richardson and Urbanke (RU) algorithm Partition the H matrix into several sub-matrix. In H, the part T is a low triangle matrix. General Encoding for LDPC Codes Richardson and Urbanke (RU) algorithm p0 O(n+g2) p1 O(n+g2) Efficient Encoding for Dual-Diagonal LDPC Codes A valid codeword c = [s|p] must satisfy Replace by dual-diagonal matrix Information bits Parity bits Define lambda value as From equation, we obtained Related Work (1) Sequential Encoding Encoding scheme One-way derivation Step 1 Compute lambda value by doing matrix operation x = HsS Step 2 Determines parity vector P0 by adding all the lambda value Step 3 Rest of parity vector is obtained by exploiting dual-diagonal matrix T Related Work (2) Arbitrary Bit-generation and Correction Encoding In [1], an alternative encoding for standard matrix was presented. Matrix will be modify by parity portion of weight-3 A Q U column set. H can be sectorized into three sub matrices The information bit region A The parity bit region Q for bit-flipping operation The parity bit region U for non bit-flipping. Replace with zero cyclic shift [1] C. Yoon, E. Choi, M. Cheong, and S.-K. Lee, "Arbitrary bit generation and correction technique for encoding QC-LDPC codes with dual-diagonal parity structure," IEEE Wireless Communications and Networking Conference, (WCNC 2007), pp. 662-666, March 2007. Related Work (2) Arbitrary Bit-generation and Correction Encoding Encoding scheme Step 1 One-way derivation Compute lambda value by doing matrix operation x = As Step 2 Set P0 as arbitrary binary values. solve unknown parity bits Step 3 Computed correction vector f from P0 Step 4 Add correction vector to parity bits in region Q to correct them Related Work (2) Arbitrary Bit-generation and Correction Encoding Advantage Low-complexity encoding The number of addition required is less than RU scheme Drawback Can not directly applicable to standard code Modifying matrix will decrease code performance Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Better Encoding scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion Better encoding scheme Advantages of the encoding scheme proposed in [2] Low-complexity encoding Can directly applicable to matrices defined in IEEE standards without any modification Achieve higher level parallelism [3] C.-Y. Lin, C.-C. Wei, and M.-K. Ku, "Efficient Encoding for Dual-Diagonal Structured LDPC Code Based on Parity bits Prediction and Correction," IEEE Asia Pacific Conference on Circuits and Systems (APPCCAS), pp.1648-1651, Dec. 2008. Better Encoding Scheme Step 1 Set P0’ as any binary vector Correct prediction vector by f Step 2 Compute lambda value by doing matrix operation Hs Step 3 [Forward Derivation] Step 4 [Backward Derivation] Step 5 Compute the P0 by adding prediction parity vector Step 6 Compute the correction vector f Step 7 Compute P0 by adding prediction vector Correct prediction parity by adding f Compute correction vector f f = (P0)d Better Encoding Scheme Step 1 Set P0’ as any binary vector. Reduce encoding delay !! Step 2 Two-way derivation Compute lambda value by doing matrix operation Hs. Step 3 [Forward Derivation] Step 4 [Backward Derivation] Step 5 Compute the P0 by adding prediction parity vector. Step 6 Compute the correction vector f. Step 7 Correct prediction parity by adding f. Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Better Encoding scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion LDPC Encoder Architecture Based on the encoding scheme proposed bedore, we design both parallel and serial architecture. Parallel architecture Achieve higher level parallelism High-speed Serial architecture Parallel architecture Barrel shifter#1 divider Prediction Matrix Parity Accumulator Correct memory Barrel shifter#6 Input data register lambda position Parallel architecture (Stage 1) Barrel shifter#1 divider Prediction Matrix Parity Accumulator Correct memory Barrel shifter#6 Input data register lambda position Benefit: In this stage, matrix 1.When the input data select the shift values is coming, it can work and multiply specific immediately without all value according to the input data are the code length. coming. 2.Reduce the numbers of barrel shifter. Shifter Value Computation Equation for computing shift value Normal code rate : Code rate 2 ∕ 3 A code : Two type of matrix implement result with multiple rate and length Slice FFs LUTs CLK Total gate (MHz) count One matrix + 14,179 4,071 26,846 141.391 227,076 calculate IP Using matrices to 41,409 12,078 76,977 165.591 635,691 save shifter value Parallel architecture (Stage 2) Barrel shifter#1 divider Prediction Matrix Parity Accumulator Correct memory Barrel shifter#6 Input data register lambda position Divide the datas from This module used to save matrix. the input data. These data are used in barrel shifters. Parallel architecture (Stage 3) Barrel shifter#1 divider Prediction Matrix Parity Accumulator Correct memory Barrel shifter#6 Input data register lambda position Lambda position = 3 These module are This module records the used to circulated row position of the shift the input data shifter values Lambda position = 8 Lambda position = 11 Shifter value Parallel architecture (Stage 4) Barrel shifter#1 divider Prediction Matrix Parity Accumulator Correct memory Barrel shifter#6 Input data register lambda position According to the Computed the lambda position, in lambda value by this clock cycle λ1, λ2, accumulating the λ5, λ8, λ9, λ11 need to be shifted data after Kb accumulated. clock cycle Kb Parallel architecture (Stage 5) Barrel shifter#1 divider Prediction Matrix Parity Accumulator Correct memory Barrel shifter#6 Input data register lambda position Computed the prediction vector Pi‘ by equation Parallel architecture (Stage 5) P_0 <= acc_out0; P_1 <= acc_out0 ^ acc_out1; P_2 <= acc_out0 ^ acc_out1 ^ acc_out2; P_3 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3; P_4 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4; P_5 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4 ^ acc_out5; P_6 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7 ^ acc_out6; P_7 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7; P_8 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8; P_9 <= acc_out11 ^ acc_out10 ^ acc_out9; P_10 <= acc_out11 ^ acc_out10; P_11 <= acc_out11; For saving the hardware area, we use 2 3, P_3 In code rate 1 / 2, P_0 ~ P_11 one architecture to compute the P_8~P_11are the are the prediction prediction prediction values for four different code rate. Parallel architecture (Stage 5) P_0 <= acc_out0; P_1 <= acc_out0 ^ acc_out1; P_2 <= acc_out0 ^ acc_out1 ^ acc_out2; P_3 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3; P_4 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4; P_5 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4 ^ acc_out5; P_6 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7 ^ acc_out6; P_7 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7; P_8 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8; P_9 <= acc_out11 ^ acc_out10 ^ acc_out9; P_10 <= acc_out11 ^ acc_out10; P_11 <= acc_out11; For saving the hardware area, we use 5 6, P_1 In code rate 3 / 4, P_0 ~ P_2 one architecture to compute the P_10~P_11are the prediction P_9~P_11 are the prediction prediction values for four different vectors code rate. Parallel architecture (Stage 6) Barrel shifter#1 divider Prediction Matrix Parity Accumulator Correct memory Barrel shifter#6 Input data register lambda position Step2: Step1: Correct the other Pi. Compute the P0. In Using the equation code rate = 1 / 2, Pi= Pi’^ P0 P0 = P5 ^ P6 Serial architecture (Stage 1) Barrel shifter#1 Accumulator & divider Matrix Predict Correct memory Barrel shifter#2 Input data Input register control 2 3 As the stage1 in 1 In the first Kb clock parallel architecture. cycle, encoder order are from top->middle and 3 3 down ->middle, 2 column by column 1 Serial architecture (Stage 1) Barrel shifter#1 Accumulator & divider Matrix Predict Correct memory Barrel shifter#2 Input data Input register control 1 2 3 Reason: In the last clock cycle, 1.Prepare the input encoder order are from data left->right, row by row 2.Reduce the slice 3 1 2 Serial architecture (Stage 2) Barrel shifter#1 Accumulator & divider Matrix Predict Correct memory Barrel shifter#2 Input data Input register control Divide the datas from Choose the corresponding matrix. input value to barrel shifter (Take clock cycle #2 for example) Serial architecture (Stage 3) Barrel shifter#1 Accumulator & divider Matrix Predict Correct memory Barrel shifter#2 Input data Input register control Shift the input data according to the shifter value chosen form Mux Serial architecture (Stage 4) Barrel shifter#1 Accumulator & divider Matrix Predict Correct memory Barrel shifter#2 Input data Input register control In normal, this module In this module, there accumulate the shifted are three works: data to compute λi . 1.Compute λi When the data is the 2.Compute Pi’ last value in this row, 3.Compute P0 also compute Pi’. Serial architecture (Stage 4) Barrel shifter#1 Accumulator & divider Matrix Predict Correct memory Barrel shifter#2 Input data Input register control When all Pi have been computed, compute the P0 by Xor Px’ and Px+1’ which are the middle prediction vector in the matrix. Serial architecture (Stage 5) Barrel shifter#1 Accumulator & divider Matrix Predict Correct memory Barrel shifter#2 Input data Input register control Correct the other Pi. Using the equation Pi= Pi’^ P0 Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Better Encoding scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion Implementation Results The proposed encoder based on IEEE 802.16e LDPC codes can encode the code with code rate 1/2 2/3 3/4 5/6 and code length ranging from 576 to 2304. The hardware implementation was performed and verification on Xilinx Virtex-4 and Altera Stratix Field Programmable Gate Array (FPGA) device. Implementation Results Parallel architecture Rate 1/2 Rate 2/3 Rate 3/4 Rate 5/6 Z N Slice FFs LUTs CLK (MHz) IT (Gbps) IT (Gbps) IT (Gbps) IT (Gbps) 24 576 2.262 2.468 2.545 2.61 40 960 3.77 4.113 4.241 4.35 60 1440 14,179 4,071 26,846 141.391 5.656 6.17 6.363 6.526 80 1920 7.541 8.226 8.483 8.701 96 2304 9.049 9.872 10.18 10.441 Information throughput ranging from 2.262 to 10.441 Gbps The encoder area is constant in any code rate or code length. For a given code rate, an increase in the code length will increase the throughput. Implementation Results Serial architecture Information throughput ranging from 0.867 to 4.019 Gbps For a given code rate, an increase in the code length will increase the throughput. Implementation Results Parallel architecture using row by row Area comparison Implementation Results IT comparison IT/Area comparison Table 4.5a The synthesis result of [22] at code rate 1/2 Compare to Related Work We compare implementation with [3]. Code Length Area (LE) Clk (MHz) IT (Gbps) IT/Total Area Code Length Area (LE) Clk (MHz) IT (Gbps) IT/Total Area (Mb per Le) Rate 1/2 (Mb per Le) 576 3391 192.23 2.129 0.0612 rate1/2 [2] 960 5100 159.57 2.253 0.0648 576 1.561 0.07447 Proposed 960 20960 97.58 2.602 0.12414 1440 7012 164.83 2.697 0.0776 1440 3.903 0.18621 1920 8924 148.72 2.644 0.0761 1920 5.204 0.24828 2304 10339 148.41 2.758 0.0793 2304 6.245 0.29794 34766 Better throughput for longer code length Using less area to implement multiple code length and code rate The clock cycle is shorter the [3]. [3] S. Kopparthi and D. M. Gruenbacher, "Implementation of a fiexible encoder for structured low-density parity-check codes," IEEE Pacic Rim Conference on Communications, Computers and Signal Processing (PacRim 2007), pp.438-441, Aug. 2007. Compare to Related Work The comparison of throughput The proposed encoder outperforms the work in [3] in terms of throughput when the code length longer then 1200 The proposed encoder architecture provides better throughput for a longer code length while the work in [3] does not have this kind of speed-up Compare to Related Work The comparison of throughput/area ratio The proposed encoder outperforms the work in [3] in terms of throughput/area ratio by 1.216 to 3.757 times The proposed encoder utilizes hardware resources more efficiently Compare to Related Work We compare implementation with [2]. Compare to Related Work The comparison of throughput The throughput in our proposed encoder is higher then [2] in all code rate and code length The proposed encoder outperforms the work in [2] in terms of throughput ratio by 1.237 to 1.963 times Compare to Related Work The comparison of throughput/area The proposed encoder outperforms the work in [2] in terms of throughput ratio by 2.427 to 5.256 times The result shows that our proposed encoder utilizes hardware resources efficiently Compare to Related Work (Serial) We compare implementation with [4]. Slices FFs LUTs Block rams CLK IT [4] 4,724 1,807 8,335 81 186 3.34 Proposed 12,567 3,885 22,050 0 123.502 4.626 Our proposed encoder achieve higher IT in low clock. In our proposed encoder, the matrix information are built in it without additional blockrams. The IT/Area of our serial encoder is 0.3681(Mbps) per slice and the IT/Area of [4] is 0.1768. [4] Jeong Ki KIM1, Hyunseuk YOO1 and Moon Ho LEE1, "Efficient Encoding Architecture for IEEE 802.16e LDPC Codes, " IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 2008. Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Proposed Encoding scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion Conclusion An efficient encoding architecture for IEEE 802.16e LDPC codes with multiple code lengths and code rates are implemented. In our design, change between different code rate or code length only to change the type in information data. This architecture is also suitable the IEEE 802.11n standard. Our encoder achieve higher throughput and better throughput/area ratio than conventional encoding scheme when code length longer than 1200. Thank you!!