VIEWS: 2 PAGES: 74 POSTED ON: 3/30/2013 Public Domain
CHAPTER I INTRODUCTION 1.1 Introduction to LDPC Due to their near Shannon limit performance and inherently parallelizable decoding scheme, low-density parity-check (LDPC) codes. have been extensively investigated in research and practical applications. Recently, LDPC codes have been considered for many industrial standards of next generation communication systems such as DVB-S2, WLAN (802.11.n), WiMAX (802.16e), and 10GBaseT (802.3an). For high throughput applications, the decoding parallelism is usually very high. Hence, a complex interconnect network is required which consumes a significant amount of silicon area and power. A message broadcasting technique was proposed to reduce the routing congestion in a fully parallel LDPC decoder. Because all check nodes and variable nodes are directly mapped to hardware, the implementation cost is very high. The decoders in are targeted to specific LDPC codes which have very simple interconnection between check nodes and variables nodes. The constraints in H matrix structure for routing complexity reduction unavoidably limit the performance of the LDPC codes. The LDPC code decoder proposed in based on two-phase message- passing (TPMP) decoding scheme. Recently, layered decoding approach has been of great interest in LDPC decoder design because it converges much faster than TPMP decoding approach. The 4.6 Gb/s LDPC decoder presented in adopted layered decoding approach. However, it is only suited for array LDPC codes, which can be viewed as a sub-class of LDPC codes. It should be noted that a shuffled iterative decoding algorithm based on vertical partitioning of the parity-check matrix can also speed up the LDPC decoding in principle. In practice, LDPC codes have attracted considerable attention due to their excellent error correction performance and the regularity in their parity check matrices which is well suited for VLSI implementation. In this paper, we present a high-throughput low-cost layered decoding architecture for generic QC-LDPC codes. 1 A row permutation approach is proposed to significantly reduce the implementation complexity of shuffle network in the LDPC decoder. An approximate layered decoding approach is explored to increase clock speed and hence to increase the decoding throughput. An efficient implementation technique which is based on Min-Sum algorithm is employed to minimize the hardware complexity. The computation core is further optimized to reduce the computation delay. Low-density parity-check (LDPC) codes were invented by R. G. Gallager (Gallager 1963; Gallager 1962) in 1962. He discovered an iterative decoding algorithm which he applied to a new class of codes. He named these codes low- density parity-check (LDPC) codes since the parity-check matrices had to be sparse to perform well. Yet, LDPC codes have been ignored for a long time due mainly to the requirement of high complexity computation, if very long codes are considered. In 1993, C. Berrou et. al. invented the turbo codes (Berrou, Glavieux, and Thitimajshima 1993) and their associated iterative decoding algorithm. The remarkable performance observed with the turbo codes raised many questions and much interest toward iterative techniques. In 1995, D. J. C. MacKay and R. M. Neal (MacKay and Neal 1995; MacKay and Neal 1996; Mackay 1999) rediscovered the LDPC codes, and set up a link between their iterative algorithm to the Pearl’s belief algorithm (Pearl 1988), from the artificial intelligence community (Bayesian networks). At the same time, M. Sipser and D. A. Spielman (Sipser and Spielman 1996) used the first decoding algorithm of R. G. Gallager (algorithm A) to decode expander codes. 1.2 Objectives: The objective of this project is a high-throughput decoder architecture for generic quasi- cyclic low-density parity-check (QC-LDPC) codes. Various optimizations are employed to increase the clock speed. A row permutation scheme is proposed to significantly simplify the implementation of the shuffle network in LDPC decoder. An approximate layered decoding approach is explored to reduce the critical path of the layered LDPC decoder. The computation core is further optimized to reduce the computation delay. It is estimated that 4.7 Gb/s decoding throughput can be achieved at 15 iterations using the current technology. 2 Low-density parity-check (LDPC) codes, which have channel capacity approaching performance, were first invented by Gallager in 1962 and rediscovered by MacKay in 1996 as a linear block code; LDPC codes show excellent error correction capability even for low signal-to-noise ratio applications. Also, inner independence of its parity-check matrix enables parallel decoding and thus makes high-speed LDPC decoder possible. Hence, LDPC codes have been suggested in many recent wire-line and wireless communication standards such as IEEE 802.11n, DVB-S2 and IEEE 802.16e (WiMax). LDPC codes can be effectively decoded by the standard belief propagation (BP) algorithm which is also called sum-product algorithm (SPA). Later, min-sum algorithm (MSA) is introduced to reduce the computational complexity of the check nodes processing in SPA, which makes this algorithm suitable for VLSI implementation. VLSI implementation of LDPC decoder has attracted attentions from researchers in the past few years, including fully parallel architecture and partly parallel architecture. Fully parallel architecture directly maps standard BP algorithm into hardware by specifying connections between check nodes and variable nodes. However, the interconnections become more complex as the block length increases, which leads to large chip area and power consumption. Partly parallel architecture can effectively balance the hardware complexity and system throughput by employing architecture-aware LDPC (AA-LDPC) codes that have regularly-constructed parity-check matrix. However, the decoder complexity is still a great challenge for LDPC codes that have irregular parity check matrix, such as the codes used in the IEEE 802.16e standard for WiMax systems. The decoding throughput for irregular LDPC codes will decrease due to the irregular parity check matrix which destroys the inherent parallelism in partly parallel decoding architectures. Layered decoding algorithm (LDA), either by horizontal partitioning or vertical partitioning, uses the newest data from the current iteration rather than data from the previous iteration and thus can double the convergence speed. Conventional LDA processes messages in serial, from the first layer to the last, leading to limited decoding throughput. Grouped layered decoding can improve the throughput but employs more hardware resources. In this paper, we introduce a new parallel layered decoding architecture (PLDA) to enable different layers to operate concurrently. Precisely scheduled message passing paths among different layers guarantees that newly calculated messages can be delivered to their designated locations before they 3 are used by the next layer. By adding offsets to the permutation values of the sub- matrices in the base parity check matrix, time intervals among different layers become large enough for message passing. In PLDA, the decoding latency per iteration can be reduced greatly and hence the decoding throughput is improved. The remainder of this paper is organized as follows. Section II introduces code structure used in WiMax, MSA and LDA. Corresponding hardware implementation of PLDA and message passing network are presented in Section III. Section IV shows experimental results of the proposed decoder, including FPGA implementation results, ASIC implementation results and comparisons with existing WiMax LDPC decoders. 4 CHAPTER II 2.1 Turbo codes Turbo Coding is an iterated soft-decoding scheme that combines two or more relatively simple convolutional codes and an interleaver to produce a block code that can perform to within a fraction of a decibel of the Shannon limit. Predating LDPC codes in terms of practical application, they now provide similar performance. One of the earliest commercial applications of turbo coding was the CDMA2000 1x (TIA IS-2000) digital cellular technology developed by Qualcomm and sold by Verizon Wireless, Sprint, and other carriers. It is also used for the evolution of CDMA2000 1x specifically for Internet access, 1xEV-DO (TIA IS-856). Like 1x, EV-DO was developed by Qualcomm, and is sold by Verizon Wireless, Sprint, and other carriers (Verizon's marketing name for 1xEV-DO is Broadband Access, Sprint's consumer and business marketing names for 1xEV-DO are Power Vision and Mobile Broadband, respectively.). 2.1.1 Characteristics of Turbo Codes 1) Turbo codes have extraordinary performance at low SNR. a) Very close to the Shannon limit. b) Due to a low multiplicity of low weight code words. 2) However, turbo codes have a BER “floor”. - This is due to their low minimum distance. 3) Performance improves for larger block sizes. a) Larger block sizes mean more latency (delay). b) However, larger block sizes are not more complex to decode. c) The BER floor is lower for larger frame/interleaver sizes 5 4) The complexity of a constraint length KTC turbo code is the same as a K = KCC convolutional code, Where: KCC 2+KTC+ log2 (number decoder iterations) 2.2 Performance of Error Correcting Codes The performances of error correcting codes are compared with each other by referring to their gap to the Shannon limit, as mentioned in section 1.1.1. This section aims at defining exactly what the Shannon limit is, and what can be measured exactly when the limit to the Shannon bound is referred to. It is important to know exactly what is measured since a lot of “near Shannon limit” codes have been discovered now. The results hereafter are classical in the information theory and may be found in a lot of references. Yet, the first part is inspired by the work of (Schlegel 1997). Forward Error Correction (FEC) is an important feature of most modem communication systems, including wired and wireless systems. Communication systems use a variety of FEC coding techniques to permit correction of bit errors in transmitted symbols. 2.2.1 Forward error correction In telecommunication and information theory, forward error correction (FEC) (also called channel coding) is a system of error control for data transmission, whereby the sender adds (carefully selected) redundant data to its messages, also known as an error-correcting code. This allows the receiver to detect and correct errors (within some bound) without the need to ask the sender for additional data. The advantages of forward error correction are that a back-channel is not required and retransmission of data can often be avoided (at the cost of higher bandwidth requirements, on average). FEC is therefore applied in situations where retransmissions are relatively costly or impossible. In particular, FEC information is usually added to most mass storage devices to protect against damage to the stored data. FEC processing often occurs in the early stages of digital processing after a signal is first received. That is, FEC circuits are often an integral part of the analog-to- 6 digital conversion process, also involving digital modulation and demodulation, or line coding and decoding. Many FEC coders can also generate a bit-error rate (BER) signal which can be used as feedback to fine-tune the analog receiving electronics. Soft-decision algorithms, such as the Viterbi encoder, can take (quasi-)analog data in, and generate digital data on output. The maximum fraction of errors that can be corrected is determined in advance by the design of the code, so different forward error correcting codes are suitable for different conditions. How it works FEC is accomplished by adding redundancy to the transmitted information using a predetermined algorithm. Each redundant bit is invariably a complex function of many original information bits. The original information may or may not appear in the encoded output; codes that include the unmodified input in the output are systematic, while those that do not are nonsystematic. An extremely simple example would be an analog to digital converter that samples three bits of signal strength data for every bit of transmitted data. If the three samples are mostly zero, the transmitted bit was probably a zero, and if three samples are mostly one, the transmitted bit was probably a one. The simplest example of error correction is for the receiver to assume the correct output is given by the most frequently occurring value in each group of three. Triplet received Interpreted as 000 0 001 0 010 0 100 0 111 1 110 1 101 1 011 1 This allows an error in any one of the three samples to be corrected by "democratic voting". This is a highly inefficient FEC, but it does illustrate the 7 principle. In practice, FEC codes typically examine the last several dozen, or even the last several hundred, previously received bits to determine how to decode the current small handful of bits (typically in groups of 2 to 8 bits). Such triple modular redundancy, the simplest form of forward error correction, is widely used. Averaging noise to reduce errors: FEC could be said to work by "averaging noise"; since each data bit affects many transmitted symbols, the corruption of some symbols by noise usually allows the original user data to be extracted from the other, uncorrupted received symbols that also depend on the same user data. Because of this "risk-pooling" effect, digital communication systems that use FEC tend to work well above a certain minimum signal-to-noise ratio and not at all below it. This all-or-nothing tendency -- the cliff effect -- becomes more pronounced as stronger codes are used that more closely approach the theoretical limit imposed by the Shannon limit. Interleaving FEC coded data can reduce the all or nothing properties of transmitted FEC codes. However, this method has limits; it is best used on narrowband data. Most telecommunication systems used a fixed Channel Code designed to tolerate the expected worst-case bit error rate, and then fail to work at all if the bit error rate is ever worse. However, some systems adapt to the given channel error conditions: Hybrid automatic repeat-request uses a fixed FEC method as long as the FEC can handle the error rate, then switches to ARO when the error rate gets too high; adaptive modulation and coding uses a variety of FEC rates, adding more error-correction bits per packet when there are higher error rates in the channel, or taking them out when they are not needed. 8 Types of FEC: The two main categories of FEC codes are block codes and convolutional codes. Block codes work on fixed-size blocks (packets) of bits or symbols of predetermined size. Practical block codes can generally be decoded in polynomial time to their block length. Convolutional codes work on bit or symbol streams of arbitrary length. They are most often decoded with the Viterbi algorithm, though other algorithms are sometimes used. Viterbi decoding allows asymptotically optimal decoding efficiency with increasing constraint length of the convolutional code, but at the expense of exponentially increasing complexity. A convolutional code can be turned into a block code, if desired. There are many types of block codes, but among the classical ones the most notable is Reed-Solomon coding because of its widespread use on the Compact disc, the DVD, and in hard disk drives. Golay, BCH, Multidimensional parity, and Hamming codes are other examples of classical block codes. Hamming ECC is commonly used to correct NAND flash memory errors ]. This provides single-bit error correction and 2-bit error detection. Hamming codes are only suitable for more reliable single level cell (SLC) NAND. Denser multi level cell (MLC) NAND requires stronger multi-bit correcting ECC such as BCH or Reed- Solomon].Classical block codes are usually implemented using hard-decision algorithms, which means that for every input and output signal a hard decision is made whether it corresponds to a one or a zero bit. In contrast, soft-decision algorithms like the Viterbi decoder process (discretized) analog signals, which allow for much higher error-correction performance than hard-decision decoding. Nearly all classical block codes apply the algebraic properties of finite fields. Concatenated FEC codes for improved performance: Classical (algebraic) block codes and convolutional codes are frequently combined in concatenated coding schemes in which a short constraint-length Viterbi- decoded convolutional code does most of the work and a block code (usually Reed- 9 Solomon) with larger symbol size and block length "mops up" any errors made by the convolutional decoder. Concatenated codes have been standard practice in satellite and deep space communications since Voyager2 first used the technique in its 1986 encounter with Uranus. Low-density parity-check (LDPC): Low-Density Parity- Check (LDPC) codes are a class of recently re- discovered highly efficient linear block codes. They can provide performance very close to the channel capacity (the theoretical maximum) using an iterated soft- decision decoding approach, at linear time complexity in terms of their block length. Practical implementations can draw heavily from the use of parallelism. LDPC codes were first introduced by Robert G. Gallager in his PhD thesis in 1960, but due to the computational effort in implementing en- and decoder and the introduction of Reed-Solomon codes, they were mostly ignored until recently. LDPC codes are now used in many recent high-speed communication standards, such as DVB-S2 (Digital video broadcasting), Wi-MAX (IEEE 802.16e standard for microwave communications), High-Speed Wireless LAN (IEEE 802.11n), 10GBase-T Ethernet (802.3an) and G.hn/G.9960 (ITU-T Standard for networking over power lines, phone lines and coaxial cable). Channel Capacity: Stated by Claude Shannon in 1948, the theorem describes the maximum possible efficiency of error-correcting methods versus levels of noise interference and data corruption. The theory doesn't describe how to construct the error-correcting method, it only tells us how good the best possible method can be. Shannon's theorem has wide-ranging applications in both communications and data storage. This theorem is of foundational importance to the modern field of information theory. Shannon only gave an outline of the proof. The first rigorous proof is due to Amiel Feinstein in 1954. 10 The Shannon theorem states that given a noisy channel with channel capacity C and information transmitted at a rate R, then if R < C there exist codes that allow the probability of error at the receiver to be made arbitrarily small. This means that, theoretically, it is possible to transmit information nearly without error at any rate below a limiting rate, C. The converse is also important. If R > C, an arbitrarily small probability of error is not achievable. All codes will have a probability of error greater than a certain positive minimal level, and this level increases as the rate increases. So, information cannot be guaranteed to be transmitted reliably across a channel at rates beyond the channel capacity. The theorem does not address the rare situation in which rate and capacity are equal. Simple schemes such as "send the message 3 times and use a best 2 out of 3 voting scheme if the copies differ" are inefficient error-correction methods, unable to asymptotically guarantee that a block of data can be communicated free of error. Advanced techniques such as Reed–Solomon codes and, more recently, turbo codes come much closer to reaching the theoretical Shannon limit, but at a cost of high computational complexity. Using low-density parity-check (LDPC) codes or turbo codes and with the computing power in today's digital signal processors, it is now possible to reach very close to the Shannon limit. In fact, it was shown that LDPC codes can reach within 0.0045 dB of the Shannon limit (for very long block lengths). Mathematical statement: Theorem (Shannon, 1948): 1. For every discrete memory less channel, the channel capacity 11 has the following property. For any ε > 0 and R < C, for large enough N, there exists a code of length N and rate ≥ R and a decoding algorithm, such that the maximal probability of block error is ≤ ε. 2. If a probability of bit error pb is acceptable, rates up to R(pb) are achievable, where And H2(pb) is the binary entropy function 3. For any pb, rates greater than R(pb) are not achievable. (MacKay (2003), p. 162; cf Gallager (1968), ch.5; Cover and Thomas (1991), p. 198; Shannon (1948) thm. 11) Error Correction in Communication Systems Noise Binary Encoded Noisy Corrected information Encoder information information Decoder information (Redundancy (Error Detection Added) and Correction Error correction is widely used in most communication systems. Figure 1: Error Correction in Communication systems 12 2.3 Row Permutation of Parity Check Matrix of LDPC Codes The Parity check matrix of a LDPC code is an array of circulant submatrices.To achieve very high decoding throughput, an array of cyclic shifters is needed to shuffle soft messages corresponding to multiple submatrices for check nodes and variable nodes. In order to reduce the VLSI implementation complexity for the shuffle network, the shifting structure in circulant matrices is extensively exploited. Suppose the parity check matrix H of a LDPC code is a J×C array of P×P circulant submatrices. With row permutation, it can be converted to a form as shown in fig.2 Figure 2: Array of circulant sub matrices Figure 3: Permuted Matrix 13 Where is a P×P permutation matrix representing a single left or right cyclic shift. The submatrix can be obtained by cyclically shifting the submatrix for a single step. Ai is a J×P matrix determined by the shift offsets of the circulant submatrices in block column i(i=1,2,...C),m is an integer such that P can be divided by m. For example, the matrix Ha shown in Fig 1. is a 2×3 array of 8×8 cyclically shifted identify submatrices. With the row permutation described in the following, a new matrix Hb can be obtained, which has the form shown in (1).First,the first four rows of the first block row of Ha are distributed to four block rows of Hb in a round- robin fashion(i.e., rows 1-4 of Ha are distributed row 1,5,9,and 13 of Hb).Then the second four rows are distributed in the same way. The permutation can be continued until all rows in the first block row of matrix Ha are moved to matrix Hb. Then the second block row of Ha Are distributed in the same way. It can be seen from Fig.2 that Hb has the form shown in(1).In the previous example, the row distribution is started from the first row of each block row. In general, the distribution can be started from any row of a block row. To minimize the data dependency between two adjacent block rows, an optimum row distribution scheme is desired. For an LDPC decoder which can process all messages corresponding to the 1-components in an entire block row of matrix Hp (e.g., Hb in Fig.2), the shuffle network for LDPC decoding can be implemented with very simple data shifters. 14 Message Passing (Row processing ) Initial value (received information from channel ) 0 0 1 1 0 0 0 1 0 Row processing 1 0 0 0 1 0 0 0 1 α 0 1 0 0 0 1 1 0 0 Col H processing 0 0 1 0 1 0 1 0 0 β 1 0 0 0 0 1 0 1 0 Error correction 0 1 0 1 0 0 0 0 1 Parity check MinSum: ij j' ,h 1, sign ij' min j' hij ' 1 j' j ij' ij ' Message Passing (Column processing ) Initial value 0 0 1 1 0 0 0 1 0 Row 1 0 0 0 1 0 0 0 1 processing 0 1 0 0 0 1 1 0 0 α H Col 0 0 1 0 1 0 1 0 0 processing 1 0 0 0 0 1 0 1 0 β 0 1 0 1 0 0 0 0 1 Error correction ij j j' j ij' Parity check λj is the received information. 15 Initial value 0 0 1 1 0 0 0 1 0 α Row 1 0 0 0 1 0 0 0 1 processing 0 1 0 0 0 1 1 0 0 α H Col 0 0 1 0 1 0 1 0 0 processing β 1α 0 0 0 0 1 0 1 0 Error correction 0 1 0 1 0 0 0 0 1 Parity check λy1 1 1 if yi 0 Vi 0 if yi 0 Initial value ^ v 0 ^ Row v 1 processing ^ α v 2 0 0 1 1 0 0 0 1 0 Col ^ processing 1 0 0 0 1 0 0 0 1 v 3 β 0 1 0 0 0 1 1 0 0 ^ = 0 (Stop decoding) Error H v 4 0 0 1 0 1 0 1 0 0 ^ ≠0 (Repeat decoding) correction 1 0 0 0 0 1 0 1 0 v 5 Parity check 0 1 0 1 0 0 0 0 1 ^ v 6 ^ v 7 ^ v 8 LDPC Codes: An LDPC code is defined by a binary matrix called parity check matrix H. Rows define parity check equations (constrains) between encoded symbols in a code word and columns define the length of the code. V is a valid code word if H٠Vt=0. 16 Decoder in the receiver checks if the condition H٠Vt=0 is valid. Example : Parity check matrix for (9, 5) LDPC code, row weight=4, column weight =2: v1 v 2 0 0 1 1 0 0 0 1 0 1 v 3 0 0 0 1 0 0 0 1 v4 0 1 0 0 0 1 1 0 0 H v 5 ≠ 0 (There is error) 0 0 1 0 1 0 1 0 0 = 0 (There is no error) 1 v 6 0 0 0 0 1 0 1 0 v7 0 1 0 1 0 0 0 0 1 v8 v 9 17 CHAPTER III CHANNEL CODING This first chapter introduces the channel code decoding issue and the problem of optimal code decoding in the case of linear block codes. First, the main notations used in the thesis are presented and especially those related to the graph representation of the linear block codes. Then the optimal decoding is discussed: it is shown that under the cycle-free hypothesis, the optimal decoding can be processed using an iterative algorithm. Finally, the performance of error correcting codes is also discussed. 3.1 Optimal decoding: Communication over noisy channels can be improved by the use of a channel code C, as demonstrated by C. E. Shannon for its famous channel coding theorem “Let a discrete channel have the capacity C and a discrete source the entropy per second H. If H _ C there exists a coding system such that the output of the source can be transmitted over the channel with an arbitrarily small frequency of errors. If H > C it is possible to encode the source so that the equivocation is less than H.” This theorem states that below a maximum rate R, which is equal to the capacity of the channel, it is possible to find error correction codes to achieve any given probability of error. Since this theorem does not explain how to make such a code, it has been the kick-off for a lot of activities in the coding theory community. When Shannon announced his theory in the July and October issues of the Bell System Technical Journal in 1948, the largest communications cable in operation at that time carried 1800 voice conversations. Twenty-five years later, the highest capacity cable was carrying 230000 simultaneous conversations. Today a single optical fiber as thin as a human hair can carry more than 6.4 million conversations. In the quest of capacity achieving codes, the performance of the codes is measured by their gap to the capacity. For a given code, the smallest gap is obtained by an optimal decoder: the maximum a-posteriori (MAP) decoder. Before dealing with the optimal decoding, some notations within a model of the communication scheme are presented hereafter. 18 Figure 4: Basic scheme for channel code encoding/decoding. 3.1.1 Communication model It depicts a classical communication scheme. The source block delivers information by the mean of sequences which are row vectors x of length K. The encoder block delivers the codeword c of length N, which is the coded version of x. The code rate is defined by the ratio R = K/N. The codeword c is sent over the channel and the vector y is the received word: a distorted version of c. The matched filters, the modulator and the demodulator, and the synchronization is supposed to work perfectly. Hence, the channel is represented by a discrete time equivalent model. The channel is a non-deterministic mapper between its input c and its output y. We assume that y depends on c via a conditional probability density function (pdf) p (y|c). We assume also that the channel is memory less: For example: If the channel is the binary-input additive white Gaussian noise (BI-AWGN), and if the modulation is a binary phased shift keying (BPSK) modulation with the On Figure 4, two types of decoder are depicted: decoders of type 1 have to compute the best estimation ˆx of the source word x; decoders of type 2 compute the 19 best estimation ˆc of the sent codeword c. In this case, ˆx is extracted from ˆc by a post processing (reverse processing of the encoding) when the code is non-systematic. Both decoders can perform two types of decoding: 3.2 Classes of LDPC codes: R. Gallager defined an (N, j, k) LDPC codes as a block code of length N having a small fixed number (j) of ones in each column of the parity check H, and a small fixed number (k) of ones in each rows of H. This class of codes is then to be decoded by the iterative algorithm described in chapter 1. This algorithm computes exact a posteriori probabilities, provided that the Tanner graph of the code is cycle free. Generally, LDPC codes do have cycles. The sparseness of the parity check matrix aims at reducing the number of cycles and at increasing the size of the cycles. Moreover, as the length N of the code increases, the cycle free hypothesis becomes more and more realistic. The iterative algorithm is processed on these graphs. Although it is not optimal, it performs quite well. Since then, LDPC codes class have been enlarged to all sparse parity check matrices, thus creating a very wide class of codes, including the extension to codes in GF(q) and irregular LDPC codes Irregularity: In the Gallager’s original LDPC code design, there is a fixed number of ones in both the rows (k) and the columns (j) of the parity check matrix: it means that each bit is implied in j parity check constraints and that each parity check constraint is the exclusive-OR (XOR) of k bits. This class of codes is referred to as regular LDPC codes. On the contrary, irregular LDPC codes do not have a constant number of non- zero entries in the rows or in the columns of H. They are specified by the distribution degree of the bit _(x) and of the parity check constraints ρ(x), using the notations of, where: 20 Similarly, denoting by ρi the proportion of rows having weight i: Code rate: The rate R of LDPC codes is defined by is the design code rate. Rd = R if the parity check matrix has full rank. The authors of have shown that as N increases, the 21 parity-check matrix is almost sure to be full rank. Hereafter, we will assume that R = Rd unless the contrary is mentioned. The rate R is then linked to the other parameters of the class by Note that in general, for random constructions, when j is odd: and when j is even: 3.2.1 Optimization of LDPC codes: The bounds and performance of LDPC codes are derived from their parameters set. The wide number of independent parameters enables to tune them so as to fit some external constraint, as a particular channel, for example. Two algorithms can be used to design a class of irregular LDPC codes under some channel constraints: the density evolution algorithm and the extrinsic information transfer (EXIT) charts. Density evolution algorithm: Richardson designed capacity approaching irregular codes with the density evolution (DE) algorithm. This algorithm tracks the probability density function (pdf) of the messages through the graph nodes under the assumption that the cycle free hypothesis is verified. It is a kind of belief propagation algorithm with pdf messages instead of log likelihood ratios messages. Density evolution is processed on the asymptotical performance of the class of LDPC codes. It means that a infinite number of iterations is processed on a infinite code-length LDPC code: if the length of the code tends to infinity, the probability that a randomly chosen node belongs to a cycle of a given length tends towards zero. 22 Usually, either the channel threshold or the code rates are optimized under the constraints of the degree distributions and of the SNR. The threshold of the channel is the value of the channel parameter above which the probability tends towards zero if the iterations are infinite (and the code length also). Optimization tries to lower the threshold or to higher the rate as best as possible. In for example, the authors designed a rate−1/2 irregular LDPC codes for binary-input AWGN channels that approach the Shannon limit very closely (up to 0.0045 dB). Optimization based on DE algorithm are often processed by the mean of differential evolution algorithm when optimizations are non-linear, as for example in where the authors optimize an irregular LDPC code for uncorrelated flat Rayleigh fading channels. The Gaussian approximation in the DE algorithm can also be used: the probability density functions of the messages are assumed to be Gaussian and the only parameters that has to be tracked in the nodes is the mean. EXIT chart: Extrinsic information transfer (EXIT) charts are 2D graphs on which are superposed the mutual information transfers through the 2 constituent codes of a turbocode. EXIT charts have been transposed to the LDPC code optimization 3.2.2 Regular vs. Irregular LDPC codes: • An LDPC code is regular if the rows and columns of H have uniform weight, i.e. all rows have the same number of ones (dv) and all columns have the same number of ones (dc) – The codes of Gallager and MacKay were regular (or as close as possible) – Although regular codes had impressive performance, they are still about 1 dB from capacity and generally perform worse than turbo codes • An LDPC code is irregular if the rows and columns have non-uniform weight – Irregular LDPC codes tend to outperform turbo codes for block lengths of about n>105 23 • The degree distribution pair (λ, ρ) for a LDPC code is defined as dv ( x) i x i 1 i 2 dc ( x) i x i 1 i 1 • λi, ρi represent the fraction of edges emanating from variable (check) nodes of degree i 3.2.3 Constructing Regular LDPC Codes: • Around 1996, Mackay and Neal described methods for constructing sparse H matrices • The idea is to randomly generate a M × N matrix H with weight dv columns and weight dc rows, subject to some constraints • Construction 1A: Overlap between any two columns is no greater than 1 – This avoids length 4 cycles • Construction 2A: M/2 columns have dv =2, with no overlap between any pair of columns. Remaining columns have dv =3. As with 1A, the overlap between any two columns is no greater than 1 • Construction 1B and 2B: Obtained by deleting select columns from 1A and 2A – Can result in a higher rate code 3.2.4 Constructing Irregular LDPC Codes: • Luby developed LDPC codes based on irregular LDPC Tanner graphs • Message and check nodes have conflicting requirements – Message nodes benefit from having a large degree 24 – LDPC codes perform better with check nodes having low degrees • Irregular LDPC codes help balance these competing requirements – High degree message nodes converge to the correct value quickly – This increases the quality of information passed to the check nodes, which in turn helps the lower degree message nodes to converge • Check node degree kept as uniform as possible and variable node degree is non-uniform – Code 14: Check node degree =14, Variable node degree =5, 6, 21, 23 • No attempt made to optimize the degree distribution for a given code rate 3.3 Constructions of LDPC codes: By constructions of LDPC codes, we mean the construction, or design, of a particular LDPC parity check matrix H. The design of H is the moment when the asymptotical constraints (the parameters of the class you designed, like the degree distribution, the rate) have to meet the practical constraints (finite dimension, girths). Hereafter are described some recipes taking into account some practical constraints. Two techniques exist in the literature: random and deterministic ones. The design compromise is that for increasing the girth, the sparseness has to be decreased yielding poor code performance due to a low minimum distance. On the contrary, for high minimum distance, the sparseness has to be increased yielding the creation of low-length girth, due to the fact that H dimensions are finite, and thus, yielding a poor convergence of the belief propagation algorithm. 3.3.1 Random based construction: The first constructions of LDPC codes were random ones. The parity check matrix is the concatenation and/or superposition of sub-matrices; these sub-matrices are created by processing some permutations on a particular (random or not) sub- matrix which usually has a column weight of 1. R. Gallager’s construction for example is based on a short matrix H0. Then j matrices Пi(H0) are vertically stacked on H0, where Пi(H0) denotes a column permutation of H0 (see figure 5). 25 Regular and irregular codes can be also constructed like in where the 2 sets of nodes are created, each node appearing as many times as its degree’s value. Then a one to one association is randomly mapped between the nodes of the 2 sets, like illustrated on figure 2.3. D. MacKay compares random constructions of regular and irregular LDPC codes: small girth have to be avoided, especially between low weight variables. All the constructions described above should be constrained by the girth’s value. Yet, increasing the girth from 4 to 6 and above is not trivial; some random constructions specifically address this issue. The authors generate a parity check matrix optimizing the length of the girth or the rate of the code when M is Figure 5: Some random constructions of regular LDPC parity check matrices based on Gallager’s (a) and MacKay’s constructions (b, c) (MacKay, Wilson, and Davey 1999). Example of a regular (3, 4) LDPC code of length N = 12. Girths of length 4 have not been avoided. The permutations can be either columns permutation (a, b) or rows permutations. 26 3.3.2 Deterministic based construction: Random constructions don’t have too many constraints: they can fit quite well to the parameters of the desired class. The problem is that they do not guarantee that the girth will be small enough. So either post-processing or more constraints are added for the random design, yielding sometimes much complexity. To circumvent the girth problem, deterministic constructions have been developed. Moreover, explicit constructions can lead to easier encoding, and can be also easier to handle in hardware. 2 branches in combinatorial mathematics are involved in such designs: finite geometry and Balanced Incomplete Block Design’s (BIBSs). They seem to be more efficient than previous an algebraic construction which was based on expander. The authors designed high rate LDPC codes based on Steiner systems. Their conclusion was that the minimum distance was not high enough and that difference set cyclic (DSC) codes should outperform them, where they are combined with the one step majority logic decoding. The authors present LDPC code constructions based on finite geometry, like in (Johnson and Weller 2003a) for constructing very high rate LDPC codes. Balanced incomplete block designs (BIBDs) have also been. The major drawback for deterministic constructions of LDPC codes is that they exist with a few combinations of parameters. So it may be difficult to find one that fits the specifications of a given system. 3.4 Code construction For large block sizes, LDPC codes are commonly constructed by first studying the behaviour of decoders. As the block size tends to infinity, LDPC decoders can be shown to have a noise threshold below which decoding is reliably achieved, and above which decoding is not achieved. This threshold can be optimized by finding the best proportion of arcs from check nodes and arcs from variable nodes. An approximate graphical approach to visualizing this threshold is an EXIT chart.. The cons ruction of a specific LDPC code after this optimization falls into two main types of techniques: 27 Pseudo-random approaches Combinatorial approaches Construction by a pseudo-random approach builds on theoretical results that, for large block size, a random construction gives good decoding performance. In general, pseudo-random codes have complex encoders; however pseudo-random codes with the best decoders can have simple encoders. Various constraints are often applied to help ensure that the desired properties expected at the theoretical limit of infinite block size occur at a finite block size. Combinatorial approaches can be used to optimize properties of small block- size LDPC codes or to create codes with simple encoders. 3.4.1 Random number generation A random number generator (often abbreviated as RNG) is a computational or physical device designed to generate a sequence of numbers or symbols that lack any pattern, i.e. appear random. The many applications of randomness have led to the development of several different methods for generating random data. Many of these have existed since ancient times, including dice, coin flipping, the shuffling of playing cards, the use of yarrow stalks (by divination) in the I Ching, and many other techniques. Because of the mechanical nature of these techniques, generating large amounts of sufficiently random numbers (important in statistics) required a lot of work and/or time. Thus, results would sometimes be collected and distributed as random number tables.Nowadays, after the advent of computational random number generators, a growing number of government-run lotteries, and lottery games, are using RNGs instead of more traditional drawing methods, such as using ping-pong or rubber balls. RNGs are also used today to determine the odds of modern slot machines. Several computational methods for random number generation exist, but often fall short of the goal of true randomness — though they may meet, with varying success, some of the statistical tests for randomness intended to measure how unpredictable their results are (that is, to what degree their patterns are discernible). 28 Only in 2010 was the first truly random computational number generator produced, recurring to principles of quantum physics. 3.4.2 Pseudorandom number generator A pseudorandom number generator (PRNG), also known as a deterministic random bit generator (DRBG), is an algorithm for generating a sequence of numbers that approximates the properties of random numbers. The sequence is not truly random in that it is completely determined by a relatively small set of initial values, called the PRNG's state. Although sequences that are closer to truly random can be generated using hardware random number generators, pseudorandom numbers are important in practice for simulations (e.g., of physical systems with the Monte Carlo method), and are central in the practice of cryptography and procedural generation. Common classes of these algorithms are linear congruential generators, Lagged Fibonacci generators, linear feedback shift registers, feedback with carry shift registers, and generalized feedback shift registers. Recent instances of pseudorandom algorithms include Blum Blum Shub, Fortuna, and the Mersenne twister. Careful mathematical analysis is required to have any confidence a PRNG generates numbers that are sufficiently "random" to suit the intended use. Robert R.Coveyou of Oak Ridge National Laboratory once titled an article, "The generation of random numbers is too important to be left to chance." As John von Neumann joked, "Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin." Periodicity: A PRNG can be started from an arbitrary starting state using a seed state. It will always produce the same sequence thereafter when initialized with that state. The maximum length of the sequence before it begins to repeat is determined by the size of the state, measured in bits. However, since the length of the maximum period potentially doubles with each bit of 'state' added, it is easy to build PRNGs with periods long enough for many practical applications. If a PRNG's internal state contains n bits, its period can be no longer than 2n results. For some PRNGs the period length can be calculated without walking through 29 the whole period. Linear Feedback Shift Registers (LFSRs) are usually chosen to have periods of exactly 2n−1. Linear congruential generators have periods that can be calculated by factoring.[citation needed] Mixes (no restrictions) have periods of about 2n/2 on average, usually after walking through a no repeating starting sequence. Mixes that are reversible (permutations) have periods of about 2n−1 on average, and the period will always include the original internal state. Although PRNGs will repeat their results after they reach the end of their period, a repeated result does not imply that the end of the period has been reached, since its internal state may be larger than its output; this is particularly obvious with PRNGs with a 1-bit output. Most pseudorandom generator algorithms produce sequences which are uniformly distributed by any of several tests. It is an open question, and one central to the theory and practice of cryptography, whether there is any way to distinguish the output of a high-quality PRNG from a truly random sequence without knowing the algorithm(s) used and the state with which it was initialized. The security of most cryptographic algorithms and protocols using PRNGs is based on the assumption that it is infeasible to distinguish use of a suitable PRNG from use of a truly random sequence. The simplest examples of this dependency are stream ciphers, which (most often) work by exclusive or-ing the plaintext of a message with the output of a PRNG, producing cipher text. The design of cryptographically adequate PRNGs is extremely difficult; because they must meet additional criteria (see below). The size of its period is an important factor in the cryptographic suitability of a PRNG, but not the only one. 3.4.3 "True" random numbers vs. pseudorandom numbers: There are two principal methods used to generate random numbers. One measures some physical phenomenon that is expected to be random and then compensates for possible biases in the measurement process. The other uses computational algorithms that produce long sequences of apparently random results, which are in fact completely determined by a shorter initial value, known as a seed or key. The latter types are often called pseudorandom number generators. A "random number generator" based solely on deterministic computation cannot be regarded as a "true" random number generator, since its output is inherently predictable. How to distinguish a "true" random number from the output of a pseudo- 30 random number generator is a very difficult problem. However, carefully chosen pseudo-random number generators can be used instead of true random numbers in many applications. Rigorous statistical analysis of the output is often needed to have confidence in the algorithm. 3.5 Encoding of LDPC codes: • A linear block code is encoded by performing the matrix multiplication c = uG • A common method for finding G from H is to first make the code systematic by adding rows and exchanging columns to get the H matrix in the form H = [PT I] – Then G = [I P] – However, the result of the row reduction is a non-sparse P matrix – The multiplication c =[u uP] is therefore very complex • As an example, for a (10000, 5000) code, P is 5000 by 5000 – Assuming the density of 1’s in P is 0.5, then 0.5× (5000)2 additions are required per codeword • This is especially problematic since we are interested in large n (>105) • An often used approach is to use the all-zero codeword in simulations. • Richardson and Urbanke show that even for large n, the encoding complexity can be (almost) linear function of ‘n’ - “Efficient encoding of low-density parity-check codes”, IEEE Trans. Inf. Theory, Feb., 2001 • Using only row and column permutations, H is converted to an approximately lower triangular matrix - Since only permutations are used, H is still sparse 31 - The resulting encoding complexity in almost linear as a function of n - An alternative involving a sparse-matrix multiply followed by differential encoding has been proposed by Ryan, Yang, & Li…. - “Lowering the error-rate floors of moderate-length high-rate irregular LDPC codes,” ISIT, 2003 The weak point of LDPC codes is their encoding process: a sparse parity check matrix does not have necessarily a sparse generator matrix. Moreover, it appears to be particularly dense. So encoding by a G multiplication yields to an N2 complexity processing. A first encoding scheme is to deal with lower triangular shape parity check matrices. The other encoding schemes are mainly to deal with cyclic parity check matrices. 3.5.1 Lower-triangular shape based encoding: ` A first approach is to create a parity check matrix with an almost lower-triangular shape, as depicted on figure 2.4(a). The performance is a little bit affected by the lower-triangular shape constraint. Instead of computing the product c t t = u G , the equation H.c = 0 is solved, where c is the unknown variable. The encoding is systematic: The last have to be solved without reduced complexity. Thus, the higher M1 is the less complex the encoding is. T. Richardson and R. Urbanke propose an efficient encoding of a parity check matrix H. It is based on the shape depicted on figure 2.4-(b). They also propose some “greedy” algorithms which transform any parity check matrix H into an equivalent parity check matrix H0 using columns and rows permutations, minimizing. So H0 is still sparse. The encoding complexity scales in O(N + g2) where g is a small fraction of N. As a 32 particular case the authors of and construct parity check matrices of the same shape with g = 0. 3.5.2 Other encoding schemes: Iterative encoding The authors derived a class of parity check codes which can be iteratively encoded using the same graph-based algorithm as the decoder. But for irregular cases, the code does not seem to perform as well as random ones. Low-density generator matrices: The generator matrices of LDPC codes are usually not sparse, because of the inversion. But if H is constructed both sparse and systematic, then: Where G is a sparse generator matrix (LDGM): they correspond to parallel concatenated codes. They seem to have high error floors (asymptotically bad codes). Yet, the authors of carefully chose and concatenate the constituent codes to lower the error floor. Note that this may be a drawback for applications with high rate codes. Cyclic parity-check matrices: The most popular codes that can be easily encoded are the cyclic or pseudo- cyclic ones. A Gallager-like construction using cyclic shifts enables to have a cyclic based encoder. Finite geometry or BIBDs constructed LDPC codes are also cyclic or pseudo-cyclic. Table 2 gives a summary of the different encoding schemes. 33 Table 2: Summary of the different LDPC encoding schemes 3.6 Decoding LDPC codes: Like Turbo codes, LDPC can be decoded iteratively – Instead of a trellis, the decoding takes place on a Tanner graph – Messages are exchanged between the v-nodes and c-nodes – Edges of the graph act as information pathways Hard decision decoding – Bit-flipping algorithm Soft decision decoding – Sum-product algorithm • Also known as message passing/ belief propagation algorithm – Min-sum algorithm • Reduced complexity approximation to the sum-product algorithm 34 In general, the per-iteration complexity of LDPC codes is less than it is for turbo codes – However, many more iterations may be required (max100;avg30) Thus, overall complexity can be higher than turbo. 35 36 CHAPTER IV TERMINOLOGY WiMAX (Worldwide Interoperability for Microwave Access) is a telecommunications protocol that provides fixed and fully mobile internet access. The current WiMAX revision provides up to 40 Mbit/s[1][2] with the IEEE 802.16m update expected offer up to 1 Gbit/s fixed speeds. The name "WiMAX" was created by the WiMAX Forum, which was formed in June 2001 to promote conformity and interoperability of the standard. The forum describes WiMAX[3] as "a standards-based technology enabling the delivery of last mile wireless broadband access as an alternative to cable and DSL".[4] WiMAX base station equipment with a sector antenna and wireless modem on top A pre-WiMAX CPE of a 26 km (16 mi) connection mounted 13 metres (43 ft) above the ground (2004, Lithuania). 37 WiMAX (Worldwide Interoperability for Microwave Access) is a telecommunications protocol that provides fixed and fully mobile internet access. The current WiMAX revision provides up to 40 Mbit/s with the IEEE 802.16m update expected offer up to 1 Gbit/s fixed speeds. The name "WiMAX" was created by the WiMAX Forum, which was formed in June 2001 to promote conformity and interoperability of the standard. The forum describes WiMAX as "a standards-based technology enabling the delivery of last mile wireless broadband access as an alternative to cable and DSL". WiMAX refers to interoperable implementations of the IEEE 802.16 wireless- networks standard (ratified by the WiMAX Forum), in similarity with Wi-Fi, which refers to interoperable implementations of the IEEE 802.11 Wireless LAN standard (ratified by the Wi-Fi Alliance). The WiMAX Forum certification allows vendors to sell their equipment as WiMAX (Fixed or Mobile) certified, thus ensuring a level of interoperability with other certified products, as long as they fit the same profile. The IEEE 802.16 standard forms the basis of 'WiMAX' and is sometimes referred to colloquially as "WiMAX", "Fixed WiMAX", "Mobile WiMAX", "802.16d" and "802.16e." Clarification of the formal names is as follow: 802.16-2004 is also known as 802.16d, which refers to the working party that has developed that standard. It is sometimes referred to as "Fixed WiMAX," since it has no support for mobility. 802.16e-2005, often abbreviated to 802.16e, is an amendment to 802.16-2004. It introduced support for mobility, among other things and is therefore also known as "Mobile WiMAX". Mobile WiMAX is the WiMAX incarnation that has the most commercial interest to date and is being actively deployed in many countries. Mobile WiMAX is also the basis of future revisions of WiMAX. As such, references to and comparisons with "WiMAX" in this Wikipedia article mean "Mobile WiMAX". Uses: The bandwidth and range of WiMAX make it suitable for the following potential applications: 38 Providing portable mobile broadband connectivity across cities and countries through a variety of devices. Providing a wireless alternative to cable and DSL for "last mile" broadband access. Providing data, telecommunications (VoIP) and IPTV services (triple play). Providing a source of Internet connectivity as part of a business continuity plan. Providing a network to facilitate machine to machine communications, such as for Smart Metering. Broadband: Companies are deploying WiMAX to provide mobile broadband or at-home broadband connectivity across whole cities or countries. In many cases this has resulted in competition in markets which typically only had access to broadband through an existing incumbent DSL (or similar) operator. Additionally, given the relatively low cost to deploy a WiMAX network (in comparison to GSM, DSL or Fiber-Optic), it is now possible to provide broadband in places where it may have not been economically viable. A WiMAX USB modem for mobile internet There are numerous devices on the market that provide connectivity to a WiMAX network. These are known as the "subscriber unit" (SU). 39 There is an increasing focus on portable units. This includes handsets (similar to cellular smart phones); PC peripherals (PC Cards or USB dongles); and embedded devices in laptops, which are now available for Wi-Fi services. In addition, there is much emphasis by operators on consumer electronics devices such as Gaming consoles, MP3 players and similar devices. It is notable that WiMAX is more similar to Wi-Fi than to 3G cellular technologies. The WiMAX Forum website provides a list of certified devices. However, this is not a complete list of devices available as certified modules are embedded into laptops, MIDs (Mobile internet devices), and other private labeled devices. WiMAX Gateways: WiMAX gateway devices are available as both indoor and outdoor versions from several manufacturers. Many of the WiMAX gateways that are offered by manufactures such as ZyXEL, Motorola, and Greenpacket are stand-alone self-install indoor units. Such devices typically sit near the customer's window with the best WiMAX signal, and provide: An integrated Wi-Fi access point to provide the WiMAX Internet connectivity to multiple devices throughout the home or business. Ethernet ports should you wish to connect directly to your computer or DVR instead. One or two PSTN telephone jacks to connect your land-line phone and take advantage of VoIP. Indoor gateways are convenient, but radio losses mean that the subscriber may need to be significantly closer to the WiMAX base station than with professionally- installed external units. Outdoor units are roughly the size of a laptop PC, and their installation is comparable to the installation of a residential satellite dish. A higher-gain directional outdoor unit will generally result in greatly increased range and throughput but with the obvious loss of practical mobility of the unit. 40 WiMAX Mobiles: HTC announced the first WiMAX enabled mobile phone, the Max 4G, on Nov 12th 2008. The device was only available to certain markets in Russia on the Yota network. HTC released the second WiMAX enabled mobile phone, the EVO 4G, March 23, 2010 at the CTIA conference in Las Vegas. The device made available on June 4, 2010 is capable of EV-DO (3G) and WiMAX (4G) as well as simultaneous data & voice sessions. The device also has a front-facing camera enabling the use of video conversations. A number of WiMAX Mobiles are expected to hit the US market in 2010. Technical information: Illustration of a WiMAX MIMO board 4.1 WiMAX and the IEEE 802.16 Standard: The current WiMAX revision is based upon IEEE Std 802.16e-2005, approved in December 2005. It is a supplement to the IEEE Std 802.16-2004, and so the actual standard is 802.16-2004 as amended by 802.16e-2005. Thus, these specifications need to be considered together. IEEE 802.16e-2005 improves upon IEEE 802.16-2004 by: 41 Adding support for mobility (soft and hard handover between base stations). This is seen as one of the most important aspects of 802.16e-2005, and is the very basis of Mobile WiMAX. Scaling of the Fast Fourier Transform (FFT) to the channel bandwidth in order to keep the carrier spacing constant across different channel bandwidths (typically 1.25 MHz, 5 MHz, 10 MHz or 20 MHz). Constant carrier spacing results in a higher spectrum efficiency in wide channels, and a cost reduction in narrow channels. Also known as Scalable OFDMA (SOFDMA). Other bands not multiples of 1.25 MHz are defined in the standard, but because the allowed FFT subcarrier numbers are only 128, 512, 1024 and 2048, other frequency bands will not have exactly the same carrier spacing, which might not be optimal for implementations. Advanced antenna diversity schemes, and hybrid automatic repeat-requeat (HARQ) Adaptive antenna Systems (AAS) and MIMO technology Denser sub-channelization, thereby improving indoor penetration Introducing Turbo Coding and Low-Density Parity Check (LDPC) Introducing downlink sub-channelization, allowing administrators to trade coverage for capacity or vice versa Fast Fourier Transform algorithm Adding an extra QoS class for VoIP applications. SOFDMA (used in 802.16e-2005) and OFDM256 (802.16d) are not compatible thus equipment will have to be replaced if an operator is to move to the later standard (e.g., Fixed WiMAX to Mobile WiMAX). Physical layer: The original version of the standard on which WiMAX is based (IEEE 802.16) specified a physical layer operating in the 10 to 66 GHz range. 802.16a updated in 2004 to 802.16-2004, added specifications for the 2 to 11 GHz range. 802.16-2004 42 was updated by 802.16e-2005 in 2005 and uses Scalable Orthogonal Frequency- Division Multiple Acess (SOFDMA) as opposed to the fixed Orthogonal Frequency- Division Multiple Acess (OFDM) version with 256 sub-carriers (of which 200 are used) in 802.16d. More advanced versions, including 802.16e, also bring multiple antenna support through MIMO (See WiMAX MIMO). This brings potential benefits in terms of coverage, self installation, power consumption, frequency re-use and bandwidth efficiency. MAC (data link) layer: The WiMAX MAC uses a Scheduling algorithm for which the subscriber station needs to compete only once for initial entry into the network. After network entry is allowed, the subscriber station is allocated an access slot by the base station. The time slot can enlarge and contract, but remains assigned to the subscriber station, which means that other subscribers cannot use it. In addition to being stable under overload and over-subscription, the scheduling algorithm can also be more bandwidth efficient. The scheduling algorithm also allows the base station to control Quality of Service (QoS) parameters by balancing the time-slot assignments among the application needs of the subscriber stations. Spectrum allocation: There is no uniform global licensed spectrum for WiMAX; however the WiMAX Forum has published three licensed spectrum profiles: 2.3 GHz, 2.5 GHz and 3.5 GHz, in an effort to drive standardization and decrease cost. In the USA, the biggest segment available is around 2.5 GHz, and is already assigned, primarily to Sprint Nextel and Clear wire. Elsewhere in the world, the most- likely bands used will be the Forum approved ones, with 2.3 GHz probably being most important in Asia. Some countries in Asia like India and Indonesia will use a mix of 2.5 GHz, 3.3 GHz and other frequencies. Pakistan's Wateen Telecom uses 3.5 GHz. Analog TV bands (700 MHz) may become available for WiMAX usage, but await the complete roll out of digital TV, and there will be other uses suggested for 43 that spectrum. In the USA the FCC auction for this spectrum began in January 2008 and, as a result, the biggest share of the spectrum went to Verizon Wireless and the next biggest to AT&T. Both of these companies have stated their intention of supporting LTE, a technology which competes directly with WiMAX. EU commissioner Viviane Reding has suggested re-allocation of 500–800 MHz spectrum for wireless communication, including WiMAX. WiMAX profiles define channel size, TDD/FDD and other necessary attributes in order to have inter-operating products. The current fixed profiles are defined for both TDD and FDD profiles. At this point, all of the mobile profiles are TDD only. The fixed profiles have channel sizes of 3.5 MHz, 5 MHz, 7 MHz and 10 MHz. The mobile profiles are 5 MHz, 8.75 MHz and 10 MHz. (Note: the 802.16 standard allows a far wider variety of channels, but only the above subsets are supported as WiMAX profiles.) Since October 2007, the Radio communication Sector of the International Telecommunication Union (ITU-R) has decided to include WiMAX technology in the IMT-2000 set of standards.[21] This enables spectrum owners (specifically in the 2.5- 2.69 GHz band at this stage) to use WiMAX equipment in any country that recognizes the IMT-2000. Spectral efficiency: One of the significant advantages of advanced wireless systems such as WiMAX is spectral efficiency. For example, 802.16-2004 (fixed) has a spectral efficiency of 3.7 (bit/s)/Hertz, and other 3.5–4G wireless systems offer spectral efficiencies that are similar to within a few tenths of a percent. This multiplies the effective spectral efficiency through multiple reuse and smart network deployment topologies. The direct use of frequency domain organization simplifies designs using MIMO-AAS compared to CDMA/WCDMA methods, resulting in more effective systems. Inherent Limitations: A commonly-held misconception is that WiMAX will deliver 70 Mbit/s over 50 kilometers. Like all wireless technologies, WiMAX can either operate at higher 44 bitrates or over longer distances but not both: operating at the maximum range of 50 km (31 miles) increases bit error rate and thus results in a much lower bitrate. Conversely, reducing the range (to under 1 km) allows a device to operate at higher bitrates. A recent city-wide deployment of WiMAX in Perth, Australia, has demonstrated that customers at the cell-edge with an indoor CPE typically obtain speeds of around 1–4 Mbit/s, with users closer to the cell tower obtaining speeds of up to 30 Mbit/s.[citation needed] Like all wireless systems, available bandwidth is shared between users in a given radio sector, so performance could deteriorate in the case of many active users in a single sector. However, with adequate capacity planning and the use of WiMAX's Quality of Service, a minimum guaranteed throughput for each subscriber can be put in place. In practice, most users will have a range of 4-8 Mbit/s services and additional radio cards will be added to the base station to increase the number of users that may be served as required. Silicon implementations: A critical requirement for the success of a new technology is the availability of low-cost chipsets and silicon implementations. WiMAX has a strong silicon ecosystem with a number of specialized companies producing baseband ICs and integrated RFICs for implementing full- featured WiMAX Subscriber Stations in the 2.3, 2.5 and 3.5Ghz band (refer to 'Spectrum allocation' above). It is notable that most of the major semiconductor companies have not developed WiMAX chipsets of their own and have instead chosen to invest in and/or utilize the well developed products from smaller specialists or start-up suppliers. These companies include but not limited to Beceem, Sequans and Pico Chip. The chipsets from these companies are used in the majority of WiMAX devices. Intel Corporation is a leader in promoting WiMAX, but has limited its WiMAX chipset development and instead chosen to invest in these specialized 45 companies producing silicon compatible with the various WiMAX deployments throughout the globe. Comparison with Wi-Fi: Comparisons and confusion between WiMAX and Wi-Fi are frequent because both are related to wireless connectivity and Internet access. WiMAX is a long range system, covering many kilometers that uses licensed or unlicensed spectrum to deliver connection to a network, in most cases the Internet. Wi-Fi uses unlicensed spectrum to provide access to a local network. Wi-Fi is more popular in end user devices. Wi-Fi runs on the Media Access Control's CSMA/CA protocol, which is connectionless and contention based, whereas WiMAX runs a connection- oriented MAC. WiMAX and Wi-Fi have quite different quality of service (QoS) mechanisms: o WiMAX uses a QoS mechanism based on connections between the base station and the user device. Each connection is based on specific scheduling algorithms. o Wi-Fi uses contention access - all subscriber stations that wish to pass data through a wireless access point (AP) are competing for the AP's attention on a random interrupt basis. This can cause subscriber stations distant from the AP to be repeatedly interrupted by closer stations, greatly reducing their throughput. Both 802.11 and 802.16 define Peer-to-Peer (P2P) and ad hoc networks, where an end user communicates to users or servers on another Local Area Network (LAN) using its access point or base station. However, 802.11 supports also direct ad hoc or peer to peer networking between end user devices without an access point while 802.16 end user devices must be in range of the base station. 46 Wi-Fi and WiMAX are complementary. WiMAX network operators typically provide a WiMAX Subscriber Unit which connects to the metropolitan WiMAX network and provides Wi-Fi within the home or business for local devices (e.g., Laptops, Wi-Fi Handsets, smartphones) for connectivity. This enables the user to place the WiMAX Subscriber Unit in the best reception area (such as a window), and still be able to use the WiMAX network from any place within their residence. WiMAX Forum: The WiMAX Forum is a non profit organization formed to promote the adoption of WiMAX compatible products and services. A major role for the organization is to certify the interoperability of WiMAX products. Those that pass conformance and interoperability testing achieve the "WiMAX Forum Certified" designation, and can display this mark on their products and marketing materials. Some vendors claim that their equipment is "WiMAX- ready", "WiMAX-compliant", or "pre-WiMAX", if they are not officially WiMAX ForumCertified. Another role of the WiMAX Forum is to promote the spread of knowledge about WiMAX. In order to do so, it has a certified training program that is currently offered in English and French. It also offers a series of member events and endorses some industry events. WiMAX Spectrum Owners Alliance: WiSOA logo WiSOA was the first global organization composed exclusively of owners of WiMAX spectrum with plans to deploy WiMAX technology in those bands. WiSOA focussed on the regulation, commercialisation, and deployment of WiMAX spectrum in the 2.3–2.5 GHz and the 3.4–3.5 GHz ranges. WiSOA merged with the Wireless Broadband Alliance in April 2008. 47 CHAPTER V Approximate Layered Decoding Approach for Pipelined Decoding Recently, layered decoding approach has been found to converge much faster than conventional TPMP decoding approach. With layered decoding approach, the parity check matrix of an LDPC code is partitioned into L layers : The layer defines a supercode and the original LDPC code is the intersection of all supercodes: The column weight of each layer is at most 1. Let denote the check-to-variable message from the check node _ to the variable node , and represent the variable-to-check message from the variable node to the check node c. In the kth iteration, the log-likelihood ratio (LLR) message from layer t to the next layer for variable node is represented by , where . The layered message passing with Min-Sum lgorithm can be formulated as (2)–(4). In (3), denotes the set of variable nodes connected to the check node excluding the variable node In an LDPC decoder, the check node unit (CNU) is for the computation shown in (3) and the variable node unit (VNU) performs (2) and (4). In the case that all soft messages corresponding to the 1-components in an entire block row of parity check matrix are processed in a clock period, the computations shown in (2)–(4) are sequentially performed. The long computation delay in the CNU inevitably limits the maximum achievable clock speed. Usually pipelining technique can be utilized to reduce the critical path in computing units. However, due to the data 48 dependency between two consecutive layers in layered decoding, pipelining technique can not be applied directly. For instance, suppose that one stage pipelining latch is introduced into every CNU. To compute messages corresponding to the third block row of messages are needed, which cannot be determined until messages are computed with (3). Due to the one-clock delay caused by the pipelining stage in CNUs, _ messages are not available in the required clock cycle. The data dependency between layer 3 and layer 2 occurs at column 4, 8, 9, and 13 as marked by bold squares in Fig. 2. To enable pipelined decoding, we propose an approximation of layered decoding approach. Let us rewrite (3) as the following: where is the variable node set The data dependency between layer and occurs in the column positions corresponding to the variable node set For the variable nodes belonging to the variable node set the following equation is satisfied: The item in (6) is the incremental change of message corresponding to layer in the decoding iteration, where represents the check node in the layer that connects to In iterative LDPC decoding, the (6) can be approximated using (7) if the item is used for updating instead of . Thus, by 49 slightly changing the updating order of message corresponding to the variable node set the (2) can be approximated by (7). Based on the previous consideration, an approximate layered decoding approach is formulated as (8)–(10). Where is a small integer. In order to demonstrate the decoding performance of the proposed approach, a (3456, 1728), (3, 6) rate-0.5 QC-LDPC code 50 constructed with progressive edge-growth (PEG) approach [12] is used. Its parity check matrix is permuted as discussed in Section II. The number of rows in each layer is 144. The parameter in (8) and (10) is set to 2 to enable two stage pipelines. The maximum iteration number is set to 15. It can be observed from Fig. 3 that the proposed approach has about 0.05 dB performance degradation compared with the standard layered decoding scheme. The conventional TPMP approach has about 0.2 dB performance loss compared with the standard layered decoding scheme because of its slowconvergence speed. It should be noted that, by increasing the maximum iteration number, the performance gap among the three decoding schemes decreases. However, the achievable decoding throughput is reduced. 5. 1 Decoder Architecture with Layered Decoding Approach 5.1.1 Overall Decoder Architecture: The proposed decoder computes the check-to-variable messages, variable-to- check messages, and LLR messages corresponding to an entire block row of matrix in one clock cycle. The decoder architecture is shown in Fig. 4. It consists of the following five portions. 1). layer -register arrays. Each layer is used to store the check-tovariable messages corresponding to the 1-components in a block row of matrix .At each clock cycle, messages in one layer are vertically shifted down to the adjacent layer. 2). A check node unit (CNU) array for generating the messages for one layer of R- register array in a clock cycle. The dashed-lines in the CNU array denote two pipeline stages. 3). LLR-register arrays. Each LLR-register array stores the messages corresponding to a block column of matrix . 4). variable node unit (VNU) arrays. Each VNU array is used for computing the variable-to-check messages and LLR messages corresponding to a block column of matrix Each VNU is composed of two adders. 51 5) data shifters. The messages corresponding to a block column of matrix is shifted one step by a data shifter array. Figure 6: Decoder Architecture In Fig. 6, each VNU, MUX, and data shifter is used to represent computing unit arrays. In the decoding initialization, the intrinsic messages are transferred to LLR- register arrays via the MUX1 arrays. At the first clock cycles, messages are not available due to the _ pipeline stages in the CNU array. Therefore, the MUX2 arrays are needed to prevent LLR-registers from being updated. In one clock cycle, only a portion of LLR-messages are updated. The updated LLR-messages correspond to the 1-component in the layer of matrix are sent to data shifter via computation path. The remained LLR-messages are directly sent to the data shifter from the LLR- register array. 52 5.1.2 Critical path of the Proposed Architecture: The computation path of the proposed architecture is shown in Fig. 5.The equations shown in (8)–(10) are sequentially performed. The computation results of (8) are represented in two’s complement format. It is convenient to use the sign-magnitude representation for the computation expressed in (9). Thus, two’s complement to sign-magnitude data conversion is needed before data are sent to CNU. The messages from CNU array and R-register arrays are in a compressed form to reduce memory requirement. More details are explained in the next paragraph. To recover the individual messages, a data distributor is needed. The messages sent out by the data distributor are in sign-magnitude representation. Consequently, sign-magnitude to two’s complement conversion is needed before data are sent to VNU. In this design, the computation path is divided into three segments. The implementation of the SM-to-2’S unit and the adder in segment-1 can be optimized by merging the adder into the SM-to-2’S unit to reduce computation delay. With the Min-Sum algorithm, the critical task of a CNU is to find the two smallest magnitudes from all input data and identify the relative position of the input data with the smallest magnitude. An efficient CNU implementation approach was proposed . The dataflow in a CNU is very briefly discussed in this paper. Since six inputs are considered for this design, four computation steps are needed in a CNU. The first step is compare-and-swap. Then, two pseudo rank order filter (PROF) stages are needed. In the last step, the two smallest magnitudes are corrected using a scaling factor (usually, is set as 3/4). In this way, the messages output by a CNU are in a compressed form with four elements, i.e., the smallest magnitude, the second smallest magnitude, the index of the smallest magnitude, and the signs of all messages. It can be observed that the critical path of segment-1 is three adders and four multiplexers. The longest logic path of segment-2 includes three adders and two multiplexers. The adder in the last stage can be implemented with a [4:2] compressor and a fast adder. The data shifter can be implemented with one-level multiplexers. The detail is illustrated in Section IV-C. Thus, the computation delay of segment-3 is less than that of either segment-1 or segment-2. By inserting two pipeline stages among the three segments, the critical path of the overall decoder architecture is reduced to three adders and four 2:1 multiplexers. 53 Data Shifter: It can be seen from Fig. 2 that by a single left cyclic shift, the block is identical to , for and Therefore, repeated single-step left cyclic-shift operations can ensure the message alignment for all layers in a decoding iteration. After the messages corresponding to the last block row are processed, a reverse cyclic-shift operation is needed for the next decoding iteration. Based on the previous observation, only the edges of the tanner graph for the first layer of matrix are mapped to the fixed hardware interconnection in the proposed decoder. A very simple data shifter which is composed of one level two-input one-output multiplexers is utilized to perform the shifting operation for one block column of matrix . Fig. 8 shows the structure of a data shifter for the matrix . When the value of control signal is 1, the shifting network performs a single-step left cyclic-shift. If is set to 0, the reverse cyclic-shift is performed. Hardware Requirement and Throughput Estimation: The hardware requirement of the decoder for the example LDPC code is estimated except for the control block and parity check block. In Table I, the gate count for computing blocks is provided. Each MUX stands for a 1-bit 2-to-1 multiplexer. Each XOR represents a 1-bit two input XOR logic unit. The register requirement is estimated in Table II. In the two tables, represent the word length of each message and message, respectively. As analyzed in Section IV-B, the critical path of the proposed decoder is three adders and four multiplexers. In the decoder architecture presented in [6], each soft message is represented as 4 bits. The critical path consists of an R-select unit, two adders, a CUN, a shifting unit and a MUX. The computation path of a CNU has a 2’S-SM unit, a two-least- minimum computation unit, an offset computation unit, an SM-to-2’S unit stage, and an R-selector unit. The overall critical path is longer than 10 4-bit adders and 7 multiplexers. The post routing frequency is 100 MHz with 0.13-m CMOS technology. Because the critical path of the proposed decoder architecture is about one-third of the 54 architecture presented in [6], using 4-bit for each soft message, the clock speed for the proposed decoder architecture is estimated to be 250 MHz with the same CMOS technology. In a decoding iteration, the required number of clock cycles is 12. To finish a decoding process of 15 iterations, we need 183 clock cycles. Among them, one cycle is needed for initialization and two cycles are due to pipeline latency. Thus, the throughput of the layered decoding architecture is at least 4.7 Gb/s. Because a real design using the proposed architecture has not been completed, we can only provide a rough comparison with other designs. Lin et al. designed an LDPC decoder for a (1200, 720) code. The decoder achieves 3.33 Gb/s throughput with 8 iterations. Sha et al.proposed a 1.8 Gb/s decoder with 20 iterations. The decoder is targeted for a (8192, 7168) LDPC code. The decoding throughput of the both decoders is less than the proposed architecture with 15 iterations. Gunnam et al. Presented an LDPC decoder architecture for (2082, 1041) array LDPC codes. With 15 iterations, it can achieve 4.6 Gb/s decoding throughput. The number of CNUs and VNUs are 347 and 2082, respectively Figure 7: Structure of a data shifter Table 5.1 Gate Count Estimation Computing Blocks 55 Table 5.2 Storage Requirement Estimate It can be seen from Table I that less than half computing units are needed in our pipelined architecture. The register requirement in our design is more than that in because an LDPC code with a larger block length for a better decoding performance is considered in our design. The two pipeline stages in CNU array also require additional registers. The design in is only suitable for array LDPC codes. But the proposed decoder architecture is for generic QC-LDPC codes. We would like to mention that the proposed architecture is scalable. For example, the LDPC code considered in this paper can be partitioned into 8, 12, or 18 layers for different trade- offs between hardware cost and decoding throughput. PLDA Architecture: The architecture of the PDLA based LDPC decoder is shown in Fig.8 As described in Section III-A, the APP messages, instead of the CTV messages are passed among different layers. Therefore, for each sub-matrix, two memory blocks are needed – one to store the APP messages and another to store the CTV messages. The memory blocks are dual port RAMs, because at every clock cycle, the decoder must not only fetch messages to facilitate the variable nodes and check nodes processing, but also receive messages from other connected layers. VNU performs the variable nodes processing to calculate the VTC messages using data from the APP memories and CTV message memories. CNU performs check nodes processing to calculate the new CTV messages. Then, these newly updated CTV messages are stored back to the same locations in the CTV memories while the updated APP messages are passed to APP memories in other connected layers. The architecture of VNU is a simple adder to update the VTC message by subtracting the CTV value from the APP message. Several VNUs in the same row operate in parallel so that the newly calculated VTC messages can be used immediately to complete the horizontal processing. The architecture of the CNU is to perform the check node processing as in MSA. Each CNU contains 6 number or 7 number comparators which can find the 56 minimum and second minimum values. Here we adopt the comparator. Then, the absolute value of the VTC message is compared with the minimum value from the comparator. If the absolute value of the VTC message is larger than the minimum value, the compare and select unit outputs the minimum value; otherwise it outputs the second minimum value. The dedicated message passing paths among different layers are pre-determined by the modified permutation values as described in Section III-A. Since the message passing paths are fixed, only static wiring connection are necessary to connect the APP memories among different layers, instead of using a l × l switching network. Therefore, the area and power consumption of the message passing network in PLDA becomes minimal. In addition, mode switching in this decoder becomes much easier. The depth of the APP memories and CTV memories are designed to be 96 which can completely fulfill the maximum size of the sub- matrices. Then, different modes can be set by adjusting the operating period to the actual size of the sub matrix. Figure 8: Parallel layered decoding Architecture To evaluate the performance of our proposed PLDA, a rate-1/2 WiMax LDPC decoder with 19 different modes is synthesized and implemented on both FPGA and ASIC platforms. Implemented on Xilinx XC2VP30 device, the maximum frequency is 66.4 MHz which corresponding to the decoding throughput of 160 Mbps with a maximum of 10 iterations. The same architecture is implemented and synthesized using TSMC 90nm ASIC technology. Totally 152 pieces of dual port RAM, each size of 96 × 6 bits, are used to store the APP messages and CTV messages. The 57 synthesized decoder can achieve a maximum throughput of 492 Mbps for 10decoding iterations. PLDA only needs (l + 1) × Iter clock cycles for the convergence of the decoding process and l is the size of the sub-matrix. This number rises tol ×(4 × Iter + 1) and l×(5 × Iter)+12, respectively. Hence, under the same number of iterations, PLDA could reduce the decoding latency by approximately 75%. The core area of the decoder is 2.1mm2 and the power consumption is 61 mW at the maximum frequency of 204 MHz Fig 5.5 shows the layout of decoder. Dual-port memories are generated by Synopsys Design Ware tool and thus flattened during synthesis and place and route. Table 5.3 shows the decoder implementation results compared with other LDPC decoders for WiMax presented in the literature. Our proposed decoder can achieve significantly higher decoding throughput with comparable chip area and power dissipation. The energy efficiency is measured by the energy required to decode per bit per iteration (pJ/Bit/Iter). As shown in Table I, the energy efficient of the proposed decoder is improved by about an order of magnitude comparing to other existing designs. Table 5.3: Overall Comparisons between Proposed Decoder And Other Existing LDPC Decoders For WiMax 58 Figure 9: Layout of the proposed decoder chip 5.1.3 Generic Implementation of an LDPC Decoder about Genericity: The main specification of the LDPC decoder that will be described in this chapter is the genericity. The meaning of genericity is double: The decoder should be generic in the sense that it should be able to decode any LDPC code, providing that they have the same size as the one fixed by the parameters of the architecture. It means that any distribution degree of the variable and check nodes should be allowed, given N,M and P. The decoder should also be generic in the sense that a lot of parameters and components could be modified: for example the parity-check matrix size, the decoding algorithm (BP, −min, BP-based), the dynamic range and the number of bits used for the fixed-point coding. But the modifications of these parameters should require a new hardware synthesis. Both of these goals are very challenging since LDPC decoders have been so far always designed for a particular class of parity-check matrix. Moreover, genericity for architecture description increases the complexity level. Our motivation to design such a decoder is the possibility to run simulations much faster on a FPGA than on a PC. Simulations give, at the end, the final judgment about the comparison of the codes and of their associated architecture. 59 5.2 Code Structure and Decoding Algorithms: Code Structure of WiMax. The IEEE 802.16e standard for Wi Max systems uses irregular LDPC code as the error-correction code because of its competitive bit error performance compared with regular LDPC codes. It is the quasi-cyclic LDPC code whose parity check matrix can be decomposed into several sub-matrices and each one is either an identity matrix or its transformation. An Mb × Nb base parity check matrix of rate-1/2 WiMax codes is defined where Mb is 12 and Nb is 24. Parity check matrix H is generated by expanding blank entries as l × l zero matrix and non- blank entries as a l × l circular right shifted identity matrix. As there are 19 different modes with the sub-matrix size l ranging from 24 to 96, the shifted value of each sub- matrix can be expressed as: (5.1) where p(i, j) is the permutation value when the size of sub matrix 5.2.1 Min-Sum Decoding Algorithm: Before presenting the MSA, we first make some definitions as follows: Let cn denote the n-th bit of a codeword and yn denote the corresponding received value from the channel. Let rmn[k] be the check-to-variable (CTV) and qmn[k] the variable-to-check (VTC) message between check node m and variable node n at the k-th iteration. Let N (m) denote the set of variables that participate in check m and M (n) denote the set of checks that participate in variable n. The set N (m) without variable n is denoted as N (m) \ n and the set M (n) without check m is denoted as M (n) \ m. Detailed steps of MSA are described below: 1. Initialization: Under the assumption of equal priori probability, compute the channel probability pn (intrinsic information) of the variable node n, by 60 (5.2) The CTV message rmn is set to be zero. 2. Iterative Decoding: At the k-th iteration, for the variable node n, calculate VTC message qmn [k] by (5.3) Meanwhile, the decoder can make a hard decision by calculating the APP (a-posterior probability) by (5.4) Decide the n-th bit of the decoded codeword xn = 0 if n > 0 and x n = 1 otherwise. The decoding process terminates when the entire codeword x =[x1, x2, · · · · · · xN] satisfy all M parity check equations Hx = 0, or the preset maximum number of iterations is reached. If the decoding process does not stop, then, calculate the CTV message rmn for check node m, by (5.5) 61 Here, a normalized factor is introduced to compensate for the performance loss in the min-sum algorithm compared to standard BP algorithm. In this paper, is set to be 0.75. 5.2.2 Layered Decoding Algorithm: In BP algorithm and MSA, CTV messages are updated during horizontal step using VTC messages received from previous iteration. During the vertical step, all VTC messages are updated by the newly obtained CTV messages from the current iteration. In other words, these two decoding steps execute iteratively with no overlapped period between them. LDA enables check node updating process to to be finished by each individual layer. Therefore, VTC messages can be updated using CTV messages from the current iteration instead of using old values from the previous iteration. The selected H base matrix in WiMax is well suited for horizontal LDA implementation, as it can be decomposed into 12 rows and each row can be treated as a horizontal layer. Different layers have some vertically overlapped positions and APP messages instead of CTV messages; can be passed from upper layer to lower layer at these positions within the same iteration. Recent work by E. Sharon has theoretically proved that such layered decoding algorithm, either horizontal or vertical, doubles the convergence speed in comparison with the BP algorithm. 62 CHAPTER-VI Xilinx ISE Tool for Synthesis and Place and Route 6.1 Xilinx ISE Tool Flow: The Integrated Software Environment (ISE) is the Xilinx design software suite that allows you to take the design from design entry through Xilinx device programming. The ISE Project Navigator manages and processes the design through the following steps in the ISE design flow. Design Entry: Design entry is the first step in the ISE design flow. During design entry, you create the source files based on the design objectives. You can create the top-level design file using a Hardware Description Language (HDL), such as VHDL, Verilog, or ABEL, or using a schematic. You can use multiple formats for the lower-level source files in the design. If work starts with a synthesized EDIF or NGC/NGO file, design entry and synthesis steps can be skipped and start with the implementation process. Synthesis: After design entry and optional simulation, you run synthesis. During this step, VHDL, Verilog, or mixed language designs become netlist files that are accepted as input to the implementation step. Implementation: After synthesis, you run design implementation, which converts the logical design into a physical file format that can be downloaded to the selected target device. From Project Navigator, you can run the implementation process in one step, or you can run each of the implementation processes separately. Implementation processes vary depending on whether you are targeting a Field Programmable Gate Array (FPGA) or a Complex Programmable Logic Device (CPLD). Verification: 63 You can verify the functionality of the design at several points in the design flow. You can use simulator software to verify the functionality and timing of the design or a portion of the design. The simulator interprets VHDL or Verilog code into circuit functionality and displays logical results of the described HDL to determine correct circuit operation. Simulation allows you to create and verify complex functions in a relatively small amount of time. You can also run in-circuit verification after programming the device. Device Configuration: After generating a programming file, you configure the device. During configuration, you generate configuration files and download the programming files from a host computer to a Xilinx device. This project is uses Xilinx ISE tool for synthesis and FPGA implementation of the 2D-DCT code processor. The device chosen is Spartan3 XC3S400. Synthesis using Xilinx can be done by following steps: 1) Now create a new project and select device Spartan3 and then XC3S400 2) Now add source code. 3) Go to Implementation 4) Synthesis XST 5) Now to change to Behavioral simulation 6) Run the source code Figure below shows the ISE tool flow. 64 Figure 7: ISE Simulation Flow 65 CHAPTER VII RESULTS 7.1 Simulation Results: 66 7.2 Schematic: 67 Internal Schematic: 68 69 70 CHAPTER VIII APPLICATIONS In 2003, an LDPC code beat six turbo codes to become the error correcting code in the new DVB-S2 standard for the satellite transmission of digital television. In 2008, LDPC beat convolutional turbo codes as the FEC scheme for the ITU-T G.hn standard. G.hn chose LDPC over turbo codes because of its lower decoding complexity (especially when operating at data rates close to 1 Gbit/s) and because the proposed turbo codes exhibited a significant error floor at the desired range of operation. LDPC is also used for 10GBase-T Ethernet, which sends data at 10 gigabits per second over twisted-pair cables. 71 CHAPTER IX CONCLUSION In this project, a high-throughput low-complexity decoder architecture for generic LDPC codes has been presented. To enable pipelining technique for layered decoding approach, an approximate layered decoding approach has been explored. It has been estimated that the proposed decoder can achieve more than 4.7 Gb/s decoding throughput at 15 iterations. 9.1 Future Scope: The IEEE 802.16m standard is the core technology for the proposed WiMAX Release 2, which enables more efficient, faster, and more converged data communications. The IEEE 802.16m standard has been submitted to the ITU for IMT-Advanced standardization[26]. IEEE 802.16m is one of the major candidates for IMT-Advanced technologies by ITU. Among many enhancements, IEEE 802.16m systems can provide four times faster[clarification needed] data speed than the current WiMAX Release 1 based on IEEE 802.16e technology. WiMAX Release 2 will provide strong backward compatibility with Release 1 solutions. It will allow current WiMAX operators to migrate their Release 1 solutions to Release 2 by upgrading channel cards or software of their systems. Also, the subscribers who use currently available WiMAX devices can communicate with new WiMAX Release 2 systems without difficulty. It is anticipated that in a practical deployment, using 4X2 MIMO in the urban microcell scenario with only a single 20-MHz TDD channel available system wide, the 802.16m system can support both 120 Mbit/s downlink and 60 Mbit/s uplink per site simultaneously. It is expected that the WiMAX Release 2 will be available commercially in the 2011-2012 timeframe. 72 REFERENCES [1]. R. G. Gallager, “Low-density parity-check codes,” IRE Trans. Inf. Theory,vol. IT-8, pp. 21–28, Jan. 1962. [2]. A. J. Blanksby and C. J. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2 low- density parity check code decoder,” IEEE J. Solid-State Circuits, vol. 37, no. 3, pp. 404–412, Mar. 2002. [3] A. Darabiha, A. C. Carusone, and F. R. Kschischang, “Multi-Gbit/sec low density parity check decoders with reduced interconnect complexity,” in Proc. ISCAS, May 2005, vol. 5, pp. 5194–5197. [4] C. Lin, K. Lin, H. Chang, and C. Lee, “A 3.33 Gb/s (1200, 720) lowdensity parity check code decoder,” in Proc. ESSCIRC, Sep. 2005, pp. 211–214. [5] J. Sha, M. Gao, Z. Zhang, L. Li, and Z. Wang, “Efficient decoder implementation for QC-LDPC codes,” in Proc. ICCCAS, Jun. 2006, vol. 4, pp. 2498–2502. [6] K. K. Gunnam, G. S. Choi, and M. B. Yeary, “A parallel VLSI architecture for layered decoding for array LDPC codes,” in Proc. VLSID, Jan. 2007, pp. 738–73. [7] E. Sharon, S. Litsyn, and J. Goldberger, “An efficient message-passing schedule for LDPC decoding,” in Proc. 23rd IEEE Convention Elect. Electron. Eng. Israel, Sep. 2004, pp. 223–226. [8] D. E. Hocevar, “A reduced complexity decoder architecture via layered decoding of LDPC codes,” in Proc. IEEE Workshop Signal Process. Syst., 2004, pp. 107–112. [9] L. Chen, J. Xun, I. Djurdjevic, and S. Lin, “Near shannon limit quasicyclic low- density parity-check codes,” IEEE Trans. Commun., vol. 52, no. 7, pp. 1038–1042, Jul. 2004. [10] Y. Zhang, W. E. Ryan, and Y. Li, “Structured eIRA codes with low floors,” in Proc. Int. Symp. Inf. Theory, Sep. 2005, pp. 174–178. [11] Z.Wang and Z. Cui, “A memory efficient partially parallel decoder architecture for QC-LDPC codes,” in Proc. 39th Asilomar Conf. Signals, Syst. Comput., 2005, pp. 729–733. 73 [12] H. Xiao-Yu, E. Eleftheriou, and D. M. Arnold, “Regular and irregular progressive edge-growth tanner graphs,” IEEE Trans. Inf. Theory, vol. 51, no. 1, pp. 386–398, Jan. 2005. [13] J. Zhang and M. P. C. Fossorier, “Shuffled iterative decoding,” IEEE Trans. Commun., vol. 53, no. 2, pp. 209–213, Feb. 2005. [14]. Ardakani, M., and F.R. Kschischang. July 2002. “Designing irregular LPDC codes using EXIT charts based on message error rate.” Proceedings of the IEEE International Symposium on Information Theory. [15]. Ardakani, Masoud, Terence H. Chan, and Frank R. Kschischang. May 2003. “Properties of the EXIT Chart for One-Dimensional LDPC Decoding Schemes.” Proceedings of CWIT. [16]. Bahl, L.R., J Cocke, F. Jelinek, and J. Raviv. March 1974. “Optimal Decoding of Linear Codes for Minimizing Symbol Error Rate.” IEEE Transactions on Information Theory 20:284–287. [17]. Barry, J.R. oct. 2001. Low-Density Parity-Check Codes. Available at http://www.ece.gatech.edu/~barry/6606/handsout/ldpc.pdf. [18].Battail, G., and A. H. M. El-Sherbini. 1982. “Coding for Radio Channels.” Annales des. [19]. T´el´ecommunications 37:75–96. 74