Docstoc

Massively LDPC Decoding on Multicore Architectures

Document Sample
Massively LDPC Decoding on Multicore Architectures Powered By Docstoc
					Massively LDPC Decoding on
 Multicore Architectures
      Present by : fakewen
                   Authors
• Gabriel Falcao
• Leonel Sousa
• Vitor Silva
                Outline
• Introduction
• BELIEF PROPAGATION
• DATA STRUCTURES AND PARALLEL
  COMPUTING MODELS
• PARALLELIZING THE KERNELS EXECUTION
• EXPERIMENTAL RESULTS
                Outline
• Introduction
• BELIEF PROPAGATION
• DATA STRUCTURES AND PARALLEL
  COMPUTING MODELS
• PARALLELIZING THE KERNELS EXECUTION
• EXPERIMENTAL RESULTS
              Introduction
• LDPC decoding on multicore architectures
• LDPC decoders were developed on recent
  multicores, such as off-the-shelf general-
  purpose x86 processors, Graphics Processing
  Units (GPUs), and the CELL Broadband Engine
  (CELL/B.E.).
                Outline
• Introduction
• BELIEF PROPAGATION
• DATA STRUCTURES AND PARALLEL
  COMPUTING MODELS
• PARALLELIZING THE KERNELS EXECUTION
• EXPERIMENTAL RESULTS
         BELIEF PROPAGATION
• Belief propagation, also known as the SPA, is
  an iterative algorithm for the computation of
  joint probabilities
              LDPC Decoding
• exploit probabilistic relationships between
  nodes imposed by parity-check conditions
  that allow inferring the most likely
  transmitted codeword.
LDPC Decoding(cont.)




   White Gaussian noise
LDPC Decoding(cont.)
Complexity
         Forward and Backward
               recursions
• memory
• access operations is registered, which
  contributes to in-
• crease the ratio of arithmetic operations per
  memory access
                Outline
• Introduction
• BELIEF PROPAGATION
• DATA STRUCTURES AND PARALLEL
  COMPUTING MODELS
• PARALLELIZING THE KERNELS EXECUTION
• EXPERIMENTAL RESULTS
  DATA STRUCTURES AND PARALLEL
       COMPUTING MODELS
• compact data structures to represent the H
  matrix
             Data Structures
• separately code the information about H in
  two independent data streams,      and
               remind
• rmn :是CNm->BNn
• qnm :是BNn->CNm
   Parallel Computational Models
• Parallel Features of the General-Purpose
  Multicores
• Parallel Features of the GPU
• Parallel Features of the CELL/B.E.
   Parallel Features of the General-
          Purpose Multicores
• #pragma omp parallel for
Parallel Features of the GPU
Throughput
Parallel Features of the CELL/B.E.
Throughput
                Outline
• Introduction
• BELIEF PROPAGATION
• DATA STRUCTURES AND PARALLEL
  COMPUTING MODELS
• PARALLELIZING THE KERNELS EXECUTION
• EXPERIMENTAL RESULTS
     PARALLELIZING THE KERNELS
            EXECUTION
• The Multicores Using OpenMP
• The GPU Using CUDA
• The CELL/B.E.
The Multicores Using OpenMP
         The GPU Using CUDA
• Programming the Grid Using a Thread per
  Node Approach
    The GPU Using CUDA(cont.)
• Coalesced Memory Accesses
              The CELL/B.E.
• Small Single-SPE Model(A B C)
• Large Single-SPE Model
        Why Single-SPE Model
• In the single-SPE model, the number of
  communications between PPE and SPEs is
  minimum and the PPE is relieved from the
  costly task of reorganizing data (sorting
  procedure in Algorithm 4) between data
  transfers to the SPE.
                Outline
• Introduction
• BELIEF PROPAGATION
• DATA STRUCTURES AND PARALLEL
  COMPUTING MODELS
• PARALLELIZING THE KERNELS EXECUTION
• EXPERIMENTAL RESULTS
       EXPERIMENTAL RESULTS
• LDPC Decoding on the General-Purpose x86
  Multicores Using OpenMP
• LDPC Decoding on the CELL/B.E.
  – Small Single-SPE Model
  – Large Single-SPE Model
• LDPC Decoding on the GPU Using CUDA
LDPC Decoding on the General-
 Purpose x86 Multicores Using
           OpenMP
LDPC Decoding on the CELL/B.E.
LDPC Decoding on the CELL/B.E.(cont.)
LDPC Decoding on the CELL/B.E.(cont.)
LDPC Decoding on the GPU Using
            CUDA
 The end



Thank you~
                        Forward backward
•   I can do better than that. I can send you a MSc thesis of a former student of ours who graduated 5
    years ago. She explains the basic concept in detail. Basically, when you are performing the
    horizontal processing (the same applies for the vertical one) and you have a CN updating all the
    BNs connected to it, the F&B optimization exploits the fact that you only have to read all the BNs
    information (probabilities in the case of SPA) once for each CN, which gives you tremendous gains
    in computation time since you save many memory accesses, which, as you know, are the main
    bottleneck in parallel computing. Quite shortly, imagine you have one CN updating 6 BNs BN0 to
    BN5 (horizontal processing) and that BN0 holds information A, BN1= B, BN2=C, ..., BN5=F. Then, to
    update the corresponding rmn elements, for each BN you have to calculate:

•   BN0=BxCXDXEXF
•   BN1=AXCXDXEXF
•   BN2=AXBXDXEXF
•   ...
•   BN5=AXBXCXDXE,
•

•   where each BN contributes to update its neighbors, but it does not contribute to the update of
    itself. So, the F&B optimization allows you to read A, B, C, D, E and F data only once from memory
    and produce all the necessary intermediate values necessary to update all the BNs connected to
    that CN. You save memory accesses (very important!) and processing too.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:5/6/2014
language:Unknown
pages:44