Docstoc

Implementation of Digital Filters in FPGA's

Document Sample
Implementation of Digital Filters in FPGA's Powered By Docstoc
					Implementation of Digital Filters
          in FPGA’s
            Ayaz Hasan




                                    1
                References
• Chi-Jui Cou, Satish Mohankrishnan, Joseph B
  Evans, “FPGA Implementation of Digital
  Filters,” ICSPAT 1993
• Uwe Meyer-Baese, “Digital Signal Processing
  with Field Programmable Gate Arrays,” 2003




                                                2
                   Outline
• Digital Filtering
• Programmable Signal Processors vs. FPGA’s
• Multiply Accumulate Units
  – Multipliers
  – Adders
  – Xilinx XC4000 implementations
• FIR Filters
• Pipelined MAC units
                                              3
                    Digital Filters
• Modification of signal attributes in frequency
  or time domain
• Linear Time-Invariant Filters

• FIR Filters
   – Finite sum per output sample instant
• IIR Filters
   – Infinite sum

                                                   4
                       FIR Filters
• Transfer Function


• Lth order filter
• Tapped Delay Structure
   – One of the multiplicands
     is an FIR coefficient
• Non-recursive
   – No feedback
   – Finite Response

                                     5
                      IIR Filters
• Transfer Function



• Recursive Filter
   – Feedback
• Canonical Filter
   – Has both recursive and
     non-recursive parts
     merged


                                    6
  Programmable Signal Processors
• Based on RISC architecture
  – At least one fast array multiplier (fixed or floating
    point)
• Most algorithms MAC intensive
  – High MAC rates using multi-stage pipelined
    architecture
  – Cost effective



                                                            7
                     FPGA’s
• Can provide more bandwidth
  – Multiple MAC cells on a chip
  – Useful in high-bandwidth applications like wireless
    and multimedia
• More efficient in implementing certain
  algorithms
  – CORDIC
  – Number Theoretic Transforms
  – Error-correction algorithms

                                                      8
                FPGA vs PDSP
• PDSP
  – Complicated algorithms that contain several if-
    then-else constructs
• FPGA
  – Front-end applications
     • FIR filters
     • CORDIC algorithms
     • FFT’s


                                                      9
      Target Device – Xilinx XC4000
• Basic logic element – Configurable Logic Block
  – Two separate 4-input, 1-output Lookup Tables
       • General purpose logic functions
  –   Fast carry
  –   One 3-input, 1-output LUT two combine two LUTs
  –   Two flip flops
  –   Five levels of routing
       • From CLB to CLB to long lines spanning the entire chip
       • Important in issues of speed
  – Can be used as 16x2 or 32x1 RAM or ROM

                                                                  10
Xilinx XC4000 CLB




                    11
      Multiply Accumulate Units
• DSP algorithms are MAC intensive
• Several approaches
  – Array approach
  – Addition using ripple carry methods
• Linear convolution sum
  – L consecutive multiplications
  – L – 1 addition operations per sample
  – N x N-bit multipliers need to be fused together with
    an accumulator
  – Full N x N product is 2N bits wide, 2N-1 for signed #’s
                                                              12
                     MAC Unit
• MAC Components
  – 8 x 8 bit combinatorial array
    multiplier
  – 16-bit accumulator
  – Word sizes constrained by FPGA
    density
  – Larger word sizes possible if MAC
    units per chip reduced




                                        13
                  Multiplier
• One CLB per partial product bit
  – 2-input AND gate generates each partial product
  – Addition logic
  – 64 CLB’s used
  – Signed Multiplication
• Basic Cell Structure
  – Sum
  – Carry
  – xi AND ai

                                                      14
Multiplier Implementation


                 • ak ≠ 0
                    – Accumulation of
                      X2k
                 • ak = 0
                    – No operation




                                     15
            Adder Implementation
• 16-bits
  – 9 CLB’s, each configured as 2-bit adder
     • 7 for middle 14 bits
     • 1 each for MSB and LSB
• Dedicated CLB carry logic
  – Improved efficiency of adders
  – Cout of a CLB can only be connected to a CLB above or
    below it
  – Vertical array
• Delay of 20.5ns

                                                        16
MAC Implementation
              • Performance
                – 100ns
                  multiplier
                  delay
                – 10 MHz
                – 73 CLB’s




                               17
               FIR Filter MAC Unit
• MAC unit with 4 multipliers and
  an adder tree
   – Pipeline registers increase clock
     speed
   – 4 terms summed every clock cycle
      • 4 taps: Sampling rate = frequency
      • 8 taps: Sampling rate = frequency/2
• Maximum sampling frequency
      • M = # of multipliers
      • T = multiplier delay
      • N = # of tap filters



                                              18
                       FIR Filters
• Performance
  – 100ns multiplier
    delay
  – 22.5ns adder
    delay
  – Routing delay
    may be up
    to75ns
  – 10 MHz clock
  – Sampling rates
    of 40/N MHz


                                     19
           Pipelined MAC Units
• Multiplier delay is a major limitation on
  maximum sampling rates
• Pipelined array multipliers
  – Execution of separate multiplications overlaps
  – Carry propagating addition delay in last row of
    multiplier can be minimized
     • High sampling frequency can be achieved
  – Can be applied to previously mentioned FIR filters


                                                      20
            Pipelined MAC Units
• Basic cells identical to unpipelined ones
• Include pipeline registers
   – To propagate multiplier and multiplicand bits to the
     destination
   – To propagate product bits that have been completed,
     done in parallel with new batch of product bits
• N x N multiplier
   – Carry propagate adder replaced with N rows of half
     adders with pipeline registers between the rows
      • Allows carry propagation of only one position between any
        two consecutive rows
      • Clock speed depends only on the delay in multiplier cells

                                                                    21
           Pipelined MAC Units
• For multiple tap filters
  – Accumulation of results needed through feedback
    of past output
     • Done by a set of full adders immediately below the
       diagonal of the array, feeding back outputs of full
       adders to their inputs through a single register
• Clock rate
  – Approaches 100MHz for XC4000


                                                             22
4 x 4 Multiplier 6-Bit Accumulator
                 • 4 MSB’s of multiplier fed
                   back for accumulation
                 • Output clocked out and
                   accumulator reset after
                   process complete
                 • Filter coefficients and
                   delayed inputs fed to
                   multiplier in synchronized
                   data streams
                    – Arrivals corresponding to basic
                      clock rate
                 • N tap filter requires N+1
                   clock cycles for computation
                   of one output

                                                  23
         FPGA Implementation
• Routing delay critical
  – 3ns for output pipeline register to stabilize after
    clocking
  – Output then routed
  – Then 4.5ns delay in the next CLB
  – Total minimum delay 7.5ns
  – In addition, 3ns from pad to input
     • Some CLB’s can be used as registers between input
       pads and cells, preventing reduction of clock speed

                                                             24
           FPGA Implementation
• 8 x 8 multiplier and
  12-bit accumulator
   – 4.6ns worst case
     routing delay
   – 12.1ns worst case logic
     path delay
   – 80MHz clock rate
   – 2 MAC units can be
     accommodated in
     XC4013


                                 25
                  Conclusion
• FPGA approach to digital filter implementation
  – Higher sampling rates than traditional DSP chips
  – Lower costs than ASICs for moderate volume
  – More flexibility
• MAC units on a single FPGA
• FIR Filter Implementation



                                                       26
Questions




            27

				
DOCUMENT INFO