Document Sample

                                       Eric Tell, Anders Nilsson, Dake Liu

                                                         o                               o
              Department of Electrical Engineering, Link¨ ping University, SE-581 83 Link¨ ping, Sweden
                                            {erite, andni, dake}

                     ABSTRACT                                                  2    THE DSP CORE
 A fully programmable radio baseband processor architec-      The core processor has a 16 bit ALU and a specialized com-
ture has been developed. The architecture is based on an      plex MAC unit. The instruction set can be divided into three
application specific DSP processor and a number of flexi-       classes of instructions:
ble hardware accelerators, connected via a configurable net-
work. A large degree of hardware reuse and careful selec-        • Ordinary RISC-style instructions operating on 16-bit
tion of accelerators together with low memory cost allows          values, or on 16+16-bit complex values.
a very area and power efficient implementation. A demon-
strator chip for 802.11a/b/g physical layer baseband pro-        • Vector instructions, operating on vectors of complex
cessing was implemented in 0.18 µm CMOS on a 5.0 mm2               data.
die with a core area of 2.9 mm2 , including all memories.
                                                                 • Instructions for network and accelerator configuration
                                                                   and control.

                                                                 All instructions are 16 bits long, providing very efficient
               1 INTRODUCTION                                use of program memory.
                                                                 The main feature of the core is the complex multiply-
                                                             accumulate unit (CMAC) and the associated vector instruc-
The large number of emerging radio standards and the con- tions. A very significant part of the computations in a
vergence of wireless products leads to an increased interest baseband processor are operations on vectors of complex
in Software Defined Radio (SDR) and increased flexibility numbers (I/Q-pairs) such as auto- and cross correlation, fir
requirements for baseband processors [1]. At the same time, filtering, FFT, vector multiplication and complex absolute
with new demanding applications, such as Wireless LAN, maximum search. The CMAC is optimized for these types
3G/4G mobile telephony and Digital Video Broadcasting, a of operations. It can execute e.g. two complex multiply-
high degree of parallelism is needed. Power consumption accumulate operations, a radix-2 FFT butterfly, or two com-
continues to be very important.                              plex absolute value calculations plus max search, each clock
    The proposed architecture, which is based on a special- cycle.
ized DSP processor core connected to a number of flexible         The vector instructions allows a complete vector opera-
hardware accelerators via a configurable network, allows tion (e.g. a scalar product or one layer of butterflies in an
a good tradeoff between flexibility and performance and a FFT) to be executed using only one instruction. The vector
very area efficient implementation with low control over- size is given explicitly in the instruction word. A vector in-
head. Figure 1 gives an overview of the architecture.        struction takes several clock cycles to complete (depending
    The use of accelerators improves the efficiency of the
architecture through increased degree of parallelism. Ef-       Normal instruction (4 different formats):
ficiency is further improved since the processor core can
focus on tasks more suitable for DSP software implementa-        0 subtype (2−5)      instr. (2−6)   arguments (4−11)
tion, e.g. multiply-accumulate based operations.
    Other programmable solutions, such as [2],[3] and [4],   Vector instruction:
are typically based on highly complex VLIW and/or multi-      10 instruction (4) ports (3)           vector size (7)
ple processor cores. The approach described here leads to
significantly less control overhead and memory size, result-  Accelerator instruction:
ing in reduced area and power consumption. The described
                                                              11 accelerator(4)       accelerator instruction (10)
approach also leads to a much higher degree of hardware
reuse and better utilization of hardware components than
a corresponding fixed function hardware implementation Figure 2: Encoding of basic instruction types in the BBP
such as [5].                                                (number of bits in parenthesis)
                                                       16 bit datapath                                        MAC

    256x32                                                                                         cADD16             cADD16
                                                           16 bit
                                                                                                  cMUL16x16        cMUL16x16

    DM1                                                                                            cADD40             cADD40
    256x32                                                  RF

                                 Control Registers
                                                                                          AR0 (40+40 bits)            AR1
                                                         16x16 bits
    DM2                                                                                   AR2                         AR3


    DM3                                                                                    Port IF                                            CM

    256x32                                                                                                                                   2048x32

    walsh                                                     Network

ADC                  FIR
                                demap                interleave       conv.enc./                                               MAC            Application
                     rotor                                                             scramble              CRC               IF
                 packet.det.                                             viterbi                                                              processor

                   Figure 1: Overview of the baseband processor. DMx, CM and IM are data memories

on vector size). However, other instructions e.g. ALU in-                              MAC.32        port1,port2       ; 32 element vector dot product
structions or accelerator configuration, can execute in paral-                          ADD           R0,R1             ; R0=R0+R1
lel with a vector instruction, as illustrated by figure 3. This                         NWC           interl,viterbi    ; setup network connection
often allows the control code of an algorithm to be “hidden”                           ACL           viterbi,0xF2      ; accelerator control instruction
behind multi-cycle vector instructions. Table 1 shows some                             IDLE          mac               ; wait for MAC.32 to finish
benchmarks for the core.
                                                                                   Figure 3: Assembly code example. The four instructions
                                                                                   following the vector instructions add no cycle cost.

Table 1: DSP core benchmark examples, cycle cost includes
                                                                                       3 THE ACCELERATOR NETWORK
memory addressing setup and other overhead. No accelera-
tion was used
  Function                     Clock cycles                                        Memories, accelerators and external interfaces are con-
  64-point FFT                      205                                            nected to the core via the interconnect network. The net-
  40 element vector add              24                                            work behaves like a crossbar switch and is configured by
  40 el. scalar product              24                                            the core using dedicated assembly instructions. This elim-
  40 el. vector elementwise                                                        inates the need for an arbiter and addressing logic, thus re-
     multiplication                  24                                            ducing the complexity of the network and the accelerator
  40 sample, 16 tap FIR filter       404                                            interfaces, still allowing many concurrent communications.
  40 element complex absolute                                                           Each accelerator has one read port and one write port to
    maximum value search             22                                            the network. A connection is set up by connecting one read
                                                                                   port to one write port. The reading unit requests one unit of
                                                                                   data by asserting a ReadReaquest signal during one clock
                                                                                   cycle and the transmitter uses a DataAvailable signal to in-
    The processor control path is similar to that of a rather                      dicate that new data is available. The requesting unit may
simple micro-controller-like processor with some added                             have up to two outstanding read requests, but must then halt
DSP- and other special features. It does not suffer from the                       if no data available signal is received. This protocol allows
control and communications overhead found in VLIW, su-                             a new data item to be communicated every clock cycle but
per scalar and other enhanced processor types. The vector                          still provides sufficient flow control.
control unit, which is responsible for execution of the par-                            A chain of accelerators connected to each other via the
allel multi cycle vector instructions, adds only little extra                      network will automatically synchronize and communicate
overhead.                                                                          without any interaction by the processor. This allows truly
concurrent operation of the core and any number of accel-
                                                                         Table 2: Firmware implementation results.
erators, and with zero synchronization overhead in the core.
This also minimizes the number of memory accesses since           Task      Req. freq.    Prog. size    Data mem.
no intermediate storage is needed when sending data be-           11a Tx    155 MHz       1020 bytes    3456 bytes
tween accelerators.                                               11a Rx    160 MHz       1658 bytes    2340 bytes
    Accelerators can be configured via special accelerator         11b Tx    120 MHz       476 bytes     484 bytes
instructions or via a control register space.                     11b Rx    110 MHz       1090 bytes    304 bytes

                                                                 core is intended for FFT and filter coefficients, look-up ta-
Using a number of small data memories gives enough mem-          bles, and other data not processed by accelerators. Using
ory bandwidth to keep the core/CMAC and accelerators             dual memory banks instead of dual port memories saves
fully occupied. The network always gives a unit (core or         power.
accelerator) exclusive memory access, thereby eliminating            The ADC/DAC interface accelerator contains a config-
stall cycles due to access conflicts. After finishing a task,      urable decimation filter, a rotor for carrier frequency off-
the entire memory containing the output can be ”handed           set compensation and a configurable packet detector based
over” to an accelerator or interface by reconfiguration of the    on autocorrelation. The packet detector will wake the core
network. This eliminates data moves between memories.            from idle mode when an incoming frame preamble is de-
    Each memory has its own address generator, and ad-           tected. These functions can be reused between many stan-
dresses and addressing modes are configured using the same        dards. They also have to run continously a large part of the
interface as accelerator configuration. No addressing infor-      time, and especially decimation is quite demanding. Other
mation needs to be sent over the network.                        accelerators reused between 11a and 11b standards are the
    Reducing memory sizes and memory accesses was a              scrambler and the mac-layer interface.
major focus in the design, since a large part of the power
consumption in a programmable architecture takes place
in the memories. The small, and thereby fast, memories
and the moderate frequency eliminates the need for caches.                           7 RESULTS
Thereby a lot of control overhead is avoided, and more im-
portantly, execution time is completely predictable, which
is major advantage in hard real time systems.                    Firmware was implemented for 802.11a and 11b
                                                                 transceivers; results in terms of memory usage and
                                                                 required frequency for different modules can be found in
               5    ACCELERATORS                                 table 2. The instruction set has proven to be very efficient.
                                                                 Only about half of the available program memory is re-
A key issue is the choice of accelerators. This has previ-       quired to store the entire 11a and 11b transceiver firmware
ously been discussed in [6]. The main factors to consider        on chip. Data memory requirements are also about half
are: 1) The relation between the area of the accelerator and     of the available data memory. (Parts of the firmware as
the cycle cost for a pure software implementation of a func-     well as most of the data memory is shared by Rx and Tx
tion, and 2) to which extent the accelerator can be reused be-   modules, so the actual requirements are less than the sum
tween standards. The reuse factor can often be improved by       of the numbers in the table.)
adding configurability to the accelerator. The right choice           The 11b receiver requires lower clock frequency (or has
of accelerators allows us to run the processor at a relatively   more idle cycles at a fixed frequency) than the 11b trans-
low frequency, which saves power. Even more power can            mitter, at the highest data rate, due to the acceleration of
be saved since lower frequency may allow us to lower the         the modified Walsh transform which is the most complex
supply voltage.                                                  operation in the receiver at the highest rate.
                                                                     An 802.11a/b/g baseband processor demonstrator chip,
    6       IMPLEMENTATION FOR WLAN                              with accelerators for ADC/DAC interface+frontend pro-
                                                                 cessing, demapping, interleaving, scrambling, CRC, Walsh
Figure 1 shows an implementation of the architecture for a       transform and MAC-layer interface was implemented and
converged 802.11a/b/g baseband processor.                        manufactured, using a 0.18 µm CMOS standard cell library.
    The program memory size is 4096x16 bits. Four iden-          The chip features and measured performance can be found
tical 256x32 bit data memories for complex data are con-         in table 3. Fig. 4 shows a die photo.
nected to the network. Each of these memories consists              The chip will function correctly at least up to 220 MHz,
of two interleaved memory banks, allowing two consec-            implying that significant power can be saved by reducing
utive addresses (vector elements) to be accessed in paral-       supply voltage in a converged 802.11a/b/g transceiver run-
lel. These memories also have FFT addressing support. A          ning at 160 MHz (the required frequency for 54 Mbit/s re-
2048x32 bit coefficient memory connected directly to the          ception in 802.11a/g.)
                                                              dards with lower data rate, such is GSM/GPRS and blue-
           Table 3: BBP chip feature summary
                                                              tooth, and firmware for these standards are currently being
 Feature                Value                                 developed.
 Technology             0.18 µm CMOS
 Chip area              5 mm2
 Core area              2.9 mm2                                                  REFERENCES
 Memory area            1.0 mm2
 Logic area             1.9 mm2
 Max frequency          220 MHz                               [1]
 Package                144 pin fpBGA                         [2] J. Kneip Single Chip Programmable Baseband
 Power @160 MHz:                                                  ASSP for 5 GHz Wireless LAN Applications, IECICE
 Idle                   44 mW                                     Trans. Electron., vol.E85-C, N0.2 February 2002.
 11a Rx burst           126 mW
                                                              [3] J. Glossner et al, A Software-Defined Communications
                                                                  Baseband Chip, IEEE Communications Magazine, Jan-
                                                                  uary 2003.
                                                              [4] S. Rajagopal et al. A Programmable Baseband Proces-
                                                                  sor Design for Software Defined Radios, Proc. Mid-
                                                                  west Symposium on Circuits and Systems (MWSCAS
                                                                  2002), p. III-413 - III-416,
                                                              [5] T. Fujitsawa et al., A Single-Chip 802.11a MAC/PHY
                                                                  with a 32b RISC Processor, ISSCC Dig. Tech. Papers,
                                                                  pp. 144-145, Feb. 2003.
                                                              [6] Anders Nilsson, Eric Tell, Dake Liu, An Accelera-
                                                                  tor Structure for Multi-Standard Programmable Base-
                                                                  band Processors, proc. of IASTED Intl. Multi-Conf. on
                                                                  Wireless and Optical Com., pp 644-649, July 2004

                   Figure 4: Die photo

                8 CONCLUSIONS
A programmable architecture for radio baseband processing
has been presented. The architecture enables very area effi-
cient implementations of baseband functions for multi stan-
dard radio systems. The accelerator architecture together
with an efficient instruction set including vector instruc-
tions, minimizes program and data memory requirements.
The accelerator chaining feature further reduces memory
accesses and provides a high degree of parallelism and low
control overhead. The programmable DSP core can support
both OFDM systems, e.g. 802.11a, and spread spectrum
systems such as 802.11b.
    A demonstrator chip for wireless LAN applications has
been fabricated. A clock frequency of 160 MHz is required
to support 802.11a reception at the highest data rate. The
core area of 2.9 mm2 is smaller than existing programmable
and non-programmable solutions.
    The power consumption is low considering that all logic
was synthesized from VHDL and low power design tech-
niques such as clock gating were not used.
    The processor is flexible enough to also support stan-

Shared By:
Tags: Baseband
Description: Baseband source (information source, also known as the fat end) issued without modulation (spectrum move and transform) inherent in the original signal frequency band (frequency bandwidth), called basic frequency bands, referred to as baseband.