DSP Processors Hit the Mainstream by nyut545e2



Cover Feature

                       DSP Processors
                       Hit the
                        Increasingly affordable digital signal processing extends the functionality
                        of embedded systems and so will play a larger role in new consumer
                        products. This tutorial explains what DSP processors are and what they do.
                        It also offers a guide to evaluating them for use in a product or application.

                                  ngineering terminology has a way of creeping        There has always been much potential benefit to

       Jennifer Eyre
                                  into the public tongue, often initially by way   adding signal processing capabilities to products, but
       Jeff Bier                  of product marketing. For example, the time      until recently, it’s simply been too expensive to be prac-
       Berkeley                   is long gone when only a few people were         tical in most cases.
                                  familiar with the unit “megahertz.” Although
       Inc. (BDTI)
                       people are perhaps not entirely certain what a mega-        WHAT MAKES IT A DSP PROCESSOR?
                       hertz is, they are perfectly comfortable discussing and        Although fundamentally related, DSP processors are
                       comparing the megahertz ratings of their computers.         significantly different from general-purpose processors
                          In a similar way, many people became familiar with       (GPPs) like the Intel Pentium or IBM/Motorola
                       the word “digital” when companies introduced CD             PowerPC. To understand why, you need to know what
                       players in the 1980s.                                       is involved in signal processing. What is it about sig-
                          These days, the once obscure engineering term            nal processing computations that spurred the devel-
                       “DSP” (short for digital signal processing) is also         opment of a different type of microprocessor?
                       working its way into common use. It has begun to
                       crop up on the labels of an ever wider range of prod-       Signal filtering
                       ucts, from home audio components to answering                 As a case study, we’ll consider one of the most com-
                       machines. This is not merely a reflection of a new mar-      mon functions performed in the digital domain, sig-
                       keting strategy, however; there truly is more digital       nal filtering, which is simply manipulating a signal to
                       signal processing inside today’s products than ever         improve signal characteristics. For example, filtering
                       before.                                                     can remove noise or static from a signal, thereby
                          Consider this: Maxtor Corp. recently reported            improving its signal-to-noise ratio.
                       receiving its 10-millionth DSP processor from Texas           It may not be obvious why it is desirable to filter sig-
                       Instruments for use in its disk drives. As further evi-     nals using a microprocessor rather than analog com-
                       dence, Forward Concepts, a DSP market research firm,         ponents, but consider the advantages:
                       reports that the 1997 market for DSP processors was
                       approximately 3 billion dollars. But why is the market        • Analog filters (and analog circuitry in general) are
                       for DSP processors booming?                                     subject to behavior variation depending on envi-
                          The answer is somewhat circular: As microproces-             ronmental factors, such as temperature. Digital
                       sor fabrication processes have become more sophisti-            filters are essentially immune to such environ-
                       cated, the cost of a microprocessor capable of                  mental effects.
                       performing DSP tasks has dropped significantly to the          • Digital filters are easily duplicated to within very
                       point where such a processor can be used in consumer            tight tolerances, since their behavior does not
                       products and other cost-sensitive systems. As a result,         depend on a combination of components, each of
                       more and more products have begun using DSP                     which deviates to some degree from its nominal
                       processors, fueling demand for faster, smaller, cheaper,        behavior.
                       more energy-efficient chips. These smaller, cheaper,          • Once it is manufactured, the defining character-
                       more efficient chips open the door for a new wave of             istics of an analog filter (such as its pass-band fre-
                       products to implement signal-processing capabilities.           quency range) are not easily changed. By
                       It’s like a positive feedback loop.                             implementing a filter digitally using a micro-

                                                                                                                 August 1998                    51

         Figure 1. The finite
         impulse response
         (FIR) filters is a typical
                                                                          xN−1                                            x2                        x1
         DSP algorithm. FIR              xN                      D                          D                                              D
         filters are useful for
         filtering noise, and
         other important func-
         tions.                          c1                          c2                                       cN−1                             cN
                                                     x                       x                                              x                       x

                                                                            +                                               +                       +

                                                                                 yN = xN × c1 + xN−1 × c2 +          x2 × cN−1 + x1 × cN

                                          processor, you can change filter characteristics            the results of all the multiplications to form an out-
                                          simply by reprogramming the device.                         put sample.
                                                                                                        This brings us to the most popular operation in
                                        There are several kinds of digital filters; one com-           DSP: the multiply-accumulate (MAC).
                                     monly used type is called a finite impulse response
                                     (FIR) filter, illustrated in Figure 1. The mechanics of           Handling MACs
                                     the basic FIR filter algorithm are straightforward. The              To implement a MAC efficiently, a processor must
                                     blocks labeled D in Figure 1 are unit delay operators;           efficiently perform multiplications. GPPs were not
                                     their output is a copy of the input sample, delayed by           originally designed for multiplication-intensive
                                     one sample period. A series of storage elements (usu-            tasks—even some modern GPPs require multiple
                                     ally memory locations) are used to implement a series            instruction cycles to complete a multiplication because
                                     of these delay elements (this series is called a delay           they don’t have dedicated hardware for single-cycle
                                     line).1                                                          multiplication. The first major architectural modifi-
                                        At any given time, N−1 of the most recently received          cation that distinguished DSP processors from the
                                     input samples reside in the delay line, where N is the           early GPPs was the addition of specialized hardware
                                     total number of input samples used in the computation            that enabled single-cycle multiplication.2
                                     of each output sample. Input samples are designated                 DSP processor architects also added accumulator
                                     xN; the first input sample is x1, the next is x2, and so on.      registers to hold the summation of several multiplica-
                                        Each time a new input sample arrives, the FIR filter           tion products. Accumulator registers are typically
                                     operation shifts previously stored samples one place to          wider than other registers, often providing extra bits,
                                     the right along the delay line. It then computes a new           called guard bits, to avoid overflow.
                                     output sample by multiplying the newly arrived sam-                 To take advantage of this specialized multiply-accu-
                                     ple and each of the previously stored input samples              mulate hardware, DSP processor instruction sets
                                     by the corresponding coefficient. In the figure, coeffi-            nearly always include an explicit MAC instruction.
                                     cients are represented as ck, where k is the coefficient          This combination of MAC hardware and a special-
                                     number. The summation of the multiplication prod-                ized MAC instruction were two key differentiators
                                     ucts forms the new output sample, yN.                            between early DSP processors and GPPs.
                                        We call the combination of a single delay element,
                                     the associated multiplication operation, and the asso-           Memory architectures
                                     ciated addition operation a tap. The number of taps                 Another highly visible difference between DSP
                                     and the values chosen for the coefficients define the              processors and GPPs lies in their memory structure.
                                     filter characteristics. For example, if the values of the            von Neumann architecture. Traditionally, GPPs have
                                     coefficients are all equal to the reciprocal of the num-          used a von Neumann memory architecture,3 illus-
                                     ber of taps, 1/N, the filter performs an averaging oper-          trated by Figure 2a. In the von Neumann architecture,
                                     ation, one form of a low-pass filter.                             there is one memory space connected to the processor
                                        More commonly, developers use filter design meth-              core by one bus set (an address bus and a data bus).
                                     ods to determine coefficients that yield a desired fre-           This works perfectly well for many computing appli-
                                     quency response for the filter. In mathematical terms,            cations; the memory bandwidth is sufficient to keep
                                     FIR filters perform a series of dot products: They take           the processor fed with instructions and data.
                                     an input vector and a vector of coefficients, perform                The von Neumann architecture is not a good design
                                     pointwise multiplication between the coefficients and             for DSP, however, because typical DSP algorithms
                                     a sliding window of input samples, and accumulate                require more memory bandwidth than the von

    52                        Computer

                                                                                                                      Figure 2. Many gen-
                                                                                                                      eral-purpose proces-
                                                                               Processor core                         sors use an (a) von
                                                                                                                      Neumann memory
                                                                                                                      architecture, which
                                                                                                                      permits only one
                                                                                                                      access to memory at
                                                                               Address bus 1
                                                                                                                      a time. While
                  Processor core                                                                                      adequate for general-
                                                                                                                      purpose applications,
                                                                                 Data bus 1
                                                                                                                      the von Neumann
                                                                                                                      architecture’s mem-
                   Address bus                                                 Address bus 2
                                                                                Data base                             ory bandwidth is
                                                                                                                      insufficient for many
                                                                                                                      DSP applications.
                    Data bus                                                     Data bus 2                           DSP processors typi-
                                                                                                                      cally use a (b)
                                                                                                                      Harvard memory
                                                                                                                      architecture, which
                                                                                                                      permits multiple
                     Memory                                         Memory A                  Memory B                simultaneous mem-
                                                                                                                      ory accesses.

   (a)                                                           (b)

Neumann architecture can provide. For example, to              Harvard architecture. In a Harvard memory archi-
sustain a throughput of one FIR filter tap per instruc-      tecture, there are two memory spaces, typically par-
tion cycle (the hallmark of DSP processor perfor-           titioned as program memory and data memory
mance), the processor must complete one MAC and             (though there are modified versions that allow some
make several accesses to memory within one instruc-         crossover between the two). The processor core con-
tion cycle. Specifically, in the straightforward case, the   nects to these memory spaces by two bus sets, allow-
processor must                                              ing two simultaneous accesses to memory. This
                                                            arrangement doubles the processor’s memory band-
  • fetch the MAC instruction,                              width, and it is crucial to keeping the processor core
  • read the appropriate sample value from the delay        fed with data and instructions. The Harvard archi-
    line,                                                   tecture is sometimes further extended with additional
  • read the appropriate coefficient value, and              memory spaces and/or bus sets to achieve even higher
  • write the sample value to the next location in the      memory bandwidth. However, the trade-off is that
    delay line, in order to shift data through the delay    extra bus sets require extra power and chip space, so
    line.                                                   most DSP processors stick with two.
                                                               The Harvard memory architecture used in DSP
   Thus, the processor must make a total of four            processors is not unlike the memory structures used in
accesses to memory in one instruction cycle. (In prac-      modern high-performance GPPs such as the Pentium
tice, most processors use various techniques to reduce      and PowerPC. Like DSPs, high-performance GPPs
the actual number of memory accesses needed to three        often need to make multiple memory accesses per
or even two per tap. Nevertheless, virtually all proces-    instruction cycle (because of superscalar execution
sors require multiple memory accesses within one            and/or because of instructions’ data requirements). In
instruction cycle to compute an FIR filter at a sus-        addition, high-performance GPPs face another prob-
tained rate of one tap per instruction cycle.)              lem: With clock rates often in excess of 200 MHz, it
   In a von Neumann memory architecture, four mem-          can be extremely expensive (and sometimes impossi-
ory accesses would consume a minimum of four                ble) to obtain a memory chip capable of keeping pace
instruction cycles. Although most DSP processors            with the processor. Thus, high-performance GPPs
include the arithmetic hardware necessary to perform        often cannot access off-chip memory at their full clock
single-cycle MACs, they wouldn’t be able to realize         speed. Using on-chip cache memory is one way that
the goal of one tap per cycle using a von Neumann           GPPs address both of these issues.
memory structure; the processor simply wouldn’t be             High-performance GPPs typically contain two on-
able to retrieve samples and coefficients fast enough.       chip memory caches—one for data and one for
For this reason, instead of a von Neumann architec-         instructions—which are directly connected to the
ture, most DSP processors use some form of Harvard          processor core. Assuming that the necessary infor-
architecture, illustrated by Figure 2b.                     mation resides in cache, this arrangement lets the

                                                                                                               August 1998                    53

            One common           processor retrieve instruction and data words        rithms. It is also why most DSP processors include spe-
                                 at full speed without accessing relatively slow,     cialized hardware for zero-overhead looping. The term
          characteristic of      off-chip memory. It also enables the processor       zero-overhead looping means that the processor can
         DSP algorithms is       to retrieve multiple instructions or data words      execute loops without consuming cycles to test the
          that most of the       per instruction cycle.                               value of the loop counter, perform a conditional
                                    Physically, this combination of dual on-chip      branch to the top of the loop, and decrement the loop
         processing time is
                                 memories and bus connections is nearly identi-       counter.
          spent executing        cal to a Harvard memory architecture. Logically,        In contrast, most GPPs don’t support zero-overhead
             instructions        however, there are some important differences        hardware looping. Instead, they implement looping
          contained within       in the way DSP processors and GPPs with caches       in software. Some high-performance GPPs achieve
                                 use their on-chip memory structures.                 nearly the same effect as hardware-supported zero-
           relatively small         Difference from GPPs. In a DSP processor, the     overhead looping by using branch prediction hard-
                loops.           programmer explicitly controls which data and        ware. This method has drawbacks in the context of
                                 instructions are stored in the on-chip memory        DSP programming, however, as discussed later.
                                 banks. Programmers must write programs so
                                 that the processor can efficiently use its dual bus   Fixed-point computation
                          sets. In contrast, GPPs use control logic to determine        Most DSP processors use fixed-point arithmetic
                          which data and instruction words reside in the on-          rather than floating-point. This may seem counter-
                          chip cache, a process that is typically invisible to the    intuitive, given the fact that DSP applications must
                          programmer. The GPP programmer typically does not           pay careful attention to numeric fidelity—which is
                          specify (and may not know) which instructions and           much easier to do with a floating-point data path. DSP
                          data will reside in the caches at any given time. From      processors, however, have an additional imperative:
                          the GPP programmer’s perspective, there is generally        They must be inexpensive. Fixed-point machines tend
                          only one memory space rather than the two memory            to be cheaper (and faster) than comparable floating-
                          spaces of the Harvard architecture.                         point machines.
                             Most DSP processors don’t have any cache; as               To maintain numeric accuracy without the benefit
                          described earlier, they use multiple banks of on-chip       of a floating-point data path, DSP processors usually
                          memory and multiple bus sets to enable several mem-         include, in both the instruction set and underlying
                          ory accesses per instruction cycle. However, some DSP       hardware, good support for saturation arithmetic,
                          processors do include a very small, specialized, on-        rounding, and shifting.
                          chip instruction cache, separate from the on-chip
                          memory banks and inside the core itself. This cache is      Specialized addressing
                          used for storing instructions used in small inner loops       DSP processors often support specialized address-
                          so that the processor doesn’t have to use its on-chip       ing modes that are useful for common signal-pro-
                          bus sets to retrieve instruction words. By fetching         cessing operations and algorithms. Examples include
                          instructions from the cache, the DSP processor frees        modulo (circular) addressing (which is useful for
                          both on-chip bus sets to retrieve data words.               implementing digital-filter delay lines) and bit-reversed
                             Unlike GPPs, DSP processors almost never incor-          addressing (which is useful for performing a com-
                          porate a data cache. This is because DSP data is typi-      monly used DSP algorithm, the fast Fourier trans-
                          cally streaming: That is, the DSP processor performs        form). These highly specialized addressing modes are
                          computations with each data sample and then discards        not often found on GPPs, which must instead rely on
                          the sample, with little reuse.                              software to implement the same functionality.

                          Zero-overhead looping                                       EXECUTION TIME PREDICTABILITY
                            It may not be obvious why a small, specialized               Aside from differences in the specific types of pro-
                          instruction cache would be particularly useful, until       cessing performed by DSPs and GPPs, there are also
                          you realize that one common characteristic of DSP           differences in their performance requirements. In most
                          algorithms is that most of the processing time is spent     non-DSP applications, performance requirements are
                          executing instructions contained within relatively          typically given as a maximum average response time.
                          small loops.                                                That is, the performance requirements do not apply
                            In an FIR filter, for example, the vast majority of        to every transaction, but only to the overall perfor-
                          processing takes place within a very small inner loop       mance.
                          that multiplies the input samples by their corre-
                          sponding coefficients and adds the results. This is why      Hard real-time constraints
                          the small on-chip instruction cache can significantly           In contrast, the most popular DSP applications
                          improve the processor’s performance on DSP algo-            (such as cell phones and modems) are hard real-time

    54               Computer

applications—all processing must take place within            grammers to confidently push the chip’s perfor-
                                                                                                                           Most DSP
some specified amount of time in every instance. This          mance limits.
performance constraint requires programmers to                                                                       applications depend
determine exactly how much processing time each               FIXED-POINT DSP INSTRUCTION SETS                       on processing taking
sample will require; or at the very least, how much             Fixed-point DSP processor instruction sets            place within some
time will be consumed in the worst-case scenario. At          are designed with two goals in mind. They must
first blush, this may not seem like a particularly impor-
                                                                                                                     specified amount of
tant point, but it becomes critical if you attempt to use       • enable the processor to perform multiple               time in every
a high-performance GPP to perform real-time signal                operations per instruction cycle, thus                instance. This
processing.                                                       increasing per-cycle computational effi-              execution time
   Execution time predictability probably won’t be an             ciency (a goal also supported by endowing
issue if you plan to use a low-cost GPP for real-time             the processor with multiple execution units          predictability is
DSP tasks, because low-cost GPPs (like DSP proces-                capable of parallel operation), and                 difficult to provide
sors) have relatively straightforward architectures and         • minimize the amount of memory space                on high-performance
easy-to-predict execution times. However, most real-              required to store DSP programs (critical in
time DSP applications require more horsepower than                cost-sensitive DSP applications because
low-cost GPPs can provide, so the developer must                  memory contributes substantially to the
choose either a DSP processor or a high-performance               overall system cost).
GPP. But which is the best choice? In this context, exe-
cution-time predictability plays an important role.           To accomplish these goals, DSP processor instruction
                                                              sets generally allow programmers to specify several
Problems with some GPPs                                       parallel operations in a single instruction. However, to
   For example, some high-performance GPPs incor-             keep word size small, the instructions only permit the
porate complicated algorithms that use branching his-         use of certain registers for certain operations and do
tory to predict whether a branch is likely to be taken.       not allow arbitrary combinations of operations. The
The processor then executes instructions speculatively,       net result is that DSP processors tend to have highly
on the basis of that prediction. This means that the          specialized, complicated, and irregular instruction
same section of code may consume a different number           sets.
of instruction cycles, depending on events that take             To keep the processor fed with data without bloat-
place beforehand.                                             ing program size, DSP processors almost always allow
   When a processor design layers many different              the programmer to specify one or two parallel data
dynamic features—such as branch prediction and                moves (along with address pointer updates) in paral-
caching—on top of each other, it becomes nearly               lel with certain other operations, like MACs.
impossible to predict how long even a short section              Typical instruction. As an illustration of a typical
of code will take to execute. Although it may be pos-         DSP instruction, consider the following Motorola
sible for programmers to determine the worst-case             DSP56300 instruction (X and Y denote the two mem-
execution time, this may be an order of magnitude             ory spaces of the Harvard architecture):
greater than the actual execution time. Assuming
worst-case behavior can force programmers to be                 MAC X0,Y0,A X:(R0)+,X0 Y:(R4)+N4,Y0
extremely conservative in implementing real-time DSP
applications on high-performance GPPs. A lack of              This instruction directs the DSP56300 to
execution time predictability also adversely affects the
programmer’s (or compiler’s) ability to optimize code,          • multiply the contents of registers X0 and Y0,
as we will discuss later.                                       • add the result to a running total kept in accu-
                                                                  mulator A,
Easy for DSP processors                                         • load register X0 from the X memory location
   Let’s compare this to the effort required to predict           pointed to by register R0,
execution times on a DSP processor. Some DSP proces-            • load register Y0 from the Y memory location
sors use caches, but the programmer (not the processor)           pointed to by R4,
decides which instructions go in them, so it’s easy to tell     • postincrement R0 by one, and
whether instructions will be fetched from cache or from         • postincrement R4 by the contents of register N4.
memory. DSP processors don’t generally use dynamic
features such as branch prediction and speculative exe-          This single instruction line includes all the opera-
cution. Hence, predicting the amount of time required         tions needed to calculate an FIR filter tap. It is clearly
by a given section of code is fairly straightforward on       a highly specialized instruction designed specifically
a DSP. This execution-time predictability allows pro-         for DSP applications. The price for this efficiency is

                                                                                                                   August 1998              55

         A compiler can take        an instruction set that is neither intuitive nor   Low-cost workhorses
                                    easy to use (in comparison to typical GPP            In the low-cost, low-performance range are the
          C source code and         instruction sets).                                 industry workhorses. Included in this group are
         generate functional           Difference from GPPs. GPP programmers typ-      Analog Devices’ ADSP-21xx, Texas Instruments’
         assembly code for a        ically don’t care about a processor instruction    TMS320C2xx, and Motorola’s DSP560xx families.
                                    set’s ease of use because they generally develop   These processors generally operate at around 20 to
            DSP, but to get
                                    programs in a high-level language, such as C or    50 native MIPS (that is, a million instructions per sec-
            efficient code,          C++. Life isn’t quite so simple for DSP program-   ond, not Dhrystone MIPS) and provide good DSP per-
             programmers            mers, because mainstream DSP applications are      formance while maintaining very modest power
         invariably optimize        written (or at least have portions optimized) in   consumption and memory usage. They are typically
                                    assembly language. This turns out to have impor-   used in consumer products that have modest DSP per-
            the program’s           tant implications in comparing processor per-      formance requirements and stringent energy con-
           critical sections        formance, but we’ll get to that later.             sumption and cost constraints, like disk drives and
                by hand.               There are two main reasons why DSP proces-      digital answering machines.
                                    sors usually aren’t programmed in high-level
                                    languages. First, most widely used high-level      Low-power midrange
                                    languages, such as C, are not well suited for         Midrange DSP processors achieve higher perfor-
                           describing typical DSP algorithms. Second, the com-         mance through a combination of increased clock
                           plexity of DSP architectures—with their multiple            speed and more sophisticated hardware. DSP proces-
                           memory spaces, multiple buses, irregular instruction        sors like the Lucent Technologies DSP16xx and Texas
                           sets, and highly specialized hardware—make it diffi-         Instruments TMS320C54x operate at 100 to 120
                           cult to write efficient compilers.                           native MIPS and often include additional features,
                              It is certainly true that a compiler can take C source   such as a barrel shifter or instruction cache, to improve
                           code and generate functional assembly code for a DSP,       performance on common DSP algorithms.
                           but to get efficient code, programmers invariably              Processors in this class also tend to have more
                           optimize the program’s critical sections by hand. DSP       sophisticated (and deeper) pipelines than their lower
                           applications typically have very high computational         performance cousins. These processors can have sub-
                           demands coupled with strict cost constraints, making        stantially better performance while still keeping energy
                           program optimization essential. For this reason, pro-       and power consumption low. Processors in this per-
                           grammers often consider the “palatability” (or lack         formance range are typically used in wireless telecom-
                           thereof) of a DSP processor’s instruction set as a key      munications applications and high-speed modems,
                           component in its overall desirability.                      which have relatively high computational demands
                              Because DSP applications require highly optimized        but often require low power consumption.
                           code, most DSP vendors provide a range of develop-
                           ment tools to assist DSP processor programmers in           Diversified high-end
                           the optimization process. For example, most DSP                Now we come to the high-end DSP processors. It is
                           processor vendors provide processor simulation tools        in this group that DSP architectures really start to
                           that accurately model the processor’s activity during       branch out and diversify, propelled by the demand for
                           every instruction cycle. This is a valuable tool both for   ultrafast processing. DSP processor architects who
                           ensuring real-time operation and for code optimiza-         want to improve performance beyond the gains
                           tion.                                                       afforded by faster clock speeds must get more useful
                              GPP vendors, on the other hand, don’t usually pro-       DSP work out of every instruction cycle. Of course,
                           vide this type of development tool, mainly because          architects designing high-performance GPPs are moti-
                           GPP programmers typically don’t need this level of          vated by the same goal, but the additional goals of
                           detailed information. The lack of a cycle-accurate sim-     maintaining execution time predictability, minimizing
                           ulator for a GPP can be a real problem for DSP appli-       program size, and limiting energy consumption typi-
                           cation programmers. Without one, it can be nearly           cally do not constrain their design decisions. There are
                           impossible to predict the number of cycles a high-per-      several ways DSP processor architects increase the
                           formance GPP will require for a given task. Think           amount of work accomplished in each cycle; we dis-
                           about it: If you can’t tell how many cycles are required,   cuss two approaches here.
                           how can you tell if the changes you make are actually          Enhanced conventional DSP processor. The first
                           improving code performance?                                 approach is to extend the traditional DSP architecture
                                                                                       by adding more parallel computational units to the
                           TODAY’S DSP LANDSCAPE                                       data path, such as a second multiplier or adder. This
                             Like GPPs, the performance and price of DSP               approach requires an extended instruction set that
                           processors vary widely.4                                    takes advantage of the additional hardware by encod-

    56                Computer

ing even more operations in a single instruction and             Like most VLIW processors, the TMS-
executing them in parallel. We refer to this type of          320C6201 consumes much more energy than
processor as an enhanced conventional DSP because             traditional DSP processors, and requires rela-            as processor
the approach is an extension of the established DSP           tively large amount of program memory. For             architectures have
architectural style.                                          these reasons, the chip is not well suited for             diversified,
   The Lucent Technologies DSP16210, which has two            portable applications. TI gave up energy and
multipliers, an arithmetic logic unit, an adder (separate     memory efficiency in exchange for ultrahigh
                                                                                                                     traditional metrics
from the ALU), and a bit manipulation unit, is a prime        performance, producing a processor intended            such as MIPS have
example of this approach. Lucent also equipped the            for line-powered applications, such as modem              become less
DSP16210 with two 32-bit data buses, enabling it to           banks, where it can take the place of several          relevant and often
retrieve four 16-bit data words from memory in every          lower performance DSP processors.
instruction cycle (assuming the words are retrieved in                                                                    downright
pairs). These wider buses keep the dual multipliers           GPPS GET DSP                                              misleading.
and other functional units from starving for data. The           In the past few years, GPP developers have
DSP16210, which executes at 100 native MIPS, offers           begun enhancing their processor designs with
a strong boost in performance while maintaining a             DSP extensions. For example, Intel added the
cost and energy footprint similar to previous genera-         Pentium’s MMX extensions, which specifically sup-
tions of DSP processors. It is specifically targeted at        port DSP and other multimedia tasks. MMX also
high-performance telecommunications applications,             gives the Pentium SIMD (single instruction, multiple
and it includes specialized hardware to accelerate com-       data) capabilities, significantly improving the proces-
mon telecommunications algorithms.                            sor’s speed on DSP algorithms.
   Multiple-instruction issue. Another way to get more           Many low- and moderate-performance GPPs are
work out of every cycle is to issue more than one             now available in versions that include DSP hardware,
instruction per instruction cycle. This is common in          resulting in hybrid GP/DSP processors. Hitachi, for
high-end GPPs, which are often 2- or even 4-way               example, offers a DSP-enhanced version of its SH-2
superscalar (they can issue and execute up to 2 or 4          microcontroller, called the SH-DSP. Advanced RISC
instructions per cycle). It’s a relatively new technique      Machines, a vendor of licensable microcontroller
in the DSP world, however, and has mostly been                cores for use in custom chips, recently introduced a
implemented using VLIW (very long instruction                 licensable DSP coprocessor core called Piccolo.
word) rather than superscalar architectures.                  Piccolo is intended to augment ARM’s low-end GPP,
   A prime example of this approach is the much-              the ARM7, with signal-processing capabilities. In
publicized Texas Instruments TMS320C6201. This                short, just about all of the major GPP vendors are
VLIW processor pulls in up to 256 bits of instruction         adding DSP enhancements to their processors in one
words at a time, breaks them into as many as eight            form or another, and the distinction between GPPs
32-bit subinstructions, and passes them to its eight          and DSP processors is not quite as clear as it once was.
independent computational units. In the best case, all
eight units are active simultaneously, and the proces-        WHICH ONE IS BETTER?
sor executes eight subinstructions in parallel.                  There are a variety of metrics you can use to judge
   The TMS320C6201 has a projected clock rate of              processor performance. The most often cited is speed,
200 MHz, which translates into a peak MIPS rating             but other metrics, such as energy consumption or
of 1,600. The catch here is that each subinstruction is       memory usage, can be equally important, especially
extremely simple (by DSP standards). Thus, it may take        in embedded-system applications. Like developers
several TMS320C6201 instructions to specify the same          using GPPs, DSP engineers must be able to accurately
amount of work that a conventional DSP can specify            compare many facets of processor performance so
in a single instruction. In addition, it is often not pos-    that they can decide which processor to choose.
sible to keep all eight execution units running in par-          In light of the ever-increasing number of processor
allel; more typically, five or six will be active at any one   families for DSP applications, it has become more dif-
time. The performance gain afforded by using this             ficult than ever for system designers to choose the
VLIW approach combined with a high clock rate is              processor that will provide the best performance in a
substantial. However, it is not nearly as high as you         given application.
might expect from comparing the 1,600 MIPS rating                In the past, DSP designers have relied on MIPS or
with the 100 MIPS rating of the Lucent DSP16210.              similar metrics for a rough idea of the relative horse-
This disparity arises because a typical DSP16210              power provided by various DSP chips. Unfortunately,
instruction accomplishes more work than a typical             as processor architectures have diversified, traditional
TMS320C6201 instruction, a critical distinction that          metrics such as MIPS have become less relevant and
simple metrics such as MIPS fail to capture.                  often downright misleading. Engineers must be wary

                                                                                                                  August 1998              57

                                                                                            recompile it for different computers and measure the
                ADI ADSP-2183                                                               run time. However, for reasons discussed earlier, DSP
                      52 MIPS                                                               systems are usually programmed (at least to some
                                                                                            degree) in assembly language.
    ARM ARM7TDMI/Piccolo                                                                       This makes full-application benchmarks unattrac-
                70 MIPS                                                                     tive for DSP performance comparisons for two main
                Hitachi SH-DSP
                       66 MIPS
                                                                                              • If the application is programmed in a high-level lan-
                  Intel Pentium                                                                 guage, the quality of the compiler will greatly affect
                       233 MHz                                                                  performance results. Hence, the benchmark would
                                                                                                measure both the processor and the compiler.
               Lucent DSP1620                                                                 • Although it’s certainly possible to develop and
                     120 MIPS                                                                   optimize entire applications in assembly code, it
              Lucent DSP16210                                                                   is impractical for the purposes of comparing
                                                                  36                            processor performance because the application
                      100 MIPS
                                                                                                would have to be recoded and optimized on every
          Motorola DSP56011                                                                     processor under consideration.
                   47.5 MIPS
    Motorola PowerPC 604e                                                                      Fortunately, one of the characteristics of DSP appli-
                 350 MHz
                                                                                            cations is that the majority of the processing effort is
                TI TMS320C204                                                               often concentrated in a few relatively small pieces (ker-
                                       7                                                    nels) of the program, which can be isolated and used
                       40 MIPS
                                                                                            as the basis for benchmark comparisons.
              TI TMS320C6201                                                                   Berkeley Design Technology Inc. (BDTI), a DSP
                   1,336 MIPS                                                               technology analysis and software development firm,
                                                                                            has developed a DSP benchmarking methodology
              TI TMS320VC549
                                                          25                                based on a group of DSP algorithm kernels.
                     100 MIPS
                                                                                               Algorithm kernels, such as the FIR filter algorithm,
                                                                                            form the heart of many DSP applications. They make
                                                                BDTImark                    good benchmark candidates because of their inherent
                                                                                            relevance to DSP and because they are small enough
          Figure 3. Comparison                                                              to implement and optimize in assembly code in a rea-
          of selected commer-     of the performance claims presented in sales              sonable amount of time.
          cial DSP, general-      brochures; all MIPS are not created equal.                   Over the past six years, BDTI has implemented their
          purpose, and hybrid        The root problem with simple metrics like MIPS is      suite of kernel-based benchmarks, the BDTI
          GP/DSP processors by    that they don’t actually measure performance, because     Benchmarks, on a wide variety of processors. By look-
          BDTImarks and MIPS      performance is a function of more than just the num-      ing at benchmark results, designers can see exactly
          (or MHz for GPPs).      ber of instructions executed per second. Performance      which algorithm kernels a given processor performs
          BDTImark is a compos-   is also a function of how much work is accomplished       efficiently. Given information about the relative
          ite measure based on    in each instruction, which depends on the processor’s     importance of each algorithm in the overall applica-
          measurements from a     specific architecture and instruction set. Thus, when      tion (we refer to this information as an application
          set of DSP-specific      processor vendors cite MIPS ratings, they are leaving     profile), system designers can accurately assess which
          benchmarks. Note how    out a crucial piece of information: the amount of work    DSP is best for the application under consideration.
          MIPS ratios do not      each instruction performs.                                   The results of BDTI’s comprehensive benchmark-
          necessarily provide a      Clearly, engineers need a way to gauge processor       ing effort provide an extremely detailed, in-depth
          good indication of      performance on DSP algorithms that isn’t tied to a        analysis of processors’ performance on typical DSP
          comparative DSP per-    specific architecture. But what’s the best way to do      algorithms. However, we still would like a single-num-
          formance. Scores for    that?                                                     ber metric for quick comparisons. For this reason,
          other processors are       One possibility would be to implement complete         BDTI introduced and trademarked a new composite
          available at            DSP applications on each processor under considera-       speed metric, the BDTImark, last year.
          http://www.bdti.com.    tion, and compare the amount of time each requires
                                  to complete the given task. This method is often used     The BDTImark
                                  for benchmarking general-purpose computer systems            The BDTImark takes the execution time results
                                  that run applications written in a high-level language.   from all of the BDTI Benchmarks and crunches them
                                  Once developers finish an application, they can easily     into one number. It provides a far more accurate

     58                     Computer

assessment of a processor’s DSP performance than          DSP, don’t suffer from the drawbacks that accom-
other simplified metrics (such as MIPS) because it is a    pany high-end GPPs, but they also don’t offer the
measurement of execution speed on actual DSP algo-        same level of performance.
rithms. In Figure 3, we present the BDTImark scores          The bottom line is, if there is a high-performance
for a variety of DSP, GP, and hybrid DSP/GP proces-       GPP in the existing system (as in the case of a PC), it
sors, including those discussed earlier.                  may be attractive to use for signal processing and to
   Clearly, a quick comparison of BDTImark scores         avoid adding a separate DSP processor. And, partic-
and MIPS ratings shows that a higher MIPS rating          ularly in the case of GPPs enhanced with DSP exten-
does not necessarily translate into better DSP perfor-    sions, it may be possible to get good DSP performance
mance.4                                                   out of the system without adding a separate DSP
   For example, consider the BDTImark scores for the      processor. If you are building a DSP application from
100-MIPS DSP16210 and the 120-MIPS DSP1620.               scratch, however, it is likely that a dedicated DSP or
The DSP16210 is about 1.5 times faster, because it has    hybrid GP/DSP processor will be a better choice, for
an extra multiplier and other hardware that lets it do    reasons of economy, lower power consumption, and
substantially more work in every instruction cycle.       ease of development.
The TMS320C6201, shown here at 167 MHz (1,336                Though high-performance GPPs have already
MIPS), achieves an impressive BDTImark score, but is      begun to challenge DSP speed, DSP processors aren’t
not 13 times faster than the DSP16210, as might be        likely to be supplanted in the near future because they
expected from the two processors’ MIPS ratio.             are able to provide extremely strong signal process-
   Another point of interest is that the score for the    ing performance with unmatched economy. y
Pentium is actually higher than the scores for many of
the low- to moderate-performance DSP processors—
a surprising result mostly attributable to the proces-    References
sor’s 233-MHz clock rate. The Motorola PowerPC             1. R. Lyons, Understanding Digital Signal Processing,
604e performs even better; in fact, it is faster on the       Addison Wesley Longman, Reading, Mass., 1997.
DSP benchmarks than nearly all the DSP processors.         2. P. Lapsley et al., DSP Processor Fundamentals, IEEE
   This observation leads to a common question: Why           Press, New York, 1997.
use a DSP processor at all when the DSP capabilities       3. J. Hennessy and D. Patterson, Computer Organization
of high-end GPPs such as the PowerPC 604e are                 and Design, Morgan Kaufmann, San Francisco, 1998.
becoming so strong?                                        4. Buyer’s Guide to DSP Processors, Berkeley Design Tech-
   The answer is, there’s more to it than raw perfor-         nology Inc., Berkeley, Calif., 1994, 1995, and 1997.
mance. Although high-end GPPs are able to perform
DSP work at a rate comparable to many DSP proces-
sors, they achieve this performance by using compli-
cated dynamic features. For this reason, high-end GPPs    Jennifer Eyre is an engineer and a technical writer at
are not well suited for real-time applications—dynamic    Berkeley Design Technology Inc. where she analyzes
features cause real problems both in terms of guaran-     and evaluates microprocessors used in DSP applica-
teeing real-time behavior and optimizing code. In addi-   tions. Eyre received a BSEE and an MSEE from
tion, the theoretical peak performance of a high-end      UCLA. She is a member of Eta Kappa Nu.
GPP may never be achieved in real-time DSP programs,
because the programmer may have to assume worst-
case behavior and write the software accordingly.         Jeff Bier is cofounder and general manager of BDTI,
   High-end GPPs also tend to cost substantially more     where he oversees DSP technology analysis and soft-
money and consume more power than DSP proces-             ware development services. He has extensive experi-
sors, an unacceptable combination in, for example,        ence in software, hardware, and design tool
highly competitive portable telecommunications            development for DSP and control applications. Bier
applications. And, although software development          received a BS from Princeton University and an MS
tools for the most widely used GPPs are much more         from the University of California, Berkeley. He is a
sophisticated than those of their DSP counterparts,       member of the IEEE Design and Implementation of
they are not geared toward DSP software development       Signal Processing Systems (DISPS) Technical Com-
and lack features that are essential in the DSP world.    mittee.

  t will be interesting to see how well the more recent

I additions to the DSP world—the hybrid GP/DSP
  processors—can penetrate the market. These
processors, such as the ARM7/Piccolo and the SH-
                                                          Contact the authors at BDTI, 2107 Dwight Way,
                                                          Second Floor, Berkeley, CA 94704; {eyre, bier}@

                                                                                                                August 1998   59

To top