Embedded Computer Architecture by tiny54tim

VIEWS: 252 PAGES: 21

									      Embedded Systems
         Computer                                                                                           Embedded
        Architecture                                                                                         Systems
                 Jakob Engblom, PhD
              Uppsala University & Virtutech Inc.
                     jakob.engblom@it.uu.se
                       jakob@virtutech.com                     virtutech
                                                               virtutech
                                                                                     14 Nov 2003       Embedded Computer Architecture   2




               Embedded Systems                                                                    Embedded Systems
                           Now what is this
                                    thing?
                           elephant thing?                       You’
                                                                 You’re all
                                                                wrong, it is a
                                                                wrong,                   “A computer that doesn’t
                                                                    fan!
                   No, a                                                                 look like a computer”
   It is a         wall!
                   wall!
  snake!
  snake!                                                                                 Interacts with world
                                                                                         Primitive or no user interface
                                                             No, it is a
                                                                                         Part of other products
                                                             treetrunk!
                                                             treetrunk!


                      No, a
                      pillar!

14 Nov 2003                 Embedded Computer Architecture                       3   14 Nov 2003       Embedded Computer Architecture   4
              Embedded Systems                                                                Processor Market

    Single purpose products                                                     Embedded = most processors!
         Not general purpose like desktop PCs                                        200 million PC and server
         Do one thing very efficiently                                               8000 million embedded
    Software very important:
         Gives character to product
                                              platform”
              Used to differentiate inside a “platform”
         Can be changed late                                                                                                           "Embedded"
                                                                                                                                          98%
         Processor cheaper than special HW                                                "Desktop"
         Today, dominates dev cost                                                           2%

14 Nov 2003               Embedded Computer Architecture                5   14 Nov 2003               Embedded Computer Architecture                6




                Processor Market                                                             Real-Time System
    Processors:
                                     100%                                       Timing as important as result
         50% of all                                   DSP      DSP
         semiconductor revenue
         Explains why everyone
                                       90%
                                       80%            4-bit
                                                               4-bit
                                                               8-bit
                                                                                Hard real-time:
         wants to do processors        70%                     16-bit                Hard deadlines
    32-bit dominant                    60%                                           Dead if missed deadline
         30% of total
         semiconductors
                                       50%            8-bit                          Worst-case
                                       40%
    PC processors:                     30%                     32-bit           Soft real-time:
         50% of CPU revenue            20%
                                                     16-bit                          Fuzzier deadlines
         15% of total                  10%
         semiconductors                 0%
                                                      32-bit                         Can miss some deadlines
         AMD and Intel share it                      Units     Money                 Average-case
14 Nov 2003               Embedded Computer Architecture                7   14 Nov 2003               Embedded Computer Architecture                8
                Real-Time Systems                                                 Simple Embedded Systems
      Embedded and Real-Time                                                                             8-bit Intel 8051,
           Synonymous?                                                                                   standard microcontroller

      Most embedded                                     embedded
                                                                                                         Behavior, talk,
                                                                                                         IR communications
      systems are
      real-time                                   embedded
                                                   real-time
      Most real-time                                                                              8-bit Hitachi H8/300
                                                                                              32 kB ROM, 32 kB RAM
      systems are
                                                 real-time                      Standard microcontroller chip
      embedded                                                                                  Byte-code machine,
                                                                                                  sensor drivers, …
  14 Nov 2003        Embedded Computer Architecture                        9    14 Nov 2003                 Embedded Computer Architecture   10




    Fun App: Smart Beer Glass                                                                 No Upgrades Possible
                                           Capacitive
                                           sensor for
                                           sensor                                   Once a product ships…
                                           fluid level

                                           8-bit, 8-pin
                                                  8-
                                                                                    …it often cannot be serviced
                                         PIC processor                                   No download ability
  Contactless
  Contactless
transmission of                                       Inductive
                                                      Inductive coil for                 No writable persistent storage
   power and                                           RF ID activation                  No disks
    readings                                               & power
                                                                                         No loader
                                 CPU and reading coil in the
                                 table. Reports the level of                        Software is write-once
                                 fluid in the glass, alerts
                                 servers when close to empty                        (There are exceptions)
  14 Nov 2003        Embedded Computer Architecture                        11   14 Nov 2003                 Embedded Computer Architecture   12
              Consumer Electronics                                                     Automotive
                                                                                                                   Multiple networks
                     Heterogeneous                                                                                      CAN for body
                     multiprocessor                                                                                     electronics: 30+ nodes
                           8-bit Atmel AVR for UI, games, …                                                             CAN for engine control:
                                                                                                                        few nodes
                           16-    fixed-
                           16-bit fixed-point TI C54 DSP for
                           GSM coding, radio interface, …                                                               LIN for instruments
                           32-bit ARM7 in Bluetooth module
                           32-                                                                                     Many processors
                           + maybe ARM7 in IRDA interface                                                               Up to 100
                     All in custom chips                            Large diversity in processor types:
                     Software is large:                                  8-bit CPUs (PIC, HC08) for door locks, lights, etc.
                           16 MB of code in control part                 16-
                                                                         16-bit CPUs (C167, HC11, HC12) for most functions
                           Plus signal processing code                   32-
                                                                         32-bit CPUs (PPC,V850) for engine control, airbags
                                                                    Total amount of code: 40-50 MB
14 Nov 2003         Embedded Computer Architecture             13   14 Nov 2003            Embedded Computer Architecture                  14




                  Automotive                                                      Timing Aspects

    Form follows function                                               Interrupt latency
         Processing where the action is                                      Important criterion for embedded
         Architecture given by application                                   A few clock cycles at most
         Sensors and actuators distributed                                   Measure of RTOS performance
    Heterogeneous systems                                               Real-Time = predictability
         Many different makes of CPUs                                        In-order pipelines
         Standardized at the network/bus                                     SRAM instead of caches
                                                                             Lockable caches
                                                                             Several small CPUs instead of one big
14 Nov 2003         Embedded Computer Architecture             15   14 Nov 2003            Embedded Computer Architecture                  16
              Military Shipboard                                     Mobile Phone Base Station
                                       Standard multiprocessor
                                       UltraSparc servers for
                                                                                 Handle signals
                                       radar, target tracking,                      Data streams to and from
                                       combat control, …                            phones
                                                                                    Massively parallel system
                                                                                    Thousands of DSP tasks
                                       Many CPUs in missiles,
                                       gun controls, engines, …
                                                                                    Perfect parallel scalability
                                                                                 Custom or standard DSPs
                                                                                    Up to 8 DSPs on a single chip

14 Nov 2003           Embedded Computer Architecture          17   14 Nov 2003          Embedded Computer Architecture                 18




                       Trends                                                    System-on-a-chip
                                                                       Integration                                   On-chip bus
    Hardware to software
                                                                       extreme




                                                                                                         Bluetooth
         Increase flexibility, lower cost                                                                             Data
         Software on fast processor can equal HW                            Thanks to modern                          mem      DSP
                                                                            semiconductors
    Software to hardware                                               Entire product
         Better power consumption & performance                        on a chip                      GSM                CPU
         Design custom hardware for application                                                       Radio
                                                                       One or more
    Hardware-software codesign                                         processors,                       LCD
                                                                                                                         Code memory
         Delay division HW/SW to late in project                       accelerators, …                  driver
         Obtain “optimal” HW/SW division
14 Nov 2003           Embedded Computer Architecture          19   14 Nov 2003          Embedded Computer Architecture                 20
                                                                                                     Microcontrollers




                                                                                                                                             Microcontrollers
                                                                                           Classic embedded hardware
                      Embedded                                                             Standard parts
                                                                                                Quite broad application domains
                      Processing                                                                Sold in large series
                                                                                                Defined by hardware vendors
                                                                                                As cheap as a single dollar
                                                                                           Single processor + devices
                                                                                           Huge number of variants
                                                                                           Usually intended for control plane

14 Nov 2003         Embedded Computer Architecture                                21   14 Nov 2003          Embedded Computer Architecture    22




              Microcontroller                                                                         Microcontroller

    A single chip:                                    RAM                                  CPU Bitness: 4 to 64 bits
                                                     (small)
         CPU Core                                                   ROM                         Most common: 8 bit (4G units)
                                                       CPU          (big)
         Integrated memory                             Core                                     32-bit growing fastest
         Integrated peripherals                                                                 32/64-bit outnumbers desktop
                                                                  LCD D
                                                     UART




                                                                          Timer
                                                            A/D




         Integrated services                                                               Frequency: DC to 2 Ghz
    Goal:                          Outside World                                           Memory on-chip: 0.5 kB to 5 MB
         System on one chip
         No external HW                                                                    Power: mW (and up)
         Fit application “perfectly”                                                       1/30 to 10 instructions per cycle
14 Nov 2003         Embedded Computer Architecture                                23   14 Nov 2003          Embedded Computer Architecture    24
          Example: PIC 12CE674                                     Example: AT91M42800A
    Memory arch:              Harvard                           ARM7TDMI 32-bit core
    Program memory:           2048 x 14 (OTP/Flash)                  Static design: 0 to 33 Mhz
    EEPROM:                   16 bytes                          Memory
                                                                     8 kB SRAM on chip
    RAM:                      128 bytes
                                                                     External memory interface, 8/16 bit interface
    ADC channels:             4 (8 bits)
                                                                Devices
    I/O ports:                6                                      6 timers
    Timers:                   One 8-bit, One WDT                     2 serial ports

    Clock:                    onchip crystal, 10MHz             JTAG debug interface                                   144 Pin package
    Package:                  8 pins (Pentium 4:700 pins)       About 0.5 W power                                      One of 13 AT91
    Cost:                     <$1.00 (Pentium 4:>$200.00)       About 18 USD                                           variants
14 Nov 2003         Embedded Computer Architecture    25    14 Nov 2003               Embedded Computer Architecture               26




              Devices on the Chip                                         Devices on the Chip
    Interface with the world                                    Timers
         Digital I/O                                                 Trigger interrupts
         Analog/Digital conversion                                   Watchdogs
         Digital/Analog conversion                              Graphics
    Communications                                                   LCD drivers
         CAN networks                                                2D/3D graphics acceleration
         Ethernet networks                                      Buses
         Radio                                                       On-chip: between devices: AMBA, …
         Serial ports (UART, USART)                                  Off-chip: PCI, HyperTransport, RapidIO …
         USB, FireWire, ...
14 Nov 2003         Embedded Computer Architecture    27    14 Nov 2003               Embedded Computer Architecture               28
                    ASIPs / ASSPs                                                                 Example: PowerQUICC III
                                                                                                                                  Features
    Application-specific                                                                         Motorola                         Serial Communications Controller (SCC)        4

    integrated/standard processor                                                                Target market                    Fast Communications Controller (FCC)
                                                                                                                                  Multi-Channel Controller (MCC2)
                                                                                                                                                                                3
                                                                                                                                                                                2
         Targeting a particular niche market                                                          Communications




                                                                               ASIP / ASSP
                                                                                                                                  Serial Management Controller (SMC)            2

         More targeted than microcontroller                                                      Processing                       Serial Peripheral Interface (SPI)
                                                                                                                                  I2C controller
                                                                                                                                                                                1
                                                                                                                                                                                1
         Domain-
         Domain-specific accelerators                                                                 PowerPC e500                DDR Memory controller                         1

    Usually more upscale                                                                              666-
                                                                                                      666-1000 Mhz                PCI-X/PCI controller                          1
                                                                                                      256 kB L2 cache             RapidIO controller                            1
         32-
         32-bit processors
                                                                                                 Networking
                                                                                                                                  Ethernet 10/100/1000 controller               2
         Multiprocessors                                                                                                          Capabilities
         Expensive peripherals                                                                                    RISC-
                                                                                                      CPM module, RISC-           Ethernet, 10 (from SCC)                       4
                                                                                                      based microcode
         External memory assumed                                                                                                  Ethernet, 10/100 (from FCC)                   3

         Higher performance, includes data-plane
                                      data-                                                      About 160 USD                    Ethernet 10/100/1000                          2
                                                                                                                                  Utopia II ATM (from FCC)                      2
                                                                                                                                  Multichannel HDLC (from MCC2)                256

14 Nov 2003                 Embedded Computer Architecture                       29          14 Nov 2003                Embedded Computer Architecture                               30




               Example: C167CS                                                                     Example: Cisco Toaster3
                                      Devices                                                               8 clusters of 2                                                   capacity:
                                                                                                                                                                        Total capacity:
    Infineon                          CAN 2.0b controllers                 2
                                                                                                           processors each                                            about 5 GOps, at
    Target Market                     General-Purpose Timers (GPT)         5                                                                                           around 160 Mhz
                                      Watch-Dog Timer (WDT)                1
         Automotive control           Pulse-Width Modulator (PWM)          1

    Processing                        Analog-Digital Converter Channels   24+8
                                      USART                                1
         16-
         16-bit C16x core
                                      Synchronous Serial Comms (SSC)       1
         4-stage simple pipeline
         40 Mhz operation
                                      Capture/Compare Channels            2x16                        Each TMC is a                                                      32-
                                                                                                                                                                    Two 32-bit
         16 MB memory space,          External Ports                                                  VLIW machine                                                ALUs and three
         including ROM, RAM,          CAN interfaces                       2                             with 74 bit                                               control/data
         devices                      8-bit ports from devices             8                         instructions, 2k                                             movement units
                                      16-bit ports from devices            1                          instructions in
    144 pin package                                                                                                                                                  per TMC
                                      Memory                                                           local memory
         Tolerates -40 C to +125 C
                                      ROM                                 32 kB
    About 25 USD                      Fast General Internal RAM (IRAM)    3 kB
                                      Extension Internal RAM (XRAM)       8 kB
                                                                                                                                 Image from Microprocessor Report, Oct 2002
14 Nov 2003                 Embedded Computer Architecture                       31          14 Nov 2003                Embedded Computer Architecture                               32
      Example: Cisco Toaster3                                                                               FPGA

    Massive                                                                     Field Programmable Gate Array
    multiprocessing                                                                  Reconfigurable hardware: “soft logic”
                                                                                           Program”
                                                                                          “Program” is circuit layout
         16 cores on a chip                                                                                      initial
                                                                                          Can be changed after initial load

         4 chips in serial                                                           Kilos to Megs of ”gates” available
         Routing:                                                               Competitor to ASICs
               10 Gbps                                                               More expensive per unit,




                                                                                                                                                        FPGA
              @ 20 Mpackets/s                                                        but no start-up cost for manufacturing
              1000 ops per packet                                                    Less flexible, slightly slower
              passing through                                                        Perfect for low-volume products

14 Nov 2003              Embedded Computer Architecture                33   14 Nov 2003                 Embedded Computer Architecture                  34




               FPGA Architecture                                                           FPGA Architecture

                                  Computation cells                             Computation cells
                                       Programmable                                  Look-Up Table
                                       function                                                     4-
                                                                                          Arbitrary 4-input,
                                             Adder, Logic funcs, ...
                                                          funcs,                          1-output function                              Config
                                             Memory, Registers, ...                  Coarse-grained                                      RAM

                                                                                          Lots of functionality
                                  Input/Output cells                                      Several LUTs
                                                                                                                                                  LUT


                                  Interconnect                                                 flip-
                                                                                          Plus flip-flops etc.

                                       Reconfigurable                                Fine-grained
                                                                                          Little functionality
                                       Programmable
14 Nov 2003              Embedded Computer Architecture                35   14 Nov 2003                 Embedded Computer Architecture                  36
              FPGA with CPU Cores                                                                    Soft CPUs in FPGAs
    CPU on-board FPGA                                                                      Processor in the FPGA fabric
         HW accelerate critical                                                                 ”Soft” processor
         tasks in FPGA fabric                                                                   Special design considerations
         Data pumps in FPGA
         Control in CPU                                                                    Examples
                                                                        CPU                     Altera Nios
    Cool new possibilities                                                                      Xilinx Microblaze
         Reconfigure FPGA online
                                                                                                Research projects
         Adapt to workloads                                                                            sterå
                                                                                                     Västerås ARM clone
                                                                                                     Leon processor also prototyped


14 Nov 2003                 Embedded Computer Architecture                        37   14 Nov 2003             Embedded Computer Architecture   38




                         Examples
    Altera Apex 20kC                            Altera Stratix
              “Volume”
               Volume”
              30k to 1.5M gates
              30k
                                                              Advanced”
                                                             “Advanced”
                                                             10 Mbit RAM
                                                                                                               Case Study:
    Xilinx Virtex II:
                                                             28 DSP elements
                                                             100000 LE
                                                                                                                  ARM
                                                                                                                1026EJ-S
               High- end”
              “High-end”                                     1300 user I/O pins
              1-4 PPC405 cores                               Optimized for Nios
              (optional)
              10M gates                         ATMEL FPSLIC:
              Price at about $1000                            Low- end”
                                                             “Low-end”
                                                                  8-
                                                             AVR 8-bit CPU
                                                             50k
                                                             50k gates



14 Nov 2003                 Embedded Computer Architecture                        39   14 Nov 2003             Embedded Computer Architecture   40
                   Overview                                  The Basics: ARM1026EJ-S
                                                              Not a stand-alone processor
                                                              For integration in your own chips
                                                              Processor package:
                                                                   CPU core
                                                                   Caches, configurable in size
                                                                   Tightly-coupled memories, configurable
                                                                   in size
                                                                   Bus interface
                                                                   MMU (supports WinCE, Symbian, etc.)

14 Nov 2003         Embedded Computer Architecture   41   14 Nov 2003              Embedded Computer Architecture     42




              Business Model                                                          ASICs

    Sold as an IP Core                                        Fully custom chips
         IP = “Intellectual Property”                              Custom for your application
         Not a physical chip, just a design                        As small or large as necessary
         ”Source code component”                              Characteristics
         Similar in scope to classic processor                     Expensive to develop
                                                                        10s of engineers, often 100s
    For integration in ASICs                                       Large series necessary to pay off
         ASIC = Application-specific                                    At least 100 000 units necessary on average
         integrated circuit                                             Mostly for large companies
                                                                   To streamline: build from IP blocks
14 Nov 2003         Embedded Computer Architecture   43   14 Nov 2003              Embedded Computer Architecture     44
                           IP Blocks                                                               CPU Cores
                                                          On-chip bus
    IP                                                                               The biggest “IP” business
         Hardware components




                                              Bluetooth
         Integrated on chip by                             Data
                                                                    DSP
                                                                                     “Fabless” chip companies
         customer                                          mem
                                                                                     Biggest players:
    Examples:
         CPU Cores
                                                                                          ARM (best-selling 32-bit architecture)
         Memory                            GSM                CPU                         MIPS (and its licensees)
                                           Radio
         Buses
         Network interfaces
                                                                                     Crowded field
         Accelerator circuits                 LCD
                                                              Code memory
                                                                                          New companies appear monthly
                                             driver                                       Niched components can find a market

14 Nov 2003                  Embedded Computer Architecture                 45   14 Nov 2003            Embedded Computer Architecture             46




                Component Styles                                                      Synthesizable Vs Hard IP
    Hard IP:                                                                         Synthesizable                          Hard IP
         Tied to a particular fab process
              Like IBM 0.13u Cu, TSMC 0.18, etc.                                      + Use any process                       + Optimized layout
         Black box to customer                                                        + Use any fab                           + Small area
    Synthesizable IP:                                                                 + Customize details                     + Low power
         Source code for compilation by customer                                      + Customize chips                       + Best performance
         Offers configuration options like cache sizes, TCMs
         MIPS 24k, ARM 9S, 1026S, 1136S                                               + Add instructions                      - No flexibility
    Soft IP:                                                                          - Slower memories
         Get full source code for the component                                       - Higher power                       For best results,
         Purpose is to customize heavily                                              - Lower                              cores need to be
                             Tensilica
         ARC ARCtangent 5, Tensilica Xtensa V                                             performance                      redesigned to be
                                                                                                                             synthesizable

14 Nov 2003                  Embedded Computer Architecture                 47   14 Nov 2003            Embedded Computer Architecture             48
                    1026EJ-S Core                                                              ARM1026EJ-S Pipeline
                                                                                          Static branch
    6-stage pipeline:                                                                   prediction
                                                                                        prediction (75%
         Max clock, best case: 475 Mhz                                                  accurate): uses
              Depends on process, synthesis used                                        less power than
                                                                                            dynamic
         Optimized for synthesis of core                                                                                     Shift/ALU        Sat
         Integer-only                                                                                                                                 Write
    Power:                                                                               Fetch      Issue    Decode             MAC1          MAC2
         Depends on process & configuration
         Quoted numbers: 0.5mW/Mhz
              With 16kB+16kB L1 caches                                                                                           LS1          LS2    LS write
              130 nm process at TSMC
                                                                                         Return
                                                                                         Return stack
              (Pentium        mW/Mhz)
              (Pentium 4: >35 mW/Mhz)
                                                                                        (single entry).
                                                                                          Simple but
                                                                                           effective
14 Nov 2003                  Embedded Computer Architecture                     49   14 Nov 2003             Embedded Computer Architecture                     50




          ARM1026EJ-S Pipeline                                                                 ARM1026EJ-S Pipeline
                                                                                                     Register read,
                                                                                                   initialize memory
   ARM/Thumb/Java
   ARM/Thumb/Java                                                                                       accesses
       decode
                                             Shift/ALU        Sat                                                            Shift/ALU        Sat
                                                                      Write                                                                           Write
    Fetch         Issue      Decode             MAC1          MAC2                       Fetch      Issue    Decode             MAC1          MAC2


                                                 LS1          LS2    LS write                                                    LS1          LS2    LS write

                Access to
              coprocessors                                                                            Evaluate
                                                                                                     immediates
14 Nov 2003                  Embedded Computer Architecture                     51   14 Nov 2003             Embedded Computer Architecture                     52
          ARM1026EJ-S Pipeline                                                                   ARM1026EJ-S Pipeline
                                                           Handle
              Execution
              Execution pipeline
                                                          saturated
               for most integer
                                                          arithmetic
                                                          arithmetic
                 instructions                                                                        Execution pipeline
                                                                                                            multiply-
                                                                                                        for multiply-
                                             Shift/ALU         Sat                                      accumulate                 Shift/ALU        Sat
                                                                                                        instructions
                                                                        Write                                                                               Write
    Fetch          Issue     Decode             MAC1          MAC2                         Fetch         Issue     Decode             MAC1          MAC2


                                                 LS1           LS2     LS write                                                        LS1          LS2    LS write




14 Nov 2003                  Embedded Computer Architecture                       53   14 Nov 2003                 Embedded Computer Architecture                     54




          ARM1026EJ-S Pipeline                                                                              Rounding Out
                                                                                           Configurable caches
                                                                                                Typically 16kB/16kB
                               2 stage memory
                                        Shift/ALU              Sat                         Optional TCMs
                             access to support
                              slow synthesized                          Write              Memory interface
    Fetch          Issue     Decodememory MAC1                MAC2                              2 x 64 bit AMBA AHB links

                                                 LS1           LS2
                                                                                           Optional vector FP coprocessor
                                                                       LS write
                                                                                           Optional vector interrupt
                Decoupled pipeline                                                         controller
               for loads and stores
14 Nov 2003                  Embedded Computer Architecture                       55   14 Nov 2003                 Embedded Computer Architecture                     56
               ARM1026EJ-S System                                                                                   TCM
                                Debug port connection
                                                                                             Tightly-Coupled Memories
                   VIC10               ETM10RV                                               Alternative to caches
                                      trace/debug                  VFP10 FP
                 interrupt
                                                                  coprocessor                     As fast as caches
               coprocessor
                                                                                                  Programmer-
                                                                                                  Programmer-controlled
                                   ARM1026EJ-S                                                    No automatic management                       TCM
                     I-TCM            Core                        D-TCM                           Cheaper to implement
                                                                                                  More predictable in behavior

                      64-bit         I$            D$             64-bit
                                                                                             Programming:
FLASH           AMBA/AHB                                          AMBA/AHB         RAM            In memory map
               data bus for I                                     data bus for D                  Tagged like caches
                                           BIU

 14 Nov 2003                     Embedded Computer Architecture                     57   14 Nov 2003           Embedded Computer Architecture         58




       Instruction Sets for ARM                                                                The ARM Instruction Set
     Base: ARM v5                                                                            Continuous evolution
          32-bit integer-only instruction set
     T: thumb instruction set                                                                     Add features required by market
          16-bit, for smaller core size                                                           RISC? Not anymore, if ever
     J: Jazelle extensions                                                                   Now at v6, in the ARM11 family
          Java support in hardware
          Implements 140 out of 228 JVM byte codes                                                v5, v5E in ARM9 and ARM10
     E: DSP extensions                                                                            V4 in old ARM7
          Done in regular registers                                                               Backwards compatibility!
          Saturation, some more MACs


 14 Nov 2003                     Embedded Computer Architecture                     59   14 Nov 2003           Embedded Computer Architecture         60
                          T: Thumb                                                                T: Thumb
    Compressed instruction set                                                  Thumb shrinks the code:
         16-bit encoding of (parts of)
         32-bit instruction set                                                           Thumb    ARM           386        8088 68020 SPARC
         Limitations in ARM/Thumb:                                             eqntott    10608   16768       17640         19106   20542   22256
                                                    mode)
              Only access to 8 registers (16 in ARM mode)
              No system operations                                                        0.63    1.00          1.05         1.14   1.23    1.33

    Effect:                                                                    xlisp      26388   40768       28097         29401   46746   44648

         More but smaller instructions                                                    0.65    1.00          0.69         0.72   1.15    1.10
              30% more, at half size                                           espresso   72596 109923 125686 137194 131854 142752
         Usually some performance loss                                                    0.66    1.00          1.14         1.25   1.20    1.30
              (Perform better on narrow buses)
                                                                                                  Source: Microprocessor Report, March 1995


14 Nov 2003                 Embedded Computer Architecture             61   14 Nov 2003            Embedded Computer Architecture                   62




    T2: Doing a Better Thumb                                                                       Why T?
    ARM Thumb: fixed 16-bit size                                                Pushed by mobile phones
         Saves 28% space compared to 32-bit ARM                                      More memory = more expensive
         Runs 20% slower than 32-bit ARM
                                                                                     More memory = bigger package
    ARM Thumb 2: mixed 16/32                                                         More memory = higher power
         Brand new, arrives with ARM1156
         Saves 26% space compared to 32-bit ARM                                 More features in same memory!
         Runs 2% slower than 32-bit ARM                                         Performance is not critical
         (Introduces some new instructions)
    Conclusion: mixed length good!
                            Source: Microprocessor Report, June 2003
14 Nov 2003                 Embedded Computer Architecture             63   14 Nov 2003            Embedded Computer Architecture                   64
                   T: Competitors                                                  J: Jazelle
    Compressed instruction sets                                     Hardware Java acceleration
         MIPS16e, shrunk MIPS32 ISA                                      Pushed by mobile phones
         ARC
         Tensilica                                                  Why?
                                                                         To fix Java performance problems
    All-small instruction sets
         SH family                                                  SW JVM problems:
    Compressed code                                                      Minimal clock frequency =
                                                                         low interpreter performance
         IBM PowerPC 405 GX
         Decompress when loaded into cache                               JIT requires more memory

14 Nov 2003               Embedded Computer Architecture   65   14 Nov 2003         Embedded Computer Architecture   66




               E: DSP Extensions                                                     Why E?
    A few new instructions                                          Enhance DSP performance
         Saturated arithmetic
              Add, Sub,                                             Of stand-alone ARM core
         Signed multiply, MAC                                       Avoid multipro solution
                16-
              2 16-bit values in one register
              16x16                                                      Hard disk controllers, for example
              32x16
         Count leading zeroes
         Load/store pairs of registers
    Fairly typical ”DSP” additions
14 Nov 2003               Embedded Computer Architecture   67   14 Nov 2003         Embedded Computer Architecture   68
                  E: Competition                                                                                                                          SIMD Extensions
    DSP-in-processor                                                                                                                            Heavy-weight addition
          MAC=DSP”
         “MAC=DSP”
         Almost all embedded processors have it                                                                                                      New functional units, registers
         No revolution in performance                                                                                                                Small vector computers
    DSP/processor hybrids                                                                                                                       Examples:
         Infineon Tricore
         Microchip DSPic                                                                                                                             ARM SIMD extensions (in v6)
         Hard to get it right, not a big success so far                                                                                              Motorola Altivec
    SIMD extensions                                                                                                                                  MIPS
         More extensive additions than v5E
         Requires new functional units                                                                                                               x86 MMX-SSE-SSE2-3Dnow!
         Major performance gain possible                                                                                                             SPARC VIS

14 Nov 2003                 Embedded Computer Architecture                                                                             69   14 Nov 2003         Embedded Computer Architecture   70




                SIMD Extensions                                                                                                                              ARM vs DSP
                                                      35,1
                                  10
    Target
         Motorola
                                    9
                                                                                             OOTB                         OPT
                                                                                                                                                Despite “E” and “SIMD”...
         PPC 7455 (G4+)             8

    EEMBC
         1 Ghz                      7                                                                                                           Standard solution:
                                    6
         Telemark suite
                                    5
                                                                                                                                                     Dual-core setup
         Networking suite
    OOTB:                           4                                                                                                                ARM core
         Out-of-the-box
         Out- of- the-              3                                                                                                                DSP core
    OPT:                            2
         Manually tuned to use
         Altivec               1                                                                                                                Control vs data
    Overall/Average:                0
                                                     Convolution 1




                                                                                                                          Packet 512
                                        Autocorr 1



                                                                     Bit alloc 1




                                                                                                                Route 1
                                                                                   FFT 1

                                                                                           Viterbi 1

                                                                                                       OSPF 1




         3-4 times speed up
         can be expected


14 Nov 2003                 Embedded Computer Architecture                                                                             71   14 Nov 2003         Embedded Computer Architecture   72
                        Control vs Data                                                                       ARM-DSP: TI OMAP 5910
        Control plane:                                                                                      Texas Instruments                                             24k I$
                                                                                                                                                        96k instr          USB 1.1 DSP private
                                                                                                                                                                            USB 1.1 devices
             Standard processor tasks                                                                       Target market                                SRAM
                                                                                                                                                                           LCD controller
                                                                                                                                                                           C55xcontroller
                                                                                                                 Data-intense real-time
                                                                                                                 Data-        real-
                                                                                                                                                                            LCD       DSP shared
             Decision-making                                                                                                                            64k data
                                                                                                                                                                           MMC/SDcard intf
                                                                                                                                                                           MMC/SDcard intf
                                                                                                                                                                           DSP
                                                                                                                                                                            MMC/SDcard  devices
                                                                                                                 Audio, biometrics, etc.
             “Integer applications”                                                                                                                      SRAM              camera interface
                                                                                                                                                                           Core interface
                                                                                                                                                                            camera System
             UI of a phone, packet routing, …
                                                                                                            Processing                                                     keyboard interface
                                                                                                                                                                                        devices
                                                                                                                                                                            keyboard interface
                                                                                                                 Dual-
                                                                                                                 Dual-core chip                            16k I$          RTC
                                                                                                                                                                            RTC
                                                                                                                                                                         ARM925       ARM shared
                                                                                                                 ARM925T 150 Mhz                                           I2C          devices
                               Data plane:                                                                       TI C55 DSP 150 Mhz
                                                                                                                                                           8k D$            I2C
                                                                                                                                                                          CPU
                                                                                                                                                                           8 serial ports private
                                                                                                                                                                                      ARM
                                                                                                                                                                            8 serial ports
                                                                                                                                                                          Core
                                 Move or process data                                                       Power 230 mW                                   MMU
                                                                                                                                                                           3 UARTs
                                                                                                                                                                                        devices
                                                                                                                                                                            3 UARTs
                                 Performance is key                                                                                                                        14 GPIO pins
                                                                                                                                                                         Mem GPIO pins
                                                                                                                                                                            14
                                                                                                            Price 32 USD                                 LCD
                                                                                                                                                                       Ctrl
                                                                                                                                                                                     192k
                                 Signal processing, multimedia, …                                                                                         Ctrl
                                                                                                                                                                      75 Mhz
                                                                                                                                                                                      Shared SRAM
                                 Floating/fixed point
    14 Nov 2003                    Embedded Computer Architecture                           73          14 Nov 2003                     Embedded Computer Architecture                               74




           ARM Family: ARM Cores                                                                                ARM Family: Intel Chips
Performance                                                                2002                     Performance
                                                                                                                                                               2001

                                                       2000
                                                                           ARM11                                                                                    XScale ARM11
                                                                                                                      5-stage pipe
                                                                           8-stage pipe                               Legandary performer                                          7-10-stage pipe
                                                        ARM10              Dynamic BP
                                                                                                          1995                                              ARM10                  Dynamic BP
                                       2000                                OOO-completion                                                                                          800 Mhz
                                                            6-stage pipe   550 Mhz                      StrongARM
                                   ARM9E                    Static BP                                                                       ARM9E
                        1998                                64-bit BIU                                                                                                    Intel makes chips
                                        5-stage pipe        FP
                        ARM9                                                                                                 ARM9                                        based on the Xscale;
                                        I/D caches
                                        Java, DSP                                                                            Intel got this from                         does not license the
                        5-stage pipe
                                                                                                                               Digital in 1998.                           core to 3rd parties
         1994           I/D caches
                                                                                                                              A single variant,
          ARM7                                                                                                ARM7              big in PDAs.
        3-stage pipe
        unified cache
        low power
                                                                                             Time                                                                                                     Time
    14 Nov 2003                    Embedded Computer Architecture                           75          14 Nov 2003                     Embedded Computer Architecture                               76
                                                             Instruction Sets: Configure

                                                               Configurable instruction sets
                    Configurable                                    Adapt to needs of application

                     Instruction                                    User can specialize the processor
                                                                    Less waste on generality
                        Sets                                        Fast evolution of instruction sets
                                                               Traditionally:
                                                                    Chip manufacturers determine
                                                                    instruction sets aimed at some niche
                                                                    Slow evolution of instruction sets
14 Nov 2003          Embedded Computer Architecture   77   14 Nov 2003         Embedded Computer Architecture   78




  Instruction Sets: Configure                                 Configurable Instruction Sets

    Subsetting                                                 Tight integration:
         There is a limited and predefined set of                   Add to regular pipeline
         instructions available                                     Additional functional units
         Easy to compile for: restrict code gen                     Adding fine-grained instructions
         Remove instructions to simplify core                  Loose integration:
    Addition                                                        Coprocessor interface
                                                                    Slower communication
         Freedom to invent instructions
                                                                    Offloading of macro-scale tasks
         Tool chain: assembly, C compilers                          Method to invoke accelerator circuits
         Genuine development of ISAs
14 Nov 2003          Embedded Computer Architecture   79   14 Nov 2003         Embedded Computer Architecture   80
              Configurability Trend                                       Benefit of Configurability
                                                                        Target                                      Speedups
    Pioneers                                                                 Xtensa III
                                                                             200 Mhz
         Tensilica Xtensa                                               EEMBC
                                                                                                                 Benchmark OOTB      OPT
                                                                                                                 Telemark
         Arc Arctangent                                                      Telemark suite
                                                                             Networking suite                    overall         1     37
         Configurability as key selling point                           OOTB:                                    Autocorr        1         9
                                                                             Out- of- the-
                                                                             Out-of-the-box
                                                                                                                 Convolution     1   1249
    Added to general architectures                                           25k gate core
                                                                        OPT:                                     Bit alloc       1     34
         MIPS: “CorExtend”                                                   Tuned code
                                                                                                                 FFT             1     24
                                                                             25k base core gates
         PowerPC: “BookE ASU”                                                18k extra instr gates               Viterbi
                                                                             100k DSP coproc                     GSM             1     14
         Usually less tight integration                                      37k config gates


14 Nov 2003          Embedded Computer Architecture           81    14 Nov 2003                 Embedded Computer Architecture             82




              Configuration Tools


                                                      instruction
                                                      set choices




                                    Gate and
                                   memory size
                                    counters

14 Nov 2003          Embedded Computer Architecture           83

								
To top