multicore-soc

Document Sample
multicore-soc Powered By Docstoc
					      Multi-Core
    System on Chip
      설계 동향 2

발표: 조준동 교수 2003년 12월

                       1
     What is Software Radio
- A transceiver in which all aspects of its
  operation are determined using versatile
  general purpose hardware whose
  configuration is under software control
- Flexible all-purpose radios that can
  implement new and different standards or
  protocols through reprogramming.
- Same hardware for all air interfaces and
  modulation schemes


                                              2
Key Technological Constraints
• High speed wide band ADCs.
• High speed DSPs.
• Real Time Operating Systems
  (isochronous software)
• Power Consumption




                                3
             Research and
           Commercialization
• DARPA’s Adaptive computing system project
• Virginia Tech – algorithms and architecture ;
  multi user receiver based on reconfigurable
  computing ; generic soft radio architecture for
  reconfigurable hardware
• UC Berkeley –       Pleiades, ultra low power, high
  performance multimedia computing ; high power
  efficiency by providing programmability
• Sirius Inc –     Software Reconfigurable Code
  Division Multiple Access (CDMAx)

                                                        4
          Research and
        Commercialization
• Brigham Young University –           Development of
 JHDL to facilitate hardware synthesis in
 reconfigurable processors
• Chameleon Systems- Reconfigurable Platform
 Architecture for wireless base station
• MorphIC Inc -Programmable hardware
 reconfigurable code using DRL
• Quicksilver Tech. Inc –     Universal Wireless
 `Ngine (WunChip) baseband algorithms


                                                    5
            Applications
• User Applications and Base Station
  Applications
• Evolve as a universal terminal
• Spectrum management:
  Reconfigurability is a big advantage
• Application updates, service
  enhancements and personalization

                                         6
  Programmable OFDM-CDMA
         Tranceiver.
• CDMA suffers from Multiple access
  interference and ISI.
• OFDM reduces interference and helps
  better spectrum utilization and
  attainment of satisfactory BER.
• It is proposed that this might be
  implemented by using SDR.


                                        7
                      SDR Architecture
                 RF unit         Signal processing/control unit
                                                                        Input/
                                                                        Output
                       Rx SYN


       LNA               RX
                                                                          HMI
                                                                        Terminal
Receive/               Tx SYN



                                M t
                                E a
                                D D
                                O
                                M
                                r
                                e
                                t
                                u
                                r
                                a
                                d
                                a
                                u
                                Q
Transmit




                                 M
                                 E
                                 D
                                 O
                                 M
                                 d
                                 n
                                 a
                                 b
                                 e
                                 s
                                 a
                                 B
                                   r
                                   t
                                   e
                                   r
                                   v
                                   e
                                   n
                                   o
                                   c
                                   a




                                                         e
                                                         c
                                                         f
                                                         a
                                                         r
                                                         t
                                                         e
                                                         I
                                                         n
           PA   EX.      TX




                                                             l
                                                             r
                                                             o
                                                             t
                                                             n
                                                             o
                                                             C
       LNA               RX


Receive/               Rx SYN
Transmit

           PA   EX.      TX


                       Tx SYN

                                             C-PCI bus




                                       Hitachi Kokusai Electric Inc.,              8
                                       teshima.isao@h-kokusai.com
Signal processing/control unit
• The signal processing/control unit consists
  of the following module
  – Data converter
  – Quadrature Modem
  – Baseband Modem
  – Interface/Control
• Every module is connected to each other
  by PCI bus, and provides a CPU in addition
  to the FPGA and DSP devices.
                                            9
    Quadrature modem module
•   The Quadrature modem
    uses FPGAs to process
    to generate baseband
    sampling rate                                   RF unit        Signal processing/control unit
                                                                                                    Input/
                                                                                                    Output
                                                         Rx SYN


                                          LNA              RX
                                                                                                      HMI
                                                                                                    Terminal
                                   Receive/              Tx SYN




                                                                  M t
                                                                  E a
                                                                  D D
                                                                  O
                                                                  M
                                                                  r
                                                                  e
                                                                  t
                                                                  u
                                                                  r
                                                                  a
                                                                  d
                                                                  a
                                                                  u
                                                                  Q
                                   Transmit




                                                                   M
                                                                   E
                                                                   D
                                                                   O
                                                                            M
                                                                            d
                                                                            n
                                                                            a
                                                                            b
                                                                            e
                                                                            s
                                                                            a
                                                                            B
                                                                     r
                                                                     t
                                                                     e
                                                                     r
                                                                     v
                                                                     e
                                                                     n
                                                                     o
                                                                     c
                                                                     a




                                                                                           e
                                                                                           c
                                                                                           f
                                                                                           a
                                                                                           r
                                                                                           t
                                                                                           e
                                                                                           I
                                                                                           n
                                              PA   EX.     TX




                                                                                               l
                                                                                               r
                                                                                               o
                                                                                               t
                                                                                               n
                                                                                               o
                                                                                               C
                                          LNA              RX


                                   Receive/              Rx SYN
                                   Transmit

                                              PA   EX.     TX


    –   Quadrature modulation                            Tx SYN

                                                                               C-PCI bus



    –   Quadrature detection
    –   Sampling rate conversion
    –   Filtering

                                                                                                    10
        Baseband modem module
•   The Baseband modem
    processes
    –   Multi-channel modulation
    –   Multi-channel demodulation
                                                      RF unit        Signal processing/control unit


•   Using four floating points
                                                                                                      Input/
                                                                                                      Output
                                                           Rx SYN




    DSP devices
                                            LNA              RX
                                                                                                        HMI
                                                                                                      Terminal
                                     Receive/              Tx SYN




                                                                    M t
                                                                    E a
                                                                    D D
                                                                    O
                                                                    M
                                                                    r
                                                                    e
                                                                    t
                                                                    u
                                                                    r
                                                                    a
                                                                    d
                                                                    a
                                                                    u
                                                                    Q
                                     Transmit




                                                                     M
                                                                     E
                                                                     D
                                                                     O
                                                                              M
                                                                              d
                                                                              n
                                                                              a
                                                                              b
                                                                              e
                                                                              s
                                                                              a
                                                                              B
                                                                       r
                                                                       t
                                                                       e
                                                                       r
                                                                       v
                                                                       e
                                                                       n
                                                                       o
                                                                       c
                                                                       a
•   individual DSP is assigned




                                                                                             e
                                                                                             c
                                                                                             f
                                                                                             a
                                                                                             r
                                                                                             t
                                                                                             e
                                                                                             I
                                                                                             n
                                                PA   EX.     TX




                                                                                                 l
                                                                                                 r
                                                                                                 o
                                                                                                 t
                                                                                                 n
                                                                                                 o
                                                                                                 C
                                            LNA              RX



    for each channel. Therefore,     Receive/
                                     Transmit
                                                           Rx SYN




    even if processing of either                PA   EX.     TX




    channel is under execution,
                                                           Tx SYN

                                                                                 C-PCI bus




    a program can be
    downloaded to another
    channel.

                                                                                                      11
  A SDR/Multimedia Solution
W-CDMA / DAB / DVB / IEE802.11x; MPEG / JPEG Codecs




                                                      12
     PACT’s SDR XPP
eXtreme Processor Platform




                             13
PACT’s SDR XPP




                 14
              Architecture Goals
• Provide template for the exploration of a range of architectures
• Retarget compiler and simulator to the architecture
• Enable compiler to exploit the architecture
• Concurrency
   – Multiple instructions per processing element
   – Multiple threads per and across processing elements
   – Multiple processes per and across processing elements
• Support for efficient computation
   – Special-purpose functional units, intelligent memory,
     processing elements
• Support for efficient communication
   – Configurable network topology
   – Combined shared memory and message passing
                                                                     15
                          Architecture Template
• Prototyping template for array of processing elements
   – Configure processing element for efficient computation
   – Configure memory elements for efficient retiming
   – Configure the network topology for efficient communication
       Memory                   ...configure            Memory         ...configure    Memory             Memory

                                PE...                                  memory
           RegFile                                       RegFile
                                                                       elements...     RegFile            RegFile

 FU   FU     FU      FU   FU                      FU   FU DCT HUF FU                  FU   FU      DCT HUF FU

           ICache                                        ICache                                  ICache




                               ...configure PEs
                               and network to
                               match the
                               application...
                                                                                                                    16
      Future Processing Element
• Specialized memory systems for efficient memory utility
   – Multi-ported, banked, levels, and intelligent memory
• Split register file allows greater register bandwidth to FUs
   – Groups of functional units have dedicated register files
• Multiple contexts for a processing element provide latency tolerance
   – Hardware for efficient context switching to fill empty instruction
     slots
• Specialized functional units and processing elements
   – SIMD instructions
   – Re-configurable fabrics for bit-level operations
   – Re-use IP blocks for more efficient computation
   – Custom hardware for the highest performance


                                                                     17
Initial Distributed Architecture
                • Array of concurrent PEs
                  and supporting network
PE   PE   PE    • Malleable network
                  topology
                  – Topology matches
                    application
PE   PE   PE    • Efficient communication
                • Memory organized around a
                  PE
                   – Each PE has physical
PE   PE   PE         memory
                   – Message passing between
                     PEs


                                            18
Future Distributed Architecture
• Multiple processing elements share a memory space
   – Shared memory communication
       • Snooping cache coherency protocol
       • Directory based protocol required if PEs in a shared memory
         space is large
• Introspective processing elements
   – Use processing elements to analyze the computation or
      communication
       • Identify dynamic bottlenecks and remove them on the fly
       • Reschedule and bind tasks as the introspective elements
         report


                                                                  19
           So What’s Different?
• Traditional application hw/sw design requires
   – Hand selection of traditional general purpose OS components
   – Hand written customization of
      • device drivers
      • memory management…
• Instead…
   – Application specific synthesis of OS components
      • scheduling
      • synchronization…
   – Automatic synthesis of hardware specific code from
     specifications
      • device drivers
      • memory management…
                                                              20
               ASIP Design
• Given a set of applications, determine micro
  architecture of ASIP (i. e., configuration of
  functional units in datapaths, instruction set)
• To accurately evaluate performance of
  processor on a given application need to
  compile the application program onto the
  processor datapath and simulate object code.
• The micro architecture of the processor is a
  design parameter!

                                                21
ASIP Design Flow




                   22
              Compiler Goals
• Develop a retargetable compiler infrastructure that
  enables a set of interesting applications to be
  efficiently mapped onto a family of fully
  programmable architectures and microarchitectures.
• 10 Year Vision:
  – Will have fully automatically-retargetable
    compilation, OS synthesis, and simulation for a
    class of architectures consisting of multiple
    heterogeneous processing elements with
    specialized functional units / memories
  – Compiled code size and performance will be within
    10% of hand-coding
                                                        23
     Compiler Research Issues
• Synthesis of RTOS elements in the compiler
   – On the application side: Generation of an efficient
     application-specific static/run-time scheduler and
     synchronization
   – On the hardware side: Generation of device drivers, memory
     management primitives, etc. using hardware specifications
• Automatic retargetability for family of target architectures while
  preserving aggressive optimization
• Automatic application partitioning
   – Mapping of process/task-level concurrency onto multiple PEs
     using programmer guidance in programmer’s model
• Effective visualization for family of target architectures


                                                                  24
 An Efficient Architecture Model for
 Systematic Design of Application-
    Specific Multiprocessor SoC
                          DATE’ 2001




Amer Baghdadi Damien Lyonnard Nacer-E. Zergainoh Ahmed A. Jerraya
                TIMA Laboratory, Grenoble, France


                                                                25
Efficient application-specific multiprocessor design



  • Modularity


  • Flexibility


  • Scalability




                                                       26
A multiprocessor architecture platform for
application-specific SoC design(1)




 Figure 1. A multiprocessor architecture platform




                                                    27
A multiprocessor architecture platform for application-
specific SoC design(2)


• Architecture platform parameters
  1. Number of CPUs,

  1. Memory sizes for each processor

  2. I/O ports for each processor

  3. Interconnections between processors

  4. Communication protocols and the external connections
     (peripherals)


                                                            28
Application-specific multiprocessor SoC design flow (1)




 Figure 2. The Y-chart: MFSAM-based architecture generation scheme


                                                                     29
Application-specific multiprocessor SoC design flow(2)




 Figure 3. MFSAM-based architecture generation flow for multiprocessor SoC

                                                                        30
Architecture design(1)




      Figure 4. Communication Interface

                                          31
Architecture design(2)




      Figure 5. Block diagram of the packet routing switch
                        (Point to Point network)

                                                             32
Architecture validation




 Figure 6. A 4-processor cosimulation architecture of the packet routing switch

                                                                           33
Analyzing the design cycle (1)




 Figure 7. A 4-processor cosimulation architecture of the IS-95 CDMA



                                                                       34
Analyzing the design cycle (2)




 Table 1. Time needed to fit the IS95 CDMA on the multiprocessor platform



                                                                            35
Conclusion

 1. Presented a generic architecture model for application-
   specific multiprocessor system-on-chip design

 2. The proposed model is modular, flexible and scalable.

 3. Definition of the architecture model and a systematic
   design flow that can be automated.




                                                            36
        A Single-Chip Multiprocessor

• Currently, processor designs dynamically extract parallelism
  by executing many instructions within a single,
  sequential program in parallel.

• Future performance improvements will require processors
  to be enlarged to execute more instructions per clock
  cycle.

• Two alternative micro-architectures that exploit multiple
  threads of control

   – SMT : simultaneous multithreading
   – CMP : chip multiprocessor

                                                              37
        A Single-Chip Multiprocessor

• Exploiting parallelism

   – Loop level parallelism results when the instruction level
     parallelism comes from data independent loop iterations.

   – Some compiler can also divide a program into multiple threads
     of control, exposing thread level parallelism.

   – A third form of very coarse parallelism, process level
     parallelism, involves completely independent applications
     running in independent processes controlled by the operations
     system.




                                                                 38
Exploiting Program Parallelism

                          Process
Levels of Parallelism




                            Thread



                              Loop



                        Instruction


                                      1   10   100          1K        10K        100K   1M

                                                     Grain Size (instructions)


                                                                                             39
          SMT (simultaneous
           mutlithreading)
• SMT processors augment wide (issuing many
  instructions at once) superscalar processors with
  hardware that allows the processor to execute
  instructions from multiple threads of control concurrently
• Dynamically selecting and executing instructions from
  many active threads simultaneously.
• Higher utilization of the processor’s execution resources
• Provides latency tolerance in case a thread stalls due to
  cache misses or data dependencies.
• When multiple threads are not available, however, the
  SMT simply looks like a conventional wide-issue
  superscalar.


                                                          40
         Single-vs Multi-threaded

single-
threaded/blocking:
 CPU waits for
accelerator;              multithreaded/non-blocking:
                          CPU continues to execute along
                          With accelerator.




                                                 41
            Mutithreading
– Multiple threads to share the functional units of a
  single processor in an overlapping fashion.
– The processor must duplicate the independent
  state of each thread. (register file, a separate PC,
  page table)
– Memory can be shared through the virtual memory
  mechanisms, which already support
  multiprocessing
– Needs hardware support for changing the threads.



                                                     42
  Single-Chip Multiprocessor
• CMPs use relatively simple single-thread
  processor cores to exploit only moderate
  amounts of parallelism within any one thread,
  while executing multiple threads in parallel
  across multiple processor cores.
• If an application cannot be effectively
  decomposed into threads, CMPs will be
  underutilized.


                                              43
Basic Out-of-order Pipeline




                              44
SMT Pipeline




               45
         Instruction Issue




Reduced function unit utilization due to dependencies




                                                        46
       Superscalar Issue




Superscalar leads to more performance, but lower utilization
                                                               47
Simultaneous Multithreading




Maximum utilization of function units by independent operations
                                                                  48
Super scalar Architecture
               Issue up to 12 instructions per cycle




                                              49
        SMT Architecture
8 separate PCs , executes instructions from 8 diff thread concurrently




                                                                 Multi bank
                                                                 caches




                                                                     50
Chip multiprocessor
    architecture
   8 small 2 issue superscalar processors. Depend on TLP




                                                           51
             Single-chip multiprocessor
                     Kunle Olukotun                             http://www-hydra.stanford.edu


                                               Centralize d Bus Arbitration Mechanis ms


                  CPU 0                        CPU 1                              CPU 2                         CPU 3


      L1 Ins t.                    L1 Ins t.                          L1 Ins t.                     L1 Ins t.
      Ca che      L1 Da ta Cache   Ca che      L1 Da ta Cache         Ca che       L1 Da ta Cache   Ca che      L1 Da ta Cache

       CPU 0 Me mory Controller      CPU 1 Me mory Controller          CPU 2 Me mory Controller      CPU 3 Me mory Controller




                                                                                       Write-through Bus (64b)

                                                                                      Read/Replace Bus (256b)


                  On-chip L2 Cache                         Rambus Memory Interface                        I/O Bus Interface



                                                                DRAM Main Memory                                I/O Devices
   Four processors
                                                                                  – Shared 2nd-level cache
   Separate primary caches
                                                                                  – Low latency interprocessor com-
   Write-through data caches to
    maintain coherence                                                              munication (10 cycles)
                                                                                  – Separate read and write buses 52
Characteristics of superscalar, simultaneous
  multithreading, and chip multiprocessor




                                          53
         CMP and Memory

• A 12-issue superscalar or SMT processor can
  place large demands on the memory system.

• The CMP architecture features sixteen 16-Kbyte
  caches.
   – The small cache size and tight connection
     to these caches allows single-cycle access.




                                                   54
              CMP Solution
• Short cycle time to be targeted with relatively
  little design effort, since its h/w is naturally
  clustered- each of the small CPUs is already
  a very small fast cluster of components.
• Since OS allocates a single s/w thread of
  control to each processor, and requires no
  h/w to dynamically allocate instructions to
  different clusters
• Heavy reliance on s/w to direct instructions to
  clusters limits the amount of ILP of CMP but
  allows the clusters within CMP to be small
  and fast.
                                                55
        A Single-Chip Multiprocessor

• Relative performance of superscalar, simultaneous
  multithreading, and chip multiprocessor architectures




                                                          56
Multi-core SoC Platform Integration using AMBA

                     DesignCon 2002
        System on Chip and IP Design Conference

          Robert L. Veal, Levon Petrosian, Neal Stollon




                                                          57
               Outline


Overview of AMBA AHB

AMBA Application to Multiprocessor Systems(RAMA)

Summary




                                              58
               Overview of AMBA AHB
      AMBA Based Integration for SoC Platforms
 Core integration is significant part of Soc Design
   - Including both RISC and signal processing engines
   - Well defined bus strategies make it easier

 AHB being adopted based on both features and
   standardization
    - Low overhead for core-to-memory communication
    - Standard interface increases IP value
    - RAMA integrates RADcore and OMNIcore using AHB
      along with memory blocks, arbiters and external
interfaces
AMBA : Advanced Microcontroller Bus Architecture
AHB : Advanced High-performance Bus
RAMA : Reconfigurable Array Multimedia Architecture
RADcore : Infinite Technology Corporation’s proprietary cores for
        reconfigurable signal processing
OMNIcore : Infinite Technology Corporation’s proprietary core for
        general purpose RISC processing
           Overview of AMBA AHB
  Value of AMBA Interfaces in Core Integration
 Key to AHB
   - Definition of master and slave AHB components
   - Master : initiate operation by sourcing address and
     control signals for a bus operation
   - Slave : respond and perform operations under
     the control of a master, memories and peripherals

 Attractive Key Features of AMBA AHB
   - Configurable data bus size (8 ~ 1024bits)
   - Dedicated request/grant and bus locking signals
   - Flexible (user-defined) arbiter based bus control
   - State based handshaking between master and slave
   - No tri-stated business; mux based unidirectional
     operation
Overview of AMBA AHB
  AHB Principle of Operations
               Overview of AMBA AHB
                    AHB Principle of Operations
 Specific datapath structure and signaling of a multiplexed bus
   - Interconnection of multiple masters and slaves is handled
     by multiplexors
   - On-chip bussing based on a arbitrated request/grant approach
   - Bussing of two types of interface
        • Master interfaces : initiate transactions through granted
            requests and source of address and communication
            parameters of a data transfer
        • Slave interfaces : respond to master requests and provide
            status of requested transactions
 High-performance system bus
   - Supports multiple bussed cores and provides high-bandwidth
     operation
   - Single-edge timed, multiplexed data bus controlled by
     arbitration logic
   - All busses and signals are unidirectional as an on-chip bus structure
             Overview of AMBA AHB
                           AHB Variants
 Specifics of interconnection structure
   - Open to the user

 Different bus structures and levels of transfer bandwidth
   - Characterized by number of masters and bus layers
     (sub-buses)
   - Efficient customization of the architecture within the
      standardized platform framework

 Usage for multi-processor core platforms
   - Several types of busses are concurrently used for control and
     high data transfer in inter-core communications
Overview of AMBA AHB
Single-layer/Single-master AHB
              - Known as AHB-Lite, reduced
                complexity version
              - A single master : no contention for bus
                ownership, no arbitration
              - No arbitration : no implementation of
                request and grant signals



Single-layer/Multi-master AHB

              - Ensure that a given master gains and
                maintains access to the bus
              - Increase the performance of data
                transfers between multiple signal
                processors and memories
Overview of AMBA AHB
Multi-layer/Single-master AHB
                - Concurrently accessing common
                  slave resources
                - The number of masters determines
                  the number of bus layers
                - Each master has a dedicated bus




Multi-layer/Multi-master AHB

                - Each master has a dedicated bus
                  in multi-layer
                - Both masters and slaves access
                  a common set of bus resources
                - The number of bus layers defined
                  by the number of slaves requiring
                  concurrent data transfer
             Overview of AMBA AHB
                 Master Slave Communication




 Both the AHB master and slave have embedded (4-state) state machines
   - Allow communication for master-slave and multiple maser status
 Specifics of the FSM operation
   - Driven by the features of the processor (transfer FSM) and memory
     (response FSM) blocks being used
AMBA Applications to Mutiprocessor Systems
           RAMA Block Diagram
      AMBA Applications to Mutiprocessor Systems
                                  RAMA
 High-performance multi-core platform for addressing datapath
applications
   - Standardizes the on-chip bus operation by adopting AMBA AHB
 Integration of ITC’s RADcore and OMNIcore processor cores
   - Driven by the features of the processor (transfer FSM) and memory
      (response FSM) blocks being used
 RADcore
   - Signal processing engine : parallel processing, Reconfigurable
Arithmetic
       Datapath (RAD) features
   - Data interface : Initialization I/O EXU, memory bus interfaces,
       RADbus interface
 OMNIcore
   - 32-bit cryptographic/RISC architecture
   - High-performance RISC processor with a dual memory bus interface
   - Uses AHB as its central bus structure
 Other elements
   - Memory blocks, an external memory interface core, arbitration logic
AMBA Applications to Mutiprocessor Systems
       RAMA Multi-layer using AHB
      AMBA Applications to Mutiprocessor Systems
                 RAMA Multi-layer using AHB

Inter-core communication is based on two AHB busses
   - Separate and reduce any interdependence of control and data access
   - Control interface : a single layer AHB with the OMNIcre
      control/ROM port and the External memory DMA
   - Slaves are the boot and Local (instruction) Memory and the RADCore
      control interfaces
   - Single the inter-core control and memory update operations are
      intermittent
 Data transfer AHB has up to six master
   - OMNIcore Data/RAM port and the external memory DMA, along with
      up to four RADcore I/O ports
   - To facilitate high-bandwidth multi-core performance, the data
transfer
      AHB is a multi-layer AHB structure
  AMBA Applications to Mutiprocessor Systems
Interfaces of Multiple System Domains in RAMA
                          Key communications interfaces
                           of RAMA
                           ①   RADcore to on chip memory array
                               data read/write operations
                           ②   OMNIcore to on chip memory
                               array data read/write operations
                           ③   RADcore to External Memory
                               Buffer read/write operations
                           ④   OMNIcore to External Memory
                               Buffer read/write operations
                           ⑤   RADcore-to-RADcore data
                               transfers
                           ⑥   RADcore to external logic data
                               transfers
                           ⑦   External memory (DMA) to
                               internal memory array read/write
                               operations
                           ⑧   OMNIcore to RADcore control
                               read/write operations
                           ⑨   OMNIcore to local (scratch) RAM
                               read/write operations
                           ⑩   OMNIcore to (boot) ROM read
                               operations
        AMBA Applications to Mutiprocessor Systems
                               RADcore Overview
                                                          A High Performance
                                                           Reconfigurable Signal Processor
                                                           with Distributed IW Architecture
                                                          A core controller/sequence block,
                                                           a DIW Instruction Memory, a set
                                                           of Execution Units(EXUs), data
                                                           I/O, external logic interface
                                                          The initialization busses,
                                                           Reconfigurable Channel Bus (RCB)
                                                           and the supporting Flags
                                                           encapsulate and interconnect each
                                                           EXU
   Key features
    •   15 channel Reconfigurable data bus based architecture
    •   Reduces register based operations
    •   User definable pipeline depth
    •   Distributed instruction word driven parallel operation
    •   Supports highly pipelined dataflow
    •   Configuration selectable by designer (up to 11 EXUs)
    •   AMBA compatible Memory and core to core busses
    •   Spreadsheet based RADware programming environment
      AMBA Applications to Mutiprocessor Systems
                         RADcore Interfaces




   Controller interface
    - between the RADcore and host processor
   Memory interface
    - both on chip RAM block and off-chip memory interfaces
   RADbus interface
    - RADcore-to-RADcore, initialization I/O EXU
   External Logic Buffer
    - co-processing with arbitrary external logic
AMBA Applications to Mutiprocessor Systems
          OMNIcore Overview

                       Key features
                        •   32-bit RISC engine
                        •   Cryptographic support
                        •   AHB compliant control and
                            RAM busses
                            → User-selectable 8 to
                                32-bit operation
                        •   4 stage pipeline
                            → Low interrupt latency
                        •   Two privilege levels user,
                            system
                            → Supports smart card
                                applications
AMBA Applications to Mutiprocessor Systems
          OMNIcore Overview
                   Two primary interface for instruction
                    operation (Ctrl) and data read/write
                    (RAM)
                     - Access to memory bus for on chip
                    memory and external memory
                    operation using its RAM interface
                     - Access to a local control bus for
                    loading of instruction data into
                    instruction cache and for supervisory
                    and status communications with
                    RADcore control blocks using its
                    Control interface
                   Dual master AHB interface to
                    integrate control and data functions
                     - Data output bus is shared
                     - Instruction cache internal to
                     the OMNIcore subsystems is used to
                    avoid stalling
      AMBA Applications to Mutiprocessor Systems
            OMNIcore Crytographic Features

 Public-private key cryptographic algorithms
   - DES, RSA, DSA and Diffie-Hellman
   - Controlled by a set of cryptographic instructions

 Cryptographic Instruction supports for
   - Compression Permutation
   - Expansion Permutation
   - Initial Permutation
   - Final Permutation
   - Key Permutation
   - Key Rotation
   - P-Box Permutation
   - S-Box Permutation
    AMBA Applications to Mutiprocessor Systems
                 RAMA Memory Subsystem




   Distributed memory block architecture, consisting of dual port memory
    blocks
   Key features
    - Dual Port RAM blocks
    - Multi-layer AHB for simultaneous memory access
    - Dual Mode External Memory Interface
         → DMA interface for internal – external memory transfer (AHB
              Master)
         → Buffer for processor – external memory transfers (AHB Slaves)
    - Multi-layer Arbiter
         → Priority based
    AMBA Applications to Mutiprocessor Systems
                         AHB Arbitration
   Multi-layer arbitration scheme
    - To coordinate concurrent processor-memory transfers between masters
    (OMNIcore, multiple RADcores, external memory DMA) and slaves (memory,
    external memory buffer)



            A Configurable Master/Slave Port
                                      RADbus AMBA AHB features
                                       - Allows direct processor to process
                                       communication
                                       - Hybrid (configurable) Master/Slave
                                       interface
                                       - Mode dependent changes in AHB
                                       operation
                                       - All write operations in master mode
                                       - All Read operations in slave mode
                                       - Uses first-come, first-serve method
                                       for arbitration
                                       - Low overhead ensures fast operation
AMBA Applications to Mutiprocessor Systems
     A Configurable Master/Slave Port
                      Structure of RADbus AHB scheme
                       - All Out-puts are defined as bus
                       masters, structured as a Write-only
                       Master
                       - All In-ports are defined as bus
                       slaves, structured as a Write-only
                       Slave
                       - The number of RADcores connecting
                       to the RADbus determines the size of
                       the address
                       - Selection of which bus channel (A, B,
                       C) is read into the RADcore is defined
                       as function of decoded address bits
                       from the master in conjunction with
                       the state of the slave
                       - Selection algorithm is based on a
                       “first-com, first-serve” selection
                       mechanism by the read mux,
                       controlled by an address decoded
                       select signal (a, b, c) for each bus
           Summary
                           Summary

 RAMA discussed as platform-based solution
 Uses multiple AHB for core-to-core integration
 AHB easily integrated into RAMA architecture
 AHB provides well understood, flexible interfaces
 RADbus example shows AHB can be flexible
 Combination of OMNIcore and RADcore provides
  enhanced DSP and data processing
 Extends platform to reach emerging SoC applications


    Cores +Infrastructure + Integration = SoC Platform
    (OMNIcore + RADcore) (RAM) + AMBA AHB = RAMA
Lightweight Implementation of the POSIX
   Threads API for an On-Chip MIPS
  Multiprocessor with VCI Interconnect




                                          81
                Contents

•   Target architecture
•   MIPS CPU properties
•   The architecture needs
•   Pthread specification
•   Implementation
•   Experimental setup
•   Conclusion
                             82
            Target architecture




               General VCI based SoC architecture


• System consist of one or more MIPS R3000 as CPU

• Virtual Chip Interconnect compliant interconnect


                                                     83
       MIPS CPU properties

• Two separated caches for instruction
  and data.
• Direct mapped caches.
• Write buffer with write update and write
  through policy.
• No memory management unit (MMU),
  logical addresses are physical
  addresses.                               84
     The architecture needs
• Protected access to shared data : Use
  spin lock
  – Spin lock is acquired using the
   pthread_spin_lock
  – Spin lock is released using the
   pthread_spin_unlock
• Cache coherency
  – if the interconnect is a shared bus, use
   snoopy cache.
    • Reduce main memory traffic.              85
              Pthread specification
                               Execute the thread : ‘start’ function call
                                                      Thread attribute : stack size, stack
                                                      addr, scheduling policy




                       Unique identifier for the thread

• Main kernel objects are the threads and the scheduler.




                                                                                        86
Pthread specification

           • Changing state is done
             using some pthread
             function on a shared
             object.

           • From RUNNABLE to
             RUN is done by the
             scheduler. Backward
             from RUN to RUNNABLE
             using sched_yield.

                                      87
           • A thread structure
      Pthread specification
• The scheduler manages 5lists of
  threads.
  – Symmetric Multi-Processor(SMP) :
    Scheduler may be shared by all processors.
  – Distributed : Scheduler exist every
    processors.
• The access to the scheduler must be
  performed in critical section, and under
  the protection of a lock.
• Other implemented objects               88
              Implementation
◈ Booting sequence


                         • The scheduler_created variable
                         must be declared with the volatile
                         type qualifier to ensure that
                         compiler will not optimize this
                         seemingly infinite loop.




                                                      89
           Implementation
• Context Switch
  – Save the current value of the CPU registers
    into context variable of the thread that is
    currently executing

  – Sets the values of the CPU registers to the
    value of the context variable of the new
    thread to execute.

  – The return address of the function is a   90
    register of the context
             Implementation
◈ CPU Idle Loop

                        • All idle CPUs enter the
                        same idle loop.




                                                    91
        Experimental setup
• Review several types of scheduler
  – Symmetric Multiprocessor (SMP)
    • Unique scheduler shared by all processors and
      protected
    • The threads can run on any processor, and
      migrate
  – Centralized Non SMP (NON_SMP_CS)
    • Unique scheduler shared by all processors and
      protected
    • Every thread is assigned to a given processor
      and can run only on it                        92
  – Distributed Non SMP (NON_SMP_DS)
       Experimental setup
◈ Motion JPEG application




Execution times of the MJPEG application   Cycles spent in the CPU idle Loop




                                                                               93
     Experimental setup
◈ COMM application

                •Does not exchange
                data between processors.


                • The only resource
                shared here is the bus


                • The application uses
                the processors at about
                full power.

                                          94
             Conclusion

• The implementation is a bit tricky, but
  quite compact and efficient.

• Experimentations have shown that a
  POSIX compliant SMP kernel allowing
  task migration is an acceptable solution
  in terms of generality, performance and
  memory footprint for SoC.
                                            95

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:17
posted:4/8/2012
language:English
pages:95