Docstoc

Diopsis Roadmap

Document Sample
Diopsis Roadmap Powered By Docstoc
					Scalable Software Hardware Architecture Platform
                      for
              Embedded Systems




            SHAPES at DATE 2007


            Pier Stanislao PAOLUCCI
            chief technical officer – ATMEL Roma
  & (part-time) permanent staff researcher – INFN Roma
                      for
             the SHAPES Consortium
                    Project Motivation and Final Objective

    SHAPES Acronym: Scalable Software Hardware Architecture Platform
     for Embedded Systems

    Objective: Develop a prototype of Tiled Scalable HW & SW architecture
     for embedded applications characterized by inherent parallelism

 Experiment: “Small” Tiles (<10 MGate) connected by “short wires”
  weaving a packet switching on-chip and off-chip network
 The HW architecture should scale on next deep-submicron technologies

    Challenges: how to program a tiled architecture

    Benchmarks
       multi-loudspeaker multi-source wave field synthesis,
       Multi-microphone voice extraction from noise on multi-microphone
       Ultrasound scanners
       Physical modelling of quantum chromo dynamics




    January, 2007   Introduction to SHAPES                                 2   2
                            HW
HW Objectives
   maintain profitable average selling prices
   control NRE by IP reuse
HW Solution
   appropriate granularity: “Small” Tiles (<10 MGate) connected by
     “short (first neighbours) wires”
   Inside the typical elementary Tile:
                 Fully C programmable VLIW DSP for computing +
                 RISC for control +
                 Distributed Network Processor (a kind of generalized inter-tile DMA
                  controller) for inter-tile communication
     
          multi-tile Silicon area >40mm2 <90mm2
          management of logic & place & route complexity through IP reuse
          multi-level network
                 Intra-tile: multi-layer bus matrix
                 Inter-tile: NoC (intra-chip) + 3DT (inter-chip)
          distributed routing fabric connects on-chip and off-chip tiles weaving
           a packet switching network

  January, 2007             Introduction to SHAPES                                      3   3
                   SW

        Communication centric, real-time aware programming
         environment
           Application description: model based with explicit
            annotation of real-time constraints
           Provide automated optimized binding of processes to
            computing resources and binding of inter-process
            communication on communication resources +
            scheduling of processes and their communication
           Provide automated generation of hardware dependent
            software support
           Retargetable compilation managing intra-tile and inter-
            tile parallelism, bandwidth and latencies
           Fast simulation


January, 2007      Introduction to SHAPES                       4   4
                 Consortium Composition and
                 Roles of the Partners

System SW
ETH Zurich - Distributed Operation Layer: manages application parallelism
TIMA Lab and THALES - Hardware dependent Software Layer and RTOS
TARGET Compiler Tech. - Retargetable Compilers
RWTH Aachen Univ. – Fast Simulation of Heterogeneous Multi Proc. Systems

System HW
ATMEL Roma - Tile:
   Evolution of (Diopsis®: mAgicV VLIW DSPTM + RISC) + INFN DNPTM
INFN Roma - DNPTM Distributed Network Processor + 3D Toroidal Eng.:
   Evolution of APE Massive Parallel Processors
STMicrolectronics + Univ. of Cagliari and Pisa – Network on Chip:
   Evolution of SpidergonTM Packet Switching Network on Chip

Parallel Application benchmarking
Fraunhofer IDMT – multi-loudspeaker Audio Wave Field Synthesis
ESAOTE, MedCom, Fraunhofer IGD - Ultrasound scanner
INFN - Physical Modelling
ATMEL – multi-microphone arrays for voice-extraction

January, 2007    Introduction to SHAPES                                     5   5
                         Deep Sub-micron Architectures…

      ~160 MGate available on a 100 mm2 chip (45nm CMOS, 2008)
      Increasing GATES/CHIP  Design Complexity Management:
          embedded processors use a few million gates only, IP reuse possible and needed;
      WIRING threatens Moore’s law:
          Wiring delay increases on new CMOS silicon generations
          The full chip cannot be reached in a single clock cycle
          Classic monolithic processor architectures do not scale
          Locally Synchronous, Globally Asynchronous needed
          Communication Centric SW and HW Architecture needed
      … PROPOSED  SOLUTION: … TILED ARCHITECTURE…BY SIMPLE
       GEOMETRIC DEMONSTRATION… IF CONSTANT LOGIC COMPLEXITY
       INSIDE EACH TILE… THEN (LENGTH OF INTRA-TILE WIRES SCALES
       DOWN AS THE TILE ITSELF… AND SHORT ~ FIRST NEIGHBOURS ON-
       CHIP AND OFF-CHIP INTER-TILE WIRES)
      QUEST OF BEST TILE, ON-CHIP AND OFF-CHIP INTERCONNECT. BUT
       HOW TO PROGRAM? EXPLICIT PARALLEL PROGRAMMING PARADIGM,
       and CULTURE NEEDED
      POWER DISSIPATION density approaching prohibitive values if higher clock speed
       used; much better Oper/Watt at moderate clock + parallelism (the human brain parallel
       architecture performs an excellent job at 50 HZ!... room for improvement)




    January, 2007        Introduction to SHAPES                                            6   6
                Distributed Network Processor
                DNP: a generalized DMA
                controller for inter-tile or intra-
                tile packet routing
                      BUS Slave (to receive commands from RISC & DSP)
                      BUS Master (to read from intra-tile memories)
                      BUS Master (simultaneous intra-tile memory write)
                      NoC (to forward/receive inter-tile ON-CHIP packets)
                      3DT X+ (forward/receive inter-tile OFF-CHIP packets)
     DNP              3DT X-
                      3DT Y+
                      3DT Y-
                      3DT Z+
                      3DT Z-
                      Collective communication

January, 2007   Introduction to SHAPES                                7   7
                                                             DXM Mem Bus                 POT Pads
                                                      RDT
                                                      RISC     DSP         DXM         POT
  Different Types
  of Tiles                                                     Multi-Layer BUS

                                                       3DT                      NoC
                                                                      DNP


                                                  RDT: RISC + DSP Elementary Tile
 DXM Mem Bus                          POT Pads

         RET                                                 DET
         RISC         DXM       POT                           DSP        DXM       POT


                  Multi-Layer BUS                                    Multi-Layer BUS


            3DT                 NoC                            3DT                 NoC
                      DNP                                                DNP


RET: RISC Elementary Tile                               DET: DSP Elementary Tile
  January, 2007              Introduction to SHAPES                                          8   8
                             mAgicV IP Architecture
                             (Fully C programmable
                             Gigaflops VLIW DSP)

                DBG            IRQ IN         IRQ OUT        RST, CLOCKS         AHB MST       AHB SLV




                 2-port, 8Kx128-bit, VLIW Program Memory(DPM)
                                                                                                  AHB
                                 VLIW Decompressor                                AHB
                                                                                                 Slave,
                                                                                 Master
                            Flow Controller, VLIW Decoder                                         e.g.
                                                                                  DMA
                                                                                                  DMA
                  Program        Condition         Status       Instruction      Engine
                                                                                                 Target
                  Counter        Generation       Register       Decoder




                              8R+8W 128x40                4-address/cycle
                             Data Register File         Multiple DSP Address
                                  System                                                  6-access/cycle
                                                             Generation                    Data Memory
                                                                 Unit                        System
                                                                                             2x8Kx40
                                                        16 multi-field Address              (DDM)
                                  10-float                  Register File
                                 ops/cycle




January, 2007         WP 1.6 - RISC+ VLIW DSP + DNP Tile                                                   1010
                           Tile Complexity estimated through
                           Synthesis & Place & Route trials


     mAgicV DSP:
        915 Kgates + 1 Mbit Prog Mem + 640 Kbit Data Mem

     ARM926 & peripherals
        <2 equivalent Mgate (including 640 Kbit mem)

     Tile Complexity 
        4230 equivalent Kgate + DNP gate count
                   including on chip memories




January, 2007        WP 1.6 - RISC+ VLIW DSP + DNP Tile        1111
                      Silicon Floorplan Trial of
                      RISC + mAgicV VLIW DSP Tile

                                    DSP Reg File



                    DSP
                    Data
                    Mem                DSP
                    (DDM)              Logic


                   DSP Prog Mem
                   (DPM)
                               AMBA Multilayer

                   Peripherals
                                      ARM
                                      RDM            ARM926




January, 2007   WP 1.6 - RISC+ VLIW DSP + DNP Tile            1212
                Spidergon NoC topology

 • It’s a family of regular/symmetric topologies

 • We look for a complexity/performance trade-off
    • Low degree (router cost)
    • Low number of links (wire cost)
    • Symmetry (homogeneous building blocks; simple routing)
    • Low diameter (performance)
    • Good scalability (small network size granularity)




January, 2007   Introduction to SHAPES                         1313
                 Background: APENext (2005) 2048
                 processor system, VLIW processors
                 designed by INFN, manufactured by ATMEL




January, 2007   Introduction to SHAPES               1616
                   SW challenges from
                   Tiled Architectures
     Facilitate expression of parallelism: e.g. Network of Actors
     Express real time constraints in a formal manner, feature missing in
      classical languages. This is a key cultural point!!!
     Avoid destroying information about available algorithm parallelism
     Compilation chain must fully aware of key architectural parameters:
      bandwidth, computational power, pipeline and latencies
     Exploit memory locality – efficient management of Distributed
      Memories – get rid of classical caches
     Manage Long delays between distant tiles
     Reduce Hot Spots in communications
     Reduce Tiled RTOS overhead (time and memory footprint)
     Introduce Hardware dependent Software and Hardware Abstraction
      Layers
     Capture scalability in a library of characterized SW/HW components
     Support for (semi)-automation of iterative design over HW, SW, Appl
     Monitor quality and real-time constraints on real HW and Simulators
     Simulation speed of multi-tiled architectures


January, 2007      Introduction to SHAPES                               1818
                                                   SW Architecture
 application            hardware platform
   specs                  specification
                                                                Distributed Operation Layer
                       Simulator
                                                    trace                  mapping
                    component interaction,      information              information
                   properties and constraints



                  Model Compiler                                      HdS Generator
                                                   Mapping

                     component                             HdS              Memory
                     source code                        source code         mapping



                                   Compiler                                            RTOS


                component           glue             HdS                Link           OS services
                  binary           binary           binary                               binary
                                                                      Dispatch

                                 Optimised compilation on tiles and comms network
January, 2007         Introduction to SHAPES                                                         1919
                Distributed Operation Layer –
                Application Specification

Two parts:
 Application structure
    @system level
                                         A          B         C
    processes

    FIFO SW channels
     between processes               .xml schema definition available
    interconnection
     between processes

     Behavior of each process
        process’ internals              .c   …     .c


January, 2007   Introduction to SHAPES                            2020
                 Virtual SHAPES Platform (VSP)

      Enable early software development
      Explore different tile configurations
      Binary compatible with the SHAPES hardware
      Debugging capability
      Export performance information
      Scalability to multiple tiles

 SHAPES
                              Applications
SW and app
 partners                          DOL

                         HdS              RTOS


                                 VSP                HW
 January, 2007   Introduction to SHAPES                  2121
                VSP-DOL interfacing




January, 2007   Introduction to SHAPES   2222
                                                          TARGET Compiler
                   OFF-                     OFF-   OFF-                             OFF-
                   CHIP    TILE    TILE     CHIP   CHIP   TILE              TILE    CHIP                      COMM I/F
Phase coupling:    MEM                      MEM    MEM                              MEM

                                                                                                              REG FILE
                                                                                                                                                           Communication




                                                                                                                                DSP DATA MEM
 reg. allocation   OFF-                     OFF-   OFF-                             OFF-                                                                    latency aware




                                                                                                                                               COMM I/F
                           TILE    TILE                   TILE              TILE




                                                                                              COMM I/F
                   CHIP                     CHIP   CHIP                             CHIP




                                                                                              uP MEM
                   MEM                      MEM    MEM                              MEM
                                                                                                          ARM     mAgicV
 SW pipelining    OFF-                     OFF-   OFF-                             OFF-
                                                                                                           uP      DSP                                        scheduling
                   CHIP    TILE    TILE     CHIP   CHIP   TILE              TILE    CHIP
                   MEM                      MEM    MEM                              MEM
                                                                                                           DSP PROG
                                                                                                             MEM
                   OFF-
                   CHIP    TILE    TILE
                                            OFF-
                                            CHIP
                                                   OFF-
                                                   CHIP   TILE              TILE
                                                                                    OFF-
                                                                                    CHIP                    COMM I/F
                   MEM                      MEM    MEM                              MEM


Support of VLIW                                                                                                                                            Intra-tile multi-
   instruction       INSTR.
                                                                     Core_bus5             Core_bus5
                                                                                                                                               P6_1
                                                                                                                                                             core on-chip
                                      P6_0                            Core_bus7                               Core_bus7
  compaction        DECODER
                                                                                                                                                              debugging
                                      P5_0                                                                                                     P4_1
                     INSTR.                                      4 5 6 7                            4 5 6 7
                    DECODER
                                      P4_                              RF0                                 RF1
                                                                 0 1 2 3                            0 1 2 3                                    P5_1
                                      0
                     DECOM-
  Support of         PACTION
                                          P3_0                                                                                                 P2_1
  predicated        PROGRAM               P2_0     Conv1 Mul1
                                                       FP/I                 FP/I Mul2         FP/I Mul3       FP/I Mul4Conv2                   P3_1           Inter-tile
                     MEMORY                        Div1
                                                                                                                        Div2
   execution          INSTR.
                                                     /Log1
                                                    Sh
                                                             *                *                 *               *     Sh/Log2                              communication
                     SEQUEN-
                       CER
                                                                                                                                                             using DNP
                    INTERRUPT                                       Cadd1
                                                                 FP/I                               FP/ICadd2
Functional unit
 assignment for
                       CON-
                     TROLLER                                       -                                      +
clustered VLIWs           mAgicV                                                                                 mAgicV
                           PCU                                      Min                               Min         core
                                                                        FP/I
                                                                    Max1 Add1                 FP/I Add2
                                                                                                      Max2
                                                                                                                                                          Communication
 Core related
                                                                       -+                        +-
                                                                                                                                                              related
requirements                                                                                                                                               requirements
  January, 2007           Introduction to SHAPES                                                                                                                       2323
                         TIMA - HdS & RTOS - Principles

     Hardware dependent Software: software directly
      dependent on the underlying hardware

     Communication differentiation
        Intra-subsystem & inter-subsystem communications



     Networked operating system:
                   Application                              Application

                     HdS API                                 HdS API




                                                                           SW
                RTOS
                                      SW




                                                  HdS
                                                        Monitor     COMM
       HdS




                 (RT      COMM
                Linux)
                   HAL ARM                                  HAL DSP




                                                                           HW
                                                          DSP Subsystem
                                      HW




                 ARM Subsystem


January, 2007            Introduction to SHAPES                                 2424
                                  SW Architecture

                                   hardware platform
                                      specification


                                simulation environment
                                   (RWTH) WP 1.4
                                                                            trace                      mapping
                                 component interaction,                 information                  information
                                properties and constraints


                                    model compiler                       mapping                    HdS generator
    application
                                    (ETHZ, RWTH)                         (ETHZ)                       (TIMA)
   specification
                                    WP 1.11, WP 1.4                      WP 1.11                      WP 1.10


                                       component                                         HdS                   Memory
                                       source code                                    source code              mapping



                     RTOS                                     Compiler
                (TIMA, THALES)                               (TARGET)
                    WP 1.10                                    WP 1.9
                                                                                                                Link
                  OS services              component            glue                     HdS                  Dispatch
                    binary                   binary            binary                   binary               (TARGET)
                                                                                                               WP 1.9




January, 2007                     Introduction to SHAPES                                                                 2525
                    SHAPES SW Architecture: challenges

    High-level exploration, mapping, and simulation:
       What is the degree of available parallelism? How can it be exposed to
        the mapping stage? What is suitable model-based specification
        formalism? What adaptations are necessary in order to expose the
        inherent parallelism?
       Define a common Profiling Trace Interface (PTI) over which
        information can be exchanged.
    Hardware-dependent software and operation system:
       To use the provided features of the HdS (i.e. platform abstraction) a
        generic interface API has to be defined.
    Compiler technology:
       Modeling low-latency communication interfaces in the C source code
        that is the input for the C compiler, for the computational tiles.
       Investigate how HdS can be modeled entirely in C source code, to be
        compiled by the C compiler for the computational tiles.


    January, 2007   Introduction to SHAPES                              2626
                                       OFF-CHIP    OFF-CHIP          OFF-CHIP       OFF-CHIP
                                         MEM         MEM               MEM            MEM


                                           Tile   Tile                   Tile      Tile        F        ADC        sensor
Tiled HW Architecture                                                                          P
                                                                                               G
   Communication Centric,                                                                     A
    not Processor Centric                Tile     Tile                   Tile      Tile
                                                                                                                   actuator
                                                                                                        DAC
   Homogeneous SW                     OFF-CHIP    OFF-CHIP          OFF-CHIP       OFF-CHIP
    interface for on-chip and            MEM         MEM               MEM            MEM


    off-chip scalable
    connection and I/O                 OFF-CHIP
                                         MEM
                                                   OFF-CHIP
                                                     MEM
                                                                     OFF-CHIP
                                                                       MEM
                                                                                    OFF-CHIP
                                                                                      MEM

   3D first-neighbour
                                         Tile     Tile                   Tile      Tile
    Toroidal System Eng.                                                                                ADC       sensor
    (3DT) for Off-Chip
    communication
                                         Tile     Tile                   Tile      Tile
   Virtual tunnelling on                                                                               DAC       actuator
    packed switching NoC               OFF-CHIP
                                         MEM
                                                   OFF-CHIP
                                                     MEM
                                                                     OFF-CHIP
                                                                       MEM
                                                                                    OFF-CHIP
                                                                                      MEM
    (Network on Chip) and off-
    chip 3DT
   Parallelism Aware System                                                                       0     1    2       3      4

    SW: Manage memory                                    NoC                                       15
    distribution, capture real 3DT Off-chip
                                                                                                                             5
                               communication
    time constraints                               DNP        RISC     DSP      Tile               14                        6

   Explicit parallel                                         Multi-Layer BUS
                                                                                                   13                        7
    programming/Network of                                    DXM       POT

    Actors                                                                                         12    11   10      9      8
                                                                         ADC/DAC




    January, 2007       Introduction to SHAPES                                                                             2727
                        The tile:                          DIOPSIS® +                          DNP

                               RISC                                                   DXM
          RCM Instr Cache      MMU RCM Data Cache
ICE
             RDM IF                  BIU                                   DXM Interface(AHB EBI)
              I        D                 I      D
                RDM SRAM
JTAG                                      Multi-layer
                  ROM                                                            PDMA
                                         Bus MATRIX                                             P
                                                                                     Bridge     E
mAgicV DSPTM JTAG                                 Master           Slave
                                                                                                R
  mAgicVTM DPM              DSP     DSP       DNP           DNP          DNP                    I
                            AHB     AHB       AHB           AHB          AHB             APB    P
      2-port               Master   Slave    Master        Master        Slave
                                                                                                H
                                                                                                E
    16-port         4-addr/                                        DNP                          R
    256x40           cycle        DDM
                    Multiple    6-access/                                                       A
   Data Regs
                     DSP          cycle      X      X      Y   Y     Z      Z    C
                                                                                       NoC      L
    10-float                                                                           (NI)
   ops/cycle
                     Addr                    +      -      +   -     +      -    +              S
                     Gen


January, 2007            Introduction to SHAPES                                                      2828

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:2/17/2012
language:English
pages:24