EEM830 Course Overview Introduction to System-on-Chip Design - PDF by itlpw9937

VIEWS: 7 PAGES: 24

									EEM830 Introduction to SoC Design
Lecture 10: Advanced Topics in SoC Design (2)



    Wen-Yen Lin, Ph.D.
    Department of Electrical Engineering
    Chang Gung University
    Email: wylin@mail.cgu.edu.tw
    Nov. 2009
Some Important Subtleties of
Processor-centric SoC design
  Microprocessor pipelining
  Advanced techniques for replacing
  HW with processors
  Special issues in memory-system
  design and synchronization




         Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   2
Optimizing Processors to Match HW
 Overcoming differences in Branch
 Architecture
   Zero-delay branches
    PowerPC: special Count register to eliminate the
     branch delay
    Zero-overhead loops
    Dynamic branch predictions
   Code organization
    Organize the code so that common or
     performance-critical sequence require few taken
     branches



          Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   3
Code organization to Optimize Branch
Delays




     Original Control Flow                          Optimized Control Flow
            Source: Chris Rowen, “Engineering the Complex SOC”

              Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin             4
Optimizing Processors to Match HW
 Overcoming differences in Branch
 Architecture
   Replace branches with conditional operations




     Traditional Implementation (a, b, c, x are reside in a0, a1, a2, and
     a3) – 4 or 5 execution cycles with two bubbles




     Conditional move – two instructions, two cycles



             Source: Chris Rowen, “Engineering the Complex SOC”

              Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin            5
Optimizing Processors to Match HW
 Overcoming differences in Branch Architecture
    Combine or pipeline operations to save cycles
     Sometimes taken-branch bubble is unavoidable




      Define new operation, op123 so that it can run in parallel with
      branch target




     Dispatch operations

              Source: Chris Rowen, “Engineering the Complex SOC”

              Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin        6
Dispatch Operations




 4 conditional branches               1 dispatch operation based on variable address

                Source: Chris Rowen, “Engineering the Complex SOC”

                Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin              7
Optimizing Processors to Match HW
 Overcoming limitations in Memory Access
   Number of data-memory ports maybe limited
   and smaller than the number of simultaneously
   accessed memory requirements
   Three techniques may apply
    Map small memories into register files
    Code rearrangement to reduce peak number of
      simultaneous memory references
    Wider memories to increase total bandwidth




         Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   8
Multiple Processor Debug and Trace
  Coordinated debugging in MP system is critical
     Tasks running in MP system need either intended or
     non-intended interactions
     H/W support for MP debugging is desired
      Common debug-access H/W, i.e. JTAG

  Two forms of debugging information in MP system
    Interactive processor state examination in source-
    level debugger.
    However seeing no history of computation while
    stopping task to examine processor state.
      Execution Trace as an alternative.
           Showing sequence of instruction execution and state
            values




              Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   9
4-processor SOC with JTAG-based on-
chip debug and trace




  5-wire JTAG interface
      Clock (TCK), control (TMS), data-in (TDI), data-out (TDO), and reset (~TRST)
                 Source: Chris Rowen, “Engineering the Complex SOC”
                 Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin                  10
Simulation-based v.s. HW-based MP
debugging
  Similar experience but
    HW runs faster. Larger cases can be
    debugged
    HW fully implements interfaces and
    networks
    HW offers lower visibility into details
    than simulators




          Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   11
Multiple Processor Traces
 Trace ports may consists
   Status word
    Status of instruction in the completion stage of
      the pipeline
    Stalls
    Exceptions
    Cycle-by-cycle execution status
   Address word
    Instruction address
    Virtual data address
   Data word
    Computation results being written to register
      file



          Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   12
Dealing with Trace Data
 High trace data volume would require high output bandwidth to
 bring out to the SoC
    Example:
    20 (processors in one SoC) x 350MHz (processor clock) x 10 bytes of
    trace data per seconds = 70GB/sec

 Two ways to handle this huge volume of trace data
    Restricted and compressed
       Sequential addresses in most instruction execution
       But hard to compress for data trace and exceptions and jumps
    On-chip trace-capture buffer
       Limited trace-buffer depth
       Capture filtering according to programmer selected criteria

 Trace data post processing on Host PC
    Decompression
    Data interpretation
    Virtual source-level debugging session if full trace data captured




               Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin         13
Issues in Memory Systems
 Configurable processor provides
 opportunity to tune the memory system for
 requirements

 Performance often limited by data
 bandwidth either within computation or
 data movements

 Memory arrays and complex control
 functions may comprise a major portion of
 SoC’s silicon area and operation power



         Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   14
Pipelining with Multiple Memory Ports
  Multiple ports memory accesses required for data-
  intensive applications

  Two basic factors affecting application throughput
    Data movement rate
    Overlapping between memory operations and
    computation

  Example:
     On data stream, data are loaded (two cycles),
     processed by a complex operation (two cycles), and
     sent out with a store operation (two cycles)
     Combined load.op.store single instruction



            Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   15
Pipelining with Multiple Memory Ports




         Source: Chris Rowen, “Engineering the Complex SOC”

         Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   16
Memory Alignment in SIMD instruction
Sets
  Performance degrade if memory address of the vector
  across the nature-word boundaries in SIMD type
  applications

  Memory alignments to solve the issue – use alignment
  buffer




             Source: Chris Rowen, “Engineering the Complex SOC”

             Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   17
Shared Memory synchronization
 Shared memory synchronization in
 MP system to maintain data
 consistency and correctness
   Memory ordering and locks




        Source: Chris Rowen, “Engineering the Complex SOC”

        Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   18
Shared Memory synchronization
   Interrupt handlers
    Excessive memory traffic on memory lock
     and semaphores implementation
      Processor busy waiting while consume memory
       bandwidth
    Interrupt-driven synchronization incurs less
     memory-bandwidth overhead and provide
     efficient communication
    Shared data still passes through global
     memory but locking is handled via
     interrupts.


         Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   19
Shared Memory synchronization
   Interrupt-driven synchronization of
   shared-memory access




         Source: Chris Rowen, “Engineering the Complex SOC”

         Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   20
Shared Memory synchronization
   Hardware synchronization
    Memory-mapped hardware queues
    Push and pop directly handled by
     instructions




         Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   21
Power Dissipation Optimizing
 Moving forward to portable devices, power
 dissipation becomes the central issue in an
 SoC’s design

 Influence of configurable processor on low-
 power design
   Software-based design save more power than
   the best lean logic design
   ASP reduce the energy requirements for data-
   intensive computation than general-purpose
   RISC processor cores.



         Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   22
Core Power
 Power estimation
   E = αCV2n
   where
   α is the average fraction of circuit nodes switching
      between 0s & 1s
   C is the total capacitance of all the switched nodes
   V is the voltage
   n is the number of cycles required to execute

   Good processor configuration sharply reduce n,
   while increasing C slightly or even also reducing.
   α can also be reduced by activating execution
   units only when necessary.



          Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   23
Memory Power
 Memory width, depth, organization and application-
 reference patterns all have significant influence on a
 processor’s power dissipation

 Some observations on power analysis
   Core power dominates total power
   RAM power dissipation depends more on memory
   access width
   Keep instruction memory width as narrow as possible
   due to heavily used instruction memory
   Data reads consume more than half of data-memory
   power
   Current leakage becomes more critical on deep-
   submicron and nanometer SoC design



            Introduction to SoC Design, 98/1, EE/CGU, W.Y. Lin   24

								
To top