Docstoc

Larrabee

Document Sample
Larrabee Powered By Docstoc
					    Larrabee

     Eric Jogerst
Cortlandt Schoonover
     Francis Tan
                   Larrabee
• Intel’s new approach to
  a GPU
• Considered to be a
  hybrid between a multi-
  core CPU and a GPU
• Combines functions of a
  multi-core CPU with the
  functions of a GPU
Larrabee
Larrabee

FETCH
                       Fetch
• Utilizes a hardware prefecther
• Supports four threads of execution
  – Separate register files for each thread
  – Switches threads in order to cover cases where
    the compiler is unable to schedule code without
    stalls or if the prefetcher has not received new
    instructions
  – Inactive thread data is written to the core’s local
    L2 cache
Larrabee

PIPELINE ORGANIZATION
                    Pipeline
• Pipeline derived from the dual-issue Pentium
  processor, which is 5-stages
  – Short, inexpensive execution pipeline
• Pairing rules for primary and secondary
  instruction pipes are deterministic
  – Allows compilers to perform offline analysis with a
    wide scope
                        Pipeline
• Pairing rules for primary and secondary instruction
  pipes are deterministic
   – Allows compilers to perform offline analysis with a wide
     scope
• All instructions can be issued on the primary pipeline
   – Minimizes the combinational problems for a compiler
• Secondary pipeline can execute a large x86 instruction
  set
   – Small and cheap
   – Power wasted by failing to dual-issue on every cycle is
     minimal
                   Pipeline
• Each core has own pipeline
  – Based upon the 5 stage Pentium
  – Dual issues instructions
  – In order execution
• Pipeline is shared between threads
  – Hardware can switch between threads that have
    instructions that have instructions ready to
    execute
                   Pipeline
• Designed software-rendering pipeline to
  minimize the number of locks and other
  synchronization events
• Graphics-rendering pipeline written with high-
  level languages and tools
  – Enables developers to add innovative rendering
    capabilities
Larrabee

SIMD ORGANIZATION
         Vector Processor Unit
• 16-wide vector processor unit (VPU)
  – executes integer, single-precision float, and
    double-precision float instructions
  – VPU and register are approximately one-third the
    area of the processor core
• Tradeoff
  – Increased computational density
  – Wider VPU’s have higher utilization
         Vector Processor Unit
• VPU instructions can be predicated by a mask
  register
• Mask controls which parts of a vector register
  or memory location are written and which are
  left untouched
• Advantages
  – Reduces branch misprediction penalties
  – Gives instruction scheduler greater freedom
              Number of Cores
• Many-core processor
  – Planned to have 24 to 48
    cores
Number of Cores
Number of Cores
Larrabee

SYSTEM ON-CHIP COMPONENTS
    System On-Chip Components
• x86 computer cores - Dual issue, in order
  processors that support the x86 protocol with
  Larrabee extensions. Connected to ring
  network and high bandwidth connection to
  adjacent L2 Cache subset.
    System On-Chip Components
• L2 Cache subsets
  – High bandwidth access to adjacent CPU
  – Connected directly to the ring network
  – Coherent cache, uses the ring network to check
    coherency when allocating new cache lines
   System On-Chip Components
• Ring Network Nodes
  – Simple bi-directional routers with a 512 bit data
    path in each direction (1024 bit total bandwidth)
  – Organized in rings of 8-16 cores and other devices
  – Interconnected with other rings
  – All data moved between cores and fixed
    functional units passes through the ring network
    System On-Chip Components
• Fixed function logic components
  – Provides rasterization, interpolation and other
    commonly needed functions
  – Directly connected to the ring network
  – Will be spread among the cores to provide lower
    latency and load balancing on the ring network
   System On-Chip Components
• Memory & I/O interface
  – Provides and manages communication between
    the Ring Network and off chip devices.
  – Manages initial routing and tasking of cores
Larrabee

MEMORY HIERARCHY
Memory Interface
Larrabee

ON-CHIP INTERCONNECT
         On-Chip Interconnect
• Ring interconnect bus
• Similar to the Sony Cell processor.
Ring Bus
             Ring Bus Features
•   Bi-directional
•   512 Bits in each direction
•   Presumably running at core speed.
•   Each element can take from one direction on
    odd CC and other direction on even CC.
        Ring Bus Comparisons
• Compared to AMD’s R600/RV670 bus, it is half
  the bit-width.
• The higher clock speed of Larrabee’s bus
  should make up for the difference in
  bandwidth.
Ring Bus Tradeoff Analysis
        Ring Bus Tradeoff Analysis

Pros:
•Straightforward, not complex
•Able to deliver high bandwidth
•Great performance if memory
clients need high bandwidth.

Cons:
•Waste of chip area if most
applications don’t need high
memory bandwidth
•That area could be spent
elsewhere to increase
performance in a different way.
Larrabee

MULTITHREADING ORGANIZATION
      Multithreading Organization
•   Superscalar
•   In-Order
•   Four Threads of execution
•   Dual issue (with a vector processing unit)
        Comparison to OO Execution
# CPU cores:         2 out-of-order   10 in-order
Instruction issue:   4 per clock      2 per clock

VPU per core:        4-wide SSE       16-wide

L2 cache size:       4 MB             4 MB

Single-stream:       4 per clock      2 per clock

Vector throughput:   8 per clock      160 per clock
Larrabee Vector Processor




          8 per clock
            Scheduling Policy
• Software Controlled
• More flexible due to the software controlled
  scheduling than a typical GPU.
   Software Controlled Scheduling
Pros                            Cons
• Flexible: can choose the      • Overhead of scheduler
  scheduler to suit the           takes a bite out of
  application.                    performance
• Worst case won’t be so bad.   • Programmer overhead of
  (As compared to a hardware      selecting the correct
  encoded scheduling policy)      scheduler.
                   Criticism
• NVIDIA
  – “like a GPU from 2006”
  – Unrealistic performance projections
  – Motivated by interest to retain market share
             Possible Market
• Dreamworks Animation
• Xbox / Playstation
• Scientific research
Questions?

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:9/1/2012
language:English
pages:42