Docstoc

EPIC

Document Sample
EPIC Powered By Docstoc
					EPIC 64-bit Architecture:
The Itanium and Itanium 2
           CSCE 380
5/10/2000, 12/11/2003, 5/7/2004,
      5/11/2005, 5/4/2006
             Draft
                                               2




Disclaimer
   This slide show is intended to explain the
    EPIC architecture in a general way. It is not
    intended to give a precise description of the
    architecture.
                                                3
    EPIC
   Explicitly Parallel Instruction Code
   Earlier known as IA-64 Architecture
   Originally it was jointly defined by Intel and
    Hewlett-Packard but it appears it now an
    Intel product. HP produces major computers
    using the chip.
   64 bit
   It is designed to be used in clusters of up to
    128 processors (or more). Machines with
    over 1900 and 4048 processors exist.
   It will run binary IA-32 programs.
                                                    4


State-of-affairs:
Typical Computers
   Compilers typically write sequential,
    in-order code.
   Advanced CISC chips use lots of hardware
    logic to try to execute the code in parallel.
   Advanced CISC chips use lots of hardware
    logic to try to execute the code “out-of-
    order.”
   RISC chips are not designed for properly
    for parallel execution
                                             5

EPIC: Some design principals
   Intel’s conclusion: Let the compiler
    produce parallel, out-of-order code.
    This simplifies chip logic. The space
    saved can be used for things like more
    registers.
ftp://download.intel.com/design/Itanium/manuals/24531705.pdf
                                          page 123             6
                                                7

(1) Registers
   General registers: 128 (64 bit) registers for
    integer computations
   NaT (Not a Thing – deferred speculative
    exceptions)
   GR0-GR31 – Static general registers
   GR0 = 0
   GR32-GR127 – Static general registers
    Some can be renamed for loop acceleration
(2) Registers                                    8




   Floating point registers: 128 (82-bit)
   FR0-FR31 – Static floating point registers
   FR0 = 0.0
   FR1 = 1.0
   FR32-FR127 – Rotating floating-point
    registers

   Instruction pointer: Holds the address of
    the bundle which contains the current
    instruction
                                                 9

(3) Registers
   Predicate registers: 64 (1-bit) registers.
   Hold the results of compare functions
   PR0-PR15 – Static Predicate registers
   PR0 = 1
   PR16-PR63 – Rotating predicate registers

   Branch registers: 8 (64-bit) registers
   Hold branching information
                                               10
(4) Registers
   Current frame marker: Used to keep track
    of stack frames in the general registers

   Application registers: Special-purpose data
    and control registers for special processor
    functions

   Kernel registers: 8 (64-bit) registers
   The kernel can write info - applications use
                                                  11
 (5) Registers
 Register stack configuration register:
 Controls the operations of the Register Stack
  Engine
 RSE Backing Store Pointer:
 Stores the location in memory which the save
  location for GR32.
 Loop count Register:
 Used as a loop counter

 Epilog Count Register:
 Used for counting the final stages in loops
                                           12


EPIC: Explicitly Parallel
Instruction Computing
   Has resources for parallel execution
    • Many registers
    • Many functional units
    • Inherently scalable
 Explicit parallelism
 Features
    • Prediction
    • Speculation
    • …
                                                    13

 VLIW:
    Very Long Instruction Word
   EPIC packs 3 instructions into its 128 bit
    long instruction.
   Compiler specifies parallelism
127                                                 0
    Instruction 2   Instruction 1   Instruction 0

   The three instructions together
    are called a bundle              Template

   The assembly language looks more
    traditional and does not show the bundles.
                                                     14
Basic instructions
Up to 3 in a bundle
   Basic Itanium instruction:
    [qp]mnemonic[.comp] dest = srcs
   qp: optional predicate register. The result is
    committed only if qp = 1
   mnemonic: unique instruction name
   comp: optional variation for the instruction
   dest: destination of the result
   srcs: one or more sources
    (Intro to programming, page 1:132)
                                                            15
Instruction Groups
 Instruction groups: groups of instructions that do
  not have RAW or WAW register dependencies
 Depending on the machine, 1, 2, …, all of the
  instructions in the group can be issued in parallel
 Hence: (Normally) instructions in a single
  instruction group cannot have
    • RAW or Read after Write dependencies:
      One instruction cannot read a register written by another
      instruction in the same group.
    • WAW or Write after Write dependencies:
      two instructions cannot write to the same register.

      (Software Developer’s Manual, page 1:133)
                                               16


Instruction groups: Example 1
ld8 r1 = [r5] ; ; // first group or bundle
add r3 = r1, r4 // second group
      RAW Dependency
; ; a stop: the end of an instruction group
Stops may appear inside or at the end of a
   bundle. There can be several bundles in a
   group

ld8 is a load 8 bytes from memory
st8 is a store 8 bytes to memory
                                                               17
Instruction groups: Example 2
                                           Read after write
ld8    r1 = [r5]
sub    r6 = r8, r9 ; ;               // first group
add    r3 = r1, r4
st8    [r6] = r12 ;;                 // second group
add     r3 = 1, r6 ;;                  // third group
                                           Write after write
Goals: 1. Put as many instruction in a group as
 possible to allow as much parallelism as possible.
       2: Load in advance so that to avoid waiting
 for memory.
(Adapted from Software Developer’s Manual, page 1:133)
                                                      18

Instruction groups:                     Hopefully,
                      load   r1=b       additional
                      load   r2=c       instructions can
                      load   r3=e
   Using pseudo      load   r4=f
                                        be added to the
                                        groups before
    code, break       load   r5=g       forming bundles
    the following     load   r6=i ;;
    into              sub r7 = r1, r2    // r7 = b – c
                      add r8 = r3, r4 ;; // r8 = e + f
    instruction
    groups:           store a = r7
                      add r9 = r8+r5 // r9 = e+f+g
   a = b - c;        mul r10=r7*r6 ;; // r10=a*i
    d = e + f + g;    store d = r9
    h = a * i;        store h = r10
    j = d;            store j = r9 ;;
                                                  19




Instruction Groups
   These memory dependencies are allowed:
    •   RAW
    •   WAW
    •   WAR
    •   ALAT
 A value read in a group does not effect the value
  written
 Ordering of instructions in a group determines the
  final value
                                                   20




Instructions
   Standard            add r1=r2,r3
   Predicated         (p4) add r1=r2,r3
    (Converted to a no-op if p4 is false)
   With immediate     add r2=r3,1
   With completer     cmp.eq p3=r2,r4
          instruction option
   Memory operations:
                       ld r1=[r4]        (load)
                       st [r4]=r1        (store)
                                                21




Minimizing Memory Latency
   Latency – Time wasted while waiting for
    memory
   Problem: In normal CPUs, jumps make it
    difficult to schedule loads in advance so
    CPU stalls at cache misses
                                              22

Jumps
   Example:
      if (a > b)
          x = y;
      else
           x = z;
   Pentium Pro type solution:
     CPU predicts a branch,
     Starts execution,
     Cycles are wasted if the prediction is
     wrong.
                                                    23


Jumps
   Intel claims that 5 to 10% mispredicts can cause a
    30-40% performance cut! How could this be?
   Assume instructions normally can be processed in
    1 cycle
   Assume 30% of instructions are branches
   Assume 10% of branches mispredicted
   3% of instructions cause mispredicted branches
   Assume 13 cycles lost for each mispredicted
    branch
   For every 100 instructions (cycles), 39 of 139 or
    28% of cycles of are wasted
                                                      24
Itanium solution: Predication
 Source code:
     if (a > b)
         x = y;
     else
          x = z;
 EPIC solution. Both branches are executed, but
  results are stored only if the associated predicate is
  true (using pseudo code)
 p1 = a>b          p2 = not(a>b) ...
  (p1) load y      (p2) load z       ...
  (p1) store x     (p2) store x      ...
 There are 64 predicate registers such as p1 and p2.
    Predication                                       25


   If (a)                         Sample C code
       b = c + d;
    if (e)
       h = i - j;

 Pseudo code for Itanium using predication
  - No branching is needed.
  cmp.ne p1, p2 = a, r0 // p1  a != 0
  cmp.ne p3, p4 = e, r0 // p3  e != 0
  (p1)add b = c, d       // if a != 0 then add
  (p3)sub h = i, j       // if e != 0 then subtract
 Note: r0 is always 0
   (Software Developer’s Manual, page 1:135)
                                              26




EPIC memory loads
   Values are loaded as far in advance as
    possible
   Code can verify variables are loaded before
    actual use
                                                   27

Jumps and loads
 In an effort to reduce the latency problem, the
  compiler will try to load data as far as in advance
  as possible. The code might be modified as
  follows (using pseudo code):
  XXXXX              XXXXX           load r10, y
  XXXXX              XXXXX           load r11, z
  XXXXX              XXXXX           XXXXX
  p1 = a>b           p2 = not(a>b) XXXXX
  (p1) store x, r10 (p2) store x, r11 XXXXX
 Preloading is called hoisting loads
Register Rotation                                        28



   Consider   for (i = 0; i <=n; i++)
                   b[i] = a[i] +1;
 Traditional compilers might code the inner loop
  statement as
          load ax, a[i]
          inc    ax                 Used in every iteration
          store b[i], ax
 Even if multiple execution units are available, the
  loop is executed sequentially because register ax is
  specified.
                                                      29




 Register Rotation
    In loop structures, register rotation and
     renumbering allows a compiler to specify
     one register but in reality multiple registers
     are being used.
     Cycle 1     Cycle 2      Cycle 3
r32 A[0]       r32 A[1]     r32 A[2]     load r32, a[i]

               r34 A[0]+1   r34 A[1]+1   add r34=r33, 1

                            r35 A[0]+1   store b[i], r35
                                                   30


Register Rotation and Software
Pipelining
   The EPIC compiler can specify that registers
    should be used on a rotational basis.
      for (i = 0; i < n; i++)
          b[i] = a[i] + 1;
    load r32, a[i]   value is rotated into r33
    add r34 = r33,1 values rotated into r34 and r35
    store b[i] = r35 values rotated into r35 and r36
   Predicate registers and prolog and epilog
    counters are used to start and stop loops
                                              31




Loop count and epilog registers
   Loop control requires branches and
    overhead in standard computers.
   EPIC provides loop count and epilog
    registers which together with the predicate
    registers allow taking care of the overhead
    in loop setup and cleanup without branches
                                            32




Procedure calls
   Procedure calls are highly desirable
    but …
   In traditional computers, procedure calls
    require push and pops (memory operations)
        to handle parameters
    and
        to save and restore registers
   Memory is slow
   Procedure calls are slow
                                                        33


    Procedure calls
                                 Suppose procedure A
 96 integer registers can be
                                 calls procedure B(x,y)
  used like a stack
 Calling and called
  procedure can share some
                              Proc A
  registers                              param x
 If the register stack                  param y
                                                     Proc B
  overflows, registers are sent
  to memory in the
  background
 Register saves are unneeded
                                          Integer
                                       register stack
                                                        34




Comments

   Itaniums use a flat 64 bit addressing space
   They normally store little endian but can support
    big endian operating systems
                                                 35




Floating point
 Allows 32 bit, 64 bit, or 80 floating values
 Registers are 82 bits long
 Uses software for division
 Square roots use looping techniques
                                                  36


MMX semantic equivalence used
when emulating a IA-32 chip
   Integer registers can be treated as eight 8
    bit, four 16 bit, or two 32 bit registers
   Floating point registers can be treated as
    two 32 bit registers
   This allows one instruction to process
    multiple data values (SIMD)
   Provides MMX semantic equivalence
                                               37




First implementation

   Itanium was first released in 2001 after
    years of development and testing.
   Initial applications: servers and high
    powered work stations
                                             38




Review of EPIC
   64 bit data path and registers.
   Complexity of the logic needed for out-of-
    order and speculative execution is removed
    from the chip.
   Order of calculation is up to the compiler
   On board chip space can be used for
    registers.
                                              39




Review of EPIC
   Intel plans to continue development and
    production of 32 bit processors (IA-32)
   Itanium 2 was released in 2002 or 2003
                                                         40

Implementations
   Itanium versions:
    • Speeds: 733 MHz, 800 MHz
    • Cache: L3: 2MB, 4MB, L2: 96KB, L1: 32 KB
   Itanium 2 versions (Other versions available):
     All have L2 cache: 256KB, L1 cache :32KB
    • For multi-processor and dual processor applications
      Speeds: 1.66 GHz
      Cache: L3: 9MB
    • For dual processor applications
      Speed 1.6 GHz (for servers and workstations
      L3 cache: 3MB
    • Lower power high density dual processor applications
      Speed: 1.3 GHz (low power)
      L3 cache: 3 MB
                                                     41
References:
   http://developer.Intel.com/design/ia64/index.htm
    (IA 64 home page) *
   http://intel.broadcast.com/intel/idf98/keynote1.htm
    (Real video) *
   http://www.intel.com/design/idf/archive/feb98/
    index.htm (Multiple media) *
   http://developer.intel.com/design/ia-64/
    architecture.htm *
   http://developer.intel.com/vtune/cbts/ia64/
    index.htm (tutorials) In particular, "Introducing the
    IA-64 Architecture" *
   *checked 5/10/00 but these links no longer work
                                                  42
References:
   http://www.intel.com/products/processor/itanium2
    /index.htm
   http://www.intel.com/design/Itanium/manuals/iias
    dmanual.htm
   ftp://download.intel.com/design/Itanium/manuals/
    24531705.pdf(Intel Itanium Architecture Software
    Developer’s Manual – January 2006)
   http://www.intel.com/design/itanium2/manuals/25
    1110.htm (Intel Itanium 2 Processor Reference
    Manual for Software Development and
    Optimization)
   http://www.intel.com/business/bss/products/server
    /itanium2/demo/index.htm?iid=ipp_srvr_proc_itan
    ium2+epic_animation& (Simplistic video) **
     Pacific Northwest National                           43



     Laboratories EMSL Supercomputer
   Build by HP
   Nearly 2000 1.5 GHz Itanium 2 processors, 2 per node
   Uses Linux
   Nov. 2003: Fifth fastest unclassified computer in the
    world (16th fastest Nov. 2004) (40th fastest Nov. 2005)
   Speed: 11.8 teraflop theoretical, rating based on 8.63
    teraflop performance running Linpac – solving dense
    linear equations
   One-half petabyte of disk space
   Quadrics QSNet 2 interconnect that enables the
    processors to communicate in less than three
    microseconds.
                                                            44


Pacific Northwest National
Laboratories EMSL Supercomputer




http://mscf.emsl.pnl.gov/about/managers_report_2002.shtml
                                                     45

Research Users and Uses

   Users: 40 Universities and 8 government
    reasearch labs
   Example uses:
    •   chemical transformations for catalyst design
    •    the physics and chemistry of biochips
    •    the synthesis and reactivity of nanomaterials
    •   the dynamics of damaged DNA
    •   a new approach to atmospheric global climate
        models.
   Source: http://mscf.emsl.pnl.gov/rank5.shtml
                                                        46


Pacific Northwest National
Laboratories EMSL Supercomputer
http://www.emsl.pnl.gov/new/spotlight/spotlight06.shtml
http://mscf.emsl.pnl.gov/rank5.shtml
http://mscf.emsl.pnl.gov/about/managers_report_2002.shtml
http://www.emsl.pnl.gov/proj/nwlinux/system_details.html
   (NWLinux description) *
http://mscf.emsl.pnl.gov/hardware/config_mpp2.shtml
   (describes the computer)
http://www.emsl.pnl.gov/using-
   emsl/tour/lab.php?facility=msc&lab=vr1119 (Virtual tour)
http://mscf.emsl.pnl.gov/docs/training/mscfphase1_phase2over
   view.ppt (PowerPoint slide show)
                                                      47

Another PNNL computer
   128 processor Linux box from SGI
   223rd fastest unclassified computer (Nov. 2003)
   Online: Sept. 2003
   Altix 3000 system
   Used in:
    •   Fundamental science
    •   Environmental quality
    •   Energy resourses
    •   National security
   Source: http://mscf.emsl.pnl.gov/rank5.shtml
                                                  48
Other supercomputers         11th fastest Nov. 2005
    Nov. 2004, 5th fastest computer:
    Thunder (California Digital Linux
     Cluster) Lawrence Livermore National
     Lab, built by California Digital Corp.
    4048 1.4 GHz Itanium2 processors, 4 per
     node
    22,038 Gflops
    Sources:
     http://www.llnl.gov/computing/hpc/resourc
     es/OCF_resources.html - thunder
    http://www.top500.org/lists/2004/11/
                                                           49


Other supercomputers
 Fastest computer Nov. 2005:
 BlueGene/L DD2 beta-System (0.7 GHz PowerPC 440)
  built by IBM for NNSA/Lawrence Livermore National
  Laboratory
 131072 0.7 GHz Power PC440 procesors
 280600 Gflops
 “dedicated to exploring the frontiers in supercomputing: in
  computer architecture, in the software required to program
  and control massively parallel systems, and in the use of
  computation to advance our understanding of important
  biological processes such as protein folding”
 Prototype for NNSA/Lawrence Livermore National Lab
 2nd fastest - same type of machine (41000 processors)
 Reference: http://www.research.ibm.com/bluegene/
                                                  50




CSCE 380

   Fall 1998, Spring 1999, Spring 2000, Fall
    2003, Spring 2004, Spring 2005, Spring
    2006
   James Brink

    5 of the 100 fastest unclassified computers
    (Nov. 2005) are built from Itaniums
    http://www.top500.org/lists/2005/11/basic

				
DOCUMENT INFO
Categories:
Tags:
Stats:
views:5
posted:3/4/2012
language:English
pages:50