PowerPoint Presentation - NCSA

Document Sample
PowerPoint Presentation - NCSA Powered By Docstoc
					      Reconfigurable
       Application
        Specific
         Computing
                                               Presented by:

                                              Steve Modica
                                      RASC Product Manager

                  Silicon Graphics, Inc.
SGI Proprietary
Altix 350

                                                PC2100
                                                 PC2700
                                              DDR PC2100
                                                  SDRAM
                                               DDR PC2700
                                                   SDRAM         4 Channels SDRAM
                                                DDR SDRAM
                  Itanium2                       DDR SDRAM        10.8 – 12.8 GB/s


                                                                                   NUMAlink4
                                                        SHUB
                             Front Side Bus                                        NUMAlink4
                               6.4 GB/s
                                                                                2 Channels NUMAlink
                                                                                   12.8 GB/s
                  Itanium2
                                                      PIC


                                                               4 Slots / 2 PCI-X Busses
                                          PCI-X                        2 GB/s

                   Ethernet
                   SCSI Disk     BASE I/O




SGI Proprietary              |    9/20/2004     |   Page 2
SGI Altix™ 3700 Bx2 Platform Introduction:
CR-Brick - Components
              IP57 Node Board                                                                CR-Brick
     Node 0        ddr1 SDRAM   ddr1 SDRAM      ddr1 SDRAM     ddr1 SDRAM

                   ddr1 SDRAM   ddr1 SDRAM      ddr1 SDRAM     ddr1 SDRAM                      P       P
                   ddr1 SDRAM   ddr1 SDRAM      ddr1 SDRAM     ddr1 SDRAM                                      I/O
                                                                                                   A
         Processor                                                                            Node Board
         (Intel Madison 9M
                                6.4GB/s
                                                    ASIC                                       P       P
                                                    (Shub1.2)
         Processor                                                                       R
         (Intel Madison 9M                                                                         A       R
                                                                                         O    Node Board
                                                                                                           O
                                   NL4            NL4                                    U                 U
                                 Network        Network                 I/O
                                                                                         T     P       P   T
                                  6.4GB/s        6.4GB/s             2.4GB/s
                                  Full Duplex    Full Duplex         Full Duplex
                                                                                         E         A       E
                                                                                         R    Node Board
                                                                                                           R
                                                                                   NL4                         NL4
                                                                                               P       P
                                                                                                   A
                                                                                              Node Board       I/O

SGI Confidential
Slide 3
SGI Altix™ 3700 Bx2 Platform Introduction:
Building Blocks
                             Itanium® 2 CR-brick
                               CPU and memory


                                   M-brick
                                   Memory


                                                       SGI®
                                   R-brick           Advanced
                              Router interconnect
                                                       Linux
                                                    Environment
                                  IX-brick              With
                               Base I/O module          SGI
                                                      ProPack

                             PA-brick, PX-brick
                               PCI-X expansion


                                  D-brick2
                                Disk expansion




SGI Confidential
Slide 4
SGI Altix™ 3700 Bx2 Platform Introduction:
System Topology Example

                          Router Plane 1




    Router Plane 2




SGI Confidential
Slide 5
Reconfigurable Application Specific Computing
Accelerating Interaction

Speedup interactive analysis and modeling
                                                                                 Compute
     –     CPUs are often the bottleneck in computations
     –     Goal is to insert faster elements
                                                                             Access to data is
Style 1 -- Traditional FPGAs                                                     critical
     –     Work with traditional FPGAs in PCI / PCI-X slots
                 •   Nallatech, Clearspeed, Annapolis Micro et al               Memory
     –     Development environments relatively advanced                      bandwidth is the
                 •   All driving to same goal of “write in C, run on FPGA”   key to success
     –     Leverages other industry efforts
                 •   Cray, PCs, Clusters

                                                                             Specialist
Style 2 -- Tightly coupled                                                   Elements
     –     Athena --- FPGA + memory for computation at high b/w
     –     Daytona --- FPGA + spigots for fast network
     –     Both being proto’d by a few customers




  Confidential
The 3 Single-Paradigm Architectures



                    Scalar                    Vector    App-Specific
                  Intel Itanium           Cray X1      Graphics - GPU
                   SGI MIPS               NEC SX        Signals - DSP
                  IBM Power                            Prog’ble - FPGA
                  Sun SPARC                             Other ASICs
                    HP PA




SGI Proprietary           |   9/20/2004   |   Page 7
Paradigms to Applications
         high

                                                            Application-specific

                                           Application-specific
         Compute
         Intensity




                               Vector
                                                                 Scalar
         Low




                     Low                         Data locality               High

SGI Proprietary            |   9/20/2004   |   Page 8
Architectural Challenges


    • Ease of Use
           –      Languages
           –      Compilers
           –      Debuggers
           –      APIs


    • Performance
           – Bandwidth to/from System
           – Scalability




SGI Proprietary         |     9/20/2004   |   Page 9
Ease of Use

  •Leverage 3rd Party Std Language Tools
       – Celoxica, Impulse Acceleration, Mitrion, Viva
       – In discussions with other HLL tool vendors

  •Developed an FPGA aware version of GDB
       – Capable of debugging the FPGA and System Software
       – Capable of multiple CPUs and multiple FPGAs

  •Developed RASC Abstraction Layer (RASCAL)

  •Provide for HDL modules
       – Integrated environment with debugger
       – Highest performance




SGI Proprietary     |   9/20/2004   |   Page 11
Contrasting ISVs
                    Hardware                             Software




      HDL         Handel-C                   Impulse-C    VIVA      Mitrion C




SGI Proprietary     |   9/20/2004   |   Page 12
 Ease of Use v. Efficiency

                                                                      x
         High


                                                               VHDL
                                                                Verilog
         Efficiency




                       x                                   x

                                 x
                       x
         Low




                      Easy                       Ease of Use    Difficult

SGI Proprietary              |   9/20/2004   |   Page 13
ISV Features

• Handel-C
      –     Runs on Windows only
      –     Plans to port to Linux in June of 2005
      –     Most efficient procedural language
• Starbridge VIVA
      –     Extremely easy to learn, Graphical, Object-oriented
      –     Develop on Windows only, execute anywhere.
      –     Easiest language to program, creates very efficient cores
      –     Large library of packaged algorithm primitives
• Mitrion C
      –     Runs natively on Altix
      –     Utilizes a processor abstraction
      –     Most useful debugging environment
• Impulse-C
      –     Runs on Windows
      –     Highly optimized for Streaming Applications
      –     Fastest language to port legacy C code



SGI Proprietary            |    9/20/2004   |   Page 14
 Bitstream Generation using
 High Level Language Tools
                               HLL Design Entry                                     Design Verification
                      (Handel-C, Impulse C, Mitrion C, Viva)
                                RTL Generation and
                           Integration with Core Services                   .v,
                                                                           .vhd     Behavioral Simulation
                                           .v, .vhd
                    .v,                                                               (VCS, Modelsim)
 IA-32             .vhd
 Linux                                  Design Synthesis
                                          (Synplify Pro,
                    Metadata                Amplify)
Machine
                   Processing
                                                         .edf
                    (Python)                                                        Static Timing Analysis
                                                                            .ncd,    (ISE Timing Analyzer)
                                      Design Implementation                  .pcf
                          .cfg                 (ISE)
                                                      .bin


                           Device Programming                                .c           Real-time
Altix                       (RASC Abstraction Layer,                                     Verification
                          Device Manager, Device Driver)                                    (gdb)



 SGI Proprietary                  |      9/20/2004           |   Page 16
Ease of Use

  •Leverage 3rd Party Std Language Tools
       – Celoxica, Impulse Acceleration, Mitrion, Starbridge Viva
       – In discussions with other HLL tool vendors

  •Developed an FPGA aware version of GDB
       – Capable of debugging the FPGA and System Software
       – Capable of multiple CPUs and multiple FPGAs

  •Developed RASC Abstraction Layer (RASCAL)

  •Provide for HDL modules
       – Integrated environment with debugger
       – Highest performance




SGI Proprietary     |   9/20/2004   |   Page 17
FPGA Aware Debugger


• Based on Open Source Gnu Debugger (GDB)
• Uses extensions to current command set
• Can debug host application and FPGA
•   Provides notification when FPGA starts or stops
•   Supplies information on FPGA characteristics
•   Can “single-step” or “run N steps” of the algorithm
•   Can HLL line step / step per C-line source
•   Dumps data regarding the set of “registers” that are visible
    when the FPGA is active




SGI Proprietary   |   9/20/2004   |   Page 18
Optimal Debugging Environment

                                                                Algorithm.c

                                                             tmp = a & b;
(gdb) fpgastep
                                      Debugger running       d = tmp | c;
(gdb) p/x $a
$6 = 0x444433                            in real time
(gdb) p/x $b
$7 = 0x111122
(gdb) p/x $tmp
$8 = 0x555533
(gdb) fpgastep
(gdb) p/x $tmp                                           a                        COP FPGA
$9 = 0x555533                                                               tmp
                                                                &
(gdb) p/x $c
$10 = 0x331222                                           b
                                                                                    |        d
(gdb) p/x $d
$11 = 0x111022
                                                                            c




SGI Proprietary   |   9/20/2004   |   Page 19
Ease of Use

  •Leverage 3rd Party Std Language Tools
       – Celoxica, Impulse Acceleration, Mitrion
       – In discussions with other HLL tool vendors

  •Developed an FPGA aware version of GDB
       – Capable of debugging the FPGA and System Software
       – Capable of multiple CPUs and multiple FPGAs

  •Developed RASC Abstraction Layer (RASCAL)

  •Provide for HDL modules
       – Integrated environment with debugger
       – Highest performance




SGI Proprietary     |   9/20/2004   |   Page 20
Application Programming Interface Overview


                               Open|Speedshop
     Debugger (GDB)                                        Download
                                Pro|Speedshop
                                                            Utilities
                      Application
                                                                           User Space
                                                            Device
                     Abstraction Layer                      Manager
                         Library

                  Algorithm Device Driver                Download Driver Linux Kernel

                   Co-Processor FPGA ( RASC Hardware )                    Hardware



SGI Proprietary            |   9/20/2004   |   Page 21
Abstraction Layer: Algorithm API

The Abstraction Layer’s algorithm API mirrors the COP API with a
few additions that enable wide scaling,
                                                              Algorithm
                                Input Data                                     Output Data
                                                                 COP


                                                                 COP
     Application

                                                                 COP




• and deep scaling.
                   Input Data                          Algorithm             Output Data


                                              COP                      COP
  Application




SGI Proprietary           |       9/20/2004    |    Page 22
Ease of Use

  •Leverage 3rd Party Std Language Tools
       – Celoxica, Impulse Acceleration, Mitrion
       – In discussions with other HLL tool vendors

  •Developed an FPGA aware version of GDB
       – Capable of debugging the FPGA and System Software
       – Capable of multiple CPUs and multiple FPGAs

  •Developed RASC Abstraction Layer (RASCAL)

  •Provide for HDL modules
       – Integrated environment with debugger
       – Highest performance




SGI Proprietary     |   9/20/2004   |   Page 23
Verilog / VHDL Module Support

  •Templates for Verilog
       – Fast start to algorithm coding
  •Templates for VHDL
       – Fast start to algorithm coding
  •Provide a system simulation stub
       – Allows both simulation debug or system debug
  •Provide source code for core service
       – Allows user to modify to meet special needs
  •Extractor tools supports GDB meta-data
       – Application and FPGA debugging



SGI Proprietary    |   9/20/2004   |   Page 24
Proto-type Configuration



                                          NUMAlink4




                  Altix 350
                                                      MOATB


SGI Proprietary          |    9/20/2004     |   Page 25
Performance

    • Direct Connection to NUMAlink4
           6.4GB/s/connection
    • Fast System Level Reprogramming of FPGA
           FPGA load at memory speeds
    • Atomic Memory Operations
            Same set as System CPUs
    • Hardware Barriers
           Dynamic Load Balancing
    • Configurations to 128 NUMA/FPGA Nodes
           Scalability



SGI Proprietary          |   9/20/2004   |   Page 26
MOATB Block Diagram

                                                                                          2MB
                                                                 Addr & Ctrl
                                                                                        QDR SRAM

                                                                                       36      36
                  NUMAlink Connectors
                                                                                                       Addr & Ctrl

                                                               72


                                                                               Algorithm               36




                                                                                                               QDR SRAM
                                                  TIO




                                                                                                                 2MB
                                                              SSP
                                                                                FPGA                   36
                                                                72



                                                                                        36      36
NUMAlink     12.8 GB/s                                               Addr & Ctrl
                                                                                              2MB
                                                  PCI 66MHz
SSP           6.4 GB/s                                         Select Map
                                                               Programming Interface
                                                                                            QDR SRAM


QDR SRAM       9.6GB/s
  3 reads @    1.6GB/s                                    Loader
  3 writes @   1.6GB/s                                    FPGA


SGI Proprietary     |   9/20/2004   |   Page 28
 System Configuration

                     PC2100
                      PC2700                                                                        2MB          SRAM 0
                   DDR PC2100
                       SDRAM
                    DDR PC2700
                        SDRAM
                                                                                                  QDR SRAM
                     DDR SDRAM
Itanium2              DDR SDRAM          NUMAlink                         Addr & Ctrl            36      36
                                                                                                                 Addr & Ctrl

                                                                         72

                         SHUB                                                           Algorithm                36




                                                                                                                         QDR SRAM
                                                            TIO




                                                                                                                           2MB
                                                                        SSP
                                                                                         FPGA                    36
                                                                          72
Itanium2
                        PIC                                                                                       SRAM 1
                                                                                                  36      36
                                                                               Addr & Ctrl
                                                            PCI 66MHz                                   2MB
                                                                         Select Map                   QDR SRAM
                                                                         Programming Interface
               PCI-X                                                                                    SRAM 2
                                                                    Loader
     BASE I/O                                                       FPGA
                         Altix 350                                                                    MOATB

 SGI Proprietary              |   9/20/2004   |   Page 29
MOATB Data Performance


SSP System Interface Performance
       Measured performance MOATB with SSP test card bitstream
                  • DMA Read => 2.548 GB/s
                  • DMA Write => 2.607 GB/s

       Measured performance MOATB with MBCS bitstream
                  • DMA Read => 1.588 GB/s
                  • DMA Write => 1.589 GB/s
                          Limited by 1.6 GB/s of external SSRAMs



MOATB Core Services
       Core Clock Frequency                200MHz



SGI Proprietary           |    9/20/2004   |   Page 30
FPGA Architecture Overview

                                               QDR-II SRAM
                                                     Bank 0



                                                                                  Reads @ 1.6GB/s
         3.2 GB/s                          Write               Read
                                           port 0              port 0    Write    Writes @ 1.6GB/s
                     Core                                                port 1
                                                                                    QDR-II SRAM
            SSP     Services                Algorithm Block                           Bank 1
                     Block
                                                                         Read
                                                                         port 1
         3.2 GB/s                           Read                Write
                                            port 2              port 2




                                                    QDR-II SRAM
                                                      Bank 2


SGI Proprietary          |     9/20/2004       |    Page 31
Algorithm Block as Submodule


                                   alg_clk
                                   do_step
                  Algorithm        alg_rst
                  controller
                                    step_flag                                                   Algorithm
                                  alg_done                                                        Block
                                debug0
                                 debug63
                  Debug


                                                                     sram_rd_cmd_vld




                                                                                                                                    sram_wr_addr[17:0]
                                                                                                                                                         sram_wr_data[63:0]
                                                                                                                                                         sram_wr_be[7:0]
                                                sram_rd_addr[17:0]



                                                                                       sram_rd_dvld




                                                                                                        sram_wr_req
                                                                      sram_rd_data




                                                                                                                      sram_wr_gnt




                                                                                                                                                                              sram_wr_dvld
                   port                            sram_rd_gnt
                                                    sram_rd_req




                                                    SRAM controller
                                                    (one bank shown)


SGI Proprietary           |    9/20/2004        |   Page 32
MOATB Sample Application Performance


    Bit Manipulation (Crypto)
            79x 1.5GHz Itanium-2 (single MOATB)
            119x 1.5GHz Itanium-2 (dual MOATB)
    DOD Bit Matrix Multiply Benchmark
            TBDx 1.5GHz Itanium-2 (single MOATB)
    Graphics Edge Detection
            42x 1.5GHz Itanium-2 (single MOATB)
            (DEMO at NAB)




SGI Proprietary    |   9/20/2004   |   Page 34
Reconfigurable Application Specific Processing


  MOATB Proof of Concept
       V2 - 6000
   Athena Computation Brick
                     V2 - 6000
  Abacus Computation Blade
                                 V4 LX 200             V4 FX200          Virtex 5

  Daytona Ingest/Egress Blade
                        V2 Pro 100                     V4 FX100          Virtex 5
 System Interface
       NL4 / SSP                                                         NL5 / SSP2

  Systems
    Altix 3700/350   BX2                 SHUB2                             UV

   2004                2005                  2006                 2007              2008



 SGI Proprietary           |     9/20/2004    |   Page 35
Athena Computation Blade

                                                               2MB
                                                             QDR SRAM


       NUMAlink Connectors




                                                                           QDR SRAM
                                                                             2MB
                                                    SSP   Algorithm
                                     TIO
                                                           FPGA




                                                                           QDR SRAM
                                                                             2MB
                                     PCI 66MHz                  2MB
                                                              QDR SRAM




                                               Loader       Algorithm FPGA Virtex2 6000 -6
                                               FPGA


SGI Proprietary      |   9/20/2004    |   Page 36
Abacus Computation Blade

                                                                        SSRAM    SSRAM


                                                                                         SSRAM

                  NL4                               SSP
                                                                            V4LX200
                                  TIO                                                    SSRAM


                                                    PCI
                                                               Selmap             SSRAM

                        NL4
                                                      Loader
                                                                                 SSRAM
                                                              Selmap

                                                                                          SSRAM
                  NL4                               SSP
                                  TIO                                       V4LX200
                                                                                          SSRAM



                                                                         SSRAM    SSRAM


SGI Proprietary               |         9/20/2004         |   Page 38
    RASC 3U Chassis




Blade Slots


                                                                            TPS Power
                                                                           Supply Slots




  SGI Proprietary   |   9/20/2004   |   Page 40
                                                  5.128” high x 17.39” w
Investigations Underway

Additional 3rd Party Partnerships
      – Pull in additional “Best in Industry Features”
      – Help drive openFPGA.org direction
      – Pull in IO and additional scalability features
New High Level Languages
      – Matlab – Working with a RASC partner to add tool as module
        generator
Library Support for Matlab*P
C-Code Improvement Tools
      – FPGA aware Speedshop enhancements
      – Source to source code optimizer targeted at 3rd party tools




SGI Proprietary     |   9/20/2004   |   Page 41
SGI Proprietary   |   9/20/2004   |   Page 43

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:2/15/2012
language:
pages:36