Docstoc

CSCE 612_ VLSI System Design

Document Sample
CSCE 612_ VLSI System Design Powered By Docstoc
					New Platforms for FPGA-Based
 Heterogeneous Computing




         Jason D. Bakos
       Heterogeneous Computing: Execution Model
instructions executed
      over time



                    49% of      initialization
                    code        0.5% of run time

                                                     co-processor
                                “hot” loop
                   1% of code
                                99% of run time


                    49% of      clean up
                    code        0.5% of run time




                                                   Heterogeneous Computing 2
      Heterogeneous Computing: Performance


• Example:
   – Application requires a week of CPU time
   – One computation consumes 99% of execution time



                     Kernel   Application   Execution
                   speedup     speedup        time
                        50        34        5.0 hours
                       100        50        3.3 hours
                       200        67        2.5 hours
                       500        83        2.0 hours
                      1000        91        1.8 hours




                                                        Heterogeneous Computing 3
        Heterogeneous Computing with FPGAs

• Heterogeneous computing with reconfigurable logic, i.e. FPGAs




                                                Heterogeneous Computing 4
         Heterogeneous Computing with FPGAs

• Advantage of HPRC:
   –   Cost
        • FPGA card => ~ $15K
        • 128-processor cluster => ~ $150K
                + maintenance + cooling + electricity + recycling


• Challenges:
   –   Programming the FPGA
   –   Identifying kernels
   –   Optimizing accelerator design




                                                      Heterogeneous Computing 5
Programming FPGAs




                Heterogeneous Computing 6
          Heterogeneous Computing with GPUs

• Graphics Processor Unit (GPU)
   –   Contains hundreds of small processor cores grouped hierarchically
   –   Has high bandwidth to on-board memory and to host memory
   –   Became “programmable” about two years ago
   –   Gained hardware double precision about one year ago


• Examples: IBM Cell, nVidia GeForce, AMD FireStream

• Advantage over FPGAs:
   – Easier to program
   – Less expensive (gamers drove high volumes, decreasing cost)

• Drawbacks:
   – Can’t necessarily outperform FPGAs for all types of computations
        • Characterizing this is an open research problem




                                                            Heterogeneous Computing 7
Host-GPU Communication




                    Heterogeneous Computing 8
         Heterogeneous Computing now Mainstream:
                     IBM Roadrunner
•   Los Alamos, fastest computer in the
    world
•   6,480 AMD Opteron (dual core) CPUs
•   12,960 PowerXCell 8i GPUs
•   Each blade contains 2 Operons and 4
    Cells
•   296 racks

•   1.71 petaflops peak (1.7 billion
    million fp operations per second)
•   2.35 MW (not including cooling)
     –   Lake Murray hydroelectric plant
         produces ~150 MW (peak)
     –   Lake Murray coal plant (McMeekin
         Station) produces ~300 MW (peak)
     –   Catawba Nuclear Station near Rock
         Hill produces 2258 MW




                                             Heterogeneous Computing 9
           FPGAs vs. GPUs: Applications
• FPGAs historically better at:
   – “Symbolic computation”
       • i.e. intrusion detection, sequence alignment, string matching, combinatorial
         algorithms
   – Very deep pipelines/systolic arrays
       • i.e. financial and physics Monte Carlo methods, molecular dyamics


• GPUs historically better at:
   – SIMD-type (data parallel) floating-point
       • i.e. numerical linear algebra, graphics
   – Programming model is based on massive multi-threading
       • i.e. map threads to the input data
       • Performance drops when threads diverge




                                Barling Bay Talk                        Oct. 28, 2009 10
                                      Our Group

• Past projects:
  – Custom FPGA accelerators and components:
      • computational biology
      • linear algebra


  – Multi-FPGA interconnection networks:
      • interface abstractions
      • adaptive routing algorithms
      • on-chip router designs



• Current projects:
  – Design tools
      •   Dynamic code analysis
      •   Semi-automatic accelerator generation


  – GPU accelerators versus FPGA accelerators for various computations
  – Accelerators for solving large systems of ODEs/PDEs



                                                         Heterogeneous Computing 11
      State-of-the-Art System Architecture for HC


            Host                   QPI          PCIe                               On
           Memory
                             CPU          X58                Co-
                                                                                  board
                    ~25            ~25          ~8 GB/s   processor    ?????
                    GB/s           GB/s          (x16)                 ~150      Memory
                                                                      GB/s for
                                                                      GeForce
                                                                        260




                           host                                 add-in card

•   In general, co-processor can achieve 10x – 1000x computational
    throughput vs. CPU
•   Pay penaly for transferring memory between host memory and on-board
    memory
•   Add-in card can have arbitrary amount of memory bandwidth (use
    proprietray memory interface)



                                                                                          CSCE 212 12
                Reconfigurable Computing
• GPU cards designed to maximize host-device bandwidth (16/8-
  lane PCIe) and onboard memory bandwidth

• Current FPGA cards on the market can not compete, system-wise!
   – None designed specifically for high-performance co-processing
       • Limited to 8-lane PCIe
       • Extremely low on-board memory bandwidth
       • Often include pin-consuming general-purpose features not necessary for co-processing
         (i.e. connectors, DACs, etc.)
   – Most are designed for signal processing, teaching, or logic emulation
   – Are extremely expensive relative to GPU cards


• There is currently no FPGA add-in are that are designed to
  compete with GPU cards




                                   Barling Bay Talk                             Oct. 28, 2009 13
                     FPGA-Based Co-Processing
• FPGA-based co-processing research within the past 3 years…
   –   "FPGA-Based Co-processor for Singular Value Array Reconciliation Tomography"
       Jack Coyne, David Cyganski and R. James Duckworth -- Worcester Polytechnic Institute
   –   "Real-Time Optical Flow Calculations on FPGA and GPU Architectures: A Comparison Study"
       Jeff Chase, Brent Nelson, John Bodily, Zhaoyi Wei and Dah-Jye Lee -- Brigham Young University
   –   "Multiobjective Optimization of FPGA-Based Medical Image Registration"
       Omkar Dandekar, William Plishker, Shuvra Bhattacharyya and Raj Shekhar -- University of Maryland, Baltimore
   –   "Credit Risk Modelling using Hardware Accelerated Monte-Carlo Simulation"
       David Barrie Thomas and Wayne Luk -- Imperial College London
   –   "Sparse Matrix-Vector Multiplication on a Reconfigurable Supercomputer"
       David DuBois, Andrew DuBois, Carolyn Connor and Poole Steve -- Los Alamos National Laboratory
   –   "An Efficient O(1) Priority Queue for Large FPGA-Based Discrete Event Simulations of Molecular Dynamics" Martin Herbordt,
       Francois Kosie and Josh Model -- Boston University
   –   Accelerating Cosmological Data Analysis with FPGAs
       Volodymyr Kindratenko and Robert Brunner University of Illinois at Urbana-Champaign
   –   FPGA-based Monte Carlo Computation of Light Absorption for Photodynamic Cancer Therapy
       Jason Luu, Keith Redmond, William Chun Yip Lo, Paul Chow, Lothar Lilge and Jonathan Rose University of Toronto and Ontario
       Cancer Institute
   –   A Fine-grained Pipelined Implementation of the LINPACK Benchmark on FPGAs
       Guiming Wu, Yong Dou, Yuanwu Lei, Jie Zhou and Miao Wang
       National University of Defense Technology, China
   –   FPGA Implemenations of the Interior-Point Algorithm for Linear Programming with Applications to Collision Detection
       Chih-Hung Wu, Seda Memik and Sanjay Mehrotra
   –   Real-time Landmine Detection Using FPGA-Based Hidden Markov Models
       Aaron Curry, Steven Hannah and Christopher Doss
   –   A Hardware Framework for Fast Generation of Multiple Long-period Random Number Streams
       Ishaan Dalal and Deian Stefan The Cooper Union
   –   Efficient FPGA Implementation of QR Decomposition Using a Systolic Array Architecture
       Xiaojun Wang and Miriam Leeser Northeastern University




                                               Barling Bay Talk                                             Oct. 28, 2009 14
                           Goal
• Need an FPGA-based co-processor card for HPRC
   – PCIe 16x or 8x
   – Maximize on-board memory bandwidth


• Target customers are academic researchers and national
  labs




                       Barling Bay Talk           Oct. 28, 2009 15
                                 Proposed Design
•   FPGAs contain a significant number of user I/O pins
     – Xilinx V6/V5: 1760 total pins, 1200 user pins
     – Altera Stratix 4: 1932 total pins, 1104 user pins


•   Maximizing the use of user I/O for on-board memory bandwidth can allow
    for high memory bandwidth

     – Example for Virtex 6 LX 760, 1200 user IO pins:
          •   Micron DDR3 SDRAM MT41J64M16:
                 – Requires 33 pins for each 16-bit bank @ 1600 MT/s
                 – Allows for 36 banks
                 – Yields 36 banks/transaction * 2 bytes/bank * 1600 MT/sec = 115.2 GB/s

          •   Samsung GDDR5 K4G10325FE:
                – Requires ~49 pins for each 32-bit bank @ 2500 MT/s
                – Allows for 24 banks
                – Yields 24 banks/T * 4 bytes/bank * 2500 MT/sec = 240.0 GB/s




                                          Barling Bay Talk                                 Oct. 28, 2009 16
                                 Costs Estimate
•   FPGAs are absorbinent on Digikey:
     –   Altera Stratix-4 GX-230 is $6600/ea on Digikey
     –   Xilinx Virtex-5 LX-330 is $8382/ea on Digikey


•   Less costly sources?

•   PCB board cost < $1000/ea (will probably need Rodgers boards to support
    memory bandwidth)

•   Memory components and miscellaneous parts < $500/board

•   Xilinx/Altera might donate FPGAs for prototypes

•   Develop prototype board over the next year?




                                       Barling Bay Talk         Oct. 28, 2009 17

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:12/22/2011
language:
pages:17