Multi-core Processors are everywhere_

Document Sample
Multi-core Processors are everywhere_ Powered By Docstoc
					           Multi-Core Processors



      Pedro Trancoso
     Assistant Professor



          CASPER Research Group
           University of Cyprus




Multi-core Processors are everywhere!




2




                                        1
Outline

•   The Road to the Multi-Core
•   The Facts
•   The Issues
•   Current Processors and Trends
•   Future Processor Challenges




3




The Road to the Multi-Core




4




                                    2
The Road to the Multi-Core




5




The Road to the Multi-Core




6




                             3
The Road to the Multi-Core




7




Outline

•   The Road to the Multi-Core
•   The Facts
•   The Issues
•   Current Processors and Trends
•   Future Processor Challenges




8




                                    4
Fact 1: CPU-Memory Speed

      1990
             1991
                    1992
                           1993
                                  1994
                                         1995
                                                1996
                                                       1997
                                                              1998
                                                                     1999
                                                                            2000
                                                                                   2001
                                                                                          2002
                                                                                                 2003
                                                                                                        2004
                                                                                                               2005
                                                                                                                      2006
                                                                                                                             2007
                                                                                                                                    2008
                                                                                                                                           2009
                                                                                                                                                  2010
                                                                                                                       Overcoming the memory wall:
                                                                                                                       • 8KB L1 (Intel 486, 1989)
                                                                                                                       • On-board L2 (Pentium Pro, 1995)
                                                                                                                       • On-package L2 (Pentium II, 1997)
                                                                                                                       • On-die L2 (Pentium III, 1999)
     Source: J. Patterson, “Modern Microprocessors”,
9    www.pattosoft.com.au/Articles/ModernProcessors




Fact 2: Performance




     Source: G. Mamon, “X.1 Perspectives numeriques”,
10   www2.iap.fr/users/gam/M2/CT2/X_1_Perspectives_numeriques.html




                                                                                                                                                            5
Fact 3: Integration




                                                        2008
                                                        Intel Six-core Xeon: 1.9billion




     Source: H. Foll, “5.4 Development and Production of a New Chip”
     www.tf.uni-kiel.de/matwis/amat/elmat_en/kap_5/backbone/r5_4_1.html and wikipedia
11




Fact 4: Complexity




                                COMPLEXITY




     Source: J. Stokes, “Inside the Xbox 360, Part II: the Xenon CPU”, ARS Technica,
     http://arstechnica.com/articles/paedia/cpu/xbox360-2.ars
12




                                                                                          6
Fact 5: Frequency
                                                                      Power Wall




     Source: D. Risley, “A CPU History”, PCMechanic, www.pcmech.com
13




Fact 6: Power Consumption




       Source: S Borkar 1999




14




                                                                                   7
Microprocessor Evolution




                   PIII               P4
     8086




      Performance
      Design Complexity
      Power Consumption
15




Microprocessor Evolution




                          Multicore   P4




                  Performance
                  Simple Design
                  Power Efficient
16




                                           8
Walls


        Memory,
         Power,
       Complexity




                                                   Programmability


                    Do not attempt this at home!
17




Outline

•    The Road to the Multi-Core
•    The Facts
•    The Issues
•    Current Processors and Trends
•    Future Processor Challenges




18




                                                                     9
The Issues: Tiling vs. Shared-Resources


         CPU                     CPU                          CPU                    CPU

      L1 I$    L1 I$       L1 I$       L1 I$               L1 I$   L1 I$          L1 I$   L1 I$

         L2 $                    L2 $
                                                Versus                     L2 $

         CPU                     CPU

      L1 I$    L1 I$       L1 I$       L1 I$
                                                              CPU                    CPU

         L2 $                    L2 $                      L1 I$   L1 I$          L1 I$   L1 I$




     e.g. Intel Pentium D                                e.g. Intel Core2 Duo


19




The Issues: Small vs. Large Number of Cores


                                                          CPU       CPU       CPU          CPU
          CPU                     CPU                        L1       L1       L1       L1
                                                         L1 I$ I$ L1 I$ I$ L1 I$ I$ L1 I$ I$


       L1 I$    L1 I$          L1 I$    L1 I$             CPU       CPU       CPU          CPU
                                                             L1       L1       L1       L1
                                                         L1 I$ I$ L1 I$ I$ L1 I$ I$ L1 I$ I$


                        L2 $                    Versus                     L2 $

                                                          CPU       CPU       CPU          CPU
          CPU                     CPU                        L1       L1       L1       L1
                                                         L1 I$ I$ L1 I$ I$ L1 I$ I$ L1 I$ I$

                                                          CPU       CPU       CPU          CPU
       L1 I$    L1 I$          L1 I$    L1 I$
                                                             L1       L1       L1       L1
                                                         L1 I$ I$ L1 I$ I$ L1 I$ I$ L1 I$ I$



     e.g. Intel/AMD Multi-core                           e.g. NVIDIA/ATI Multi-core


20




                                                                                                  10
The Issues: Symmetric vs. Asymmetric


      CPU       CPU       CPU          CPU
     L1 I$ I$ L1 I$ I$ L1 I$ I$ L1 I$ I$
         L1       L1       L1       L1                       CPU            CPU          CPU
      CPU       CPU       CPU          CPU
                                                         L1 I$   L1 I$         L1       L1
                                                                           L1 I$ I$ L1 I$ I$
         L1       L1       L1       L1
     L1 I$ I$ L1 I$ I$ L1 I$ I$ L1 I$ I$


                       L2 $                   Versus                     L2 $

      CPU       CPU       CPU          CPU              CPU       CPU       CPU          CPU
         L1       L1       L1       L1
     L1 I$ I$ L1 I$ I$ L1 I$ I$ L1 I$ I$               L1 I$ I$ L1 I$ I$ L1 I$ I$ L1 I$ I$
                                                           L1       L1       L1       L1

      CPU       CPU       CPU          CPU              CPU       CPU       CPU          CPU
         L1       L1       L1       L1
     L1 I$ I$ L1 I$ I$ L1 I$ I$ L1 I$ I$               L1 I$ I$ L1 I$ I$ L1 I$ I$ L1 I$ I$
                                                           L1       L1       L1       L1




21




The Issues: Homogeneous vs.
Heterogeneous


          CPU                    CPU                        CPU                    CPU

       L1 I$   L1 I$          L1 I$   L1 I$              L1 I$   L1 I$          L1 I$   L1 I$

          L2 $                   L2 $                       L2 $                   L2 $
                                              Versus

          CPU                    CPU                        CPU
                                                                                   GPU
       L1 I$   L1 I$          L1 I$   L1 I$              L1 I$   L1 I$

          L2 $                   L2 $                       L2 $


                                                       e.g. AMD Fusion


22




                                                                                                11
23




Outline

•    The Road to the Multi-Core
•    The Facts
•    The Issues
•    Current Processors and Trends
•    Future Processor Challenges




24




                                     12
Multi-Core Today




        Intel Quad-Core
                                     IBM Power7




     AMD Quad-Core
                          Sun Rock
25




(Fun) Multi-Core Today




26




                                                  13
(Fun) Multi-Core Today




27




Stream Feeding / Computing




28




                             14
Stream Computing

         GPUs are "...very powerful scientific
         computers installed in many homes... I
         think exploring that design space and its
         utilization... is one of the most exciting
         areas in computer architecture today."

         (Professor Frederick P. Brooks, Jr. 2004
         ACM/IEEE Eckert-Mauchly Award
         acceptance speech)




29




Technology: Stream Computing




          SIMD         Διάνυσμα
30




                                                      15
Graphics Processors




       Source: http://www.ixbt.com/video3/cuda-1.shtml
31




General Purpose
not so easy…
     • Graphics Processor Units (GPU) were
       designed for… Graphics!




32




                                                         16
General Purpose GPU



     NVIDIA CUDA

                                       NVIDIA Fermi Die (3billion transistors)




          NVIDIA Tesla
                                      NVIDIA Fermi Architecture (512 cores)
33




 Multi-core Architectures Overview:
 General-Purpose Homogeneous Multi-Cores

 ➡ Small number of identical cores
                                             AMD Quad Core Shanghai
 ➡ Complex cores able to efficiently
   exploit Instruction Level Parallelism
   (ILP)
 ➡ Multi-level hardware managed
   cache memory
 ➡ Typically hardware supports memory
   coherency
 ➡ Shared cache  Efficient data
   transfer and synchronization
 ➡ Relatively easy to program using
   POSIX Threads or OpenMP directives
              09/17/09
34




                                                                                 17
 Multi-core Architectures Overview:
 IBM Cell/BE Heterogeneous Multi-core
     ➡ Heterogeneous Multi-core architecture (9 cores):
        •   1x general-purpose PowerPC Processor Element (PPE)
        •   8x special-purpose Synergistic Processing Elements (SPEs)
     ➡ SPEs include a private unified memory with 256KB.
     ➡ PPE and SPEs communicate through the Element Interconnect Bus
       (EIB) via DMA data transfers.
     ➡ Software managed cache  user is responsible for efficiently
       manage the memory space.
     ➡ Programmed using the Cell/SDK (thread-based) library extensions




                   09/17/09
35




 Multi-core Architectures Overview:
 Graphics Processing Unit (GPU)

 ➡ Acceleration unit  heterogeneous system
 ➡ Can be used for general-purpose processing
   (GPGPU)
 ➡ Host and GPU are usually connected through
   a system bus (e.g. PCIe)
 ➡ Includes a large number of very basic
   processing cores (1000s)  highly multi-
   threaded
 ➡ Data is organized hierarchically in memory
   and a good management by the user is
   crucial for best efficiency
 ➡ Recent NVIDIA GPUs can be easily
   programmed with CUDA



                   09/17/09
36




                                                                         18
  Multi-core Architectures Overview:
  Summary



                        Homogeneous       Heterogeneous      Accelerators
#Cores                           10s           10s              1000s
Parallelism                    MC + OoO     MC + SIMD           SIMT
Memory Access         Memory to cache Memory to cache System Mem. to Video Mem.
                       Cache to CPU    Cache to PPE               SW
                           HW              HW             Video Memory to PEs
                                       Cache to SPEs              HW
                                            SW

Programmer Effort                                               




                    09/17/09
 37




Outline

 •    The Road to the Multi-Core
 •    The Facts
 •    The Issues
 •    Current Processors and Trends
 •    Future Processor Challenges




 38




                                                                                  19
From Multi- to Many-Core




39




AMD Roadmap




40




                           20
Intel Teraflop Processor


                                            • 80-core prototype
                                            • 2D-mesh interconnect
                                            • Performance: >
                                              1Tflops
                                            • Power: < 100W




 41




Future (3)




In this project we aim to address the following relevant challenges: programmability,
reliability, complexity of design.

… A natural way of handling efficiently the concurrency, then will be by combining advanced
programming models like transactional memory with multithreaded dataflow execution model
…introduce specific hardware scheduling units able to manage different levels of thread
granularities, take care of code or data
migration based on information passed by the virtual layer, and consider power, thermal, and
fault information
 42




                                                                                               21
Multi-Core

• Power-Performance Efficient Architecture
• The current use:
     – Multiprogramming / Throughput computing
• The Challenge:
     – Single application parallel execution
     – Programming parallel applications
• Future Challenge:
     – Utilize efficiently large number of
       homogeneous/heterogeneous cores

43




Data-Driven Multithreading

     • Dataflow at the thread level (instead of instruction)
     • Synchronization part of the program separated from the
       computation part
     • Tolerates synchronization and communication latencies:
         – Computation processor produce useful work while event in
           progress
     • Efficient data prefetching through Cache-Flow
     • Data-Driven Multithreading implemented using
       off-the-shelf microprocessors
         – Extra HW requirement: Thread Synchronization Unit (TSU)
     • Data-Driven Network of Workstations (D2NOW)
      [C. Kyriakou, P. Evripidou and P. Trancoso. Data-Driven Multithreading using Conventional
      Microprocessors. IEEE Transactions on Parallel and Distributed Systems, in press 2005]
      [C. Kyriakou. Data-Driven Multithreading using Conventional Control Flow Microprocessors.   44
      PhD. Thesis, Dept. of Computer Science, University of Cyprus, 2005]




                                                                                                       22
DDM Model of Execution
                                                          Thread
      Program              Re-entrant code block            ...
                                                          Thread      A producer –
                                                          Thread      consumer
                           Re-entrant code block              ...
                                   ...                    Thread
                                                                      relationship
                                                                      exists among
                        Re-entrant code block             Thread      threads
                                                            ...
                                                          Thread


 • DDM program is a collection of re-entrant code blocks
      –   A code-block is equivalent to function or loop body
      –   Each code-block comprises of several threads
      –   A Thread is a sequence of instructions equivalent to a basic block
      –   A producer-consumer relationship exists among threads
      –   Scheduling of code-blocks and threads done dynamically
          according to data availability
                                                                                       45




Dataflow Model
Χ=8                 Y=4                 Χ =8                        Z=(X+Y)*(X-Y)
                       4                 8
 8
                                                    Add
                                                          
                                                 [ ]

 + 12
 12
                      -4
                                                   [ ]     12
                                                          12
                            4
                                                                       Mul
                                                                             
                                                                      [ ]
                  
             *
                                                                      [ ]         48
                                                                             48        Z
             48
              48                                    Sub
                                                          
                                                   [ ]
                                                   [ ]
             Z                                            4    4
                                         4                                             46
                                        Υ =4




                                                                                            23
Example




                                                          47




TFlux Platform
• Pre-Processor + Compiler directive
• Runtime Support
• Implemented and Validated:
  – Software TSU:
     • 8-core Intel Core2 QuadCore Linux
     • 28-core Sparc Linux (Simics)
  – Software TSU:
     • IBM Cell
  – Hardware TSU:
     • 28-core Sparc Linux (Simics)
     • 9-core x86 Linux (Simics)
• Performance:
  – Close to linear speedup and stable across platforms
                                                          48




                                                               24
TFlux Characteristics

• Generic Runtime Support on Commodity OS
  – Supports both DDM and non-DDM applications
• Complete Toolchain: Compiler directives,
  Pre-processor and commodity compiler
  – Supports different ISAs
• Abstract TSU Design
  – Hardware or software (emulation)
    implementations


                                                 49




TFlux Portable Platform




      TFluxSoftCell   TFluxSoftCMP   TFluxHard




                                                 50




                                                      25
TFlux Toolchain
• DDMCPP: source-to-source Pre-processor for DDM Pragma
  directives
• User identifies Threads and Blocks as well as Thread
  dependencies
• Generated code is compiled with a commodity compiler




                                                          51




TFluxSoftCMP: x86 Results




• Large speedup for all apps
• Native MMULT runs with larger data set size
                                                          52




                                                               26
 Data Parallel Acceleration of Decision
Support Queries Using Cell/BE and GPUs


       Pedro Trancoso,
       Despo Othonos,
       Artemis Artemiou


              CASPER Research Group
               University of Cyprus

           Computing Frontiers 2009, May 19, 2009, Ischia, Italy




Motivation
• Decision Support Systems
     – Time-Consuming Workloads
New Accelerators offer:
     – High-performance & Low-cost (Cell/BE + GPU)
Accelerators are difficult to program
     – Different architectures
     – Different programming/execution models
 Evaluation of Data-Parallel Platform (Rapidmind) to
 execute DSS Queries on the Cell/BE and GPUs
              Portability ✔          Performance ?
54




                                                                   27
55




Programmability (not!)




56




                         28
Programmability




57




But...




58




                  29
But... Data Parallel !




59




Rapidmind Architecture




60




                         30
Rapidmind
     ...
     // Arrays for streams representation
     Array<1,Value4f> l_extendedprice(LINEITEM/4),A;
     Array<1,Value4f> l_quantity(LINEITEM/4),B;
     ...
     main() {
       ...
                // Definition of the stream program
                Program addition_program = RM_BEGIN {
                  In<Value4f> l_year_v3;
                  Out<Value4f> out;
                  IF((l_year_v3>=(Value4f)(float)DATE1) && (l_year_v3<(Value4f)(float)DATE2) &&
                   ... ) {
                     out=(l_extprice_v1*l_year_v3);
                  }ELSE{
                     out=(Value4f)-1;
                  }ENDIF;
                } RM_END;
       ...
     }


• Data-parallel program
• Backend determines target execution
61




Data-Parallel Sequential Scan (DPSS)




62




                                                                                                  31
Experimental Setup                                                                                Speedup = T(Baseline) / T(x)

                                            Baseline                RM-Cell           RM-8500T      RM-8800GTS        OMP-MC8
     System                                  Generic                Sony PS3          NVIDIA           NVIDIA         IBM x3650
                                                                                       8500GT         8800GTS
     Cores                                    1 Dual                   1+6                16              96            2 Quad
     Model                                 Intel E6420              PPE+SPE           Streaming       Streaming       Intel E5320
     Freq                                   2.13GHz                  3.2GHz           900MHz           1.2GHz           1.8GHz
     Cache                                     4MB                   512KB             512MB           640MB            2x4MB
     Mem                                       2GB                   256MB               2GB             2GB             48GB

                         Scale Factor               ORDERS              CUSTOMER              LINEITEM            Approx Size [MB]
                             0.01                    15000                     1500                60175                8.7
                             0.02                    30000                     3000               120515                17.7
                             0.05                    75000                     7500               299814                44.2
                             0.1                    150000                    15000               600572                89.3
                             0.2                    300000                    30000               1199969              179.6
                             0.5                    750000                    75000               2999671              454.0
     Application: Database queries from TPC-H Benchmark
                  Q6 - Sequential Scan
                  Q12 - 1 Join Operation
                  Q3 - 2 Join Operations
63




Data-Parallel Nested Loop Join: Q12
                             (1 join)   RM-Cell

                         4
     Speedup [PPE/SPE]




                         2




                         0
                               0.01          0.02            0.05




• RM-Cell:...
• RM-GPU: high speedup and increasing (poor baseline)
• OMP-MC8: high but falls for sf>0.02 (eviction of reused data)
64




                                                                                                                                     32
To Probe Further




                               Q3 SF=0.01 Execution Time
                                                           400            344.6
                                                                 321.2
                                                           300

                                                           200




                                          [s]
                                                           100                        68.1
                                                                                                 48.2

                                                             0
                                                                  RM     RM-blk64   RM-blk512 RM-blk1024




               • Cell/BE needs tuning:
                 – Preliminary results using blocking: 6.7x improvement
                 – Asynchronous data transmission (double buffering)…
               • Careful study of performance and programming
                 effort
                 – Comparison CUDA-vs-RM and Cell/SDK-vs-RM
    65




Conclusions
                                                                                              new DP versions of algorithms
                                                                                                 need to be developed
                • Data-Parallel Platform (Rapidmind)
PROGRAM




                   – Sequential → Data-Parallel conversion
                   – Hand-coded optimizations seem necessary for Cell/BE
                   ✓ Single program for all accelerators
                • General-Purpose Multi-Core:
                   – Smaller scalability
                   ✓ Programmability (simple serial → OpenMP) & simpler algorithms
ARCHITECTURE




                   ✓ Overall best performance                                        Cell/BE need tuning!
                • Cell/BE & PS3:
                   – Limited memory, LS, in-order, ...                                        GPUs have the potential to
                                                                                               accelerate DSS queries
                • GPUs:
                   – Data transfers penalty
                   ✓ Good speedup (21.3x) for data&compute intensive algorithms
    66               (e.g. Nested-Loop Join)




                                                                                                                              33
           Fine-grain Parallelism using Multi-core, Cell/BE,
           and GPU Systems: Accelerating the Phylogenetic
           Likelihood Function


        Frederico Pratas1, Pedro Trancoso2,
     Alexandros Stamatakis3, and Leonel Sousa1




             1 SiPS GROUP         2                  3
                                      CASPER GROUP       The Exelixis Lab
             IST, Portugal            UCY, Cyprus        TUM, Germany




67




 Motivation

• Modern applications have increasing
                                                         Phylogenetic Likelihood
  computational demands                                        Functions


• Systems are more powerful due to hardware
                                                         Fine-grain Parallelism
  parallelism


• But…wide selection of parallel processors:
     – General-purpose Homogeneous Multi-core            We propose to analyze
       (e.g. quad-core CPU)                                  Performance
                                                              Scalability
     – General-purpose Heterogeneous Multi-                Programmability
       core (e.g. Cell/BE)
     – Accelerators (e.g. GPUs)

                 09/17/09
68




                                                                                   34
 MrBayes

       • MrBayes is a bioinformatics application that performs Bayesian
         inference of evolutionary (phylogenetic) trees based on the
         Maxium Likelihood model

                      Phylogenetic trees are used for instance in
          “Origins and evolutionary genomics of the 2009 swine-origin H1N1
                                influenza A epidemic”
                 http://www.nature.com/nature/journal/v459/n7250/full/nature08182.html



       • A current real-world phylogenomic data set study contains 1,500
         genes and requires 2,000,000 CPU hours on a BlueGene/L system1

       1 M. Hejnol, M. Obst, A. Stamatakis, M. Ott, G. Rouse, G. Edgecombe, P. Martinez, J. Baguna, U. Jondelius,
           M. Wiens, W. Mueller, E. Seaver, W. Wheeler, M. Martindale, G. Giribet, C. Dunn: "Assessing the root of
           bilaterian animals with scalable phylogenomic methods", Proceedings of the Royal Society B, in press.


                        09/17/09
69




 MrBayes
 Kernel Parallelization
     ➡ GPU – Parallelization of the Inner Loop using SIMD andlevel parallelism,
       Cell/BE – Parallelization of – Parallelization using thread of the between cores
       Homogeneous Multi-Cores the Inner Loop of the outermost loop outermost
       loop GTX285 #threads=21760
       e.g., between SPEs


                                                                                                            SIMT
                                                                                                            SIMD




                                                     CPU0
                                                     SPE0
                                                       ...
                                                          CPUn
                                                          SPEn

                        09/17/09
70




                                                                                                                     35
  Implementation Aspects
  Summary



                               Homogeneous                     Heterogeneous                        Accelerators

                                                                 small #cores                      large #cores
      Main                   small #cores
                                                             limited SW managed                  SW/HW managed
  Characteristics         HW managed memory
                                                                   memory                            memory
                                                          execution in smaller steps
 Disadvantages               hidden overheads                                        global data transfers
                                                              synchronization
                                                                concurrent
                          fast communication via
 Advantages                                               communication/computati                  high throughput
                                shared cache
                                                                    on




                          09/17/09
71




     Experimental Setup

            Baseline       Homogeneous General-purpose Multi-cores                Cell/BE                      GPU
                           2xXeon(4)     4xOpteron(4)     8xOpteron(2)      PS3         QS20          8800GT     GTX285
 System      Generic        IBM x3650    Dell PowerEdge    Sun x4600 M2   Sony PS3     IBM QS20       NVIDIA         NVIDIA
                                               M905                                                   8800GT         GTX285

 #Cores         1           2 x Quad        4 x Quad         8 x Dual       1+6        2 x (1+8)        112           240
 Model      Intel E8400    Intel E5320     AMD 8354         AMD 8218      PPE+SPE      PPE+SPE       Streaming   Streaming
Frequency    3.0GHz          1.8GHz         2.2GHz           2.6GHz       3.2GHz        3.2GHz        1.5GHz     1.476GHz
 Cache         6MB            2x4MB      4x512KB + 2MB        2x1MB        512KB       2x512KB         256KB         480KB
 Memory        2GB            48GB           64GB             64GB         256MB       2x512MB        512MB           1GB

     • Baseline: general-purpose architecture used as reference system
     • The Homogeneous General-Purposed Multi-Core implementation uses
       OpenMP provided by the Intel Compiler Suite 11, for the Cell/BE is
       used the Cell SDK v2.0, and the GPU implementation uses CUDA API
       v2.1
     • MrBayesv3.1.2 with input datasets obtained from Seq-Gen v1.3.2
     • Datasets are named according to X_Y, where X is related to the number
       of PLF calls and Y is related to the size of the data set
                          09/17/09
72




                                                                                                                              36
  Experimental Results
  Total Execution Time

                                                              Poor performance
                                                              1.5x - 2x Speedup

                                                               However, data
                                                               Cell/BE is mainly
                                                               communication
                                                               Considering only
                                                               limited by the
                                                               overhead is
                                                               the parallel
                                                               PPE performance
General-Purpose                                                critical the GPU
                                                               section
efficiently handle                                             would be the
both serial and                                                most successful
parallel execution                                             architecture!!

          • Results are scaled to the frequency and to the baseline
          • Times are break into parallel execution, serial execution
            and communication (for GPUs)
          • Programmability: approximately 1/2 day to parallelize
            MrBayes on multi-cores, 2 days on GPU, and 10 days on
            Cell/BE
                     09/17/09
73




  Conclusions



           • Overall Performance:
           Future Processor Vision
             –Poor for GPU: Data Transfers
           - Tightly coupled
             –Poor for Cell/BE: specific-purpose cores
           - Large number of small Inefficient execution of serial code
             on PPE.
           - Few complex general-purpose cores for serial
             –Good for Homogeneous General-Purpose Multi-core:
             execution
             Best balance between efficient parallel and serial
           - Efficient communication/synchronization
             execution of the code




                     09/17/09
74




                                                                                   37
                     CASPER
                    Research
                      Group




                                             75




CASPER

• What is CASPER:
  – Computer Architecture Systems and
    Performance Evaluation Research

• Who is CASPER:
  – 1 Ph.D.: Panayiotis
  – 2 M.Sc.: Maria, George
  – 6 Undergrad: Froso, Maria, Andreas, Michael,
    Gianninos, Nikolas
  – 10+ Undergrad + 2 MSc + 1 PhD Alumni
                                             76




                                                   38
 Research Projects                          (1/2)

 • TFlux (ICPP’08)
   – TFluxHard & TFluxSoft
   – TFlux vs. OpenMP
   – FPGA implementation of Thread Synch Unit
   – PreProcessor code to Cell/BE & Distributed
     Systems
   – Data prefetching, Dynamic Scheduling, Load
     Balancing, ...
 • HelperCoreDB (IPDPS’08)
   – Dynamic Tuning of Generic Software-Managed
     Multi-core Data Prefetcher             77




 Research Projects                          (2/2)
• Accelerators
  – DSS Database Queries on Multi-core, Cell/BE,
    and GPU using Rapidmind (CF’09)
  – Fine-grain Parallelism on Multi-core, Cell/BE,
    GPU for MrBayes (ICPP’09)
  – Rapidmind Cell/BE & GPU versus native pthread
    & CUDA
  – MapReduce on Multi-core systems
  – DBMS+GPU acceleration (PGSQL-GPU)
• Virtualization
  – Execution overhead of VM execution       78




                                                     39
             Thank you !
             CASPER GROUP
             Computer Architecture Systems
             Performance Evaluation Research
             www.cs.ucy.ac.cy/carch/casper

TFlux Now Available at:
              www.cs.ucy.ac.cy/carch/casper/tflux
                                               79




                                                    40

				
DOCUMENT INFO