Docstoc

VTT ELECTRONICS

Document Sample
VTT ELECTRONICS Powered By Docstoc
					            FP6-2004-IST-4 FET Proactive Initiative ACA

              SUPERcomputing on a CHIP: SUPERCHIP
                     Proposal Number 26888




  Kari Tiensyrjä            Jesper Larsson Träff               Ben Juurlink                  Ian Phillips
Senior Research Scientist    Senior Principal Researcher            Professor       Prof., Principal Staff Engineer
         VTT                      NEC Europe               Delft University of                  ARM
                                                              Technology




                  24 May, 2005                       SUPERCHIP Evaluation Hearing        1
VTT TECHNICAL RESEARCH CENTRE OF FINLAND




                                1. Paths to exploitation

• FET project with potential for application breakthroughs in a 10+ years horizon

• Industrial Partners (NEC, ARM, Intel) cover a wide spectrum of application domains
  and provide:
     • Steering of scientific and technological research
     • Transfer of knowledge and results to and interplay with company design groups

• Proposition to standardization bodies, where relevant (B.3.6)

• Active promotion of results (T6.1 and T6.2):
     • High-profile scientific and applied conferences and journals
     • Organization of workshops
     • PhD courses and summer schools, incorporation into advanced curricula
     • Links to NoE’s

• WP6 (led by Intel): dissemination and exploitation (also: B.3.3, B.4.1.7, and B.8.2.6)
    • T6.3 for technology transfer
    • T6.4 for exploitation




              24 May, 2005                 SUPERCHIP Evaluation Hearing   2
VTT TECHNICAL RESEARCH CENTRE OF FINLAND



                                    2. Target applications
  • Wide range of applications with high computational requirements will be considered
  • WP4 will analyse and identify applications, and selected sample applications will be
    implemented as proof-of-concept
  • An initial set of applications considered:
                             • Mobile devices (energy-efficiency)
                                 • PDA, HDTV
                                 • Games, virtual reality

                             • Desktops and servers (versatility from high-performance/single-
                               application to high-throughput application suites)
                                  • Streaming and DSP applications, e.g. video in bandwidth
                                     constrained active networks and embedded 3D graphics
                                  • Real-time speech recognition and videoconferencing
                                  • Database applications, string processing, geographical
                                     information processing

                             • Supercomputer (high-performance)
                                  • Vectorised CFD Boltzmann automata
                                  • MPI-parallelised finite element methods
                                  • Quantum Chromodynamics

              24 May, 2005                     SUPERCHIP Evaluation Hearing     3
VTT TECHNICAL RESEARCH CENTRE OF FINLAND




                                                                                             3. Leading contenders within the proposal

• Objectives: to boost performance by 2-3 orders of magnitude (compared to same
  transistor count), exploit parallelism at all levels, realise easy-to-use strong model of
  computing, provide scalability/wide application area/power saving techniques

Eclipse                                                                                    XMT               CMP                  TTA/PISMA                                            TRIPS
-Scalable NOC                                                                             - CMP with         - Shared memory      - Tiled                                              -Single chip
with EREW PRAM                                                                            PRAM-like but      using caches +       architecture with                                    reconfigurable
model                                                                                     more               advanced cache       virtual shared                                       processor /
- Simultaneous                                                                            asynchronous       coherency            memory                                               memory
ILP-TLP                                                                                   model              protocols            communication                                        architecture
exploitation                                                                              - SMT +                                 - Very simple and                                    -Grids of ALUs
- Cacheless                                                                               synchronization                         strongly                                             connected via
memory                                                                                    mechanism                               decentralized                                        operand networks
-Regular structure                                                                        - On-chip caches                        organization                                         -Static spatial
                                                                                                                                                                                       scheduling
                        I/O                  I/O               I/O               I/O                                                           Super-Tile Interface Unit
                        M                     M                 M                M
          S    S               S     S             S   S             S   S                                                             P   M      P    M     P     M       P   M
          S    S               S     S             S   S             S   S
I    P              I     P              I     P           I     P               I/O
                                                                                                                                       M   P      M     P    M     P       M   P
    I/O                 M                     M                 M                M
                                                                                                                                   S                                               S
          S    S               S     S             S   S             S   S
                                                                                                                                   T   P   M      P    M     P     M       P   M   T
          S    S               S     S             S   S             S   S
                                                                                                                                   I                                               I
I    P              I     P              I     P           I     P               I/O                                                   M   P      M     P    M     P       M   P
    I/O                 M                     M                 M                M                                                 U                                               U
          S    S               S     S             S   S             S   S                                                             P   M      P    M     P     M       P   M
                                                                                                                                   n                                               n
          S    S               S     S             S   S             S   S                                                         i                                               i
I    P              I     P              I     P           I     P               I/O                                                   M   P      M     P    M     P       M   P
                                                                                                                                   t                                               t
    I/O                 M                     M                 M                M
                                                                                                                                       P   M      P    M     P     M       P   M
          S    S               S     S             S   S             S   S
          S    S               S     S             S   S             S   S
I    P              I     P              I     P           I     P
                                                                                                                                       M   P      M     P    M     P       M   P

    I/O                 I/O                  I/O               I/O                                                                             Super-Tile Interface Unit


      P Prose ssori ydin
      S Kytkin/re ititin
      I/O Syöttö/tu lo stu slai te
                                                       I
                                                       M
                                                           Käskymui stimod uli
                                                           Da ta muistimo duli         24 May, 2005                SUPERCHIP Evaluation Hearing                                        4
VTT TECHNICAL RESEARCH CENTRE OF FINLAND




               3. Leading contenders within the proposal (cont)

     • Initial choice of architectures is partially guided by application requirements:
           • Eclipse and XMT: general purpose computing, embedded computing
           • Advanced CMP: high-throughput desktop and server machines
           • TTA/PISMA: streaming/DSP
           • TRIPS: HPC, streaming/DSP, threaded servers

     • Procedure to choose the initial SUPERCHIP architecture:
          1. Develop an architecture evaluation framework (T1.1)
          2. Develop semi-analytical power/performance/cost models (T5.1)
          3. Develop/modify existing simulators for the architectures (T5.2)
          4. Design benchmark programs for the architectures (T4.1)
          5. Perform evaluation + identify strong/weak points + select (T1.1)

     • Preliminary criteria:
          • Power, performance, cost (silicon area)
          • Estimated scalability, PRAM-like model support, ease of programming
          • Estimated coverage for aimed application area, TLP-ILP co-exploitation
          • Potential for solving the rest of the problems



               24 May, 2005                SUPERCHIP Evaluation Hearing   5
VTT TECHNICAL RESEARCH CENTRE OF FINLAND



                 4. Ensuring HW implementation technologies
                   impact on choice of scalable architecture

  • Scalability issues are observed in initial selection of candidate architectures
      • Mesh-like topologies (providing constant wire length links): Eclipse, CMP,
         TTA, TRIPS
      • Regular structures: Eclipse, CMP, TTA, TRIPS
      • No forwarding networks (Eclipse) or multistage forwarding networks (TRIPS)
      • No cache coherency mechanisms: Eclipse
      • Multithreading: Eclipse, XMT
      • Decentralized structure: Eclipse, CMP, TTA, TRIPS
  • Semi-analytical modeling of the architectures and candidate techniques (T5.1)
      • Analytical parametric power/performance/cost estimation models
      • Hardware implementation parameters are extracted from
             • Technology roadmaps e.g. ITRS
             • Pragmatic experience and knowledge of industrial partners




              24 May, 2005                 SUPERCHIP Evaluation Hearing   6
VTT TECHNICAL RESEARCH CENTRE OF FINLAND



             4. Ensuring HW implementation technology impact on
                   our choice of scalable architecture (cont)

     • Architectural simulation (T5.2)
          • Develop/modify existing simulators
          • Benchmarks
          • Sample applications
          • Information on execution time, resource utilization and power
            consumption is extracted
     • Modeling of the critical parts of architectures
          • Feasibility analysis of candidate architectures
          • Studies on fault tolerance, clocking schemes, on-chip/off-chip
            communication, power saving and other implementation related issues
            for the SUPERCHIP architecture (T5.3)
          • Detailed modeling and feasibility assessment of critical parts of the
            SUPERCHIP architecture (T5.4)




              24 May, 2005                 SUPERCHIP Evaluation Hearing   7
VTT TECHNICAL RESEARCH CENTRE OF FINLAND




   5. Evolvement of the PRAM model for the candidate architectures

     • For ease-of-programming the SUPERCHIP programming model will be
       based on a PRAM-like model, considering
          •Relaxed synchronization (BSP-like)
          •Strong memory semantics (CRCW-like, built-in operators)
          •Potential for locality exploitation (memory, Hierarchical-PRAM)

     • SUPERCHIP will develop the necessary architectural support for this model

     • Architectural requirements:
          • Synchronization: implicit after each instruction
          • Bandwidth: high bisection to handle random communication
          • Latency: communication/memory access latency should be hidden

     • SUPERCHIP will not investigate PRAM-implementation on distributed memory
       architectures in general
     • Long-term research issue: Evolution of programming model and architecture to
       SUPERCHIP constellations


              24 May, 2005                 SUPERCHIP Evaluation Hearing   8
VTT TECHNICAL RESEARCH CENTRE OF FINLAND




5. Evolvement of the PRAM model for the candidate architectures (cont)


 Candi-    Synchronization            Bisection               Latency          Initial model
 date                                 bandwidth               hiding

 Eclipse   synchronization wave       P/2                     Super-pipelined EREW PRAM
           fast barrier mechanism                             multithreading

 XMT       hardware                   ?                       caches           PRAM-like
           synchronization

 CMP       software                   square root P           caches           NUMA
           synchronization
 TTA/      software                   square root P           caches           NUMA
 PISMA     synchronization
 TRIPS     software                   square root P           caches           NUMA
           synchronization



              24 May, 2005                  SUPERCHIP Evaluation Hearing   9
VTT TECHNICAL RESEARCH CENTRE OF FINLAND



         6. Validation and assessment of the performance scalability
                   of the final choice of HW/SW architecture

     • Analytically through parametric power/performance/cost models

     • Empirically through simulations
         • Benchmark kernels and sample applications
               • Scalable benchmark suite for fine-grained shared memory
                  architecture
               • Standard benchmark suites
               • Sample applications
         • Parametric architecture simulations

     • By comparing to future alternative approaches (e.g. advanced CMPs) and
       theoretical machines (e.g. ideal PRAM) using the applications and
       benchmarks




              24 May, 2005                 SUPERCHIP Evaluation Hearing   10
VTT TECHNICAL RESEARCH CENTRE OF FINLAND



              7. Plan for identifying the requirements for the OS
                     within the resources of the work plan

    • Goal is to identify requirements and implement core OS services to
      demonstrate validity of the architectural approach, but not to develop full-
      fledged OS (as stated in B.4.1.5):
          • Requirements from underlying architecture and applications
          • Resource management (process, thread and memory)
          • Runtime functions and services for applications

    • Input for identifying requirements will come from several other tasks including
      T1.2, T1.3, T2.2 and T3.3
         • OS is not in charge of supporting distributed shared memory
         • Certain OS functionality will be covered by compiler’s run-time system

    • Task leader of OS task (T4.3, ULM) has developed a distributed operating
      system (Plurix) which provides an excellent basis




              24 May, 2005                 SUPERCHIP Evaluation Hearing   11
VTT TECHNICAL RESEARCH CENTRE OF FINLAND



              7. Plan for identifying the requirements for the OS
                  within the resources of the work plan (cont)

        • Preliminary anticipated OS requirements
             • Dynamic process/thread scheduling
             • Memory management (physical and virtual)
             • Synchronization including inter-process communication
             • Support for power management and IO
        • Definition
             • A coarse-grain functional model of OS will be developed and validated
               through simulation
             • Definition of API in SUPERCHIP language (or pseudo-language in the
               early phase)
        • Implementation
             • Using the SUPERCHIP language and compiler (from T2.2 and T3.3)
             • Testing with architecture simulation tools (from T5.2)


                 Feasible with the allocated resources and partners


              24 May, 2005                 SUPERCHIP Evaluation Hearing   12

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:10/3/2011
language:English
pages:12