Docstoc

SMiLE Shared Memory Programming

Document Sample
SMiLE Shared Memory Programming Powered By Docstoc
					                                             SMiLE


SMiLE Shared Memory Programming


Wolfgang Karl, Martin Schulz
Lehrstuhl für Rechnertechnik und Rechnerorganisation, LRR
Technische Universität München

      SCI Summer School, Trinity College Dublin
      October 3rd, 2000
SMiLE Project at LRR
 Lehrstuhl  für Rechnertechnik und
    Rechnerorganisation, Prof. Dr. Arndt Bode
       Parallel Processing and Architectures
                   Tools and Environments for Parallel Processing
                   Applications
                   Parallel Architectures

   Computer Architecture Group (W. Karl, M. Schulz)
       SMiLE: Shared Memory in a LAN-like Environment
          Programming Environments and Tools
          SCI Hardware Developments
          http://smile.in.tum.de/

Wolfgang Karl                                                        2
3. Oktober 2000
Outline
 Parallel          Processing: Principles
 SMiLE             Software Infrastructure
       Focus on Communication Architecture
 SMiLE             Tool Environment
 Data            Locality Optimizations
 Continuation           by Martin
       Shared Memory Programming on SCI


Wolfgang Karl                                 3
3. Oktober 2000
Parallel Processing
 Parallel        Computer Architectures
       Shared Memory Machines
       Distributed Memory Machines
       Distributed Shared Memory
 Parallel        Programming Models
       Shared Memory Programming
       Message Passing
       Data Parallel Programming Model


Wolfgang Karl                              4
3. Oktober 2000
Shared Memory Multiprocessors
                                                       Global address space
         CPU               CPU              CPU
                                                       Uniform Memory
        Cache             Cache            Cache        Access (UMA)
                                                       Communication /
                   Interconnection Network              synchronization via
                  (Bus, Crossbar, Multistage)
                                                        shared variables

     Memory             Memory            Memory   Centralized shared
     Module             Module            Module   memory




Wolfgang Karl                                                           5
3. Oktober 2000
Parallel Programming Models (1)
 Shared               Memory
       Single global virtual address space
       Easier programming model
                   Implicit data distribution
                   Support for incremental parallelization
       Mainly on tightly coupled systems
                   SMPs, CC-NUMA
       Pthreads, SPMD, OpenMP, HPF


Wolfgang Karl                                                 6
3. Oktober 2000
   Shared Memory Flow: Timing
       Producer Thread:
       for (i:=0; i<num_bytes; i++)
           buffer[i]:=source[i]
       Flag:=num_bytes                Consumer Thread:
                                      while (flag==0);
                                      for (i:=0; i<flag; i++)
            Write to shared buffer        dest[i]:=buffer[i];
            and set flag
                                      Detect flag set and read message
                                      from buffer
Time




       Wolfgang Karl                                                     7
       3. Oktober 2000
Distributed Memory, DM

                           Network                            Network
     CPU          Memory                 CPU         Memory
                           Interface                          Interface

    Cache                                Cache



                           Interconnection Network

   No remote memory access (NORMA)
   Communication: message passing
   MPP, NOWs, clusters: Scalability
Wolfgang Karl                                                             8
3. Oktober 2000
Parallel Programming Models (2)
 Message                 Passing
       Predominant paradigm for DM machines
       Straightforward resource abstraction
                   High-level communication libraries: PVM, MPI
                   Exploiting underlying interconnection networks
       Complex and more difficult for user
                   Explicit data distribution and parallelization
       But:Performance tuning more intuitive

Wolfgang Karl                                                        9
3. Oktober 2000
   Message Passing Flow: Timing
       Producer Process:
       send(proci,processi,@sbuffer,num_bytes)

        Sender                         Consumer Process:
                                       Receive(@rbuffer,max_bytes)
            Send a message
            OS call
                Protection check
                Program DMA            Receiver
             DMA to NI
                                       DMA from net to system buffer
Time




                                       OS interrupt and message decode
                                       OS copy from system buffer to user buffer
                                       Reschedule user process
                                       Receive message
       Wolfgang Karl                                                   10
       3. Oktober 2000
Distributed Shared Memory, DSM

     CPU          Memory                             CPU     Memory


    Cache                                            Cache


                           Interconnection Network


   Distributed memory, shared by all processors
   NUMA: non-uniform memory access, CC-NUMA, COMA
   Combines support for shared memory programming with scalability
Wolfgang Karl                                                         11
3. Oktober 2000
Parallel Computer Architecture
 Trends          in parallel computer architecture
       Convergence towards a generic parallel machine
        organization

       Use of commodity-off-the-shelf components

       Low-cost parallel processing

       Comprehensive and high-level development
        environments

Wolfgang Karl                                         12
3. Oktober 2000
SCI-based PC clusters
                                                  PCs with
                                                  PCI-SCI adapter



                                                  SCI interconnect
                      Global address
                          space



   NUMA architecture with commodity components
       Hardware-supported DSM with low-latency remote memory
        access and fast message-passing
       Competitive capabilities for less money
       But new challenges for the software environment
Wolfgang Karl                                              13
3. Oktober 2000
SCI-Based Cluster-Computing
 The             SMiLE project at LRR-TUM
       Shared Memory in a LAN-like Environment
 System            architecture
       SCI-based PC cluster with NUMA characteristics
 Software infrastructure for PC clusters
      User-level communication architectures on top of SCI’s
       DSM
      Providing message-passing and transparent shared
       memory on a single platform

 Tool            environment
Wolfgang Karl                                             14
3. Oktober 2000
SMiLE software layers
                         Target applications / Test suites
                                                                            High-level
                                                                            SMiLE
                   SISAL                           SPMD        TreadMarks
                     on        SISCI PVM            style      compatible
         NT        MuSE                            model           API
        prot.                                                               Low-level
        stack                                                               SMiLE
                  AM 2.0     SS-lib     CML
                                                        SCI-VM lib
                           SCI-Messaging
                                                                            User/Kernel
                                                                            boundary
       NDIS                SCI Drivers &
                                                             SCI-VM
       driver               SISCI API
              SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor

Wolfgang Karl                                                                   15
3. Oktober 2000
SMiLE software layers
  Message Passing applications / Test suites
              Target
  User Level Communication                                               High-level
                                                                         SMiLE
                   SISAL                         SPMD       TreadMarks
                     on        SISCI PVM          style     compatible
         NT        MuSE                          model          API
        prot.                                                            Low-level
        stack                                                            SMiLE
                  AM 2.0      SS-lib       CML
                                                    SCI-VM lib
                           SCI-Messaging
                                                                         User/Kernel
                                                                         boundary
       NDIS                 SCI driver &
                                                          SCI-VM
       driver                SISCI API
              SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor

Wolfgang Karl                                                                16
3. Oktober 2000
Message passing using HW-DSM
   SMiLE messaging layers
       Active Messages,
       User level sockets
       Common Messaging Layer for PVM and MPI

 User-level communication
   Remove the OS from the critical path of sending and
    receiving messages
   Mapping parts of the NI into the user’s address space
   Avoid context switches and buffering
   Direct utilization of the HW-DSM
   Buffered remote writes
Wolfgang Karl                                               17
3. Oktober 2000
Principles of the message engines
           Node A / Sender           Node B / Receiver

                                                                  Maping
                                                                  via SCI




        mapped End Pointer          End pointer
           +1
        End Pointer copy
                             Sync

                                    Start Pointer copy
                                                           Sync
        Start Pointer               mapped Start Pointer


Wolfgang Karl                                                       18
3. Oktober 2000
Implementation Remarks
 Ring            buffer setup
       One pair of ring buffers for each connection
       Avoiding high overhead access synchronization
       On demand establishment of connections
 Data            transfer
       Only pipelined, buffered remote writes
       Avoiding inefficient, blocking reads
 User            level barriers to avoid IOCTL overhead

Wolfgang Karl                                           19
3. Oktober 2000
SMiLE software layers
                                      True Shared Memory
                         Target applications / Test suites
                                      Programming                           High-level
                                                                            SMiLE
                   SISAL                           SPMD        TreadMarks
                     on        SISCI PVM            style      compatible
         NT        MuSE                            model           API
        prot.                                                               Low-level
        stack                                                               SMiLE
                  AM 2.0     SS-lib     CML
                                                        SCI-VM lib
                           SCI-Messaging
                                                                            User/Kernel
                                                                            boundary
       NDIS              SCI Drivers &
                        SMiLE driver&
                                                             SCI-VM
       driver              SISCI API
                       Dolphin IRM driver
              SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor

Wolfgang Karl                                                                   20
3. Oktober 2000
DSM performance monitoring (1)
 Performance       aspects of DSM systems:
       High communication performance via hardware-
        supported DSM
       Remote memory accesses are an order of magnitude
        higher than local ones
       Data locality should be enabled or exploited by
        programming systems or tools



Wolfgang Karl                                             21
3. Oktober 2000
DSM performance monitoring (2)
 Monitoring                aspects of DSM systems:
       Capture information about the dynamic behavior of a
        parallel program
       Fine-grain communication
                   Communication might occur implicitly on every read or
                    write

       Avoid probe effect through software instrumentation



Wolfgang Karl                                                         22
3. Oktober 2000
SMiLE Monitoring Approach
 Event-driven      hybrid monitoring system
       Network interface with monitoring hardware
       Delivers information about the runtime and
        communication behavior to tools for performance
        analysis and debugging
       Allows on-line steering
       Hardware monitor exploits spatial and temporal
        locality of accesses


Wolfgang Karl                                            23
3. Oktober 2000
The SMiLE Hardware Monitor
                                                     SCI out
                              PCI-SCI Bridge Probe
                                                     SCI in

                  SMiLE Monitor
  PCI local bus




                                                          Network
                                                            SCI
                   PCI Unit




Wolfgang Karl                                                  25
3. Oktober 2000
Features of the SMiLE Monitor
 Dynamic            monitoring mode
           Used on whole physical address space
           Creation of global access heuristics
           Cache-like swap logic to save hardware resources
           Automatic aggregation of neighboring areas


 Static          monitoring mode
       Used on predefined memory areas
       Flexible event logic

Wolfgang Karl                                             26
3. Oktober 2000
Monitor’s dynamic mode
mov %esi, %ecx
mov %dx, (%esi)                      tag   Counter #1
mov %esi, %ebx                       tag   Counter #2
add %dx, (%esi) Memory               tag   Counter #3
pushf          reference hit?        tag   Counter #4
                                     tag   Counter #5
  Interrupt               tail ptr


                                                        Main
             ring buffer head ptr                       memory
             stuffed?                                   ring
                                                        buffer

Wolfgang Karl                                             27
3. Oktober 2000
Information delivered
 All             information acquired is based on SCI packets
       Physical addresses
       Source/Target IDs
 Information             can not be used directly
       Physical addresses inappropriate
       Back-translation to source code level necessary
 Need              for a monitoring infrastructure
       Access to mapping & symbol information
       Clean monitoring interface
Wolfgang Karl                                             28
3. Oktober 2000
OMIS: Goals and Design
 Goal:              Flexible monitoring for distributed systems
       Specify interface to be used by tools
       Decouple tools and monitor system
       Increased portability and availability of tools
     OMIS Approach
       Interface based on Event-Action paradigm
                   Events: When should something happen?
                   Action: What should happen?
       OMIS provides default set of Events and Actions
       Tools define relations between Events and Actions
Wolfgang Karl                                               30
3. Oktober 2000
Putting it all together
        SMiLE
       HW-DSM
        Monitor
                                    Extensible
                                    Monitoring
                                    API




                  Global Virtual
                  Memory for Clusters
Wolfgang Karl                              31
3. Oktober 2000
                    Multi-Layered SMiLE monitoring
                                                                            Tools
                                                       Prog. Environment                High-level Prog. Environment
OMIS/OCM Monitor for DSM systems




                                                           Extension                             (Specific information)
                                                        Prog. Model                      Shmem Programming Model
                                                         Extension                               (Specific information)
                                   OMIS OMIS SCI                                                              SyncMod
                                    /   DSM                                (Statistics of Synchronization mechanisms)
                                   OCM               Extension                                                 SCI-VM
                                   Core                                      (Virt./Phys. Address mappings, Statistics)
                                                                                SMiLE PCI-SCI bridge and monitor
                                                                 (Physical addresses, Node IDs, Counters, Histograms)
                                                                                               Node Local Resources
                                                                     (CPU counters, Cache statistics, OS information)
                                   Wolfgang Karl                                                                      32
                                   3. Oktober 2000
Advantages
 Comprehensive            DSM monitoring
       Utilization of information from all components
 Structure          of execution environment maintained
       Generic Shared Memory monitoring
       Small Model specific extensions
      Flexibility and Extensibility
 Profit          from existing OMIS environment
       Easy implementation
       Utilization of existing rich tool base

Wolfgang Karl                                            33
3. Oktober 2000
Current Status and Future Work
 SCI             Virtual Memory
       Prototype completed
       Work on larger infrastructure   in progress
 SMiLE             Hardware Monitor
       Prototype is currently being tested
       Simulation environment available

 OMIS
       OMIS definition and OCM core      completed
       DSM extension in development


Wolfgang Karl                                         34
3. Oktober 2000
Data locality optimizations
 Using           the monitor’s static mode
       monitoring predefined memory sections
       integration of the monitoring concept into the
        programming model
       translation of the application’s data structures into
        physical addresses for the hardware monitor
       putting the monitoring results into relation with the
        source code information
       evaluation of the network behavior and data locality
Wolfgang Karl                                              35
3. Oktober 2000
Example application
 SPLASH               benchmark suite: LU kernel
       implements a blocked version of an LU
        decomposition for dense matrices
       solves a system of linear equations
       splits the data structure into subblocks of 16x16
        values
 LU              decomposition
       split into phases, one for each block
       for each phase: analysis of the remote memory
        accesses
Wolfgang Karl                                               36
3. Oktober 2000
Simulation environment
 Multiprocessor                  memory system simulator:
    LIMES
       Shared memory system
                   local read/write access latency: 1 cycle
                   remote write latency: 20 cycles


       DSM system with x86 nodes
                   memory distribution at page granularity



Wolfgang Karl                                                  37
3. Oktober 2000
                     Optimization results
                                     Unoptimized version                                                                                           Optimized version
                                                LU matrix after phase 1                                                                                Matrix of LU optimized version (phase 1)




                     70000                                                                                                             70000

                     60000                                                                                                             60000




                                                                                                                  Number of accesses
                     50000
Number of accesses




                                                                                                                                       50000

                     40000                                                                                                             40000

                      30000                                                                                                             30000

                      20000                                                                                                             20000
                                                                                                          1                                                                                                    1
                                                                                                      2
                      10000                                                                       3                                                                                                        3
                                                                                              4
                                                                                                                                        10000
                                                                                          5
                                                                                                                                                                                                       5
                             0                                                        6               blocks in                                0                                                               blocks in
                                 1                                                7                     j dim.
                                       2       3                                                                                                   1     2                                        7             j dim.
                                                     4
                                                              5
                                                                              8                                                                               3    4
                                                                  6                                                                                                      5
                                                                      7
                                                                          8                                                                                                   6    7
                                           blocks in i dim.                                                                                                                             8
                                                                                                                                                        blocks in i dim.



                     Wolfgang Karl                                                                                                                                                                38
                     3. Oktober 2000
Summary
 SMiLE                Software Infrastructure
       Parallel Processing Principles
       SMiLE Software Infrastructure
                   Message Passing Communication
                   Shared Memory Programming

       SMiLE Tool Environment
                   Based on Hardware Monitoring


Wolfgang Karl                                       39
3. Oktober 2000
SMiLE software layers
                                      True Shared Memory
                         Target applications / Test suites
                                      Programming                           High-level
                                                                            SMiLE
                   SISAL                           SPMD        TreadMarks
                     on        SISCI PVM            style      compatible
         NT        MuSE                            model           API
        prot.                                                               Low-level
        stack                                                               SMiLE
                  AM 2.0     SS-lib     CML
                                                        SCI-VM lib
                           SCI-Messaging
                                                                            User/Kernel
                                                                            boundary
       NDIS              SCI Drivers &
                        SMiLE driver&
                                                             SCI-VM
       driver              SISCI API
                       Dolphin IRM driver
              SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor

Wolfgang Karl                                                                   40
3. Oktober 2000

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:10/10/2011
language:German
pages:38