SMiLE Shared Memory Programming

Document Sample
SMiLE Shared Memory Programming Powered By Docstoc
					SMiLE
SMiLE Shared Memory Programming
Wolfgang Karl, Martin Schulz
Lehrstuhl für Rechnertechnik und Rechnerorganisation, LRR Technische Universität München

SCI Summer School, Trinity College Dublin October 3rd, 2000

SMiLE Project at LRR
 Lehrstuhl

für Rechnertechnik und Rechnerorganisation, Prof. Dr. Arndt Bode
 Parallel Processing and Architectures
 Tools and Environments for Parallel Processing  Applications  Parallel Architectures



Computer Architecture Group (W. Karl, M. Schulz)
 SMiLE: Shared Memory in a LAN-like Environment  Programming Environments and Tools  SCI Hardware Developments  http://smile.in.tum.de/

Wolfgang Karl 3. Oktober 2000

2

Outline
 Parallel  SMiLE

Processing: Principles

Software Infrastructure
Tool Environment by Martin

 Focus on Communication Architecture
 SMiLE

 Data

Locality Optimizations

 Continuation

 Shared Memory Programming on SCI

Wolfgang Karl 3. Oktober 2000

3

Parallel Processing
 Parallel

Computer Architectures

 Shared Memory Machines  Distributed Memory Machines  Distributed Shared Memory
 Parallel

Programming Models

 Shared Memory Programming  Message Passing  Data Parallel Programming Model

Wolfgang Karl 3. Oktober 2000

4

Shared Memory Multiprocessors
CPU Cache CPU Cache CPU Cache

 

Global address space Uniform Memory Access (UMA) Communication / synchronization via shared variables


Interconnection Network (Bus, Crossbar, Multistage)

Memory Module

Memory Module

Memory Module

Centralized shared memory

Wolfgang Karl 3. Oktober 2000

5

Parallel Programming Models (1)
 Shared

Memory

 Single global virtual address space  Easier programming model
 Implicit data distribution  Support for incremental parallelization

 Mainly on tightly coupled systems
 SMPs, CC-NUMA

 Pthreads, SPMD, OpenMP, HPF

Wolfgang Karl 3. Oktober 2000

6

Shared Memory Flow: Timing
Producer Thread:
for (i:=0; i<num_bytes; i++) buffer[i]:=source[i] Flag:=num_bytes Consumer Thread: while (flag==0); for (i:=0; i<flag; i++) Write to shared buffer dest[i]:=buffer[i]; and set flag Detect flag set and read message from buffer Time
Wolfgang Karl 3. Oktober 2000

7

Distributed Memory, DM
CPU
Cache Memory Network Interface

CPU
Cache

Memory

Network Interface

Interconnection Network
  

No remote memory access (NORMA) Communication: message passing MPP, NOWs, clusters: Scalability
8

Wolfgang Karl 3. Oktober 2000

Parallel Programming Models (2)
 Message

Passing

 Predominant paradigm for DM machines  Straightforward resource abstraction
 High-level communication libraries: PVM, MPI

 Exploiting underlying interconnection networks

 Complex and more difficult for user
 Explicit data distribution and parallelization

 But:Performance tuning more intuitive
Wolfgang Karl 3. Oktober 2000

9

Message Passing Flow: Timing
Producer Process:
send(proci,processi,@sbuffer,num_bytes) Sender Send a message OS call Protection check Program DMA DMA to NI Time

Consumer Process:
Receive(@rbuffer,max_bytes)

Receiver DMA from net to system buffer OS interrupt and message decode OS copy from system buffer to user buffer Reschedule user process Receive message

Wolfgang Karl 3. Oktober 2000

10

Distributed Shared Memory, DSM
CPU
Cache Memory

CPU
Cache

Memory

Interconnection Network
  

Distributed memory, shared by all processors

NUMA: non-uniform memory access, CC-NUMA, COMA
Combines support for shared memory programming with scalability
11

Wolfgang Karl 3. Oktober 2000

Parallel Computer Architecture
 Trends

in parallel computer architecture

 Convergence towards a generic parallel machine organization  Use of commodity-off-the-shelf components

 Low-cost parallel processing
 Comprehensive and high-level development environments
Wolfgang Karl 3. Oktober 2000

12

SCI-based PC clusters
PCs with PCI-SCI adapter

SCI interconnect Global address space



NUMA architecture with commodity components
 Hardware-supported DSM with low-latency remote memory access and fast message-passing  Competitive capabilities for less money  But new challenges for the software environment

Wolfgang Karl 3. Oktober 2000

13

SCI-Based Cluster-Computing
 The

SMiLE project at LRR-TUM

 Shared Memory in a LAN-like Environment
 System

architecture

 SCI-based PC cluster with NUMA characteristics
 Software infrastructure for PC clusters  User-level communication architectures on top of SCI’s DSM  Providing message-passing and transparent shared memory on a single platform

 Tool
Wolfgang Karl 3. Oktober 2000

environment
14

SMiLE software layers
Target applications / Test suites
High-level SMiLE

NT prot. stack

SISAL on MuSE

SISCI PVM

SPMD style model

TreadMarks compatible API
Low-level SMiLE

AM 2.0

SS-lib

CML
SCI-VM lib

SCI-Messaging NDIS driver SCI Drivers & SISCI API

User/Kernel boundary

SCI-VM

SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor
Wolfgang Karl 3. Oktober 2000

15

SMiLE software layers
Message Passing applications / Test suites Target User Level Communication
SISAL on MuSE SISCI PVM SPMD style model TreadMarks compatible API
Low-level SMiLE High-level SMiLE

NT prot. stack

AM 2.0

SS-lib

CML
SCI-VM lib

SCI-Messaging NDIS driver SCI driver & SISCI API

User/Kernel boundary

SCI-VM

SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor
Wolfgang Karl 3. Oktober 2000

16

Message passing using HW-DSM


SMiLE messaging layers
 Active Messages,  User level sockets  Common Messaging Layer for PVM and MPI

 User-level communication  Remove the OS from the critical path of sending and receiving messages  Mapping parts of the NI into the user’s address space  Avoid context switches and buffering  Direct utilization of the HW-DSM  Buffered remote writes
Wolfgang Karl 3. Oktober 2000

17

Principles of the message engines
Node A / Sender Node B / Receiver

Maping via SCI

mapped End Pointer

+1 End Pointer copy
Start Pointer
Wolfgang Karl 3. Oktober 2000

End pointer
Sync

Start Pointer copy
Sync

mapped Start Pointer

18

Implementation Remarks
 Ring

buffer setup

 One pair of ring buffers for each connection  Avoiding high overhead access synchronization  On demand establishment of connections
 Data

transfer

 Only pipelined, buffered remote writes  Avoiding inefficient, blocking reads
 User

level barriers to avoid IOCTL overhead
19

Wolfgang Karl 3. Oktober 2000

SMiLE software layers
Target applications / Test suites SISAL on MuSE

True Shared Memory Programming
SPMD style model TreadMarks compatible API

High-level SMiLE

SISCI PVM

NT prot. stack

AM 2.0

SS-lib

CML
SCI-VM lib

Low-level SMiLE

SCI-Messaging NDIS driver SMiLE driver& SCI Drivers & Dolphin IRM driver SISCI API

User/Kernel boundary

SCI-VM

SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor
Wolfgang Karl 3. Oktober 2000

20

DSM performance monitoring (1)
 Performance

aspects of DSM systems:

 High communication performance via hardwaresupported DSM  Remote memory accesses are an order of magnitude higher than local ones  Data locality should be enabled or exploited by programming systems or tools

Wolfgang Karl 3. Oktober 2000

21

DSM performance monitoring (2)
 Monitoring

aspects of DSM systems:

 Capture information about the dynamic behavior of a parallel program  Fine-grain communication
 Communication might occur implicitly on every read or write

 Avoid probe effect through software instrumentation

Wolfgang Karl 3. Oktober 2000

22

SMiLE Monitoring Approach
 Event-driven

hybrid monitoring system

 Network interface with monitoring hardware  Delivers information about the runtime and communication behavior to tools for performance analysis and debugging  Allows on-line steering  Hardware monitor exploits spatial and temporal locality of accesses

Wolfgang Karl 3. Oktober 2000

23

The SMiLE Hardware Monitor
PCI-SCI Bridge Probe SMiLE Monitor
PCI local bus SCI Network
SCI out SCI in

Wolfgang Karl 3. Oktober 2000

PCI Unit

25

Features of the SMiLE Monitor
 Dynamic

monitoring mode

   

Used on whole physical address space Creation of global access heuristics Cache-like swap logic to save hardware resources Automatic aggregation of neighboring areas

 Static

monitoring mode

 Used on predefined memory areas  Flexible event logic
Wolfgang Karl 3. Oktober 2000

26

Monitor’s dynamic mode
mov %esi, %ecx mov %dx, (%esi) mov %esi, %ebx add %dx, (%esi) Memory pushf reference hit?

tag tag tag tag tag

Counter #1 Counter #2 Counter #3 Counter #4 Counter #5

Interrupt

tail ptr

ring buffer head ptr stuffed?
Wolfgang Karl 3. Oktober 2000

Main memory ring buffer
27

Information delivered
 All

information acquired is based on SCI packets

 Physical addresses  Source/Target IDs
 Information

can not be used directly

 Physical addresses inappropriate  Back-translation to source code level necessary
 Need

for a monitoring infrastructure

 Access to mapping & symbol information  Clean monitoring interface
Wolfgang Karl 3. Oktober 2000

28

OMIS: Goals and Design
 Goal:

Flexible monitoring for distributed systems

 Specify interface to be used by tools  Decouple tools and monitor system  Increased portability and availability of tools


OMIS Approach
 Interface based on Event-Action paradigm
 Events: When should something happen?  Action: What should happen?

 OMIS provides default set of Events and Actions  Tools define relations between Events and Actions
Wolfgang Karl 3. Oktober 2000

30

Putting it all together
SMiLE HW-DSM Monitor

Extensible Monitoring API

Global Virtual Memory for Clusters
Wolfgang Karl 3. Oktober 2000

31

Multi-Layered SMiLE monitoring
Tools
OMIS/OCM Monitor for DSM systems

Prog. Environment Extension Prog. Model Extension
OMIS OMIS SCI DSM / OCM Core

Extension

High-level Prog. Environment (Specific information) Shmem Programming Model (Specific information) SyncMod (Statistics of Synchronization mechanisms) SCI-VM (Virt./Phys. Address mappings, Statistics)

SMiLE PCI-SCI bridge and monitor (Physical addresses, Node IDs, Counters, Histograms) Node Local Resources (CPU counters, Cache statistics, OS information)
Wolfgang Karl 3. Oktober 2000

32

Advantages
 Comprehensive

DSM monitoring

 Utilization of information from all components
 Structure

of execution environment maintained

 Generic Shared Memory monitoring  Small Model specific extensions Flexibility and Extensibility
 Profit

from existing OMIS environment

 Easy implementation  Utilization of existing rich tool base
Wolfgang Karl 3. Oktober 2000

33

Current Status and Future Work
 SCI

Virtual Memory
in progress

 Prototype completed  Work on larger infrastructure

 SMiLE

Hardware Monitor

 Prototype is currently being tested  Simulation environment available

 OMIS
 OMIS definition and OCM core  DSM extension in development

completed

Wolfgang Karl 3. Oktober 2000

34

Data locality optimizations
 Using

the monitor’s static mode

 monitoring predefined memory sections  integration of the monitoring concept into the programming model  translation of the application’s data structures into physical addresses for the hardware monitor  putting the monitoring results into relation with the source code information  evaluation of the network behavior and data locality
Wolfgang Karl 3. Oktober 2000

35

Example application
 SPLASH

benchmark suite: LU kernel

 implements a blocked version of an LU decomposition for dense matrices  solves a system of linear equations  splits the data structure into subblocks of 16x16 values
 LU

decomposition

 split into phases, one for each block  for each phase: analysis of the remote memory accesses
Wolfgang Karl 3. Oktober 2000

36

Simulation environment
 Multiprocessor

memory system simulator:

LIMES
 Shared memory system
 local read/write access latency: 1 cycle  remote write latency: 20 cycles

 DSM system with x86 nodes
 memory distribution at page granularity

Wolfgang Karl 3. Oktober 2000

37

Optimization results
Unoptimized version
LU matrix after phase 1

Optimized version
Matrix of LU optimized version (phase 1)

70000 60000

70000 60000

Number of accesses

50000 40000 30000 20000
2

Number of accesses

50000 40000 30000 20000 10000
3 5 1 2 3 7 4 5 6 7 8 1

1

10000
5

3 4 6 1 2 3 7 4 5 6 8 7 8

0

blocks in j dim.

0

blocks in j dim.

blocks in i dim.

blocks in i dim.

Wolfgang Karl 3. Oktober 2000

38

Summary
 SMiLE

Software Infrastructure

 Parallel Processing Principles  SMiLE Software Infrastructure
 Message Passing Communication

 Shared Memory Programming

 SMiLE Tool Environment
 Based on Hardware Monitoring

Wolfgang Karl 3. Oktober 2000

39

SMiLE software layers
Target applications / Test suites SISAL on MuSE

True Shared Memory Programming
SPMD style model TreadMarks compatible API

High-level SMiLE

SISCI PVM

NT prot. stack

AM 2.0

SS-lib

CML
SCI-VM lib

Low-level SMiLE

SCI-Messaging NDIS driver SMiLE driver& SCI Drivers & Dolphin IRM driver SISCI API

User/Kernel boundary

SCI-VM

SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor
Wolfgang Karl 3. Oktober 2000

40


				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:10/25/2009
language:German
pages:38