Document Sample
Apr01-shared-memory Powered By Docstoc
					Distributed Shared Memory Cache Coherent
Operating System Assignment # 1 Dated 11-11-2008

• Small processor count
– SMP machines – Single shared memory with multiple processors interconnected with a bus

• Large processor count
– Distributed Shared Memory Machines – Largely message passing architectures

Programming Concerns
• Message passing
– Access to memory involve send/request packets – Communication costs

• Shared memory model
– Ease of programming – But not very scalable

• Scalable and easy to program?

Distributed Shared Memory
• Physically distributed memory • Implemented with a single shared address space • Also known as NUMA machines since memory access times are non-uniform
– Local access times < Remote access times

DSM and Memory access
• Big difference in accessing local versus remote data • Large differences make it difficult to hide latency • How about caching?
– In short, it’s difficult – Cache coherence

Cache coherence
• Cache Coherence
– Different processors may access values at same memory location – How to ensure data integrity at all times?
• An update by a processor at time t is available for other processors at time t+1

– Snoopy protocol – Directory based protocol

Snoopy Coherence Protocols
• Transparent to user • Easy to implement • For a read
– Data fetched from other cache or from memory

• For a write
– All data at other caches are invalidated – Delayed or immediate write-back.

• The Bus plays an important role


But it does not scale!
• Not feasible for machines with memory distributed across a large number of systems • Broadcast on bus approach is bad • Leads to bus saturation • Waste of processor cycles to snoop all caches in system

Directory-Based Cache Coherence
• A directory tracks which processor have cached a block of memory • Directory contains information for all cache blocks in system • Each cache block can have 1 of 3 states
– Invalid – Shared – Exclusive

• To enter exclusive state, all other cache blocks for same memory location is invalidated

Original form not popular
• Compared to snoopy protocols
– Directory systems avoid broadcasting on bus

• But requests served by 1 directory server
– May saturate a directory server

• Still not scalable • How about distributing the directory
– Load balancing – Hierarchical model?

Distributed Directory Protocol
• Involved sending messages among 3 node types
– Local node
• Requesting processor node

– Home node
• Node containing memory location

– Remote node
• Node containing cache block in exclusive state

3 Scenarios
• Scenario 1
– Local node sends request to home node – Home node sends data back to local node

• Scenario 2
– Local node sends request to home node – Home node redirects request to remote node – Remote node sends data back to local node

• Scenario 3
– Local node sends request for exclusive state – Home node redirects request to other remote nodes for invalidation


Stanford DASH Multiprocessor
• 1st operational multiprocessor to support scalable coherence protocol • Demonstrates scalability and cache coherence are not incompatible • 2 hypotheses
– Shared memory machines easier to program – Cache coherence vital

Past experience
• From experience
– Memory access times differ widely between physical location – Latency and bandwidth is important for shared memory systems – Caching helps amortize cost of memory access in a memory hierarchy

DASH Multiprocessor
• Relaxed memory consistency model • Observation
– Most programs use explicit synchronization – Sequential consistency is not necessary – Allows system to perform writes without waiting till all invalidations are performed

• Offers advantages in hiding memory latency

DASH Multiprocessor
• Non-Binding software prefetch
– Prefetches data into cache – Maintains coherence – Transparent to user
• Compiler can issue such instructions to help runtime performance

– If data is invalidated, it will refresh the data when it is accessed

• Helps to hide latency as well

DASH Multiprocessor
• Remote Access Cache
– Remote access combined and buffered within individual nodes – Can be likened to having a 2-level cache hierarchy

• High performance require careful planning of remote data access • Scaling applications depend on other factors
– Load balancing – Limited parallelism – Difficult to scale application into using more processor

• Programming model?
– Model that helps programmers reason about code rather than fine-tuning for a specific machine

• Fault tolerance and recovery?
– More computers = Higher chance of failure

• Increasing latency?
– Increasing hierarchies = Larger variety of latencies

• Previously networking gateways
– Handle diverse set of services – Handles 1000s of channels – Complex designs involving many chips – High power requirement

• Callisto is a gateway on a chip
– Used to implement communication gateways for different networks

In a nutshell
• Integrates DSPs, CPUs, RAM, IO channels on chip • Programmable multi-service platform • Handles 60 to 240 channels per chip • An array of Callisto chips can fit in a small space
– Power efficient – Handles a large number of channels

Shared By:
Shah Muhammad  Butt Shah Muhammad Butt IT professional