CS 271 (Computer Architecture) Chapter 1(1)

Document Sample
CS 271 (Computer Architecture) Chapter 1(1) Powered By Docstoc
					Chapters 6 and 8 (selections)
Virtual Memory and Parallel Processing

        CS 271 Computer Architecture
    Indiana University – Purdue University
                 Fort Wayne

The Operating System Machine Level
 This level is also known as the OSM level
 The OSM level consists of . . .
   Conventional ISM level machine language instructions
   Additional OSML instructions
       New conventional machine instructions reserved for use by
        the operating system
       Calls to operating system service routines (API calls)
         • For example – call to support reading a file
 We will focus on three areas
   Virtual memory (Chapter 6)
   Process concept (Chapter 6)
   Parallel computer architectures (Chapter 8)
Virtual memory
 The traditional solution to the problem of not enough
 memory was overlays
    The programmer would break a program into pieces called
    Each overlay was small enough to fit into memory
    The first overlay was brought in
    When done, it was responsible for reading in the next overlay
    The programmer was responsible for all the details
 Virtual memory allows the operating system to use the
 hard disk to allow whatever RAM memory is available to
 appear to expand to the size of the address space allowed
 by the processor
Virtual memory
 The virtual address space of a computer is the set of
 addresses that make sense at the conventional machine
    Typically depends on the number of bits used for
   CPU     Address bits       Virtual address space
   Z80         16                  0 – 65,535

   VAX         32               0 – 4,294,967,295

 The physical address space consists of the RAM
 memory addresses that are actually installed
   This is typically much smaller than the virtual address space
Virtual memory
 With virtual memory, it is no longer possible for virtual
 addresses to numerically correspond to physical addresses
    Virtual addresses in a program must be mapped to corresponding
    physical addresses dynamically
        This means during run-time
 This requires a memory map
    A table relating a virtual address to the corresponding physical
 Two common techniques are used
    Paged virtual memory (paging)
    Segmented virtual memory (segmentation)
 We will only consider paging
Paged virtual memory
 The virtual address space is divided into (as many
 as) 2m pages of fixed size 2n
   m + n = the number of physical address bits
 Virtual address
         m bits                n bits

      page number      displacement within page

 Physical memory is logically divided into 2k page
 frames of the same fixed size 2n
   Of course, k <= m
 Any page may be loaded into any available page
and page

Paged virtual memory
 The operating system maintains a page table for
 each process
   This is the memory map
 The page table consists of page table entries
   Called PTEs
   Entry n in the table is the PTE of page n
   The PTE of page n gives the page frame number where
   the page is loaded
   It also contains a present / absent bit

The page
table and
The present / absent
bit indicates if the
page has been
   0: not loaded
   1: loaded
Sometimes called a
residence bit or a
valid bit
The page
table and
Pages 2, 4, 7,
9, 10, 12,13,
and 15 are
not presently
loaded into

Address translation
 Address translation refers to mapping virtual
 addresses to physical addresses
 This is done dynamically by the memory
 management unit (MMU)
 Given a virtual address let . . .
   p be the page number
   d be the displacement within the page

Address translation
 The MMU does the following
   Uses p to index into the page table to fetch the PTE
   If the residence bit is 1, then extract the frame number f
       This is a k-bit number
   The physical address is
          k bits                    n bits

     frame number f          displacement d within page

   If the residence bit is 0, generate a page fault
       This is another type of internal interrupt similar to a trap
       It is not fatal, but just temporarily blocks the process
       Diagram from: Stallings, Operating Systems,
Internals and Design Principles, 4th ed., Prentice-Hall (2001)
Page fault handler
 The operating system page fault handler does the
   Locates an empty page frame
   Finds the disk address for the missing page in a
   Activates the DMA to copy the needed page from the
   disk into the empty page frame
   Calls the operating system dispatcher routine to switch
   to another process
       This allows the processor to do something useful while the
        DMA is working

DMA interrupt handler
 When the DMA has completed the transfer, it issues
 an interrupt
 The interrupt handler for the DMA does the
   Changes the residence bit in the PTE for the page to 1
   Places the new frame number in the PTE frame field
   Schedules the process that caused the page fault for
   later activation
 When the process resumes, it tries address
 translation as before and this time succeeds

Paged virtual memory
 Paging is transparent to the OSM level user
   It is implemented at the ISA level
 New ISA hardware must provide an automatic mechanism
 (the MMU hardware) to
   Either translate the m+n bit virtual address to a k+n bit physical
   Or generate a page fault
 Note: An additional memory cycle is required for each
 memory reference in order to fetch the needed PTE
   More hardware is usually needed to facilitate this
        a Translation Lookaside Buffer (TLB) – a high-speed cache for PTEs
Paged virtual memory
 When a page fault occurs, all page frames are typically full
 To make room for the needed page, one of the currently
 loaded pages must be sent back to the disk
 How the unlucky page is chosen in determined by a page
 replacement policy
 The ideal choice
    Choose the page that will be needed the farthest in the future, if
    at all
 Some page replacement algorithms
    LRU – Least Recently Used
    FIFO – First In First Out

LRU page replacement
 Swap out the Least Recently Used page
    This method performs well
 To implement LRU, time stamp each page frame whenever
 it is referenced
    This requires action each reference
 A practical way to do this is to have a special memory cell
 associated with each page frame
    For every reference, increment a global counter and copy it into
    the special cell of the associated frame
    When a page needs to be replaced, the page fault handler
    searches for the frame with the lowest counter
 This involves overhead and costly hardware
FIFO page replacement
 Swap out the oldest loaded page
 FIFO could be implemented by maintaining a
 queue of the loaded pages
   When a page is loaded, it added to the tail of the queue
   When a page fault occurs, the page at the head of the
   queue is replaced
 Implementation is much simpler than LRU
   It requires action only when a page fault occurs
   But how would FIFO work in a grocery store?

Dirty pages
 If a page that is to be replaced has not been modified
 (written), it need not be copied back to disk
    The disk copy is an identical clean copy
 If the page has been modified, the disk copy is dirty
    The page in memory must be copied back to disk
    Include an extra dirty bit in the PTE
    Initialize the dirty bit to 0
    Set the bit to 1 whenever there is a write to the page
        This could be implemented by the microcode for memory writes
    After a page fault, this bit determines whether the page needs to
    be copied back to disk
Parallel processing
 A large problem may sometimes be solved by distributing
 the computations over many CPUs that work on the
 problem simultaneously
 The best way to organize the activity is to decompose the
 activity into separate independent processes
 A process can be thought of as a running program
 together with all of its state information
    A process can be interrupted at any point and resumed later
    Each process runs on only one processor at a time
        At least in the simple case of a process consisting of a single thread
    A process can jump from one processor to another

 Typically, many processes are concurrently active on a
 Each process gives the illusion of running on a separate
 OSML computer
 OSML instructions allow process . . .
    Memory sharing and synchronization

 The operating system needs to maintain state information
 for each process
   Allocated address space (memory)
   Pending I/O activity
   Device ownership

Process “states”
 At a given time, a process may be running, ready,
 or blocked

      Running                 Ready
                   time out

       block                    wake-up
    (event wait)              (event completion)

Process “states”
 The ready state involves a queue of waiting
 A process makes a number of state transitions
 whenever there is a . . .
   Page fault
   I/O request
 A transition to the blocked state is the only
 transition that a process itself initiates
Concurrent processes
 Asynchronous concurrent processes . . .
   May collaborate on an application
   Need to communicate
   Need to synchronize
 Concurrent processes may run on . . .
   A single shared processor
       Simulated parallel processing
   Separate processors
       True parallel processing

Single processor execution
 Simulated parallel processing on a single processor
 is implemented using time slicing
   A time slice is a maximum time increment for a process

Single processor execution
 Each time slice terminates with a timer interrupt
 The interrupt handler . . .
   Saves the state of the interrupted process
   Enqueues the interrupted process in the ready queue
   Dequeues the next process to run from the ready queue
   Loads the state of the new process
   Transfers to the new process

Multiple processor execution
 Symmetric multiprocessing (SMP)
   Multiple processors share a common memory
   Each processor is equivalent
 If there are more processes than processors, then
 the CPUs must simulate parallelism with time slicing

  CPU1     CPU2     CPU3    CPU4     memory

Parallel computer architectures
   High-level decomposition of parallel architectures

(a) On-chip parallelism (b) A coprocessor (c) A multiprocessor
(d) A multicomputer (e) A grid
Homogeneous multiprocessors on a chip

 (a) A dual-pipeline chip (Pentium 4 “hyperthreading”)
    Allows resources (functional units) to be shared
    Does not scale up well
 (b) A chip with two cores
    A core is a complete CPU
Symmetric multiprocessors (SMP)

 (a) A multiprocessor with 16 CPUs sharing a common
 (b) An image partitioned into 16 sections, each being
 analyzed by a different CPU

 (a) A multicomputer with 16 CPUs, each with its own
 private memory
 (b) The bit-map image of Fig. 8-17 split up among the
 16 memories
Taxonomy of parallel computers

UMA symmetric multiprocessor architectures

 (a) Without caching
 (b) With caching
 (c) With caching and private memories

UMA multiprocessors using crossbar switches

       (a) An 8 × 8 crossbar switch
       (b) An open crosspoint
       (c) A closed crosspoint
Message-passing multicomputers

      A generic multicomputer
Interconnection network topologies
                      The heavy dots represent
                      switches (the CPUs and
                      memories are not shown)
                         (a) A star
                         (b) A complete interconnect
                         (c) A tree
                         (d) A ring
                         (e) A grid
                         (f) A double torus
                         (g) A cube
                         (h) A 4D hypercube

Massively parallel processors (MPPs)
 Typical supercomputer
 Use standard CPUs
   Intel Pentium
   Intel Itanium
   Sun UltraSPARC
   IBM PowerPC
 Set apart by a very high-performance proprietary
 interconnection network

BlueGene/L MPP

        The BlueGene/L custom processor chip

 Design goals
     World’s fastest MPP (achieved in 2005)
     Most efficient in terms of teraflops/dollar and terraflops/watt
 65,536 dual-processor nodes configured as a 32 x 32 x 64 3-D torus
 Peak 360 teraflops/sec (sustained 280.6 teraflops/sec)
 1.5 megawatts
 2500 square feet floor space
BlueGene/L MPP

COWs (Cluster of Workstations)
 A cluster consists of dozens, hundreds, or
 thousands of PCs or workstations connected over a
 commercially-available network
 Two dominant types
       Typically all in one room
       Connected by a LAN or the internet
       Google

Software metrics

 Real programs achieve less than the perfect speedup
 indicated by the dotted line
 Data from a multicomputer consisting of 64 Pentium Pro
 CPUs                                                43

Shared By:
Lingjuan Ma Lingjuan Ma