A Scalable Implementation of Virtual Memory HAT Layer for Shared by dfgh4bnmu


									             A Scalable
                      Implementation Virtual
               Memory HAT Layer for Shared
              Memory MultiprocessorMachines
                         Balan,            -
                              Kurt Gollhardt UNIX System
             This paper describesthe design and implementation the UNIX@ SVR4.2 Virtual
        Memory (VM) HardwareAddressTranslation(ÌIAT) layer that can be used as a model for
        other multiprocessor(MP) platforms in terms of scalability and MP related interfaces
        betweenthe IIAT layer and the machineindependent      layer. SVR4.2 is a SVR4.1 ES based
        kernel that suppo¡s sharedmemory multiprocessors light weight processes a shared
                                                             and                      in
        addressspace. By implementinga fine-grainedlocking mechañisn¡ a lazy Translation
        Lookaside Buffer (TLB) shootdown   evaluation policy and otherimprovements over the SVR4
        design_the memory management     featureis made scalablein terml of numberof processors
        as well as size of memory supported. Providing a small set of interfacesbeiween the
        machinedependent independent
                            and             layersfor TLB consistency a simple set of locking
        requirements  between th€ two layers, SVR4.2 facilitates thè portability of the memory
        management   featureto othermultiprocessor platforms.

                   Introduction                          architectures.
                                                                      This effort is much more complexin a
A scalable and pofable HAT layer that supports           multiprocessorenvironment.  Also, typically,scalabil-
multiprocessors  and multiple threadsin an address       ity issues not emphasized
                                                                   are                 duringportingefforts.
spaceis describedin this paper.The scalability of        By providing a well definedset of interfacesand a
the implementation primarily due to threereasons:
                    is                                   simple locking protocol, the porting effort will be
   r The TLB shootdown     policy and algorithms.        routine without any loss in the performance the
   o A fine-grainedlocking scheme that allows            system.
     memory management a whole to be scal-
                            as                                             Related Work
     able with respect numberof processors.
   o Design to support large physical memory                   Previous work done on providing a general
     configurations.                                     interfacefor the hardwaredependent     layer óf VM
      A small set of well defined MP related HAT         includesthe MACH pmap layer [1] and the original
interfaces introduced use by otherlayersof the
           is           for                              SVR4 HAT layer interfacesthat were derived from
kernel. The purposeof these HAT functions is to          SunOS [2]. The TLB shoordownpolicy imple-
maintain TLB consistency in a multiprocessor             mentedin SVR4.2 is similar ro the MACH policy
environment.  SVR4.2doesnot assume      hardware  sup-   [3], however kernel addressspace shootdownsare
p9f ,f9r TLB consistencyrand the support is pro-         handled differently from the user addressspace.
vided by the IIAT layer.                                 Several solutions to TLB consistency with and
                                                         without assuminghardwarecache consistency      have
      The HAT layer is the Memory Mangement              beendiscussed variouspapers
Unit (MMU) dependent     part of rhe memorymanage-                                        [4]. An implemen-
                                                         tation of TLB synchronization that usesa paÍicular
ment facility in SVR4.0 UNIX implementations.            TLB format (TLB ID entry) has been describedin
Other UNIX Virtual Memory implementations         also
usually contain such a machinedependent                  t5]. The SVR4.2 implementation      does not expect
                                             layer. In   the TLB to containany fields such as TLB ID other
SVR4.2 (a derivative of SVR4.1 that providessup-         thana subset fieldsin the pagetableentry. How-
port for multiprocessors light weight processes),
                          and                            ever, there are two areasin which SVR4.2 imole-
all but a small portion of rest of the VM-subsystem      mentation TLB shootdowns machine
                                                                    of                is          dependènt:
is machineindependent.                                   one is when clearingthe pagetableentries other
     Traditionally,most of the porting effort is spent   processors  which is dependent the MMU struc-
on implementing IIAT layer when porting SVR4
                 the                                     ture and the otheris in sendinginter-processorinter-
memory management feature to                   various   rupts for synchronization the processors
                                                                                   of                  whose
                                                         TLB is beingshot down.
 /However, cachecoherence assumed
                        is      to be supported
by the hardware.

Summer '92 USENIX - June 8-June L2, tgg2 - San Antonio, TX                                              107
    A ScalableImplementation Virtual Memory ...
                            of                                                                    Balan, Gollhardt

                     Bacþround                                    when accessing reference
                                                                                   the           and modify bits;
       The SVR4.0HAT data structures        were retained         i.e. no other accesses the page table entry
 for SVR4.2. The reason for this is that the data                 are possiblewhen the hardwareis changing
 structuresefficiently support large, sparseaddress               thesebits for that entry.
 spacesin terms of spaceand time. The principal              The i386 architecture support Gigabytes vir-
                                                                                    can         4           of
 factor behind this efûciency is the møppingchunlæ           tual address space. many RISC architectures
                                                                                 In                          (such
 data structure.A mapping chunk is used to keep              as MIPS@),the modify and reference     bits are simu-
 track of all virtual mappingsto a physicalpage.             latedin software thus,unlikethe Intel386imple-
 Each page table entry has a corresponding                   mentation,  TLB shootdowns not requiredwhen
 chunkentry and eachphysical      pagehas a linkedlist       these bits aremodified.
 of mappingchunk entriesthat denotesthe virtual
                                                                           Multiprocessor Interfaces
 translations the pageQnøpping
             to                      chøin).The size of
 a mappingchunkis much smallerthan that of a page                   Most MMUs implement simplecacheknown
 tablez, Non-activetranslations      does not have an         as TranslationI¡okaside Buffer for caching virtual
 entry in the mappingchunk. Due to this reason,               to physical translationsto avoid real memory
 sparsely  populated pagetables   wastevery little space      accesses. a multiprocessor
                                                                          In                   environment same
for providing the map¡tingchain. When operating         on    virtual address residein multipleTLBs and the
 a large address ranger,all the pagetable chunksthat          coherence thesetranslations
                                                                           of                      needsto be main-
does not have a conesponding        mappingchunk are          tained between the TLBs. In SVR4.2, all the
skipped,and no time is spentlooking at the non                exportedMP related IIAT interfacesare used for
existent  pagetableentries.                                   maintaining    the TLB consistency.    The numberof
       The Uniprocessor(UP) interfaces from the               active CPUS in the system for the kernel address
SVR4.0IIAT layerhavealsobeenretained                          spaceand the numberof CPUs a user address          space
interfaceshave been found to be suffrcientin sup,             (execution   entity : a process(consisting one or
porting different architectures   that SVR4 has beèn          more light weight processes)) associated
                                                                                                is              with is
ported to so far (including Intel386, SPARC@,                 recorded to do selective TLB flushes. This is
Motorola88000,MIPS). The most frequentlyexe-                  referred to as TLB accounting.Tn-e IIAT layer
cutedUP IIAT functionalities SVR4 were to load
                                 in                           recordsthe TLB accounting a HAT data structure
a translation to a given page (hat_memload0),                 that is associated eachaddress
                                                                                  with               space,  including
unload translationsfor a range of addresses                   the kerneladdress    space.
(hat_unload0)and to unload all translationsro a                     All online CPUsin the systemcan execute          in
given physicalpage(hat     jageunload}).                      the context of the kernel addressspace(&øs).The
       The reference port for SVR4.2 is on an                kernel address space HAT accounting structule
Intel386/486architectureand thus the initial HAT             recordsthe current set of online CPUs. Two HAT
implementation targeted
                  is                                         functionsare providedfor establishing account-
                               for the Intel386MMU.
The following is a list of its features                      ing when bringing CPUs online or offline. These
                                             that are of
interest:                                                    functionsare usedin accounting         which processors'
   r The Intel386 MMU uses a two level page                  TLBs will be flushed the kerneladdress
                                                                                      for                    space.
      tablestructure definean address
                      to                                         o hat_online ( ): Called when onlining an
      When references the page table entriesare
                         to                                        engine(CPU) in the system. Setsactivecpu
      denoted level L entriesor level 2 entries
              as                                                   count field and the processor's in the &as
      the sections below,they are in regards this                  HAT structure. also flushesthe ensine,s
      structure.                                                   TLB.
   r Level L is the pagetable directoryconsisting               r hat_offline(         ): Called when taking an
      of. 7024 entries, each of which points to a                  engine  (CPU)offline in the system.    Cleari the
      pagetable.This page table is refenedto as                    processor's set in hat_online( ) and
      thelevel 2 pagetable.                                        decrements count of active cpus in køs
   o Iævel 2 page table consists 1024 entries,
                                     of                            HAT structure.
      eachof whichpoint to a physical    page.                     The processor    accounting userlevel address
   o The physical   pagesizeis 4096bytes.                    spaces shooting
                                                                      for            down TLBs is done at context
   o The modify and reference     bits are in the page       switching time. Since threadswithin an address
      tableentry andareupdated the hardware.
                                   by                        spacecan be runningat the sametime on different
   ¡ The i386 alsoprovides interlocking
                              an                facility     CPUs,the CPUsthat are executingin the contextof
                                                             the same address     spacemust be known to perform
-                                                            selectiveTLB flushes. The following interfacesare
 -'In- $e i386 implementation, size of a mapping
                             the                             usedwhen scheduling light weightproc€ss
                                                                                       a                       (LWp)
chunk tl32ndof page
       is            tablesize.                              on any CPU in the system.
  rSuch as unloadingan address   range or changing              . hat_asload(as): Called when context
                                                                   switching to a new LWP. It adds the

108                                           Summer '92 USENIX - June 8-JuneL2, Lgg2- SanAntonio, TX
Balan, Gollhardt                                         A ScalableImplementation Virtual Memory ...

      pfocessor to the active engine þrocessor)              The TLB shootdowninterfacesfor user level
      accounting in the IIAT structure of this          address  spaces not seenoutsidethe HAT layer -
      address spacea,sand loads this address  space     the shootdownis immediateand it is done during
      into the MMU fiust the level 1 page table         one of the following HAT operations:   unloading a
      entrieson the i386 architecture).                 translation,changingthe protectionof a page (only
   o hat_asunload(as, ftag): Called           when      in the case of restrictingpermissions),
     context switching out a LWP. It unloadsthe         virtual address a differentphysicaladdress, in
                                                                       to                           and
      MMU mappingsfor this process(again,just           the case of clearing a modify bit of a page table
     the level 1 translations the i386) and takes
                              on                        entry (architecture
     the engineout of the activeengineaccounting
     of the HAT structure. The flag parameter                           ScalabilitySolutions
     indicates whetherthe caller wantsa TLB flush            This sectionwill discusssome of the features
     to be doneby this functionafter unloading   the                the
                                                        that makes SVR4.2implementation          scalable.
     mappingsa.Except for the CPU accounting,
     the rest of the fi¡nctionalitvneedsto be done      TLB Shootdowns
     on a UP platform as wells. Note that thereis             On platforms that does not supportTLB con-
     no need to call hat_asunloadQ the context
                                      if                sistencyin hardware,a multiprocessor      kernel needs
     switch is to select a LWP in the same pro-         to maintain the consistency translations
                                                                                      for              that are
     cess.                                              cachedin severalprocessors'      TLBs. The TLB is a
      The following are the HAT interfaces lazy         commonfeatureof presentday architectures        since it
                                             for        avoids any memory accesses
shootdown TLBs usedonly on the kemel address                                            (two in the case of a
spaceby the kernel segment                              i386 architecture) translatinga virtual addressto
                               drivers. To implement    physicaladdress the address present the TLB
lazy TLB shootdowns     (detailsof which is explained                    if              is        in
later), an object opaqueto all other layers of VM       cache. The shootdown algorithms depend on the
exceptthe HAT layer called a cookie,is maintained,      existence of a hardware facility to issue cross-
                                                        processorinterrupts. TLBs are fully flushedz as
The cookìe reflects the age of virtual translations
with respectto the TLB. In the i386 HAT imple-                   to
                                                        opposed flushing single TLB lines [5]. Since,the
mentation,the cookie is a timestampbut it could be      shootdown   algorithmsare MMU architecture       depen-
a corrnter somesort in otherimplementations.            dent,they arepart of the HAT layer in SVR4.2.
           of                                     The
state of the TLB the HAT recordsis the timestamp             A lazy shootdown    policy for the kemel address
of the last TLB flush. The stateof a virtual address    space  hasbeenusedwhereas      immediate   shootdowns
will be explained section
                   in        6.1.1.The followingare     are employedfor the user address      space.Since the
the interfaces:                                         kernel virtual address  usageis in the control of the
  r hat getshootcookie( ): Returns an                   kernel, a lazy evaluationof the inconsistentTLB
     opaquevalue that indicatesthe "age" of a           states can be done. However, for a user address
     TLB, which is usedfor lazy shootdown.              space, multiple TLBs need to be immediately
  o hat_shootdown(cookÍe_t                cookie,       broughtto a consistent   state since SVR4.2 supports
     u_int flag)¡ TLB shootdown          routinefor     multiple L-WPsin an address     spacewhich can con-
     kernel addressspace. If any of the active          currentlyexecute multiple CPUs.
     CPUs in the system has an older cookie than        Lary Shootdowtts
     the passed-in  cookie,then the TLBs of these            A lazy evaluationpolicy is very importantfor
     CPUs will be flushed. The flag argumentis          the kemel addressspace. When a kernel virtual
     used by clients which do not use lazy shoot-       address translation unloaded, processors'
                                                                             is           all             TLBs
     downo, so all the CPUs in the system are           in the system need to be brought to a consistent
     flushedregardless  of.thecookiepassãd  in.         state. This is becauseall processors the system
                                                        sharethe kerneladdress           (in
                                                                                  space a symmetric        mul-
                                                        tiprocessorarchitecture). Delaying shootdowns      may
                                                        avoid doing the shootdowns    entirely since the TLBs
ffiswitch                        implementation
                                             does       might be flushedalreadywhen the evaluation done is
trot requesthat_asunload} flush the TLB. This is
                            to                          (due to a contextswitch, for example).
because hasto flushthc TLB aftercopying page
         it                              the
tablc entries the U areaof the new process is
             for                                             The kernel segmentdrivers essentiallydeter-
loading.Thusit forgoes TLB flushafterunloading
                         the                            mine the lazinessof a shootdownin kemel address
themappings theold proc€ss.
             of                                         space. Ttvo major usersof this policy in SVR4.2are
  JTheTLB flushis not necessary some
                                on     æchitectu¡es     the segbnen driver which manages permanently
whose  MMUs (suchas SPARC MIPS)provide
                               and             the      residentkernelmemoryand lhe segmap        driver which
context number partof everyTLB entryaudon those
            whe¡c TLBs ue flushedon eachcontext
architectu¡es                                           ffi                     doesnot support singleline
switch.                                                 TLB flushes exc€ptthroughthe use of an unsupported
  óIhereis no such ctientin SVR4.2.                     multi-instruction

Summer'92 USENIX- Junet-June 12,LggZ SanAntonio,TX
                                   -                                                                     109
A ScalableImplementation of Virtual Memory ...                                                     Balan, Gollhardt

 manages   transientfile mappings     usedby file system             whose cookie is "older" than the passedin
 readandwrite system        calls.                                   cookie. While selectingthe processors be    to
       When a kernelvirtual address freedbv a ker-
                                          is                         interrupted,it recomputesthe least recently
nel thread,then typically that address        wouldïeed to           flushed  value and setsthe cookiefor eachpro-
be flushedfrom all the TLBs in the system.But the                    cessorit selects(to lbolt in our implementa-
 SVR4.2seglonem        driver delaysthis shootdown     until         tion).
this address aboutto be reused the kernel.The
                is                     by                         3. Sends   crossprocessor  interruptsto the proces-
virtual space managedby the segkrnem               driver is         sors.
 represented a bitmap and the bitmap itself is
                as                                                4. Unlocks the global spin lock it acquiredear-
 divided into zones(the size of the zone is a tune-                  lier once all the responders    have begun pro-
 able; the default value is 16 bytes). Each zone has                          the
                                                                     cessing interrupt.
 associated  with it a cooãe (explained the previous
                                             in                The responders     (processors the receiving end of
section),  which is set when an address the zoneis
                                               in              theseinterrupts)then flush their own TLBs before
freed.At the time of allocation,     when it is foundthat      againbecomingactive.The crossprocessor           interrupt
 a page is allocated from a freed zone whose                              at
                                                               executes the highestinterruptpriority level (ipl) in
 addresses   havestill not been flushedfrom the TLBs.          the system becauseno interrupts can be allowed
hat-shootdownfl called with the cookie-associated
                     is                                        while servicinga shootdown, Otherwise,this could
with the zone as an argument. What                             result in a deadlock if the interrupt level routine
hat-shootdownfl does with this cookie will be                  causesa shootdownitself. This interrupt level is
explained    shortly.                                          even higher than the normal "block-all" interrupts
       Similarly, segmapmanages virtual spacein
                                      its                      level (splhi) to avoid latencyproblems;we are care-
fixed-sizechunl<s      (configuredas 8K as the default         ful to avoid changinganythingin the cross-processor
value) and each chunk has an associated              cookie,   interruptserviceroutineswhich could interferewith
Unlike segkmem,        however,when a segmapchuttkis           splhi-protected critical regions. Note that the
freed (last reference released), cookie for the
                          is            the                    responders not wait for any synchronization
                                                                             do                                     with
chunk is set through the hatÅetshootcookie} inteþ              the initiatorin this algorithm. they haveto do is
face but the translations not unloaded.Instead,
                               are                             a TLB flush since the translations have been
this chunk is linked on to a list; the segmapaging             modified   earlierby the segment    drivers.The initiator
daemonperiodically looks at this list and unloads              doesnot wait for all the responders completetheir
the translations this time but does not perform a
                    at                                         operation.
shootdown of the unloaded addresses.              When the     Immedinte    Shootdowns
chunk is then reused by segmap, it calls                              The interfaces for immediate shootdowns
hat_shootdown       with the associatedcookie. The             employedfor the user addressspaceare hidden in
shootdown      can be delayedafter the unloadingsince          the HAT layer and are not exportedto other layers
no other context can accessthis file page in the               in VM. This is because      immediate     shootdowns  are
meantime. This technique        allows us to eliminatethe      causedonly by operationswithin the FIAT layer
shootdown      entirely, if the chunk is reusedwith the        such as unloadinga translationand changingprotec-
same identity (same physical pages)before it is                tionsfor a translation.
                                                               I mmediat Shootdow Al gorithm
                                                                          e           n
Lazy ShootdownAlgorithm
                                                                      The algorithm for immediate shootdown is
       Inside the HAT layer, a cookie is associated            similarto the lazy shootdown     algorithm.  The follow-
with each processor that denotes when the                      ing steps executed the initiator:
                                                                          are           by
processor's    TLB was flushed     last. In a separate  glo-      1. Grab the same global spin lock that we
bal variable,the cookie of the least recentlyflushed                 acquirein the lazy algorithmfor the samerea-
TLB is maintained. If the cookie passed in to                        son (to keep anybodyelse from changingthe
hat_shootdownfl older than this value, then it
                      is                                             active processor    set or performinganother
immediatelyreturnssinceit knows that all the TLBs                    shootdown).
in the systemhavebeenflushedsincethe cookiewas                    2. Sendcross-processor    interrupts all the pro-
acquired. this is not the case,the following steps
            If                                                                that
                                                                     cessors sharethis address         space  (the pro-
are executed the initiator(the contextthat is ini-
                 by                                                  cessorlist that is updatedby hat_asloadand
tiating the shootdown):                                              hat_asunload).  Unlike the lazy algorithm,the
   1. Acquiresa global spin lock. This spin lock                     responders   spin waiting on synchronization
      disallowsthe active processor of the sys-
                                           set                       with the initiator.
      tem from changingunderneath. also serial-
                                            It                   3. Modify the pagetableentries       (level 2 entries)
      izes lazy shootdownsin order to set the                        as appropriatefor the operation (zero page
      cookiefor eachprocessor.                                       table entriesif unloading    translations, change
  2. Scansthe list of all processors havebeen
                                           that                      the protection or clearthe modifyingbit if
      hat_online)'ed (seeinterfacedefinition in the                  syncing the page table entry to the page
      last section) and selectsall the processors

110                                                 '92     -
                                               Summer USENIX JuneE-June     -
                                                                      12,1992 SanAntonio,
Balan, Gollhardt                                             A ScalableImplementation of Virtual Memory ...

       structurÐ.                                                totally avoided by this policy (see section
   4. Incrementthe counterthat the responders       are          "Performance   Data") and thus result in very
       spinning on, The responders      perform a TLB            little overheadwhen compared to flushing.
       flush and returnfrom the interrupt.                       individualentries.
   5. Performa TLB flushfor the initiator'sproces-            o The immediateshootdown               (both the
       sor.                                                      initiator and the responder) would changeto
   6. Unlock the global spin lock acquiredearlier.               take in the address  rangeas an argumentand
       The initiator again - as in the lazy case -               flushjust thoseentries.  Sincetheseinterfaces
       does not wait for the responders finishto                 are not exportedto other VM layers,changing
       flushingtheir TLBs.                                       the interfaces acceptable
       This algorithmhasbeen optimizedfor the i386            o There may be a point in such architectures
 architecture  when the initiator has to modify a large          where flushing a whole TLB is cheaper the
 range of page table entries(example:when unload-                numberof lines to be flushedin the TLB is
 ing a large range of addresses). initiator holds
                                      The                        too large. The algorithms should take this
 the HAT resourcelock (a spin lock) that is associ-              into account when deciding which is more
 ated with the addressspacebeing modified at the                 efficient.
 outsetof the algorithm.After the responders in aare              fuchitectureswhoseTLB entriescontaina field
 spinningstate,insteadof changingall the pagetable          for contextnumber:
 entriesthe initiator just unloadsthe level 1 entries         o It is unnecessary flush the TLB on context
 for the affected page tables. Thus, the initiator               switches.
 spendsless time when all other processors spin-are           o The lazy shootdown algorithm would not
 ning. The initiator then increments counterthat
                                         the                     change. no gther local TLB flushesare done
 releases responders
           the              from spinningon the barrier.         by the kernelð,all the cookiesassociated with
 The responders    then flush their TLB beforereturning          the processors would be in the samestateand
 from the interrupt. If any of the LWPs running on               only one cookiewould be needed.
 the responders to accessthe inconsistentþage
                   try                                        o For the user address   space,a lazy shootdown
 table entry, it will take a fault because the non-
                                             of                  algorithmmay be possibleas implemented     in
 existentlével 1 entry. The trap codewill then try to            tsl.
 acquirethe HAT resourcelock and will block until              o No changes interfaces necessary.
                                                                              to           are
 the initiator releases HAT lock. This reduces
                         the                          the   Locking Design
 time processorsspin uselessly in the shootdown
 algorithm.                                                      The locking design implementedfor the VM
                                                            subsystem a whole should scalewell on parallel
                                                            activities (intra-process inter-process)
                                                                                     and               that occur
 -     The implementation local working set aging
                              of                            on the system.   The primary motive in arriving at the
for pageout in SVR4.2 also prevents shootdowni              currentlocking modelwas to keep thingssimple and
when comparedto the global pageout policy in                not to have the locking requirements     betweenthe
SVR4. The global pageoutdaemonscani ali the                 VM layers(the pagelayer, the segment    layer and the
pltlsical pagesin the systemand clears the modify           HAT layer) too complex.As a result,porting of this
bit if the bit is set for a page (after calling             HAT layer to otherarchitectures   shouldbe almostas
VOP_PUTPAGEQ the page) or clearsthe refer--
                       on                                   straightforward a Uniprocessor
                                                                             as                HAT layer.
ence bit if it is set. Both of these actionswould
                                                                 The principallocksin the VM layerare:
requireshootdowns       (sincethesebits are in the page       r PageLayer
table entry). But with the working set aging,the pro,
                                                                   O A global spin lock in the pagelayer for
cess to be øged is seized;i.e. all the LWPs in the                    protectingthe pagehashchains
processexcept the current context are brought to a
                                                                   o A per page spin lock for mutexingthe
quiescent   state.Thus, thereis no needto shootdown
                                                                      fields of the pagestructure
when modifying the page table enrries. The i386
                                                                   o A read/write sleep lock which is
context switch code flushesthe TLBs when switch-
                                                                      acquiredin readermode to ensurethat
ing back in theseLWPs.
                                                                      the page state, identity and data are
OtherArchitectures                                                    valid and remain so and acquiredin
       The abovementioned      interfacesand algorithms               writer mode if modifying any of the
provide flexibility in supporting various architec-                   above.
tures,requiring   minimalchanges them.
                                     to                       . Segment     Layer (usersegment driver)
      Architectures supporting single TLB entry                    O A reader/writer   lock per segment. This
_ -                                                                   lock is acquiredin writer mode when
   o The lazy shootdown algorithm need not                            changing the attributes of a segment
      change at all. Even though the algorithm
      flushesthe whole TLB, most shootdowns         are     ffi+.2.

Summer '92 USENIX - June 8-June lZ,1rgg?- San Antonio, TX                                                    111
A ScalableImplementation Virtual Memory ...
                        of                                                                    Balan,Gollhardt

            (suchas protection)and in readermode                table entries)is being considered.
            when the attributesof the segmentare           Examples of the Locking requirements for HAT
            to remain valid for the duration of                  interfaces
          o A per segmentspin lock which guards                  To get an idea of the locking requirements,
            the sleeplock.                                 some of the IIAT interfacesare listed here. In all
    o HAT Layer                                            theseoperations, HAT resource
                                                                              the                  lock is acquired
          o Thereis only one spin lock associated          by the HAT layer.
            with each addressspace for guarding            hat_memload( ) : I¡ad a virtual address          trans-
            the ÉIAT resources.                                 lation. Called with the reader/writerlock for
                                                                the physicalpage held. The caller can not
        Making the HAT lock finer grainedby moving              hold any spin locks.
it to the pagetable level was considered decided
                                              but          hat_unload( ): Unload a range of virtual
it wouldn't be much of a gain for the following rea-            address  translations. caller neednot hold
sons:most UNIX processes in one pagetable and
                                fit                             any spin locks. This routine acquires spin
there would be extra locking round trips for HAT                lock associated    with the physical page struc-
functionsthat cross page tables.If found necessary,             ture in order to modify the mappingchain for
other ports can move this to a page table level                 the page.
(architectures   where the page table size is small)       hat_pageunload ( ) : Unload all the virtual
without any need to change the locking require-                 translations a given physicalpage.The spin
ments.                                                          lock for the pageis heldby the caller.
Analysß of the Locking desígn                             PhysicalMemory Scalability
        Two widely occurringsystemeventsin UNIX                 SVR4 had a limit on the physicalmemory it
systems, page faults and forkQ/exit0 operation,           was able to supporton the i386 platform. Changes
would be a good indicatorof scalability the VMin          were madein SVR4.2to avoid this limite. Several
layer.                                                    kernel functionsin SVR4 relied on the fact that all
   o Whengenerating      concurrent  pagefaults in dif-   of physicalmemoryin a machineis mappedinto the
       ferentaddress  spaces, only lock contention
                              the                         kernel virtual space.Thesefunctionsgenerallyneed
      will be for the global pagelayer spin lock that     to get a virtual address from a given physical
       is guarding the page hash chains. The lock                   in
                                                          address a non-blockingfashion. On the Intel386
       hold time for this lock is very low. Therewill     reference   port, out of the available4 Gigabytevir-
      be differentinstances the HAT lock (due to
                             of                           tual address           the
                                                                         space, useraddress       space  was given
       different addressspaces)and segmentlocks           3 Gbytesand the kernel 1 Gbytevirtual space.         The
      (faulting on differentsegments).                    kernel virtual itself was divided at kernel boot time
   o For concurrent    page faults generated   within a   among different kernel segmentdrivers (segkman,
      process  amongits LWPs, therewould be con-          segmapand segu (which manages simultaneous
      tentionfor the FIAT resource    lock but the lock   mappingof severalprocesses' areasat the same
      hold time during loadingof a translation     will   time)). After this division, only 256 Mbytesof phy-
      again be very small. Faulting on the same           sical memory could be mappedinto the kernel vir-
      segmentby various LWPs would causecon-              tual space. The default layouts of the kernel
      tention for the per segmentsleep lock. Some         memory map could be changedto make this limit
      faults require the lock to be held only in          biggerbut therewould still be a limit. All the kernel
      readermode and thus allows for parallelism          functions that expect this non-blocking behaviour
      between   suchfaults at the segment   layer.        were modifiedin SVR4.2to eliminatethis restriction
   . \ryhenconcurrentforkQ/exitQoperations         take   in one of the following two ways:
      place in different addressspaces;the only              o Cachethe needed      virtual address.
      contention the VM level would be for locks
                  at                                         . Create and destroy virtual mappings to a
      at the anon layer (which manages     anonymous           givenphysical     pageas needed.
      pages) and at the swøp layer for reserving
      anon pages and swap space for the child                   The HAT layer was one of the primary usersof
      processes  respectively. Again, the lock hold       this "magicmapping"in using it to get to the vir-
      times during the reservation                        tual address a level 2 pagetable entry from the
     be very small.                                       pageframe numberstoredin the level 1 page table
  . Intra-processconcurrent forkQ/exitQ opera-            entry. It was changed cachethe virtual address
                                                                                   to                           in
      tions could causelock contentionat the HAT          the HAT structure    itself. Going into the detailsof all
      layer and the segment   layer but both the locks    the changes the kemel is beyond scope this
                                                                        in                       the        of
     will be heldonly while eachsegment being             paper.
     copied. Reducingthe lock hold time on the
     HAT resource     lock by droppingthe HAT lock        ffifluenced                        by any Mp ¡elated
     after copying each mapping chunk (32 page            issues.

tt2                                         Summer'92 USENIX- June8-Junet2, L992 SanAntonio,TX
 Balan, Gollhardt                                           A Scalable
                                                                     Implementation Virtual Memory ...

      Most features SVR4.2havebeencompleted
                      of                                           3s0
and performance     measurements beginningto be
                                   are                             300
collected for the system. Thus the performance
measurements     presented                                   .     250
                            here are by no meansthe
optimalfrguresfor SVR4.2.                                          zoo
      Some measurements      were taken on how well                 50
our shootdown algorithms (lazy and immediate)                        0
scalewith respect numberof processors.
                    to                         Scalabil-
ity of the basic cost of the shootdowns,       the lazy                   2    3       4     5       6
shootdown    algorithmand the immediateshootdowns                             numberof processors
algorithm \¡/ere measured.    The following measure-       Figure 2: Cost of the Lazy shootdownalgorithm
ment process   was usedin collectingthe data:                 (used kerneladdress
                                                                    for            space)
   o The measurements      were taken on a Sequent
      Symmetryplatformwhich has6 Intel 386 pro-            The cost of the lazy shootdownalgorithm is illus-
      cessors 20Mhz.
              at                                           trated in Figure 2. Note that there is a fixed cost
   ¡ Ten samples eachmeasurement
                    of                   were taken,       overhead   over the basicshootdown  cost in the range
      and their meanvalue was used.                        of 70 microseconds.    The cost of the algorithm per
   o A measurement calledcasperwas usedto
                       tool                                additionalprocessor about40 microseconds.
      mear¡ure time spent in different windows                  Measurements the immediate shootdown
      of the kernel code paths in units of                 algorithmwere analyzed    next. It was measured    by
      microseconds.                                        running a kemel level test that handcrafted user
   ¡ All our measurements     reflect the time spent       address  spaceand spawnedas many threadsas the
      by the initiator.                                    number of onlined processors.    After the spawned
      The time spentby the responders the lazy
                                            in             threads waited spinning on a barrier, the parent
algorithmwould be a fixed time constant     (time taken    unmapped previouslymappedpage. This would
to flush its TLB). In the immediateshootdown       case,   generate shootdown all the otherprocessors
                                                                     a            on                         that
the time spent by the responders      would be upper       were spinning on the barrier. Thus the initiator
bounded the time spentby the initiaror.                    touches only onepagetableentry.
      The basiccost of shootdowns computed
                                     was             on
the Symmetry by calling hat_shootdown} with                        3s0
HAT_NOCOOKIE as an argument,which shoots                           300
down all processors the systemwithout any other
                       in                                          250
computation. Figure 1 is the graph that illustrates        ¡¡iç¡e- 200
the results.                                               seconds 159
         300                                                       100
                                                                          t       i       i       ¡        ¿
ùvw¡¡gù                                                                        number processors
         100                                               Figure 3: Cost of Immediateshootdown
          50                                                  (used useraddress
                                                                    for           space)
           0                                               The overhead the algorithmover the basicshoot-
                t        t       l       l        I        down (Fig. 3) cost is about 30 microseconds.    The
               2    3        4  5    6                     fixed cost of each additional processorfor the
                   number processors
                          of                               immediate shootdown algorithm is about 45
        Figure1: Basic of shootdowns
                      cost                                 microseconds.
The graph shows that the measurements not    are                The data collectedso far indicatesthat we
exactlylinear.TWo possible   reasons this is vari-
                                     for                   shouldbe able to scalewell in the rangeof tens of
ancesin the interruptfanout facility on the Sequents       processors both the lazy shootdowns immedi-
                                                                      for                          and
and that even a slight disturbance orderof tens of
                                  in                       ate shootdowns. The cost of the lazv shootdown
microseonds eachcollectionof samples
              for                           will per-      algorithm is slighrly (about 40 microsóconds) more
turb the linearity.                                        thanthe immediate    shootdown.This is dueto all the
                                                           accounting  that is doneto updatethe cookiefor each

Summer'92 USENIX- June8.Junet2, t992- SanAntonio,TX                                                        113
A ScalableImplementation of Virtuat Memory ...                                             Balan, Gollhardt

 TLB in the lazy shootdown case. Other similar            3. The measurement done on a 4 processor
 measurements show that the bus contentionmay
                 [3]                                         (Intel386 @ 20MIfz) Sequent Symmetry
 becomea problemfor algorithmsthat use crosspro-             machine.
 cessorintemrptswhen it dealswith processors the
                                               in       As mentioned  earlier,the measurements (Figure
 rangeof 75 - 20.                                       4) is usedonly to illustratethe scalabilityof SVR4.2
      Another encouragingmeasurement      about the     and should not be taken as the final performance
 effectivenessof the laø.y unload policy of the         dataof the system.
segkman driver (disorssed above) shows that it
makes only 1.1 calls to hat_shootdawn        per 700                        Conclusions
memoryallocationrequests,Actual shootdowns       will         A model of the HAT layer that is scalable    with
be even less frequent(as can be observed    from the    respect to processors and memory has been
explanationof the algorithm). This data was col-        described this paper.This model makesthe port-
lected by running a kemel level test that allocates     ing effof simpler without losing sight of the sòala-
 and frees kernel memory repeatedly in different        bility issues. well definedmultiprocessor
                                                                      A                               manage-
sizes. There was no other activity (such as the         ment interfacebetween machineindependent
                                                                                 the                       and
pageoutdaemon)in the systemwhen this test was           the MMU dependent of Virtual Memory subsys-
run.                                                    tem and simple locking guidelinesprovide the keys
Concurrent forkQ/execQ/exitQ     measurements           in making a memory management         featureportable.
      Scalability of concurrent inter-addressspace      The TLB shootdown þolicy and algorithms in
                                                        SVR4.2 adapt well to different architectures.     rWith
fork) / exec)I exit) oper
                        ationswere measured  through
                                                        multiprocessor  platforms becoming more common,
a benchmark    programro. The benchmark  consists óf
the following tests:                                    preservingthe easeof porting a kernel to different
   o forlOluitfl operations  with åss size ranging      architectures without losing sight of scalabilityissues
     from 0 to 192K.                                    will be extremely critical.
   o fork)løcec)lexit) operations with åss slze                         Acknowledgements
     rangingfrom 0 to 48K.
   o forkQlsbrkQlexit) operations with sår,t slze             The designand implementation the VM sub-
     rangingfrom 0 to 192K                              systemwas a joint effort by the SVR4.2 VM team
                                                        members.The past and present members of this
                                                        team include: Steve Baumel (also provided
        4                                               seglonem  measurements), Doshi, Mike Lazar of
       ?<                                               PyramidTechnologies, Lée, and Dave Lennertof
         J                                              Sequent  ComputerSystems. Dick Menningerwas the
      2.5                                               initial implementor of the SVR4.2 HAT layer.
speedup 2                                               Thanksalso to Mike Miracle for his supportin writ-
      1.5                                               ing this paper.
        0                                                [1] Richard Rashid et al., MachineJndependent
                                               I             Virtual memorymanagement PagedUnipro-
                                                             cessor and Multiprocessor Architectures, in
                 numberof concurrent  runs                   IEEE Transactions Computers,
                                                                                 of           Vol.37, No.8,
Figure 4: Scalabilityof fork0/execO/exitO
                                        operations           August1988."
The measurement   process
                        was as follows:                  [2] RobertA. Gingell, JosephP. Moran and Wil-
                                                             liam A. Shannon,   Virtual MemoryArchitecture
 1. Scalabilitywas measured having a fixed
                              by                             in SunOS, n Proc. USENIX Summer '87
    processor configuration (4 processors)and
                                                             Conference, Phoenix,   AR, June1987.
    varying the workload of the tests in the
    benchmark.   The workload was varied bv exe-         [3] Black, et. al., Translation Lookaside Buffer
    cuting the samebenchmark                                 Consistency: sofnvareApproach, December
                               concunentli from
    1 up to 4 times. The speedup   was measured
    by the elapsedtime of each work load and             [4] PatriciaJ. Teller, Translation-Lookaside
    meæuring it against the unit workload (one               Consistency, JuneL990,IEEE COMPUTER.
    run of the benchmark).                               [5] Michael Y. Thompsonet al., TranslationLoo-
 2. Each of the above test was repeatedfor 10                kasideBuffer Synchronizationin a Multiproces-
    times in a run.                                          sor System, USENIX Winter Conference   1988.
                                                         [6] 80386Programmer's     Reference Manual
 rulhe benchmark
               programcalled S is one of the
benchmarks at USL.

tt{                                        Summer '92 USENIX - June E-June12,lg92 - SanAntonio,TX
Balan,Gollhardt                                     A Scalable
                                                             Implementation Virtual Memory ...

              Author Information
     RameshBalan is a Memberof TechnicalStaff
at LJND( System Laboratories in the Kernel
Development   group. He receivedhis M.S. in Com-
puter Sciencein 1989 from the schoolof Engineer-
ing and Applied Sciencesin Columbia University.
He canbe reached e-mailat ramesh@usl.com.
     Kurt Gollhardtis a consultant IJNIX System
Laboratories the Kernel Development
             in                        group. He
received B.S. in Computer
         a                  Science and a B.S. in
Electrical Engineeringat WashingtonUniversity in
St.Louis. He can be reached via e-mail at

Summer'92 USENIX- June8-June      -
                            12,1992 SanAntonio,TX                                         115

To top