performance of POS

Document Sample
performance of POS Powered By Docstoc

         Multiprocessor Operating Systems

    Vote for Peace:
    Implementation and
    Performance of a Parallel
    Operating System
    Jörg Cordsen
    GMD First
    Thomas Garnatz and Michael Sander
    Siemens Communications Test Equipment GmbH
    Anne Gerischer
    Dakosy GmbH
    Marco Dimas Gubitoso
    University of São Paulo
    Ute Haack and Wolfgang Schröder-Preikschat
    University of Magdeburg

       Vote, a virtual                                  etween 1990 and 1994, the German National Research Cen-
    shared-memory system                                ter for Information Technology (GMD), Research Institute
                                                        for Computer Architecture and Software Technology (First),
    and an extension to                                 conducted the Manna project, which aimed to design and
    the Peace parallel                                  develop a scalable distributed-memory parallel computer to
    operating system,                     support the execution of parallel numerical and nonnumerical applica-
                                          tions. The Peace operating system served as the starting point for imple-
    provides architectural                menting a system-software platform for Manna.1 Based on previous expe-
    transparency and                      riences with the 320-node Suprenum system,2 we have changed two
    efficiency to effectively             fundamental approaches concerning the design of Peace. The first change
                                          was the decision to exploit object-oriented design principles, similar to
    solve high-performance                the Choices approach.3 However, unlike the designers of Choices, we did
    computing problems.                   not use a microkernel as the minimal basis for a family of parallel oper-
                                          ating systems. The decision to turn away from the pure microkernel
                                          approach was the second change regarding the original Peace design.
                                             The microkernel approach successfully managed the Suprenum system,
                                          so that all the nodes jointly did number crunching for the parallel applica-
                                          tions quite well. However, this approach failed in the presence of well-shaped

    16                                       1063-6552/97/$10.00 © 1997 IEEE                          IEEE Concurrency

                                                                 Crossbar                                  Peripherals

                                                                   Link         i860XP            i860XP       I/O

applications tuned with respect to the actual number of                                  Memory
nodes (the degree of parallelism) provided by the hard-                                   Node
ware. With mapping application tasks in a one-to-one cor-
respondence with the nodes, the performance—in par-          Figure 1. Manna’s dual-processor-node architecture.
ticular, the message startup time—was not satisfactory.
   The multiuser, multitasking facilities of today’s
microkernels are not free. This is even true for running     design and implementation of Vote was strongly influ-
only a single-user, single-tasking application on a micro-   enced by the program family concept.4 Rather than pro-
kernel, which results in a single-user, multitasking mode    viding a single consistency protocol that manages the
of operation for a multinode (distributed-memory) par-       replicated VSM data objects for all kinds of shared-
allel machine. In such a composition, the less-demand-       memory programs, a family of consistency protocols offers
ing applications must pay for (microkernel) functions        users problem-oriented solutions.5
that will never be used by them. For such applications,         In this article, we discuss the implementation of the
a functionally scaled-down microkernel would be bet-         VSM system and Peace’s message-passing kernel
ter—that is, a kernel without functions such as security,    (nucleus). First, we briefly describe the architecture of
address-space isolation, virtual memory, (local) task        the Manna processing node and the Peace operating
scheduling, and trap-based system calls. Thus, the min-      system. Then we briefly share insight into the Vote sys-
imal kernel for these multiple instruction, multiple data    tem and analyze the performance of access fault han-
(MIMD) parallel applications appears to be a library that    dling. Finally, we illustrate the organization of the Peace
is directly linked to the respective user program.           message-passing kernel, particularly the nucleus, and
   Although the parallel computing community com-            describe various Manna node configurations.
monly uses a single node’s (dedicated) single-user, sin-
gle-tasking operating mode, it is not the only mode.         System architecture
Considering again a single (perhaps microkernel-
controlled) node, many single-user, multitasking appli-      A Vote platform consists of two major components: the
cations exist. Here, the degree of parallelism is dynamic,   hardware and the operating-system software.
exceeding the static degree of parallelism offered by a
multinode machine. In such a case, a few tasks must          DUAL-PROCESSOR-NODE ARCHITECTURE
multiplex some, though not necessarily all, nodes. Fur-      Basically, the Manna node defines a small-scale, shared-
thermore, other applications often require a multiuser,      memory multiprocessor system. The per-node mem-
multitasking mode of operation. Such a configuration         ory system allows a wait-state-free operation of the two
can either be supported on the basis of different (iso-      50-MIPS reduced-instruction-set-computing proces-
lated) user partitions or can follow the pattern of tradi-   sors (i860XP) (see Figure 1). The effective memory
tional time-sharing operating systems.                       access rate is 381.47 megabytes per second. Both proces-
   We can best resolve the obvious dichotomy by using        sors share the memory and the I/O units attaching
a family of microkernels rather than a single kernel that    peripherals to the node; they also share the bidirectional
offers only compromise solutions to the applications.1       communication link. This link connects the node with
The family consists of a collection of problem-oriented      a byte-wide 16 × 16 crossbar switch. Reading data from
kernels, each tailored to a specific application’s demands   and writing data into first-in, first-out (FIFO) registers
and each implementing a different operating mode of          takes care of sending and receiving messages through
the underlying node. Depending on the use pattern of         the link. The total (physical) crossbar throughput is 2 ×
the parallel machine’s application, the most appropri-       47.68 Mbytes/s, with an effective byte-transfer latency
ate kernel comes into play. The Peace parallel operat-       of four clock ticks (that is, 80 ns).
ing system is built atop such a microkernel family.             A crossbar hierarchy interconnects more than 16
   Vote is a virtual shared-memory system and part of the    nodes. For this purpose, some crossbar links establish the
Peace operating system. Vote targets the Manna com-          hierarchy by connecting a crossbar to other crossbars.
puting system, allowing execution of shared-memory-          The Manna interconnection structure can also include
based programs on a distributed-memory architecture.         crossbar switches used only for the crossbar intercon-
Thus, Vote provides a smooth migration path for these        nection. Crossbars with attached nodes form a cluster. At
types of programs. Following the pattern of Peace, the       the hardware level, internode communication within a

April–June 1997                                                                                                          17


                       Parallel Operating System
                                                                  are multithreaded entities, and thus support concurrent ser-
                                                                  vice processing. A Peace entity provides a common exe-
                                                                  cution domain for active objects and is considered to be
                 Nucleus                           Kernel         the unit of distribution, whereas an active object is consid-
                                                                  ered to be the unit of execution. The kernel entity’s main
                                                                  purpose is to provide hardware abstractions, keeping (par-
                                                                  allel) applications and POSE independent from a given
                                                                  processing node’s physical characteristics.
    Figure 2. Peace’s architecture.                                  From the design point of view, neither the kernel nor
                                                                  POSE need to be present for every node, only the
                                                                  nucleus. In a specific configuration, the majority of the
    cluster is faster than between clusters. The performance      nodes of a massively parallel machine are equipped only
    difference is in nanoseconds and will diminish as increas-    with the nucleus. The kernel supports some nodes, and
    ingly more user and system software emerges.                  a few nodes are allocated to POSE. All nodes can be
       Various Manna research systems have different scales:      used for application processing, but they are not all
    2-, 4-, 16-, 20-, and 40-node configurations are in daily     obliged to be shared by user tasks and system tasks.
    operation, running parallel applications for summer
    smog prediction and computer animation. They also             The Vote system
    provide testbeds for parallel operating-system research.
                                                                  The Vote system is part of the POSE building block.
    OPERATING-SYSTEM ARCHITECTURE                                 Vote highlights two ideas predominant in VSM systems.
    Peace is an object-oriented operating-system family           From 1986 to 1989, the VSM community argued quite
    especially designed for, but not limited to, distributed-     aggressively that “RPC is a poor man’s version of shared
    memory parallel machines. Each family member is con-          memory.”6 By means of architectural transparency,
    structed from three major building blocks: the nucleus,       VSM systems such as IVY (Integrated Shared Virtual
    the kernel, and the Parallel Operating-System Extension       Memory System Developed at Yale) promoted a gentle
    (see Figure 2). The application is the fourth integral part   migration path allowing the execution of shared-mem-
    of this architecture, in addition to the system compo-        ory programs on distributed-memory machines. Their
    nents. It determines a family member’s complexity in          low performance and lack of scalability started a second
    general and the distribution of the building blocks over      VSM generation, such as Midway, Munin, or Dash,
    the nodes of the parallel machine.                            which tried to improve efficiency. The particular moti-
       The nucleus is responsible for systemwide interprocess     vation for designing and implementing Vote was to sup-
    communication and provides a runtime executive for the        port a symbiosis of architectural transparency and effi-
    processing of threads. It acts as the minimal basis and is    ciency aspects.
    part of the kernel domain. The kernel is a multithreaded         At default, a shared-memory program running on
    system component encapsulating minimal nucleus exten-         Vote is executed in a multiple-reader, single-writer
    sions. These extensions implement device abstractions,        (MRSW) sequential consistency model.7 This provides
    the propagation of exceptional events (traps and inter-       architectural transparency to the application level. At
    rupts), dynamic creation and destruction of process           any time, the program can change the consistency main-
    objects, and the association of process objects with nam-     tenance, continuing execution by virtue of a different
    ing domains and address spaces. The Parallel Operating        memory-consistency definition. Vote supports several
    System Extension performs application-oriented services       performance-enhancement techniques. These tech-
    such as naming, process and memory management, file           niques help avoid sequences of read and write memory-
    handling, I/O, load balancing, and inter-networking to        access faults, allow pre-paging and release of address
    provide some host access.                                     ranges, and provide support for one-sided communica-
       The kernel and POSE services use active objects imple-     tion to propagate data to a set of processes. A fine-
    mented by lightweight processes; service access happens       grained multiple-writer model allows (within a page)
    on a remote-procedure-call basis. In contrast to the ker-     modifying accesses with a subsequent restoration
    nel and POSE, the nucleus is an ensemble of passive           scheme to unify a sequential consistent result. Vote also
    objects scheduling active objects. The kernel and POSE        supports message-passing communication functions,

    18                                                                                                      IEEE Concurrency

                                                                        Requesting site                                                            Vote

                                                                            Catcher        Con


                                                                                                              Adviser          site

which operate in parallel with the demand paging of                                                                     Ca         In

                                                                                                                              ed     va

sequential consistency.                                                                                                                lid

  Throughout this article, we concentrate on the basic                                                                                       on Owning


consistency maintenance scheme—that is, Vote’s access                                                                                            site
fault-handling scheme.5                                                                       Actor                 put_copy                    Actor
Vote distinguishes three functional units responsible for                   Upcall
handling consistency maintenance and raising memory-
access faults. These three functional units are the catcher,
the actor, and the adviser (see Figure 3).                        Figure 3. Vote’s building blocks.
   For the sake of clarity, we use the terms requesting
site, knowing site, and owning site. The requesting site is the
process that causes the memory-access fault. The know-            processors or dedicated system processors. State-of-the-
ing site is the process that implements consistency               art consistency maintenance operates with a dynamic
maintenance, and the owning site is the process that              distributed ownership-based protocol. Because of the
actually owns the requested memory page. In specific              ownership approach, this scheme can operate only on
situations, the knowing site might also play the role of          processors running application tasks. Because of this dis-
the owning site.                                                  advantage, Vote rejects the dynamic, distributed con-
   An application process using the global address space          sistency maintenance and instead uses a highly opti-
of Vote is associated with an exception handler (the              mized fixed distributed scheme.
catcher). When a memory-access fault occurs, the oper-
ating-system kernel invokes the catcher via an upcall (an         ACCESS FAULT HANDLING
event-driven invocation of higher-level system compo-             When an adviser receives a request to handle a mem-
nents by lower-level kernel functions.9 The catcher               ory-access fault, the access fault type (that is, protection
determines the consistency-maintenance process rele-              or access violation) and the directory information deter-
vant for handling the access fault. For this purpose, the         mine further processing. The most simple case is a pro-
catcher uses a predetermined (user-directed) mapping              tection violation, where the requesting site already owns
from memory pages to consistency-maintenance pro-                 a copy of the memory page with read-only access per-
cesses. Using this information, the catcher calls the             mission. Then the adviser sends invalidation messages
adviser at the knowing site.                                      to the set of owning sites, updates the directory infor-
   The adviser process implements consistency mainte-             mation, and replies to the requesting site to upgrade its
nance for the requested memory page. This process                 memory-access permission.
maintains directory information about the distribution               If the access fault was an access violation, the know-
of the memory pages that it has taken responsibility for.         ing site checks if the requested memory page has been
A set of threads, called actor threads, supplements each          cached. In case of a cache hit, the knowing site directly
adviser. When a VSM variable is declared to be main-              transfers the page. If the access violation was due to a
tained by a specific adviser, Vote creates an associated          write access, the knowing site invalidates the set of copy
actor thread in the declaring application process’s               holders before the directory information is updated. In
address space. Thus, using several adviser processes to           contrast, a read-access violation only requires adding
maintain different memory pages yields the same num-              the requesting site to the set of copy holders. Afterwards,
ber of actor threads associated with each application             the requesting site receives a message to operate its local
process. The adviser itself creates an actor to cache VSM         memory-management unit (MMU) programming.
pages and optimize access fault handling.                            The final two cases handle access violations if the
   An actor’s purpose is to give its adviser an interface         requested page is not cached at the knowing site. In
for controlling the memory management and the move-               such situations, the adviser calls the requesting site’s
ment of VSM memory pages. The adviser’s functional                actor and tells where to get a copy of the memory page.
encapsulation of consistency maintenance, on the one              Then the requesting site’s actor asks the owning site’s
hand, and the actor’s memory management and data                  actor to copy the requested page (get_copy in Fig-
transfer, on the other hand, let Vote make use of idle            ure 3). At the requesting site, the actor defines the

April–June 1997                                                                                                                                           19

                 Table 1. Times for access fault handling in Vote (t represents
                        logarithmic time complexity for invalidations).

         PROTECTION                               ACCESS VIOLATION (µS)
         VIOLATION                 CACHE HIT                               CACHE MISS
         (µS)             READ ACCESS    WRITE ACCESS             READ ACCESS     WRITE ACCESS

         628 + t               667              794 + t                    1,670              1,396 + t
                                                                                                        munication, the 4-Kbyte data transfer
                                                                                                        of a memory page, and the MMU pro-
                                                                                                        gramming efforts. The horizontal axis
    desired access permission. If the page has read-only                             shows the time progress with a scale of a microsecond.
    access permission, the requesting site’s actor sends a                           At time 0 µs, an application process tries to read from
    copy of the page to the knowing site as well                                     a memory address that is not mapped to the appropri-
    (put_copy, in Figure 3). As described above, caching                             ate memory-access permission.
    these pages at the knowing site motivates prompt han-                               When the control shifts to the catcher at the
    dling of future read-access faults with respect to the                           requesting site (step 1), a memory page is allocated.
    same pages. Finally, control returns to the adviser. If a                        Vote sends the address of that page as an argument to
    write-access violation is to be handled, invalidations are                       the adviser. At the knowing site, a cache hit for the
    sent to the set of owning sites. The adviser updates the                         requested page is detected (step 2). Afterward, the
    directory information and sends a reply to the request-                          adviser instructs its actor to transfer the cached page to
    ing site. Table 1 summarizes the performance of Vote’s                           the requesting site. The actor writes the page directly
    access fault handling.                                                           into the allocated memory region at the requesting site.
                                                                                        When the page transfer is finished, control returns
    CASE STUDY                                                                       to the adviser. The directory information is updated
    The case study considers the handling of a read-access                           (step 5), leaving information behind about an addi-
    violation if the knowing site caches the requested mem-                          tional copy of the specific page. Then the adviser
    ory page. Vote is particularly designed to optimally                             answers the catcher, and control returns to the
    handle this situation, taking place when the adviser                             requesting site. The catcher instructs the local actor
    handled the first page read-access fault, and all the                            to map the memory page to the necessary logical
    other processes retrieve that particular memory page                             address and to assign read-only access permission.
    in sequence.                                                                     Control moves back to the catcher and finally to the
       Figure 4 shows the timing graph of the consistency                            operating-system kernel, which then can reactivate
    maintenance carried out by Vote. The vertical axis is                            the application process.
    labeled with five different terms describing basic activ-                           The time graph clearly shows that I/O (IPC and
    ities in consistency maintenance: Trap, Vote, IPC                                HVDT) is the dominating issue. However, the know-
    (interprocess communication), HVDT (high-volume                                  ing site is involved only for about 450 µs. The over-
    data transfer), and MMU. These are trap handling,                                lapped execution with trap propagation and MMU
    activities in the functional units of Vote, packet com-                          programming lets the adviser handle other access
                                                                                                       faults and thus makes an important
                                                                                                       contribution to Vote’s scalability.

                                                                                                          PERFORMANCE AND RELATED WORKS
                                           Activity time of adviser                                       Considering the implementation of
         MMU                                                                                              Vote and the resulting performance,
         HVDT                                                                                             the family concept proved to be the
                         Adviser     Actor                     Adviser Catcher Actor Catcher              right decision for designing and devel-
                                                                                                          oping a parallel operating system for
         Vote                                                                                             the Manna architecture. Table 2 shows
                     1               2   3                       4 5           6 7    8   9
          Trap                                                                                            the measured performance results of
                                                                                                          access fault handling in a few VSM
                 0       100         200          300       400          500       600           700      systems.
                                                    Time (µs)                                                MEther and Munin both run on
         IPC Interprocess communication               Trap   Trap handling                                rather old hardware. Nevertheless, a
         HVDT High-volume data transfer               Vote   Peace's virtual shared-memory system         comparison of Myoan and Vote is
         MMU Memory-management unit                                                                       fair, because both systems have the
                                                                                                          same CPU foundation.8 Myoan runs
    Figure 4. Breakdown of activities in read-access fault handling.                                      on the Intel Paragon machine with a

    20                                                                                                                          IEEE Concurrency

                                          Table 2. Comparison of read-fault handling in different VSM systems.

                                          SYSTEM   PLATFORM      CPU                NETWORK (MBYTES/S)     TIME (MS)

                                          MEther   SunOS4.0      25-MHz MC68020              1.20         70–100
                                          Munin    V             25-MHz MC68020              1.20       13.34–31.15
                                          Myoan    OSF/1 + NX    50-MHz i860XP             200.00          4.068
                                          Vote     Peace         50-MHz i860XP              47.68          0.667
network throughput more than four
times better than the throughput of
the Manna communication network.
Yet Vote handles a read-access fault more than six            • determining the locations of active and passive
times faster than does Myoan, although communica-               objects;
tion and data transfer are responsible for about 80%          • enabling data transport between (and, thus, inter-
of the total costs.                                             connection of) locations; and
   One of Vote’s main advantages over Myoan is the spe-       • attaching the locations to the network interface.
cialized operating-system kernel, which appeared in the
just-discussed case to be only a communication and            Hence, Peace’s communication system comprises three
thread library rather than a microkernel with additional      problem-oriented protocol layers (see Figure 5).
user-level and problem-oriented communication sup-               Interactions between the layers happen via downcalls
port. To partly overcome the performance problems,            (program-driven invocations of lower-level by higher-
Myoan is using Intel’s low-level NX communication             level system functions) and upcalls.9 Peace uses queues,
library for internode communications. This reduces the        when possible, to decouple the different flows of control.
communication time (for 8 bytes) from 1,909 µs, when          Calls in either direction are virtually asynchronous,
exploiting the IPC functions of the OSF microkernel, to       because whether message transfer requests must be
about 329 µs.                                                 queued or can be immediately executed depends on the
                                                              actual load. The layers implement the NICE-COSY-
The Peace nucleus                                             CLUB (NC2) protocol suite.

The Peace nucleus acts as the minimal basis of system         Internucleus protocol
functions, handling networkwide communication and             Networkwide communication between objects is sup-
thread processing. The entrance to the nucleus is rep-        ported by the network-independent communication
resented as an abstract data type with different imple-       executive. NICE implements the internucleus protocol,
mentations, resulting in featherweight, lightweight, and
heavyweight activation patterns.
   Some configurations assume only vertical isolation
or vertical and horizontal isolation and therefore
require a trap-based activation of the nucleus. Thus,                                 Process
there is a separation between the user and supervisor
modes of operation (vertical isolation), which entails                                           Software events
lightweight nucleus-activation patterns, and a separa-                 Nucleus
tion between competing tasks (horizontal isolation),                                      NICE
which entails heavyweight nucleus activation patterns.
Horizontal isolation means that user-system entities
have a private address space; that is, they operate in a
                                                                       Downcalls          COSY           Upcalls
private protection domain.
   Other configurations sacrifice complete (vertical and
horizontal) isolation, which entails featherweight
nucleus-activation patterns, and make the nucleus                                         CLUB
appear as a communication and threads-library pack-
age. The variants basically distinguish between single-                 Hardware events
tasking (no isolation) and multitasking (isolation) modes
of operation. They implement different members of the                                  Device
kernel family.1
                                                              Figure 5. The communication-system architecture,
PROBLEM-ORIENTED PROTOCOL LAYERS                              including the communication system (COSY), the
Networkwide interprocess communication requires               cluster bus (CLUB), and the Network Independent
three main functions:                                         Communication Executive (NICE).

April–June 1997                                                                                                        21


                                                                    processor                                         Nucleus

    which is responsible for global, networkwide, process,                                                  Coprocessor
                                                                                        Request queue
    and address-space control. It activates and verifies the                 NICE
    presence of remote processes and address spaces. State
    transitions of processes and address spaces are controlled                              Interrupt         COSY
    to logically enable end-to-end data transfers without the
    need for intermediate buffering.
                                                                                            Communication      CLUB
    Transport protocol                                                                        processor
    The communication system (COSY) handles data trans-
    fers. This layer encapsulates transport-protocol func-       Figure 6. The communication coprocessor.
    tions and provides an abstraction from the actual net-
    work capabilities. Depending on these capabilities,
    COSY is more or less complex. It provides a protocol suite
    (that is, a family), not only a single implementation, but   is to provide architectural support to minimize the mes-
    also for several system configurations.                      sage startup time and the message latency. However, all
       Logically, COSY takes responsibility for a secured        these measures are of little value if the operating-system
    data transport of arbitrarily sized messages. However,       architecture for such a hardware organization is inap-
    “logical” also implies a configuration in which COSY is      propriate. For example, both OSF/1 AD10 and Puma11
    not required for any network activities. That extreme        are parallel operating systems for the Paragon machine.
    situation arises when the network hardware itself is capa-   One of the main reasons that Puma outperforms OSF/1
    ble of transferring message streams in a manner required     AD is that the former has been specifically designed to
    by parallel applications. In those cases, as happened with   operate in a distributed-memory parallel-computer envi-
    the Manna implementation, COSY simply forwards all           ronment, whereas the latter is mainly a port of a micro-
    requests from and to NICE, without interpretation.           kernel-based distributed (timesharing) operating system.
                                                                 Peace falls into the same category as Puma.
    Network device protocol                                          The two processors of a Manna node are fully soft-
    Abstraction from the physical network interface is han-      ware-programmable and can be configured in various
    dled by the cluster bus (CLUB). (This terminology comes      ways. Single- and multiprocessor configurations are
    from the Suprenum architecture, where a cluster bus          equally supported. Depending on the application
    interconnected up to 20 nodes to build a cluster. A sec-     demands, symmetric or asymmetric multiprocessing
    ond-level network system interconnected up to 16 clus-       might come into play. For example, one processor might
    ters. The low-level communication protocol was known         play the role of an application processor, and the other
    as the cluster bus driver.) Thus, the bottom layer of the    processor might play the role of a communication proces-
    communication system encapsulates the network device         sor. The idea of the CP is to relieve the AP of all func-
    and physically attaches the nucleus to the network. This     tions necessary for driving the networkwide communi-
    layer implements the network device driver.                  cation protocol.
       CLUB provides the view of an abstract network                 In the following paragraphs, we discuss in more detail
    device that can have several physical representations.       the two main Manna node configurations that the Peace
    The CLUB abstraction makes COSY independent from             kernel supports. Peace supports other configurations,
    the network device actually used, whether this device is     such as symmetric multiprocessing; however, they have
    a physical or a logical one. Thus, CLUB supports the         not reached the same importance for the Manna appli-
    portability of COSY protocols.                               cations, so we don’t describe them here.

    DUAL-PROCESSOR-NODE CONFIGURATIONS                           Coprocessing
    The basic idea behind the Manna dual-processor-node          A more or less straightforward nucleus configuration is
    architecture is to have one processor in charge of appli-    to view the CP as a message-passing coprocessor. The task
    cation program processing and to use the second proces-      of the CP is carrying out transport protocol functions on
    sor for global communication (internode message pass-        behalf of the AP and driving the network hardware
    ing). The Intel Paragon, for example, implements a           interface. This approach is shown in Figure 6. As indi-
    similar node architecture. This technology’s main aim        cated in the figure, NICE is executed by both proces-

    22                                                                                                        IEEE Concurrency


   processor             AP/APCP

                                         NICE                (Note: in this article, acronyms such as X_Y denote an
                                                             X instance of the NC2 protocol stack executed by, or
                                                             bound to, the Y processor of the dual-processor node.)
                         Interrupt       COSY                   Sacrificing AP interruption by the CP has the advan-
                                                             tage of allowing the AP to be fully in charge of execut-
                                                             ing the application tasks. In this situation, from the AP
                                                             point of view, the reception of messages is completely
                                                             transparent. The AP idle loop, entered when no more
                                                             (AP) threads are ready to run, polls the ready list to
Figure 7. The computation coprocessor.                       determine when to dispatch which thread. The CP has
                                                             already made the scheduling decision. In other words,
                                                             the AP computing resources are completely under appli-
                                                             cation control.
sors. In contrast, COSY and CLUB are executed only by
one processor. In other words, the NC2 processor             Asymmetric multiprocessing
becomes the CP, and the processor executing both             We can easily extend the AP/CP node configuration on
NICE and the application becomes the AP.                     behalf of the application tasks (and by exploiting a dif-
   The NICE portion executed by the AP is responsible        ferent kernel family member). This requires structur-
for setting up and queuing message-transfer requests.        ing a task into a computation thread and a communication
Under control of the application thread that requests a      thread, which the AP and the CP, respectively, then exe-
networkwide message transfer, NICE initializes a pro-        cute. In such a configuration, requesting the communi-
tocol data unit and prepares a message descriptor. The       cation thread to carry out networkwide message trans-
descriptor is placed on a request queue.                     fers no longer involves the nucleus at the AP site.
   The AP-to-CP connection consists of a queue used to       Communication between both threads is through a
store packet and segment transfer requests, and it imple-    common, shared address-space segment and happens
ments two fundamental queuing strategies. The first          entirely without nucleus intervention. The AP portion
strategy performs only enqueuing. To be aware of a           of the nucleus does no thread communication, only
nonempty request queue, this strategy implies that the       synchronization. Figure 7 illustrates the corresponding
CP polls the queue from time to time. The second strat-      configuration.
egy is a specialization of the first strategy. It sends an      From the nucleus point of view, the functional dedi-
interrupt signal if the first element has been queued and    cation is more a question of the primitives being called
the CP is busy executing software other than polling the     by user-level threads. More specifically, the functional
request queue.                                               dedication depends on the tasks performed by the
   The NICE portion executed by the CP is responsible        threads and the thread mapping (that is, processor allo-
for processing the request queue, delivering incoming        cation). Because the computation threads are no longer
message segments directly into the address space of the      compelled to call the nucleus to request transmission or
AP destination thread, and unblocking AP threads wait-       reception of messages, the AP can be relieved of any net-
ing for incoming messages (that is, packets or segments).    workwide communication activity. Only the communi-
To handle the latter two functions, NICECP must share        cation threads invoke the nucleus to transfer messages
common data structures with NICEAP: the ready list,          on behalf of the computation threads.
the per-thread message (sender) queue, and, depending           In general, if AP threads invoke the nucleus primi-
on the address-space model supported, the page table.        tives to perform networkwide communication, the
NICECP autonomously manipulates the ready list when          communication requests are handed over to the CP, as
AP threads must be unblocked because of the reception        we described earlier. CP threads, in contrast, always go
of messages. More specifically, the CP schedules but         the direct way. In particular, this allows CP threads not
does not dispatch AP threads. Only the NICE portion          only to communicate but also to compute and, most
of the NC2 suite builds a multiprocessor-critical section    importantly, use the AP as a computation coprocessor. For
and so must escape parallel execution to avoid race con-     example, instead of migrating communication tasks
ditions. That is, NICEAP and COSY-CLUBCP can exe-            from the AP to the CP, a much better solution could
cute in parallel just as the AP threads and NC2CP can.       be splitting a computation-intensive task into two

April–June 1997                                                                                                    23

                                                                       Table 3. Breakdown of networkwide
                                                                                 message passing.

                                                                                                TIMING (µs)
                                                                                     AP                         AP/APCP
                                                                FUNCTION        1×        10×                 1×     10×

                                                                Setup        11.40     6.60              23.56     13.00
                                                                Transfer     29.32    19.80              28.04     16.28
    computation threads, each one processed by an own
                                                                Return        0.68     0.56               7.84      4.92
    CPU. That is, a computation subtask migrates from the       Overlap        —        —                 9.85      5.68
    CP to the AP. This case establishes some sort of            Delay        41.40    26.96              31.40     17.92
    AP/APCP configuration: one CPU runs only in AP-
    mode, and the other CPU runs in AP- and CP-mode
    (that is, it performs communication, as well as compu-
       The application determines which configuration        startup time might call for the communication proces-
    actually comes into play. Applications demanding a       sor solution.
    low end-to-end message latency might be better sup-
    ported on a computation coprocessor basis. In con-       MESSAGE LATENCY AND STARTUP TIME
    trast, applications demanding a low (local) message      Whether both processors or only one processor of a
                                                             Manna node are enabled depends on the actual Peace
                                                             kernel configuration. In any case, the NC2 packet
                                                             transfer is the portion that dominates performance
                                                             during message startup. This phase involves program-
         Glossary                                            ming the Manna network I/O device and copying an
                                                             NC2 message packet to the communication link. Inter-
          Choices—The Class Hierarchical Open Inter-         rupt handling and NC2 upcall handling additionally
          face for Custom Embedded Systems, an object-       dominate message latency. The former mainly involves
          oriented operating system developed at the         saving and restoring the i860XP registers and pipeline,
          University of Illinois at Urbana-Champaign.        and the latter involves device programming. Peace
          COSY—Peace’s communication system, which           copies the incoming NC2 message packet from the
          handles data transfers.                            communication link (more specifically, out of the
          CLUB—Peace’s cluster bus, which handles            receiver FIFO registers) into a message buffer. Simi-
          abstraction from the physical network interface.   lar to the message-transfer case, this phase mainly con-
          Manna—A project conducted by GMD First             cerns device programming. Thus, the performance-
          to design and develop a scalable distributed-      limiting factors are i860XP management and network
          memory parallel computer to support the exe-       I/O device programming.
          cution of parallel numerical and nonnumerical         Compared with a single-processor mode, the addi-
          applications.                                      tional functionality provided by the AP/CP and
          NICE—Peace’s Network Independent Com-              AP/APCP mode of operation is not free. Nucleus lock-
          munication Executive. Supports networkwide         ing and unlocking more expensive, and an atomic AP-
          communication between objects and imple-           to-CP request queue must be maintained. On the other
          ments the internucleus protocol.                   hand, there are still striking facts that plead for these
          Peace—An object-oriented operating-system          two models. The functional distribution of the nucleus
          family, designed for distributed-memory paral-     over the two processors benefits from the doubled CPU
          lel machines, that solves high-performance         (code/data) cache space and enables overlapped (paral-
          computing problems.                                lel) execution of the nucleus. In this case, a larger frac-
          POSE—The Parallel Operating System Exten-          tion of nucleus code and data becomes cache-resident.
          sion to the Peace kernel. Performs application-    Table 3 compares the two configurations. To identify
          oriented services.                                 caching effects, we ran the 80-byte asynchronous mes-
          Suprenum—320-node supercomputer for par-           sage-passing operation once and ten times.
          allel numerical applications.                         Regarding the delay caused by an asynchronous mes-
          Vote—A virtual shared-memory system that           sage-passing operation in the AP case, the CP in the
          extends the Peace operating system to allow        AP/APCP configuration takes over about 68% (noncache
          execution of shared-memory-based programs          case) or 60% (cache case) of the AP’s work. Due to the
          on a distributed-memory architecture.              additional overhead introduced by the dual-processor
                                                             mode of operation, the net improvement is about 24%
                                                             or 34%, respectively. In this configuration, the overlap

    24                                                                                                    IEEE Concurrency

             Table 4. Speedup of the Linpack
                1,000 × 1,000 benchmark.

   SYSTEMS             2         4         8      16

   Manna AP           1.97     3.75      6.90    11.83
   Manna AP/APCP      3.14     5.81     10.35    17.39
   Intel iPSC/860     1.68     2.42      3.31     4.22       figuration, with dedicated user-level PVM threads tak-
   Meiko CS2          1.88     3.20      4.79     6.12       ing care of all communication activities. We implemented
   Cray C90           1.89     3.63      6.85    11.95       the PVM mailbox as a multithreaded active object.
                                                                As shown in Table 4, the performance of the Manna
                                                             AP configuration was quite good, although only one
                                                             processor on every node is used for computation and
                                                             communication; that is, the second processor simply
fraction compared with the total AP delay is approxi-        remained unexploited. The Manna AP/APCP case
mately 31% in both the noncache and the cache case.          exhibits the best speedup of all configurations. Thus,
With respect to the CP transfer times, this AP/CP over-      although the two-processor mode is more overhead-
lap means that 35% of the CP work has been done when         prone at the kernel-level, the user-level end-to-end per-
the kernel returns to the application task running on the    formance increased.
AP. However, the amount of CP work done depends on              We made similar observations for the 3D-IFS appli-
the message size. Table 3, and the analysis based on the     cation, which we ran on an 8- and 16-node Manna, CM-
numbers shown, relates to the massage-passing over-          5, and Cray C90, and on an 8-node IBM SP1. With 7.8
head of the Peace nucleus independent of the actual          and 15.4 for the 8- and 16-node configuration, respec-
message size.                                                tively, Manna showed the best speedup. Regarding the
   When issuing a message-passing operation, in the          shortest total runtime, only the Cray C90 outperformed
AP/APCP configuration, user-level execution of the           Manna.
sender thread resumes significantly earlier than in the         Finally, Vote’s message-passing mode outperformed
AP mode. Furthermore, both return from the nucleus,          the PVM implementation. Exploiting Vote abstractions
and user-level thread processing (performed by the AP)       to communicate 4-Kbyte-sized pages yielded better per-
overlaps with nucleus-level data transfer to the com-        formance than did exploiting PVM primitives to com-
munication link (performed by the CP). In particular,        municate 4-Kbyte-sized messages. That is, in the case of
setting up the next message-transfer request can over-       page-aligned message passing, Vote is superior to PVM.
lap with the transmission of the previous message.           The main reasons for the better Vote performance are
   The AP/APCP configuration implements communi-             the absence of intermediate message (page) buffering,
cation pipelining. This improves the overall runtime         the avoidance of local memory-to-memory copying, and
behavior of communication-intensive threads and makes        the tight interlocking of low-level Vote functions with
the AP/APCP mode, in this particular case, superior to       the kernel. Compared with a simple data or page trans-
the AP mode. On the other hand, since the AP/APCP            fer, these facts help compensate for the additional over-
model is a software-configuration matter, computation-       head introduced by Vote to handle page faults (traps),
intensive threads might benefit from the computing           locate the page-holding site, analyze the access fault,
power offered by the two processors. Peace supports          and program the MMU.
communication and computation bursts quite well,
which especially holds for applications alternating
between these two periods.


To determine the overall system performance and com-                    hared-memory programming is still the most
pare the Peace-Manna approach to other approaches,                      common and popular way of using parallel
we ran the Linpack (1,000 × 1,000) benchmark. Table                     machines for high-performance computing.
4 shows the speedup achieved with respect to various                    This programming style is based on a well-
system configurations.                                                  known methodology, supported by high-
   The Linpack benchmark used the PVM implementa-            quality programming environments (for example,
tion of Peace to close the gap between the object-oriented   compilers and debuggers) and mature libraries. Moreover,
kernel and the Fortran program. In the AP/APCP case,         this style will still be dominant in the near future due to
we tuned the PVM library implementation specifically         the lack of other accepted or pioneering approaches to
with respect to the dual-processor-node architecture.        parallel programming. Consequently, during the last
That is, we enabled the asymmetric multiprocessor con-       decade, considerable effort went into applying the

April–June 1997                                                                                                      25

    shared-memory paradigm to distributed-memory par-                  REFERENCES
    allel machines. This led to the development of various
                                                                         1. W. Schröder–Preikschat, The Logical Design of Parallel Operating
    hardware- and software-supported VSM systems.                           Systems, Prentice Hall, Upper Saddle River, N.J., 1994.
       So far, the performance of many of the existing VSM
    systems has not been very promising. These systems’                  2. W.K. Giloi, “The SUPRENUM Supercomputer: Goals,
    performance defects have established the myth that                      Achievements, and Lessons Learned,” Parallel Computing, Vol.
                                                                            20, Nos. 10–11, Nov. 1994, pp. 1407–1425.
    VSM-based programming of distributed-memory
    machines is not appropriate for high-performance com-                3. R. Campbell, G. Johnston, and V. Russo, “Choices (Class Hier-
    puting. However, the Vote concepts presented here dis-                  archical Open Interface for Custom Embedded Systems),” Oper-
    prove this myth by a symbiosis of architectural trans-                  ating Systems Review, Vol. 21, No. 3, July 1987, pp. 9–17.
    parency and efficiency.
                                                                         4. D.L. Parnas, “Designing Software for Ease of Extension and
       Vote provides a smooth migration path for shared-                    Contraction,” Trans. Software Engineering, Vol. SE-5, No. 2, Mar.
    memory-based programs in distributed-memory archi-                      1979, pp. 128–137.
    tectures. As an extension to the Peace parallel operat-
    ing system, Vote benefits from the advantages offered by             5. J. Cordsen and W. Schröder-Preikschat, “On the Coexistence
                                                                            of Shared-Memory and Message-Passing in the Programming
    problem-oriented kernels being tailored to particular                   of Parallel Applications,” Proc. HPCN Europe ’97, Lecture Notes
    application demands.                                                    in Computer Science, 1225, Springer-Verlag, New York, Apr. 1997,
       The program family concept strongly influenced the                   pp. 718–727.
    design and implementation of Vote and Peace. This
                                                                         6. M. Tam, J.M. Smith, and D.J. Farber, “A Taxonomy-Based
    concept is a means to an end but not an end in itself.                  Comparison of Several Distributed Shared Memory Systems,”
    It requires that the designer be very disciplined and                   ACM Operating Systems Rev., Vol. 24, No. 3, July 1990, pp. 40–67.
    precise, if not pedantic, in the construction of operat-
    ing-system software. Typically, an operating-system                  7. L. Lamport, “How to Make a Multiprocessor Computer That
                                                                            Correctly Executes Multiprocessor Programs,” IEEE Trans. Com-
    family exhibits a highly modular and hierarchical soft-                 puters, Vol. 28, No. 9, Sept. 1979, pp. 241–248.
    ware structure. The hierarchy is built by postponing
    design decisions that would restrict the already de-                 8. G. Cabillic, T. Priol, and I. Puaut, Myoan: An Implementation of
    signed family. A major challenge is deciding which                      the Koan Shared Virtual Memory on the Intel Paragon, Tech. Report
                                                                            812, Irisa, Campus Universitaire de Beaulieu, Rennes, France,
    functions to exclude from a design. In other words,                     1994.
    functions of both Vote and Peace are introduced only
    on a per-application basis (that is, if demanded by a                9. D.D. Clark, “The Structuring of Systems Using Upcalls,” Oper-
    higher layer). The new functions define an offspring                    ating Systems Rev./, Vol. 19, No. 5, 1985, pp. 171–180.
    of the existing family.
                                                                       10. R. Esser and R. Knecht, “Intel Paragon XP/s—Architecture and
       Always forcing oneself to reason about the necessity                Software Environment,” Proc. Supercomputer ’93, Lecture Notes in
    of a certain function at a certain level is often deemed               Computer Science, Springer-Verlag, 1993, pp. 121–141.
    too academic and less pragmatic, causing an over-
    structuring of the resulting system—but it is worth it.            11. S.R. Wheat et al., “Puma: An Operating System for Massively
                                                                           Parallel Systems,” Proc. 27th Ann. Hawaii Int’l Conf. System Sci-
    Self-assessment during the design (and implementa-                     ences, Vol. II, IEEE Computer Society Press, Los Alamitos, Calif.,
    tion) process helps reduce the complexity—and, thus,                   1994, pp. 56–65.
    increase the performance—of lower-level software

                                                                       Jörg Cordsen is a research associate and leads the Communication
                                                                       Support System Group at the German National Research Center for
                                                                       Information Technology (GMD), Research Institute for Computer
                                                                       Architecture and Software Technology (FIRST) in Berlin. His main
                                                                       research interests are distributed, parallel operating systems; pro-
    ACKNOWLEDGMENTS                                                    gramming models for distributed, parallel systems; and networks of
    This work was partially supported by the Commission of the Euro-   SMPs, software-implemented multicasts, and memory-consistency
    pean Communities, Project ITDC-207 and 9124 (Shared Memory         models. He received his diploma degree and his PhD from the Tech-
    on Distributed Architectures), and the German-Brazilian Cooper-    nical University of Berlin. Readers can contact Cordsen at GMD
    ative Programme in Informatics, Project No. 523112/94-7 (Paral-    FIRST Berlin, Rudower Chaussee 5, D-12489 Berlin, Germany;
    lel and Flexible Environmental Program in Informatics).  ;

    26                                                                                                                  IEEE Concurrency

Thomas Garnatz is a system engineer at Siemens Communications
Test Equipment GmbH in Berlin. His main research interests are
operating systems, computer networks, and file systems. Garnatz
                                                                         Next Issue
received his diploma degree in computer science at the Technical
University of Berlin. Readers can contact Garnatz at Siemens Com-            Software Engineering for
munications Test Equipment GmbH, Design and Development
(SCTE-E1), Wernerwerkdamm 5, D-13629 Berlin, Germany;                      Parallel & Distributed Systems

                                                                         This special issue will cover advances in software
Anne Gerischer is a system engineer at Dakosy GmbH (Data Com-
munications Systems for the Transport Sector) in Hamburg, Ger-           engineering that facilitate the construction of
many. Her main research interests are object-oriented software design    inherently parallel and distributed systems with
methods, database technology, and operation management systems.
Gerischer received her diploma degree in computer science at the         complex coordination, communication, and con-
Technical University of Berlin. Readers can contact Gerischer at         currency requirements. The articles highlight
Dakosy (Daten-Kommunikations-Systeme GmbH), Cremon 9, D-
20457 Hamburg, Germany;; http://www.       problems and solutions through significant case                                                               studies or commercial and industrial application
                                                                         development. In particular, they present the results
Marco Dimas Gubitoso is an assistant at the University of São Paulo.     of recent research, report experiences in the appli-
His main research interests are performance analysis, parallel pro-
cessing, and high-performance computing. He holds a bachelors and
                                                                         cation of new techniques, or review current col-
a masters degree in physics and a PhD in computer science, all from      laborative projects and software tools.
the University of São Paulo. Readers can contact Gubitoso at Rua do
Matao, 1010 Departamento de Ciencia da Computacao, Instituto de
Matematica e Estatistica, Universidade de São Paulo, Cidade Uni-         The articles include
versitaria, São Paulo, SP, Brazil, 05508-900;;                                 • “Hypersequential Programming: A New Par-
                                                                           adigm for Concurrent Program Development,”
Ute Haack is a research associate at the Computer Science Depart-          by Naoshi Uchihira, Shinichi Honiden, and
ment of the Otto-von-Guericke University in Magdeburg, Germany.
She received her diploma degree in computer science from the Tech-         Toshibumi Seki.
nical University of Berlin. Her main research interests are in operat-   • “Improving a Parallel Embedded Industry
ing systems for heterogeneous distributed embedded systems. Read-
ers can contact Haack at Otto-von-Guericke-Universität Magdeburg,          Application with Trapper,” by Frank Heinze,
Universitätsplatz 2, D-39106 Magdeburg, Germany; ute@cs.                   Lorenz Schäfers, Christian Scheidler, and;
                                                                           Wolfgang Obelöer.
Michael Sander is a senior system engineer at Siemens Communi-           • “Model-Driven Distributed Systems,” by Ian A.
cations Test Equipment GmbH in Berlin. Sander received his diploma         Coutts and John M. Edwards.
degree in computer science from the Technical University of Berlin.
He is member of the German Computer Society. Readers can contact         • “Software Engineering Methods for Parallel
Sander at Siemens Communications Test Equipment GmbH, Design
and Development (SCTE-E1), Wernerwerkdamm 5, D-13629 Berlin,               Applications in Scientific Computing: Project
Germany;                                   Sempa,” by P. Luksch, U. Maier, S. Rathmayer,
                                                                           M. Weidmann, and Friedemann Unger.
Wolfgang Schröder-Preikschat is a full professor of computer sci-        • “Survey and Market Structure of HPCN Soft-
ence (operating systems and distributed systems) at the Otto-von-
Guericke-University in Magdeburg, Germany. His main research
                                                                           ware Tools,” by Daron G. Green, Chris J.
interests are embedded and distributed-parallel operating systems,         Scott, Adrian Colbrook, and Mike Surridge.
object-oriented software construction, communications systems, and
computer architecture. He received his diploma degree, PhD, and
venia legendi (appointment as university lecturer), all from the Tech-   The special issue also includes a Virtual Round-
nical University of Berlin. Schröder-Preikschat is a member of the
ACM, the Forum of Computer Professionals for Peace and Social            table, featuring Alan Fekete and John Potter,
Responsibility, the German Computer Society, the IEEE, and the           Douglas C. Schmidt, Jeffrey Kramer, John A.
Association of German Electrical Engineers. Readers can contact
Schröder-Preikschat at Otto-von-Guericke-Universität Magdeburg,          Stankovic, and Hassan Gomaa. (See http://
Universitätsplatz 2, D-39106 Magdeburg, Germany; wosch@cs.     ;

April–June 1997                                                                                                                 27

Shared By:
Description: grid computing system - products - applications