1021 by babbian


									Proceedings of the 1997 Winter Simulation Conference
ed. S. Andradóttir, K. J. Healy, D. H. Withers, and B. L. Nelson


                                                Anand Sivasubramaniam

                                  Department of Computer Science and Engineering
                                        The Pennsylvania State University
                                    University Park, Pennsylvania 16802, U.S.A.

ABSTRACT                                                           imposing diverse demands on the underlying hard-
                                                                   ware, while parallel machines also come in several
Evaluating, analyzing and predicting the performance               flavors. To find out how good a job a machine does
of a parallel system is challenging due to the complex             of meeting an application’s demands, we need a way
inter-play between the application characteristics and             of evaluating the match between an application and
architectural features. The overheads in a parallel                an architecture. Evaluating the performance of an
system that limit its scalability have to be identi-               application-architecture combination has widespread
fied and separated in order to enable performance-                  applicability in parallel systems research. The results
conscious parallel application design and the develop-             from such an evaluation may be used to: select the
ment of high-performance parallel machines. We have                best architecture platform for an application domain,
developed an evaluation framework that uses a combi-               select the best algorithm for solving the problem on
nation of experimentation, simulation and analytical               a given hardware platform, predict the performance
modeling to quantify these parallel system overheads.              of an application on a larger configuration of an ex-
At the heart of this framework is an execution-driven              isting architecture, predict the performance of large
simulation testbed called SPASM which uses a suite                 application instances, identify application and archi-
of real applications as the workload. We discuss our               tectural bottlenecks in a parallel system to suggest
experiences in using this simulator in a wide range of             application restructuring and architectural enhance-
architectural projects in this paper.                              ments, and evaluate the cost vs. performance trade-
                                                                   offs in important architectural design decisions. But
1   INTRODUCTION                                                   evaluating and analyzing the performance of parallel
                                                                   systems pose several problems due to the complex in-
High Performance Computing is becoming increas-                    teraction between application characteristics and ar-
ingly important to scientific advancement and eco-                  chitectural features.
nomic development and is at the point of significantly                  Performance evaluation techniques have to grap-
improving our standard of living. With the inher-                  ple with several more degrees of freedom exhibited by
ent limitations of sequential computing, parallel ma-              parallel systems compared to their sequential coun-
chines have been proposed as the solution for high-                terparts. Experimentation and measurement on ac-
performance computing. Despite their promise and                   tual hardware, analytical modeling and simulation
attractiveness to the research community, parallel ma-             are three well-known performance evaluation tech-
chines have not been very successful in the commer-                niques. But each technique has its own limitations.
cial world due to two main reasons. First, their de-               Experimentation requires the hardware to be built,
livered performance often falls short of the projected             analytical models often make unreasonable assump-
peak performance. Second, the cost of these machines               tions about the underlying system to keep the mod-
is high compared to their sequential counterparts.                 eling tractable, and simulation requires immense re-
For the success of parallel computation, we should                 sources in terms of storage and time.
build machines that bridge the gap between projected                   In this paper, we summarize our previous and on-
and delivered performance over a spectrum of impor-                going effort in developing a framework for evaluating
tant real-world applications in a cost-effective man-               the performance of parallel systems and using this
ner. Performance evaluation of parallel systems plays              framework to develop cost-effective platforms that
a crucial role towards this goal.                                  meet the demands of numerous real-world applica-
    Applications exhibit different characteristics thus             tions. First, we identify performance metrics which
1022                                            Sivasubramaniam

are essential to understand the intrinsic algorithmic     tectural artifacts that lead to these bottlenecks and
and architectural artifacts that impact the perfor-       quantify their relative contribution towards limiting
mance of a parallel system. Next, we outline an eval-     the overall scalability of the system. Traditional met-
uation framework that we have developed to quantify       rics do not help further in this regard.
these metrics. The framework uses all three perfor-           Parallel system overheads may be broadly classi-
mance evaluation techniques to alleviate their indi-      fied into a purely algorithmic component (algorith-
vidual limitations. At the heart of this framework lies   mic overhead), a component arising from the interac-
SPASM (Simulator for Parallel Architectural Scala-        tion of the application with the system software (soft-
bility Measurements) which provides detailed perfor-      ware interaction overhead), and a component arising
mance profiles for applications on a range of parallel     from the interaction of the application with the hard-
hardware platforms. This simulator helps identify,        ware (hardware interaction overhead). Algorithmic
isolate and quantify the algorithmic and architectural    overheads arise from the inherent serial part in the
bottlenecks in an execution, that can be used for ap-     application, the work-imbalance between the execut-
plication restructuring and to suggest architectural      ing threads of control, any redundant computation
enhancements. In the rest of the paper, we illustrate     that may be performed, and additional work intro-
the utility of an execution-driven simulator such as      duced by the parallelization. Software interaction
SPASM in several architectural projects.                  overheads such as overheads for scheduling, message-
    The rest of this paper is organized as follows. In    passing, and software synchronization arise due to
Section 2, we identify performance metrics that we        the interaction of the application with the system
require from evaluating a parallel system and discuss     software. Hardware slowdown due to network la-
different evaluation techniques. Section 3 outlines        tency (the transmission time for a message in the
our evaluation framework and the SPASM simulator.         network), network contention (the amount of time
Section 4 summarizes our experience and results in        spent in the network waiting for availability of net-
using execution-driven simulators for several architec-   work resources), synchronization and cache coherence
tural projects. Finally, Section 5 presents concluding    actions, contribute to the hardware interaction over-
remarks.                                                  head. Each of these components would cause the per-
                                                          formance to deteriorate from the available compute
2      EVALUATING PARALLEL SYSTEMS                        power (potential peak performance) of the hardware.
                                                          To fully understand the scalability of a parallel sys-
In conducting any evaluation, we need to identify a       tem, it is important to isolate and quantify the impact
set of performance metrics that we would like to mea-     of each of these components on the overall execution.
sure and the techniques and tools that will be used       In our earlier research, we have proposed the notion of
to gather these metrics.                                  an overhead function (Sivasubramaniam et al. 1994)
                                                          that tracks the growth of a particular system over-
2.1     Performance Metrics                               head with respect to a specific system parameter.

Metrics which capture the “available” compute power       2.2     Evaluation Techniques
(MFLOPS, MIPS etc.) are often not a true indi-
cator of the performance actually “delivered” by a        Experimentation, analytical modeling and simulation
parallel system. Metrics for parallel system perfor-      are three well-known techniques for evaluating par-
mance evaluation should quantify this gap between         allel systems. Experimentation involves implement-
available and delivered compute power since under-        ing the application on the actual hardware and mea-
standing application and architectural bottlenecks is     suring its performance. Analytical models abstract
crucial for application restructuring and architectural   hardware and application details in a parallel sys-
enhancements. Many performance metrics such as            tem and capture complex system features by simple
speedup, scaled speedup and isoefficiency, have been        mathematical formulae. These formulae are usually
proposed to quantify the match between the appli-         parameterized by a limited number of degrees of free-
cation and architecture in a parallel system. While       dom so that the analysis is kept tractable. Simu-
these metrics are useful for tracking overall perfor-     lation is a valuable technique which exploits com-
mance trends, they provide little additional informa-     puter resources to model and imitate the behavior
tion about where performance is lost. Some of these       of a real system in a controlled manner. Each tech-
metrics attempt to identify the cause (the application    nique has its own limitations. The amount of statis-
or the architecture) of the problem when the paral-       tics that can be gleaned by experimentation (to quan-
lel system does not scale as expected. However, it is     tify the overhead functions) is limited by the monitor-
essential to find the individual application and archi-    ing/instrumentation support provided by the under-
                                    Execution-Driven Simulators for Parallel Systems Design                          1023

lying system. Additional instrumentation can some-                 simulation platform called SPASM which is used to
times perturb the evaluation. Analytical models are                identify, isolate and quantify the individual parallel
often criticized for the unrealism and simplifying as-             system overheads.
sumptions made in expressing the complex interac-                      SPASM is an execution-driven simulator written
tion between the application and the architecture.                 in CSIM used for simulating the execution of a par-
Simulation of realistic computer systems demand con-               allel program on a parallel machine. As with other
siderable resources in terms of space and time.                    recent simulators the bulk of the instructions in the
                                                                   parallel program is executed at the speed of the native
3   THE FRAMEWORK                                                  processor (SPARC in our studies) and only instruc-
                                                                   tions such as LOADs/STOREs on a shared memory
                                                                   platform, and SENDs/RECEIVEs on a message pass-
                                                                   ing platform, that may potentially involve a network
                                                                   access are simulated. The rationale behind this ap-
                                  Experimentation                  proach is that since uniprocessor architecture is get-
                                                                   ting standardized with the advent of RISC technol-
                   Speedup Simulation     Kernels                  ogy, we can fix most of the processor characteristics
                                                                   (such as instruction sets, clocks per instruction, float-
                                                                   ing point capabilities, pipelining) by using a com-
      Analytical                     Simulation                    modity processor as the baseline for each processor
                                                                   in our parallel system. A detailed simulation of the
                                                                   processor architecture is not likely to contribute sig-
                    Refine Models                                   nificantly to our understanding of the scalability of a
                                                                   parallel system. The input to the simulator is parallel
                                                                   applications written in C. On a message passing sys-
                        Results                                    tem, the calls (SENDs/RECEIVEs) which trap to the
                                                                   simulator are inserted into the application program
               Figure 1: The Framework
                                                                   explicitly by the programmer. On a shared memory
                                                                   system, a pre-processor inserts code into the appli-
We have developed an evaluation framework that uses                cation program to trap to the simulator on a shared
a combination of the three techniques to avoid some                memory reference. On both systems, the compiled
of their individual drawbacks. Experimentation is                  assembly code is augmented with cycle counting in-
used to implement real-world applications on parallel              structions which is used to keep track of the time
machines, to understand their behavior and extract                 spent in the application program since the last trap
interesting kernels (abstractions of applications that             to the simulator. Finally, the assembled binary is
capture representative phases of the execution) that               linked with the rest of the simulator code.
occur in them. These kernels are fed to an execution-                  A simulation platform like SPASM allows us to
driven simulator called SPASM which faithfully mod-                vary a wide range of hardware parameters such as
els the details of the parallel system interactions. The           the number of processors, the CPU clock speed, the
statistics that are drawn from the simulation are used             network topology, the bandwidth of the links in the
to develop new analytical models or to validate and                network, the network switching delays, and the cache
refine existing models. Simulation is used for detailed             parameters (the block size, cache size, associativity,
study of smaller systems in a non-intrusive manner.                etc). SPASM gives a wide range of statistics that
Analytical models are used to complement the sim-                  isolate and quantify the contribution of each parallel
ulation results to project the performance and over-               system overhead on the overall execution time of the
heads for larger systems (than those that can be sim-              application. Further, these overheads can be quanti-
ulated). When an analytical model is sufficiently val-               fied for different phases of the execution that can help
idated/refined, it may be possible to use this model                in performance debugging for application restructur-
in the simulator itself to abstract details in the sim-            ing and for suggesting architectural enhancements.
ulation to ease resource requirements. Using this ap-
proach, we have illustrated (Sivasubramaniam et al.
1995a) how the details of cache simulation and the                 4   PROJECTS USING THE FRAMEWORK
details of interconnection network simulation may be
                                                                   We have used the above framework in a wide spec-
abstracted by suitable models to gain substantial sav-
                                                                   trum of architectural projects that are summarized
ings in the simulation time.
                                                                   below. Even though SPASM has been used to model
    At the heart of our evaluation framework lies a
                                                                   and study message passing systems, the projects dis-
1024                                            Sivasubramaniam

cussed here have used only its shared memory capa-         tion. Furthermore, if our abstraction closely models
bilities.                                                  the behavior of a machine with a simple cache coher-
                                                           ent protocol, then it would even more closely model
4.1    Validating Abstractions                             the behavior of a machine with a fancier cache coher-
                                                           ence protocol.
Abstracting features of parallel systems is a tech-            We have used our simulation framework for eval-
nique often employed in performance analysis and al-       uating these abstractions. We have compared the re-
gorithm development. For instance, abstracting par-        sults from simulating the five applications on a ma-
allel machines by theoretical models like the PRAM         chine incorporating these abstractions with the re-
has facilitated algorithm development and analysis.        sults from an exact simulation of the actual hardware.
Such models try to hide hardware details from the          Our results show that the latency overhead modeled
programmer, providing a simplified view of the ma-          by LogP is fairly accurate. On the other hand, the
chine. Similarly, analytical models used in perfor-        contention overhead modeled by LogP can become
mance evaluation abstract complex system interac-          pessimistic for some applications since the model does
tions with simple mathematical functions, parame-          not capture communication locality. The pessimism
terized by a limited number of degrees of freedom          gets amplified as we move to networks with lower con-
that are tractable. Abstractions are also useful in        nectivity. With regard to data locality, results show
execution-driven simulators where details of the hard-     that our ideal cache, which does not model any co-
ware and the application can be captured by abstract       herence protocol overheads, is a good abstraction for
models in order to ease the demands on resource (time      capturing locality over the chosen range of applica-
and space) usage in simulating large parallel systems.     tions.
Some simulators already abstract details of instruction-       Apart from evaluating these abstractions in the
set simulation, since such a detailed simulation is not    context of real applications, the isolation and quan-
likely to contribute significantly to the performance       tification of parallel system overheads has helped us
analysis of parallel systems.                              validate the individual parameters used in each ab-
    An important question that needs to be addressed       straction. The simulation of the system which incor-
in using abstractions is their validity. Our framework     porates these two abstractions is around 250-300%
serves as a convenient vehicle for evaluating the accu-    faster than the simulation of the actual machine. Us-
racy of these abstractions using real applications. In     ing a similar approach, one may also use this frame-
(Sivasubramaniam et al. 1995a), we have illustrated        work to refine existing models (like reducing the pes-
the use of the framework to evaluate the validity and      simism in LogP in modeling contention), or even de-
use of abstractions in simulating the interconnection      velop new models for accurately capturing parallel
network and locality properties of parallel systems.       system behavior.
An outline of the evaluation strategy and results are
presented below.
                                                           4.2    Synthesizing Network Requirements
    For abstracting the interconnection network, we
have used the recently proposed LogP model that in-        For building a general-purpose parallel machine, it
corporates the two defining characteristics of a net-       is essential to identify and quantify the architectural
work, namely, latency and contention. For abstract-        requirements necessary to assure good performance
ing the locality properties of a parallel system, we       over a wide range of applications. Such a synthesis of
have modeled a private cache at each processing node       requirements from an application view-point can help
in the system to capture data locality. Shared mem-        us make cost vs. performance trade-offs in important
ory machines with private caches usually employ a          architectural design decisions. Our framework pro-
protocol to maintain coherence. With a diverse range       vides a convenient platform to study the impact of
of cache coherence protocols, it would become very         hardware parameters on application performance and
specific if our abstraction were to model any partic-       use the results to project architectural requirements.
ular protocol. Further, memory references (locality)       We have conducted such a study in (Sivasubrama-
are largely dictated by application characteristics and    niam et al. 1995b) towards synthesizing the network
are relatively independent of cache coherence proto-       requirements of the applications mentioned earlier,
cols. Hence, instead of modeling any particular proto-     and the experimental strategy along with interesting
col, we have chosen to maintain the caches coherent        results from our study are summarized here.
in our abstraction but do not model the overheads              To quantify link bandwidth requirements for a
associated with maintaining the coherence. Such an         particular network topology, we have simulated the
abstraction would represent an ideal coherent cache        execution of the applications on such a topology and
that captures the true inherent locality in an applica-    vary the bandwidth of the links in the network. As
                             Execution-Driven Simulators for Parallel Systems Design                        1025

the bandwidth is increased, the network overheads           unavoidable. Cache coherence protocols, weak mem-
(latency and contention) decrease, yielding a perfor-       ory consistency models, prefetch, poststore, and mul-
mance that is close to the ideal execution. From            tithreading are some of the proposed latency reduc-
these results, we have arrived at link bandwidths that      ing and tolerating techniques in the context of shared
are needed to limit network overheads (latency and          memory architectures. It has been shown that no one
contention) to an acceptable level of the overall ex-       technique is universally applicable for all applications.
ecution time. We have also studied the impact of            On the other hand, a close examination of the com-
the number of processors, the CPU clock speed and           munication behavior of a range of applications can
the application problem size on bandwidth require-          help derive a set of architectural mechanisms that
ments. Computation to communication ratio tends             may prove beneficial and we have conducted such a
to decrease when the number of processors or the            study in (Ramachandran et al. 1995) using our eval-
CPU clock speed is increased, making the network            uation framework. By examining the communication
requirements more stringent. An increase in problem         properties of applications, we have proposed a set of
size improves the computation to communication ra-          explicit communication primitives that are general-
tio, lowering the bandwidth needed to maintain an           izations of the poststore and prefetch mechanisms.
acceptable efficiency. Using regression analysis and              Cache coherence protocols broadly fall into two
analytical techniques, we have extrapolated require-        categories: write-invalidate and write-update. Invalidation-
ments for systems built with larger number of proces-       based schemes are more suited to migratory data and
sors.                                                       can become inefficient when the producer-consumer
    The results from the study suggest that existing        relationship for shared data remains relatively un-
link bandwidth of 200-300 MBytes/sec available on           changed during the course of execution. On the other
machines like Intel Paragon and Cray T3D can easily         hand, update-based protocols can result in significant
sustain the requirements of some applications even on       overheads due to repeated updates to the same data
high-speed processors of the future. For the other ap-      before they are used by another processor, as well
plications studied, one may be able to maintain net-        as redundant updates when there are changes to the
work overheads at an acceptable level if the problem        sharing pattern of a data item. The update and in-
size is increased commensurate with the processing          validation based schemes thus have their relative ad-
speed.                                                      vantages and disadvantages, and based on application
    The separation of the parallel system overheads         characteristics one may be preferable over the other.
plays an important role in synthesizing the communi-        Invalidations are useful when an application changes
cation requirements of applications. For instance, an       its sharing pattern, and updates are useful to effect
application may have an algorithmic deficiency due           direct communication once a sharing pattern is estab-
to either a large serial part or due to work-imbalance,     lished.
in which case 100% efficiency is impossible regardless            By examining the communication properties of a
of other architectural parameters. The separation of        spectrum of applications, we have derived a set of
overheads enables us to quantify bandwidth require-         explicit communication primitives that use sender-
ments as a function of acceptable network overheads         initiated communication within the context of an un-
(latency and contention). The framework may also be         derlying invalidation-based protocol. The three pro-
used for synthesizing requirements of other architec-       posed primitives intelligently propagate the data items
tural features such as synchronization primitives and       to one or more consumers as soon as the data items
locality capabilities from an application perspective.      are produced. The first primitive is intended for ap-
                                                            plications with static communication behavior where
4.3   Deriving Architectural Mechanisms                     the consumer set of a data item is available at compile
                                                            time. As a result, this set can be directly supplied to
The single most important overhead limiting perfor-         the hardware when the data item is produced. The
mance of parallel applications is the communication         second primitive is intended for variables governed by
overhead. One solution is to make the network as            locks and it uses the lock structure to propagate data
fast as possible so that even though the application        items to the processor next in line for the lock. The
does not make any fewer network accesses, the over-         third primitive is for applications with dynamic com-
heads will not manifest as a significant component of        munication behavior which detects the arrival of a
the total execution time. But the resources to sustain      new consumer to a current sharing pattern, and uses
the necessary bandwidth may simply not be available         this information to intelligently mix invalidates with
in some cases. The second approach is to reduce the         updates.
network accesses incurred in the execution or to toler-         The execution-driven simulation of real applica-
ate the communication overhead if these accesses are        tions has played an important role in this exercise. It
1026                                           Sivasubramaniam

has helped us identify and isolate typical communi-           SPASM is perhaps the first execution-driven simu-
cation scenarios in applications and derive a set of      lator that has been used to integrate both these view-
mechanisms that can optimize these scenarios. It          points into a single evaluation framework. It has been
has also helped us evaluate the cost-effectiveness of      used extensively to study performance over a wide
these primitives, and the benefits of these primitives     range of real applications and network parameters.
over alternate mechanisms. A related study (Shah,         For instance, in (Vaidya, Sivasubramaniam, and Das
Singla, and Ramchandran 1995) develops a realis-          1997a) we have used it to study the performance of a
tic model for a shared memory machine, and using          2-dimensional mesh network for 5 shared memory ap-
SPASM shows that for a spectrum of applications al-       plications. The specific aim in this study is to verify
most all the inherent communication in them may           whether the promised performance improvement (for
be overlapped with computation. This serves as the        synthetic workloads) using recently proposed network
motivation for further research in developing explicit    enhancements, such as virtual channels and adaptive
sender-initiated communication mechanisms.                routing, is indeed obtained for real applications, and
                                                          if so do these benefits override the cost of providing
4.4    Evaluating Network Designs                         these enhancements.
                                                              The performance results show that there is a mod-
The complex interaction between a parallel architec-      est performance benefit with these enhancements in
ture and an application makes it essential to use real-   the average network latency for the messages. How-
istic workloads for evaluating parallel systems. Per-     ever, with respect to the overall execution time, this
formance analyses of processors, caches, memory and       improvement is dwarfed in comparison to the other
I/O subsystems have therefore been conducted with         components which constitute the execution time. When
parallel benchmarks. However, unlike other subsys-        considered in the context of application scalability
tems, the design and analysis of the interconnection      in terms of the number of processors and the prob-
network, which is perhaps the most crucial hardware       lem considered, even though many of the considered
component in a parallel machine, has rarely used the      applications inject a large number of messages into
knowledge of workloads generated by parallel appli-       the network, their arrival into the network does not
cations.                                                  seem to generate any significant contention for net-
    There are two differing perspectives of viewing        work resources. Consequently, virtual channels and
the multiprocessor interconnection network. From          adaptive routing algorithms, which attempt to lower
the viewpoint of a software designer or an applica-       the network contention and not the raw network la-
tion programmer, it helps to make certain simplify-       tency, do not show substantial saving in execution
ing assumptions about the interconnection network         time. Further, our results suggest that the perfor-
such as assuming a constant delay or a simple model       mance rewards may not justify the cost of these en-
which does not take into account the details of mes-      hancements unless an application is highly commu-
sage traversal within the actual network. These as-       nication intensive and potentially scaling poorly. On
sumptions are sufficiently accurate when the objec-         the other hand, if any of these enhancements were to
tive is to minimize the communication required. By        slow down the network router, then there is a signif-
making these assumptions, performance evaluation of       icant degradation in performance.
the system can be simplified and speeded-up. Inter-            This study (Vaidya, Sivasubramaniam, and Das
connection network designers have a more network-         1997a) has served has the motivation for yet another
centric viewpoint. From this viewpoint, improving         project (Vaidya, Sivasubramaniam, and Das 1997b)
the network performance is critical. Network topol-       where we are trying to develop better routers for
ogy, switching mechanism, routing, flow control, and       interconnection networks. In this project, we have
communication workload, together determine the net-       formalized a pipelined model for the network router,
work performance. Until recently, network research        and we have evaluated the trade-offs between differ-
has primarily focussed on the first four parameters to     ent router designs using our simulator. We have also
optimize network latency and throughput. Network          proposed and evaluated dynamically adaptable selec-
designers have traditionally used synthetic benchmarks    tion functions within the router to route messages
to evaluate their designs. At best, these benchmarks      along less congested paths.
try to mimic some typical communication behavior in
applications. The performance results derived from
                                                          4.5    Characterizing Communication Behavior
synthetic workloads can provide a general guideline
or bounding values, while it may be difficult to make       Characterization of the communication in parallel ap-
cost-performance architectural design decisions using     plications is essential in understanding their interplay
these results.                                            with parallel architectures, to maximize the perfor-
                             Execution-Driven Simulators for Parallel Systems Design                          1027

mance of existing architectures and to design better        destination distribution.
architectures in the future. The communication traf-            The results obtained from the analysis of the ap-
fic of a parallel application can be captured by three       plication traces show that the inter-arrival times of
attributes namely the temporal, spatial and volume          all applications except one can be fitted to known
components. Temporal behavior is captured by the            probability distribution functions, which are varia-
message generation rate, spatial behavior is expressed      tions of exponential distribution. Also, the average
in terms of the message distribution or traffic pattern,      message generation rate can be obtained for the un-
and volume of communication is specified by the num-         derlying distribution. For the spectrum of applica-
ber of messages and the message length distribution.        tions considered, the message generation distribution
These three attributes together define the communi-          can be expressed in terms of exponential, hypoex-
cation workload and have been used extensively in           ponential or weibull distributions. Our results also
many types of architectural evaluations. In particu-        confirm that the spatial distributions of parallel ap-
lar, one of the most extensively studied areas of re-       plications can be captured mathematically. For the
search in parallel architectures is the interconnection     applications considered, the spatial distributions are
networks. A plethora of network topologies that sup-        uniform, bimodal uniform and univariate polynomial.
port various types of switching mechanism and mes-          The sensitivity of these results to different application
sage routing algorithms have been proposed to design        and hardware parameters has also been studied. We
scalable parallel machines. Performance analyses of         have found that only the means of the distributions
all these networks either via simulation or analysis        change as we vary many of the parameters. These
require the above three communication attributes.           results lead us closer to the belief that it is possible
    In the previous subsection, we discussed two stud-      to abstract the communication properties of parallel
ies that have studied the network for real applications     applications in convenient mathematical forms that
using execution-driven simulation. But, such a de-          have wide applicability.
tailed simulation of the network makes the evaluation
exceedingly slow. Mathematical models, on the other         5   CONCLUDING REMARKS
hand, do not suffer from this drawback. However,
most of these models for interconnection networks           Performance evaluation is an integral part of any sys-
have been accused of making unrealistic assumptions         tems design process to: evaluate the cost-effectiveness
about the communication workload. It is not clear           of a given design, compare different designs, and de-
what different traffic patterns are generated by par-          rive alternate designs. This process is particularly
allel applications and how these traffic patterns can         made more difficult for parallel systems where the
be captured by a distribution function for subsequent       complex interaction between application and archi-
study. Therefore, the credibility of many model-based       tecture introduces several more degrees of freedom
performance results has been questioned frequently.         compared to their sequential counterparts.
    It is thus crucial to develop some formal tech-             Performance evaluation techniques should clearly
niques to capture the communication properties of           isolate and quantify the different overheads in a paral-
parallel applications. The novelty of such a charac-        lel system execution that limit its scalability. Exper-
terization is that these attributes can be useful for       imentation on the actual system, analytical modeling
many divergent studies: a system architect can use          and simulation are three well-known techniques. But
the communication information for better architec-          each has its own limitations. Execution-driven simu-
tural design; an algorithm developer can use the com-       lation offers the most promise because of its ability to
munication cost for better algorithm design and anal-       study the parallel system accurately and in great de-
ysis; and a system analyst can develop more accurate        tail in a non-intrusive manner. However, we need to
performance models using realistic workloads.               confine ourselves to smaller systems with this tech-
    In (Chodnekar et al. 1997) and (Seed, Sivasubra-        nique, and complement the evaluation with mathe-
maniam, and Das 1997), we have embarked on char-            matical models and experimentation to extrapolate
acterizing the communication traffic generated by a           performance for larger systems. In this paper, we
spectrum of applications using SPASM. We conduct            have described one such simulator called SPASM that
a detailed execution-driven simulation on a chosen          has been used extensively to study parallel architec-
network configuration for each application. The net-         tures over a spectrum of applications. We have also
work logs the arrival of messages along with the time       briefly discussed five architectural projects that have
of arrival, length and destination information. These       used this simulator.
logs are then presented to a statistical package (SAS)          Recent trends show that a Network of Worksta-
for regression analysis to calculate the message gen-       tions (NOW) is a cost-effective solution for high per-
eration rate, the message length distribution, and the      formance computing. There are a wide range of ar-
1028                                          Sivasubramaniam

chitectural issues that need to be addressed if this        1994 Conference on Measurement and Modeling
platform is to become more prevalent. Our ongo-             of Computer Systems, 171–180.
ing research is focusing on architectural projects to-   Sivasubramaniam, A., A. Singla, U. Ramachandran,
wards this goal. We have recently implemented an            and H. Venkateswaran. 1995. Abstracting net-
execution-driven simulator called pSNOW (Kasbekar,          work characteristics and locality properties of par-
Nagar, and Sivasubramaniam 1997) to specifically study       allel systems. In Proceedings of the First Inter-
hardware and system software issues for NOW plat-           national Symposium on High Performance Com-
forms. We intend to use this simulator to design and        puter Architecture, 54–63.
evaluate architectural innovations concurrently with     Sivasubramaniam, A., A. Singla, U. Ramachandran,
the development of an actual prototype in our labo-         and H. Venkateswaran. 1995. On characteriz-
ratory.                                                     ing bandwidth requirements of parallel applica-
                                                            tions. In Proceedings of the ACM SIGMETRICS
ACKNOWLEDGMENTS                                             1995 Conference on Measurement and Modeling
                                                            of Computer Systems, 198–207.
This research is supported in part by a NSF Career       Vaidya, A., A. Sivasubramaniam, and C. Das. 1997.
Award MIP-9701475 and equipment grants from NSF             Performance benefits of virtual channels and adap-
and IBM.                                                    tive routing: An application-driven study. In Pro-
                                                            ceedings of the ACM 1997 International Confer-
REFERENCES                                                  ence on Supercomputing, 140–147.
                                                         Vaidya, A., A. Sivasubramaniam, and C. Das. 1997.
Chodnekar, S., V. Srinivasan, A. Vaidya, A. Sivasub-        The PROUD pipelined routers for high perfor-
   ramaniam, and C. Das. 1997. Towards a commu-             mance networks. Technical Report CSE-97-007,
   nication characterization methodology for parallel       Department of Computer Science and Engineer-
   applications. In Proceedings of the Third Inter-         ing, The Pennsylvania State University.
   national Symposium on High Performance Com-
   puter Architecture, 310–319.                          AUTHOR BIOGRAPHY
Kasbekar, M., S. Nagar, and A. Sivasubramaniam.
   1997. pSNOW: A tool to evaluate architectural         ANAND SIVASUBRAMANIAM is an Assistant
   issues for NOW environments. In Proceedings of        Professor in the Department of Computer Science and
   the ACM 1997 International Conference on Su-          Engineering at The Pennsylvania State University.
   percomputing, 100–107.                                He received his B.Tech in Computer Science from the
Ramachandran, U., G. Shah, A. Sivasubramaniam,           Indian Institute of Technology, Madras, in 1989, and
   A. Singla, and I. Yanasak. 1995. Architectural        the MS and Ph.D. degrees in Computer Science from
   mechanisms for explicit communication in shared       the Georgia Institute of Technology in 1991 and 1995
   memory multiprocessors. In Proceedings of Super-      respectively. His research interests are in architec-
   computing ’95.                                        ture, operating systems, performance evaluation and
Seed, D., A. Sivasubramaniam, and C. Das. 1997.          application aspects of high performance computing.
   Communication in Parallel Applications: Charac-
   terization and Sensitivity Analysis. To appear in
   Proceedings of the 1997 International Conference
   on Parallel Processing.
Shah, G., A. Singla, and U. Ramachandran. 1995.
   The quest for a zero overhead shared memory par-
   allel machine. In Proceedings of the 1995 Inter-
   national Conference on Parallel Processing, 194–
Sivasubramaniam, A. 1997. Reducing the communi-
   cation overhead of dynamic applications on shared
   memory multiprocessors. In Proceedings of the
   Third International Symposium on High Perfor-
   mance Computer Architecture, 194–203.
Sivasubramaniam, A., A. Singla, U. Ramachandran,
   and H. Venkateswaran. 1994. An approach to
   scalability study of shared memory parallel sys-
   tems. In Proceedings of the ACM SIGMETRICS

To top