20500658

Document Sample
20500658 Powered By Docstoc
					                       Performance Evaluation of OmniRPC in a Grid Environment

                      Yoshihiro Nakajima                            Mitsuhisa Sato, Taisuke Boku, Daisuke Takahashi
                      Graduate School of                                    Institute of Information Sciences
              Systems & Information Engineering,                                     and Electronics,
                    University of Tsukuba                                         University of Tsukuba
                yoshihiro@hpcs.is.tsukuba.ac.jp                        {msato,taisuke,daisuke}@is.tsukuba.ac.jp
                                                          Hitoshi Gotoh
                                             Knowledge-based Information Engineering,
                                                Toyohashi University of Technology
                                                  gotoh@cochem2.tutkie.tut.ac.jp


                                 Abstract                                dependable, consistent and pervasive access to enormous
                                                                         computational resources.
           OmniRPC is a Grid RPC system for parallel program-                OmniRPC [6] is a thread-safe RPC (remote procedure
        ming in a grid environment. In order to understand the           call) system for cluster and grid environments. In this pa-
        performance characteristics of OmniRPC, we executed a            per, we report the performance of OmniRPC in our grid en-
        synthetic benchmark program which varies the execution           vironment, which connects several clusters geographically
        time in remote nodes and the amount of communication             distributed in a wide-area network.
        on several configurations of our grid environment. The re-            OmniRPC is a thread-safe implementation of Ninf RPC
        sult shows the performance of the application is improved if     [5, 7] which is a Grid RPC facility for wide-area networks.
        RPC data transmissions are less than 10 KB, the job time in      Several systems adopt the concept of RPCs as the basic
        remote nodes is more than 4 seconds, and RPCs are called         model of computation, including Ninf, NetSolve [2] and
        more than 256 times. Our result also shows a small perfor-       CORBA [1]. The RPC-style system provides an easy-to-
        mance degradation when using the feature of communica-           use, intuitive programming interface, allowing users of the
        tion multiplexing. We also measured the performance of the       grid system to easily make grid-enabled applications. In
        EP application from the NAS parallel benchmark suite. In         order to support parallel programming, an RPC client can
        EP, even if using SSH or the Globus GRAM as methods of           issue asynchronous call requests to a different remote com-
        agent invocation, both performances are almost the same.         puter to exploit network-wide parallelism via OmniRPC.
        As a practical application, we parallelized the CONFLEX              A typical application of OmniRPC is parametric execu-
        molecular confirmation search program using OmniRPC. In           tion, in which the same function is executed with differ-
        the comparison of CONFLEX-G with the CONFLEX MPI                 ent input parameters. Once a remote executable is invoked,
        version, CONFLEX-G achieves comparable efficiencies to            the client attempts to use the invoked remote executable for
        the MPI version and increased speed by using two or more         subsequent RPC calls to the same remote functions, in order
        clusters.                                                        to eliminate the invocation cost of each call.
                                                                             One objective of OmniRPC is the support of seamless
                                                                         programming environments from a cluster to a grid. With
                                                                         OmniRPC, the user can execute the same program for both
        1. Introduction                                                  a cluster and a grid without changing the source code: lo-
                                                                         cal environments are supported by the use of rsh (remote
            Recently, the concept of computational grids has begun       shell), grid environments by Globus [4], and remote hosts
        to attract significant interest in the field of high-performance   by ssh (secure shell). Furthermore, a typical grid resource
        network computing. Rapid advances in wide-area network-          is regarded as a cluster of geographically distributed clus-
        ing technology and infrastructure have made it possible          ters. Under this system, the user designates a local job
        to construct large-scale, high-performance distributed com-      scheduler to execute remote executables. For a cluster
        puting environments, or computational grids, that provide        within a private network, the OmniRPC agent process run-




Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
        ning the host server relays the communications between the        a remote executable is invoked, the client attempts to use
        client and the remote hosts.                                      the invoked remote executable for subsequent RPC calls to
            In order to make effective use of OmniRPC for grid pro-       eliminate the cost of invoking it again.
        gramming, it is useful for the programmer to examine per-
        formance characteristics such as the invocation cost of re-          When the agent and the remote executables are invoked,
        mote programs and the overhead of communication. To               the programs obtain the client address and port from the ar-
        understand the performance characteristics, we executed           gument list and connect back to the client by direct TCP/IP
        a synthetic benchmark program, which varies the execu-            or Globus-IO for subsequent data transfer. Because the
        tion time in remote nodes and the amount of communica-            OmniRPC system does not use any fixed service ports, the
        tion on several configurations of our grid environment. We         client program allocates unused ports dynamically to wait
        measured the performance of the EP (embarrassingly par-           for connection from the remote executables. This avoids
        allel) application from the NAS parallel benchmark suite.         possible security problems, and allows the user to install the
        As a practical application, we parallelized the CONFLEX           OmniRPC system without requiring a privileged account.
        molecular confirmation search program using OmniRPC.
        In this paper, we also report the performance of the grid-            For parallel programming, the programmer can use asyn-
        enabled CONFLEX.                                                  chronous remote procedure calls, allowing the client to is-
            An overview of the OmniRPC system is presented in the         sue several requests while continuing with other compu-
        next section. The performance evaluation in wide-area net-        tations. The requests are dispatched to different remote
        works is described in Section 3. Section 4 demonstrates a         hosts to be executed in parallel, and the client waits or
        practical grid application with CONFLEX-G and discusses           polls the completed request. In such a programming model
        its performance. In Section 5 the current status and future       with asynchronous remote procedure calls, the programmer
        work are described.                                               should handle outstanding requests explicitly. Because Om-
                                                                          niRPC is thread-safe, a number of remote procedure calls
        2. The OmniRPC System                                             may be outstanding at any time for multi-threaded programs
                                                                          written in OpenMP.
            OmniRPC is a Grid RPC system which allows seamless
        parallel programming from a cluster to a grid environment.           OmniRPC efficiently supports typical master/worker
        OmniRPC inherits its API and basic architecture from Ninf.        parallel grid applications such as parametric execution pro-
        A client and the remote computational hosts which execute         grams. In the case of parametric search applications, which
        the remote procedures may be connected via a network. The         often require a large amount of identical data for each
        remote libraries are implemented as an executable program         call, OmniRPC supports a limited persistence model, imple-
        which contains a network stub routine as its main routine.        mented by the automatic-initializable remote module. The
        We call this executable program a remote executable pro-          user can define an initialization procedure in the remote ex-
        gram.                                                             ecutable to send and store data automatically in advance of
            When the OmniRPC client program starts, the OmniRPC           actual remote procedure calls. Then, since the remote exe-
        initialization function invokes the OmniRPC agent program         cutable may accept requests for subsequent calls once it is
        omrpc-agent in the remote hosts listed in the host file. To        invoked, the data set by the initialization procedure is re-
        invoke the agent, the user uses the remote shell command          used, resulting in efficient execution and a reduction in the
        rsh in a local-area network, the GRAM (Globus Resource            amount of data transmitted.
        Allocation Manager) API of the Globus toolkit in a grid en-
        vironment, or the secure remote shell command ssh. The                We assume that a typical grid resource is regarded as a
        user can switch his configurations by changing the host file.       cluster of geographically distributed clusters. For clusters
            OmniRpcCall is a simple client programming inter-             in a private network, in which the master node has global
        face for calling remote functions. When OmniRpcCall               IP, an OmniRPC agent process running on the host server
        makes a remote procedure call, the call is allocated to an        functions as a proxy to relay communications between the
        appropriate remote host. When the client issues the RPC           client and the remote executables by multiplexing the com-
        request, it requests that the agent in the selected host submit   munications into one connection to the client. This feature
        the job of the remote executable with the local job scheduler     allows a single client to use even 1,000 remote computing
        specified in the host file. If the job scheduler is not speci-      hosts. We call this feature multiplex IO (MXIO). When the
        fied, the agent executes the remote executable in the same         cluster is inside a firewall, the port forwarding of SSH en-
        node by the fork system call. The client sends the data of the    ables the node to communicate to the outside with MXIO.
        input arguments to the invoked remote executable, and re-
        ceives the results on the return of the remote function. Once        For more detailed information, refer to OmniRPC [6].




Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
          Site                        Cluster    Machine                  # of Nodes    # of CPUs    OS              Agent Invoker
          Univ. of Tsukuba            Dennis     Dual P4 Xeon 2.4 GHz         14            28       Linux 2.4.18    SSH, Globus
                                      Alice      Dual Athlon 1800+            20            40       Linux 2.4.19    SSH, Globus
          Toyohashi Univ. of Tech.    Toyo       Dual Athlon 2600+             8            16       Linux 2.4.18        SSH
          Tokushima Univ.             Toku       Pentium3 1.0 GHz              8             8       Linux 2.4.2         SSH
          AIST                        Ume        Dual P3 1.4 GHz              32            64       Linux 2.4.20    SSH, Globus

                                        Table 1. Machine configurations in our grid testbed.



        3. Performance Characterization of OmniRPC                                            Round-Trip     Throughput
            using the Synthetic Benchmark Program                                   Cluster   Time (ms)        (Mbps)
                                                                                    Alice        0.18           94.12
                                                                                    Toyo        11.27            1.53
           To understand the performance characteristics of Om-
                                                                                    Toku        23.09            8.36
        niRPC, we used a synthetic benchmark program which
                                                                                    Ume          1.07          373.33
        models a typical parallel program. As a typical OmniRPC
        program, we measured the performance of EP from the                 Table 2. Network performance between mas-
        NAS parallel benchmark suite on several configurations in            ter node of the Dennis cluster and master
        our grid environment.                                               node of each cluster.

        3.1. Synthetic Benchmark Program

           The synthetic benchmark program transmits character           measured network between the master node of the Den-
        data of a specified length as an argument to a worker pro-        nis cluster and the master node of each cluster in our grid
        gram from a client program, and makes the worker program         testbed. The communication throughput was measured by
        sleep for a specified time in the remote host. The synthetic      using the netperf command, and the round-trip time was
        benchmark program simulates a worker program which vir-          measured by the ping command.
        tually takes an amount of data and executes in a certain time        This experiment was conducted from September 10,
        period. On the return from the remote worker, the program        2003 to October 5, 2003. Due to some broken compu-
        sends back the same data to the client program. The client       tational nodes in the experiment, the number of workers
        makes RPC calls to these workers in parallel, and the RPC        changed and we could not use all of the computer resources
        calls are performed asynchronously.                              in this grid environment.
           In this experiment, we varied the execution time in the
        remote nodes and the amount of communication on several          3.3. Results of Synthetic Benchmark Program
        configurations of our grid environment. We can thus ex-
        amine how much benefit can be obtained according to the
                                                                            This experiment was performed on three configurations:
        amount of transmission and the job execution time in the
        workers.                                                           • Between a set cluster of the Dennis and Alice clusters,
                                                                             on which the network performance between these clus-
        3.2. Our Grid Testbed                                                ters is high with small latency

                                                                           • Between a set cluster of the Dennis and Ume clusters,
           Our grid testbed was constructed by computing re-
                                                                             where the throughput is high with small network la-
        sources installed in University of Tsukuba, Toyohashi Uni-
                                                                             tency,
        versity of Technology(TUT), Tokushima University and the
        National Institute of Advanced Industrial Science and Tech-        • Between a set cluster of the Dennis and Toyo clusters,
        nology (AIST). Table 1 shows the computing resources                 on which network performance is poor.
        used on our grid. The worker nodes at TUT and Tokushima
        University are in a private network and the master nodes of         Since 16 worker programs are executed in each cluster,
        each cluster have global IP.                                     a total of 32 worker programs are executed in two clusters.
           The University of Tsukuba and AIST are connected by              We changed the execution time of the workers in each
        a 1 Gbps Tsukuba WAN, and the other clusters are con-            call. For “short,” “middle,” and “long” workers, the com-
        nected by SINET. Table 2 shows the performance of the            puting time in the remote host is 1 second, 4 seconds and




Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
        16 seconds, respectively. The clients were executed in the                     Selection of
                                                                                          initial
                                                                                                            Local
                                                                                                                        Grid Environment
                                                                                                         Perturbation
        master node of the Dennis cluster, and SSH was used for                         structure                          Cluster 0


        authentication.                                                                                Master
                                                                                                                              Trial structure

            We found that, even in the case of short workers which                                        Geometry
                                                                                                                           Cluster 1
                                                                                                         Optimization
        execute in 1 second, we can obtain good performance when                                          Task Pool
                                                                                       Conformation                            Trial structure
        the throughput of the network between clusters is high.                          Database

        For “middle” and “long” workers, the performance remains                                         Comparison       cluster2
                                                                                                             &
        good in all cases when the performance of the network is                                           Store
                                                                                                                                Trial structure
        high.
            We found a small performance degradation when MXIO
                                                                                Figure 7. The procedure of CONFLEX-G.
        is used. This is because overhead is incurred since the Om-
        niRPC agent relays the communication between the client
        and the workers. It is, however, considered that the over-        4. Performance Evaluation of a Practical Ap-
        head in the case of MXIO is relatively small for the “long”           plication, CONFLEX-G
        computation time so that the effect on performance is still
        small. When the performance of the network is poor in a
                                                                          4.1 CONFLEX-G
        wide-area network environment, the application which re-
        quires a large amount of data transmitted (more than 1 MB)
                                                                              CONFLEX [3] is an efficient conformational space
        cannot be executed efficiently.
                                                                          search program, which can predominately and exhaustively
            According to the results, when the amount of data trans-
                                                                          search the conformers in the lower-energy results. Applica-
        mitted is 10 KB or less, the performance of applications by
                                                                          tions of CONFLEX are the elucidation of the reactivity and
        OmniRPC works well in almost all cases. Also, we found
                                                                          selectivity of drugs and possible drug materials with regard
        that the throughput of networks is more important than the
                                                                          to their conformational flexibility.
        latency. When the throughput of the network is low, large
                                                                              The basic strategy of CONFLEX is the exhaustive search
        overhead and performance degradation are the results. It
                                                                          of only the low-energy regions. The original CONFLEX
        is noted that the performance improvement is archived in
                                                                          performs the following four major steps:
        clusters connected remotely by high-performance networks
        better than in clusters connected locally.                          1. Selection of an initial structure from the conformers
                                                                               already found and the unique conformers sorted in a
        3.4. Performance of NAS Parallel Benchmark EP                          conformational database (in the beginning of a search
                                                                               execution, an input structure is used for the first initial
            We implemented the NAS Parallel Benchmark EP by us-                structure),
        ing OmniRPC. We replaced the process to calculate each              2. Generation of trial structures by using local perturba-
        random number by RPC. In this program, more than 16000                 tions to the selected initial structure,
        RPCs are called, and the transmission of data is less than 1
        KB for each RPC.                                                    3. Geometry optimization for those trial structures,
            We used all of the nodes on the clusters, and the client
                                                                            4. Comparison of those successively optimized struc-
        program was executed on the master node of the Dennis
                                                                               tures with other conformers stored in a conformational
        cluster. Table 3 shows the elapsed time and speedup on
                                                                               database, and the preservation of newly found unique
        Class B in different sets of clusters.
                                                                               conformers into the database (Figure 7).
            The experiment result indicates little performance ef-
        fect from the authentication method used (SSH or Globus               In these repeating procedures, two unique strategies are
        Toolkit). In this experiment, as the number of agents in-         incorporated. The first strategy involves the local pertur-
        creases, the elapsed time of the invoking agents is larger.       bations: corner flapping, edge flipping, and stepwise rota-
        When connecting two clusters, between which the network           tion that are highly efficient in producing several good trial
        performance is not poor, the performance improves more            structures. Actually, perturbations of an initial structure cor-
        than in the case of a single cluster, since jobs are efficiently   respond to precise performance around the space of the ini-
        allocated to the computer resources.                              tial structure.
            When the latency of the network is small, the communi-            “Lowest-Conformer-First,” the selection rule of the ini-
        cation multiplexing is acceptable. However, when commu-           tial structure, is the second strategy for directing the confor-
        nication multiplexing is used by SSH with port forwarding,        mation search expanded to the low-energy regions.
        it is found that the performance becomes worse when the               In the CONFLEX search algorithm, the most time-
        latency of the network is large.                                  consuming part is the geometry optimization procedure,




Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
                            30                                                                                  30

                            25                                                                                  25

                            20                                                                                  20
                 Speed Up




                                                                                                     Speed Up
                            15                                                                                  15

                            10                                                                                  10

                             5                                                                                   5

                             0                                                                                   0
                                 32     64           128         256            512           1024                   32     64           128         256            512           1024
                                                   Number of RPC Calls                                                                 Number of RPC Calls
                                      Dennis+Ume 1KB                     Dennis+Alice 100KB                               Dennis+Ume 1KB                     Dennis+Alice 100KB
                                                 10KB                               1000KB                                           10KB                               1000KB
                                                100KB                      Dennis+Toyo 1KB                                          100KB                      Dennis+Toyo 1KB
                                              1000KB                                   10KB                                       1000KB                                   10KB
                                      Dennis+Alice 1KB                                100KB                               Dennis+Alice 1KB                                100KB
                                                 10KB                               1000KB                                           10KB                               1000KB



                            Figure 1. The speedup on short workers                                              Figure 2. The speedup on short workers
                            (1 s) with communication multiplexing                                               (1 s) without communication multiplex-
                            in the synthetic program.                                                           ing in the synthetic program.

                            30                                                                                  30

                            25                                                                                  25

                            20                                                                                  20
                 Speed Up




                                                                                                     Speed Up
                            15                                                                                  15

                            10                                                                                  10

                             5                                                                                   5

                             0                                                                                   0
                                 32     64           128         256            512           1024                   32     64           128         256            512           1024
                                                   Number of RPC Calls                                                                 Number of RPC Calls
                                      Dennis+Ume 1KB                     Dennis+Alice 100KB                               Dennis+Ume 1KB                     Dennis+Alice 100KB
                                                 10KB                               1000KB                                           10KB                               1000KB
                                                100KB                      Dennis+Toyo 1KB                                          100KB                      Dennis+Toyo 1KB
                                              1000KB                                   10KB                                       1000KB                                   10KB
                                      Dennis+Alice 1KB                                100KB                               Dennis+Alice 1KB                                100KB
                                                 10KB                               1000KB                                           10KB                               1000KB



                            Figure 3. The speedup on middle work-                                               Figure 4. The speedup on middle work-
                            ers (4 s) with communication multiplex-                                             ers (4 s) without communication multi-
                            ing in the synthetic program.                                                       plexing in the synthetic program.

                            30                                                                                  30

                            25                                                                                  25

                            20                                                                                  20
                 Speed Up




                                                                                                     Speed Up




                            15                                                                                  15

                            10                                                                                  10

                             5                                                                                   5

                             0                                                                                   0
                                 32     64           128         256            512           1024                   32     64           128         256            512           1024
                                                   Number of RPC Calls                                                                 Number of RPC Calls
                                      Dennis+Ume 1KB                     Dennis+Alice 100KB                               Dennis+Ume 1KB                     Dennis+Alice 100KB
                                                 10KB                               1000KB                                           10KB                               1000KB
                                                100KB                      Dennis+Toyo 1KB                                          100KB                      Dennis+Toyo 1KB
                                              1000KB                                   10KB                                       1000KB                                   10KB
                                      Dennis+Alice 1KB                                100KB                               Dennis+Alice 1KB                                100KB
                                                 10KB                               1000KB                                           10KB                               1000KB



                            Figure 5. The speedup on long workers                                               Figure 6. The speedup on long work-
                            (16 s) with communication multiplexing                                              ers (16 s) without communication mul-
                            in the synthetic program.                                                           tiplexing in the synthetic program.




Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
                                                                            SSH                                     Globus
                                    Cluster                  MXIO ON                   MXIO OFF                  MXIO OFF
                                 (# of Worker)       Elapsed time(s) Speedup   Elapsed time(s) Speedup   Elapsed time(s)   Speedup
                            Dennis (30)                        28.23   28.56             29.27   27.54             27.35     29.48
                            Alice (36)                         28.66   28.13           227.79     3.54             34.46     23.39
                            Toyo (16)                           90.8    8.88           438.02     1.84                 -         -
                            Toku (8)                         302.25     2.67           744.91     1.08                 -         -
                            Ume (54)                           21.85   36.89           113.82     7.08             34.56     23.33
                            Dennis (30)+Alice (36)             22.46    35.9             43.63   18.48             24.13     33.41
                            Dennis (30)+Toyo (16)              63.43   12.71             71.72   11.24                 -         -
                            Dennis (30)+Toku (8)               29.98   26.89              38.9   20.72                 -         -
                            Dennis (30)+Ume (54)               18.12   44.49             43.21   18.66             25.18     32.02
                            Alice (36)+Toyo (16)             183.83     4.39             63.09   12.78                 -         -
                            Alice (36)+Toku (16)               32.07   25.13           167.58     4.81                 -         -
                            Alice (36)+Ume (54)                25.85   31.18             95.27    8.46             28.34     28.44
                            Toyo (16)+Toku (8)                 78.56   10.26           293.84     2.74                 -         -
                            Toyo (16)+Ume (54)                 55.19   14.61           125.32     6.43                 -         -
                            Toku (8)+Ume (54)                  26.64   30.26           123.41     6.53                 -         -


           Table 3. Elapsed time and speedup compared to the sequential version on the Dennis cluster of NAS
           Parallel Benchmark EP Class B in our grid.



             Molecular    # of        # of all             Degree of
                 code     atoms       trial structures       parallel
              AlaX04        181                   360             45
              AlaX16        191                   480           160

           Table 4. Outline of molecules in performance
           evaluation.




        which always takes 90% of the elapsed time of the search
        execution. Because we thought that we obtained good per-
        formance in our grid environment, where huge computa-
        tional resources can be used, we implemented CONFLEX-
        G which is an enabled grid environment. Figure 7 shows                        Figure 8. The performance of CONFLEX-G
        the process of CONFLEX-G. The worker programs are ini-                        and CONFLEX MPI version and Original CON-
        tialized by the Initialize method, which is provided by Om-                   FLEX in the Dennis cluster.
        niRPC’s automatic-initialize module function when work-
        ers programs are invoked. After the RPC calls, the initial-
        ized state is reused on the remote host. In other words, the
        client program can omit the initialization for each RPC call,             in the table, some overhead was found in the case of Om-
        and can optimize trial structures efficiently. CONFLEX-G                   niRPC. This is because, in the case of MPI, all workers are
        allocates jobs and trial structure optimization to each clus-             initialized in advance of the optimization phase. In the case
        ter’s computational nodes in the grid environment.                        of OminRPC, the worker is invoked on demand when the
                                                                                  RPC call is actually issued. Therefore, the initialization in-
        4.2 Performance of CONFLEX-G                                              curs this overhead.
                                                                                      Table 5 shows the speedup and elapsed time of
           To examine the performance of CONFLEX-G, we used                       the conformational search of AlaX04 and AlaX16 with
        the two molecules listed in Table 4. At first, we com-                     CONFLEX-G in our grid environment.
        pared the performance of CONFLEX-G and the CONFLEX                            In the case of AlaX16, two or more clusters are also used
        MPI version and the CONFLEX original sequential version.                  in the wide-area network, and the speedup is archived. This
        Figure 8 shows the comparison between the MPI version                     is because the overhead, such as network delay and invok-
        and CONFLEX-G in a local PC cluster. We found that                        ing of programs, becomes relatively small and can be con-
        CONFLEX-G achieves comparable efficiencies to the MPI                      cealed, since the calculation times on the remote host are
        version. With 28 workers, the speed was 18.00 times more                  long. The best performance is when 64 workers were used
        than that of the CONFLEX sequential version. As shown                     by the Dennis and Alice clusters. The achieved performance




Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE
          Cluster                           Total # of   Structures   Total optimization    Optimization time     Elapsed
          (# of workers)                     workers       / worker              time (s)       / structure (s)    time (s)   Speedup
          Dennis (1)1                                1         480             74027.80                154.22     74027.80       1.00
          Dennis (28)                              28          17.1            74027.80                154.22      3375.60      21.93
          Alice (36)                               36          13.3            90047.27                187.60      3260.41      22.71
          Toyo (16)                                16          30.0            70414.21                146.70      4699.15      15.75
          Ume (56)                                 56           8.6           123399.38                257.08      2913.63      25.41
          Dennis (28)+Alice (36)                   64           7.5            87571.30                182.44      2051.50      36.08
          Dennis (28)+Toyo (16)                    44          10.9            76747.74                159.89      2762.10      26.80
          Dennis (28)+Ume (56)                     84           5.7           102817.90                214.20      2478.93      29.86
          Alice (36)+Toyo (16)                     52           9.2            82700.44                172.29      2246.73      32.95
          Toyo (16)+Ume (56)                       72           6.6           109671.32                228.48      2617.85      28.28
          Dennis (28)+Ume (56)+Toyo (16)          100           4.8            98238.07                204.66      2478.93      29.86
          1
              Estimated based on Dennis cluster at 28 workers

                            Table 5. Elapsed time of AlaX16 on CONFLEX-G in our grid environment.


        was 36.08 times the speedup in the case of AlaX16.                    In grid environments in which the environment changes
           However, when two or more sets of clusters are used,             dynamically, it is also necessary to support fault tolerance.
        the speedup in performance is hampered due to the load              This is especially necessary due to large-scale applications
        imbalance of the optimization of the trial structures. The          which take much time in a wide-area network.
        longest time for optimizing a trial structure was nearly 24
        times longer than the time for the shortest one. Furthermore,       Acknowledgment
        other workers must wait until the longest job has finished.
        Therefore, the entire execution time cannot be improved.               This research is partly supported by a Grant-in-Aid of the
           On the CONFLEX MPI version, initialization of all re-            Ministry of Education, Culture, Sports, Science and Tech-
        mote executable programs (worker programs) finished be-              nology in Japan, No. 14019011, 2002, and the Program of
        fore optimization of the trial structures. In contrast, in          Research and Development for Applying Advanced Com-
        CONFLEX-G the remote executable program was initial-                putational Science and Technology by the Japan Science
        ized on demand for the RPC calls. This caused the over-             and Technology Corporation (“Research on the Grid com-
        head in CONFLEX-G. As a possible solution method, APIs              puting platform for drug design”).
        could be prepared which would start some stub programs
        before the RPC calls.
                                                                            References
        5. Conclusions                                                       [1] Object management group. http://www.omg.org/.
                                                                             [2] D. Arnold, S. Agrawal, S. Blackford, J. Dongarra, M. Miller,
                                                                                 K. Seymour, K. Sagi, Z. Shi, and S. Vadhiyar. Users’ Guide
            We have measured the performance of grid applications
                                                                                 to NetSolve V1.4.1. Innovative Computing Dept. Technical
        by OmniRPC and evaluated OmniRPC’s basic performance
                                                                                 Report ICL-UT-02-05, University of Tennessee, Knoxville,
        in our grid, which consists of clusters geographically dis-              TN, June 2002.
        tributed. This research revealed that the performance of the         [3] H. Goto and E. Osawa. An efficient algorithm for searching
        application is improved if RPC data transmissions are less               low-energy conformers of cyclic and acyclic molecules. J.
        than 10 KB, the job time in the remote nodes is more than                Chem. Soc., Perkin Trans, 2:187–198, 1993.
        4 seconds, and RPCs are called more than 256 times. We               [4] I. Foster and C. Kesselman. Globus: A meta computing in-
        implemented CONFLEX-G, a grid-enabled molecular con-                     frastructure toolkit. In Workshop on Environments and Tools,
        formational search program by OmniRPC, as a practical ap-                1996. http://www.globus.org/.
        plication, and obtained good performance in the wide-area            [5] Ninf Project. http://ninf.apgrid.org/.
                                                                             [6] M. Sato, T. Boku, and D. Takahashi. OmniRPC: a Grid RPC
        network environment.
                                                                                 System for Parallel Programming in Cluster and Grid Envi-
            Our future work will be to develop deployment tools and              ronment. In Proc. of CCGrid2003, pages 219–229, 2003.
        to support fault tolerance. In the current OmniRPC, regis-           [7] M. Sato, H. Nakada, S. Sekiguchi, S. Matsuoka, U. Na-
        trations of an execution program to remote hosts and de-                 gashima, and H. Takagi. Ninf: A Network Based Information
        ployments of worker programs are manually set. It is nec-                Library for Global World-Wide Computing Infrastructure. In
        essary to develop deployment tools when the number of re-                HPCN Europe, pages 491–502, 1997.
        mote hosts are increased.




Proceedings of the 2004 International Symposium on Applications and the Internet Workshops (SAINTW’04)
0-7695-2050-2/04 $20.00 © 2004 IEEE

				
DOCUMENT INFO
Shared By:
Tags: Grid, Computing
Stats:
views:7
posted:2/14/2010
language:
pages:7
Description: Grid Computing