Scientific codes on a cluster of Itanium-based compute nodes by hcj


									Scientific codes on a cluster of Itanium-based nodes.
(Joseph Pareti --- Hewlett-Packard Corporation )

At Hewlett-Packard the strategic importance of IA64 is undisputable. This is particularly true for High Performance Technical Computing (HPTC). As a co-developer of the Intel ® Itanium ® architecture, HP deploys Microsoft Windows, HP-UX, and LINUX, while the road-map includes VMS and an enhanced version of HP-UX encompassing (clustering and file systems) features from Tru64UNIX. Clusters are built out of Intel ® Itanium ® 2 (“Madison”)-based twin and quad cpu’s hp Integrity rx2600 and rx5670 servers. The former is powered by 1-2 Intel ® Itanium ® 2 processors (up to 1.5 GHz Intel ® Itanium ® with up to 6 MB L3 cache) and up to 24 GB memory. These employ the HP Scalable Processor Chipset zx1, a modular three-components chipset designed to provide the best performance for demanding applications that do not fit within the processor cache. For 1 to 2-way systems the HP zx1 chipset enables 8.5 GB/sec of aggregated memory bandwidth. For 4-way systems, it enables 12.8 GB/sec. Open page latency is an outstanding 80 nsec in a two CPU system. rx2600’s are only 3.5 inches high, and hence suitable for dense rack-mounted, low-priced clusters. The HP Integrity server rx5670 is powered by 2-4 Intel ® Itanium ® 2 processors and up to 96 GB memory and is also built around the zx1 chipset. For HP-UX-based clusters of Intel ® Itanium ®-based systems hptc/ClusterPack is available. This provides a complete hardware and software solution, including the cluster interconnect (Gigabit Ethernet and HyperFabric 2. Bundled within HyperFabric2 is HP's patented Hyper Messaging Protocol (HMP), a messaging-based protocol that significantly enhances performance of parallel and technical applications by optimizing the processing of various communication tasks.), and the software stack. ClusterPack includes several utilities which can aide both in administrative tasks and in workload management. In the LINUX space, HP offers a fully supported cluster solution, the HP XC6000, consisting of interconnected HP Integrity rx2600 servers, the Quadrics Ltd. interconnect and the software stack. The software enables system administrators to manage and operate the XC system as a single entity, provides a single file name-space and multiprocessor execution environment, and enables users to easily execute single jobs across multiple servers. XC systems scale from two through thousands of processors on applications. XC System Software products are licensed and distributed by HP, and consist of components developed by third-party suppliers as well as by HP. A brief review of the Intel ® Itanium ® architecture features for performance is followed by an outline of the HP XC6000 that is particularly relevant for the high-end of the HPTC market, which addresses the system management as well as the software development capabilities. Some case studies involving development tools are presented as well.

Intel ® Itanium ® architecture features for performance
This is a non-exhaustive discussion and it just focuses on performance-influencing features that are described in [1]. These are:       Instruction Level Parallelism Predication Speculation Software Pipelining and Rotating Registers Stacked registers Three level Cache memory


Instruction Level Parallelism (ILP) ILP is the ability to execute multiple instructions at the same time. The Intel ® Itanium ® architecture allows issuing of independent instructions in bundles of 3 instructions for parallel execution and can issue multiple bundles per clock. This is achieved by a large parallel resource such as the register files and the several execution units, and by compiler features such as predication, speculation that are further discussed below. Predication In traditional RISC and CISC architectures, code branches are a significant limiter to performance. A misspredicted branch will cause the execution pipeline to flush and stall until the instructions and data for the proper branch are loaded. Many branches can be predicted with a very high degree of accuracy, but the small number that cannot will cause the processor to stall for hundreds of cycles. The Intel ® Itanium ® processor can avoid this problem using predication and control speculation. Predication allows both sides of a branch to be executed. It then discards the results from the invalid one. In the example below, the sequence of instructions can be mapped to three bundles and will execute in two cycles. There are only three instructions, so one would think that they could all be in one bundle and execute in one cycle. However, the result of the compare must be known before the following instructions are executed. The “;;” characters at the end of the bundle are called “stop bits.” They force instruction dispersal to stop for the current cycle. The predicate registers p1 and p2 are set depending on the result of the compare, and the validity of the results are determined by the predicate registers. The invalid result is ignored. source code if (r1) r2 = r3 + r4 else r7 = r6 – r5 end if non-optimized code ITANIUM ® code

cmp.eq p1,p2=r1,r0 p1,p2 =r1,0;; (p1) br.cond else_clause add r2=r3,r4 (p1) add r2=r3,r4 br end_if else_clause: sub r7=r6,r5 (p2)sub r7=r6,r5 end_if: 5 cycles, including 2 cycles potential branch misprediction(30%of 10 cycles)

Data Speculation The example below shows how a load that might conflict with a store can still be hoisted above the store. A check instruction is placed at the original location of the load to validate it. While it might seem that data speculation would allow software developers to completely forget about pointer aliasing, this is not true. There are a limited number of resources available to support speculation, and it narrows the window over which the compiler can optimize. It is still a much bigger performance win to avoid address aliasing in performance-critical code. Original code Optimized code (speculation) speculative load control or data dependency (e.g. store) control or data dependency (e.g. store)

original load check for exceptions or memory conflict use of load use of load 2

Software Pipelining prolog iteration 1 ld4 add st4 iteration 2 ld4 ld4 add st4 ld4 add st4 ld4 add st4 add st4 iteration 3 iteration 4 iteration 5 cycle X X+1 X+2 X+3 X+4 X+5 X+6 X+7


Software pipelining is used to optimize loop codes such as in the example above, by (i) subdividing each iteration into its constituent instructions, and (ii) interspersing (independent) instructions that belong to different iterations at each cycle. In the kernel part of the loop (cycle X+3 and X+4 above) the loop will complete one iteration per cycle. This technique exposes large code blocks to parallelism similar to loop unrolling, but avoids the memory expansion drawback of loop unrolling (with potential instruction cache misses), and it also controls the filling and draining of the pipeline. In the Intel ® Itanium ® architecture loops can be pipelined without code expansion through register rotation, that is, registers are renamed by adding the register number to the value of a register rename base (rrb) contained in the Current Frame Marker. The rrb value gets decremented when loop branches are executed at the end of each iteration. Decrementing the rrb makes the value in register Y appear to move to register Y+1. Therefore register rotation avoids overwriting of values that are still needed and it also handles the epilogue and prologue of the loop. Stacked Integer Registers Intel ® Itanium ® avoids the cost of spilling and filling of registers at procedure call and return interfaces through compiler-controlled renaming. At a call site, a new frame of registers is available to the called procedure without the need for register spill and fill. Instead, an hardware feature called Register Stack Engine (RSE) will save the content of the registers in backing store, using otherwise unused cycles. Register access occurs by renaming the virtual register identifiers in the instructions through a base register into the physical registers. The callee can freely use available registers without having to spill and eventually restore the caller’s registers. The callee executes an alloc instruction specifying the number of registers it expects to use in order to ensure that enough registers are available. If however stack overflow occurs, the alloc stalls the processor and spills the caller’s registers until the requested number of register is available. At the return the caller’s registers are restored by reversing the renaming. If some, or all, of the caller’s registers were spilled by the hardware and not yet restored, the return stalls the processor until the RSE has restored the callee’s registers.


Three level Cache memory The three-tiered cache organization of the Intel ® Itanium ® processor provides a balanced trade-off between speed and size, and it reduces the complexity of the chip design. However, in order for it to be used effectively, the instructions and data need to be in the smallest, fastest cache when the processor needs them. This is accomplished by pre-fetching the instructions and data into the proper cache. Lack of effective pre-fetching will quickly kill application performance. This is the one area that is significantly different than RISC processors. All superscalar, out-of-order RISC processors dedicate an enormous amount of chip real estate and logic to hiding cache misses. This is done by allowing instructions to be executed out of order, then retiring them in order when all the dependencies are known. This works well when the ratio of CPU frequency to memory frequency is relatively small, but with today’s high-frequency processors, the size and complexity of this task is growing exponentially. With the Explicitely Parallel Instruction Computing architecture, software is responsible for making sure the data is in the proper cache at the proper time. Instructions are issued in order, so there is no hardware mechanism to hide a cache miss.  L1I cache (16 KB, 4-way associative, 64-byte line sizes, and have a 1-cycle latency.)

Effective pre-fetching of instructions can be relatively simple for computationally intensive applications. The primary consideration is to ensure the inner loop of any algorithm is small enough to fit in the L1I cache. Even without explicit pre-fetching by the compiler, the entire algorithm will be in cache after the first iteration.  L1D cache (16 KB, 4-way associative, 64-byte line sizes, and have a 1-cycle latency.)

For floating-point data, the L1D cache is irrelevant. All loads and stores to the floating-point register file are done from the L2 cache. The L1D cache is a small, very high-speed cache capable of servicing two integer load and store requests in one cycle. Since it is small, effective cache management is critical to achieving maximum performance. Making sure data structures are laid out so that sequentially accessed data falls in the same cache line is the best way to maximize L1D cache usage.  L2 cache (256 KB unified for instruction and (integer and f.p.) data cache 8-way associative, 128byte line size, 16 banks, 11-cycles latency)

The L2 cache has sufficient bandwidth to support streaming of floating-point data to the floating-point register file. This means the Intel ® Itanium ® 2 processor can support four loads, two stores, and two Fused Multiply-Add instructions per cycle. This is a theoretical maximum FLOP rate of 4000 MFLOPS/sec. (1GHz clock). However, most useful code will loop over a series of data. At the maximum dispersal rate of two FMAs per cycle, 32 bytes of data are consumed per cycle, or two cache lines every eight cycles. This requires the addition of two lfetches to acquire new data, and a branch instruction every eight cycles. The result is that nine cycles are required to execute 32 floating-point operations. The maximum useful rate is 3555 MFLOPS/sec if all the data is in the L2 cache.  L3 cache (1.5 to 3 MB, unified cache, 12-way associative, 128-byte line size, one bank, 12 (integer miss) to 18 (instruction miss) cycles latency.

The L3 cache can support a data transfer rate of 32 bytes/cycle to L2. This is enough to support the peak execution rate of the processor. However, for most algorithms, some of the loads to L2 will initiate a writeback as a modified line is chosen for eviction. For heavily pipelined loops such as daxpy, each load initiates a write-back. This reduces the maximum transfer rate to 16 bytes/cycle. This is not enough to sustain the peak execution rate of the processor. Therefore, maximum performance can be achieved if the data is


blocked, or packed, to fit in the L2 cache. Blocking will provide significant benefit only if the stride is small and there is significant data reuse, Data access should be set up so that sequential data accesses are in contiguous cache lines. For Fortran this means accessing arrays in column major order, and for C it means accessing them in row major order. For applications that do not sequentially access data, the data should be accessed in the way that minimizes the stride.

HP-UX development tools (CALIPER)
Caliper is a configurable general-purpose profiler for HP-UX compiled code. It does not require changes to the compilation and supports C/C++ and Fortran applications; all levels of optimization; archive or shared binding; single and multi-threaded; and applications or shell scripts that spawn additional processes. It uses a combination of dynamic code instrumentation and Itanium processor Performance Management Unit to collect data for analysis. Caliper does not change the behavior of the target application and includes the following features:  13 different pre-configured performance measurements, each customizable through config files  All reports are available in text format; most reports are also available in HTML for easy browsing  Performance data is correlated to your source by line number  Easy inclusion/exclusion of specific modules during measurement  Per-thread and aggregated thread reports  Sample data reported by function, sorted to show hot spots Caliper is invoked on the command line in the following way: caliper config_file [caliper_options] program [program_args] It has the following capabilities:  global application performance parameters (cpu cycles, instruction/data cache misses for whole program)  samples at regular intervals (correlate hardware counter measurements to code addresses)  exact measurement using program instrumentation (precise counts for individual functions, e.g. gprof functionality, used by compiler for profile-based optimizations.) As an example, CALIPER was run on Gaussian (l906 and l502.exe). Comparison of the cycles count on different platforms for the same problem shows good agreement with the only exception of mlib modules on HP-UX only(math library, further discussed below). HP-UX/CALIPER %cycles 49.1 10.7 10.3 4.3 2.6 2.6 2.3 2 1.8 1.7 1.4 1.2 1.2 routine fqtril_ mlib_dhb_rect_32 trn34w_ docont_ dovr1_ dotran_ __milli_divI mlib_dhsc2_8 bnstr1_ gobcmo_ doshuf_ loadsx_ calc0m_ cumulat 49.1 59.8 70.1 74.4 77 79.6 81.9 83.9 85.7 87.4 88.8 90 91.2 Tru64UNIX/DCPI %cycles 52.67 9.20 6.19 5.31 4.83 2.59 2.46 2.31 1.90 1.68 1.49 1.17 1.13 routine fqtril_ trn34w_ docont_ dotran_ dovr1_ gobcmo_ dotrn_ bnstr1_ calc0m_ doshuf_ ipopvc_ trani2_ domd2_ cumulat 52.67 61.87 68.06 73.37 78.2 80.79 83.25 85.56 87.46 89.14 90.63 91.8 92.93


HP XC systems are integrated Linux solutions consisting of interconnected HP servers and software. The software enables system administrators to manage and operate the XC system as a single entity, provides a single file name-space and multiprocessor execution environment, and enables users to easily execute single jobs across multiple servers. XC systems scale from two through thousands of processors. XC System Software products will be licensed and distributed by HP, and will consist of components developed by third-party suppliers as well as by HP's engineering organization. High Performance Interconnect HP XC uses the Quadrics Ltd. interconnect, consisting of three hardware components: an Elan Adapter Card that plugs in to the XC compute node, a hierarchical interconnect switch and an ultra high frequency interconnect link cable. While a description of the interconnect is provided in [2,3], here the main points are summarized with reference to the 3rd generation switch (Elan/Elite III):  leading edge message passing / one-sided communications performance (340 MB/sec/rail peak bandwidth; 2 sec. latency (5 on MPI) with low process overhead, or less than 2% CPU utilization for MPI communications through the Elan Adapter Card.  support of industry-standard PCI (currently 66MHz 64bit)  multiple rails connecting each node PCI Host Base Adapters and the switch that provide redundancy or added performance through message striping across rails.  (switch) packaging suitable for rack-mounting and air cooling.  scalability to thousands of compute nodes through additions of switching layers, and through usage of federated switch for configurations beyond 128 nodes, without sacrificing the low latency and while scaling the bi-sectional bandwidth of the cluster. Administrative LAN This is a switched GE LAN that connects to each node and protects from interconnect failure. It also offloads the interconnect of management data traffic. System Software Overview The HP XC system architecture is designed as a distributed system with single system traits. This architecture achieves a single system view for both users and administrators, which enables HP XC to support features such as: • Single user login • Single file system name space • Single integrated view of system resources • Integrated program development environment • Integrated job submission system The HP XC architecture employs node specialization, where subsets of nodes are assigned to specialized roles. The distributed system is organized into a scalable hierarchy of application and service nodes. The former are committed exclusively to running user applications, and incur minimal operating system overhead, thus delivering more cycles of useful work. The latter are used for administration, file i/o, etc by distributing appropriate services, such as user login, job submission and monitoring, I/O, and administrative tasks. Already in HP XC Version 1 service nodes are duplicated for high availability, though a more comprehensive high-availability implementation will ship in future releases incorporating Lustre [4] The management node is the central point of administration for the entire cluster and supports the following functionality:   monitoring of all systems resources query cluster configuration data


        

(remote) shut-dow and reboot of a subset of nodes attributes management Remote Login Software management (RPM, ISV’s packages, etc) start-up and shut-down of system services set-up users’ environment monitoring of the System Management Data Base (SMDB) back-up of the XC System disk Accounting

Disk I/O Future releases of HP XC will support Lustre [4] as a scalable high-performance cluster file system, that will also improve the high-availability features of HP XC. The current release is based on NFS services provided by (i) the service node with administrator role, and (ii) the file server nodes. 1 The HP XC cluster assumes that most file systems are shared and exported through NFS to other nodes, either by means of the administrative network, or across the system interconnect. This means that all files can be equally accessed from any node and the storage is not considered volatile. Shared file systems are mounted on service nodes. The disk I/O path starts at the channel connected to the server and is managed by disk drivers and logical volume layers. The data is passed through to the file system, usually to buffer cache. The mount point determines the file system chosen for the I/O request. The file system’s mount point specifies whether it is local or global. A local request is allowed to continue through the local file system. A request for I/O from a file system that is mounted globally is packaged and shipped directly to the service node where the file system is mounted. All processing of the request takes place on this service node, and the results are passed back upon completion to the requesting node and to the requesting process.

Integrated Resource and Job management sub-systems In the HP XC environment, resource management is defined as the low-level maintenance and control over hardware resources and software (user and system) attributes. The goal is to deliver complete information about the cluster instantaneously to the job management system. The job manager uses this information to formulate its queue optimizations. Resources are managed from the same resource manager and are made available to higher-level services, such as a queue manager, which in the HP XC environment is provided by the Load Sharing Facility (LSF) from Platform Computing. HP XC service nodes with the command role act as LSF server hosts, on which LSF daemons run and LSF control commands must be entered. Also, LSF jobs can only be submitted on server hosts. From the perspective of LSF, the server hosts represent the entire HP XC cluster. They are viewed as symmetric multiprocessor (SMP) nodes, where the total number of processors on the LSF server hosts is considered to be the same as the total number of application node processors that are available for batch scheduling. One of the LSF server hosts is designated the master host, where some of the primary LSF daemons run. Software Development Tools  HP software development tools and compilers, and from ISV’s such as PALLAS Vampir, Etnus TotalView  debugger, and Intel VTune,  HP MLIB

Direct access to local volumes on a node is supported. All local disk devices can be used for swap on their respective local nodes. The service and application nodes can use local disk for temporary storage on the HP XC cluster. The purpose of this local temporary storage is to provide higher performance for private I/O than can be provided across the interconnect. Because the local disk holds only temporary files, the amount of local disk space does not typically need to be large.


HP MLIB has four components, VECLIB, LAPACK, ScaLAPACK, and SuperLU_DIST, which have been optimized for use on Intel ® Itanium ® 2 processors (e.g. 95% of peak performance on DGEMM). HP MLIB incorporates many algorithmic improvements and several tunable parameters have been adjusted to maximize execution performance. VECLIB uses a highly efficient implementation of the Basic Linear Algebra Subprograms (BLAS) Levels 1, 2, and 3, as well as a subset of the newly-defined BLAS Standard. LAPACK fully conforms with the public-domain LAPACK version 3.0. ScaLAPACK fully conforms with the public-domain ScaLAPACK version 1.7. SuperLU_DIST fully conforms with the public-domain SuperLU_DIST version 2.0. The key computational kernels in MLIB are optimized to take full advantage of HP's parallel architectures. MLIB optimizations include:     Instruction scheduling to maximize the number of machine instructions executed per clock cycle Algorithm restructuring to increase the number of computations per memory access Cache management to minimize data cache misses Multi-threaded parallel processing on multiprocessor system

HP MLIB contains libraries for VECLIB, LAPACK, ScaLAPACK, and SuperLU_DIST, which support:      32-bit addressing 64-bit addressing 64-bit addressing and 64-bit integer arguments Fortran Integer*8 or C/C++ long long) Those components are referred to as VECLIB8, LAPACK8, ScaLAPACK8, and SuperLU_DIST8 respectively

Sample Code
N3D is an MPI-parallel 3D time-transient, incompressible Navier-Stokes code from the Institute of Aerodynamics (IAG) of the University of Stuttgart [5], that was originally developed on NEC SX5 and CRAY T3E. The I/O module, called EAS3IOMOD, provides a FORTRAN interface to buffered C I/O (fopen, fwrite,etc). Each problem size is described by the dimension parameters MMAX and NMAX (number of grid-points), KMAX (KMAX+1 being the number of modes that are computed), KEXP (related to the number of support points for the modes calculation). Profiling shows that the most cpu-intensive routines are “pendimod” (penta-diagonal matrix solver), “tridimod” (tri-diagonal solver), and “uvwmod” (multi-grid algorithm). Porting The default data model on the Intel ® Itanium ® platform is 32 bit int and long int and 64 bit pointers. As explained in a very useful porting guide [6] porting the data structure to LINUX didn’t cause any problem. Some code changes were however required to deal with the data that EAS3IOMOD stores in IEEE format (little endian for IA64/Linux and big endian for IA64/HP-UX). Furthermore the data type was set to 64 bit floating-point, on request by the Users, and 32 bit integer operands, in order to avoid further complications with the MPI library. Compiler options A parameter study was carried out to gauge the effectiveness of optimizing compiler flags, both for the HPUX and for the LINUX platform. In order to efficiently scan several combinations thereof the entire process has been automated in a shell script essentially looping through a table of compiler options. This generates a modified makefile, builds the executable and runs one common test case. The best performance was attained on HP-UX when the +O4 compiler option is in effect. On the LINUX platform the options –O3 –tpp2 have been used but with lower performance that on HP-UX, as shown in the table below.


IAG -- Sequential (100 time-steps) optimized version of n3dlib/pendimod.f90 and n3d/uvwmod.f90

Itanium 2 900 MHz L3 Unified: size = 1536 KB sysname = HP-UX node = lp13 release = B.11.22 F90FLAGS=+DD64 +FPD +r8 +DSnative +O4 F90FLAGS=+DD64 +FPD +r8 +DSnative +O4 +Odataprefetch +Ovectorize F90FLAGS=+DD64 +FPD +r8 +DSnative +O4 +fast +Ovectorize F90FLAGS=+DD64 +FPD +r8 +DSnative +O4 +Odataprefetch +Onofltacc +Onolimit +fastallocatable F90FLAGS=+DD64 +FPD +r8 +DSnative +O4 +Ovectorize F90FLAGS=+DD64 +FPD +r8 +Dsnative F90FLAGS=+DD64 +FPD +r8 F90FLAGS=+DD64 +r8 F90FLAGS=+DD64 +FPD +r8 +DSnative +O4 +Odataprefetch +Onofltacc +Onolimit +fastallocatable F90FLAGS=+DD64 +FPD +r8 +DSnative +O3 +Onofltacc +Onolimit +fastallocatable F90FLAGS=+DD64 +FPD +r8 +DSnative +O3 +Odataprefetch +Onolimit +fastallocatable F90FLAGS=+DD64 +FPD +r8 +DSnative +O3 +Odataprefetch +Onofltacc +fastallocatable F90FLAGS=+DD64 +FPD +r8 +DSnative +O4 +fast F90FLAGS=+DD64 +FPD +r8 +DSnative +O4 +Odataprefetch +Onofltacc +Onolimit F90FLAGS=+DD64 +FPD +r8 +DSnative +O4 +Odataprefetch +Onofltacc F90FLAGS=+DD64 +FPD +r8 +DSnative +O4 +Odataprefetch Itanium 2 cpu MHz : 900. Red Hat Linux 7.2 2.96-112.7.2 real 45m20.889s real 47m3.460s real 45m59.564s real 45m18.897s real 34:22.5 real 34:22.6 real 34:24.7 real 34:47.9 real 34:22.4 real 4:44:46.6 real 5:08:30.9 real 5:16:34.5 real 34:46.2 real 40:26.1 real 40:25.4 real 41:57.0 real 34:25.6 real 34:46.7 real 34:20.9 real 34:23.4

F90FLAGS=-O3 -tpp2 -r8 -i8 -ftz -IPF_fma -IPF_fltacc -IPF_fp_speculation fast F90FLAGS=-O3 -tpp2 -r8 -i8 -ftz -IPF_fma -IPF_fp_speculation fast F90FLAGS=-O3 -tpp2 -r8 -i8 -IPF_fma -IPF_fltacc -IPF_fp_speculation fast F90FLAGS=-O3 -tpp2 -r8 -i8

Parallel performance The table below indicates that a particular problem size that doesn’t exceed the physical memory available scales almost linearly when one of the cpu’s on each system is in use. MMAX 481 481 481 481 NMAX 2730 2730 2730 2730 KEXP 7 7 7 7 KMAX 29 29 29 29 #cpu's 10 15 15 30 #nodes 10 15 10 30 elapsed seconds 2598 1748 2494 899

pfmon (measuring flops and bus-cycles) pfmon [7] is a profiling tool for LINUX/Intel ® Itanium ®. For Intel ® Itanium ® 2 it provides access to all Performance Management Unit specific features, similar to HP Caliper for HP-UX. The basic questions we wanted to get answered by pfmon for this code were the rate of floating point operations per second, and secondly the memory bus utilization on the current systems. The latter parameter was intended to make judgments on how far future Intel ® Itanium ® implementations could increase the application


performance without memory-bandwidth restrictions. Below is an excerpt of the script employed to automate the runs, explaining how pfmon is used.
/usr/bin/time prun -tvs -B2 -N $NNODES -n $MPI_NCPU ./mem_bw_script |& tee log.$$ !starts the MPI-parallel application where “mem_bw_script” is as follows: #!/bin/csh -f unlimit cputime set X=/usr/users/pareti/lkg_ia64/IAG_FINAL/IAG/N3D/n3d/n3d !this is the executable file pfmon -u -k -e CPU_CYCLES,IA64_INST_RETIRED,FP_OPS_RETIRED,BUS_DATA_CYCLE -outfile=stats.$RMS_RANK $X

Here prun is the flavor of mpirun on HP XC; further, pfmon is invoked on the executable and at completion there as many “stats” files as there are processes, each of which tagged by the environment variable $RMS_RANK2. A post processor loops across the “stats” files and parses them to report the FLOPs rate and the percentage of peak bus-utilization on a per process base.

Performance discussion. pfmon reports 8% of peak floating point performance when one cpu per each rx2600 is used, and 30% of peak bus use. While the relatively low bus utilization suggests there is headroom for performance growth due to clock frequency and architectural features, the percentage of peak performance indicates not all optimizations have been exhausted. In particular, suitable pre-fetching by compiler directives in the most critical constructs, such as a strided-array access in the solver module, might improve the performance as suggested in [8]. To this aim it appears important to evaluate the cycles needed between memory accesses in that code segment and relate them to the load-use latency of the memory system in order to determine a suitable number of cache lines to be pre-fetched ahead of the current one.

[1] Intel Itanium Architecture Software Development Manual [2] Quadrics Ltd. [3] Petrini F., et al “THE QUADRICS NETWORK: HIGH-PERFORMANCE CLUSTERING TECHNOLOGY” IEEE Micro, 2002 [4] Braam, P.J. “The Lustre Storage Architecture” Cluster File Systems, Inc. [5] IAG [6] Maroney, M. “Porting to Linux for the Intel ® Itanium ® Architecture” Trimar Communications () [7]pfmon-2.0 Itanium2 specific documentation available at: [8] Intel Itanium Processor Family performance tuning guide


This was an experimental version of HP XC, still employing Quadrics Ltd. Resource Management System that has been replaced in the current release of the HP XC software.


To top