Advanced Scientific Computing at Jefferson Lab
September 18, 2000 Document prepared by Thomas Jefferson National Accelerator Facility Contacts: Ian Bird (firstname.lastname@example.org), Chip Watson (email@example.com), R. Roy Whitney (firstname.lastname@example.org)
In two previous papers (“Scientific Discovery through Advanced Computing” (1), and “Addressing the Need for Advanced Scientific Computing in the DOE Office of Science” (2)) the case for an advanced scientific computing program within the DOE Office of Science that would reach out into the wider scientific community was discussed. The present paper discusses scientific research programs at Thomas Jefferson National Accelerator Facility (Jefferson Lab) that could benefit from significant high-end computing resources configured as application-specific centers as discussed in (1) and (2). The national program proposed in (2) has three basic elements: 1. One or two major national supercomputing centers along the lines of the National Energy Research Computing Center (NERSC); 2. 10-15 Application-specific centers, each with about 5-10% of the resources of a major center, devoted to the computational tasks of a specific science problem; 3. Research and development of enabling technologies, in particular high speed networking. The use of very large high-end computing facilities such as NERSC is essential for some problems. However, smaller, application-specific centers permit an optimization of the computer architecture for specific science applications. In addition, not all applications are currently able to make optimal use of a very large high-end machine. Application-specific centers also bring access to high-end computing within reach of the wider scientific community, benefiting that community and encouraging development of ideas and expertise in essential computing techniques and software development. Currently, Jefferson Lab’s theoretical nuclear physics program requires significant high-end computing resources in order to calculate theoretical values at a level of precision equivalent to that of the experimental program. The simulation demands that the Lattice QCD research program acquire a massively parallel computer of 1 Tera-op as soon as possible, growing to 5-10 Tera-op within 5 years. We could also anticipate a dedicated machine in the materials science program associated with the Free Electron Laser facility as the facility and its user community grows. It is expected that such a program would be 2 – 3 years behind the LQCD program in development, and would eventually require resources of a similar scale. These demands provide a good example of how Jefferson Lab could benefit from one or two of the application-specific computing centers proposed above. Locating such centers at Jefferson Lab would leverage existing infrastructure, facilities, and expertise, in addition to enabling the necessary advances in science. Since all the science programs at Jefferson Lab involve nationwide and international collaborations of physicists and computer scientists, the aim of bringing high-end computing facilities within reach of the wider university community is also achieved. In line with the program cost outlined in the previous paper, it is envisaged that the cost for a LQCD center would be $3-5M per year leveraged on top of the existing infrastructure.
Jefferson Lab Science programs
The Thomas Jefferson National Accelerator Facility (Jefferson Lab) operates a high intensity continuous wave electron accelerator and hosts experiments aimed at understanding the structure of nuclei in the energy regime that overlaps traditional nuclear and high-energy physics. The physics is described by the theory of strong interactions between quarks and gluons – -1-
Quantum Chromodynamics (QCD). The need for theoretical understanding that goes hand in hand with the experimental program is driving an active research effort to model QCD. The technique, where the interactions are modeled on a discrete multi-dimensional grid or lattice (Lattice QCD) enables the prediction from first principles of fundamental parameters. This technique is eminently suited to implementation on a massively parallel supercomputer. The precision that can be reached in these models depends solely on the amount of compute power available. It is important to match the precision of current experiments, which requires a machine capable of multiple Tera-ops. A powerful enough machine and precise enough models and predictions will enable future experiments to be designed that focus on the most significant physics problems. At the present time there are no computational resources available to the national nuclear physics theory community of the scale required to undertake world-class calculations of hadronic structure. A single large scale LQCD calculation requires approximately 1 Tera-op-month of dedicated resources. Thus to achieve the required level of detail and to be competitive with initiatives in Europe and Japan, a system needs to be of the scale of at least 1 Tera-op expanding to 5-10 Tera-op in 5 years. Jefferson Lab is participating in a coherent national plan to begin to address the need for computational resources for LQCD in High Energy and Nuclear Physics. The initiatives at Jefferson Lab concentrate on those areas relevant to the unique experimental capabilities of the laboratory – namely hadronic physics, including the quark and gluon structure of hadrons, the spectroscopy of conventional and exotic hadronic states, and interactions between hadrons. An additional such center could be used to support materials science research at the Free Electron Laser. The Free Electron Laser (FEL) operated by Jefferson Lab is the world’s most powerful high-average-power infrared laser by a factor of 100, with upgrades under construction to increase power to 10 kW, as well as to extend its capabilities into the ultraviolet. This power will be a factor of 1000 greater than any other tunable infrared laser. The FEL program is the basis of a potentially large program of materials science research. The computer modeling and simulation requirements of that program are expected to be significant. As this program grows and develops its user research community, a second application-specific facility to support it is anticipated, and on a timescale somewhat (2-3 years) behind that of the LQCD system. Both the LQCD and the FEL programs have significant off-site user communities, as does the experimental program. Therefore, an advanced networking infrastructure is important to enable access to the facilities by the user community. The significant computational requirements described above are typical of such science-driven computing needs. There are a multitude of other compute-bound problems with the DOE Office of Science programs which would benefit from the availability of these type of centers.
Leverage existing facilities and expertise
Jefferson Lab is presently operating a small LQCD development system, and is member of a collaboration that proposes to build a 0.3 Tera-op (256 node) machine at Jefferson Lab and similar machines at other collaborating sites. As noted, the physicists engaged in this work require a system of several Tera-ops to match current experimental accuracy, which should grow to of order 5-10 Tera-ops within 5 years. On that timescale we would anticipate major national supercomputing centers to be operating machines of close to 100 Tera-ops. Such an installation would thus be an application-specific computing center devoted to the science of the laboratory. The architecture would be adapted to be specific to the needs of Lattice QCD (LQCD), which requires high memory access speeds and fast floating point processors but relatively little memory or storage. The experimental nuclear physics program at Jefferson Lab and its associated computing and network infrastructure are sufficient that the infrastructure requirements of the proposed LQCD facility could be readily absorbed without significant impact.
Currently the experimental program collects data at the rate of 100 TB per year, which is stored in a robotic tape silo. That data is analyzed on a farm of 250 processors. The present data analysis is a trivially parallel application where one data set is processed on each node. The tape silo has a capacity of 300 TB. This is supplemented with an online disk storage capacity of 10 TB. Anticipated upgrades to the accelerator will support experiments with data rates close to an order of magnitude larger on a 5-year timescale. Appropriate increases in storage, processing power and networking will be implemented to handle the new experiments. A program of Lattice QCD computing on the scale of a few Tera-ops will require systems built from a similar number of discrete processors to the experimental physics processing farms. Data storage requirements, however, will be significantly less than those of the experimental program and can easily be absorbed by the planned facilities. The networking bandwidth required for the wide-area will be significant for the experimental program and will provide sufficient capacity for the simulation efforts. Thus building a massively parallel supercomputer at Jefferson Lab can leverage the existing and planned mass storage and network infrastructure, as well as existing expertise in managing large system clusters. Jefferson Lab is already collaborating in Lattice QCD initiatives and is able to provide the required access to computing facilities that would be needed to put the computational power into the hands of the scientific community. The laboratory is also a member of the Particle Physics Data Grid collaboration that is building the wide-area software infrastructure needed to manipulate Petabyte-scale datasets for High Energy and Nuclear Physics experiments, and so is able to leverage that effort for a program of LQCD and FEL driven simulations that needs to reach a distributed user community.
While there are some scientific problems that require very high-end computing resources, Jefferson Lab science programs have an urgent need for Tera-scale dedicated computing facilities, to allow the investment in the experimental program to be fully realized with a complementary theoretical program. Given the rapid and dramatic increases in microprocessor and networking technologies expected in the next few years, the building of a massively parallel computer from commercial components is an undertaking that is similar in magnitude to the processor farms that Jefferson Lab and other national laboratories and universities are already operating in support of their existing programs. The ancillary requirements of modeling and simulation studies for mass storage and networking are considerably less than the requirements of the experimental programs, and can readily be handled by Jefferson Lab which currently manipulates several Terabytes a day and is planning for an order of magnitude greater data rates for the anticipated upgrades in the experimental program. The expertise needed to operate systems built from hundreds of systems is already available. In addition, Jefferson Lab has the infrastructure and mechanisms necessary to make the facilities available and accessible to the wider scientific community. Thus building such an application-specific center at Jefferson Lab would significantly benefit the science objectives in a cost-efficient way and also achieve the objectives of delivering high-end computing capabilities to a broader community of both physicists and computer scientists.
1. “Scientific Discovery through Advanced Scientific Computing” produced by the Office of Science (Department of Energy, Office of Science, March 2000); available at: http://www.sc.doe.gov/production/octr 2. “Addressing the Need for Advanced Scientific Computing in the DOE Office of Science”, produced by Thomas Jefferson National Accelerator Facility, September 2000.