Docstoc

TGExt_Prop_Draft-v1.10-vw

Document Sample
TGExt_Prop_Draft-v1.10-vw Powered By Docstoc
					TeraGrid Extension: Bridging to XD




1 TeraGrid Extension: Bridging to XD

                                 Submitted to the National Science Foundation as an invited proposal.


Principal Investigator
Ian Foster
Director, Computation Institute
University of Chicago
5640 S. Ellis Ave, Room 405
Chicago, IL 60637
Tel: (630) 252-4619
Email: foster@mcs.anl.gov


Co-Principal Investigators
John Towns                                        Matthew Heinzel
TeraGrid Forum Chair                              Deputy Director, TeraGrid GIG
Director, Persistent Infrastructure,              U Chicago
NCSA/Illinois


Senior Personnel
Phil Andrews                                      Rich Loft
Project Director, NICS/U Tennessee                Director of Technology Development, CISL/NCAR
Jay Boisseau                                      Richard Moore
Director, TACC/U Texas Austin                     Deputy Director, SDSC/UCSD
John Cobb                                         Ralph Roskies
???, ORNL                                         Co-Scientific Director, PSC
Nick Karonis                                      Carol X. Song
Professor and Acting Chair, Department of         Senior Research Scientist, RCAC/Purdue
Computer Science, NIU
                                                  Craig Stewart
Daniel S. Katz                                    Associate Dean, Research Technologies, Indiana
Director for Cyberinfrastructure


                                                     i
TeraGrid Extension: Bridging to XD


Development, CCT/LSU




                            TeraGrid Principal Investigators (GIG and RPs)
         Ian Foster (GIG)            University of Chicago/Argonne National Laboratory (UC/ANL)
         Phil Andrews                University of Tennessee (UT-NICS)
         Jay Boisseau                Texas Advanced Computing Center (TACC)
         John Cobb                   Oak Ridge National Laboratory (ORNL)
         Michael Levine              Pittsburgh Supercomputing Center (PSC)
         Rich Loft                   National Center for Atmospheric Research (NCAR)
         Charles McMahon             Louisiana Optical Network Initiative/Louisiana State University
                                     (LONI/LSU)
         Richard Moore               San Diego Supercomputer Center (SDSC)
         Carol Song                  Purdue University (PU)
         Rick Stevens                University of Chicago/Argonne National Laboratory (UC/ANL)
         Craig Stewart               Indiana University (IU)
         John Towns                  National Center for Supercomputing Applications (NCSA)


TeraGrid Senior Personnel
Grid Infrastructure Group
Matt Heinzel (UC) Deputy Director of the TeraGrid GIG
Tim Cockerill (NCSA) Project Management Working Group
Kelly Gaither (TACC) Data Analysis and Visualization
David Hart (SDSC) User Facing Projects
Daniel S. Katz (LSU) GIG Director of Science
Scott Lathrop (UC/ANL) Education, Outreach and Training; External Relations
Elizabeth Leake (UC) External Relations
Lee Liming (UC/ANL) Software Integration and Scheduling
Amit Majumdar (SDSC) Advanced User Support
J.P. Navarro (UC/ANL) Software Integration and Scheduling
Mike Northrop (UC) GIG Project Manager
Tony Rimovsky (NCSA) Networking, Operations and Security
Sergiu Sanielevici (PSC) User Services and Support
Nancy Wilkins-Diehr (SDSC) Science Gateways




                                                       ii
TeraGrid Extension: Bridging to XD



2 Project Summary
The TeraGrid is an advanced, nationally distributed, open cyberinfrastructure (CI) comprised of
supercomputing, storage, and visualization systems, data collections, and science gateways,
connected by high-bandwidth networks, integrated by coordinated policies and operations, and
supported by computing and technology experts, that enables and supports leading-edge
scientific discovery and promotes science and technology education. TeraGrid's three-part
mission is summarized as ―deep, wide, and open‖: supporting the most advanced computational
science in multiple domains; expanding usage and impact, and empowering new communities
of users; and providing resources and services that can be extended to a broader CI, enabling
researchers and educators to use TG resources in concert with personal, local/campus, and
other large-scale CI resources.
TeraGrid has enabled innumerable scientific achievements in almost all fields of science and
engineering. The project is user-driven is driven by user projects, which are reviewed for
potential for impact based on the need for advanced CI resources and support. Through the
coordinated capabilities of its staff, resources, and services, TeraGrid enables deep impact
through cutting-edge, even transformative science and engineering, by expert and teams of
users making highly-skilled use of TeraGrid resources. TeraGrid also supports a wider
community of much larger, domain-focused groups of users that may not possess specific high-
performance computing skills but who are addressing important scientific research and
education problems.
Advanced user support joins users with TeraGrid experts to increase research efficiency and
productivity, define best practices, and create a vanguard of early adopters of new capabilities.
Advanced scheduling and metascheduling services support use cases that require or benefit
from cross-site capabilities. Advanced data services provide a consistent, high-level approach to
multi-site data management and analysis. Support for science gateways provides community-
designed interfaces to TeraGrid resources and extends access to data collections, community
collaboration tools, and visualization capabilities to a much wider audience of users.
Transformative science and engineering on the TeraGrid also depends on its resources working
in concert, which requires a coordinated user support system, centralized mechanisms for user
access and information, a common allocations process and allocations management, and a
coordinated user environment. Underlying this user support environment, TeraGrid maintains a
robust, centrally managed infrastructure for networking, security and authentication, and
operational services.
TeraGrid will continue to support several resources into 2011 under this proposal. This includes
the three Track 2 systems, Pople, a shared memory system particularly useful to a number of
newer users, and four IA32-64 clusters. The former provide petascale computing capabilities for
very large simulations; the latter providing nearly 270 Tflops of computing power, to support
large-scale interactive, on-demand, and science gateway use. Allocated as a single resource,
these latter systems will permit metascheduling and advanced reservations; allow high-
throughput, Open Science Grid-style jobs; enable exploration of interoperability and technology
sharing; and provide a transition platform for users coming from university- or departmental-
level resources. TeraGrid will also support unique compute platforms and massive storage
systems, and will integrate new systems from additional OCI awards.
TeraGrid will also provide vigorous efforts in training, education and outreach. Through these
efforts, TeraGrid will engage and retain larger and more diverse communities in advancing
scientific discovery. TeraGrid will engage under-represented communities, in which under-
representation includes race, gender, disability, discipline, and institution, and continue to build


                                                2-1
TeraGrid Extension: Bridging to XD


strong partnerships in order to offer the best possible HPC learning and workforce development
programs and increase the number of well-prepared STEM researchers and educators.




                                             2-2
TeraGrid Extension: Bridging to XD



3 Table of Contents
A TeraGrid Extension: Bridging to XD ...................................................................................i
B Project Summary............................................................................................................. 2-1
C Table of Contents .................................................................................................................i
D Project Description ......................................................................................................... 4-1
 D.1 Introduction ............................................................................................................... 4-1
      D.1.1        TeraGrid Organization and Management ............................................................4-2
     D.1.2  Advisory Groups .................................................................................................4-3
   D.2 TeraGrid Science ....................................................................................................... 4-3
      D.2.1        Geosciences – SCEC, PI Tom Jordan, USC .......................................................4-4
      D.2.2    Social Sciences (SIDGrid) – PI Rick Stevens, University of Chicago; TeraGrid
      Allocation PI, Sarah Kenny, University of Chicago ............................................................4-5
      D.2.3        Astronomy – PI Mike Norman, UCSD, Tom Quinn, U. Washington .....................4-6
      D.2.4    Biochemistry/Molecular Dynamics – Multiple PIs (Adrian Roitberg, U. Florida, Tom
      Cheatham, U. Utah, Greg Voth, U. Utah, Klaus Schulten, UIUC, Carlos Simmerling, Stony
      Brook, etc.) .......................................................................................................................4-6
      D.2.5        CFD – PI Krishnan Mahesh, U. Minnesota ..........................................................4-7
      D.2.6        Structural Engineering – Multiple NEES PIs ........................................................4-8
      D.2.7        Biosciences – PI George Karniadakis, Brown University .....................................4-8
      D.2.8        Neutron Science – PI John Cobb, ORNL ............................................................4-9
      D.2.9   Chemistry (GridChem) – Project PI John Connolly, University of Kentucky;
      TeraGrid Allocation PI Sudhakar Pamidighantam, NCSA..................................................4-9
      D.2.10 Astrophysics - PI Erik Schnetter, LSU, Christian D. Ott, Caltech, Denis Pollney,
      and Luciano Rezzolla, AEI ..............................................................................................4-10
      D.2.11       Biosciences (Robetta) – PI David Baker, University of Washington ..................4-11
      D.2.12       GIScience – PI Shaowen Wang, University of Illinois ........................................4-11
      D.2.13 Computer Science: Solving Large Sequential Two-person Zero-sum Games of
      Imperfect Information – PI Tuomas Sandholm, Carnegie Mellon University ....................4-11
      D.2.14 Nanoscale Electronic Structures/nanoHUB – PI Gerhard Klimeck, Purdue
      University ........................................................................................................................4-11
     D.2.15 Atmospheric Sciences (LEAD), PI Kelvin Droegemeier, University of Oklahoma4-12
   D.3 Advanced Capabilities Enabling Science .............................................................. 4-13
      D.3.1        Advanced User Support ....................................................................................4-13
      D.3.2        Advanced Scheduling and meta scheduling ......................................................4-14
      D.3.3        Advanced Data Services ...................................................................................4-15
      D.3.4        Visualization and Data Analysis ........................................................................4-16
      D.3.5        Science Gateways ............................................................................................4-17


                                                                     i
TeraGrid Extension: Bridging to XD


  D.4 Supporting the User Community............................................................................ 4-20
     D.4.1       User Information and Access Environment .......................................................4-20
     D.4.2       User Authentication and Allocations ..................................................................4-21
     D.4.3       Frontline User Support ......................................................................................4-22
    D.4.4   Training.............................................................................................................4-23
  D.5 Integrated Operations of TeraGrid ......................................................................... 4-24
     D.5.1       Packaging and maintaining CTSS Kits ..............................................................4-24
     D.5.2       Information Services .........................................................................................4-25
     D.5.3       Supporting Software Integration and Information Services ................................4-25
     D.5.4       Networking ........................................................................................................4-25
     D.5.5       Security.............................................................................................................4-26
     D.5.6       Quality Assurance .............................................................................................4-26
     D.5.7       Common User Environment ..............................................................................4-27
     D.5.8       Operational Services.........................................................................................4-27
    D.5.9  RP Operations ..................................................................................................4-28
  D.6 Education, Outreach, Collaboration, and Partnerships ........................................ 4-30
     D.6.1       Education..........................................................................................................4-31
     D.6.2       Outreach ...........................................................................................................4-32
     D.6.3       Enhancing Diversity ..........................................................................................4-34
     D.6.4       External Relations (ER) ....................................................................................4-34
    D.6.5  Collaborations and Partnerships .......................................................................4-35
  D.7 Project Management and Leadership .................................................................... 4-35
     D.7.1       Project and Financial Management ...................................................................4-35
     D.7.2       Leadership ........................................................................................................4-36




                                                                  ii
TeraGrid Extension: Bridging to XD



4 Project Description
4.1      Introduction
TeraGrid's three-part mission is to support the most advanced computational science in
multiple domains, to empower new communities of users, and to provide resources and
services that can be extended to a broader cyberinfrastructure.
The TeraGrid is an advanced, nationally distributed, open cyberinfrastructure comprised of
supercomputing, storage, and visualization systems, data collections, and science gateways,
integrated by software services and high bandwidth networks, coordinated through common
policies and operations, and supported by computing and technology experts, that enables and
supports leading-edge scientific discovery and promotes science and technology education
Accomplishing this vision is crucial for the advancement of many areas of scientific discovery,
ensuring US scientific leadership, and increasingly, for addressing critical societal issues.
TeraGrid achieves its purpose and fulfills its mission through a three-pronged focus:
         deep: ensure profound impact for the most experienced users, through provision of the
         most powerful computational resources and advanced computational expertise;
         wide: enable scientific discovery by broader and more diverse communities of
         researchers and educators who can leverage TeraGrid’s high-end resources, portals
         and science gateways; and
         open: facilitate simple integration with the broader cyberinfrastructure through the use of
         open interfaces, partnerships with other grids, and collaborations with other science
         research groups delivering and supporting open cyberinfrastructure facilities.
The TeraGrid’s deep goal is to enable transformational scientific discovery through
leadership in HPC for high-end computational research. The TeraGrid enables high‐end
science utilizing powerful supercomputing systems and high‐end resources for the data
analysis, visualization, management, storage, and transfer capabilities required by large‐scale
simulation and analysis. All of this requires an increasingly diverse set of leadership‐class
resources and services, and deep intellectual expertise.
The TeraGrid’s wide goal is to increase the overall impact of TeraGrid’s advanced
computational resources to larger and more diverse research and education
communities through user interfaces and portals, domain specific gateways, and
enhanced support that facilitate scientific discovery by people without requiring them to
become high performance computing experts. The complexity of using TeraGrid’s high‐end
resources continues to grow as systems increase in scale and evolve with new technologies.
TeraGrid broadens the scientific user base of its resources via the development and support of
simple and powerful interfaces, ranging from common user environments to Science Gateways
and portals, through more focused outreach and collaboration with science domain research
groups, and by educational and outreach efforts that will help inspire and educate the next
generation of America’s leading‐edge scientists.
TeraGrid’s open goal is twofold: to ensure the expansibility and future viability of the
TeraGrid by using open standards and interfaces; and to ensure that the TeraGrid is
interoperable with other, open-standards-based cyberinfrastructure facilities. TeraGrid
must enable its high-end cyberinfrastructure to be more accessible from, and even integrated
with, cyberinfrastructure of all scales. That includes not just other grids, but also campus
cyberinfrastructures and even individual researcher labs/systems. The TeraGrid leads the


                                                 4-1
TeraGrid Extension: Bridging to XD


community forward by providing an open infrastructure that enables, simplifies, and even
encourages scaling out to its leadership-class resources by establishing models in which
computational resources can be integrated both for current and new modalities of science. This
openness includes interfaces and APIs, but goes further to include appropriate policies, support,
training, and community building.
This proposal is to to extend the TeraGrid program of enabling transformative scientific research
through a sixth year, from April 2010 to March 2011, until the revised start date of April 2011 for
the ―XD‖ follow-on program. This includes many of the integrative activities of the Grid
Integration Group (GIG) as well as an extension of the highest-value resources not separately
funded under the Track 2 or Track 1 solicitations. TeraGrid will operate resources that would
otherwise not be available to users, such as Cobalt and Pople, shared memory systems that are
particularly useful to a number of newer users, in areas such game theory, web analytics,
machine learning, etc, as well as being a key part of the workflow in a number of more
established applications; and IA32-64 clusters, providing nearly 270 Tflops of computing power,
that will support large-scale interactive, on-demand, and science gateway use. Allocated as a
single resource, these resources will permit metascheduling and advanced reservations; allow
high-throughput, Open Science Grid-style jobs; enable exploration of interoperability and
technology sharing; and provide a transition platform for users coming from university- or
departmental-level resources; TeraGrid will provide networking both within the TeraGrid and to
other resources, such as on campuses or in other cyberinfrastructures. It will provide common
grid software to enable easy use of multiple TeraGrid and non-TeraGrid resources, including
expanding wide-area filesystems Lustre and GPFS, It will provide and enhance the TeraGrid
user portal (enabling single-sign-on, common access to TeraGrid resources and information),
services such as metascheduling (automated selection of specific resources), co-scheduling
(use of multiple resources for a single job), reservations (use of a resource at a specific time),
workflows (use of single or multiple resources for a set of jobs), and gateways (interfaces to
resources that hide complex features or usage patterns, or tie TeraGrid resources to additional
datasets and capabilities). It will provide an extensive set of user support services to enable the
scientific community to make best use of the resources in creating transformative research. It
will provide a vigorous education, training and outreach program to make more people aware of
TeraGrid’s potential for scientific discovery and make them more proficient in exploiting that
potential.
4.1.1 TeraGrid Organization and Management
The coordination and management of the TeraGrid partners and resources requires
organizational    and     collaboration
mechanisms that are different from a
classic organizational structure. The
existing structure and practice has
evolved     from   many    years      of
collaboration, many predating the
TeraGrid.
The TeraGrid team (Figure 1) is
comprised of eleven resource providers
(RPs) and the Grid Infrastructure Group
(GIG). The GIG provides user support
coordination,     software    integration,
operations, management and planning.
GIG area directors (ADs) direct project
activities involving staff from multiple        Figure 1: TeraGrid Facility Partner Institutions



                                               4-2
TeraGrid Extension: Bridging to XD


partner sites, coordinating and maintaining TeraGrid central services.
TeraGrid policy and governance rests with the TeraGrid Forum (TG Forum), comprised of RP
PIs and the GIG PI. The TG Forum is led by a Chairperson—a GIG-funded position—filled by
Towns this past year as a result of an election process within the TG Forum. This position
facilitates the functioning of the TG Forum on behalf of the overall collaboration. The decision
was collectively made to provide funding for this position (50%) as a result of understanding the
substantial time commitment required.
TeraGrid management and planning is coordinated via a series of regular meetings, including
weekly project-wide Round Table meetings (held via Access Grid), weekly TeraGrid AD and
biweekly TG Forum teleconferences, and quarterly face-to-face internal project meetings. This
past year saw the first execution of a fully integrated annual planning process developed over
the past several years. Coordination of project staff in terms of detailed technical analysis and
planning is done through two types of technical groups: working groups and Requirement
Analysis Teams (RATs). Working groups are persistent coordination teams and in general have
participants from all RP sites; RATs are short-term (6-10 weeks) focused planning teams that
are typically small, with experts from a subset of both RP and GIG. Both groups make
recommendations to the TG Forum or, as appropriate, to the GIG management team.
4.1.2 Advisory Groups
The NSF/TeraGrid Science Advisory Board (SAB) consists of 14 people from a wide spectrum
of disciplines. The SAB provides advice to the TG Forum and the NSF TeraGrid Program
Officer on a wide spectrum of scientific and technical activities within or involving the TeraGrid.
The SAB considers the progress and quality of these activities, their balance, and the
TeraGrid’s interactions with the national and international research community, with the ultimate
aim of building a more unified TeraGrid and enhancing the progress of those aspects of
academic research and education that require high-end computing. The SAB advises on future
TeraGrid plans, identifies synergies between TeraGrid activities and related efforts in other
agencies, promotes the TeraGrid mission and its activities in the national and international
community, and provides help in building and expanding the TeraGrid community.
The SAB members are: Chair: James Kinter, Director of Center for Ocean-Land-Atmosphere
Studies; Bill Feiereisen, New Mexico; Thomas Cheatham, Utah; Gwen Jacobs, Montana State;
Dave Kaeli, Northeastern; Michael Macy, Cornell; Phil Maechling, USC; Alex Ramirez, HACU;
Nora Sabelli, SRI; Pat Teller, UTexas, El Paso; P. K. Yeung, Georgia Tech; Cathy Wu,
Georgetown; Eric Chassignet, Florida State; and Luis Lehner, Louisiana State.
4.2      TeraGrid Science
The TeraGrid aims to support, enable, and accelerate scientific research and education that
requires the high-end capabilities offered by a national cyberinfrastructure of high-end resources
and expert support. This comprehensive cyberinfrastructure enabled many usage modalities
and levels of users, but with emphasis on breakthrough, even transformative, results for all
projects. The usage of TeraGrid and the resulting impact can be generally be categorized
according to the TeraGrid mission focusing principles: deep, wide, and open. The TeraGrid is
oriented towards user-driven projects, with each project being led by a PI who applies for an
allocation to enable transformative scientific discovery through advanced computing. A project
can consist of a set of users identified by the PI, or a community represented by the PI. In
general, the TeraGrid’s deep focus represents projects that are usually small, established
groups of expert users making highly-skilled use of TeraGrid resources, and the TeraGrid’s
wide focus represents projects that are either new or established science communities using




                                               4-3
TeraGrid Extension: Bridging to XD


TeraGrid resources for both research and education, without requiring specific high-
performance computing skills, even for users who are domain science experts.
In both cases, various capabilities of the TeraGrid’s open focus can be needed, such as
networking (both within the TeraGrid and to other resources, such as on campuses or in other
cyberinfrastructures), common grid software (to enable easy use of multiple TeraGrid and non-
TeraGrid resources), the TeraGrid user portal (enabling single-sign-on, common access to
TeraGrid resources and information), services such as metascheduling (automated selection of
specific resources), co-scheduling (use of multiple resources for a single job), reservations (use
of a resource at a specific time), workflows (use of single or multiple resources for a set of jobs),
and gateways (interfaces to resources that hide complex features or usage patterns, or tie
TeraGrid resources to additional datasets and capabilities). On the other hand, a number of the
most experienced TeraGrid users simply want the low-overhead access to a single machine that
best matches their needs. Even in this category, the variety of architectures of the TeraGrid
enables applications that would not run well on simple clusters, including those that require the
lowest latency and microkernel operating systems to scale well, and those that require large
amounts of shared memory. While the Track2 systems (Ranger and Kraken) will continue to be
supported even if this proposal is not funded (albeit more individually), neither the four terascale
x86-64 systems currently in heavy use nor the shared memory systems (Pople and Cobalt) will
not continue to be supplied to the national user committee without this proposed work. The
former systems have great potential for enabling much greater interactive usage and science
gateway support, and these latter systems are particularly useful to a number of newer users, in
areas such game theory, web analytics, machine learning, etc, as well as being a key part of the
workflow in a number of more established applications, as described below.
4.2.1    Geosciences – SCEC, PI Tom Jordan, USC
The Southern California Earthquake Center (SCEC) is an inter-disciplinary research group that
includes over 600 geoscientists, computational scientists and computer scientists from about 60
various institutions, including the United States Geological Survey. Its goal is to develop an
understanding of earthquakes, and to mitigate risks of loss of life and property damage through
this understanding. SCEC is an exemplar for using the distributed resources of TeraGrid in an
integrated manner to achieve transformative geophysical science results. SCEC simulations
consist of highly scalable runs, mid-range core count runs and embarrassingly parallel small-
core count runs, and they require high bandwidth data transfer and large storage for post
processing and data sharing. These science results directly impact everyday life by contributing
to new building codes (used for construction of buildings in a city, hospitals, and nuclear
reactors), emergency planning, etc., and could potentially save billions of dollars through
proactive planning and construction.
For high core-count runs, SCEC researchers use the highly scalable codes (AWM-Olsen,
Hercules, AWP-Graves) on many tens of thousands of processors of the largest TeraGrid
systems (TACC Ranger and NICS Kraken) to improve the resolution of dynamic rupture
simulations by an order of magnitude and to study the impact of geophysical parameters. These
highly scalable codes are also used to run high frequency (1.0 Hz currently and higher in the
future) wave propagation simulations of earthquakes on systems at SDSC, TACC and PSC.
Using different codes on different machines and observing the match between the ground
motions projected by the simulations in needed to validate the results. Systems are also chosen
based on memory requirements for mesh and source partitioning, which requires large memory
machines; PSC’s Pople has been used for this.
For mid-range core count runs, SCEC researchers are carrying out full 3D tomography (called
Tera3D) data intensive runs on NCSA Abe and other clusters using a few hundreds to a few



                                                4-4
TeraGrid Extension: Bridging to XD


thousands of cores. SCEC researchers are also studying ―inverse‖ problems that require
running many forward simulations, while perturbing the ground structure model and comparing
against recorded surface data. As this inverse problem requires hundreds of forward runs, it is
necessary to recruit multiple platforms to distribute this work.
Another important aspect of SCEC research is in the CyberShake project, which uses 3D
waveform modeling (Tera3D) to calculate probabilistic seismic hazard (PSHA) curves for sites in
Southern California. A PSHA map provides estimates of the probability that the ground motion
at a site will exceed some intensity measure over a given time period. For each point of interest,
the CyberShake platform includes two large-scale MPI calculations and approximately 840,000
data-intensive pleasingly-parallel post-processing jobs. The required complexity and scale of
these calculations have impeded production of detailed PSHA maps; however, through the
integration of hardware, software and people in a gateway-like framework, these techniques can
now be used to produce large numbers of research results. Grid-based workflow tools are used
to manage these hundreds of thousands of jobs on multiple TeraGrid clusters. Over 1 million
CPU hours were consumed in 2008 through this usage model.
The high core-count simulations can produce 100-200 TB of output data. Much of this output
data is registered on the digital library on file systems at NCSA and SDSC’s GPFS-wan. In total,
SCEC requires close to half a petabyte of archival storage every year. Efficient data transfer
and access to large files for Tera3D project is of high priority. To ensure the datasets are safely
archived, redundant copies of the dataset at multiple locations are used. The collection of
Tera3D simulations include more than a hundred millions files, with each simulation organized
as a separate sub-collection in the iRODs data grid.
The distributed CI, of TeraGrid, with a wide variety of HPC machines (with different number of
cores, memory/core, varying interconnect performance, etc.), high bandwidth network, large
parallel and wide-area file systems, and large archival storage, is needed to allow SCEC
researchers to carry out scientific research in an integrated manner.
4.2.2 Social Sciences (SIDGrid) – PI Rick Stevens, University of Chicago; TeraGrid
      Allocation PI, Sarah Kenny, University of Chicago
SIDGrid is a social science team using the TeraGrid to develop the only science gateway in this
field, providing some unique capabilities for social science researchers. Social scientists make
heavy use of ―multimodal‖ data, streaming data which change over time. For example, a human
subject is viewing a video, while a researcher collects heart rate and eye movement data. Data
are collected many times per second and synchronized for analysis, resulting in large datasets.
Sophisticated analysis tools are required to study these datasets, which can involve multiple
datasets collected at different time scales. Providing these analysis capabilities through a
gateway has many advantages. Individual laboratories do not need to recreate the same
sophisticated analysis tools. Geographically distant researchers can collaborate on the analysis
of the same data sets. Social scientists from any institution can be involved in analysis,
increasing the opportunity for science impact by providing access to the highest quality data and
resources to all social scientists.
SIDGrid uses TeraGrid resources for computationally-intensive tasks such as media
transcoding (decoding and encoding between compression formats), pitch analysis of audio
tracks and functional Magnetic Resonance Imaging (fMRI) image analysis. These often result in
large numbers of single node jobs. Current platforms in use include TeraGrid roaming platforms
and TACC’s Spur and Ranger systems. Workflow tools such as SWIFT have been very useful in
job management.




                                               4-5
TeraGrid Extension: Bridging to XD


Active users of the SIDGrid system include a human neuroscience group and linguistic research
groups from the University of Chicago and the University of Nottingham, UK. TeraGrid is
providing support to make use of the resources more effectively. Building on experiences with
OLSGW, the same team will address similar issues for SIDGrid. A new application framework
has been developed to enable users to easily deploy new social science applications in the
SIDGrid portal.
4.2.3 Astronomy – PI Mike Norman, UCSD, Tom Quinn, U. Washington
ENZO is a multi-purpose code (developed by Norman’s group at UCSD) for computational
astrophysics. It uses adaptive mesh refinement to achieve high temporal and spatial resolution,
and includes a particle-based method for modeling dark matter in cosmology simulations, and
state-of-the-art PPML algorithms for MHD. A version that couples radiation diffusion and
chemical kinetics is in development.
ENZO consists of several components that are used to create initial conditions, evolve a
simulation in time, and then analyze the results. Each component has quite different
computational requirement, and the requirements; the full set of components cannot be met at
any single TeraGrid site. For example, the current initial conditions generator for cosmology
runs is an OpenMP-parallel code that requires a large shared memory system; NCSA Cobalt is
the primary platform that runs this code at production scale. The initial conditions data can be
very large; the initial conditions for a 20483 cosmology run contain approximately 1 TB of data.
Production simulation runs are done mainly on NICS Kraken and TACC Ranger, so the initial
conditions generated on Cobalt must be transferred to these sites over the TeraGrid network
using GridFTP. Similarly, the output from an ENZO simulation must generally be transferred to a
site with suitable resources for analysis and visualization, both of which typically require large
shared memory systems similar to PSC’s Pople. Furthermore, some sites are better equipped to
provide long-term archival storage of a complete ENZO simulation (of the order of 100 TB) for a
period of several months to several years. Thus, almost every ENZO run at large scale is
dependent on multiple TeraGrid resources and the high-speed network links between the
TeraGrid sites.
Quinn (University of Washington) is using the N-body cosmology code GASOLINE for analyzing
N-body simulation of structure formation in the universe. This code utilizes the TeraGrid
infrastructure in a similar fashion as the ENZO code. Generation of the initial condition, done
using a serial code, requires several 100 GB of RAM and is optimally done on the NCSA Cobalt
system. Since the highly scalable N-body simulations are performed on PSC BigBen, the initial
condition data has to be transferred over the high bandwidth TeraGrid network. The total output,
especially when the code is run on Ranger and Kraken, can reach a few petabtyes and
approximate one thousand files. The researchers use visualization software that allows
interactive steering, and they are exploring the TeraGrid global filesystems to ease data staging
for post processing and visualization.
4.2.4 Biochemistry/Molecular Dynamics – Multiple PIs (Adrian Roitberg, U. Florida,
      Tom Cheatham, U. Utah, Greg Voth, U. Utah, Klaus Schulten, UIUC, Carlos
      Simmerling, Stony Brook, etc.)
Many of the Molecular Dynamics users use the same codes (such as AMBER, NAMD,
CHARMM, LAMMPS, etc.) for their research, although they are looking at different research
problems, such as drug discovery, advanced materials research, and advanced enzymatic
catalyst design impacting areas such as bio-fuel research. The broad variation in the types of
calculations needed to complete various Molecular Dynamics workflows (including both
quantum and classical calculations), along with large scale storage and data transfer



                                               4-6
TeraGrid Extension: Bridging to XD


requirements, define a requirement for a diverse set of resources coupled with high bandwidth
networking. This TeraGrid, therefore, offers an ideal resource for all researchers who conduct
Molecular Dynamics simulations.
Quantum calculations, which are an integral part in the parameterization of force fields, and
often used for the Molecular Dynamics runs themselves in the form of hybrid QM/MM
calculations, require large shared memory machines like NCSA Cobalt or PSC Pople. The latest
generation of machines that feature large numbers of processors interconnected by high
bandwidth networks do not lend themselves to the extremely fine grained parallelization needed
for the rapid solving of the self consistent field equations necessary for QM/MM MD simulations.
(It should be noted that a number of MD research groups are working on being able to do
advanced quantum calculations over distributed core machines such as Ranger, and Kraken.
The availability of those types of machines as testbeds and future production is incredibly useful
for these MD researchers.) Classical MD runs using the AMBER and NAMD packages (as well
as other commonly available MD packages) use the distributed memory architectures present in
Kraken, Abe, and Ranger very efficiently for running long time scale MD simulations. These
machines are allow simulations that were not possible as recently as two years ago, and they
are having enormous impact on the field of MD. Some MD researchers use QM/MM techniques,
and these researchers benefit from the existence of machines with nodes that have different
amounts of memory per node, as the large memory nodes are used for the quantum
calculations, and the other nodes for the classical part of the job.
The reliability and predictability of biomolecular simulation is increasing at a fast pace and is
fueled by access to the NSF large-scale computational resources across the TeraGrid.
However, researchers are now entering a realm where they are becoming deluged by the data
and its subsequent analysis. More and more, large ensembles of simulations, often loosely
coupled, are run together to provide better statistics, sampling and efficient use of large-scale
parallel resources. Managing these simulations, performing post-processing/visualization, and
ultimately steering the simulations in real-time currently has to be done on local machines. The
TeraGrid Advanced User Support program is working with the MD researchers to address some
of these limitations. Although most researchers currently are bringing data back to their local
sites to do analysis, this is quickly becoming impractical and is limiting scientific discovery.
Access to large persistent analysis space linked to the various computational resources on the
TeraGrid by the high-bandwidth TeraGrid network is therefore essential to enabling
groundbreaking new discoveries in this field.
4.2.5 CFD – PI Krishnan Mahesh, U. Minnesota
Access to HPC resources with different system parameters is important for many CFD users.
Here we describe the use case scenario of a particular CFD user to show how the distributed
infrastructure of TeraGrid is needed and utilized by this user, representing many other CFD
projects and users. Mahesh uses an unstructured grid computational fluid dynamics code for
modeling the very complex geometries of real life engineering problems. For example, his code
has been used to conduct large eddy simulations of incompressible mixing in the exceedingly
complex geometry of gas-turbines. The unstructured grid approach has also been extended to
compressible flow solvers and used for studying jets in supersonic crossflow.
This code has been run at large scale, using up to 2048 cores and 50 million control volumes,
on multiple TG systems. The code shows very good weak scaling and the communication
pattern is largely localized to nearest neighbors. The code has been ported to Ranger and
Kraken and the PI is planning simulations on these machines at larger scales than possible on
previous TeraGrid clusters. These larger simulations will provide the capability to reach
resolution at a scale of Reynolds numbers observed only experimentally today. This will allow



                                               4-7
TeraGrid Extension: Bridging to XD


him to solve engineering turbulence problems such as the flow around marine propellers
(simulating crashback where the propeller is suddenly spun in the reverse direction from its
normal direction). A critical component that is needed before the many thousand core-count
simulations can be done is the grid generation and initial condition generation needed by the
main runs. This part of the code generation is serial and requires many hundreds of GB of
shared memory for large cases. This can only be done on large shared memory machines such
as Cobalt or Pople. The initial data then needs to be transferred to sites such as TACC and
NICS for the large scale simulations. After the simulation further data access is needed to do in-
situ post processing and visualization or the output data is transferred back to the local site, at
University of Minnesota, for post processing and visualization. The high bandwidth network of
the TeraGrid is essential for both of these scenarios. Thus even for this, seemingly traditional,
CFD user the distributed infrastructure of TeraGrid with highly scalable machines, large shared
memory machines and the fast network is essential to carry out new engineering simulations.
4.2.6 Structural Engineering – Multiple NEES PIs
The Network for Earthquake Engineering Simulation (NEES) project is an NSF-funded MRE
project that seeks to lessen the impact of earthquake and tsunami-related disasters by providing
revolutionary capabilities for earthquake engineering research. A state-of-the-art network links
world-class experimental facilities around the country, making it possible for researchers to
collaborate remotely on experiments, computational modeling, data analysis and education.
NEES currently has about 75 users spread across about 15 universities. These users use
TeraGrid HPC and data resources for various kinds of structural engineering simulations using
both commercial codes and research codes based on algorithms developed by academic
researchers. Some of these simulations, especially those using commercial FEM codes such as
Abaqus, Ansys, Fluent, and LS-Dyna, require moderately large shared memory nodes, such as
the large memory nodes of Abe and Mercury, but scale to only few tens of processors using
MPI. Large memory is needed so that the whole mesh structure can be read in to a single node
and this is necessary due to the basic FEM algorithm applied for some simulation problems.
Many of these codes have OpenMP parallelization, in addition to MPI parallelization, and users
mainly utilize shared memory nodes using OpenMP for pre/post processing. On the other hand,
some of the academic codes, such as the OpenSees simulation package tuned for specific
material behavior, have utilized many thousands of processors of machines, including Kraken,
and Ranger, scaling well at these high core counts. Due to the geographically distributed
location of NEES researchers and experimental facilities, high bandwidth data transfer and data
access are vital requirement.
NEES researchers also perform ―hybrid tests‖ where multiple geographically distributed
structural engineering experimental facilities (e.g., shake tables) perform structural engineering
experiments simultaneously in conjunction with simulations running on TeraGrid resources.
Some complex pseudo real-life engineering test cases can only be captured by having multiple
simultaneous experiments coupled with complementary simulations, as they are too complex to
perform by either experimental facilities or simulations alone. These ―hybrid tests‖ require close
coupling and data transfer in real time between experimental facilities and TeraGrid compute
and data resources using the fast network. NEES as a whole is dependent on the variety of
HPC resources of TeraGrid, the high bandwidth network and data access and sharing tools.
4.2.7 Biosciences – PI George Karniadakis, Brown University
High-resolution, large-scale simulations of a blood flow in the human arterial tree require
solution of flow equations with billions degrees of freedom. In order to perform such
computationally demanding simulations, tens or even hundreds of thousands computer
processors must be employed. Use of a network of distributed computers (TeraGrid) presents



                                               4-8
TeraGrid Extension: Bridging to XD


an opportunity to carry out these simulations efficiently; however, new computational methods
must be developed.
The Human Arterial Tree project has developed a new scalable approach for simulating large
multiscale computational mechanics problems on a network of distributed computers or grid.
The method has been successfully employed in cross site simulations connecting SDSC,
TACC, PSC, UC/ANL, and NCSA.
The project considers 3D simulation of blood flow in the intracranial arterial tree using NEKTAR
- the spectral/hp element solver developed at Brown University. It employs a multi-layer
hierarchical approach whereby the problem is solved on two layers. On the inner layers,
solutions of large tightly coupled problems are performed simultaneously on different
supercomputers, while on the outer layer, the solution of the loosely coupled problem is
performed across distributed supercomputers, involving considerable inter-machine
communication. The heterogeneous communication topology (i.e., both intra- and inter-machine
communication) is performed initially by MPICH-G2 and later with the recently developed MPIg
libraries. MPIg's multithreaded architecture provides applications with an opportunity to overlap
computation and inter-site communication on multicore systems. Cross-site computations
performed on the TeraGrid's clusters demonstrate the benefits of MPIg over MPICH-G2. The
multi layer communication interface implemented in NEKTAR permits efficient communication
between multiple groups of processors. The developed methodology is suitable for solution of
multi-scale and multi-physics problems on distributed and on the modern petaflop
supercomputers.
4.2.8 Neutron Science – PI John Cobb, ORNL
The Neutron Science TeraGrid Gateway (NSTG) project is an exemplar for the use of CI for
simulation and data analysis that are coupled to an experiment. The unique contributions of
NSTG are the connection of national user facility instrument data sources to the integrated CI of
the TeraGrid and the development of a neutron science gateway that allows neutron scientists
to use TeraGrid resources to analyze their data, including comparison of experiment with
simulation. The NSTG is working in close collaboration with the Spallation Neutron Source
(SNS) at Oak Ridge as their principal facility partner. The SNS is a next-generation neutron
source, which has completed construction at a cost of $1.4 billion and is ramping up operations.
The SNS will provide an order of magnitude greater flux than any other neutron scattering
facility in the world and will be available to all of the nation's scientists, independent of funding
source, on a reviewed basis. With this new capability, the neutron science community is facing
orders of magnitude larger data sets and is at a critical point for data analysis and simulation.
They recognize the need for new ways to manage and analyze data to optimize both beam time
and scientific output. The TeraGrid is providing new capabilities in the gateway for simulations
using McStas and for data analysis by the development of a fitting service. Both run on
distributed TeraGrid resources, at ORNL, TACC and NCSA, to improve turnaround. NSTG is
also exploring archiving experimental data on the TeraGrid. As part of the SNS partnership, the
NSTG provides gateway support, cyberinfrastructure outreach, community development, and
user support for the neutron science community, including not only SNS staff and users, but
extending to all five neutron scattering centers in North America and several dozen worldwide.
4.2.9 Chemistry (GridChem) – Project PI John Connolly, University of Kentucky;
      TeraGrid Allocation PI Sudhakar Pamidighantam, NCSA
Computational chemistry forms the foundation not only of chemistry, but is required in materials
science and biology as well. Understanding molecular structure and function are beneficial in
the design of materials for electronics, biotechnology and medical devices and also in the
design of pharmaceuticals. GridChem, an NSF Middleware Initiative (NMI) project, provides a


                                                4-9
TeraGrid Extension: Bridging to XD


reliable infrastructure and capabilities beyond the command line for computational chemists.
GridChem, one of the most heavily used TeraGrid science gateways in 2008, requested and is
receiving advanced support resources from the TeraGrid. This advanced support work will
address a number of issues, many of which will benefit all gateways. These issues include
common user environments for domain software, standardized licensing, application
performance characteristics, gateway incorporation of additional data handling tools and data
resources, fault tolerant workflows, scheduling policies for community users, and remote
visualization. This collaboration with TeraGrid staff is ongoing in 2009.
4.2.10 Astrophysics - PI Erik Schnetter, LSU, Christian D. Ott, Caltech, Denis Pollney,
       and Luciano Rezzolla, AEI
Cactus <http://www.cactuscode.org> is an HPC software framework enabling parallel
computation across different architectures and collaborative code development between
different groups. Cactus originated in the academic research community, where it was
developed and used over many years by a large international collaboration of physicists and
computational scientists. Cactus is now mainly developed at LSU with major contributions from
the AEI in Germany, and is predominantly used in computational relativistic astrophysics where
it is employed by several groups in the US and abroad.
An application that is based on Cactus consists of a set individual modules (―thorns‖) that
encapsulate particular physical, computational, or infrastructure algorithms. A special ―driver‖
thorns provides parallelism, load balancing, memory management, and efficient I/O. One such
driver is Carpet <http://www.carpetcode.org>, which supports both adaptive mesh refinement
(AMR) with subcycling in time and multi-block methods, offering a hybrid parallelisation
combining MPI and OpenMP. An Einstein Toolkit provides a common basic infrastructure for
relativistic astrophysics calculations. Cactus, Carpet, the Einstein Toolkit, and many other thorns
are available as open source, while most cutting-edge physics thorns are developed privately by
individual research groups.
Current significant users of Cactus outside LSU include AEI (Germany), Caltech, GA Tech,
KISTI (South Korea), NASA GSFC, RIT, Southampton (UK), Tübingen (Germany), UI, UMD,
Tokyo (Japan), and WashU. Ongoing development is funded among others via collaborative
grants from NASA (ParCa, with partners LSU, GSFC, and company Decisive Analytics
Corporation) and NSF (XiRel, with partners LSU, GA Tech, RIT).
The LSU-AEI-Caltech numerical relativity collaboration uses Cactus-based applications to study
binary systems of black holes and neutron stars as well as stellar collapse scenarios. Numerical
simulations are the only practical way to study these systems, which requires modeling the
Einstein equations, relativistic hydrodynamics, magnetic fields, nuclear microphysics, and
effects of neutrino radiation. This results in a complex, coupled system of non-linear equations
describing effects that span a wide range length- and time-scales which are addressed with
high-order discretization methods, adaptive mesh refinement with up to 12 levels, and multi-
block methods. The resulting applications are highly portable and have been shown to scale up
to 12k cores, with currently up to 2k cores used in production runs.
Production runs are mainly performed on Queen Bee at LONI, Ranger at TACC, and on
Damiana at the AEI, and it is expected that Kraken will soon also be used for production. These
applications prefer to have 2 GByte of memory per core available due to the parallelization
overhead of the higher order methods, but can run with less memory if OpenMP is used, though
combining OpenMP and MPI does not always increase performance. Initial configurations are
typically either calculated at the beginning of the simulation or are imported from one-
dimensional data. They may involve a large number of time steps, leading to wall-clock times of
20 days or more for a high-resolution run.


                                               4-10
TeraGrid Extension: Bridging to XD


4.2.11 Biosciences (Robetta) – PI David Baker, University of Washington
Protein structure prediction is one of the more important components of bioinformatics. The
Rosetta code, from the David Baker laboratory, has performed very well at CASP (Critical
Assessment of Techniques for Protein Structure Prediction) competitions and is available for
use by any academic scientist via the Robetta server – a TeraGrid science gateway. Robetta
developers were able to use TeraGrid’s gateway infrastructure, including community accounts
and Globus, to allow researchers to run Rosetta on TeraGrid resources through the gateway.
This very successful group did not need any additional TeraGrid assistance to build the Robetta
gateway; it was done completely be using the tools TeraGrid provides to all potential gateway
developers. Google scholar reports 601 references to the Robetta gateway, including many
PubMed publications. Robetta has made extensive use of a TeraGrid roaming allocation and
will be investigating additional platforms such as Purdue’s Condor pool and the NCSA/LONI
Abe-QueenBee systems.
4.2.12 GIScience – PI Shaowen Wang, University of Illinois
The GIScience gateway, a geographic information systems (GIS) gateway, has over 60 regular
users and is used by undergraduates in coursework at UIUC. GIS is becoming an increasingly
important component of a wide variety of fields. The GIScience team has worked with
researchers in fields as distinct as ecological and environmental research, biomass-based
energy, linguistics (linguist.org), coupled natural and human systems and digital watershed
systems, hydrology and epidemiology. The team has allocations on resources in TeraGrid
ranging from TACC’s Ranger system to NCSA’s shared memory Cobalt system to Purdue’s
Condor pool and Indiana’s BigRed system. Most usage to date has been on the NCSA/LONI
Abe-QueenBee systems. The GIScience gateway may also lead to collaborations with the
Chinese Academy of Sciences through the work of the PI.
4.2.13 Computer Science: Solving Large Sequential Two-person Zero-sum Games of
       Imperfect Information – PI Tuomas Sandholm, Carnegie Mellon University
Professor Sandholm’s work in game theory is internationally recognized. While many games
can be formulated mathematically, the formulations for those that best represent the challenges
of real-life human decision making (in national defense, economics, etc.) are huge. For
example, two-player poker has a game-tree of about 1018 nodes. In the words of Sandholm's
Ph.D. student Andrew Gilpin, ―To solve that requires massive computational resources. Our
research is on scaling up game-theory solution techniques to those large games, and new
algorithmic design.‖
The most computationally intensive portion of Sandholm and Gilpin's algorithm is a matrix-
vector product, where the matrix is the payoff matrix and the vector is a strategy for one of the
players. This operation accounts for more than 99% of the computation, and is a bottleneck to
applying game theory to many problems of practical importance. To drastically increase the size
of problems the algorithm can handle, Gilpin and Sandholm devised an approach that exploits
massively parallel systems of non-uniform memory-access architecture, such as Pople, PSC’s
SGI Altix. By making all data addressable from a single process, shared memory simplifies a
central, non-parallelizable operation performed in conjunction with the matrix-vector product.
Sandholm and Gilpin are doing experiments to learn how the shared-memory code performs,
and points to areas for further algorithmic improvement.
4.2.14 Nanoscale Electronic Structures/nanoHUB – PI Gerhard Klimeck, Purdue
       University
Gerhard Klimeck’s lab is tackling the challenge of designing microprocessors and other devices
at a time when their components are dipping into the nanoscale – a billionth of a meter. The


                                              4-11
TeraGrid Extension: Bridging to XD


new generation of nano-electronic devices requires a quantum-mechanical description to
capture properties of devices built on an atomic scale. This is required to study quantum dots
(spaces where electrons are corralled into acting like atoms, creating in effect a tunable atom for
optical applications), resonant tunneling diodes (useful in very high-speed circuitry), and tiny
nanowires. The simulations in this project look two or three generations down-the-line as
components continue to shrink, projecting their physical properties and performance
characteristics under a variety of conditions before they are fabricated. The codes also are used
to model quantum computing.
Klimeck’s team received an NSF Petascale Applications award for his NEMO3-D and OMEN
software development projects, aimed at efficiently using the petascale systems that are being
made available by the TeraGrid. They have already employed the software in multimillion-atom
simulations matching experimental results for nanoscale semiconductors, and have run a
prototype of the new OMEN code on 32,768 cores of TACC’s Ranger system. They also use
TeraGrid resources at NCSA, PSC, IU, ORNL and Purdue. Their simulations involve millions to
billions of interacting electrons, and thus require highly sophisticated and optimized software to
run on the TeraGrid’s most powerful systems. Different code and machine characteristics may
be best suited to different specific research problems, but it is important for the team to plan and
execute their virtual experiments on all these resources in a coordinated manner, and to easily
transfer data between systems.
This project aims not only at direct research, but also is creating modeling and simulation tools
that other researchers, educators, and students can use through NanoHUB, a TeraGrid Science
Gateway, designed to make doing research on the TeraGrid easier. The PI likens the situation
to making computation as easy as making phone calls or driving cars, without being a telephone
technician or an auto mechanic. Overall, nanoHUB.org is hosting more than 90 simulation tools,
with more than 6,200 users who ran more than 300,000 simulations in 2008. The hosted codes
range in computational intensity from very lightweight to extremely intensive, such as NEMO 3-
D and OMEN. The nanoHUB.org site has more than 68,000 users in 172 countries, with a
system uptime of more than 99.4-percent. More than 44 classes used the resource for teaching.
According to Klimeck, it has become an infrastructure people rely on for day-to-day operations.
nanoHUB plans on being among the early testers for the metascheduling capabilities currently
being developed by the TeraGrid, since interactivity and reliability are high priorities for
nanoHUB users. The Purdue team is also looking at additional communities that might benefit
from the use of HUB technology and TeraGrid. The Cancer Care Engineering HUB is one such
community.
4.2.15 Atmospheric Sciences (LEAD), PI Kelvin Droegemeier, University of Oklahoma
In preparation for the spring 2008 Weather Challenge, involving 67 universities, the LEAD team
and TeraGrid began a very intensive and extended ―gateway-debug‖ activity involving Globus
developers, TeraGrid resource provider (RP) system administrators and the TeraGrid GIG
software integration and gateway teams. Extensive testing and evaluation of GRAM, GridFTP,
and RFT were conducted on an early CTSS V4 testbed especially tuned for stability. The
massive debugging efforts laid the foundation for improvements in reliability and scalability of
TeraGrid’s grid middleware for all gateways. A comprehensive analysis of job submission
scenarios simulating multiple gateways will be used to conduct a scalability and reliability
analysis of WS GRAM. The LEAD team also participated in the NOAA Hazardous Weather
Testbed Spring 2008 severe weather forecasts. High resolution on demand and urgent
computing weather forecasts will enable scientists study complex weather phenomenon in near
real-time.




                                               4-12
TeraGrid Extension: Bridging to XD


A pilot program with Campus Weather Service (CWS) groups from atmospheric science
departments from universities across the country. Millersville University and University of
Oklahoma CWS users have been predicting local weather in 3 shifts per day with 5km, 4km and
2km forecast resolutions computing on Big Red and archiving on the IU Data Capacitor.
Development of reusable LEAD tools continues. The team is supporting the OGCE released
components – Application Factory, Registry Services and Workflow Tools. TeraGrid supporters
have generalized, packaged and tested the notification system and personal metadata catalog
to prepare for an OGCE release to be used by gateway community and will provide workflow
support to integrate with the Apache ODE workflow enactment engine.
4.3      Advanced Capabilities Enabling Science
The transformative science examples in §D2 are enabled by the coordinated efforts of the
TeraGrid (TG) project. The TG advanced capabilities (those delivered to the user community
above simple access to computer cycles) are developed based on existing and expected user
needs. These needs are determined from direct contact with users, surveys, discussions with
potential users, and collaborations with other CI projects. TG uses projects that express interest
in a CI need as test users of that capability. In other cases, such as for advanced user support,
where there is more need for a capability than we can deliver, we use the allocations process to
recieve recommendations on where we should apply our efforts. All the projects described in
§D2 are driving or using TeraGrid advanced capabilities, and will continue to do so during the
extension period, in order to obtain the best possible science results. We will support new data
capabilities as a central component of whole new forms of data-intensive research, especially in
combination with advanced visualization and community interfaces such as those supported by
the gateways efforts. We will continue to enhance these advanced capabilities in a collaborative
context, with TG staff bringing their expertise to bear on user needs to improve the experiences
of current users of TeraGrid, and to help develop the new generation of XD users.
4.3.1 Advanced User Support
Advanced User Support (AUS) plays a critical role in enabling
                                                                   “I consider the user
science in the TeraGrid, particularly with regard to less
                                                                support people to be the
traditional users of CI, such as users whose research focuses
                                                                 most valuable aspect of
primarily on data analysis or visualization, and users in areas
                                                                the TeraGrid because the
such as the social sciences who may not have strong
                                                                 infrastructure is only as
backgrounds in computational methods. AUS staff
                                                                 good as the people who
(computational scientists from all RP sites, with Ph.D level
                                                                  run and support it.” –
expertise in various domain sciences, HPC, CS, visualization,
                                                                Martin Berzins, University of
and workflow tools) will be responsible for the highest level
                                                                           Utah
support for TeraGrid users. The overall advanced support
efforts under the AUS area will consist of three sub-efforts:
Advanced Support for TeraGrid Applications (ASTA), Advanced Support for Projects (ASP), and
Advanced Support for EOT (ASEOT).
AUS operations will be coordinated by the AUS Area Director jointly with the AUS Point of
Contacts (POCs) from the RP sites; together they will handle the management and coordination
issues associated with ASTA, ASP and ASEOT. They have created an environment of
cooperation and collaboration among AUS technical staff from across the RP sites where AUS
staff benefit from each other’s expertise and work jointly on ASTA, ASP and ASEOT projects.
4.3.1.1 Advanced Support for TeraGrid Applications (ASTA)
ASTA projects allow AUS staff to work with a PI for a period of few months to a year, so that the
project is able to optimally use TeraGrid resources for science research. As has been shown in



                                              4-13
TeraGrid Extension: Bridging to XD


the past, ASTA efforts will be vital for many of the ground-breaking simulations performed by
TeraGrid users. ASTA work will include porting applications, transitioning them from outgoing to
incoming TeraGrid resources, implementing algorithmic enhancements, implementing parallel
programming methods, incorporating mathematical libraries, improving the scalability of codes
to higher core counts, optimizing codes to utilize specific resources, enhancing scientific
workflows, and tackling visualization and data analysis tasks. To receive ASTA support
TeraGrid users submit a request as a part of allocation request. Next, allocations reviewers
provide a recommendation score, AUS staff work with the user to define an ASTA work plan,
and finally, AUS staff provide ASTA support to the user. The AUS effort optimally matches
TeraGrid-wide AUS staff to ASTA projects, taking into account the reviewers’ recommendation,
AUS staff expertise in relevant domain science/HPC/CI, the ASTA project work plan, and the
site where the user has an allocation. ASTA projects provide long-term benefits to the user team
other TG users, and the TeraGrid project. ASTA project results provide insights and exemplars
for the general TG user community; they are included in documentation, training and outreach
activities. ASTA efforts also allow us to bring in new user communities, from social science,
humanities etc. and enable them to use TG resources. And, ASTA insights help us understand
the need for new TG capabilities.
4.3.1.2 Advanced Support Projects (ASP)
The complex, ever changing, and leading edge nature of the TeraGrid infrastructure
necessitates identifying and undertaking advanced projects that will have significant impact on a
large groups of TeraGrid users. ASPs are identified based on the broad impact they will have on
the user community, by processing input from users, experienced AUS and frontline support
staff and other TeraGrid experts. AUS staff expertise in various domain sciences and
experience in HPC/CI, along with deep understanding of users’ needs, play an important role in
identifying such projects. ASP work includes (1) porting, optimizing, benchmarking and
documenting widely-used domain science applications from outgoing to incoming TeraGrid
machines; (2) addressing the issues in scaling these applications to tens of thousands of cores;
(3) investigating and documenting optimal use of the data-centric, high-throughput, Grid
research, and experimental Track-2D systems; (4) demonstrating feasibility and performance of
new programming models (PGAS, hybrid MPI/OpenMP, MPI one-sided communication etc.); (5)
providing technical documentation on effective use of profiling, tracing tools on TeraGrid
machines for single processor and parallel performance optimization, (6) providing usage-based
visualization, workflow, and data analysis/transfer use cases.
4.3.1.3 Advanced Support for EOT
In this area, AUS staff provide their expertise in support of education, outreach and training.
AUS staff will contribute to advanced HPC/CI training (both synchronous and asynchronous)
and teach such topics. AUS staff will provide outreach to the TeraGrid user community about
the transition to new resources in 2010/2011, and on the process for requesting support through
ASTA and ASP. In this regard, AUS staff will reach out to the NSF program directors that fund
computational science and CI research projects. AUS staff will be involved in planning and
organizing TG10, SC2010 and other workshops and attending and presenting at these
workshop, BOFs, and panels. We will provide outreach to other NSF CI programs (e.g.,
DataNet, iPlant, etc.) and enable them to use TeraGrid resources. We will pay special attention
to broadening participation of underrepresented user groups and provide advanced support as
appropriate and under the guidance of the allocation process.
4.3.2 Advanced Scheduling and meta scheduling
TeraGrid systems have traditionally been scheduled independently, with each system’s local
scheduler optimized to meet the needs of local users. Feedback from TeraGrid users, user


                                              4-14
TeraGrid Extension: Bridging to XD


surveys, review panels, and the science advisory board, has indicated emerging user needs for
coordinated scheduling capabilities. In PY2, our scheduling and metascheduling requirements
analysis teams (RATs) identified advance reservation, co-scheduling (aka co-allocation),
automatic resource selection (aka metascheduling), and on-demand (aka urgent) computing as
the most needed capabilities. We formed a scheduling working group (WG) in late PY2/early
PY3. In PY3 and PY4, the WG defined several TeraGrid-wide capability definitions and
implementation plans that are now being used to finalize production support in the remainder of
PY4. Maintenance of these capabilities is described in §D.5.3.
We are currently moving the first three of these capabilities into production: on-demand/urgent
computing, advance reservation, and co-scheduling. Automatic resource selection is available,
but only for the two IA64 systems at SDSC and NCSA, with another four systems to be added in
PY4 and PY5. Although it is not yet clear what the level of demand for these services will be, we
have ample evidence that they will be used by some TeraGrid users (such as was described in
examples in §D2) for innovative, high-impact scientific explorations. The first two years of use
will reveal unanticipated requirements and limitations of the technology.
We propose to allow user needs over the next two years of operation to drive the work in this
area and to allocate a modest budget to meeting these needs. It seems likely at this time that at
least two priorities will be evident in PY6: the need to extend our advanced scheduling
capabilities to new resources as they are added, and the need to establish standard
mechanisms with peer systems (e.g., OSG, UK National Grid Service, LHC Computing Grid)
that allow users to integrate their scientific activities on these systems. The existing IA32-64
architectures (Abe, Lonestar, QueenBee, Steele) that would not continue to be available to TG
users outside without this TeraGrid extension will be used to production test these services
under load (§D.5.9.1).
4.3.3 Advanced Data Services
Data requirements of the scientific community have been increasing at a rapid rate, both in size
and complexity. With the HPC systems increasing in both capacity and capability, and the
generation and use of experimental and sensor data also increasing, this trend is unlikely to
change. This means that we must continue TeraGrid’s efforts to provide reliable data transfer,
management and archival capabilities. The data team has studied the data movement and
management patterns of TeraGrid’s current user community, and developed a data architecture
plan that is being implemented by the RPs and the Data working group. Further effort to
implement the data architecture and its component pieces will help users with their current
concerns and provide an approach that will persist into XD. High-performance data transfers,
more sophisticated metadata and data management capabilities, global file systems for data
access, and archival policies are all essential parts of the plan, and we will work to integrate
them into production systems and operations. A consistent, high-level approach to data
movement and management in the TeraGrid is necessary to respond to ongoing feedback from
TG users and to support their needs.
4.3.3.1 Global Wide Area File Systems
Global Wide Area File Systems always rank at the top or near to the top of user requirements
within the TeraGrid; significant strides have been made in their implementation, but several
more are needed before they can become ubiquitous.
We have committed to a project-wide implementation of Lustre-WAN as a wide area file system,
available in PY5 at a minimum of three RP sites. The IU Data Capacitor WAN file system (984
TB capacity) is mounted on two resources now and we are continuing efforts to expand the
number of production resources with direct access to this file system. Future development plans



                                              4-15
TeraGrid Extension: Bridging to XD


focus on increasing security and performance through the provision of distributed storage
physically located near HPC resources. New effort in the TG Extension provides $100k for
additional hardware at PSC, NCSA, IU, NICS, TACC to expand this distributed storage
resource, and also provides 0.50 FTE at PSC, NCSA, IU, NICS, TACC, SDSC for support in
deployment. This will deploy additional Lustre-WAN disk resources as part of a wide area file
system to be available on all possible resources continued in the TG Extension. We will also
deploy wide-area file systems on Track 2d and XD/Remote Visualization awardee resources as
appropriate.
This effort will also support SDSC’s GPFS-WAN (700 TB capacity), which will continue to be
available and will support data collections. It may be used within the archival replication project
as a wide-area file system or high-speed data cache for transfers. If appropriate, hardware
resources could be redirected to participate in a TG-wide Lustre-WAN solution.
pNFS is an extension to the NFS standard that allows for wide area parallel file system support
using an interoperable standard. If pNFS clients are provided by system vendors, pNFS could
obviate issues with licensing and compatibility that currently present an obstacle to global
deployment of wide-area file systems. More development and integration with vendors is
necessary before pNFS can be seen as a viable technology for production resources within the
TeraGrid, but these developments are highly likely to occur with the timeframe of this extension.
We will also continue investigating other alternatives (e.g. ReddNet, PetaShare).
4.3.3.2 Archive Replication
Archival replication services are an area of recognized need, and a separate effort will be
undertaken to provide software to support replication of data across multiple TG sites. Ongoing
effort will be required, however, to support users and applications accessing the archives and
replication services. In addition, management of data and metadata in large data collections,
across both online and archive resources, is an area of growing need. The data architecture
team will work with the archival replication team to ensure smooth interaction between existing
data architecture components and the archival replication service, and to study and document
archival practices, patterns and statistics regarding usage by the TG user community.
4.3.3.3 Data Movement Performance
The data movement performance team has been instrumental in mapping and instrumenting the
use of data movement tools across TG resources and from them to external locations. This
team is implementing scheduled data movement tools including interfaces to the TG User
Portal. After these tools are in place by the end of PY5, we will take advantage of performance
and reliability enhancements in data movement technologies into the QA effort.
4.3.4 Visualization and Data Analysis
Visualization and data analysis services funded by the TeraGrid in PY6 will be focused
exclusively on visualization consulting and user support through the deployment and
development of tools required by the user community. Both visualization and data analysis at
the petascale continue to present significant challenges to the user community and require
collaboration with visualization and data analysis experts. Additionally, the need for the
deployment of more sophisticated data analysis capabilities is becoming more apparent as
shown by user requests. Data analysis often benefits from large shared memory., such as will
be available on Pople under this extension. Visualization and data analysis have traditionally
relied heavily on the value-added resources and services at RP sites, and these services
continue to be a critical need identified by the user community. Building upon the work at the RP
sites and the anticipated introduction of two new data and visualization analysis resources, PY6
efforts will continue to focus on an integrated, documented visualization services portfolio


                                               4-16
TeraGrid Extension: Bridging to XD


created with two goals; 1) to provide the user community with clarity in terms of where to turn for
visualization needs; and 2) to effectively define a set of best practices with respect to providing
such services, enabling individual campuses to harvest the experience of the TeraGrid RP sites.
Additionally, visualization consulting is a growing need for the TG user community, particularly
at the high end. We will leverage existing capabilities at the RP sites in addition to the new XD
remote visualization resource sites to provide consistent, knowledgeable visualization consulting
to the TG user community.
4.3.4.1 Visualization Gateway
TG Vis gateway development will expand the capabilities and provide the ability for additional
services and resources to be included. With community access to the TG Vis Gateway via
community allocations and dynamic accounts now available, we will emphasize educating the
user community on using of this capability, and providing a uniform interface for visualization
and data analysis capabilities. Providing centralized information about and access to such
capabilities will benefit users. We will also build upon the work at the RP sites to expose these
resources and services through the TG Vis Gateway.
4.3.5 Science Gateways
Gateways provide community-designed interfaces to TeraGrid resources, extending the
command line experience to include access to datasets, community collaboration tools,
visualization capabilities, and more. TeraGrid provides resources and support for 35 such
gateways with additional gateways anticipated. Section D.2 above illustrates the transformative
impact that TeraGrid gateways have already made on computational science across multiple
domains: of the 15 examples described there, nine science programs (SCEC, SIDGrid, NEES,
NSTG, GridChem, Robetta, GIScience, nanoHUB and LEAD) operate and develop gateways.
They allow researchers, educators and students in Geosciences, Social Sciences, Astronomy,
Structural Engineering, Neutron Science, Chemistry, Biosciences and Nanotechnology to
benefit from leading-edge resources without having to master the complexities of programming,
adapting, testing and running leading-edge applications.
The Science Gateways program works to identify common needs across projects and work with
the other TG Areas to prioritize meeting these needs. Goals for PY6 include a smoothly
functioning, flexible and effective gateway targeted support program, streamlined access to
community accounts and production use of attribute-based authentication.
4.3.5.1 Gateway Targeted Support Activities
The gateway targeted support program, perhaps
the most successful part of the gateway program,
provides assistance to developers wishing to
integrate TeraGrid resources into their gateways.
Targeted support is available to any researcher,
and requests are submitted through the TeraGrid
allocation process. As diverse requests are
received,a team of staff membersis flexible and
ready to support approved requests. Requests
can come from any discipline and can vary
widely between gateways. One gateway may be
interested in adding fault tolerance to a complex,
existing workflow. Another may have not used Figure 2: Southern California hazard map,
any grid computing software previously and         showing probability of ground motion
needs help getting started. A third may be         exceeding 0.1g in next 50 years.




                                               4-17
TeraGrid Extension: Bridging to XD


interested in using sophisticated metascheduling techniques. Outreach will be conducted to
make sure that underrepresented communities are aware of the targeted support program.
PY6 targeted support projects will be chosen through the TeraGrid’s planning process which
starts with an articulation of objectives to reflect both the progress achieved in PY5 and the
need for a smooth transition to XD.
To illustrate the type of projects included in targeted support and to describe the work upon
which PY6 activities will build, we describe here some of the targeted support projects planned
for PY5:
     Assist the GridChem gateway in the areas of common chemistry software access across
        RP sites, data management, improved workflows, visualization and scheduling.
     Assist the PolarGrid team with TeraGrid integration. May include realtime processing of
        sensor data, support for parallel simulations, GIS integration and EOT components.
     Prototype creation of an OSG cloud on TeraGrid resources via NIMBUS. Work with OSG
        science communities to resolve issues.
     Augment SIDGrid with scheduling enhancements, improved security models for
        community accounts, data sharing capabilities and workflow upgrades. Lessons learned
        will be documented for other gateways and projects.
     Develop and enhance the simple gateway framework SimpleGrid. Within this effort, we
        plan to augment online training service for building new science gateways, develop
        prototyping service to support virtualized access to TeraGrid, develop a streamlined
        packaging service for new gateway deployment, develop a user-level TeraGrid usage
        service within SimpleGrid based on the community account model and attributes-based
        security services, work with potential new communities to improve the usability and
        documentation of the proposed gateway support services, and conduct education and
        outreach work using the SimpleGrid online training service.
     Adapt the Earth System Sciences to use the TeraGrid via a semantically enabled
        environment that includes modeling, simulated and observed data holdings, and
        visualization and analysis for climate as well as related domains. Build upon synergistic
        community efforts including the Earth System Grid (ESG), Earth System Curator (ESC),
        Earth System Modeling Framework (ESMF), the Community Climate System Model
        (CCSM) Climate Portal (developed at Purdue University), and NCAR’s Science Gateway
        Framework (SGF) development effort. Extend the Earth System Grid-Curator (ESGC)
        Science Gateway so that Community Climate System Model runs can be initiated on
        TeraGrid.
     Extend Computational Infrastructure for Geodynamics (CIG) gateway to support running
        parameter sweeps through regions of the input parameter space on TeraGrid. For
        example, the SPECFEM3D code computes a simulation of surface ground motion at
        real-world seismological recording stations according to a whole-earth model of
        seismological wave propagation. Multiple parameter sweep runs produce 'synthetic
        seismograms' that are compared with measured ground motions.

We do not know yet which PY6 projects we will select, but some groups who have expressed
interest in the gateway program with whom we have not yet worked extensively include:
     Center for Genomic Sciences (CGS), Allegheny-Singer Research Institute, Allegheny
        General Hospital is interested in using the TeraGrid for genome sequencing via a
        pyrosequencing platform from Roche. Computing would run on the TeraGrid rather than
        on local clusters that are required by the Roche platform now and seen as a barrier to
        entry for some users.




                                              4-18
TeraGrid Extension: Bridging to XD


        The Center for Analytical Ultracentrifugation of Macromolecular Assemblies, University
         of Texas Health Science Center at San Antonio runs a centrifuge and maintains analysis
         software. They would like to port the analysis software to the TeraGrid and incorporate
         access into a gateway.
        The director of Bioinformatics Software at the J. Craig Venter Institute is interested in
         developing a portal to National Institute of Allergy and Infectious Diseases (NIAID)
         Bioinformatics Resource Centers.
        San Diego State University (SDSU) is interested in developing a TeraGrid Gateway for a
         NASA proposal entitled ―Spatial Decision Support System for Wildfire Emergency
         Response and Evacuation‖. The gateway would automate the data collection, data input
         formatting, GIS model processing, and rendering of model results on 2D maps and 3D
         globes and run the FARSITE (Fire Area Simulator) code developed by the US Forest
         Service.
        CIPRES (Cyberinfrastructure for Phylogenetic Research) is interested in incorporating
         TeraGrid resources into their portal in order to serve an increasing number of
         researchers.
        The minimally-funded gateway component of the TeraGrid Pathways program could be
         expanded via a targeted support project.
        SDSU is also interested in developing a TeraGrid gateway to provide a Web-enhanced
         Geospatial Technology (WGT) Education program through the geospatial
         cyberinfrastructure and virtual globes. High school students at 5 schools would be
         involved in gateway development.

4.3.5.2 Gateway Support Services
In addition to supporting individual gateway projects, TeraGrid staff provide and develop general
services that benefit all projects, including gateways. These activities include helpdesk support
(answering user questions, routing user requests to appropriate gateway contacts, and tracking
user responses), documentation, providing relevant input for the TeraGrid Knowledgebase,
SimpleGrid for basic gateway development and teaching, gateway hosting services, a gateway
software registry, and security tools (including the Community Shell, credential management
strategies, and attribute-based authentication).
While community accounts increase access, they also obscure the number of researchers using
the account and therefore using the TeraGrid. In order to capture this information automatically,
in PY5 we are implementing attribute-based authentication, through the use of GridShib SAML
tools. This allows gateways to send additional attributes via the credentials used to submit a job.
These attributes are stored in the TeraGrid central database, allowing TeraGrid to query the
database for the number of end users of each gateway using community accounts. Additional
capabilities include the ability to blacklist individual gateway users or IPs so the gateway can
continue to operate in the event of a security breach. TeraGrid can also provide per-user
accounting information for gateways. GridShib SAML tools and GridShib for Globus Toolkit have
been released for the CTSS science gateway capability kit. The release includes extensive
documentation for gateway developers and resource providers.
In PY6, we will standardize the implementation and documentation of community accounts
across the RPs. Maintaining and updating these standards will make the integration of new
systems into TeraGrid straightforward, which directly supports gateways in their use of
community accounts.




                                               4-19
TeraGrid Extension: Bridging to XD


4.4      Supporting the User Community
The TeraGrid user community as a whole, including the users contributing to the transformative
science examples in §D2, depends on and benefits tremendously from the range of support
services that TeraGrid provides. In the 2008 TeraGrid user survey, ―the helpfulness of TeraGrid
user support staff‖ (84%), and ―the promptness of user support ticket resolution‖ (82%) received
the highest satisfaction ratings of all TeraGrid resources. Transformative science on the
TeraGrid is possible due to the resources and services working together in concert, which
requires a coordinated user support system comprised of centralized mechanisms for user
access, the TeraGrid-wide allocations management process, a comprehensive user information
infrastructure, the production of user information content, a frontline user support system and
user training efforts. Each of these functions supports TeraGrid’s focus on delivering deep,
wide, and open cyberinfrastructure to address the diverse user needs and requirements.
We propose to continue and further improve this user support system, which requires
interlocking activities from several project areas: user information and access environment
(§D.4.1); user authentication and allocations (§D.4.2); frontline user support (§D.4.3) and
training (§D.4.4), as well as the advanced support capabilities proposed in section §D.3 above.
All of these activities will continue to be coordinated by the User Interaction Council (UIC) for
day-to-day collaboration among the Area Directors, with the GIG Director of Science
participating in the UIC. This interplay of strategic and operational perspectives will be essential
in ensuring continued user success as the resource mix changes and the TeraGrid program
transitions to the XD awardee(s). The User Facing Projects and Core Services (UFP) area
oversees the activities described in sections §D.4.1 and §D.4.2 below; the User Services area
coordinates the activities of Frontline User Support described in section §D.4.3; multiple areas
(User Services, Advanced User Services, and the Education, Outreach and Training) are
working in unison to coordinate, develop and deliver HPC training on topics requested by users.
4.4.1 User Information and Access Environment
User access to the resources at RP sites is supported through a coordinated environment of
user information and remote access capabilities. This objective ensures that users are provided
with current, accurate information from across the TeraGrid in a dynamic environment of
resources, software and services.
Building on a common Internet backend infrastructure, the UFP team maintains and updates the
TeraGrid web site, the TeraGrid User Portal (TGUP), and the KnowledgeBase. In 2008, these
sites delivered more than half a million web, portal, and KnowledgeBase hits each month.
The TGUP is the central user environment that allows users to access and use resources
across RP sites. The TGUP provides single-sign-on capability for RP resources, a multi-site file
manager, remote visualization, queue prediction services, and training events and resources. In
2009, the portal plans to expand its interactive services by deploying job submission and
metascheduling capabilities. Furthermore, the TGUP plans to expand its customization features
to give users a personalized TeraGrid experience that caters to their requirements and scientific
goals. This includes presenting information in domain views, listing domain-related software,
enabling user forums, and allowing users to share information with other TeraGrid users in their
field of science.
To help individual users as well as the providers of community gateways, UFP develops and
operates a suite of resource and software catalogs, system monitors, and TeraGrid’s central
user news service. Such services are essential to providing up-to-the-minute information about
a dynamic resource environment. UFP services, such as the Resource Description Repository,




                                               4-20
TeraGrid Extension: Bridging to XD


leverage TeraGrid’s investment in Information Services wherever possible to minimize the RP
effort needed to integrate resources.
The team also produces and delivers central TeraGrid-wide documentation from a central
content management system and provides the Knowledgebase to provide answers to frequently
asked questions. The UFP team also maintains processes to provide quality assurance to the
user information we deliver. This includes managing web pages, posting new and updated
documentation, working with the External Relations group and all relevant subject-matter
experts, and continually developing and updating Knowledgebase articles.
During the TeraGrid Extension period, we will continue to update and enhance the current set of
user access and information offerings, prioritizing based on user requests and the evolving
TeraGrid resource environment.
4.4.2 User Authentication and Allocations
The UFP team operates and manages the procedures and processes—adapting and updating
its services to evolving TeraGrid policies as necessary—for bringing users into the TeraGrid
environment and establishing their identity; and making allocations and authorization decisions
for use of TeraGrid resources. During this extension to TeraGrid, these procedures and
processes will need to support users through the changes to the TeraGrid resource portfolio
resulting from the transition to the XD awardee(s).
By providing a common access and authentication point, UFP supports TeraGrid's common
user environment, simplifies multi-site access and usage, and hides the complexity of working
with multiple RP sites through such capabilities as single sign-on and Shibboleth integration.
The integration of the community-adopted Shibboleth will allow TeraGrid to scale to greater
numbers of users with the same staffing levels and permit users to authenticate once to access
both TeraGrid and local campus resources. The central authorization and allocation
mechanisms supported by UFP make cross-site activities possible with minimal effort, make it
easier for PIs to share allocations with students and colleagues, eliminate duplication of effort
among RPs, and reduce RP costs.
Through the TGUP, a user will create his or her TeraGrid identity and authenticate using either
TeraGrid- or campus-provided credentials. Once a TeraGrid identity is established, any eligible
user can then request allocations (as the PI on a TeraGrid project) or be authorized to use
resources as part of an existing project. In 2008, current UFP processes added 1,862 new users
to the TeraGrid community (148 more than in 2007), and the TGUP, web site, and
Knowledgebase recorded thousands of unique visitors each month.
The TeraGrid allocations processes are a crucial operational function within UFP. In particular,
the UFP area implements the TeraGrid policies for accepting, reviewing, and deciding requests
for Startup, Education, Research and ASTA allocations. These procedures include managing
the quarterly meetings of the TeraGrid Resource Allocations Committee (TRAC), and
coordinating an impartial, multidisciplinary panel of nearly 40 computational experts. To ensure
appropriate and efficient use of resources, the TRAC reviews Research and ASTA requests and
recommends allocation amounts for PIs who wish to use significant fractions of the available
resources. In 2008, this review process covered more than 300 requests for hundreds of
millions of HPC core-hours and about fifty ASTA projects. In addition to the quarterly TRAC
process, the UFP team is responsible for the ongoing processing of Startup and Education
requests, Research project supplements and TRAC appeals, as well as extensions, transfers,
and advances. More than 750 Startup and Education requests were submitted and processed in
2008.




                                              4-21
TeraGrid Extension: Bridging to XD


During the TeraGrid Extension period, we will continue to develop improvements to the
authentication and allocations interfaces and processes. These will encompass enhancements
to the submission interface of POPS (the System for TeraGrid Allocation Requests) based on
user feedback and policy changes, further integration of POPS and TGUP, and reducing the
time it takes from a user’s first encounter with TeraGrid to his or her first access to resources.
4.4.3 Frontline User Support
We propose to continue and further improve the frontline user support structure that has made
the TeraGrid a successful enabler of breakthrough science. This will comprise the TeraGrid
Operations Center (TOC) at NCSA and the user services working group, which assembles user
consulting staff from all the RP sites under the leadership of the Area Director for User Support.
Users will submit problem reports to the TOC via email to help@teragrid.org, by web form from
the TeraGrid User Portal, or via phone (866.907.2383). Working 24/7, the TOC will create a
trouble ticket for each problem reported, and track its resolution until it is closed. The user will
be automatically informed that a ticket has been opened and advised of the next steps.
If a ticket cannot be resolved within one hour at the TOC itself, it is assigned to a member of the
user services working group, who begins by discussing the matter with the user. The consultant
may request the assistance of other members of the working group, advanced support staff,
systems, or vendor experts. The immediate goal is to ensure that the user can resume his or
her scientific work as soon as possible, even if addressing the root cause requires a longer-term
effort. When a proposed root-cause solution becomes available, we contact the affected users
again and request their participation in its testing. Strategies that are identified that will benefit
other users are incorporated into the documentation, Knowledge Base and training materials to
benefit all users.
TeraGrid frontline support will also continue to take a personal, proactive approach to
preventing issues from arising in the first place, and to improve the promptness and quality of
ticket resolution. This will done by continuing the User Champions program, in which RP
consultants are assigned to each TRAC award by discussion in the user services working
group, taking into account the distribution of an allocation across RP sites and machines, and
the affinity between the group and the consultants based on expertise, previous history, and
institutional proximity. The assigned consultant contacts the user group as their champion within
the TeraGrid, and seeks to learn about their plans and issues.
We will continue to leverage the EOT area's Campus Champions program to fulfill this same
contact role with respect to users on their campuses, especially for Startup and Education
grants. Campus Champions are enrolled as members of the user services working group, and
thus are being trained to become "on site consultants" extending the reach of TeraGrid support.
We propose user engagement and sharing and maintaining best practices as the ongoing focus
of user support coordination. This will allow us to effectively assist the user community in the
transition to a new TeraGrid resource mix and organizational structure through the XD program.
4.4.3.1 User Engagement
The user support team will provide the TeraGrid with ongoing feedback by means of surveys as
well as day-to-day personal interaction. The 2011 TeraGrid user survey will be designed and
administered by a professional evaluator selected by the GIG. Topics to be included in the
survey, the population to be surveyed, and the analysis of the results will be iterated between
the evaluator and the TeraGrid ADs, working groups, and Forum, with feedback from the SAB,
with the US area director functioning as the process driver. The final report on the 2011 user
survey will be complete by March 15, 2011.



                                                4-22
TeraGrid Extension: Bridging to XD


Personal interaction between users and the TeraGrid consultants will continue to be essential in
providing us with feedback on a day-to-day basis. This process will be coordinated in the user
services working group, via the User Champions and Campus Champions programs. The
nature of the problems encountered will inform the selection of Advanced Support for Projects
activities (§D.3.1.2). The Campus Champions programs will be employed to enlist appropriate
users as testers for proposed new TeraGrid resources and CTSS capabilities that specifically
address these users' priority needs and interests. In particular, we will support the Software
Integration (§D.5.1), Quality Assurance (§D.5.6) and Common User Environment (§D.5.7)
teams' work.
We will work with the UFP area to realize the potential of social networking mechanisms for user
engagement. Our experiences will populate the TeraGrid repository documenting user
suggestions obtained by various methods, and how they are followed up.
4.4.3.2 Share and Maintain Best Practices for Ticket Resolution
In the user services working group, the US area director and coordinators will continue to focus
on helping the consultants at all the RPs to ensure that the time to suggesting a solution to the
user is minimized, and that progress in resolving a ticket is communicated to the user at least
once a week. The discussion of pending tickets and lessons learned from recently closed ones
will continue to be a standing item at every meeting of the working group. The working
document outlining Ticket Resolution Guidelines will continue to be refined based on the real life
operational experiences encountered, with the ever more complex user workload and TeraGrid
resource menu. The guidelines provide for lessons learned from each problem to be fed into the
TeraGrid's documentation, training, and user feedback processing systems. They show how to
recognize user problems that may require advanced support and how to help the user apply for
advanced support.
4.4.4 Training
The training goal is to prepare users to make effective use of the TeraGrid resources and
services to advance scientific discovery. The training objectives include:

        Regular assessment of users needs and requirements
        Development of HPC training materials that allow the research community to make
         effective use of current and emerging TeraGrid resources and services
        Delivery of HPC training content through live, synchronous and asynchronous
         mechanisms to reach current and potential users of TeraGrid across the country
        Providing high quality reviewed HPC learning materials
        Leveraging the work of others to avoid duplication of effort
The EOT team in collaboration with AUS and User Services conducts an annual HPC training
survey, separate from the annual TeraGrid User Survey, to assess community needs for training
in more depth. There will also be surveys of participants during each training session. Survey
results are used to identify areas for improvement and to identify topics for new content
development. The training that is offered in response to the identified needs will focus on
expanding the learning resources and opportunities for current and potential members of the
TeraGrid user community by providing a broad range of live, synchronous and asynchronous
training opportunities. The topics will span the range from introductory to advanced HPC topics,
with an emphasis on petascale computing.




                                                4-23
TeraGrid Extension: Bridging to XD


While continuing to deliver the training content that is requested by users that exists, the
TeraGrid will continue to develop and deliver new HPC training content to address community
needs for making effective use of TeraGrid resources and services. The development efforts will
involve multiple working groups as appropriate including the User Services, AUS, Science
Gateways, and DVI teams. The training teams will build on the lessons learned and successes
from past efforts to make more training available through synchronous delivery mechanisms to
reach more users across the country. There will be an increased level of effort in PY6 directed
towards accelerating the pace of making more quality training content available via
asynchronous tutorials. The team will augment its effort from PY5 with .2 FTE, 2 graduate and 3
undergraduate students to develop the on-line tutorials in collaboration with the AUS ASEOT
staff.
The training materials will be reviewed to ensure that they are of the highest quality, before they
are made available to the community through HPC University for broad dissemination. The
quality review team will be augmented in PY6 by .25 FTE and 1 graduate student to review
submissions to the HPC University to ensure that all of the materials made available are of the
highest quality.
The training team will work with external organizations to identify existing training materials to
add to the HPC University portal, and to avoid duplication of effort in developing new content.
HPC University will expand to include reference materials including books and journals,
computational science competencies, and a complete calendar of events
The EOT team will document the challenges, effective strategies, and lessons learned from the
efforts to date to share with the XD awardee(s).
4.5      Integrated Operations of TeraGrid
The integrated operations of TeraGrid encompasses a range of activities spanning the software
integration and support, operational responsibilities across the project and at the Resource
Provider sites, and efforts to maintain quality, usability and security of the distributed
environment for the user community. These are activities found at any computational facility, but
in TeraGrid are distributed and coordinated across the breath of the project in order to provide
users with a coherent view of a collection of resources beyond what any single facility could
offer. Specific activities include the 7x24 TeraGrid Operations Center (TOC), networking
interconnect between the RP sites, providing phone and email user support and issue tracking,
resource and service monitoring, user management and authentication, production security and
incident response, monitoring and instrumentation, and the integration and maintenance of a
common software state and consistent computing environment. In PY6, TeraGrid will continue
to maintain these activities and make advancements as described in the subsequent section.
[[a sentence on the SIIS activities]] [[provisioning of the Common TeraGrid Software Services
(CTSS), ]]
Our focus on the direct operational aspects of TeraGrid is important for the science community.
[[words on SIIS]] The services support common user environments and software, making cross-
or multi-site usage easier. The network provides capacity well beyond what users would have
available at their home institutions and new security services are rapidly bringing us to the point
where users will be able to simply use certificates across administrative domains using
gateways, portals or grid applications.
4.5.1 Packaging and maintaining CTSS Kits
Software components are a critical element of TeraGrid’s common user environment. Significant
effort is required to satisfy the critical user need for uniform interfaces in the face of great



                                               4-24
TeraGrid Extension: Bridging to XD


diversity of hardware/OS platforms on TeraGrid and the ongoing discovery of bugs and security
flaws. The software packaging team generates: (1) rebuilt software components for TeraGrid
resources to address security vulnerabilities and functionality issues; (2) new builds of software
components across all TeraGrid resources to implement new CTSS kits; and (3) new builds of
software components to allow their deployment and use on new TeraGrid resources. This work
is strictly demand-driven. New resources during this project period will include the Track 2d
resource, XD visualization resource(s), and may include new data archive systems. Finally,
packaging team reuses and contributes to the NSF OCI Virtual Data Toolkit (VDT) production
effort.
This team also responds to help desk tickets concerning existing CTSS capability kits and
assists both resource providers and software providers with debugging software issues,
including but not limited to defects.
4.5.2 Information Services
TeraGrid’s integrated information service (IIS) is the means by which TeraGrid resource
providers publicize availability of their services, including compute queues, software services,
local HPC software, data collections, and science gateways. By the end of TeraGrid’s PY5,
most of the descriptive data about TeraGrid that is (or formerly was) stored in a myriad
independently operated databases will be accessible in one place via the IIS. The IIS combines
distributed publishing with centralized aggregation: each data provider publishes its own data
independently of others, while users see a coherent combined view of all data. The IIS is used
throughout the TeraGrid system—in user documentation, automated verification and validation
systems, automatic resource selection tools, and even in project plans—to provide up-to-date
views of system capabilities and their status.
During this project period, the new Track 2d resource, new XD visualization resource(s), and
possible new data archive systems will need to be integrated with the IIS. There will also be
several new capabilities that will need to be tracked by IIS, including WAN file systems and
advanced scheduling capabilities. We also anticipate significant growth in the use of the IIS—
both by humans and by automated systems—that may require capacity/scalability
improvements for the central indices. Finally, the IIS will be prepared for transition to the new
XD CMS awardee.
4.5.3 Supporting Software Integration and Information Services
Several advanced user capabilities on TeraGrid rely on centralized services for their day-to-day
operation. These include: automatic resource selection, co-allocation, queue prediction, on-
demand/urgent computation, the integrated information service (IIS), and our multi-platform
software build and test capability. We will maintain these centralized services and ensure their
high availability (99.5% availability) to the TeraGrid user community. High availability requires
redundant servers in continuous operation in distributed locations with a design that includes
automatic, user-transparent failover. We are able to provide this at a low cost using virtual
machine (VM) hosting technology at multiple RP and commercial locations and a dynamic DNS
system operated by the TeraGrid operations team.
4.5.4 Networking
The goal of the TeraGrid network is to facilitate high-performance data movement between the
TeraGrid Resource Provider (RP) sites. As such, this network is exclusive to TeraGrid
applications, such as file access/transfer via Global File Systems, data archive, and GridFTP.
Users at non-TeraGrid institutions access TeraGrid resources through the site’s normal research
and education networks.



                                              4-25
TeraGrid Extension: Bridging to XD


The TeraGrid network connects all TeraGrid RP sites and resources at 10 Gb/s or more. The
backbone network is comprised of hub routers in Chicago, Denver and Los Angeles that are
maintained by the GIG. The three routers are connected with two10 GB/s links—one primary and
one backup. The configuration provides for redundancy in case of circuit failures. The RPs
connect to one of these hub routers, and maintain local site routers to connect to their local
network and resources.
In PY6, the TeraGrid networking group will continue to operate the TeraGrid network in its current
configuration, which includes the maintenance contracts for the backbone network hardware. The
working group will continue to provide the same support it has for the first five years of the project,
which includes troubleshooting, performance monitoring, and tuning.
In addition, this project will fund connectivity for sites with computational resources not provided
under Track 2. These sites are LONI/LSU, SDSC, UI, Purdue, ORNL, and PSC. PSC’s funding
will be for three months of connectivity support for Pople in advance of their Track 2 system
coming online.
4.5.5 Security
Incident response
Security of resources and data is a top priority for the TeraGrid partnership. The TeraGrid
Incident Response (IR) team will continue to operate, coordinating and tracking incident
information at the RP sites. The IR team has members from all sites and coordinates via weekly
conference calls and over secured email lists. The team develops and executes response plans
for current threats, and coordinates reporting to NSF regarding security events. The GIG will
fund sites providing resources not funded by Track 2 to provide security for those resources,
including day-to-day security maintenance and incident response. Track 2 sites will provide for
their security from their operational awards.
User-Facing Security Services
The TeraGrid provides a single sign-on mechanism to its user community, giving them a
standard method for authenticating to access any TeraGrid resources to which they have
access. This service depends on a set of core services including a java-based PKI-enabled
SSH application in the TeraGrid User Portal (TGUP), a provision for single sign-on across
resources provided by the MyProxy service, and the TeraGrid-wide Kerberos realm..The
MyProxy and Kerberos services are deployed at both NCSA and PSC for fault tolerance. All of
these services will be supported and maintained during PY6 in order to continue to facilitate
simplified authentication for TeraGrid users.
Additional PY6 activities will include supporting the advanced access services for science
gateways (the community accounts and Science Gateway capability kit described in §D.3.5.2)
and supporting the Shibboleth capability integrated into the TeraGrid User Portal (§D.4.2). We
will also continue to advance these services by providing for integration with the instrumentation
work (§D.5.8.3) to better track and analyze usage and continue to expand the user base of
these services in preparation for XD.
4.5.6 Quality Assurance
As the benefits of using Grid services are being recognized by both
individual users and SGW, the increase in demand for Grid services
have resulted in unexpected behavior and unanticipated instability.
One outcome of a concerted effort by many TG staff to address these
problems was the recognition for the need to formalize and establish a
quality assurance (QA) effort for TG systems and services. A QA group


                                                4-26
TeraGrid Extension: Bridging to XD


was formed in late 2008 to address this issue. The near term task of
the group was to develop a plan to improve the availability of grid
services as quickly as possible. Through PY5 work will continue to
reach this goal to improve grid service availability and reliability.
Looking ahead, the QA working group will continue to work toward
improving the quality of all aspects of using TG grid services. In
PY6, the team wil collaborate with the TAIS group to transition their
work to XD to ensure that the XD environment and services are of the
same high quality as those that will have been established for the TG
project.
4.5.7 Common User Environment
A goal of the TeraGrid is to allow users to move between systems with
relative ease. This is difficult to achieve since it requires a high
degree of coordination between sites with diverse resources. This
diversity of resources is a strength of the TeraGrid and imposing
unnecessary uniformity can be an obstacle to scaling and to using each
resource's specific abilities to the fullest. In 2008, the Common User
Environment (CUE) group was established as a forum for strengthening
TeraGrid’s efforts to providing a common environment and to strike the
right balance of commonality and diversity across the project’s
resources. The group quickly undertook extensive gathering of user
requirements, identified a series of recommendations, and have begun
creating and implementation plans based on those recommendations.   In
PY6, the CUE group will continue to work with TeraGrid operations and
the QA groups to establish and refine the common environment and
evaluate the effectiveness of elements for the user community.
4.5.8 Operational Services
4.5.8.1 TOC Services
The TeraGrid Operations Center (TOC) will continue to provide 24x7 help desk services for the
user community. The TOC is accessible via toll-free telephone, email and the web. As an initial
global point of contact and triage center for the TeraGrid community, the TOC solves problems,
connects users to groups and individuals for problem resolution, and maintains the TeraGrid
Ticket System (TTS). The TTS is used both to ensure issues receive appropriate follow up and
to collect data on the types of issues the users are facing in order to better focus project support
resources.
4.5.8.2 UFP Operational Infrastructure and RP Integration
The User-Facing Projects (UFP) team operates a suite of services for providing users access to
TeraGrid resources and information. This set of services is geographically distributed and
encompasses the:
    TeraGrid User Portal
    TeraGrid Web Site
    TeraGrid Wiki and Content Management System, critical for internal project
      communications
    POPS, the system for TeraGrid allocation requests
    TeraGrid Central Database (TGCDB) and Account Management Information Exchange
      (AMIE) servers for accounting, allocations, and RP integration
    Resource Description Repository



                                               4-27
TeraGrid Extension: Bridging to XD


        TeraGrid Knowledgebase
        TeraGrid allocations and accounting monitoring tools
        Suite of resource catalogs, monitors, and news applications

In PY6, these services will continue to be operated and coordinated as production services by
the UFP team. UFP strives for a better than 99% uptime for all of these components to ensure a
productive and satisfying user experience.
4.5.8.3 Operational Instrumentation (device tracking)
The TeraGrid developed and supports a suite of operational instrumentation software that is
used to monitor grid and network usage. In PY6, this instrumentation will continue to be
developed to provide better integration of the different instrument platforms to simplify reporting
and provide integrated data views for the user community. New resources will be incorporated,
including LONI’s final network connection and the Track 2c and 2d platforms.
In order to facilitate adoption in the XD program and benefit the broader community, the
reporting system will be released as an open source tool for use by other organizations utilizing
the Globus monitoring system.
4.5.8.4 Inca Monitoring
Inca provides monitoring of TeraGrid resources and services with the goal of identifying
problems for correction before they hamper users. The Inca team at SDSC, who developed the
software, will continue to,maintain its deployment on TeraGrid, including writing and updating
Inca reporters (test scripts), configuring and deploying reporters to resources, archiving test
results in a Postgres database, and displaying and analyzing reporter data in Web status pages.
The Inca team will work with administrators to troubleshoot detected failures on their resources
and make improvements to existing tests and/or their configuration on resources. In addition,
the team will implement new tests identified by TeraGrid working groups or CTSS kit
administrators and deploy them to TeraGrid resources as part of the suite of Inca reporters. The
team will modify Web status pages as CTSS and other working group requirements change.
SDSC will continue to upgrade the Inca deployment on TeraGrid with new versions of Inca (as
new features are often driven by TeraGrid) and optimize performance as needed.
4.5.9 RP Operations
In developing plans for this proposal, the TeraGrid team clearly needed to take a strategic view
on how best to allocate lesser funds than have been available to this program to date. With
respect to considering current resources at RP sites, it was clear that we could not simply
continue ―business as usual.‖ Resources to be continued must provide a clearly defined benefit
to the user community either through
                                            100,000
direct     provision    of    important               Requested NU's
capabilities or by providing a resource               Awarded NU's
for developing/enabling important                     Available NU's
new capabilities. With this in mind we                Available NU's (Projected)
                                          Millions of NUs




                                             10,000
came to consensus on a subset of
resources to continue to support.
4.5.9.1 Compute Resources
                                                            1,000
While it is clear that the dearth of
computational resources curbed the
growth curve of use by the
community, the deployment of
                                                             100




                                                              4-28
                                         Figure 3: Requested, Allocated and Available Resources
                                            for TeraGrid Large Resources (BigBen, Abe, Ranger,
                                            Kraken)
TeraGrid Extension: Bridging to XD


Ranger               and
subsequently Kraken                       Peak
                             System       Performance Memory Nodes Disk              Manufacturer
has spurred a new
surge in requests and        Abe             90 TF      14.4 TB     1200   400 TB        Dell
usage      from      the     Lonestar        62 TF      11.6 TB     1460   107 TB        Dell
community. As shown
in Figure 3 the growth       Steele          66 TF      15.7 TB      893   130 TB        Dell
in     requests      and     QueenBee        51 TF       5.3 TB      668   192 TB        Dell
awards of resources
have already matched                      Table 1: TeraGrid IA32-64 Cluster Systems
the currently available resources and given that there is little growth in the available resources,
even when the Track 2c systems becomes available some time in 1H2010, the user demand
clearly outstrips the availability of new resources. Still, given the budget available, the ratio of
impact to cost was considered.
During the TeraGrid Extension period, we will retain the four primary IA32-64 cluster resources
shown in Table 1. While these clusters will provide capacity (collectively approximately half of
Ranger), they will be focused on supporting four additional efforts:
Large scale interactive and on-demand (including science gateway) use. We have been
given clear indications from the user community, the Science Advisory Board and review panels
more effort is needed in this area. Often researchers need an interactive resource in order to be
able to effectively develop models and debug applications at scale. In some cases this will be in
preparation for longer-running execution on Ranger, Kraken or other large-scale system. In
other cases it is the best mode of use to conduct science. Further, many science gateways
need access to resources with short response times to provide a useful experience for the
gateway users. We will make use of reservations and pre-emptive scheduling to satisfy the
needs of such gateways.
Transition platform to Track 2 systems: These systems will provide a transition platform for
those coming from a university- or departmental-level resource and moving out into the larger
national CI . Typically such researchers are accustomed to using an Intel-based cluster and
these resources will provide a familiar platform with which to expand their usage and to work on
scalability and related issues. Researchers will not be restricted to taking this path and could
jump straight to the Track 2 systems, but many have asked for this type of capability. By making
use of these platforms in this way, we also alleviate the pressure of smaller jobs on the larger
systems that have been optimized in their configuration and operational policies to favor highly
scalable applications.
Metascheduling and Job Affinity: These resources will have the metascheduling CTSS Kit
installed and will be allocated as a single resource. Given that there is some variation amongst
these systems (e.g. Steele has a mixture of GigE and high-performance interconnects, Lonestar
is configured to support more memory bandwidth per CPU core), we will preferentially schedule
jobs needing certain characteristics to the appropriate particular resources. This will maximize
the efficiency and effectiveness of the collective resource.
Support for OSG Jobs: As noted in §D.X.Y, there have been effort to further develop the
relationship between TeraGrid and OSG. As part of our work during the TeraGrid Extension
period, we will support running of ―traditional‖ OSG jobs (high-throughput, single node
execution) in addition to our efforts to support the less common parallel jobs from OSG users.
These will also support work with OSG to not only support running traditional OSG-style jobs
(i.e. single node execution) on TG resources. These jobs at a minimum can backfill the
schedule of jobs across the set of resources, but we will also want to allow them to have



                                               4-29
TeraGrid Extension: Bridging to XD


―reasonable‖ priority, as opposed to how low-level parallel jobs are typically handled by
scheduling policies on large systems today. Making use of the job affinity scheduling already
mentioned, we can schedule these jobs to appropriate resources (e.g. the GigE connected
portions of Steele).
4.5.9.1.1 Unique Computing Resource
The Lincoln cluster provides a unique GPU-based computing resource at scale. With 192
compute nodes and 96 S1070 Tesla units, this system represents the first GPU-based resource
at scale to be available to the academic research community. Initial allocation requests have
already overwhelmed this machine and early applications work has shown it to be effective for a
subset of important applications (NAMD, WRF …). While it will not be easily used for a broad
range of applications, it will provide a powerful capability for a set of important applications.
4.5.9.2 Supporting Virtual Machines
An emerging need and very interesting area for investigation and evaluation is the use of VMs
to support scientific calculations. There are some groups doing this now and Quarry at IU
already provides a VM hosting service that is increasingly widely used and unique within the
TeraGrid. (Currently Quarry supports more than 18 VMs for 16 distinct users, many of which
host gateway front-end services.) This also has connections to supporting OSG users and we
should have an effort in this area. I believe this is another viable usage modality for the four
cluster resources noted above along with Quarry at IU. (7.1 TF , 112 HS21 Blades in IBM e1350
BladeCenter Cluster with 266 TB GPFS disk
4.5.9.3 Supporting the Track 2c Transition
PSC's Pople (768 processor, Altix 4700, 1.5 TB shared memory) together with Cobalt at NCSA,
represent the large shared memory resources in the TeraGrid. They are in great demand, and
consistently oversubscribed at TRAC meetings. PSC has exploited the availability of shared
memory resources to attract new communities to the TeraGrid, including researchers in game
theory, machine translation, parallel Matlab users, etc. The Track 2c system will deliver
substantially more shared memory resources to the national community. But since the onset of
the Track 2c proposal will be somewhat delayed compared to what was originally proposed,
funds are requested for a three month period of Pople operation in PY6 to assure continued
production access to these valuable resources.
4.6      Education, Outreach, Collaboration, and Partnerships
Work in education, outreach, collaboration, and partnerships is driven by both community
requirements and the desire to advance the science, technology, engineering and mathematics
(STEM) fields of education and research. TeraGrid regularly assesses requirements and needs
in these areas through the annual TeraGrid User Survey, an annual HPC training and education
survey, surveys completed at the end of training and education events throughout the year,
interviews and discussions with community members, discussions with the Science Advisory
Board (SAB), and discussions with our external partners.
TeraGrid’s Education, Outreach, and Training (EOT) area seeks to engage and retain larger
and more diverse communities in advancing scientific discovery, emphasizing under-
represented communities, including race, gender, disability, discipline, and institution. EOT will
continue to build strong internal and external partnerships, leveraging opportunities to offer the
best possible HPC learning and workforce development programs, and increasing the number
of well-prepared STEM researchers and educators. EOT will continue to conduct formative and
summative evaluations of all programs and activities. The evaluations allow TeraGrid to improve




                                              4-30
TeraGrid Extension: Bridging to XD


the offerings to best address community needs and requirements, to identify best practices, and
to identify transformative impact among the target communities.
For PY6, the TeraGrid Forum has determined that an increased level of EOT and ER effort
above the PY5 level of effort is necessary to further the goals of TeraGrid, and those areas of
increased emphasis are highlighted in section D.4.4 and the remainder of this section. The EOT
and ER teams will work closely with the TeraGrid Forum and the TeraGrid ADs to plan this
increased level of work, including the relevant WBS elements, scopes of work, and budgets, to
be shared among the GIG and the RPs using the same mechanisms and timelines used to
develop similar details for other TeraGrid activities. The plans will be vetted with the SAB prior
to being finalized.
In PY6, we are planning an increased level of support to involve undergraduate and graduate
students. Through these positions and internships, we will mentor these students to encourage
them to pursue STEM education and careers. Every student will be provided with travel support
to attend the TeraGrid Conference, where they can learn from and share with other students
and other conference attendees. The students will be encouraged to submit papers and posters
and to enter student competitions to showcase their knowledge and skills.
All EOT activities in PY6 will take into account the need to transition from TeraGrid to XD, and
the EOT team will document all activities in preparation for hand-off to the XD awardees.
4.6.1 Education
TeraGrid has established a strong foundation in learning and workforce development efforts,
which are focused around computational thinking, computational science, and quantitative
reasoning skills, to motivate and prepare larger and more diverse generations to pursue
advanced studies and professional careers in STEM fields. RPs have led, supported, and
directly contributed to K-12, undergraduate, and graduate education programs across the
country. Activities focus on:
     Providing professional development for K-12 teachers and undergraduate faculty;
     Supporting curriculum development efforts by K-12 teachers and undergraduate faculty;
     Collecting and disseminating high-quality reviewed curricular materials, resources, and
        activities for broad dissemination and use; and
     Engaging students to excite, motivate, and retain them in STEM careers.

The education team will provide professional development and support educators developing
computational science and HPC curriculum materials through local, regional, and national
programs and through the 5 year SC07-SC11 Education Program. Workshops, institutes and
tutorials will be offered to engage and support teachers and faculty throughout the year.
Computational science and HPC curricular materials developed by educators will be reviewed
and disseminated through the Computational Science Education Reference Desk, a Pathways
project of the National Science Digital Library.
TeraGrid will provide students with internships, research experiences, professional
development, competitions, and numerous learning opportunities, to recruit, excite and motivate
many more students to pursue STEM education and STEM careers. Particular emphasis will be
placed on engaging under-represented students. The internships and research experiences will
include summer experiences and year-longyearlong involvement at RPs.
In PY6, the education effort will include an increased level of support for two complementary
components: development of undergraduate education materials and student engagement
through competitions.




                                              4-31
TeraGrid Extension: Bridging to XD


The first effort is focused on working with faculty to develop undergraduate HPC materials
including modules, teacher activities, and student activities for use in four different disciplinary
areas, which will be identified in an initial meeting of the faculty and TeraGrid staff. The effort
will build on the expertise of the TeraGrid AUS and Science Gateways teams and
transformational science efforts from among the TeraGrid user base. Following a faculty
application process, we will select from among the qualified applicants in consultation with the
SAB to ensure appropriate disciplinary representation. The faculty participants will require
institutional commitments to support their efforts. The team will re-convene halfway through the
project to present materials, receive constructive suggestions for improvements, and then pilot
the materials in each other’s classrooms during the second half of the year as a demonstration
of re-usability by others. The final materials will be reviewed one last time and then posted to
HPC University for broad dissemination. We plan to add 0.2 FTE to coordinate this effort, 1
graduate student to assist the faculty throughout the process, and a $10,000 stipend for each
faculty member.
Computational science and HPC curricular materials developed by educators will be reviewed
and disseminated through the Computational Science Education Reference Desk, a Pathways
project of the National Science Digital Library. The materials will also be presented at the
subsequent SC Education Program. Workshops, institutes and tutorials will continue to be
offered to engage and support teachers and faculty throughout the year.
The second effort will build on the work of the faculty and on the ―Computational Science
Problem of the Week‖ effort that is starting in March 2009 to focus on engaging students from
middle school through college in STEM challenges. Many of the challenges will come from the
curriculum materials developed by the faculty teams. The challenges are intended to empower
students to unleash their minds to solve challenging problems and to be recognized for their
accomplishments. We will build on this foundation of student excitement to engage national
programs that foster student engagement through local, regional, and national competitions
such as the TeraGrid Conference competitions and SC Education Program competitions. We
will have an additional 0.2 FTE for a coordinator and 2 undergraduate students to develop
challenge problems and review student submissions.
We are working with the National Science Olympiad (http://soinc.org/), which has for 25 years
been engaging over 5,300 teams of middle school and high school students from 48 states, to
introduce computational science challenges into their national effort. This is intended to excite,
engage, and empower students across the country to pursue STEM education and careers and
to advance science through the use of computational methods. We will also explore
opportunities to work with the ACM Student Programming Contest, the National College Bowl,
and the Siemens Competition in Math, Science & Technology as other possible venues to
engage thousands more students across the country. We will use emerging youth-oriented
collaboration spaces (Facebook, MySpace, etc.) to reach out to and engage students where
they live and communicate with one another in cyberspace.
The team will document the challenges, effective strategies, and lessons learned from thethese
efforts and share them publicly. The team will emphasize strategies for professional
development, curriculum development, dissemination of quality reviewed materials, student
engagement, recruiting in under-represented communities, and strategies for working with other
organizations to sustain and scale up successful education programs.
4.6.2 Outreach
TeraGrid has been conducting aan aggressive outreach program to engage new communities in
using TeraGrid resources and services. The impact of this can be seen in the number of new
DAC (and now Start-up and Education) accounts that have been established over the last few


                                               4-32
TeraGrid Extension: Bridging to XD


years. In 2007, there were 736 requests for DACs of which 684 were approved. In 2008, there
were 762 requests of which 703 were approved. There were 17 education accounts approved in
the last quarter of 2008. TeraGrid has also been working to increase the number of new users.
In 2007 there were 1,714 new users, and in 2008 there were 1,862 new users.
TeraGrid has been proactive about meeting people ―where they live‖ on their campuses, at their
professional society meetings, and through sharing examples of successes achieved by their
peers in utilizing TeraGrid resources. Outreach programs include Campus Champions,
Professional Society Outreach, EOT Highlights, and EOT Newsletter. These activities focus on:
        Raising awareness of TeraGrid resources and services among administrators,
         researchers and educators across the country;
        Building human capacity among larger and more diverse communities to broaden
         participation in the use of TeraGrid; and
        Expanding campus partnerships.
Based on the current level of interest, we plan to rapidly expand the Campus Champions
program. WeThe June 2008 launch of the Campus Champions program resulted in a
groundswell of interest from campuses across the country. What began as a start-up effort to
recruit 12 campuses has now reached 30 campuses with another 30+ in discussions about
joining. The Campus Champions representatives (Champions) have been providing great ideas
for improving TeraGrid services for both people new to TeraGrid as well as ―old hands‖. The
TeraGrid User Survey shows that more campus assistance would be valuable to users, and that
more start-up assistance, documentation, and training are needed for users and the
Champions. We plan to continue to support this effort through PY6, but because of the high
level of interest, we plan to increase the level of effort above PY5 levels to organize the program
and support the Champions in supporting current and future TeraGrid users. We plan to invest
in an additional 0.5 FTE to coordinate the program and an additional 0.5 FTE to provide
technical support, training, and documentation that will directly benefit the Champions, as well
as help all new TeraGrid users become long-term users of TeraGrid. We also will add an
undergraduate student to assist the professional staff working with the Champions. TeraGrid will
work with the CI Days team (Open Science Grid, Internet2, NLR, EDUCAUSE, and MSI-CIEC)
to reach more campuses and to enlist more Campus Champions.
We will also continue to raise awareness of TeraGrid through participation in professional
society meetings, emphasizing reaching under-represented disciplines and under-represented
people. TeraGrid will present papers, panels, and posters, workshops, tutorials, and exhibits to
reach as many people as possible to encourage them to utilize TeraGrid. TeraGrid will continue
to host the TeraGrid conference in June and participate in the annual SC Conference.
Through Campus Champions, CI Days, and professional society outreach, TeraGrid will identify
new users and potential users that may benefit from support from the User Services and AUS
teams to become long-term users of TeraGrid. We will work with the Science Director, the SAB,
and external partners to identify these candidate users. A concerted effort will be made to reach
out to areas of the country that have traditionally been under-represented among TeraGrid
users, including the EPSCoR states.
We will document challenges, effective strategies, and lessons learned from current efforts to
share with the XD awardees and the public, emphasizing strategies for identifying additional
outreach opportunities, identifying and engaging new users, and nurturing strong campus
partnerships to broaden the TeraGrid and XD user bases.




                                               4-33
TeraGrid Extension: Bridging to XD


4.6.3 Enhancing Diversity
Through both its education and outreach efforts, TeraGrid will continue to target under-
represented disciplines with the goal of enhancing the racial and ethnic diversity of the TeraGrid
user community. We will engage industry, international communities, and other organizations on
activities of common interest and provide community forums for sharing the impact of TeraGrid
on society. We will continue to work with organizations representing under-represented
individuals, including organizations in the Minority Serving Institution Cyberinfrastructure
Empowerment Coalition (MSI-CIEC): the American Indian Higher Education Consortium
(AIHEC), the Hispanic Association of Colleges and Universities (HACU), and the National
Association for Equal Opportunity (NAFEO). We will continue to reach out to EPSCoR
institutions by recruiting more Campus Champions from their institutions. TeraGrid will also
continue to engage larger numbers of students, with an emphasis on activities targeting under-
represented students.
4.6.4 External Relations (ER)
To meet NSF, user, and public expectations, information about TeraGrid success stories—
including science highlights, news releases, and other news stories—will be made accessible
via the TeraGrid website and distributed to news outlets that reach the scientific user community
and the public, including iSGTW, HPCwire, and NSF through the Office of Cyberinfrastructure
(OCI) and the Office of Legislative and Public Affairs (OLPA). We also design and prepare
materials for the TeraGrid website, conferences, and other outreach activities.
While TeraGrid is yielding more and more success stories, the ER team cannot document all of
them due to lack of resources. Further, as we enter PY6, considerable time and attention is
needed to document lessons learned to assist with the transition to XD. We will place an
increased level of effort from PY5 with an additional 0.75 FTE (mixed between students and RP
staff) for recording science and EOT successes and for documenting lessons learned to share
with the XD awardee(s) and the public. In addition, we will augment our PY5 efforts with two
undergraduate students (majoring in communications) to assist with literature searches,
interviewing users and staff, and writing the information to be shared with the community. The
team will use a variety of multimedia venues to broadly disseminate the news, including
podcasts, Facebook, and professional society newsletters.
The ER working group, with representatives from every RP, will regularly share information,
strategize plans, and coordinate activities to communicate TeraGrid news and information. The
ER team will continue to convey information about TeraGrid to the national and international
communities, via press releases, science impact stories, EOT impact stories, news stories, and
updates on TeraGrid resources and services. The team will produce the Science Highlights
publication to highlight science impact and will work with the EOT team to produce the EOT
Highlights publication. The ER team will continue to work closely with the NSF OCI public
information experts to ensure TeraGrid information is effectively communicated.
The ER team will collaborate extensively with the User Facing Projects team and others on the
development of an enhanced TeraGrid web presence. The ER working group will investigate
ways that Web 2.0 and multimedia tools can dynamically disseminate TeraGrid news and
information and engage the 18-35 year-old demographic who utilize online social networking
tools and portal-based communication.
The ER team will continue to support TeraGrid involvement in professional society meetings,
including the annual TeraGrid and SC conferences, and help develop promotional pieces for
use at conferences and meetings. The team will document challenges and successful strategies




                                              4-34
TeraGrid Extension: Bridging to XD


in working with the TeraGrid staff and the users to capture success stories, news and other
information of value for sharing with the community.
We will build on this foundation of student excitement to engage national programs that foster
student engagement through local, regional and national competitions. We are working with the
National Science Olympiad (http://soinc.org/, which has for 25 years been engaging over 5,300
teams of middle school and high school students from 48 states,) to introduce computational
science challenges into their national effort. This is intended to excite, engage, and empower
thousands of students to pursue STEM education and careers and to advance science through
the use of computational methods.
4.6.5 Collaborations and Partnerships
In addition to the EOT collaborations just described (with MSI-CIEC and other organizations),
the TeraGrid intends to remain a technology leader in the broader national and international
computational science community, and all participating sites regularly collaborate with overseas
universities and organizations – both domestically and overseas – in advancing the state of the
art in cyberinfrastructure. These RP-directed collaborations range from the Partnership for
Advanced Computing in Europe (PRACE) and the Distributed European Infrastructure for
Supercomputing Applications (DEISA) to the Chinese Academy of Science and the Universidad
del CEMA in Buenos Aires.
In PY6, the TeraGrid will collectively work to further develop current and identify new domestic
and international collaborations through TeraGrid users, participation in professional society
meetings, RP activities, and recommendations from the SAB and elsewhere. In the US, for
example, TeraGrid will continue to extend its connections to the Open Science Grid (OSG). The
TeraGrid and OSG infrastructures both provide scientific users with access to a variety of
resources using similar infrastructures and services. TeraGrid users have access to NSF-
funded HPC systems, but OSG users normally only have access to less powerful, more widely
distributed resources. Depending on the application some OSG users could benefit from using
significantly more powerful, tightly coupled, clusters that are part of the pool of TeraGrid
compute resources. Additionally, we know that some TeraGrid users have components of their
workflow that are better suited to a blend of OSG and TeraGrid resources.
As described in §D.5.9.1, current work to make TeraGrid resources available to OSG users will
advance in PY6, with the RP resources supported by this proposal enabled to support OSG
users in running not only traditional OSG-style jobs (i.e. single-node execution) but also larger-
scale jobs not possible on OSG systems. The TeraGrid’s IA32-64 clusters will also be used to
further explore interoperability and technology sharing.
As an international example, the TeraGrid will continue to build on its partnerships with DEISA
and advance the distributed, international use of both computational and data resources. DEISA
has adopted the TeraGrid’s Inca system for resource monitoring, and TeraGrid is collaborating
on efforts to have projects use both DEISA and TeraGrid resources, including the ability to co-
schedule resources across both organizations for large science users. Science applications
serving as drivers for these DEISA collaborations include climate research (with the Global
Monitoring for Environment and Security effort), the life sciences (with the Virtual Physiological
Human project), and astrophysics (with LIGO, GEO600, and the Sloan Digital Sky Survey).
4.7      Project Management and Leadership
4.7.1 Project and Financial Management
The Project Management Working Group (PM-WG) is responsible for building, tracking, and
reporting on the Integrated Project Plan (IPP), and a change management process.



                                              4-35
TeraGrid Extension: Bridging to XD


Central project management coordination provides tighter activity integration across the TG
partner sites. Building a single IPP for the project enhances cross-site collaboration and reduces
duplication of efforts. Tracking a single IPP provides for a more transparent view of progress.
Reporting against a single IPP significantly reduces the complexity of integrating many
disparate stand-alone RP reports. Managing a change process provides a visible and controlled
method for modifying the IPP.
Financial Management is the responsibility of the University of Chicago. Subaward management
will be straightforward since contracts are already in place from the current TeraGrid award.
4.7.2 Leadership
Ian Foster is the Director of the Computation Institute; Arthur Holly Compton, Distinguished
Service Professor of Computer Science, Argonne National Laboratory, and The University of
Chicago. He has lead computer science projects developing advanced distributed computing
("Grid") technologies, computational science efforts applying these tools to problems in areas
ranging from the analysis of data from physics experiments to remote access to earthquake
engineering facilities, and the Globus open source Grid software project. The objective of the
Global Grid Forum is to promote and develop Grid technologies and applications via the
development and documentation of "best practices," implementation guidelines, and standards
with an emphasis on rough consensus and running code. Some of his major projects are:

Globus: This project provides a unifying framework for work on high-performance distributed
computing; it includes investigations of security, resource management, communication
protocols, data management mechanisms, and other issues, funded by a number of sources in
particular DOE Offices of Science MICS, the NSF PACI program, NASA IPG, IBM, and
Microsoft, and with early support provided by DARPA.

GriPhyN (Grids Physics Network) and PPDG (Particle Physics Data Grid): These projects
funded under the NSF ITR and DOE SciDAC programs, respectively, plan to implement the first
Petabyte-scale computational environments for data intensive science in the 21st century.

IVDGL (International Virtual Data Grid Laboratory) is creating an international Data Grid
infrastructure.

Earth SystemsGrid: This project funded under the DOE SciDAC program is creating technology
for the collaborative and distributed analysis of environmental data.

GRIDS Center: Part of the NSF Middleware Initiative, focused on integrating, deploying,
supporting Grid middleware.

Honors and Awards: R&D Magazine "Innovator of the Year" Award, 2003; Fellow American
Association for the Advancement of Science 2003; Info World Innovator, 2003; University of
Chicago Distinguished Performance Award, 2003; Silicon.com Top 50 Agenda Setter,2003;
Federal Laboratory Consortium Technology Transfer Award, 2002; Lovelace Medal, 2002;
Fellow British Computer Society, 2002; R&D100 "Most Promising New Technology" Award,
2002; Gordon Bell Award, 2001; Global Information Infrastructure "Next Generation" Award,
1997; Best Paper Award, 1995 Supercomputing Conference Society Award for Technical
Innovation, 1989.



                                              4-36
TeraGrid Extension: Bridging to XD




John Towns is Director of the Persistent Infrastructure Directorate at the National Center for
Supercomputing Applications (NCSA) at the University of Illinois. He is PI on the NCSA
Resource Provider/HPCOPS award for the TeraGrid, and serves as Chair of the TeraGrid
Forum, which provides overall leadership for the TeraGrid project. He has gained a broad view
of the computational science needs and researchers through his key role in the policy
development and implementation of the resource allocations processes of the TeraGrid and
preceding NSF-funded resources. He is co-PI on the Computational Chemistry Grid project led
by the University of Kentucky. His background is in computational astrophysics, making use of a
variety of computational architectures with a focus on application performance analysis. At
NCSA, he provides leadership and direction in the support of an array of computational science
and engineering research projects that use advanced resources. Towns plays significant roles
in the deployment and operation of computational, data and visualization resources, and grid-
related projects deploying technologies and services supporting distributed computing
infrastructure.
J. Towns: Leadership Class Scientific and Engineering Computing: Breaking Through the
Limits, OCI 07-25070, $208M, 10/07-10/12; NLANR/DAST, OCI 01-29681, $2.5M, 7/02-6/06;
National Computational Science Alliance, OCI 96-19019, $249.1M, 10/97-9/05; The TeraGrid:
Cyberinfrastructure for 21st Century Science and Engineering, SCI 01-22296 and SCI 03-
32116, $44.0M, 10/01- 9/05; Cyberinfrastructure in Support of Research: A New Imperative,
OCI 04-38712, $41.1M, 7/06-8/08; ETF Early Operations-NCSA, OCI 04-51538, $1.9M, 3/05-
9/06; ETF Grid Infrastructure Group (U of Chicago lead), OCI 05-03697, $14.1M, 9/05-2/10;
TeraGrid Resource Partner-NCSA, OCI 05-04064, $4.2M, 9/05-2/10; Empowering the TeraGrid
Science and Engineering Communities, OCI 05-25308, $17.8M, 10/07-9/08; Critical Services for
Cyberinfrastructure: Accounting, Authentication, Authorization and Accountability Services (U of
Chicago lead), OCI 07-42145, $479k, 10/07-9/08.
Matt Heinzel is the Deputy Director of the Teragrid Grid Infrastructure Group (GIG) and Director
of TeraGrid Operations at The University of Chicago, Computation Institute. As Deputy Director,
he is responsible for TeraGrid coordination, overall architecture, planning, software integration,
and operations. The GIG manages operation process and improvement projects. The Director
of TeraGrid Operations manages a nation-wide team that provides operational monitoring of all
TeraGrid Infrastructure Services which also operates the TeraGrid help desk.
4.7.2.1 Other Senior Personnel
As described in §Error! Reference source not found.D.7.1, the overall TeraGrid project is led
by the TG Forum membership which includes the RP and GIG PIs. This arrangement gives the
RP and GIG PIs equal decision-making influence in the project. Due to limitations on number of
co-PIs on NSF proposals, the RP PIs are included on this proposal as Senior Personnel.




                                              4-37

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:1/7/2012
language:
pages:43