How to Build a Grid by country


									SURA Cyberinfrastructure Workshop: Grid Application Planning & Implementation
January 5 – 7, 2005, Georgia State University

Notes from breakout sessions on: How to Build a Grid/Different Grid Technologies
(January 7, a.m.)

Facilitated by: Phil Emer, MCNC
Scribe: Mary Trauner, Georgia Institute of Technology

A number of key questions were posed throughout the breakout session. The following
notes are organized by question coupled with the discussion that took place for each.

A grid can be considered as an access method or platform.

1) What are considered reasonable platforms for compute and storage?
      Anything with a power supply?
      Anything in a controlled environment?
      Anything in a controlled lab?
      State-wide resources

   Phil stated that at MCNC, the decision was:
        Not a desktop
        Something in a controlled environment
        Something that peers across universities had built in a consistent manner.

   Amy Apon (U of Arkansas) worked with students to build an X-Grid (Apple) that
   included some Linux systems. This led to a course on building grids and tools for the

2) Is anyone considering grids as a way to harness unused cycles and/or to avoid a
   [HPC] purchase?

   One person has done so with Avaki; they are currently looking at United Devices.
   The Avaki grid has paid off.
       Now Avaki software is basically free to universities
       Support for it is [the fee for support] is reasonable.

3) What middleware is reasonable?
   <Discussed later>

4) What manpower is needed?
   <Not discussed>
5) Should grids support heterogeneity?

       If you support multiple sites, campuses, virtual organizations, it [heterogeneity]
       will happen. For example, MCNC supports a grid comprised of:
            Duke: Linux with Sun Grid Engine and Ethernet
            NC State: A Sun SMP with Grid Engine
            UNC: Linux with Myrnet and LSF

6) Does Globus work in a heterogeneous environment?

       One site has implemented this at GT2. There is uncertainty and concern about the
       next version since it uses web services. Some in the group were uncertain if this
       was a step forward.

       NMI provided a stable stack. Now there is a bifurcation of those building
       infrastructures from those at virtual organizations based on the application
       interfaces. Where do those interfaces tie in? At this stage, end users present an
       application to their infrastructure people for a grid solution.

7) Who has built a grid?

       Jorg Schwarz (Sun) responded from two perspectives:
              Within an institution, you just need:
                  Resource manager
                  Directory structure (LDAP, Punch, etc.)
                  Accounting
              Across universities, add:
                  GT2
                  A portal
                  A command line interface

8) Is a grid an HPC solution or a network piece?

       Several responded that they considered it as HPC. One person commented that he
       considered it HPC because it appeared to be an outgrowth from the NSF centers
       on how to do research.

       We discussed the terms Grid versus grid (big versus little “g”.) What must a
       network provider provide for a grid beyond what an ISP might provide?
           Earlier availability
           Something only a university will consume
           A grid gateway
           Beyond bandwidth
                 o Network to compute
                 o An access tier to resources
       Application specific grids tend to want a portal (veneer) whereas general grid
       users tend to prefer lower level access like command line interfaces.

       This launched into a discussion of “point of view”. Utility and presentation
       become important based on point of view when deciding access methods.

       An exampled posed was a backup service. This still leverages middleware, but
       the use of a backup service is more “grid” than “Grid”.

9) How does one build a grid file system or data grid while immersed in an AFS
   <Discussions of this tended to be spread out across several subsequent questions.>

10) How long does it take to build a grid?

       This depends on how many pieces you need. (Kerberos, AFS, etc.?) You must
       start with:
        Authentication
        Data access
        Local load sharing environment

       This led to the question “Is a grid a cluster or vice versa?”

       Authentication, data access, and load sharing help answer this. Whether your
       users are from a single administrative domain or multiple domains is also part of
       the answer.

       Certificate authorities were discussed. Multiple CA’s within a grid (aka different
       policies for each compute resource) is problematic. MCNC got around this
       [campus access/use policies] by purchasing the equipment that was placed at each
       university, thus providing unified policies, authentication, etc.

11) What should I do in a closed environment with multiple OS platforms, where the
    applications have dependencies on certain libraries?

       What middleware and DRM [Distributed Resource Manager] will you use? DRM
       options include
            Condor G
            PBS
            Grid Engine
            LSF
            Maui
            Load Leveler

       The DRM provides:
           Execution-host clients to monitor loads (memory, CPU, etc.)
              Submit-host/master client to collect execution-host information, build
              Does not allow interactive access to execution nodes (interactive nodes are
               generally provided separately.)

       Note that Globus does not provide a DRM. GRAM is the Globus Toolkit
       interface between Globus and the DRM.

       Another thing to note is that GridFTP is not needed in a local environment; AFS
       [,scp] or other would be used instead.

       And scheduling across clusters is not handled by Globus out of the box. Meta-
       schedulers can be built to handle this. It usually requires some building or
       scripting. Grid Engine (Sun) will do this.

12) Are there some good grid terminology resources?

       Several mentioned the IBM Redbook as a good source [title and/or ISBN?]

13) What layers do you need to connect clusters into a grid?

       First, you need to identify an initial exercising application. So this question is
       hard to completely answer without knowing the application.

       Jorg (Sun) disagreed, saying an authentication platform that could verify user info
       was enough to do a general-purpose grid. He proposed the following diagram.
       LDRM: Local resource manager
       A or B: SMP system, cluster, set of clusters, particular applications, etc.

       (Phil noted that this is a compute-centric grid. A data grid may be different.)

       For example, someone could log onto the portal at A to submit a Blast job. B
       may be the Blast server. B will get the job, know how to get data from A and
       return the results to A. Something like GridLab metascheduling was
       recommended for review.

14) How do you move data? Who gets to access input and output? How does it [data]
    get in and out?

       This is not necessarily complicated, but it requires some choices for middleware

       When is the data replicated? How does data location affect performance? When
       do you need to add data access servers?

       Data grids have similar authentication issues but add:
           Data access method
           Replication
           Tools

       Virtual LAN and Cluster on Demand were mentioned. (??)

15) How do you deploy a consistent image across multiple OS platforms?

       Consistency is difficult. Adding the middleware complicates it. Application
       domain specificity will simplify it.

       Maytal Dahan (TACC) described their hub and spoke trust philosophy. Phil
       (MCNC) asked if this wouldn’t implode into forcing homogeneity. Victor Bolet
       (GSU) said it would implode to the use of standards not OS or platform.)

       Maytal went on to say that they are dealing with standards and middleware. She
       added that researchers think grids are still just too hard to use. So portals are
       important. Now as portals become standardized, they can interoperate so which
       one you choose isn’t as important now. You don’t have to adopt a particular one.

16) Are grids good? How? Where are we going or trying to go?

       We should eventually look at grids supporting science and research the same way
       as the block box that the Internet has become for us today. Scheduling,
       Certificate Authorities, etc should be transparent.
      So how do we build the black box? Visualization and workflow aspects are
      important to how the end-user interfaces to the results (and the speed/quantity at
      which they receive them.) As infrastructure people, we may not have those skills.
      Maytal (TACC) said some middleware may be heading in this direction.

      Visualization widgets, engines, and instruments could be something we need to
      think more about, consider.

      Kazaa and peer-to-peer applications are grid-like things. Grid is like another
      version or “second coming”. Grids need to “spew” services that just work, in a
      similar manner as the peer-to-peer applications.

      But the basic or fundamental requirement: run an executable that reads data and
      returns results.

17) Are there any automatic methods?

      There is no “grid in a box.” The NMI toolkit is a good place to start.

      Choosing the Globus API (2, 3, or 4) is a big question. (See below)

      Implementing a Metascheduler: Rudimentary scheduling, like round-robin, isn’t
      too difficult if you have some sort of access control.
           CSF: Community Scheduler Framework
           VGRS (VeriSign Global Registry Services)
           LSF has a multi-cluster solution, but it is very expensive

      Globus Toolkit Review:
          Pre “web services” components are pretty safe and stable. 2.4 is the most
            stable for infrastructure.
          Those looking for things like grid applications have gone with 3.0. There
            is concern about 4.0 which is using WSRF (delayed until April.) If 4.0 is
            not rock solid, the results (difficulties?) will be dramatic.
          Today, for production services, run 2.4 (or some 2.x.) Run 3.9.4 if you
            want to play with the WSRF beta. (Comments that it crashes a lot?)

      Other grid toolkits:
          .net: OSGI is going away
          WSRF <missed those comments>
          SRM/SRB: Jefferson Lab
          Unicore: Grid toolkit developed in Europe

      Phil (MCNC) mentioned that they wrote an initial “kick start” guide for an
      enterprise grid. He listed a diagram:
18) Should we ask SURA to host a mailing list on building grids?

   If interested, let Mary Fran know (

19) Afterthoughts from the facilitator (Phil Emer/MCNC)

I think that these notes do a great job of capturing the flow of the conversation and the
types of questions that were on the table (nice, Mary!). I am not quite sure what to do
with these notes though…some workshops are probably in order…My gut feeling based
on listening to the types of questions folks were asking is that this is what people
want/need to do vis-à-vis building grids:

a) Provide "elegant" access to centralized HPC resources. So grid as a front end access
   method to HPC resources. At MCNC we call this the enterprise grid. Several
   attendees mentioned wanting to provide such an interface to a collection of
   heterogeneous resources (though in most cases the resources were not distributed).
   You don't really need grid to do this as you can apply DRM's like LSF and grid
   engine. Adding a grid interface here allows applications and users to access the
   resources in a more transparent and potentially cross-domain kind of way. IMO
   adding the grid access method makes access to high throughput computing accessible
   to folks that are not traditional command-line driven scientists.

b) Save money and increase user happiness by more effectively using resources. There
   is a bit of momentum building around the notion of cluster or resource on demand.
   So for instance having a pool of resources and say imaging a system on the fly to
   support a particular application for a particular user and releasing the resource when
   done. Some (including folks at NC State) believe that it may be cheaper and easier to
   build this kind of a system than the land of milk and honey (grid) where you
   somehow can apply middleware to any combination of hardware and software
   platform and get a consistent, deterministic result. So for instance some applications
   may simply run cheaper or faster or whatever on a particular OS - so deal with that
   and image an optimized system on the fly and release the resources when you're done.
   Only problem is this is almost anti-grid in that it punts on the notion of being able to
   build the perfect middleware stack.

c) Build more effective Virtual Organizations that share data, applications, tools, gear,
   etc. The first two examples above are more from the point of view of an organization
   that is in the business of running infrastructure – a service provider. Here the
   perspective is a user group. So the Florida example comes to mind where a Biologist
   is providing services to Biologists and some grid tools make sense for managing
   computation, data management, application support and the like. Defining workflows
   and building portals that approximate those workflows while maintaining the notion
   of access control and "ownership" comes to mind here.

To top