Grid Operations Issues and
Open Science Grid Operations
December 1, 2004
Many sites still experience installation and resource certification
Lack of step-by-step installation instructions for most clusters.
Effort on “generalized” rather than “specialized” install process.
Many configuration steps are required after “pacman” command completes.
Need more tests of real world usage, a common compliant is “The site status is
green but I can’t use it…”
More enhancements for operational verification software (site_verify)
Lack of sufficient monitoring and publication of available storage
No policy publication and little policy enforcement
No redundancy for “single point of failure” services, I.e VOMS server,
documentation services, installation caches
Avoid centrality and hopefully avoid catastrophic grid failures
December 1, 2004 .2
Problems keeping communication channels open with resource
Problems solving with resource providers stall.
Infrastructure updates and configuration management are done as “best effort”.
This results in simple problems hanging around for extended periods of time.
Lack of consistency in error reporting and inaccessible log data increases
No documented remote diagnostics procedures makes problem solving harder.
No procedure for bug tracking which would allow feedback to software
Lack of required personnel to research and correct all problems
What is the correct ratio of sites to support personnel?
How much should be invested?
December 1, 2004 .3
Thus far focused almost exclusively on the resource
providers. Typical of early adoption of internet
Much planning and focus on reacting to critical
problems. Other problems go unreported or
No training activities. An important part of good
Informal channels of support abound.
December 1, 2004 .4