Scalable Systems Software Enabling Technology Center by Levone


									Scalable Systems Software
for Terascale Computer Centers
Al Geist Narayan Desai Rusty Lusk Brett Bode Paul Hargrove
SciDAC ISIC Review March 12, 2003

Research sponsored by MICS Office of DOE

Scalable Systems Software
for Terascale Computer Centers
Coordinator: Al Geist Participating Organizations
Includes DOE Labs, NSF Supercomputer Centers, Vendors



PSC Cray SGI SDSC Intel HP IBM Unlimited Scale Open to all like MPI forum
Copy of these slides

The Problem Today
System administrators and managers of terascale computer centers are facing a crisis: Computer centers use incompatible, ad hoc sets of systems tools

Present tools are not designed to scale to multi-Teraflop systems

Do nothing and …
Each computer center rewrites their home-grown software

Redundant effort and delayed availability
End result leaves the community no better off and science suffers Example: San Diego Supercomputer Center has no automatic way of having their queue manager interact with their accounting system.
Date: Tue, 18 Feb 2003 10:22:59 -0800 From: Donald Frederick <> To: Subject: UChicago ASCI Account on SDSC BH Dear Anshu - Please ask your colleagues not to submit any new jobs until we can replenish the Chicago ASCI account. Thanks. - -Don F

Our Three Goals
Collectively (with industry) agree on and specify standardized interfaces between system components MPI-like process to promote interoperability, portability, and longterm usability. Produce a fully integrated suite of systems software and tools

Reference Implementation for the management and utilization of terascale computational resources.
Research and development of more advanced versions of the components

To support the scalability, fault tolerance, and performance requirements of large science applications. Up to 10,000 nodes.

Scope of the Effort
Resource & Queue Management

Allocation management

Accounting & user mgmt
Fault Tolerance

Security System Monitoring

Checkpoint restart

System Build & Configure

Job management

Fundamentally change the way future high-end systems software is developed and distributed Reduced facility management costs
• reduce need to support ad hoc software • better systems tools available • able to get machines up and running faster and keep running

More effective use of machines by scientific applications
• scalable launch of jobs and checkpoint/restart • job monitoring and management tools • allocation management interface


A Common Integrated Interface Framework
Vendor optimized highly scalable version

Easy to swap components

common pool of attributes
User Attribute Host database Node Job Allocation Etc…

XML format for attributes Standardized request protocol Choose an existing transfer protocol-TCP Every component uses the same framework

System Software Architecture
Access control Security manager
Interacts with all components

Meta Scheduler

Meta Monitor

Meta Manager

Allocation Management


System Monitor

Node Configuration &Build Manager

User DB

Queue Manager

Job Manager & Monitor

Data Migration

Usage Reports

User utilities

High Performance Communication & I/O

Checkpoint/ restart

File System

Components not in our mission

Three Phase Plan
Project is on schedule

18 months
- Agree on and create an initial integrated suite of conforming components (subset of architecture) that will be released and updated - Specify standardized interfaces between system components through a series of open meetings organized and run like the meetings that defined the MPI standard

24 months
- Research and creation of components that have no existing versions - Continuing improvements in efficiency and scalability of the most critical system components. - Explore long-term support for the distribution through the industry participants
18 months - Starting in 2005 focus on the next generation of computers and the creation of system software components needed for this future technology - Getting vendors to support the Scalable System Software distribution on their platforms

Project Management
Quarterly Face-to Face Meetings
To discuss and vote on interface proposals

Four different Working Groups
1. 2. 3. 4. Node build, configuration, and information service Resource management, scheduling, and allocation Process management, system monitoring, and checkpointing Validation and Integration

Web-based Project Notebooks (over 200 pages and growing)
A main notebook for general information & mtg notes And individual notebooks for each working group

Voting and Concensus
Same voting rules used by MPI forum

We Have Active Vendor Participation
In discuss and voting on interface proposals

Sample of proposals voted on:
1. 2. 3. 4. 5. 6. 7. Basic Wire Protocol Cluster Bios support XML schema formalization Single method monitoring XML interface Authentication support Add Event Manager to architecture Support HTTP protocol

And so on…

Four Working Groups
Overall Leader: Al Geist ORNL

Node build, configuration working group
Leader: Narayan Desai ANL

Proccess management working group
Leader: Paul Hargrove LBNL

Resource management working group
Leader: Scott Jackson PNNL Validation and testing working group Leader: Eric DeBenedictis SNL

Each working group is composed of members from several organizations Leaders of the working groups come from across the organizations Working groups are cross-fertilized with members from other groups

Build and Configure WG
Hardware manager
manages all physical node services, including basic node identification, hardware inventory functions, power controller access, and management hardware topology information.

Build and Configuration manager
handles all aspects of software installation and configuration management on all nodes in the system.

Node state manager
an administrative control panel for the cluster. It keeps track of and allows direct control of all individual node states and availability. Compatible with OSCAR, Rocks, City, Cplant, and Scyld-like systems

Integrating Components
SSSlib communication library
Supplies five wire protocols and is extensible Supplies basic authentication Compatible with C, C++, Java, Perl, and Python components

Service Directory
Stores location and protocol information for all components Provides this information when components need to communicate

Event Manager
Scalable solution to asynchronous events between components Also provides framework for system monitoring

Process Management WG
Process manager
scalable startup runtime communication establishment knows where all the processes of a parallel job have been started.

System and Job Monitors
unified framework for collecting data across systems provides real-time state data of components and jobs focus on scalability and extensibility into new environments

Checkpoint manager
Checkpoint—an MPI-LAM or serial job on demand Suspend—a job to temporarily cease use of resources Migrate—a running job to different (overlapping) nodes

Resource Management WG
based on Maui with Scalable Systems interfaces extensive advance reservation support and policy control support for Loadleveler, PBS, SGE, LSF, and BProc systems

Queue Manager
maintains information about jobs, both present and past handles job submission and deletion in the system node setup, data staging, and teardown

Accounting & Allocation manager
supports management features for accounts, users, machines, allocations, jobs, resources, usage and charging w/ simple GUI

based on Silver with Scalable Systems interfaces and Grid interfaces distribute workload across HPC systems at computer center

Validation and Testing WG
unit-test driver for black-box testing components through their communications interfaces

Coupled-system tester
whole-system test framework applications are run on the machine in various manners for the purpose of finding bugs, and giving confidence that the system will not crash during production use


QMTest QMTest Report Results Interpret RAW Test Output Expected Results ...

Scriptable test driver for whole-system tests Tests defined with XML Make http interface Matches stdout, stderr, and exit codes

Test Packages ...

System Software Components
Presently under construction
Strong Emphasis on multi-lab cooperation and team effort
Meta Scheduler D. Jackson Meta Manager S. Scott Node Manager T. Naughton Package Services J. Mugler System/Job Monitors M. Showerman

Scheduler D. Jackson

Job Manager B. Bode
C-Plant XML interface E. Debenedictis Process Manager R.Lusk Checkpoint / Restart P. Hargrove

Accounting S. Jackson

Service Directory N. Desai

Allocation Management S. Jackson

Information Services J.P. Navarro

Queue Manager B. Bode

Authentication & Communication N. Desai, A. Lusk

Resource Mgmt Working Group

Build & Configure Working Group

Process Mgmt Working Group

Progress on Integrated Suite
Grid Interfaces

Working Components and Interfaces (bold)

Meta Scheduler Meta Services

Meta Monitor

Meta Manager


System & Job Monitor Service Directory

Node State Manager

Standard XML interfaces
Usage Reports

authentication communication Allocation Management

Event Manager

Node Configuration & Build Manager

Job Queue Manager Checkpoint / Restart

Process Manager Hardware Infrastructure Manager

Validation & Testing

Components written in any mixture of C, C++, Java, Perl, and Python.

Interactions and Users
Vendors Interest
They consider these tools commodity - would like to use ours They are asking what license we’ll use – told BSD Particular interest in validation & testing framework

Early use in DOE Computer Centers
• Process Manager, Event Manager, and Service Directory in production use on Chiba City at ANL • XML wrapped Cplant tools used at Sandia • Allocation Management component under evaluation by CCS at ORNL • Adrian Wong from NERSC is an advisor to Scalable Systems on the NERSC requirements.

Leveraging Other Work
Packaging technology Incorporating Scalable Systems components Large user base to spread the Scalable Systems interfaces

Developing, and incorporating Scalable Systems components Production scale testing

Micro-kernel clusters Validation and testing framework Incorporating Scalable Systems interfaces

Science Appliance
Bproc based clusters compatible with Scalable Systems Architecture Supermon monitoring component

To Learn More – Five Project Notebooks
A main notebook for general information And individual notebooks for each working group
• Allows groups to keep track of other groups progress and comment on the items of overlap • Allows Center members and interested parties to see what is being defined and implemented • Over 200 total pages

Get to all notebooks through main web site Click on side bar or at ―project notebooks‖ at bottom of page

Component Demonstration
System Monitor

Allocation Manager

Event Manager

Service Directory

Process Process Manager Manager


Queue Queue Manager Manager


Node Monitor

Checkpoint Manager

To top