Scalable Systems Software Enabling Technology Center
Document Sample


Scalable Systems Software for Terascale Computer Centers Al Geist Narayan Desai Rusty Lusk Brett Bode Paul Hargrove SciDAC ISIC Review March 12, 2003 Research sponsored by MICS Office of DOE Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations Includes DOE Labs, NSF Supercomputer Centers, Vendors ORNL ANL LBNL PNNL SNL LANL Ames NCSA PSC Cray SGI SDSC Intel HP IBM Unlimited Scale Open to all like MPI forum www.scidac.org/ScalableSystems Copy of these slides The Problem Today System administrators and managers of terascale computer centers are facing a crisis: Computer centers use incompatible, ad hoc sets of systems tools Present tools are not designed to scale to multi-Teraflop systems Do nothing and … Each computer center rewrites their home-grown software Redundant effort and delayed availability End result leaves the community no better off and science suffers Example: San Diego Supercomputer Center has no automatic way of having their queue manager interact with their accounting system. Date: Tue, 18 Feb 2003 10:22:59 -0800 From: Donald Frederick <frederik@sdsc.edu> To: dubey@tagore.uchicago.edu Subject: UChicago ASCI Account on SDSC BH Dear Anshu - Please ask your colleagues not to submit any new jobs until we can replenish the Chicago ASCI account. Thanks. - -Don F Our Three Goals Collectively (with industry) agree on and specify standardized interfaces between system components MPI-like process to promote interoperability, portability, and longterm usability. Produce a fully integrated suite of systems software and tools Reference Implementation for the management and utilization of terascale computational resources. Research and development of more advanced versions of the components To support the scalability, fault tolerance, and performance requirements of large science applications. Up to 10,000 nodes. Scope of the Effort Resource & Queue Management Allocation management Accounting & user mgmt Fault Tolerance Security System Monitoring Checkpoint restart System Build & Configure Job management Impact Fundamentally change the way future high-end systems software is developed and distributed Reduced facility management costs • reduce need to support ad hoc software • better systems tools available • able to get machines up and running faster and keep running More effective use of machines by scientific applications • scalable launch of jobs and checkpoint/restart • job monitoring and management tools • allocation management interface Vision: A Common Integrated Interface Framework Vendor optimized highly scalable version Easy to swap components common pool of attributes User Attribute Host database Node Job Allocation Etc… XML format for attributes Standardized request protocol Choose an existing transfer protocol-TCP Every component uses the same framework System Software Architecture Access control Security manager Interacts with all components Meta Scheduler Meta Monitor Meta Manager Accounting Allocation Management Scheduler System Monitor Node Configuration &Build Manager User DB Queue Manager Job Manager & Monitor Data Migration Usage Reports User utilities High Performance Communication & I/O Checkpoint/ restart File System Components not in our mission Three Phase Plan Project is on schedule 18 months - Agree on and create an initial integrated suite of conforming components (subset of architecture) that will be released and updated - Specify standardized interfaces between system components through a series of open meetings organized and run like the meetings that defined the MPI standard 24 months - Research and creation of components that have no existing versions - Continuing improvements in efficiency and scalability of the most critical system components. - Explore long-term support for the distribution through the industry participants 18 months - Starting in 2005 focus on the next generation of computers and the creation of system software components needed for this future technology - Getting vendors to support the Scalable System Software distribution on their platforms Project Management Quarterly Face-to Face Meetings To discuss and vote on interface proposals Four different Working Groups 1. 2. 3. 4. Node build, configuration, and information service Resource management, scheduling, and allocation Process management, system monitoring, and checkpointing Validation and Integration Web-based Project Notebooks (over 200 pages and growing) A main notebook for general information & mtg notes And individual notebooks for each working group www.scidac.org/ScalableSystems Voting and Concensus Same voting rules used by MPI forum We Have Active Vendor Participation In discuss and voting on interface proposals Sample of proposals voted on: 1. 2. 3. 4. 5. 6. 7. Basic Wire Protocol Cluster Bios support XML schema formalization Single method monitoring XML interface Authentication support Add Event Manager to architecture Support HTTP protocol And so on… Four Working Groups Overall Leader: Al Geist ORNL gst@ornl.gov Node build, configuration working group Leader: Narayan Desai ANL Proccess management working group Leader: Paul Hargrove LBNL Resource management working group Leader: Scott Jackson PNNL Validation and testing working group Leader: Eric DeBenedictis SNL Each working group is composed of members from several organizations Leaders of the working groups come from across the organizations Working groups are cross-fertilized with members from other groups Build and Configure WG Hardware manager manages all physical node services, including basic node identification, hardware inventory functions, power controller access, and management hardware topology information. Build and Configuration manager handles all aspects of software installation and configuration management on all nodes in the system. Node state manager an administrative control panel for the cluster. It keeps track of and allows direct control of all individual node states and availability. Compatible with OSCAR, Rocks, City, Cplant, and Scyld-like systems Integrating Components SSSlib communication library Supplies five wire protocols and is extensible Supplies basic authentication Compatible with C, C++, Java, Perl, and Python components Service Directory Stores location and protocol information for all components Provides this information when components need to communicate Event Manager Scalable solution to asynchronous events between components Also provides framework for system monitoring Process Management WG Process manager scalable startup runtime communication establishment knows where all the processes of a parallel job have been started. System and Job Monitors unified framework for collecting data across systems provides real-time state data of components and jobs focus on scalability and extensibility into new environments Checkpoint manager Checkpoint—an MPI-LAM or serial job on demand Suspend—a job to temporarily cease use of resources Migrate—a running job to different (overlapping) nodes Resource Management WG Scheduler based on Maui with Scalable Systems interfaces extensive advance reservation support and policy control support for Loadleveler, PBS, SGE, LSF, and BProc systems Queue Manager maintains information about jobs, both present and past handles job submission and deletion in the system node setup, data staging, and teardown Accounting & Allocation manager supports management features for accounts, users, machines, allocations, jobs, resources, usage and charging w/ simple GUI Meta-scheduler based on Silver with Scalable Systems interfaces and Grid interfaces distribute workload across HPC systems at computer center Validation and Testing WG APITEST unit-test driver for black-box testing components through their communications interfaces Coupled-system tester whole-system test framework applications are run on the machine in various manners for the purpose of finding bugs, and giving confidence that the system will not crash during production use QMTest QMTest QMTest Report Results Interpret RAW Test Output Expected Results ... Scriptable test driver for whole-system tests Tests defined with XML Make http interface Matches stdout, stderr, and exit codes Test Packages ... System Software Components Presently under construction Strong Emphasis on multi-lab cooperation and team effort Meta Scheduler D. Jackson Meta Manager S. Scott Node Manager T. Naughton Package Services J. Mugler System/Job Monitors M. Showerman Scheduler D. Jackson Job Manager B. Bode C-Plant XML interface E. Debenedictis Process Manager R.Lusk Checkpoint / Restart P. Hargrove Accounting S. Jackson Service Directory N. Desai Allocation Management S. Jackson Information Services J.P. Navarro Queue Manager B. Bode Authentication & Communication N. Desai, A. Lusk Resource Mgmt Working Group Build & Configure Working Group Process Mgmt Working Group Progress on Integrated Suite Grid Interfaces Working Components and Interfaces (bold) Accounting Meta Scheduler Meta Services Meta Monitor Meta Manager Scheduler System & Job Monitor Service Directory Node State Manager Standard XML interfaces Usage Reports authentication communication Allocation Management Event Manager Node Configuration & Build Manager Job Queue Manager Checkpoint / Restart Process Manager Hardware Infrastructure Manager Validation & Testing Components written in any mixture of C, C++, Java, Perl, and Python. Interactions and Users Vendors Interest They consider these tools commodity - would like to use ours They are asking what license we’ll use – told BSD Particular interest in validation & testing framework Early use in DOE Computer Centers • Process Manager, Event Manager, and Service Directory in production use on Chiba City at ANL • XML wrapped Cplant tools used at Sandia • Allocation Management component under evaluation by CCS at ORNL • Adrian Wong from NERSC is an advisor to Scalable Systems on the NERSC requirements. Leveraging Other Work OSCAR Packaging technology Incorporating Scalable Systems components Large user base to spread the Scalable Systems interfaces City Developing, and incorporating Scalable Systems components Production scale testing CPlant Micro-kernel clusters Validation and testing framework Incorporating Scalable Systems interfaces Science Appliance Bproc based clusters compatible with Scalable Systems Architecture Supermon monitoring component To Learn More – Five Project Notebooks A main notebook for general information And individual notebooks for each working group • Allows groups to keep track of other groups progress and comment on the items of overlap • Allows Center members and interested parties to see what is being defined and implemented • Over 200 total pages Get to all notebooks through main web site www.scidac.org/ScalableSystems Click on side bar or at ―project notebooks‖ at bottom of page Component Demonstration System Monitor Allocation Manager Event Manager Service Directory Process Process Manager Manager Scheduler Queue Queue Manager Manager Batch Node Monitor Checkpoint Manager
Get documents about "