Scalable Systems Software Enabling Technology Center by ewghwehws


									A View from the Top
   End of Year 1

      Al Geist
    October 10-11
     Houston TX
       Participating Organizations

       Coordinator: Al Geist
   Participating Organizations

ORNL      SNL        PSC         Cray
ANL       LANL       SDSC        Intel
LBNL      Ames       IBM         Unlimited Scale

        Main Web Site
Review of Last Meeting
 Scalable Systems Software Center
   June 13-14
   Houston TX

                                 Details in
                    Main project notebook
Progress Reports at June. mtg

Al Geist – working groups, notebooks, telecoms

Working Group Leaders –
  What areas their working group is addressing
  Progress report on what their group has done
  Present problems being addressed
  Next steps for the group
  Discussion items for the larger group to consider

Demonstrations of Prototype Components
  One Big intra-component demo

    Slides can be found in Main Notebook page 22
  Consensus and Voting:

Event Manager Proposal:
Much discussion: revised proposal to say that Event Management
is important feature to our Software Suite independent of whether
it is in a central component or inside components.
And that proposed tuple API is initial starting point.
 Passed strawvote 13 for / 0 against / 0 abstain
Adopt HTTP POST (byte count) as standard Proposal:
 Passed strawvote 10 for / 0 against / 1 abstain
Adopt W3 standard for XML signature syntax and process:
 Long discussion. Decided more discussion needed before vote
Bugzilla site now up and running
 Link is on the ScalableSystems home page.
Progress Since Last Meeting
 Scalable Systems Software Center

Five Project Notebooks filling up

A main notebook for general information
And individual notebooks for each working group
• Over 200 total pages – 34 added since last meeting
• A lot of new material in Resource Management
  notebook (way to go)

 Get to all notebooks through main web site

 Click on side bar or at “project notebooks” at bottom of page
Four Bi-weekly Working Group Telecoms
Less talk more work

Resource management, scheduling, and accounting
  Tuesday 3:00 pm (Eastern) 1-800-664-0771 keyword “SSS mtg”

Validation and Testing
  Wednesday 1:00 pm (Eastern) 1-877-540-9892 mtg code 999157

Proccess management, system monitoring, and checkpointing
  Thursday 1:00 pm (Eastern) 1-877-252-5250 mtg code 160910

Node build, configuration, and information service
  Friday 3:00 pm (Eastern) 1-888-469-1934 mtg code 58145 (changes)
     Scalable Systems Integrated
     Component Demonstration
                            4 Create-Reservation
                                                             Done June 2002
        Allocation                         Local
                   9 Withdraw-Allocation
        Manager                          Scheduler

   Job       1 Submit-Job    Queue                  Node
Submission                  Manager                Monitor
  Client                                                     Color   Working Group
                                 6 Exec-Process

                                                                     Resource Management
                                                                     and Accounting
                                                                     Process Management
                                                                     and Monitoring
Discovery                   Process
                                                                     Node Configuration and
 Service                    Manager                                  Build Infrastructure
            Meta                               Meta
           Manager                          Scheduler                 System/Job
           S. Scott                         D. Jackson                 Monitors
                                                                     M. Showerman
  Manager                                               Allocation
T. Naughton                                            Management
                    Package                             S. Jackson         Job
                    Services                                            Manager
                   J. Mugler                                            B. Bode
                                          Accounting                                    C-Plant
                                          S. Jackson                                 XML interface
                Service                                                             E. Debenedictis
               N. Desai

                                                       Scheduler           Process
                                                       D. Jackson          Manager
 Information                                                               R.Lusk
  JP Navaro

                                                                                     Checkpoint /
      Authentication &                          Queue                                   Restart
      Communication                            B. Bode
                                                                                     P. Hargrove
               R. Lusk          SSSlib
                            Used by all

Build & Configure                         Resource Mgmt                Process Mgmt
 Working Group                            Working Group                Working Group
This Meeting
 Scalable Systems Software Center

                   October 10-11,2002
SciDAC Booth
SciDAC Systems Poster
SciDAC Booth
SciDAC Systems Poster (2)
Agenda – October 10
  8:00   Breakfast
  8:30   Al Geist – Project Status. Getting ready for SC 2002
  9:00   External Project review – Feburary (start planing)

  Working Group Reports
  9:30 Scott Jackson – Resource Management
 10:30 Break
 11:00 Erik Debenedictis – Validation and Testing 12:00
 Lunch (on own but go somewhere as group)
  1:00 Paul Hargrove – Process Management
  2:00 Narayan Desi – Node Build, Configure
  3.00 Break
  3:30 SC Demos and Hacking
        big multi-component demo
  5:00 Open Discussion
  5:30 Adjourn
  Working groups may wish to get together in evening
Agenda – October 11
  8:00   Breakfast
  8:30   Discussion, proposals, strawvotes
         THANKS to Airport Security Meeting
         for open access to their internet access!
         meatball GUI (who?)
         Chiba City for SC demos (Nov 4?)
         cross group issues
         test packaging?

 10:30   Break
 11:00   Al Geist – Summary
                    SC Booth, demos, theater, software, handout (Brett)
                    February review – reviewers, advisor, talks
                    next meeting date: day before review
 12:00   meeting ends
External SciDAC Review mtg

Late February 2003 – may bubble over to early March
18 month checkup by MICS

Each SciDAC Project is reviewed separately –
Scalable Systems is the only thing on the agenda

Full two days of detailed presentations
   So many of us will have to give presentations

External review panel (different for each ISIC)
   We can suggest names
   Can’t be from our organizations or affiliated
   They will have been given our proposal beforehand
External SciDAC Review metrics

I asked Fred and McGraw about Metrics:

1. How have we helped SciDAC Aps?
    Can we show use in CCS and NERSC and others.
2. Put Advisory Panel into place.
    Apps and Computer Center personnel
    I’ve asked Drake (Climate),
    Mezzacapa (Astro), Bland (CCS), Nichols (Chemistry)
      we need NERSC rep and others?
3. Show short term successes and use
External Review Panel Suggestions

External review panel (different for each ISIC)
   We can suggest names - who?

   Barney McCabe
   Russ Miller
   Bart Miller
   Jose M (IBM)
   Someone from Cray
   Someone from Etnus – John Delsignore
   Someone from Unlimited Scale?
   Walt Ligon
   Andrew Lumsdaine
   Jim Garlick
   Steve Chapin
Meeting Notes

Scott Jackson – rm progress
Scope queue manager, job manager, scheduler, allocation, & meta
Demo CCS, NERSC, and Chiba meta-schedule would be good
Scheduler- enhance internal scalability to 64K nodes, add support for
  HTTP framing protocol. Qbank security enhanced
  Interface to PBS, LSF, LL for suspend/resume and requeue mgt
Queue Manager-conforms to SSSRMAP XML spec. full wire protocol
  compatibility new enterface to Event Manager
Allocation Manager-survey of 15 sites for requirements. Implemented
  HTTP framing, SHA1-HMAC security working with Qbank/Maui
  reframed bank objects (accounts, users, allocations) as dynamic
  object actions defined in metadata cache
  creation of dynamic web-GUI using PHP and javascript
Meta scheduler – interoperates with Grid (globus), fault tolerance –
  global jobID tracking, scheduler reconnection. Improved user interface
Current issues – job state mgt, data staging, job signaling, job steps
Meeting Notes

Scott Jackson – rm progress (cont.)
Next work- prepare for SC demos, scalability testing,
 BIG thing is release v1.0 RM system.
 Documentation, security authentication,
 extend suspend/resume schema beyond what PBS, LL does today
 Discussion of the need for a scalability testbed.
Eric Debenidictis – validation progress
Create machine independent test for testing supercomputer
 Infrastructure QMTest
 Tests (from all sources)
 Value- improved method execute the “SSS Standard Test body”
Recent Activity – QMTest on SNL SciDAC cluster, test package definition
Will McClendon – test architecture (diagram in slides)
QMTest is scriptable test driver in Python
HTTP based interface – Zope
Running at SNL and PSC
Requires exact match on STDOUT/STDERR
Meeting Notes

Will McClendon – test architecture (cont.)
QMTest Screenshot and discussion of how tests are done.
Raw results need to be interpreted to determine pass or fail
Mike ???- goes over the “package” details
How to create a test package to the suite – Package File Layout
Will present as a proposal tomorrow
Paul Hargrove – pm group
Progress – prototyping and development continue
 how to interface to something we can’t imagine
 validating schema for process manager
 node monitor schema created
Checkpoint Manager- types
 serial checkpoints (independent but potentially multithreaded), done
 parallel checkpoints (MPI)
 scalable systems XML interfaces
Meeting Notes

Rusty Lusk – process manager (see diagram in his slides)
MPD1 (C) overview – added capabilities required by pmWG
MPD is one prototype for SSS Process Manager
MPD2 (python) diagram in slides for new design
Python about 5X slower with this untuned version
Mike Showerman- system monitoring component
Craig Steffen full time on this project and a student
Using new XML schema defined by
Need to write graphical display that uses this new XML interface
Run a small cluster in NCSA booth with SSS software stack
Discussion – how about an animated meatball diagram
Paul returns –Data migration meatball removed
Next steps – interfaces continue to stabilize chkpt, PM, monitors
  Monitoring data. . . Details need defining
Meeting Notes

Narayan Desai – Build and configure update
Components –
  service directory (solid and on Chiba now),
  event manager completely rewritten, stable XML,
  SSSlib robust (bindings for C++, Java, Python, Perl)
          (wire protocol modules, basic, challenge, http, http-rm)
Build and Config Management (third try at the abstraction)
  cluster HW
  build system (OSCAR module for this one in the works)
  node state manager
Issues- Abstraction problems with second try.
  Multiple implementations important to validate abstraction

                                                 Refined Picture on Next Slide
                                            Meta                        Meta                        Meta
           Event Manager                  Scheduler                    Monitor                     Manager
                                                       Meta Services

      Accounting                               Scheduler                          System &                     Configuration
                                                                                 Job Monitor                     & Build

                                                      Job Queue                                Process
                                                       Manager                                 Manager

                                                                                                     User DB

  Usage                   User               Performance               Checkpoint /
 Reports                 Utilities          Communication                Restart                               Testing &
                                                & I/O                                                          Validation

                                           Application Environment

     Blue text – uses ssslib
     Red text – talks ssslib protocol
                  Grid Interfaces

                      Meta                            Meta                      Meta
                    Scheduler                        Monitor                   Manager
                                  Meta Services

     Accounting                 Scheduler                                       System &
                                                                               Job Monitor
                                             These                                                    Node
                                                                Service                            Configuration
                                             Interface         Directory                             & Build
                                             To all

                                             File           Event
              Allocation                                   Manager
                                                                                                   User DB

                                     Job Queue                                           Manager

                            Performance                         Checkpoint /          User
                           Communication                          Restart            Utilities
                               & I/O

                                                     Application Environment

To top