Docstoc

Scalable Systems Software Enabling Technology Center

Document Sample
Scalable Systems Software Enabling Technology Center Powered By Docstoc
					Progress on Release, API Discussions,
  Vote on APIs, and Quarterly Report


               Al Geist
            May 6-7, 2004
             Chicago, ILL
                Participating Organizations

                Coordinator: Al Geist
              Participating Organizations

         ORNL      SNL        PSC           Cray
         ANL       LANL       SDSC          Intel
         LBNL      Ames       IBM
         PNNL      NCSA       SGI


How do we position ourselves for the DOE Ultrascale facility
winner to be announced May 12

Regardless of who is chosen we should try to be in a position
to help with the system software needs of the facility.
                                 Scalable Systems Software
                                                  ORNL   SNL     IBM     NCSA
                                Participating     ANL    LANL    Cray    PSC
                                Organizations     LBNL   Ames    Intel   SDSC
                                                  PNNL           SGI
 Problem
  • Computer centers use incompatible,
                                           Resource
    ad hoc set of systems tools
                                          Management
  • Present tools are not designed to
                                                                     Accounting
    scale to multi-Teraflop systems
                                                                    & user mgmt

 Goals
  • Collectively (with industry) define
    standard interfaces between systems
    components for interoperability
  • Create scalable, standardized
    management tools for efficiently
    running our large computing centers    System
                                          Monitoring
                                                                 System
                                                                 Build &
                                                                 Configure

 To learn more visit                            Job management
www.scidac.org/ScalableSystems
               Scalable Systems Software Suite
                                                             Updates to this diagram
                                     Grid Interfaces
Components written in any
mixture of C, C++, Java, Perl,           Meta                          Meta                     Meta
and Python can be                      Scheduler                      Monitor                  Manager
integrated into the Scalable
Systems Software Suite                             Meta Services



          Accounting              Scheduler                                       System &
                                                                                 Job Monitor             Node State
                                                                                                          Manager
                                                           Service
                                                          Directory
   Standard
                                                                                             Node
     XML                                 authentication
                                                                                          Configuration
  interfaces                             communication          Event                       & Build
                     Allocation                                Manager                      Manager
                    Management


           Usage                                                                                            Packaging
          Reports                                                               Process                         &
                                       Job Queue                                Manager                       Install
                                        Manager

                    Validation                                                          Hardware
                    & Testing                                Checkpoint /             Infrastructure
                                                               Restart                   Manager
Review of Last Meeting
 Scalable Systems Software Center
  January 15-16
  Argonne




                                 Details in
                    Main project notebook
Highlights from Jan. mtg
Craig – 1280 dual xeon cluster “Titanium” is available this evening
To test the scalability of SSS suite. One node will be used as
Head node to install our suite and run on entire cluster.
Could build everything but Bambo and ssslib due to Xerses
Will begin to be available at 6pm

Late night session on 1280 node testbed
PM ran at 1280 worked at 4000, hung at 6000
Warehouse had a problem at 1280 and took out head node
RM components ran on head node OK until Warehouse crashed it

Scott Jackson – Gold running on 11 TF PNNL cluster

Thomas Naughton – 2nd release March. Discussion of how many
orgs in our group could shakedown the tarball. Group feels better
to have few very reliable components than all components
Highlights from Jan. mtg (cont.)
Rusty Lusk – Process Manager Spec for first vote
Presentation and discussion…
Who is responsible for limited enforcement PM or QM? I.e.
Must use certain amount of memory, must not execute OS
command
(in general - things that happen after fork)
Rusty says the question is good and he needs to think about
How this may affect the interface.
Other items to think about
 - use of wildcard as “to be returned” operator – OK
 - Inclusion but don’t show me.
 - Dynamic jobs and PM.
 - improve readability

Delay vote until we have a written proposal.
 Highlights from Jan. mtg

Discussion of having two XML syntax styles (functional, object)
Al says he would like to see one common one across the suite
that he didn’t care which one as long as the whole group could agree.

Narayan – Restriction Syntax Overview. An issue of uniqueness was
brought up and was to be taken into consideration by Narayan

Rusty Lusk – Restriction Syntax on Chiba City
David would like to see a paper of the requirements that the Chiba
effort required.

Andrew and Paul and Craig offer to investigate a prototype translator
To see how / if it is possible.

Investigate standardization of tokens across the two syntax
Progress Since Last Meeting
 Scalable Systems Software Center




                      January-May
SciDAC PI mtg – March 22-24, 2004

In Charleston SC with several
attending for Scalable Systems
2 page project summary report
Annual report for Fred
20 minute talk – presented by Rusty
Fred asked each ISIC to use new speaker

Poster Presentation – by Stephen/John
Systems Software Suite 2nd Release

Target Date March „04 – So we could announce it at the
  PI meeting. Real Status?

SSS-OSCAR – will hear more in next talk
Need way to test that the suite is installed correctly
Five Project Notebooks

A main notebook for general information
And individual notebooks for each working group
• Over 300 total pages
• BC and PM groups need to get specs into their notebooks
• Add Telecom meeting notes even if short (Kudos to RM group)



 Get to all notebooks through main web site
 www.scidac.org/ScalableSystems

 Click on side bar or at “project notebooks” at bottom of page
Bi-Weekly Working Group Telecoms
RM is only notes I see in notebook

Resource management, scheduling, and accounting
  Tuesday 3:00 pm (Eastern) 1-800-664-0771 keyword “SSS mtg”


Proccess management, monitoring, and checkpointing
  Thursday 1:00 pm (Eastern) 1-877-252-5250 mtg code 160910


Node build, configuration, and information service
  Thursday 3:00 pm (Eastern) 1-888-469-1934 mtg code (changes)
This Meeting
 Scalable Systems Software Center




                  May 6-7, 2004
Major Topics this Meeting

Stability of Systems Software Suite – second release is
out. Are we ready for outside users?

Quarterly Report Due – would like to get one to Fred by
end of May. Will need text from WG leaders.

Formal API presentations and voting - we left several
things hanging last meeting

MICS PI Mtg - August 9-12 at Argonne. A good time to
have a highlight of outside user(s)

SC04 Mtg - November in Pittsburg. Talks? Tutorial? Birds
of a feather?
Agenda – May 6
 8:30 Al Geist – Project Status.
 9:15 Thomas Naughton – SSS OSCAR software suite release
 Working Group Reports
   Progress report on what their group has done
   API Proposals for adoption by the group
   Progress on software suite improvements
 9:30   Narayan Desai – Node Build, Configure
10:30   Break
11:30   Will McClendon – Validation and Testing
12:30   Lunch (on own – cafeteria)
 1:30   Ron Oldfield – ASAP testing, and formalism issues
 2:00   Paul Hargrove – Process Management
        Craig and Rusty
 3:00   Scott Jackson – Resource Management
 4:00   Paul/Craig – findings about trying to build a syntax translator
 4:30   Group Discussion on getting outside users of 2nd release
 5:00   Al – Discussion on SC04, other conferences, papers, etc.
 5:30   Adjourn
Agenda – May 7

  8:30   Discussion, proposals, votes
         Craig – discussion
         Paul – straw vote on two syntax
         Rusty - Process Manager proposal (deferred)
         Scott – Allocation Manager proposal (deferred)
         Al - Quarterly report, papers, SC04, other meetings.

 10:30   Break
 11:00   Al Geist – Release 2 and outside users (Jazz? Ram? NCSA? SNL?)
         MICS PI Mtg August at Argonne (news to come)
         next meeting date: August 26-27, 2004
         location: Argonne

 12:00   meeting ends
  Meeting notes
Al Geist – presents project overview and goals for this meeting

Thomas Naughton – SSS-OSCAR: in tarball is
Bamboo, BRLC, Gold, LAM/MPI, MAUI-SSS, SSSLib, Warehouse, MPD2
SSSLib contains SD, EM, PM, BCM, NSM, NHw, plus communication
Todo: bug tracker, test sss-oscar-v2a6-v3.0 for pre-release,
Documentation- use scidac review 1 pager, add license-sss to directory
Need: A test suite and a few test machines to test on
Discussion on APItest and who creates tests, etc. Each does individual
Establish release schedule thru SC04
Add easier way for authors to “test just their stuff
SC04 – fully tested release v1.0 with all SSS components
        code freeze Friday September 3
  Meeting notes
Narayan Dasi – Build Configure
Library improvements- bugfixes, testing of java support, SSL testing
Infrastructure Improvements-sss python library improvements, EM bugfixes
BCM component usage experience
         Hardware infrastructure – still seeking purpose
Restriction Syntax examples given and discused
         craig thankful that !d (don’t display this field) now works
Uniqueness issue-default is to return all duplicates
         new flag “unique=true” to remove duplicates
         much discussion. Rusty suggests remove only duplicate lines
         Paul brings up the problem on “action” commands ie kill jobs twice
Al says the problem is not solvable in general in restriction syntax
Scott asked if RMAP syntax can handle this?
Much work on the board. And question of
atomicity of queries which require multiple SQL queries to complete.
  Meeting notes
Will McClendon – Component Interface Testing
APITest v0.1.2
It is now available by FTP by putting it under GPL Cplant license
ftp://ftp.sandia.gov/outgoing/apitest (also in notebook)
Not integrated back into ssslib
HTTP Interface development
“Twisted Python” framework Info and www.effbot.org
Scott helped find bug in python popen3 – now uses Twisted SpawnProcess
Better support for browsing test data within session
Batch and test data stored in an in-memory in XML file format
           writing out data to file available soon
Shows an XML example that runs test. Several questions answered
Shows an XML batch file example.
Runs live demo – works fine. Discussion follows.

Ron Oldfield – replacing Eric DeBenedictis who is moving to other SNL jobs
-ORNL help set up a testing environment
-Testing for correct installation and individual tests, then whole suite test
  Meeting notes
Ron Oldfield (cont) – simulating real workloads
       performance and scalability testing needed in the future
       portability is important for our reference implementation
       discussion code portability vs feature portability
       authorization also needs testing
What are the issues in lightweight OS
Standard naming conventions both format and semantics
       someone really needs to go through the existing schemaes
       RMAP dictionary makes a good starting point

Paul Hargrove – process management
Still continue development on all three components
Syntax translation effort to be discussed later today.
Checkpoint
–pre-emption (suspend and resume) works
-checkpointing (ckpt works, restart in progress)
Todo: migration, checkpoint file management – not overflow disks (list,delete)
Query- “can I restart here”
  Meeting notes
Paul Hargrove – process management (cont)
Suspend/resume works with Bamboo, SD, EM, OM, PM components
Still need to design restart-time interactions with RM group
Open files support under testing
Bug fix releases as needed.
Checkpoint manger outstanding issues
Implement full interface
         using restriction syntax, event generation, error reporting
Must implement file management
         think ls and rm, expiration

Craig Steffan – no slides
Tried run on 1280 nodes on Tungsten failed, did run on 128
Can now run on 1024 nodes. Being stopped by #sockets limit
Harvesting can now be done of other info f.e. myrinet HW
Next: adding support for “job” management
      start interfacing with Build group
      help to get it on Chiba
  Meeting notes
Rusty Lusk – process manager update
PM component – added “limits” interface, dynamic jobs (mpi_comm_spawn)
  can spawn lots of nodes and the use “unused” ones as needed
  show limits spec
MPD2
  improvements found by production use on chiba
  support for limits
  support for mpi_comm_spawn
  interactive debugging via mpigdb – allows control of stdin, stderr, stdout
Future: need to work more closely with QM
  QM interface for requesting dynamic jobs
  Meeting notes
Scott Jackson – resource manager update                      Multi-step job
Diagram on board
                                              Job group
Released SSSRMAPv3 spec                                     Job
                                   Job      Job
New things                                T
                                               T T T T   Task group
 - wire protocol
 - message format
 - job groups
Latest software release (in OSCAR) uses SSSRMAP v2
Second release of Bamboo in March w/ epilogue and prologue support
Gold now fully SSSRMAP v2 - second alpha release due June
 - which will be in Perl (first release in Java ran into memory size limits)
 - user guide done
 - first release running on PNNL’s SGI Altix
Testing using APITest begun
Silver several,various improvements in XML
Future work: implement SSSRMAP v3 in the components
 - merger of Maui 3.2 and SSS. Integrate chkpt/restart. Limit enforcement
 - now SSS affects all Maui users. Ability to handle dynamic jobs
  Meeting notes
Paul – translator report (no slides)
looking at the two syntax and seeing if we could automate
Translation between sssrmap and restriction syntax

Found: sssrmap could say 4<proc<16 but not in RS
RS band aid – special operators to handle ranges
For multiple table queries – nested RS syntax doesn’t have
Information (primary data type) to know how to combine
multiple SQL results
There is no way to translate between these cases.

Paul discourages the implementation of a translator.
  Meeting notes – Day 2
Craig – General thoughts on official V1.0 (no slides)
Released at SC04 this will be the first time many people will see
Our orthogonal directions in syntax is damaging
If we don’t make a decision soon - project progress towards V1.0
Brett, who works with both, favors the SSSRMAP
He likes the more descriptive nature of it and OO nature.
Rusty says that we need two written proposals for a component
that we can compare and vote on otherwise we are just all talk.
Paul says the one is better but two is not too bad.
Scott doesn’t think we can reconcile
Paul asks for straw vote for a preference, Scott second’s
   SSRMAP – 7 and 5 institutions (but one is Al)
   Restriction Syntax - 3 all ANL
   Abstain – 3 and 2 institutions
Craig says he will do whatever it takes to make either work.
   he is going to make ssslib SSSRMAP work
Neil says “users” are guiding factor and RMAP better there
Paul says understandability and acceptability is key and RMAP is better
Both say that RS is more compact and elegant.
  Meeting notes – Day 2 (cont)
Narayan- asks does it just need documentation and tutorials
Paul says no. There is closer match for SOAP et al.
  the OO was not a factor in his choice, but it is more popular today.
Neil says potential users won’t have a Narayan to figure this out.
Components are both client and server so developer has to know syntax.
Rusty – if there was something else added to RS that made it easier to
  use or understand. He is not sure it is a good idea.
Will – documentation is better in RMAP and he has looked at RMAP more
  Would all this stuff be more abstracted? User does as little as they can
  read manual only after they get stuck. Doesn’t care as long we pick ONE!
  Need to have a same look and feel across the project.
Rick – I don’t care which. I don’t like XML. What about the SD and EM
  that are already accepted.
Al – says that he feels that RMAP would be more acceptable to vendors
 and this would be a critical to long term success of the project.

Paul says that Process manager document is not complete enough to vote
on at this time.
  Meeting notes – Day 2 (cont)
Discussion -

				
DOCUMENT INFO