August 2005, Oak Ridge, TN SSS Face-to-face meeting
SSS Deployment using OSCAR
John Mugler, Thomas Naughton & Stephen Scott
Oak Ridge National Laboratory -- U.S. Department of Energy 1
OSCAR: Cluster Toolkit
• Framework for cluster management
– simplifies installation, configuration and operation
– reduces time/learning curve for cluster build
• requires: pre-installed headnode w. supported Linux distribution
• thereafter: wizard guides user thru setup/install of entire cluster
• Package-based framework
– Content: Software + Configuration, Tests, Docs
– Types:
• Core: SIS, C3, Switcher, ODA, OPD, (Support Libs)
• Non-core: selected & third-party
– Access: repositories accessible via OPD/OPDer
Oak Ridge National Laboratory -- U.S. Department of Energy 2
OSCAR Wizard
* OSCAR-3.0 release
Oak Ridge National Laboratory -- U.S. Department of Energy 3
Using OSCAR for SSS
Problem: Helping users obtain and install SSS
software.
Solution: Leverage OSCAR framework to package
and distribute the SSS suite, sss-oscar.
sss-oscar A release of OSCAR containing all
SSS software in single downloadable bundle.
Oak Ridge National Laboratory -- U.S. Department of Energy 4
OSCAR-ized SSS Components
• Bamboo – Queue/Job Manager
• BLCR – Berkeley Checkpoint/Restart
• Gold – Accounting & Allocation Management System
• LAM/MPI (w/ BLCR) – Checkpoint/Restart enabled MPI
• MAUI-SSS – Job Scheduler
• SSSLib – SSS Communication library
– Includes: SD, EM, PM, BCM, NSM, NWI
• Warehouse – Distributed System Monitor
* As of Aug 2005
• MPD2 – MPI Process Manager
Oak Ridge National Laboratory -- U.S. Department of Energy 5
Current Status
• Released v1.0 at SC’04
– Based on oscar-3.0 (using Red Hat 9/x86)
– All SSS components represented
• Released v1.1 on July 8, 2005
– Based on oscar-3.0 (using Red Hat 9/x86)
– Misc. fixes and minor package updates
• Working on v1.2
– Based on oscar-4.1 (using Fedora Core 2/x86)
– Primary features being new oscar & newer distro
• Then to v2.0
– Based on oscar-4.x (using Fedora Core 2/x86)
– Note, if v2.0 is the SC’05 release, should this be FC4?
– Focus to be Less-Restrictive Syntax change over
Oak Ridge National Laboratory -- U.S. Department of Energy 6
Goals for SC’05 release
• Release sss-oscar v2.0/2.0.1 at SC’05
• Compatible with oscar-5.0
• Support more current Linux distribution(s)
– Fedora Core 4
• Improved testing
– Supply thorough installation/validation/performance tests
• Documentation
– Specifications for component interfaces (schemas), etc.
• Improve interoperability with standard OSCAR
– Track more closely to SC’05 release
– Post SC’05 – “Package Sets”, SSS OPD Repository
Oak Ridge National Laboratory -- U.S. Department of Energy 7
SSS-OSCAR Release Schedule
(updated Aug’05)
SSS Freeze Based on
Release Date/Target
Version Date OSCAR
v1.0 ??? Nov 10, 2004 v3.0
v1.1 Feb 15 Jul 8, 2005 v3.0
v1.2 Jun 15 July Aug v4.1
v2.0 Aug 15 26? Sept v4.x
v2.0.1 Oct 15 Nov - SC’05 v5.0
Oak Ridge National Laboratory -- U.S. Department of Energy 8
Roadmap
(updated Aug’05)
• 1.2 (frz: jun, rel: jul aug) • 2.0.1 (frz: oct, rel: nov) [SC’05]
– Fedor Core 2 / Pkg rebuild – Any bugfixes/minor updates
• BLCR upgrade to linux-2.6
– Improved install/validation tests • 2.0.2 or 2.1
– oscar-4.1 opkg modifications – SSS oscar-pkg set
(updates)
• Updates to HOWTO as
needed
• Simplify XML meta file
– Close (most) open tracker issues
• 2.0 (frz: aug, rel: sep)
– LRS change over
– Fedora Core 4 / Pkg rebuild # Do we want to use Fedora Core 2?
– Improved install/validation tests # If not FC4, what for SC’05???
– Add performance/stress tests?
– oscar-4.x opkg modifications
(updates)
• Updates to HOWTO as
needed
– Meta-scheduler (Silver)?
Oak Ridge National Laboratory -- U.S. Department of Energy 9
TODO / Hackerfest (1)
• SecureID tokens
– Cindy’s working on this…
• Warehouse
– Testing/integration w/ Dave
– Integration of new version into suite release?
• BLCR
– Opps…
• Todd’s tests/analysis
– Look at any particular (directed) testing – Maui, Gold, etc.
– Jobs in queue not restart (hold state?)
• Gold
– RPMS into sss-oscar tree for 1.2!
• Extend SSS component tests
– Installation, Validation
– Durability/Stress, Performance
Oak Ridge National Laboratory -- U.S. Department of Energy 10
TODO / Hackerfest (2)
• Testing v1.2beta
– Work on testing for release
• Testing v2.0 features
– Less-Restrictive-Syntax changeover
• Documentation
– Update OPkg Howto (if needed)
– Update v1.2 release notes
– SSS schemas/component specs.
• SSS-OSCAR Tracker
– clean up / close out bugs
• OSCAR stuffo
– API script ordering via XML
– Draft plan/prototype “Package Sets”
• RH EL 4 system / testing
Oak Ridge National Laboratory -- U.S. Department of Energy 11
SSS-OSCAR CVS
• Testing v1.2beta
– Work on testing for release
• Testing v2.0 features
– Less-Restrictive-Syntax (LSR) changeover
Q: Should we create a 1.2 branch now and let
folks start checking in LSR changeover
stuff during “hackerfest” time?
– How close is v1.2? looking at this now…
– How active is the LSR work in coming days?
Oak Ridge National Laboratory -- U.S. Department of Energy 12
Resources
• ORNL test clusters
– Systems: xtorc-sss, test1
– Access via ORNL SSH Login Server
– Must do reservations/coordinate use (Note, no remote
power mgmt)
• SSS-OSCAR Project page
– Hosted at http://sourceforge.net/projects/sss-oscar/
• OSCAR Homepage
– http://www.OpenClusterGroup.org/OSCAR/
– Includes “HOWTO: Create an OSCAR Package” document
Oak Ridge National Laboratory -- U.S. Department of Energy 13