What�s New in Condor by l79y007U

VIEWS: 0 PAGES: 58

									What’s new in Condor?
 Condor Week 2006

         Todd Tannenbaum
  Computer Sciences Department
  University of Wisconsin-Madison
     condor-admin@cs.wisc.edu
   http://www.cs.wisc.edu/condor
So Todd… where is v6.8?

 Well, v6.7 has been a
       challenge…




                          2
3
                                   Changes Per Condor Version
60


50


40

                                                                                                                                          Bugs Fixed
30
                                                                                                                                          New Features


20


10


0
     6.7.19 6.7.16 6.7.13 6.7.10 6.7.7   6.7.3   6.7.0 6.6.10 6.6.7 6.6.4   6.6.1   6.5.4   6.5.1   6.4.7   6.4.2 6.3.3   6.3.0   6.2.0




                                                                                                                                               4
Around since the 80’s




                        5
 Around since the 80’s




80’s Mullet Boy

                         6
100 people surveyed!
  Favorite “ility” ?




                       7
100 people surveyed!
  Favorite “ility” ?


                Deployability!




                          8
                   Existing Ports
• Digital UNIX 4.0        Alpha
• AIX 5.2 (clipped) PowerPC
• Tru64 5.1 (clipped)      Alpha
• HP UNIX 10.20 PA RISC
• HP UNIX 11.00 (clipped using hpux10.20 32 bit) PA RISC
• Irix 6.5 (clipped) SGI
• Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 (clipped) Alpha
• Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 Intel x86
• Linux 2.4.x (glibc 2.2) - Red Hat 8    Intel x86
• Linux 2.4.x (glibc 2.3) - Red Hat 9    Intel x86
• Enterprise Server 8.1 Intel Itanium
• Solaris 8      Sparc
• Solaris 9      Sparc
• Microsoft Windows 2000 or XP (clipped) Intel x86



                                                                    9
›
                      New Ports
    Introduced in v6.6.x
       MacOSX (“clipped") PowerPC
       Debian Linux 3.1 Intel x86         Sigh…
       Fedora Core 1 Intel x86
       Red Hat Enterprise Linux 3 Intel x86
       SuSE Linux Enterprise Server 8.1 Intel
        Itanium
› Introduced in v6.7.x
       AIX 5.1 (“clipped") PowerPC
       Fedora Core 2 on x86
       Fedora Core 3 on x86                    “Psilord” – The Condor
       SuSE 8.0 ("clipped") on AMD64
                                                porting doctor. Talk to him
       Solaris 10 ("clipped") on Sparc
       Scientific Linux (Release 303) on x86
                                                in person tomorrow.
› Still to be introduced in v6.7.x (before
    v6.8.0)
     HPUX 11i 64-bit pa-risc
     RHEL 4 on x86
     “native” 64 bit AMD Linux


                                                                    10
                 Porting Table
› See
 http://www.cs.wisc.edu/condor/porting/port_table.html

› Highlights
    Almost every 32-bit Linux flavor as “full”
    Every other Unix, MacOS and Windows available as “clipped”
    Solaris 10 and HP-UX 11.x now “clipped”
    FreeBSD 4 contribution from Yahoo!, added 5 and 6
    X86_64 Linux: “full” running in the lab



                                                            11
                     Backfill Jobs
 › Execute machines will run a locally
     staged executable when otherwise idle.
 ›   Currently designed for BOINC.
# Turn on backfill functionality, and use BOINC
ENABLE_BACKFILL = TRUE
BACKFILL_SYSTEM = BOINC
# Spawn a backfill job if we've been Unclaimed for more than 5 minutes
START_BACKFILL = $(StateTimer) > (5 * $(MINUTE))
# Evict a backfill job if the machine is busy (based on keyboard
# activity or cpu load)
EVICT_BACKFILL = $(MachineBusy)



                                                                         12
        Joining Condor’s
    Einstein@Home Compute
              Team
› If you’re running BOINC backfill jobs in
  Condor and want to use your cycles to
  help another UW project, please join the
  Einstein@Home computation
› Join the “Condor Backfill” team:
  http://einstein.phys.uwm.edu/team_display.p
   hp?teamid=5994
  http://einstein.phys.uwm.edu/create_accoun
   t_form.php?teamid=5994
                                           13
      More “deployability”
› “Personal” Condor Support on Win32
   LocalSystem not required
› MSI installer on Win32 (thanks Micron!)
› New tools
  Safe, dynamic Condor service deployment.
  More info @ Research BOF 9am Rm219
   condor_cold_start and
   condor_cold_stop


                                             14
100 people surveyed!
  Favorite “ility” ?




                       15
100 people surveyed!
  Favorite “ility” ?


                 Availability!




                           16
                 Condor with
             Firewalls and NATS:
                GCB in v6.8.0!
                                        listen
Client app   connect                    accept   Server app




                            translate
GCB layer                                        GCB layer

 TCP/IP                                           TCP/IP

                       Relay point




                                                              17
    Job Progress continues if
    connection is interrupted
› Now for Vanilla, Java, and Grid universe jobs,
  Condor supports reestablishment of the
  connection between the submitting and executing
  machines.
    If network outage between execute and submit machine
    If submit machine restarts
    Grid Universe was tricky…
› To take advantage of this feature, put the
  following line into their job’s submit description
  file:
     JobLeaseDuration = <N seconds>
For example:
      job_lease_duration = 1200

                                                            18
   Job Progress continues if
     submit machine fails
› Condor can now support a submit
 machine “hot spare” (schedd failover)
  If your submit machine A is down for
   longer than N minutes, a second machine
   B can take over
  Requires shared filesystem between
   machines A and B




                                          19
  Central Manager Failover
› Condor Central Manager has two services
› condor_collector
   Now a list of collectors is supported
› condor_negotiator (matchmaker)
   If fails, election process, another takes over
   Accounting state is peridocially replicated
   Contributed technology from Technion




                                                     20
        Reliability, cont.
› Time shifts
› Quill
› Closing windows of vulnerability




                                     21
100 people surveyed!
  Favorite “ility” ?




                       22
100 people surveyed!
  Favorite “ility” ?


                 Lighweight?




                         23
100 people surveyed!
  Favorite “ility” ?


                 Lighweight?




                         24
100 people surveyed!
  Favorite “ility” ?




                       25
100 people surveyed!
  Favorite “ility” ?


                Functionality!




                          26
                Security
› Common Authentication Methods
 between Condor on Unix and Win32
  Kerberos 1.4
    • Additional hopeful benefit: Authentication
      against MS Active Directory!
  SSL
  Password (shared secret)
› Starter only runs known executables
› More powerful, unified map file(s)
› GSI credentials delegated
                                                   27
    With Condor on Win32, it be
             nice if …
› My jobs could access my files just like the
    condor_shadow can
›   I didn’t have to tie my execute machines to
    a single account
›   I didn’t have to run condor_store_cred
    from every machine where my credential is
    needed
          (thank you Optena)


                                                28
        The Windows CredD
› A centralized repository
 for user passwords                       myp4sswd
                                           y0urs
                              “store
 C:\>condor_store_cred add    password”
 Account: gquinn@CROW
                             <password>   credd
 Enter password:

 Operation succeeded.




                                                     29
The Windows CredD

schedd   “fetch password”   myp4sswd
                             y0urs
           <password>



         Submit machines can use the
shadow   CredD to impersonate the user in
         the shadow




                                            30
   The Windows CredD

  starter         “fetch password”
                                     myp4sswd

                    <password>        y0urs




                  Execute machines can use the
condor_exec.exe   CredD to run jobs as the
                  submitting user!




                                                 31
Running Jobs as Submitting User
› In submit file:
  Run_job_as_owner = true
› In config file on submit and execute
 nodes:
  CREDD_HOST = vault.cs.wisc.edu

  STARTER_ALLOW_RUNAS_OWNER = True

  CREDD_CACHE_LOCALLY = True



                                         32
         Some Condor APIs
› Command Line tools
     condor_submit, condor_q, etc
     -format, -constraint, -xml
›   Condor Perl Module
›   Chirp
›   Checkpoint Library API
›   MW --- improved!
›   DRMAA (Works w/ Win32, on SourceForge)
›   Condor Grid ASCII Protocol (GAHP)
›   Web Service Interface



                                             33
                     DRMAA
› Distributed Resource Management
 Application API (DRMAA)
   GGF Working Group
   An API specification for the submission and
    control of jobs to one or more Distributed
    Resource Management (DRM) systems
› An API with C and Java bindings
   not a protocol
› Scope
   Does: job submission, monitoring, control, final
    status
   Does not: file staging, reservations, security, …

                                                       34
              Condor GAHP

› The Condor GAHP is a relatively low-level protocol
    based on simple ASCII messages through stdin and
    stdout
›   Supports a rich feature set including two-phase
    commits, transactions, and optional asynchronous
    notification of events




                                                   35
   GAHP, cont
Example:

        R: $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $
        S: GRAM_PING 100 vulture.cs.wisc.edu/fork
        R: E
        S: RESULTS
        R: E
        S: COMMANDS
        R: S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST
  GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING INITIALIZE_FROM_FILE
  QUIT RESULTS VERSION
        S: VERSION
        R: S $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $
        S: INITIALIZE_FROM_FILE /tmp/grid_proxy_554523.txt
        R: S
        S: GRAM_PING 100 vulture.cs.wisc.edu/fork
        R: S
        S: RESULTS
        R: S 0
        S: RESULTS
        R: S 1
        R: 100 0
        S: QUIT
        R: S




                                                           36
    Web Service Interfaces
› SOAP over http or https to
     the Condor daemons
›    Use any language or
     platform (where you can find
     a decent SOAP library)
›   Functionality Exposed
    in current release
     Submit jobs
     Retrieve job output
     Remove/hold/release jobs
     Query machine status (fetch ads from collector)
     Query job status (fetch ads from the schedd)


                                                        37
    Getting machine status via
      SOAP (in Java with Axis)
locator = new CondorCollectorLocator();

collector = locator.getcondorCollector(new
        URL(“http://machine:port”));

ads = collector.queryStartdAds(“Memory>512“);


   Because we give you WSDL information you don’t
        have to write any of these functions.


                                                    38
    More Functionality changes..
› FINALLY, clean/consistent cross-platform quoting
  rules for arguments and environment variables
  (see condor_submit man page)
› Schedd can run HawkEye modules, just like the
  Startd
     Enables monitoring on the submit machine
› condor_history : now faster than a snail, and
    cleans up droppings.
›   DeferralTime, DeferralWindow
     Coordinated starts
› BIND_ALL_INTERFACES in config file
› WANT_REMOTE_IO in job ClassAd

                                                  39
 ClassAd Functions in Condor!
› Conditionals
   IfThenElse(condition,then,else)
› String functions
   Strcat(), strcmp(), toUpper(), etc.
› StringList functions
   Example of a “string list” (CSV style)
     • Mylist = “Joe, Jon, Jeff, Jim, Jake”
   StrListContains(), StrListAppend(),
    StrListRemove(), etc.
› Others
   Regular expressions, arithmetic, etc…


                                              40
      Accounting Groups and
      Group Quota Support
› Account Group (w/ CORE Feature Animation)
› Account Group Quota (inspiration CDF @ Fermi)
   Sample Problem: Cluster w/ 500 nodes, Chemistry Dept
    purchased 100 of them, Chemistry users must always be
    able to use them
   Could use Machine Rank…
     • but this ties to specific machines
   Or   could use new group support
     •   Each group can be given a quota in config file
     •   Job ads can specify group membership
     •   Group quotas are satisfied first
     •   Accounting by user and by group



                                                            41
100 people surveyed!
  Favorite “ility” ?




                       42
100 people surveyed!
  Favorite “ility” ?


                Universability!




                          43
               Grid Universe
› With new Grid Universe, always specify a
    ‘gridtype’. So the old “globus” Universe is now
    declared as:
      universe = grid
      gridtype = gt2
›   Other gridtypes?
     GT2 (Globus Toolkit 2)
     GT3 (Globus Toolkit 3.2)        ‘Condor-G’
     GT4 (Globus Toolkit 3.9.5+)
     UNICORE
     Nordugrid
     PBS (OpenPBS, PBSPro – technology from INFN)
     LSF (Platform LSF – technology from INFN)
     CONDOR (thanks gLite!)         ‘Condor-C’

                                                      44
            Other Grid Universe
               improvements
› Condor-G has support for credential refresh via the
    MyProxy Online Credential Management in NMI
     http://grid.ncsa.uiuc.edu/myproxy
     (both GT2 and GT4)
›   GT4 : we start a GridFTP server behind the scenes
     GridFTP server bundled w/ Condor nowadays
› Some functionality present in Condor-G added to
    Condor-C
     Forwarding of refreshed credentials (EGEE)
     GSI authentication support
     Cleaner ClassAd representation (URL)



                                                        45
        Parallel Universe
› Replaces the “MPI” universe
› Allows running arbitrary programs
 that need to gang-schedule multiple
 machines
  MPICH, LAM, …
  FT-MPICH (Seoul National Univ)
  Great for testing environments


                                       46
Hey Jobs! We’re watching you!
› Local Universe               Submit     Execute
  Just like Scheduler
                                          startd
   Universe, but there is a    schedd
   condor_starter
  All advantages of the
                                          starter
                                starter
   starter
                                           job
                                 job


                        Hey, job,
                     behave or else!
                                                 47
100 people surveyed!
  Favorite “ility” ?




                       48
100 people surveyed!
  Favorite “ility” ?


                 Scalability!




                          49
         Faster Negotiation
› SIGNIFICANT_ATTRIBUTES determined
    automatically
     Job attributes AutoClusterId and
      AutoClusterAttributes
     Rounding of Attributes
› Schedd uses non-blocking TCP connects to the
    startd
›   Negotiator caching
›   Collector Forks for queries
›   More coming…

                                          50
          Scalability, cont.
› Knobs
   GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE,

   GRIDMANAGER_MAX_PENDING_SUBMIT_PER_RESOURCE,

   GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE
› One instance of gridmanager handles
  multiple jobs (all from a given user)
› One instance of condor_dagman can run
  multiple dags
  Is the Shadow next?
› Buffered I/O read on schedd restart
 (thanks Yahoo!)
                                             51
                     Quill
                                › Job ClassAds
                                    information
    Master                          mirrored into an
                                    RDBMS
Startd
                                ›
         …Schedd    Quill
                                    Both active jobs
                                    and historical jobs
             Job     RDBMS
                      Queue
                                ›   Benefits BOTH
            Queue        +
                      History
                                    scalability and
             log      Tables        accessibility



                                                    52
Version 6.9.x




                53
   What’s brewing for after
           v6.8.0?
› More data, data, data
  Stork distributed now v6.7.x, incl DAGMan
   support – next it is NeST’s turn.
  NeST manage Condor spool files, ckpt
   servers
    • GridFTP used to move the bits
  Quill++ and CondorDB goodness
› Virtual Machines (and the future of
 Standard Universe)
  Research BOF w/ Jaeyoung Moon, rm219
   9am
                                          54
             SOAP API
› First focus will be to finish
 interfaces used by all command-line
 tools
  condor_userprio, condor_cod, …
› Explore message-based security
  Ian Alderman’s work w/ signed ClassAd
    attributes


                                           55
     Privilege Separation
› No more root in the Condor daemons!
› Instead, a small component will be
  responsible for privileged operations
› Initial exploratory work w/ GNU
  userv (Cambridge)
› Now focusing on integration w/ glexec
  (gLite / nikhef)


                                        56
 “The Year of the Schedd”
› Schedd is juggling to many tasks
   Break it down into smaller pieces, more modular
› Scalability
   All non-blocking I/O
   Hierarchy of schedds
› Schedd-on-the-side
   “Scheduler booster”
   Transform & delegate job classads to different
    grids
   A “job router” for a grid



                                                  57
Thank you!




             58

								
To top