BOINC and Condor – Scavenging th by fjzhangxiaoquan

VIEWS: 9 PAGES: 2

									             BOINC and Condor – Scavenging the Scavenger
    Derek Wright, wright@cs.wisc.edu, UW-Madison Condor Project (www.condorproject.org)


What is Condor?                                  Can BOINC and Condor work together?
Condor is a system for managing high-            One of the problems that Condor faces is
throughput computing -- huge numbers of          that there can be periods when there are no
tasks over long periods of time, not short       jobs in the system. This results in idle (and
bursts of very fast computation. Condor was      therefore wasted) resources.         One of
started in the early 1980’s primarily as a       BOINC’s many strengths is that by
way to scavenge and harness wasted               participating in multiple projects, there’s
computing cycles on idle desktop                 always work to be done. To take advantage
workstations. It has evolved over the years      of this never-ending supply of useful work,
into a system for managing all sorts of          we modified Condor to have knowledge of
computing resources (workstations, “big          the BOINC client. Administrators can now
iron” SMP machines, dedicated clusters,          configure machines such that if there are no
computational grids, etc), data movement,        Condor jobs available, Condor will hand
authentication, and more. In many ways,          control over the resource to BOINC.
Condor could be seen as an ancestor of
BOINC, and they share some common                How does this integration work?
roots at the University of Wisconsin,            The Condor daemon that manages compute
Madison.                                         nodes is the startd. The startd enforces the
                                                 policy for when jobs should run and controls
One of the novel features of the Condor          their execution. Each compute slot
system is its generic scheduling based on        managed by a startd has a state, and the
match-making . Every entity in the Condor        machine’s policy expressions control when
system advertises itself with a classified ad,   certain state transitions happen.
much like the ones in the newspaper. Each
machine describes itself, what it has to         When the startd has an idle compute slot, it
offer, and what it is looking for. Similarly,    evaluates an expression that controls if the
jobs describe themselves and what kinds of       slot should enter a backfill state. Upon
machines they are looking for. Each party        entering this state, the startd spawns the
can define requirements that must be met,        command-line BOINC client (as defined in
and can rank their preferences across all        the Condor configuration file). This client is
entities that meet their requirements.           the standard, un-modified BOINC client, and
                                                 is configured to join whatever computing
Resource owners define their own policy for      projects the machine owner wants.
when jobs should run, if they should be          Effectively, the Condor startd serves the
suspended or evicted, etc. The machine’s         same purpose as the BOINC screen-saver:
requirements are a central part of this policy   it decides when the resource is otherwise
– what kinds of jobs should be allowed to        idle, and tells BOINC it should put it to use.
run under what conditions.
                                                 While a Condor resource is in the backfill
One part of the Condor system, the               state, if an interactive user comes back to
negotiator, periodically goes through all the    the machine, or if a Condor job is matched
requests and finds machines that match           with it, the startd will kill the BOINC client
jobs (both sides have to want each other).       and any processes it has spawned.
Because the attributes in the classified ads
are dynamic and customizable, as are the         In this way, we are now using BOINC to
policy expressions themselves, this system       scavenge unused cycles from Condor, itself
provides enormous flexibility in scheduling      a scavenger of wasted compute power.
jobs and resources.
                               Possible Areas of Future Work

Dynamically sharing compute slots               Condor’s Master-Worker (MW) system
With dual-core CPUs becoming the norm,          MW is a system built on top of Condor for
multiple compute slots per machine are the      managing a pool of work. There is a C++
wave of the future. Currently, when either      API for defining work tasks, and a master
BOINC or Condor are running, unless             that knows how to claim Condor resources,
specifically told in advance, they both think   create a pool of slots, and allocates the
they should discover all physical compute       work. MW attempts to ensure it has enough
slots on the machine and control everything.    slots at all times, will re-try a task if it hasn’t
Ideally, Condor would be able to add and        gotten an answer back, and takes care of
remove resources (including RAM) from           most of the complication for application
BOINC’s control, so that instead of killing     writers facing similar problems. In these
the BOINC client entirely (all-or-nothing)      respects, MW is very similar to BOINC.
they could more peacefully co-exist.
                                                BOINC and MW could be modified to talk
Identifying BOINC tasks within Condor           directly to each other, to the benefit of each:
If a BOINC work-unit running on a compute       * MW could hand tasks (aka work units)
slot is about to miss a deadline, and there’s   over to BOINC, to gain access to a much
another resource working on a work-unit         wider (though potentially less reliable) pool
that has a lot of time before it expires,       of resources.
Condor should prefer to preempt the task        * BOINC could hand work units to MW for
that’s not going to expire, and allow the one   an even more guaranteed and reliable way
under a tight schedule to complete (if          to drain a computation as it nears
possible). The challenge here is getting        completion. Instead of the final work units
data about the expiration date of the current   that are blocking a whole computation
work unit out of the BOINC client and giving    running on Condor nodes as backfill jobs,
it to the startd so that this information can   these critical work units could run as first
be included in the eviction policy              class Condor jobs via MW.
expressions. Condor provides a hook for
dynamically adding attributes into the          Better packaging and installation
machine classified ads, so this task might      Currently, BOINC must be downloaded,
involve no changes to the Condor source,        installed, and configured separately from
and only minor changes to BOINC’s code.         Condor. We’d like to ship a pre-configured
                                                copy of the BOINC client directly inside
Identifying Condor resources in BOINC           Condor release packages.
Due to the fact that Condor compute slots
are often more tightly administered and         Collaborations on portability issues
managed than random screensavers on the         * We currently run nightly builds of the latest
Internet, these resources could be identified   development version of BOINC on our
to the BOINC servers as such. This would        automated build and test infrastructure
allow BOINC to more effectively schedule        (http://nmi.cs.wisc.edu), to verify BOINC’s
work units as a given computation is            portability on over 20 different platforms.
nearing completion. A huge computation          * The BOINC client and the Condor startd
that is 98% done can be held up for many        both have to discover similar attributes of
days, waiting for the last few work units to    the machines they run on (RAM, # of CPUs,
return their results. By identifying more       keyboard activity, etc). Condor has a library
reliable Condor slots within the BOINC          to abstract these details and provide a
scheduling algorithm, we could shorten          single portability layer. Perhaps we could
these periods of “draining” the jobs.           share code and work together to maintain it.

								
To top