BOINC and Condor – Scavenging the Scavenger Derek Wright, firstname.lastname@example.org, UW-Madison Condor Project (www.condorproject.org) What is Condor? Can BOINC and Condor work together? Condor is a system for managing high- One of the problems that Condor faces is throughput computing -- huge numbers of that there can be periods when there are no tasks over long periods of time, not short jobs in the system. This results in idle (and bursts of very fast computation. Condor was therefore wasted) resources. One of started in the early 1980’s primarily as a BOINC’s many strengths is that by way to scavenge and harness wasted participating in multiple projects, there’s computing cycles on idle desktop always work to be done. To take advantage workstations. It has evolved over the years of this never-ending supply of useful work, into a system for managing all sorts of we modified Condor to have knowledge of computing resources (workstations, “big the BOINC client. Administrators can now iron” SMP machines, dedicated clusters, configure machines such that if there are no computational grids, etc), data movement, Condor jobs available, Condor will hand authentication, and more. In many ways, control over the resource to BOINC. Condor could be seen as an ancestor of BOINC, and they share some common How does this integration work? roots at the University of Wisconsin, The Condor daemon that manages compute Madison. nodes is the startd. The startd enforces the policy for when jobs should run and controls One of the novel features of the Condor their execution. Each compute slot system is its generic scheduling based on managed by a startd has a state, and the match-making . Every entity in the Condor machine’s policy expressions control when system advertises itself with a classified ad, certain state transitions happen. much like the ones in the newspaper. Each machine describes itself, what it has to When the startd has an idle compute slot, it offer, and what it is looking for. Similarly, evaluates an expression that controls if the jobs describe themselves and what kinds of slot should enter a backfill state. Upon machines they are looking for. Each party entering this state, the startd spawns the can define requirements that must be met, command-line BOINC client (as defined in and can rank their preferences across all the Condor configuration file). This client is entities that meet their requirements. the standard, un-modified BOINC client, and is configured to join whatever computing Resource owners define their own policy for projects the machine owner wants. when jobs should run, if they should be Effectively, the Condor startd serves the suspended or evicted, etc. The machine’s same purpose as the BOINC screen-saver: requirements are a central part of this policy it decides when the resource is otherwise – what kinds of jobs should be allowed to idle, and tells BOINC it should put it to use. run under what conditions. While a Condor resource is in the backfill One part of the Condor system, the state, if an interactive user comes back to negotiator, periodically goes through all the the machine, or if a Condor job is matched requests and finds machines that match with it, the startd will kill the BOINC client jobs (both sides have to want each other). and any processes it has spawned. Because the attributes in the classified ads are dynamic and customizable, as are the In this way, we are now using BOINC to policy expressions themselves, this system scavenge unused cycles from Condor, itself provides enormous flexibility in scheduling a scavenger of wasted compute power. jobs and resources. Possible Areas of Future Work Dynamically sharing compute slots Condor’s Master-Worker (MW) system With dual-core CPUs becoming the norm, MW is a system built on top of Condor for multiple compute slots per machine are the managing a pool of work. There is a C++ wave of the future. Currently, when either API for defining work tasks, and a master BOINC or Condor are running, unless that knows how to claim Condor resources, specifically told in advance, they both think create a pool of slots, and allocates the they should discover all physical compute work. MW attempts to ensure it has enough slots on the machine and control everything. slots at all times, will re-try a task if it hasn’t Ideally, Condor would be able to add and gotten an answer back, and takes care of remove resources (including RAM) from most of the complication for application BOINC’s control, so that instead of killing writers facing similar problems. In these the BOINC client entirely (all-or-nothing) respects, MW is very similar to BOINC. they could more peacefully co-exist. BOINC and MW could be modified to talk Identifying BOINC tasks within Condor directly to each other, to the benefit of each: If a BOINC work-unit running on a compute * MW could hand tasks (aka work units) slot is about to miss a deadline, and there’s over to BOINC, to gain access to a much another resource working on a work-unit wider (though potentially less reliable) pool that has a lot of time before it expires, of resources. Condor should prefer to preempt the task * BOINC could hand work units to MW for that’s not going to expire, and allow the one an even more guaranteed and reliable way under a tight schedule to complete (if to drain a computation as it nears possible). The challenge here is getting completion. Instead of the final work units data about the expiration date of the current that are blocking a whole computation work unit out of the BOINC client and giving running on Condor nodes as backfill jobs, it to the startd so that this information can these critical work units could run as first be included in the eviction policy class Condor jobs via MW. expressions. Condor provides a hook for dynamically adding attributes into the Better packaging and installation machine classified ads, so this task might Currently, BOINC must be downloaded, involve no changes to the Condor source, installed, and configured separately from and only minor changes to BOINC’s code. Condor. We’d like to ship a pre-configured copy of the BOINC client directly inside Identifying Condor resources in BOINC Condor release packages. Due to the fact that Condor compute slots are often more tightly administered and Collaborations on portability issues managed than random screensavers on the * We currently run nightly builds of the latest Internet, these resources could be identified development version of BOINC on our to the BOINC servers as such. This would automated build and test infrastructure allow BOINC to more effectively schedule (http://nmi.cs.wisc.edu), to verify BOINC’s work units as a given computation is portability on over 20 different platforms. nearing completion. A huge computation * The BOINC client and the Condor startd that is 98% done can be held up for many both have to discover similar attributes of days, waiting for the last few work units to the machines they run on (RAM, # of CPUs, return their results. By identifying more keyboard activity, etc). Condor has a library reliable Condor slots within the BOINC to abstract these details and provide a scheduling algorithm, we could shorten single portability layer. Perhaps we could these periods of “draining” the jobs. share code and work together to maintain it.
Pages to are hidden for
"BOINC and Condor – Scavenging th"Please download to view full document