An Introduction to the Computational Grid
Jeff Linderoth
Dept. of Industrial and Systems Engineering Univ. of Wisconsin-Madison linderot@cs.wisc.edu
Steve Wright
Computer Sciences Dept. Univ. of Wisconsin-Madison swright@cs.wisc.edu
Second International Conference on Continuous Optimization McMaster University Hamilton, Ontario, Canada
Linderoth (UW-Madison) An Introduction to the Computational Grid ICCOPT II
wisconsin-logo
1 / 42
Outline
What is “The Grid?” Grid Software: Condor Large-scale Grid resources: Teragrid, Open Science Grid Using Condor – Some Hands on Demos
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
2 / 42
The Grid
Richard Dawson Rules!
Come on Let’s Play the Feud
‘‘100 People Surveyed. Top 5 answers are on the board. Here’s the question...’’
Name one common use of the Internet
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
3 / 42
The Grid
Richard Dawson Rules!
The Big Board
1 2
email Looking up answers to homework problems YouTube Updating personal information at myspace Looking at pictures of Anna Kournikova
3 4
5
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
4 / 42
The Grid
Richard Dawson Rules!
Strike!
Doing Computations
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
5 / 42
The Grid
Building a Grid
People envision a “Computational Grid” much like the national power grid Users can seamlessly draw computational power whenever they need it Many resources can be brought together to solve very large problems Gives application experts the ability to solve problems of unprecedented scope and complexity, or to study problems which they otherwise would not. Large funded initiative in the US.
NSF Office of Cyberinfrastructure
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
6 / 42
The Grid
Building a Grid
Types of Grids
Computational grids
Focus on computationally-intensive operations. This included CPU Scavenging Grids – which is our focus today
Data grids
Help control, share, and manage large quantities of (distributed) data
Equipment grids
Associated with a piece of expensive equipment (telescope, earthquaje shake table, advanced photon source) Grid software used to access and control equipment remotely
Access grid
Used to support group-to-group interactions Consists of multimedia large-format displays, presentation and interactive environments, interfaces to Grid middleware and visualization environments.
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
7 / 42
The Grid
Building a Grid
Grid Contrasts
(Source: IBM Web Site)
Grid Vs. Web Like the web Grid keeps complexity hidden: multiple users enjoy a single, unified experience. Unlike the Web which mainly enables communication, grid computing enables full collaboration toward common business or scientific goals. Grid Vs. P2P Like peer-to-peer grid computing allows users to share files. Unlike peer-to-peer grid computing allows many-to-many sharing not only files but other resources as well.
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
8 / 42
The Grid
Building a Grid
Grid Contrasts
Grid Vs. Clusters Like clusters and distributed computing, grids bring computing resources together. Unlike clusters and distributed computing, which need physical proximity and operating homogeneity, grids can be geographically distributed and heterogeneous. Grid Vs. Virtualization Like virtualization technologies, grid computing enables the virtualization of IT resources. Unlike virtualization technologies, which virtualize a single system, grid computing enables the virtualization of vast and disparate IT resources.
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
9 / 42
The Grid
Building a Grid
This ain’t easy!
User access and security
Who should be allowed to tap in?
Interfaces
How should they tap in?
Heterogeneity
Different hardware, operating systems, and software
Dynamic
Participating Grid resources may come and go Fault-Tolerance is very important!
Communicationally challenged
Machines may be very far apart ⇒ slow communication.
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
10 / 42
The Grid
Building a Grid
Grid Computing Tools: Globus
Globus: Widely-used grid computing toolkit Globus Services/Libraries Security, Information infrastructure, Resource management, Data management, Communication, Fault detection, Portability. It is packaged as a set of components that can be used either independently or together to develop applications.
Linderoth (UW-Madison) An Introduction to the Computational Grid ICCOPT II
wisconsin-logo
11 / 42
The Grid
Building a Grid
Building a Grid
Even with wonderful tools like Globus providing these services, there is still a fundamental obstacle to creating computational grids available to all scientists
GREED!
Most people don’t want to contribute “their” machine!
How to induce people to contribute their machine to the Grid?
Screensaver – BOINC, seti@home Social Welfare – fightaids@home Offer frequent flyer miles – company went bankrupt Let the people keep control over their machine Give donaters a chance to use the Grid
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
12 / 42
The Grid
Condor
Condor
Peter Couvares Alan DeSmet Peter Keller Miron Livny Erik Paulsen Marvin Solomon Todd Tannenbaum Greg Thain Derek Wright http://www.cs.wisc.edu/condor
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
13 / 42
The Grid
Condor
Condor:
www.cs.wisc.edu/condor
Manages collections of “distributively owned” workstations
User need not have an account or access to the machine Workstation owner specifies conditions under which jobs are allowed to run All jobs are scheduled and “fairly” allocated among the pool
How does it do this?
Scheduling/Matchmaking Jobs can be checkpointed and migrated Remote system calls provide the originating machines environment
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
14 / 42
The Grid
Condor
Matchmaking
MyType = Job TargetType = Machine Owner = ferris Cmd = cplex Args = seymour.d10.mps HasCplex = TRUE Memory ≥ 64 Rank = KFlops Arch = x86 64 OpSys = LINUX MyType = Machine TargetType = Job Name = nova9 HasCplex = TRUE Arch = x86 64 OpSys = LINUX Memory = 256 KFlops = 53997 RebootedDaily = TRUE
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
15 / 42
The Grid
Condor
Checkpointing/Migration
Professor’s Machine Professor Arrives
5am Checkpoint Server Grad Student’s Machine
8am
}
5 min
Grad Student Grad Student Arrives Leaves
}
8:10am
12pm 5 min
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
16 / 42
The Grid
Condor
Other Condor Features
Pecking Order
Users are assigned priorities based on the number of CPU cycles they have recently used. If someone with higher priority wants a machine, your job will be booted off.
Flocking
Condor jobs can negotiate to run in other Condor pools.
Glide-in
Globus provides a “front-end” to many traditional supercomputing sites. Submit a Globus job which creates a temporary Condor pool on the supercomputer, on which users jobs may run.
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
17 / 42
The Grid
Condor
Condor + Operations Research
GAMS (www.gams.com) has added Grid Computing Language Extensions This allows regular GAMS optimization models to be submit to job schedulers like Condor!
mymodel.solvelink=3; loop(scenario, demand=sdemand(scenario); cost=scost(scenario) solve mymodel min obj using minlp; h(scenario)=mymodel.handle);
Ferris and Busseick use this strategy, in combination with some “manual branching”, and CPLEX MIP solver to solve three previously unsolved MIPLIB2003 instances “overnight” More this afternoon!
Linderoth (UW-Madison) An Introduction to the Computational Grid ICCOPT II 18 / 42
wisconsin-logo
The Grid
Condor
Condor Daemons
condor master: Controls all daemons condor startd: Controls executing jobs
condor starter: Helper for starting jobs
condor schedd: Controls submit jobs
condor shadow: Submit-side helper for running jobs
condor collector: Collects system information; only on Central Manager condor negotiator: Assigns jobs to machines; only on Central Manager
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
19 / 42
The Grid
Condor
A Typical Condor Pool
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
20 / 42
The Grid
Condor
How Condor Starts Up Your Job
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
21 / 42
The Grid
Condor
Flocking and Glide-in
Flocking Collector from on central manager shark.ie.lehigh.edu is allowed to negotiate with central manager from a different pool condor.cs.wisc.edu shark’s condor config: FLOCK TO = condor.cs.wisc.edu condor’s condor config: FLOCK FROM = shark.ie.lehigh.edu Beware firewalls! (schedd on submit machine must be abe to make direct socket connection to submitting machine) There is a tool GCB (Generic Connection Broker) that can get around this limitation Glide-in Resource request made to gate-keeper Often on high-performance computing resource Gatekeeper make request to batch-scheduled resource.
Linderoth (UW-Madison) An Introduction to the Computational Grid ICCOPT II 22 / 42
wisconsin-logo
The Grid
Condor
Personal Condor—A Computational Grid
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
23 / 42
The Grid
Condor
Grid-Enabling Algorithms
Condor and growing number of interconnection mechanisms gives us the infrastructure from which to build a grid (the spare CPU cycles), We still need a mechanism for controlling algorithms on a computational grid No guarantee about how long a processor will be available. No guarantee about when new processors will become available To make parallel algorithms dynamically adjustable and fault-tolerant, we could (should?) use the master-worker paradigm What is the master-worker paradigm, you ask? More in the next talk!
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
24 / 42
Distributed Resources
The TeraGRID
The Teragrid
http://www.teragrid.org
Consortium of traditional high-performance computing centers > $150M of NSF funding behind it! Over 100 TeraFLOPS! total CPU power Dozens of Petabytes of online and archival storage 30Gbps backbone
Site IU NCAR SDSC NCSA UC/ANL CACR PSC Purdue TACC # 712 1024 3612 4381 316 104 5248 5012 5256 21,284 Type PowerPC, Itanium, Xeon Blue Gene Itanium, Power-4, Blue Gene Itanium, Altix, Xeon Itanium, Xeon Itanium Alpha Xeon Xeon, Ultra-Sparc
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
25 / 42
Distributed Resources
Open Science Grid
Open Science Grid
A distributed computing infrastructure for large-scale scientific research, built and operated by a consortium of universities and national laboratories “Virtual Organizations” Compact Muon Solenoid Computing Resources 85 participating institutions ≈ 25,000 computers. 175 TB of storage CompBioGrid Genome Analysis and Database Update Grid Laboratory of Wisconsin nanoHUB Network for Computational Nanotechnology
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
26 / 42
Distributed Resources
Open Science Grid
Putting it all together
Distributed Resources The Teragrid: http://www.teragrid.org Open Science Grid: http://www.opensciencegrid.org The Upshot You can put all of these components together to solve BIG problems in operations research You can use byproducts (software tools) of this research We still need to use our OR expertise to engineer the algorithms for the computational platform
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
27 / 42
Distributed Resources
Open Science Grid
Installing and Setting up (Personal) Condor
./condor configure --install=/home/jtl3/tmp/condor-6.8.5/release.tar --install-dir=/home/jtl3/condor --local-dir=/home/jtl3/condor/local --make-personal-condor Set environment variable CONDOR CONFIG to point to $HOME/condor/etc/condor config Edit CONDOR CONFIG to have HOSTALLOW WRITE = *: Anyone can join your pool export PATH=$HOME/condor/bin:$HOME/condor/sbin:$PATH condor master
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
28 / 42
Distributed Resources
Open Science Grid
Other CPU Grid Building Tools
Condor is not the only way to build Free: Sun Grid Engine: http://gridengine.sunsource.net/ Commercial: LSF Platform: www.platform.com
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
29 / 42
Let’s Run Condor!
My First Condor Job
Let’s Run Condor!
You have been set up with temporary account at UW-Madison, from which we can run some Condor jobs. These will stay active for a week or so. To get started, ssh to chopin.cs.wisc.edu, and login using the username and passwords distributed earlier. If you don’t have ssh and are running Windows, you can get it from
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
Now run the following commands: cd /scratch mkdir your name (choose some unique identifier) cd your name cp -r /scratch/ICCOPT/* . source ./setit
Linderoth (UW-Madison) An Introduction to the Computational Grid ICCOPT II
wisconsin-logo
30 / 42
Let’s Run Condor!
My First Condor Job
Mmmmmmmmmmmmmm. Pie
Our first computational task will be to estimate π by numerical integration. Everyone knows...
1 0
π 1 dx = arctan(x)|1 = arctan(1) = . x=0 2 1+x 4
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
31 / 42
Let’s Run Condor!
My First Condor Job
The Rectangle Rule
4 4/(1+x*x) 3.5 3 2.5 2 1.5 1 0.5 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
ICCOPT II
wisconsin-logo
1
Linderoth (UW-Madison)
An Introduction to the Computational Grid
32 / 42
Let’s Run Condor!
My First Condor Job
A Program to Estimate π
We’ve written a π-calculator for you.
cd src gcc pi1.c -lm -o pi1 ./pi1 1000
This is not a parallel program. Just a simple (one process) program.
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
33 / 42
Let’s Run Condor!
My First Condor Job
Condor Universes
Condor jobs run in a specific Condor Universe
Standard—Has cool features like checkpointing and migration of jobs
Requires special linking of your program
Vanilla—No cool condor features (regular) MPI/PVM/Java/Grid
Not mentioned here today, but they exist.
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
34 / 42
Let’s Run Condor!
My First Condor Job
Compiling for Condor
Standard Universe
Put the command condor compile in front of your normal link line. [jtl3@fire1 condor]$ condor compile gcc pi1.c -o pi1-standard -lm
Vanilla Universe
Do nothing
Condor submission is like other resource management software
Describe your job in a job submission file Submit and monitor your job with command line programs
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
35 / 42
Let’s Run Condor!
My First Condor Job
Sample Condor Submission Files
universe = standard executable = pi1-standard arguments = 1000000000 output = pi1.out error = pi1.err notification = Never notify user = swright@cs.wisc.edu getenv = True rank = kflops queue universe = vanilla executable = pi1 arguments = 666 output = pi1.out requirements = (OpSys != WINNT51) error = pi1.err getenv = True rank = Memory queue
man condor submit
http: wisconsin-logo //www.cs.wisc.edu/condor/manual/v6.8/condor submit.html
Linderoth (UW-Madison) An Introduction to the Computational Grid ICCOPT II 36 / 42
Let’s Run Condor!
My First Condor Job
The Big Four
condor submit
Submit a job to the Condor scheduler
condor q
Check the status of the queue of Condor jobs
condor status
Check the status of the condor pool
condor rm
Delete a Condor job
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
37 / 42
Let’s Run Condor!
My First Condor Job
Let’s Do It!
[jtl3@fire1 condor]$ condor_submit run.condor Submitting job(s). 1 job(s) submitted to cluster 16. [jtl3@fire1 condor]$ condor_q
-- Submitter: fire1.cluster : <192.168.0.1:32777> : fire1.cluster ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 16.0 jtl3 8/4 11:22 0+00:00:16 R 0 3.4 pi1-standard 1000000000
[jtl3@fire1 condor]$ cat pi1.out pi is about 3.1415926555921398488635532 Error is 2.0023467328655897e-09
I could do condor rm 16.0 Any Condor questions?
Linderoth (UW-Madison) An Introduction to the Computational Grid ICCOPT II
wisconsin-logo
38 / 42
Let’s Run Condor!
Parallel Job on Condor: Statistical Bootstrapping
Condor Parallel Example: Statistical Bootstrapping
{z2 , z2 , z5 , ...} Sample
z1 , z2 , z3 , z4 , z5 , ...} Distribution
samp {z2 , z5 , z7 , ...} Analyze
Resamp {z5 , z7 , z9 , ...} Analyze Coalesce
Resamp {z7 , z7 , z9 , ...} Analyze
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
39 / 42
Let’s Run Condor!
Parallel Job on Condor: Statistical Bootstrapping
Statistical Bootstrapping
driver.m dist size = 100000; d = rand(dist size, 1) .* 500; subset = d(floor(rand(1000,1) .* 100000)); save ”subset” subset;
Driver Creates distribution.
Driver Creates submit file.Introduction to the Computational Grid Linderoth (UW-Madison) An
submit universe = vanilla executable = worker.m transfer files = true when to transfer output = on exit transfer input files = subset output = mean.$(PROCESS) wisconsin-logo log = log queue 5
ICCOPT II 40 / 42
Let’s Run Condor!
Parallel Job on Condor: Statistical Bootstrapping
Running the example
Shell prompt $ ./driver.m Submitting job(s)..... Logging submit event(s)..... 5 job(s) submitted to cluster 565262. 5 minutes later... All jobs done. mean of mean is 161.014978
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
41 / 42
Let’s Run Condor!
Parallel Job on Condor: Statistical Bootstrapping
Let’s run it!
Go to your directories scratch/your name on chopin.cs.wisc.edu, and look at the files submit, driver.m, and worker.m You’ll see that they contain the material from the previous slides. The two .m files have a line at the start indicating that they are to be run using Octave (a free Matlab-like language available from www.octave.org. Now run it by typing “driver.m” on the command line! driver.m creates the file “subset” and then invokes submit. Five instances of worker.m are submitted into the condor pool. You can check on the status of these by typing condor q (in another window). When the five jobs are finished, driver.m does the final computation and prints a message to the screen.
wisconsin-logo
Linderoth (UW-Madison)
An Introduction to the Computational Grid
ICCOPT II
42 / 42