Interactive Condor Tutorial
Carey Kireyev
ckireyev@cs.wisc.edu
Boulder ACM Seminar
June 12, 2004 1
Creating a Grid with Condor
In this tutorial we will walk though the complete process of setting up a real computing Grid with Condor, including:
Installing Condor software on your machine Starting Condor on your machine Connecting to the pool Using Condor tools Submitting a job and watching it execute remotely
Oct 16, 2004
Interactive Condor Tutorial
2
Creating a Grid with Condor
Requirements: Computer(s) running Windows, Linux, MacOS Network / DNS Windows users:
Unless you have SP2, stop Firewall service:
Control Panel Admin Tools Services
Internet Connection Firewall (ICF) / Internet Connection Sharing (ICS) Stop
Condor Software
Oct 16, 2004 Interactive Condor Tutorial 3
Downloading Condor
Go to Condor website
http://www.condorproject.org
Click on “Download Condor Software” Choose “Condor 6.6.7”
Latest stable release Subscribe to “condor-world/condor-users” For UNIX/Linux, pick dynamic executable (2nd column)
Oct 16, 2004 Interactive Condor Tutorial 4
Fill in user information
Pick the correct operating system
Installing Condor on Windows
On Windows XP/2000/NT: 1. Run downloaded InstallShield executable 2. Enter your name, company name 3. Pick “Joining an existing pool” 4. Enter the name of the central manager: central-manager.gridtutorial.com 5. Answer “yes” to:
Submit jobs to the Condor pool Allow jobs to Run on this machine
6.
Pick installation directory, e.g.: C:\Condor
Oct 16, 2004 Interactive Condor Tutorial 5
Installing Condor on Windows, cont.
7.
Java Universe: Leave blank
Can always configure later
8. 9. 10.
Email/SMTP: Leave blank Domain: gridtutorial.com Access: Read (who can query jobs?): *.gridtutorial.com
Write (who can submit jobs?): *.gridtutorial.com Admin (who can reconfigure?): Leave blank
11. 12. 13.
Pick “Always run Condor jobs” Pick “Leave job in memory” After installation, go to Control Panel Admin Tools
Services, make sure Condor service is started
Oct 16, 2004 Interactive Condor Tutorial 6
Installing Condor on Linux
1.
Untar the tarball:
tar xzf condor-6.6.7-linux-x86-glibc23-dynamic.tar.gz
2.
Install:
./condor_configure
–install –install-dir=/opt/condor –central-manager=central-manager.gridtutorial.com --type=submit,execute
3.
4.
5.
Set $CONDOR_CONFIG to /opt/condor/etc/condor_config Add /opt/condor/bin, /opt/condor/sbin to your $PATH Start Condor:
/opt/condor/sbin/condor_master
Oct 16, 2004 Interactive Condor Tutorial 7
Condor daemons
Look at the processes running on your computer
Linux: type ps –x Windows: Ctrl-Alt-Del Task Manager Processes
You should see the following processes:
condor_master – manages other Condor daemons condor_schedd – keeps track of your job queue condor_startd – runs jobs on your machine
SchedD allows you to submit jobs, StartD allows jobs to be executed. You can configure which capabilities you want for each machine in your pool.
Oct 16, 2004 Interactive Condor Tutorial 8
Configuration files
Condor has many “knobs” On Windows:
C:\Condor\condor_config C:\Condor\condor_config.local /opt/condor/etc/condor_config /opt/condor/local.myhost/condor_config.local
On Linux:
For a pool, the common configuration entries are usually in shared file, machine-specific entries in local file
Oct 16, 2004 Interactive Condor Tutorial 9
Logs
Daemons log all their events. Logs are useful for troubleshooting On Windows C:\Condor\log\... On Linux /opt/condor/local.myhost/log/… Logs rotate automatically after they reach a certain size Log detail can be changed by config entries, e.g.:
SCHEDD_DEBUG = D_FULLDEBUG see Condor manual for more debug flags
Oct 16, 2004 Interactive Condor Tutorial 10
Using Condor tools
Go to a shell prompt Windows: Start Run cmd Check your job queue: condor_q condor_q is in the Condor bin directory
C:\>condor_q -- Submitter: IBM-F9D9C420761 : <128.105.48.190:3292> : IBM-F9D9C420761 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
There are currently no jobs in your queue
There will be when you submit them
Oct 16, 2004
Interactive Condor Tutorial
11
Using Condor tools
Look at your pool: condor_status
State Claimed Claimed Owner Activity Busy Busy Idle LoadAv Mem 0.990 1.000 0.000 500 1004 500 ActivityTime 0+01:00 0+00:34 0+06:40
C:\> condor_status Name OpSys Arch purple.gridtutori LINUX INTEL central-manager.g LINUX INTEL yellow.gridtutori WINNT INTEL
This information is collected by Collector condor_status has many options
condor_status –long to see detailed information condor_status –total to see totals
Oct 16, 2004 Interactive Condor Tutorial 12
Submitting a job to Condor pool
Prepare your job
Must be able to run in the background: no interactive input, windows, GUI, etc. Can still use stdout, stderr, stdin (the keyboard and the screen), but files are used for these instead of the actual devices Make sure dynamic libraries are available or don’t use them at all! Scripts ok, but make sure interpreter is installed! Organize data files
Oct 16, 2004
Interactive Condor Tutorial
13
Simple Job (Windows)
@echo off echo “Hello, I’m a grid job” echo “I am running on a machine:” hostname echo “The time is:” date /T
Save to disk, e.g.:
C:\Condor\submit\my_job.bat
Oct 16, 2004
Interactive Condor Tutorial
14
Simple Job (Linux)
#!/bin/sh echo “Hello, I’m a grid job” echo “I am running on a machine:” hostname echo “The time is:” date
Save to disk: e.g.
/opt/condor/submit/my_job.sh
chmod +x my_job.sh
Oct 16, 2004 Interactive Condor Tutorial 15
Creating a Submit Description File
A plain ASCII text file Tells Condor about your job:
executable universe input, output and error files user log command-line arguments environment variables any special requirements or preferences custom attributes Useful macros for submitting multiple runs of different datasets
Oct 16, 2004 Interactive Condor Tutorial 16
Creating a Submit Description File
Let’s create in:
C:\Condor\submit\my_job.submit (Windows) /opt/condor/submit/my_job.submit (Linux)
Universe=vanilla Executable=my_job.sh #(Win: my_job.bat) Output=my_job.output Error=my_job.error Log=my_job.log should_transfer_files = YES when_to_transfer_output = ON_EXIT Queue
Oct 16, 2004 Interactive Condor Tutorial 17
Windows users: storing credentials
Users with Windows must first run:
Make sure your account is password-protected condor_store_cred add
Enter password for your account
Why? condor_ store_cred stores the password of a
user/domain pair securely in the Windows registry. Using this stored password, Condor is able to run jobs with the user ID of the submitting user when running scheduler universe jobs and DAGMan. In addition, Condor uses this password to acquire the submitting user's credentials when writing output or log files. The password is stored in the same manner as the system does when setting or changing account passwords.
Oct 16, 2004 Interactive Condor Tutorial 18
Running condor_submit
You give condor_submit the name of the submit file you have created condor_submit does the following:
parses the file checks for errors creates a “ClassAd” that describes your job(s) sends your job’s ClassAd(s) and executable to the SchedD, which stores the job in its queue SchedD reports the job ad(s) to the Central Manager, which tries to match it with a resource
condor_submit my_job.description
Oct 16, 2004 Interactive Condor Tutorial 19
See your job in the queue
condor_q
C:\>condor_q -- Submitter: IBM-F9D9C420761 : <128.105.48.190:3292> : IBMF9D9C420761 ID 2.0 OWNER ckireyev SUBMITTED 10/16 12:59 RUN_TIME 0+0:00:00 ST PRI SIZE CMD I 0 372.3 my_job.ba
1 jobs; 1 idle, 0 running, 0 held
You can see your job in the job queue Job is in the “I” (idle) state – it has not started running yet
Oct 16, 2004
Interactive Condor Tutorial
20
See the job’s ClassAd
condor_q –l (long view)
MyType = "Job" Owner = "ckireyev" Cmd = “/opt/condor-6.9./submit/my_job.sh" UserLog = “/opt/condor-6.6.7submit/my_job.log" In = "/dev/null" Out = "my_job.output" Err = "my_job.error" Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain == MY.FileSystemDomain) . . .
Oct 16, 2004
Interactive Condor Tutorial
21
Job Requirements
Notice the Requirements clause:
Requirements = (Arch == "INTEL") && (OpSys == "LINUX") …
Automatically generated, to ensure that job is run on a compatible OS/architecture User can specify custom requirements in submit file
e.g. HasJava = true, Disk > 10000, Name = “m1.grid.com” e.g. “Don’t run large executables”
Machines can have requirements/preferences too
Oct 16, 2004
Interactive Condor Tutorial
22
Job user log
Specified in submit file.
Optional but highly recommended!
Documents major milestones in job’s lifetime
[/opt/condor-6.6.7/submit] cat my_job.log 000 (002.000.000) 10/12 15:50:47 Job submitted from host: <128.105.121.21:45261> ...
Well-defined format
Can be “monitored” by a script
Condor can email user log to submitter when job completes or fails
Oct 16, 2004 Interactive Condor Tutorial 23
Meanwhile, the gears are turning…
Meanwhile:
SchedD reports jobs to Collector Negotiator retrieves the jobs from Collector … analyzes all the job ClassAds, ... compares them with all machine ads …finds machines to run jobs on (“matchmaking”) … notifies SchedD’s about StartD’s available to them
To see the “matching process” look in the log
e.g. /opt/condor-6.6.7/local./log/NegotiatorLog on Central Manager machine
Oct 16, 2004 Interactive Condor Tutorial 24
See job run…
condor_q - notice the job in “R” state:
[/opt/condor-6.6.7/submit] condor_q ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 ckireyev 10/16 12:59 0+21:29:33 R 0 372.3 my_job.sh 1 jobs; 1 idle, 0 running, 0 held
Notice the new event in user log:
[/opt/condor-6.6.7/submit] cat my_job.log 000 (002.000.000) 10/12 15:50:47 Job submitted from host: <128.105.121.21:45261> ... 001 (002.000.000) 10/12 15:55:55 Job executing on host: <128.105.121.21:45260> ...
Oct 16, 2004
Interactive Condor Tutorial
25
If all goes well…
Eventually you can see that the job is finished:
[/opt/condor-6.6.7/submit] cat my_job.log
... 005 (002.000.000) 10/12 15:55:55 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 Usr 0 00:00:00, Sys 0 00:00:00 Usr 0 00:00:00, Sys 0 00:00:00 Usr 0 00:00:00, Sys 0 00:00:00 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job
Run Remote Usage Run Local Usage Total Remote Usage Total Local Usage
Oct 16, 2004
Interactive Condor Tutorial
26
Job output…
Notice the output file (job’s stdout):
[/opt/condor-6.6.7/submit] cat my_job.output Hello, I'm a grid job I am running on a machine: blue.gridtutorial.com The time is: Tue Oct 12 16:16:36 CDT 2004
Similary the job’s stderr is in my_job.error If the job had created any other output files during it’s execution on the remote machine, they would also have been transferred back when job completes
Oct 16, 2004 Interactive Condor Tutorial 27
What if your job doesn’t run?
Ideally, everything works great. But in the Grid world, problems are common. Your job may not run for many reasons:
There are no available machines in the pool
They’re running other Condor jobs or used by owner They don’t fit your job’s requirements
Network communication errors (can’t connect to Central Manager, remote host) Authentication errors File permissions errors
Oct 16, 2004 Interactive Condor Tutorial 28
condor_q -analyze
Try condor_q -analyze
[/opt/condor-6.6.7/submit] condor_q –analyze 004.000: Run analysis summary. Of 7 machines, 7 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match but are serving users with a better priority in the pool 0 match but reject the job for unknown reasons 0 match but will not currently preempt their existing job 0 are available to run your job WARNING: Be advised: No resources matched request's constraints Check the Requirements expression below: Requirements = ((Disk >= 1000000000)) && (Arch == "INTEL") && (OpSys == "LINUX") && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain == MY.FileSystemDomain)
Oct 16, 2004
Interactive Condor Tutorial
29
Check the SchedD log
Next try looking at the SchedD log, for clues about what the SchedD was able to do with this job
.../local.myhost/log/SchedLog Set SCHEDD_DEBUG = D_FULLDEBUG in local config file (condor_config.local) condor_reconfig (to make daemons re-read the config files) SchedD can’t contact Collector Can’t authenticate with Collector Collector won’t allocate any machines for user
Oct 16, 2004 Interactive Condor Tutorial 30
If necessary, increase the debug level
Potential causes:
Check the Shadow log
Check the Shadow Log for information about what happens after a successful match has been made
.../local.myhost/log/ShadowLog Can increase debug detail with SHADOW_DEBUG=D_FULLDEBUG Shadow can’t connect to Starter assigned to it Shadow can’t read the executable or input file(s)
Possible causes of problem
Oct 16, 2004
Interactive Condor Tutorial
31
If all else fails…
Check Condor manual
http://www.cs.wisc.edu/condor/manual/v6.6
Post your questions on condor-users mailing list Send your question/problem to our support system:
condor-admin@cs.wisc.edu Question will be answered by a Condor developer Paid “VIP” support also available
Oct 16, 2004
Interactive Condor Tutorial
32