Lab Exercise: Condor
Purpose:
During this lab the user will become familiar with using Condor to run jobs while SSHed into the head node of a cluster.
5. Multiple Job Submissions within the same Submit file
Note: Replace the username and jobID keywords in the following command prompts with the proper ones.
First: Login to NCSA Abe with your username and password
$ myproxy-logon –l username
$ gsissh
tg-condor.purdue.teragrid.org |
1. b. In TeraGrid, you need to specify the Project Number to be able to submit jobs to the Condor. To do that, create a file called .tg_default_project in your home directory, and enter the keyword TG-CCR090020 in this file.
$ cd
$ nano .tg_default_project
TG-CCR090020 |
Displaying Condor Information
$ condor_version $CondorVersion: 6.7.3 Dec 28 2004 $ $CondorPlatform: I386-LINUX_RH9 $ $ |
$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
vm1@ldas-grid LINUX INTEL Owner Idle 0.040 700 11+06:25:28 vm2@ldas-grid LINUX INTEL Owner Idle 0.000 1200 11+06:25:29 vm1@node10.li LINUX INTEL Claimed Busy 1.000 700 0+00:01:51 [..snip..] vm2@node8.lig LINUX INTEL Claimed Busy 1.010 1200 0+00:13:00 vm1@node9.lig LINUX INTEL Claimed Busy 1.000 700 0+00:11:42 vm2@node9.lig LINUX INTEL Claimed Busy 1.000 1200 0+00:24:28
Machines Owner Claimed Unclaimed Matched Preempting
INTEL/LINUX 138 2 136 0 0 0
Total 138 2 136 0 0 0 $ |
$ condor_q
-- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 61122.0 dbrown 3/7 16:38 0+00:00:00 I 0 16.3 lalapps_inspiral - 61123.0 dbrown 3/7 16:38 0+00:00:00 I 0 16.3 lalapps_inspiral - 61124.0 dbrown 3/7 16:38 0+00:00:00 I 0 16.3 lalapps_inspiral - [..snip..] 61140.0 kipp 3/7 16:45 0+00:06:35 R 0 2.4 condor_dagman -f - 61141.0 kipp 3/7 16:45 0+00:06:28 R 0 0.0 dagdbUpdator -j 13 61143.0 kipp 3/7 16:45 0+00:06:07 R 0 18.0 lalapps_power --wi
988 jobs; 820 idle, 168 running, 0 held $ |
If you're logged into to server and want to see just your jobs, you can specify your username as follows:
$ condor_q username
-- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 28098.0 username 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.1 username 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.2 username 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.3 username 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.4 username 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.5 username 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.6 username 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.7 username 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.8 username 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.9 username 2/24 15:50 0+00:00:00 I 0 0.0 hostname 29105.0 username 2/25 15:44 0+00:00:00 I 0 0.0 condor_simple.sh E
11 jobs; 11 idle, 0 running, 0 held $ |
Click here for complete documentation on the condor_q command.
$ cd
$ mkdir Condor_lab
$ cd Condor_lab |
This new Condor_lab directory should be used to contain any files we create during the remainder of this lab exercise.
6. Create a file called lab.submit and copy the following into it:
Universe = vanilla Executable = /bin/hostname Log = lab.log output = lab.output error = lab.error Queue |
Looking at the submit file you should note several tags. The executable tag tells condor the name of the program to run. The queue tag tells condor to submit the job definition to the Condor queue.
$ condor_submit lab.submit Submitting
job(s)............... 1 job(s) submitted to cluster 29. $ |
$ condor_q username
-- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.0 username 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E
194 jobs; 40 idle, 154 running, 0 held $ |
output=lab.output |
The output file will contain the output of the executable. The error file will contain any error output that the program might director to stderr. The log file is condor's log of the job. Look at each file in turn.
$ ls -la total 98 drwxrwxr-x 2 username username 776 Mar 7 16:56 . drwx------ 7 username username 464 Mar 7 16:51 .. -rw-rw-r-- 1 username username 31 Mar 7 16:57 lab.error -rw-rw-r-- 1 username username 150 Mar 7 16:57 lab.log -rw-rw-r-- 1 username username 256 Mar 7 16:57 lab.output -rw-rw-r-- 1 username username 241 Mar 7 16:55 lab.submit
$ cat lab.error
$ cat lab.log 000 (015.000.000) 12/15 10:38:06 Job submitted from host: <141.142.96.174:33149> ... 017 (015.000.000) 12/15 10:38:19 Job submitted to Globus RM-Contact: ldas-grid.ligo-la.caltech.edu/jobmanager-condor JM-Contact: https://ligo-server.ncsa.uiuc.edu:38307/24309/1103128689/ Can-Restart-JM: 1 ... 001 (015.000.000) 12/15 10:38:19 Job executing on host: ldas-grid.ligo-la.caltech.edu ... 005 (015.000.000) 12/15 10:40:11 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
$ cat lab.output cms-010.rcac.purdue.edu $ |
Single Job Submission with Requirements
Create a file: lab2.submit
Copy the following into the file.
Universe = vanilla Requirements = Memory >= 320 && OpSys == "LINUX" && Arch =="X86_64" Executable = /bin/hostname Log = lab2.log output = lab2.output error = lab2.error Queue |
Requirements for the job are defined by the requirements tag. In this case we have told condor that we need a minimum of 320 megs of memory. That the operating system has to be Linux and that the processor needs to be X86_64 based. You can find a full listing of the requirements that can be specified in the condor manual.
http://www.cs.wisc.edu/condor/ manual/v6.6/ 2_5Submitting_Job.html
$ condor_submit lab2.submit
Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 30.
$ condor_q username
-- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 30.0 btest 12/17 09:44 0+00:00:03 R 0 0.0 lab6.sh E
1 jobs; 0 idle, 1 running, 0 held $ |
Comprehensive documentation on submitting jobs can be found at http://www.cs.wisc.edu/condor/manual/v6.6/ 2_5Submitting_Job.html
$ condor_q
-analyze jobID 51 reject your job because of their own requirements 478 match but are serving users with a better priority in the pool 928 match but reject the job for unknown reasons 2 match but will not currently preempt their existing job 91 are
available to run your job |
13. Create a new file: lab3.submit, and copy the following to this file:
Universe = vanilla Requirements = Memory >= 320 && OpSys == "LINUX" && Arch =="INTEL" Executable = /bin/hostname Log = lab3.log output = lab3.output error = lab3.error Queue |
Now, repeat steps 11 and 12 with this new file, and observe the output.
Multiple Job Submissions within the same Submit file
Universe = vanilla Executable = /bin/date Log = lab4.log output = lab4.output.$(Process) error = lab4.error.$(Process) Queue 5 |
Process is the value that Condor provides referring to the programs process id (PID) within a Cluster.
The queue tag tells condor how many instances of the executable to run. In this case 5 instances will be run simultaneously. One thing to keep in mind when telling condor to rerun multiple instances of a executable is what will happen to the output. In the above set of instructions we have added the process id to the end of the file name. Condor will now create 5 different files each being unique because of the id number. If we had not done this, condor would have used the same file for all 5 processes.
$ condor_submit lab4.submit Submitting
job(s)............... 5 job(s) submitted to cluster 29. $ |
$ condor_q username
-- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.0 username 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 9.1 username 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 9.2 username 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 9.3 username 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 9.4 username 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E
194 jobs; 40 idle, 154 running, 0 held $ |
17. After the job has finished, you can check the log file, and all 5 output files generated by this job.
Job Submission with Condor-G
# Submissions over the Grid must use the "globus" universe. universe = globus
# The executable to run. Need the full path. ~/ does not work. executable = /bin/hostname
# Command-line arguments to the executable. #arguments = 1 2 3
# false: The executable is already on remote machine. # true: Copy the executable from the local machine to the remote. transfer_executable = false
# Where to submit the job. See the "Resources" page for local jobmanagers. globusscheduler = queenbee.loni-lsu.teragrid.org/jobmanager-pbs
# Filenames for standard output, standard error, and Condor log. output = example.out error = example.err log = example.log
# The following line is always required. It is the command to submit the above. queue |
Notice the universe, and globusscheduler keywords.
$ condor_submit example.condor
Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 13.
$ |
$ condor_q username
-- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.0 username 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E
194 jobs; 40 idle, 154 running, 0 held $ |
21. When the job finishes executing (this may take some time), check the output.
$ cat example.output Running PBS prologue script -------------------------------------- User and Job Data: -------------------------------------- Job ID: 137113.qb2 Username: skalayci Group: tgusers Date: 20-Jul-2009 20:04 Node: qb511 (16854) -------------------------------------- PBS has allocated the following nodes:
qb511
A total of 8 processors on 1 nodes allocated --------------------------------------------- Check nodes and clean them of stray processes --------------------------------------------- qb511 starting 1 nodes Done clearing all the allocated nodes ------------------------------------------------------ Concluding PBS prologue script - 20-Jul-2009 20:04:00 ------------------------------------------------------ Mon Jul 20 20:04:01 CDT 2009 ------------------------------------------------------ Running PBS epilogue script - 20-Jul-2009 20:04:01 ------------------------------------------------------ Checking node qb511 (MS) Checking node qb511 ok ------------------------------------------------------ Concluding PBS epilogue script - 20-Jul-2009 20:04:02 ------------------------------------------------------ Exit Status: Job ID: 137113.qb2 Username: skalayci Group: tgusers Job Name: scheduler_pbs_job_script Session Id: 16853 Resource Limits: ncpus=1,nodes=1:ppn=8,walltime=12:00:00 Resources Used: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:00 Queue Used: workq Account String: TG-CCR090020 Node: qb511 Process id: 16916 $ |