Lab Exercise: Condor

Purpose:

During this lab the user will become familiar with using Condor to run jobs while SSHed into the head node of a cluster. 

  1. SSH into the Server
  2. Displaying Condor Information
  3. Submitting Local Condor Jobs
  4. Single Job Submission with Requirements

5.      Multiple Job Submissions within the same Submit file

  1. Job Submission with Condor-G

 

SSH into the Server

Note: Replace the username and jobID keywords in the following command prompts with the proper ones.

 

  1. All of the work we will be during during this lab exercise will be done on the login node of the Purdue Condor cluster.

First: Login to NCSA Abe with your username and password

 

$ myproxy-logon –l username

 

$ gsissh tg-condor.purdue.teragrid.org


1.                              b. In TeraGrid, you need to specify the Project Number to be able to submit jobs to the Condor. To do that, create a file called .tg_default_project in your home directory, and enter the keyword TG-CCR090020 in this file.

 

$ cd

 

$ nano .tg_default_project

   TG-CCR090020


 

Displaying Condor Information

  1. The condor_version command is a good starting point.

$ condor_version


$CondorVersion: 6.7.3 Dec 28 2004 $

$CondorPlatform: I386-LINUX_RH9 $

$

  1. The condor_status command will show the status of the nodes in the Condor pool.

$ condor_status

 

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

 

vm1@ldas-grid LINUX       INTEL  Owner      Idle       0.040   700  11+06:25:28

vm2@ldas-grid LINUX       INTEL  Owner      Idle       0.000  1200  11+06:25:29

vm1@node10.li LINUX       INTEL  Claimed    Busy       1.000   700  0+00:01:51

[..snip..]

vm2@node8.lig LINUX       INTEL  Claimed    Busy       1.010  1200  0+00:13:00

vm1@node9.lig LINUX       INTEL  Claimed    Busy       1.000   700  0+00:11:42

vm2@node9.lig LINUX       INTEL  Claimed    Busy       1.000  1200  0+00:24:28

 

                     Machines Owner Claimed Unclaimed Matched Preempting

 

         INTEL/LINUX      138     2     136         0       0          0

 

               Total      138     2     136         0       0          0

$

  1. The condor_q command will display the job queues.

$ condor_q

 

 

-- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

61122.0   dbrown          3/7 16:38    0+00:00:00 I  0   16.3 lalapps_inspiral -

61123.0   dbrown          3/7 16:38    0+00:00:00 I  0   16.3 lalapps_inspiral -

61124.0   dbrown          3/7 16:38    0+00:00:00 I  0   16.3 lalapps_inspiral -

[..snip..]

61140.0   kipp            3/7 16:45    0+00:06:35 R  0    2.4 condor_dagman -f -

61141.0   kipp            3/7 16:45    0+00:06:28 R  0    0.0 dagdbUpdator -j 13

61143.0   kipp            3/7 16:45    0+00:06:07 R  0   18.0 lalapps_power --wi

 

988 jobs; 820 idle, 168 running, 0 held

$

If you're logged into to server and want to see just your jobs, you can specify your username as follows:

$ condor_q username

 

 

-- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu

 

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

28098.0   username       2/24 15:50    0+00:00:00 I  0    0.0 hostname

28098.1   username       2/24 15:50    0+00:00:00 I  0    0.0 hostname

28098.2   username       2/24 15:50    0+00:00:00 I  0    0.0 hostname

28098.3   username       2/24 15:50    0+00:00:00 I  0    0.0 hostname

28098.4   username       2/24 15:50    0+00:00:00 I  0    0.0 hostname

28098.5   username       2/24 15:50    0+00:00:00 I  0    0.0 hostname

28098.6   username       2/24 15:50    0+00:00:00 I  0    0.0 hostname

28098.7   username       2/24 15:50    0+00:00:00 I  0    0.0 hostname

28098.8   username       2/24 15:50    0+00:00:00 I  0    0.0 hostname

28098.9   username       2/24 15:50    0+00:00:00 I  0    0.0 hostname

29105.0   username       2/25 15:44    0+00:00:00 I  0    0.0 condor_simple.sh E

 

11 jobs; 11 idle, 0 running, 0 held

$

Click here for complete documentation on the condor_q command.

 

 

Submitting Local Condor Jobs

  1. Start by creating a directory in your home directory on the server called Condor_lab  and cd into it:

 

$ cd

 

$ mkdir Condor_lab

 

$ cd Condor_lab

This new Condor_lab directory should be used to contain any files we create during the remainder of this lab exercise.

6.      Create a file called  lab.submit and copy the following into it:

Universe   = vanilla

Executable = /bin/hostname

Log        = lab.log

output = lab.output

error = lab.error

Queue

Looking at the submit file you should note several tags.  The executable tag tells condor the name of the program to run.  The queue tag tells condor to submit the job definition to the Condor queue.

  1. Now submit the job to Condor.  This is done using the condor_submit command.

$ condor_submit lab.submit

Submitting job(s)...............
Logging submit event(s)...............

1 job(s) submitted to cluster 29.

$

  1. Once the job has been submitted we can look at it status with the utility condor_q.  This application gives us information about the condor job queue.  We can see what jobs are on the queue and what their status is.  By using condor_q you can follow the progress of our submitted job.  Running condor_q several times you should see output similar to what is shown below.  First the job is entered onto the queue, then it begins to run and finally it completes and is removed from the queue.

$ condor_q username

 

 

-- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

   9.0   username         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E

 

194 jobs; 40 idle, 154 running, 0 held

$

  1. Looking back at our submission script you will note that there were several files defined:

output=lab.output
error=lab.error
log=lab.log

The output file will contain the output of the executable.  The error file will contain any error output that the program might director to stderr.  The log file is condor's log of the job.  Look at each file in turn.

$ ls -la

total 98

drwxrwxr-x   2 username username 776 Mar  7 16:56 .

drwx------   7 username username 464 Mar  7 16:51 ..

-rw-rw-r--   1 username username 31 Mar  7 16:57 lab.error

-rw-rw-r--   1 username username 150 Mar  7 16:57 lab.log

-rw-rw-r--   1 username username 256 Mar  7 16:57 lab.output

-rw-rw-r--   1 username username 241 Mar  7 16:55 lab.submit

 

$ cat lab.error

 

 

$ cat lab.log

000 (015.000.000) 12/15 10:38:06 Job submitted from host: <141.142.96.174:33149>

...

017 (015.000.000) 12/15 10:38:19 Job submitted to Globus

    RM-Contact: ldas-grid.ligo-la.caltech.edu/jobmanager-condor

    JM-Contact: https://ligo-server.ncsa.uiuc.edu:38307/24309/1103128689/

    Can-Restart-JM: 1

...

001 (015.000.000) 12/15 10:38:19 Job executing on host: ldas-grid.ligo-la.caltech.edu

...

005 (015.000.000) 12/15 10:40:11 Job terminated.

        (1) Normal termination (return value 0)

                Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage

                Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage

                Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage

                Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage

        0 - Run Bytes Sent By Job

        0 - Run Bytes Received By Job

        0 - Total Bytes Sent By Job

        0 - Total Bytes Received By Job

...

 

$ cat lab.output

cms-010.rcac.purdue.edu

$

 

Single Job Submission with Requirements

  1. Condor also allows us to define requirements that need to be met before a job is run.  These requirements give direction to condor on what type of machine a job needs to be run on.

Create a file: lab2.submit

Copy the following into the file.

 

Universe   = vanilla

Requirements = Memory >= 320 && OpSys == "LINUX" && Arch =="X86_64"

Executable = /bin/hostname

Log        = lab2.log

output = lab2.output

error = lab2.error

Queue

Requirements for the job are defined by the requirements tag.  In this case we have told condor that we need a minimum of 320 megs of memory.  That the operating system has to be Linux and that the processor needs to be X86_64 based.  You can find a full listing of the requirements that can be specified in the condor manual.

http://www.cs.wisc.edu/condor/ manual/v6.6/ 2_5Submitting_Job.html

  1. Submit the job and watch it run.  Then verify the output.

$ condor_submit lab2.submit

 

Submitting job(s).

Logging submit event(s).

1 job(s) submitted to cluster 30.

 

$ condor_q username

 

-- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD              

  30.0   btest          12/17 09:44   0+00:00:03 R  0   0.0 lab6.sh E

 

1 jobs; 0 idle, 1 running, 0 held

$

Comprehensive documentation on submitting jobs can be found at http://www.cs.wisc.edu/condor/manual/v6.6/ 2_5Submitting_Job.html

  1.   Once you see the job become idle, use condor_q with the -analyze option to look at what is going on.  The analyze option returns information about the job in question.  Use the job number that is specified for your job.

$ condor_q -analyze jobID

-- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
---
031.000: Run analysis summary. Of 2 machines,      
   12 are rejected by your job's requirements

     51 reject your job because of their own requirements

    478 match but are serving users with a better priority in the pool

    928 match but reject the job for unknown reasons

      2 match but will not currently preempt their existing job

     91 are available to run your job
     

$

13.   Create a new file: lab3.submit, and copy the following to this file:

 

Universe   = vanilla

Requirements = Memory >= 320 && OpSys == "LINUX" && Arch =="INTEL"

Executable = /bin/hostname

Log        = lab3.log

output = lab3.output

error = lab3.error

Queue

Now, repeat steps 11 and 12 with this new file, and observe the output.

Multiple Job Submissions within the same Submit file

  1. Create a file called  lab4.submit and copy the following into it:

Universe   = vanilla

Executable = /bin/date

Log        = lab4.log

output = lab4.output.$(Process)

error = lab4.error.$(Process)

Queue 5

Process is the value that Condor provides referring to the programs process id (PID) within a Cluster

The queue tag tells condor how many instances of the executable to run.  In this case 5 instances will be run simultaneously.  One thing to keep in mind when telling condor to rerun multiple instances of a executable is what will happen to the output.  In the above set of instructions we have added the process id to the end of the file name.  Condor will now create 5 different files each being unique because of the id number.  If we had not done this, condor would have used the same file for all 5 processes.

  1. Now submit the job to Condor.  This is done using the condor_submit command.

$ condor_submit lab4.submit

Submitting job(s)...............
Logging submit event(s)...............

5 job(s) submitted to cluster 29.

$

  1. Once the job has been submitted we can look at it status with the utility condor_q.  This application gives us information about the condor job queue.  We can see what jobs are on the queue and what their status is.  By using condor_q you can follow the progress of our submitted job.  Running condor_q several times you should see output similar to what is shown below.  First the job is entered onto the queue, then it begins to run and finally it completes and is removed from the queue.

$ condor_q username

 

 

-- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

   9.0   username         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E

   9.1   username         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E

   9.2   username         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E

   9.3   username         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E

   9.4   username         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E

 

194 jobs; 40 idle, 154 running, 0 held

$

17.  After the job has finished, you can check the log file, and all 5 output files generated by this job.

 

Job Submission with Condor-G

  1. Create a file called  example.condor  and copy the following into it:

# Submissions over the Grid must use the "globus" universe.

universe = globus

 

# The executable to run.  Need the full path.  ~/ does not work.

executable = /bin/hostname

 

# Command-line arguments to the executable.

#arguments = 1 2 3

 

# false:  The executable is already on remote machine.

# true:   Copy the executable from the local machine to the remote.

transfer_executable = false

 

# Where to submit the job.  See the "Resources" page for local jobmanagers.

globusscheduler = queenbee.loni-lsu.teragrid.org/jobmanager-pbs

 

# Filenames for standard output, standard error, and Condor log.

output = example.out

error = example.err

log = example.log

 

# The following line is always required.  It is the command to submit the above.

queue

Notice the universe, and globusscheduler keywords.

  1. Now submit the job to Condor.  This is done using the condor_submit command.

$ condor_submit example.condor

 

Submitting job(s).

Logging submit event(s).

1 job(s) submitted to cluster 13.

 

$

  1. Check the status of job.

$ condor_q username

 

 

-- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

   9.0   username         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E

 

194 jobs; 40 idle, 154 running, 0 held

$

21.  When the job finishes executing (this may take some time), check the output.

 

$ cat example.output

Running PBS prologue script

--------------------------------------

User and Job Data:

--------------------------------------

Job ID:    137113.qb2

Username:  skalayci

Group:     tgusers

Date:      20-Jul-2009 20:04

Node:      qb511 (16854)

--------------------------------------

PBS has allocated the following nodes:

 

qb511

 

A total of 8 processors on 1 nodes allocated

---------------------------------------------

Check nodes and clean them of stray processes

---------------------------------------------

qb511 starting 1 nodes

Done clearing all the allocated nodes

------------------------------------------------------

Concluding PBS prologue script - 20-Jul-2009 20:04:00

------------------------------------------------------

Mon Jul 20 20:04:01 CDT 2009

------------------------------------------------------

Running PBS epilogue script    - 20-Jul-2009 20:04:01

------------------------------------------------------

Checking node qb511 (MS)

Checking node qb511 ok

------------------------------------------------------

Concluding PBS epilogue script - 20-Jul-2009 20:04:02

------------------------------------------------------

Exit Status:

Job ID:          137113.qb2

Username:        skalayci

Group:           tgusers

Job Name:        scheduler_pbs_job_script

Session Id:      16853

Resource Limits: ncpus=1,nodes=1:ppn=8,walltime=12:00:00

Resources Used:  cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:00

Queue Used:      workq

Account String:  TG-CCR090020

Node:            qb511

Process id:      16916

$