Lab Exercise: Condor DAGMan

 

Note: Replace the username and jobID keywords in the following command prompts with the proper ones.

 

Part 1: Standard DAG Submission and Execution with DAGMan

  1. All of the work we will be during during this lab exercise will be done on the login node of the Purdue Condor cluster.

First: Login to NCSA Abe with your username and password

 

$ myproxy-logon –l username

 

$ gsissh tg-condor.purdue.teragrid.org

1.      b. In TeraGrid, you need to specify the Project Number to be able to submit jobs to the Condor. To do that, create a file called .tg_default_project in your home directory, and enter the keyword TG-CCR090020 in this file.

 

$ cd

 

$ nano .tg_default_project

   TG-CCR090020

 

  1. Create a directory “DAGMan” in your Home directory, and switch to that directory

$ mkdir DAGMan

$ cd DAGMan

 

 

  1. Copy example files in Condor distribution to your directory

$ cp /opt/condor/examples/dagman/* .

 

 

  1. Explore each file in your directory.

$ cat diamond.dag                                            

Job  A  A.submit

Job  B  B.submit

Job  C  C.submit

Job  D  D.submit

PARENT A CHILD B C

PARENT B C CHILD D

 

 

This is the .dag file, which basically represents a simple diamond-shaped DAG. Each job has associated Condor submit file with it: A.submit,…

 

 

$ cat B.submit

Universe   = standard

Executable = half.condor

input      = A.out

output     = B.out

log        = diamond.log

Queue

 

$ cat C.submit

Universe   = standard

Executable = half.condor

input      = A.out

output     = C.out

log        = diamond.log

Queue

 

$ cat D.submit

Universe   = standard

Executable = sum.condor

arguments  = B.out C.out

output     = D.out

error      = D.err

log        = diamond.log

Queue

 
            The following are the 4 Condor submit files:

            $ cat A.submit

Universe   = standard

Executable = random.condor

output     = A.out

log        = diamond.log

Queue

 

 

 

Notice that all the jobs use the “standard” universe to get the fault-tolerance support of Condor. Also, notice the data dependencies between jobs. For example, output generated by A.submit (A.out) is being used as input parameter in B.submit and C.submit.

random.condor, half.condor, and sum.condor executables used in these submit files are 3 simple C programs. random.condor generates a random number, half.condor divides an input number by 2, and sum.condor calculates the summation of two input numbers.

Finally, the Makefile looks like this:

$ cat Makefile

all: random.condor half.condor sum.condor

random.condor: random.c

        condor_compile gcc -o random.condor random.c

half.condor: half.c

        condor_compile gcc -o half.condor half.c

sum.condor: sum.c

        condor_compile gcc -o sum.condor sum.c

clean:

        rm -f *.out *.err *.log diamond.dag.* *.condor *~

 

Basically, this Makefile generates the executables from the C source files. Notice that compilation is done via the condor_compile linkage to benefit from Condor libraries.

  1. Issue the “make” command to compile the C programs.

$ make

 

 

This command will produce the random.condor, half.condor, and sum.condor executables.

  1. Submit the “.dag” file to Condor

$  condor_submit_dag diamond.dag

 

 

 

7.      Now, watch the Condor queue.

$ condor_q username

 

You should be seeing the DAGMan run itself as a job. Also, DAGMan will release the execution of jobs in the DAG based on the dependencies among them.

  1. When the DAG finishes executing, explore the files generated by Condor. Especially:

$ vi diamond.dag.condor.sub

 

$ vi diamond.dag.dagman.out

 

 

 diamond.dag.condor.sub is automatically generated by the Condor, and serves as a Condor submit file itself. Remember that Condor DAGMan itself runs as a Condor job itself. Content of the file will look something like this:

$ cat diamond.dag.condor.sub

 

universe        = scheduler

executable      = /opt/condor/bin/condor_dagman

getenv          = True

output          = diamond.dag.lib.out

error           = diamond.dag.lib.err

log             = diamond.dag.dagman.log

queue

 

Notice that it uses the “scheduleruniverse, and the executable refers to the location for DAGMan.

 

Part 2: Fault-Tolerance in DAGMan with Rescue DAGs

9.      To observe the fault-tolerance behavior of DAGMan and demonstrate the usage of Rescue DAGs, we first need to mess with one of the job submit files (C.submit in this case). Before that, if you want to have a clean execution environment, remove the data products generated in the first part (*.out, *.log, *.err, and diamond.dag.* files).

 

So, modify the C.submit file as follows:

$ cat C.submit

Universe   = standard

Executable = h.condor

input      = A.out

output     = C.out

log        = diamond.log

Queue

 

 

Notice that, we have changed the Executable name in the submit file. In this case, normally, Condor will not be able to find h.condor executable, and the job C.submit should fail.

 

10.  Submit the DAG again:

$  condor_submit_dag diamond.dag

 

 

11.  Now, watch the Condor queue:

$ condor_q username

 

If everything goes as expected, first random.condor (job A) should execute successfully. Then, DAGMan will try to submit half.condor (job B) and h.condor (job C) to the Condor queue at the same time. Job B should execute successfully, whereas job C will not be able to execute, since the executable h.condor is unknown to the Condor.

 

12.  At this point, DAGMan will not be able to proceed any further, so it will create a Rescue DAG file, and exit. Content of the Rescue file would look something like this:

$ cat diamond.dag.rescue001

# Rescue DAG file, created after running  the diamond.dag DAG file

# Created 7/28/2009 12:58:58 UTC

# Total number of Nodes: 4

# Nodes premarked DONE: 2

# Nodes that failed: 1

#   C,<ENDLIST>

JOB A A.submit DONE

JOB B B.submit DONE

JOB C C.submit

JOB D D.submit

PARENT A CHILD B C

PARENT B CHILD D

PARENT C CHILD D

 

 

As you see, DAGMan marked the job A and job B as DONE, so the next time this rescue file is submitted for execution these jobs will not need to be executed again. As long as the data products generated by job A and job B are available for the usage of latter jobs, there is no need to re-run job A and job B.

 

13.  As you can also see from the Rescue file, the failure in the DAG has occurred due to an error in job C. So, at this point, we will fix the C.submit file to its original version, and submit the Rescue DAG to DAGMan.

So, first, fix the C.submit:

 

$ cat C.submit

Universe   = standard

Executable = half.condor

input      = A.out

output     = C.out

log        = diamond.log

Queue

 

 

14.  Now, submit the Rescue DAG:

$  condor_submit_dag diamond.dag.rescue001

 

 

15.  Now, watch the Condor queue:

$ condor_q username

 

You should see that DAGMan first submits job C (half.condor), and after it finishes execution DAGMan will submit job D (sum.condor) to the Condor queue. After job D finishes execution, the whole DAG finishes execution and DAGMan exits.

 

Now, you can check the output files and log files and explore the results…