Lab
Exercise: Condor DAGMan
Note:
Replace
the username and jobID keywords in the following command prompts
with the proper ones.
Part
1: Standard DAG Submission and Execution with DAGMan
First:
Login to NCSA Abe with your username and password $
myproxy-logon –l username $
gsissh
tg-condor.purdue.teragrid.org |
1.
b.
In TeraGrid, you need to specify the Project Number to
be able to submit jobs to the Condor. To do that, create a file called
.tg_default_project
in your home directory, and enter the keyword TG-CCR090020 in this
file.
$
cd $
nano .tg_default_project TG-CCR090020 |
$
mkdir DAGMan $
cd DAGMan |
$
cp /opt/condor/examples/dagman/* . |
$
cat
diamond.dag
Job A A.submit Job B B.submit Job C C.submit Job D D.submit PARENT
A CHILD B C PARENT
B C CHILD D |
This
is the .dag file, which basically represents a simple diamond-shaped DAG. Each
job has associated Condor submit file with it: A.submit,…
$
cat
B.submit Universe =
standard Executable
= half.condor input =
A.out output = B.out log
= diamond.log Queue $
cat
C.submit Universe =
standard Executable
= half.condor input =
A.out output = C.out log
= diamond.log Queue $
cat
D.submit Universe =
standard Executable
= sum.condor arguments =
B.out C.out output = D.out error =
D.err log
= diamond.log Queue
The following are the 4 Condor submit files:
$
cat
A.submit
Universe = standard
Executable
= random.condor
output = A.out
log =
diamond.log
Queue
Notice
that all the jobs use the “standard”
universe to get the fault-tolerance support of Condor. Also, notice the data
dependencies between jobs. For example, output generated by A.submit (A.out)
is being used as input parameter in B.submit and C.submit.
random.condor,
half.condor,
and sum.condor
executables used in these submit files are 3 simple C programs. random.condor
generates a random number, half.condor
divides an input number by 2, and sum.condor
calculates the summation of two input numbers.
Finally,
the Makefile
looks like this:
$ cat Makefile all:
random.condor half.condor sum.condor random.condor:
random.c
condor_compile gcc -o random.condor random.c half.condor:
half.c
condor_compile gcc -o half.condor half.c sum.condor:
sum.c
condor_compile gcc -o sum.condor sum.c clean:
rm -f *.out *.err *.log diamond.dag.*
*.condor *~ |
Basically,
this Makefile
generates the executables from the C source files. Notice that compilation is
done via the condor_compile
linkage
to benefit from Condor libraries.
$
make |
This
command will produce the random.condor,
half.condor,
and sum.condor
executables.
$ condor_submit_dag
diamond.dag |
7.
Now,
watch the Condor queue.
$
condor_q username |
You
should be seeing the DAGMan run itself as a job. Also,
DAGMan will release the execution of jobs in the DAG
based on the dependencies among them.
$
vi diamond.dag.condor.sub $
vi diamond.dag.dagman.out |
diamond.dag.condor.sub
is
automatically generated by the Condor, and serves as a Condor submit file
itself. Remember that Condor DAGMan itself runs as a
Condor job itself. Content of the file will look something like this:
$
cat
diamond.dag.condor.sub universe
= scheduler executable =
/opt/condor/bin/condor_dagman getenv
= True output
= diamond.dag.lib.out error
= diamond.dag.lib.err log
= diamond.dag.dagman.log queue |
Notice
that it uses the “scheduler”
universe, and the executable refers to the location for
DAGMan.
Part
2: Fault-Tolerance in DAGMan with Rescue
DAGs
9.
To
observe the fault-tolerance behavior of DAGMan and
demonstrate the usage of Rescue DAGs, we first need to mess with one of the job
submit files (C.submit in this case). Before that, if
you want to have a clean execution environment, remove the data products
generated in the first part (*.out,
*.log, *.err, and
diamond.dag.* files).
So,
modify the C.submit file as
follows:
$
cat
C.submit Universe =
standard Executable
= h.condor input = A.out output = C.out log
= diamond.log Queue |
Notice
that, we have changed the Executable name in the submit file. In this case,
normally, Condor will not be able to find h.condor
executable, and the job C.submit should
fail.
10.
Submit
the DAG again:
$ condor_submit_dag
diamond.dag |
11.
Now,
watch the Condor queue:
$
condor_q username |
If
everything goes as expected, first random.condor
(job A) should execute successfully. Then, DAGMan will
try to submit half.condor
(job B) and h.condor
(job C) to the Condor queue at the same time. Job B should execute successfully,
whereas job C will not be able to execute, since the executable h.condor
is unknown to the Condor.
12.
At
this point, DAGMan will not be able to proceed any
further, so it will create a Rescue DAG file, and exit. Content of the Rescue
file would look something like this:
$
cat
diamond.dag.rescue001 #
Rescue DAG file, created after running the diamond.dag DAG
file #
Created 7/28/2009 12:58:58 UTC #
Total number of Nodes: 4 #
Nodes premarked DONE: 2 #
Nodes that failed: 1 #
C,<ENDLIST> JOB
A A.submit DONE JOB
B B.submit DONE JOB
C C.submit JOB
D D.submit PARENT
A CHILD B C PARENT
B CHILD D PARENT
C CHILD D |
As
you see, DAGMan marked the job A and job B as DONE, so
the next time this rescue file is submitted for execution these jobs will not
need to be executed again. As long as the data products generated by job A and
job B are available for the usage of latter jobs, there is no need to re-run job
A and job B.
13.
As
you can also see from the Rescue file, the failure in the DAG has occurred due
to an error in job C. So, at this point, we will fix the C.submit file to its original version, and submit the Rescue
DAG to DAGMan.
So,
first, fix the C.submit:
$
cat
C.submit Universe =
standard Executable
= half.condor input = A.out output = C.out log
= diamond.log Queue |
14.
Now,
submit the Rescue DAG:
$ condor_submit_dag diamond.dag.rescue001 |
15.
Now,
watch the Condor queue:
$
condor_q username |
You
should see that DAGMan first submits job C
(half.condor),
and after it finishes execution DAGMan will submit job
D (sum.condor)
to the Condor queue. After job D finishes execution, the whole DAG finishes
execution and DAGMan exits.
Now,
you can check the output files and log files and explore the
results…