dextrose
dextrose
is the head node for a cluster of 34
compute nodes, all running the Linux operating system. You log
into dextrose by using your favourite ssh
client
program to connect to dextrose.lcd.uregina.ca
.
You can perform interactive work, such as editing or
compiling files on the head node, dextrose
, itself,
but intensive computation should take place on compute nodes. The
compute nodes are managed by a job queuing system called LSF (for
"Load Sharing Facility"). This page is a bare bones introduction
to submitting jobs to LSF on dextrose. See LSF's
extensive
documentation for full information.
To oversimplify, you submit a job to the queue by prefixing the command
line you wish to execute with the word bsub
. For
example,
asks the queuing system to run the ls
command on
the next compute node with resources to spare. Since the compute
nodes mount the same home directory for your account as does the
head node, there is not much point to such a command, but it does
illustrate one thing about the batch queue: it is for
non-interactive jobs.*
If there are many jobs queued up, your job may not run
immediately, and in any event the compute node that runs the job
will not have access to your terminal. So if you ran the above
command, the output from ls
would not show up
immediately on your screen. Instead, the batch system would email
you* the output from
ls
after the command runs on some compute node.
You can modify how the batch system handles your job by inserting
arguments to the bsub
command between the word
“bsub
” and the command you
are submitting to the queue.
For example, to have the standard output and standard error streams of your command redirected to files, rather than emailed to you, you could use
to have the standard output stream written to a file named
stdoutFile
and the standard error stream written to
stderrFile
. The preceding command would overwrite any
existing copies of stdoutFile
and
stderrFile
. Use
if you wish to append to existing files instead.
See the bsub
manual
page (also readable by typing “man bsub” when
you're logged into dextrose
) for details on
bsub
's many options.
Here's a real-world example. To submit a job that will run
Gaussian 09 on the input file ng1.gjf
, save the output stream
into ng1.out
, and save the error stream into
ng1.err
, while assigning the job the name “NG
job” (the job name shows up in some of the monitoring
commands described below):
Some programs use a Message Passing Interface (MPI) library to spread their work over processors from multiple compute nodes. Starting such programs is complicated by the necessity to use some MPI launcher to start the program on each of the nodes it ends up using, and by the necessity to ask the queuing system to allocate multiple compute nodes.
This primer will only illustrate the idea with a single example, albeit a complicated one. The command:
asks the queuing system to allocate 24 processors, with up to 6
processors per node, to run a job called “Big Vasp
Job”, saving the standard output and error streams to
stdout
and stderr
, respectively. The
queuing system will execute the command
on each assigned compute node.
The mpirun
command is
the launcher for Platform Computing's MPI library. mpirun
's
“-lsf” argument tells it to get the description of
which hosts to communicate with from the queuing system, while
“-e MLK_NUM_THREADS=2” specifies a
variable setting that mpirun
should make in the
environnment of all the processes it starts to run the job. The
actual command to be run is vasp5
(vasp5
, in its turn, will get its job parameters
from files with standard names in the directory where it starts,
which is just how vasp5
happens to work).
The MKL_NUM_THREADS
environment is a directive
to the Math Kernel Library, which our build of
vaspt5
is linked to. It essentially tells the
library how many threads it should start up for parallelizing
array operations. In the example invocations, we've combined
“-R span[ptile=6]
” and “-e
MKL_NUM_THREADS=2
” in the hopes the combination will produce 6x2=12
active threads on each of 4 compute nodes (each of our compute
nodes have 12 processor cores).
See the “RESOURCE REQUIREMENT STRINGS” section of the
output of “man lsfintro” for details about what can
go after bsub
's -R
option.
The command
lists your currently running jobs.
There are a number of options to the bjobs
command
to specify what it should tell you about currently running
jobs. For example,
lists the jobs of other users as well as yours, and increases the width of output lines to avoid abbreviating the contents.
Lists extra detail about the job numbered 3838 (the number likely
coming from the output of one of the other invocations of
bjobs
)
You can get some data about jobs that have terminated with
bhist
. For example,
To check the state of the compute nodes, try running either
or
All of these commands take options that affect how they work.
See “man bjobs
”, “man
bkill
”, et cetera.
To terminate a running job that you decide you do not need, run
where 1234 stands for the jobid of the job you wish to terminate.
Actually, mechanisms do exist for submitting interactive jobs to
the batch queue, but that's an unusual special case which won't
be discussed in this primer. See the complete LSF documentation,
and in particular bsub
's manual
page if you need to be able to interact with your batch jobs
while they are running.
As
this primer is being written, the destination to which the
email should be directed is set by the contents of a file called
.forward
in your home directory on
dextrose
, so you can change the destination address by
editing this file (or creating it if it does not exist). In
the absence of a .forward
file, email generated within
the dextrose cluster is delivered on dextrose, where you can read it with the
mail
command, but you'd probably rather have it
delivered to where you are accustomed to reading mail.