With 72 processors sharing access to 144 gigabytes of main memory, entropy is a large computer able to support many simultaneous users. To date, it has not been necessary to impose formal resource limits in order to regulate the sharing of entropy's resources; instead, we have relied on the courtesy and good sense of the user community.
This tutorial introduces some basic methods for monitoring the overall load on entropy and the resources consumed by your own work, together with some hints for balancing your requirements against the potential needs of others.
The attempt to keep this tutorial self-contained and comprehensible has also made it longer than one might hope. For people who lack the time to read the whole thing, here's the bottom line:
If you do not already know what a load average is, or how to set priorities, then there's no avoiding the rest of the tutorial.
Programmers tend to talk about the “processes” running on a computer, while number crunching scientists tend to talk about the “jobs” running there. This tutorial will employ the two terms more or less synonymously, but there is a distinction that purists can make between them. If you are not a purist, you can skip to the next section.
A process is a running instance of a program. For a programmer, this has a precise meaning, encompassing the state of the machine's registers as well as a copy of the program's data. The precision gains nuance if you throw “threads” or lightweight processes into the mix: threads are mini-processes which share a single copy of most of the program's data, but have individual copies of the registers and the program stack.
The computing sense of the term “job” has its roots in batch processing: think of the deck of punched cards that you might submit to a mainframe back in the 1970s. The deck might direct the computer to execute several programs in turn which, strictly speaking, means that the job is implemented by a series of processes, not by one.
On entropy, where many people use scripts of shell commands to direct their computations, you could claim that such a script describes a job comprising the processes started to run the individual commands. When the script starts multiple long-running processes in parallel, though, entropy's users tend to describe each of those processes as separate jobs.
Maintaining a distinction between jobs and processes might still be meaningful at installations where long-running computations have to be submitted to a queuing system, instead of executed directly from an interactive login, since the queuing system would deal in jobs, which it would implement by starting multiple processes.
Installing such a queuing system on entropy, incidentally, might reduce the need for this tutorial, since such systems schedule each submitted job for execution when all the resources the job needs can be guaranteed to be available, rather than parceling out resources to all processes moment-by-moment. Since submitting jobs to a queuing system involves more bookkeeping than simply starting a program from the shell prompt, we have not imposed one as yet.
The easiest way to see how heavily entropy is being used at
any given moment is to run the top
command,
which provides a list of the processes currently consuming
the largest share of processor time. Just type
“top
” from the command line, and
you'll see something similar to this:
The three “load averages” on the top line are the average number of threads which were actually running, or ready to run, or waiting for disk I/O, during the previous 1, 5, and 15 minutes.
In this example, the second line shows
the total number of processes that existed at the instant
that top
sampled the system. top
can toggle between displaying processes or individual threads
within the processes, so it uses the word “task”
as a blanket term for both; try typing capital-H while
running top
and see the “Tasks”
field switch between showing the total number of processes
and the total number of threads.
The tasks which are “sleeping” are waiting for some event—such as a user's keystroke or the arrival of data over the network—before they will be ready to run.
If the load average is less than the number of
CPUs (72 on entropy
), and if the memory fields “free
”,
“buffers
”, and “cached
”
are not all near zero
(meaning that all the processes fit easily into main
memory), then entropy is coasting, with capacity to
spare.
Run “man top
”
to see more detail about the top
command and its
output.
Entropy will continue to operate properly if the number of active threads exceeds the number of CPUs. Essentially, the available CPUs will be shared among the threads by leaving some threads inactive for brief intervals.
Entropy's recent workload does not strain its memory capacity, so the rest of this tutorial concentrates on the issue of having more threads to run than processors to run them on.
The scheduling algorithms which choose which process to
activate next favour threads that do lots of I/O over threads that use lots
of CPU cycles, with the result that interactive users and
system maintenance tasks are unlikely to notice much slowdown
when there are more ready threads than CPUs to run
them. The top
program itself, for example,
continues to refresh its display promptly even when entropy's load
average rises into the mid-90s.
While entropy continues to operate under heavy load, there are still consequences to sharing the available processors and memory among many competing uses. The throughput for CPU-intensive processes drops, and processes with low priorities may be starved for CPU time.
Throughput—essentially the amount of work finished per unit of time—drops because there is some overhead involved in sharing resources. With the current gap between processor speed and memory access time, a particularly important component of the overhead is slowed memory access for an interval after a CPU switches from one process to another, while data for the new process replaces data for the old process in the CPU's local cache.
As a result of the overhead incurred by switching processors between threads, if you start 144 identical single-threaded processes at once on our 72-processor machine, it will take longer for them all to finish than it would if you started 72 of them to begin with, then started the remaining processes as replacements for the original 72, as the original processes exit.
The preceding example leads naturally into a description of how some processes can be starved for CPU time as a result of low process priorities. If you knew you would be the only person using entropy for the time it takes to run your 144 processes, there is a way you could start them all at once without throughput suffering: set the process priorities such that half of the processes run to completion before the other half run at all.
Look again at the top
snapshot at the
beginning of the tutorial. The PR
column lists
the priority of each process. Other factors being equal, the
scheduler will run processes with numerically lower priorities ahead of
processes with higher priorities.
So if you were entropy's only user, you could start 144 single-threaded processes at once without having more than 72 of them actually consume CPU time by setting their priorities such that processes 73 through 144 only get to run after their predecessors have exited. Here's one way to do so:
(Actually these commands won't quite work, for a reason we'll get to. And remember that this is just an illustration of how priorities operate; you are not the only person using entropy so you should not start 72 processes with a given priority.)
As you probably know, the “&
”
at the end of each command starts the command “in the
background”, allowing you to start another command
before this one finishes.
“nice -n
n command”
says to start command with a priority n
lower than the default priority (the name comes from the fact
that you are being “nice” to other users). So
job1
through job72
run with the
default priority, job73
with a slightly lower
priority, job74
with priority slightly lower
yet, et cetera. As a result, jobs 1 through 72 will begin
consuming CPU cycles right away, while jobs 73 through 144
remain idle. When one of jobs 1 through 72 exits,
job73
will begin to consume CPU cycles. When
another of jobs 1 through 73 exits, job72
will
begin consuming CPU cycles, and so on. At no time will more
than 72 processes be active, yet there will not be fewer than
72 processes active so long as 72 or more remain to run.
Now for the explanation of why these commands will not
quite work as advertised. On entropy, the maximum niceness
level is 19, so job92
through job144
will
all have the same priority, meaning that all 53 of them will
start receiving CPU cycles at once after 19 of the previous
processes have terminated, causing the load average to jump
from 72 to 124.
To illustrate how different users can interfere with each
other's plans, let's continue the thought experiment where
you start 144 single-thread processes, nicing half of them as shown above.
But this time before any of your processes exits, another
user, janedoe
, starts six processes of her own at the
default priority. For a while janedoe
's
6 processes and your job1
through job72
,
all with the default priority, will share the 72 CPUs, sending
the load average to 78. That means your job73
will not
start to run until six of the original 72 jobs exit, rather
than just one. If janedoe
or other users
keep starting more processes at the default priority in the
meantime, your job73
may never get any CPU time
at all.
So the difficulty with using nice factors to influence the order in which your own processes get resources is that other users may inadvertently undermine your plans. Given that entropy has 72 processors, the problem becomes acute when there are more than 72 long-running CPU-intensive processes and, among those processes, there is one or more whose priority is lower than 72 of the others. In such a situation, the low priority processes will get virtually no CPU time until the load average falls back below 72.
If you notice that somebody else's processes are being
starved because some of your processes are running at a higher
priority, you should reduce the priority of your processes
using the renice
command, or, equivalently,
with top
's r
command.
The command
will reduce the priority of four of your processes by 15
(that's reduce in the sense of giving the processes less
priority; the integer that top
reports in its
PR
column would increase in magnitude).
The four numbers after the 15
are
process identifiers, taken from top
's
PID
column.
If you are feeling particularly magnanimous, you could use
to reduce the priorities of all of your processes without typing their individual identifiers.
You can only change the priority of your own processes (and
only decrease it). So if your processes are the ones being
starved, you need to email either the other users or the
system administrator (trouble@lcd.uregina.ca
)
to ask them to address the situation.
We motivated the discussion of process priorities with an example where they served as a (flawed) mechanism for starting a lot of processes at one time without having all of those processes actually consume resources at once. Here's a more realistic situation where you might want to schedule processes to run in the future, together with more practical methods of doing so.
(In passing, note that there are two mechanisms for
scheduling processes to start in the future that are not
discussed here: at
and crontab
.
Both of these can start processes for you at particular
times. What we are looking for, though, is a method for
starting process B at whatever time process
A is finished, so we do not have to know in
advance how long A will run in order to schedule
B.)
Let's say you have nine processes to execute that you expect
to run roughly two hours each, and that top
is
currently showing a load average of about 69.
with a load average of 69, entropy has 3 idle processors, so you would like to start 3 of your processes right away, and set things up so that the remaining 6 will start in two waves later on, without your having to stay up until the wee hours of the morning to begin them by hand.
Here's a simple approximation of what you want:
where each “job
n” stands
for the shell command that starts one of your processes. The
semi-colon separating job
n from
job
n+3 says “run
job
n, then when it finishes, run
job
n+3”. The parentheses around
each triplet of jobs starts a subshell to run that series of
shell commands, while the & at the end of each line puts
the subshell into the background, so that the next subshell
can start running before the previous one terminates.
In practice—since each job
n
may stand for quite a bit of typing—it might
be more convenient to put the commands for each subshell into
a separate shell script file, and then start each of those
shell scripts in the background, something like this:
(assuming that your script files can be run by the default shell,
/bin/sh
; otherwise you might need to substitute
csh
, tcsh
, or
bash
).
The use of separate subshells, each executing a series of
processes in sequence, leaves open the possibility that we
might leave processors idle unnecessarily. Say that
job7
terminates while job5
and
job6
are still running. Assuming that
job8
doesn't depend on the results of
job5
, we would like to be able to start it up
early, on the processor freed by the completion of the first
subshell.
You can set up such a scheme, but it involves more sophisticated shell programming. For the adventurous, here's a shell script that reads the commands to start new processes from a file (such as this one) and then executes those commands as required to keep a given number of processes running. The script is not polished or robust enough to serve as a tool for general use, but it may be a useful model for people comfortable with shell programming.
Ideally, we would like a tool that would allow all users to submit jobs to a single shared queue, and which would start jobs from that queue as required to keep entropy's load average at 72. The difficulty is that simple home-grown tools might not deal properly with the security implications of starting different user's jobs from a single queue, while full blown queuing systems like OpenPBS and Grid Engine have features for resource scheduling and distributed computing that go beyond our needs. So, at least in the near term, we are continuing to rely on users' courtesy and good sense.
top
(1), ps
(1),
nice
(1) and renice
(1) commands