Process scheduling for entropy users

With 72 processors sharing access to 144 gigabytes of main memory, entropy is a large computer able to support many simultaneous users. To date, it has not been necessary to impose formal resource limits in order to regulate the sharing of entropy's resources; instead, we have relied on the courtesy and good sense of the user community.

This tutorial introduces some basic methods for monitoring the overall load on entropy and the resources consumed by your own work, together with some hints for balancing your requirements against the potential needs of others.

Executive Summary

The attempt to keep this tutorial self-contained and comprehensible has also made it longer than one might hope. For people who lack the time to read the whole thing, here's the bottom line:

If you do not already know what a load average is, or how to set priorities, then there's no avoiding the rest of the tutorial.

Initial clarification of terminology

Programmers tend to talk about the “processes” running on a computer, while number crunching scientists tend to talk about the “jobs” running there. This tutorial will employ the two terms more or less synonymously, but there is a distinction that purists can make between them. If you are not a purist, you can skip to the next section.

A process is a running instance of a program. For a programmer, this has a precise meaning, encompassing the state of the machine's registers as well as a copy of the program's data. The precision gains nuance if you throw “threads” or lightweight processes into the mix: threads are mini-processes which share a single copy of most of the program's data, but have individual copies of the registers and the program stack.

The computing sense of the term “job” has its roots in batch processing: think of the deck of punched cards that you might submit to a mainframe back in the 1970s. The deck might direct the computer to execute several programs in turn which, strictly speaking, means that the job is implemented by a series of processes, not by one.

On entropy, where many people use scripts of shell commands to direct their computations, you could claim that such a script describes a job comprising the processes started to run the individual commands. When the script starts multiple long-running processes in parallel, though, entropy's users tend to describe each of those processes as separate jobs.

Maintaining a distinction between jobs and processes might still be meaningful at installations where long-running computations have to be submitted to a queuing system, instead of executed directly from an interactive login, since the queuing system would deal in jobs, which it would implement by starting multiple processes.

Installing such a queuing system on entropy, incidentally, might reduce the need for this tutorial, since such systems schedule each submitted job for execution when all the resources the job needs can be guaranteed to be available, rather than parceling out resources to all processes moment-by-moment. Since submitting jobs to a queuing system involves more bookkeeping than simply starting a program from the shell prompt, we have not imposed one as yet.

Monitoring Usage

The easiest way to see how heavily entropy is being used at any given moment is to run the top command, which provides a list of the processes currently consuming the largest share of processor time. Just type “top” from the command line, and you'll see something similar to this:

Look again at the top snapshot at the beginning of the tutorial. The PR column lists the priority of each process. Other factors being equal, the scheduler will run processes with numerically lower priorities ahead of processes with higher priorities.

So if you were entropy's only user, you could start 144 single-threaded processes at once without having more than 72 of them actually consume CPU time by setting their priorities such that processes 73 through 144 only get to run after their predecessors have exited. Here's one way to do so:

(Actually these commands won't quite work, for a reason we'll get to. And remember that this is just an illustration of how priorities operate; you are not the only person using entropy so you should not start 72 processes with a given priority.)

job1 & job2 & job3 & job4 & job5 & job6 & job7 & . . . job70 & job71 & job72 & nice -n 1 job73 & nice -n 2 job74 & nice -n 3 job75 & . . . nice -n 70 job46 & nice -n 71 job47 & nice -n 72 job48 &

As you probably know, the “&” at the end of each command starts the command “in the background”, allowing you to start another command before this one finishes.

“nice -n n command” says to start command with a priority n lower than the default priority (the name comes from the fact that you are being “nice” to other users). So job1 through job72 run with the default priority, job73 with a slightly lower priority, job74 with priority slightly lower yet, et cetera. As a result, jobs 1 through 72 will begin consuming CPU cycles right away, while jobs 73 through 144 remain idle. When one of jobs 1 through 72 exits, job73 will begin to consume CPU cycles. When another of jobs 1 through 73 exits, job72 will begin consuming CPU cycles, and so on. At no time will more than 72 processes be active, yet there will not be fewer than 72 processes active so long as 72 or more remain to run.

Now for the explanation of why these commands will not quite work as advertised. On entropy, the maximum niceness level is 19, so job92 through job144 will all have the same priority, meaning that all 53 of them will start receiving CPU cycles at once after 19 of the previous processes have terminated, causing the load average to jump from 72 to 124.

Recognizing starved low-priority processes

To illustrate how different users can interfere with each other's plans, let's continue the thought experiment where you start 144 single-thread processes, nicing half of them as shown above. But this time before any of your processes exits, another user, janedoe, starts six processes of her own at the default priority. For a while janedoe's 6 processes and your job1 through job72, all with the default priority, will share the 72 CPUs, sending the load average to 78. That means your job73 will not start to run until six of the original 72 jobs exit, rather than just one. If janedoe or other users keep starting more processes at the default priority in the meantime, your job73 may never get any CPU time at all.

So the difficulty with using nice factors to influence the order in which your own processes get resources is that other users may inadvertently undermine your plans. Given that entropy has 72 processors, the problem becomes acute when there are more than 72 long-running CPU-intensive processes and, among those processes, there is one or more whose priority is lower than 72 of the others. In such a situation, the low priority processes will get virtually no CPU time until the load average falls back below 72.

If you notice that somebody else's processes are being starved because some of your processes are running at a higher priority, you should reduce the priority of your processes using the renice command, or, equivalently, with top's r command.

Reducing the priority of running processes

renice 15 822179 758066 822178 680590

will reduce the priority of four of your processes by 15 (that's reduce in the sense of giving the processes less priority; the integer that top reports in its PR column would increase in magnitude). The four numbers after the 15 are process identifiers, taken from top's PID column.

to reduce the priorities of all of your processes without typing their individual identifiers.

You can only change the priority of your own processes (and only decrease it). So if your processes are the ones being starved, you need to email either the other users or the system administrator (trouble@lcd.uregina.ca) to ask them to address the situation.

Scheduling jobs to start in the future

We motivated the discussion of process priorities with an example where they served as a (flawed) mechanism for starting a lot of processes at one time without having all of those processes actually consume resources at once. Here's a more realistic situation where you might want to schedule processes to run in the future, together with more practical methods of doing so.

(In passing, note that there are two mechanisms for scheduling processes to start in the future that are not discussed here: at and crontab. Both of these can start processes for you at particular times. What we are looking for, though, is a method for starting process B at whatever time process A is finished, so we do not have to know in advance how long A will run in order to schedule B.)

Let's say you have nine processes to execute that you expect to run roughly two hours each, and that top is currently showing a load average of about 69.

with a load average of 69, entropy has 3 idle processors, so you would like to start 3 of your processes right away, and set things up so that the remaining 6 will start in two waves later on, without your having to stay up until the wee hours of the morning to begin them by hand.

( job1 ; job4 ; job7 ) & ( job2 ; job5 ; job8 ) & ( job3 ; job6 ; job9 ) &

where each “jobn” stands for the shell command that starts one of your processes. The semi-colon separating jobn from jobn+3 says “run jobn, then when it finishes, run jobn+3”. The parentheses around each triplet of jobs starts a subshell to run that series of shell commands, while the & at the end of each line puts the subshell into the background, so that the next subshell can start running before the previous one terminates.

In practice—since each jobn may stand for quite a bit of typing—it might be more convenient to put the commands for each subshell into a separate shell script file, and then start each of those shell scripts in the background, something like this:

sh scriptA & sh scriptB & sh scriptC &

(assuming that your script files can be run by the default shell, /bin/sh; otherwise you might need to substitute csh, tcsh, or bash).

The use of separate subshells, each executing a series of processes in sequence, leaves open the possibility that we might leave processors idle unnecessarily. Say that job7 terminates while job5 and job6 are still running. Assuming that job8 doesn't depend on the results of job5, we would like to be able to start it up early, on the processor freed by the completion of the first subshell.

You can set up such a scheme, but it involves more sophisticated shell programming. For the adventurous, here's a shell script that reads the commands to start new processes from a file (such as this one) and then executes those commands as required to keep a given number of processes running. The script is not polished or robust enough to serve as a tool for general use, but it may be a useful model for people comfortable with shell programming.

Ideally, we would like a tool that would allow all users to submit jobs to a single shared queue, and which would start jobs from that queue as required to keep entropy's load average at 72. The difficulty is that simple home-grown tools might not deal properly with the security implications of starting different user's jobs from a single queue, while full blown queuing systems like OpenPBS and Grid Engine have features for resource scheduling and distributed computing that go beyond our needs. So, at least in the near term, we are continuing to rely on users' courtesy and good sense.

Process scheduling for entropy users

Executive Summary

Initial clarification of terminology

Monitoring Usage

Consequences of high system load

Throughput

Process priorities

Recognizing starved low-priority processes

Reducing the priority of running processes

Scheduling jobs to start in the future

Further reading: