Process scheduling for entropy users

With 72 processors sharing access to 144 gigabytes of main memory, entropy is a large computer able to support many simultaneous users. To date, it has not been necessary to impose formal resource limits in order to regulate the sharing of entropy's resources; instead, we have relied on the courtesy and good sense of the user community.

This tutorial introduces some basic methods for monitoring the overall load on entropy and the resources consumed by your own work, together with some hints for balancing your requirements against the potential needs of others.

Executive Summary

The attempt to keep this tutorial self-contained and comprehensible has also made it longer than one might hope. For people who lack the time to read the whole thing, here's the bottom line:

  1. Try to keep entropy's load average from rising above 72 for long periods of time.
  2. Do not set priorities such that some low-priority processes get no CPU time at all.

If you do not already know what a load average is, or how to set priorities, then there's no avoiding the rest of the tutorial.

Initial clarification of terminology

Programmers tend to talk about the “processes” running on a computer, while number crunching scientists tend to talk about the “jobs” running there. This tutorial will employ the two terms more or less synonymously, but there is a distinction that purists can make between them. If you are not a purist, you can skip to the next section.

A process is a running instance of a program. For a programmer, this has a precise meaning, encompassing the state of the machine's registers as well as a copy of the program's data. The precision gains nuance if you throw “threads” or lightweight processes into the mix: threads are mini-processes which share a single copy of most of the program's data, but have individual copies of the registers and the program stack.

The computing sense of the term “job” has its roots in batch processing: think of the deck of punched cards that you might submit to a mainframe back in the 1970s. The deck might direct the computer to execute several programs in turn which, strictly speaking, means that the job is implemented by a series of processes, not by one.

On entropy, where many people use scripts of shell commands to direct their computations, you could claim that such a script describes a job comprising the processes started to run the individual commands. When the script starts multiple long-running processes in parallel, though, entropy's users tend to describe each of those processes as separate jobs.

Maintaining a distinction between jobs and processes might still be meaningful at installations where long-running computations have to be submitted to a queuing system, instead of executed directly from an interactive login, since the queuing system would deal in jobs, which it would implement by starting multiple processes.

Installing such a queuing system on entropy, incidentally, might reduce the need for this tutorial, since such systems schedule each submitted job for execution when all the resources the job needs can be guaranteed to be available, rather than parceling out resources to all processes moment-by-moment. Since submitting jobs to a queuing system involves more bookkeeping than simply starting a program from the shell prompt, we have not imposed one as yet.

Monitoring Usage

The easiest way to see how heavily entropy is being used at any given moment is to run the top command, which provides a list of the processes currently consuming the largest share of processor time. Just type “top” from the command line, and you'll see something similar to this:

top - 19:54:39 up 216 days, 9:23, 6 users, load average: 50.33, 49.88, 48.70 Tasks: 1011 total, 48 running, 955 sleeping, 4 stopped, 4 zombie Cpu(s): 65.2%us, 1.5%sy, 2.8%ni, 30.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 130966000k total, 24044628k used, 106921372k free, 679528k buffers Swap: 0k total, 0k used, 0k free, 17256400k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1255 xerxes 20 0 239m 228m 10m R 100.0 0.2 232:33.15 hanoi_12_4_mv2. 1282 xerxes 20 0 228m 213m 5852 R 100.0 0.2 231:16.61 hanoi_12_4_mv2. 5817 seome20s 20 0 1155m 141m 51m S 100.0 0.1 1311:04 MATLAB 2780 ohara 20 0 7256 416 336 R 99.7 0.0 117557:16 wolE 4554 ohara 20 0 7252 416 336 R 99.7 0.0 23318:22 eulerE 5816 seome20s 20 0 1145m 137m 51m S 99.7 0.1 1596:21 MATLAB 5822 seome20s 20 0 1157m 135m 51m S 99.7 0.1 1310:11 MATLAB 5837 ohara 20 0 7256 420 336 R 99.7 0.0 100186:33 wolE 6246 ohara 20 0 7256 420 336 R 99.7 0.0 102421:46 wolE 17464 ohara 20 0 7256 420 336 R 99.7 0.0 114406:05 wolE 23337 ohara 20 0 7252 416 336 R 99.7 0.0 39495:52 eulerE 23582 ohara 20 0 7252 416 336 R 99.7 0.0 26223:52 eulerE 24668 xerxes 20 0 298m 280m 1800 R 99.7 0.2 9:06.05 rubiks_3x3.eps 26009 ohara 20 0 7256 416 336 R 99.7 0.0 11431:23 wolE 27341 xerxes 20 0 131m 121m 1156 R 99.7 0.1 3174:21 rubiks_3x3.prd 27509 ohara 20 0 7252 416 336 R 99.7 0.0 13227:18 eulerE 27513 ohara 20 0 7252 420 336 R 99.7 0.0 12952:57 eulerE 30143 xerxes 20 0 115m 105m 1164 R 99.7 0.1 6970:50 rubiks_3x3.prd 30655 xerxes 20 0 115m 105m 1164 R 99.7 0.1 6995:29 rubiks_3x3.prd 32690 ohara 20 0 7256 416 336 R 99.7 0.0 102037:07 wolE 4551 ohara 20 0 7252 420 336 R 99.4 0.0 23017:50 eulerE 6875 ohara 20 0 7256 420 336 R 99.4 0.0 107636:55 wolE 6889 ohara 20 0 7256 416 336 R 99.4 0.0 110238:22 wolE 9116 ohara 20 0 7256 420 336 R 99.4 0.0 110218:32 wolE 12095 ohara 20 0 7256 420 336 R 99.4 0.0 115298:08 wolE 12751 ohara 20 0 7252 416 336 R 99.4 0.0 76321:13 eulerE 12767 ohara 20 0 7252 416 336 R 99.4 0.0 75374:05 eulerE 12777 ohara 20 0 7252 416 336 R 99.4 0.0 73183:10 eulerE 12984 ohara 20 0 7256 420 336 R 99.4 0.0 104132:09 wolE 14591 ohara 20 0 7256 416 336 R 99.4 0.0 113617:38 wolE 15536 ohara 20 0 7252 412 336 R 99.4 0.0 8974:44 eulerE 16193 ohara 22 2 7252 416 336 R 99.4 0.0 31049:24 eulerE 16984 ohara 20 0 7256 420 336 R 99.4 0.0 106247:46 wolE 18818 ohara 20 0 7256 420 336 R 99.4 0.0 114613:29 wolE 21747 ohara 20 0 7256 420 336 R 99.4 0.0 112254:37 wolE 23180 ohara 20 0 7252 416 336 R 99.4 0.0 14504:23 eulerE 23183 ohara 20 0 7252 416 336 R 99.4 0.0 14612:29 eulerE 23585 ohara 20 0 7252 416 336 R 99.4 0.0 28167:56 eulerE 24092 xerxes 20 0 293m 275m 1800 R 99.4 0.2 14:42.92 rubiks_3x3.eps 24656 xerxes 20 0 309m 291m 1800 R 99.4 0.2 9:35.74 rubiks_3x3.eps 25836 ohara 22 2 7252 416 336 R 99.4 0.0 30888:20 eulerE 27197 ohara 20 0 7256 420 336 R 99.4 0.0 107705:47 wolE 27817 xerxes 20 0 115m 105m 1164 R 99.4 0.1 4463:32 rubiks_3x3.prd 32292 ohara 20 0 7256 416 336 R 99.4 0.0 108845:11 wolE 6243 ohara 20 0 7256 416 336 R 99.0 0.0 102914:36 wolE 15466 ohara 20 0 7256 420 336 R 99.0 0.0 112240:59 wolE 16343 ohara 20 0 7256 420 336 R 99.0 0.0 99299:15 wolE 19765 ohara 20 0 7256 416 336 R 99.0 0.0 99826:23 wolE 24198 xerxes 20 0 336m 318m 1808 R 99.0 0.2 12:25.31 rubiks_3x3.eps 24275 xerxes 20 0 302m 284m 1804 R 99.0 0.2 11:06.80 rubiks_3x3.eps 25522 plond 20 0 13416 1860 832 R 1.0 0.0 0:00.16 top 8487 seome20s 20 0 1401m 280m 52m S 0.7 0.2 12:04.29 MATLAB 160 root 15 -5 0 0 0 S 0.3 0.0 0:04.16 ksoftirqd/52 5535 seome20s 20 0 1272m 260m 52m S 0.3 0.2 12:38.30 MATLAB 5820 seome20s 20 0 1148m 139m 51m S 0.3 0.1 1252:20 MATLAB 5823 seome20s 20 0 1160m 137m 51m S 0.3 0.1 1208:42 MATLAB

The three “load averages” on the top line are the average number of threads which were actually running, or ready to run, or waiting for disk I/O, during the previous 1, 5, and 15 minutes.

In this example, the second line shows the total number of processes that existed at the instant that top sampled the system. top can toggle between displaying processes or individual threads within the processes, so it uses the word “task” as a blanket term for both; try typing capital-H while running top and see the “Tasks” field switch between showing the total number of processes and the total number of threads.

The tasks which are “sleeping” are waiting for some event—such as a user's keystroke or the arrival of data over the network—before they will be ready to run.

If the load average is less than the number of CPUs (72 on entropy), and if the memory fields “free”, “buffers”, and “cached” are not all near zero (meaning that all the processes fit easily into main memory), then entropy is coasting, with capacity to spare.

Run “man top” to see more detail about the top command and its output.

Entropy will continue to operate properly if the number of active threads exceeds the number of CPUs. Essentially, the available CPUs will be shared among the threads by leaving some threads inactive for brief intervals.

Entropy's recent workload does not strain its memory capacity, so the rest of this tutorial concentrates on the issue of having more threads to run than processors to run them on.

The scheduling algorithms which choose which process to activate next favour threads that do lots of I/O over threads that use lots of CPU cycles, with the result that interactive users and system maintenance tasks are unlikely to notice much slowdown when there are more ready threads than CPUs to run them. The top program itself, for example, continues to refresh its display promptly even when entropy's load average rises into the mid-90s.

Consequences of high system load

While entropy continues to operate under heavy load, there are still consequences to sharing the available processors and memory among many competing uses. The throughput for CPU-intensive processes drops, and processes with low priorities may be starved for CPU time.

Throughput

Throughput—essentially the amount of work finished per unit of time—drops because there is some overhead involved in sharing resources. With the current gap between processor speed and memory access time, a particularly important component of the overhead is slowed memory access for an interval after a CPU switches from one process to another, while data for the new process replaces data for the old process in the CPU's local cache.

As a result of the overhead incurred by switching processors between threads, if you start 144 identical single-threaded processes at once on our 72-processor machine, it will take longer for them all to finish than it would if you started 72 of them to begin with, then started the remaining processes as replacements for the original 72, as the original processes exit.

Process priorities

The preceding example leads naturally into a description of how some processes can be starved for CPU time as a result of low process priorities. If you knew you would be the only person using entropy for the time it takes to run your 144 processes, there is a way you could start them all at once without throughput suffering: set the process priorities such that half of the processes run to completion before the other half run at all.

Look again at the top snapshot at the beginning of the tutorial. The PR column lists the priority of each process. Other factors being equal, the scheduler will run processes with numerically lower priorities ahead of processes with higher priorities.

So if you were entropy's only user, you could start 144 single-threaded processes at once without having more than 72 of them actually consume CPU time by setting their priorities such that processes 73 through 144 only get to run after their predecessors have exited. Here's one way to do so:

(Actually these commands won't quite work, for a reason we'll get to. And remember that this is just an illustration of how priorities operate; you are not the only person using entropy so you should not start 72 processes with a given priority.)

job1 & job2 & job3 & job4 & job5 & job6 & job7 & . . . job70 & job71 & job72 & nice -n 1 job73 & nice -n 2 job74 & nice -n 3 job75 & . . . nice -n 70 job46 & nice -n 71 job47 & nice -n 72 job48 &

As you probably know, the “&” at the end of each command starts the command “in the background”, allowing you to start another command before this one finishes.

nice -n n command” says to start command with a priority n lower than the default priority (the name comes from the fact that you are being “nice” to other users). So job1 through job72 run with the default priority, job73 with a slightly lower priority, job74 with priority slightly lower yet, et cetera. As a result, jobs 1 through 72 will begin consuming CPU cycles right away, while jobs 73 through 144 remain idle. When one of jobs 1 through 72 exits, job73 will begin to consume CPU cycles. When another of jobs 1 through 73 exits, job72 will begin consuming CPU cycles, and so on. At no time will more than 72 processes be active, yet there will not be fewer than 72 processes active so long as 72 or more remain to run.

Now for the explanation of why these commands will not quite work as advertised. On entropy, the maximum niceness level is 19, so job92 through job144 will all have the same priority, meaning that all 53 of them will start receiving CPU cycles at once after 19 of the previous processes have terminated, causing the load average to jump from 72 to 124.

Recognizing starved low-priority processes

To illustrate how different users can interfere with each other's plans, let's continue the thought experiment where you start 144 single-thread processes, nicing half of them as shown above. But this time before any of your processes exits, another user, janedoe, starts six processes of her own at the default priority. For a while janedoe's 6 processes and your job1 through job72, all with the default priority, will share the 72 CPUs, sending the load average to 78. That means your job73 will not start to run until six of the original 72 jobs exit, rather than just one. If janedoe or other users keep starting more processes at the default priority in the meantime, your job73 may never get any CPU time at all.

So the difficulty with using nice factors to influence the order in which your own processes get resources is that other users may inadvertently undermine your plans. Given that entropy has 72 processors, the problem becomes acute when there are more than 72 long-running CPU-intensive processes and, among those processes, there is one or more whose priority is lower than 72 of the others. In such a situation, the low priority processes will get virtually no CPU time until the load average falls back below 72.

If you notice that somebody else's processes are being starved because some of your processes are running at a higher priority, you should reduce the priority of your processes using the renice command, or, equivalently, with top's r command.

Reducing the priority of running processes

The command

renice 15 822179 758066 822178 680590

will reduce the priority of four of your processes by 15 (that's reduce in the sense of giving the processes less priority; the integer that top reports in its PR column would increase in magnitude). The four numbers after the 15 are process identifiers, taken from top's PID column.

If you are feeling particularly magnanimous, you could use

renice 15 -u you

to reduce the priorities of all of your processes without typing their individual identifiers.

You can only change the priority of your own processes (and only decrease it). So if your processes are the ones being starved, you need to email either the other users or the system administrator (trouble@lcd.uregina.ca) to ask them to address the situation.

Scheduling jobs to start in the future

We motivated the discussion of process priorities with an example where they served as a (flawed) mechanism for starting a lot of processes at one time without having all of those processes actually consume resources at once. Here's a more realistic situation where you might want to schedule processes to run in the future, together with more practical methods of doing so.

(In passing, note that there are two mechanisms for scheduling processes to start in the future that are not discussed here: at and crontab. Both of these can start processes for you at particular times. What we are looking for, though, is a method for starting process B at whatever time process A is finished, so we do not have to know in advance how long A will run in order to schedule B.)

Let's say you have nine processes to execute that you expect to run roughly two hours each, and that top is currently showing a load average of about 69.

with a load average of 69, entropy has 3 idle processors, so you would like to start 3 of your processes right away, and set things up so that the remaining 6 will start in two waves later on, without your having to stay up until the wee hours of the morning to begin them by hand.

Here's a simple approximation of what you want:

( job1 ; job4 ; job7 ) & ( job2 ; job5 ; job8 ) & ( job3 ; job6 ; job9 ) &

where each “jobn” stands for the shell command that starts one of your processes. The semi-colon separating jobn from jobn+3 says “run jobn, then when it finishes, run jobn+3”. The parentheses around each triplet of jobs starts a subshell to run that series of shell commands, while the & at the end of each line puts the subshell into the background, so that the next subshell can start running before the previous one terminates.

In practice—since each jobn may stand for quite a bit of typing—it might be more convenient to put the commands for each subshell into a separate shell script file, and then start each of those shell scripts in the background, something like this:

sh scriptA & sh scriptB & sh scriptC &

(assuming that your script files can be run by the default shell, /bin/sh; otherwise you might need to substitute csh, tcsh, or bash).

The use of separate subshells, each executing a series of processes in sequence, leaves open the possibility that we might leave processors idle unnecessarily. Say that job7 terminates while job5 and job6 are still running. Assuming that job8 doesn't depend on the results of job5, we would like to be able to start it up early, on the processor freed by the completion of the first subshell.

You can set up such a scheme, but it involves more sophisticated shell programming. For the adventurous, here's a shell script that reads the commands to start new processes from a file (such as this one) and then executes those commands as required to keep a given number of processes running. The script is not polished or robust enough to serve as a tool for general use, but it may be a useful model for people comfortable with shell programming.

Ideally, we would like a tool that would allow all users to submit jobs to a single shared queue, and which would start jobs from that queue as required to keep entropy's load average at 72. The difficulty is that simple home-grown tools might not deal properly with the security implications of starting different user's jobs from a single queue, while full blown queuing systems like OpenPBS and Grid Engine have features for resource scheduling and distributed computing that go beyond our needs. So, at least in the near term, we are continuing to rely on users' courtesy and good sense.

Further reading:


John Jorgensen