Compute Jobs

Where can I run my resource intensive jobs?

The CS Department network has been carefully configured to serve as a single integrated resource consisting of many powerful workstations physically distributed across all the labs and individual desktops. This open distributed configuration gives desktop users access to state-of-the-art hardware while simultaneously providing a powerful compute cluster for background jobs. Generally speaking, users may run resource intensive jobs on any machine which they can log onto, provided the jobs don’t interfere with other users.

The privilege to run background jobs, however, comes with a specific set of rules that must be followed and responsibilities which must be upheld. All long-running and/or background jobs should be run in a way that minimizes the impact on other users, and interactive users in particular. Scheduling priorities should be niced, memory constraints should be respected, and network and disk I/O loads should be kept within reason. If your jobs in any way annoy an interactive user then you are doing something wrong, and right or wrong aside, you must take action to correct the situation. Actions include terminating some or all of the jobs within a reasonable time-frame, and determining how to eliminate any interference before attempting to restart the jobs.

One of the best ways to avoid annoying another user is communication. Within the department there is a long tradition of using informal communication and coordination to mitigate resource competition, and this generally works quite well. In the event a user feels attempts to work out a problem with another user have not been productive, then that user should refer the matter to the systems administration staff. While rare, this may happen, for instance, when there is a dispute over who’s jobs are at fault, or there is no email response from the owner of interfering jobs. Jobs that are interfering may be terminated by systems administration without notice. Repeat offenders may face enforced limits on their use of machines on the CS Department network.

See the following section for hints on analyzing or minimizing the impact of your jobs.

What are some commands that can help determine the resources that my job is using?

top and sar are general purpose tools for viewing process resource utilization. free and vmstat are useful tools for viewing memory utilization. iostat is a useful tool for viewing I/O activity. vnstat is a useful tool for monitoring network throughput. (see the –live option)

What CPU priority should I use for my resource intensive jobs?

It is mandatory social etiquette that all long-running and background jobs be run with a nice value of 19. To do this:

nice -n 19 COMMAND

nice +19 COMMAND

Or, on an already-running job (where PID is the process ID):

renice +19 PID

For further information, consult the renice and nice man pages.

Note: If your job is multi-threaded, don’t forget to renice the individual threads.

How can I kill a process?

You may use the kill command to kill a process. The command accepts various signal values which can be sent to the process. The default signal is SIGTERM (terminate the process). So, to kill a process (gracefully)

kill PID

To kill a process forcibly

kill -9 PID

How can I prevent a process from consuming too much memory?

Typically, jobs should not use more physical memory than is available on the machine. Once a machine begins to swap pages to disk because of a shortage of memory, problems will result. In calculating the fit of your job to available memory, accommodation should also be made for memory to be used by other users. The amount of physical RAM available on each machine is listed in the file ~info/machines.

A resource limit on virtual memory size can be set with the ulimit command in bash or with limit in tcsh.

For example, in bash:

ulimit -v 2000000

or in tcsh:

limit vmemoryuse 2000000

will prevent a job from using more than about 2G of memory. This limit is set for the current login session only and the process will segfault if the limit is exceeded.

How can I prevent a process from over-loading the file servers?

The CS department Linux clients share several NFS file servers. If a resource intensive job is performing many reads or writes to the NFS file servers, then performance will slow down for everyone.

If your job is performing many reads and writes or uses a large dataset, then you should consider keeping a copy on the local disk until the job has completed and then copy the data back to NFS. See the temporary disk space section more information on using local, temporary disk space.

What I/O priority should I use for jobs that perform lots of local reads and writes?

If your job is performing many reads and/or writes to the local disks, the I/O priority should be set to 7 with the ionice command. To do this:

ionice -n 7 COMMAND

Or, on an already-running job (where PID is the process ID):

ionice -n 7 -p PID

For further information, consult the ionice man pages.

What is the significance of the linux load average?

The load average is an important indicator of how loaded (or overloaded) a machine is. It is represented by the three numbers in the upper right hand corner of the info displayed by the linux top or w commands, and is defined as the average length of the kernel process run queue over a period of time ( specifically 1, 5, and 15 minutes in the case of top and w). If the load average exceeds the number of cores on a machine, then context switching is taking place. The larger this excess, the more per-process performance will drop off, impacting everyone using the machine. As a general rule, background jobs should be configured so as not to cause the system load average to exceed the number of cores.

You can find the number of cores on a machine with the following command: nproc