Monitoring Jobs

Job Progress

Once you submit your job, it goes through several states. The most common states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED. Below is a listing of all the states, with their short codes:

PD     Pending. Job is waiting for resource allocation
R      Running. Job has an allocation and is running
S      Suspended. Execution has been suspended and resources have been released for other jobs
CA     Cancelled. Job was explicitly cancelled by the user or the system administrator
CG     Completing. Job is in the process of completing. Some processes on some nodes may still be active
CD     Completed. Job has terminated all processes on all nodes with an exit code of zero
F      Failed. Job has terminated with non-zero exit code or other failure condition

Slurm provides commands which you can use to monitor your jobs. You can also use the Live Cluster Status web page for a quick glance at all jobs. And you can specify your email address within your job script to be alerted at specific job events, see the Sample Job Scripts section for help configuring email alerts.

Monitoring Commands

squeue

The command squeue provides an overview of jobs in the scheduling queue (state information, allocated resources, runtime, etc.).

Syntax

squeue [options]

Common options

--user=<user[,user[,...]]>          Request jobs from a comma separated list of users. 
--jobs=<job_id[,job_id[,...]]>      Request specific jobs to be displayed
--partition=<part[,part[,...]]>     Request jobs to be displayed from a comma separated list of partitions
--states=<state[,state[,...]]>      Display jobs in specific states. Comma separated list or "all". Default: "PD,R,CG"

The default output format is as follows:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

where

JOBID              Job or step ID. For array jobs, the job ID format will be of the form <job_id>_<index>
PARTITION          Partition of the job/step
NAME               Name of the job/step
USER               Owner of the job/step
ST                 State of the job/step. See above for a description of the most common states
TIME               Time used by the job/step. Format is days-hours:minutes:seconds
                   (days,hours only printed as needed)
NODES              Number of nodes allocated to the job or the minimum amount of nodes required
                   by a pending job
NODELIST(REASON)   For pending jobs: Reason why pending. 
		   For failed jobs: Reason why failed.
                   For all other job states: List of allocated nodes.

Examples

List all currently running jobs of user foo:

squeue --user=foo

List all currently running jobs of user foo in partition bar, in running state:

squeue --user=foo --partition=bar --states=R

sfqueue

The sfqueue command provides the queue status including number of CPUs and GPUs used. The output is similar to that displayed on the live status webpage.

Examples

sfqueue

sgpu

To see the GPU memory usage use the sgpu command:

sgpu <your-jobid-here>

scontrol

The scontrol command provides detailed information about jobs and job steps.

Syntax

scontrol [options] [command]

Examples

Show detailed information about job with ID 1536:

scontrol show jobid 1536

Show even more detailed information about job with ID 1396 (including the jobscript):

scontrol -dd show jobid 1396

sstat

The command sstat provides detailed usage information about running jobs.

Syntax

sstat [options] -j <job(.stepid)>

Examples

Show detailed information about job with ID 1536:

sstat -j 1536

Show even more detailed information about job with ID 1396:

sstat -v -j 1396