Job Progress
Once you submit your job, it goes through several states. The most common states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED. Below is a listing of all the states, with their short codes:
PD Pending. Job is waiting for resource allocation
R Running. Job has an allocation and is running
S Suspended. Execution has been suspended and resources have been released for other jobs
CA Cancelled. Job was explicitly cancelled by the user or the system administrator
CG Completing. Job is in the process of completing. Some processes on some nodes may still be active
CD Completed. Job has terminated all processes on all nodes with an exit code of zero
F Failed. Job has terminated with non-zero exit code or other failure condition
Slurm provides commands which you can use to monitor your jobs. You can also use the Live Cluster Status web page for a quick glance at all jobs. And you can specify your email address within your job script to be alerted at specific job events, see the Sample Job Scripts section for help configuring email alerts.
Monitoring Commands
squeue
The command squeue
provides an overview of jobs in the scheduling queue (state information, allocated resources, runtime, etc.).
Syntax
squeue [options]
Common options
--user=<user[,user[,...]]> Request jobs from a comma separated list of users.
--jobs=<job_id[,job_id[,...]]> Request specific jobs to be displayed
--partition=<part[,part[,...]]> Request jobs to be displayed from a comma separated list of partitions
--states=<state[,state[,...]]> Display jobs in specific states. Comma separated list or "all". Default: "PD,R,CG"
The default output format is as follows:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
where
JOBID Job or step ID. For array jobs, the job ID format will be of the form <job_id>_<index>
PARTITION Partition of the job/step
NAME Name of the job/step
USER Owner of the job/step
ST State of the job/step. See above for a description of the most common states
TIME Time used by the job/step. Format is days-hours:minutes:seconds
(days,hours only printed as needed)
NODES Number of nodes allocated to the job or the minimum amount of nodes required
by a pending job
NODELIST(REASON) For pending jobs: Reason why pending.
For failed jobs: Reason why failed.
For all other job states: List of allocated nodes.
Examples
List all currently running jobs of user foo:
squeue --user=foo
List all currently running jobs of user foo in partition bar, in running state:
squeue --user=foo --partition=bar --states=R
sfqueue
The sfqueue
command provides the queue status including number of CPUs and GPUs used. The output is similar to that displayed on the live status webpage.
Examples
sfqueue
sgpu
To see the GPU memory usage use the sgpu
command:
sgpu <your-jobid-here>
scontrol
The scontrol
command provides detailed information about jobs and job steps.
Syntax
scontrol [options] [command]
Examples
Show detailed information about job with ID 1536:
scontrol show jobid 1536
Show even more detailed information about job with ID 1396 (including the jobscript):
scontrol -dd show jobid 1396
sstat
The command sstat
provides detailed usage information about running jobs.
Syntax
sstat [options] -j <job(.stepid)>
Examples
Show detailed information about job with ID 1536:
sstat -j 1536
Show even more detailed information about job with ID 1396:
sstat -v -j 1396