Job Handling

This section describes how the scheduler is configured on the Falcon cluster and how to use it.

Scheduler

A scheduler, also known as a task manager or a batch-queuing system, acts as a resource manager to provide users access to the cluster’s resources (such as CPUs, GPUs, and memory) in a fair and efficient manner. There are many different resource managers available, we use SLURM on Falcon.

Jobs

The way a program is run on an HPC cluster differs from how it is run on a typical workstation. We communicate with only the login node(s) when we log into the cluster. However, since the compute nodes are where the cluster’s actual computing power is located, programs need to run on them and not on the login nodes.

Since users cannot login on the compute nodes, you must ask the cluster’s scheduling system to run your program on the compute nodes. To do this, you must submit a unique script with instructions on how to run your program on the compute nodes. This script is then scheduled to be run on the compute nodes by the scheduler as a job.

Partitions

Partitions group nodes into logical sets of resources. They can be considered as queues for jobs, each having a set of resource limits. Users submit their jobs to a specific partition which suits best to their job’s resource requirements.

Refer to the sub-sections on the left for detailed information on partitions, QoS, submitting jobs, and more.