PyTorch

In this example, we’ll use the PyTorch MNSIT example. Get the source code from https://github.com/pytorch/examples/tree/main/mnist and save the Python code as mnist.py

We will now use the following SLURM script pytorch-gpu.sh to run the code:

#!/bin/bash

#SBATCH --job-name="PyTorch-GPU-Demo"	 # job name
#SBATCH --partition=peregrine-gpu 		 # partition to which job should be submitted
#SBATCH --qos=gpu_debug			  		 # qos type
#SBATCH --nodes=1                 		 # node count
#SBATCH --ntasks=1                		 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1         		 # cpu-cores per task
#SBATCH --mem=4G                  		 # total memory per node
#SBATCH --gres=gpu:nvidia_a100_3g.40gb:1 # Request 1 GPU (A100 40GB)
#SBATCH --time=00:05:00 				 #  wall time

module purge
module load python/anaconda

python mnist.py --epochs=3

Submit the job as

sbatch pytorch-gpu.sh

The result will be saved in a file named slurm-####.out and should look like

Train Epoch: 1 [0/60000 (0%)]	Loss: 2.299824
Train Epoch: 1 [640/60000 (1%)]	Loss: 1.733667
Train Epoch: 1 [1280/60000 (2%)]	Loss: 0.933156
Train Epoch: 1 [1920/60000 (3%)]	Loss: 0.623502
Train Epoch: 1 [2560/60000 (4%)]	Loss: 0.357575
Train Epoch: 1 [3200/60000 (5%)]	Loss: 0.315663


-----------------------------------------------------------
-------------------TRUNCATED-------------------------------
-----------------------------------------------------------

Train Epoch: 3 [55680/60000 (93%)]	Loss: 0.009016
Train Epoch: 3 [56320/60000 (94%)]	Loss: 0.241464
Train Epoch: 3 [56960/60000 (95%)]	Loss: 0.004863
Train Epoch: 3 [57600/60000 (96%)]	Loss: 0.004337
Train Epoch: 3 [58240/60000 (97%)]	Loss: 0.109445
Train Epoch: 3 [58880/60000 (98%)]	Loss: 0.038164
Train Epoch: 3 [59520/60000 (99%)]	Loss: 0.014446

Test set: Average loss: 0.0333, Accuracy: 9887/10000 (99%)