In this example, we’ll use the PyTorch MNSIT example.
Get the source code from https://github.com/pytorch/examples/tree/main/mnist
and save the Python code as mnist.py
We will now use the following SLURM script pytorch-gpu.sh
to run the code:
#!/bin/bash
#SBATCH --job-name="PyTorch-GPU-Demo" # job name
#SBATCH --partition=peregrine-gpu # partition to which job should be submitted
#SBATCH --qos=gpu_debug # qos type
#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1 # cpu-cores per task
#SBATCH --mem=4G # total memory per node
#SBATCH --gres=gpu:nvidia_a100_3g.39gb:1 # Request 1 GPU (A100 40GB)
#SBATCH --time=00:05:00 # wall time
module purge
module load python/anaconda
python mnist.py --epochs=3
Submit the job as
sbatch pytorch-gpu.sh
The result will be saved in a file named slurm-####.out
and should look like
Train Epoch: 1 [0/60000 (0%)] Loss: 2.299824
Train Epoch: 1 [640/60000 (1%)] Loss: 1.733667
Train Epoch: 1 [1280/60000 (2%)] Loss: 0.933156
Train Epoch: 1 [1920/60000 (3%)] Loss: 0.623502
Train Epoch: 1 [2560/60000 (4%)] Loss: 0.357575
Train Epoch: 1 [3200/60000 (5%)] Loss: 0.315663
-----------------------------------------------------------
-------------------TRUNCATED-------------------------------
-----------------------------------------------------------
Train Epoch: 3 [55680/60000 (93%)] Loss: 0.009016
Train Epoch: 3 [56320/60000 (94%)] Loss: 0.241464
Train Epoch: 3 [56960/60000 (95%)] Loss: 0.004863
Train Epoch: 3 [57600/60000 (96%)] Loss: 0.004337
Train Epoch: 3 [58240/60000 (97%)] Loss: 0.109445
Train Epoch: 3 [58880/60000 (98%)] Loss: 0.038164
Train Epoch: 3 [59520/60000 (99%)] Loss: 0.014446
Test set: Average loss: 0.0333, Accuracy: 9887/10000 (99%)