Run Jobs on Slurm systems with GPUs
We will first use UChicago Midway 2 cluster as an example. Later we will also introduce setups for ACCESS machines.
As a Ph.D. student, we can apply for an EXPLORE account for FREE on ACCESS and have access to multiple computing clusters in the US.
Run a GPU task on Midway 2
First of all, when you are on a login node, you usually have no access to a GPU. If you want to test PyTorch with a GPU environment, it is better to enter a GPU session first. There are two important settings: (1) the partition -p
; (2) the --gres
setting that specifies the amount and type of GPU. The name of the partition and GPU is usually different for different clusters. We need to refer to the UChicago RCC User Guide to know the right setting. From the user guide, we know there is a command rcchelp qos
on midway2, which shows us the available partitions. There is a gpu2
in there, and we will use that for GPU tasks.
I found that the domain name is not that obvious in the user guide. It is there, but I’ll put it down here for ease of use:
midway2.rcc.uchicago.edu
.
You can connect to the Midway 2 cluster by ssh CNETID@midway2.rcc.uchicago.edu
. It will ask for your CNETID password and a 2-factor authentication. It is usually DUO push for UChicago students. To get access to GPU on Midway, run the following command after logging in.
1
srun --gres=gpu:1 -p gpu2 --pty /bin/bash
It may take some time (usually a few minutes) for you to enter the interactive session. We will need a Python environment that has PyTorch with CUDA. The easiest way is to use module load
command to directly get a pre-installed environment. For instance, we can see pytorch/1.2
after running module avail pytorch
. The only issue is that the pre-installed library can usually be outdated. We will address this problem later if you want to install pytorch 2.0
, but for now, we can just load that module.
To see if the module has the environment we can start a interactive Python session and run the following code. If you are following along, you should see the exact same output. If CUDA is not available, make sure you are inside the GPU session using the above srun
command.
1
2
3
4
5
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.__version__
'1.12.0+cu102'
Now we are certain that the PyTorch is working properly, we can exit the GPU session and run our GPU task as a batch job. I’ll show you an example SBATCH script to train a simple Convolutional Neural Network (CNN) with PyTorch. Make sure you are using the right account.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
#SBATCH --job-name=simplecnn
#SBATCH --output=simplecnn.out
#SBATCH --error=simplecnn.err
#SBATCH --account=pi-foster
#SBATCH --time=00:10:00
#SBATCH --partition=gpu2
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=2000
module load pytorch/1.2
python train_simeplecnn.py
The example GPU training task
The Python script is shown below. We train a simple CNN network on the CIFAR10 dataset. I added a simple logger module in the code so you can verify if the GPU is utilized in the task.
Note that in supercomputers, the computation nodes sometimes have no access to the Internet. It is better to prepare the datasets beforehand and store them in your supercomputer’s filesystem.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torchvision import models
import time
import logging
import os
from datetime import datetime
from pathlib import Path
formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
def setup_logger(name, log_file, level=logging.INFO):
"""To setup as many loggers as you want"""
handler = logging.FileHandler(log_file)
handler.setFormatter(formatter)
logger = logging.getLogger(name)
logger.setLevel(level)
logger.addHandler(handler)
return logger
current_time = datetime.now()
log_prefix = f"logs/{current_time.month}-{current_time.day}-{current_time.year}_{current_time.hour}-{current_time.minute}-{current_time.second}"
os.makedirs(log_prefix)
logger = setup_logger('app_logger', f'{log_prefix}/app.log')
# Check if GPU is available
device = "cpu"
if torch.cuda.is_available():
device = "cuda"
elif torch.backends.mps.is_available():
device = "mps"
print(device)
logger.info(f"Using device: {device}")
# Define the transformation for the dataset
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# Load CIFAR-10 dataset
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=False, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=False, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=32, shuffle=True, num_workers=2)
classes = ('airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck')
# Define a simple convolutional neural network
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(torch.relu(self.conv1(x)))
x = self.pool(torch.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
# Initialize the model
model = SimpleCNN().to(device)
# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
# Train the model
start_time = time.time()
for epoch in range(20): # More epochs can be added to extend training time
running_loss = 0.0
model.train()
for i, data in enumerate(trainloader, 0):
inputs, labels = data[0].to(device), data[1].to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 200 == 199: # Print every 200 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 200))
model.eval()
with torch.inference_mode():
correct = 0
total = 0
for data in testloader:
images, labels = data
images = images.to(device)
labels = labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
print(f"epoch: {epoch + 1}, accuracy: {accuracy}, total: {total}, correct: {correct}")
logger.info(f"epoch: {epoch + 1}, loss: {running_loss / 200:.3f}, accuracy: {accuracy}, total: {total}, correct: {correct}")
running_loss = 0.0
print("Training finished. Total training time:", time.time() - start_time, "seconds")
logger.info(f"Training finished. Total training time: {time.time() - start_time} seconds")
MODEL_PATH = "trained_model.pth"
# Save the trained model
torch.save(model.state_dict(), MODEL_PATH)
print("Trained model saved.")
logger.info(f"Trained model saved to {Path(os.getcwd()) / MODEL_PATH}")
How to use the latest PyTorch?
We can load a relatively new Anaconda module and install PyTorch ourselves. We’ll use conda to create a new virtual environment for ourselves. The environment will be stored in our home directory, and we can load it quickly next time.
1
conda create --name tgpu python=3.11
After creating the conda environment, enter a GPU session to install PyTorch with CUDA.
Run the following command in a GPU session.
1
2
conda activate tgpu
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Then you can use torch.cuda.is_available()
to test if CUDA toolkit is properly installed. We can also see the torch.__version__
is the latest. You should see the following output.
1
2
3
4
5
>>> import torch
>>> torch.__version__
'2.2.1+cu118'
>>> torch.cuda.is_available()
True
Run GPU jobs on ACCESS machines
You may have differnt user ids for ACCESS machines. You should always check your user id on ACCESS Allocation Management Page. I will save all my usernames and the login domain name in my ssh config file to make it convenient.
Johns Hopkins University - Rockfish Cluster
The User Guide is always the starting point. The login node’s domain name is login.rockfish.jhu.edu
. This cluster only needs a password to login and does not require two-factor authentication. It does not support public/private key authentication by default. If you want to use RSA authentication, you have to contact the support team.
There is a pyTorch/1.8.1-cuda-11.1.1
module pre-installed. We can use module load
to use it.
According to their documentation, we can use the following script to submit a batch job. But until now, I still have the QOS problem. If I specify qos_gpu
, it says “salloc: error: Job submit/allocate failed: Invalid qos specification”. If I don’t specify it, it says “Job’s QOS not permitted to use this partition (a100 allows qos_gpu,qos_gpu_condo,urgent not normal)”.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash
#SBATCH --job-name=traincnn
#SBATCH --qos=qos_gpu
#SBATCH --account yliu4_gpu
#SBATCH --output=traincnn.out
#SBATCH --error=traincnn.err
#SBATCH --time=00:20:00
#SBATCH --partition=a100
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
module load pyTorch/1.8.1-cuda-11.1.1
python train_simplecnn.py
SDSC Expanse
For SDSC, still, check their Official User Guide first. I had a problem about conda init. It seems when you load the Anaconda module on SDSC Expanse, the conda environment is not initialized properly. I found that the node simply did not load the .bashrc file in the home folder. After adding one line to manually source the .bashrc file, everything looks normal.
One thing to note is that when you use the partition gpu
, even if you didn’t use all the resources on one node, they will still charge you for the whole node. Therefore I’d stick to gpu-shared
most of the time (when using fewer than 4 GPUs), unless I can utilize all the resources in one node.
We already know how to get into a GPU session on Midway 2, let’s enter a GPU session and install the latest PyTorch. Most things are similar, but remember to change the account to your own account, and change the partition accordingly. Some machines require --gres
, some others just require --gpus
. Please refer to their user guide to decide how to enter a GPU session.
1
srun --partition=gpu-debug --pty --account=chi151 --nodes=1 --ntasks-per-node=4 --mem=8G -t 00:30:00 --wait=0 --gpus=1 /bin/bash
Run the following command in the GPU session.
1
2
conda activate tgpu
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
It will install the latest PyTorch with CUDA 11.8 support. Then we can submit the batch job on a regular login node with the following sbatch script.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash
#SBATCH --job-name=traincnn
#SBATCH --account chi151
#SBATCH --output=traincnn.out
#SBATCH --error=traincnn.err
#SBATCH --time=00:20:00
#SBATCH --partition=gpu-shared
#SBATCH --gpus=1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
module load anaconda3/2021.05/q4munrg
source /home/yliu4/.bashrc
conda activate mygpu
python train_simplecnn.py
NCSA Delta
Check the User Guide as always! The system already has a conda environment loaded by default.
Create a virutal environment.
1
conda create --name mygpu python=3.11
You might see an error as shown below.
CondaValueError: You have chosen a non-default solver backend (libmamba) but it was not recognized. Choose one of: classic
If you see the error, run the following command to change the conda backend from libmamba
to classic
. Then you should be able to create a virtual environment as before.
1
conda config --set solver classic
We can then install the latest PyTorch
1
2
conda activate mygpu
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Start a GPU session to verify the PyTorch+CUDA installation. To know which account you should use, run the accounts
command, which is unique to the NSCA Delta machine.
1
srun -A bcnl-delta-gpu --time=00:30:00 --nodes=1 --ntasks-per-node=16 --partition=gpuA40x4 --gpus=1 --mem=16g --pty /bin/bash
Make sure the .bashrc file in your home folder has the conda initialization lines in there. And remember to change the account to your own account. Then you can submit the job with the following script.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/bash
#SBATCH --job-name=traincnn
#SBATCH --account bcnl-delta-gpu
#SBATCH --output=traincnn.out
#SBATCH --error=traincnn.err
#SBATCH --time=00:20:00
#SBATCH --partition=gpuA40x4
#SBATCH --gpus=1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
module load anaconda3_cpu/23.7.4
source /u/yliu4/.bashrc
conda deactivate
conda activate mygpu
python train_simplecnn.py
My SSH Config File
At the end, I want to share the ssh config file to ease your life. Those machines with the IdentityFile
property support public/private key authentication. The others require password + two-factor authentication. I use the ssh config file to reduce the number of characters I need to type in to connect to a host. For instance, I can use ssh sdsc
and it will directly take me into the SDSC Expanse machine.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Host sdsc
User yliu4
HostName login.expanse.sdsc.edu
IdentityFile ~/.ssh/id_ed25519
Host anvil
User x-yliu4
HostName anvil.rcac.purdue.edu
IdentityFile ~/.ssh/id_ed25519
Host midway2
User yuanjian
HostName midway2.rcc.uchicago.edu
Host rockfish
User yliu4
HostName login.rockfish.jhu.edu
Host ncsa-delta
User yliu4
HostName login.delta.ncsa.illinois.edu
Host darwin
User xsedeu3007
HostName darwin.hpc.udel.edu
IdentityFile ~/.ssh/id_ed25519