Running a PyTorch Example on Turing
Gompei is an on-campus student at WPI who wants to run a PyTorch example on the Turing High-Performance Computing (HPC) cluster. They use a Windows laptop and have little experience with Linux or HPC systems. This guide walks through the steps they take to run a PyTorch script on Turing, keeping it simple and straightforward, with links to related documentation for deeper understanding.
1. Request Access to Turing
Before starting, Gompei needs an account on Turing.
- Action: Complete the Turing Account Request Form; wait for an email confirming the account is ready.
- More Information: See the Getting Started Guide.
2. Connect to Turing from Windows
Gompei connects to Turing using the built-in SSH client on their Windows laptop.
- Open Command Prompt: Press the Windows key, type
Command Prompt
, and press Enter. - Connect via SSH: In the Command Prompt window, type
ssh gompei@turing.wpi.edu
and press Enter. - Handle Security Prompt: When prompted about the server's authenticity, type
yes
and press Enter. - Authenticate: Enter the WPI password and press Enter.
- More Information: For a deeper understanding of Linux basics, refer to Linux on Turing.
3. Set Up the Python Environment
Now connected to Turing, Gompei sets up the Python environment to run PyTorch.
- Load the Python Module:
module load python/3.11.10
- Learn more about modules in the Modules Documentation.
- Create a Project Directory:
mkdir pytorch_example
andcd pytorch_example
- Set Up a Virtual Environment:
python3 -m venv pytorch_example_env
andsource pytorch_example_env/bin/activate
- More about Python virtual environments: Python on Turing.
- Install Necessary Packages:
pip3 install numpy
pip3 install torch
4. Prepare the PyTorch Script
Gompei uses an example from the PyTorch tutorials.
- Get the Example Code: Visit the PyTorch Tensors Tutorial, copy the example code, and uncomment the 5th line to use the GPUs available on Turing.
- Create the Python Script:
nano pytorch_example.py
, paste the copied code into the editor. - Save and Exit: Press
Ctrl + o
, then Enter to save; pressCtrl + x
to exit the editor.
5. Create the SLURM Submission Script
To run the job on Turing, Gompei creates a SLURM submission script.
- Create the Script:
nano run_pytorch.sh
- Add the Following Content:
#!/bin/bash
#SBATCH -N 1 # allocate 1 compute node
#SBATCH -n 1 # total number of tasks
#SBATCH --mem=1g # allocate 1 GB of memory
#SBATCH -J "pytorch example" # name of the job
#SBATCH -o pytorch_example_%j.out # name of the output file
#SBATCH -e pytorch_example_%j.err # name of the error file
#SBATCH -p short # partition to submit to
#SBATCH -t 01:00:00 # time limit of 1 hour
#SBATCH --gres=gpu:1 # request 1 GPU
module load python/3.11.10 # These version were chosen for compatability with pytorch
module load cuda/12.4.0/3mdaov5 # load CUDA (adjust if necessary)
python3 -m venv pytorch_example_env # create virtual environment
source pytorch_example_env/bin/activate # activate virtual environment
pip3 install numpy # install NumPy
pip3 install torch # install PyTorch
python3 pytorch_example.py # run Python script
- Save and Exit: Press
Ctrl + o
, then Enter to save; pressCtrl + X
to exit the editor. - More Information: Refer to the SLURM Submission Guide.
6. Submit and Monitor the Job
Gompei submits the job and monitors its progress.
- Submit the Job:
sbatch run_pytorch.sh
- After submitting, a message like
Submitted batch job 123456
appears, indicating the job ID. - Check Job Status:
squeue --me
; this shows if the job is running or queued. - Monitor Output in Real-Time:
tail -f pytorch_example_123456.out
; pressCtrl + c
to stop monitoring. - Ensure the Job Has Finished: When the job no longer appears in
squeue --me
, it has completed. - View the Output File:
cat pytorch_example_123456.out
- Check for Errors:
cat pytorch_example_123456.err
; an empty or non-critical error file indicates the job ran smoothly.
7. Transfer Files Between Turing and the Local Computer
Gompei wants to copy the output files to their Windows laptop.
- Open Command Prompt: Press the Windows key, type
Command Prompt
, and press Enter. - Transfer Files Using
scp
: Run the following command:
scp gompei@turing.wpi.edu:/home/gompei/pytorch_example/pytorch_example_123456.out C:\Users\gompei\Downloads\
- Replace
C:\Users\gompei\Downloads\
with the desired local directory. - Enter the password when prompted.