Writing a SLURM Submission Script

TLDR

Create a submission script with #SBATCH directives to specify resources, and include the commands to run your job.

Submitting jobs to the compute nodes on Turing requires a SLURM submission script. This script tells the scheduler what resources your job needs and what commands to execute.

📝 Understanding Submission Scripts

A submission script has two main parts:

Resource Specifications: Using #SBATCH directives to describe the resources and properties required for your job.
Job Commands: The actual commands or scripts that will be executed on the compute nodes.

🔹 Basic Submission Script

In the simplest case, you could omit all #SBATCH options, but it's recommended to include some basic directives to ensure your job runs effectively.

Here is an example of a basic submission script:

#!/bin/bash
#SBATCH -N 1          # (1)
#SBATCH -n 2          # (2)
#SBATCH --mem=8g      # (3)
#SBATCH -J "Hello World Job"  # (4)
#SBATCH -p short      # (5)
#SBATCH -t 12:00:00   # (6)

echo "Hello World"    # (7)!

#SBATCH -N 1: Request 1 node for the job.
#SBATCH -n 2: Request 2 CPU cores.
#SBATCH --mem=8g: Request 8 GiB of memory.
#SBATCH -J "Hello World Job": Set the job name to "Hello World Job".
#SBATCH -p short: Submit the job to the short partition.
#SBATCH -t 12:00:00: Set the maximum runtime to 12 hours.
If the job hasn't completed within this time, it will be terminated.
echo "Hello World": The script content that will be executed on the compute node.

Using Turing in a Class

If you are using Turing as part of a class, you must submit your jobs to the academic partition. Jobs submitted as part of a class are also limited to one GPU at a time.

Available Partitions

short For jobs requiring less than 24 hours of runtime. This should be your default unless you need to use academic for a class. If your job cant run in 24 hours consider requesting more resources. If the required resources would be too large then you should use the long partition.
long For jobs requiring more than 24 hours, with a default runtime of 3 days. Can be extended up to a maximum of 1 week.
academic Reserved for students using Turing as part of a class.

🔹 Submission Script for GPU Use

If your job requires GPUs, you'll need to include additional directives in your submission script.

Example GPU submission script:

#!/bin/bash
#SBATCH -N 1                   # (1)
#SBATCH -n 8                   # (2)
#SBATCH --mem=8g               # (3)
#SBATCH -J "Example GPU Job"   # (4)
#SBATCH -p short               # (5)
#SBATCH -t 12:00:00            # (6)
#SBATCH --gres=gpu:2           # (7)
#SBATCH -C "A100|V100"         # (8)

module load python             # (9)
module load cuda/12.2          # (10)

python my_script_name.py       # (11)

#SBATCH -N 1: Request 1 node for the job.
#SBATCH -n 8: Request 8 CPU cores.
#SBATCH --mem=8g: Request 8 GiB of memory.
#SBATCH -J "Example GPU Job": Set the job name to "Example GPU Job".
#SBATCH -p short: Submit the job to the short partition.
#SBATCH -t 12:00:00: Set the maximum runtime to 12 hours.
#SBATCH --gres=gpu:2: Request 2 GPUs.
#SBATCH -C "A100|V100": [Optional] Specify GPU types, limiting to A100 or V100.
module load python: Load the latest stable version of Python.
For more information, see Software.
module load cuda/12.2: Load the CUDA 12.2 toolkit, providing access to required GPU drivers.
python my_script_name.py: Run your Python script.

Available GPUs

H200, A100-80G, H100, L40S, A100, V100, P100, A30

⚠️ Important Notes

Resource Requests: Be mindful of the resources you request. Overestimating can lead to longer wait times; underestimating can cause your job to fail.
Time Limits: Setting the --time directive helps the scheduler optimize resource allocation.

🤔 Why Use Submission Scripts?

Automation: Scripts allow you to run complex jobs without manual intervention.
Reproducibility: Easily rerun jobs with consistent settings.
Resource Management: Specify exactly what your job needs, helping the scheduler optimize cluster usage.