Skip to content

Writing a SLURM Submission Script

TLDR

Create a submission script with #SBATCH directives to specify resources, and include the commands to run your job.

Submitting jobs to the compute nodes on Turing requires a SLURM submission script. This script tells the scheduler what resources your job needs and what commands to execute.


📝 Understanding Submission Scripts

A submission script has two main parts:

  1. Resource Specifications: Using #SBATCH directives to describe the resources and properties required for your job.
  2. Job Commands: The actual commands or scripts that will be executed on the compute nodes.

🔹 Basic Submission Script

In the simplest case, you could omit all #SBATCH options, but it's recommended to include some basic directives to ensure your job runs effectively.

Here is an example of a basic submission script:

#!/bin/bash
#SBATCH -N 1          # (1)
#SBATCH -n 2          # (2)
#SBATCH --mem=8g      # (3)
#SBATCH -J "Hello World Job"  # (4)
#SBATCH -p short      # (5)
#SBATCH -t 12:00:00   # (6)

echo "Hello World"    # (7)!
  1. #SBATCH -N 1: Request 1 node for the job.
  2. #SBATCH -n 2: Request 2 CPU cores.
  3. #SBATCH --mem=8g: Request 8 GiB of memory.
  4. #SBATCH -J "Hello World Job": Set the job name to "Hello World Job".
  5. #SBATCH -p short: Submit the job to the short partition.
  6. #SBATCH -t 12:00:00: Set the maximum runtime to 12 hours.
  7. If the job hasn't completed within this time, it will be terminated.
  8. echo "Hello World": The script content that will be executed on the compute node.

Using Turing in a Class

If you are using Turing as part of a class, you must submit your jobs to the academic partition. Jobs submitted as part of a class are also limited to one GPU at a time.


Available Partitions

  • short   For jobs requiring less than 24 hours of runtime. This should be your default unless you need to use academic for a class. If your job cant run in 24 hours consider requesting more resources. If the required resources would be too large then you should use the long partition.

  • long   For jobs requiring more than 24 hours, with a default runtime of 3 days. Can be extended up to a maximum of 1 week.

  • academic   Reserved for students using Turing as part of a class.


🔹 Submission Script for GPU Use

If your job requires GPUs, you'll need to include additional directives in your submission script.

Example GPU submission script:

#!/bin/bash
#SBATCH -N 1                   # (1)
#SBATCH -n 8                   # (2)
#SBATCH --mem=8g               # (3)
#SBATCH -J "Example GPU Job"   # (4)
#SBATCH -p short               # (5)
#SBATCH -t 12:00:00            # (6)
#SBATCH --gres=gpu:2           # (7)
#SBATCH -C "A100|V100"         # (8)

module load python             # (9)
module load cuda/12.2          # (10)

python my_script_name.py       # (11)
  1. #SBATCH -N 1: Request 1 node for the job.
  2. #SBATCH -n 8: Request 8 CPU cores.
  3. #SBATCH --mem=8g: Request 8 GiB of memory.
  4. #SBATCH -J "Example GPU Job": Set the job name to "Example GPU Job".
  5. #SBATCH -p short: Submit the job to the short partition.
  6. #SBATCH -t 12:00:00: Set the maximum runtime to 12 hours.
  7. #SBATCH --gres=gpu:2: Request 2 GPUs.
  8. #SBATCH -C "A100|V100": [Optional] Specify GPU types, limiting to A100 or V100.
  9. module load python: Load the latest stable version of Python.
  10. For more information, see Software.
  11. module load cuda/12.2: Load the CUDA 12.2 toolkit, providing access to required GPU drivers.
  12. python my_script_name.py: Run your Python script.

Available GPUs

H200, A100-80G, H100, L40S, A100, V100, P100, A30


⚠️ Important Notes

  • Resource Requests: Be mindful of the resources you request. Overestimating can lead to longer wait times; underestimating can cause your job to fail.
  • Time Limits: Setting the --time directive helps the scheduler optimize resource allocation.

🤔 Why Use Submission Scripts?

  • Automation: Scripts allow you to run complex jobs without manual intervention.
  • Reproducibility: Easily rerun jobs with consistent settings.
  • Resource Management: Specify exactly what your job needs, helping the scheduler optimize cluster usage.