Submitting a SLURM task to a Cluster: A Guide
Running Python scripts for machine learning tasks on a computing cluster equipped with SLURM (Simple Linux Utility for Resource Management) can be an effective way to tackle complex and resource-intensive projects. This guide aims to help users run basic Python scripts on a powerful cluster, as demonstrated in a research internship at Cambridge University.
To begin, prepare a SLURM batch script that specifies the job requirements and runs your Python code. Key SLURM directives include:
1. `#SBATCH --account=your_account` to specify your account. 2. `#SBATCH --partition=gpu_partition` or other relevant partition with GPUs. 3. `#SBATCH --gres=gpu:number` to request GPU resources. 4. `#SBATCH --mem=memory_amount` to allocate memory. 5. `#SBATCH --time=HH:MM:SS` to specify the maximum run time. 6. `#SBATCH -o output_file` and `#SBATCH -e error_file` for logs.
Next, use environment variables and modules to load Python and any required modules for your ML environment inside the script, such as `module load python` and `module load cuda`.
Once the environment is set up, run the Python script inside the SLURM job script, for example:
```bash python your_ml_script.py ```
For multiple jobs on GPUs, you can use GNU Parallel combined with `CUDA_VISIBLE_DEVICES` to run several single-GPU Python jobs concurrently without GPU conflicts.
Here's an example minimal SLURM batch script to run a Python ML task on one GPU:
```bash #!/bin/bash #SBATCH --account=def-myuser # Your project/account #SBATCH --partition=gpu # GPU partition #SBATCH --gres=gpu:1 # Request 1 GPU #SBATCH --mem=16G # Memory #SBATCH --time=02:00:00 # Max run time 2 hours #SBATCH -o myjob_%j.out # Stdout with job ID #SBATCH -e myjob_%j.err # Stderr with job ID
module load python/3.x # Load Python module module load cuda # Load CUDA if needed
python train_model.py # Run your ML Python script ```
When working with SLURM, remember to use `salloc` or `srun` for an interactive session with GPU(s) to test code before batch submission, and check cluster documentation for GPU types, available partitions, and environment modules.
To monitor the job's progress in the queue, use the `squeue` command. For detailed information about a specific job, use the `scontrol show job
It's important to note that if you modify the source code after submitting a job, the latest version will be used at the time of execution, not the one at the time of submission. Be patient and continue learning to optimize workflows.
SLURM manages, schedules, and supervises the execution of jobs, ensuring each job gets its required resources without collisions, delays, or wasting valuable resources. For large distributed ML training, you may integrate MPI or use frameworks like PyTorch’s distributed API inside your SLURM job.
Lastly, always respect the cluster's guidelines and resource limits. To search for all jobs with the same name, use the `squeue -j
In the context of running Python scripts for machine learning tasks on a computing cluster, data-and-cloud-computing technology synergizes with SLURM to manage, schedule, and supervise resource-intensive projects effectively. Utilizing SLURM directives, environment variables, and modules can help load Python and mandatory ML libraries within the script, facilitating the execution of Python ML tasks.