SLURM is short for 'Simple Linux Utility for Resource Management'. It helps you to make use of a cluster by giving you a command line interface to add jobs to a queue. That means, you and other users can specify program calls that get executed as soon als all conditions are met.
I currently need to work with this, because I have some pretty heavy tasks that occur periodically, but don't need to be executed immediately. More concrete, I am working in the field of ASR - Automatic Speech Recognition. In ASR you have models that include probabilities for ... well, let's don't get into detail for that. The important part is that those models need to be trained. And they can get adapted to speakers.
As soon as you have more recordings, you can try to improve the model. And this training involves a lot of computation and data processing. As the training job is executed periodically, but not always when a new recording is present, you might also have recordings for many different speakers. So you can improve many models when this script is started. So this can be done on different computers in parallel.
SLURM basic usage
A simple call could look like this:
srun python test.py -aflag -param value
Other notable commands are:
squeue: Gives a list of all running / pending jobs
sbatch: Run a batch script in background
scancel: Stop a job
Important parameters for
--mem=[X in MB]
--job-name=[Name of your SLURM job]
--dependency: Start this job when all dependencies are met. This could be time or other jobs
Another important command is
squeue. It allows you to list all jobs in
command line (if you have a GUI:
You can get your jobs only with
squeue -u mthoma, where
mthoma should be
replaced by your user name.
$ sbatch --help Usage: sbatch [OPTIONS...] executable [args...] Parallel run options: -A, --account=name charge job to specified account --begin=time defer job until HH:MM MM/DD/YY -c, --cpus-per-task=ncpus number of cpus required per task --comment=name arbitrary comment -d, --dependency=type:jobid defer job until condition on jobid is satisfied -D, --workdir=directory set working directory for batch script -e, --error=err file for batch script's standard error --export[=names] specify environment variables to export --get-user-env load environment from local cluster --gid=group_id group ID to run job as (user root only) --gres=list required generic resources -H, --hold submit job in held state -i, --input=in file for batch script's standard input -I, --immediate exit if resources are not immediately available --jobid=id run under already allocated job -J, --job-name=jobname name of job -k, --no-kill do not kill job on node failure -L, --licenses=names required license, comma separated -m, --distribution=type distribution method for processes to nodes (type = block|cyclic|arbitrary) -M, --clusters=names Comma separated list of clusters to issue commands to. Default is current cluster. Name of 'all' will submit to run on all clusters. --mail-type=type notify on state change: BEGIN, END, FAIL or ALL --mail-user=user who to send email notification for job state changes -n, --ntasks=ntasks number of tasks to run --nice[=value] decrease secheduling priority by value --no-requeue if set, do not permit the job to be requeued --ntasks-per-node=n number of tasks to invoke on each node -N, --nodes=N number of nodes on which to run (N = min[-max]) -o, --output=out file for batch script's standard output -O, --overcommit overcommit resources -p, --partition=partition partition requested --propagate[=rlimits] propagate all [or specific list of] rlimits --qos=qos quality of service -Q, --quiet quiet mode (suppress informational messages) --requeue if set, permit the job to be requeued -t, --time=minutes time limit --time-min=minutes minimum time limit (if distinct) -s, --share share nodes with other jobs --uid=user_id user ID to run job as (user root only) -v, --verbose verbose mode (multiple -v's increase verbosity) Constraint options: --contiguous demand a contiguous range of nodes -C, --constraint=list specify a list of constraints -F, --nodefile=filename request a specific list of hosts --mem=MB minimum amount of real memory --mincpus=n minimum number of logical processors (threads) per node --reservation=name allocate resources from named reservation --tmp=MB minimum amount of temporary disk -w, --nodelist=hosts... request a specific list of hosts -x, --exclude=hosts... exclude a specific list of hosts Consumable resources related options: --exclusive allocate nodes in exclusive mode when cpu consumable resource is enabled --mem-per-cpu=MB maximum amount of real memory per allocated cpu required by the job. --mem >= --mem-per-cpu if --mem is specified. Affinity/Multi-core options: (when the task/affinity plugin is enabled) -B --extra-node-info=S[:C[:T]] Expands to: --sockets-per-node=S number of sockets per node to allocate --cores-per-socket=C number of cores per socket to allocate --threads-per-core=T number of threads per core to allocate each field can be 'min' or wildcard '*' total cpus requested = (N x S x C x T) --ntasks-per-core=n number of tasks to invoke on each core --ntasks-per-socket=n number of tasks to invoke on each socket --cpu_bind= Bind tasks to CPUs (see "--cpu_bind=help" for options) --hint= Bind tasks according to application hints (see "--hint=help" for options) --mem_bind= Bind memory to locality domains (ldom) (see "--mem_bind=help" for options) Help options: -h, --help show this help message -u, --usage display brief usage message Other options: -V, --version output version information and exit
squeue --help Usage: squeue [OPTIONS] -A, --account=account(s) comma separated list of accounts to view, default is all accounts -a, --all display jobs in hidden partitions -h, --noheader no headers on output --hide do not display jobs in hidden partitions -i, --iterate=seconds specify an interation period -j, --job=job(s) comma separated list of jobs IDs to view, default is all -l, --long long report -M, --clusters=cluster_name cluster to issue commands to. Default is current cluster. cluster with no name will reset to default. -n, --nodes=hostlist list of nodes to view, default is all nodes -o, --format=format format specification -p, --partition=partition(s) comma separated list of partitions to view, default is all partitions -q, --qos=qos(s) comma separated list of qos's to view, default is all qos's -s, --step=step(s) comma separated list of job steps to view, default is all -S, --sort=fields comma separated list of fields to sort on --start print expected start times of pending jobs -t, --states=states comma separated list of states to view, default is pending and running, '--states=all' reports all states -u, --user=user_name(s) comma separated list of users to view -v, --verbose verbosity level -V, --version output version information and exit Help options: --help show this help message --usage display a brief summary of squeue options