SLURM

SLURM is short for 'Simple Linux Utility for Resource Management'. It helps you to make use of a cluster by giving you a command line interface to add jobs to a queue. That means, you and other users can specify program calls that get executed as soon als all conditions are met.

I currently need to work with this, because I have some pretty heavy tasks that occur periodically, but don't need to be executed immediately. More concrete, I am working in the field of ASR - Automatic Speech Recognition. In ASR you have models that include probabilities for ... well, let's don't get into detail for that. The important part is that those models need to be trained. And they can get adapted to speakers.

As soon as you have more recordings, you can try to improve the model. And this training involves a lot of computation and data processing. As the training job is executed periodically, but not always when a new recording is present, you might also have recordings for many different speakers. So you can improve many models when this script is started. So this can be done on different computers in parallel.

SLURM basic usage

A simple call could look like this:

srun python test.py -aflag -param value

Other notable commands are:

squeue: Gives a list of all running / pending jobs
sbatch: Run a batch script in background
scancel: Stop a job

Important parameters for srun are:

--mem=[X in MB]
--job-name=[Name of your SLURM job]
--dependency: Start this job when all dependencies are met. This could be time or other jobs

Another important command is squeue. It allows you to list all jobs in command line (if you have a GUI: sview).

You can get your jobs only with squeue -u mthoma, where mthoma should be replaced by your user name.

Help texts

sbatch

$ sbatch --help
Usage: sbatch [OPTIONS...] executable [args...]

Parallel run options:
  -A, --account=name          charge job to specified account
      --begin=time            defer job until HH:MM MM/DD/YY
  -c, --cpus-per-task=ncpus   number of cpus required per task
      --comment=name          arbitrary comment
  -d, --dependency=type:jobid defer job until condition on jobid is satisfied
  -D, --workdir=directory     set working directory for batch script
  -e, --error=err             file for batch script's standard error
      --export[=names]        specify environment variables to export
      --get-user-env          load environment from local cluster
      --gid=group_id          group ID to run job as (user root only)
      --gres=list             required generic resources
  -H, --hold                  submit job in held state
  -i, --input=in              file for batch script's standard input
  -I, --immediate             exit if resources are not immediately available
      --jobid=id              run under already allocated job
  -J, --job-name=jobname      name of job
  -k, --no-kill               do not kill job on node failure
  -L, --licenses=names        required license, comma separated
  -m, --distribution=type     distribution method for processes to nodes
                              (type = block|cyclic|arbitrary)
  -M, --clusters=names        Comma separated list of clusters to issue
                              commands to.  Default is current cluster.
                              Name of 'all' will submit to run on all clusters.
      --mail-type=type        notify on state change: BEGIN, END, FAIL or ALL
      --mail-user=user        who to send email notification for job state
                              changes
  -n, --ntasks=ntasks         number of tasks to run
      --nice[=value]          decrease secheduling priority by value
      --no-requeue            if set, do not permit the job to be requeued
      --ntasks-per-node=n     number of tasks to invoke on each node
  -N, --nodes=N               number of nodes on which to run (N = min[-max])
  -o, --output=out            file for batch script's standard output
  -O, --overcommit            overcommit resources
  -p, --partition=partition   partition requested
      --propagate[=rlimits]   propagate all [or specific list of] rlimits
      --qos=qos               quality of service
  -Q, --quiet                 quiet mode (suppress informational messages)
      --requeue               if set, permit the job to be requeued
  -t, --time=minutes          time limit
      --time-min=minutes      minimum time limit (if distinct)
  -s, --share                 share nodes with other jobs
      --uid=user_id           user ID to run job as (user root only)
  -v, --verbose               verbose mode (multiple -v's increase verbosity)

Constraint options:
      --contiguous            demand a contiguous range of nodes
  -C, --constraint=list       specify a list of constraints
  -F, --nodefile=filename     request a specific list of hosts
      --mem=MB                minimum amount of real memory
      --mincpus=n             minimum number of logical processors (threads) per node
      --reservation=name      allocate resources from named reservation
      --tmp=MB                minimum amount of temporary disk
  -w, --nodelist=hosts...     request a specific list of hosts
  -x, --exclude=hosts...      exclude a specific list of hosts

Consumable resources related options:
      --exclusive             allocate nodes in exclusive mode when
                              cpu consumable resource is enabled
      --mem-per-cpu=MB        maximum amount of real memory per allocated
                              cpu required by the job.
                              --mem >= --mem-per-cpu if --mem is specified.

Affinity/Multi-core options: (when the task/affinity plugin is enabled)
  -B  --extra-node-info=S[:C[:T]]            Expands to:
       --sockets-per-node=S   number of sockets per node to allocate
       --cores-per-socket=C   number of cores per socket to allocate
       --threads-per-core=T   number of threads per core to allocate
                              each field can be 'min' or wildcard '*'
                              total cpus requested = (N x S x C x T)

      --ntasks-per-core=n     number of tasks to invoke on each core
      --ntasks-per-socket=n   number of tasks to invoke on each socket
      --cpu_bind=             Bind tasks to CPUs
                              (see "--cpu_bind=help" for options)
      --hint=                 Bind tasks according to application hints
                              (see "--hint=help" for options)
      --mem_bind=             Bind memory to locality domains (ldom)
                              (see "--mem_bind=help" for options)


Help options:
  -h, --help                  show this help message
  -u, --usage                 display brief usage message

Other options:
  -V, --version               output version information and exit

squeue

squeue --help
Usage: squeue [OPTIONS]
  -A, --account=account(s)        comma separated list of accounts
                  to view, default is all accounts
  -a, --all                       display jobs in hidden partitions
  -h, --noheader                  no headers on output
      --hide                      do not display jobs in hidden partitions
  -i, --iterate=seconds           specify an interation period
  -j, --job=job(s)                comma separated list of jobs IDs
                  to view, default is all
  -l, --long                      long report
  -M, --clusters=cluster_name     cluster to issue commands to.  Default is
                                  current cluster.  cluster with no name will
                                  reset to default.
  -n, --nodes=hostlist            list of nodes to view, default is
                  all nodes
  -o, --format=format             format specification
  -p, --partition=partition(s)    comma separated list of partitions
                  to view, default is all partitions
  -q, --qos=qos(s)                comma separated list of qos's
                  to view, default is all qos's
  -s, --step=step(s)              comma separated list of job steps
                  to view, default is all
  -S, --sort=fields               comma separated list of fields to sort on
      --start                     print expected start times of pending jobs
  -t, --states=states             comma separated list of states to view,
                  default is pending and running,
                  '--states=all' reports all states
  -u, --user=user_name(s)         comma separated list of users to view
  -v, --verbose                   verbosity level
  -V, --version                   output version information and exit

Help options:
  --help                          show this help message
  --usage                         display a brief summary of squeue options

SLURM

SLURM basic usage

Help texts

sbatch

squeue

More information

Published

Category

Tags

Contact