Preemption

With the move to SLURM, we are experimenting with preemption for jobs with QOS long-low-prio. Preemptions allows SLURM to cancel and requeue large jobs submitted with the long-low-prio QOS in order to run small jobs which are submitted using debug and short QOSes.

This allows the scheduler to fill up the grid with long jobs and still be able to immediately run small jobs without have to wait days. If you intend to submit your job using the long-low-prio QOS, you must have implemented some kind of regular checkpointing of your job (e.g. hourly) and be able to resume your job from it. Furthermore you must indicate that your job supports checkpointing by adding #SBATCH --requeue to your job definition. If you do not do it and your job is preempted, it will be killed by SLURM and all your unsaved work will be lost.

If your job is preempted, SLURM will signal it by sending a SIGTERM to your script and will allow 2 minutes for your job to save its current state before killing it.

In addition, if your job supports requeuing, SLURM will automatically resume it in case of server crash or loss of power once the fault has been repaired.

Caffe Checkpoints

For instance, a Caffe task can be checkpointed by regularly saving a .solverstate and .caffemodel then adapting your job script to resume from it if your job has been requeued:

#!/bin/sh

#SBATCH -N 1
#SBATCh -c 1
#SBATCH --gres=gpu:1
#SBATCH -p long
#SBATCH --requeue

module load cuda/8.0

OUTPUT_DIR=ouptut/
PRETRAINED_MODEL=pretrained.caffemodel

if [ $SLURM_RESTART_COUNT -gt 0 ]; then
   SOLVERSTATE="--solverstate=$(ls -Art $output/*.solverstate|head -n1)"
   WEIGHTS="--weights=$(ls -Art $output/*.caffemodel|head -n1)"
else
   SOLVERSTATE=
   WEIGHTS=$PRETRAINED_MODEL
end

caffe train --solver=solver.prototxt $WEIGHTS $SOLVERSTATE --gpu=0

TensorFlow Checkpoints

In TensorFlow, your work can be regularly saved using a Saver.

For instance, you can use the following code structure:

import tensorflow as tf
from sys import argv

flags = tf.app.flags
FLAGS = flags.FLAGS

# command line flags
flags.DEFINE_string('checkpoint', '', "Checkpoint file (.ckpt)")
flags.DEFINE_boolean('restore', False, "Restore from a checkpoint")

def main(_):
  # Build your graph
  ...

  # Add ops to save and restore all the variables.
  saver = tf.train.Saver()

  # Later, launch the model, initialize the variables, do some work, save the
  # variables to disk.
  with tf.Session() as sess:
    sess.run(init_op)

    # Restore variables from disk.
    if FLAGS.restore:
      saver.restore(sess, FLAGS.checkpoint)
      print("Model restored.")

    # Do some work with the model
    ..
    if step % 1000 == 0:
      # Save the variables to disk every 1000 iterations
      save_path = saver.save(sess, "model.ckpt")
      print("Model saved in file: %s" % save_path)

# parses flags and calls the `main` function above
if __name__ == '__main__':
    tf.app.run()

and define a batch script such as:

#!/bin/sh

#SBATCH -N 1
#SBATCh -c 1
#SBATCH --gres=gpu:1
#SBATCH -p long
#SBATCH --requeue

module load cuda/8.0

if [ $SLURM_RESTART_COUNT -gt 0 ]; then
   RESTORE=--restore
end

python test.py --checkpoint=mycheckpoint.ckpt $RESTORE