With the move to SLURM, we are experimenting with preemption for jobs with QOS
long-low-prio. Preemptions allows SLURM to cancel and requeue large jobs submitted with the
long-low-prio QOS in order to run small jobs which are submitted using
This allows the scheduler to fill up the grid with long jobs and still be able to immediately run small jobs without have to wait days. If you intend to submit your job using the
long-low-prio QOS, you must have implemented some kind of regular checkpointing of your job (e.g. hourly) and be able to resume your job from it. Furthermore you must indicate that your job supports checkpointing by adding
#SBATCH --requeue to your job definition. If you do not do it and your job is preempted, it will be killed by SLURM and all your unsaved work will be lost.
If your job is preempted, SLURM will signal it by sending a SIGTERM to your script and will allow 2 minutes for your job to save its current state before killing it.
In addition, if your job supports requeuing, SLURM will automatically resume it in case of server crash or loss of power once the fault has been repaired.
For instance, a Caffe task can be checkpointed by regularly saving a
.caffemodel then adapting your job script to resume from it if your job has been requeued:
#!/bin/sh #SBATCH -N 1 #SBATCh -c 1 #SBATCH --gres=gpu:1 #SBATCH -p long #SBATCH --requeue module load cuda/8.0 OUTPUT_DIR=ouptut/ PRETRAINED_MODEL=pretrained.caffemodel if [ $SLURM_RESTART_COUNT -gt 0 ]; then SOLVERSTATE="--solverstate=$(ls -Art $output/*.solverstate|head -n1)" WEIGHTS="--weights=$(ls -Art $output/*.caffemodel|head -n1)" else SOLVERSTATE= WEIGHTS=$PRETRAINED_MODEL end caffe train --solver=solver.prototxt $WEIGHTS $SOLVERSTATE --gpu=0
In TensorFlow, your work can be regularly saved using a Saver.
For instance, you can use the following code structure:
import tensorflow as tf from sys import argv flags = tf.app.flags FLAGS = flags.FLAGS # command line flags flags.DEFINE_string('checkpoint', '', "Checkpoint file (.ckpt)") flags.DEFINE_boolean('restore', False, "Restore from a checkpoint") def main(_): # Build your graph ... # Add ops to save and restore all the variables. saver = tf.train.Saver() # Later, launch the model, initialize the variables, do some work, save the # variables to disk. with tf.Session() as sess: sess.run(init_op) # Restore variables from disk. if FLAGS.restore: saver.restore(sess, FLAGS.checkpoint) print("Model restored.") # Do some work with the model .. if step % 1000 == 0: # Save the variables to disk every 1000 iterations save_path = saver.save(sess, "model.ckpt") print("Model saved in file: %s" % save_path) # parses flags and calls the `main` function above if __name__ == '__main__': tf.app.run()
and define a batch script such as:
#!/bin/sh #SBATCH -N 1 #SBATCh -c 1 #SBATCH --gres=gpu:1 #SBATCH -p long #SBATCH --requeue module load cuda/8.0 if [ $SLURM_RESTART_COUNT -gt 0 ]; then RESTORE=--restore end python test.py --checkpoint=mycheckpoint.ckpt $RESTORE