Preemption
Preemption
Preemption
With the move to SLURM, we are experimenting with preemption for jobs with QOS long-low-prio
. Preemptions allows SLURM to cancel and requeue large jobs submitted with the long-low-prio
QOS in order to run small jobs which are submitted using debug
and short
QOSes.
This allows the scheduler to fill up the grid with long jobs and still be able to immediately run small jobs without have to wait days. If you intend to submit your job using the long-low-prio
QOS, you must have implemented some kind of regular checkpointing of your job (e.g. hourly) and be able to resume your job from it. Furthermore you must indicate that your job supports checkpointing by adding #SBATCH --requeue
to your job definition. If you do not do it and your job is preempted, it will be killed by SLURM and all your unsaved work will be lost.
If your job is preempted, SLURM will signal it by sending a SIGTERM to your script and will allow 2 minutes for your job to save its current state before killing it.
In addition, if your job supports requeuing, SLURM will automatically resume it in case of server crash or loss of power once the fault has been repaired.
Caffe Checkpoints
For instance, a Caffe task can be checkpointed by regularly saving a .solverstate
and .caffemodel
then adapting your job script to resume from it if your job has been requeued:
#!/bin/sh
#SBATCH -N 1
#SBATCh -c 1
#SBATCH --gres=gpu:1
#SBATCH -p long
#SBATCH --requeue
module load cuda/8.0
OUTPUT_DIR=ouptut/
PRETRAINED_MODEL=pretrained.caffemodel
if [ $SLURM_RESTART_COUNT -gt 0 ]; then
SOLVERSTATE="--solverstate=$(ls -Art $output/*.solverstate|head -n1)"
WEIGHTS="--weights=$(ls -Art $output/*.caffemodel|head -n1)"
else
SOLVERSTATE=
WEIGHTS=$PRETRAINED_MODEL
end
caffe train --solver=solver.prototxt $WEIGHTS $SOLVERSTATE --gpu=0
TensorFlow Checkpoints
In TensorFlow, your work can be regularly saved using a Saver.
For instance, you can use the following code structure:
import tensorflow as tf
from sys import argv
flags = tf.app.flags
FLAGS = flags.FLAGS
# command line flags
flags.DEFINE_string('checkpoint', '', "Checkpoint file (.ckpt)")
flags.DEFINE_boolean('restore', False, "Restore from a checkpoint")
def main(_):
# Build your graph
...
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, initialize the variables, do some work, save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Restore variables from disk.
if FLAGS.restore:
saver.restore(sess, FLAGS.checkpoint)
print("Model restored.")
# Do some work with the model
..
if step % 1000 == 0:
# Save the variables to disk every 1000 iterations
save_path = saver.save(sess, "model.ckpt")
print("Model saved in file: %s" % save_path)
# parses flags and calls the `main` function above
if __name__ == '__main__':
tf.app.run()
and define a batch script such as:
#!/bin/sh
#SBATCH -N 1
#SBATCh -c 1
#SBATCH --gres=gpu:1
#SBATCH -p long
#SBATCH --requeue
module load cuda/8.0
if [ $SLURM_RESTART_COUNT -gt 0 ]; then
RESTORE=--restore
end
python test.py --checkpoint=mycheckpoint.ckpt $RESTORE