Learning rate decay 是深度学习训练过程中常用的手段之一。随训练的进展减小学习率可以定位到loss更深但是更窄的部分。

文献[2]提供了三种学习率递减的方法,并且表示 setp decay 是一个可行性较高的方案。TensorFlow官方模型中也提供了学习率递减的实现,摘抄如下。

def learning_rate_with_decay(
    batch_size, batch_denom, num_images, boundary_epochs, decay_rates,
    base_lr=0.1, warmup=False):
  """Get a learning rate that decays step-wise as training progresses.
  Args:
    batch_size: the number of examples processed in each training batch.
    batch_denom: this value will be used to scale the base learning rate.
      `0.1 * batch size` is divided by this number, such that when
      batch_denom == batch_size, the initial learning rate will be 0.1.
    num_images: total number of images that will be used for training.
    boundary_epochs: list of ints representing the epochs at which we
      decay the learning rate.
    decay_rates: list of floats representing the decay rates to be used
      for scaling the learning rate. It should have one more element
      than `boundary_epochs`, and all elements should have the same type.
    base_lr: Initial learning rate scaled based on batch_denom.
    warmup: Run a 5 epoch warmup to the initial lr.
  Returns:
    Returns a function that takes a single argument - the number of batches
    trained so far (global_step)- and returns the learning rate to be used
    for training the next batch.
  """
  initial_learning_rate = base_lr * batch_size / batch_denom
  batches_per_epoch = num_images / batch_size

  # Reduce the learning rate at certain epochs.
  # CIFAR-10: divide by 10 at epoch 100, 150, and 200
  # ImageNet: divide by 10 at epoch 30, 60, 80, and 90
  boundaries = [int(batches_per_epoch * epoch) for epoch in boundary_epochs]
  vals = [initial_learning_rate * decay for decay in decay_rates]

  def learning_rate_fn(global_step):
    """Builds scaled learning rate function with 5 epoch warm up."""
    lr = tf.train.piecewise_constant(global_step, boundaries, vals)
    if warmup:
      warmup_steps = int(batches_per_epoch * 5)
      warmup_lr = (
          initial_learning_rate * tf.cast(global_step, tf.float32) / tf.cast(
              warmup_steps, tf.float32))
      return tf.cond(global_step < warmup_steps, lambda: warmup_lr, lambda: lr)
    return lr

  return learning_rate_fn

核心的部分其实在函数tf.train.piecewise_constant中,看文档就能看懂。不过我们关心的是这里的warm up部分。

if warmup:
    warmup_steps = int(batches_per_epoch * 5)
    warmup_lr = (initial_learning_rate * tf.cast(global_step, tf.float32) / tf.cast(warmup_steps, tf.float32))
    return tf.cond(global_step < warmup_steps, lambda: warmup_lr, lambda: lr)

启用warm up,当step小于warm up setp时,学习率等于基础学习率×(当前step/warmup_step),由于后者是一个小于1的数值,因此在整个warm up的过程中,学习率是一个递增的过程!当warm up结束后,学习率开始递减。

Warm up 的作用是什么?从名字看就是热身的意思。个人猜测这个过程是对最佳学习率的一个搜索。如果在整个过程中出现loss先减后增,则说明最开始设置的学习率可能过大,而热身过程中的最佳学习率可供参考。


  1. Cover photo by Samuel Fyfe on Unsplash
  2. http://cs231n.github.io/neural-networks-3/
  3. https://github.com/tensorflow/models/blob/57e075203f8fba8d85e6b74f17f63d0a07da233a/official/resnet/resnet_run_loop.py#L225