TensorFlow中的学习率“热身”
来自官方的Learning rate dacay方法。
Learning rate decay 是深度学习训练过程中常用的手段之一。随训练的进展减小学习率可以定位到loss更深但是更窄的部分。
文献[2]提供了三种学习率递减的方法,并且表示 setp decay 是一个可行性较高的方案。TensorFlow官方模型中也提供了学习率递减的实现,摘抄如下。
def learning_rate_with_decay(
batch_size, batch_denom, num_images, boundary_epochs, decay_rates,
base_lr=0.1, warmup=False):
"""Get a learning rate that decays step-wise as training progresses.
Args:
batch_size: the number of examples processed in each training batch.
batch_denom: this value will be used to scale the base learning rate.
`0.1 * batch size` is divided by this number, such that when
batch_denom == batch_size, the initial learning rate will be 0.1.
num_images: total number of images that will be used for training.
boundary_epochs: list of ints representing the epochs at which we
decay the learning rate.
decay_rates: list of floats representing the decay rates to be used
for scaling the learning rate. It should have one more element
than `boundary_epochs`, and all elements should have the same type.
base_lr: Initial learning rate scaled based on batch_denom.
warmup: Run a 5 epoch warmup to the initial lr.
Returns:
Returns a function that takes a single argument - the number of batches
trained so far (global_step)- and returns the learning rate to be used
for training the next batch.
"""
initial_learning_rate = base_lr * batch_size / batch_denom
batches_per_epoch = num_images / batch_size
# Reduce the learning rate at certain epochs.
# CIFAR-10: divide by 10 at epoch 100, 150, and 200
# ImageNet: divide by 10 at epoch 30, 60, 80, and 90
boundaries = [int(batches_per_epoch * epoch) for epoch in boundary_epochs]
vals = [initial_learning_rate * decay for decay in decay_rates]
def learning_rate_fn(global_step):
"""Builds scaled learning rate function with 5 epoch warm up."""
lr = tf.train.piecewise_constant(global_step, boundaries, vals)
if warmup:
warmup_steps = int(batches_per_epoch * 5)
warmup_lr = (
initial_learning_rate * tf.cast(global_step, tf.float32) / tf.cast(
warmup_steps, tf.float32))
return tf.cond(global_step < warmup_steps, lambda: warmup_lr, lambda: lr)
return lr
return learning_rate_fn
核心的部分其实在函数tf.train.piecewise_constant中,看文档就能看懂。不过我们关心的是这里的warm up部分。
if warmup:
warmup_steps = int(batches_per_epoch * 5)
warmup_lr = (initial_learning_rate * tf.cast(global_step, tf.float32) / tf.cast(warmup_steps, tf.float32))
return tf.cond(global_step < warmup_steps, lambda: warmup_lr, lambda: lr)
启用warm up,当step小于warm up setp时,学习率等于基础学习率×(当前step/warmup_step),由于后者是一个小于1的数值,因此在整个warm up的过程中,学习率是一个递增的过程!当warm up结束后,学习率开始递减。
Warm up 的作用是什么?从名字看就是热身的意思。个人猜测这个过程是对最佳学习率的一个搜索。如果在整个过程中出现loss先减后增,则说明最开始设置的学习率可能过大,而热身过程中的最佳学习率可供参考。
Comments ()