机器学习学习率

by David Mack

大卫·麦克(David Mack)

如何为您的机器学习项目选择最佳学习率 (How to pick the best learning rate for your machine learning project)

A common problem we all face when working on deep learning projects is choosing a learning rate and optimizer (the hyper-parameters). If you’re like me, you find yourself guessing an optimizer and learning rate, then checking if they work (and we’re not alone).

在进行深度学习项目时,我们所有人都面临的一个普遍问题是选择学习速率和优化器(超参数)。 如果您像我一样,会发现自己在猜测优化器和学习率,然后检查它们是否有效( 而且我们并不孤单 )。

To better understand the affect of optimizer and learning rate choice, I trained the same model 500 times. The results show that the right hyper-parameters are crucial to training success, yet can be hard to find.

为了更好地了解优化器和学习速率选择的影响,我训练了500次相同的模型。 结果表明,正确的超参数对于培训成功至关重要,但可能很难找到。

In this article, I’ll discuss solutions to this problem using automated methods to choose optimal hyper-parameters.

在本文中,我将讨论使用自动化方法选择最佳超参数的解决方案。

实验装置 (Experimental setup)

I trained the basic convolutional neural network from TensorFlow’s tutorial series, which learns to recognize MNIST digits. This is a reasonably small network, with two convolutional layers and two dense layers, a total of roughly 3,400 weights to train. The same random seed is used for each training.

我从TensorFlow的教程系列中训练了基本的卷积神经网络 ,该系列学习了识别MNIST数字。 这是一个相当小的网络,具有两个卷积层和两个密集层,总共需要训练约3,400个权重。 每次训练都使用相同的随机种子。

It should be noted that the results below are for one specific model and dataset. The ideal hyper-parameters for other models and datasets will differ.

应该注意的是,以下结果是针对一种特定的模型和数据集的。 其他模型和数据集的理想超参数将有所不同。

(If you’d like to donate some GPU time to run a larger version of this experiment on CIFAR-10, please get in touch).

(如果您想花费一些GPU时间在CIFAR-10上运行此实验的较大版本,请联系 )。

哪种学习率效果最好? (Which learning rate works best?)

The first thing we’ll explore is how learning rate affects model training. In each run, the same model is trained from scratch, varying only the optimizer and learning rate.

我们将探讨的第一件事是学习率如何影响模型训练。 在每次运行中,从头开始训练相同的模型,仅改变优化器和学习率。

The model was trained with 6 different optimizers: Gradient Descent, Adam, Adagrad, Adadelta, RMS Prop, and Momentum. For each optimizer, it was trained with 48 different learning rates, from 0.000001 to 100 at logarithmic intervals.

该模型使用6种不同的优化器进行了训练:梯度下降,Adam,Adagrad,Adadelta,RMS Prop和Momentum。 对于每个优化器,它以48个不同的学习率进行了训练,学习率从0.000001到对数间隔为100。

In each run, the network is trained until it achieves at least 97% train accuracy. The maximum time allowed was 120 seconds. The experiments were run on an Nvidia Tesla K80, hosted by FloydHub. The source code is available for download.

在每次运行中,都要对网络进行训练,直到达到至少97%的训练精度为止。 允许的最大时间为120秒。 实验是在FloydHub主持的Nvidia Tesla K80上进行的。 可以下载源代码。

Here is the training time for each choice of learning rate and optimizer:

这是每种学习率和优化器选择的培训时间:

The above graph is interesting. We can see that:

上图很有趣。 我们可以看到:

  • For every optimizer, the majority of learning rates fail to train the model.

    对于每个优化器,大多数学习率都无法训练模型。
  • There is a valley shape for each optimizer: too low a learning rate never progresses, and too high a learning rate causes instability and never converges. In between, there is a band of “just right” learning rates that successfully train.

    每个优化器都有一个谷底形状:学习率太低将永远不会前进,学习率太高会导致不稳定并永远不会收敛。 在两者之间,有一系列成功训练的“正当”学习率。
  • There is no learning rate that works for all optimizers.

    没有适用于所有优化程序的学习率。
  • Learning rate can affect training time by an order of magnitude.

    学习率会影响训练时间一个数量级。

To summarize the above, it’s crucial that you choose the correct learning rate. Otherwise your network will either fail to train, or take much longer to converge.

综上所述,选择正确的学习率至关重要。 否则,您的网络将无法训练,或者需要更长的时间才能收敛。

To illustrate how each optimizer differs in its optimal learning rate, here is the the fastest and slowest model to train for each learning rate, across all optimizers. Notice that the maximum time is 120s (for example, network failed to train) across the whole graph — there is no single learning rate that works for every optimizer:

为了说明每个优化器的最佳学习率如何不同,这里是在所有优化器中针对每种学习率训练的最快和最慢的模型。 请注意,整个图表的最大时间为120秒(例如,网络无法训练)-没有适用于每个优化程序的学习率:

Check out the wide range of learning rates (from 0.001 to 30) that achieve success with at least one optimizer from the above graph.

从上面的图表中查看使用至少一个优化器成功实现的广泛学习率(从0.001到30)。

哪个优化器效果最佳? (Which optimizer performs best?)

Now that we’ve identified the best learning rates for each optimizer, let’s compare the performance of each optimizer training with the best learning rate found for it in the previous section.

现在,我们已经确定了每个优化器的最佳学习率,下面将每个优化器训练的性能与上一节中找到的最佳学习率进行比较。

Here is the validation accuracy of each optimizer over time. This lets us observe how quickly, accurately, and stably each performs:

这是每个优化器随时间的验证准确性。 这使我们观察到每个执行的速度,准确度和稳定性:

A few observations:

一些观察:

  • All of the optimizers, apart from RMSProp (see final point), manage to converge in a reasonable time.

    除了RMSProp (请参阅最后一点)之外,所有优化器都设法在合理的时间内收敛。

  • Adam learns the fastest.

    亚当学得最快。
  • Adam is more stable than the other optimizers, and it doesn’t suffer any major decreases in accuracy.

    Adam比其他优化器更稳定,并且准确性没有任何大的下降。
  • RMSProp was run with the default arguments from TensorFlow (decay rate 0.9, epsilon 1e-10, momentum 0.0) and it could be the case that these do not work well for this task. This is a good use case for automated hyper-parameter search (see the last section for more about that).

    使用TensorFlow的默认参数(衰减率0.9,epsilon 1e-10,动量0.0)运行RMSProp,可能是这些参数不适用于此任务。 这是自动超参数搜索的一个很好的用例(有关更多信息,请参阅最后一部分)。

Adam also had a relatively wide range of successful learning rates in the previous experiment. Overall, Adam is the best choice of our six optimizers for this model and dataset.

在先前的实验中,亚当的成功学习率也相对较高。 总体而言,对于该模型和数据集,亚当是我们六个优化器的最佳选择。

模型大小如何影响训练时间? (How does model size affect training time?)

Now lets look at how the size of the model affects how it trains.

现在,让我们看一下模型的大小如何影响模型的训练。

We’ll vary the model size by a linear factor. That factor will linearly scale the number of convolutional filters and the width of the first dense layer, thus approximately linearly scaling the total number of weights in the model.

我们将通过线性因子来改变模型大小。 该因子将线性缩放卷积滤波器的数量和第一致密层的宽度,从而近似线性缩放模型中权重的总数。

There are two aspects we’ll investigate:

我们将研究两个方面:

  1. How does the training time change as the model grows, for a fixed optimizer and training rate?

    对于固定的优化程序和培训率,培训时间如何随着模型的增长而变化?
  2. Which learning rate trains fastest on each size of model, for a fixed optimizer?

    对于固定的优化程序,哪种学习速度在每种模型尺寸上训练最快?
随着模型的发展,培训时间如何变化? (How does training time change as the model grows?)

Below shows the time taken to achieve 96% training accuracy on the model, increasing its size from 1x to 10x. We’ve used one of our most successful hyper-parameters from earlier:

下面显示了在模型上实现96%的训练精度所需的时间,该模型的大小从1倍增加到10倍。 我们使用了之前最成功的超参数之一:

  • The time to train grows linearly with the model size.

    训练时间随模型大小线性增长。
  • The same learning rate successfully trains the network across all model sizes.

    相同的学习率可以成功地训练所有模型尺寸的网络。

(Note: the following results can only be relied upon for the dataset and models tested here, but could be worth testing for your experiments.)

(注意:以下结果只能用于此处测试的数据集和模型,但可能值得您进行实验测试。)

This is a nice result. Our choice of hyper-parameters was not invalidated by linearly scaling the model. This may hint that hyper-parameter search can be performed on a scaled-down version of a network, to save on computation time.

这是一个不错的结果。 我们对超参数的选择并未通过线性缩放模型而无效。 这可能暗示可以在按比例缩小的网络版本上执行超参数搜索,以节省计算时间。

This also shows that, as the network gets bigger, it doesn’t incur any O(n²) work in converging the model (the linear growth in time can be explained by the extra operations incurred for each weight’s training).

这也表明,随着网络的扩大,在模型收敛时不会产生任何O(n²)的工作(时间的线性增长可以用每次重量训练所需的额外操作来解释)。

This result is further reassuring, as it shows our deep learning framework (here TensorFlow) scales efficiently.

此结果进一步令人放心,因为它表明我们的深度学习框架(此处为TensorFlow)可以有效地扩展。

哪种学习率最适合不同尺寸的模型? (Which learning rate performs best for different sizes of model?)

Let’s run the same experiment for multiple learning rates and see how training time responds to model size:

让我们针对多个学习率运行相同的实验,并查看训练时间对模型大小的响应:

  • Learning rates 0.0005, 0.001, 0.00146 performed best — these also performed best in the first experiment. We see here the same “sweet spot” band as in the first experiment.

    学习率0.0005、0.001、0.00146表现最好-在第一个实验中学习率也表现最好。 我们在这里看到的是与第一个实验相同的“最佳点”频段。
  • Each learning rate’s time to train grows linearly with model size.

    每个学习率的训练时间随模型大小线性增长。
  • Learning rate performance did not depend on model size. The same rates that performed best for 1x size performed best for 10x size.

    学习率表现不取决于模型大小。 对于1倍尺寸,最佳效果相同的比率在10倍尺寸时,效果最佳。
  • Above 0.001, increasing the learning rate increased the time to train and also increased the variance in training time (as compared to a linear function of model size).

    高于0.001,增加学习率会增加训练时间,并且也会增加训练时间的方差(与模型大小的线性函数相比)。
  • Time to train can roughly be modeled as c + kn for a model with n weights, fixed cost c and learning constant k=f(learning rate).

    对于具有n个权重,固定成本c和学习常数k = f(学习率)的模型,可以将训练时间粗略地建模为c + kn

In summary, the best performing learning rate for size 1x was also the best learning rate for size 10x.

总而言之,尺寸为1倍的最佳学习率也是尺寸为10倍的最佳学习率。

自动选择学习率 (Automating choice of learning rate)

As the earlier results show, it’s crucial for model training to have an good choice of optimizer and learning rate.

如先前的结果所示,对于模型训练而言,选择最佳的优化器和学习速度至关​​重要。

Manually choosing these hyper-parameters is time-consuming and error-prone. As your model changes, the previous choice of hyper-parameters may no longer be ideal. It is impractical to continually perform new searches by hand.

手动选择这些超参数既费时又容易出错。 随着模型的更改,先前选择的超参数可能不再是理想的选择。 不断手工进行新搜索是不切实际的。

There are a number of ways to automatically pick hyper-parameters. I’ll outline a couple of different approaches here.

有多种自动选择超参数的方法。 我将在此处概述几种不同的方法。

Grid search is what we performed in the first experiment — for each hyper-parameter, create a list of possible values. Then for each combination of possible hyper-parameter values, train the network and measure how it performs. The best hyper-parameters are those that give the best observed performance.

网格搜索是我们在第一个实验中执行的操作-为每个超参数创建一个可能值的列表。 然后针对可能的超参数值的每种组合,训练网络并衡量其性能。 最好的超参数是那些能提供最佳观察性能的参数。

Grid search is very easy to implement and understand. It’s also easy to verify that you’ve searched a sufficiently broad section of the parameter search. It’s very popular in research because of these reasons.

网格搜索非常易于实现和理解。 验证您已经搜索了参数搜索的足够宽泛的部分也很容易。 由于这些原因,它在研究中非常受欢迎。

人口基础培训 (Population-based training)

Population-based training (DeepMind) is an elegant implementation of using a genetic algorithm for hyper-parameter choice.

基于人口的训练(DeepMind)是一种使用遗传算法进行超参数选择的完美实现。

In PBT, a population of models are created. They are all continuously trained in parallel. When any member of the population has had sufficiently long to train to show improvement, its validation accuracy is compared to the rest of the population. If its performance is in the lowest 20%, then it copies and mutates the hyper-parameters and variables of one of the top 20% performers.

在PBT中,创建了大量模型。 他们都并行不断地接受培训。 当总体中的任何成员有足够的时间训练以显示出改进时,将其验证准确性与其余总体进行比较。 如果其性能处于最低的20%,则它会复制和变异性能最高的20%之一的超参数和变量。

In this way, the most successful hyper-parameters spawn many slightly mutated variants of themselves and the best hyper-parameters are likely discovered.

通过这种方式,最成功的超参数会产生许多自身的略微突变的变体,并且可能会发现最佳的超参数。

下一步 (Next steps)

Thanks for reading this investigation into learning rates. I began these experiments out of my own curiosity and frustration around hyper-parameter turning, and I hope you enjoy the results and conclusions as much as I have.

感谢您阅读有关学习率的调查。 我出于对超参数转向的好奇心和沮丧而开始了这些实验,希望您能像我一样享受这些结果和结论。

If there is a particular topic or extension you’re interested in seeing, let me know. Also, if you’re interested in donating some GPU time to run a much bigger version of this experiment, I’d love to talk.

如果您有感兴趣的特定主题或扩展名,请告诉我 。 另外,如果您有兴趣捐出一些GPU时间来运行此实验的更大版本, 我也很乐意谈论

These writings are part of a year-long exploration of AI architecture topics. Follow this publication (and give this article some applause!) to get updates when the next pieces come out.

这些著作是对AI架构主题进行为期一年的探索的一部分。 请关注本出版物(并为此鼓掌!),以在下一部分发表时获得更新。

翻译自: https://www.freecodecamp.org/news/how-to-pick-the-best-learning-rate-for-your-machine-learning-project-9c28865039a8/

机器学习学习率

Logo

技术共进,成长同行——讯飞AI开发者社区

更多推荐