How to Choose an Optimal Learning Rate for Gradient Descent

One of the challenges of gradient descent is choosing the optimal value for the learning rate, eta (η). The learning rate is perhaps the most important hyperparameter (i.e. the parameters that need to be chosen by the programmer before executing a machine learning program) that needs to be tuned (Goodfellow 2016).

If you choose a learning rate that is too small, the gradient descent algorithm might take a really long time to find the minimum value of the error function. This defeats the purpose of gradient descent, which was to use a computationally efficient method for finding the optimal solution.

On the other hand, if you choose a learning rate that is too large, you might overshoot the minimum value of the error function, and may even never reach the optimal solution. 

Before we get into how to choose an optimal learning rate, it should be emphasized that there is no value of the learning rate that will guarantee convergence to the minimum value of the error function (assuming global) value of a function. 

Algorithms like logistic regression are based on gradient descent and are therefore what is known as “hill climber.” Therefore, for non-convex problems (where the graph of the error function contains local minima and a global minimum vs. convex problems where global minimum = local minimum), they will most likely get stuck in a local minimum. 

convex-vs-nonconvex-graphs

So with that background, we are ready to take a look at some options for selecting the optimal learning rate (eta) for gradient descent.

Choose a Fixed Learning Rate

The standard gradient descent procedure uses a fixed learning rate (e.g. 0.01) that is determined by trial and error. For example:

“Typical values for a neural network with standardized inputs (or inputs mapped to the (0,1) interval) are less than 1 and greater than 10-6 but these should not be taken as strict ranges and greatly depend on the parametrization of the model. A default value of 0.01 typically works for standard multi-layer neural networks but it would be foolish to rely exclusively on this default value. If there is only time to optimize one hyper-parameter and one uses stochastic gradient descent, then this is the hyper-parameter that is worth tuning.”

Bengio (2013)

Use Learning Rate Annealing

Learning rate annealing entails starting with a high learning rate and then gradually reducing the learning rate linearly during training. The learning rate can decrease to a value close to 0.

The idea behind this method is to quickly descend to a range of acceptable weights, and then do a deeper dive within this acceptable range.

Use Cyclical Learning Rates

Cyclical learning rates have been proposed: 

“Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values.”

Leslie (2017)

Use an Adaptive Learning Rate

Another option is to use a learning rate that adapts based on the error output of the model. Here is what the experts say about adaptive learning rates.

Reed (1999) notes on page 72 of his book Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks that an adaptable learning rate is preferred over a fixed learning rate:

“The point is that, in general, it is not possible to calculate the best learning rate a priori. The same learning rate may not even be appropriate in all parts of the network. The general fact that the error is more sensitive to changes in some weights than in others makes it useful to assign different learning rates to each weight”

Reed (1999)

Melin et al. (2010) writes:

“First, the initial network output and error are calculated. At each epoch new weights and biases are calculated using the current learning rate. New outputs and errors are then calculated.

As with momentum, if the new error exceeds the old error by more than a predefined ratio (typically 1.04), the new weights and biases are discarded. In addition, the learning rate is decreased. Otherwise, the new weights, etc. are kept. If the new error is less than the old error, the learning rate is increased (typically by multiplying by 1.05).”

Melin et al. (2010)

References

Bengio, Y. Practical Recommendations for Gradient-based Training of Deep Architectures. In K.-R. Muller, G. Montavon, and G. B. Orr, editors, Neural Networks: Tricks of the Trade. Springer, 2013.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. Cambridge, Massachusetts: The MIT Press, 2016. Print.

Melin, Patricia, Janusz Kacprzyk, and Witold Pedrycz. Soft Computing for Recognition Based on Biometrics. Berlin Heidelberg: Springer, 2010. Print.

Reed, Russell D., and Robert J. Marks. Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks. Cambridge, Mass: MIT Press, 1999. Print.

Smith, Leslie N. Cyclical learning rates for training neural networks. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, pages 464–472.

Propositional Rules, Logical Decision Trees, and the Satisfiability Problem

Here is a question I received the other day: 

Can the propositional rules generated by logical decision trees be recast as a satisfiability problem, making decision trees NP-complete?

I’m going to break my answer down into segments.

What are Propositional Rules?

In order to understand what proposition rules are, it is important to understand what a proposition is. A proposition is the fundamental building block of logic. 

A proposition is a sentence that is either true or false, but not both. If a proposition is true, it is denoted by the capital letter T. If a proposition is false, it can be denoted by the capital letter F. Below are some examples of propositions:

  • The sky is blue. …. T
  • 10 + 10 = 20. ….T
  • The letter “p” is a vowel…..F

Given two propositions p and q, we can create a “if p then q” statement that is known as an implication. Implications can be written as p → q. Here is an example of an implication:

  • p = The relative humidity is greater than 70%
  • q = It is raining tonight
  • p → q = “If the relative humidity is greater than 70% then it is raining tonight.”

As seen in the real-world example above, propositional rules can be created by combining propositions in different ways.

What is the Satisfiability Problem?

If I have a propositional formula denoted as a vector of boolean variables, for example x = (x1 = F, x2 = T, x3 = F, x4 = T, x5 = F), can we find an assignment of values for these boolean variables that would evaluate the propositional formula to True? This statement denotes that satisfiability problem.

Consider, for example. The following formula:

  • F = x1 AND x2 AND x3 AND x4 AND x5

The formula denoted as F above will evaluate to TRUE if all the variables in the vector x are assigned true. Therefore, the formula is satisfiable.

This satisfiability problem was proven to be NP-complete by Stephen A. Cook.

What does NP-Complete Mean?

Computational problems that belong to the class of NP-complete problems are those that have no known efficient (polynomial time) solution algorithms. You can beat your head against a wall trying to find a fast solution to these sorts of problems. However, given solution to these NP-complete problems, you can verify that solution in polynomial time (i.e. quickly). 

An example of an NP-complete problem is the Vertex Cover Problem some of you might have seen in an algorithms or data structures class.

Can the Propositional Rules Generated by Logical Decision Trees Be Recast as a Satisfiability Problem, Making Decision Trees NP-Complete?

Yes they can. Once we have the propositional rules, we then need to either find a variable assignment that makes it satisfiable, or we can report that no such assignment exists.

Let’s suppose we are dealing with binary decision trees, and we have three variables, a, b, and c. There are 2^n possible truth assignments that would need to be verified.

decision-tree

By the way, there is a famous paper by Hyafil and Rivest proving that optimal construction of decision trees is NP-complete.

References

Cook, S. A. and Mitchell, D. G. “Finding Hard Instances of the Satisfiability Problem: A Survey.” In Satisfiability Problem: Theory and Applications (Piscataway, NJ, 1996) (Ed. D. Du, J. Gu, and P. M. Pardalos). Providence, RI: Amer. Math. Soc., pp. 1-17, 1997.

My Master’s Thesis

Straight from the vault, my Master’s thesis: An Analysis of the Robustness and Relevance of Meteorological Triggers for Catastrophe Bonds.

Abstract

Each year weather-related catastrophes account for an estimated United States dollars (USO) $40 billion in damage across the world, although only a fraction of this risk of loss is insured. Losses from hurricanes in the United States have increased over the past several years to the extent that many insurance companies have become increasingly reluctant to insure in certain locations along the coast. Several insurance companies have become insolvent as a result of the active hurricane seasons of 2004 and 2005. In order to cope with this hurricane risk, some insurance and reinsurance firms have shifted part of their risk to the capital markets in the form of catastrophe bonds.

Two problems are observed with catastrophe bonds based on parametric triggers (e.g. Saffir-Simpson scale rating of a hurricane at landfall). First, the trigger mechanisms are measured imprecisely, with the degree of imprecision depending on the choice of trigger mechanism, the available sensor systems, and the methods by which meteorologists analyze the resulting observations. Second, the trigger mechanisms might not relate well to the economic harm caused by the weather phenomena, suggesting that they were not selected on the basis of adequate understanding of relevant meteorology and its relationship to storm damage. Both problems are documented, and perhaps ameliorated in part, by a thorough study of the relevant meteorology and meteorological practices.

Development of a set of robust and relevant triggers for catastrophe bonds for hurricanes is the objective of this study. The real-time and post-landfall accuracy of measured hurricane parameters such as minimum central pressure and maximum sustained surface wind speed were analyzed. Linear regression and neural networks were then employed in order to determine the predictability of storm damage from these measurable hurricane parameters or combination thereof. The damage data set consisted of normalized economic losses for hurricane landfalls along the United States Gulf and Atlantic coasts from 1900 to 2005. The results reveal that single hurricane parameters and combinations of hurricane parameters can be poor indicators of the amount of storm damage.

The results suggest that modeled-loss type catastrophe bonds may be a potentially superior alternative to parametric-type bonds, which are highly sensitive to the accuracy of the measurements of the underlying storm parameters and to the coastal bathymetry, topography, and economic exposure. A procedure for determining the robustness of a risk model for use in modeled-loss type catastrophe bonds is also presented.