Sponsored Links
-->

Friday, October 5, 2018

10.1.2 Stochastic Gradient Descent - YouTube
src: i.ytimg.com

Stochastic gradient descent (often shortened to SGD), also known as incremental gradient descent, is an iterative method for optimizing a differentiable objective function, a stochastic approximation of gradient descent optimization. A recent article implicitly credits Herbert Robbins and Sutton Monro for developing SGD in their 1951 article titled "A Stochastic Approximation Method"; see Stochastic approximation for more information. It is called stochastic because samples are selected randomly (or shuffled) instead of as a single group (as in standard gradient descent) or in the order they appear in the training set.


Video Stochastic gradient descent



Background

Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum:

Q ( w ) = 1 n ? i = 1 n Q i ( w ) , {\displaystyle Q(w)={\frac {1}{n}}\sum _{i=1}^{n}Q_{i}(w),}

where the parameter w {\displaystyle w} which minimizes Q ( w ) {\displaystyle Q(w)} is to be estimated. Each summand function Q i {\displaystyle Q_{i}} is typically associated with the i {\displaystyle i} -th observation in the data set (used for training).

In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation (for independent observations). The general class of estimators that arise as minimizers of sums are called M-estimators. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation. Therefore, contemporary statistical theorists often consider stationary points of the likelihood function (or zeros of its derivative, the score function, and other estimating equations).

The sum-minimization problem also arises for empirical risk minimization. In this case, Q i ( w ) {\displaystyle Q_{i}(w)} is the value of the loss function at i {\displaystyle i} -th example, and Q ( w ) {\displaystyle Q(w)} is the empirical risk.

When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations :

w := w - ? ? Q ( w ) = w - ? ? i = 1 n ? Q i ( w ) / n , {\displaystyle w:=w-\eta \nabla Q(w)=w-\eta \sum _{i=1}^{n}\nabla Q_{i}(w)/n,}

where ? {\displaystyle \eta } is a step size (sometimes called the learning rate in machine learning).

In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations.

However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems.


Maps Stochastic gradient descent



Iterative method

In stochastic (or "on-line") gradient descent, the true gradient of Q ( w ) {\displaystyle Q(w)} is approximated by a gradient at a single example:

w := w - ? ? Q i ( w ) . {\displaystyle w:=w-\eta \nabla Q_{i}(w).}

As the algorithm sweeps through the training set, it performs the above update for each training example. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges.

In pseudocode, stochastic gradient descent can be presented as follows:

A compromise between computing the true gradient and the gradient at a single example is to compute the gradient against more than one training example (called a "mini-batch") at each step. This can perform significantly better than "true" stochastic gradient descent described, because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step is averaged over more training examples.

The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the learning rates ? {\displaystyle \eta } decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum. This is in fact a consequence of the Robbins-Siegmund theorem.


Stochastic Gradient Descent - Large Scale Machine Learning | Coursera
src: d3c33hcgiwev3.cloudfront.net


Example

Let's suppose we want to fit a straight line y = w 1 + w 2 x {\displaystyle y=\!w_{1}+w_{2}x} to a training set with observations ( x 1 , x 2 , ... , x n ) {\displaystyle (x_{1},x_{2},\ldots ,x_{n})} and corresponding estimated responses ( y 1 ^ , y 2 ^ , ... , y n ^ ) {\displaystyle ({\hat {y_{1}}},{\hat {y_{2}}},\ldots ,{\hat {y_{n}}})} using least squares. The objective function to be minimized is:

Q ( w ) = ? i = 1 n Q i ( w ) = ? i = 1 n ( y i ^ - y i ) 2 = ? i = 1 n ( w 1 + w 2 x i - y i ) 2 . {\displaystyle Q(w)=\sum _{i=1}^{n}Q_{i}(w)=\sum _{i=1}^{n}\left({\hat {y_{i}}}-y_{i}\right)^{2}=\sum _{i=1}^{n}\left(w_{1}+w_{2}x_{i}-y_{i}\right)^{2}.}

The last line in the above pseudocode for this specific problem will become:

[ w 1 w 2 ] := [ w 1 w 2 ] - ? [ ? ? w 1 ( w 1 + w 2 x i - y i ) 2 ? ? w 2 ( w 1 + w 2 x i - y i ) 2 ] = [ w 1 w 2 ] - ? [ 2 ( w 1 + w 2 x i - y i ) 2 x i ( w 1 + w 2 x i - y i ) ] . {\displaystyle {\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}:={\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}{\frac {\partial }{\partial w_{1}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\\{\frac {\partial }{\partial w_{2}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\end{bmatrix}}={\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}2(w_{1}+w_{2}x_{i}-y_{i})\\2x_{i}(w_{1}+w_{2}x_{i}-y_{i})\end{bmatrix}}.}

The key difference compared to standard (Batch) Gradient Descent is that only one piece of data from the dataset is used to calculate the step, and the piece of data is picked randomly at each step.


Uniform Learning in a Deep Neural Network via
src: i.ytimg.com


Applications

Stochastic gradient descent is a popular algorithm for training a wide range of models in machine learning, including (linear) support vector machines, logistic regression (see, e.g., Vowpal Wabbit) and graphical models. When combined with the backpropagation algorithm, it is the de facto standard algorithm for training artificial neural networks. Its use has been also reported in the Geophysics community, specifically to applications of Full Waveform Inversion (FWI).

Stochastic gradient descent competes with the L-BFGS algorithm, which is also widely used. Stochastic gradient descent has been used since at least 1960 for training linear regression models, originally under the name ADALINE.

Another popular stochastic gradient descent algorithm is the least mean squares (LMS) adaptive filter.


ECE6504 â€
src: slideplayer.com


Extensions and variants

Many improvements on the basic stochastic gradient descent algorithm have been proposed and used. In particular, in machine learning, the need to set a learning rate (step size) has been recognized as problematic. Setting this parameter too high can cause the algorithm to diverge; setting it too low makes it slow to converge. A conceptually simple extension of stochastic gradient descent makes the learning rate a decreasing function ?t of the iteration number t, giving a learning rate schedule, so that the first iterations cause large changes in the parameters, while the later ones do only fine-tuning. Such schedules have been known since the work of MacQueen on k-means clustering.

Momentum

Further proposals include the momentum method, which appeared in Rumelhart, Hinton and Williams' seminal paper on backpropagation learning. Stochastic gradient descent with momentum remembers the update ? w at each iteration, and determines the next update as a linear combination of the gradient and the previous update:

? w := ? ? w - ? ? Q i ( w ) {\displaystyle \Delta w:=\alpha \Delta w-\eta \nabla Q_{i}(w)}
w := w + ? w {\displaystyle w:=w+\Delta w}

that leads to:

w := w - ? ? Q i ( w ) + ? ? w {\displaystyle w:=w-\eta \nabla Q_{i}(w)+\alpha \Delta w}

where the parameter w {\displaystyle w} which minimizes Q ( w ) {\displaystyle Q(w)} is to be estimated, and ? {\displaystyle \eta } is a step size (sometimes called the learning rate in machine learning).

The name momentum stems from an analogy to momentum in physics: the weight vector w {\displaystyle w} , thought of as a particle traveling through parameter space, incurs acceleration from the gradient of the loss ("force"). Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfully by computer scientists in the training of artificial neural networks for several decades.

Averaging

Averaged stochastic gradient descent, invented independently by Ruppert and Polyak in the late 1980s, is ordinary stochastic gradient descent that records an average of its parameter vector over time. That is, the update is the same as for ordinary stochastic gradient descent, but the algorithm also keeps track of

w ¯ = 1 t ? i = 0 t - 1 w i {\displaystyle {\bar {w}}={\frac {1}{t}}\sum _{i=0}^{t-1}w_{i}} .

When optimization is done, this averaged parameter vector takes the place of w.

AdaGrad

AdaGrad (for adaptive gradient algorithm) is a modified stochastic gradient descent with per-parameter learning rate, first published in 2011. Informally, this increases the learning rate for more sparse parameters and decreases the learning rate for less sparse ones. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition. It still has a base learning rate ?, but this is multiplied with the elements of a vector {Gj,j} which is the diagonal of the outer product matrix

G = ? ? = 1 t g ? g ? T {\displaystyle G=\sum _{\tau =1}^{t}g_{\tau }g_{\tau }^{\mathsf {T}}}

where g ? = ? Q i ( w ) {\displaystyle g_{\tau }=\nabla Q_{i}(w)} , the gradient, at iteration ?. The diagonal is given by

G j , j = ? ? = 1 t g ? , j 2 {\displaystyle G_{j,j}=\sum _{\tau =1}^{t}g_{\tau ,j}^{2}} .

This vector is updated after every iteration. The formula for an update is now

w := w - ? d i a g ( G ) - 1 2 ? g {\displaystyle w:=w-\eta \,\mathrm {diag} (G)^{-{\frac {1}{2}}}\circ g}

or, written as per-parameter updates,

w j := w j - ? G j , j g j . {\displaystyle w_{j}:=w_{j}-{\frac {\eta }{\sqrt {G_{j,j}}}}g_{j}.}

Each {G(i,i)} gives rise to a scaling factor for the learning rate that applies to a single parameter wi. Since the denominator in this factor, G i = ? ? = 1 t g ? 2 {\displaystyle {\sqrt {G_{i}}}={\sqrt {\sum _{\tau =1}^{t}g_{\tau }^{2}}}} is the l2 norm of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates.

While designed for convex problems, AdaGrad has been successfully applied to non-convex optimization.

RMSProp

RMSProp (for Root Mean Square Propagation) is also a method in which the learning rate is adapted for each of the parameters. The idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight. So, first the running average is calculated in terms of means square,

v ( w , t ) := ? v ( w , t - 1 ) + ( 1 - ? ) ( ? Q i ( w ) ) 2 {\displaystyle v(w,t):=\gamma v(w,t-1)+(1-\gamma )(\nabla Q_{i}(w))^{2}}

where, ? {\displaystyle \gamma } is the forgetting factor.

And the parameters are updated as,

w := w - ? v ( w , t ) ? Q i ( w ) {\displaystyle w:=w-{\frac {\eta }{\sqrt {v(w,t)}}}\nabla Q_{i}(w)}

RMSProp has shown excellent adaptation of learning rate in different applications. RMSProp can be seen as a generalization of Rprop and is capable to work with mini-batches as well opposed to only full-batches.

Adam

Adam (short for Adaptive Moment Estimation) is an update to the RMSProp optimizer. In this optimization algorithm, running averages of both the gradients and the second moments of the gradients are used. Given parameters w ( t ) {\displaystyle w^{(t)}} and a loss function L ( t ) {\displaystyle L^{(t)}} , where t {\displaystyle t} indexes the current training iteration (indexed at 0 {\displaystyle 0} ), Adam's parameter update is given by:

m w ( t + 1 ) <- ? 1 m w ( t ) + ( 1 - ? 1 ) ? w L ( t ) {\displaystyle m_{w}^{(t+1)}\leftarrow \beta _{1}m_{w}^{(t)}+(1-\beta _{1})\nabla _{w}L^{(t)}}
v w ( t + 1 ) <- ? 2 v w ( t ) + ( 1 - ? 2 ) ( ? w L ( t ) ) 2 {\displaystyle v_{w}^{(t+1)}\leftarrow \beta _{2}v_{w}^{(t)}+(1-\beta _{2})(\nabla _{w}L^{(t)})^{2}}
m ^ w = m w ( t + 1 ) 1 - ( ? 1 ) t + 1 {\displaystyle {\hat {m}}_{w}={\frac {m_{w}^{(t+1)}}{1-(\beta _{1})^{t+1}}}}
v ^ w = v w ( t + 1 ) 1 - ( ? 2 ) t + 1 {\displaystyle {\hat {v}}_{w}={\frac {v_{w}^{(t+1)}}{1-(\beta _{2})^{t+1}}}}
w ( t + 1 ) <- w ( t ) - ? m ^ w v ^ w + ? {\displaystyle w^{(t+1)}\leftarrow w^{(t)}-\eta {\frac {{\hat {m}}_{w}}{{\sqrt {{\hat {v}}_{w}}}+\epsilon }}}

where ? {\displaystyle \epsilon } is a small scalar used to prevent division by 0, and ? 1 {\displaystyle \beta _{1}} and ? 2 {\displaystyle \beta _{2}} are the forgetting factors for gradients and second moments of gradients, respectively. Squaring and square-rooting is done elementwise.

Natural Gradient Descent and kSGD

Kalman-based Stochastic Gradient Descent (kSGD) is an online and offline algorithm for learning parameters from statistical problems from quasi-likelihood models, which include linear models, non-linear models, generalized linear models, and neural networks with squared error loss as special cases. For online learning problems, kSGD is a special case of the Kalman Filter for linear regression problems, a special case of the Extended Kalman Filter for non-linear regression problems, and can be viewed as an incremental Gauss-Newton method. Moreover, because of kSGD's relationship to the Kalman Filter and natural gradient descent's relationship to the Kalman Filter, kSGD is a rigorous improvement over the popular natural gradient descent method.

The benefits of kSGD, in comparison to other methods, are (1) it is not sensitive to the condition number of the problem , (2) it has a robust choice of hyperparameters, and (3) it has a stopping condition. The drawbacks of kSGD is that the algorithm requires storing a dense covariance matrix between iterations, and requires a matrix-vector product at each iteration.

To describe the algorithm, suppose Q i ( w ) {\displaystyle Q_{i}(w)} , where w ? R p {\displaystyle w\in \mathbb {R} ^{p}} is defined by an example ( Y i , X i ) ? R × R d {\displaystyle (Y_{i},X_{i})\in \mathbb {R} \times \mathbb {R} ^{d}} such that

? w Q i ( w ) = Y i - ? ( X i , w ) V ( ? ( X i , w ) ) ? w ? ( X i , w ) {\displaystyle \nabla _{w}Q_{i}(w)={\frac {Y_{i}-\mu (X_{i},w)}{V(\mu (X_{i},w))}}\nabla _{w}\mu (X_{i},w)}

where ? ( X i , w ) {\displaystyle \mu (X_{i},w)} is mean function (i.e. the expected value of Y i {\displaystyle Y_{i}} given X i {\displaystyle X_{i}} ), and V ( ? ( X i , w ) ) {\displaystyle V(\mu (X_{i},w))} is the variance function (i.e. the variance of Y i {\displaystyle Y_{i}} given X i {\displaystyle X_{i}} ). Then, the parameter update, w ( t + 1 ) {\displaystyle w(t+1)} , and covariance matrix update, M ( t + 1 ) {\displaystyle M(t+1)} are given by the following

p = ? w ? ( X t + 1 , w ( t ) ) {\displaystyle p=\nabla _{w}\mu (X_{t+1},w(t))}
m = ? ( X t + 1 , w ( t ) ) {\displaystyle m=\mu (X_{t+1},w(t))}
v = M ( t ) p {\displaystyle v=M(t)p}
s = min { ? 1 , max { ? 2 , V ( m ) } } + v T p {\displaystyle s=\min \lbrace \gamma _{1},\max \lbrace \gamma _{2},V(m)\rbrace \rbrace +v^{\mathsf {T}}p}
w ( t + 1 ) = w ( t ) + Y t + 1 - m s v {\displaystyle w(t+1)=w(t)+{\frac {Y_{t+1}-m}{s}}v}
M ( t + 1 ) = M ( t ) - 1 s v v T {\displaystyle M(t+1)=M(t)-{\frac {1}{s}}vv^{\mathsf {T}}}

where ? 1 , ? 2 {\displaystyle \gamma _{1},\gamma _{2}} are hyperparameters. The M ( t ) {\displaystyle M(t)} update can result in the covariance matrix becoming indefinite, which can be avoided at the cost of a matrix-matrix multiplication. M ( 0 ) {\displaystyle M(0)} can be any positive definite symmetric matrix, but is typically taken to be the identity. As noted by Patel, for all problems besides linear regression, restarts are required to ensure convergence of the algorithm, but no theoretical or implementation details were given. In a closely related, off-line, mini-batch method for non-linear regression analyzed by Bertsekas, a forgetting factor was used in the covariance matrix update to prove convergence.


Accelerated stochastic gradient ..first-order optimization ...
src: i.ytimg.com


Notes


Deep learning using Caffe - ppt video online download
src: slideplayer.com


See also

  • Coordinate descent - changes one coordinate at a time, rather than one example
  • Linear classifier
  • Online machine learning

Machine Learning 212 Stochastic Gradient Descent - YouTube
src: i.ytimg.com


References


Mini-Batch Gradient Descent - Large Scale Machine Learning | Coursera
src: d3c33hcgiwev3.cloudfront.net


Further reading

  • Bertsekas, Dimitri P. (1999), Nonlinear Programming (2nd ed.), Cambridge, MA.: Athena Scientific, ISBN 1-886529-00-0.
  • Bertsekas, Dimitri (2003), Convex Analysis and Optimization, Athena Scientific.
  • Bottou, Léon (2004), "Stochastic Learning", Advanced Lectures on Machine Learning, LNAI, 3176, Springer, pp. 146-168, ISBN 978-3-540-23122-6.
  • Davidon, W.C. (1976), "New least-square algorithms", Journal of Optimization Theory and Applications, 18 (2): 187-197, doi:10.1007/BF00935703, MR 0418461.
  • Duda, Richard O.; Hart, Peter E.; Stork, David G. (2000), Pattern Classification (2nd ed.), Wiley, ISBN 978-0-471-05669-0.
  • Kiwiel, Krzysztof C. (2004), "Convergence of approximate and incremental subgradient methods for convex optimization", SIAM Journal of Optimization, 14 (3): 807-840, doi:10.1137/S1052623400376366, MR 2085944. (Extensive list of references)
  • Snyman, Jan A.; Wilke, Daniel N. (2018), Practical Mathematical Optimization - Basic Optimization Theory and Gradient-Based Algorithms, Springer Optimization and Its Applications Vol. 133 (2 ed.), Springer, pp. xxvi+372, ISBN 978-3-319-77585-2. (Python module pmo.py)
  • Spall, James C. (2003), Introduction to Stochastic Search and Optimization, Wiley, ISBN 978-0-471-33052-3.

Stochastic Gradient Descent - YouTube
src: i.ytimg.com


Software

  • sgd: an LGPL C++ library which uses stochastic gradient descent to fit SVM and conditional random field models.
  • CRF-ADF A C# toolkit of stochastic gradient descent and its feature-frequency-adaptive variation for training conditional random field models.
  • Vowpal Wabbit: BSD licence, fast scalable learning by John Langford and others. Includes several stochastic gradient descent variants. Source repository on github

How Stochastic Gradient Descent Is Solving Optimisation Problems ...
src: www.analyticsindiamag.com


External links

  • Using stochastic gradient descent in C++, Boost, Ublas for linear regression
  • Machine Learning Algorithms

Source of article : Wikipedia